Unstract is an unstructured data extraction system and ETL pipeline orchestrator that uses large language models to convert documents, images, and scans into structured JSON. It provides a document extraction API for integrating these capabilities into external automation tools and includes a Model Context Protocol server to connect AI agents to structured information retrieval.
The system ensures data accuracy through a verification tool featuring dual-model verification and human-in-the-loop review with coordinate-based document highlighting. It utilizes natural language extraction schemas to map unstructured content into predefined formats regardless of layout inconsistencies.
The platform covers a full lifecycle of data movement, including the construction of pipelines that pull files from storage and load processed results into databases or warehouses. These workflows can be triggered manually via REST API or managed through recurring cron-based schedules.
The entire application stack is provided as a dockerized deployment.