# zipstack/unstract

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/zipstack-unstract).**

6,669 stars · 633 forks · Python · AGPL-3.0

## Links

- GitHub: https://github.com/Zipstack/unstract
- Homepage: https://unstract.com
- awesome-repositories: https://awesome-repositories.com/repository/zipstack-unstract.md

## Topics

`ai-agents` `data-engineering` `document-ai` `generative-ai` `idp` `json-extraction` `llm` `mcp-server` `ocr` `pdf-extraction` `prompt-engineering` `structured-output`

## Description

Unstract is an unstructured data extraction system and ETL pipeline orchestrator that uses large language models to convert documents, images, and scans into structured JSON. It provides a document extraction API for integrating these capabilities into external automation tools and includes a Model Context Protocol server to connect AI agents to structured information retrieval.

The system ensures data accuracy through a verification tool featuring dual-model verification and human-in-the-loop review with coordinate-based document highlighting. It utilizes natural language extraction schemas to map unstructured content into predefined formats regardless of layout inconsistencies.

The platform covers a full lifecycle of data movement, including the construction of pipelines that pull files from storage and load processed results into databases or warehouses. These workflows can be triggered manually via REST API or managed through recurring cron-based schedules.

The entire application stack is provided as a dockerized deployment.

## Tags

### Data & Databases

- [LLM-Integrated Extraction Pipelines](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/data-engineering-pipelines/llm-integrated-extraction-pipelines.md) — Orchestrates complex data pipelines that chain file ingestion, LLM-based extraction, and database loading.
- [Document-to-JSON Converters](https://awesome-repositories.com/f/data-databases/document-to-json-converters.md) — Converts unstructured files such as PDFs, images, and scans into structured JSON using natural language prompts. ([source](https://cdn.jsdelivr.net/gh/zipstack/unstract@main/README.md))
- [Data Pipeline Orchestration](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration.md) — Provides a workflow engine for defining, scheduling, and monitoring document extraction and loading sequences.
- [ETL Workflows](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/etl-workflows.md) — Orchestrates the full lifecycle of extracting document data and loading it into databases or data warehouses. ([source](https://docs.unstract.com/unstract/unstract_platform/quick_start))
- [Document and Unstructured Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction.md) — Converts unstructured PDFs, images, and scans into structured JSON using large language models.
- [Prompt-Based Schema Mapping](https://awesome-repositories.com/f/data-databases/json-schema-modeling/llm-driven-schema-generation/prompt-based-schema-mapping.md) — Uses natural language prompts and LLMs to map unstructured document content into predefined JSON schemas.
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Reads unstructured files from a filesystem and converts them into structured formats based on defined extraction schemas. ([source](https://docs.unstract.com/unstract/unstract_platform/etl_pipeline/unstract_etl_pipeline_intro/))
- [Unstructured Data Transformation Tools](https://awesome-repositories.com/f/data-databases/unstructured-data-transformation-tools.md) — Uses large language models to convert narrative, unstructured documents into structured JSON schemas.
- [Model Context Protocol Servers](https://awesome-repositories.com/f/data-databases/graph-data-models/model-context-protocol-servers.md) — Implements a Model Context Protocol server that allows AI agents to process documents and receive structured results. ([source](https://docs.unstract.com/unstract/unstract_platform/mcp/unstract_platform_mcp_server/))

### Artificial Intelligence & ML

- [Extraction Endpoints](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-integration-apis/rest-endpoints/extraction-endpoints.md) — Enables converting defined extraction workflows into RESTful endpoints that process documents and return structured JSON. ([source](https://docs.unstract.com/unstract/unstract_platform/api_deployment/unstract_api_deployment_intro/))
- [Human-in-the-Loop Systems](https://awesome-repositories.com/f/artificial-intelligence-ml/human-in-the-loop-systems.md) — Provides a user interface for manual verification of extracted data with coordinate-based document highlighting.
- [Cross-Model Consistency Checks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-output-verifications/cross-model-consistency-checks.md) — Validates extraction accuracy by comparing the outputs of two separate language model passes.
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Uses LLMs to analyze inconsistent document layouts and parse structural information into structured JSON. ([source](https://docs.unstract.com/unstract/unstract_platform/quick_start))
- [Extraction Field Specifications](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-schema-generation/extraction-field-specifications.md) — Allows specifying required fields and formatting for converting unstructured documents into structured data using natural language prompts. ([source](https://cdn.jsdelivr.net/gh/zipstack/unstract@main/README.md))
- [Model Context Protocol](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/model-integration-serving/model-integration-interfaces/model-context-protocol.md) — Integrates extraction capabilities with external AI agents through the standardized Model Context Protocol.
- [Model Context Protocol Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-assistant-integrations/model-context-protocol-integrations.md) — Integrates extraction capabilities with AI agents using the Model Context Protocol for structured information retrieval from documents. ([source](https://cdn.jsdelivr.net/gh/zipstack/unstract@main/README.md))
- [Model Context Protocol Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-context-protocol-servers.md) — Implements a Model Context Protocol server to connect AI agents to structured document retrieval tools.

### Content Management & Publishing

- [Document Automation Pipelines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/document-automation-interfaces/document-automation-pipelines.md) — Implements automated pipelines to parse, transform, and manipulate document formats programmatically.

### Development Tools & Productivity

- [Document Processing APIs](https://awesome-repositories.com/f/development-tools-productivity/rest-apis/document-processing-apis.md) — Provides RESTful endpoints that accept unstructured documents and return structured data for external integration.
- [Cron Trigger Management](https://awesome-repositories.com/f/development-tools-productivity/cron-scheduling/cron-trigger-management.md) — Automates recurring ETL tasks by using a cron-based scheduling mechanism to trigger data extraction workflows.

### Part of an Awesome List

- [Extraction Verification Tools](https://awesome-repositories.com/f/awesome-lists/media/pdf/extraction-verification-tools.md) — Provides visual utilities with coordinate-based highlighting to verify the accuracy of extracted document data. ([source](https://cdn.jsdelivr.net/gh/zipstack/unstract@main/README.md))

### DevOps & Infrastructure

- [Workflow Execution Triggers](https://awesome-repositories.com/f/devops-infrastructure/pipeline-automation/workflow-execution-triggers.md) — Provides capabilities to trigger data workflows manually, via REST API, or on a recurring basis using cron triggers. ([source](https://docs.unstract.com/unstract/unstract_platform/etl_pipeline/unstract_etl_pipeline_intro/))

### Software Engineering & Architecture

- [Batch Document Processing](https://awesome-repositories.com/f/software-engineering-architecture/batch-document-processing.md) — Allows submitting multiple files in a single API call for independent, high-throughput automated data extraction. ([source](https://docs.unstract.com/unstract/unstract_platform/api_deployment/unstract_api_deployment_intro/))

### Web Development

- [Long-Running Operation Polling](https://awesome-repositories.com/f/web-development/rest-apis/api-response-validation/long-running-task-endpoints/long-running-operation-polling.md) — Utilizes a polling architecture to handle long-running document extractions in the background via status endpoints.
