# ucbepic/docetl

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/ucbepic-docetl).**

3,597 stars · 379 forks · Python · mit

## Links

- GitHub: https://github.com/ucbepic/docetl
- Homepage: https://docetl.org
- awesome-repositories: https://awesome-repositories.com/repository/ucbepic-docetl.md

## Topics

`agents` `data` `data-pipelines` `document-analysis` `document-processing` `elt` `etl` `llm` `python` `semantic-data` `unstructured-data` `unstructured-data-analysis` `workflow`

## Description

docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas.

The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizing extraction logic, alongside tools for cost-accuracy trade-off analysis and model consistency calibration.

The system covers a broad range of data processing capabilities, including multi-stage reduction for information aggregation, recursive document clustering, and schema-constrained extraction. It supports mixed-format data loading and provides utilities for entity standardization and synthetic data generation.

The tool is implemented in Python and supports the execution of deterministic code within its pipelines for custom computational processing.

## Tags

### Data & Databases

- [Document and Unstructured Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction.md) — Transforms large collections of unstructured documents into structured, queryable tables using language models. ([source](https://ucbepic.github.io/docetl/tutorial/))
- [Parallel Map-Reduce Tools](https://awesome-repositories.com/f/data-databases/parallel-data-transformation/parallel-data-reducers/parallel-map-reduce-tools.md) — Coordinates data processing through parallel map, reduce, and filter operations to transform unstructured text into structured tables. ([source](https://ucbepic.github.io/docetl))
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Transforms large collections of unstructured documents into queryable tables using schema-constrained LLM extraction. ([source](https://docetl.org/))
- [Schema Definition](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-modeling-schemas/data-schemas/schema-definition.md) — Allows specifying the structure of extracted data using basic types and nested objects to ensure consistent formats. ([source](https://docetl.org/llms.txt#docetl-system-description-and-llm-instructions-short))
- [LLM Schema Outputs](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-modeling-schemas/data-schemas/schema-validated-data-structures/schema-enforced-output-parsers/llm-schema-outputs.md) — Uses predefined structural types and nested objects to constrain language model outputs into consistent formats.
- [Declarative Workflow Definitions](https://awesome-repositories.com/f/data-databases/data-pipeline-configurations/declarative-workflow-definitions.md) — Provides a way to define the end-to-end flow and transformation logic of document processing pipelines via configuration files.
- [LLM-Integrated Extraction Pipelines](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/data-engineering-pipelines/llm-integrated-extraction-pipelines.md) — Orchestrates workflows that chain document ingestion and layout analysis with model-based structured data generation.
- [ETL Workflows](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/etl-workflows.md) — Implements automated ETL workflows to extract, clean, and transform data from documents into predefined schemas.
- [Declarative Pipeline Construction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing-frameworks/declarative-pipeline-construction.md) — Implements a declarative interface for defining complex data operations and workflows to transform unstructured datasets into tables. ([source](https://docetl.org/llms.txt#docetl-system-description-and-llm-instructions-short))
- [Iterative Information Aggregation](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/document-llm-preparation/multi-stage-pipeline-processing/iterative-information-aggregation.md) — Condenses information from multiple documents into structured summaries through iterative aggregation steps.
- [Semantic Similarity Joins](https://awesome-repositories.com/f/data-databases/join-operations/semantic-similarity-joins.md) — Merges disparate datasets by calculating embedding-based similarity scores when exact primary keys are unavailable.
- [LLM Output Constraints](https://awesome-repositories.com/f/data-databases/custom-data-fields/custom-field-validation/llm-output-constraints.md) — Enforces specific requirements and logic-based constraints on data extracted from large language models. ([source](https://docetl.org/llms-full.txt))
- [Dataset Loading](https://awesome-repositories.com/f/data-databases/dataset-loading.md) — Imports data from JSON or CSV files and maps fields to variables for use in processing operations. ([source](https://docetl.org/llms.txt#docetl-system-description-and-llm-instructions-short))
- [Entity Resolution](https://awesome-repositories.com/f/data-databases/entity-resolution.md) — A feature that resolves variations of the same entity into a single canonical name using embedding-based blocking and comparison. ([source](https://ucbepic.github.io/docetl/tutorial/))
- [Multi-Format Data Loading](https://awesome-repositories.com/f/data-databases/tabular-data-frameworks/csv-data-loaders/multi-source-csv-loading/multi-format-data-loading.md) — Imports data from standard files or custom parsing tools for non-standard formats like audio and PDFs. ([source](https://docetl.org/llms-full.txt))

### Artificial Intelligence & ML

- [Entity Clustering and Canonicalization](https://awesome-repositories.com/f/artificial-intelligence-ml/clustering-tools/entity-clustering-and-canonicalization.md) — Canonicalizes duplicate entities across multiple documents to ensure data consistency through clustering and synthesis. ([source](https://docetl.org/llms-full.txt))
- [LLM-Based Data Transformations](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-based-data-transformations.md) — Applies prompt-based transformations to input items with automatic retries and strict schema validation. ([source](https://ucbepic.github.io/docetl/operators/map/))
- [LLM Orchestration](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-orchestration.md) — Designs and executes complex map-reduce workflows that integrate language model transformations with deterministic Python code.
- [Agent Tool Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/integration-deployment/agent-frameworks/tool-use-and-execution/agent-tool-execution.md) — Allows the execution of external tools or functions during transformations to perform real-time lookups and actions. ([source](https://ucbepic.github.io/docetl/operators/map/))
- [Automated Data Validation](https://awesome-repositories.com/f/artificial-intelligence-ml/automated-data-validation.md) — Evaluates the accuracy and operational cost of extraction pipelines through iterative refinement and schema validation.
- [Recursive Document Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/clustering-algorithms/hierarchical-clustering/recursive-document-synthesis.md) — Groups unstructured documents into hierarchical clusters to generate synthesized summaries across varying granularities.
- [Document-Based](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-synthesis/document-based.md) — Creates high-quality datasets by aggregating, clustering, and resolving entities across collections of unstructured files.
- [Document Clustering Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/document-clustering-frameworks.md) — Groups unstructured documents into hierarchical clusters based on semantic similarity to generate synthesized summaries. ([source](https://docetl.org/llms-full.txt))
- [Document Summarization](https://awesome-repositories.com/f/artificial-intelligence-ml/document-summarization.md) — Condenses key information from multiple documents into structured summaries using a reduction process. ([source](https://docetl.org/llms-full.txt))
- [Prompt Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/profiling-and-benchmarking/model-performance-optimization/prompt-optimizers.md) — Refines model prompts and examples to improve the accuracy and reliability of the extraction pipeline. ([source](https://cdn.jsdelivr.net/gh/ucbepic/docetl@main/README.md))
- [Multi-Document Syntheses](https://awesome-repositories.com/f/artificial-intelligence-ml/multi-document-syntheses.md) — Aggregates and synthesizes information across multiple documents grouped by a common key into unified outputs. ([source](https://ucbepic.github.io/docetl/tutorial/))
- [Natural Language Entity Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-entity-extraction.md) — Uses natural language instructions and map-reduce operators to extract structured metrics and entities from unstructured text. ([source](https://cdn.jsdelivr.net/gh/ucbepic/docetl@main/README.md))
- [Workflow Performance Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/workflow-performance-optimizations.md) — Optimizes workflow efficiency by swapping models, rewriting prompts, and replacing expensive LLM tasks with deterministic code. ([source](https://ucbepic.github.io/docetl))

### User Interface & Experience

- [PDF and HTML Content Extraction](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/pdf-and-html-content-extraction.md) — Provides capabilities to extract text and metadata from PDF documents into standardized objects for further processing.

### Content Management & Publishing

- [LLM-Powered Document Transformations](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/llm-powered-document-transformations.md) — Processes individual documents with language models to perform data extraction, content filtering, and semantic ranking. ([source](https://docetl.org/llms-full.txt))
- [PDF Text Extraction](https://awesome-repositories.com/f/content-management-publishing/pdf-text-extraction.md) — Provides the capability to download and pull text from publicly accessible PDF URLs for processing. ([source](https://ucbepic.github.io/docetl/operators/map/))

### Development Tools & Productivity

- [Prompt Playgrounds](https://awesome-repositories.com/f/development-tools-productivity/human-in-the-loop-interfaces/interactive-prompts/prompt-playgrounds.md) — Ships an interactive environment for refining prompts and testing model parameters in real-time.
- [Extraction Pipeline Prototyping](https://awesome-repositories.com/f/development-tools-productivity/interactive-prototyping/code-prototyping/document-logic-prototyping/extraction-pipeline-prototyping.md) — Provides a real-time interface for developing and refining document processing workflows to test transformations instantly. ([source](https://ucbepic.github.io/docetl/playground/))
- [Natural Language Pipeline Generators](https://awesome-repositories.com/f/development-tools-productivity/natural-language-pipeline-generators.md) — Provides a natural language interface to generate data processing workflows for extracting information from unstructured files. ([source](https://ucbepic.github.io/docetl/quickstart-claude-code/))

### Software Engineering & Architecture

- [Deterministic Pipeline Step Execution](https://awesome-repositories.com/f/software-engineering-architecture/deterministic-pipeline-step-execution.md) — Supports the execution of custom Python scripts within pipelines for computational processing or external library integration. ([source](https://docetl.org/llms-full.txt))

### Testing & Quality Assurance

- [Pipeline Accuracy Optimizers](https://awesome-repositories.com/f/testing-quality-assurance/pipeline-accuracy-optimizers.md) — Includes an automated process to determine the best operational settings and thresholds to improve data extraction accuracy. ([source](https://ucbepic.github.io/docetl/tutorial/))

### Web Development

- [Iterative Refinement Loops](https://awesome-repositories.com/f/web-development/client-side-input-validators/schema-based-response-validation/ai-output-validation/iterative-refinement-loops.md) — Implements deterministic checks and refinement loops to correct errors and ensure LLM outputs comply with defined schemas.

### Part of an Awesome List

- [Data Ingestion and Parsing](https://awesome-repositories.com/f/awesome-lists/data/data-ingestion-and-parsing.md) — System for agentic data processing and ETL workflows.