Docetl | Awesome Repository

docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas.

The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizing extraction logic, alongside tools for cost-accuracy trade-off analysis and model consistency calibration.

The system covers a broad range of data processing capabilities, including multi-stage reduction for information aggregation, recursive document clustering, and schema-constrained extraction. It supports mixed-format data loading and provides utilities for entity standardization and synthetic data generation.

The tool is implemented in Python and supports the execution of deterministic code within its pipelines for custom computational processing.

Features

Document and Unstructured Extraction - Transforms large collections of unstructured documents into structured, queryable tables using language models.
Parallel Map-Reduce Tools - Coordinates data processing through parallel map, reduce, and filter operations to transform unstructured text into structured tables.
Structured Data Extraction - Transforms large collections of unstructured documents into queryable tables using schema-constrained LLM extraction.
Entity Clustering and Canonicalization - Canonicalizes duplicate entities across multiple documents to ensure data consistency through clustering and synthesis.

Features

Document and Unstructured Extraction - Transforms large collections of unstructured documents into structured, queryable tables using language models.
Parallel Map-Reduce Tools - Coordinates data processing through parallel map, reduce, and filter operations to transform unstructured text into structured tables.
Structured Data Extraction - Transforms large collections of unstructured documents into queryable tables using schema-constrained LLM extraction.
Entity Clustering and Canonicalization - Canonicalizes duplicate entities across multiple documents to ensure data consistency through clustering and synthesis.

The tool is implemented in Python and supports the execution of deterministic code within its pipelines for custom computational processing.