docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas.
The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizing extraction logic, alongside tools for cost-accuracy trade-off analysis and model consistency calibration.
The system covers a broad range of data processing capabilities, including multi-stage reduction for information aggregation, recursive document clustering, and schema-constrained extraction. It supports mixed-format data loading and provides utilities for entity standardization and synthetic data generation.
The tool is implemented in Python and supports the execution of deterministic code within its pipelines for custom computational processing.