# docling-project/docling

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/docling-project-docling).**

61,674 stars · 4,310 forks · Python · MIT

## Links

- GitHub: https://github.com/docling-project/docling
- Homepage: https://docling-project.github.io/docling
- awesome-repositories: https://awesome-repositories.com/repository/docling-project-docling.md

## Topics

`ai` `convert` `document-parser` `document-parsing` `documents` `docx` `html` `markdown` `pdf` `pdf-converter` `pdf-to-json` `pdf-to-text` `pptx` `tables` `xlsx`

## Description

Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types.

The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures.

The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.

## Tags

### Content Management & Publishing

- [Document Layout Analyzers](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/document-layout-analyzers.md) — Maps spatial relationships between text, tables, and images by applying computer vision and advanced text processing techniques to document layouts.
- [Hierarchical Document Models](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/hierarchical-document-models.md) — Organizes document content into a hierarchical tree structure that preserves the semantic and spatial relationships between individual elements. ([source](https://docling-project.github.io/docling/concepts/docling_document/))
- [Conversion Engines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-conversion/conversion-engines.md) — Transforms diverse file formats and web content into structured models using both programmatic and command-line interfaces. ([source](https://docling-project.github.io/docling/usage/))

### Data & Databases

- [Structured](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-extraction/structured.md) — Extracts information from unstructured sources by applying schemas to identify and organize content into clean, typed data formats. ([source](https://docling-project.github.io/docling/examples/extraction/))
- [Document and LLM Preparation](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/document-llm-preparation.md) — Converts diverse file types and web content into unified, machine-readable formats specifically optimized for downstream model training and analysis.
- [Schema-Driven Extractors](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-extraction/schema-driven-extractors.md) — Maps document regions to strongly-typed objects by validating content against predefined structural templates.
- [Document Processing Pipelines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/document-llm-preparation/document-processing-pipelines.md) — Ingests and parses unstructured files into a unified, hierarchical data model to facilitate standardized downstream processing.
- [Schema-Based](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-management-governance/data-integrity-validation/data-validation/schema-based.md) — Validates extracted document data against defined schemas to ensure structural integrity and type safety. ([source](https://docling-project.github.io/docling/examples/extraction/))
- [Intermediate Representations](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing-frameworks/intermediate-representations.md) — Normalizes diverse input formats into a consistent internal data model to enable uniform processing across different sources.
- [Structured Data Extractors](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing-frameworks/structured-data-extractors.md) — Identifies and transforms complex document layouts into standardized, machine-readable information.
- [Document Intelligence Pipelines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/processing-pipelines/document-intelligence-pipelines.md) — Automates the ingestion, parsing, and structuring of unstructured files through a modular pipeline for downstream data analysis.
- [Extraction Configurations](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction/extraction-configurations.md) — Defines specific input types and file formats to ensure that documents are processed according to custom requirements. ([source](https://docling-project.github.io/docling/examples/extraction/))

### Artificial Intelligence & ML

- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Parses hierarchical document structures to identify and relate text, tables, and images for intelligent content analysis.

### Development Tools & Productivity

- [Document Conversion Toolkits](https://awesome-repositories.com/f/development-tools-productivity/project-scaffolding-config-code-generation/document-conversion-toolkits.md) — Ships programmatic utilities to convert diverse file formats into standardized outputs for automated data processing pipelines.
- [Automated Document Processing](https://awesome-repositories.com/f/development-tools-productivity/build-tooling/build-orchestration-logic/build-orchestration-configuration/build-automation-systems/automation/automated-document-processing.md) — Integrates document parsing capabilities into software pipelines to enable autonomous data handling within larger application workflows.
- [Automated Workflow Integration](https://awesome-repositories.com/f/development-tools-productivity/build-tooling/build-orchestration-logic/build-orchestration-configuration/build-automation-systems/workflow-orchestration/automated-workflow-integration.md) — Enables integration with automated agents and server-based architectures, allowing document processing tasks to be embedded directly into complex application workflows. ([source](https://docling-project.github.io/docling/usage/mcp/))

### Part of an Awesome List

- [AI Frameworks](https://awesome-repositories.com/f/awesome-lists/ai/ai-frameworks.md) — Library for parsing and ingesting diverse document formats for retrieval.
- [Data Preprocessing](https://awesome-repositories.com/f/awesome-lists/data/data-preprocessing.md) — Unified document parsing tool for complex layouts and multi-format support.
- [Document and File Processing](https://awesome-repositories.com/f/awesome-lists/data/document-and-file-processing.md) — Converts diverse document formats into structured data.
- [Document Parsing and Extraction](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction.md) — Prepares diverse document types for generative AI workflows.

### Software Engineering & Architecture

- [Processing Backends](https://awesome-repositories.com/f/software-engineering-architecture/integration-extensibility/extensibility/plugin-architectures/domain-specific/processing-backends.md) — Employs a modular architecture to dynamically load specialized engines for optical character recognition and complex visual layout analysis.

### DevOps & Infrastructure

- [Declarative Configuration Schemas](https://awesome-repositories.com/f/devops-infrastructure/configuration-management/declarative-configuration-frameworks/declarative-configuration-schemas.md) — Allows users to define extraction parameters and processing rules through external configuration files to control document parsing behavior.
