# opendatalab/pdf-extract-kit

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/opendatalab-pdf-extract-kit).**

9,724 stars · 733 forks · Python · AGPL-3.0

## Links

- GitHub: https://github.com/opendatalab/PDF-Extract-Kit
- Homepage: https://pdf-extract-kit.readthedocs.io/zh-cn/latest/index.html
- awesome-repositories: https://awesome-repositories.com/repository/opendatalab-pdf-extract-kit.md

## Description

PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system.

The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like watermarks or blurring.

The system supports the composition of custom parsing pipelines through configuration files and provides tools for benchmarking extraction model performance against datasets. Its broader capabilities include optical character recognition for extracting text and spatial coordinates, as well as vision-to-LaTeX translation for mathematical notation.

## Tags

### Content Management & Publishing

- [PDF Format Converters](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/format-conversion-toolkits/pdf-format-converters.md) — Converts PDF documents into structured Markdown, HTML, and LaTeX formats while preserving layout and content quality.
- [PDF to Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-markdown-converters.md) — Transforms PDF documents into structured Markdown format while preserving content quality and original layout. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Document Layout Analyzers](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/document-layout-analyzers.md) — Maps spatial relationships and structural elements within PDFs using layout detection, formula recognition, and OCR. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))

### Artificial Intelligence & ML

- [Document Layout](https://awesome-repositories.com/f/artificial-intelligence-ml/model-predictions/prediction-engines/document-layout.md) — Identifies structural elements in PDF reports and textbooks while ignoring noise like watermarks or blurring. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Uses deep learning models to identify structural document elements like tables and formulas within PDFs.
- [Optical Character Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition.md) — Extracts text from PDF documents through an OCR pipeline to enable digital analysis of visual content. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Extraction Model Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-analysis/machine-learning-evaluation/model-comparison-interfaces/extraction-model-evaluation.md) — Evaluates parsing performance against comprehensive datasets to determine the most accurate extraction model for specific document types. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))

### Part of an Awesome List

- [Text Extractors](https://awesome-repositories.com/f/awesome-lists/media/pdf/text-extractors.md) — Provides an OCR pipeline to retrieve written text and precise spatial metadata from PDF layers.
- [Text Extraction and OCR](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr.md) — Extracts precise text content and spatial coordinates from PDF images and documents using optical character recognition.

### Data & Databases

- [Content Extraction](https://awesome-repositories.com/f/data-databases/content-extraction.md) — Implements a multi-stage pipeline that sequentially performs layout detection, formula recognition, and text extraction.
- [Document Parsing Pipelines](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-ingestion/document-parsing-pipelines.md) — Implements modular parsing pipelines that automate the extraction of data from documents for downstream translation or question answering. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Table Extraction Utilities](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/table-extraction-utilities.md) — Detects table structures in documents and extracts content into machine-readable formats like HTML or LaTeX. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Text Extraction](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction.md) — Recognizes and extracts text content and precise spatial coordinates from document images. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Output Format Rendering](https://awesome-repositories.com/f/data-databases/data-serialization-formats/data-formats/output-format-rendering.md) — Provides capabilities to render internal document representations into multiple target formats including Markdown, HTML, and LaTeX.
- [Table-to-Code Converters](https://awesome-repositories.com/f/data-databases/table-data-processing/table-to-html-converters/table-to-code-converters.md) — Transforms images of tables into structured source code using LaTeX, HTML, or Markdown formats. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))

### Scientific & Mathematical Computing

- [Formula Locators](https://awesome-repositories.com/f/scientific-mathematical-computing/formula-evaluators/symbolic-formula-parsers/formula-locators.md) — Locates mathematical formulas within multilingual documents to prepare them for subsequent recognition and extraction. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Formula Extractors](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/mathematical-typesetting-engines/mathematical-typesetting/formula-typesetters/formula-extractors.md) — Detects and recognizes mathematical notation within documents to convert complex formulas into digital text. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Formula Recognition Engines](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/mathematical-typesetting-engines/mathematical-typesetting/latex-math-rendering/formula-recognition-engines.md) — Translates images of mathematical formulas into editable source code using LaTeX formatting. ([source](https://github.com/opendatalab/pdf-extract-kit#readme))
- [Image-to-LaTeX Converters](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/mathematical-typesetting-engines/mathematical-typesetting/latex-math-rendering/image-to-latex-converters.md) — Converts images of mathematical formulas and tables into structured LaTeX code using specialized recognition models.

### Software Engineering & Architecture

- [Compositional Transformation Pipelines](https://awesome-repositories.com/f/software-engineering-architecture/compositional-transformation-pipelines.md) — Allows the construction of custom extraction workflows by chaining modular components into a sequential transformation pipeline.