# opendatalab/mineru

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/opendatalab-mineru).**

67,734 stars · 5,705 forks · Python · NOASSERTION

## Links

- GitHub: https://github.com/opendatalab/MinerU
- Homepage: https://opendatalab.github.io/MinerU/
- awesome-repositories: https://awesome-repositories.com/repository/opendatalab-mineru.md

## Topics

`ai4science` `document-analysis` `extract-data` `layout-analysis` `ocr` `parser` `pdf` `pdf-converter` `pdf-extractor-llm` `pdf-extractor-pretrain` `pdf-extractor-rag` `pdf-parser` `python`

## Description

MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation.

The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recognition, and formula extraction into a unified pipeline. It serializes all extracted features and spatial coordinates into a standardized format, ensuring that output remains consistent for downstream integration. To support verification, the tool includes a diagnostic suite that generates visual overlays, allowing users to inspect segmentation boundaries and reading order directly against the original source files.

The software provides a comprehensive framework for automated data extraction, organizing parsed elements into a page-based structure suitable for large-scale information retrieval. It is distributed as a Python-based package, with documentation and installation instructions available in the repository.

## Tags

### Artificial Intelligence & ML

- [Deployment & Serving](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving.md) — Deploys deep learning models to classify content types and extract complex mathematical expressions from diverse visual inputs.
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Identifies document regions, tables, and text hierarchies to convert complex visual layouts into machine-readable data.
- [Visual Debugging Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/structured-document-extraction/visual-debugging-utilities.md) — Generates visual overlays that highlight detected text segments and reading order to verify parsing accuracy.

### Content Management & Publishing

- [Automated Data Extraction](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/automated-data-extraction.md) — Converts scanned or digital documents into structured data formats to enable large-scale information retrieval and analysis.
- [Layout Reconstruction Algorithms](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/layout-reconstruction-algorithms.md) — Applies geometric heuristics and spatial analysis to reassemble fragmented text blocks into a coherent reading order.

### Data & Databases

- [Structured Data Extractors](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing-frameworks/structured-data-extractors.md) — Transforms unstructured document content into standardized, machine-readable formats for automated information retrieval.
- [Document Processing Pipelines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/document-llm-preparation/document-processing-pipelines.md) — Ingests unstructured files and normalizes them into structured data through a multi-stage deep learning pipeline.
- [Multi-Stage Pipeline Processing](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/document-llm-preparation/multi-stage-pipeline-processing.md) — Orchestrates sequential document analysis tasks including layout detection, optical character recognition, and formula extraction.
- [Document Schema Normalizers](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-standardization/document-schema-normalizers.md) — Organizes parsed document elements into a unified, page-based format to ensure consistent data structures for downstream applications. ([source](https://opendatalab.github.io/MinerU/reference/output_files/))
- [JSON-Schema](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-serialization/json-schema.md) — Encodes extracted document features and spatial coordinates into a standardized schema for seamless interoperability.
- [Structured Data Exporters](https://awesome-repositories.com/f/data-databases/data-serialization-formats/structured-data-exporters.md) — Exports parsing results as structured JSON files to facilitate deeper data analysis through automated scripts. ([source](https://opendatalab.github.io/MinerU/reference/output_files/))

### Part of an Awesome List

- [Document Parsing and Extraction](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction.md) — Converts PDF documents into structured Markdown and JSON.
