MinerU

MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation.

The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recognition, and formula extraction into a unified pipeline. It serializes all extracted features and spatial coordinates into a standardized format, ensuring that output remains consistent for downstream integration. To support verification, the tool includes a diagnostic suite that generates visual overlays, allowing users to inspect segmentation boundaries and reading order directly against the original source files.

The software provides a comprehensive framework for automated data extraction, organizing parsed elements into a page-based structure suitable for large-scale information retrieval. It is distributed as a Python-based package, with documentation and installation instructions available in the repository.

Features

Deployment & Serving - Deploys deep learning models to classify content types and extract complex mathematical expressions from diverse visual inputs.
Document Layout Analysis - Identifies document regions, tables, and text hierarchies to convert complex visual layouts into machine-readable data.
Automated Data Extraction - Converts scanned or digital documents into structured data formats to enable large-scale information retrieval and analysis.
Layout Reconstruction Algorithms - Applies geometric heuristics and spatial analysis to reassemble fragmented text blocks into a coherent reading order.
Structured Data Extractors - Transforms unstructured document content into standardized, machine-readable formats for automated information retrieval.
Document Processing Pipelines - Ingests unstructured files and normalizes them into structured data through a multi-stage deep learning pipeline.
Multi-Stage Pipeline Processing - Orchestrates sequential document analysis tasks including layout detection, optical character recognition, and formula extraction.
Data Processing - Tool for extracting high-quality content from PDFs and web pages.
Data Processing Tools - Tool for high-quality extraction from PDFs and web pages.
Document Parsing and Extraction - Converts PDF documents into structured Markdown and JSON.
AI - Listed in the “AI 项目” section of the Great Open Source Project awesome list.
Document Schema Normalizers - Organizes parsed document elements into a unified, page-based format to ensure consistent data structures for downstream applications.
Visual Debugging Utilities - Generates visual overlays that highlight detected text segments and reading order to verify parsing accuracy.
JSON-Schema - Encodes extracted document features and spatial coordinates into a standardized schema for seamless interoperability.
Structured Data Exporters - Exports parsing results as structured JSON files to facilitate deeper data analysis through automated scripts.

Star history

opendatalabMinerU

Name: opendatalab/mineru
Author: opendatalab

View on GitHub

67,734 stars5,705 forksPython16 viewsopendatalab.github.io/MinerU

MinerU

Features

Deployment & Serving - Deploys deep learning models to classify content types and extract complex mathematical expressions from diverse visual inputs.
Document Layout Analysis - Identifies document regions, tables, and text hierarchies to convert complex visual layouts into machine-readable data.
Automated Data Extraction - Converts scanned or digital documents into structured data formats to enable large-scale information retrieval and analysis.
Layout Reconstruction Algorithms - Applies geometric heuristics and spatial analysis to reassemble fragmented text blocks into a coherent reading order.
Structured Data Extractors - Transforms unstructured document content into standardized, machine-readable formats for automated information retrieval.
Document Processing Pipelines - Ingests unstructured files and normalizes them into structured data through a multi-stage deep learning pipeline.
Multi-Stage Pipeline Processing - Orchestrates sequential document analysis tasks including layout detection, optical character recognition, and formula extraction.
Data Processing - Tool for extracting high-quality content from PDFs and web pages.
Data Processing Tools - Tool for high-quality extraction from PDFs and web pages.
Document Parsing and Extraction - Converts PDF documents into structured Markdown and JSON.
AI - Listed in the “AI 项目” section of the Great Open Source Project awesome list.
Document Schema Normalizers - Organizes parsed document elements into a unified, page-based format to ensure consistent data structures for downstream applications.
Visual Debugging Utilities - Generates visual overlays that highlight detected text segments and reading order to verify parsing accuracy.
JSON-Schema - Encodes extracted document features and spatial coordinates into a standardized schema for seamless interoperability.
Structured Data Exporters - Exports parsing results as structured JSON files to facilitate deeper data analysis through automated scripts.

Open-source alternatives to MinerU

Similar open-source projects, ranked by how many features they share with MinerU.

docling-project/docling
docling-project/docling
61,674View on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Pythonaiconvertdocument-parser
View on GitHub61,674
bytedance/dolphin
bytedance/Dolphin
8,820View on GitHub
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers
Pythondocument-analysislayout-analysisocr
View on GitHub8,820
ds4sd/docling
DS4SD/docling
62,172View on GitHub
Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical structures while preserving the original page layout. The system operates as a local-first inference engine, allowing for the processing of sensitive data in air-gapped environments without external network connectivity. It can also be deployed as an API or a Model Context Protocol server to provide parsi
Python
View on GitHub62,172

Frequently asked questions

What does opendatalab/mineru do?

What are the main features of opendatalab/mineru?

The main features of opendatalab/mineru are: Deployment & Serving, Document Layout Analysis, Automated Data Extraction, Layout Reconstruction Algorithms, Structured Data Extractors, Document Processing Pipelines, Multi-Stage Pipeline Processing, Data Processing.

What are some open-source alternatives to opendatalab/mineru?

Open-source alternatives to opendatalab/mineru include: docling-project/docling — Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It… bytedance/dolphin — Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital… opendatalab/pdf-extract-kit — PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as… ds4sd/docling — Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into… funstory-ai/babeldoc — BabelDOC is a technical document translation system designed to translate PDF files while preserving their original… opendcai/dataflow — DataFlow is an agent-based workflow orchestrator and data pipeline designed to synthesize, clean, and augment…

MinerU

Features

Star history

MinerU

Features

Open-source alternatives to MinerU

docling-project/docling

bytedance/Dolphin

DS4SD/docling

Frequently asked questions

Star history

Frequently asked questions

Open-source alternatives to MinerU

docling-project/docling

bytedance/Dolphin

DS4SD/docling

funstory-ai/BabelDOC