30 open-source projects similar to opendatalab/mineru, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best MinerU alternative.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers
Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical structures while preserving the original page layout. The system operates as a local-first inference engine, allowing for the processing of sensitive data in air-gapped environments without external network connectivity. It can also be deployed as an API or a Model Context Protocol server to provide parsi
BabelDOC is a technical document translation system designed to translate PDF files while preserving their original layout and styling. It functions as a layout-preserving translator that utilizes large language models to convert content into target languages, specifically tailored for scientific and technical documents. The system distinguishes itself through specialized handling of academic content, including the identification and preservation of mathematical formulas and complex layout structures. It ensures technical accuracy by employing glossary-driven terminology enforcement, using so
PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system. The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like wa
DataFlow is an agent-based workflow orchestrator and data pipeline designed to synthesize, clean, and augment large-scale datasets for training large language models. It functions as a synthetic data generator and text curation tool, utilizing an intelligent assistant to assemble modular processing operators into functional pipelines based on user requirements. The project distinguishes itself through a low-code approach, providing a web-based visual interface for designing and monitoring multi-stage execution flows. It features an operator-based registry system that allows for the integratio
This project is a PDF data extraction tool and document preprocessor designed to convert PDF files into structured formats such as Markdown, JSON, and HTML. It functions as an OCR document parser for scanned files, an accessibility automator for generating PDF/UA compliant metadata, and a loader for AI orchestration frameworks like LangChain. The software distinguishes itself through specialized handling of complex document elements, including the conversion of mathematical formulas into LaTeX and the generation of natural-language descriptions for charts and images. It utilizes recursive seg
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
Megaparse is a document parsing tool and RAG data preprocessor designed to convert PDFs, Word documents, and presentations into clean text formats. It functions as a vision-based document extractor that recovers high-fidelity information from images and complex layouts to optimize data for large language model ingestion. The system employs multimodal AI and vision models to perform schema-preserving parsing, which maintains structural hierarchies such as tables and headers. It utilizes lossless structural transformation to turn layout-heavy binary files into text sequences while preserving th
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis f
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Zerox is a multimodal document parser and OCR tool that uses vision models to convert PDF files and images into structured Markdown text. It functions as a visual layout extraction engine, leveraging large multimodal models to digitize documents while maintaining their original structural formatting. The system differentiates itself through the use of coordinate-based element mapping and multimodal layout analysis to identify structural elements like tables, charts, and headers. It utilizes rasterization to convert vector PDF pages into high-resolution bitmaps, ensuring consistent input for t
Fast Multimodal Semantic Deduplication & Filtering
Sparrow is an LLM document extraction platform and vision-based inference engine designed to convert images and PDFs into validated structured data. It functions as an agentic workflow orchestrator that chains classification, extraction, and validation tasks into multi-step pipelines. The system distinguishes itself through a backend-agnostic inference layer that manages models across local GPUs, Apple Silicon, and cloud providers. It employs coordinate-based visual grounding to map extracted text to precise bounding box coordinates and utilizes hint-based model steering to guide attention an
Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points. The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side
OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
This tool is a command-line processor designed for querying, updating, and transforming structured data files. It functions as a versatile engine for manipulating YAML, JSON, TOML, and XML documents, allowing users to perform complex operations directly from the terminal. By utilizing a path-based expression language, it enables precise navigation and modification of data structures within configuration files and infrastructure-as-code workflows. What distinguishes this tool is its ability to perform in-place document mutations while preserving original formatting, comments, and metadata. It
sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple proces
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Manage scalable open LLM inference endpoints in Slurm clusters
Olmocr is a distributed document processing framework designed to convert PDF and image files into structured markdown. It functions as a vision-based document parser that utilizes multimodal neural networks to interpret complex visual layouts and translate them into standardized text representations. The system operates as a remote inference orchestrator, offloading heavy document analysis tasks to external servers or cloud APIs to minimize local computational requirements. By employing a stateless worker architecture, it decouples document ingestion from inference, allowing for the distribu
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"