Extract structured data and text from PDF documents for seamless integration into large language models.
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabilities include extracting images and structured data based on predefined schemas, as well as chunking documents for retrieval augmented generation pipelines. The project supports high-volume processing by distributing conversion tasks across multiple GPUs.
Marker is a specialized document parsing pipeline that converts complex PDFs into structured, LLM-ready formats like markdown and JSON, explicitly supporting table extraction, OCR, and layout analysis for RAG workflows.
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers spatial document layout mapping to identify bounding boxes and generate natural reading order sequences. It provides capabilities for granular content retrieval, allowing for the targeted extraction of specific document elements such as tables, formulas, and code blocks through prompt-based parsing.
Dolphin is a multimodal document parser that uses vision-language models to perform layout analysis, OCR, and structured extraction of tables and text, making it a highly capable tool for preparing PDF data for RAG pipelines.
PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system. The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like watermarks or blurring. The system supports the composition of custom parsing pipelines through configuration files and provides tools for benchmarking extraction model performance against datasets. Its broader capabilities include optical character recognition for extracting text and spatial coordinates, as well as vision-to-LaTeX translation for mathematical notation.
This toolkit is specifically engineered for high-fidelity PDF parsing, providing the layout analysis, OCR, and table extraction capabilities required to convert complex documents into LLM-ready structured formats like Markdown.
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recognition, and formula extraction into a unified pipeline. It serializes all extracted features and spatial coordinates into a standardized format, ensuring that output remains consistent for downstream integration. To support verification, the tool includes a diagnostic suite that generates visual overlays, allowing users to inspect segmentation boundaries and reading order directly against the original source files. The software provides a comprehensive framework for automated data extraction, organizing parsed elements into a page-based structure suitable for large-scale information retrieval. It is distributed as a Python-based package, with documentation and installation instructions available in the repository.
MinerU is a comprehensive document parsing pipeline that performs layout analysis, OCR, and table extraction to convert complex PDFs into structured, LLM-ready formats specifically optimized for RAG workflows.
DeepSeek-OCR is a vision processing framework designed to convert image-based text into machine-readable tokens for large language models. It functions as a document inference pipeline that encodes visual data into compact representations, enabling automated optical character recognition and document analysis workflows. The system distinguishes itself through a high-throughput architecture that utilizes hardware-accelerated batch inference to process large volumes of visual data. It incorporates dynamic resolution scaling to manage the balance between visual detail and token consumption, ensuring that image content is compressed into optimized formats for efficient model ingestion. The framework includes comprehensive capabilities for scaling inference throughput across distributed backends to maintain consistent performance under heavy traffic. It also integrates automated benchmarking tools to evaluate the accuracy and speed of text extraction across diverse datasets, ensuring reliable output quality during system operations.
This framework functions as a document inference pipeline that converts visual document data into LLM-ready tokens, providing the core OCR and document analysis capabilities needed for RAG pipelines.
pdfplumber is a PDF data extraction library and layout analysis tool used to retrieve text, tables, and geometric objects from PDF files using precise coordinate-based analysis. It functions as a layout analyzer and table parser that identifies the bounding boxes and visual coordinates for every character and image on a page. The library distinguishes itself through visual debugging capabilities, allowing users to render PDF pages as images and draw annotations to verify the position of extracted data. It employs line and intersection analysis to identify cell structures and convert unstructured tabular data into organized lists. The tool covers broad capability areas including geometric object extraction, spatial filtering via page area cropping, and the retrieval of document metadata from file trailers. It also supports text data mining that preserves the visual arrangement of characters.
This library provides precise coordinate-based layout analysis and table extraction, making it a highly effective tool for preparing structured data for RAG pipelines, though it lacks built-in OCR capabilities for scanned documents.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. Its broader capability surface covers optical character recognition for creating searchable text layers, detailed data extraction of tables and key-value pairs, and security operations including AES/RC4 encryption and permanent content redaction. The library also handles complex document geometry, layout analysis, and the generation of PDFs from HTML and CSS. The library supports multi-format document loading for PDF, EPUB, MOBI, SVG, and Office files, with the ability to process files via memory streams.
PyMuPDF is a high-performance library that provides the necessary layout analysis, OCR integration, and table extraction features to convert complex PDF documents into structured, LLM-ready formats for RAG pipelines.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures. The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.
Docling is a comprehensive document parsing framework that natively supports layout analysis, table extraction, and OCR, providing structured, LLM-ready output specifically designed for RAG pipelines.
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document structures and formatting requirements. This flexibility is supported by an integrated optical character recognition capability that ensures text recovery from embedded images during the conversion process. The system provides both a command-line interface and a programmatic library, facilitating automated batch processing and custom integration into data pipelines. To ensure consistent performance across different environments, the project supports deployment within containerized architectures that encapsulate all necessary system-level dependencies and binaries.
This tool is designed to convert various document formats, including PDFs, into structured Markdown suitable for LLM ingestion, effectively handling layout analysis and OCR to support RAG workflows.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU. Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models. The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.
Kreuzberg is a comprehensive document extraction engine that natively supports table extraction, OCR, and layout analysis, specifically designed to output structured data for RAG pipelines.
Pix2Text is an optical character recognition system and document conversion tool designed to transform images and PDFs into Markdown. It functions as a multilingual OCR engine supporting over 80 languages, a LaTeX formula recognizer for mathematical notations, and a parser integrated with vision language models. The project utilizes a hybrid pipeline to separate plain text from mathematical formulas and tabular structures within a single pass. It converts recognized formulas into LaTeX expressions and transforms detected tables and layouts into structured Markdown formatting. The system includes a command line interface for document conversion and a local HTTP web API for programmatic image processing. It supports GPU acceleration to increase model inference speed.
This tool functions as a document parser that converts PDFs into structured Markdown, providing the layout analysis and table extraction capabilities necessary for preparing data for RAG pipelines.
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis framework that employs a hybrid approach to resolve complex structures like multi-column text and tables. It offers extensive configurability, allowing users to refine recognition accuracy through custom linguistic models, user-defined dictionaries, and specialized training pipelines. The engine supports the generation of various structured outputs, including searchable PDFs with hidden text layers, and provides hardware-accelerated math kernels to optimize inference performance. Beyond core recognition, the system includes comprehensive tooling for image pre-processing, page segmentation, and the management of modular language data. It provides C and C++ APIs alongside various language-specific wrappers, enabling integration into diverse software environments. The engine is available as pre-built binary packages or can be compiled from source using standard system compilers.
This is a powerful OCR engine that provides the foundational text recognition and layout analysis needed for document processing, but it is a building block rather than a complete RAG-ready extraction pipeline.
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Docling is a purpose-built document processing tool that handles complex PDF layouts, table extraction, and OCR to produce structured, LLM-ready output specifically designed for RAG pipelines.
EasyOCR is a deep learning-based computer vision library designed to perform optical character recognition on images and video frames. It functions as a comprehensive pipeline that automates the transformation of visual text into machine-readable strings, enabling the digitization of physical documents, forms, and receipts into searchable data. The engine distinguishes itself through a multi-stage processing workflow that combines convolutional neural networks for spatial feature extraction with sequence-based decoding mechanisms. This architecture allows the system to identify and interpret text across a wide range of global languages without requiring explicit character segmentation. It further refines its output using geometric filtering to ensure that detected text regions maintain coherent structure and logical paragraph grouping. The library provides a unified interface for hardware-agnostic compute, allowing users to route operations between central processing units and graphics accelerators based on their available environment. It supports various configuration options for language selection, output detail levels, and model storage management to facilitate integration into diverse data extraction workflows.
This is a specialized OCR library for image-based text recognition, but it lacks the document-specific layout analysis and table extraction capabilities required to parse complex PDF structures for RAG pipelines.
Botasaurus is a Python web scraping framework and headless browser automation system used to build scalable data extraction tools. It functions as a web data extraction tool and OCR document parser, converting website content, images, and PDF files into structured formats such as JSON, CSV, and Excel. The framework distinguishes itself by providing a scraper management interface that allows Python functions to be wrapped in a web-based UI or deployed as standalone desktop applications. This enables non-technical users to trigger extraction jobs and manage tasks via a graphical interface or REST API without writing code. The system covers a broad range of capabilities including bot detection bypass, proxy rotation, and resource-aware parallel execution to manage large-scale data collection. It provides integrated utilities for session persistence, asynchronous task orchestration, and document text extraction via optical character recognition. Data management is supported through interchangeable database backends, result caching, and interactive filtering and sorting tools for viewing extracted data.
This framework provides a comprehensive system for web and document data extraction, including OCR and PDF parsing capabilities that can be structured into JSON for RAG pipelines.
Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized text into formats such as plain text, PDF, and ALTO.
Tesseract is a foundational OCR engine for converting images to text, but it lacks the built-in layout analysis and table extraction logic required to parse complex PDF documents into structured data for RAG pipelines.
Stirling-PDF is a web-based PDF management suite used for editing, merging, splitting, and converting PDF documents. It functions as a self-hosted document manager, providing a centralized interface for users to manipulate files on a private server. The system features a workflow automation engine that allows for the creation of processing pipelines to handle large volumes of documents without writing custom code. It also includes an optical character recognition tool to convert scanned PDFs into searchable and editable text. Access is managed through single sign-on integration and OIDC compatibility, which supports secure authentication and the maintenance of audit logs for compliance. The application is delivered as a container-based deployment and exposes its functions through a REST API for external software integration.
This is a comprehensive PDF management and manipulation suite, but it focuses on document editing and file operations rather than the specialized structural parsing and layout analysis required for RAG pipelines.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex queries through iterative processing and tool-calling, while its hybrid retrieval orchestration combines vector similarity and full-text search with re-ranking to improve the accuracy of retrieved context. The framework also features event-driven streaming, which delivers incremental results from long-running pipelines to the user interface in real-time. Beyond its core reasoning capabilities, the platform includes a suite of functional modules for the entire lifecycle of document-based applications. This includes multi-modal parsing for extracting text, tables, and visual elements from diverse file formats, as well as administrative tools for managing document collections, vector stores, and multi-user access. The system is designed to be interface-agnostic, allowing developers to wrap third-party libraries and external services into standardized, reusable processing units. The project provides a web-based user interface for interactive querying and configuration, and it supports deployment of private, isolated instances through predefined templates.
This is a comprehensive RAG orchestration platform that includes built-in modules for multi-modal PDF parsing, table extraction, and OCR, making it a suitable end-to-end solution for preparing document data for LLM pipelines.
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural language queries into structured SQL or CSV formats by analyzing database schemas. The framework covers a broad range of capabilities including end-to-end retrieval-augmented generation pipelines, hybrid search engines, and multimodal content processing for PDFs, Office documents, audio, and images. It also incorporates tools for structured function calling, named entity recognition, and text risk classification to detect toxicity and prompt injections. The system integrates with various SQL and vector database backends to manage knowledge collection indexing and document embeddings.
This framework provides a comprehensive suite of document parsing and extraction tools specifically designed for RAG pipelines, including built-in support for table extraction, layout analysis, and OCR-ready processing.
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations. The platform distinguishes itself through deep document understanding and sophisticated knowledge orchestration. It supports complex document parsing, including the extraction of tables and images, and utilizes graph-based indexing to enhance reasoning over large document collections. Users can configure multiple recall strategies and fused re-ranking to optimize retrieval accuracy, while the system maintains context through multi-turn dialogue management and flexible tool-use frameworks. The architecture is built on a modular, containerized microservice foundation that supports both local inference engines and external language model APIs. It includes asynchronous task processing for document ingestion and indexing, ensuring system responsiveness during heavy workloads. The platform also provides a standardized interface for model abstraction, allowing for seamless integration with existing language model ecosystems. Developers can interact with the platform through a comprehensive suite of RESTful endpoints and Python client libraries, which cover the full lifecycle of agents, datasets, and knowledge graphs. The system is designed for flexible deployment, offering configurable environment settings and support for custom containerized environments to facilitate local development and infrastructure portability.
This platform provides a comprehensive RAG pipeline that includes built-in document parsing, table extraction, and layout analysis specifically designed to prepare unstructured data for LLM consumption.