High-performance open-source libraries and tools designed for recognizing and extracting text from digital documents.
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis framework that employs a hybrid approach to resolve complex structures like multi-column text and tables. It offers extensive configurability, allowing users to refine recognition accuracy through custom linguistic models, user-defined dictionaries, and specialized training pipelines. The engine supports the generation of various structured outputs, including searchable PDFs with hidden text layers, and provides hardware-accelerated math kernels to optimize inference performance. Beyond core recognition, the system includes comprehensive tooling for image pre-processing, page segmentation, and the management of modular language data. It provides C and C++ APIs alongside various language-specific wrappers, enabling integration into diverse software environments. The engine is available as pre-built binary packages or can be compiled from source using standard system compilers.
Tesseract is a comprehensive, industry-standard OCR engine that provides robust multi-language support, advanced layout analysis, and native PDF processing, making it the definitive tool for this category.
EasyOCR is a deep learning-based computer vision library designed to perform optical character recognition on images and video frames. It functions as a comprehensive pipeline that automates the transformation of visual text into machine-readable strings, enabling the digitization of physical documents, forms, and receipts into searchable data. The engine distinguishes itself through a multi-stage processing workflow that combines convolutional neural networks for spatial feature extraction with sequence-based decoding mechanisms. This architecture allows the system to identify and interpret text across a wide range of global languages without requiring explicit character segmentation. It further refines its output using geometric filtering to ensure that detected text regions maintain coherent structure and logical paragraph grouping. The library provides a unified interface for hardware-agnostic compute, allowing users to route operations between central processing units and graphics accelerators based on their available environment. It supports various configuration options for language selection, output detail levels, and model storage management to facilitate integration into diverse data extraction workflows.
EasyOCR is a comprehensive deep learning-based engine that provides multi-language support, text localization, and a flexible API for integrating OCR capabilities into Python-based document processing workflows.
dots.ocr is a suite of software utilities for document layout analysis, multilingual optical character recognition, and scene text digitization. It functions as an engine for extracting digital text and structured layout data from images and PDFs across various human scripts. The project includes a specialized transformer for converting charts, diagrams, and chemical formulas from raster images into scalable vector graphics. It also provides a pipeline to transform extracted text and structural layout from documents and web screenshots into formatted Markdown files. The system covers capabilities for identifying bounding boxes and categories of layout elements to produce structured JSON representations. It further includes tools for scene text detection within natural images and an evaluation framework for measuring text and table extraction accuracy against ground truth data.
This repository provides a comprehensive OCR engine that natively supports document layout analysis, multilingual text recognition, and PDF processing, making it a complete solution for converting visual documents into structured machine-readable formats.
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers spatial document layout mapping to identify bounding boxes and generate natural reading order sequences. It provides capabilities for granular content retrieval, allowing for the targeted extraction of specific document elements such as tables, formulas, and code blocks through prompt-based parsing.
Dolphin is a vision-language model-based document parser that performs OCR and layout analysis to convert images into structured data, making it a highly capable engine for document-to-text extraction tasks.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures. The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.
Docling is a comprehensive document parsing and layout analysis framework that integrates OCR engines as pluggable backends, making it a highly capable tool for converting complex documents into machine-readable structured data.
Pix2Text is an optical character recognition system and document conversion tool designed to transform images and PDFs into Markdown. It functions as a multilingual OCR engine supporting over 80 languages, a LaTeX formula recognizer for mathematical notations, and a parser integrated with vision language models. The project utilizes a hybrid pipeline to separate plain text from mathematical formulas and tabular structures within a single pass. It converts recognized formulas into LaTeX expressions and transforms detected tables and layouts into structured Markdown formatting. The system includes a command line interface for document conversion and a local HTTP web API for programmatic image processing. It supports GPU acceleration to increase model inference speed.
Pix2Text is a comprehensive OCR engine that supports multi-language text, complex document layout analysis, PDF processing, and provides a local HTTP API for integration, making it a complete solution for your requirements.
PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system. The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like watermarks or blurring. The system supports the composition of custom parsing pipelines through configuration files and provides tools for benchmarking extraction model performance against datasets. Its broader capabilities include optical character recognition for extracting text and spatial coordinates, as well as vision-to-LaTeX translation for mathematical notation.
This toolkit functions as a comprehensive document parsing framework that integrates OCR, layout analysis, and PDF processing to convert complex documents into structured formats.
OCRmyPDF is a tool for converting image-based PDF files into machine-readable documents by adding a searchable text layer via optical character recognition. It functions as a multi-language processor capable of detecting and extracting text in over 100 different languages using linguistic data packs. The software includes a PDF image optimizer to remove image artifacts and correct page skew to improve recognition accuracy. It also provides a converter to transform scanned documents into the PDF/A standard for long-term digital archiving. The system manages PDF optimization by compressing embedded raster images to reduce overall file size. It further supports extensibility through an interface that allows the integration of custom text recognition engines.
This tool is a specialized OCR engine that focuses on transforming scanned PDFs into searchable documents, offering robust features like multi-language support, image preprocessing, and PDF/A conversion.
pdf-craft is an OCR-based document parser and structure extractor designed to convert PDF files into structured data, Markdown, or EPUB ebooks. It utilizes optical character recognition and statistical analysis to identify document hierarchies and extract text and structured content. The system features specialized rendering for mathematical formulas and tables, using heuristic reconstruction to convert tabular data into digital formats. It includes a document structure extractor that builds tables of contents by analyzing font sizes, linguistic patterns, and language model title detection. The pipeline supports offline processing through local model weight caching, ensuring that OCR and layout analysis can function without an internet connection.
This tool functions as an OCR-based document parser that performs layout analysis and text extraction from PDFs, making it a direct fit for your document processing needs.
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recognition, and formula extraction into a unified pipeline. It serializes all extracted features and spatial coordinates into a standardized format, ensuring that output remains consistent for downstream integration. To support verification, the tool includes a diagnostic suite that generates visual overlays, allowing users to inspect segmentation boundaries and reading order directly against the original source files. The software provides a comprehensive framework for automated data extraction, organizing parsed elements into a page-based structure suitable for large-scale information retrieval. It is distributed as a Python-based package, with documentation and installation instructions available in the repository.
MinerU is a comprehensive document parsing pipeline that integrates OCR with advanced layout analysis and formula extraction to convert complex PDFs into structured, machine-readable data.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU. Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models. The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.
Kreuzberg is a comprehensive document extraction engine that natively integrates multiple OCR backends, supports PDF and image processing, and provides a robust API for structured text extraction, making it a complete solution for your requirements.
Stirling-PDF is a self-hosted document processing suite designed for secure, private file management. It functions as a comprehensive transformation engine that executes complex operations—such as merging, splitting, converting, and redacting documents—directly on the host machine. The platform provides both a browser-based interface for interactive editing and a programmatic, API-first architecture that allows for the automation of document workflows through standard HTTP requests. The project distinguishes itself through its focus on private, infrastructure-agnostic deployment and granular security. It supports role-based access control and stateless session authentication, ensuring that sensitive operations remain protected within a user-controlled environment. By offering a unified interface for sequential file transformations, it enables users to chain multiple processing tasks into single, automated pipelines while maintaining full control over document integrity and security. The system covers a broad range of document manipulation capabilities, including optical character recognition, digital signature validation, and advanced layout operations like booklet imposition and page reorganization. It is built for flexible integration, supporting deployment across containerized environments, bare metal, or native desktop installations. Configuration is managed through environment variables, YAML files, or the web interface, allowing for consistent behavior across diverse infrastructure setups.
Stirling-PDF is a comprehensive document processing suite that includes built-in OCR capabilities for converting scanned documents into text, making it a suitable tool for your requirements despite its broader focus on general PDF manipulation.
Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized text into formats such as plain text, PDF, and ALTO.
Tesseract is the industry-standard open-source OCR engine that provides comprehensive support for multi-language recognition, document layout analysis, PDF output, and extensive image preprocessing capabilities.
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runtime, OpenVINO, PaddlePaddle, PyTorch, MNN, and TensorRT. It supports over 80 languages by combining language-specific recognition models with a unified text detection backbone, and offers both lightweight mobile-optimized and higher-accuracy server-grade model variants selected at runtime. The project includes a command-line tool for extracting text from images and URLs with bounding boxes and confidence scores, and provides structured programmatic output with separate fields for bounding boxes, recognized text, and confidence scores. It can classify text line orientation before recognition to improve accuracy, and visualize results by drawing detected text regions onto the original image. For deployment, the OCR engine can be packaged into a Docker container for consistent environments across platforms, or bundled into a standalone executable using PyInstaller that removes the Python runtime dependency. The project also includes utilities for converting PaddleOCR models to ONNX format and fine-tuning them on custom data for specialized text recognition scenarios.
RapidOCR is a comprehensive, offline OCR engine that provides multi-language support, modular detection and recognition, and flexible API bindings, making it a robust solution for converting images into machine-readable text.
PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into independent, configurable stages. This architecture supports automated document digitization and multilingual text recognition, capable of identifying text in over one hundred languages across diverse environments ranging from scanned documents to industrial scenes. The framework distinguishes itself through a hardware-agnostic inference layer and a high-performance execution engine that enables consistent model deployment across CPUs, GPUs, and mobile hardware. It facilitates high-throughput production environments by utilizing static graph execution and distributed device orchestration, which allow for the scaling of recognition tasks across multiple hardware accelerators and network services. To support flexible integration, the system includes a cross-platform deployment toolkit and utilities for exporting models into universal formats. It provides granular control over resource utilization through multi-process parallelism and custom inference distribution, ensuring efficient performance for both local processing and remote network service deployment.
PaddleOCR is a comprehensive OCR framework that provides multi-language support, document layout analysis, and PDF processing capabilities, making it a robust engine for converting images and documents into machine-readable text.
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabilities include extracting images and structured data based on predefined schemas, as well as chunking documents for retrieval augmented generation pipelines. The project supports high-volume processing by distributing conversion tasks across multiple GPUs.
Marker is a specialized document parsing and OCR pipeline that excels at converting complex PDFs into structured formats like markdown and JSON, making it a highly effective tool for document-heavy data extraction tasks.
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedded images, interactive form data, and tagged contents. It supports multilingual text processing for diverse character sets and vertical writing, and can transform document data into formats such as HTML, hOCR, or plain text.
This tool is a PDF parser designed to extract existing text and layout data from digital files, rather than an OCR engine capable of performing character recognition on images or scanned documents.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. Its broader capability surface covers optical character recognition for creating searchable text layers, detailed data extraction of tables and key-value pairs, and security operations including AES/RC4 encryption and permanent content redaction. The library also handles complex document geometry, layout analysis, and the generation of PDFs from HTML and CSS. The library supports multi-format document loading for PDF, EPUB, MOBI, SVG, and Office files, with the ability to process files via memory streams.
PyMuPDF is a powerful document processing library that includes built-in OCR capabilities and layout analysis, making it a highly effective tool for extracting machine-readable text from PDFs and images.
Stirling-PDF is a web-based PDF management suite used for editing, merging, splitting, and converting PDF documents. It functions as a self-hosted document manager, providing a centralized interface for users to manipulate files on a private server. The system features a workflow automation engine that allows for the creation of processing pipelines to handle large volumes of documents without writing custom code. It also includes an optical character recognition tool to convert scanned PDFs into searchable and editable text. Access is managed through single sign-on integration and OIDC compatibility, which supports secure authentication and the maintenance of audit logs for compliance. The application is delivered as a container-based deployment and exposes its functions through a REST API for external software integration.
Stirling-PDF is a comprehensive document management suite that includes integrated OCR capabilities for converting scanned documents into searchable text, making it a functional tool for this purpose despite being a broader application rather than a standalone engine.
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document structures and formatting requirements. This flexibility is supported by an integrated optical character recognition capability that ensures text recovery from embedded images during the conversion process. The system provides both a command-line interface and a programmatic library, facilitating automated batch processing and custom integration into data pipelines. To ensure consistent performance across different environments, the project supports deployment within containerized architectures that encapsulate all necessary system-level dependencies and binaries.
This tool functions as a comprehensive document processing engine that integrates optical character recognition to extract text from images and various file formats, making it a suitable choice for converting visual data into machine-readable Markdown.