High-performance open-source libraries and tools designed for recognizing and extracting text from digital documents.
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis f
Tesseract is a comprehensive, industry-standard OCR engine that provides robust multi-language support, advanced layout analysis, and native PDF processing, making it the definitive tool for this category.
EasyOCR is a deep learning-based computer vision library designed to perform optical character recognition on images and video frames. It functions as a comprehensive pipeline that automates the transformation of visual text into machine-readable strings, enabling the digitization of physical documents, forms, and receipts into searchable data. The engine distinguishes itself through a multi-stage processing workflow that combines convolutional neural networks for spatial feature extraction with sequence-based decoding mechanisms. This architecture allows the system to identify and interpret
EasyOCR is a comprehensive deep learning-based engine that provides multi-language support, text localization, and a flexible API for integrating OCR capabilities into Python-based document processing workflows.
dots.ocr is a suite of software utilities for document layout analysis, multilingual optical character recognition, and scene text digitization. It functions as an engine for extracting digital text and structured layout data from images and PDFs across various human scripts. The project includes a specialized transformer for converting charts, diagrams, and chemical formulas from raster images into scalable vector graphics. It also provides a pipeline to transform extracted text and structural layout from documents and web screenshots into formatted Markdown files. The system covers capabil
This repository provides a comprehensive OCR engine that natively supports document layout analysis, multilingual text recognition, and PDF processing, making it a complete solution for converting visual documents into structured machine-readable formats.
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers
Dolphin is a vision-language model-based document parser that performs OCR and layout analysis to convert images into structured data, making it a highly capable engine for document-to-text extraction tasks.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Docling is a comprehensive document parsing and layout analysis framework that integrates OCR engines as pluggable backends, making it a highly capable tool for converting complex documents into machine-readable structured data.
Pix2Text is an optical character recognition system and document conversion tool designed to transform images and PDFs into Markdown. It functions as a multilingual OCR engine supporting over 80 languages, a LaTeX formula recognizer for mathematical notations, and a parser integrated with vision language models. The project utilizes a hybrid pipeline to separate plain text from mathematical formulas and tabular structures within a single pass. It converts recognized formulas into LaTeX expressions and transforms detected tables and layouts into structured Markdown formatting. The system incl
Pix2Text is a comprehensive OCR engine that supports multi-language text, complex document layout analysis, PDF processing, and provides a local HTTP API for integration, making it a complete solution for your requirements.
PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system. The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like wa
This toolkit functions as a comprehensive document parsing framework that integrates OCR, layout analysis, and PDF processing to convert complex documents into structured formats.
OCRmyPDF is a tool for converting image-based PDF files into machine-readable documents by adding a searchable text layer via optical character recognition. It functions as a multi-language processor capable of detecting and extracting text in over 100 different languages using linguistic data packs. The software includes a PDF image optimizer to remove image artifacts and correct page skew to improve recognition accuracy. It also provides a converter to transform scanned documents into the PDF/A standard for long-term digital archiving. The system manages PDF optimization by compressing emb
This tool is a specialized OCR engine that focuses on transforming scanned PDFs into searchable documents, offering robust features like multi-language support, image preprocessing, and PDF/A conversion.
pdf-craft is an OCR-based document parser and structure extractor designed to convert PDF files into structured data, Markdown, or EPUB ebooks. It utilizes optical character recognition and statistical analysis to identify document hierarchies and extract text and structured content. The system features specialized rendering for mathematical formulas and tables, using heuristic reconstruction to convert tabular data into digital formats. It includes a document structure extractor that builds tables of contents by analyzing font sizes, linguistic patterns, and language model title detection.
This tool functions as an OCR-based document parser that performs layout analysis and text extraction from PDFs, making it a direct fit for your document processing needs.
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recogn
MinerU is a comprehensive document parsing pipeline that integrates OCR with advanced layout analysis and formula extraction to convert complex PDFs into structured, machine-readable data.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Kreuzberg is a comprehensive document extraction engine that natively integrates multiple OCR backends, supports PDF and image processing, and provides a robust API for structured text extraction, making it a complete solution for your requirements.
Stirling-PDF is a self-hosted document processing suite designed for secure, private file management. It functions as a comprehensive transformation engine that executes complex operations—such as merging, splitting, converting, and redacting documents—directly on the host machine. The platform provides both a browser-based interface for interactive editing and a programmatic, API-first architecture that allows for the automation of document workflows through standard HTTP requests. The project distinguishes itself through its focus on private, infrastructure-agnostic deployment and granular
Stirling-PDF is a comprehensive document processing suite that includes built-in OCR capabilities for converting scanned documents into text, making it a suitable tool for your requirements despite its broader focus on general PDF manipulation.
Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized
Tesseract is the industry-standard open-source OCR engine that provides comprehensive support for multi-language recognition, document layout analysis, PDF output, and extensive image preprocessing capabilities.
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runti
RapidOCR is a comprehensive, offline OCR engine that provides multi-language support, modular detection and recognition, and flexible API bindings, making it a robust solution for converting images into machine-readable text.
PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into independent, configurable stages. This architecture supports automated document digitization and multilingual text recognition, capable of identifying text in over one hundred languages across diverse environments ranging from scanned documents to industrial scenes. The framework disti
PaddleOCR is a comprehensive OCR framework that provides multi-language support, document layout analysis, and PDF processing capabilities, making it a robust engine for converting images and documents into machine-readable text.
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
Marker is a specialized document parsing and OCR pipeline that excels at converting complex PDFs into structured formats like markdown and JSON, making it a highly effective tool for document-heavy data extraction tasks.
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedde
This tool is a PDF parser designed to extract existing text and layout data from digital files, rather than an OCR engine capable of performing character recognition on images or scanned documents.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
PyMuPDF is a powerful document processing library that includes built-in OCR capabilities and layout analysis, making it a highly effective tool for extracting machine-readable text from PDFs and images.
Stirling-PDF is a web-based PDF management suite used for editing, merging, splitting, and converting PDF documents. It functions as a self-hosted document manager, providing a centralized interface for users to manipulate files on a private server. The system features a workflow automation engine that allows for the creation of processing pipelines to handle large volumes of documents without writing custom code. It also includes an optical character recognition tool to convert scanned PDFs into searchable and editable text. Access is managed through single sign-on integration and OIDC comp
Stirling-PDF is a comprehensive document management suite that includes integrated OCR capabilities for converting scanned documents into searchable text, making it a functional tool for this purpose despite being a broader application rather than a standalone engine.
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
This tool functions as a comprehensive document processing engine that integrates optical character recognition to extract text from images and various file formats, making it a suitable choice for converting visual data into machine-readable Markdown.
BallonsTranslator is a software suite designed for extracting, translating, and replacing text within comic panels while preserving the original visual layout. It functions as an image translation tool that combines text region detection, optical character recognition, and deep learning inpainting to automate the localization of comics. The tool features a deep learning image inpainter that removes original text and restores backgrounds using generative neural networks and patch-matching algorithms. It also includes a rich-text translation editor for modifying translated dialogue with support
This tool integrates an OCR pipeline specifically for comic text extraction and translation, providing the necessary text recognition and layout-aware processing capabilities even though its primary focus is on comic localization rather than general-purpose document scanning.
Umi-OCR is an optical character recognition engine designed to convert visual text from images and documents into machine-readable character data. It functions as a local-first toolkit, processing all visual data directly on the host machine using embedded neural network models to maintain privacy and offline availability. The project distinguishes itself through its focus on automated document digitization and integrated barcode and QR code decoding. By utilizing a modular, Python-based orchestration layer, it enables users to transform static image files and multi-page documents into search
Umi-OCR is a local-first optical character recognition engine that provides robust text extraction and batch processing capabilities, making it a suitable tool for converting images and documents into machine-readable text.
Tesseract.js is a JavaScript library that provides optical character recognition capabilities directly within web browsers and Node.js environments. It functions as a client-side engine, enabling the conversion of images containing printed text into machine-readable strings without the need for external APIs or server-side infrastructure. The library distinguishes itself by running the original C++ optical character recognition engine within the browser through WebAssembly modules. To maintain interface responsiveness during intensive computation, it utilizes background threads for parallel p
Tesseract.js is a JavaScript port of the Tesseract OCR engine that provides robust text recognition capabilities directly in the browser or Node.js, making it a capable tool for your document processing needs.
Nougat is a neural OCR system and LLM document parser designed to convert images of academic PDF documents into structured markdown text and mathematical formulas. It functions as a PDF to markdown converter that uses deep learning to handle layout and formula recognition. The project provides a document training pipeline for generating datasets and training neural networks to recognize specific academic document styles. This includes utilities for training dataset generation, neural model training, and model checkpoint management to ensure reproducible deployment. The system covers a broad
This is a specialized neural OCR engine designed for academic document parsing and PDF-to-markdown conversion, providing the requested API integration and layout analysis capabilities.