Tesseract

Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts.

The project distinguishes itself through a sophisticated document layout analysis framework that employs a hybrid approach to resolve complex structures like multi-column text and tables. It offers extensive configurability, allowing users to refine recognition accuracy through custom linguistic models, user-defined dictionaries, and specialized training pipelines. The engine supports the generation of various structured outputs, including searchable PDFs with hidden text layers, and provides hardware-accelerated math kernels to optimize inference performance.

Beyond core recognition, the system includes comprehensive tooling for image pre-processing, page segmentation, and the management of modular language data. It provides C and C++ APIs alongside various language-specific wrappers, enabling integration into diverse software environments. The engine is available as pre-built binary packages or can be compiled from source using standard system compilers.

Features

OCR Engines - Transforms scanned images and digital documents into machine-readable text using neural network-based recognition.
Automated Digitization Engines - Converts static images and physical scans into searchable, machine-readable formats for efficient archival and indexing.
OCR Command Line Interfaces - Executes character recognition tasks directly from the terminal by specifying input images, language models, and output requirements.
Command-Line Document Processors - Streamlines large-scale document digitization workflows by automating image processing, text extraction, and structured output generation.
Document Layout Analysis Tools - Detects page structure, column orientation, and table regions to enable precise text extraction from complex document layouts.
Optical Character Recognition Engines - Integrates robust visual text recognition into desktop, mobile, and server-side software environments.
Adaptive Recognition Models - Refines recognition accuracy by applying document-specific image and language models tailored to varying typefaces and vocabularies.
Document Layout Analysis - Parses complex document images by detecting tab-stops and structural cues to deduce reading order and column layout.
Recurrent Neural Networks - Models sequential dependencies in text across diverse scripts and languages using advanced neural network architectures.
Data Ingestion and Preparation - Provides specialized interfaces for preparing and editing raw image data to facilitate model training.
Model Fine-Tuning and Adaptation - Allows fine-tuning of recognition engines to improve performance for unique fonts, niche languages, and domain-specific terminology.
Multilingual Text Recognition Engines - Supports transcription across more than one hundred languages and multiple scripts using configurable linguistic models.
Table Detection Algorithms - Identifies and localizes tabular data regions within heterogeneous document layouts to assist in information extraction.
OCR API Bindings - Exposes programmatic interfaces for embedding document recognition and pattern matching capabilities into custom software.
Page Segmentation Optimizers - Optimizes recognition performance by allowing users to configure page segmentation modes for specific document structures.
PDF Generation Tools - Creates searchable PDF documents by overlaying a hidden text layer onto original image data.
Document Processing Pipelines - Automates document parsing and layout analysis to normalize static image content for downstream data integration.
Document Segmentation - Decomposes visual documents into hierarchical structures, including text blocks, lines, and individual characters.
Page Segmentation Modes - Defines how document layouts are parsed into blocks, lines, or characters through configurable segmentation settings.
AI and Machine Learning - Open-source optical character recognition engine.
Computer Vision Libraries - Open-source engine for optical character recognition.
Optical Character Recognition - Industry-standard open source engine for character recognition.
Documentation and Knowledge - Open-source engine for optical character recognition.
Model Provider Adapters - Load trained data files via system paths or environment variables to register specific recognition models for immediate use.
OCR Language Support - Maintain extensive language-specific character sets and script definitions to ensure compatibility with diverse global writing systems.
OCR Integration APIs - Embed recognition capabilities directly into custom software using native C and C++ interfaces or various language-specific wrappers.
Image Pre-processing Utilities - Enhance image quality through rescaling, binarization, and noise reduction to prepare raw visual data for more accurate recognition.
OCR Interfaces - Facilitate manual proofreading and layout analysis by providing a backend for graphical tools designed to manage document digitization workflows.
Script and Orientation Detectors - Apply fast shape classifiers to connected components to determine the writing system and page rotation of input documents.
Mobile OCR Integrations - Enable real-time text extraction from camera-captured images on Android and iOS platforms through dedicated mobile SDKs.
Multilingual Text Recognition - Transcribe diverse linguistic content by configuring the engine with specific language codes to handle varied character sets.
OCR Data Export Formats - Generate structured output in formats like HOCR or TSV to enable seamless data exchange with external analysis and web-based pipelines.
Cloud Document Conversion - Offload document conversion tasks to cloud-based services to transform images and PDFs into searchable text without local installation.
OCR Engine Selectors - Toggle between legacy algorithms and modern neural network backends to balance processing speed against character recognition accuracy.
Post-Processing Constraints - Apply linguistic constraints and dictionary lookups to refine raw classifier output into contextually accurate text sequences.
Text Orientation Detection - Utilize script detection models to automatically identify and correct document orientation for improved text extraction results.
Table Extraction Utilities - Resolve complex grid-based document structures by applying specialized layout analysis methods to extract tabular data.
Specialized Recognition Data - Incorporate domain-specific data files to support advanced features like mathematical equation recognition and script detection.
Image Format Decoders - Decode standard image formats such as PNG, JPEG, and TIFF to supply raw pixel data for subsequent text analysis.
Custom Dictionaries - Adjust recognition accuracy for domain-specific terminology by utilizing user-defined word lists and custom patterns.

jbarlow83/OCRmyPDF

33,901View on GitHub

OCRmyPDF is a tool for converting image-based PDF files into machine-readable documents by adding a searchable text layer via optical character recognition. It functions as a multi-language processor capable of detecting and extracting text in over 100 different languages using linguistic data packs. The software includes a PDF image optimizer to remove image artifacts and correct page skew to improve recognition accuracy. It also provides a converter to transform scanned documents into the PDF/A standard for long-term digital archiving. The system manages PDF optimization by compressing emb

RapidAI/RapidOCR

5,968View on GitHub

RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runti

VikParuchuri/marker

36,164View on GitHub

Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi

UB-Mannheim/tesseract

4,111View on GitHub

Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized

tesseract-ocrtesseract

Features