# tesseract-ocr/tesseract

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/tesseract-ocr-tesseract).**

74,751 stars · 10,660 forks · C++ · Apache-2.0

## Links

- GitHub: https://github.com/tesseract-ocr/tesseract
- Homepage: https://tesseract-ocr.github.io/
- awesome-repositories: https://awesome-repositories.com/repository/tesseract-ocr-tesseract.md

## Topics

`hacktoberfest` `lstm` `machine-learning` `ocr` `ocr-engine` `tesseract` `tesseract-ocr`

## Description

Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts.

The project distinguishes itself through a sophisticated document layout analysis framework that employs a hybrid approach to resolve complex structures like multi-column text and tables. It offers extensive configurability, allowing users to refine recognition accuracy through custom linguistic models, user-defined dictionaries, and specialized training pipelines. The engine supports the generation of various structured outputs, including searchable PDFs with hidden text layers, and provides hardware-accelerated math kernels to optimize inference performance.

Beyond core recognition, the system includes comprehensive tooling for image pre-processing, page segmentation, and the management of modular language data. It provides C and C++ APIs alongside various language-specific wrappers, enabling integration into diverse software environments. The engine is available as pre-built binary packages or can be compiled from source using standard system compilers.

## Tags

### Artificial Intelligence & ML

- [OCR Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/ocr-engines.md) — Transforms scanned images and digital documents into machine-readable text using neural network-based recognition. ([source](https://cdn.jsdelivr.net/gh/tesseract-ocr/tesseract@main/README.md))
- [Automated Digitization Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/automated-digitization-engines.md) — Converts static images and physical scans into searchable, machine-readable formats for efficient archival and indexing.
- [OCR Command Line Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/ocr-command-line-interfaces.md) — Executes character recognition tasks directly from the terminal by specifying input images, language models, and output requirements. ([source](https://cdn.jsdelivr.net/gh/tesseract-ocr/tesseract@main/README.md))
- [Adaptive Recognition Models](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/ocr-optimization/adaptive-recognition-models.md) — Refines recognition accuracy by applying document-specific image and language models tailored to varying typefaces and vocabularies. ([source](https://tesseract-ocr.github.io/docs/))
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Parses complex document images by detecting tab-stops and structural cues to deduce reading order and column layout. ([source](https://tesseract-ocr.github.io/docs/))
- [Recurrent Neural Networks](https://awesome-repositories.com/f/artificial-intelligence-ml/recurrent-neural-networks.md) — Models sequential dependencies in text across diverse scripts and languages using advanced neural network architectures.
- [Data Ingestion and Preparation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/data-ingestion-preparation.md) — Provides specialized interfaces for preparing and editing raw image data to facilitate model training. ([source](https://tesseract-ocr.github.io/tessdoc/AddOns.html))
- [Model Fine-Tuning and Adaptation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation.md) — Allows fine-tuning of recognition engines to improve performance for unique fonts, niche languages, and domain-specific terminology.
- [Multilingual Text Recognition Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis/multilingual-text-recognition-engines.md) — Supports transcription across more than one hundred languages and multiple scripts using configurable linguistic models.
- [Table Detection Algorithms](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis/table-detection-algorithms.md) — Identifies and localizes tabular data regions within heterogeneous document layouts to assist in information extraction. ([source](https://tesseract-ocr.github.io/docs/))
- [OCR API Bindings](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/ocr-api-bindings.md) — Exposes programmatic interfaces for embedding document recognition and pattern matching capabilities into custom software. ([source](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html))
- [Page Segmentation Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/page-segmentation-optimizers.md) — Optimizes recognition performance by allowing users to configure page segmentation modes for specific document structures. ([source](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html))
- [Page Segmentation Modes](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis/page-segmentation-modes.md) — Defines how document layouts are parsed into blocks, lines, or characters through configurable segmentation settings. ([source](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html))
- [Model Provider Adapters](https://awesome-repositories.com/f/artificial-intelligence-ml/artificial-intelligence-tooling/language-model-integrations/model-provider-adapters.md) — Load trained data files via system paths or environment variables to register specific recognition models for immediate use. ([source](https://tesseract-ocr.github.io/tessdoc/FAQ.html))
- [OCR Language Support](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/ocr-language-support.md) — Maintain extensive language-specific character sets and script definitions to ensure compatibility with diverse global writing systems. ([source](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html))
- [Script and Orientation Detectors](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis/script-and-orientation-detectors.md) — Apply fast shape classifiers to connected components to determine the writing system and page rotation of input documents. ([source](https://tesseract-ocr.github.io/docs/))
- [Mobile OCR Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/mobile-ocr-integrations.md) — Enable real-time text extraction from camera-captured images on Android and iOS platforms through dedicated mobile SDKs. ([source](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html))
- [Multilingual Text Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/multilingual-text-recognition.md) — Transcribe diverse linguistic content by configuring the engine with specific language codes to handle varied character sets. ([source](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html))
- [OCR Data Export Formats](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/ocr-data-export-formats.md) — Generate structured output in formats like HOCR or TSV to enable seamless data exchange with external analysis and web-based pipelines. ([source](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html))

### Content Management & Publishing

- [Command-Line Document Processors](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/document-automation-interfaces/command-line-document-processors.md) — Streamlines large-scale document digitization workflows by automating image processing, text extraction, and structured output generation.
- [Document Layout Analysis Tools](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/intelligent-extraction-frameworks/document-layout-analysis-tools.md) — Detects page structure, column orientation, and table regions to enable precise text extraction from complex document layouts.
- [Optical Character Recognition Engines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/intelligent-extraction-frameworks/optical-character-recognition-engines.md) — Integrates robust visual text recognition into desktop, mobile, and server-side software environments.
- [PDF Generation Tools](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/format-conversion-toolkits/pdf-generation-tools.md) — Creates searchable PDF documents by overlaying a hidden text layer onto original image data. ([source](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html))
- [Cloud Document Conversion](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/format-conversion-toolkits/cloud-document-conversion.md) — Offload document conversion tasks to cloud-based services to transform images and PDFs into searchable text without local installation. ([source](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html))
- [OCR Engine Selectors](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/intelligent-extraction-frameworks/optical-character-recognition-engines/ocr-engine-selectors.md) — Toggle between legacy algorithms and modern neural network backends to balance processing speed against character recognition accuracy. ([source](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html))
- [Post-Processing Constraints](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/intelligent-extraction-frameworks/post-processing-constraints.md) — Apply linguistic constraints and dictionary lookups to refine raw classifier output into contextually accurate text sequences.
- [Text Orientation Detection](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/intelligent-extraction-frameworks/text-orientation-detection.md) — Utilize script detection models to automatically identify and correct document orientation for improved text extraction results. ([source](https://tesseract-ocr.github.io/tessdoc/Planning.html))

### Data & Databases

- [Document Processing Pipelines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/document-llm-preparation/document-processing-pipelines.md) — Automates document parsing and layout analysis to normalize static image content for downstream data integration. ([source](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html))
- [Table Extraction Utilities](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/table-extraction-utilities.md) — Resolve complex grid-based document structures by applying specialized layout analysis methods to extract tabular data. ([source](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html))
- [Specialized Recognition Data](https://awesome-repositories.com/f/data-databases/data-integration-synchronization/specialized-recognition-data.md) — Incorporate domain-specific data files to support advanced features like mathematical equation recognition and script detection. ([source](https://tesseract-ocr.github.io/tessdoc/Data-Files.html))

### Graphics & Multimedia

- [Document Segmentation](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/computer-vision-pipelines/document-segmentation.md) — Decomposes visual documents into hierarchical structures, including text blocks, lines, and individual characters.
- [Image Pre-processing Utilities](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/image-processing-pipelines/image-pre-processing-utilities.md) — Enhance image quality through rescaling, binarization, and noise reduction to prepare raw visual data for more accurate recognition. ([source](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html))
- [Image Format Decoders](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/image-processing-pipelines/image-format-decoders.md) — Decode standard image formats such as PNG, JPEG, and TIFF to supply raw pixel data for subsequent text analysis. ([source](https://tesseract-ocr.github.io/tessdoc/InputFormats))

### Part of an Awesome List

- [AI and Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/ai-and-machine-learning.md) — Open-source optical character recognition engine.
- [Optical Character Recognition](https://awesome-repositories.com/f/awesome-lists/ai/optical-character-recognition.md) — Industry-standard open source engine for character recognition.
- [Documentation and Knowledge](https://awesome-repositories.com/f/awesome-lists/productivity/documentation-and-knowledge.md) — Open-source engine for optical character recognition.

### Development Tools & Productivity

- [OCR Integration APIs](https://awesome-repositories.com/f/development-tools-productivity/api-development-sdks/software-development-kits/ocr-integration-apis.md) — Embed recognition capabilities directly into custom software using native C and C++ interfaces or various language-specific wrappers. ([source](https://cdn.jsdelivr.net/gh/tesseract-ocr/tesseract@main/README.md))

### User Interface & Experience

- [OCR Interfaces](https://awesome-repositories.com/f/user-interface-experience/graphical-user-interfaces/ocr-interfaces.md) — Facilitate manual proofreading and layout analysis by providing a backend for graphical tools designed to manage document digitization workflows. ([source](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html))

### Programming Languages & Runtimes

- [Custom Dictionaries](https://awesome-repositories.com/f/programming-languages-runtimes/programming-utilities/data-text-processing/custom-dictionaries.md) — Adjust recognition accuracy for domain-specific terminology by utilizing user-defined word lists and custom patterns. ([source](https://tesseract-ocr.github.io/tessdoc/Planning.html))
