# ub-mannheim/tesseract

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/ub-mannheim-tesseract).**

4,111 stars · 523 forks · C++ · apache-2.0 · fork

## Links

- GitHub: https://github.com/UB-Mannheim/tesseract
- awesome-repositories: https://awesome-repositories.com/repository/ub-mannheim-tesseract.md

## Topics

`lstm` `ocr` `ocr-d` `ocr-d-mp` `tesseract-ocr` `windows-build`

## Description

Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats.

The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data.

Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized text into formats such as plain text, PDF, and ALTO.

## Tags

### Artificial Intelligence & ML

- [Optical Character Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition.md) — Provides a comprehensive system for converting printed or handwritten text from images into machine-readable digital text.
- [Image Text Translators](https://awesome-repositories.com/f/artificial-intelligence-ml/image-translation-pipelines/image-text-translators.md) — Extracts printed and handwritten text characters from image files using visual recognition. ([source](https://cdn.jsdelivr.net/gh/ub-mannheim/tesseract@main/README.md))
- [OCR Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/ocr-engines.md) — Acts as a complete visual recognition system that converts image text into digital characters and documents.
- [Multilingual Text Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/multilingual-text-recognition.md) — Recognizes and digitizes text across a wide variety of global languages and alphabets.
- [Multi-Stage Inference Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/architectures/computer-vision-segmentation-models/object-detection-models/multi-stage-inference-pipelines.md) — Implements a multi-stage inference pipeline that sequences layout analysis, line detection, and character recognition.
- [C++ Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-engines/c-inference-backends/c-based-engines/c-based-image-engines/c-engines.md) — Uses a high-performance C++ core engine to handle computationally intensive image analysis tasks.
- [OCR Model Customizers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-customization/ocr-model-customizers.md) — Allows adaptation of the OCR engine to specific languages, scripts, or fonts through custom training.
- [OCR Language Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training/ocr-language-training.md) — Provides a framework to train the engine to recognize new languages or unique fonts using custom data. ([source](https://cdn.jsdelivr.net/gh/ub-mannheim/tesseract@main/README.md))
- [OCR Training Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training/ocr-training-frameworks.md) — Includes a framework for training custom scripts and language-specific models to improve recognition accuracy.
- [OCR Data Export Formats](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition/ocr-data-export-formats.md) — Supports exporting recognized text into multiple digital formats including plain text, PDF, and ALTO. ([source](https://cdn.jsdelivr.net/gh/ub-mannheim/tesseract@main/README.md))

### Part of an Awesome List

- [Long Short-Term Memory Networks](https://awesome-repositories.com/f/awesome-lists/ai/neural-network-architectures/long-short-term-memory-networks.md) — Uses long short-term memory networks to recognize sequences of visual features as text characters.
- [Text Extraction and OCR](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr.md) — Automates the extraction of characters and lines of text from images for use in other applications.

### Business & Productivity Software

- [Digitization Pipelines](https://awesome-repositories.com/f/business-productivity-software/digitization-pipelines.md) — Transforms scanned images through a pipeline into structured digital formats like PDF, hOCR, and ALTO.
- [Document Digitization Tools](https://awesome-repositories.com/f/business-productivity-software/document-digitization-tools.md) — Processes large volumes of scanned documents into structured, searchable digital formats for archives.