Tessdata

This repository provides the pre-trained neural network and legacy data files used by Tesseract to recognize and extract printed text from images. It serves as a multilingual training data repository and a collection of Long Short-Term Memory models designed for high-accuracy optical character recognition across various global scripts and languages.

The data includes specialized models for analyzing image layouts to determine text rotation and script direction. It provides the necessary language-specific datasets and linguistic patterns required to enable Tesseract OCR engines to function.

These files cover a wide range of capabilities including multilingual text extraction and document digitization. The repository contains trained models for a variety of specific languages and scripts, including Japanese, Korean, Portuguese, German, Latin, Filipino, and Armenian.

Features

Text Recognition - Provides a comprehensive collection of trained models for converting detected text images into machine-readable characters.

Optical Character Recognition - Digitizes printed characters from image files using trained language models.

Embedded Text Recognizers - Provides the essential trained models for extracting and interpreting text embedded within images across global scripts.

Multilingual Models - Provides pre-trained models that enable the extraction of written text across various global languages.

Multilingual Text Recognition - Extracts text from images using trained model files for a wide variety of global scripts and languages.

Multi-Language Recognition Models - Provides language-specific recognition models to identify and extract text from images.

OCR Language Datasets - Provides trained models and language-specific data files to enable text extraction from images.

OCR Training Datasets - Provides language-specific datasets and linguistic patterns used to improve OCR accuracy across global languages.

OCR Language Training - Defines characters and linguistic patterns via trained data files for specific language recognition.

Pre-trained OCR Language Data - Provides the pre-trained neural network and legacy data files used by Tesseract to recognize and extract text from images.

Multilingual Text Processing - Ships specialized language data files used for identifying and extracting text across multiple global scripts.

Optical Character Recognition - Provides the pre-trained neural network and legacy data files that enable the conversion of images of text into encoded text.

Tesseract Model Data - Provides the necessary trained LSTM and legacy model files that enable Tesseract OCR engines to function.

Long Short-Term Memory Networks - Uses Long Short-Term Memory networks to predict character sequences from image slices.

LSTM-Based OCR Models - Ships a collection of Long Short-Term Memory trained data for high-accuracy optical character recognition.

Text and Language Models - Supplies trained neural network files for identifying text in specific languages and scripts.

Optical Character Recognitions - Provides language-specific data files for optical character recognition of printed text.

Model Binary Formats - Provides compiled binary files containing language-specific weights and neural network parameters.

Document Layout Analysis - Analyzes image layouts to determine text rotation and script direction.

Pattern-Matching Recognition - Provides non-neural pattern-matching models as a fallback for specific font styles.

Korean Text Recognition - Identifies and extracts Korean characters from images using trained linguistic models.

Japanese Text Recognition - Extracts printed Japanese characters from images using trained language model data.

Model Pruning - Implements model pruning to reduce the size and computational requirements of neural network weights.

Script and Orientation Detectors - Includes specialized models to identify the writing system and rotation angle of text within images.

Document Digitization Tools - Provides the linguistic and visual data necessary for converting physical document scans into searchable digital formats.

Text Orientation Detection - Ships trained data to detect and correct document text orientation and script direction.

Language Data Partitioning - Separates linguistic patterns and character sets into discrete files for optimized loading of required scripts.

tesseract-ocrtessdata

Features

Star history