OCR Text Extraction Tools

Open-source software libraries and applications for converting images and scanned documents into machine-readable text.

Find the best repos with AI.We'll search the best matching repositories with AI.

dmmaze/ballonstranslator
dmMaze/BallonsTranslator
4,551View on GitHub
BallonsTranslator is a software suite designed for extracting, translating, and replacing text within comic panels while preserving the original visual layout. It functions as an image translation tool that combines text region detection, optical character recognition, and deep learning inpainting to automate the localization of comics. The tool features a deep learning image inpainter that removes original text and restores backgrounds using generative neural networks and patch-matching algorithms. It also includes a rich-text translation editor for modifying translated dialogue with support for font presets, search-and-replace, and document exports. The system provides a multi-engine OCR pipeline for extracting text and font colors, and a layout-aware replacement system that matches font sizes and positioning. For automated workflows, a headless command-line interface allows for batch image translation and rendering without a graphical user interface.
This tool provides a specialized OCR pipeline for extracting and translating text from comic panels, offering batch processing and layout-aware features that align with the core requirements for document text extraction.
PythonComic Panel TranslatorsBatch ProcessingComic Text Extraction
View on GitHub4,551
frooodle/stirling-pdf
Frooodle/Stirling-PDF
81,168View on GitHub
Stirling-PDF is a web-based PDF management suite used for editing, merging, splitting, and converting PDF documents. It functions as a self-hosted document manager, providing a centralized interface for users to manipulate files on a private server. The system features a workflow automation engine that allows for the creation of processing pipelines to handle large volumes of documents without writing custom code. It also includes an optical character recognition tool to convert scanned PDFs into searchable and editable text. Access is managed through single sign-on integration and OIDC compatibility, which supports secure authentication and the maintenance of audit logs for compliance. The application is delivered as a container-based deployment and exposes its functions through a REST API for external software integration.
Stirling-PDF is a comprehensive document management suite that includes built-in OCR capabilities for converting scanned PDFs into searchable text, making it a functional tool for your requirements despite being broader than a dedicated OCR engine.
JavaPDF Manipulation UtilitiesOCR EnginesOptical Character Recognition
View on GitHub81,168
Less-relevant matchesScored below the primary cut
pdfminer/pdfminer.six
pdfminer/pdfminer.six
6,906View on GitHub
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedded images, interactive form data, and tagged contents. It supports multilingual text processing for diverse character sets and vertical writing, and can transform document data into formats such as HTML, hOCR, or plain text.
This tool is a PDF parser designed to extract existing text and layout data from digital documents, rather than performing optical character recognition on images or scanned files.
PythonDocument Layout AnalysisPDF ParsersPDF Text Extraction
View on GitHub6,906
funstory-ai/babeldoc
funstory-ai/BabelDOC
7,752View on GitHub
BabelDOC is a technical document translation system designed to translate PDF files while preserving their original layout and styling. It functions as a layout-preserving translator that utilizes large language models to convert content into target languages, specifically tailored for scientific and technical documents. The system distinguishes itself through specialized handling of academic content, including the identification and preservation of mathematical formulas and complex layout structures. It ensures technical accuracy by employing glossary-driven terminology enforcement, using source-to-target mappings to maintain consistency across translated text. The software covers a broad range of document processing capabilities, including PDF content extraction, spatial-based text reconstruction, and layout detection. It supports both monolingual and bilingual PDF generation, allowing for side-by-side comparisons of original and translated content through coordinate-normalized layout reflow. The system uses TOML-based configuration files to manage processing pipelines and supports offline asset management for deployment in air-gapped environments.
This tool is designed for document translation and layout preservation rather than general-purpose OCR, as it focuses on extracting and translating existing text rather than converting images into machine-readable text.
PythonDocument Layout AnalysisDocument Structure Analysis
View on GitHub7,752
vert-sh/vert
VERT-sh/VERT
13,999View on GitHub
VERT is a media conversion platform designed to transform images, audio, video, and documents into various formats. It functions as a batch file processor that allows users to apply consistent conversion settings and custom naming patterns to multiple assets simultaneously, bundling the final outputs into compressed archives for streamlined organization. The system distinguishes itself through a distributed architecture that routes heavy media transcoding tasks across local hardware or remote server infrastructure. This approach optimizes performance by balancing computational workloads, allowing users to adjust processing intensity to prioritize either rapid output generation or higher fidelity results. Beyond core conversion, the platform provides granular control over digital asset optimization, including the ability to modify compression levels, bitrates, and sample rates. It also features metadata management, enabling the selective preservation or removal of technical information such as EXIF data during the transformation flow.
This is a general-purpose media conversion and transcoding platform for audio, video, and images, but it lacks the specific OCR engine capabilities required to extract text from scanned documents.
SvelteBatch Processing
View on GitHub13,999
jgm/pandoc
jgm/pandoc
44,822View on GitHub
Pandoc is a universal document converter that translates content between a wide range of markup and binary formats. It functions by parsing input documents into a unified intermediate abstract syntax tree, which serves as the foundation for consistent manipulation and transformation across diverse output types. The system is distinguished by its modular reader-writer pipeline, which decouples input parsing from output generation to allow for granular control over document structure. Users can programmatically manipulate this intermediate tree through a robust filter system, supporting both external JSON-based interop and an integrated scripting environment for custom transformations. This architecture enables complex document processing tasks, such as automated scholarly publishing, where citations, bibliographies, and mathematical expressions are managed through a specialized toolchain. Beyond core conversion, the project provides a comprehensive templating engine that merges structured document data with customizable templates to produce final outputs with specific styling and layout requirements. It also offers a network-based server mode for API-driven and batch processing, allowing the tool to be integrated into automated technical content pipelines. The software is primarily operated via a command-line interface, which provides extensive configuration options for managing input formats, citation styles, and document metadata.
Pandoc is a universal document converter for markup and binary formats, but it lacks the optical character recognition engine required to extract text from images or scanned documents.
HaskellBatch Processing
View on GitHub44,822
dataelement/bisheng
dataelement/bisheng
11,455View on GitHub
Bisheng is an enterprise AI framework and LLM DevOps platform designed to manage the full lifecycle of large language models. It provides a unified system for dataset curation, supervised fine-tuning, model versioning, and performance evaluation. The platform features a visual workflow orchestrator for building retrieval-augmented generation pipelines and complex task sequences using flowcharts with conditional logic and human intervention points. It also includes an AI agent framework that uses a specialized guidance language to embed domain expertise and professional business logic into autonomous agents. The system covers comprehensive enterprise AI governance through role-based access control, single sign-on, and integrated observability tools for monitoring system health and traffic. Additional capabilities include layout-aware document parsing for extracting text and tables from printed or handwritten sources and high-availability infrastructure deployment.
While this platform includes layout-aware document parsing for RAG pipelines, it is an enterprise LLM DevOps and workflow orchestration framework rather than a dedicated OCR software tool.
TypeScriptDocument Layout Analysis
View on GitHub11,455
microsoft/unilm
microsoft/unilm
22,030View on GitHub
This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations. The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mechanisms such as retentive state processing for efficient sequence generation, differential attention for improved focus, and distributed weight partitioning to handle memory-intensive computations. These capabilities are complemented by techniques for sparse decoding and model compression, which maintain performance while reducing the computational footprint of large-scale architectures. The project covers a broad capability surface, including end-to-end pipelines for data curation, synthetic data generation, and tokenization across diverse modalities. It supports extensive workflows for pre-training, instruction tuning, and fine-tuning, with specific focus areas in document understanding, speech synthesis, and cross-lingual transfer. Diagnostic tools for attention analysis and benchmarking further assist in evaluating model performance on complex reasoning and retrieval tasks.
This is a research-oriented framework for building multimodal foundation models rather than a ready-to-use OCR application, though it includes underlying components like TrOCR and LayoutLM that could be used to develop such a tool.
PythonDocument Layout AnalysisMultimodal Layout Analysis
View on GitHub22,030

OCR Text Extraction Tools

dmMaze/BallonsTranslator

Frooodle/Stirling-PDF

pdfminer/pdfminer.six

funstory-ai/BabelDOC

VERT-sh/VERT

jgm/pandoc

dataelement/bisheng

microsoft/unilm