OCRmyPDF is a tool for converting image-based PDF files into machine-readable documents by adding a searchable text layer via optical character recognition. It functions as a multi-language processor capable of detecting and extracting text in over 100 different languages using linguistic data packs. The software includes a PDF image optimizer to remove image artifacts and correct page skew to improve recognition accuracy. It also provides a converter to transform scanned documents into the PDF/A standard for long-term digital archiving. The system manages PDF optimization by compressing emb
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runti
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized