مكتبات وأدوات مفتوحة المصدر عالية الأداء مصممة للتعرف على النصوص واستخراجها من المستندات الرقمية.
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis f
Tesseract is the flagship open-source OCR engine, supporting over 100 languages with neural-network-based text recognition, layout analysis, image preprocessing, and both CLI and library APIs, directly matching your need for accurate, trainable OCR from scanned documents and images.
Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized
Tesseract is the industry-standard open-source OCR engine with built-in multilingual recognition and training capability, a command-line interface and API, and sufficient layout analysis for most scanned documents, directly matching this search's core need.
PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into independent, configurable stages. This architecture supports automated document digitization and multilingual text recognition, capable of identifying text in over one hundred languages across diverse environments ranging from scanned documents to industrial scenes. The framework disti
PaddleOCR is a comprehensive open-source OCR framework that directly delivers multilingual text recognition, layout analysis and table extraction, image preprocessing, PDF and image input, a command-line interface and Python API, and support for model training, making it a full-featured engine for extracting text from scanned documents.
chineseocr is an end-to-end deep learning pipeline for detecting and recognizing Chinese and English text in images. The project combines text region detection using YOLOv3 with sequence-based recognition via Convolutional Recurrent Neural Networks (CRNN) and dense OCR models, forming a complete optical character recognition workflow. The pipeline includes orientation detection to handle text rotated at 0, 90, 180, or 270 degrees before recognition, and supports structured field extraction from identity cards and train tickets. A multi-framework model converter enables trained models to be co
This repository is an end-to-end OCR pipeline for Chinese and English text using deep learning, which fits the OCR engine category, but it lacks the broad multilingual support, layout analysis, table extraction, and command-line interface this search is asking for.
Pix2Text is an optical character recognition system and document conversion tool designed to transform images and PDFs into Markdown. It functions as a multilingual OCR engine supporting over 80 languages, a LaTeX formula recognizer for mathematical notations, and a parser integrated with vision language models. The project utilizes a hybrid pipeline to separate plain text from mathematical formulas and tabular structures within a single pass. It converts recognized formulas into LaTeX expressions and transforms detected tables and layouts into structured Markdown formatting. The system incl
Pix2Text is a multilingual OCR and document-conversion system that extracts text, math formulas, and tables from images and PDFs into Markdown, which directly matches your need for an open-source OCR engine with multilingual recognition, layout analysis, and table extraction.
dots.ocr is a suite of software utilities for document layout analysis, multilingual optical character recognition, and scene text digitization. It functions as an engine for extracting digital text and structured layout data from images and PDFs across various human scripts. The project includes a specialized transformer for converting charts, diagrams, and chemical formulas from raster images into scalable vector graphics. It also provides a pipeline to transform extracted text and structural layout from documents and web screenshots into formatted Markdown files. The system covers capabil
dots.ocr is a multilingual OCR engine that performs text extraction from images and PDFs with layout analysis, matching your need for an OCR engine; it may not expose all requested features like preprocessing or training, but it is the right kind of tool.
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runti
RapidOCR is an offline deep-learning OCR engine with modular detection and recognition, offering cross‑platform support and language bindings — the right kind of tool for extracting text from images, though layout analysis and table extraction are not highlighted in its description.
chineseocr_lite is a lightweight Chinese optical character recognition engine designed to detect text regions, analyze orientation, and convert Chinese characters from images into digital text. It supports both horizontal and vertical reading layouts and can be deployed as a web service for image uploads and result visualization. The system utilizes a multi-backend inference framework that supports ncnn, mnn, and tnn, allowing it to run across diverse hardware and platforms. It is specifically engineered for lightweight deployment on mobile and desktop environments through the use of small mo
This is a lightweight OCR engine specialised for Chinese text recognition, supporting text detection, orientation analysis, and web service deployment; it fits the search for an OCR engine but focuses on Chinese rather than full multilingual support and lacks table extraction and training features.
EasyOCR is a deep learning-based computer vision library designed to perform optical character recognition on images and video frames. It functions as a comprehensive pipeline that automates the transformation of visual text into machine-readable strings, enabling the digitization of physical documents, forms, and receipts into searchable data. The engine distinguishes itself through a multi-stage processing workflow that combines convolutional neural networks for spatial feature extraction with sequence-based decoding mechanisms. This architecture allows the system to identify and interpret
EasyOCR is a deep-learning OCR library supporting over 80 languages and working on images and video frames, fitting the request for an OCR engine with multilingual recognition, though it lacks built-in layout analysis or PDF input.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Kreuzberg is a document extraction engine that uses Tesseract for OCR, providing text extraction from images, PDFs, and Office files via CLI and API, with layout and table extraction support, directly matching the need for an OCR engine.
Umi-OCR is an optical character recognition engine designed to convert visual text from images and documents into machine-readable character data. It functions as a local-first toolkit, processing all visual data directly on the host machine using embedded neural network models to maintain privacy and offline availability. The project distinguishes itself through its focus on automated document digitization and integrated barcode and QR code decoding. By utilizing a modular, Python-based orchestration layer, it enables users to transform static image files and multi-page documents into search
Umi-OCR is a local-first OCR engine with multilingual recognition via PaddleOCR and a modular Python stack, fitting your search for an open-source tool to extract text from images and documents, though layout analysis and training capabilities are not prominently featured.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
PyMuPDF is a PDF manipulation library that also functions as an OCR engine, supporting multilingual text recognition and layout analysis via Tesseract integration, making it a suitable but multi-purpose choice for extracting text from scanned documents and images.
Tesseract.js is a JavaScript library that provides optical character recognition capabilities directly within web browsers and Node.js environments. It functions as a client-side engine, enabling the conversion of images containing printed text into machine-readable strings without the need for external APIs or server-side infrastructure. The library distinguishes itself by running the original C++ optical character recognition engine within the browser through WebAssembly modules. To maintain interface responsiveness during intensive computation, it utilizes background threads for parallel p
Tesseract.js is an OCR engine that extracts text from images in browsers and Node.js, supporting multiple languages, but it is a JavaScript library rather than a standalone CLI tool and lacks built-in PDF support and training capabilities, so it fits the core request for an OCR engine with some feature gaps.
Nougat is a neural OCR system and LLM document parser designed to convert images of academic PDF documents into structured markdown text and mathematical formulas. It functions as a PDF to markdown converter that uses deep learning to handle layout and formula recognition. The project provides a document training pipeline for generating datasets and training neural networks to recognize specific academic document styles. This includes utilities for training dataset generation, neural model training, and model checkpoint management to ensure reproducible deployment. The system covers a broad
Nougat is a neural OCR engine designed to convert academic PDFs into structured markdown, but its narrow focus on scholarly documents limits its use as a general-purpose OCR for diverse scanned images and multilingual text.
DocTR is a deep learning OCR library built on PyTorch that detects and transcribes text in document images using a two-stage detection-recognition pipeline. It provides a complete framework for building and deploying OCR pipelines with pretrained models available through the Hugging Face Hub, and supports exporting trained models to ONNX format for cross-runtime deployment. The library offers end-to-end OCR pipelines that combine text detection and recognition to extract all text from document images or PDFs, with support for rotated page handling and varied text orientations. It includes cap
DocTR is a deep learning OCR library that provides a full framework for building and deploying document text extraction pipelines, with support for training, PDF/image input, and pretrained models—fitting the core need for an OCR engine, though it may lack a built-in command-line interface and explicit layout/table analysis out of the box.
This project is a terminal-based optical character recognition engine that uses neural network models to extract text and spatial layout data from images. It functions as both a command-line utility for automated text processing and a library for integrating machine learning-powered recognition into broader workflows. The engine distinguishes itself through a modular processing pipeline that supports custom model loading and memory-mapped weight initialization for efficient execution. It preserves document structure by tracking precise geometric coordinates for every detected text element, an
robertknight/ocrs is a Rust library and CLI tool that directly performs optical character recognition from images, making it a native OCR engine; while it may have narrower built-in language and layout support than full-featured engines like Tesseract, it squarely matches the core need for extracting text from images.