What are the best open-source GitHub repositories for محرك OCR لاستخراج النصوص من المستندات?

tesseract-ocr/tesseract is the closest match — Tesseract is the flagship open-source OCR engine, supporting over 100 languages with neural-network-based text recognition, layout analysis, image preprocessing, and both CLI and library APIs, directly matching your need for accurate, trainable OCR from scanned documents and images.. Other strong matches: ub-mannheim/tesseract, paddlepaddle/paddleocr, chineseocr/chineseocr, breezedeus/pix2text.

Why does tesseract-ocr/tesseract match “محرك OCR لاستخراج النصوص من المستندات”?

Tesseract is the flagship open-source OCR engine, supporting over 100 languages with neural-network-based text recognition, layout analysis, image preprocessing, and both CLI and library APIs, directly matching your need for accurate, trainable OCR from scanned documents and images.

Why does ub-mannheim/tesseract match “محرك OCR لاستخراج النصوص من المستندات”?

Tesseract is the industry-standard open-source OCR engine with built-in multilingual recognition and training capability, a command-line interface and API, and sufficient layout analysis for most scanned documents, directly matching this search's core need.

Why does paddlepaddle/paddleocr match “محرك OCR لاستخراج النصوص من المستندات”?

PaddleOCR is a comprehensive open-source OCR framework that directly delivers multilingual text recognition, layout analysis and table extraction, image preprocessing, PDF and image input, a command-line interface and Python API, and support for model training, making it a full-featured engine for…

Why does chineseocr/chineseocr match “محرك OCR لاستخراج النصوص من المستندات”?

This repository is an end-to-end OCR pipeline for Chinese and English text using deep learning, which fits the OCR engine category, but it lacks the broad multilingual support, layout analysis, table extraction, and command-line interface this search is asking for.

Why does breezedeus/pix2text match “محرك OCR لاستخراج النصوص من المستندات”?

Pix2Text is a multilingual OCR and document-conversion system that extracts text, math formulas, and tables from images and PDFs into Markdown, which directly matches your need for an open-source OCR engine with multilingual recognition, layout analysis, and table extraction.

محركات استخراج النصوص (OCR)

مكتبات وأدوات مفتوحة المصدر عالية الأداء مصممة للتعرف على النصوص واستخراجها من المستندات الرقمية.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

tesseract-ocr/tesseract
tesseract-ocr/tesseract
74,751عرض على GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis f
Tesseract is the flagship open-source OCR engine, supporting over 100 languages with neural-network-based text recognition, layout analysis, image preprocessing, and both CLI and library APIs, directly matching your need for accurate, trainable OCR from scanned documents and images.
C++Multilingual Text RecognitionOCR Language SupportScript and Orientation Detectors
عرض على GitHub74,751
ub-mannheim/tesseract
UB-Mannheim/tesseract
4,111عرض على GitHub
Tesseract is an optical character recognition engine and tool designed to convert printed or handwritten text from images into machine-readable digital text. It functions as a multilingual text extractor and a document digitization pipeline that transforms scanned images into structured digital formats. The project includes a framework for training custom scripts and language-specific models, allowing the engine to recognize new languages or unique fonts through custom training data. Its capabilities cover automated text extraction, digital archive digitization, and the export of recognized
Tesseract is the industry-standard open-source OCR engine with built-in multilingual recognition and training capability, a command-line interface and API, and sufficient layout analysis for most scanned documents, directly matching this search's core need.
C++Multilingual Text RecognitionOCR Model Customizers
عرض على GitHub4,111
paddlepaddle/paddleocr
PaddlePaddle/PaddleOCR
82,412عرض على GitHub
PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into independent, configurable stages. This architecture supports automated document digitization and multilingual text recognition, capable of identifying text in over one hundred languages across diverse environments ranging from scanned documents to industrial scenes. The framework disti
PaddleOCR is a comprehensive open-source OCR framework that directly delivers multilingual text recognition, layout analysis and table extraction, image preprocessing, PDF and image input, a command-line interface and Python API, and support for model training, making it a full-featured engine for extracting text from scanned documents.
PythonMultilingual Text Recognition
عرض على GitHub82,412
chineseocr/chineseocr
chineseocr/chineseocr
6,113عرض على GitHub
chineseocr is an end-to-end deep learning pipeline for detecting and recognizing Chinese and English text in images. The project combines text region detection using YOLOv3 with sequence-based recognition via Convolutional Recurrent Neural Networks (CRNN) and dense OCR models, forming a complete optical character recognition workflow. The pipeline includes orientation detection to handle text rotated at 0, 90, 180, or 270 degrees before recognition, and supports structured field extraction from identity cards and train tickets. A multi-framework model converter enables trained models to be co
This repository is an end-to-end OCR pipeline for Chinese and English text using deep learning, which fits the OCR engine category, but it lacks the broad multilingual support, layout analysis, table extraction, and command-line interface this search is asking for.
PythonText Orientation DetectionDense OCR Models
عرض على GitHub6,113
breezedeus/pix2text
breezedeus/Pix2Text
3,012عرض على GitHub
Pix2Text is an optical character recognition system and document conversion tool designed to transform images and PDFs into Markdown. It functions as a multilingual OCR engine supporting over 80 languages, a LaTeX formula recognizer for mathematical notations, and a parser integrated with vision language models. The project utilizes a hybrid pipeline to separate plain text from mathematical formulas and tabular structures within a single pass. It converts recognized formulas into LaTeX expressions and transforms detected tables and layouts into structured Markdown formatting. The system incl
Pix2Text is a multilingual OCR and document-conversion system that extracts text, math formulas, and tables from images and PDFs into Markdown, which directly matches your need for an open-source OCR engine with multilingual recognition, layout analysis, and table extraction.
Jupyter NotebookMultilingual OCR SystemsMultilingual Text Recognition
عرض على GitHub3,012
rednote-hilab/dots.ocr
rednote-hilab/dots.ocr
7,695عرض على GitHub
dots.ocr is a suite of software utilities for document layout analysis, multilingual optical character recognition, and scene text digitization. It functions as an engine for extracting digital text and structured layout data from images and PDFs across various human scripts. The project includes a specialized transformer for converting charts, diagrams, and chemical formulas from raster images into scalable vector graphics. It also provides a pipeline to transform extracted text and structural layout from documents and web screenshots into formatted Markdown files. The system covers capabil
dots.ocr is a multilingual OCR engine that performs text extraction from images and PDFs with layout analysis, matching your need for an OCR engine; it may not expose all requested features like preprocessing or training, but it is the right kind of tool.
PythonMultilingual OCR SystemsMultilingual Text Recognition
عرض على GitHub7,695
rapidai/rapidocr
RapidAI/RapidOCR
5,968عرض على GitHub
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runti
RapidOCR is an offline deep-learning OCR engine with modular detection and recognition, offering cross‑platform support and language bindings — the right kind of tool for extracting text from images, though layout analysis and table extraction are not highlighted in its description.
PythonOCR Language SupportOCR Model Customizers
عرض على GitHub5,968
daybreak-u/chineseocr_lite
DayBreak-u/chineseocr_lite
12,324عرض على GitHub
chineseocr_lite is a lightweight Chinese optical character recognition engine designed to detect text regions, analyze orientation, and convert Chinese characters from images into digital text. It supports both horizontal and vertical reading layouts and can be deployed as a web service for image uploads and result visualization. The system utilizes a multi-backend inference framework that supports ncnn, mnn, and tnn, allowing it to run across diverse hardware and platforms. It is specifically engineered for lightweight deployment on mobile and desktop environments through the use of small mo
This is a lightweight OCR engine specialised for Chinese text recognition, supporting text detection, orientation analysis, and web service deployment; it fits the search for an OCR engine but focuses on Chinese rather than full multilingual support and lacks table extraction and training features.
C++Text Orientation Detection
عرض على GitHub12,324
jaidedai/easyocr
JaidedAI/EasyOCR
29,615عرض على GitHub
EasyOCR is a deep learning-based computer vision library designed to perform optical character recognition on images and video frames. It functions as a comprehensive pipeline that automates the transformation of visual text into machine-readable strings, enabling the digitization of physical documents, forms, and receipts into searchable data. The engine distinguishes itself through a multi-stage processing workflow that combines convolutional neural networks for spatial feature extraction with sequence-based decoding mechanisms. This architecture allows the system to identify and interpret
EasyOCR is a deep-learning OCR library supporting over 80 languages and working on images and video frames, fitting the request for an OCR engine with multilingual recognition, though it lacks built-in layout analysis or PDF input.
PythonMultilingual OCR Systems
عرض على GitHub29,615
kreuzberg-dev/kreuzberg
kreuzberg-dev/kreuzberg
8,527عرض على GitHub
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Kreuzberg is a document extraction engine that uses Tesseract for OCR, providing text extraction from images, PDFs, and Office files via CLI and API, with layout and table extraction support, directly matching the need for an OCR engine.
RustDocument Extraction EnginesMulti-Format ParsersText Extraction
عرض على GitHub8,527
hiroi-sora/umi-ocr
hiroi-sora/Umi-OCR
45,273عرض على GitHub
Umi-OCR is an optical character recognition engine designed to convert visual text from images and documents into machine-readable character data. It functions as a local-first toolkit, processing all visual data directly on the host machine using embedded neural network models to maintain privacy and offline availability. The project distinguishes itself through its focus on automated document digitization and integrated barcode and QR code decoding. By utilizing a modular, Python-based orchestration layer, it enables users to transform static image files and multi-page documents into search
Umi-OCR is a local-first OCR engine with multilingual recognition via PaddleOCR and a modular Python stack, fitting your search for an open-source tool to extract text from images and documents, though layout analysis and training capabilities are not prominently featured.
PythonOptical Character RecognitionLocal Inference EnginesDocument Analysis Tools
عرض على GitHub45,273
pymupdf/pymupdf
pymupdf/PyMuPDF
9,086عرض على GitHub
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
PyMuPDF is a PDF manipulation library that also functions as an OCR engine, supporting multilingual text recognition and layout analysis via Tesseract integration, making it a suitable but multi-purpose choice for extracting text from scanned documents and images.
PythonPDF Manipulation UtilitiesText ExtractionAnnotation Data Extraction
عرض على GitHub9,086
naptha/tesseract.js
naptha/tesseract.js
38,141عرض على GitHub
Tesseract.js is a JavaScript library that provides optical character recognition capabilities directly within web browsers and Node.js environments. It functions as a client-side engine, enabling the conversion of images containing printed text into machine-readable strings without the need for external APIs or server-side infrastructure. The library distinguishes itself by running the original C++ optical character recognition engine within the browser through WebAssembly modules. To maintain interface responsiveness during intensive computation, it utilizes background threads for parallel p
Tesseract.js is an OCR engine that extracts text from images in browsers and Node.js, supporting multiple languages, but it is a JavaScript library rather than a standalone CLI tool and lacks built-in PDF support and training capabilities, so it fits the core request for an OCR engine with some feature gaps.
JavaScriptOptical Character Recognition LibrariesWeb-Based Text RecognitionWebAssembly
عرض على GitHub38,141
facebookresearch/nougat
facebookresearch/nougat
10,015عرض على GitHub
Nougat is a neural OCR system and LLM document parser designed to convert images of academic PDF documents into structured markdown text and mathematical formulas. It functions as a PDF to markdown converter that uses deep learning to handle layout and formula recognition. The project provides a document training pipeline for generating datasets and training neural networks to recognize specific academic document styles. This includes utilities for training dataset generation, neural model training, and model checkpoint management to ensure reproducible deployment. The system covers a broad
Nougat is a neural OCR engine designed to convert academic PDFs into structured markdown, but its narrow focus on scholarly documents limits its use as a general-purpose OCR for diverse scanned images and multilingual text.
PythonPDF to Markdown ConvertersEnd-to-End Document ParsersImage-to-Text Transformers
عرض على GitHub10,015
mindee/doctr
mindee/doctr
6,149عرض على GitHub
DocTR is a deep learning OCR library built on PyTorch that detects and transcribes text in document images using a two-stage detection-recognition pipeline. It provides a complete framework for building and deploying OCR pipelines with pretrained models available through the Hugging Face Hub, and supports exporting trained models to ONNX format for cross-runtime deployment. The library offers end-to-end OCR pipelines that combine text detection and recognition to extract all text from document images or PDFs, with support for rotated page handling and varied text orientations. It includes cap
DocTR is a deep learning OCR library that provides a full framework for building and deploying document text extraction pipelines, with support for training, PDF/image input, and pretrained models—fitting the core need for an OCR engine, though it may lack a built-in command-line interface and explicit layout/table analysis out of the box.
PythonOCR LibrariesDocument Text Recognition ToolkitsEnd-to-End Pipelines
عرض على GitHub6,149
robertknight/ocrs
robertknight/ocrs
1,843عرض على GitHub
This project is a terminal-based optical character recognition engine that uses neural network models to extract text and spatial layout data from images. It functions as both a command-line utility for automated text processing and a library for integrating machine learning-powered recognition into broader workflows. The engine distinguishes itself through a modular processing pipeline that supports custom model loading and memory-mapped weight initialization for efficient execution. It preserves document structure by tracking precise geometric coordinates for every detected text element, an
robertknight/ocrs is a Rust library and CLI tool that directly performs optical character recognition from images, making it a native OCR engine; while it may have narrower built-in language and layout support than full-featured engines like Tesseract, it squarely matches the core need for extracting text from images.
RustOCR Command Line InterfacesDocument Spatial Coordinate OutputsImage Text Extractions
عرض على GitHub1,843

محركات استخراج النصوص (OCR)

tesseract-ocr/tesseract

UB-Mannheim/tesseract

PaddlePaddle/PaddleOCR

chineseocr/chineseocr

breezedeus/Pix2Text

rednote-hilab/dots.ocr

RapidAI/RapidOCR

DayBreak-u/chineseocr_lite

JaidedAI/EasyOCR

kreuzberg-dev/kreuzberg

hiroi-sora/Umi-OCR

pymupdf/PyMuPDF

naptha/tesseract.js

facebookresearch/nougat

mindee/doctr

robertknight/ocrs