30 open-source projects similar to mindee/doctr, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Doctr alternative.
mmocr is a PyTorch-based optical character recognition framework designed for training and deploying text detection, recognition, and key information extraction models. It serves as a comprehensive toolbox for scene text detection and recognition, providing specialized libraries for locating text regions and converting visual text into machine-encoded strings. The project distinguishes itself through a research framework for key information extraction and advanced text spotting capabilities. These include point-based spotting using transformers and the use of parameterized Bezier curves to id
PaddleX is a PaddlePaddle-based framework for building, deploying, and fine-tuning AI model pipelines, with pre-built support for computer vision, OCR, document analysis, and time series tasks. It offers a toolkit of ready-to-use pipelines for image classification, object detection, segmentation, and pose estimation, alongside an end-to-end OCR document analysis pipeline that extracts text, tables, formulas, and layout information. The platform also includes a dedicated time series forecasting pipeline for analyzing historical data to detect anomalies, classify patterns, and predict future val
ddddocr is a Python library for automated image analysis, focused on extracting text and detecting objects from visual content. Its core capabilities include character recognition that can handle alphanumeric, Chinese, and special characters, as well as object detection that returns bounding box coordinates for targets within images. The library provides specialized support for solving slider CAPTCHAs by identifying the position of missing pieces using edge matching or image comparison algorithms. It also offers image preprocessing through color-based filtering to reduce noise from complex ba
This library provides a comprehensive collection of modular building blocks and research-backed architectures for implementing vision transformers within the PyTorch framework. It serves as a centralized repository for constructing, training, and analyzing attention-based models, offering a wide array of specialized variants designed for image classification and visual representation learning. The project distinguishes itself through a focus on architectural efficiency and flexibility, supporting diverse input formats including non-square images and volumetric data like video. It incorporates
YOLO-World is a vision-language framework and open-vocabulary object detection model. It identifies objects in images and video based on free-form text prompts without requiring predefined category labels. The system enables the identification of arbitrary objects by fusing image features with text embeddings. It includes a specialized tool for automated image labeling, which generates bounding box annotations for custom datasets using text-based prompts. The project provides a deployment pipeline for converting models into quantized ONNX and TFLite formats, supporting real-time inference on
This project is a library of pretrained computer vision architectures and backbones for image classification and feature extraction. It serves as a comprehensive model zoo and collection of standardized image encoders, including ResNet, Vision Transformers, and EfficientNet, for use in visual analysis and as backbones for object detection and image segmentation. The library provides a framework for distributed training and evaluation of image models using advanced data augmentation and optimization scripts. It includes a dedicated toolset for converting trained PyTorch vision models into the
This is a PyTorch implementation of EfficientNet convolutional neural networks. It serves as a computer vision model library providing architectures for image classification and high-level feature extraction, including pre-trained weights for immediate image categorization. The library supports transfer learning by allowing the modification of model architectures and output layers to accommodate a custom number of classes for new datasets. It also includes a model exporter to convert trained PyTorch weights into the ONNX format for production inference. The system covers broader computer vis
Kornia is a differentiable computer vision library and cross-framework tensor vision toolset. It implements vision operations as differentiable tensors to enable integration into deep learning pipelines and supports the transpilation of operations across PyTorch, TensorFlow, JAX, and NumPy. The project provides specialized toolsets for geometric vision and stereo depth, including algorithms for 3D scene reconstruction, camera calibration, and pose estimation. It further distinguishes itself as a differentiable image augmentation framework, applying random geometric and color transformations w
PaddleDetection is an object detection framework designed for the end-to-end development, training, and deployment of computer vision models. It provides a comprehensive library of modular neural network architectures and pipelines that support object detection, instance segmentation, and multi-object tracking tasks. The project distinguishes itself through a configuration-driven approach that decouples model components like backbones and heads, allowing for the flexible assembly of custom vision workflows. It incorporates advanced techniques such as anchor-free detection logic, joint detecti
SAHI is a sliced inference framework and computer vision pipeline designed to detect small objects in high-resolution images. It provides a system for dividing large images into overlapping patches to prevent the detail loss that typically occurs during standard model downscaling, alongside an image tiling utility and a COCO dataset toolkit. The project distinguishes itself by offering a model-agnostic prediction wrapper that standardizes different machine learning frameworks into a unified interface. This allows it to implement sliced inference and object detection across various model backe
mmaction2 is a PyTorch video understanding toolbox designed for training and evaluating deep learning models. It serves as a framework for action recognition, temporal localization, and spatio-temporal action detection, providing specialized tools for both pixel-based video analysis and skeleton-based action recognition. The project distinguishes itself through a modular architecture featuring registry-based component discovery and hierarchical, config-driven model assembly. It supports multi-modal feature fusion, integrating RGB frames, optical flow, and audio, and includes capabilities for
Darknet is a high-performance C-based inference engine and computer vision library designed for real-time object identification and localization. It serves as a neural network framework for training and deploying detection models using the YOLO architecture, providing a toolset for deep learning training and deployment. The project differentiates itself through a C and CUDA implementation that enables hardware acceleration for matrix multiplication and inference speed optimization. It provides a shared library interface for embedding detection capabilities into external applications and suppo
This project provides a deep learning architecture designed to identify and isolate distinct objects within images by generating precise pixel-level masks. It functions as a browser-based inference engine, enabling the execution of complex machine learning models directly within web environments without requiring server-side processing. The system distinguishes itself by utilizing hardware-accelerated execution and parallel processing to achieve real-time segmentation speeds. It supports prompt-based mask decoding, allowing users to generate spatial masks by providing specific points or boxes
This project is a modular research toolkit designed for developing, training, and evaluating deep learning models for object detection, segmentation, and video instance tracking. It provides a flexible training engine that manages complex neural network execution, including distributed training, custom lifecycle hooks, and weight optimization. The framework is built around a hierarchical configuration system that allows users to define architectures, data pipelines, and training hyperparameters through composable, inheritable files. The project distinguishes itself through its highly modular
GoCV is a computer vision library and Go language binding for OpenCV. It serves as an image processing toolkit and deep learning inference engine, providing programmatic access to a wide range of algorithms for image manipulation, object detection, and video analysis. The project differentiates itself through high-performance native bindings and hardware acceleration. It utilizes a foreign function interface to map Go calls to C++ functions and includes a hardware-agnostic backend dispatch to route neural network tasks to computation engines such as CUDA and OpenVINO. The library covers a br
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runti
A fast, helpful, and open-source document parser
This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks. The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini
This project is a PyTorch person re-identification framework designed for training and evaluating models that identify individuals across different camera views. It provides a complete model training pipeline, a deep learning feature extractor for converting images into numeric vectors, and a suite of computer vision benchmarking tools to measure identity retrieval accuracy. The framework includes a specialized transfer learning toolkit that supports layer freezing, staged learning rate optimization, and differential learning rates for fine-tuning pretrained models. It distinguishes itself th
PandaOCR is a desktop application for extracting text from images and screen captures using optical character recognition. It functions as a mathematical formula digitizer, a table data extractor, a multilingual translation utility, and a text-to-speech interface. The project distinguishes itself through specialized recognition routing that distributes data across different providers based on whether the content is standard text, tables, or formulas. It provides real-time software interface localization by rendering translated text layers directly over active application windows using coordin
This repository provides the pre-trained neural network and legacy data files used by Tesseract to recognize and extract printed text from images. It serves as a multilingual training data repository and a collection of Long Short-Term Memory models designed for high-accuracy optical character recognition across various global scripts and languages. The data includes specialized models for analyzing image layouts to determine text rotation and script direction. It provides the necessary language-specific datasets and linguistic patterns required to enable Tesseract OCR engines to function. T
BigDL is a PyTorch acceleration framework and distributed inference engine designed for large language models. It provides a toolkit for running models on Intel hardware, integrating quantization tools and libraries for parameter-efficient fine-tuning. The project distinguishes itself through the use of pipeline parallelism to distribute model workloads across multiple hardware accelerators. It utilizes low-bit integer quantization and speculative decoding to reduce memory footprints and decrease text generation latency. The system covers broad capabilities in model optimization, including w
Kaolin is a PyTorch 3D deep learning library providing a comprehensive suite of tools for 3D geometry processing, physics simulation, data visualization, and gradient-based rendering for computer vision. The library includes a differentiable 3D renderer and a geometry processing toolkit for converting and transforming 3D representations such as meshes and point clouds. It also features a 3D physics simulation engine to calculate physical interactions and collisions between three-dimensional objects and scenes. The toolkit provides utilities for 3D data visualization, including the creation o
This project is a CAPTCHA solver browser extension that automatically detects and resolves image, text, and behavioral challenges using an AI inference engine. It functions as a bot detection bypass tool designed to overcome interactive web barriers and session timeouts to maintain access to protected websites. The extension provides a bridge between automated solving capabilities and external programming languages or browser automation frameworks via an API integration. It utilizes an AI-powered optical character recognition system to transcribe text from images and auditory challenges into
CV-Backbones is a computer vision backbone library and model zoo providing a collection of pre-defined neural network architectures for extracting visual features and processing image data. It serves as a PyTorch vision framework of reusable deep learning components designed for image analysis and visual representation learning. The library focuses on efficient neural network architectures to reduce computational overhead while maintaining feature extraction performance. This is achieved through the implementation of lightweight model designs such as GhostNet and MLP. The project covers a br
This project is a deep learning curriculum and a collection of PyTorch tutorials designed for deep learning education. It provides a structured set of technical documents and runnable notebooks that translate theoretical machine learning concepts into executable code. The repository includes implementation guides for various neural network architectures, specifically covering convolutional, recurrent, and transformer-based models. It provides practical examples for building computer vision pipelines for object detection and semantic segmentation, as well as natural language processing tools f
chineseocr_lite is a lightweight Chinese optical character recognition engine designed to detect text regions, analyze orientation, and convert Chinese characters from images into digital text. It supports both horizontal and vertical reading layouts and can be deployed as a web service for image uploads and result visualization. The system utilizes a multi-backend inference framework that supports ncnn, mnn, and tnn, allowing it to run across diverse hardware and platforms. It is specifically engineered for lightweight deployment on mobile and desktop environments through the use of small mo
SimSwap is a deep learning face swapping framework and computer vision media processor built with PyTorch. It functions as an image synthesis tool designed to replace a person's identity in images and videos with a target face using a single trained model. The system operates as a video identity replacement tool that swaps identities across frames while preserving the original expressions and lighting of the source media. It enables digital identity manipulation and the production of synthetic media through automated facial feature mapping. The framework supports both the application of trai
chineseocr is an end-to-end deep learning pipeline for detecting and recognizing Chinese and English text in images. The project combines text region detection using YOLOv3 with sequence-based recognition via Convolutional Recurrent Neural Networks (CRNN) and dense OCR models, forming a complete optical character recognition workflow. The pipeline includes orientation detection to handle text rotated at 0, 90, 180, or 270 degrees before recognition, and supports structured field extraction from identity cards and train tickets. A multi-framework model converter enables trained models to be co