Doctr

DocTR is a deep learning OCR library built on PyTorch that detects and transcribes text in document images using a two-stage detection-recognition pipeline. It provides a complete framework for building and deploying OCR pipelines with pretrained models available through the Hugging Face Hub, and supports exporting trained models to ONNX format for cross-runtime deployment.

The library offers end-to-end OCR pipelines that combine text detection and recognition to extract all text from document images or PDFs, with support for rotated page handling and varied text orientations. It includes capabilities for document layout analysis using transformer-based detectors, key information extraction that combines detection, recognition, and layout analysis to extract structured data, and document image classification using standard CNN architectures. Text detection is performed using segmentation-based detectors like DBNet and LinkNet, while text recognition uses sequence recognition models such as CRNN, SAR, and MASTER, with optional vocabulary restriction for character set control.

DocTR provides multiple deployment options including FastAPI-based REST API serving for remote document processing, command-line tools for script-based analysis, and Docker container deployment for consistent environments. It supports document input from images, PDFs, and URLs through a unified loading interface, and offers post-processing capabilities including prediction visualization, document reconstruction, and structured JSON export. The library also includes model benchmarking tools for comparing custom architectures against pretrained models on standard datasets.

Features

OCR Libraries - Ships a complete deep learning OCR library with pretrained detection and recognition models on PyTorch.

End-to-End Pipelines - Combines detection and recognition models into a complete end-to-end OCR pipeline.

PyTorch-Based Frameworks - All models are built and trained using PyTorch with GPU acceleration.

OCR Frameworks - Provides a complete PyTorch-based framework for building and deploying OCR pipelines with pretrained models.

Segmentation-Based Detectors - Locates text regions in document images using segmentation-based detectors like DBNet and LinkNet with various backbones.

Text Recognition - Transcribes characters in cropped text regions into strings using a choice of pretrained recognition models.

Document Text Recognition Toolkits - Provides a full toolkit for extracting text from scanned documents using trainable detection and recognition models.

Two-Stage Pipelines - Implements a two-stage pipeline that first detects text regions then transcribes each one.

Detection Result Exporters - Provides visualization of detected text overlays and exports OCR predictions as structured JSON.

Rotated Detections - Detects and returns rotated bounding boxes for text to accommodate varied page orientations.

Document Information Extraction - Performs end-to-end key information extraction by combining detection, recognition, and layout analysis on document images.

ONNX Model Exporters - Converts trained PyTorch models into the ONNX interchange format for deployment in non-PyTorch environments.

ONNX Exporters - Supports exporting trained OCR models to ONNX format for cross-runtime deployment.

Document Layout - Identifies structural regions like paragraphs, tables, and figures using a transformer-based detector.

Tree Representations - Structures detected text into a hierarchy of words, lines, blocks, pages, and documents.

Document Region Detectors - Detects structural regions with transformer-based detectors supporting straight or rotated bounding boxes.

PDF Document Importers - Reads PDFs from file paths, byte streams, or URLs and decodes each page into a numpy array.

Document Entity Extractors - Extracts structured entities such as dates and addresses from document images using combined detection and recognition.

OCR REST API Servers - Ships a FastAPI-based REST endpoint that accepts documents and returns OCR results.

OCR Deployments - Integrates the detection-recognition predictor into browser demos or API endpoints with minimal setup.

OCR Model Hubs - Loads pretrained OCR model checkpoints directly from the Hugging Face Hub for immediate use.

Transformer-Based Detectors - Uses transformer-based detectors for identifying structural regions in document layout analysis.

Computer Vision - High-performance document text recognition and analysis.

mindeedoctr

Features

Star history