DocTR is a deep learning OCR library built on PyTorch that detects and transcribes text in document images using a two-stage detection-recognition pipeline. It provides a complete framework for building and deploying OCR pipelines with pretrained models available through the Hugging Face Hub, and supports exporting trained models to ONNX format for cross-runtime deployment.
The library offers end-to-end OCR pipelines that combine text detection and recognition to extract all text from document images or PDFs, with support for rotated page handling and varied text orientations. It includes capabilities for document layout analysis using transformer-based detectors, key information extraction that combines detection, recognition, and layout analysis to extract structured data, and document image classification using standard CNN architectures. Text detection is performed using segmentation-based detectors like DBNet and LinkNet, while text recognition uses sequence recognition models such as CRNN, SAR, and MASTER, with optional vocabulary restriction for character set control.
DocTR provides multiple deployment options including FastAPI-based REST API serving for remote document processing, command-line tools for script-based analysis, and Docker container deployment for consistent environments. It supports document input from images, PDFs, and URLs through a unified loading interface, and offers post-processing capabilities including prediction visualization, document reconstruction, and structured JSON export. The library also includes model benchmarking tools for comparing custom architectures against pretrained models on standard datasets.