Donut | Awesome Repository

Donut is an OCR-free document transformer and end-to-end document parser. It functions as a neural network that converts unstructured document images directly into structured data or text without the use of an external optical character recognition engine.

The project includes a synthetic document generator to create artificial images and ground-truth labels for training. It employs a transformer model to perform visual question answering and document image classification based on visual layout and text.

The system covers several document understanding capabilities, including structured information extraction, document text transcription, and visual document question answering. It provides tools for transformer model fine-tuning and model accuracy evaluation.

Features

Image-to-Text Transformers - Provides a transformer-based model that maps image pixels directly to structured text without external OCR engines.
Document Structure Transcription - Transcribes text sequences from document images into raw strings using vision models.
Document Information Extraction - Identifies and retrieves specific data fields from images of forms to automate data entry.
End-to-End Document Parsers - Provides an end-to-end parser that translates visual document representations directly into structured JSON.

Features

Image-to-Text Transformers - Provides a transformer-based model that maps image pixels directly to structured text without external OCR engines.
Document Structure Transcription - Transcribes text sequences from document images into raw strings using vision models.
Document Information Extraction - Identifies and retrieves specific data fields from images of forms to automate data entry.
End-to-End Document Parsers - Provides an end-to-end parser that translates visual document representations directly into structured JSON.