Donut is an OCR-free document transformer and end-to-end document parser. It functions as a neural network that converts unstructured document images directly into structured data or text without the use of an external optical character recognition engine.
The project includes a synthetic document generator to create artificial images and ground-truth labels for training. It employs a transformer model to perform visual question answering and document image classification based on visual layout and text.
The system covers several document understanding capabilities, including structured information extraction, document text transcription, and visual document question answering. It provides tools for transformer model fine-tuning and model accuracy evaluation.