# clovaai/donut

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/clovaai-donut).**

6,789 stars · 552 forks · Python · mit

## Links

- GitHub: https://github.com/clovaai/donut
- Homepage: https://arxiv.org/abs/2111.15664
- awesome-repositories: https://awesome-repositories.com/repository/clovaai-donut.md

## Topics

`computer-vision` `document-ai` `eccv-2022` `multimodal-pre-trained-model` `nlp` `ocr`

## Description

Donut is an OCR-free document transformer and end-to-end document parser. It functions as a neural network that converts unstructured document images directly into structured data or text without the use of an external optical character recognition engine.

The project includes a synthetic document generator to create artificial images and ground-truth labels for training. It employs a transformer model to perform visual question answering and document image classification based on visual layout and text.

The system covers several document understanding capabilities, including structured information extraction, document text transcription, and visual document question answering. It provides tools for transformer model fine-tuning and model accuracy evaluation.

## Tags

### Artificial Intelligence & ML

- [Image-to-Text Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-to-text-transformers.md) — Provides a transformer-based model that maps image pixels directly to structured text without external OCR engines.
- [Document Structure Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-models/document-structure-transcription.md) — Transcribes text sequences from document images into raw strings using vision models. ([source](https://github.com/clovaai/donut/blob/master/README.md))
- [Document Information Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/document-information-extraction.md) — Identifies and retrieves specific data fields from images of forms to automate data entry.
- [End-to-End Document Parsers](https://awesome-repositories.com/f/artificial-intelligence-ml/end-to-end-document-parsers.md) — Provides an end-to-end parser that translates visual document representations directly into structured JSON.
- [Image Classification](https://awesome-repositories.com/f/artificial-intelligence-ml/image-classification.md) — Assigns category labels to document images based on their visual structure and textual content. ([source](https://github.com/clovaai/donut#readme))
- [Information Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/information-extraction.md) — Converts document images into structured data by identifying and extracting key information fields. ([source](https://github.com/clovaai/donut/blob/master/README.md))
- [Question Answering](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/question-answering.md) — Extracts specific text answers from document images using natural language questions. ([source](https://github.com/clovaai/donut#readme))
- [OCR-Free Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-data-encoders/document-image-transformations/ocr-free-transformers.md) — Implements a transformer model that extracts structured data from document images without using external OCR.
- [Encoder-Decoder Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-transformers/encoder-decoder-architectures.md) — Implements an encoder-decoder vision transformer to map image features to structured text sequences.
- [Visual Document Understanding](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-document-understanding.md) — Converts document images into structured data or text without relying on an external OCR engine.
- [Visual Question Answering](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-question-answering.md) — Produces text answers to natural language questions by analyzing the visual and spatial content of document images. ([source](https://github.com/clovaai/donut/blob/master/README.md))
- [Synthetic Dataset Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-generation/synthetic-dataset-generators.md) — Includes a pipeline for generating synthetic document images and matching labels for model training.
- [Document Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-generation/synthetic-dataset-generators/document-generation.md) — Generates artificial document images and labels to reduce the need for manual training data annotation. ([source](https://github.com/clovaai/donut/tree/master/synthdog))
- [Feature Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-extraction.md) — Extracts spatial and semantic features from document images using convolutional or transformer-based backbones.
- [Document Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/synthetic-content-generators/synthetic-media-generators/document-generators.md) — Provides a tool to create artificial document images and ground-truth labels for model training.
- [Transformer-Based Image Classifiers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-classification/transformer-based-image-classifiers.md) — Uses a transformer-based classifier to assign categories to document images.
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning.md) — Provides capabilities to fine-tune pretrained transformer models on specific document datasets using custom configurations. ([source](https://github.com/clovaai/donut/blob/master/README.md))
- [Multimodal Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning/multimodal-fine-tuning.md) — Employs multimodal fine-tuning to optimize pretrained weights for specific visual document extraction tasks.

### Part of an Awesome List

- [Sequence To Sequence Models](https://awesome-repositories.com/f/awesome-lists/ai/sequence-to-sequence-models.md) — Transforms visual document inputs into structured JSON-like strings using sequence-to-sequence mapping.