# bytedance/dolphin

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/bytedance-dolphin).**

8,820 stars · 739 forks · Python · other

## Links

- GitHub: https://github.com/bytedance/Dolphin
- awesome-repositories: https://awesome-repositories.com/repository/bytedance-dolphin.md

## Topics

`document-analysis` `layout-analysis` `ocr` `parser` `pdf` `pdf-converter` `pdf-parser` `python` `vlm-ocr`

## Description

Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content.

The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats.

The project covers spatial document layout mapping to identify bounding boxes and generate natural reading order sequences. It provides capabilities for granular content retrieval, allowing for the targeted extraction of specific document elements such as tables, formulas, and code blocks through prompt-based parsing.

## Tags

### Artificial Intelligence & ML

- [Vision-Language Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-inference.md) — Uses vision-language inference to simultaneously predict spatial layout and text content from document images.
- [Pixel Coordinate Mappings](https://awesome-repositories.com/f/artificial-intelligence-ml/bounding-box-regression/bounding-box-representations/bounding-box-coordinate-predictors/pixel-coordinate-mappings.md) — Maps high-level bounding boxes and regions to exact pixel coordinates for document layout identification.
- [Image-to-Text Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-to-text-transformers.md) — Uses transformer-based mapping to convert image pixels directly into structured text sequences.
- [Content Parsing Prompts](https://awesome-repositories.com/f/artificial-intelligence-ml/instructional-prompting/content-parsing-prompts.md) — Uses targeted text instructions to guide the model in isolating specific data types like tables or formulas.
- [Document Image Transformations](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-data-encoders/document-image-transformations.md) — Converts raw pixel data from photographs or digital scans into machine-readable structured text formats.
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Parses and extracts structural information from complex documents to identify text, tables, and layout hierarchies. ([source](https://github.com/bytedance/Dolphin/blob/master/README.md))
- [Structured Document Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/structured-document-extraction.md) — Converts visual document layouts into machine-readable formats like JSON or Markdown. ([source](https://github.com/bytedance/Dolphin/blob/master/pyproject.toml))

### Content Management & Publishing

- [Document Layout Analyzers](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/document-layout-analyzers.md) — Provides a multimodal layout analyzer that identifies spatial arrangements and reading orders of text, tables, and figures in images.
- [Parallel Processing](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/document-automation-interfaces/document-parsing-services/parallel-processing.md) — Implements parallel document parsing across distributed nodes to reduce the total time required for large-volume image conversion. ([source](https://github.com/bytedance/Dolphin#readme))
- [Vision-Based Document Parsers](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/document-automation-interfaces/plugin-based-document-parsers/vision-based-document-parsers.md) — Uses multimodal vision models to interpret document layouts and convert them into structured text.
- [Reading Order Predictors](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/document-layout-analyzers/reading-order-predictors.md) — Analyzes text and spatial layout to determine the logical reading sequence of document elements. ([source](https://github.com/bytedance/Dolphin/tree/v1.0))
- [Parallel Processing](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-processing-engines/parallel-processing.md) — Employs concurrent execution of document transformation tasks to improve overall processing throughput.

### User Interface & Experience

- [Reading Order Reconstruction](https://awesome-repositories.com/f/user-interface-experience/information-architecture-resources/visual-reading-flow-patterns/reading-order-reconstruction.md) — Determines the natural reading order by linking disparate layout elements into a coherent structural sequence.

### Part of an Awesome List

- [Complex Document Extraction](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction/complex-document-extraction.md) — Identifies and parses specific components such as tables, formulas, and paragraphs from complex document sources. ([source](https://github.com/bytedance/Dolphin#readme))
- [Multimodal Models](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-models.md) — Multimodal model for image and text integration.
- [Document Parsing and Extraction](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction.md) — Official repository for document image parsing via heterogeneous prompting.

### Data & Databases

- [Concurrent Data Processors](https://awesome-repositories.com/f/data-databases/concurrent-data-processors.md) — Distributes computational workloads across multiple cores to accelerate the conversion of images into structured data.
- [Document Processing Engines](https://awesome-repositories.com/f/data-databases/document-processing-engines.md) — Provides high-performance pipelines for converting large volumes of images into structured data through parallel execution.
- [Granular Content Extractions](https://awesome-repositories.com/f/data-databases/granular-content-extractions.md) — Parses specific content types, such as tables or code blocks, from images for detailed data retrieval. ([source](https://github.com/bytedance/Dolphin/blob/master/README.md))