Dolphin | Awesome Repository

Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content.

The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats.

The project covers spatial document layout mapping to identify bounding boxes and generate natural reading order sequences. It provides capabilities for granular content retrieval, allowing for the targeted extraction of specific document elements such as tables, formulas, and code blocks through prompt-based parsing.

Features

Vision-Language Inference - Uses vision-language inference to simultaneously predict spatial layout and text content from document images.
Document Layout Analyzers - Provides a multimodal layout analyzer that identifies spatial arrangements and reading orders of text, tables, and figures in images.
Pixel Coordinate Mappings - Maps high-level bounding boxes and regions to exact pixel coordinates for document layout identification.
Image-to-Text Transformers - Uses transformer-based mapping to convert image pixels directly into structured text sequences.

Features

Vision-Language Inference - Uses vision-language inference to simultaneously predict spatial layout and text content from document images.
Document Layout Analyzers - Provides a multimodal layout analyzer that identifies spatial arrangements and reading orders of text, tables, and figures in images.
Pixel Coordinate Mappings - Maps high-level bounding boxes and regions to exact pixel coordinates for document layout identification.
Image-to-Text Transformers - Uses transformer-based mapping to convert image pixels directly into structured text sequences.