Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content.
The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats.
The project covers spatial document layout mapping to identify bounding boxes and generate natural reading order sequences. It provides capabilities for granular content retrieval, allowing for the targeted extraction of specific document elements such as tables, formulas, and code blocks through prompt-based parsing.