Sparrow

Sparrow is an LLM document extraction platform and vision-based inference engine designed to convert images and PDFs into validated structured data. It functions as an agentic workflow orchestrator that chains classification, extraction, and validation tasks into multi-step pipelines.

The system distinguishes itself through a backend-agnostic inference layer that manages models across local GPUs, Apple Silicon, and cloud providers. It employs coordinate-based visual grounding to map extracted text to precise bounding box coordinates and utilizes hint-based model steering to guide attention and normalize data formats.

The platform covers document intelligence workflows, including specialized image-based table processing to maintain structural integrity and schema-driven validation to verify the correctness of extracted fields. It also provides a document analysis dashboard for monitoring API performance, usage analytics, and system health.

The architecture includes a plugin-based extension system for integrating third-party libraries used in indexing and orchestration.

Features

Intelligent Document Processing - Provides a platform for intelligent document processing, combining classification, extraction, and validation into multi-step pipelines.
Vision-Language Model Backends - Uses vision-capable language models to parse document layouts and convert visual content into structured data.
Agentic Workflow Pipelines - Implements automated pipelines that chain LLM instructions and external tools for complex document analysis and error recovery.
Hardware Acceleration Backends - Manages extraction pipelines across diverse hardware accelerators including local and cloud-based backends.
Image Text Extractions - Recognizes and extracts text and key-value pairs from images and PDFs as structured data.
Vision-Language Orchestrators - Manages model inference across local GPUs, Apple Silicon, and cloud providers to process visual document data.
Hardware-Agnostic Inference Layers - Provides a hardware-agnostic inference layer that routes processing to local GPUs, Apple Silicon, or cloud providers.
Structured Document Extraction - Converts visual document layouts into machine-readable structured formats using vision-capable models.
Vision-Language Inference - Implements a vision-language inference engine that executes multimodal models across various hardware backends.
Workflow Orchestration - Combines document classification and data extraction into a single AI workflow pipeline with visual monitoring.
Document Field Validations - Checks extracted document fields against schemas to verify the presence and correctness of required data.
Structured Data Extraction - Parses complex tables and text from documents into predefined schemas with bounding box coordinate mapping.
PDF Coordinate Extraction - Extracts precise bounding box coordinates for recognized text regions within PDF pages.
Agentic Workflow Orchestrators - Functions as an orchestrator that chains LLM classification, extraction, and validation tasks with integrated error recovery.
Pipeline Orchestrators - Chains classification, extraction, and validation tasks into sequenced pipelines with error recovery.
Schema-Driven Validations - Verifies extracted document fields against predefined structural definitions to ensure data correctness.
Extraction Coordinate Annotations - Generates bounding box coordinates for extracted elements to provide visual grounding for the data.
Visual Coordinate Mapping - Maps extracted text to precise bounding box coordinates for visual grounding within documents.
Tabular Data Extraction - Extracts complex tabular data from documents while maintaining structural integrity through specialized vision processing.
OCR Document Conversion - Converts images and PDFs into validated structured data using vision models and schema-based validation.
Local Model Backends - Supports running inference across a variety of backends including local GPUs, Apple Silicon, and cloud providers.
Attention Steering Hints - Uses hint-based configuration files to steer model attention and normalize extracted data formats.
Extraction Hinting - Uses hint-based model steering to guide attention and normalize data formats during the extraction process.
Document Processing Pipelines - Extracts and analyzes data across documents containing multiple pages using orchestrated pipelines.
Table Structure Detections - Identifies tabular grids and merges cells within document layouts to crop them for specialized inference.
Image-Based Table Extractors - Identifies tabular regions and crops them into images for specialized inference to preserve structural integrity.
Document Table Extractors - Maps large or multi-column tables from documents to structured schemas using an intermediate processing pipeline.
Document Analysis Dashboards - Provides a visual dashboard for monitoring API performance, usage analytics, and the operational health of extraction pipelines.
Data Processing - Solution for efficient data extraction from documents and images.
Data Processing Tools - Solution for efficient data extraction from documents and images.

datalab-to/chandra

4,833View on GitHub

sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple proces

kreuzberg-dev/kreuzberg

8,527View on GitHub

Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo

bytedance/Dolphin

8,820View on GitHub

Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers

run-llama/liteparse

10,782View on GitHub

A fast, helpful, and open-source document parser

A fast, helpful, and open-source document parser

katanamlsparrow

Features