sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple proces
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers
A fast, helpful, and open-source document parser