4 مستودعات
High-performance pipelines for converting large volumes of narrative text into machine-readable data.
Distinguishing note: Focuses on the high-performance pipeline and parallel execution aspects of document processing.
Explore 4 awesome GitHub repositories matching data & databases · Document Processing Engines. Refine with filters or upvote what's useful.
Langextract is a framework designed to transform unstructured text into structured, machine-readable data using language model orchestration. It provides a high-performance pipeline that processes large volumes of narrative text by utilizing parallel execution and sequential extraction passes. The library is built to handle complex data extraction tasks, including specialized support for clinical information and medical entity relationship recognition. The project distinguishes itself through a plugin-based architecture that supports both local hardware execution and cloud-hosted model endpoi
Executes parallel extraction passes to convert large volumes of narrative text into machine-readable data.
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
Employs high-performance pipelines to process large batches of PDF files in parallel via GPUs.
DeepSeek-OCR is a vision processing framework designed to convert image-based text into machine-readable tokens for large language models. It functions as a document inference pipeline that encodes visual data into compact representations, enabling automated optical character recognition and document analysis workflows. The system distinguishes itself through a high-throughput architecture that utilizes hardware-accelerated batch inference to process large volumes of visual data. It incorporates dynamic resolution scaling to manage the balance between visual detail and token consumption, ensu
Provides high-performance pipelines for batch processing and text extraction from documents.
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers
Provides high-performance pipelines for converting large volumes of images into structured data through parallel execution.