10 Repos
Tools for extracting and integrating information from both text and visual data sources for AI systems.
Distinguishing note: Focuses on the extraction of information from mixed-media documents for retrieval purposes.
Explore 10 awesome GitHub repositories matching artificial intelligence & ml · Multimodal Document Processing. Refine with filters or upvote what's useful.
LightRAG is a graph-based retrieval framework designed to build retrieval-augmented generation pipelines. It structures unstructured text into knowledge graphs, enabling multi-hop reasoning and complex query synthesis across large document collections. By integrating dense vector embeddings with structured knowledge graphs, the system facilitates both similarity-based and relationship-aware information retrieval. The framework distinguishes itself through a dual-level retrieval strategy that combines low-level keyword matching with high-level semantic graph traversal to capture both specific
Extract information from both text and images within diverse document types to improve the context and accuracy of answers generated by automated information retrieval systems.
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Extracts text and structure from images by sending visual data alongside text prompts to a compatible inference server.
Vercel is a cloud platform for building, deploying, and scaling web applications. It provides a unified infrastructure that automates the build process by detecting project frameworks and distributing static and dynamic content through a global content delivery network. The platform executes application logic using serverless functions that scale automatically based on real-time traffic demand. The platform distinguishes itself through a centralized AI gateway that proxies requests to multiple model providers, enabling standardized authentication, observability, and cost tracking. It supports
Supports visual analysis and document-based reasoning by processing images and PDFs alongside text.
Instructor is a framework designed for structured data extraction, validation, and language model integration. It functions as a library that transforms unstructured text into validated, type-safe objects by leveraging schema definitions and model-specific tool-calling capabilities. By acting as a validation middleware, the project ensures that language model outputs strictly conform to defined data structures. The library distinguishes itself through a robust validation-based retry loop that automatically re-submits failed responses with error feedback to iteratively correct schema complianc
Extracts semantic information from multimodal documents like images and PDFs to populate structured data models.
Paper-qa is a retrieval augmented generation system designed for question answering and analysis of scientific literature and technical documents. It functions as an LLM-powered research assistant that extracts grounded answers and summaries with citations from a document library. The system utilizes an agentic RAG orchestrator to iteratively refine search queries and gather evidence through multi-step tool calling. It features a multimodal document parser that extracts text, tables, and images from PDFs, alongside a vector-based indexer that embeds and caches document libraries for efficient
Provides a multimodal processing pipeline to extract text, tables, and images from PDFs for LLM consumption.
This project is a reference implementation and application template for Retrieval-Augmented Generation (RAG). It integrates Azure OpenAI with Azure AI Search to enable conversational chat interfaces that provide grounded responses based on private enterprise data. The system is distinguished by its multimodal AI interface, allowing it to process and reason over combined text, image, and PDF content. It employs a hybrid search architecture that combines vector and keyword retrieval with semantic reranking to prioritize the most relevant documents for prompt augmentation. The project covers a
Processes and reasons over combined text, image, and PDF content to extract structured information.
BERTopic is a topic modeling library used to extract interpretable themes from collections of text documents and images. It functions as a document clustering framework that transforms unstructured data into numerical vectors to group semantically similar content. The project distinguishes itself through a multimodal embedding tool that allows for joint clustering of text and images in a shared vector space. It also features a class-based TF-IDF representation engine to identify representative words for clusters and an integrated system for using large language models to generate natural lang
Groups mixed-media data by creating shared vector representations for both text and images in a single space.
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Extracts metadata and converts complex, mixed-media documents into structured formats like JSON and Parquet.
ruby_llm is an LLM integration framework and AI agent orchestrator designed to connect applications to multiple large language model providers through a unified interface. It serves as a toolkit for building autonomous assistants with custom personas, managing structured output via JSON schemas, and implementing vector embedding engines for semantic search. The project distinguishes itself as an observability suite and multimodal toolkit. It provides specialized capabilities for tracking token usage, calculating model costs, and tracing workflows via OpenTelemetry, while supporting the proces
Processes images, videos, audio, and documents to extract information and summaries through a unified interface.
Das Synthetic Data Kit ist ein integriertes Framework, das darauf ausgelegt ist, Trainingsdatensätze für Sprachmodelle zu generieren, zu kuratieren und zu formatieren. Es bietet eine End-to-End-Pipeline, die rohe Quelldokumente in strukturierte Daten umwandelt, die für das Fine-Tuning, Reasoning und das Training von Tool-Use-Modellen geeignet sind. Das Framework zeichnet sich durch eine modulare Orchestrierungs-Engine aus, die den gesamten Lebenszyklus der Datenvorbereitung verwaltet. Es unterstützt multimodale Eingaben durch das Extrahieren von Text- und Bildinhalten aus verschiedenen Dateiformaten, während es kontextbewusstes Chunking einsetzt, um die semantische Kohärenz beizubehalten. Der Generierungsprozess wird durch vorlagenbasierte Prompt-Injektion gesteuert, und die resultierende Ausgabe wird durch ein automatisiertes Bewertungssystem validiert, das Sprachmodelle als Richter verwendet, um Qualität und Genauigkeit sicherzustellen. Das Projekt deckt ein breites Spektrum an Datenverarbeitungsfähigkeiten ab, einschließlich Dokumenten-Parsing, automatisierter Qualitätsfilterung und schema-agnostischer Serialisierung. Es unterstützt die Erstellung diverser Trainingsbeispiele, wie Reasoning-Traces und Tool-Use-Demonstrationen, und exportiert die finalen Datensätze in standardisierte Formate für die Kompatibilität mit Machine-Learning-Trainings-Frameworks. Nutzer verwalten den Generierungs-Workflow und die Pipeline-Phasen über zentrale Konfigurationsdateien und Befehlszeilenargumente.
Extracts text and image content from mixed-media documents to support synthetic data generation.