3 repos
Tools and utilities for extracting and chunking text content from various file formats for indexing.
Distinguishing note: Focuses on the extraction and chunking phase of data ingestion, distinct from general file storage.
Explore 3 awesome GitHub repositories matching data & databases · Document Parsers. Refine with filters or upvote what's useful.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
LlamaIndex parses spreadsheet files into structured table regions and metadata by uploading files, initiating extraction jobs, and downloading the resulting data files.
This project is an autonomous agent framework designed to integrate large language models with popular messaging platforms. It functions as a middleware platform that enables automated, multimodal interactions by decomposing complex user goals into sequential plans, executing them through external tools, and maintaining persistent context across sessions. The framework distinguishes itself through a modular skill architecture and a hybrid memory system. Users can extend system capabilities by installing custom logic modules from community hubs or generating them through natural language. The
Agent framework provides access to text, images, and PDF documents to provide necessary context for system tasks and user queries.
Quivr is a retrieval-augmented generation platform designed to transform raw documents into searchable knowledge bases. It functions as a centralized environment where users can ingest files, index them into vector databases, and interact with language models to receive contextually relevant, data-backed responses. The platform distinguishes itself through an agentic workflow orchestrator that sequences retrieval tasks, tool execution, and model interactions to resolve complex, multi-step queries. This engine is entirely configuration-driven, allowing users to define document ingestion, chunk
Convert PDF files into smaller manageable text chunks using dedicated processors to facilitate efficient indexing and retrieval within the system.