Document Embedding and Chunking Frameworks

Tools and libraries for splitting text documents into segments and generating vector embeddings for semantic search.

Find the best repos with AI.We'll search the best matching repositories with AI.

supermemoryai/supermemory
supermemoryai/supermemory
27,334View on GitHub
Supermemory is an artificial intelligence memory management platform designed to provide autonomous agents with persistent, long-term knowledge bases. It functions as a centralized repository that synchronizes multimodal data, enabling agents to maintain context and historical information across complex, multi-session workflows. By serving as a knowledge graph engine and vector database orchestrator, the platform ensures that information remains accessible and relevant for automated tasks. The system distinguishes itself through its hybrid indexing approach, which combines vector similarity search with structured graph traversal to retrieve both semantic context and explicit relational data. It decomposes unstructured documents into granular, standalone facts and utilizes composable retrieval pipelines to refine information before it is injected into agent prompts. This architecture supports the creation of automated user profiles and fact hierarchies, allowing the system to learn and update information in real-time while managing the lifecycle of stored data. Beyond individual agent support, the platform facilitates enterprise knowledge sharing by maintaining collective repositories of project decisions and patterns. It automates data ingestion from diverse sources, including cloud storage, productivity platforms, and web content, using event-driven synchronization to ensure information freshness. The platform is designed for self-hosted, containerized deployment, providing users with full control over their data infrastructure and sovereignty.
Supermemory is a comprehensive platform for managing long-term knowledge bases that includes built-in document ingestion, semantic chunking, and vector database orchestration, making it a strong candidate for building RAG-ready pipelines.
TypeScriptSemantic Chunking
View on GitHub27,334
langroid/langroid
langroid/langroid
3,894View on GitHub
Langroid is a multi-agent orchestration framework and tool integration suite designed for building complex AI applications. It serves as a multi-modal integration layer that connects diverse local and remote language models with an agentic retrieval-augmented generation system. The project distinguishes itself through a collaborative message-exchange paradigm, allowing specialized agents to delegate tasks hierarchically and coordinate via structured communication. It features an advanced state management system for conversational AI, including the ability to rewind and prune conversation history to correct errors and optimize token usage. The framework provides a broad set of capabilities for grounding model responses in factual data using vector databases, graph databases, and tabular datasets. It includes a schema-driven tool execution system that binds models to Python functions and external protocol servers, as well as a comprehensive observability suite for tracing message lineage and monitoring reasoning paths. The library provides installation guidance via import errors when optional dependencies are missing.
Langroid is a comprehensive multi-agent orchestration framework that includes built-in RAG capabilities, such as document ingestion, chunking, and vector database integration, making it a suitable tool for building the pipelines you described.
PythonStructured Data ExtractionText ChunksLocal Embedding Generators
View on GitHub3,894
lancedb/lancedb
lancedb/lancedb
9,031View on GitHub
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters into a single ranked result set. The project covers a broad range of capabilities, including automated vector embedding generation, multimodal data ingestion, and large-scale feature engineering. Its search surface includes approximate nearest neighbor indexing, precision reranking, and late-interaction multivector retrieval. Additionally, it provides tools for dataset curation, model evaluation, and zero-copy data streaming for training loops. The database is accessible via multi-language SDKs and a standardized REST API, supporting deployments across local filesystems and cloud object storage providers.
LanceDB is a high-performance vector database that provides the essential storage and retrieval backend for RAG pipelines, including built-in support for embedding generation and multimodal data ingestion.
HTMLEmbedding ModelsStructured Data ExtractionLocal Embedding Generators
View on GitHub9,031
nomic-ai/gpt4all
nomic-ai/gpt4all
77,375View on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vector spaces. This capability enables context-aware chat sessions where the model can reference private files, notes, and spreadsheets to provide grounded, relevant responses. The system also features a local HTTP server that exposes an OpenAI-compatible API, allowing developers to integrate these private, self-hosted models into existing applications and workflows. Beyond its core inference and retrieval capabilities, the project includes a graphical desktop interface for end-user interaction and a Python software development kit for programmatic access. These tools support advanced configuration of model parameters, performance monitoring, and the management of local embedding pipelines for custom semantic search tasks. The software is distributed as a unified application package, with documentation available to guide users through installation and local environment setup.
GPT4All provides a built-in RAG engine that handles local document indexing, text chunking, and vector embedding, making it a functional tool for building document ingestion pipelines despite its primary focus on local LLM inference.
C++Local Embedding Generators
View on GitHub77,375
vercel/ai
vercel/ai
21,885View on GitHub
This project is a comprehensive framework for building AI-powered applications, providing a unified toolkit for orchestrating language models, autonomous agents, and interactive user interfaces. It serves as a central library for managing the entire lifecycle of AI interactions, from initial prompt generation and model provider abstraction to complex, multi-step reasoning and tool execution. The framework distinguishes itself through its deep integration with frontend development, specifically by enabling generative user interfaces that render dynamic components directly from model outputs. It features a robust agentic execution engine that manages recursive reasoning loops, allowing developers to define custom stopping conditions, delegate tasks to subagents, and enforce structured workflows. By providing a standardized interface for streaming data and state management, it ensures that backend model responses and frontend UI components remain synchronized in real time. Beyond its core orchestration capabilities, the project covers a broad surface of AI integration features, including schema-driven data extraction, multi-modal input processing, and middleware-based request interception. It supports a wide range of operational needs such as persistent conversation history, retrieval-augmented generation, and comprehensive observability tools for monitoring token usage and execution flows. The library is designed for TypeScript environments and provides a collection of hooks and utilities that simplify the implementation of chat interfaces and agentic workflows.
This framework provides a comprehensive set of utilities for RAG pipelines, including embedding model configurations, multi-modal data management, and retrieval interfaces, though it focuses more on application orchestration than standalone document ingestion.
TypeScriptStructured Data Extraction
View on GitHub21,885
mastra-ai/mastra
mastra-ai/mastra
21,221View on GitHub
Mastra is an orchestration framework designed for building, deploying, and managing autonomous AI agents and multi-agent systems. It provides a comprehensive suite of primitives for creating resilient AI applications, including durable workflow orchestration, event-driven agent loops, and semantic memory management. By integrating these core components, the platform enables developers to build complex, multi-step processes that can reason about goals and execute tasks without manual intervention. The framework distinguishes itself through its focus on observability and secure, isolated execution. It features a built-in telemetry pipeline that captures structured execution traces, logs, and performance metrics, allowing for real-time debugging and evaluation of agent behavior. Furthermore, it utilizes sandboxed environments to isolate code execution and filesystem operations, ensuring that agent interactions remain secure and reproducible. Mastra covers a broad capability surface, including multi-agent delegation hierarchies, schema-validated tool execution, and real-time voice interaction. It supports advanced orchestration patterns such as human-in-the-loop approvals, persistent state management for long-running workflows, and retrieval-augmented generation using vector-based semantic memory. These features are designed to work together to support the entire lifecycle of AI-powered applications, from initial development and testing to production deployment. The project is built for TypeScript environments and provides a modular architecture that integrates with existing web stacks and infrastructure. It includes a client SDK for interacting with remote agents and supports various authentication providers to secure API endpoints and agent resources.
Mastra is an orchestration framework that includes built-in support for RAG pipelines, specifically offering semantic memory management and vector-based retrieval capabilities to handle document ingestion and embedding tasks within AI agent workflows.
TypeScriptDocument Chunking StrategiesStructured Data Extraction
View on GitHub21,221
chroma-core/chroma
chroma-core/chroma
26,198View on GitHub
Chroma is a specialized vector database designed to index and retrieve high-dimensional data representations for semantic similarity search. It functions as a comprehensive platform for information retrieval, enabling the storage and management of unstructured documents alongside structured metadata. By mapping data into numerical representations, the system facilitates rapid similarity lookups across large datasets. The platform distinguishes itself through a hybrid search infrastructure that combines dense vector embeddings with sparse keyword and regular expression matching to balance semantic relevance with exact term precision. It supports multi-modal data, allowing for the indexing and querying of text, images, and audio within a unified interface. Furthermore, the system provides an agentic retrieval framework that enables autonomous agents to perform iterative search cycles and refine results for complex, multi-step queries. Beyond its core search capabilities, the platform includes specialized tools for codebase analysis, utilizing syntax-aware chunking to preserve logical structure for development tasks. It features a pluggable embedding pipeline that decouples vector generation from storage, allowing integration with diverse third-party machine learning models. The system also supports metadata-filtered query execution, ensuring precise retrieval by applying boolean constraints to document attributes. Operational support is provided through a programmatic interface for managing database instances in both self-hosted and cloud-based environments, including automated provisioning for scalable deployments.
Chroma is a vector database that provides the essential storage and retrieval infrastructure for RAG pipelines, including built-in embedding pipelines and chunking capabilities, though it functions primarily as the database layer rather than a standalone ingestion framework.
RustVector DatabasesHybrid Search EnginesVector Search
View on GitHub26,198
microsoft/graphrag
microsoft/graphrag
33,792View on GitHub
GraphRAG is a data processing pipeline and retrieval engine designed to transform unstructured text into interconnected knowledge graphs. By utilizing language models to extract entities and relationships, it builds structured representations of information that enable context-aware retrieval for downstream applications. The system distinguishes itself through hierarchical graph clustering and large-scale data synthesis, which organize massive document corpora into multi-level structures. This approach allows for both vector-based semantic searches and graph-based traversals, providing a comprehensive method for navigating complex datasets and identifying hidden connections between concepts. The platform includes a modular orchestration pipeline that manages the entire lifecycle of information, from initial ingestion and indexing to query execution. Users can refine the synthesis and retrieval processes by adjusting prompt templates and configuration arguments to align with specific data characteristics.
This framework provides a sophisticated ingestion and retrieval pipeline that transforms raw text into structured knowledge graphs, offering a powerful alternative to standard vector-only RAG approaches while supporting the core requirements of document processing and semantic retrieval.
PythonGraph-Based Retrieval AugmentationGraph-Based Retrieval EnginesContext-Aware Retrieval
View on GitHub33,792
hkuds/lightrag
HKUDS/LightRAG
36,651View on GitHub
LightRAG is a graph-based retrieval framework designed to build retrieval-augmented generation pipelines. It structures unstructured text into knowledge graphs, enabling multi-hop reasoning and complex query synthesis across large document collections. By integrating dense vector embeddings with structured knowledge graphs, the system facilitates both similarity-based and relationship-aware information retrieval. The framework distinguishes itself through a dual-level retrieval strategy that combines low-level keyword matching with high-level semantic graph traversal to capture both specific facts and broad thematic context. It supports incremental knowledge management, allowing the underlying graph structure to be updated dynamically as new data arrives without requiring a full re-indexing of the dataset. Additionally, the system functions as a multimodal information extractor, processing both text and visual data to create unified, searchable knowledge bases. The platform provides modular, prompt-driven pipeline orchestration to coordinate document parsing, knowledge extraction, and language model generation. These automated workflows allow for the synthesis of information across interconnected documents to provide context-aware responses to nuanced, multi-step inquiries.
LightRAG is a graph-based RAG framework that handles document ingestion and multi-level retrieval, though it focuses more on knowledge graph construction than traditional vector-only embedding pipelines.
PythonKnowledge Graph Retrieval SystemsRetrieval Augmented Generation PipelinesGraph Reasoning Systems
View on GitHub36,651
opendatalab/mineru
opendatalab/MinerU
67,734View on GitHub
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recognition, and formula extraction into a unified pipeline. It serializes all extracted features and spatial coordinates into a standardized format, ensuring that output remains consistent for downstream integration. To support verification, the tool includes a diagnostic suite that generates visual overlays, allowing users to inspect segmentation boundaries and reading order directly against the original source files. The software provides a comprehensive framework for automated data extraction, organizing parsed elements into a page-based structure suitable for large-scale information retrieval. It is distributed as a Python-based package, with documentation and installation instructions available in the repository.
MinerU is a specialized document parsing and layout analysis pipeline that excels at converting complex PDFs into structured data, providing the essential ingestion layer required for RAG pipelines even though it focuses on extraction rather than vector database integration.
PythonDeployment & ServingDocument Layout AnalysisAutomated Data Extraction
View on GitHub67,734
Less-relevant matchesScored below the primary cut
langchain-ai/rag-from-scratch
langchain-ai/rag-from-scratch
7,393View on GitHub
This project is an educational implementation guide and framework for building Retrieval Augmented Generation systems. It provides a workflow for constructing a knowledge base pipeline that partitions documents, indexes them as vectors, and provides external context for language model prompts. The system features a document chunking framework that uses recursive character splitting to fit text into model context windows. It includes an in-memory vector store and a similarity search system that retrieves relevant text segments by calculating the mathematical distance between dense embedding vectors. The project covers the end-to-end RAG pipeline development process, including custom data indexing, vector search implementation, and context management for large language models. The implementation is provided as a series of Jupyter Notebooks.
This repository is an educational guide and collection of notebooks demonstrating how to build RAG systems rather than a reusable library or tool for processing document ingestion pipelines.
Jupyter NotebookText ChunksRecursive Character Splitting
View on GitHub7,393
unclecode/crawl4ai
unclecode/crawl4ai
68,644View on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages large-scale data collection via asynchronous task queuing. It employs adaptive crawling algorithms to determine when sufficient information has been gathered to satisfy specific requests, while simultaneously managing browser sessions, proxies, and authentication to navigate modern web environments. The system supports integration with autonomous agents through standardized communication protocols, allowing external tools to access live web data and browser capabilities directly. Beyond core extraction, the project provides a flexible pipeline that allows for custom logic injection through middleware hooks for specialized processing or authentication requirements. It includes tools for monitoring system health and performance during high-volume operations, ensuring reliable job management across diverse environments. The entire engine is packaged for containerized deployment, providing consistent execution across different hardware and hosting configurations.
This tool is a web crawling and data extraction engine designed to convert web content into structured formats, serving as a data-gathering building block for an ingestion pipeline rather than a complete document-to-vector embedding system.
PythonMarkdown ConvertersStructured
View on GitHub68,644
mozilla/pdf.js
mozilla/pdf.js
53,454View on GitHub
This project is a portable document rendering engine designed to parse and display complex document layouts directly within standard web browser environments. It functions as a web-native viewer that enables the presentation of documents without requiring external software or browser plugins. The engine utilizes a canvas-based rendering layer to map document page data onto standard web drawing surfaces, ensuring high-fidelity visual output. To maintain interface responsiveness, it offloads heavy parsing and object extraction tasks to background threads. The system also employs asynchronous byte-range fetching to retrieve only the necessary parts of a document on demand, allowing for immediate viewing without waiting for the entire file to download. The library provides a comprehensive set of tools for client-side processing, including text extraction and the ability to handle multi-page documents. It manages document data through low-level binary buffers and uses web-compatible font processing to ensure that text renders identically to the original file layout. Developers can integrate these capabilities to load remote documents, navigate through pages, and apply precise viewport transformations for custom display logic.
This is a specialized PDF parsing and rendering library that can extract raw text for your pipeline, but it lacks the built-in chunking, embedding, and vector database integration required for a complete ingestion system.
JavaScriptJavaScript Document Parsers
View on GitHub53,454
asg017/sqlite-vec
asg017/sqlite-vec
6,961View on GitHub
sqlite-vec is a C-based vector library and SQLite extension that adds virtual tables for storing and querying high-dimensional embeddings. It functions as a database plugin for performing nearest neighbor searches using distance metrics such as L2, cosine, and Hamming distance. The project provides a portable embedding store that supports deployment across Android, iOS, desktop environments, and web browsers via WebAssembly. It distinguishes itself by converting numerical arrays into compact binary formats and utilizing quantization to reduce the memory footprint and storage size of vector indexes. The library covers a broad range of vector operations, including similarity querying, vector arithmetic, and data transformation. It also includes capabilities for metadata filtering, key-based index sharding, and the attachment of auxiliary data to vector records. The extension can be integrated into projects using C, C++, Go, Ruby, and Rust, and it is compatible with Datasette and distributed SQLite environments.
This is a specialized vector database extension for SQLite that handles storage and similarity search, but it lacks the document parsing, text chunking, and ingestion pipeline features required to process raw text into embeddings.
CVector Database Integrations
View on GitHub6,961
camel-ai/camel
camel-ai/camel
17,253View on GitHub
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-evaluate reasoning traces, ensuring high-quality results. To maintain operational integrity, the system enforces schema-based output parsing for reliable workflow integration and utilizes sandboxed environments for secure, isolated code execution. Beyond its core orchestration capabilities, the project includes a suite of utilities for retrieval-augmented generation and synthetic data production. It supports persistent memory management via vector-based context retrieval and provides extensive tooling for web automation, API integration, and human-in-the-loop oversight. The platform is designed to be model-agnostic, offering a consistent interface for interacting with a wide range of proprietary and open-source language models.
This is a multi-agent orchestration framework that includes RAG utilities as a secondary feature, rather than a dedicated document ingestion and embedding pipeline tool.
PythonMarkdown ConvertersPDF ParsersStructured Data Extraction
View on GitHub17,253
datawhalechina/tiny-universe
datawhalechina/tiny-universe
4,505View on GitHub
Tiny Universe is an educational monorepo that delivers multiple independent implementations of core AI subsystems as self-contained Jupyter notebooks. It provides from-scratch constructions of foundational architectures including a complete Transformer model built from the original paper specification, a denoising diffusion probabilistic model for image generation, and a ReAct-style autonomous agent framework that equips an LLM with tools for planning and multi-step task execution. The project distinguishes itself by covering the full lifecycle of modern AI systems through hands-on implementations. It includes retrieval-augmented generation pipelines that combine vector databases with knowledge graphs, a GraphRAG system that constructs knowledge graphs from text and generates hierarchical community summaries, and a two-stage evaluation pipeline that scores model outputs against reference answers using metrics like F1, ROUGE, and accuracy. The repository also demonstrates reinforcement learning fine-tuning, automated document review workflows that detect deviations and generate revision suggestions, and iterative image optimization that evaluates and improves generated images against text prompts. Beyond these core areas, Tiny Universe explores the internal mechanisms of large language models with walkthroughs of grouped query attention, rotary position embeddings, and causal masking. It covers data processing techniques such as semantic chunking by sentence shifts, vector embedding pipelines for similarity-based retrieval, and hybrid search strategies that fuse sentence-level similarity with domain-specific term importance. The project also includes image quality evaluation using Inception Score and Fréchet Inception Distance, as well as image-text consistency checking with vision-language models. All implementations are delivered as self-contained Jupyter notebooks within a single repository, making the code directly runnable and inspectable for educational purposes.
This repository is an educational collection of Jupyter notebooks demonstrating AI concepts rather than a production-ready tool or library for building document ingestion and embedding pipelines.
Jupyter NotebookText Embeddings
View on GitHub4,505
pymupdf/pymupdf
pymupdf/PyMuPDF
9,086View on GitHub
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. Its broader capability surface covers optical character recognition for creating searchable text layers, detailed data extraction of tables and key-value pairs, and security operations including AES/RC4 encryption and permanent content redaction. The library also handles complex document geometry, layout analysis, and the generation of PDFs from HTML and CSS. The library supports multi-format document loading for PDF, EPUB, MOBI, SVG, and Office files, with the ability to process files via memory streams.
This is a high-performance document parsing and extraction library that serves as a foundational building block for RAG pipelines, but it lacks the built-in vector database integration and orchestration features required to be a complete ingestion and embedding pipeline.
PythonPDF ParsersStructured Data Extraction
View on GitHub9,086
google-research/bert
google-research/bert
39,869View on GitHub
This project is a transformer-based language model and natural language processing toolkit designed to generate deep contextual representations of text. By utilizing a transformer-based encoder architecture, the system processes input sequences through stacked self-attention layers to capture the semantic meaning of tokens based on their surrounding sentence structure. The model distinguishes itself through bidirectional contextual processing, which analyzes text in both directions simultaneously, and masked language modeling, which trains the system by predicting hidden tokens within a sequence. It also employs next sentence prediction to understand relationships between text segments and utilizes shared parameter multilingualism to maintain a unified structure across diverse languages. Beyond these core capabilities, the toolkit provides utilities for subword-based tokenization to manage vocabulary and punctuation, as well as functionality for generating high-dimensional contextual embeddings. It supports the development of question answering systems by identifying specific start and end positions for text segments within a document.
This repository provides the foundational transformer model and tokenization utilities for generating embeddings, but it lacks the document ingestion, chunking, and vector database integration required for a complete RAG pipeline.
PythonContextual Embedding Generation
View on GitHub39,869

Document Embedding and Chunking Frameworks

supermemoryai/supermemory

langroid/langroid

lancedb/lancedb

nomic-ai/gpt4all

vercel/ai

mastra-ai/mastra

chroma-core/chroma

microsoft/graphrag

HKUDS/LightRAG

opendatalab/MinerU

langchain-ai/rag-from-scratch

unclecode/crawl4ai

mozilla/pdf.js

asg017/sqlite-vec

camel-ai/camel

datawhalechina/tiny-universe

pymupdf/PyMuPDF

google-research/bert