Tools and libraries for splitting text documents into segments and generating vector embeddings for semantic search.
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabilities include extracting images and structured data based on predefined schemas, as well as chunking documents for retrieval augmented generation pipelines. The project supports high-volume processing by distributing conversion tasks across multiple GPUs.
Marker is a specialized document parsing and ingestion engine that converts complex files into structured formats and provides built-in chunking for RAG pipelines, serving as a critical preprocessing step for embedding workflows.
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural language queries into structured SQL or CSV formats by analyzing database schemas. The framework covers a broad range of capabilities including end-to-end retrieval-augmented generation pipelines, hybrid search engines, and multimodal content processing for PDFs, Office documents, audio, and images. It also incorporates tools for structured function calling, named entity recognition, and text risk classification to detect toxicity and prompt injections. The system integrates with various SQL and vector database backends to manage knowledge collection indexing and document embeddings.
This framework provides a comprehensive suite for RAG pipelines, including built-in document parsing for various formats, text chunking, metadata extraction, and direct integration with multiple vector database backends.
This project is a retrieval-augmented generation pipeline designed for building custom ChatGPT plugins that allow language models to query private or professional documents. It implements a full retrieval workflow, from processing and indexing document chunks to retrieving relevant context for natural language queries. The system distinguishes itself through a hybrid retrieval approach that combines dense vector embeddings with sparse keyword matching, further refined by a two-stage semantic re-ranking process. It includes specialized data privacy tools for screening personally identifiable information and secures private data stores using OAuth-based user authentication. The capability surface covers multi-format file indexing for PDF, DOCX, and PPTX files, alongside document ingestion from JSON and ZIP archives. It supports multiple vector storage backends, including PostgreSQL with pgvector, Redis, and cloud-native services. The architecture is designed for containerized deployment via Docker and includes tools for metadata extraction and real-time data synchronization through webhooks. The project provides a local development server with pre-configured routing and security to verify plugin functionality before deployment.
This project provides a comprehensive document ingestion and embedding pipeline that handles multi-format parsing, text chunking, metadata extraction, and integration with multiple vector database backends, making it a complete solution for RAG workflows.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU. Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models. The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.
Kreuzberg is a comprehensive document ingestion and embedding engine that natively handles multi-format parsing, metadata extraction, text chunking, and vector embedding generation specifically for RAG pipelines.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex queries through iterative processing and tool-calling, while its hybrid retrieval orchestration combines vector similarity and full-text search with re-ranking to improve the accuracy of retrieved context. The framework also features event-driven streaming, which delivers incremental results from long-running pipelines to the user interface in real-time. Beyond its core reasoning capabilities, the platform includes a suite of functional modules for the entire lifecycle of document-based applications. This includes multi-modal parsing for extracting text, tables, and visual elements from diverse file formats, as well as administrative tools for managing document collections, vector stores, and multi-user access. The system is designed to be interface-agnostic, allowing developers to wrap third-party libraries and external services into standardized, reusable processing units. The project provides a web-based user interface for interactive querying and configuration, and it supports deployment of private, isolated instances through predefined templates.
Kotaemon is a comprehensive RAG orchestration framework that natively handles the entire document ingestion and embedding pipeline, including multi-format parsing, chunking, and vector database integration.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture that supports directed acyclic graph orchestration, allowing users to chain complex transformation pipelines while maintaining metadata, spatial context, and hierarchical relationships across extracted elements. The system covers a broad capability surface, including extensive connectivity to cloud storage, databases, and collaboration platforms, alongside robust data export options for vector databases and search indices. It enforces enterprise security standards through isolated multi-tenant infrastructure, role-based access control, and private network connectivity, ensuring that sensitive data remains secure throughout the entire transformation lifecycle. Operational visibility is maintained through integrated job monitoring, event-driven notification systems, and audit logging. The platform is designed for deployment within private cloud environments, supporting scalable, asynchronous processing of high-volume document batches.
This platform is a comprehensive engine for document ingestion and transformation that natively handles chunking, multi-format parsing, and metadata extraction to prepare unstructured data for vector database integration in RAG pipelines.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures. The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.
Docling is a powerful document parsing and layout analysis framework that serves as a critical ingestion layer for RAG pipelines by converting complex, multi-format documents into structured, machine-readable data.
localGPT is a private AI knowledge base and retrieval-augmented generation application. It provides a local document indexer, a hybrid search engine, and an inference interface to enable chatting with private documents and managing a self-hosted information repository without sending data to external servers. The system distinguishes itself through a dual-pass verification pipeline that ensures generated answers are grounded in retrieved sources, accompanied by explicit source attribution. It employs a hybrid retrieval approach combining semantic vector search with keyword matching and reranking, and utilizes recursive query decomposition to break complex requests into smaller parallel sub-queries. The platform covers broad capability areas including multi-format document processing, dynamic query routing, and semantic query caching. It also manages conversation history tracking and provides a RESTful API for integrating document retrieval and language model functionality into external applications. The project integrates with open-source models across different hardware accelerators and includes system health monitoring via structured logs and health endpoints.
This is a comprehensive RAG application that includes built-in document ingestion, multi-format parsing, advanced chunking strategies, and vector database integration for local document processing.
MaxKB is a self-hosted retrieval-augmented generation platform designed to connect internal document repositories with large language models. It functions as an enterprise knowledge management system that enables organizations to query private data through a conversational interface, providing automated responses based on uploaded files and internal business information. The platform distinguishes itself by normalizing diverse data sources into a unified index, which is then processed through chunking and vector-based retrieval to ensure context-aware results. It manages session state and prompt templates to maintain coherence across multi-turn interactions, allowing the system to serve as an automated customer support bot or an internal policy assistant. Beyond its core retrieval capabilities, the system supports the automation of administrative tasks and the generation of professional business content. It provides the infrastructure to deploy intelligent chatbots capable of resolving inquiries and accessing company guidelines without manual intervention.
MaxKB is a comprehensive RAG platform that provides built-in document parsing, text chunking, and vector database integration to transform raw files into searchable knowledge bases for LLM pipelines.
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations. The platform distinguishes itself through deep document understanding and sophisticated knowledge orchestration. It supports complex document parsing, including the extraction of tables and images, and utilizes graph-based indexing to enhance reasoning over large document collections. Users can configure multiple recall strategies and fused re-ranking to optimize retrieval accuracy, while the system maintains context through multi-turn dialogue management and flexible tool-use frameworks. The architecture is built on a modular, containerized microservice foundation that supports both local inference engines and external language model APIs. It includes asynchronous task processing for document ingestion and indexing, ensuring system responsiveness during heavy workloads. The platform also provides a standardized interface for model abstraction, allowing for seamless integration with existing language model ecosystems. Developers can interact with the platform through a comprehensive suite of RESTful endpoints and Python client libraries, which cover the full lifecycle of agents, datasets, and knowledge graphs. The system is designed for flexible deployment, offering configurable environment settings and support for custom containerized environments to facilitate local development and infrastructure portability.
This platform provides a complete end-to-end pipeline for document ingestion, including advanced parsing for complex formats, automated chunking, and integrated vector database support for RAG workflows.
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to provide context-aware responses for chat and completion requests. The system distinguishes itself through a database-agnostic abstraction layer that supports various storage backends, ranging from local disk storage to enterprise-grade vector databases. It offers flexible deployment options, enabling users to run language models entirely on private hardware or connect to external cloud-based providers through a unified interface. To improve the quality of generated output, the engine incorporates reranking logic that refines retrieved document chunks before they are processed by the language model. The platform includes a comprehensive suite of tools for managing document intelligence pipelines, including automated parsing, text chunking, and embedding generation. Users can configure the system through environment-based profiles to match specific hardware capabilities, such as CPU or GPU-accelerated setups, and stream responses in real time to reduce latency. The application is configured via runtime settings files and environment variables, with support for building custom container images to suit specific deployment requirements.
PrivateGPT is a comprehensive RAG pipeline that natively handles document ingestion, multi-format parsing, text chunking, and integration with various vector databases, making it a complete solution for your requirements.
Reader is an AI data ingestion pipeline and web content parser designed to convert websites and documents into clean markdown for use with large language models. It functions as a headless browser content extractor and web-to-markdown converter, transforming URLs and PDF files into structured text formats while removing irrelevant web clutter. The system optimizes retrieval augmented generation by acting as a search optimizer that retrieves web results and applies re-ranking to improve context relevance. It further enhances content accessibility by using vision models to generate descriptive captions for images and creating vector embeddings for semantic retrieval. The tool provides broad capabilities for document conversion, web content extraction, and data preprocessing. These include headless browser rendering for JavaScript execution, multi-format conversion of office documents, and bucket-based content caching to reduce latency. The conversion engine can be deployed as a self-hosted container including all necessary headless browsers and document processors.
This tool functions as a specialized ingestion and preprocessing layer for RAG pipelines by converting web content and documents into structured markdown, though it focuses more on extraction and parsing than on managing the full vector database lifecycle.
Verba is a retrieval-augmented generation interface and chatbot that uses Weaviate to provide factual answers based on private datasets. It functions as a vector database knowledge base, combining a hybrid search engine with an orchestration interface to connect various large language model providers and embedding services. The system differentiates itself through a RAG pipeline manager for adjusting text chunking rules and retrieval settings, alongside a 3D vector space visualization tool for analyzing the spatial organization and clustering of high-dimensional embeddings. It employs a modular provider system that allows for swapping between different local and cloud text generation and embedding services. The platform covers multi-modal data ingestion, processing unstructured documents, audio transcriptions, web crawls, and version control repositories into a searchable knowledge base. Its retrieval capabilities combine semantic and keyword search to extract relevant context from vector stores, utilizing configurable text chunking to optimize retrieval precision.
Verba is a comprehensive RAG application that includes built-in document ingestion, configurable text chunking, and vector database integration, making it a complete pipeline for processing and querying your data.
Quivr is a retrieval-augmented generation platform designed to transform raw documents into searchable knowledge bases. It functions as a centralized environment where users can ingest files, index them into vector databases, and interact with language models to receive contextually relevant, data-backed responses. The platform distinguishes itself through an agentic workflow orchestrator that sequences retrieval tasks, tool execution, and model interactions to resolve complex, multi-step queries. This engine is entirely configuration-driven, allowing users to define document ingestion, chunking parameters, and workflow node sequences through structured schemas. By maintaining a unified knowledge management interface, the system tracks chat history alongside file storage, ensuring that interactions remain context-aware across diverse local and remote backends. Beyond its core orchestration, the system provides a comprehensive pipeline for document processing, including parsing for various file formats and asynchronous task execution to maintain responsiveness during data ingestion. It supports the development of specialized chatbots, including voice-enabled interfaces, by integrating speech-to-text and text-to-speech capabilities with its underlying retrieval systems. The project utilizes strict base classes to enforce configuration integrity, ensuring consistent data processing across all application settings.
Quivr is a comprehensive RAG platform that provides a complete pipeline for document ingestion, multi-format parsing, text chunking, and vector database integration, making it a direct match for your requirements.
This platform serves as a comprehensive environment for managing private language models, document knowledge bases, and automated agent workflows within secure local infrastructure. It functions as a document-aware workspace that enables users to ingest diverse file formats into searchable repositories, ensuring that all data processing and model inference remain within private, local environments to maintain data sovereignty. The system distinguishes itself through a modular agentic engine that allows for the definition of custom skills and external tool execution. By utilizing a multi-model abstraction layer, it normalizes interactions across various local and cloud-based providers, while workspace-scoped management ensures that system prompts and knowledge bases remain isolated to meet specific operational requirements. Beyond core orchestration, the platform includes a document-parsing pipeline that converts files into structured text for semantic retrieval via local vector indexing. Users can further extend functionality through command-line triggers and persistent system instructions, standardizing how artificial intelligence behaves across different business contexts.
This platform provides a complete end-to-end RAG pipeline that handles document parsing, text chunking, and vector database integration, making it a comprehensive solution for processing raw text into searchable embeddings.
Claude-context is a retrieval-augmented generation pipeline and semantic code search tool. It functions as an LLM codebase indexer and RAG context provider, designed to index local directories and retrieve relevant code files to provide context for large language models. The system operates as a hybrid search engine that combines keyword matching with dense vector search. This allows for the retrieval of code snippets and logic using natural language queries based on meaning rather than exact text matches. The project covers codebase indexing and search index management, utilizing asynchronous processing and recursive directory traversal. It incorporates index filtering rules to manage which files are included and employs a combination of semantic encoding and local vector storage to maintain a searchable representation of the source code.
This tool provides a complete RAG pipeline specifically for codebase indexing, handling recursive directory traversal, semantic chunking, and vector storage to prepare context for LLMs.
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document structures and formatting requirements. This flexibility is supported by an integrated optical character recognition capability that ensures text recovery from embedded images during the conversion process. The system provides both a command-line interface and a programmatic library, facilitating automated batch processing and custom integration into data pipelines. To ensure consistent performance across different environments, the project supports deployment within containerized architectures that encapsulate all necessary system-level dependencies and binaries.
This tool serves as a specialized document parsing and extraction engine that handles multi-format conversion and layout analysis, providing the essential ingestion layer required to prepare raw files for downstream embedding pipelines.
This project is a framework for building custom AI chatbots capable of PDF document analysis. It implements Retrieval Augmented Generation to connect a large language model to private document data. The system utilizes graph-based agent orchestration to control conversation flow and decision logic. It maintains context across interactions through thread-based state management and delivers AI responses to the user interface via real-time streaming. The project covers PDF document ingestion through chunk-based processing and vector-store retrieval. It includes mechanisms for query-based data retrieval to extract relevant excerpts from ingested documents to ground the model's answers.
This project provides a complete RAG pipeline for PDF document ingestion, including text chunking, vector database integration, and retrieval logic, making it a functional tool for processing documents into embeddings.
PocketFlow is a graph-based framework for designing and executing large language model operations and reasoning patterns. It serves as an orchestrator for building goal-oriented autonomous agents, multi-agent systems, and retrieval-augmented generation pipelines. The system is distinguished by its ability to coordinate autonomous AI agents that use shared memory and tools to solve complex goals, supported by a structured output engine that enforces schema-consistent responses. It utilizes graph-based workflow orchestration to manage sequences of model operations and supports supervisor-based coordination for task delegation and self-correction. The platform covers a broad range of capabilities, including asynchronous task runtimes, hierarchical workflow nesting, and map-reduce parallel execution for large-scale data processing. It integrates vector database management for semantic retrieval and includes observability tools such as execution stack tracing and workflow hierarchy visualization. Reliability is managed through automatic retry logic and response guardrails.
PocketFlow is an agentic workflow orchestrator that includes built-in RAG pipeline capabilities and vector database integration, making it a suitable framework for building the document ingestion and embedding processes you need.
ruby_llm is an LLM integration framework and AI agent orchestrator designed to connect applications to multiple large language model providers through a unified interface. It serves as a toolkit for building autonomous assistants with custom personas, managing structured output via JSON schemas, and implementing vector embedding engines for semantic search. The project distinguishes itself as an observability suite and multimodal toolkit. It provides specialized capabilities for tracking token usage, calculating model costs, and tracing workflows via OpenTelemetry, while supporting the processing of images, audio, video, and documents through a consistent API. The framework covers a broad surface of AI infrastructure, including retrieval-augmented generation workflows, multi-step task orchestration, and the ability to expose local Ruby methods as tools for AI models to execute. It also provides utilities for content moderation, multimodal data extraction, and concurrent request management. The system includes tools to bootstrap AI infrastructure using database migrations and configuration files.
This framework provides a comprehensive toolkit for building RAG pipelines in Ruby, including support for vector embedding engines, semantic similarity calculations, and document-based metadata extraction.