Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment.
What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU.
Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models.
The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Langroid is a multi-agent orchestration framework and tool integration suite designed for building complex AI applications. It serves as a multi-modal integration layer that connects diverse local and remote language models with an agentic retrieval-augmented generation system. The project distinguishes itself through a collaborative message-exchange paradigm, allowing specialized agents to delegate tasks hierarchically and coordinate via structured communication. It features an advanced state management system for conversational AI, including the ability to rewind and prune conversation hist
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedde
A fast, helpful, and open-source document parser