Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment.
What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU.
Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models.
The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.