Kreuzberg

Kreuzberg

Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment.

What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU.

Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models.

The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.

Features

Document Extraction Engines - Extracts text and metadata from PDFs, Office files, images, and 90+ formats using a Rust core with OCR and LLM support.

Multi-Format Parsers - Extracting text and structured data from PDFs, Office files, images, and other document formats for downstream processing.

Text Extraction - Imports the extraction engine into applications across 18 programming languages including Python and Rust.

AI Agent Servers - Exposes document extraction, embedding, and chunking tools through the Model Context Protocol for direct AI agent integration.

Document Extraction Tools - Exposing document extraction and processing tools to AI agents through the Model Context Protocol.

Document Chunking & Embedding - Chunking, embedding, and reranking document content to prepare it for retrieval-augmented generation workflows.

Document Processors - Chunks extracted text, generates vector embeddings, and reranks results for RAG pipelines.

Local Embedding Generators - Generates document embeddings locally with ONNX models for semantic search and RAG pipelines.

ONNX Runtime Inference - All ML tasks including OCR, embeddings, and reranking run through ONNX Runtime.

MCP Servers - Exposes extraction tools through the Model Context Protocol so AI agents can call them directly.

Hierarchical Representations - Returns a traversable tree of nodes with heading levels and inline annotations for knowledge graphs.

Tree Representations - Provides a traversable tree of document nodes with parent-child references for knowledge graph construction.

Vision-Language Model Backends - Leverages vision language models as an OCR backend and extracts structured JSON from documents using a schema, supporting 146 LLM providers.

Text Chunks - Splits text into chunks with heading paths for hierarchical context in RAG retrieval.

Custom OCR Backend Registrations - Recognizes text in images using Tesseract, PaddleOCR, EasyOCR, or VLM backends across 143 vision providers.

Optical Character Recognition - Processes image data through pluggable OCR backends supporting multiple languages and engines.

Semantic Chunking - Splits extracted text into sized chunks using recursive, semantic, or Markdown-aware strategies for LLM consumption.

Configurable Extraction APIs - Exposes document extraction as an HTTP API with configurable host, port, CORS, and upload limits.

Polyglot Language Bindings - Integrates document extraction into Python, TypeScript, Rust, Go, Java, and 12+ other languages via native bindings.

Document Text Extractors - Parses PDFs, Office files, images, HTML, email, archives, and academic formats into clean, structured text using a Rust core with SIMD acceleration.

Custom Format Extractors - Registers custom document extractors for proprietary or unsupported file formats through a plugin system.

Markup Extraction - Extracts text from HTML, Markdown, XML, and other markup files, converting structured content to plain text.

Text Extraction - Extracts text and metadata from a file path by detecting its MIME type from the extension and selecting the appropriate parser automatically.

Multi-Format Extractors - Extracts text and metadata from a file by detecting or validating the MIME type, selecting the appropriate extractor, and running a post-processing pipeline.

Core Crate Integrations - Integrates the core crate directly into Rust-native applications, CLI tools, API servers, or embedded systems.

Text Extraction and OCR - Recognizes text from scanned documents and images using multiple OCR backends with quality fallback.

Presentation Text Extractors - Extracts slide content, metadata, and embedded images from PowerPoint (PPTX) files.

Multi-Backend OCR Configurators - Selects from multiple OCR engines including Tesseract, PaddleOCR, EasyOCR, and VLM models to balance speed, accuracy, and language support.

Document Processing Pipelines - Processes thousands of documents with high throughput by calling directly into a compiled Rust core without subprocess or HTTP overhead.

Typed Element Extraction - Returns typed document elements like titles, paragraphs, and tables with page numbers for RAG pipelines.

Office Document Parsers - Extracts text from Excel, PowerPoint, and Word files using native Rust parsers, preserving structure like sheets, slides, and formatting.

Document Metadata Extraction - Pulls document properties like title, author, creation date, and format-specific metadata.

Multi-Format Extractors - Returns full document text with minimal formatting, per-page breakdowns, and structured tables and image metadata.

PDF Text Extraction - Extracts text content from PDF files using native Rust parsing, with support for metadata, images, and OCR fallback for scanned documents.

Semantic Element Extraction - Extracts logical content units with semantic classification, unique identifiers, and position metadata.

Document Table Extractors - Detects and extracts structured table data from documents, returning cells and markdown representation.

Extraction Configurations - Controls every extraction stage through a single configuration object loaded from TOML, YAML, or JSON.

Pipeline Configurations - Controls every stage of the extraction pipeline through a single configuration object.

Document Extraction Tools - Offers a CLI tool installable via Homebrew or Docker for extracting document content.

Multi-Format Document Ingestion - Ingests text from 96 formats including PDFs, Office docs, images, email, archives, source code, and niche formats.

Structured Data Extraction - Applies a JSON schema and an LLM to extracted text to return typed, structured output.

LLM-to-Structured Data Converters - Applies a JSON schema and an LLM to extracted text to return typed, structured output.

Byte Array Text Extractors - Extracts text and metadata from a byte array by validating the MIME type, selecting the appropriate extractor, and running a post-processing pipeline.

Extraction Configurations - Sets extraction options via programmatic config, TOML/YAML/JSON files, or environment variables.

MIME-Type Based Extractors - Registers a handler for a specific MIME type that extracts text from file paths or raw bytes with priority-based conflict resolution.

Single File Extraction - Extracts text content from a document file and prints it to standard output.

Text Chunking Parameters - Provides configurable text chunking with size, overlap, and chunker type for RAG pipelines.

CLI Tooling - Executes extraction tasks directly from the command line without writing code.

PDF Asset Extractions - Pulls metadata such as title and author from PDF files during extraction.

Language Bindings - Provides native packages for Python, TypeScript, Rust, Go, Java, C#, Ruby, PHP, and many other languages.

REST APIs - Starts an HTTP server that exposes extraction and embedding endpoints for language-agnostic access.

Document Processing APIs - Starts an HTTP server that exposes endpoints for extraction, batch processing, MIME detection, health checks, and cache management.

Polyglot FFI Exposures - Exposes a compiled Rust engine's API through FFI to 18+ languages.

Document Processing - Running a full document extraction and analysis pipeline on private infrastructure with no data leaving the environment.

Self-Hosted Services - Runs the full extraction pipeline on private infrastructure so data never leaves the environment.

Model Context Protocol Servers - Exposes extraction tools through the Model Context Protocol for direct AI agent integration.

Private Data Processing Environments - Processes documents entirely on self-hosted infrastructure with no data leaving the environment.

File Format Detectors - Identifies file format by extension or MIME type, with fallback to content-based detection and manual override.

Extraction Pipeline Plugins - Extraction is built as a chain of typed registries for custom extractors and backends.

Document APIs - Serves a REST API for document extraction, embedding, and chunking on self-hosted infrastructure.

Integration SDKs - Integrates into any stack using native bindings for Python, TypeScript, Rust, Go, Java, C#, and many other languages.

Document Translators - Translates extracted text into a target language using any LLM provider with optional Markdown preservation.

Document Rerankers - Sorts documents by relevance to a query, optionally truncating to top-k results.

Cross-Encoder Rerankers - Selects pre-configured cross-encoder models for reranking by balancing speed and quality.

Document Structure Analysis - Builds a hierarchical tree of document elements with headings, tables, and content layers.

Document Summarization - Generates a prose summary of extracted content using a local TextRank or an LLM-powered abstractive backend.

Embedding Generators - Generates ultra-fast embeddings using a Rust-native engine with multiple presets.

GPU Acceleration - Offloads model inference to a GPU by pointing to a GPU-enabled runtime installation.

Language Detection Tools - Identifies languages present in extracted text using whatlang, supporting 60+ languages.

Document Chunking Strategies - Determines optimal chunking strategies for documents based on size, format, and content properties.

Model-Driven Text Extraction - Uses a Vision Language Model to extract Markdown text from image regions.

Vision-Language Region Extractors - Extracts Markdown text from image regions using a Vision Language Model.

Model Inference Accelerators - Selects a hardware-specific execution provider to speed up model-based processing tasks.

Named Entity Recognition - Identifies people, organizations, locations, and other entities in extracted text using ONNX or LLM providers.

Backend Selectors - Selects between ONNX engine or LLM backend with 143 providers for named entity recognition.

Heading Level Classifiers - Identifies heading levels (H1-H6) from font size clustering and semantic analysis in PDFs.

OCR - Uses pure-Rust VLM models like GLM-OCR and Hunyuan-OCR for OCR on complex layouts and low-quality scans.

OCR Command Line Interfaces - Performs OCR extraction on files directly from the terminal with configurable backends.

Result Reranking - Sends a query and documents to a POST endpoint and receives scored results in JSON.

Keyword Extraction - Identifies and ranks keywords from text using a configurable algorithm.

Automated Video Transcribers - Extracts speech-to-text from audio and video files using Whisper ONNX models, producing a plain-text transcript.

Per-Page Content Separators - Returns each page's content as a separate array entry in the extraction results.

Text Classification - Assigns labels to a plain text string using an LLM, without requiring a full extraction result.

LLM-Based Classifiers - Assigns labels to a single piece of plain text using a configured LLM, without requiring an extraction result.

Text Embedding Generators - Produces vector embeddings for a list of texts asynchronously, offloading ONNX inference to a blocking thread pool.

Text Summarization - Scores sentences and returns the top-N in original order to summarize extracted text.

Configurable Token Reduction - Applies configurable token reduction intensity to extracted text to lower costs when sending to language models.

Vector Embeddings - Converts text strings into numerical vector embeddings using configurable models for semantic search or RAG pipelines.

Image Captioning - Generates textual captions for images using a vision language model.

Custom Extractor Implementations - Provides a trait-based interface for implementing custom document extractors for new formats.

Browser-Based Text Extractors - Runs text extraction entirely in the browser or edge runtime via a WebAssembly build without server-side dependencies.

Email Text Extraction - Extracts text content and metadata from EML and MSG email files, including headers and attachments.

Structural Code Parsers - Parses source code files using tree-sitter to extract structure, imports, symbols, docstrings, and semantic chunks.

Enrichment Stages - Applying OCR, classification, summarization, translation, and redaction to extracted document content.

Extraction Confidence Scores - Combines OCR, text coverage, and other signals into a weighted confidence score for extraction results.

AI-Generated Captions - Generates text captions for extracted images using a vision-language model.

Per-Page Content Extractors - Returns text, tables, images, and hierarchy blocks for each individual page with byte offsets and bounding box coordinates.

Document Batch Processors - Processes multiple files concurrently for text extraction with concurrency management.

Document Furniture Filters - Strips headers, footers, watermarks, and repeating text from extracted document content.

Image Extractions - Pulls embedded images from documents with configurable DPI, dimension limits, and output format re-encoding.

Python Plugin Integrations - Integrates Python plugins with the Rust extraction pipeline via PyO3 with zero-copy buffers.

LLM-Based - Ships a configurable LLM-based page classification enrichment stage for document processing.

Output Format Rendering - Renders extracted text content as plain text, Markdown, HTML, or Djot markup.

Token-Efficient Serializations - Outputs extraction results as plain text, JSON, or TOON format optimized for LLM token efficiency.

Document Classification - Aggregates page-level classifications into a combined label set representing the whole document.

LLM-Based Document Classifiers - Aggregates page-level classifications across a document's text to produce a combined label set for the whole document.

Tree-Sitter Parsers - Identifies code structures from 306 programming languages via tree-sitter grammars.

Embedding Generation - Creates vector embeddings from document content using built-in ONNX models.

Hardware Acceleration - Selects execution provider (CPU, CoreML, CUDA, TensorRT) for ONNX Runtime model inference.

Parallel Batch Processing - Processes multiple files faster than sequential extraction using a work-stealing scheduler.

Document Batch Processors - Processes multiple file paths concurrently for text extraction in a single batch operation.

Structured Data File Extractors - Extracts text from JSON, YAML, TOML, CSV, and TSV files, preserving field names and tabular structure.

Text Vectorizers - Converts text into numerical embedding vectors using a configured ONNX model.

Archive Extraction Utilities - Extracts text content from files inside ZIP, TAR, 7-Zip, and Gzip archives without manual decompression.

Custom Plugin Registrations - Ships a trait-based plugin system for registering custom extractors, OCR backends, and validators.

Document Extraction Post-Processors - Modifies extraction output through ordered post-processing stages for cleanup, analysis, redaction, or reformatting without failing on errors.

Multi-Language Code Parsers - Parses functions, classes, imports, and symbols from code files across 306 programming languages for structured analysis.

Academic Format Text Extractors - Extracts text from LaTeX, EPUB, BibTeX, and Jupyter Notebooks among other academic formats.

Environment Variable Configurations - Sets runtime behavior via environment variables that override config files and defaults.

Document Extraction Containers - Mounts host directories and runs extraction commands inside a Docker container.

Custom Post-Processors - Registers custom post-processors that transform extraction results in early, middle, or late stages.

PDF to Image Rendering - Renders individual PDF pages as PNG images at configurable DPI for thumbnails, vision model input, or custom OCR pipelines.

Multi-Channel File Uploads - Uploads files through an API, SDK, CLI, or Docker for processing, supporting over 90 formats.

Email Content Parsing - Extracts headers, body content, and attachments from .eml and .msg email files.

Structured Result Deliveries - Returns a structured JSON response with full document structure and supports webhook delivery for asynchronous workflows.

Document Processing Schedulers - Distributes document extraction tasks across CPU cores using a work-stealing scheduler.

Document Furniture Filters - Strips headers, footers, watermarks, and repeating text from extracted document content.

Document Content Redaction - Rewrites textual fields in extraction results by removing or masking patterns and NER-detected entities.

Batch Document Processing - Processes multiple documents concurrently with 2-5x throughput gains via parallelization.

Extraction Result Translators - Translates extracted document content into target languages using an LLM.

Plugin Extenders - Allows loading custom document extractors, OCR backends, and validators through typed registries.

Pipeline Plugin Registrations - Adds user-defined plugins to a typed registry for automatic dispatch during extraction.

LLM-Based Page Classifiers - Assigns labels to each page of an extraction result using an LLM, appending classifications and usage data to the result.

Extraction Validation Frameworks - Inspects extraction results against custom criteria and rejects them if they fail to meet requirements.

Document Extraction Timeouts - Abort an extraction that exceeds a configured time limit and surface the timeout as a distinct error.

Extraction Quality Validators - Inspects extraction results against custom rules and halts the pipeline if the output fails to meet defined quality requirements.

Document Batch Processors - Processes multiple byte arrays concurrently for text extraction with concurrency management.

MIME Type Mappings - Identifies MIME type of a file using its extension and optionally its content.

Document Processing Runtimes - Enables fully local document processing in browsers via WebAssembly.

RAG Frameworks - Polyglot library for extracting text and metadata from diverse document formats.

Data Ingestion Pipelines - Polyglot library for document intelligence and extraction.

Document and File Processing - Extracts content from various document types using a Rust core.

Productivity and Collaboration - Extracts text, tables, and metadata from diverse document formats.

kreuzberg-devkreuzberg

Features

Open-source alternatives to Kreuzberg

pymupdf/PyMuPDF

langroid/langroid

pdfminer/pdfminer.six

run-llama/liteparse

Star history