Document Parsing and Extraction - Core document parser that extracts text and bounding boxes from PDFs and other formats into structured output.
Cost-Optimized Parsers - Routes each page to the cheapest suitable parsing tier automatically to balance accuracy and expense without manual configuration.
Open-Source Document Parsers - An open-source document parser that extracts text, tables, and layout from PDFs and office files into Markdown or JSON.
Content Parsing Prompts - Provides prompt-based parsing customization that steers extraction results using natural-language instructions or structured schemas.
Output Schema Instructions - Ships prompt-driven output shaping that accepts natural-language instructions or structured schemas to steer extraction results.
Document Page Routing Optimizers - Provides automatic per-page routing to the cheapest suitable parsing tier for cost-efficient document extraction.
Structured Document Extraction - Converts PDFs and office documents into structured Markdown or JSON with spatial layout for direct use by language models.
Document Text Extractors - Parses documents to retrieve text alongside precise positional coordinates for each extracted element.
Spatial Text Extractors - Parses documents to retrieve text alongside precise positional coordinates for each extracted element.
OCR Document Parsers - Applies optical character recognition using a bundled engine or external HTTP server.
PDF Text Extractors - Parses PDF files and extracts text with spatial bounding boxes, returning structured Markdown, JSON, or plain text.
Text Extraction and OCR - Applies OCR to scanned or image-based PDFs to extract text with optional language selection.
Office and Documents - Extracts text from DOCX, XLSX, PPTX, PNG, JPG, and other file formats via automatic conversion.
Document Generation from Markdown - Reconstructs headings, tables, lists, images, and links from a PDF's spatial layout into structured Markdown.
Document Format Converters - Automatically converts over 130 file types including office documents and images into PDF before extracting text and layout.
Document to Markdown Converters - Reconstructs headings, tables, lists, images, and links from spatial layout for LLM and RAG pipelines.
LLM-Ready Markdown Converters - Converts PDFs and office documents into structured Markdown optimized for language model and RAG pipeline consumption.
PDF to Markdown Conversion - Converts PDF documents into structured Markdown preserving headings, tables, lists, images, and links.
Document Table Extractors - Recovers table data from PDFs, scans, and images with cell structure intact for downstream use.
Browser-Based Parsers - Runs the entire parsing engine and OCR inside a web browser using WebAssembly for offline or serverless document extraction.
PDF Spatial Layout Parsers - Extracts text from PDFs while preserving exact position on each page including bounding boxes for every line.
Multi-Format Document Ingestion - Handles PDF, DOCX, PPTX, XLSX, HTML, JPEG, PNG, XML, EPUB, and many other formats for flexible document ingestion.
Layout-Aware Extraction - Combines spatial layout analysis with OCR to extract text, tables, and charts preserving document structure.
Layout Preservation - Extracts text, tables, and images from PDFs and office documents while preserving spatial layout and structure.
Multi-Format Document Parsing - Converts over 130 file types including office documents and images into PDF before extracting text and layout.
Document Page Cost Optimizers - Automatically routes each page to the cheapest suitable parsing tier, reserving premium accuracy for complex layouts.
Document Layout Bounding Box Extractors - Returns precise coordinates for every text line and table cell, preserving document layout for downstream analysis.
OCR Document Conversion - Extracts text, tables, and charts from PDFs while preserving spatial layout and structure.
Document Page Cost Optimizers - Automatically routes each page to the cheapest suitable parsing tier, reserving premium accuracy for complex layouts.