run-llamaliteparse

0

10,782 stars710 forksRustApache-2.00 viewsdevelopers.llamaindex.ai/liteparse

Liteparse

A fast, helpful, and open-source document parser

Features

Document Parsing and Extraction - Core document parser that extracts text and bounding boxes from PDFs and other formats into structured output.

Cost-Optimized Parsers - Routes each page to the cheapest suitable parsing tier automatically to balance accuracy and expense without manual configuration.

Open-Source Document Parsers - An open-source document parser that extracts text, tables, and layout from PDFs and office files into Markdown or JSON.

Content Parsing Prompts - Provides prompt-based parsing customization that steers extraction results using natural-language instructions or structured schemas.

Output Schema Instructions - Ships prompt-driven output shaping that accepts natural-language instructions or structured schemas to steer extraction results.

Document Page Routing Optimizers - Provides automatic per-page routing to the cheapest suitable parsing tier for cost-efficient document extraction.

Structured Document Extraction - Converts PDFs and office documents into structured Markdown or JSON with spatial layout for direct use by language models.

Document Text Extractors - Parses documents to retrieve text alongside precise positional coordinates for each extracted element.

Spatial Text Extractors - Parses documents to retrieve text alongside precise positional coordinates for each extracted element.

OCR Document Parsers - Applies optical character recognition using a bundled engine or external HTTP server.

PDF Text Extractors - Parses PDF files and extracts text with spatial bounding boxes, returning structured Markdown, JSON, or plain text.

Text Extraction and OCR - Applies OCR to scanned or image-based PDFs to extract text with optional language selection.

Office and Documents - Extracts text from DOCX, XLSX, PPTX, PNG, JPG, and other file formats via automatic conversion.

Document Generation from Markdown - Reconstructs headings, tables, lists, images, and links from a PDF's spatial layout into structured Markdown.

Document Format Converters - Automatically converts over 130 file types including office documents and images into PDF before extracting text and layout.

Document to Markdown Converters - Reconstructs headings, tables, lists, images, and links from spatial layout for LLM and RAG pipelines.

LLM-Ready Markdown Converters - Converts PDFs and office documents into structured Markdown optimized for language model and RAG pipeline consumption.

PDF to Markdown Conversion - Converts PDF documents into structured Markdown preserving headings, tables, lists, images, and links.

Document Table Extractors - Recovers table data from PDFs, scans, and images with cell structure intact for downstream use.

Browser-Based Parsers - Runs the entire parsing engine and OCR inside a web browser using WebAssembly for offline or serverless document extraction.

PDF Spatial Layout Parsers - Extracts text from PDFs while preserving exact position on each page including bounding boxes for every line.

Multi-Format Document Ingestion - Handles PDF, DOCX, PPTX, XLSX, HTML, JPEG, PNG, XML, EPUB, and many other formats for flexible document ingestion.

Layout-Aware Extraction - Combines spatial layout analysis with OCR to extract text, tables, and charts preserving document structure.

Layout Preservation - Extracts text, tables, and images from PDFs and office documents while preserving spatial layout and structure.

Multi-Format Document Parsing - Converts over 130 file types including office documents and images into PDF before extracting text and layout.

Document Page Cost Optimizers - Automatically routes each page to the cheapest suitable parsing tier, reserving premium accuracy for complex layouts.

Document Layout Bounding Box Extractors - Returns precise coordinates for every text line and table cell, preserving document layout for downstream analysis.

OCR Document Conversion - Extracts text, tables, and charts from PDFs while preserving spatial layout and structure.

Document Page Cost Optimizers - Automatically routes each page to the cheapest suitable parsing tier, reserving premium accuracy for complex layouts.

Document Spatial Coordinate Outputs - Extracts text items from a PDF and returns them with spatial coordinates for precise layout analysis.

Document Bounding Box Extractors - Returns spatial coordinates for every line of text extracted from documents for visualization or processing.

Document Bounding Box Extractors - Returns spatial bounding boxes for each text line, enabling visualization or further geometric processing.

Document JSON Bounding Box Outputs - Extracts text with bounding boxes from a PDF and outputs the result as structured JSON.

Document Output Shapers - Accepts natural-language instructions or structured schemas to steer document extraction results toward desired formats.

Custom OCR Backend Registrations - Accepts a user-defined OCR engine with a recognize method for custom text extraction.

Markdown RAG Pipeline Outputs - Reconstructs headings, tables, lists, images, and links from spatial layout for direct use in LLMs and RAG pipelines.

Multi-Runtime Libraries - Provides library APIs and CLI for Rust, Node.js/TypeScript, Python, and browser WASM environments.

REST and SDK Parsing Interfaces - Provides REST, Python, and TypeScript interfaces to upload documents and retrieve parsed results programmatically.

Command-Line Document Processors - Processes files from the command line with options for format, page range, and remote URLs.

Document Parsing Services - Integrates document parsing into applications through a library API accepting file paths or raw byte buffers.

WASM-Based PDF Parsers - Parses PDF documents entirely in the browser using WebAssembly, requiring no server or cloud calls.

Diagram Structure Parsing - Converts visual data from charts, plots, and diagrams into structured formats for numerical reasoning by LLMs.

Document Chart Parsers - Extracts charts, plots, and diagrams from documents into structured data for numerical reasoning by LLMs.

Browser-Based OCR Engines - Provides a JavaScript-side OCR engine with a recognize method for text extraction in WASM environments.

OCR REST API Servers - Sends OCR requests to remote HTTP services for higher accuracy or performance.

Document Page Rendering - Converts document pages into raster images for LLM agents to extract visual information.

PDF Page Image Generators - Renders PDF pages as raster images for use in LLM agents or visual workflows.

Browser Screenshot Capture - Generates page images as PNG byte buffers for use with LLMs or disk storage.

Browser-Based Runtimes - Runs the entire parsing engine and OCR inside a web browser using WebAssembly for offline document extraction.

Document Directory Parsers - Parses all documents in a given input folder and writes results to a specified output directory.

Batch Document Processing - Processes entire directories of documents efficiently with a single command, reusing the parsing engine.

Document Page Screenshot Capturers - Renders pages as high-quality PNG images to capture visual information that text alone cannot convey.

Liteparse

Features

Star history