Marker | Awesome Repository

Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures.

The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements.

Capabilities include extracting images and structured data based on predefined schemas, as well as chunking documents for retrieval augmented generation pipelines. The project supports high-volume processing by distributing conversion tasks across multiple GPUs.

Features

Markdown Converters - Transforms various document formats into clean markdown including formatted tables, equations, and code blocks.
LLM-Powered Parsers - Provides an LLM-powered parser that extracts structured data from complex documents and refines OCR accuracy.
Document Chunking Strategies - Creates flattened lists of document blocks with embedded HTML optimized for retrieval augmented generation pipelines.
Model-Driven Text Extraction - Uses model-driven techniques to extract raw text and layout data from PDF layers or images.

Features

Markdown Converters - Transforms various document formats into clean markdown including formatted tables, equations, and code blocks.
LLM-Powered Parsers - Provides an LLM-powered parser that extracts structured data from complex documents and refines OCR accuracy.
Document Chunking Strategies - Creates flattened lists of document blocks with embedded HTML optimized for retrieval augmented generation pipelines.
Model-Driven Text Extraction - Uses model-driven techniques to extract raw text and layout data from PDF layers or images.