# vikparuchuri/marker

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/vikparuchuri-marker).**

36,164 stars · 2,496 forks · Python · GPL-3.0

## Links

- GitHub: https://github.com/VikParuchuri/marker
- Homepage: https://www.datalab.to
- awesome-repositories: https://awesome-repositories.com/repository/vikparuchuri-marker.md

## Description

Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures.

The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements.

Capabilities include extracting images and structured data based on predefined schemas, as well as chunking documents for retrieval augmented generation pipelines. The project supports high-volume processing by distributing conversion tasks across multiple GPUs.

## Tags

### Content Management & Publishing

- [Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/format-specific-parsers/markdown-converters.md) — Transforms various document formats into clean markdown including formatted tables, equations, and code blocks. ([source](https://github.com/vikparuchuri/marker#readme))
- [Optical Character Recognition Engines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/intelligent-extraction-frameworks/optical-character-recognition-engines.md) — Provides a pipeline for extracting text and images from scanned documents with structural cleanup.
- [PDF to Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-markdown-converters.md) — Transforms PDF documents into clean markdown files while preserving tables, equations, and layout structures.
- [Document Conversion](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-conversion.md) — Transforms complex documents into a tree-like JSON structure reflecting the original block hierarchy. ([source](https://github.com/vikparuchuri/marker#readme))
- [Parallel Processing](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-processing-engines/parallel-processing.md) — Supports high-volume document conversion by distributing processing tasks across multiple GPUs in parallel. ([source](https://github.com/vikparuchuri/marker#readme))
- [Automated Data Extraction](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/automated-data-extraction.md) — Converts scanned or digital PDFs into structured JSON data formats for large-scale analysis.
- [Hierarchical Document Models](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/hierarchical-document-models.md) — Organizes converted document elements into a nested JSON tree that preserves structural and semantic relationships.
- [PDF to HTML Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-html-converters.md) — Generates structured HTML output that preserves images, equations, and code blocks from original documents. ([source](https://github.com/vikparuchuri/marker#readme))

### Data & Databases

- [LLM-Powered Parsers](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/document-processing-tools/llm-powered-parsers.md) — Provides an LLM-powered parser that extracts structured data from complex documents and refines OCR accuracy.
- [Image Extractions](https://awesome-repositories.com/f/data-databases/content-extraction/image-extractions.md) — Saves images from document files into local directories or replaces them with descriptive text. ([source](https://github.com/vikparuchuri/marker#readme))
- [Table Extraction Utilities](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/table-extraction-utilities.md) — Isolates and converts tables within documents into markdown or JSON, including cell bounding boxes. ([source](https://github.com/vikparuchuri/marker#readme))
- [Document Processing Engines](https://awesome-repositories.com/f/data-databases/document-processing-engines.md) — Employs high-performance pipelines to process large batches of PDF files in parallel via GPUs.
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Maps unstructured document text into specific JSON formats using predefined schemas and language models.

### Artificial Intelligence & ML

- [Document Chunking Strategies](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-orchestration/retrieval-augmented-generation/document-chunking-strategies.md) — Creates flattened lists of document blocks with embedded HTML optimized for retrieval augmented generation pipelines. ([source](https://github.com/vikparuchuri/marker#readme))
- [Model-Driven Text Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/model-driven-text-extraction.md) — Uses model-driven techniques to extract raw text and layout data from PDF layers or images.
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Employs neural networks to predict block types and bounding boxes for identifying tables and equations.
- [Optical Character Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition.md) — Extracts text from images or PDFs using character recognition and allows for reprocessing by stripping existing layers. ([source](https://github.com/vikparuchuri/marker#readme))
- [RAG Data Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/rag-data-pipelines.md) — Prepares complex documents by converting them into structured chunks and HTML for use in RAG systems.

### DevOps & Infrastructure

- [Document Content Refiners](https://awesome-repositories.com/f/devops-infrastructure/infrastructure/version-control-systems/git-based-repositories/git-based-code-analysis-platforms/llm-based-analysis/document-content-refiners.md) — Uses large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables.
- [Document Content Refinement](https://awesome-repositories.com/f/devops-infrastructure/infrastructure/version-control-systems/git-based-repositories/git-based-code-analysis-platforms/llm-based-analysis/document-content-refinement.md) — Uses large language models to merge fragmented tables across pages and clean up mathematical notation for higher precision. ([source](https://github.com/vikparuchuri/marker#readme))

### Web Development

- [Parallel GPU Schedulers](https://awesome-repositories.com/f/web-development/performance-optimizations/computational-parallelization/parallel-gpu-schedulers.md) — Distributes heavy document conversion tasks across multiple GPUs to accelerate large-scale file processing.

### Part of an Awesome List

- [Data Scraping Tools](https://awesome-repositories.com/f/awesome-lists/ai/data-scraping-tools.md) — Tool for converting PDF documents into clean markdown text.