# getomni-ai/zerox

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/getomni-ai-zerox).**

12,241 stars · 846 forks · TypeScript · MIT

## Links

- GitHub: https://github.com/getomni-ai/zerox
- Homepage: https://getomni.ai/ocr-demo
- awesome-repositories: https://awesome-repositories.com/repository/getomni-ai-zerox.md

## Topics

`ocr` `pdf`

## Description

Zerox is a multimodal document parser and OCR tool that uses vision models to convert PDF files and images into structured Markdown text. It functions as a visual layout extraction engine, leveraging large multimodal models to digitize documents while maintaining their original structural formatting.

The system differentiates itself through the use of coordinate-based element mapping and multimodal layout analysis to identify structural elements like tables, charts, and headers. It utilizes rasterization to convert vector PDF pages into high-resolution bitmaps, ensuring consistent input for the vision models used to synthesize the final Markdown output.

The tool covers a broad range of document digitization capabilities, including complex layout extraction and vision-based OCR. It processes visual document representations to interpret the spatial relationship between text and data, converting them into machine-readable formats.

## Tags

### Content Management & Publishing

- [PDF to Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-markdown-converters.md) — Transforms PDF files into Markdown text using vision models to preserve the original layouts, tables, and charts. ([source](https://cdn.jsdelivr.net/gh/getomni-ai/zerox@main/README.md))
- [Vision-Based Document Parsers](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/document-automation-interfaces/plugin-based-document-parsers/vision-based-document-parsers.md) — Provides a parser that uses multimodal vision models to interpret document layouts and convert them into structured text.
- [Visual-to-Markdown Pipelines](https://awesome-repositories.com/f/content-management-publishing/markdown-documentation/visual-to-markdown-pipelines.md) — Transforms visual document representations into structured text by mapping identified coordinates to Markdown formatting syntax.
- [PDF to Markdown Conversion](https://awesome-repositories.com/f/content-management-publishing/pdf-to-markdown-conversion.md) — Transforms visual document representations into structured Markdown text by mapping spatial coordinates to formatting syntax.
- [Coordinate-Based Layout Mapping](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-html-converters/coordinate-based-layout-mapping.md) — Implements a coordinate-based mapping system to preserve the original document layout during the conversion process.

### Artificial Intelligence & ML

- [Multimodal Vision Models](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/architectures/multimodal-perception-models/multimodal-vision-models.md) — Identifies structural elements like tables and headers by processing document images through a large multimodal model.
- [Automated Digitization Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/automated-digitization-engines.md) — Converts physical or digital document scans into machine-readable formats using multimodal models to identify structural elements.
- [Multimodal Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/multimodal-layout-analysis.md) — Uses large vision models to identify structural document elements like tables and headers from image data.
- [Structured Document Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/structured-document-extraction.md) — Converts complex PDF files into structured Markdown while preserving tables, charts, and the original page formatting.
- [Optical Character Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/optical-character-recognition.md) — Extracts raw character data from images using optical character recognition to supplement semantic structural formatting.
- [Element Discrimination Prompts](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-engineering-guides/element-discrimination-prompts.md) — Uses specialized visual prompts to help models distinguish between standard body text and complex tabular data.
- [Multimodal Prompting](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-engineering-guides/multimodal-prompting.md) — Employs specialized visual prompts to help the model distinguish between body text and complex tabular data.

### Part of an Awesome List

- [AI Vision OCR Tools](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr/ai-vision-ocr-tools.md) — Provides a document extraction system that uses vision models to convert PDF files and images into structured Markdown text.
- [Text Extraction and OCR](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr.md) — Recovers raw character data from images using OCR before applying semantic structural formatting.
- [Data Processing](https://awesome-repositories.com/f/awesome-lists/data/data-processing.md) — Zero-shot PDF OCR using vision-capable language models.

### Data & Databases

- [Multimodal Document Ingestion](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-ingestion/multimodal-document-ingestion.md) — Uses vision-capable large language models to interpret and convert visual document representations into clean, structured text.
- [Layout-Aware Extraction](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/layout-aware-extraction.md) — Parses documents with intricate formatting to maintain the spatial relationship between text, headers, and tabular data.
- [Document Layout Extraction](https://awesome-repositories.com/f/data-databases/document-parsing-engines/web-document-parsing/visual-layout-parsing/document-layout-extraction.md) — Identifies structural elements in documents through coordinate-based mapping and vision-model analysis.

### Graphics & Multimedia

- [Vector Rasterizers](https://awesome-repositories.com/f/graphics-multimedia/graphics-engines-rendering/rendering/vector-rendering-pipelines/vector-graphics-renderers/vector-rasterizers.md) — Converts vector PDF pages into high-resolution bitmaps to provide consistent input for vision-based multimodal models.
- [PDF to Image Rendering](https://awesome-repositories.com/f/graphics-multimedia/pdf-to-image-rendering.md) — Converts vector-based PDF pages into high-resolution bitmaps to ensure compatibility with vision-based model inputs.