# oomol-lab/pdf-craft

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/oomol-lab-pdf-craft).**

4,867 stars · 315 forks · Python · mit

## Links

- GitHub: https://github.com/oomol-lab/pdf-craft
- Homepage: https://pdf.oomol.com/
- awesome-repositories: https://awesome-repositories.com/repository/oomol-lab-pdf-craft.md

## Topics

`deepseek-ocr` `document` `ocr` `pdf`

## Description

pdf-craft is an OCR-based document parser and structure extractor designed to convert PDF files into structured data, Markdown, or EPUB ebooks. It utilizes optical character recognition and statistical analysis to identify document hierarchies and extract text and structured content.

The system features specialized rendering for mathematical formulas and tables, using heuristic reconstruction to convert tabular data into digital formats. It includes a document structure extractor that builds tables of contents by analyzing font sizes, linguistic patterns, and language model title detection.

The pipeline supports offline processing through local model weight caching, ensuring that OCR and layout analysis can function without an internet connection.

## Tags

### Data & Databases

- [Document Parsers](https://awesome-repositories.com/f/data-databases/document-parsers.md) — Uses optical character recognition to extract text and structured data from PDF files for downstream processing.

### Artificial Intelligence & ML

- [Document Structure Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/document-structure-analysis.md) — Employs AI-driven extraction to build document hierarchies and tables of contents from PDFs.
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Uses OCR and spatial analysis to detect document structures, tables, and layout hierarchies.
- [Cross-Platform Offline OCR](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-assistants/offline/cross-platform-offline-ocr.md) — Ensures document conversion works without an internet connection by managing local OCR model weights.

### Content Management & Publishing

- [Structural Text Extractors](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-manipulation-utilities/pdf-editors/pdf-content-converters/structural-text-extractors.md) — Reconstructs logical document structures, including headings and tables of contents, from raw PDF data.
- [Multi-Format Document Exports](https://awesome-repositories.com/f/content-management-publishing/multi-format-document-exports.md) — Transforms extracted PDF content into multiple distinct output formats including EPUB and Markdown.
- [PDF to EPUB Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-epub-converters.md) — Converts PDF documents into EPUB files with specialized rendering for mathematical formulas and tables. ([source](https://cdn.jsdelivr.net/gh/oomol-lab/pdf-craft@main/README.md))
- [PDF to Markdown Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-markdown-converters.md) — Transforms PDF documents into structured markdown syntax by recognizing layout and tables via OCR. ([source](https://cdn.jsdelivr.net/gh/oomol-lab/pdf-craft@main/README.md))
- [Statistical Heading Detection](https://awesome-repositories.com/f/content-management-publishing/content-management-systems/content-architecture-modeling/document-models/document-sectioning/document-content-structuring/flat-heading-hierarchies/statistical-heading-detection.md) — Builds document tables of contents by analyzing font sizes and linguistic patterns to identify headings.
- [Tabular Data Reconstruction](https://awesome-repositories.com/f/content-management-publishing/tabular-data-reconstruction.md) — Provides heuristic grid-based recognition to convert visual tabular data into structured digital formats.

### Web Development

- [Table of Contents Generators](https://awesome-repositories.com/f/web-development/table-of-contents-generators.md) — Automatically generates a document hierarchy and table of contents from extracted content. ([source](https://cdn.jsdelivr.net/gh/oomol-lab/pdf-craft@main/README.md))
