pdf-craft is an OCR-based document parser and structure extractor designed to convert PDF files into structured data, Markdown, or EPUB ebooks. It utilizes optical character recognition and statistical analysis to identify document hierarchies and extract text and structured content.
The system features specialized rendering for mathematical formulas and tables, using heuristic reconstruction to convert tabular data into digital formats. It includes a document structure extractor that builds tables of contents by analyzing font sizes, linguistic patterns, and language model title detection.
The pipeline supports offline processing through local model weight caching, ensuring that OCR and layout analysis can function without an internet connection.