Pdf Craft | Awesome Repository

pdf-craft is an OCR-based document parser and structure extractor designed to convert PDF files into structured data, Markdown, or EPUB ebooks. It utilizes optical character recognition and statistical analysis to identify document hierarchies and extract text and structured content.

The system features specialized rendering for mathematical formulas and tables, using heuristic reconstruction to convert tabular data into digital formats. It includes a document structure extractor that builds tables of contents by analyzing font sizes, linguistic patterns, and language model title detection.

The pipeline supports offline processing through local model weight caching, ensuring that OCR and layout analysis can function without an internet connection.

Features

Document Parsers - Uses optical character recognition to extract text and structured data from PDF files for downstream processing.
Document Structure Analysis - Employs AI-driven extraction to build document hierarchies and tables of contents from PDFs.
Document Layout Analysis - Uses OCR and spatial analysis to detect document structures, tables, and layout hierarchies.
Structural Text Extractors - Reconstructs logical document structures, including headings and tables of contents, from raw PDF data.

Features

Document Parsers - Uses optical character recognition to extract text and structured data from PDF files for downstream processing.
Document Structure Analysis - Employs AI-driven extraction to build document hierarchies and tables of contents from PDFs.
Document Layout Analysis - Uses OCR and spatial analysis to detect document structures, tables, and layout hierarchies.
Structural Text Extractors - Reconstructs logical document structures, including headings and tables of contents, from raw PDF data.

The pipeline supports offline processing through local model weight caching, ensuring that OCR and layout analysis can function without an internet connection.