30 open-source projects similar to pdf2htmlex/pdf2htmlex, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Pdf2htmlEX alternative.
BabelDOC is a technical document translation system designed to translate PDF files while preserving their original layout and styling. It functions as a layout-preserving translator that utilizes large language models to convert content into target languages, specifically tailored for scientific and technical documents. The system distinguishes itself through specialized handling of academic content, including the identification and preservation of mathematical formulas and complex layout structures. It ensures technical accuracy by employing glossary-driven terminology enforcement, using so
Zerox is a multimodal document parser and OCR tool that uses vision models to convert PDF files and images into structured Markdown text. It functions as a visual layout extraction engine, leveraging large multimodal models to digitize documents while maintaining their original structural formatting. The system differentiates itself through the use of coordinate-based element mapping and multimodal layout analysis to identify structural elements like tables, charts, and headers. It utilizes rasterization to convert vector PDF pages into high-resolution bitmaps, ensuring consistent input for t
pdfminer is a Python library for parsing PDF files to extract text, analyze layouts, decrypt content, and convert documents into HTML or XML formats. It functions as a text extraction engine and layout analysis tool designed to retrieve characters and words while preserving the structural organization of the original document. The project provides utilities for converting PDF content into structured HTML or XML to maintain visual layout and a decryption tool for unlocking restricted documents using encryption keys. It identifies the positions and groupings of text elements to reconstruct page
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system. The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like wa
pdf2htmlEX is a tool that converts PDF documents into HTML while preserving the original text, fonts, and layout. It uses CSS positioning and font embedding to replicate the PDF's appearance in a browser, producing output that works without JavaScript. The tool can generate a single self-contained HTML file with all resources embedded, or split the document into separate HTML files per page for individual loading and navigation. The converter offers extensive control over the output, including the ability to embed fonts directly into the HTML using base64-encoded Data URIs, or keep them as se
pdfplumber is a PDF data extraction library and layout analysis tool used to retrieve text, tables, and geometric objects from PDF files using precise coordinate-based analysis. It functions as a layout analyzer and table parser that identifies the bounding boxes and visual coordinates for every character and image on a page. The library distinguishes itself through visual debugging capabilities, allowing users to render PDF pages as images and draw annotations to verify the position of extracted data. It employs line and intersection analysis to identify cell structures and convert unstructu
PyPDF2 is a pure Python library for transforming, securing, and extracting data from PDF documents. It provides a comprehensive suite of tools to modify page layouts, manage document security, and retrieve embedded metadata without relying on external C libraries. The toolkit enables document assembly through the merging of multiple files and the splitting of documents into smaller parts. It also supports page-level transformations, including the ability to rotate pages and adjust visible crop areas. The library includes capabilities for security management via password-based encryption and
OpenPDF is a Java library and document processor used for creating, editing, rendering, and encrypting PDF documents. It functions as a toolkit for generating new files from scratch, modifying existing document structures, and extracting text content. The project includes a dedicated engine for transforming HTML and CSS content into PDF documents by parsing markup and applying styles. It also provides a rendering engine to convert PDF pages into image formats for thumbnails and previews, alongside a security utility for protecting content via document encryption. The library supports the add
React-pdf is a library of components designed to integrate document viewing and interaction into web applications. It provides a standardized interface for parsing and displaying portable document format files directly within a browser environment, supporting input from local files, remote web addresses, and encoded data strings. The library renders document content onto HTML5 canvas elements to ensure consistent visual display across browsers. To maintain interface responsiveness during document processing, it offloads parsing tasks to background threads. It also implements a layered approac
Jo is a command-line utility designed to construct and manipulate JSON objects and arrays directly from shell arguments and standard input. It functions as a data processing tool that transforms raw input into structured formats, enabling the generation of complex payloads for APIs, configuration files, and automated data pipelines. The tool distinguishes itself through its ability to resolve hierarchical data structures using delimiter-based path definitions and its integrated type-inference engine, which automatically casts input values into native boolean, numeric, or null types. Users can
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Sparrow is an LLM document extraction platform and vision-based inference engine designed to convert images and PDFs into validated structured data. It functions as an agentic workflow orchestrator that chains classification, extraction, and validation tasks into multi-step pipelines. The system distinguishes itself through a backend-agnostic inference layer that manages models across local GPUs, Apple Silicon, and cloud providers. It employs coordinate-based visual grounding to map extracted text to precise bounding box coordinates and utilizes hint-based model steering to guide attention an
Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical structures while preserving the original page layout. The system operates as a local-first inference engine, allowing for the processing of sensitive data in air-gapped environments without external network connectivity. It can also be deployed as an API or a Model Context Protocol server to provide parsi
Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points. The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
The Open-Source Data Annotation Platform
Manage scalable open LLM inference endpoints in Slurm clusters
Fast Multimodal Semantic Deduplication & Filtering
Olmocr is a distributed document processing framework designed to convert PDF and image files into structured markdown. It functions as a vision-based document parser that utilizes multimodal neural networks to interpret complex visual layouts and translate them into standardized text representations. The system operates as a remote inference orchestrator, offloading heavy document analysis tasks to external servers or cloud APIs to minimize local computational requirements. By employing a stateless worker architecture, it decouples document ingestion from inference, allowing for the distribu
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple proces
OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers