pymupdfPyMuPDF

PyMuPDF

PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents.

The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines.

Its broader capability surface covers optical character recognition for creating searchable text layers, detailed data extraction of tables and key-value pairs, and security operations including AES/RC4 encryption and permanent content redaction. The library also handles complex document geometry, layout analysis, and the generation of PDFs from HTML and CSS.

The library supports multi-format document loading for PDF, EPUB, MOBI, SVG, and Office files, with the ability to process files via memory streams.

Features

Text Extraction - Provides high-performance logic for retrieving raw text and structural metadata from PDF layers.
PDF Manipulation Utilities - Provides a comprehensive programmatic interface for merging, splitting, rotating, and restructuring PDF pages.
Document Layout Analysis - Performs layout analysis to identify functional areas such as pictures, text blocks, and tables.
Structured Document Extraction - Converts visual document layouts into machine-readable formats like JSON, HTML, or XML.
OCR Engines - Provides an OCR engine to create searchable text layers from images and scanned PDFs.

Open-source alternatives to PyMuPDF

Similar open-source projects, ranked by how many features they share with PyMuPDF.

py-pdf/pypdf
py-pdf/pypdf
9,818View on GitHub
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Pythonhelp-wantedpdfpdf-documents
View on GitHub9,818
pdfminer/pdfminer.six
pdfminer/pdfminer.six
6,906View on GitHub
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedde
Pythonparserpdfpython
View on GitHub6,906
kreuzberg-dev/kreuzberg
kreuzberg-dev/kreuzberg
8,527View on GitHub
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Rustdocument-intelligenceelixirffi
View on GitHub8,527
jsvine/pdfplumber
jsvine/pdfplumber
9,732View on GitHub
pdfplumber is a PDF data extraction library and layout analysis tool used to retrieve text, tables, and geometric objects from PDF files using precise coordinate-based analysis. It functions as a layout analyzer and table parser that identifies the bounding boxes and visual coordinates for every character and image on a page. The library distinguishes itself through visual debugging capabilities, allowing users to render PDF pages as images and draw annotations to verify the position of extracted data. It employs line and intersection analysis to identify cell structures and convert unstructu
Pythonpdfpdf-parsingtable-extraction
View on GitHub9,732

See all 30 alternatives to PyMuPDF

PyMuPDF

Features

Open-source alternatives to PyMuPDF

py-pdf/pypdf

pdfminer/pdfminer.six

kreuzberg-dev/kreuzberg

jsvine/pdfplumber

Star history

Open-source alternatives to PyMuPDF

py-pdf/pypdf

pdfminer/pdfminer.six

kreuzberg-dev/kreuzberg

jsvine/pdfplumber