PyMuPDF

PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents.

The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines.

Its broader capability surface covers optical character recognition for creating searchable text layers, detailed data extraction of tables and key-value pairs, and security operations including AES/RC4 encryption and permanent content redaction. The library also handles complex document geometry, layout analysis, and the generation of PDFs from HTML and CSS.

The library supports multi-format document loading for PDF, EPUB, MOBI, SVG, and Office files, with the ability to process files via memory streams.

Features

Text Extraction - Provides high-performance logic for retrieving raw text and structural metadata from PDF layers.

PDF Manipulation Utilities - Provides a comprehensive programmatic interface for merging, splitting, rotating, and restructuring PDF pages.

Document Layout Analysis - Performs layout analysis to identify functional areas such as pictures, text blocks, and tables.

Structured Document Extraction - Converts visual document layouts into machine-readable formats like JSON, HTML, or XML.

OCR Engines - Provides an OCR engine to create searchable text layers from images and scanned PDFs.

Optical Character Recognition - Converts images of text from scanned documents into machine-encoded text and searchable PDF layers.

Document and File Processing - Loads and processes a wide variety of formats including PDF, EPUB, MOBI, SVG, and Office documents.

Character-Level Geometry Extraction - Retrieves individual coordinates and properties for every single character in the document.

Document Conversion - Transforms documents between various file types including PDF, SVG, Markdown, DOCX, and raster images.

PDF Format Converters - Transforms documents between PDF and various other formats including SVG, HTML, and Office files.

Internal Object Manipulation - Modifies individual PDF dictionary keys and object definitions using low-level xref identifiers.

Document Splitting and Merging - Merges multiple PDF files into a single document using specific page ranges and password-protected inputs.

PDF Libraries - Serves as a comprehensive cross-platform library for parsing and manipulating PDF documents.

Document Rendering - Converts document pages into raster images or vector graphics with adjustable resolution.

Document Metadata Extraction - Retrieves standard descriptive properties such as author, creator, and creation date from the document.

Document Metadata Management - Provides comprehensive tools for updating XML metadata, catalogs, and trailers to manage file-level information.

HTML to PDF Converters - Converts HTML and CSS source code into a PDF document by flowing content into target rectangles.

Page Rearrangements - Provides operations for modifying document structure through splitting, merging, and cropping pages.

PDF Document Generation - Generates new PDF files by programmatically defining page dimensions and flushing content to a buffer.

Searchable PDF Generation - Performs optical character recognition on images to create a PDF with a hidden, searchable text layer.

PDF Layout Analysis Tools - Analyzes layout geometry and determines spatial coordinates and bounding boxes of PDF elements.

Content Extraction - Extracts content grouped by logical blocks, lines, or words accompanied by their bounding box coordinates.

Annotation Data Extraction - Retrieves text, embedded files, audio data, or visual pixmaps from a document annotation.

Image Extractions - Isolates and saves image assets from within complex documents or renders pages as images.

Content Type Detection - Identifies the actual document format using internal data heuristics regardless of the file extension.

Document Extraction Tools - Identifies and retrieves tabular data and key-value pairs from document pages.

PDF Parsers - Parses PDF documents to extract text, tables, images, and structural layout data.

Text Search - Locates specific text strings within document pages and returns their precise coordinates.

Document Area Redactions - Permanently removes text, images, or vector graphics from specific areas of a document.

Vector Rasterizers - Renders vector page content into pixel maps to enable image export, OCR processing, and visual analysis.

Image Insertion Utilities - Places image files or pixmaps into a document page at specific coordinates.

Optical Character Recognition - Implements an OCR engine to generate searchable text layers from scanned documents and images.

Vector Annotation Insertion - Inserts PDF annotations including text markers, geometric shapes, stamps, redactions, and freehand ink.

C-Bindings - Utilizes high-performance C-bindings to provide Python access to low-level document manipulation and rendering functions.

Content Redaction Tools - Implements a review-and-apply process to permanently remove sensitive information from documents.

Document Annotators - Provides capabilities to create and modify highlights, underlines, redactions, stamps, and ink within documents.

Spatial Key-Value Extraction - Provides the ability to extract structured data from documents based on the spatial proximity of labels and their values.

Retrieval-Augmented Generation Frameworks - Integrates document loading tools with external orchestration libraries to facilitate retrieval-augmented generation workflows.

Raster Image Analysis - Determines image dimensions, resolution, and color usage to analyze embedded document images.

Raster Image Exports - Saves raster images to files or byte streams in multiple supported formats.

PDF Form Filling - Reads existing values from PDF form fields and programmatically updates them.

Content Overlaying - Implements merging of pages or images on top of existing content for watermarking and overlays.

PDF Security Management - Applies password encryption and manages user permissions to protect sensitive document content.

Parallel Processing - Accelerates rendering and data extraction for large files by splitting document workloads across multiple CPU cores.

Document Generation Templates - Populates PDF reports with external data using HTML placeholders and clones.

PDF Compression - Cleans document contents and applies compression to reduce PDF file size while preserving integrity.

Document Area Definitions - Specifies rectangular regions using coordinates to target areas for analysis or modification.

Document Watermarking - Adds visible or invisible stamps and watermarks to digital documents to indicate ownership or status.

Table of Contents - Creates a linked list of sections by tracking the page positions of document headings.

Coordinate Mapping - Provides the ability to log coordinates of headings and hyperlinks for generating automated tables of contents.

Page Box Modifications - Adjusts page rotation and the visible cropbox area to change how the document is displayed.

Page Cropping - Allows redefining the visible area of a PDF page by setting new boundary coordinates.

Text Page Generation - Supports adding new pages containing arbitrary text with configurable fonts and colors.

Page Rotations - Enables changing the visual orientation of individual PDF pages in fixed increments.

PDF Repair - Recovers and cleans problematic or corrupt PDF files to produce a valid, non-corrupt version.

PDF Storage Optimizations - Writes documents to disk using garbage collection and stream compression to optimize file size.

PDF Text Composition - Prepares text spans with specific fonts and positions to be written onto pages.

PDF to Markdown Converters - Transforms PDF content into Markdown for use in LLM pipelines and retrieval-augmented generation.

Document Layout Engines - Implements engines to calculate the visual arrangement and formatting of multi-column layouts, grids, and tables of contents.

Text Style Analysis - Retrieves metadata for text spans, including font names, sizes, colors, and stylistic flags.

Repetitive Element Filtering - Omits repetitive page headers and footers during text extraction to remove noise.

Vector Graphics Extraction - Extracts drawing commands as lists of dictionaries containing precise geometry and color information.

Outline Extraction - Retrieves the hierarchical tree of bookmarks to reconstruct a functional table of contents.

Structured Data Extraction - Produces structured JSON output that captures the bounding boxes and structural hierarchy of page elements.

Layout Preservation - Extracts text while precisely preserving the original visual spatial arrangement and coordinates of the content.

Text Search and Marking - Locates specific text strings within a document and applies visual annotations like highlights or underlines.

Outline Management - Provides capabilities to create and modify the table of contents for document navigation.

Document Element Inspection - Identifies and iterates through all links, annotations, and form fields present on a document page.

LLM-Optimized Formats - Provides specialized formatting of document data to optimize it for ingestion by large language models and RAG pipelines.

Coordinate Transformations - Applies a transformation matrix to a point to recalculate its position in a plane.

Vector - Creates graphical elements like lines, polygons, and curves using a sequence of drawing commands.

Vector Content Removal - Removes specific drawings by applying redaction annotations to the bounding boxes of the graphics.

2D Image Transformations - Calculates 3x3 transformation matrices to determine scaling, rotation, and translation of images.

Image Format Encoding - Transforms images between various raster formats via a common pixmap representation.

PDF to Image Rendering - Renders PDF pages into high-resolution raster images or scalable vector graphics.

Text Insertion Utilities - Adds text to pages using specified fonts, colors, and positions with support for alignment.

Vector Graphics Export - Generates a scalable vector graphics (SVG) representation of a document page.

Document Access Permissions - Controls document-level functional restrictions such as printing, modifying, or extracting content.

Document Encryption - Provides mechanisms to authenticate and unlock password-protected document files.

Document Object Models - Builds a structured tree of nodes to define document content and layout through a programmatic interface.

Unicode Glyph Mapping - Maps binary glyph names to Unicode values and calculates character widths for specific fonts.

Font Metric Mappers - Retrieves character dimensions and glyph bounding boxes to ensure accurate text placement and layout.

Interactive PDF Form Fields - Inserts interactive PDF form widgets and fillable fields into document pages.

HTML Content Processing - Renders HTML and CSS within a page rectangle with support for complex text shaping.

HTML Layout Parsers - Parses HTML and CSS into a geometric box tree to render structured content onto document pages.

Link Destination Analysis - Detects whether a document link points to an internal location, an external URI, or another file.

Rich Text Annotations - Adds free-text annotations with styling, custom fonts, and call-out lines to highlight page areas.

Transformation Matrix Scaling - Applies 3x3 matrices to scale, rotate, and translate page elements within a document.

PDF Table Rendering - Produces PDF tables from HTML source with automatic column and row calculations.

Documentation and Processing - Advanced PDF manipulation library.

pymupdfPyMuPDF

Features

Star history