Pypdf

pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files.

The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata.

Beyond basic structural changes, the library covers page management through rotation, cropping, and scaling, as well as text and image extraction with layout-preserving options. It provides security utilities for document encryption and decryption, and optimization tools to reduce file size by removing images or applying lossless compression.

Features

Document Splitting and Merging - Combines multiple PDF files into one or divides a single document into smaller files.

Document Processing - Divides single PDF files into smaller documents by extracting specific page ranges.

PDF Processing - Combines multiple PDF documents into one file while handling object cloning.

Text Extraction - Extracts text from PDF layers conditionally based on spatial coordinates or font properties.

PDF Navigational Bookmarks - Creates and modifies hierarchical PDF outlines and navigational bookmarks to organize document sections.

Document Content Structuring - Reorganizes content by splitting, merging, and transforming pages within a document.

Link Management - Implements methods for adding and removing internal and external hyperlinks within PDF documents.

PDF Manipulation Utilities - Provides utilities for merging multiple PDF files into a single output while preserving object links.

Bookmark Managers - Provides specialized tools for creating, removing, and nesting navigational bookmarks to organize document structure.

PDF Libraries - Parses internal PDF structures by locating cross-reference tables and decoding binary content streams.

Layout Reorganization - Splits, merges, and transforms pages to reorganize or resize the layout.

Page Counting - Determines the total number of pages within a document.

Page Index Retrieval - Provides the ability to access individual pages by index number for targeted operations on document sections.

Page Insertion - Adds empty or specific pages at designated positions to restructure the file.

Page Sequence Managers - Splits, merges, and transforms pages to reorganize document structure.

Page Rearrangements - Allows rearranging and resizing document content by splitting, merging, and cropping pages.

PDF Document Generation - Generates PDF documents by structuring headers, bodies, and internal cross-reference tables.

Content Extraction - Provides utilities for retrieving raw text and structural information from PDF documents.

Deep Object Manipulations - Performs recursive modifications on the hierarchical structure of PDF dictionaries and arrays.

Binary Stream Decoding - Processes compressed binary data blocks using filters like FlateDecode to retrieve text and images.

Cross-Reference Table Mappings - Parses internal PDF cross-reference tables to enable random access to document objects.

PDF Security and Signing - Removes password protection from PDF files to enable reading and modification of contents.

PDF Permission Controls - Applies encryption and access restrictions to PDF files to prevent unauthorized editing or printing.

PDF Restriction Removers - Uses provided passwords to remove encryption and allow access to protected PDF content.

Document Encryption - Removes encryption from documents using user or owner passwords to restrict unauthorized access.

Symmetric Encryption - Implements symmetric AES encryption and decryption to secure document access and permissions.

Indirect Object Mapping - Tracks and manages shared PDF objects through a reference system to avoid duplicating data.

Glyph Mappings - Translates binary character codes into human-readable text using embedded font glyph maps.

Interactive PDF Form Fields - Manages interactive PDF form fields and the document table of contents.

Form Data Extraction - Parses interactive PDF form fields to retrieve names, values, and page locations.

Page Box Modifications - Adjusts crop, bleed, and trim box properties to control the visible area.

PDF Metadata Editors - Updates internal properties and viewing options to change how documents are identified.

PDF Metadata Managers - Manages and edits internal property fields and viewer preferences of PDF documents.

File Attachments - Retrieves embedded files and associated metadata from PDF structures for export.

Attachment Removal - Provides the ability to delete specific embedded files from a PDF document.

Internal Object Manipulation - Reads and clones generic PDF data structures like dictionaries and arrays to modify internal document properties.

Content Overlaying - Merges a page or image onto an existing page to create watermarks or overlays.

Content Scaling - Adjusts the size of page contents and the canvas to resize elements.

Dimension Adjustments - Adjusts page layout by cropping, rotating, or scaling content dimensions.

Document Metadata Extraction - Retrieves descriptive information such as title, author, and creation date from PDF files.

Document Metadata Management - Creates, updates, or removes standard information fields and custom metadata entries.

Document Property Modification - Adds custom metadata and viewing options to modify how a PDF file is displayed.

Document Watermarking - Adds or reads stamps and watermarks to customize the visual identification of documents.

File Size Optimizations - Reduces document file size by eliminating redundant objects and applying lossless compression.

Incremental PDF Updates - Appends new content to the end of the file to preserve original data and existing digital signatures.

Malformed Document Handling - Implements a best-effort approach to process malformed PDF files that violate official specifications.

Page Box Definitions - Defines boundaries like crop, bleed, and media boxes including their coordinates.

Page Cropping - Adjusts the visible area of a page by defining new boundary coordinates.

Page Rotations - Enables changing the visual orientation of PDF pages by increments of 90 degrees.

Page Overlays - Provides the ability to merge pages or images onto another using rotation, translation, and scaling transformations.

PDF Storage Optimizations - Reduces overall file size and ensures the document adheres to PDF standards.

PDF Viewer Preferences - Sets document-level flags to control viewer behaviors like menu visibility and duplex printing.

XMP Metadata Management - Creates and modifies XML-based metadata structures to store advanced properties.

Image Extractions - Provides tools for isolating and saving image assets from within PDF documents as external files.

In-Memory Document Processing - Reads and writes PDF content using byte streams to avoid the need for temporary intermediate files.

Layout Preservation - Retrieves text while preserving the visual layout, orientation, and spacing of the original document.

2D Vector Transformations - Modifies elements using 2D transformations like scaling, rotating, and translating.

XMP Structure Manipulation - Modifies extensible metadata structures to define detailed properties and standards compliance.

PDF Interactive Elements - Adds interactive elements such as highlights, links, and geometric shapes to document pages.

Vector Annotation Insertion - Adds geometric shapes and free-text boxes with customizable colors and styles to PDF pages.

Metadata Extraction - Extracts embedded information and XMP metadata directly from PDF files.

XMP - Retrieves document information and XMP data to identify file versions.

XMP Data Retrieval - Reads platform data to retrieve authors, creation dates, and producer tools.

Decryption Utilities - Implements utilities to remove AES encryption from PDF files to enable content processing.

Document Access Permissions - Controls user access levels for printing, modifying, and extracting text from secured documents.

PDF Form and Annotation Tools - Embeds fillable form fields and interactive annotations into PDF files.

Form Field Configuration - Configures properties for interactive PDF form fields, including read-only status and password protection.

Hyperlink Managers - Provides tools for creating internal and external clickable hyperlinks within PDF documents.

Spatial Text Filtering - Filters text fragments based on position to exclude non-content areas like headers and footers.

Text Highlighting - Applies highlight annotations to specific rectangular areas to emphasize document content.

Document Annotators - Provides utilities for adding and modifying persistent notes and annotations within PDF documents.

Document and File Processing - Manipulates PDF pages through splitting, merging, and transformation.

py-pdfpypdf

Features

Star history