10 Repos
Tools for creating, manipulating, and extracting data from PDF files.
Explore 10 awesome GitHub repositories matching part of an awesome list · PDF Processing. Refine with filters or upvote what's useful.
Zotero is reference management software designed for collecting, organizing, and citing bibliographic research sources and digital documents for academic work. It functions as a web bibliographic collector, a citation generator, and a collaborative research platform. The system integrates tools for capturing metadata and archiving web pages into a centralized research library. It provides a specialized environment for reading and marking up PDF and EPUB files with highlights and notes linked directly to research sources. The software covers a broad range of capabilities including bibliograph
Extracts and manipulates data within PDF files to facilitate deep research analysis.
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Combines multiple PDF documents into one file while handling object cloning.
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedde
Implements RC4 and AES decryption to enable programmatic extraction of content from protected PDF files.
OpenPDF ist eine Java-Bibliothek und ein Dokumentenprozessor zum Erstellen, Bearbeiten, Rendern und Verschlüsseln von PDF-Dokumenten. Sie fungiert als Toolkit, um neue Dateien von Grund auf zu generieren, bestehende Dokumentstrukturen zu modifizieren und Textinhalte zu extrahieren. Das Projekt enthält eine dedizierte Engine zur Transformation von HTML- und CSS-Inhalten in PDF-Dokumente durch das Parsen von Markup und die Anwendung von Styles. Zudem bietet es eine Rendering-Engine, um PDF-Seiten für Thumbnails und Vorschauen in Bildformate zu konvertieren, sowie ein Sicherheits-Utility zum Schutz von Inhalten durch Dokumentenverschlüsselung. Die Bibliothek unterstützt das Hinzufügen von Grafiken, Tabellen und mehrseitigen TIFF-Bildern. Sie bewältigt komplexe Typografie durch Unterstützung für Multi-Byte-Zeichen, bidirektionalen Text und nicht-lateinische Schriften. Die Software nutzt die plattformübergreifende Java-Laufzeitumgebung und enthält Pakete, um die Dokumentenverarbeitung in Android-Umgebungen zu ermöglichen.
Open-source fork for programmatic PDF creation.
pdf2docx is a suite of PDF utilities designed to transform static PDF documents into editable DOCX files. It functions as a multi-core processor capable of accelerating the conversion of large files by distributing page tasks across multiple CPU cores. The project includes specialized tools for decrypting password-protected PDF files and extracting tabular content as structured data. It also provides a layout analyzer to visually inspect and verify document structure during the conversion process. Conversion is accessible through both a graphical user interface and a command-line interface,
Removes encryption from PDF files to enable content processing and format conversion.
Pdfcraft is a containerized service for self-managed PDF processing, editing, and conversion. It provides a toolkit for document manipulation, a multi-format converter, and OCR software to transform scanned documents into searchable and editable text. The project features a visual, node-based workflow editor that allows users to build automated pipelines by chaining together various PDF conversion and optimization operations. The service covers a broad range of capabilities, including document management for merging and splitting files, format conversion between PDFs and office documents or
Provides a drag-and-drop visual editor for building automated PDF processing pipelines.
XML/XHTML and CSS 2.1 renderer in pure Java
Renders XML/XHTML and CSS 2.1 to documents.
Extract tables from PDF files
Extracts tabular data from existing PDF files.
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 3. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
Modern PDF standard support based on existing rendering engines.
Java reporting library for creating dynamic report designs at runtime
Simplifies report generation based on JasperReports.