# artifexsoftware/pdf2docx

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/artifexsoftware-pdf2docx).**

3,305 stars · 471 forks · Python · agpl-3.0

## Links

- GitHub: https://github.com/ArtifexSoftware/pdf2docx
- Homepage: https://pdf2docx.readthedocs.io
- awesome-repositories: https://awesome-repositories.com/repository/artifexsoftware-pdf2docx.md

## Topics

`docx` `extract-table` `pdf-converter` `pdf-to-word` `pymupdf`

## Description

pdf2docx is a suite of PDF utilities designed to transform static PDF documents into editable DOCX files. It functions as a multi-core processor capable of accelerating the conversion of large files by distributing page tasks across multiple CPU cores.

The project includes specialized tools for decrypting password-protected PDF files and extracting tabular content as structured data. It also provides a layout analyzer to visually inspect and verify document structure during the conversion process.

Conversion is accessible through both a graphical user interface and a command-line interface, which supports automated batch processing and scripting workflows.

## Tags

### Content Management & Publishing

- [PDF to DOCX Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-docx-converters.md) — Transforms static PDF documents into editable Word DOCX files while preserving original layout. ([source](https://pdf2docx.readthedocs.io))
- [Parallel Processing](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-processing-engines/parallel-processing.md) — Distributes individual page conversion tasks across multiple CPU cores to accelerate processing.
- [Table Extraction Utilities](https://awesome-repositories.com/f/content-management-publishing/documentation-knowledge-management/pdf-structural-elements/table-extraction-utilities.md) — Isolates and retrieves tabular content from PDF pages as structured data.
- [Tabular Data Reconstruction](https://awesome-repositories.com/f/content-management-publishing/tabular-data-reconstruction.md) — Identifies and reconstructs structural grids from unstructured PDF visuals into digital table formats.
- [Layered Decomposition](https://awesome-repositories.com/f/content-management-publishing/documentation-knowledge-management/pdf-structural-elements/layered-decomposition.md) — Decomposes PDF structures into primitive text and vector components before mapping them to Word elements.
- [PDF Layout Analysis Tools](https://awesome-repositories.com/f/content-management-publishing/pdf-layout-analysis-tools.md) — Programmatically determines the spatial coordinates and bounding boxes of elements within a PDF for verification.
- [Encrypted PDF Unlockers](https://awesome-repositories.com/f/content-management-publishing/pdf-text-extraction/encrypted-pdf-unlockers.md) — Decrypts password-protected PDF files to make them accessible for conversion or data extraction.

### Software Engineering & Architecture

- [Document Processing Parallelization](https://awesome-repositories.com/f/software-engineering-architecture/pipeline-optimization-techniques/multi-core-parallelization/document-processing-parallelization.md) — Uses multiple CPU cores to accelerate the conversion of large PDF files.

### User Interface & Experience

- [Tabular Data Extraction](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/pdf-and-html-content-extraction/tabular-data-extraction.md) — Isolates and retrieves tabular content from PDF pages as structured data. ([source](https://pdf2docx.readthedocs.io/en/latest/quickstart.cli.html))

### Part of an Awesome List

- [Decryption Utilities](https://awesome-repositories.com/f/awesome-lists/devtools/pdf-processing/decryption-utilities.md) — Removes encryption from PDF files to enable content processing and format conversion. ([source](https://pdf2docx.readthedocs.io/en/latest/quickstart.convert.html))

### Graphics & Multimedia

- [Coordinate-Based Layout Analysis](https://awesome-repositories.com/f/graphics-multimedia/coordinate-based-layout-analysis.md) — Analyzes physical page coordinates to determine document structure and group text fragments.
- [Layout Visualization Tools](https://awesome-repositories.com/f/graphics-multimedia/layout-visualization-tools.md) — Provides a visual tool to inspect and verify the accuracy of document layout during conversion. ([source](https://pdf2docx.readthedocs.io/en/latest/quickstart.cli.html))

### Security & Cryptography

- [Input Stream Decryption](https://awesome-repositories.com/f/security-cryptography/message-decryption/input-stream-decryption.md) — Decrypts encrypted PDF byte streams during the read process to allow access to the document model.
