# ocrmypdf/ocrmypdf

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/ocrmypdf-ocrmypdf).**

33,898 stars · 2,339 forks · Python · MPL-2.0

## Links

- GitHub: https://github.com/ocrmypdf/OCRmyPDF
- Homepage: http://ocrmypdf.readthedocs.io/
- awesome-repositories: https://awesome-repositories.com/repository/ocrmypdf-ocrmypdf.md

## Topics

`image-processing` `ocr` `pdf` `python` `tesseract`

## Description

OCRmyPDF is a command-line tool designed to transform scanned documents into searchable, selectable PDF files. It functions as a document processing pipeline that adds a hidden text layer to image-based files while simultaneously optimizing the document's file size and image quality. By preserving the original visual fidelity of the input, it ensures that digitized documents remain accessible to screen readers and search engines.

The project distinguishes itself through a modular architecture that supports custom plugins and the integration of external recognition engines, allowing users to tailor the processing workflow to unique file formats or specific requirements. It provides robust support for multi-language environments through configurable language packs and handles large-scale operations via automated batch processing.

The software manages complex system-level dependencies and external binary tools through containerized environments, ensuring consistent execution across different host operating systems. It is available for installation via standard Python package managers or native system package managers on Linux, macOS, and Windows, and includes comprehensive documentation covering API usage, performance tuning, and cloud deployment strategies.

## Tags

### Content Management & Publishing

- [Document Processing and Conversion](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion.md) — Generates searchable PDF files by layering text and optimizing content while maintaining visual fidelity. ([source](https://cdn.jsdelivr.net/gh/ocrmypdf/OCRmyPDF@main/README.md))

### Data & Databases

- [Batch Processing Systems](https://awesome-repositories.com/f/data-databases/batch-processing-systems.md) — Automates the standardization and text extraction of large document volumes for archival and indexing workflows.

### Development Tools & Productivity

- [Package Managers](https://awesome-repositories.com/f/development-tools-productivity/package-managers.md) — Provides support for installation and dependency management through standard Python package managers. ([source](https://ocrmypdf.readthedocs.io/en/latest/installation.html))
- [Plugin Architectures](https://awesome-repositories.com/f/development-tools-productivity/plugin-architectures.md) — Enables the integration of custom plugins to replace default engines with specialized document processing tools. ([source](https://cdn.jsdelivr.net/gh/ocrmypdf/OCRmyPDF@main/README.md))

### DevOps & Infrastructure

- [Container Environments](https://awesome-repositories.com/f/devops-infrastructure/container-environments.md) — Packages complex system-level dependencies and external binary tools into isolated environments to ensure consistent execution across different host operating systems.
- [Containerization Tools](https://awesome-repositories.com/f/devops-infrastructure/containerization-tools.md) — Provides pre-configured container images to simplify deployment and manage complex system dependencies automatically. ([source](https://ocrmypdf.readthedocs.io/en/latest/installation.html))
- [Installation Utilities](https://awesome-repositories.com/f/devops-infrastructure/installation-utilities.md) — Provides native installation support for Linux distributions to ensure compatibility and ease of deployment. ([source](https://ocrmypdf.readthedocs.io/en/latest/installation.html))

### Graphics & Multimedia

- [Image Analysis Tools](https://awesome-repositories.com/f/graphics-multimedia/image-editing-processing/image-analysis-tools.md) — Supports multi-language document processing by allowing the configuration and installation of specific recognition language packs. ([source](https://cdn.jsdelivr.net/gh/ocrmypdf/OCRmyPDF@main/README.md))

### Part of an Awesome List

- [AI and Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/ai-and-machine-learning.md) — Adds searchable text layers to scanned PDF documents.
- [Data Management Systems](https://awesome-repositories.com/f/awesome-lists/data/data-management-systems.md) — Adds searchable text layers to scanned PDF documents.

### Education & Learning Resources

- [API Documentation Guides](https://awesome-repositories.com/f/education-learning-resources/api-documentation-guides.md) — Provides a structured learning resource detailing the practical implementation and usage patterns for external application programming interfaces. ([source](http://ocrmypdf.readthedocs.io/api.html))
- [Deployment Guides](https://awesome-repositories.com/f/education-learning-resources/deployment-guides.md) — Provides structured learning materials and operational guides for deploying applications to cloud infrastructure. ([source](http://ocrmypdf.readthedocs.io/cloud.html))
