Open-source platforms for scanning, indexing, and organizing digital documents and paper records locally.
Stirling-PDF is a web-based PDF management suite used for editing, merging, splitting, and converting PDF documents. It functions as a self-hosted document manager, providing a centralized interface for users to manipulate files on a private server. The system features a workflow automation engine that allows for the creation of processing pipelines to handle large volumes of documents without writing custom code. It also includes an optical character recognition tool to convert scanned PDFs into searchable and editable text. Access is managed through single sign-on integration and OIDC compatibility, which supports secure authentication and the maintenance of audit logs for compliance. The application is delivered as a container-based deployment and exposes its functions through a REST API for external software integration.
This is a self-hosted PDF manipulation and processing suite that provides essential document management features like OCR and automated workflows, though it focuses more on file transformation than on long-term document indexing and archival.
Paperless is a self-hosted document management system designed to digitize, index, and archive paper documents. It functions as an optical character recognition system that converts scanned images and PDFs into a searchable digital library, providing a web-based interface for querying and retrieving documents from a database. The system features an automated file ingestion pipeline that monitors specific directories and email inboxes to process and import documents without manual uploading. To maintain a private archive, it includes on-disk encryption for sensitive files and the ability to organize physical storage using metadata-driven filename templates. The platform covers broad capabilities for document processing, including image cleaning to remove speckles and correct skewing for better text recognition. It also provides tools for exporting archived documents to local directories for external backups and allows for user interface customization via custom styles and scripts. The application is packaged as a containerized deployment to ensure consistent installation across different environments.
This is a comprehensive, self-hosted document management system that provides the full suite of required features, including OCR, automated ingestion, and full-text search for your digital archive.
Papra is a self-hosted document management system designed for digital archiving, organization, and retrieval. It serves as a centralized platform for storing files with a focus on security, providing an encrypted file archive using AES-256-GCM and a programmatic interface for managing documents and metadata via a REST API, SDK, and command line tools. The system distinguishes itself through an automated document ingestion engine that imports files via email forwarding, monitored folders, and webhook listeners. It further enhances discoverability by acting as an OCR document indexer, extracting text from images and scanned documents to enable full-text search across all archived content. The platform covers a broad range of capabilities, including identity management via OAuth2, role-based organizational partitioning for collaborative spaces, and content-based deduplication. It supports diverse storage backends and provides tools for encryption key rotation and metadata filtering. The software is delivered as a containerized deployment, allowing for installation and orchestration via Docker.
Papra is a comprehensive, self-hosted document management system that provides the requested OCR indexing, full-text search, automated ingestion, and secure storage in a containerized, web-accessible platform.
Paperless-ngx is a self-hosted document management server designed to transform physical paperwork into a searchable, organized digital archive. It functions as a private platform for storing, indexing, and retrieving documents, providing users with full control over their data on local infrastructure or private cloud servers. The system distinguishes itself through an automated workflow engine that categorizes, tags, and routes incoming files using content analysis and metadata extraction. To maintain responsiveness during resource-intensive tasks like optical character recognition, it utilizes an asynchronous task queue. The platform also features a dedicated search engine for rapid retrieval across large archives and stores documents in a structured, portable directory hierarchy on disk. Beyond core storage, the project acts as a central integration hub by exposing all system functionality through a comprehensive interface. This allows for automated document workflows, event-driven ingestion from monitored directories, and connectivity with a wide range of community-developed mobile applications, desktop clients, and automation scripts.
Paperless-ngx is a comprehensive, self-hosted document management system that natively provides OCR, automated classification, full-text search, and a web-based interface for managing your digital archive.
Stirling-PDF is a self-hosted document processing suite designed for secure, private file management. It functions as a comprehensive transformation engine that executes complex operations—such as merging, splitting, converting, and redacting documents—directly on the host machine. The platform provides both a browser-based interface for interactive editing and a programmatic, API-first architecture that allows for the automation of document workflows through standard HTTP requests. The project distinguishes itself through its focus on private, infrastructure-agnostic deployment and granular security. It supports role-based access control and stateless session authentication, ensuring that sensitive operations remain protected within a user-controlled environment. By offering a unified interface for sequential file transformations, it enables users to chain multiple processing tasks into single, automated pipelines while maintaining full control over document integrity and security. The system covers a broad range of document manipulation capabilities, including optical character recognition, digital signature validation, and advanced layout operations like booklet imposition and page reorganization. It is built for flexible integration, supporting deployment across containerized environments, bare metal, or native desktop installations. Configuration is managed through environment variables, YAML files, or the web interface, allowing for consistent behavior across diverse infrastructure setups.
This is a powerful PDF manipulation and transformation toolkit rather than a document management system, as it lacks the indexing, metadata management, and long-term storage features required to organize a library of documents.