ArchiveBox

ArchiveBox is a self-hosted archiving tool designed for personal digital preservation and research data management. It functions as an automated web preservation engine that monitors URL inputs from bookmarks, browser history, or manual entries to capture and store permanent, offline copies of web content. By utilizing headless browser automation, the system renders dynamic web pages to ensure that captured snapshots, PDFs, and media assets remain accurate and accessible even if the original source disappears.

The project distinguishes itself through a modular extractor pipeline and a task-queue-based processing model, which allow it to handle long-running ingestion jobs reliably and at scale. It organizes all captured data into a predictable, file-system-based directory structure, ensuring that archives remain portable and accessible without the need for a dedicated database engine. This architecture supports the generation of static, self-contained archives that can be hosted on any standard web server.

To maintain high fidelity across diverse web environments, the system includes configuration-driven dependency management that coordinates the necessary browser binaries and command-line tools. The platform provides a comprehensive suite of command-line interfaces, configuration options, and core modules to support operational management and integration. Detailed documentation is available to guide users through installation, dependency maintenance, and the security considerations of managing archived web content.

Features

Web Content Archivers - Collect URLs from browser history, bookmarks, and manual inputs to trigger the automated process of capturing and saving web pages for future offline viewing and research.
Browser Automation Tools - Uses automated browser instances to render dynamic web pages and capture visual snapshots or media assets for offline storage.
Digital Preservation Tools - Provides tools for creating permanent, offline archives of web content to prevent data loss.
Digital Preservation Tools - Create multiple versions of web pages including screenshots, PDFs, and media files to ensure that content remains readable and accessible for long-term digital preservation and reference.
Static Site Generators - Generates self-contained, static web archives that are deployable on any standard web server without backend dependencies.
Research Data Management - Provides structured storage and organization for diverse web assets like PDFs, media files, and screenshots to maintain a coherent record of reference materials.
Browser Automation Orchestrators - A management layer that coordinates headless browser engines and command-line tools to render and extract complex web content for archival.
Digital Archiving - Self-hosted wayback machine for archiving sites from various sources.
Collaboration And Storage - Self-hosted tool for archiving web pages and media.
Notes and Productivity - Self-hosted web archiving.
Data Processing Pipelines - Executes a series of independent plugins to generate multiple archive formats like PDFs, screenshots, and raw HTML from a single URL.
Task Queues - Manages long-running archiving jobs by distributing ingestion tasks across background workers to ensure reliable and scalable content capture.
Portable Data Formats - A structured file storage format that keeps archived web content and metadata accessible without requiring specialized software or active databases.
Command Line Interfaces - Provides a suite of command-line subcommands for managing archiving tasks and system operations.
Dependency Managers - Coordinates external command-line tools and browser binaries to ensure the environment is correctly prepared for diverse web archiving requirements.
API Documentation - Provides comprehensive technical reference documentation for interacting with the software's programmatic interface.
Documentation Guides - Provides a comprehensive configuration reference for managing web archiving workflows and system settings.
Technical Documentation - Provides comprehensive reference documentation for core system modules and API interfaces.
Web Security Analysis - Security Risks of Viewing Archived JS — a named example documented in this learning resource.

Star history

ArchiveBoxArchiveBox

Name: archivebox/archivebox
Author: ArchiveBox

View on GitHub

26,876 stars1,483 forksPythonmit21 viewsarchivebox.io

ArchiveBox

Features

Web Content Archivers - Collect URLs from browser history, bookmarks, and manual inputs to trigger the automated process of capturing and saving web pages for future offline viewing and research.
Browser Automation Tools - Uses automated browser instances to render dynamic web pages and capture visual snapshots or media assets for offline storage.
Digital Preservation Tools - Provides tools for creating permanent, offline archives of web content to prevent data loss.
Digital Preservation Tools - Create multiple versions of web pages including screenshots, PDFs, and media files to ensure that content remains readable and accessible for long-term digital preservation and reference.
Static Site Generators - Generates self-contained, static web archives that are deployable on any standard web server without backend dependencies.
Research Data Management - Provides structured storage and organization for diverse web assets like PDFs, media files, and screenshots to maintain a coherent record of reference materials.
Browser Automation Orchestrators - A management layer that coordinates headless browser engines and command-line tools to render and extract complex web content for archival.
Digital Archiving - Self-hosted wayback machine for archiving sites from various sources.
Collaboration And Storage - Self-hosted tool for archiving web pages and media.
Notes and Productivity - Self-hosted web archiving.
Data Processing Pipelines - Executes a series of independent plugins to generate multiple archive formats like PDFs, screenshots, and raw HTML from a single URL.
Task Queues - Manages long-running archiving jobs by distributing ingestion tasks across background workers to ensure reliable and scalable content capture.
Portable Data Formats - A structured file storage format that keeps archived web content and metadata accessible without requiring specialized software or active databases.
Command Line Interfaces - Provides a suite of command-line subcommands for managing archiving tasks and system operations.
Dependency Managers - Coordinates external command-line tools and browser binaries to ensure the environment is correctly prepared for diverse web archiving requirements.
API Documentation - Provides comprehensive technical reference documentation for interacting with the software's programmatic interface.
Documentation Guides - Provides a comprehensive configuration reference for managing web archiving workflows and system settings.
Technical Documentation - Provides comprehensive reference documentation for core system modules and API interfaces.
Web Security Analysis - Security Risks of Viewing Archived JS — a named example documented in this learning resource.

Open-source alternatives to ArchiveBox

Similar open-source projects, ranked by how many features they share with ArchiveBox.

awesome-selfhosted/awesome-selfhosted
awesome-selfhosted/awesome-selfhosted
299,516View on GitHub
This project is a community-curated directory of open-source software designed for deployment in private server environments and home labs. It serves as a comprehensive resource for discovering independent, self-hosted alternatives to mainstream cloud services, enabling users to maintain full data ownership and control over their digital infrastructure. The directory is structured through a hierarchical taxonomy that organizes a vast collection of applications into logical categories, ranging from media management and data analytics to private communication and team productivity tools. It dis
awesomeawesome-listcloud
View on GitHub299,516
squidfunk/mkdocs-material
squidfunk/mkdocs-material
26,949View on GitHub
This project is a comprehensive documentation site framework and static site generator theme designed to transform markdown files into professional, responsive websites. It functions as a technical content platform that supports complex documentation projects, including multi-project management, blog workflows, and advanced content formatting. By processing source files through an extensible pipeline, it generates self-contained HTML sites that can be hosted on any web server without a database. What distinguishes this framework is its focus on developer experience and highly configurable bui
Pythondocumentationframeworkmaterial-design
View on GitHub26,949
pirate/archivebox
pirate/ArchiveBox
27,721View on GitHub
ArchiveBox is a self-hosted web archiving system designed to capture and preserve permanent static copies of webpages, media, and PDFs on personal infrastructure. It functions as a digital content curator and personal web archive manager, allowing users to import URLs from bookmarks, RSS feeds, and browser history to create a centralized, searchable knowledge base. The project is distinguished by its ability to archive private, paywalled, or login-protected content using browser cookies and authenticated session persistence. It ensures long-term availability by saving pages in multiple concur
Python
View on GitHub27,721
karakeep-app/karakeep
karakeep-app/karakeep
26,248View on GitHub
Karakeep is a self-hosted, open-source platform designed for personal knowledge management and web content archiving. It functions as a centralized repository where users can capture, organize, and preserve bookmarks, notes, and media files, ensuring long-term access to digital information even if original sources are removed or modified. The system distinguishes itself through its automated content processing and security-focused architecture. It utilizes headless browser crawling and optical character recognition to ingest and index web content, while a modular artificial intelligence pipel
TypeScriptbookmark-managerbookmarksbookmarks-manager
View on GitHub26,248

See all 30 alternatives to ArchiveBox

Frequently asked questions

What does archivebox/archivebox do?

What are the main features of archivebox/archivebox?

The main features of archivebox/archivebox are: Web Content Archivers, Browser Automation Tools, Digital Preservation Tools, Static Site Generators, Research Data Management, Browser Automation Orchestrators, Digital Archiving, Collaboration And Storage.

What are some open-source alternatives to archivebox/archivebox?

Open-source alternatives to archivebox/archivebox include: awesome-selfhosted/awesome-selfhosted — This project is a community-curated directory of open-source software designed for deployment in private server… squidfunk/mkdocs-material — This project is a comprehensive documentation site framework and static site generator theme designed to transform… pirate/archivebox — ArchiveBox is a self-hosted web archiving system designed to capture and preserve permanent static copies of webpages,… karakeep-app/karakeep — Karakeep is a self-hosted, open-source platform designed for personal knowledge management and web content archiving.… payloadcms/payload — Payload is a headless content management system and application framework that uses a code-first approach to define… datawhalechina/pumpkin-book — Pumpkin-book is an open-source educational textbook that provides annotated study materials and mathematical…

ArchiveBox

Features

Star history

ArchiveBox

Features

Open-source alternatives to ArchiveBox

awesome-selfhosted/awesome-selfhosted

squidfunk/mkdocs-material

pirate/ArchiveBox

karakeep-app/karakeep

Frequently asked questions

Star history

Frequently asked questions

Open-source alternatives to ArchiveBox

awesome-selfhosted/awesome-selfhosted

squidfunk/mkdocs-material

pirate/ArchiveBox

karakeep-app/karakeep