Open-source software for capturing, preserving, and managing full-page snapshots of websites on your own infrastructure.
ArchiveBox is a self-hosted archiving tool designed for personal digital preservation and research data management. It functions as an automated web preservation engine that monitors URL inputs from bookmarks, browser history, or manual entries to capture and store permanent, offline copies of web content. By utilizing headless browser automation, the system renders dynamic web pages to ensure that captured snapshots, PDFs, and media assets remain accurate and accessible even if the original source disappears. The project distinguishes itself through a modular extractor pipeline and a task-queue-based processing model, which allow it to handle long-running ingestion jobs reliably and at scale. It organizes all captured data into a predictable, file-system-based directory structure, ensuring that archives remain portable and accessible without the need for a dedicated database engine. This architecture supports the generation of static, self-contained archives that can be hosted on any standard web server. To maintain high fidelity across diverse web environments, the system includes configuration-driven dependency management that coordinates the necessary browser binaries and command-line tools. The platform provides a comprehensive suite of command-line interfaces, configuration options, and core modules to support operational management and integration. Detailed documentation is available to guide users through installation, dependency maintenance, and the security considerations of managing archived web content.
ArchiveBox is a self-hosted web archiving engine that captures full-fidelity snapshots of pages, supports WARC formats, and provides automated crawling, making it a comprehensive solution for personal digital preservation.
ArchiveBox is a self-hosted web archiving system designed to capture and preserve permanent static copies of webpages, media, and PDFs on personal infrastructure. It functions as a digital content curator and personal web archive manager, allowing users to import URLs from bookmarks, RSS feeds, and browser history to create a centralized, searchable knowledge base. The project is distinguished by its ability to archive private, paywalled, or login-protected content using browser cookies and authenticated session persistence. It ensures long-term availability by saving pages in multiple concurrent formats, including HTML, PDF, and PNG, and can automatically mirror these local snapshots to external preservation services. The system includes capabilities for multimedia asset extraction, full-text archive indexing, and scheduled content mirroring. Users can manage their collections through a web-based interface, a command-line interface, or a remote API, with options to export the entire collection as a standalone static HTML site for offline browsing.
ArchiveBox is a comprehensive, self-hostable web archiving system that captures full-fidelity snapshots in multiple formats, supports automated crawling, and provides full-text search for your saved content.
Omnivore is an open-source, self-hostable read-it-later application designed to centralize web articles, newsletters, and digital documents into a personal library. It functions as a comprehensive content archiver that captures web pages and stores them locally, ensuring permanent access and readability regardless of internet connectivity. The platform distinguishes itself through an event-sourced synchronization engine that maintains a consistent state across multiple devices by replaying user actions. It utilizes a headless web scraping service to extract clean text and metadata from raw web pages, providing a uniform reading experience. Users can manage their collections through a research-oriented workflow that supports highlighting passages and attaching personal notes to saved content. The application provides a full suite of content management capabilities, including offline reading, cross-device progress synchronization, and structured data persistence. It is distributed as an open-source project, allowing users to maintain full control over their personal data and reading history.
Omnivore is a self-hostable read-it-later platform that captures and preserves web content for offline access, though it focuses on text-based article extraction rather than full-fidelity visual snapshots or WARC-based archival.
Karakeep is a self-hosted, open-source platform designed for personal knowledge management and web content archiving. It functions as a centralized repository where users can capture, organize, and preserve bookmarks, notes, and media files, ensuring long-term access to digital information even if original sources are removed or modified. The system distinguishes itself through its automated content processing and security-focused architecture. It utilizes headless browser crawling and optical character recognition to ingest and index web content, while a modular artificial intelligence pipeline automatically generates summaries and metadata for saved items. To maintain privacy and security, the platform supports single sign-on authentication and includes robust network controls, such as proxy-based crawling and request forgery prevention, to protect internal infrastructure during automated tasks. Beyond core archival capabilities, the platform provides extensive tools for library maintenance and data portability. Users can manage their collections through a command-line interface, synchronize content across devices, and integrate external data sources like RSS feeds. The system also facilitates collaboration through shared collections and public link generation, while offering a comprehensive programmatic interface that allows external applications to interact with stored data via webhooks and authenticated requests. The application is designed for containerized deployment, providing a unified environment for managing services, database migrations, and external storage backends.
Karakeep is a self-hosted platform that captures and preserves web content using headless browser crawling, making it a capable tool for personal web archiving despite lacking explicit mention of WARC format support.
SingleFile is a browser-based utility designed to preserve the visual state and functional integrity of web pages by capturing them as self-contained HTML files. It functions by traversing the document object model to embed external assets, such as images, stylesheets, and scripts, directly into a single document for reliable offline viewing. The tool distinguishes itself through its ability to handle complex, dynamic web content by executing custom scripts and managing cross-origin resource requests during the capture process. It utilizes isolated execution environments and shadow document fragments to ensure that annotations, highlights, and custom modifications remain intact and conflict-free within the archived file. Beyond basic archiving, the software supports automated workflows, including the scheduling of recurring page captures and the synchronization of saved files to remote cloud storage providers. These capabilities facilitate long-term content preservation and integration with personal knowledge management systems. The project is available as a browser extension and provides a command-line interface for automated web scraping and content management tasks.
SingleFile is a browser-based utility that captures web pages as self-contained HTML files, providing a reliable way to preserve content for offline access even though it does not natively support the WARC format.
Linkwarden is a self-hosted bookmark manager and web archiving platform designed to preserve permanent copies of online content. It functions as a centralized repository where users can capture, store, and organize web pages to ensure they remain accessible even if the original source is removed. The platform distinguishes itself through its focus on collaborative knowledge management and multi-platform capture. It enables teams to curate shared collections, apply custom tags, and annotate saved resources within a unified workspace. Users can integrate the service into their daily workflows via browser extensions and mobile device sharing, allowing for the direct archiving of links from various environments. The system provides a comprehensive suite of organization and administrative tools, including folder-based grouping, role-based access control, and programmatic management through a secure API. It supports scalable storage and user seat management, ensuring that both individual researchers and teams can maintain structured, searchable libraries of web-based information.
Linkwarden is a self-hosted bookmark manager that includes web archiving capabilities to preserve permanent copies of online content, making it a suitable tool for capturing and organizing web pages for offline access.
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live web research, interact with pages, and execute multi-step navigation tasks. It supports distributed crawling infrastructure, enabling users to scale data collection across multiple nodes while managing concurrency and long-running jobs through asynchronous queueing. The system also integrates with agentic frameworks via standardized protocols, allowing for seamless connection to AI-powered clients and automated pipelines. Beyond its core extraction capabilities, the project provides a suite of developer tools for site mapping, batch scraping, and web searching. It includes features for stateful session persistence, webhook-based notifications, and configurable crawl depth, allowing for granular control over how information is retrieved and processed. The project offers comprehensive API documentation and SDKs to facilitate integration into backend services and local development environments. Users can deploy the crawling infrastructure within their own private networks or utilize managed cloud services.
This is a web scraping and data extraction platform designed for LLM ingestion rather than a web archiving tool, as it focuses on converting page content to markdown rather than preserving full-fidelity snapshots in standard archival formats like WARC.
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The system covers a broad range of capabilities, including traffic management with independent rate limiting and automatic request retries. It provides content processing tools for XML and HTML parsing via CSS selectors, as well as binary file downloading and character encoding normalization to standard UTF-8.
This is a programmable web scraping library for developers to build custom data extraction tools, rather than a self-contained web archiving application that provides the requested snapshot storage, WARC support, and search features.
Katana is a web crawler and spider designed for security reconnaissance and web application mapping. It functions as a utility for identifying endpoints, forms, and API structures across web targets by combining standard HTTP request traversal with headless browser automation to render dynamic, JavaScript-heavy content. The tool distinguishes itself through its ability to maintain authenticated sessions and handle complex web interactions, such as automated form submission and captcha resolution. It provides granular control over the discovery process, allowing users to define specific crawl scopes, throttle request rates, and apply custom filtering logic to refine datasets based on response attributes or status codes. Beyond basic navigation, the project supports advanced data extraction and monitoring capabilities. It can classify page content, store raw request and response pairs for auditing, and use pattern-based matching to isolate specific information from web traffic. The software is distributed as a single, statically compiled binary to ensure portability across different environments.
This is a security-focused web crawler designed for reconnaissance and endpoint discovery rather than a tool for preserving web pages in standard archival formats like WARC.
Markdownload is a browser extension that functions as a markdown web clipper, converting webpages and selected text into clean markdown files for offline storage and archiving. It operates as a content extractor that isolates the main document from the page while removing navigation elements and advertisements. The tool includes a template generator for injecting dynamic front-matter and metadata into documents via user-defined placeholders. It also serves as a local media downloader that saves remote images to the filesystem and updates links to reference those local files. Additionally, it acts as an integration tool to transfer captured web data and metadata directly into Obsidian vaults using custom URI schemes. The extension supports capturing content from all open browser tabs simultaneously and clipping specific highlighted text. Users can customize markdown styling for links and images, organize downloaded files into specific subfolders, and export media as formatted embeds or hyperlinks to the system clipboard.
This is a browser-based markdown clipper designed for personal knowledge management rather than a self-hostable web archiving server that preserves full-fidelity page snapshots in WARC format.