30 open-source projects similar to mozilla/readability, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Readability alternative.
Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and associated metadata. The tool provides a terminal interface that processes content from remote URLs, local files, or piped HTML streams. It supports custom content targeting, allowing users to specify CSS selectors to manually define the main content area when automatic detection is insufficient. The sy
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
Postlight Parser is a command-line tool that extracts the main article content from any web page URL, returning clean structured data including the title, author, date, excerpt, and lead image while stripping away ads and clutter. It uses a readability-based heuristic that scores HTML elements on text density and structural cues to identify the article body, and can accept pre-fetched HTML strings directly for parsing instead of fetching the URL. The tool distinguishes itself through a modular architecture that supports domain-specific extractor overrides, allowing custom JavaScript modules t
ReadYou is a self-hosted reading application and RSS feed aggregator that centralizes content from multiple web sources. It functions as a full-text RSS reader, extracting the complete body text from web pages to provide a distraction-free reading experience. The application includes specialized accessibility and speed tools, such as a bionic reading mode that uses pattern-based text highlighting to guide the eye and a text-to-speech system for audio content consumption. The project covers comprehensive subscription management through OPML import and export, feed categorization, and keyword-
python-goose is a Python library for web scraping and content extraction. It functions as an HTML boilerplate remover and article parser designed to isolate primary text and metadata from web pages by stripping away navigation, layout noise, and non-essential elements. The tool features multilingual processing capabilities, utilizing language-specific stop-word analyzers to identify and extract primary content across different languages. It also identifies and collects embedded media, including source URLs and embed codes for lead images and videos associated with an article. The library cov
Spider is a web-based platform designed for automated data extraction, providing a centralized framework to collect, process, and route structured information from websites. It functions as a comprehensive pipeline that manages the entire lifecycle of data gathering, from initial configuration to final storage in external databases or message queues. The platform distinguishes itself through a visual configuration interface that allows users to define extraction rules and manage scraping templates without writing custom code. It supports both static and dynamic content retrieval by integratin
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
This project is a specialized TikTok API scraper and data extractor. It functions as a proxy-based web scraper designed to collect user metadata, video posts, and trend feeds, while providing a webhook data pipeline to route scraped information to external URLs via HTTP requests. The tool includes a watermark-free video downloader that saves high-definition content to local storage. It employs cryptographic request signing for server authentication and utilizes session cookie authentication combined with proxy rotation to manage network traffic and avoid rate limits. Capabilities cover bulk
Min is a minimalist, privacy-focused web browser designed to limit data collection and remove interface clutter. It serves as an ad-blocking tool that prevents tracking scripts and advertisements from loading to improve page speed and protect user identity. The browser differentiates itself through organized tab management, where related tabs are grouped into named tasks to separate work streams. It features a tag-based bookmark manager that replaces traditional hierarchical folders with custom labels and provides a simplified reader view to strip away non-essential page elements for focused
markdown-clipper is a browser extension that converts website content into markdown files for offline storage and personal knowledge bases. It functions as a content extractor and HTML to markdown converter that removes layout clutter to isolate primary text. The tool includes a specific integration for sending clipped web content directly into vaults and folders within the Obsidian note-taking application. It also supports batch processing to convert all open browser tabs into individual markdown files. The extension covers a broad range of extraction capabilities, including capturing selec
Web Clipper is a browser extension that captures web content and saves it directly to a variety of note-taking and productivity platforms. It strips distracting page elements like ads and sidebars before clipping, ensuring only the core article, recipe, or product information is stored cleanly. The extension supports saving content to multiple destinations including Notion, OneNote, Obsidian, Joplin, Confluence, and GitHub, allowing users to send web clippings to their preferred workspace with a single action. It uses a plugin-based architecture where each platform is wrapped in an adapter th
Feedbin is an RSS feed aggregator that collects and organizes updates from websites, video channels, and playlists into a chronological list. It functions as a centralized content manager, providing tools for feed aggregation and the organization of web-based information. The service distinguishes itself by converting email newsletters into feed entries via unique email aliases and offering a dedicated podcast manager that tracks playback progress across devices. It also includes a full-text extractor to retrieve complete articles when source feeds only provide snippets and a system to track
Wechatsync is a multi-platform content synchronizer and cross-platform publishing tool. It extracts articles from webpages and distributes them to multiple social media and blogging platforms simultaneously. The system utilizes a web content extractor with reader-mode logic to strip advertisements and navigation elements from source pages. The project employs a markdown content pipeline that converts extracted web content into a standardized format for editing before redistribution. It features an automated media migrator that performs host-to-host image migration, downloading images from sou
Feeder is an RSS and Atom feed reader that aggregates content into a single interface. It functions as a full-text content extractor that removes website clutter to isolate the main body of articles, and a self-hosted feed synchronizer that maintains subscription lists and read statuses across devices via a private backend server. The application integrates AI services and external API keys to translate and generate concise summaries of long-form articles. It also features a text-to-speech reader that uses system engines with automatic language detection to convert written content into spoken
Summarize is a command line tool and multimodal content extractor designed to generate concise summaries from web pages, documents, and media files. It functions as an orchestrator that connects developer tools to various language model providers to process and condense information. The system provides specialized capabilities for audio and video processing, including transcription with speaker identification and the extraction of timestamped visual markers from video slides. It also includes a translation utility to convert generated summaries and extracted text into different target languag
This project is a Go library and command-line utility designed for the retrieval and local archival of remote video content. It provides a programmatic interface for fetching media streams, allowing users to extract metadata and download video files directly to local storage. The library distinguishes itself through its ability to resolve playback restrictions by performing algorithmic transformations on obfuscated authentication tokens. This signature decryption process enables the tool to bypass standard access limitations, while its interface-driven design allows for the selection of speci
NewsBlur is an RSS feed aggregator and social news reader that collects and organizes stories from feeds, newsletters, and websites into a single interface. It functions as a feed synchronization service that maintains read states and subscription data across multiple devices and third-party applications. The platform distinguishes itself with AI-powered content summarization to generate briefings and answer questions about articles, alongside a system for training content classifiers. These classifiers learn user preferences for authors and tags to automatically highlight preferred topics or
This project is a multi-purpose REST API utility collection and developer suite. It serves as a centralized service for real-time information aggregation, data transformation, and a wide array of programmatic tools. The service distinguishes itself by providing a broad range of specialized content delivery endpoints, from curated daily summaries and global trending rankings to randomized entertainment content like jokes and quotes. It also functions as a real-time aggregator for environmental and network data, including weather forecasts, currency exchange rates, and public IP lookups. The c
HaE is a network traffic analysis tool designed to extract, classify, and highlight specific data fragments within network messages and HTTP traffic. It functions as an HTTP data extractor and traffic content filter, utilizing a network metadata aggregator to centralize highlighted data fragments and annotations for analysis. The tool identifies high-value network packets by mapping classification results to visual color markers and employs a modular classification system to isolate data fragments from binary or text streams. It distinguishes the severity of matched data by piping extracted c
This project is an LLM knowledge base builder and personal knowledge management tool. It is a desktop application designed to transform diverse documents into a persistent, interlinked wiki through LLM analysis and incremental ingestion. The system distinguishes itself with a knowledge graph visualizer that uses community detection algorithms to map relationships between concepts and identify topical clusters. It features a hybrid retrieval system that combines keyword matching, vector embeddings, and graph relevance to locate information. The platform covers a wide range of capabilities inc
iflow-cli is a command-line interface and suite of AI tools designed for software engineering, workflow orchestration, and multimodal data analysis. It functions as an LLM command line interface that enables users to execute AI workflows, analyze codebase structures, and interact with large language models directly from the terminal. The project features a plugin-based agent architecture that allows for the integration of specialized domain experts and custom instruction sets from an external marketplace. It distinguishes itself through a multimodal AI terminal capable of processing visual da
bilibili-api is a Bilibili API wrapper and content scraper designed for programmatically accessing video metadata, user profiles, and content data. It functions as an anti-bot crawler framework and a WebSocket live chat client for retrieving platform information and real-time interaction data. The project incorporates tools to bypass anti-crawling measures and rate limits through the use of proxies and TLS fingerprint spoofing. It also includes logic for mapping and converting various video and content identifiers to ensure consistent data retrieval across different endpoints. Its capability
WeChat Moments Screenshot Generator is a social media mockup tool designed to create simulated screenshots of social posts. It functions as a mobile OS UI simulator, allowing users to generate realistic images that mimic the visual appearance of social media activity. The tool features a metadata fetcher that retrieves titles and cover images from public links to automatically populate shared post previews. It includes capabilities for simulating organic engagement through randomized likes and comments, and uses pattern-based mapping to insert platform-specific emoticons into post content. T
Tridactyl is a Vim-like Firefox extension that provides a comprehensive keyboard-driven interface for browsing, tab management, and page interaction. It replaces traditional mouse-based navigation with Vim-style keybindings, an ex-mode command line, and a hint overlay system for selecting and interacting with page elements. The extension is built around a core infrastructure that includes a modal command parser, a keybinding configuration system, and a content-script command bridge for executing commands in page context. The extension distinguishes itself through its deep integration with Fir
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
This project is a browser extension that integrates real-time web search results and page content into large language model prompts to provide updated context. It functions as a prompt template manager and web content extractor, allowing users to fetch live data from search engines to overcome knowledge cutoff dates. The extension enables deep research by performing comprehensive searches and providing original source citations. It augments search engines by displaying AI-generated answers alongside traditional search results through a custom interface overlay. The system includes capabiliti
This project is a curated library of Python code examples, educational resources, and programming tutorials. It functions as an educational repository designed to teach Python language fundamentals through practical implementation tasks, real-world exercises, and functional code snippets. The collection covers a diverse range of implementation examples, including the development of interactive websites and message boards using web frameworks. It also features scripts for audio speech processing, automated media processing for images, and the extraction of data from web content. Additional ca