30 open-source projects similar to grangier/python-goose, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Python Goose alternative.
Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and associated metadata. The tool provides a terminal interface that processes content from remote URLs, local files, or piped HTML streams. It supports custom content targeting, allowing users to specify CSS selectors to manually define the main content area when automatic detection is insufficient. The sy
Wechatsync is a multi-platform content synchronizer and cross-platform publishing tool. It extracts articles from webpages and distributes them to multiple social media and blogging platforms simultaneously. The system utilizes a web content extractor with reader-mode logic to strip advertisements and navigation elements from source pages. The project employs a markdown content pipeline that converts extracted web content into a standardized format for editing before redistribution. It features an automated media migrator that performs host-to-host image migration, downloading images from sou
Readability is a JavaScript library designed for web content extraction. It functions as a DOM parsing utility and article metadata extractor that isolates the primary text of a webpage by removing clutter such as advertisements and navigation bars. The library employs a heuristic-based content detector to predict if a webpage contains a parseable article before performing full extraction. It uses a parsing workflow to convert complex HTML documents into a simplified format, facilitating the implementation of distraction-free reader views. The tool covers several capability areas, including
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Postlight Parser is a command-line tool that extracts the main article content from any web page URL, returning clean structured data including the title, author, date, excerpt, and lead image while stripping away ads and clutter. It uses a readability-based heuristic that scores HTML elements on text density and structural cues to identify the article body, and can accept pre-fetched HTML strings directly for parsing instead of fetching the URL. The tool distinguishes itself through a modular architecture that supports domain-specific extractor overrides, allowing custom JavaScript modules t
This project is a markdown web clipper and local-first web archiver. It functions as a browser extension that extracts web page content and highlights, saving them as structured markdown files for personal knowledge management and long-term preservation. The utility acts as a template-based content extractor, transforming raw website data into formatted notes. It uses custom variables and processing filters to organize how captured information is structured before it is sent to a local directory.
markdown-clipper is a browser extension that converts website content into markdown files for offline storage and personal knowledge bases. It functions as a content extractor and HTML to markdown converter that removes layout clutter to isolate primary text. The tool includes a specific integration for sending clipped web content directly into vaults and folders within the Obsidian note-taking application. It also supports batch processing to convert all open browser tabs into individual markdown files. The extension covers a broad range of extraction capabilities, including capturing selec
MechanicalSoup is a Python web automation library and scraping framework designed to simulate browser sessions and navigate websites without requiring JavaScript execution. It functions as an HTML parsing tool and HTTP session manager, allowing for the programmatic retrieval of page content and the automation of web interactions. The library distinguishes itself by combining session persistence with automated form interaction. It maps user data to HTML input fields and selection boxes for programmatic submission and maintains authenticated states by managing cookies and user-agent headers acr
ReadYou is a self-hosted reading application and RSS feed aggregator that centralizes content from multiple web sources. It functions as a full-text RSS reader, extracting the complete body text from web pages to provide a distraction-free reading experience. The application includes specialized accessibility and speed tools, such as a bionic reading mode that uses pattern-based text highlighting to guide the eye and a text-to-speech system for audio content consumption. The project covers comprehensive subscription management through OPML import and export, feed categorization, and keyword-
Qwen2-VL is a multimodal large language model and vision language model designed to process and reason across text, images, and video content. It functions as a visual reasoning engine and a visual agent framework, capable of interpreting visual data to perform object detection, document parsing, and spatial reasoning. The model is distinguished by its ability to act as a video understanding model, processing hour-long videos with second-level indexing and event recall. It further differentiates itself through a visual agent capability that interacts with software interfaces and robotic hardw
Autoscraper is an automatic web scraping library and pattern-based data extractor that learns extraction rules from sample data. It identifies and retrieves text, URLs, and HTML elements from web pages by analyzing sample values to replicate data patterns across different URLs. The system functions as a web scraping model manager, allowing users to save and reload learned rules to maintain consistent data extraction. It supports the export and import of scraping rules to a local file system to avoid repeating the training process for the same website. The library covers automated web data ex
fake-useragent is a tool for generating realistic browser identification strings and parsing existing agents into structured metadata. It functions as an HTTP user agent generator and a web scraping utility designed to rotate browser identities to mimic different devices during automated data collection. The project provides capabilities for random user-agent generation and filtering based on specific browsers, operating systems, device platforms, or minimum version numbers. It also includes a user agent parser to extract detailed metadata, such as browser versions and device brands, from age
OpenCLI is an AI browser automation framework designed to automate web navigation, data extraction, and repetitive browser tasks. It functions as a browser-based CLI generator that converts website interfaces into command-line interactions by controlling authenticated web browser sessions. The project features a web-to-CLI adapter platform for mapping web elements to programmatic command-line inputs and outputs. It includes a browser profile manager to organize and switch between isolated session profiles to maintain different user identities. The toolkit provides capabilities for web conten
Dango-Translator is an OCR translation system and multi-engine translation client designed to extract text from images or screens and replace it with translated content. It functions as an image text translator and real-time screen translator, utilizing optical character recognition to convert text between different languages automatically. The software distinguishes itself through coordinate-based image typesetting and a glossary manager. These tools allow for the replacement of original image content with translated text in the same area and the use of specialized dictionaries to ensure con
Pretext is a canvas-based text layout engine designed to calculate precise text dimensions and line breaks for custom rendering. It serves as a rich text measurement tool and a cross-browser typography normalizer, enabling the determination of pixel-perfect widths and heights for mixed inline content without relying on browser CSS. The project distinguishes itself through its ability to handle complex typography and dynamic layouts. It implements language-specific segmentation rules for CJK and Hangul scripts and corrects emoji width variances between DOM and canvas rendering. Additionally, i
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
Markdownload is a browser extension that functions as a markdown web clipper, converting webpages and selected text into clean markdown files for offline storage and archiving. It operates as a content extractor that isolates the main document from the page while removing navigation elements and advertisements. The tool includes a template generator for injecting dynamic front-matter and metadata into documents via user-defined placeholders. It also serves as a local media downloader that saves remote images to the filesystem and updates links to reference those local files. Additionally, it
Anti-Anti-Spider is an automated web scraping toolkit and CAPTCHA bypass framework. It uses convolutional neural networks to recognize characters and digits in image-based security challenges, enabling programmatic access to protected web content. The project functions as an image recognition model trainer, providing a workflow to preprocess labeled image datasets and train custom neural networks. Users can configure model architectures and hyperparameters to align the recognition system with the visual style of specific target websites. The toolkit covers capabilities for image data preproc
pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis. The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages. The system covers a broad range of extraction capabilities, including the retrieval of embedde
ddgs is a metasearch engine and web content extractor that provides a toolkit for programmatically retrieving search results from DuckDuckGo. It functions as a search API server and a Model Context Protocol server to integrate web search capabilities directly into large language model environments. The project distinguishes itself by aggregating text, image, news, and video results from multiple providers into a single interface. It includes a utility for fetching URLs and converting HTML content into markdown, plain text, or structured data. The system covers a broad range of search capabil
DotnetSpider is a .NET web crawling framework and C# data extraction tool designed for automated web page discovery and the retrieval of structured data from the internet at scale. It functions as a high-level web scraping library for collecting information from various websites. The framework provides capabilities for automated web crawling and large-scale data scraping. It enables web content extraction to facilitate the creation of local databases or the analysis of online information through programmatic web automation within the .NET ecosystem. The system utilizes a pipeline-based data
Yi is a bilingual language model and foundation model designed for natural language processing, reasoning, and reading comprehension in both English and Chinese. It is built as a transformer-based architecture capable of general purpose text generation and conversational tasks. The model is distinguished by its ability to function as a long context system, processing and analyzing extended input sequences up to 200k tokens. It also supports quantized versions that use low-bit precision to reduce memory footprints, enabling execution on consumer-grade hardware. The project covers a broad rang
Shiori is a self-hosted bookmark manager and webpage archiving tool. Written in Go, it functions as a backend service that allows users to save, organize, and search for web links while maintaining a private collection of online resources. The system ensures content availability by creating offline copies of saved pages, preventing data loss if the original source is removed. It is distributed as a containerized application to provide consistent installation and deployment across different operating systems. The software provides a dual-interface access model, featuring both a web-based mana
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
wewe-rss is an RSS feed generator and scheduled aggregator that converts social media accounts and web content into standardized RSS, Atom, or JSON feeds. It functions as a full-text content extractor, retrieving the complete body of articles rather than short summaries. The system operates as an API-protected feed gateway, utilizing token-based authorization and request rate limiting to restrict access and maintain stability. It supports subscription portability by exporting tracked sources into standardized OPML files. The project manages content aggregation through automated polling via c
Scraperr is a self-hosted web scraping and crawling platform designed for extracting structured data from websites using XPath selectors. It functions as a containerized system for managing scraping jobs through a queue and analyzing the resulting content using artificial intelligence. The project differentiates itself through its Kubernetes-native architecture, allowing for scalable deployment and management via package managers. It includes a crawling engine capable of domain-level spidering to discover linked pages and a data analyzer that uses artificial intelligence to query extracted we
Super Video Downloader is an integrated application designed for capturing, managing, and playing streaming media from web sources. It functions as a comprehensive utility that combines a web browser with media extraction tools, allowing users to save video and audio content directly to local storage for offline access. The application distinguishes itself by incorporating a headless browser engine that automates navigation and interacts with dynamic web content. It includes built-in privacy and security features, such as proxy-based traffic routing and encrypted domain name queries, to prote
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
UserScripts is a collection of JavaScript browser userscripts designed to modify website behavior and add custom functionality to web browsers. It serves as a multi-purpose toolset for web page content automation, web interface enhancement, and specialized web scraping and downloading. The project distinguishes itself through a wide range of specialized utilities, including a browser-based text transformer for character encoding and terminology mapping, and tools for bypassing content censorship. It provides advanced web scraping capabilities such as deciphering obfuscated download links, agg
snscrape is a Python-based social media web scraper and crawler designed to extract public posts, profiles, and hashtags from social networks without the use of official APIs. It functions as an archival tool and a utility for open-source intelligence data collection, allowing for the gathering of publicly available information to investigate trends and people. The tool facilitates social media data extraction for research and archival purposes, enabling the creation of historical records of conversations and user activity. It supports workflows for academic social analysis and the export of