What are the best open-source alternatives to Python Goose?

30 open-source projects similar to grangier/python-goose, ranked by shared features. Top picks: kepano/defuddle, wechatsync/wechatsync, mozilla/readability, esbatmop/mnbvc, postlight/parser, obsidianmd/obsidian-clipper, deathau/markdown-clipper, mechanicalsoup/mechanicalsoup, readyouapp/readyou, qwenlm/qwen2-vl.

Is kepano/defuddle a good alternative to Python Goose?

Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and assoc…

Is wechatsync/wechatsync a good alternative to Python Goose?

Wechatsync is a multi-platform content synchronizer and cross-platform publishing tool. It extracts articles from webpages and distributes them to multiple social media and blogging platforms simultaneously. The system utilizes a web content extractor with reader-mode logic to strip advertisements…

Is mozilla/readability a good alternative to Python Goose?

Readability is a JavaScript library designed for web content extraction. It functions as a DOM parsing utility and article metadata extractor that isolates the primary text of a webpage by removing clutter such as advertisements and navigation bars. The library employs a heuristic-based content de…

Is esbatmop/mnbvc a good alternative to Python Goose?

MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms,…

Is postlight/parser a good alternative to Python Goose?

Postlight Parser is a command-line tool that extracts the main article content from any web page URL, returning clean structured data including the title, author, date, excerpt, and lead image while stripping away ads and clutter. It uses a readability-based heuristic that scores HTML elements on t…

Is obsidianmd/obsidian-clipper a good alternative to Python Goose?

This project is a markdown web clipper and local-first web archiver. It functions as a browser extension that extracts web page content and highlights, saving them as structured markdown files for personal knowledge management and long-term preservation. The utility acts as a template-based conten…

Is deathau/markdown-clipper a good alternative to Python Goose?

markdown-clipper is a browser extension that converts website content into markdown files for offline storage and personal knowledge bases. It functions as a content extractor and HTML to markdown converter that removes layout clutter to isolate primary text. The tool includes a specific integrati…

Is mechanicalsoup/mechanicalsoup a good alternative to Python Goose?

MechanicalSoup is a Python web automation library and scraping framework designed to simulate browser sessions and navigate websites without requiring JavaScript execution. It functions as an HTML parsing tool and HTTP session manager, allowing for the programmatic retrieval of page content and the…

Is readyouapp/readyou a good alternative to Python Goose?

ReadYou is a self-hosted reading application and RSS feed aggregator that centralizes content from multiple web sources. It functions as a full-text RSS reader, extracting the complete body text from web pages to provide a distraction-free reading experience. The application includes specialized a…

Is qwenlm/qwen2-vl a good alternative to Python Goose?

Qwen2-VL is a multimodal large language model and vision language model designed to process and reason across text, images, and video content. It functions as a visual reasoning engine and a visual agent framework, capable of interpreting visual data to perform object detection, document parsing, a…

Back to grangier/python-goose

Open-source alternatives to Python Goose

30 open-source projects similar to grangier/python-goose, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Python Goose alternative.

kepano/defuddle
kepano/defuddle
3,189View on GitHub
Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and associated metadata. The tool provides a terminal interface that processes content from remote URLs, local files, or piped HTML streams. It supports custom content targeting, allowing users to specify CSS selectors to manually define the main content area when automatic detection is insufficient. The sy
TypeScript
View on GitHub3,189
wechatsync/wechatsync
wechatsync/Wechatsync
4,866View on GitHub
Wechatsync is a multi-platform content synchronizer and cross-platform publishing tool. It extracts articles from webpages and distributes them to multiple social media and blogging platforms simultaneously. The system utilizes a web content extractor with reader-mode logic to strip advertisements and navigation elements from source pages. The project employs a markdown content pipeline that converts extracted web content into a standardized format for editing before redistribution. It features an automated media migrator that performs host-to-host image migration, downloading images from sou
TypeScriptblogchromechrome-extension
View on GitHub4,866
mozilla/readability
mozilla/readability
11,298View on GitHub
Readability is a JavaScript library designed for web content extraction. It functions as a DOM parsing utility and article metadata extractor that isolates the primary text of a webpage by removing clutter such as advertisements and navigation bars. The library employs a heuristic-based content detector to predict if a webpage contains a parseable article before performing full extraction. It uses a parsing workflow to convert complex HTML documents into a simplified format, facilitating the implementation of distraction-free reader views. The tool covers several capability areas, including
JavaScript
View on GitHub11,298

Open-source alternatives to Python Goose

kepano/defuddle

wechatsync/Wechatsync

mozilla/readability

esbatmop/MNBVC

postlight/parser

obsidianmd/obsidian-clipper

deathau/markdown-clipper

MechanicalSoup/MechanicalSoup

ReadYouApp/ReadYou

QwenLM/Qwen2-VL

alirezamika/autoscraper

fake-useragent/fake-useragent

jackwener/OpenCLI

PantsuDango/Dango-Translator

chenglou/pretext

lapwinglabs/x-ray

deathau/markdownload

luyishisi/Anti-Anti-Spider

pdfminer/pdfminer.six

deedy5/ddgs

dotnetcore/DotnetSpider

01-ai/Yi

go-shiori/shiori

asciimoo/colly

cooderl/wewe-rss

jaypyles/Scraperr

alexch33/super-video-downloader

facebookresearch/fairseq

hoothin/UserScripts

JustAnotherArchivist/snscrape