What are the best open-source alternatives to Gerapy?

30 open-source projects similar to gerapy/gerapy, ranked by shared features. Top picks: adbar/trafilatura, alir3z4/html2text, alirezamika/autoscraper, antivanov/js-crawler, bda-research/node-crawler, binux/pyspider, blatzar/scraping-tutorial, brendonboshell/supercrawler, browser-use/browser-use, cgiffard/node-simplecrawler.

Is adbar/trafilatura a good alternative to Gerapy?

Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like tit…

Is alir3z4/html2text a good alternative to Gerapy?

Convert HTML to Markdown-formatted text.

Is alirezamika/autoscraper a good alternative to Gerapy?

Autoscraper is an automatic web scraping library and pattern-based data extractor that learns extraction rules from sample data. It identifies and retrieves text, URLs, and HTML elements from web pages by analyzing sample values to replicate data patterns across different URLs. The system function…

Is bda-research/node-crawler a good alternative to Gerapy?

node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing thro…

Is binux/pyspider a good alternative to Gerapy?

PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping sc…

Is blatzar/scraping-tutorial a good alternative to Gerapy?

You want to start scraping? Well this guide will teach you, and not some baby selenium scraping. This guide only uses raw requests and has examples in both python and kotlin. Only basic programming knowlege in one of those languages is required to follow along in the guide.

Is brendonboshell/supercrawler a good alternative to Gerapy?

Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.

Is browser-use/browser-use a good alternative to Gerapy?

Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex,…

Is cgiffard/node-simplecrawler a good alternative to Gerapy?

simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.

Back to gerapy/gerapy

Open-source alternatives to Gerapy

30 open-source projects similar to gerapy/gerapy, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Gerapy alternative.

adbar/trafilatura
adbar/trafilatura
5,319View on GitHub
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
Pythonarticle-extractorcorpus-buildercorpus-tools
View on GitHub5,319
alir3z4/html2text
Alir3z4/html2text
2,159View on GitHub
Convert HTML to Markdown-formatted text.
Pythonmarkdownmarkdown-parserpython
View on GitHub2,159
alirezamika/autoscraper
alirezamika/autoscraper
7,297View on GitHub
Autoscraper is an automatic web scraping library and pattern-based data extractor that learns extraction rules from sample data. It identifies and retrieves text, URLs, and HTML elements from web pages by analyzing sample values to replicate data patterns across different URLs. The system functions as a web scraping model manager, allowing users to save and reload learned rules to maintain consistent data extraction. It supports the export and import of scraping rules to a local file system to avoid repeating the training process for the same website. The library covers automated web data ex
Python
View on GitHub7,297
antivanov/js-crawler
antivanov/js-crawler
257View on GitHub
js-crawler
TypeScript
View on GitHub257

Open-source alternatives to Gerapy

adbar/trafilatura

Alir3z4/html2text

alirezamika/autoscraper

antivanov/js-crawler

bda-research/node-crawler

binux/pyspider

Blatzar/scraping-tutorial

brendonboshell/supercrawler

browser-use/browser-use

cgiffard/node-simplecrawler

chineking/cola

coleifer/micawber

D4Vinci/Scrapling

gawel/pyquery

hickford/MechanicalSoup

IonicaBizau/scrape-it

jmcarp/robobrowser

JustAnotherArchivist/snscrape

kurtmckee/feedparser

lapwinglabs/x-ray

lorien/awesome-web-scraping

lorien/grab

martinsbalodis/web-scraper-chrome-extension

matiasb/demiurge

MechanicalSoup/MechanicalSoup

mherrmann/helium

microsoft/playwright-python

miso-belica/sumy

MontFerret/ferret

my8100/scrapydweb