30 open-source projects similar to gerapy/gerapy, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Gerapy alternative.
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
Convert HTML to Markdown-formatted text.
Autoscraper is an automatic web scraping library and pattern-based data extractor that learns extraction rules from sample data. It identifies and retrieves text, URLs, and HTML elements from web pages by analyzing sample values to replicate data patterns across different URLs. The system functions as a web scraping model manager, allowing users to save and reload learned rules to maintain consistent data extraction. It supports the export and import of scraping rules to a local file system to avoid repeating the training process for the same website. The library covers automated web data ex
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
You want to start scraping? Well this guide will teach you, and not some baby selenium scraping. This guide only uses raw requests and has examples in both python and kotlin. Only basic programming knowlege in one of those languages is required to follow along in the guide.
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions. The project distinguishes itself through its ability to translate high-level intent into
simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.
a small library for extracting rich content from urls
MechanicalSoup is a Python web automation library designed to simulate browser behavior. It functions as a toolkit for web scraping and automation, providing an HTML parsing engine and an HTTP session manager to interact with websites programmatically. The library enables headless web interaction by mimicking a real user session. It manages persistent state through cookie handling and automatic redirect following, allowing for programmatic website navigation and the simulation of complex browser interactions. Its capabilities cover automated form population and submission using CSS selectors
scrape-it is a Node.js web scraper and HTML parser designed to extract structured data from websites and HTML files. It functions as a web data extraction tool that retrieves specific information from DOM elements and converts web content into usable data fields. The tool uses CSS selectors to target specific data points and employs schema-driven data mapping to organize unstructured web text into a consistent format. It supports custom value transformation to convert raw extracted strings into specific data formats. The system provides capabilities for web data extraction and automated cont
RoboBrowser: Your friendly neighborhood web scraper
snscrape is a Python-based social media web scraper and crawler designed to extract public posts, profiles, and hashtags from social networks without the use of official APIs. It functions as an archival tool and a utility for open-source intelligence data collection, allowing for the gathering of publicly available information to investigate trends and people. The tool facilitates social media data extraction for research and archival purposes, enabling the creation of historical records of conversations and user activity. It supports workflows for academic social analysis and the export of
Parse feeds in Python
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.…
MechanicalSoup is a Python web automation library and scraping framework designed to simulate browser sessions and navigate websites without requiring JavaScript execution. It functions as an HTML parsing tool and HTTP session manager, allowing for the programmatic retrieval of page content and the automation of web interactions. The library distinguishes itself by combining session persistence with automated form interaction. It maps user data to HTML input fields and selection boxes for programmatic submission and maintains authenticated states by managing cookies and user-agent headers acr
Helium is a Python library and high-level wrapper for Selenium designed for browser automation, functional UI testing, and web scraping. It provides a simplified interface for interacting with web applications across different browser engines. The library distinguishes itself by allowing users to identify and interact with web elements using visible text labels rather than relying exclusively on technical identifiers like XPaths or CSS selectors. This approach enables the creation of automation scripts based on human-readable labels. The toolkit covers a broad range of browser automation cap
Playwright for Python is a browser automation framework designed for end-to-end testing, web scraping, and user interaction simulation. It functions as a headless browser controller that enables programmatic navigation, data extraction, and the execution of complex workflows across multiple rendering engines. The framework distinguishes itself through an actionability-aware interaction engine that automatically verifies element readiness before performing actions, significantly reducing test flakiness. It utilizes isolated browser contexts to maintain separate storage and cookies for parallel
Module for automatic summarization of text documents and HTML pages.
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. Docs 文档 :point_right: