30 open-source projects similar to venomous/cloudscraper, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Cloudscraper alternative.
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
CyberScraper-2077 is an AI-powered web scraping tool that uses large language models to extract and structure data from websites into organized formats. It functions as an LLM web scraper and AI content parser, transforming unstructured raw web text into specific data schemas. The project distinguishes itself through a suite of anonymity and evasion tools, including proxy rotation, SOCKS-based identity masking, and the ability to route traffic through the Tor network to access hidden onion services. It further includes a bot detection bypass system that employs stealth parameters and custom n
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
req is a chainable HTTP client library for Go designed to simplify request configuration and automatic response decoding into structures. It provides a fluent-interface request builder that allows developers to incrementally define request properties and encapsulate HTTP logic into reusable API SDKs. The project distinguishes itself with a TLS fingerprint emulator that mimics browser network signatures to bypass bot detection and crawler filters. It also includes a concurrent file downloader that increases transfer speeds by fetching large remote files in parallel segments. The library cover
This project is a specialized TikTok API scraper and data extractor. It functions as a proxy-based web scraper designed to collect user metadata, video posts, and trend feeds, while providing a webhook data pipeline to route scraped information to external URLs via HTTP requests. The tool includes a watermark-free video downloader that saves high-definition content to local storage. It employs cryptographic request signing for server authentication and utilizes session cookie authentication combined with proxy rotation to manage network traffic and avoid rate limits. Capabilities cover bulk
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
This project is an Amazon web scraper and e-commerce data extractor designed to retrieve product names, prices, and ratings. It functions as a headless browser crawler that converts unstructured web content from product listings into structured JSON and CSV formats. The tool incorporates anti-bot bypass capabilities to circumvent CAPTCHAs and security challenges. It achieves this through the use of residential proxy integration, automatic proxy rotation, and the modification of browser fingerprints to simulate human interaction patterns. The system provides broad web scraping capabilities, i
ai-goofish-monitor is an AI-driven marketplace monitor and containerized web scraper designed to track online listings. It uses multimodal large language models and natural language prompts to analyze product text and images, determining if items meet specific requirements. The system employs an anti-detection workflow that rotates network proxies and authenticated accounts to bypass rate limits. It captures browser cookies and session states to mimic real user behavior during automated requests. The project includes a task scheduler using cron expressions and an embedded SQLite database for
This project is a CAPTCHA solver browser extension that automatically detects and resolves image, text, and behavioral challenges using an AI inference engine. It functions as a bot detection bypass tool designed to overcome interactive web barriers and session timeouts to maintain access to protected websites. The extension provides a bridge between automated solving capabilities and external programming languages or browser automation frameworks via an API integration. It utilizes an AI-powered optical character recognition system to transcribe text from images and auditory challenges into
pydoll is a Chrome DevTools Protocol automation library and headless browser controller used for web data extraction and parallel browser automation. It controls Chromium-based browsers via direct WebSocket connections, allowing it to manage isolated browser contexts and tabs while bypassing the overhead and detection associated with WebDriver. The project features an anti-bot evasion framework that mimics natural human behavior, including mouse movements generated via Bezier curves and variable typing patterns. It provides specialized stealth capabilities to bypass behavioral analysis and au
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
Obscura is a web scraping infrastructure and headless browser server designed for AI agents. It provides a system for AI models to control browser sessions, interact with websites, and extract web data using a WebSocket implementation of the Chrome DevTools Protocol. The project focuses on bot detection evasion by randomizing browser fingerprints, masking native functions, and blocking tracking scripts to mimic human behavior. It further secures identities through a traffic layer that routes network requests via HTTP or SOCKS5 proxies. The system supports large-scale data extraction through
This project is a public proxy aggregator and directory providing curated lists of validated HTTP and SOCKS proxy servers. It features a machine-readable API service and tools designed for anonymous network routing and the automated rotation of outgoing IP addresses. The system distinguishes itself through a proxy rotation tool used to bypass rate limits and prevent detection by automated security systems. It provides a programmatic interface for retrieving and filtering verified proxies by country and protocol, delivering this data in JSON and text formats for integration into custom applica
curl_cffi is a Python HTTP client built on libcurl that focuses on browser fingerprint impersonation to evade anti-bot detection. By replacing default TLS handshake and HTTP/2 settings with those extracted from real browsers like Chrome and Firefox, it allows HTTP requests that closely mimic actual browser traffic, reducing the likelihood of being blocked by services that fingerprint automated clients. Beyond fingerprint impersonation, curl_cffi offers a dual API supporting both synchronous and asynchronous execution, with per-request proxy assignment, automatic retry with exponential backoff
nodriver is an asynchronous Chromium browser automation framework that provides headless control and web scraping capabilities. It functions as a Chrome DevTools Protocol client, allowing for granular engine control by attaching directly to the browser's debug port without the need for external driver binaries. The framework is specifically designed as an anti-bot detection bypass tool. It modifies browser fingerprints and protocol headers to evade automated security systems, handle security warnings, and bypass common obstacles like insecure connection alerts. The system covers a broad rang
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
Steel is a cloud browser automation platform that provides a REST API for launching and controlling remote Chrome browser sessions. It enables programmatic browsing and web scraping using standard automation tools like Puppeteer, Playwright, and Selenium, connecting to cloud-hosted browser instances via WebSocket and the Chrome DevTools Protocol. The platform supports both headless and headful browser sessions, with language-specific SDKs for TypeScript and Python. The service distinguishes itself through comprehensive anti-detection capabilities, including residential proxy rotation, CAPTCHA
Damaihelper is a ticketing automation bot and browser automation framework designed to monitor ticket availability and execute checkout processes. It utilizes a ticket purchasing script to automate the selection and purchase of tickets on web platforms based on predefined user criteria. The tool includes a graphical user interface for managing scripts and configuring automation parameters, allowing users to trigger tasks without using a command line. To maintain access, it employs browser session management to save and reuse authentication cookies, avoiding repetitive manual login procedures.
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
Anti-Anti-Spider is an automated web scraping toolkit and CAPTCHA bypass framework. It uses convolutional neural networks to recognize characters and digits in image-based security challenges, enabling programmatic access to protected web content. The project functions as an image recognition model trainer, providing a workflow to preprocess labeled image datasets and train custom neural networks. Users can configure model architectures and hyperparameters to align the recognition system with the visual style of specific target websites. The toolkit covers capabilities for image data preproc
Scraperr is a self-hosted web scraping and crawling platform designed for extracting structured data from websites using XPath selectors. It functions as a containerized system for managing scraping jobs through a queue and analyzing the resulting content using artificial intelligence. The project differentiates itself through its Kubernetes-native architecture, allowing for scalable deployment and management via package managers. It includes a crawling engine capable of domain-level spidering to discover linked pages and a data analyzer that uses artificial intelligence to query extracted we
ProxyBroker is a tool for scraping public HTTP and SOCKS proxy addresses, validating their connectivity, and managing a curated pool of functional proxies. It consists of a proxy scraper for discovery, a validation engine to check anonymity and response times, and a pool manager to maintain a filtered queue of servers. The project includes a local rotating proxy server that acts as a single entry point, automatically distributing incoming network traffic across a pool of validated external proxies. This infrastructure allows for the rotation of IP addresses to maintain resilience during web d
CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network
This project is a railway booking automation tool designed to monitor ticket inventory and execute purchases on the 12306 platform. Its primary purpose is to secure high-demand train tickets by automating the login, booking, and checkout processes. The system utilizes automated captcha solving and headless session management to bypass security barriers and maintain user authentication. It employs a concurrent request queue and polling-based inventory monitoring to track seat availability and execute purchases immediately as they open. The automation surface includes waitlist management for r
This is a collection of Python scripts designed for extracting data from popular Chinese websites and mobile applications. It functions as a multi-platform data extraction toolkit, capable of automating tasks such as downloading videos from platforms like Bilibili and Douyin, scraping product reviews and images from e-commerce sites like Taobao and JD.com, and booking train tickets on the 12306 railway system. The project distinguishes itself through its focus on automating specific, high-value tasks within the Chinese internet ecosystem. It includes capabilities for solving Chinese CAPTCHA c
Twikit is a Python library and API wrapper designed for interacting with X (Twitter). It simulates browser requests and mimics private network traffic to enable programmatic access to the platform without requiring an official API key. The project focuses on social media automation and data extraction, featuring tools for scraping user profiles, trending topics, and chronological tweet histories. It includes a session manager that handles user authentication, two-factor authentication, and cookie persistence to maintain active account access. The library's capabilities cover a broad range of
Undetected-chromedriver is a framework for automated browser navigation designed to bypass anti-bot security measures. It functions by patching browser drivers at the binary level to obscure automation signals, allowing scripts to interact with protected websites without being flagged or blocked by security services. The project distinguishes itself through its ability to maintain stealth during automated sessions, including those executed in headless mode. It achieves this by injecting custom configurations to mimic human user behavior and by hooking into low-level browser debugging protocol
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v