30 open-source projects similar to oxylabs/how-to-scrape-amazon-product-data, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best How To Scrape Amazon Product Data alternative.
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
CyberScraper-2077 is an AI-powered web scraping tool that uses large language models to extract and structure data from websites into organized formats. It functions as an LLM web scraper and AI content parser, transforming unstructured raw web text into specific data schemas. The project distinguishes itself through a suite of anonymity and evasion tools, including proxy rotation, SOCKS-based identity masking, and the ability to route traffic through the Tor network to access hidden onion services. It further includes a bot detection bypass system that employs stealth parameters and custom n
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
so-novel is a web novel downloader and scraping engine designed to extract structured text from websites and convert it into electronic book formats. It functions as a multi-interface content extractor, providing a shared backend accessible via a web-based management dashboard, a terminal user interface, and a command line interface. The system utilizes a rule-driven approach for data extraction, using CSS selectors and XPath rules defined in external configuration files to map web elements to specific data fields. To maintain access to content, it includes a proxy-routed request pipeline to
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
This project is an MCP browser automation server that connects large language models to headless cloud browsers. It functions as an autonomous web workflow engine and an LLM web agent interface, enabling the translation of natural language instructions into browser actions and structured data retrieval. The system distinguishes itself through a managed headless browser cloud API that supports concurrent Chromium sessions with integrated stealth modes, CAPTCHA solving, and proxy traffic routing. It utilizes self-healing element selection to maintain automation resilience when page structures c
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
cloudscraper is a Python library designed to bypass Cloudflare anti-bot protections by resolving JavaScript challenges and mimicking browser fingerprints. It functions as a specialized tool for accessing websites that employ automated security systems to block scripts and headless browsers. The project differentiates itself through the use of interchangeable JavaScript runtimes, such as Node.js or V8, to execute challenge code and obtain security clearance tokens. It employs a fingerprint rotation engine and HTTP request emulation to rotate browser headers and device identifiers, mimicking hu
pydoll is a Chrome DevTools Protocol automation library and headless browser controller used for web data extraction and parallel browser automation. It controls Chromium-based browsers via direct WebSocket connections, allowing it to manage isolated browser contexts and tabs while bypassing the overhead and detection associated with WebDriver. The project features an anti-bot evasion framework that mimics natural human behavior, including mouse movements generated via Bezier curves and variable typing patterns. It provides specialized stealth capabilities to bypass behavioral analysis and au
Automa is a browser-based automation platform that enables users to build, schedule, and execute repetitive web tasks through a visual, no-code interface. By operating as a browser extension, it provides a canvas-based environment where users construct workflows by connecting functional blocks to interact with web elements, manage browser state, and process data. The platform distinguishes itself through its deep integration with the browser environment, allowing for complex orchestration such as event-driven triggers, cross-origin request handling, and the ability to package workflows as sta
JobSpy is a job board scraper and listing aggregator designed to extract employment opportunities from multiple websites and compile them into a unified dataset. It functions as a job search automation tool that programmatically collects vacancies based on keywords, locations, and specific filters. The project serves as a web scraping framework that utilizes proxy routing and user-agent rotation to bypass rate limits and avoid server-side blocking during data extraction. It includes infrastructure for concurrent request aggregation and schema-based data normalization to ensure consistent form
ai-goofish-monitor is an AI-driven marketplace monitor and containerized web scraper designed to track online listings. It uses multimodal large language models and natural language prompts to analyze product text and images, determining if items meet specific requirements. The system employs an anti-detection workflow that rotates network proxies and authenticated accounts to bypass rate limits. It captures browser cookies and session states to mimic real user behavior during automated requests. The project includes a task scheduler using cron expressions and an embedded SQLite database for
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
Damaihelper is a ticketing automation bot and browser automation framework designed to monitor ticket availability and execute checkout processes. It utilizes a ticket purchasing script to automate the selection and purchase of tickets on web platforms based on predefined user criteria. The tool includes a graphical user interface for managing scripts and configuring automation parameters, allowing users to trigger tasks without using a command line. To maintain access, it employs browser session management to save and reuse authentication cookies, avoiding repetitive manual login procedures.
bilibili-api is a Bilibili API wrapper and content scraper designed for programmatically accessing video metadata, user profiles, and content data. It functions as an anti-bot crawler framework and a WebSocket live chat client for retrieving platform information and real-time interaction data. The project incorporates tools to bypass anti-crawling measures and rate limits through the use of proxies and TLS fingerprint spoofing. It also includes logic for mapping and converting various video and content identifiers to ensure consistent data retrieval across different endpoints. Its capability
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Stagehand is an AI-native browser automation framework that enables developers to build reliable web automations using a hybrid of natural language instructions and deterministic TypeScript code.
Maxun is an open-source web scraping and automation platform designed to transform dynamic website content into structured data. By leveraging artificial intelligence to interpret natural language prompts, the system identifies page elements and extracts information without requiring manual selector configuration. It serves as a bridge between raw web content and intelligent workflows, providing structured outputs in formats optimized for large language model ingestion and agent-based applications. The platform distinguishes itself through its ability to handle complex, authenticated, and dyn
curl_cffi is a Python HTTP client built on libcurl that focuses on browser fingerprint impersonation to evade anti-bot detection. By replacing default TLS handshake and HTTP/2 settings with those extracted from real browsers like Chrome and Firefox, it allows HTTP requests that closely mimic actual browser traffic, reducing the likelihood of being blocked by services that fingerprint automated clients. Beyond fingerprint impersonation, curl_cffi offers a dual API supporting both synchronous and asynchronous execution, with per-request proxy assignment, automatic retry with exponential backoff
BrowserOS is an AI agent browser orchestrator and automation framework designed to manage browser state and execute complex web workflows. It functions as a local AI browser assistant and a Model Context Protocol controller, enabling the control of browser tabs, windows, and navigation through programmable AI agents and standardized context protocols. The system distinguishes itself through a graph-based visual workflow builder for creating repeatable automation sequences and the use of markdown-based files to define agent personalities and task recipes. It supports multi-provider orchestrati
Obscura is a web scraping infrastructure and headless browser server designed for AI agents. It provides a system for AI models to control browser sessions, interact with websites, and extract web data using a WebSocket implementation of the Chrome DevTools Protocol. The project focuses on bot detection evasion by randomizing browser fingerprints, masking native functions, and blocking tracking scripts to mimic human behavior. It further secures identities through a traffic layer that routes network requests via HTTP or SOCKS5 proxies. The system supports large-scale data extraction through
ECommerceCrawlers is an educational collection of Python-based crawler scripts designed to extract data from a variety of public websites, including e-commerce platforms, social media sites, news outlets, and multimedia sources. The project serves as a learning resource for web scraping techniques, offering ready-to-run examples that demonstrate practical data extraction methods. The toolkit covers a broad range of data types, including product listings and prices from online retail platforms, public posts and profiles from social networking sites, articles from news and blogging platforms, p
This is a tool for searching, downloading, and archiving articles and engagement metadata from WeChat official accounts. It functions as a web-based content scraper and data exporter, allowing for the automated retrieval of social media content and the collection of performance metrics. The project distinguishes itself through a system that captures session credentials and authentication cookies from desktop clients via a local proxy to access private engagement data. It utilizes a concurrent proxy-pool fetching mechanism to download large volumes of content while avoiding rate limits, and it
This project is a CAPTCHA solver browser extension that automatically detects and resolves image, text, and behavioral challenges using an AI inference engine. It functions as a bot detection bypass tool designed to overcome interactive web barriers and session timeouts to maintain access to protected websites. The extension provides a bridge between automated solving capabilities and external programming languages or browser automation frameworks via an API integration. It utilizes an AI-powered optical character recognition system to transcribe text from images and auditory challenges into
This project is a Model Context Protocol tool that connects local browser instances to AI agents, enabling programmatic control over web sessions. It functions as a browser automation framework, allowing for the navigation of pages, interaction with form elements, and the management of user data while maintaining existing authentication states and profiles. The utility distinguishes itself by enabling local analysis of browser content, including the extraction of text and the performance of semantic searches across open tabs without transmitting private data to external servers. It also provi
X-crawl is a Node.js-based web scraping framework designed to automate data collection from both static and dynamic websites. It integrates artificial intelligence to perform semantic parsing, allowing it to transform unstructured HTML into structured data formats that remain accurate even when website layouts or class names change. The project distinguishes itself through a comprehensive suite of stealth and reliability features. It manages crawler identity by randomizing device fingerprints and rotating proxy servers to bypass access restrictions. To handle complex, JavaScript-heavy interfa
This project is a Python library that wraps official NBA endpoints to retrieve player, team, and game statistics as structured data. It serves as a programmatic interface for fetching professional basketball league records and real-time scoreboards via HTTP requests. The library integrates with Pandas to transform raw JSON responses from sports servers into DataFrames for statistical analysis and data science. It functions as a data retrieval utility for tracking league-wide performance trends and scouting professional basketball players. The tool covers a broad range of capabilities includi
This project is a containerized search infrastructure designed to deploy a privacy-focused metasearch engine. It acts as a self-hosted search proxy that aggregates results from multiple external web, image, and academic search providers while anonymizing requests and stripping trackers to protect user identity. The system utilizes Docker to orchestrate the search instance, integrating caching mechanisms and reverse proxy support to ensure a private and efficient search environment. It employs a modular adapter-based integration to standardize diverse external API responses and a processing pi