30 open-source projects similar to dropsdevopsorg/ecommercecrawlers, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best ECommerceCrawlers alternative.
Helium is a Python library and high-level wrapper for Selenium designed for browser automation, functional UI testing, and web scraping. It provides a simplified interface for interacting with web applications across different browser engines. The library distinguishes itself by allowing users to identify and interact with web elements using visible text labels rather than relying exclusively on technical identifiers like XPaths or CSS selectors. This approach enables the creation of automation scripts based on human-readable labels. The toolkit covers a broad range of browser automation cap
Toutatis is an open-source intelligence tool designed to extract public account information, emails, and phone numbers from Instagram profiles. It functions as a command-line utility for gathering user details and contact information for research purposes. The tool provides capabilities for public identity verification and account intelligence by translating usernames into internal unique identifiers to retrieve associated profile data. The system handles data extraction through a command line interface, utilizing request simulation and session-based API interactions to parse structured data
This project is an Amazon web scraper and e-commerce data extractor designed to retrieve product names, prices, and ratings. It functions as a headless browser crawler that converts unstructured web content from product listings into structured JSON and CSV formats. The tool incorporates anti-bot bypass capabilities to circumvent CAPTCHAs and security challenges. It achieves this through the use of residential proxy integration, automatic proxy rotation, and the modification of browser fingerprints to simulate human interaction patterns. The system provides broad web scraping capabilities, i
This is a collection of Python scripts designed for extracting data from popular Chinese websites and mobile applications. It functions as a multi-platform data extraction toolkit, capable of automating tasks such as downloading videos from platforms like Bilibili and Douyin, scraping product reviews and images from e-commerce sites like Taobao and JD.com, and booking train tickets on the 12306 railway system. The project distinguishes itself through its focus on automating specific, high-value tasks within the Chinese internet ecosystem. It includes capabilities for solving Chinese CAPTCHA c
This project is a collection of Python implementations for web scraping, network traffic interception, data analysis, and sentiment analysis. It provides methods for extracting structured data from websites and mobile application interfaces. The collection includes tools for capturing and analyzing network packets from mobile applications to identify hidden internal API endpoints. It also features scripts for evaluating the emotional tone and public perception of text data. The project covers data manipulation and transformation of large datasets, as well as the generation of charts and grap
snscrape is a Python-based social media web scraper and crawler designed to extract public posts, profiles, and hashtags from social networks without the use of official APIs. It functions as an archival tool and a utility for open-source intelligence data collection, allowing for the gathering of publicly available information to investigate trends and people. The tool facilitates social media data extraction for research and archival purposes, enabling the creation of historical records of conversations and user activity. It supports workflows for academic social analysis and the export of
This project is a Model Context Protocol server that connects large language models to web scraping and crawling tools. It functions as a bridge, allowing LLM clients to utilize a web crawling engine and scraping utilities to extract and process web data. The server integrates a markdown web converter that transforms dynamic web pages and PDF documents into clean markdown to optimize consumption by AI models. It also provides a browser automation interface for controlling headless sessions and bypassing access restrictions. The system covers broad capabilities including large-scale website d
This project is a collection of Python scripts and source code examples designed for learning programming fundamentals through practical application. It serves as a toolkit for web scraping and browser automation, alongside a library of utilities for data processing. The repository includes scripts for simulating human interactions to automate repetitive web tasks and online booking processes. It also provides a structured database of administrative divisions, including provinces, cities, and districts, for geographic data management and address validation. The collection covers capabilities
img2dataset is a high-performance image dataset pipeline and preprocessing tool designed to download and process millions of images from URLs for machine learning training. It functions as a distributed image downloader and cloud storage data exporter, moving large visual datasets from web sources directly into structured formats. The system prioritizes high-throughput data acquisition by distributing workloads across multiple CPU cores and machines. It integrates directly with remote cloud storage buckets and employs a manifest-based tracking system to resume interrupted downloads without re
Instaloader is a Python library and command-line utility designed for the automated retrieval, archiving, and analysis of Instagram content. It provides a programmatic interface to fetch media, captions, and metadata from public or private profiles, hashtags, and stories, while maintaining persistent user sessions for authorized access. The tool distinguishes itself through robust archive management and traffic control mechanisms. It supports incremental synchronization, allowing users to resume interrupted downloads and update local collections without redundant requests. To ensure reliable
Osintgram is a command-line utility designed for open-source intelligence gathering and the extraction of public data from social media profiles. It functions as a framework for collecting and processing user information to assist in digital investigations and the mapping of digital footprints. The tool distinguishes itself through a modular architecture that organizes intelligence-gathering tasks into independent scripts, all sharing a unified session state and data processing pipeline. It utilizes headless browser automation and session-based interactions to mimic legitimate user behavior,
Social-analyzer is an open-source intelligence framework designed for the automated discovery, correlation, and verification of digital identities across online platforms. It functions as a comprehensive engine for gathering social media intelligence, utilizing distributed browser automation to extract metadata and profile information from hundreds of websites simultaneously. The platform distinguishes itself through its ability to perform cross-platform identity correlation using heuristic-based pattern matching and name permutation generation. It processes these findings through a confidenc
jd-assistant is an e-commerce automation bot designed for the JD.com platform. It functions as an automated checkout script and task runner that monitors product stock and executes purchase sequences for high-demand items. The project specializes in flash sale automation, combining real-time stock monitoring with clock-synced task scheduling to trigger orders at specific timestamps. It manages the end-to-end purchase flow, including automated cart management and the submission of orders using predefined regional identifiers. The system includes capabilities for account and session management
This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis. The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic. The capability surfac
Haipproxy is a high-availability proxy gateway and distributed proxy pool manager. It consists of a system for storing and rotating verified IP proxy addresses using Redis, a web crawling system to discover anonymous proxies from public sources, and a validation engine that checks proxy functionality against specific target domains. The project implements a middleware layer that provides a stable entry point for requests by automatically rotating backend IP addresses. This includes a local proxy server that acts as a bridge between the client and the pool, decoupling the two by updating inter
This project is a distributed web crawling framework that enables the horizontal scaling of scraping tasks. It uses Redis as a centralized request queue manager and state store to coordinate crawl progress and request metadata across multiple server instances. The system distributes crawling workloads by sharing a single request queue and utilizes a distributed duplicate filter to prevent multiple workers from visiting the same page. It persists complex request state and metadata as JSON strings within the shared remote store. The framework also provides capabilities for distributed data pro
Twikit is a Python library and API wrapper designed for interacting with X (Twitter). It simulates browser requests and mimics private network traffic to enable programmatic access to the platform without requiring an official API key. The project focuses on social media automation and data extraction, featuring tools for scraping user profiles, trending topics, and chronological tweet histories. It includes a session manager that handles user authentication, two-factor authentication, and cookie persistence to maintain active account access. The library's capabilities cover a broad range of
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
Social Mapper is an open-source intelligence framework designed to gather and structure digital footprints from social media networks. It functions as a platform for correlating online identities across multiple platforms, enabling the construction of a unified digital profile for a specific subject. The tool distinguishes itself by integrating automated facial recognition to verify identities, comparing target photographs against profile pictures found during the search process. This capability allows for the filtering of search results to improve the accuracy of identity correlation before
GhostTrack is an open-source intelligence (OSINT) framework that aggregates geographic, network, and social identity information from public data sources. It functions as a digital footprint analyzer, collecting various pieces of publicly available information to build comprehensive profiles of target individuals. The framework combines multiple investigative capabilities into a single tool, including IP address geolocation, phone number intelligence, and social media username discovery. It distributes queries across external data services to maximize coverage and accuracy, resolving IP addre
Distribute crawler is a distributed web scraping framework that integrates with Scrapy to coordinate multiple crawler instances across clusters. It utilizes a centralized task queue to manage and scale concurrent data collection operations, enabling horizontal scaling of scraping tasks across multiple worker nodes. The framework distinguishes itself through its focus on large-scale data management and traffic control. It persists scraped items and binary assets into document-oriented database clusters, utilizing deduplication logic to optimize bandwidth and storage. To maintain consistent dat
Mr.Holmes is an open-source intelligence investigation framework designed to gather public data from phone numbers, usernames, IP addresses, and domains. It functions as a collection of tools for digital footprint analysis and social media reconnaissance. The system integrates several specialized capabilities, including a search engine dorking tool for uncovering hidden public records and a geolocation utility for identifying the physical location and ownership of network addresses. It also includes a social media reconnaissance system that scrapes and links public profiles using usernames an
Twint is an open-source intelligence and data extraction framework designed to gather public social media information. It functions as a command-line utility that retrieves posts, user profiles, and follower lists directly from web interfaces, bypassing the need for official platform developer credentials or authentication keys. The tool distinguishes itself by enabling automated, large-scale data collection through terminal-based orchestration. It supports granular filtering by keywords, geographic locations, time ranges, and account status, allowing researchers to build targeted datasets fo
Mechanize is a Ruby library for web browser automation and headless browser emulation. It allows for programmatically navigating websites and simulating human behavior without a graphical user interface. The library provides an automated interface for populating and submitting web forms, including text fields, checkboxes, and file uploads. It manages stateful sessions by automatically storing and sending cookies across multiple requests to maintain user authentication and identity. Additional capabilities include web data scraping, the ability to download remote web content, and the maintena
This project is a PHP implementation of a CSS selector engine that transforms CSS selector strings into compatible XPath expressions for locating elements within documents. It serves as a converter and expression generator that maps CSS selection logic to the XPath query language. The library processes selectors through a pipeline involving lexer-based tokenization and recursive descent parsing to create an abstract syntax tree. It utilizes pattern-matching logic to handle child and sibling relationships, translating CSS pseudo-classes and selectors into functional XPath logic. These capabil
This project is an administrative GIS toolset that provides a comprehensive dataset of China's administrative divisions, including provinces, cities, districts, and townships. It functions as a coordinate system transformer and a boundary converter for transforming geographic data into standard formats. The toolset distinguishes itself through the ability to convert administrative boundary data between CSV, GeoJSON, Shapefiles, and SQL. It includes specialized utilities for coordinate system transformation between GCJ-02, BD-09, WGS-84, and CGCS2000 standards to ensure accuracy across differe
This project is a public proxy aggregator and directory providing curated lists of validated HTTP and SOCKS proxy servers. It features a machine-readable API service and tools designed for anonymous network routing and the automated rotation of outgoing IP addresses. The system distinguishes itself through a proxy rotation tool used to bypass rate limits and prevent detection by automated security systems. It provides a programmatic interface for retrieving and filtering verified proxies by country and protocol, delivering this data in JSON and text formats for integration into custom applica
This project is an open source discovery resource that provides curated lists of reusable code and libraries to help developers find technical solutions for specific tasks. It utilizes a category-based indexing system to organize diverse software tools by their functional capabilities. The repository is structured as a collection of markdown-based documentation and static content, serving as a directory for manual discovery and reference. The directory covers a wide range of capability areas, including cross-platform application development, cybersecurity tool creation, network protocol impl
Hakuneko is a cross-platform manga downloader and multi-platform media scraper designed to save manga and anime images and videos from various websites. It functions as a tool for offline media consumption, allowing users to extract visual content from web sources and save it to local storage. The application enables cross-platform media archiving on Windows, Linux, and MacOS. It focuses on web content scraping to create local archives of images and videos, ensuring content remains accessible without an internet connection. The system manages these tasks through a connector architecture and