Pholcus

Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection.

The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior.

The platform includes a request deduplication pipeline and breakpoint-based recovery to maintain data integrity during system failures. Scraped content is routed through a pluggable data export layer to destinations such as databases, message queues, or flat files.

Management of spider selection, parameter configuration, and task execution is handled via a web interface or a command-line tool.

Features

Data Scraping Tools - Provides a distributed framework for high-concurrency web scraping and automated data extraction into structured formats.

Master-Worker Coordination - Employs a master-worker coordination model to distribute and balance scraping tasks across remote client nodes.

Dynamic - Uses headless browsers to render JavaScript and extract data from modern, dynamic web pages.

Proxy and Fingerprint Rotation - Implements proxy and fingerprint rotation to avoid rate limiting and bypass bot detection systems.

Extraction Rule Sets - Supports custom extraction rule sets that define how data and navigation logic are handled during scraping.

Anti-Bot Evasion - Employs anti-bot evasion techniques including proxy rotation and human behavior simulation to access protected data.

Task Coordinations - Coordinates complex crawling workflows and task completion across a distributed network of client nodes.

Dynamic Rule Engines - Features a dynamic rule engine that allows extraction and navigation logic to be updated at runtime without restarts.

Request Reliability & Recovery - Implements automatic request deduplication, retry logic, and breakpoint recovery to maintain data integrity during failures.

Headless Rendering Engines - Utilizes headless rendering engines to execute JavaScript and bypass security checks on protected websites.

Headless Browser Orchestrators - Integrates headless browser orchestration to render JavaScript and extract data from dynamic web pages.

Web Crawling - Implements a high-concurrency distributed system for large-scale horizontal and vertical web crawling.

Distributed Crawling Infrastructures - Provides a scalable infrastructure for executing high-concurrency web data collection across multiple remote environments.

Data Exporters - Provides a pluggable export layer to route scraped content into databases, message queues, or flat files.

Web Data Pipelines - Provides automated web data pipelines that extract information and route it directly into structured storage.

Compiled Extraction Rules - Uses static compiled code for high-performance scraping or dynamic files for hot-loading rules without restarting the system.

Multi-Destination Data Routing - Enables multi-destination data routing to persist scraped results into databases, queues, or various file formats.

Outbound IP Rotation - Rotates outbound IP addresses at defined frequencies to avoid rate limits and IP bans.

Automated Login Bypasses - Simulates login sequences to programmatically access protected web content.

Crawl State Recovery - Implements breakpoint-based recovery to resume large-scale data collection from the last successful state after system failures.

Request Deduplication - Includes a request deduplication pipeline to prevent redundant network calls and infinite crawling loops.

Crawl Task Managers - Offers comprehensive crawl task management to pause, cancel, and execute scraping jobs in batch concurrency.

Data Processing and Machine Learning - Distributed framework for web crawling and data extraction.

andeyapholcus

Features

Star history