Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection.
The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior.
The platform includes a request deduplication pipeline and breakpoint-based recovery to maintain data integrity during system failures. Scraped content is routed through a pluggable data export layer to destinations such as databases, message queues, or flat files.
Management of spider selection, parameter configuration, and task execution is handled via a web interface or a command-line tool.