Pholcus

Features

Data Scraping Tools - Provides a distributed framework for high-concurrency web scraping and automated data extraction into structured formats.
Master-Worker Coordination - Employs a master-worker coordination model to distribute and balance scraping tasks across remote client nodes.
Dynamic - Uses headless browsers to render JavaScript and extract data from modern, dynamic web pages.
Proxy and Fingerprint Rotation - Implements proxy and fingerprint rotation to avoid rate limiting and bypass bot detection systems.
Extraction Rule Sets - Supports custom extraction rule sets that define how data and navigation logic are handled during scraping.
Anti-Bot Evasion - Employs anti-bot evasion techniques including proxy rotation and human behavior simulation to access protected data.
Task Coordinations - Coordinates complex crawling workflows and task completion across a distributed network of client nodes.
Dynamic Rule Engines - Features a dynamic rule engine that allows extraction and navigation logic to be updated at runtime without restarts.
Request Reliability & Recovery - Implements automatic request deduplication, retry logic, and breakpoint recovery to maintain data integrity during failures.
Headless Rendering Engines - Utilizes headless rendering engines to execute JavaScript and bypass security checks on protected websites.
Headless Browser Orchestrators - Integrates headless browser orchestration to render JavaScript and extract data from dynamic web pages.
Web Crawling - Implements a high-concurrency distributed system for large-scale horizontal and vertical web crawling.
Distributed Crawling Infrastructures - Provides a scalable infrastructure for executing high-concurrency web data collection across multiple remote environments.
Data Exporters - Provides a pluggable export layer to route scraped content into databases, message queues, or flat files.
Web Data Pipelines - Provides automated web data pipelines that extract information and route it directly into structured storage.
Compiled Extraction Rules - Uses static compiled code for high-performance scraping or dynamic files for hot-loading rules without restarting the system.
Multi-Destination Data Routing - Enables multi-destination data routing to persist scraped results into databases, queues, or various file formats.
Outbound IP Rotation - Rotates outbound IP addresses at defined frequencies to avoid rate limits and IP bans.
Automated Login Bypasses - Simulates login sequences to programmatically access protected web content.
Crawl State Recovery - Implements breakpoint-based recovery to resume large-scale data collection from the last successful state after system failures.
Request Deduplication - Includes a request deduplication pipeline to prevent redundant network calls and infinite crawling loops.
Crawl Task Managers - Offers comprehensive crawl task management to pause, cancel, and execute scraping jobs in batch concurrency.
Data Processing and Machine Learning - Distributed framework for web crawling and data extraction.

Open-source alternatives to Pholcus

Similar open-source projects, ranked by how many features they share with Pholcus.

apify/crawlee
apify/crawlee
24,002View on GitHub
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
TypeScriptapifyautomationcrawler
View on GitHub24,002
apify/crawlee-python
apify/crawlee-python
8,097View on GitHub
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Pythonapifyautomationbeautifulsoup
View on GitHub8,097
henrylee2cn/pholcus
henrylee2cn/pholcus
7,578View on GitHub
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
Go
View on GitHub7,578
binux/pyspider
binux/pyspider
16,809View on GitHub
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Python
View on GitHub16,809

See all 30 alternatives to Pholcus

andeyapholcus

Features

Open-source alternatives to Pholcus

apify/crawlee

apify/crawlee-python

henrylee2cn/pholcus

binux/pyspider

Star history

Open-source alternatives to Pholcus

apify/crawlee

apify/crawlee-python

henrylee2cn/pholcus

binux/pyspider