This project is a distributed web crawling framework that enables the horizontal scaling of scraping tasks. It uses Redis as a centralized request queue manager and state store to coordinate crawl progress and request metadata across multiple server instances. The system distributes crawling workloads by sharing a single request queue and utilizes a distributed duplicate filter to prevent multiple workers from visiting the same page. It persists complex request state and metadata as JSON strings within the shared remote store. The framework also provides capabilities for distributed data pro
BullMQ is a Redis-backed message queue library and background processor designed for distributed task queueing. It functions as a distributed queue manager and task scheduler, utilizing Redis to manage asynchronous job processing and persistence. The system distinguishes itself through its role as a job workflow orchestrator, enabling the definition of complex parent-child job dependencies and hierarchies for multi-step workflows. It provides sandboxed process execution to isolate heavy workloads and prevent event loop blocking, alongside distributed rate limiting to protect downstream servic
Faktory is an open-source work server that queues, dispatches, and manages background jobs across multiple programming languages. It stores job payloads as JSON hashes in a Redis-backed queue and provides language-specific client and worker libraries that enable any language to push jobs to the server or fetch and execute them. The server includes a batch workflow orchestrator that groups jobs into batches with completion tracking for coordinating multi-step asynchronous workflows. It features a configurable job uniqueness filter that prevents duplicate enqueues within a time window, an expon
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v