Scrapy Redis

Features

Distributed Crawl Coordination - Coordinates multiple Scrapy spiders that share a Redis-backed request queue and deduplicate URLs across workers.
Redis-Backed Schedulers - Replaces Scrapy's in-memory scheduler with a Redis-backed one that coordinates request distribution across workers.
Redis - Pushes scraped data into Redis queues for downstream batch processing or consumption by separate services.
Web Crawl Schedulers - Implements a distributed scheduler that coordinates crawl request distribution across multiple Scrapy spider workers via Redis.
Scraped Item Sinks - Pushes scraped items into a Redis list for asynchronous consumption by downstream processors.
Scraped Data Storage - Persists each scraped item into a Redis list for later batch processing or consumption by other services.
Redis Item Queues - Pushes scraped items into a Redis queue so separate processes can consume and process them independently.
Visited URL Sets - Uses a Redis set to track seen URLs across all workers, preventing duplicate crawls.
Distributed Job Execution - Shares a Redis-backed request queue among multiple spider instances so each worker picks the next unprocessed URL.
Crawl Request Schedulers - Pushes new URLs into a Redis list and lets any connected spider consume them, enabling dynamic feed-in of crawl targets.
Crawl State Recovery - Saves crawled URLs and pending requests in Redis to survive restarts and enable resume.
Redis-Backed Queues - Stores pending crawl requests in Redis lists so multiple spiders can consume them concurrently.
Crawl Request Injectors - Reads JSON payloads from Redis and converts them into structured HTTP requests with metadata and cookies.
Crawl Request Queues - Provides a Redis-backed queue that stores and distributes HTTP crawl requests across multiple Scrapy spider workers.
URL Duplicate Filters - Ships a Redis-based duplicate filter that prevents the same URL from being crawled twice across distributed spider workers.
Crawl Request Injections - Reads JSON payloads from Redis and converts them into structured HTTP requests with metadata and cookies.
Crawl Request Deduplications - Uses a Redis set to filter duplicate URLs across all running spiders, preventing the same page from being crawled twice.
Dynamic URL Injections - Feeds new crawl targets into a shared Redis queue from external processes while spiders consume them in real time.

Open-source alternatives to Scrapy Redis

Similar open-source projects, ranked by how many features they share with Scrapy Redis.

rolando/scrapy-redis
rolando/scrapy-redis
5,639View on GitHub
This project is a distributed web crawling framework that enables the horizontal scaling of scraping tasks. It uses Redis as a centralized request queue manager and state store to coordinate crawl progress and request metadata across multiple server instances. The system distributes crawling workloads by sharing a single request queue and utilizes a distributed duplicate filter to prevent multiple workers from visiting the same page. It persists complex request state and metadata as JSON strings within the shared remote store. The framework also provides capabilities for distributed data pro
Python
View on GitHub5,639
taskforcesh/bullmq
taskforcesh/bullmq
8,432View on GitHub
BullMQ is a Redis-backed message queue library and background processor designed for distributed task queueing. It functions as a distributed queue manager and task scheduler, utilizing Redis to manage asynchronous job processing and persistence. The system distinguishes itself through its role as a job workflow orchestrator, enabling the definition of complex parent-child job dependencies and hierarchies for multi-step workflows. It provides sandboxed process execution to isolate heavy workloads and prevent event loop blocking, alongside distributed rate limiting to protect downstream servic
TypeScriptbackground-jobselixirnodejs
View on GitHub8,432
contribsys/faktory
contribsys/faktory
6,089View on GitHub
Faktory is an open-source work server that queues, dispatches, and manages background jobs across multiple programming languages. It stores job payloads as JSON hashes in a Redis-backed queue and provides language-specific client and worker libraries that enable any language to push jobs to the server or fetch and execute them. The server includes a batch workflow orchestrator that groups jobs into batches with completion tracking for coordinating multi-step asynchronous workflows. It features a configurable job uniqueness filter that prevents duplicate enqueues within a time window, an expon
Go
View on GitHub6,089
henrylee2cn/pholcus
henrylee2cn/pholcus
7,578View on GitHub
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
Go
View on GitHub7,578

See all 30 alternatives to Scrapy Redis

rmaxscrapy-redis

Features

Open-source alternatives to Scrapy Redis

rolando/scrapy-redis

taskforcesh/bullmq

contribsys/faktory

henrylee2cn/pholcus

Star history

Open-source alternatives to Scrapy Redis

rolando/scrapy-redis

taskforcesh/bullmq

contribsys/faktory

henrylee2cn/pholcus