30 open-source projects similar to rmax/scrapy-redis, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Scrapy Redis alternative.
This project is a distributed web crawling framework that enables the horizontal scaling of scraping tasks. It uses Redis as a centralized request queue manager and state store to coordinate crawl progress and request metadata across multiple server instances. The system distributes crawling workloads by sharing a single request queue and utilizes a distributed duplicate filter to prevent multiple workers from visiting the same page. It persists complex request state and metadata as JSON strings within the shared remote store. The framework also provides capabilities for distributed data pro
BullMQ is a Redis-backed message queue library and background processor designed for distributed task queueing. It functions as a distributed queue manager and task scheduler, utilizing Redis to manage asynchronous job processing and persistence. The system distinguishes itself through its role as a job workflow orchestrator, enabling the definition of complex parent-child job dependencies and hierarchies for multi-step workflows. It provides sandboxed process execution to isolate heavy workloads and prevent event loop blocking, alongside distributed rate limiting to protect downstream servic
Faktory is an open-source work server that queues, dispatches, and manages background jobs across multiple programming languages. It stores job payloads as JSON hashes in a Redis-backed queue and provides language-specific client and worker libraries that enable any language to push jobs to the server or fetch and execute them. The server includes a batch workflow orchestrator that groups jobs into batches with completion tracking for coordinating multi-step asynchronous workflows. It features a configurable job uniqueness filter that prevents duplicate enqueues within a time window, an expon
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
Osmedeus is a security workflow orchestration engine that coordinates AI agents, shell commands, and scanning tools through declarative YAML pipelines. It functions as a distributed security scanner, a declarative workflow automator, and an AI agent framework for security, enabling automated multi-step security analysis with conditional branching, parallel execution, and distributed workers. The engine distinguishes itself through a hybrid runner model that executes workflow steps on the local host, inside Docker containers, or over SSH to remote machines, selected per step or module. It supp
Bull is a Node.js library for managing distributed jobs and message queues using Redis as the primary data store. It functions as a distributed task worker, job scheduler, and priority queue manager designed to handle asynchronous workloads across multiple processes. The project distinguishes itself by providing a persistent communication channel that decouples servers through the exchange of serializable data objects. It ensures distributed system reliability by detecting stalled tasks and recovering from process crashes to ensure every queued job is completed. The system covers a broad ran
This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis. The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic. The capability surfac
Resque is a Ruby library for enqueueing and processing asynchronous tasks using Redis as a data store. It functions as a distributed task processor and queue manager, allowing long-running work to be moved out of the main request cycle. The system executes background jobs in isolated child processes to prevent memory leaks and provides a web-based dashboard for monitoring queue depths, worker activity, and failed job statistics. Capability areas include distributed worker coordination via signals, error handling with job retry mechanisms, and priority-ordered queue management. It also suppor
Sidekiq is a Ruby background processing framework and asynchronous task runner. It functions as a Redis-backed background job processor that offloads heavy or time-consuming work from web requests to separate worker processes to ensure the main application remains responsive. The system operates as a Redis task queue, storing pending jobs in Redis to be processed concurrently by multiple threads. It provides a framework for distributed task queueing and asynchronous job scheduling to coordinate work across multiple server instances. The project covers Ruby application scaling by executing ba
rq is a distributed task queue and background worker system for Python that uses a Redis backend to decouple task submission from execution. It functions as a reliable message queue and task scheduler, allowing Python functions or asyncio coroutines to be processed asynchronously across multiple worker processes. The project distinguishes itself through reliable queuing mechanisms that prevent job loss during worker crashes using atomic operations. It provides specialized orchestration capabilities, including the prevention of duplicate jobs, job execution prioritization, and the ability to m
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Azkaban is a distributed workflow manager and DAG-based job orchestrator designed as an enterprise batch processor. It serves as a Java-based workflow engine that schedules and executes complex job sequences across a cluster of executor servers, with specific functionality for managing big data workloads on Hadoop clusters. The system distinguishes itself through a distributed executor model that coordinates state via a shared database to ensure high availability. It employs a plugin-based architecture that allows for custom job types and system functionality extensions, including the ability
Quartz.NET is a job scheduler for .NET applications designed to schedule and execute programmatic tasks. It functions as a distributed task orchestrator and enterprise task orchestrator, capable of managing recurring jobs with concurrency limits and complex intervals. The system provides high availability through a clustered execution model that balances loads and provides fail-over redundancy across multiple server instances. It utilizes a relational database job store to persist job and trigger states, ensuring that scheduled tasks survive application restarts. The framework includes capab
Asynq is a distributed background job processing framework for Go applications. It manages asynchronous task queues by offloading heavy operations to persistent storage, allowing the main application to remain responsive while background workers handle workloads. The system utilizes Redis to manage task state, concurrency, and message distribution across multiple worker instances. It employs atomic Lua scripting and sorted sets to ensure reliable job acquisition, precise scheduling of delayed tasks, and fault-tolerant processing through a two-stage acknowledgement flow. The framework support
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Kue is a Redis-backed job queue library for Node.js that provides a complete system for defining, scheduling, and processing background work. It stores job metadata and state in Redis lists and sorted sets, enabling persistent, in-memory operations with configurable concurrency control and priority-sorted processing. The library includes a RESTful HTTP API for managing jobs and a web-based monitoring dashboard for inspecting job status, progress, and logs. The system distinguishes itself through its event-driven worker model, where workers listen for job events via Redis pub/sub and process j
Hydro is an online judge platform and competitive programming management system. It provides the infrastructure to host programming contests, manage a library of programming problems, and evaluate code submissions against predefined test cases and time limits. The system utilizes a distributed code execution engine that scales judging tasks across multiple worker nodes to process high volumes of submissions. It is built as a modular judge framework, employing a plugin-based architecture that allows for the extension of system functionality without modifying the core source code. The platform
This is a distributed voting application designed to demonstrate a multi-service architecture. It uses stateless web frontends for submitting votes and viewing live results, with a Redis-backed queue to buffer incoming votes and a PostgreSQL database for persistent tallying. The application is built around asynchronous message queue processing, decoupling the vote submission from the tallying workflow. The project showcases how to deploy a multi-service application using container orchestration tools. It provides YAML-driven declarative deployment manifests for Docker Compose, Docker Swarm, a
Ignite is a distributed in-memory data grid and compute platform. It functions as a distributed SQL database and storage engine designed to store and process large datasets in RAM to minimize latency and increase calculation speed. The system is distinguished by a multi-tier storage engine that manages data placement across memory and disk to balance high-speed access with large capacity. It features a distributed compute grid that executes custom logic directly on the nodes where data resides to reduce network traffic. The platform provides a broad set of capabilities including ACID transac
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
pysheeet is a technical reference library providing a curated collection of code snippets and implementation patterns for advanced Python development, system integration, and high-performance computing. It serves as a comprehensive guide for implementing low-level network programming, native C extensions, and asynchronous and concurrent programming. The project provides specialized frameworks for the development and deployment of large language models, including tools for distributed GPU inference and high-performance serving. It also includes detailed patterns for high-performance computing
Hakrawler is a command-line web spider tool designed for security reconnaissance, built to crawl target websites and extract hyperlinks along with JavaScript file references. As a focused reconnaissance utility, it collects every discoverable URL and script source from a given domain, mapping the attack surface for penetration testing and vulnerability assessment. The tool differentiates itself through its concurrent architecture: a fixed-size goroutine pool fetches pages in parallel, while CSS selectors parse HTML to extract anchor and script references. A depth-aware recursion limiter preve
GoCD is a continuous delivery server and build automation platform designed to orchestrate software delivery pipelines. It functions as a CD pipeline orchestrator that manages the automated execution of build, test, and deployment stages to move code from commit to production. The system utilizes an agent-based job execution model where remote agents pull work from a central server via polling. It employs a state-machine approach to pipeline orchestration, tracking the progression of software through stages and managing immutable build outputs via a central artifact repository to ensure consi
Distribute crawler is a distributed web scraping framework that integrates with Scrapy to coordinate multiple crawler instances across clusters. It utilizes a centralized task queue to manage and scale concurrent data collection operations, enabling horizontal scaling of scraping tasks across multiple worker nodes. The framework distinguishes itself through its focus on large-scale data management and traffic control. It persists scraped items and binary assets into document-oriented database clusters, utilizing deduplication logic to optimize bandwidth and storage. To maintain consistent dat
.. image:: https://media.charlesleifer.com/blog/photos/huey3-logo.png
This project is a learning curriculum and programming guide for Apache Spark, providing a structured set of educational resources and practical code examples for mastering distributed data processing. It serves as a course for building scalable data workflows and big data engineering pipelines. The repository provides practical source code and project layouts that demonstrate how to connect external data stores, process streaming data, and organize code for distributed environments. It includes implementation examples for scaling machine learning algorithms across clusters to handle large tra
Oban is a distributed background job processing system and task scheduler that uses PostgreSQL for transactional job storage and reliable execution across multiple nodes. It serves as a PostgreSQL-backed background worker and job queue, coordinating task execution and concurrency through a relational database to ensure delivery guarantees. The system differentiates itself through a distributed workflow orchestrator capable of managing multi-step processing pipelines, dependent job sequencing, and shared context. It provides advanced orchestration tools including job batching, chunked processi
Horizon is a background job orchestrator and worker manager for Redis queues. It provides a monitoring dashboard to track job throughput, wait times, and failure rates, alongside a system for managing job retries, execution timeouts, and worker distribution. The project distinguishes itself through a Redis-backed monitoring interface that identifies system bottlenecks and a queue alerting system that sends notifications when background job wait times exceed defined thresholds. Worker processes are managed via version-controlled configuration files to ensure consistent balancing and scaling ac
Dora is a robotics dataflow framework and distributed orchestrator used to build and manage processing pipelines. It enables the deployment of robotics workloads across clusters with remote node execution and provides a real-time data pipeline for predictable performance. The system is distinguished by its support for multi-language nodes written in Rust, Python, C, or C++ that interoperate within a single dataflow. It utilizes a zero-copy shared-memory transport and columnar formats to minimize latency for large payloads, and it includes bidirectional bridges to integrate with external ecosy
APScheduler is a Python task scheduler designed to execute functions at specific times or recurring intervals. It functions as an asynchronous background scheduler and distributed job dispatcher, allowing tasks to run concurrently with application lifecycles and web server request handling. The system distinguishes itself through a persistent job store that saves schedules and task states in external databases, ensuring continuity across process restarts. It separates task scheduling from execution by dispatching jobs to distributed workers in separate processes to prevent execution bottlenec