30 open-source projects similar to airbnb/airflow, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Airflow alternative.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Conductor is a durable workflow engine designed to orchestrate complex, long-running business processes and autonomous agent loops. It functions as a stateful execution platform that persists the entire history of a process, ensuring that workflows remain reliable and recoverable across infrastructure failures, system restarts, and transient network errors. By managing task lifecycles, worker polling, and state transitions, it provides a centralized coordination layer for distributed systems. The platform distinguishes itself through its specialized support for AI agent orchestration, allowin
Osmedeus is a security workflow orchestration engine that coordinates AI agents, shell commands, and scanning tools through declarative YAML pipelines. It functions as a distributed security scanner, a declarative workflow automator, and an AI agent framework for security, enabling automated multi-step security analysis with conditional branching, parallel execution, and distributed workers. The engine distinguishes itself through a hybrid runner model that executes workflow steps on the local host, inside Docker containers, or over SSH to remote machines, selected per step or module. It supp
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Kestra is a declarative workflow orchestrator designed to manage complex task dependencies and automated processes through versioned configuration files. It functions as a distributed platform that decouples task scheduling from execution by offloading computational workloads to a fleet of worker nodes. The system uses a reactive, event-driven engine to initiate workflows automatically in response to external signals, webhooks, schedules, or file system changes. The platform distinguishes itself through a modular plugin architecture that allows for the integration of custom tasks and external
Hatchet is an open-source durable workflow engine and task orchestration platform. It provides a framework for building and executing fault-tolerant, multi-step pipelines as directed acyclic graphs (DAGs), with automatic retries, scheduling, and real-time observability. The system is built around durable task checkpointing, which persists execution state after each step so work can resume from the last checkpoint after a worker crash or restart, and it supports event-driven task resumption that pauses a task until a matching external event arrives. The platform distinguishes itself through it
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Bull is a Node.js library for managing distributed jobs and message queues using Redis as the primary data store. It functions as a distributed task worker, job scheduler, and priority queue manager designed to handle asynchronous workloads across multiple processes. The project distinguishes itself by providing a persistent communication channel that decouples servers through the exchange of serializable data objects. It ensures distributed system reliability by detecting stalled tasks and recovering from process crashes to ensure every queued job is completed. The system covers a broad ran
Azkaban is a distributed workflow manager and DAG-based job orchestrator designed as an enterprise batch processor. It serves as a Java-based workflow engine that schedules and executes complex job sequences across a cluster of executor servers, with specific functionality for managing big data workloads on Hadoop clusters. The system distinguishes itself through a distributed executor model that coordinates state via a shared database to ensure high availability. It employs a plugin-based architecture that allows for custom job types and system functionality extensions, including the ability
This project is a collection of learning resources and instructional guides for implementing asynchronous messaging patterns using RabbitMQ. It provides a series of tutorials and runnable code examples focused on the Advanced Message Queuing Protocol to help users decouple services via a message broker. The resources cover practical implementation patterns including request-reply, pub-sub, and stream processing. These guides demonstrate how to use official client libraries to balance worker loads, route messages across multiple consumers in a distributed system, and deploy high availability b
Machinery is a distributed task queue and asynchronous workflow engine. It provides a system for processing heavy workloads outside the main request flow using a network of distributed background workers and a message-based job orchestrator. The project manages complex task lifecycles through sequential chaining, where results are passed between tasks, and parallel coordination, which can trigger callback tasks upon the completion of a group. It supports periodic workflow scheduling for recurring jobs and delayed execution via specific timestamps. The system includes capabilities for result
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Kedro is a data science pipeline framework and production toolbox designed to build reproducible, modular workflows using software engineering best practices. It functions as a data engineering orchestrator and catalog manager, bridging the gap between interactive analysis and maintainable production pipelines. The framework distinguishes itself by using a data catalog to decouple data access from processing logic and providing tools to transition analysis from interactive notebooks into structured workflows. It includes a workflow visualization tool that generates visual maps of data pipelin
Great Expectations is a data quality testing framework and observability platform designed to monitor the reliability of data pipelines. It provides a structured environment for defining, documenting, and automating data quality assertions, allowing teams to validate datasets against expected structure and content before they move through downstream processes. The project distinguishes itself through a declarative domain-specific language that stores quality rules as version-controlled configuration files. It utilizes an execution engine abstraction to translate these high-level assertions in
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
MaaAssistantArknights is a cross-platform automation engine designed for mobile games, utilizing computer vision and input simulation to perform routine tasks. It functions as an Android emulator controller, managing game lifecycles, resource farming, and infrastructure optimization through structured, scripted workflows. The project distinguishes itself through a modular configuration system that allows users to define complex automation logic via external instruction files. This framework supports dynamic task modification, configuration inheritance, and schema validation, ensuring that cus
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data. The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.
Dramatiq is a distributed task queue and workload manager used to offload function execution to background workers. It functions as an asynchronous task orchestrator that enables the distribution of computational tasks across a cluster using a pluggable transport layer supporting RabbitMQ and Redis. The framework provides specialized tools for complex task orchestration, including the ability to link background jobs into sequences, pipelines, and barriers. It further manages distributed concurrency through the use of shared mutexes, rate limiters, and exponential backoff retries to prevent re
Apache NiFi is a flow-based programming platform that enables the visual design, monitoring, and management of data pipelines. At its core, it provides a web-based visual dataflow designer where users build directed graphs of processors to route, transform, and mediate data movement between any source and destination without writing custom code. The system records fine-grained data provenance for every data item from ingestion to delivery, supporting audit, debugging, and replay of data lineage. The platform distinguishes itself through a zero-master cluster architecture that distributes proc
Streem is a stream-based programming language and data pipeline orchestrator. It provides a domain-specific language for defining concurrent data flows, allowing users to link data sources to destinations through a sequence of operations that transform and filter individual stream elements. The system uses a custom script syntax to define data-flow connections and pipeline definitions. This allows for the orchestration of concurrent data processing where multiple pipeline stages execute simultaneously to move data elements through the system. The platform covers functional data transformatio
Mage AI is a Python-based data pipeline orchestrator and self-hosted data integrated development environment. It is designed for building, scheduling, and monitoring data workflows using a block-based pipeline design and interactive notebook interface. The platform distinguishes itself by integrating generative AI capabilities, allowing users to connect large language model providers via API to incorporate artificial intelligence into automated data streams. It also functions as an Apache Spark data processor, managing the kernels and infrastructure required for high-volume analytics and larg
Orchest is a data pipeline orchestrator and containerized workflow manager. It provides a platform for designing, scheduling, and executing complex data processing sequences through a combination of a graphical interface and scripting. The platform distinguishes itself by using containers to manage software dependencies, ensuring consistent execution across different environments. It features a polyglot task scheduler capable of triggering jobs written in multiple programming languages and includes a version control system that tracks historical snapshots of project configurations and code.
Asynq is a distributed background job processing framework for Go applications. It manages asynchronous task queues by offloading heavy operations to persistent storage, allowing the main application to remain responsive while background workers handle workloads. The system utilizes Redis to manage task state, concurrency, and message distribution across multiple worker instances. It employs atomic Lua scripting and sorted sets to ensure reliable job acquisition, precise scheduling of delayed tasks, and fault-tolerant processing through a two-stage acknowledgement flow. The framework support
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Airbyte is a data integration platform designed to synchronize information between diverse applications, databases, and data warehouses. It functions as an extract, transform, and load orchestrator that manages automated data movement workflows across cloud, on-premise, and hybrid environments. The platform provides a standardized interface for connectors, enabling the movement of structured and unstructured data while maintaining stateful checkpoints for reliable incremental syncing. The platform distinguishes itself through a containerized architecture that isolates connectors to prevent de
DolphinScheduler is a distributed workflow orchestrator designed to manage and automate complex data processing pipelines. It functions as a data pipeline scheduler that coordinates multi-step tasks across distributed environments, ensuring reliable execution through defined dependencies and sequences. The platform utilizes a directed acyclic graph model to represent workflows, allowing users to define task relationships via a visual interface. It employs a master-worker architecture supported by a pluggable task plugin system, which enables the dynamic extension of task types without requiri
Joyagent-jdgenie is an automated data orchestrator designed to centralize the retrieval and processing of information from disparate remote sources. It functions as a framework for building repeatable data pipelines that fetch, clean, and normalize raw input into consistent, structured formats. The system utilizes a schema-driven engine to apply validation rules and structural templates to incoming data, ensuring compatibility across enterprise systems. By employing configuration-based workflow definitions, it allows for the orchestration of modular tasks into automated execution flows, separ
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task