Airflow

Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments.

The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external service integrations. This extensibility allows users to connect diverse cloud services, databases, and storage systems through custom plugins and packages. The system utilizes a distributed task queue to enable horizontal scaling, while a centralized scheduler and metadata-driven state management ensure fault tolerance and visibility across large-scale infrastructure.

Beyond core scheduling, the project provides comprehensive observability through a web-based interface for pipeline visualization, status tracking, and source code inspection. It supports secure operations by integrating with external secret management services and offers robust administrative control through both a command-line interface and a programmatic API. The system is designed for containerized deployment, providing tools for building optimized images and managing complex dependency environments.

Features

Data Pipeline Orchestrators - A platform that schedules, monitors, and manages complex sequences of data processing tasks across distributed computing environments.
Workflow Orchestration - Describe the sequence and dependencies of automated tasks using a structured configuration format to manage complex business processes across distributed environments.
Workflow Orchestration Engines - Managing the lifecycle of recurring business processes by executing code-defined task dependencies and handling state persistence across distributed environments.
Workflow Orchestrators - Provides a platform for authoring, scheduling, and monitoring complex data pipelines using directed acyclic graphs.
Batch Processing Schedulers - Define and monitor complex data pipelines using code-based configurations that support dynamic task generation to automate recurring business processes.
Data Processing Workflows - Execute complex data analysis and graph traversals against distributed stores to incorporate advanced insights directly into automated data processing workflows.
Distributed Task Schedulers - Distributing and managing the execution of batch processing jobs across large clusters to ensure reliable data transformation and efficient resource utilization.
Task Schedulers - A distributed execution environment that manages task distribution and resource allocation across containerized clusters and cloud-native infrastructure.
Batch Processing Engines - Orchestrates batch workflows defined as code with centralized monitoring.
Plugin Architectures - Build custom commands, task links, and connection types to add specialized features and third-party service integrations that meet unique operational requirements.
Data Integration Tools - Automate the movement of information between disparate storage locations and distributed file systems to ensure data pipelines remain consistent and up to date.
Workflow Authoring Frameworks - Define workflows and execute tasks in isolated subprocesses using native interfaces that separate business logic from the underlying execution environment.
Distributed Task Queues - Enables horizontal scaling by dispatching tasks to a pool of distributed workers.
Secret Management Integrations - Connect external secret management services to securely store and retrieve credentials and configuration settings instead of using an internal database.
Workflow Monitoring Systems - Tracking the status of automated processes through centralized logging, custom alert notifications, and system dashboards for improved visibility and troubleshooting.
Data Analysis and Processing - Platform for authoring and scheduling workflows.
Data Pipelines - Orchestrates complex data workflows with a scheduler and UI.
Workflow Orchestration - Programmatic platform for authoring and scheduling data workflows.
Automation - Platform for programmatically authoring, scheduling, and monitoring workflows.
Automation Tools - Listed in the “Automation Tools” section of the Awesome Selfhosted awesome list.
Build and CI/CD - Platform for programmatically authoring and scheduling workflows.
Data Engineering - Platform for authoring and monitoring data workflows.
DevOps and Infrastructure - Platform for authoring and scheduling data pipelines.
Infrastructure Management - Programmatically author, schedule, and monitor complex data workflows.
Job Schedulers - Programmatic authoring and monitoring of workflows.
Distributed Processing Engines - Submit and manage analytical queries and batch transformation jobs on remote clusters to handle large-scale data workloads efficiently and reliably.
Cloud Infrastructure Orchestration - Authenticate and manage resource allocation across cloud infrastructure providers to control remote computing tasks from a single centralized point.
Integration Frameworks - Connecting diverse external cloud services, databases, and storage systems through a modular architecture that supports custom plugins and provider packages.
Pipeline Monitoring Dashboards - Provides a web interface for visualizing pipeline status, asset dependencies, and source code.
Database Connectors - Execute queries and perform data operations across multi-model and distributed database instances to interact with persistent storage layers during task execution.
Command Line Interfaces - Perform system operations and monitor workflow status using a command-line interface to control environments directly from a terminal window.
Cloud Service Integrations - Connecting automated workflows to diverse cloud services and managed platforms to handle resource allocation, data movement, and job execution.
Administrative APIs - Exposes comprehensive programmatic interfaces for managing system operations and workflow configurations.
Connection Management - Create custom connection types with specialized forms and field handling logic to manage external service credentials and configuration settings securely.
Data Lake Management - Perform data retrieval and metadata operations within distributed file systems and data lakes to maintain organized and accessible information repositories.
Metadata Management Systems - Ensures fault tolerance and state persistence by tracking task execution status in a relational database.
Provider Integrations - Contribute to provider packages within the monorepo by understanding distribution structures, dependency management, and the integration of optional extras into the core system.
Secret Management - Retrieve and handle sensitive credentials from external security services during task execution to ensure authentication tokens remain protected throughout the workflow lifecycle.
Plugin Frameworks - Decouples core logic from external services using a modular provider-based framework.
Alerting Systems - Define custom notification channels to receive automated alerts and status updates regarding the execution progress of tasks and workflows.
Centralized Logging Systems - Save and retrieve task execution logs using centralized external services to simplify troubleshooting and log management across complex distributed computing environments.

spotify/luigi

18,676View on GitHub

Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t

PrefectHQ/prefect

21,640View on GitHub

Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep

kestra-io/kestra

27,073View on GitHub

Kestra is a declarative workflow orchestrator designed to manage complex task dependencies and automated processes through versioned configuration files. It functions as a distributed platform that decouples task scheduling from execution by offloading computational workloads to a fleet of worker nodes. The system uses a reactive, event-driven engine to initiate workflows automatically in response to external signals, webhooks, schedules, or file system changes. The platform distinguishes itself through a modular plugin architecture that allows for the integration of custom tasks and external

dagster-io/dagster

14,974View on GitHub

Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.

apacheairflow

Features