Open-source libraries and distributed systems designed for building, orchestrating, and managing high-throughput data processing workflows.
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Airflow is a comprehensive workflow orchestration platform that uses code-as-configuration to manage complex ETL/ELT pipelines, featuring robust distributed execution, extensive data connectors, and built-in monitoring for large-scale data environments.
DolphinScheduler is a distributed workflow orchestrator designed to manage and automate complex data processing pipelines. It functions as a data pipeline scheduler that coordinates multi-step tasks across distributed environments, ensuring reliable execution through defined dependencies and sequences. The platform utilizes a directed acyclic graph model to represent workflows, allowing users to define task relationships via a visual interface. It employs a master-worker architecture supported by a pluggable task plugin system, which enables the dynamic extension of task types without requiri
DolphinScheduler is a comprehensive, distributed workflow orchestration platform that provides the visual DAG management, task scheduling, and observability features required to manage complex ETL/ELT pipelines at scale.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Luigi is a mature Python-based orchestration framework that provides the core features required for managing complex, dependency-driven data pipelines, including distributed execution, monitoring, and robust retry logic.
Benthos is a stream processing engine and data integration pipeline used for routing, transforming, and connecting data streams between diverse sources and sinks. It functions as event routing middleware and a change data capture tool, streaming real-time database modifications as discrete events for downstream processing. The system utilizes a declarative pipeline configuration, where data flow and processing logic are defined in a single static file. It features a specialized domain-specific language for mapping, filtering, and enriching data payloads, allowing for complex transformations w
Benthos is a high-performance stream processing engine that handles data integration and transformation through declarative configuration, serving as a robust tool for building real-time data pipelines even though it focuses more on streaming than on traditional batch-oriented workflow orchestration.
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Prefect is a comprehensive workflow orchestration platform that enables distributed execution, complex pipeline management, and observability through a code-as-configuration approach, perfectly matching your requirements for scaling data operations.
Flyte is a distributed machine learning pipeline manager and MLOps workflow engine. It functions as a Kubernetes-native orchestrator used to coordinate data, models, and compute resources for executing machine learning pipelines and autonomous agents at scale. The platform provides specialized infrastructure for the full machine learning lifecycle, including a dedicated model serving platform to deploy trained models as scalable production-ready inference services. It also enables the coordination and state management of autonomous AI agents. The system manages scalable pipeline execution th
Flyte is a Kubernetes-native orchestration framework that manages complex, distributed data and machine learning workflows using DAGs, container-based isolation, and robust state management.
Tenacity is a Python retry library and fault tolerance framework designed to automatically re-execute failing functions based on custom conditions, wait intervals, and stop criteria. It provides a mechanism to apply retry logic to both synchronous functions and asynchronous coroutines. The library implements exponential backoff to increase delays between retries, helping to manage transient network failures and prevent the overloading of services. Its capabilities cover the definition of retry conditions based on exception types or return values, as well as the enforcement of duration limits
This is a specialized library for implementing retry logic and fault tolerance within individual functions, rather than a comprehensive framework for orchestrating complex, multi-step data pipelines.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
This framework provides a declarative, code-based approach to orchestrating complex ETL pipelines specifically for unstructured document processing, though it is more specialized toward LLM-driven transformations than general-purpose distributed data orchestration.
Conductor is a durable workflow engine designed to orchestrate complex, long-running business processes and autonomous agent loops. It functions as a stateful execution platform that persists the entire history of a process, ensuring that workflows remain reliable and recoverable across infrastructure failures, system restarts, and transient network errors. By managing task lifecycles, worker polling, and state transitions, it provides a centralized coordination layer for distributed systems. The platform distinguishes itself through its specialized support for AI agent orchestration, allowin
Conductor is a durable workflow orchestration engine that provides the distributed execution, state management, and retry logic required for complex pipelines, though it is architected for general-purpose microservice and agent orchestration rather than being a specialized ETL-specific framework.
Hatchet is an open-source durable workflow engine and task orchestration platform. It provides a framework for building and executing fault-tolerant, multi-step pipelines as directed acyclic graphs (DAGs), with automatic retries, scheduling, and real-time observability. The system is built around durable task checkpointing, which persists execution state after each step so work can resume from the last checkpoint after a worker crash or restart, and it supports event-driven task resumption that pauses a task until a matching external event arrives. The platform distinguishes itself through it
Hatchet is a durable workflow engine that provides the distributed execution, DAG-based orchestration, and observability required for managing complex data pipelines, though it is designed more for general-purpose task orchestration than specifically for ETL/ELT data movement.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
This framework provides a specialized orchestration engine for document-centric ETL pipelines, offering the necessary workflow management, data connectors, and monitoring to process unstructured data for AI applications.
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
This framework provides a unified engine for real-time ETL and data transformation workflows, offering the distributed execution and incremental processing capabilities required for complex data pipelines.