These Python libraries provide declarative abstractions for defining, scheduling, and executing complex data processing workflows.
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Apache Airflow is a comprehensive Python-native orchestration platform that allows you to define complex data pipelines as code, supporting distributed execution, backfilling, and robust monitoring out of the box.
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Pathway is a Python-native data processing framework that enables the definition of complex streaming and batch pipelines through code, though it focuses more on real-time dataflow and RAG applications than on traditional task-based orchestration features like backfilling.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Luigi is a Python-native orchestration framework that uses declarative task dependencies to manage complex data pipelines, offering built-in support for distributed execution, retries, and dependency visualization.
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Kedro is a Python-native framework that uses a declarative approach to define data pipelines through modular nodes and a centralized data catalog, providing the orchestration, dependency resolution, and deployment flexibility required for robust data engineering workflows.
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Airflow is a comprehensive Python-native orchestration platform that uses code-defined directed acyclic graphs to manage complex data pipelines, offering robust support for distributed execution, retries, and workflow monitoring.
Tenacity is a Python retry library and fault tolerance framework designed to automatically re-execute failing functions based on custom conditions, wait intervals, and stop criteria. It provides a mechanism to apply retry logic to both synchronous functions and asynchronous coroutines. The library implements exponential backoff to increase delays between retries, helping to manage transient network failures and prevent the overloading of services. Its capabilities cover the definition of retry conditions based on exception types or return values, as well as the enforcement of duration limits
This is a specialized library for implementing retry logic and fault tolerance in Python functions, rather than a comprehensive data pipeline orchestration framework for defining and executing complex workflows.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
This is a Python-based framework that uses declarative pipelines to orchestrate document ETL and map-reduce workflows, fitting the core requirement for defining data processing through configuration.
Temporal is a distributed workflow orchestration engine designed to manage fault-tolerant, stateful, and long-running background processes. It functions as a platform for coordinating complex cross-service operations, ensuring consistency and reliability in distributed environments by decoupling workflow orchestration from task execution. The platform distinguishes itself through a deterministic, event-sourced execution model that reconstructs workflow state by re-executing code from an immutable event log. This approach isolates non-deterministic side effects into managed activities, allowin
Temporal is a robust distributed workflow engine that provides a Python SDK for defining complex, stateful, and fault-tolerant processes, though it is primarily designed for general-purpose microservice orchestration rather than specialized data pipeline tasks.
Conductor is a durable workflow engine designed to orchestrate complex, long-running business processes and autonomous agent loops. It functions as a stateful execution platform that persists the entire history of a process, ensuring that workflows remain reliable and recoverable across infrastructure failures, system restarts, and transient network errors. By managing task lifecycles, worker polling, and state transitions, it provides a centralized coordination layer for distributed systems. The platform distinguishes itself through its specialized support for AI agent orchestration, allowin
This is a robust workflow orchestration engine that supports complex, distributed task management and stateful execution, though it is built on Java rather than being a Python-native framework.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Ray is a distributed execution engine that provides the underlying primitives for defining complex, dependency-aware workflows in Python, though it functions as a general-purpose compute framework rather than a specialized pipeline orchestrator with built-in lineage and backfilling features.
Hatchet is an open-source durable workflow engine and task orchestration platform. It provides a framework for building and executing fault-tolerant, multi-step pipelines as directed acyclic graphs (DAGs), with automatic retries, scheduling, and real-time observability. The system is built around durable task checkpointing, which persists execution state after each step so work can resume from the last checkpoint after a worker crash or restart, and it supports event-driven task resumption that pauses a task until a matching external event arrives. The platform distinguishes itself through it
Hatchet is a durable workflow engine that provides a Python SDK for defining task-based pipelines with support for distributed execution, retries, and local development, making it a strong fit for orchestrating complex data workflows.
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
dbt-core is a Python-based framework that uses declarative SQL and configuration to manage data transformation workflows, providing robust lineage tracking and dependency management for data pipelines.
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
This is a data processing engine that supports declarative, event-driven ETL workflows and distributed execution, making it a capable tool for building complex data pipelines despite its primary focus on real-time AI and RAG applications.