这些 Python 库为定义、调度和执行复杂数据处理工作流提供了声明式抽象。
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Apache Airflow is a Python workflow orchestration and data pipeline engine that lets you define pipelines as DAGs in code, with built-in scheduling, operators, connectors, monitoring, and parallel execution — exactly matching the declarative, Python-native pipeline framework you’re looking for.
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Pathway is a Python-native data processing framework that lets you define batch and streaming pipelines in a declarative, dataflow style, covering DAG-based execution, transformations, connectors, and parallel processing, which directly matches what you're looking for.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Luigi is a battle-tested Python framework that lets you define batch pipelines as DAGs of tasks, handling scheduling, parallel execution, and state management largely declaratively—exactly the kind of declarative pipeline tool this search is after.
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Dagster is a Python-native data orchestration platform that lets you define data pipelines declaratively using code-as-configuration, with built-in DAG execution, scheduling, monitoring, and parallel processing, making it a strong fit for this search.
Meltano is an open-source platform for building, running, and orchestrating ELT (Extract, Load, Transform) data pipelines. It provides a declarative, YAML-driven configuration system that defines entire pipeline workflows, including data connectors, schedules, and transformations, without requiring imperative code. The platform is built on the Singer specification for data connectors and integrates with dbt for SQL-based transformations and Apache Airflow for scheduling and orchestration. What distinguishes Meltano is its comprehensive approach to pipeline management, combining a curated cata
Meltano is a comprehensive declarative ELT pipeline framework that uses YAML configuration and integrates with Singer, dbt, and Airflow, covering all the requested features with a Python-native ecosystem.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
docetl is a Python library that offers a declarative pipeline framework for AI-powered document ETL, making it directly fit your search for a declarative data pipeline tool—though it may lack built-in scheduling and monitoring features.
Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis. The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
Haystack is a Python-native declarative pipeline framework built around a DAG architecture with modular components and connectors, which exactly matches the requested pattern — though its focus on AI and search pipelines makes it a narrower fit than a general-purpose data pipeline tool.
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
Dask lets you define data pipelines by building a lazy DAG of operations and then computing them, which is exactly the declarative style you want; it offers Python-native data transformations, parallel execution, scheduling, and observability, though its source/sink connectors are less explicit than in dedicated pipeline frameworks.
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
ZenML is a Python-native ML orchestration framework that lets you define pipelines declaratively as DAGs, making it a clear fit for a declarative data pipeline framework, though its focus on machine learning may limit general-purpose data transformation and connector coverage.
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Kedro is a Python framework that structures data pipelines as modular DAGs with a data catalog for connectors and supports parallel execution, observability, and orchestration — making it a solid fit for declarative pipeline definition, though its pipeline composition is code‑based rather than purely config‑driven.
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Prefect is a Python-native workflow orchestration framework that lets you declare data pipelines as DAGs using decorators and function composition, which fits the core intent, though defining individual tasks requires some imperative code rather than a fully configuration-driven approach.
ZenML 🙏: One AI Platform from Pipelines to Agents. https://zenml.io.
ZenML is a Python-native framework for building ML pipelines declaratively by composing steps into DAGs, with built-in connectors, orchestration, and monitoring — exactly the kind of tool the visitor wants, though its focus on AI/ML pipelines rather than generic data may slightly narrow the scope.
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Ploomber is a Python library that lets you define data pipelines declaratively using YAML and Python, with DAG-based execution, built-in connectors, scheduling, and parallel processing — exactly the kind of declarative pipeline framework this search targets.
TFX is an end-to-end platform for deploying production ML pipelines
TensorFlow TFX is a production ML pipeline platform that supports declarative pipeline definition in Python with DAG execution, transformation operators, connectors, orchestration, and monitoring, which fits the intent for a declarative data pipeline framework—though its strong ML focus narrows its scope compared to a general-purpose pipeline library.
Dag-factory is a framework for constructing and managing Apache Airflow data pipelines through declarative configuration files. By replacing manual procedural code with structured YAML definitions, it enables the programmatic generation of complex workflow structures, task dependencies, and execution schedules. The project distinguishes itself by mapping configuration keys directly to Python class constructors and operators, allowing for the dynamic instantiation of objects and custom logic. It supports hierarchical configuration inheritance to standardize settings across environments and pro
dag-factory lets you define Airflow DAGs declaratively using YAML, giving you a Python library for declarative pipeline definition, though it relies on Airflow's ecosystem for execution and scheduling rather than being a standalone framework.