What are the best open-source GitHub repositories for Python 数据流水线框架?

apache/incubator-airflow is the closest match — Apache Airflow is a Python workflow orchestration and data pipeline engine that lets you define pipelines as DAGs in code, with built-in scheduling, operators, connectors, monitoring, and parallel execution — exactly matching the declarative, Python-native pipeline framework you’re looking for.. Other strong matches: pathwaycom/pathway, spotify/luigi, dagster-io/dagster, meltano/meltano.

Why does apache/incubator-airflow match “Python 数据流水线框架”?

Apache Airflow is a Python workflow orchestration and data pipeline engine that lets you define pipelines as DAGs in code, with built-in scheduling, operators, connectors, monitoring, and parallel execution — exactly matching the declarative, Python-native pipeline framework you’re looking for.

Why does pathwaycom/pathway match “Python 数据流水线框架”?

Pathway is a Python-native data processing framework that lets you define batch and streaming pipelines in a declarative, dataflow style, covering DAG-based execution, transformations, connectors, and parallel processing, which directly matches what you're looking for.

Why does spotify/luigi match “Python 数据流水线框架”?

Luigi is a battle-tested Python framework that lets you define batch pipelines as DAGs of tasks, handling scheduling, parallel execution, and state management largely declaratively—exactly the kind of declarative pipeline tool this search is after.

Why does dagster-io/dagster match “Python 数据流水线框架”?

Dagster is a Python-native data orchestration platform that lets you define data pipelines declaratively using code-as-configuration, with built-in DAG execution, scheduling, monitoring, and parallel processing, making it a strong fit for this search.

Why does meltano/meltano match “Python 数据流水线框架”?

Meltano is a comprehensive declarative ELT pipeline framework that uses YAML configuration and integrates with Singer, dbt, and Airflow, covering all the requested features with a Python-native ecosystem.

声明式 Python 数据流水线框架

这些 Python 库为定义、调度和执行复杂数据处理工作流提供了声明式抽象。

用 AI 发现最棒的仓库。我们将通过 AI 为您搜索最匹配的仓库。

apache/incubator-airflow
apache/incubator-airflow
45,840在 GitHub 上查看
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Apache Airflow is a Python workflow orchestration and data pipeline engine that lets you define pipelines as DAGs in code, with built-in scheduling, operators, connectors, monitoring, and parallel execution — exactly matching the declarative, Python-native pipeline framework you’re looking for.
PythonDirected Acyclic Graph EnginesPipeline Monitoring Dashboards
在 GitHub 上查看45,840
pathwaycom/pathway
pathwaycom/pathway
62,959在 GitHub 上查看
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Pathway is a Python-native data processing framework that lets you define batch and streaming pipelines in a declarative, dataflow style, covering DAG-based execution, transformations, connectors, and parallel processing, which directly matches what you're looking for.
PythonDeclarative Pipeline Construction
在 GitHub 上查看62,959
spotify/luigi
spotify/luigi
18,676在 GitHub 上查看
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Luigi is a battle-tested Python framework that lets you define batch pipelines as DAGs of tasks, handling scheduling, parallel execution, and state management largely declaratively—exactly the kind of declarative pipeline tool this search is after.
PythonDirected Acyclic Graph EnginesPipeline Monitoring Dashboards
在 GitHub 上查看18,676
dagster-io/dagster
dagster-io/dagster
14,974在 GitHub 上查看
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Dagster is a Python-native data orchestration platform that lets you define data pipelines declaratively using code-as-configuration, with built-in DAG execution, scheduling, monitoring, and parallel processing, making it a strong fit for this search.
PythonData Pipeline OrchestrationDeclarative OrchestrationWorkflow Orchestration Engines
在 GitHub 上查看14,974
meltano/meltano
meltano/meltano
2,534在 GitHub 上查看
Meltano is an open-source platform for building, running, and orchestrating ELT (Extract, Load, Transform) data pipelines. It provides a declarative, YAML-driven configuration system that defines entire pipeline workflows, including data connectors, schedules, and transformations, without requiring imperative code. The platform is built on the Singer specification for data connectors and integrates with dbt for SQL-based transformations and Apache Airflow for scheduling and orchestration. What distinguishes Meltano is its comprehensive approach to pipeline management, combining a curated cata
Meltano is a comprehensive declarative ELT pipeline framework that uses YAML configuration and integrates with Singer, dbt, and Airflow, covering all the requested features with a Python-native ecosystem.
PythonBusiness IntelligenceData IntegrationData Pipelines and Orchestration
在 GitHub 上查看2,534
ucbepic/docetl
ucbepic/docetl
3,597在 GitHub 上查看
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
docetl is a Python library that offers a declarative pipeline framework for AI-powered document ETL, making it directly fit your search for a declarative data pipeline tool—though it may lack built-in scheduling and monitoring features.
PythonDeclarative Pipeline Construction
在 GitHub 上查看3,597
deepset-ai/haystack
deepset-ai/haystack
24,253在 GitHub 上查看
Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis. The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
Haystack is a Python-native declarative pipeline framework built around a DAG architecture with modular components and connectors, which exactly matches the requested pattern — though its focus on AI and search pipelines makes it a narrower fit than a general-purpose data pipeline tool.
MDXDirected Acyclic Graph Engines
在 GitHub 上查看24,253
dask/dask
dask/dask
13,746在 GitHub 上查看
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
Dask lets you define data pipelines by building a lazy DAG of operations and then computing them, which is exactly the declarative style you want; it offers Python-native data transformations, parallel execution, scheduling, and observability, though its source/sink connectors are less explicit than in dedicated pipeline frameworks.
PythonDirected Acyclic Graph Execution EnginesExecution GraphsParallel Execution
在 GitHub 上查看13,746
maiot-io/zenml
maiot-io/zenml
5,452在 GitHub 上查看
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
ZenML is a Python-native ML orchestration framework that lets you define pipelines declaratively as DAGs, making it a clear fit for a declarative data pipeline framework, though its focus on machine learning may limit general-purpose data transformation and connector coverage.
PythonDirected Acyclic Graph PipelinesPipeline Monitoring Dashboards
在 GitHub 上查看5,452
kedro-org/kedro
kedro-org/kedro
10,889在 GitHub 上查看
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Kedro is a Python framework that structures data pipelines as modular DAGs with a data catalog for connectors and supports parallel execution, observability, and orchestration — making it a solid fit for declarative pipeline definition, though its pipeline composition is code‑based rather than purely config‑driven.
PythonData CatalogsDAG-Based Dependency ResolutionData Access Abstractions
在 GitHub 上查看10,889
prefecthq/prefect
PrefectHQ/prefect
21,640在 GitHub 上查看
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Prefect is a Python-native workflow orchestration framework that lets you declare data pipelines as DAGs using decorators and function composition, which fits the core intent, though defining individual tasks requires some imperative code rather than a fully configuration-driven approach.
PythonData Pipeline OrchestrationWorkflow OrchestrationContainer-Native Infrastructure
在 GitHub 上查看21,640
zenml-io/zenml
zenml-io/zenml
5,451在 GitHub 上查看
ZenML 🙏: One AI Platform from Pipelines to Agents. https://zenml.io.
ZenML is a Python-native framework for building ML pipelines declaratively by composing steps into DAGs, with built-in connectors, orchestration, and monitoring — exactly the kind of tool the visitor wants, though its focus on AI/ML pipelines rather than generic data may slightly narrow the scope.
PythonData PipelinesMLOps and LifecycleProject Documentation Examples
在 GitHub 上查看5,451
ploomber/ploomber
ploomber/ploomber
3,623在 GitHub 上查看
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
Ploomber is a Python library that lets you define data pipelines declaratively using YAML and Python, with DAG-based execution, built-in connectors, scheduling, and parallel processing — exactly the kind of declarative pipeline framework this search targets.
PythonData PipelinesInteractive NotebooksML Ops
在 GitHub 上查看3,623
tensorflow/tfx
tensorflow/tfx
2,186在 GitHub 上查看
TFX is an end-to-end platform for deploying production ML pipelines
TensorFlow TFX is a production ML pipeline platform that supports declarative pipeline definition in Python with DAG execution, transformation operators, connectors, orchestration, and monitoring, which fits the intent for a declarative data pipeline framework—though its strong ML focus narrows its scope compared to a general-purpose pipeline library.
PythonMLOps and InfrastructureTraining and Orchestration
在 GitHub 上查看2,186
astronomer/dag-factory
astronomer/dag-factory
1,440在 GitHub 上查看
Dag-factory is a framework for constructing and managing Apache Airflow data pipelines through declarative configuration files. By replacing manual procedural code with structured YAML definitions, it enables the programmatic generation of complex workflow structures, task dependencies, and execution schedules. The project distinguishes itself by mapping configuration keys directly to Python class constructors and operators, allowing for the dynamic instantiation of objects and custom logic. It supports hierarchical configuration inheritance to standardize settings across environments and pro
dag-factory lets you define Airflow DAGs declaratively using YAML, giving you a Python library for declarative pipeline definition, though it relies on Airflow's ecosystem for execution and scheduling rather than being a standalone framework.
PythonDAG Workflow ExecutionsDeclarative Pipeline Construction
在 GitHub 上查看1,440

声明式 Python 数据流水线框架

apache/incubator-airflow

pathwaycom/pathway

spotify/luigi

dagster-io/dagster

meltano/meltano

ucbepic/docetl

deepset-ai/haystack

dask/dask

maiot-io/zenml

kedro-org/kedro

PrefectHQ/prefect

zenml-io/zenml

ploomber/ploomber

tensorflow/tfx

astronomer/dag-factory