The visitor is looking for a Python-based framework that allows for the definition of data processing workflows using declarative code or configuration.

apache/incubator-airflow is the closest match — Apache Airflow is a comprehensive Python-native orchestration platform that allows you to define complex data pipelines as code, supporting distributed execution, backfilling, and robust monitoring out of the box.. Other strong matches: pathwaycom/pathway, spotify/luigi, kedro-org/kedro, apache/airflow.

Why does apache/incubator-airflow match “a Python framework for data pipelines”?

Apache Airflow is a comprehensive Python-native orchestration platform that allows you to define complex data pipelines as code, supporting distributed execution, backfilling, and robust monitoring out of the box.

Why does pathwaycom/pathway match “a Python framework for data pipelines”?

Pathway is a Python-native data processing framework that enables the definition of complex streaming and batch pipelines through code, though it focuses more on real-time dataflow and RAG applications than on traditional task-based orchestration features like backfilling.

Why does spotify/luigi match “a Python framework for data pipelines”?

Luigi is a Python-native orchestration framework that uses declarative task dependencies to manage complex data pipelines, offering built-in support for distributed execution, retries, and dependency visualization.

Why does kedro-org/kedro match “a Python framework for data pipelines”?

Kedro is a Python-native framework that uses a declarative approach to define data pipelines through modular nodes and a centralized data catalog, providing the orchestration, dependency resolution, and deployment flexibility required for robust data engineering workflows.

Why does apache/airflow match “a Python framework for data pipelines”?

Airflow is a comprehensive Python-native orchestration platform that uses code-defined directed acyclic graphs to manage complex data pipelines, offering robust support for distributed execution, retries, and workflow monitoring.

Declarative Python Data Pipeline Frameworks

These Python libraries provide declarative abstractions for defining, scheduling, and executing complex data processing workflows.

Find the best repos with AI.We'll search the best matching repositories with AI.

apache/incubator-airflow
apache/incubator-airflow
45,840View on GitHub
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Apache Airflow is a comprehensive Python-native orchestration platform that allows you to define complex data pipelines as code, supporting distributed execution, backfilling, and robust monitoring out of the box.
PythonBackfill Managers
View on GitHub45,840
pathwaycom/pathway
pathwaycom/pathway
62,959View on GitHub
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Pathway is a Python-native data processing framework that enables the definition of complex streaming and batch pipelines through code, though it focuses more on real-time dataflow and RAG applications than on traditional task-based orchestration features like backfilling.
PythonDeclarative Pipeline Construction
View on GitHub62,959
spotify/luigi
spotify/luigi
18,676View on GitHub
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Luigi is a Python-native orchestration framework that uses declarative task dependencies to manage complex data pipelines, offering built-in support for distributed execution, retries, and dependency visualization.
PythonPython Data Pipeline FrameworksWorkflow Orchestration EnginesBatch Processing Schedulers
View on GitHub18,676
kedro-org/kedro
kedro-org/kedro
10,889View on GitHub
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Kedro is a Python-native framework that uses a declarative approach to define data pipelines through modular nodes and a centralized data catalog, providing the orchestration, dependency resolution, and deployment flexibility required for robust data engineering workflows.
PythonData CatalogsDAG-Based Dependency ResolutionData Access Abstractions
View on GitHub10,889
apache/airflow
apache/airflow
45,902View on GitHub
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Airflow is a comprehensive Python-native orchestration platform that uses code-defined directed acyclic graphs to manage complex data pipelines, offering robust support for distributed execution, retries, and workflow monitoring.
PythonData Pipeline OrchestratorsWorkflow OrchestrationWorkflow Orchestration Engines
View on GitHub45,902
jd/tenacity
jd/tenacity
8,375View on GitHub
Tenacity is a Python retry library and fault tolerance framework designed to automatically re-execute failing functions based on custom conditions, wait intervals, and stop criteria. It provides a mechanism to apply retry logic to both synchronous functions and asynchronous coroutines. The library implements exponential backoff to increase delays between retries, helping to manage transient network failures and prevent the overloading of services. Its capabilities cover the definition of retry conditions based on exception types or return values, as well as the enforcement of duration limits
This is a specialized library for implementing retry logic and fault tolerance in Python functions, rather than a comprehensive data pipeline orchestration framework for defining and executing complex workflows.
PythonRetry PoliciesRetry Policies
View on GitHub8,375
ucbepic/docetl
ucbepic/docetl
3,597View on GitHub
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
This is a Python-based framework that uses declarative pipelines to orchestrate document ETL and map-reduce workflows, fitting the core requirement for defining data processing through configuration.
PythonDeclarative Pipeline Construction
View on GitHub3,597
temporalio/temporal
temporalio/temporal
18,411View on GitHub
Temporal is a distributed workflow orchestration engine designed to manage fault-tolerant, stateful, and long-running background processes. It functions as a platform for coordinating complex cross-service operations, ensuring consistency and reliability in distributed environments by decoupling workflow orchestration from task execution. The platform distinguishes itself through a deterministic, event-sourced execution model that reconstructs workflow state by re-executing code from an immutable event log. This approach isolates non-deterministic side effects into managed activities, allowin
Temporal is a robust distributed workflow engine that provides a Python SDK for defining complex, stateful, and fault-tolerant processes, though it is primarily designed for general-purpose microservice orchestration rather than specialized data pipeline tasks.
GoBackfill ManagersRetry PoliciesRetry Policies
View on GitHub18,411
conductor-oss/conductor
conductor-oss/conductor
31,962View on GitHub
Conductor is a durable workflow engine designed to orchestrate complex, long-running business processes and autonomous agent loops. It functions as a stateful execution platform that persists the entire history of a process, ensuring that workflows remain reliable and recoverable across infrastructure failures, system restarts, and transient network errors. By managing task lifecycles, worker polling, and state transitions, it provides a centralized coordination layer for distributed systems. The platform distinguishes itself through its specialized support for AI agent orchestration, allowin
This is a robust workflow orchestration engine that supports complex, distributed task management and stateful execution, though it is built on Java rather than being a Python-native framework.
JavaRetry PoliciesRetry Strategies
View on GitHub31,962
ray-project/ray
ray-project/ray
42,895View on GitHub
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Ray is a distributed execution engine that provides the underlying primitives for defining complex, dependency-aware workflows in Python, though it functions as a general-purpose compute framework rather than a specialized pipeline orchestrator with built-in lineage and backfilling features.
PythonDistributed Datasets
View on GitHub42,895
hatchet-dev/hatchet
hatchet-dev/hatchet
6,622View on GitHub
Hatchet is an open-source durable workflow engine and task orchestration platform. It provides a framework for building and executing fault-tolerant, multi-step pipelines as directed acyclic graphs (DAGs), with automatic retries, scheduling, and real-time observability. The system is built around durable task checkpointing, which persists execution state after each step so work can resume from the last checkpoint after a worker crash or restart, and it supports event-driven task resumption that pauses a task until a matching external event arrives. The platform distinguishes itself through it
Hatchet is a durable workflow engine that provides a Python SDK for defining task-based pipelines with support for distributed execution, retries, and local development, making it a strong fit for orchestrating complex data workflows.
GoRetry PoliciesWorkflow Definitions as Code
View on GitHub6,622
dbt-labs/dbt-core
dbt-labs/dbt-core
13,051View on GitHub
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
dbt-core is a Python-based framework that uses declarative SQL and configuration to manage data transformation workflows, providing robust lineage tracking and dependency management for data pipelines.
RustLocal Development Environments
View on GitHub13,051
pathwaycom/llm-app
pathwaycom/llm-app
59,341View on GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
This is a data processing engine that supports declarative, event-driven ETL workflows and distributed execution, making it a capable tool for building complex data pipelines despite its primary focus on real-time AI and RAG applications.
Jupyter NotebookData Processing FrameworksDifferential Dataflow EnginesDistributed State Management
View on GitHub59,341

Declarative Python Data Pipeline Frameworks

apache/incubator-airflow

pathwaycom/pathway

spotify/luigi

kedro-org/kedro

apache/airflow

jd/tenacity

ucbepic/docetl

temporalio/temporal

conductor-oss/conductor

ray-project/ray

hatchet-dev/hatchet

dbt-labs/dbt-core

pathwaycom/llm-app