# Declarative Python Data Pipeline Frameworks

> Search results for `declarative Python framework for building data pipelines` on awesome-repositories.com. 119 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/declarative-python-framework-for-building-data-pipelines

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/declarative-python-framework-for-building-data-pipelines).**

## Results

- [deepset-ai/haystack](https://awesome-repositories.com/repository/deepset-ai-haystack.md) (24,253 ⭐) — Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis.

The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
- [scramjetorg/framework-python](https://awesome-repositories.com/repository/scramjetorg-framework-python.md) (35 ⭐) — Python port of Scramjet framework
- [dagster-io/dagster](https://awesome-repositories.com/repository/dagster-io-dagster.md) (14,974 ⭐) — Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality.

The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
- [pathwaycom/pathway](https://awesome-repositories.com/repository/pathwaycom-pathway.md) (62,959 ⭐) — Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources.

The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
- [taskflow/taskflow](https://awesome-repositories.com/repository/taskflow-taskflow.md) (12,013 ⭐) — Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators.

The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow
- [awslabs/data-pipeline-samples](https://awesome-repositories.com/repository/awslabs-data-pipeline-samples.md) (472 ⭐) — This repository hosts sample pipelines
- [angular/angular](https://awesome-repositories.com/repository/angular-angular.md) (100,360 ⭐) — Angular is a platform for building web applications using a component-based architecture. It provides a comprehensive suite of tools for managing encapsulated UI units, including hierarchical dependency injection, a declarative template system, and fine-grained reactivity through signals. The framework supports complex application requirements such as client-side routing, form management, and internationalization.

The project includes a command-line interface for scaffolding and build automation, alongside a testing ecosystem for unit and integration verification. It offers multiple rendering
- [tektoncd/pipeline](https://awesome-repositories.com/repository/tektoncd-pipeline.md) (8,996 ⭐) — The Tekton Pipelines project provides k8s-style resources for declaring CI/CD-style pipelines.
- [fredkschott/snowpack](https://awesome-repositories.com/repository/fredkschott-snowpack.md) (19,329 ⭐) — Snowpack is an ESM-powered frontend build tool and development server that serves native ES modules directly to the browser. By eliminating the bundling process during development, it enables nearly instant server startup and unbundled frontend development.

The project features a framework-aware hot module reload system that preserves component state during updates, with specific Fast Refresh integration for React, Preact, Svelte, and Vue. It also acts as a modern web transpiler, automatically converting TypeScript, JSX, and CSS Modules into browser-compatible code without requiring manual co
- [mikefarah/yq](https://awesome-repositories.com/repository/mikefarah-yq.md) (14,913 ⭐) — This tool is a command-line processor designed for querying, updating, and transforming structured data files. It functions as a versatile engine for manipulating YAML, JSON, TOML, and XML documents, allowing users to perform complex operations directly from the terminal. By utilizing a path-based expression language, it enables precise navigation and modification of data structures within configuration files and infrastructure-as-code workflows.

What distinguishes this tool is its ability to perform in-place document mutations while preserving original formatting, comments, and metadata. It
- [teckkean/gtfs-data-pipeline-tfnsw-bus](https://awesome-repositories.com/repository/teckkean-gtfs-data-pipeline-tfnsw-bus.md) (8 ⭐) — Introduction Data Availability Statement Data Pipeline Directory Structure Data Pipeline Operations - 1.1 Convert .PB.GZ to .CSV Files - 1.2 Transform .CSV Files - 1.2A Transform .CSV Files by Agency (Daily to Monthly) - 1.3 Prepare Cleaned Unique Datasets
- [dragonflydb/dragonfly](https://awesome-repositories.com/repository/dragonflydb-dragonfly.md) (30,688 ⭐) — Dragonfly is a high-performance, multi-model in-memory data store designed to serve as a drop-in replacement for existing database infrastructures. By utilizing a multi-threaded, shared-nothing architecture and a fiber-based concurrency model, it maximizes CPU utilization and minimizes latency for read and write operations. The system supports a wide range of data structures, including strings, hashes, lists, sets, sorted sets, and JSON documents, while maintaining full compatibility with standard industry wire protocols and client libraries.

What distinguishes Dragonfly is its focus on effic
- [fastai/course22](https://awesome-repositories.com/repository/fastai-course22.md) (3,398 ⭐) — This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks.

The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
- [inosrahul/f1-data-pipeline](https://awesome-repositories.com/repository/inosrahul-f1-data-pipeline.md) (0 ⭐)
- [flutter/flutter](https://awesome-repositories.com/repository/flutter-flutter.md) (177,056 ⭐) — This project is a multi-platform UI framework designed for building applications that target mobile, web, and desktop environments from a single codebase. It utilizes a declarative paradigm where the user interface is defined as a function of application state, supported by a layered architecture that includes a high-performance rendering engine and a multi-platform compilation model.

The framework provides a comprehensive suite of developer tools, including hot reloading for real-time code injection and diagnostic utilities for monitoring application state and performance. It features a modu
- [ucbepic/docetl](https://awesome-repositories.com/repository/ucbepic-docetl.md) (3,597 ⭐) — docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas.

The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
- [kubeflow/pipelines](https://awesome-repositories.com/repository/kubeflow-pipelines.md) (4,154 ⭐) — Machine Learning Pipelines for Kubeflow
- [flutter-team-archive/plugins](https://awesome-repositories.com/repository/flutter-team-archive-plugins.md) (17,710 ⭐) — This project is a collection of official plugin packages and a native integration library designed to provide a consistent interface for accessing hardware and software functionality across different mobile and desktop platforms. It serves as a native platform bridge, enabling cross-platform applications to invoke native code and manage operating system dependencies.

The project utilizes a federated plugin architecture, splitting plugins into common interfaces and separate platform implementations to allow for independent development and extension. It further supports native integration throu
- [benthosdev/benthos](https://awesome-repositories.com/repository/benthosdev-benthos.md) (8,681 ⭐) — Benthos is a stream processing engine and data integration pipeline used for routing, transforming, and connecting data streams between diverse sources and sinks. It functions as event routing middleware and a change data capture tool, streaming real-time database modifications as discrete events for downstream processing.

The system utilizes a declarative pipeline configuration, where data flow and processing logic are defined in a single static file. It features a specialized domain-specific language for mapping, filtering, and enriching data payloads, allowing for complex transformations w
- [kivy/python-for-android](https://awesome-repositories.com/repository/kivy-python-for-android.md) (8,888 ⭐) — python-for-android is a toolchain that compiles Python applications and their dependencies into installable Android APK or AAB packages. It bundles a Python interpreter and standard library into an Android package, enabling Python code to run natively on mobile devices. The project provides a recipe-based build engine that automates dependency resolution, version pinning, and custom compilation steps for Android targets.

The system cross-compiles Python and native C-extension libraries for multiple Android CPU architectures, producing separate native binaries for each target and packaging the
- [apify/crawlee](https://awesome-repositories.com/repository/apify-crawlee.md) (24,002 ⭐) — Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture.

The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
- [alibaba/roll](https://awesome-repositories.com/repository/alibaba-roll.md) (2,844 ⭐) — ROLL is a distributed reinforcement learning framework and model alignment toolkit designed for large language models. It serves as a scalable training pipeline and GPU cluster manager, providing the infrastructure to align model behavior using reinforcement learning algorithms and preference optimization techniques.

The project distinguishes itself through an agentic rollout orchestrator that generates and collects multi-turn interaction trajectories between AI agents and simulated environments. It supports specialized alignment methods including Direct Preference Optimization, reinforcement
- [hkuds/lightrag](https://awesome-repositories.com/repository/hkuds-lightrag.md) (36,651 ⭐) — LightRAG is a graph-based retrieval framework designed to build retrieval-augmented generation pipelines. It structures unstructured text into knowledge graphs, enabling multi-hop reasoning and complex query synthesis across large document collections. By integrating dense vector embeddings with structured knowledge graphs, the system facilitates both similarity-based and relationship-aware information retrieval.

The framework distinguishes itself through a dual-level retrieval strategy that combines low-level keyword matching with high-level semantic graph traversal to capture both specific
- [esri/spatial-framework-for-hadoop](https://awesome-repositories.com/repository/esri-spatial-framework-for-hadoop.md) (376 ⭐) — The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
- [google-ai-edge/mediapipe](https://awesome-repositories.com/repository/google-ai-edge-mediapipe.md) (35,660 ⭐) — MediaPipe is a cross-platform machine learning framework designed for deploying vision, audio, and text processing models across mobile, desktop, and web environments. It functions as an on-device inference engine that executes complex models locally on edge hardware, ensuring low latency and privacy without requiring a constant internet connection.

The framework utilizes a graph-based pipeline orchestration system where data flows through a directed network of modular calculators to ensure synchronized and deterministic processing. It distinguishes itself through a unified runtime that provi
- [thephpleague/pipeline](https://awesome-repositories.com/repository/thephpleague-pipeline.md) (1,000 ⭐) — League\Pipeline
- [datalab-to/marker](https://awesome-repositories.com/repository/datalab-to-marker.md) (36,137 ⭐) — Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale.

The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
- [hyfather/pipeline](https://awesome-repositories.com/repository/hyfather-pipeline.md) (61 ⭐) — Pipelines using goroutines
- [argoproj/argo](https://awesome-repositories.com/repository/argoproj-argo.md) (16,770 ⭐) — Argo is a cloud native CI/CD platform and Kubernetes workflow engine. It functions as a container pipeline orchestrator and job scheduler, managing multi-step sequences of containers as jobs using directed acyclic graphs within a cluster.

The system acts as a progressive delivery controller, reducing release risk through automated Canary and Blue-Green deployment strategies. It provides declarative GitOps synchronization to mirror the state of a git repository directly into the cluster environment for continuous delivery automation.

The platform covers a broad range of capabilities including
- [datalab-to/surya](https://awesome-repositories.com/repository/datalab-to-surya.md) (20,889 ⭐) — Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion.

The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
- [atuinsh/atuin](https://awesome-repositories.com/repository/atuinsh-atuin.md) (30,266 ⭐) — Atuin is a command-line tool that replaces standard shell history with a searchable, encrypted SQLite database. By hooking into shell initialization scripts, it provides an interactive, keyboard-driven interface for real-time command filtering and retrieval. The platform ensures data privacy through a client-side encryption layer, securing sensitive history and configuration data before it is synchronized across multiple machines.

Beyond history management, Atuin functions as an executable documentation platform that enables teams to create and share interactive runbooks. These documents use
- [gaia-pipeline/gaia](https://awesome-repositories.com/repository/gaia-pipeline-gaia.md) (5,216 ⭐) — Build powerful pipelines in any programming language.
- [daytonaio/daytona](https://awesome-repositories.com/repository/daytonaio-daytona.md) (72,416 ⭐) — Daytona is a cloud-native development environment platform designed to orchestrate ephemeral, containerized workspaces. It provides a centralized system for managing reproducible coding environments as code, ensuring consistency across distributed teams by abstracting the underlying infrastructure. By utilizing declarative configuration, the platform automates the entire lifecycle of development sandboxes, from initial provisioning to resource governance.

The platform distinguishes itself through its infrastructure-agnostic runner layer, which allows development environments to be deployed ac
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [yolain/comfyui-easy-use](https://awesome-repositories.com/repository/yolain-comfyui-easy-use.md) (2,567 ⭐) — ComfyUI-Easy-Use is a custom node suite and workflow optimizer designed to simplify Stable Diffusion generation pipelines. It provides a set of integrated tools to reduce visual clutter and streamline the process of creating images from text and existing image references.

The project distinguishes itself through a pipeline manager that consolidates models, conditioning, and latents into unified data pipes, eliminating complex wiring in the node graph. It also introduces a logical operator set that enables conditional if-else branching and for-loop structures directly within the visual program
- [tomnicholas/python-for-scientists](https://awesome-repositories.com/repository/tomnicholas-python-for-scientists.md) (359 ⭐) — A list of recommended Python libraries, and resources, intended for scientific Python users.
- [arktypeio/arktype](https://awesome-repositories.com/repository/arktypeio-arktype.md) (7,780 ⭐) — Arktype is a TypeScript runtime validation library and schema orchestrator. It synchronizes TypeScript types with runtime data validation, allowing users to define type-safe schemas that ensure unknown data adheres to specific structures during application execution.

The project distinguishes itself by using set-theory type analysis to determine intersections and subtype compatibility, alongside JIT-compiled validation functions for optimized performance. It supports advanced type modeling through branded type constraints, recursive alias resolution, and the ability to generate runtime valida
- [iamseancheney/python_for_data_analysis_2nd_chinese_version](https://awesome-repositories.com/repository/iamseancheney-python-for-data-analysis-2nd-chinese-version.md) (8,937 ⭐) — This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data.

The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
- [chopratejas/headroom](https://awesome-repositories.com/repository/chopratejas-headroom.md) (29,537 ⭐) — Headroom is an AI gateway proxy and token optimizer designed to reduce the cost and latency of large language model interactions. It functions as an intermediary that intercepts traffic between clients and providers to apply context compression, request routing, and format translation.

The system differentiates itself through a Model Context Protocol server implementation that delivers compression and retrieval tools to compatible AI hosts. It employs a content-aware compression pipeline and tiered importance scoring to trim redundant data from logs and tool outputs while preserving essential
- [atsushisakai/pythonrobotics](https://awesome-repositories.com/repository/atsushisakai-pythonrobotics.md) (29,772 ⭐) — PythonRobotics is a comprehensive collection of modular robotics algorithms and educational simulations designed for autonomous navigation, state estimation, and motion control. The project provides a library of standalone implementations for path planning, localization, mapping, and kinematics, serving as a resource for researchers and students to experiment with foundational and advanced robotic theories.

The project distinguishes itself through an algorithm-centric design where each module functions as an isolated script, allowing for independent testing and clear pedagogical demonstration
- [z-shell/declare-zsh](https://awesome-repositories.com/repository/z-shell-declare-zsh.md) (10 ⭐) — Declare-zsh is a parser for Zi commands in .zshrc.
- [microsoft/graphrag](https://awesome-repositories.com/repository/microsoft-graphrag.md) (33,792 ⭐) — GraphRAG is a data processing pipeline and retrieval engine designed to transform unstructured text into interconnected knowledge graphs. By utilizing language models to extract entities and relationships, it builds structured representations of information that enable context-aware retrieval for downstream applications.

The system distinguishes itself through hierarchical graph clustering and large-scale data synthesis, which organize massive document corpora into multi-level structures. This approach allows for both vector-based semantic searches and graph-based traversals, providing a comp
- [jd-opensource/joyagent-jdgenie](https://awesome-repositories.com/repository/jd-opensource-joyagent-jdgenie.md) (11,350 ⭐) — Joyagent-jdgenie is an automated data orchestrator designed to centralize the retrieval and processing of information from disparate remote sources. It functions as a framework for building repeatable data pipelines that fetch, clean, and normalize raw input into consistent, structured formats.

The system utilizes a schema-driven engine to apply validation rules and structural templates to incoming data, ensuring compatibility across enterprise systems. By employing configuration-based workflow definitions, it allows for the orchestration of modular tasks into automated execution flows, separ
- [krzjoa/awesome-python-data-science](https://awesome-repositories.com/repository/krzjoa-awesome-python-data-science.md) (3,468 ⭐) — Probably the best curated list of data science software in Python.
- [clearml/clearml](https://awesome-repositories.com/repository/clearml-clearml.md) (6,740 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts.

The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
- [scikit-learn/scikit-learn](https://awesome-repositories.com/repository/scikit-learn-scikit-learn.md) (66,344 ⭐) — Scikit-learn is a machine learning library for predictive data analysis that provides a collection of algorithms for supervised and unsupervised learning. It functions as a comprehensive toolkit for data preprocessing, dimensionality reduction, and model selection, allowing users to classify data objects, predict continuous values, and cluster similar items based on historical patterns.

The project is defined by a unified interface design where objects either learn from data, transform data, or chain these operations into sequential workflows. To ensure performance on large or high-dimensiona
- [0xcert/framework](https://awesome-repositories.com/repository/0xcert-framework.md) (340 ⭐) — 0xcert Framework - JavaScript framework for building decentralized applications - build something unique
- [bruin-data/bruin](https://awesome-repositories.com/repository/bruin-data-bruin.md) (1,620 ⭐) — Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.
- [allegroai/clearml](https://awesome-repositories.com/repository/allegroai-clearml.md) (6,733 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving.

The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating r
- [kozistr/awesome-gans](https://awesome-repositories.com/repository/kozistr-awesome-gans.md) (763 ⭐) — Awesome-GANs is a curated resource list and research repository focused on the development and evaluation of generative adversarial networks. It serves as a structured index for academic literature and open-source implementations dedicated to the creation of synthetic data generators.

The project provides a framework for training competing neural networks to produce outputs that mimic the statistical properties of original datasets. It emphasizes the use of configuration-driven pipelines to manage model hyperparameters and dataset paths, facilitating reproducible research workflows and standa
