# Scalable Data Pipeline Frameworks

> Search results for `a framework for building scalable data pipelines` on awesome-repositories.com. 112 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/a-framework-for-building-scalable-data-pipelines

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/a-framework-for-building-scalable-data-pipelines).**

## Results

- [binhnguyennus/awesome-scalability](https://awesome-repositories.com/repository/binhnguyennus-awesome-scalability.md) (71,779 ⭐) — This project is a curated knowledge repository that aggregates high-quality resources, technical documentation, and expert insights focused on distributed systems engineering. It serves as a community-driven learning resource designed to help developers navigate the complexities of building and maintaining large-scale software applications.

The repository distinguishes itself through a hierarchical taxonomy that organizes vast amounts of technical information into a structured, searchable format. By utilizing markdown-based content curation and static indexing, the collection remains version-
- [deepset-ai/haystack](https://awesome-repositories.com/repository/deepset-ai-haystack.md) (24,253 ⭐) — Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis.

The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
- [hazelcast/hazelcast](https://awesome-repositories.com/repository/hazelcast-hazelcast.md) (6,570 ⭐) — Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources.

What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
- [dagster-io/dagster](https://awesome-repositories.com/repository/dagster-io-dagster.md) (14,974 ⭐) — Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality.

The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
- [jenkinsci/pipeline-examples](https://awesome-repositories.com/repository/jenkinsci-pipeline-examples.md) (4,296 ⭐) — This project is a library of version-controlled workflow definitions and a collection of Groovy scripts and configuration snippets for implementing continuous integration and delivery automation in Jenkins. It serves as a reference for building automated pipelines using both declarative syntax and scripted logic.

The repository provides template collections and implementation patterns for creating software build and deployment workflows. It includes reusable functions and logic patterns designed to standardize pipeline behavior and eliminate code duplication across multiple projects through t
- [gaia-pipeline/gaia](https://awesome-repositories.com/repository/gaia-pipeline-gaia.md) (5,216 ⭐) — Gaia is a polyglot pipeline orchestrator and continuous integration and delivery automation platform. It functions as a multi-language workflow engine that coordinates the movement and transformation of data by executing tasks written in different programming languages through a dependency graph.

The platform distinguishes itself with a visual pipeline configurator for mapping function arguments via a management portal and a secure secret manager that uses ciphers to encrypt passwords and tokens. It further automates the software lifecycle by cloning repositories and recompiling applications
- [cake-build/cake](https://awesome-repositories.com/repository/cake-build-cake.md) (4,179 ⭐) — Cake is a cross-platform build automation system and scripting framework that allows users to define software build pipelines using C# scripts. It functions as a CI/CD pipeline orchestrator and build runner, providing a strongly-typed domain-specific language to simplify the orchestration of compilation, testing, and packaging processes across Windows, Linux, and macOS.

The system ensures reproducible build environments by pinning the versions of build tools, modules, and dependencies. It distinguishes itself by enabling a C# scripting workflow with full IDE support, including autocomplete, s
- [datajuicer/data-juicer](https://awesome-repositories.com/repository/datajuicer-data-juicer.md) (6,574 ⭐) — Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines.

The project distinguishes itself through a YAML-based data recipe sys
- [awslabs/data-pipeline-samples](https://awesome-repositories.com/repository/awslabs-data-pipeline-samples.md) (472 ⭐) — This repository hosts sample pipelines
- [dragonflydb/dragonfly](https://awesome-repositories.com/repository/dragonflydb-dragonfly.md) (30,688 ⭐) — Dragonfly is a high-performance, multi-model in-memory data store designed to serve as a drop-in replacement for existing database infrastructures. By utilizing a multi-threaded, shared-nothing architecture and a fiber-based concurrency model, it maximizes CPU utilization and minimizes latency for read and write operations. The system supports a wide range of data structures, including strings, hashes, lists, sets, sorted sets, and JSON documents, while maintaining full compatibility with standard industry wire protocols and client libraries.

What distinguishes Dragonfly is its focus on effic
- [mikefarah/yq](https://awesome-repositories.com/repository/mikefarah-yq.md) (14,913 ⭐) — This tool is a command-line processor designed for querying, updating, and transforming structured data files. It functions as a versatile engine for manipulating YAML, JSON, TOML, and XML documents, allowing users to perform complex operations directly from the terminal. By utilizing a path-based expression language, it enables precise navigation and modification of data structures within configuration files and infrastructure-as-code workflows.

What distinguishes this tool is its ability to perform in-place document mutations while preserving original formatting, comments, and metadata. It
- [tektoncd/pipeline](https://awesome-repositories.com/repository/tektoncd-pipeline.md) (8,996 ⭐) — Pipeline is a Kubernetes native CI/CD framework and cloud native pipeline orchestrator. It functions as a custom resource controller that translates declarative pipeline definitions into coordinated pod executions and managed workloads.

The system acts as a containerized task runner, allowing for the execution of standalone build steps and reusable tasks that process specific inputs to produce defined outputs. It enables the orchestration of complex workflows by running a sequence of independent containers as modular components within a cloud environment.

The platform covers automated softwa
- [benthosdev/benthos](https://awesome-repositories.com/repository/benthosdev-benthos.md) (8,681 ⭐) — Benthos is a stream processing engine and data integration pipeline used for routing, transforming, and connecting data streams between diverse sources and sinks. It functions as event routing middleware and a change data capture tool, streaming real-time database modifications as discrete events for downstream processing.

The system utilizes a declarative pipeline configuration, where data flow and processing logic are defined in a single static file. It features a specialized domain-specific language for mapping, filtering, and enriching data payloads, allowing for complex transformations w
- [angular/angular](https://awesome-repositories.com/repository/angular-angular.md) (100,360 ⭐) — Angular is a platform for building web applications using a component-based architecture. It provides a comprehensive suite of tools for managing encapsulated UI units, including hierarchical dependency injection, a declarative template system, and fine-grained reactivity through signals. The framework supports complex application requirements such as client-side routing, form management, and internationalization.

The project includes a command-line interface for scaffolding and build automation, alongside a testing ecosystem for unit and integration verification. It offers multiple rendering
- [astronomer/dag-factory](https://awesome-repositories.com/repository/astronomer-dag-factory.md) (1,440 ⭐) — Dag-factory is a framework for constructing and managing Apache Airflow data pipelines through declarative configuration files. By replacing manual procedural code with structured YAML definitions, it enables the programmatic generation of complex workflow structures, task dependencies, and execution schedules.

The project distinguishes itself by mapping configuration keys directly to Python class constructors and operators, allowing for the dynamic instantiation of objects and custom logic. It supports hierarchical configuration inheritance to standardize settings across environments and pro
- [simonlin1212/a-stock-data](https://awesome-repositories.com/repository/simonlin1212-a-stock-data.md) (5,603 ⭐) — This project is a comprehensive market data toolkit and financial analysis system specifically designed for China A-shares. It serves as a data pipeline for retrieving real-time quotes, aggregating corporate financial statements, and automating equity research.

The system distinguishes itself through specialized monitors for institutional capital movements, including Northbound fund flows, margin trading balances, and large block transactions. It also features a dedicated options Greeks calculator for ETF derivatives and tools to gauge market sentiment via retail popularity rankings and trend
- [teckkean/gtfs-data-pipeline-tfnsw-bus](https://awesome-repositories.com/repository/teckkean-gtfs-data-pipeline-tfnsw-bus.md) (8 ⭐) — Introduction Data Availability Statement Data Pipeline Directory Structure Data Pipeline Operations - 1.1 Convert .PB.GZ to .CSV Files - 1.2 Transform .CSV Files - 1.2A Transform .CSV Files by Agency (Daily to Monthly) - 1.3 Prepare Cleaned Unique Datasets
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [pathwaycom/pathway](https://awesome-repositories.com/repository/pathwaycom-pathway.md) (62,959 ⭐) — Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources.

The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
- [inosrahul/f1-data-pipeline](https://awesome-repositories.com/repository/inosrahul-f1-data-pipeline.md) (0 ⭐)
- [google/clusterfuzz](https://awesome-repositories.com/repository/google-clusterfuzz.md) (5,574 ⭐) — ClusterFuzz is an automated platform that runs coverage-guided fuzzers at scale to find security and stability bugs in software. It orchestrates libFuzzer and AFL++ across distributed clusters of worker bots, collecting coverage feedback to guide input mutation and discover crashes. The platform provides a web-based dashboard for configuring fuzzing jobs, monitoring progress, and inspecting crash reports, with role-based access control to restrict sensitive features.

The system automates the full fuzzing lifecycle, from build pipeline integration and corpus management to crash triage and bug
- [fastai/course22](https://awesome-repositories.com/repository/fastai-course22.md) (3,398 ⭐) — This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks.

The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
- [fredkschott/snowpack](https://awesome-repositories.com/repository/fredkschott-snowpack.md) (19,329 ⭐) — Snowpack is an ESM-powered frontend build tool and development server that serves native ES modules directly to the browser. By eliminating the bundling process during development, it enables nearly instant server startup and unbundled frontend development.

The project features a framework-aware hot module reload system that preserves component state during updates, with specific Fast Refresh integration for React, Preact, Svelte, and Vue. It also acts as a modern web transpiler, automatically converting TypeScript, JSX, and CSS Modules into browser-compatible code without requiring manual co
- [esri/spatial-framework-for-hadoop](https://awesome-repositories.com/repository/esri-spatial-framework-for-hadoop.md) (376 ⭐) — The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
- [ucbepic/docetl](https://awesome-repositories.com/repository/ucbepic-docetl.md) (3,597 ⭐) — docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas.

The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
- [chonkie-inc/chonkie](https://awesome-repositories.com/repository/chonkie-inc-chonkie.md) (4,170 ⭐) — Chonkie is a text chunking library designed for retrieval-augmented generation pipelines. It functions as a semantic text splitter and RAG ingestion pipeline, transforming raw text into embedded segments for storage in vector databases.

The project distinguishes itself through specialized splitting strategies, including an AST-based code splitter for preserving logical boundaries in source code and a semantic text splitter that uses embedding models to determine boundaries based on meaning. It also provides a vector database ingestor to automate the generation of embeddings and their export t
- [taskflow/taskflow](https://awesome-repositories.com/repository/taskflow-taskflow.md) (12,013 ⭐) — Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators.

The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow
- [kubeflow/pipelines](https://awesome-repositories.com/repository/kubeflow-pipelines.md) (4,154 ⭐) — This project is a containerized machine learning workflow engine and orchestrator designed to automate the end-to-end lifecycle of machine learning models on Kubernetes clusters. It functions as an MLOps pipeline compiler that transforms a domain-specific language into structured specifications for portable and scalable deployment.

The platform provides a multi-tenant environment with isolated namespaces and identity provider authentication. It distinguishes itself through a combination of container-based task isolation, strongly typed artifact management for data passing, and content-address
- [lotabout/let-s-build-a-compiler](https://awesome-repositories.com/repository/lotabout-let-s-build-a-compiler.md) (580 ⭐) — A C & x86 version of the "Let's Build a Compiler" by Jack Crenshaw
- [infiniflow/ragflow](https://awesome-repositories.com/repository/infiniflow-ragflow.md) (82,922 ⭐) — This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations.

The platform distinguishes itself through deep document understanding and sophisticated know
- [nathanmarz/storm](https://awesome-repositories.com/repository/nathanmarz-storm.md) (8,772 ⭐) — Storm is a distributed stream processing framework and fault-tolerant compute engine designed for executing real-time continuous computations across a cluster of machines. It functions as a stateful stream processor and cluster topology manager, enabling the deployment and monitoring of distributed data flow configurations.

The system ensures exactly-once semantics by utilizing transactional state management to guarantee that every message in a data stream is processed exactly one time. It further operates as a distributed RPC system, allowing for the integration of non-native languages throu
- [thephpleague/pipeline](https://awesome-repositories.com/repository/thephpleague-pipeline.md) (1,000 ⭐) — League\Pipeline
- [hkuds/lightrag](https://awesome-repositories.com/repository/hkuds-lightrag.md) (36,651 ⭐) — LightRAG is a graph-based retrieval framework designed to build retrieval-augmented generation pipelines. It structures unstructured text into knowledge graphs, enabling multi-hop reasoning and complex query synthesis across large document collections. By integrating dense vector embeddings with structured knowledge graphs, the system facilitates both similarity-based and relationship-aware information retrieval.

The framework distinguishes itself through a dual-level retrieval strategy that combines low-level keyword matching with high-level semantic graph traversal to capture both specific
- [cube-js/cube](https://awesome-repositories.com/repository/cube-js-cube.md) (20,251 ⭐) — Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools.

The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orches
- [hyfather/pipeline](https://awesome-repositories.com/repository/hyfather-pipeline.md) (61 ⭐) — Pipelines using goroutines
- [grid-js/gridjs](https://awesome-repositories.com/repository/grid-js-gridjs.md) (4,692 ⭐) — Grid.js is a framework-agnostic JavaScript library for rendering interactive data grids. It allows for the display of structured information in tabular formats across different frontend environments, supporting data population from static arrays or JSON imports.

The library features a plugin system for extending user interface components and logic, as well as a custom data pipeline for transforming information before it is displayed. It includes built-in support for multilingual localization to translate interface elements and messages.

The project covers core data visualization capabilities
- [open-webui/pipelines](https://awesome-repositories.com/repository/open-webui-pipelines.md) (2,403 ⭐) — Pipelines: Versatile, UI-Agnostic OpenAI-Compatible Plugin Framework
- [willkoehrsen/machine-learning-project-walkthrough](https://awesome-repositories.com/repository/willkoehrsen-machine-learning-project-walkthrough.md) (1,281 ⭐) — This project is an educational resource and step-by-step guide for implementing end-to-end machine learning workflows. It provides a structured walkthrough for managing the entire lifecycle of a predictive modeling project, from initial data cleaning and feature engineering to final model training and performance assessment.

The repository utilizes interactive documents to interleave code, data visualizations, and narrative explanations, facilitating a reproducible approach to data science. By following this guided sequence, users can construct and orchestrate pipelines that transform raw dat
- [datahub-project/datahub](https://awesome-repositories.com/repository/datahub-project-datahub.md) (12,141 ⭐) — DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations.

The platform distinguishes itself through its focus on grounding artificial intelligence and autono
- [google-ai-edge/mediapipe](https://awesome-repositories.com/repository/google-ai-edge-mediapipe.md) (35,660 ⭐) — MediaPipe is a cross-platform machine learning framework designed for deploying vision, audio, and text processing models across mobile, desktop, and web environments. It functions as an on-device inference engine that executes complex models locally on edge hardware, ensuring low latency and privacy without requiring a constant internet connection.

The framework utilizes a graph-based pipeline orchestration system where data flows through a directed network of modular calculators to ensure synchronized and deterministic processing. It distinguishes itself through a unified runtime that provi
- [coollabsio/coolify](https://awesome-repositories.com/repository/coollabsio-coolify.md) (57,055 ⭐) — This project is a self-hosted platform-as-a-service that provides a centralized management interface for deploying, configuring, and monitoring containerized applications and databases on private infrastructure. It functions as a visual control plane, automating the end-to-end lifecycle of services from source code to production. By managing container orchestration, networking, and resource allocation, it allows users to maintain full control over their own hardware while streamlining the delivery of software.

The platform distinguishes itself through its agentless architecture, which uses se
- [0xcert/framework](https://awesome-repositories.com/repository/0xcert-framework.md) (340 ⭐) — 0xcert Framework - JavaScript framework for building decentralized applications - build something unique
- [dotnetcore/dotnetspider](https://awesome-repositories.com/repository/dotnetcore-dotnetspider.md) (4,137 ⭐) — DotnetSpider is a .NET web crawling framework and C# data extraction tool designed for automated web page discovery and the retrieval of structured data from the internet at scale. It functions as a high-level web scraping library for collecting information from various websites.

The framework provides capabilities for automated web crawling and large-scale data scraping. It enables web content extraction to facilitate the creation of local databases or the analysis of online information through programmatic web automation within the .NET ecosystem.

The system utilizes a pipeline-based data
- [apple/foundationdb](https://awesome-repositories.com/repository/apple-foundationdb.md) (16,446 ⭐) — FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture.

The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state.

The platform provides
- [facebook/react](https://awesome-repositories.com/repository/facebook-react.md) (245,669 ⭐) — React is a JavaScript library for building user interfaces based on a component-driven architecture and unidirectional data flow.
- [khuyentran1401/detect-data-drift-pipeline](https://awesome-repositories.com/repository/khuyentran1401-detect-data-drift-pipeline.md) (0 ⭐)
- [apache/incubator-skywalking](https://awesome-repositories.com/repository/apache-incubator-skywalking.md) (24,832 ⭐) — SkyWalking is a comprehensive observability stack and application performance monitoring platform. It functions as a distributed tracing system and an AI application monitor, providing a centralized suite for collecting and analyzing logs, metrics, and traces to maintain the health of containerized architectures.

The platform distinguishes itself through a service topology visualizer that renders interactive maps of infrastructure dependencies and communication patterns. It also includes specialized capabilities for generative AI workflow observation to track the execution flow and performanc
- [llrizvanll/rn-scalable-rental-app](https://awesome-repositories.com/repository/llrizvanll-rn-scalable-rental-app.md) (46 ⭐) — This is a pure work for understanding scalable sample app written on react native , which showcases clean architecture and scalability with solid principles.
- [cloudflare/cloudflare-docs](https://awesome-repositories.com/repository/cloudflare-cloudflare-docs.md) (4,859 ⭐) — This repository is a technical documentation site and a collection of guides and references for implementing networking, security, and cloud infrastructure services. It functions as a static-site generated portal and a headless content platform, separating source files from the presentation layer to enable flexible rendering.

The project utilizes markdown-based documentation stored in a version-controlled Git repository. It provides specialized technical content including an AI platform documentation for building agents and managing inference, a cloud infrastructure guide for DNS and CDN conf
- [armbian/build](https://awesome-repositories.com/repository/armbian-build.md) (5,110 ⭐) — This repository is the Armbian build framework — an embedded Linux build system for generating custom operating system images tailored to single-board computers, primarily targeting ARM and RISC-V architectures. The build process is orchestrated by GNU Makefiles and relies on a chroot-based environment to assemble the root filesystem, manage cross-compilation toolchains, and aggregate binary firmware blobs for hardware compatibility. Kernel and bootloader source trees are fetched via git, with structured patches applied in a controlled sequence, while each supported board is described by a ded
