10 dépôts
Systems for defining, scheduling, and executing complex sequences of data analysis and transformation tasks.
Distinguishing note: Focuses on the orchestration of analytical queries within automated pipelines rather than the storage engine itself.
Explore 10 awesome GitHub repositories matching data & databases · Data Processing Workflows. Refine with filters or upvote what's useful.
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Execute complex data analysis and graph traversals against distributed stores to incorporate advanced insights directly into automated data processing workflows.
Open-notebook is a collaborative workspace designed for knowledge management and structured data workflows. It functions as a centralized repository where users can document, refine, and retrieve information while interacting with artificial intelligence models to generate content and process complex data. The platform distinguishes itself through a local-first data persistence model that ensures offline availability and performance, paired with state-synchronized collaborative editing for real-time team sessions. It utilizes a virtualized rendering engine to maintain interface responsiveness
Organizes complex information processing tasks into collaborative workflows to simplify project tracking and team productivity.
Luigi is a Python framework designed for building and managing complex batch data pipelines. It functions as a workflow orchestration engine that organizes tasks into directed acyclic graphs, ensuring that jobs execute in the correct logical order based on their dependencies. By utilizing a centralized scheduler, the system coordinates task execution across distributed environments, tracks global workflow state, and prevents redundant processing by verifying the existence of output targets before triggering any work. The project distinguishes itself through a robust state-tracking mechanism t
Tracks data table and partition existence to coordinate dependencies within complex data processing workflows.
BigData-Notes is a big data learning resource and data engineering knowledge base. It provides a collection of guides, technical references, and documentation focused on the installation and configuration of distributed data processing technologies. The project covers a learning path for distributed systems, including the setup of large-scale data storage and computing clusters. It specifically addresses both batch and stream processing workflows and the implementation of data APIs for interacting with distributed messaging and storage systems. The materials are organized using markdown-base
Covers the execution and definition of batch and stream processing tasks using distributed computing engines.
This tool is a command-line processor designed for querying, updating, and transforming structured data files. It functions as a versatile engine for manipulating YAML, JSON, TOML, and XML documents, allowing users to perform complex operations directly from the terminal. By utilizing a path-based expression language, it enables precise navigation and modification of data structures within configuration files and infrastructure-as-code workflows. What distinguishes this tool is its ability to perform in-place document mutations while preserving original formatting, comments, and metadata. It
Automates complex data manipulation and aggregation tasks within shell-based scripting workflows.
DolphinScheduler is a distributed workflow orchestrator designed to manage and automate complex data processing pipelines. It functions as a data pipeline scheduler that coordinates multi-step tasks across distributed environments, ensuring reliable execution through defined dependencies and sequences. The platform utilizes a directed acyclic graph model to represent workflows, allowing users to define task relationships via a visual interface. It employs a master-worker architecture supported by a pluggable task plugin system, which enables the dynamic extension of task types without requiri
Coordinates large-scale data processing jobs across diverse infrastructure to ensure reliable data movement.
Automatisch is an open-source, self-hosted automation platform designed to orchestrate multi-stage workflows across various third-party web services. It functions as a no-code integration engine that allows users to connect disparate applications, enabling the automated movement of data and the execution of tasks without manual intervention. By running the platform on private infrastructure, users maintain full control over their data and ensure compliance with internal security policies. The platform distinguishes itself through a focus on secure, local credential management and flexible int
Processes trigger events and action responses within the system to facilitate data movement between workflow steps.
Great Expectations is a data quality testing framework and observability platform designed to monitor the reliability of data pipelines. It provides a structured environment for defining, documenting, and automating data quality assertions, allowing teams to validate datasets against expected structure and content before they move through downstream processes. The project distinguishes itself through a declarative domain-specific language that stores quality rules as version-controlled configuration files. It utilizes an execution engine abstraction to translate these high-level assertions in
Integrates validation steps directly into data processing workflows to ensure reliability during scheduled jobs.
CUE is a constraint-based configuration language designed for data validation, schema definition, and code generation. At its core, it unifies types and values into a single concept, enabling compile-time validation that catches structural and value errors before runtime. The language treats data and constraints as the same thing, allowing a single definition to serve as both a schema and concrete configuration data. CUE distinguishes itself through its constraint-based unification engine, which combines multiple configuration sources into a single coherent result by merging their constraints
Orchestrates sequences of data processing steps driven by constraint unification.
This project is a learning curriculum and programming guide for Apache Spark, providing a structured set of educational resources and practical code examples for mastering distributed data processing. It serves as a course for building scalable data workflows and big data engineering pipelines. The repository provides practical source code and project layouts that demonstrate how to connect external data stores, process streaming data, and organize code for distributed environments. It includes implementation examples for scaling machine learning algorithms across clusters to handle large tra
Guides the definition and execution of complex sequences of data analysis and transformation tasks.