Kedro

Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production.

The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized project structure to ensure consistency and maintainability across teams.

The framework covers pipeline orchestration through automatic dependency resolution and visualization, alongside configuration management for environment-specific settings. It includes capabilities for multi-platform deployment across local machines and distributed clusters, as well as integration with interactive notebooks for data exploration.

The toolkit provides a command line interface for workflow execution and includes utilities for commit performance benchmarking and regression analysis.

Features

Data Catalogs - Provides a centralized data catalog manager to abstract data access and versioning across diverse file formats and cloud storage.

Data Science Frameworks - Provides a modular framework for building reproducible data engineering and data science workflows using software engineering best practices.

Data Access Abstractions - Implements an abstraction layer that decouples data access from processing logic by mapping datasets to specific storage backends.

Data Dependency Visualizers - Automatically resolves dependencies between functions to visualize the end-to-end flow of data through the project.

Data Pipeline Orchestration - Defines and executes modular data pipelines by automatically resolving dependencies between functions.

Dataset Versioning Platforms - Maintains versions of datasets to ensure reproducibility and enable loading of specific versions during execution.

Dataset Registries - Maintains a centralized registry of datasets supporting lazy initialization and pattern matching factories.

Python Data Pipeline Frameworks - Offers a Python-based framework for building and managing complex batch data pipelines and DAGs.

Environment Configuration Management - Loads and merges configuration files using variable interpolation and custom resolvers for environment-specific settings.

Production Data Science Toolboxes - Provides a framework to build reproducible and maintainable data workflows using software engineering practices.

MLOps Templates - Ships a standardized MLOps project template to enforce coding standards and reproducibility in machine learning projects.

Runtime Parameterization - Uses environment-specific YAML files and variable interpolation to inject settings into pipeline nodes at runtime.

DAG-Based Dependency Resolution - Determines task execution order by mapping function inputs and outputs to a directed acyclic graph.

Pipeline Component Modularization - Supports the design of isolated, reusable pipeline components that can be packaged and shared across projects.

Modular Program Composition - Enables nesting of independent pipeline objects to construct complex workflows from reusable functional units.

Project Structures - Enforces standardized directory layouts and organizational patterns to ensure consistency across data science projects.

Data Connection Registries - Implements a centralized registry that instantiates data connections only when they are requested during execution.

Interactive Data Exploration Tools - Integrates modular pipeline components with Jupyter notebooks to bridge the gap between research and production.

Lifecycle Hooks - Allows injecting custom behavior into the project lifecycle via registration hooks for pipelines, loaders, and catalogs.

Notebook Integrations - Provides a dedicated notebook extension and kernel to load project contexts and nodes into interactive environments.

Pipeline Execution CLIs - Ships a command line interface for executing specific pipelines or individual nodes with support for failure resumption.

Project Scaffolding Templates - Offers a system to generate new projects from official or custom starter templates to ensure consistency.

Data Workflow Execution - Provides the ability to execute data pipelines across local machines, distributed clusters, and cloud orchestrators.

Unified Multi-Platform Deployment - Enables running workflows across local machines, distributed clusters, or managed cloud orchestration platforms.

Hook-Based Extension Frameworks - Provides registration points to inject custom logic into project initialization and pipeline execution phases.

Project Bootstrapping Tools - Generates a standardized directory structure and configuration skeleton to enforce software engineering patterns.

Deep Learning Frameworks - Structures data science code into reproducible and modular pipelines.

Data Analysis and Processing - Toolbox for production-ready data science.

Data Pipelines - Builds robust, versioned, and reproducible data pipelines.

Data Engineering - Framework for reproducible and modular data science code.

kedro-orgkedro

Features

Star history