Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments.
The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It provides specialized compute orchestration for scaling workloads across cloud CPUs and GPUs using ephemeral clusters, vertical scaling for memory-intensive tasks, and spot instance management to optimize infrastructure costs.
The project covers a broad surface of pipeline capabilities, including DAG-based workflow orchestration with conditional routing and parallel execution. It provides tools for ML experiment tracking, metadata querying, and result visualization, alongside data management features for interacting with cloud object storage and data warehouses.
Workflows can be developed and executed within notebooks or via a command-line interface, with support for packaging local code and dependencies for consistent remote execution.