# treeverse/dvc

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/treeverse-dvc).**

15,679 stars · 1,302 forks · Python · Apache-2.0

## Links

- GitHub: https://github.com/treeverse/dvc
- Homepage: https://dvc.org
- awesome-repositories: https://awesome-repositories.com/repository/treeverse-dvc.md

## Topics

`ai` `data-science` `data-version-control` `developer-tools` `machine-learning` `reproducibility` `unstructured-data`

## Description

DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data.

The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.

The framework supports the definition of reproducible computational graphs by managing dependencies between code and commands. This capability enables the tracking of model lineage and the validation of data versioning consistency through commit-stage hooks.

## Tags

### Data & Databases

- [Data Pipeline Orchestration](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration.md) — Orchestrates complex sequences of data processing tasks by managing dependencies between code and data.
- [Content-Addressable Storage](https://awesome-repositories.com/f/data-databases/content-addressable-storage.md) — Implements content-addressable storage using cryptographic hashes to ensure data integrity and deduplicate large artifacts.
- [Remote Dataset Caching](https://awesome-repositories.com/f/data-databases/data-caching/remote-dataset-caching.md) — Provides a synchronization layer to cache large remote datasets locally using hash-based integrity verification.
- [Dataset Versioning Platforms](https://awesome-repositories.com/f/data-databases/data-versioning/dataset-versioning-platforms.md) — Provides a workflow for tracking historical versions of large-scale datasets to ensure machine learning reproducibility.
- [Artifact Versioning](https://awesome-repositories.com/f/data-databases/large-scale-dataset-management/artifact-versioning.md) — Tracks large datasets and machine learning models using external caches and repository placeholders. ([source](https://github.com/treeverse/dvc#readme))
- [Cloud Storage Sync Tools](https://awesome-repositories.com/f/data-databases/cloud-storage-sync-tools.md) — Synchronizes local data caches with remote cloud storage providers using standard transfer protocols. ([source](https://github.com/treeverse/dvc#readme))

### Artificial Intelligence & ML

- [Model Lineage Trackers](https://awesome-repositories.com/f/artificial-intelligence-ml/data-lineage/model-lineage-trackers.md) — Maintains a consistent provenance link between specific data versions, code, and hyperparameters used to produce a model.
- [Machine Learning Experiment Trackers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-experiment-trackers.md) — Monitors metrics and hyperparameters across multiple model iterations to identify optimal performance.
- [Experiment Tracking](https://awesome-repositories.com/f/artificial-intelligence-ml/experiment-tracking.md) — Records hyperparameters and metrics to compare the performance of different model iterations and training workflows. ([source](https://github.com/treeverse/dvc#readme))

### Part of an Awesome List

- [Data-Code Version Linking](https://awesome-repositories.com/f/awesome-lists/devtools/git-and-version-control-tools/data-code-version-linking.md) — Links specific versions of large datasets and models to the exact git commit of the code that produced them.

### Development Tools & Productivity

- [Git-Integrated Data Versioning](https://awesome-repositories.com/f/development-tools-productivity/git-integrated-data-versioning.md) — Integrates large file tracking with Git by using placeholders to keep heavy artifacts out of the repository.
- [Pointer-Based Tracking](https://awesome-repositories.com/f/development-tools-productivity/version-control-file-operations/pointer-based-tracking.md) — Uses lightweight pointer files in Git to track large binary assets stored in an external cache.

### DevOps & Infrastructure

- [Data Pipeline Definitions](https://awesome-repositories.com/f/devops-infrastructure/infrastructure/infrastructure-as-code/orchestration-and-workflows/infrastructure-as-code-workflows/data-pipeline-definitions.md) — Allows users to define data processing pipelines as version-controlled code to ensure reproducibility. ([source](https://github.com/treeverse/dvc#readme))
- [Backup Storage Backends](https://awesome-repositories.com/f/devops-infrastructure/backup-storage-backends.md) — Provides drivers and configurations to offload large data caches to cloud or network storage providers.

### Software Engineering & Architecture

- [Directed Acyclic Graph Engines](https://awesome-repositories.com/f/software-engineering-architecture/directed-acyclic-graph-engines.md) — Provides a DAG-based execution engine to manage computational dependencies between data and code.
- [ML Pipeline Reproducibility](https://awesome-repositories.com/f/software-engineering-architecture/reproducibility-verifiers/analytical-reproducibility/ml-pipeline-reproducibility.md) — Defines dependencies between data and code to ensure computational graphs are rebuilt reliably across environments.

### System Administration & Monitoring

- [Experiment Result Comparators](https://awesome-repositories.com/f/system-administration-monitoring/agent-observability/experimentation-sandboxes/experiment-result-comparators.md) — Records hyperparameters and performance metrics in structured files to enable comparative analysis of model iterations.
