17 dépôts
Systems for tracking and retrieving historical states of data.
Distinguishing note: Focuses on historical snapshots of web content.
Explore 17 awesome GitHub repositories matching data & databases · Data Versioning. Refine with filters or upvote what's useful.
Changedetection.io is a self-hosted monitoring service designed to track web pages for content updates and notify users of changes. It functions as a centralized platform where users can manage tracking tasks, observe specific website elements, and receive automated alerts through various communication channels whenever modifications are detected. The service distinguishes itself through an integrated headless browser engine that executes interaction sequences, such as logins or form submissions, to access dynamic or restricted content. It maintains a historical record of page snapshots, util
Tracks historical snapshots of web pages to compare differences between versions.
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Serves as a collaborative platform for publishing and versioning datasets to ensure research reproducibility.
Mastra is an orchestration framework designed for building, deploying, and managing autonomous AI agents and multi-agent systems. It provides a comprehensive suite of primitives for creating resilient AI applications, including durable workflow orchestration, event-driven agent loops, and semantic memory management. By integrating these core components, the platform enables developers to build complex, multi-step processes that can reason about goals and execute tasks without manual intervention. The framework distinguishes itself through its focus on observability and secure, isolated execut
Tracks mutations to dataset items, enabling experiment pinning and historical comparison of evaluation data.
Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes. The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
Maintains snapshots of test cases and evaluation data to ensure reproducibility and auditability across experiment runs.
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models. It functions as a system for managing large data artifacts by storing lightweight metadata in version control while keeping the actual binaries in a separate cache. The project serves as an experiment tracker and remote storage synchronizer, enabling the execution and comparison of machine learning iterations based on hyperparameters and performance metrics. It provides a bridge for pushing and pulling these large data artifacts between local environments and cloud or on-premi
Provides a platform for versioning large research datasets and ML models to ensure training reproducibility.
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data. The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.
Provides a workflow for tracking historical versions of large-scale datasets to ensure machine learning reproducibility.
This library provides a diagnostic toolkit for automated data profiling and exploratory analysis. It generates comprehensive statistical summaries and visual reports for tabular datasets, enabling users to identify distribution patterns, missing values, and quality anomalies through a unified interface. The project distinguishes itself by offering differential analysis, which allows for the comparison of two dataset versions to track structural and statistical changes over time. It supports large-scale data processing through lazy evaluation and provides interactive widgets that embed directl
Identifies structural and statistical differences between two versions of a dataset to track preprocessing impacts.
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Maintains versions of datasets to ensure reproducibility and enable loading of specific versions during execution.
Wandb is a centralized platform for machine learning experiment tracking, model registry management, and workflow orchestration. It provides a comprehensive suite of tools for logging, visualizing, and versioning training metrics, model artifacts, and hyperparameter sweeps to ensure reproducibility across development cycles. The platform also functions as an observability tool for large language model applications, enabling the tracing of execution steps, token usage, and reasoning processes. The project distinguishes itself through its event-driven automation capabilities, which allow users
Groups files and data objects into versioned collections to track assets throughout the machine learning lifecycle.
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endp
Tracks historical versions of datasets using schema-aware versioning to ensure machine learning reproducibility.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Automatically tracks and manages historical versions of datasets to ensure machine learning reproducibility.
Oumi is a comprehensive large language model development platform designed for synthesizing data, fine-tuning models, and running performance evaluations. It serves as a unified environment for the entire model lifecycle, encompassing a training and fine-tuning suite, an evaluation framework, and tools for synthetic data generation and model distillation. The platform is distinguished by its iterative, failure-driven synthesis approach, which analyzes model weaknesses during evaluation to generate targeted training data. It utilizes an LLM-based judge framework to programmatically score respo
Allows reverting datasets to previous states and rerunning quality checks for reproducibility.
language-ext is a functional programming framework for C# that provides a suite of immutable data structures and monadic types. It enables the implementation of pure functional programming patterns, utilizing containers to manage side effects, optional values, and error handling. The library is distinguished by its advanced concurrency and state management tools, including a software transactional memory system and lock-free atomic references. It also provides specialized utilities for distributed systems, such as vector clocks for causality tracking and deterministic data conflict resolution
Provides historical version tracking for map entries to enable retrieval of previous states.
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
Provides a CLI for managing and versioning massive datasets stored on object storage or network drives.
Pachyderm is a containerized, versioned, and lineage-tracked data pipeline platform that runs natively on Kubernetes. It combines a distributed file system backend with immutable data versioning, so every commit to a data repository creates an auditable snapshot, and every pipeline step executes as an isolated container. The platform is defined by a data-centric pipeline model where pipelines are specified by their input and output data repositories rather than explicit task sequences, and provenance is recorded as a directed acyclic graph of commits linking output data to its input sources an
Automates multi-stage data pipelines with built-in version control and lineage tracking for every dataset and transformation.
lakeFS est un système de versioning de data lake qui fournit des branches et des commits de type Git pour de grands jeux de données stockés dans un stockage objet. Il fonctionne comme une couche de contrôle de version, permettant la création d'instantanés immuables, de commits atomiques et de branches zero-copy pour créer des environnements isolés pour l'expérimentation de données sans dupliquer les fichiers physiques. Le système sert de passerelle de stockage compatible S3 et de catalogue REST Iceberg, permettant aux protocoles de stockage cloud standard et aux clients compatibles de gérer des tables versionnées. Il agit comme un gardien de la qualité des données en utilisant un système de hooks piloté par événements pour valider les jeux de données par rapport aux politiques de gouvernance avant que les changements ne soient fusionnés en production. La plateforme couvre de larges capacités pour la gouvernance des données, incluant la collaboration via pull requests, le contrôle d'accès basé sur les rôles et le suivi de la lignée des données (data lineage). Elle fournit une intégration pour l'orchestration de workflows, les pipelines de machine learning et divers moteurs de calcul big data, prenant en charge la connectivité de stockage multi-cloud et la synchronisation d'identité via SSO et SCIM. Le logiciel peut être installé en utilisant des binaires, des conteneurs ou des charts Helm pour un déploiement sur Kubernetes.
Tracks and manages historical versions of large-scale research datasets and models to ensure reproducibility.
CML est un outil d'automatisation de pipeline pour l'entraînement et l'évaluation de modèles d'apprentissage automatique, fonctionnant comme un système CI/CD pour l'apprentissage automatique. Il sert d'orchestrateur de calcul cloud et de gestionnaire de flux de travail basé sur Git qui automatise les cycles d'entraînement de modèles via la gestion de branches, les commits automatisés et le reporting intégré. Le projet se distingue par le provisionnement d'instances cloud éphémères ou de nœuds Kubernetes pour fournir du matériel spécialisé pour les tâches gourmandes en calcul. Il gère également des exécuteurs de calcul distants, permettant la connexion de clusters GPU auto-hébergés ou de machines sur site pour exécuter des flux de travail d'apprentissage automatique conteneurisés. Le système couvre un large éventail de capacités, incluant le suivi des expériences ML, où les métriques de performance et les visualisations sont publiées directement dans les pull requests de contrôle de version. Il gère l'automatisation du pipeline ML depuis l'importation initiale des données et le versionnage jusqu'à la génération de rapports de flux de travail formatés et de liens de visualisation externes. L'outil fournit une utilité supplémentaire pour la gestion de l'infrastructure via le débogage distant basé sur SSH et la capacité de reprendre les tâches interrompues.
Integrates data versioning tools directly into the ML pipeline to ensure datasets are synchronized across execution environments.