17 repositorios
Systems for tracking and retrieving historical states of data.
Distinguishing note: Focuses on historical snapshots of web content.
Explore 17 awesome GitHub repositories matching data & databases · Data Versioning. Refine with filters or upvote what's useful.
Changedetection.io is a self-hosted monitoring service designed to track web pages for content updates and notify users of changes. It functions as a centralized platform where users can manage tracking tasks, observe specific website elements, and receive automated alerts through various communication channels whenever modifications are detected. The service distinguishes itself through an integrated headless browser engine that executes interaction sequences, such as logins or form submissions, to access dynamic or restricted content. It maintains a historical record of page snapshots, util
Tracks historical snapshots of web pages to compare differences between versions.
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Serves as a collaborative platform for publishing and versioning datasets to ensure research reproducibility.
Mastra is an orchestration framework designed for building, deploying, and managing autonomous AI agents and multi-agent systems. It provides a comprehensive suite of primitives for creating resilient AI applications, including durable workflow orchestration, event-driven agent loops, and semantic memory management. By integrating these core components, the platform enables developers to build complex, multi-step processes that can reason about goals and execute tasks without manual intervention. The framework distinguishes itself through its focus on observability and secure, isolated execut
Tracks mutations to dataset items, enabling experiment pinning and historical comparison of evaluation data.
Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes. The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
Maintains snapshots of test cases and evaluation data to ensure reproducibility and auditability across experiment runs.
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models. It functions as a system for managing large data artifacts by storing lightweight metadata in version control while keeping the actual binaries in a separate cache. The project serves as an experiment tracker and remote storage synchronizer, enabling the execution and comparison of machine learning iterations based on hyperparameters and performance metrics. It provides a bridge for pushing and pulling these large data artifacts between local environments and cloud or on-premi
Provides a platform for versioning large research datasets and ML models to ensure training reproducibility.
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data. The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.
Provides a workflow for tracking historical versions of large-scale datasets to ensure machine learning reproducibility.
This library provides a diagnostic toolkit for automated data profiling and exploratory analysis. It generates comprehensive statistical summaries and visual reports for tabular datasets, enabling users to identify distribution patterns, missing values, and quality anomalies through a unified interface. The project distinguishes itself by offering differential analysis, which allows for the comparison of two dataset versions to track structural and statistical changes over time. It supports large-scale data processing through lazy evaluation and provides interactive widgets that embed directl
Identifies structural and statistical differences between two versions of a dataset to track preprocessing impacts.
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
Maintains versions of datasets to ensure reproducibility and enable loading of specific versions during execution.
Wandb is a centralized platform for machine learning experiment tracking, model registry management, and workflow orchestration. It provides a comprehensive suite of tools for logging, visualizing, and versioning training metrics, model artifacts, and hyperparameter sweeps to ensure reproducibility across development cycles. The platform also functions as an observability tool for large language model applications, enabling the tracing of execution steps, token usage, and reasoning processes. The project distinguishes itself through its event-driven automation capabilities, which allow users
Groups files and data objects into versioned collections to track assets throughout the machine learning lifecycle.
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endp
Tracks historical versions of datasets using schema-aware versioning to ensure machine learning reproducibility.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Automatically tracks and manages historical versions of datasets to ensure machine learning reproducibility.
Oumi is a comprehensive large language model development platform designed for synthesizing data, fine-tuning models, and running performance evaluations. It serves as a unified environment for the entire model lifecycle, encompassing a training and fine-tuning suite, an evaluation framework, and tools for synthetic data generation and model distillation. The platform is distinguished by its iterative, failure-driven synthesis approach, which analyzes model weaknesses during evaluation to generate targeted training data. It utilizes an LLM-based judge framework to programmatically score respo
Allows reverting datasets to previous states and rerunning quality checks for reproducibility.
language-ext is a functional programming framework for C# that provides a suite of immutable data structures and monadic types. It enables the implementation of pure functional programming patterns, utilizing containers to manage side effects, optional values, and error handling. The library is distinguished by its advanced concurrency and state management tools, including a software transactional memory system and lock-free atomic references. It also provides specialized utilities for distributed systems, such as vector clocks for causality tracking and deterministic data conflict resolution
Provides historical version tracking for map entries to enable retrieval of previous states.
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
Provides a CLI for managing and versioning massive datasets stored on object storage or network drives.
Pachyderm is a containerized, versioned, and lineage-tracked data pipeline platform that runs natively on Kubernetes. It combines a distributed file system backend with immutable data versioning, so every commit to a data repository creates an auditable snapshot, and every pipeline step executes as an isolated container. The platform is defined by a data-centric pipeline model where pipelines are specified by their input and output data repositories rather than explicit task sequences, and provenance is recorded as a directed acyclic graph of commits linking output data to its input sources an
Automates multi-stage data pipelines with built-in version control and lineage tracking for every dataset and transformation.
lakeFS es un sistema de versionado de lagos de datos que proporciona ramificaciones (branching) y commits similares a Git para grandes conjuntos de datos almacenados en almacenamiento de objetos. Funciona como una capa de control de versiones, permitiendo la creación de instantáneas inmutables, commits atómicos y ramificaciones de copia cero para crear entornos aislados para la experimentación de datos sin duplicar archivos físicos. El sistema sirve como una puerta de enlace de almacenamiento compatible con S3 y un catálogo REST de Iceberg, permitiendo que los protocolos de almacenamiento en la nube estándar y los clientes compatibles gestionen tablas versionadas. Actúa como un guardián de calidad de datos mediante el uso de un sistema de hooks basado en eventos para validar conjuntos de datos contra políticas de gobernanza antes de que los cambios se fusionen en producción. La plataforma cubre amplias capacidades para la gobernanza de datos, incluyendo colaboración mediante pull requests, control de acceso basado en roles y seguimiento del linaje de datos. Proporciona integración para la orquestación de flujos de trabajo, pipelines de aprendizaje automático y varios motores de cómputo de big data, soportando conectividad de almacenamiento multi-nube y sincronización de identidad mediante SSO y SCIM. El software se puede instalar utilizando binarios, contenedores o Helm charts para su despliegue en Kubernetes.
Tracks and manages historical versions of large-scale research datasets and models to ensure reproducibility.
CML es una herramienta de automatización de pipelines para entrenar y evaluar modelos de machine learning, funcionando como un sistema CI/CD para machine learning. Sirve como orquestador de computación en la nube y gestor de flujos de trabajo basado en Git que automatiza los ciclos de entrenamiento de modelos mediante la gestión de ramas, commits automatizados e informes integrados. El proyecto se distingue por aprovisionar instancias de nube efímeras o nodos de Kubernetes para proporcionar hardware especializado para tareas de computación intensiva. También gestiona runners de computación remota, permitiendo la conexión de clusters de GPU autohospedados o máquinas on-premise para ejecutar flujos de trabajo de machine learning contenerizados. El sistema cubre una amplia gama de capacidades, incluyendo el seguimiento de experimentos de ML, donde las métricas de rendimiento y visualizaciones se publican directamente en los pull requests de control de versiones. Maneja la automatización de pipelines de ML desde la importación y versionado inicial de datos hasta la generación de informes de flujo de trabajo formateados y enlaces de visualización externos. La herramienta proporciona utilidad adicional para la gestión de infraestructura a través de depuración remota basada en SSH y la capacidad de reanudar trabajos interrumpidos.
Integrates data versioning tools directly into the ML pipeline to ensure datasets are synchronized across execution environments.