Open-source platforms for managing, storing, and serving consistent data features for machine learning model training.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates materialization pipelines that move batch features from offline stores to online stores using configurable compute engines. Feast distinguishes itself through its multi-protocol serving surface, exposing the same feature values simultaneously via REST, gRPC, and MCP protocols to support diverse client ecosystems including AI agents. It includes an on-demand transformation framework that applies Python-based feature transformations at retrieval time, combining precomputed features with request-time data for flexible serving. The project also provides entity-key collocated storage, storing all features for a single entity in one document to reduce online reads to a single lookup per request, and a background registry cache refresh that prevents serving requests from blocking on cache updates. The platform covers the full lifecycle of feature management, including feature engineering and transformation from batch and streaming sources, governance and access control with application-level RBAC and OIDC authentication, real-time inference serving, and historical feature retrieval for training. It supports vector search and retrieval-augmented generation workflows by storing and querying embeddings for similarity search. Feast integrates with a wide range of storage backends, compute engines, and data sources, and provides tooling for deployment on Kubernetes, monitoring with Prometheus and OpenTelemetry, and lineage tracking with OpenLineage.
Feast is a comprehensive, industry-standard feature store that provides the required point-in-time joins, dual-storage architecture, and transformation pipelines needed for centralized machine learning data management.
Feast is a machine learning feature store and MLOps data infrastructure layer. It provides a centralized system for managing and serving features across offline training and online production environments, utilizing an online feature serving layer for low-latency retrieval. The project centers on a feature registry that acts as a central catalog for defining, governing, and discovering feature services. It employs a unified data access layer to decouple feature retrieval from physical storage and includes a point-in-time data generator to create historically accurate training datasets that prevent data leakage. The platform covers a broad range of capabilities including real-time model inference, streaming data feature engineering, and the generation of training datasets. It also supports vector embedding search for similarity-based retrieval and feature quality validation to maintain data integrity.
Feast is a comprehensive feature store that provides the required centralized management, dual-store architecture for online and offline retrieval, and point-in-time join capabilities for machine learning workflows.
Hopsworks - Data-Intensive AI platform with a Feature Store
Hopsworks is a comprehensive MLOps platform that includes a dedicated feature store providing both online and offline storage, point-in-time joins, and versioning capabilities to support the full machine learning lifecycle.
Great Expectations is a data quality testing framework and observability platform designed to monitor the reliability of data pipelines. It provides a structured environment for defining, documenting, and automating data quality assertions, allowing teams to validate datasets against expected structure and content before they move through downstream processes. The project distinguishes itself through a declarative domain-specific language that stores quality rules as version-controlled configuration files. It utilizes an execution engine abstraction to translate these high-level assertions into native queries for various data processing frameworks, while a rendering engine automatically transforms these rules and validation outcomes into human-readable documentation for stakeholders. The platform supports a broad range of operational capabilities, including the ability to connect to diverse data sources and persist metadata and validation results across distributed environments. It integrates directly into existing orchestration pipelines to automate recurring quality checks, track data health trends over time, and trigger notifications when datasets deviate from established benchmarks.
This is a data quality and validation framework used to monitor pipeline health, but it lacks the storage, serving, and feature-engineering capabilities required for a centralized machine learning feature store.
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It incorporates sensitive data governance by identifying and masking personally identifiable information, ensuring that generated reports remain compliant with security standards. Furthermore, the framework supports dataset drift detection by comparing multiple versions of data collections to pinpoint statistical shifts over time. Beyond its core profiling capabilities, the library offers a modular architecture that allows for schema-driven metadata enrichment and pluggable report rendering. It provides a broad surface for data quality monitoring, including the analysis of temporal trends and the export of metrics into standard formats for integration with other analytical tools.
This repository is an automated exploratory data analysis and profiling tool for assessing data quality, rather than a feature store designed to manage, version, and serve machine learning features for training and inference.
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
Featureform acts as a virtual feature store that orchestrates your existing data infrastructure to provide versioned feature management, transformation pipelines, and retrieval for both training and inference.