Hub is a multimodal AI data lake and vector database designed for storing and querying embeddings, text, audio, and images. It functions as a dataset version control system and a machine learning data streaming engine to support large-scale model training.
The system utilizes a serverless PostgreSQL vector store to index high-dimensional embeddings for semantic search. It provides a visual interface for inspecting multimodal datasets and viewing annotations such as bounding boxes and masks.
The platform handles cloud-agnostic storage synchronization and implements lazy, compressed data streaming to move datasets from remote sources into deep learning frameworks. It maintains dataset lineage and versioning to track iterations across the development lifecycle.