DataHub is a metadata management system and data catalog platform designed to provide a centralized directory for discovering, managing, and documenting datasets across a diverse data stack. It serves as a comprehensive framework for metadata management, incorporating a data governance framework to classify sensitive information and assign ownership for organizational accountability.
The platform distinguishes itself through AI-enabled data discovery, which connects large language models to a metadata graph to allow for natural language search and exploration of data assets. It also provides specialized data lineage tools that map column-level dependencies to track the flow of data from source to consumption.
The system covers a broad range of capabilities including universal metadata search, data quality monitoring for schema drift and freshness, and dataset profiling. It utilizes a plugin-based ingestion framework to automate the extraction of schemas and usage metrics from warehouses and business intelligence tools.