These open-source tools map and visualize data movement from raw sources through pipelines to dashboards.
DataHub is a metadata management system and data catalog platform designed to provide a centralized directory for discovering, managing, and documenting datasets across a diverse data stack. It serves as a comprehensive framework for metadata management, incorporating a data governance framework to classify sensitive information and assign ownership for organizational accountability. The platform distinguishes itself through AI-enabled data discovery, which connects large language models to a metadata graph to allow for natural language search and exploration of data assets. It also provides
DataHub is a comprehensive metadata management and data catalog platform that provides automated column-level lineage tracking, SQL parsing, and a robust API-first architecture, making it a flagship solution for visualizing data flow and observability.
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Dagster is a comprehensive data orchestration platform that natively treats data assets as first-class primitives, providing the automated lineage tracking, asset cataloging, and observability dashboards required to monitor data flow from origin to destination.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
DataHub is a comprehensive metadata management platform that provides automated data lineage, a centralized data catalog, and observability features, making it a flagship solution for tracking data flow and dependencies across complex ecosystems.
Amundsen is a data catalog and discovery platform that provides a centralized directory for indexing tables and dashboards. It functions as a metadata management system and search engine, allowing users to locate and understand available data assets across diverse distributed sources. The platform includes capabilities for data lineage tracking to map the origin and movement of datasets between systems. It also serves as a data profiling tool, calculating distribution and quality statistics for individual table columns to provide automated insights into the nature of the data. The system man
Amundsen is a data discovery and metadata management platform that provides essential data lineage tracking and cataloging features, though it focuses more on asset discovery than on end-to-end observability of data transformation flows.
Kedro is a data science pipeline framework and production toolbox designed to build reproducible, modular workflows using software engineering best practices. It functions as a data engineering orchestrator and catalog manager, bridging the gap between interactive analysis and maintainable production pipelines. The framework distinguishes itself by using a data catalog to decouple data access from processing logic and providing tools to transition analysis from interactive notebooks into structured workflows. It includes a workflow visualization tool that generates visual maps of data pipelin
Kedro is a data engineering framework that provides pipeline visualization and a data catalog to manage dependencies, though it functions primarily as an orchestration tool rather than a standalone observability platform for tracking data flow across external systems.
Gravitino is a federated metadata lake and unified data catalog designed to manage tables, files, and AI models across diverse data sources and cloud storage. It serves as a centralized interface for governing schemas, access controls, and tagging across relational databases, messaging queues, and object stores. The project distinguishes itself by unifying the management of AI assets, such as machine learning models and their version lineages, alongside traditional tabular data. It also implements the Iceberg REST specification to provide a standardized metadata server and proxy for lakehouse
This platform provides a unified data catalog and metadata management system that includes native support for tracking data lineage via the OpenLineage standard, making it a strong fit for managing and observing data flows across diverse sources.
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
dbt-core is a data transformation framework that natively tracks and visualizes data lineage through a directed acyclic graph, serving as a core engine for the observability and mapping features you require.
sqlglot is a SQL parser and transpiler that represents queries as abstract syntax trees to enable structural analysis, modification, and semantic transformation. It functions as a dialect translator and query optimizer, converting SQL code between different database engines and simplifying syntax trees through rule-based normalization. The project provides a framework for defining custom SQL dialects by overriding tokenizers, parsers, and generators. It includes a lineage analyzer to track data flow from source tables through complex queries to identify the origin of specific columns. Additi
This is a powerful SQL parsing and transformation library that provides the underlying logic for column-level lineage, but it is a developer-focused building block rather than a complete observability platform with a dashboard and data catalog.