26 open-source projects similar to elementary-data/elementary, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Elementary alternative.
Egeria provides the Apache-2.0 licensed open metadata and governance type system, frameworks, APIs, event payloads and interchange protocols to enable tools, engines and platforms to exchange metadata in order to get the best value from data, whilst ensuring it is properly governed.
OpenMetadata is an enterprise data catalog, metadata platform, and governance suite that functions as a knowledge graph for data assets. It serves as an AI-ready metadata layer, providing governed context and organizational memory to large language model agents via the Model Context Protocol. The platform distinguishes itself by capturing institutional knowledge, linking conversations, decisions, and remediation notes directly to data assets to preserve tribal knowledge. It integrates AI agents to automate metadata governance, such as suggesting descriptions and identifying sensitive data thr
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Next-Gen Data Discovery and Data Observability Platform
Gravitino is a federated metadata lake and unified data catalog designed to manage tables, files, and AI models across diverse data sources and cloud storage. It serves as a centralized interface for governing schemas, access controls, and tagging across relational databases, messaging queues, and object stores. The project distinguishes itself by unifying the management of AI assets, such as machine learning models and their version lineages, alongside traditional tabular data. It also implements the Iceberg REST specification to provide a standardized metadata server and proxy for lakehouse
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
DataHub is a metadata management system and data catalog platform designed to provide a centralized directory for discovering, managing, and documenting datasets across a diverse data stack. It serves as a comprehensive framework for metadata management, incorporating a data governance framework to classify sensitive information and assign ownership for organizational accountability. The platform distinguishes itself through AI-enabled data discovery, which connects large language models to a metadata graph to allow for natural language search and exploration of data assets. It also provides
This project is an exploratory data analysis framework and profiling tool designed to generate comprehensive statistical reports from Pandas and Spark DataFrames. It functions as a data quality profiler that identifies missing values, duplicates, and high correlations within tabular datasets. The tool distinguishes itself through specialized capabilities for time-series analysis, extracting temporal statistics, seasonality, and auto-correlation plots. It also includes a dataset comparison utility to identify structural or content changes between different versions of a dataset. The analysis
Pandera is a data pipeline validation framework and statistical type validation tool. It functions as a library for defining and enforcing schemas on datasets to ensure data quality and consistency, specifically providing validation capabilities for Pandas dataframes. The project includes a schema inference tool that automates setup by analyzing existing dataset samples to generate validation schemas. It also serves as a synthetic data generator, creating artificial datasets based on predefined schemas to verify data-producing functions. The framework covers data engineering quality assuranc
OpenAddresses is an open-source geospatial data aggregator and directory that collects public domain and open-license address, parcel, and building datasets from governments and organizations worldwide. It functions as a global index and data warehouse for locating and distributing free geospatial records. The project operates a normalization pipeline that cleans and standardizes diverse source formats into a consistent global coordinate and attribute schema. This process includes a crowdsourced curation pipeline and programmatic quality validation to verify the spatial accuracy and formattin
This project is a collection of big data frameworks and pipelines, including an Apache Hive analysis framework, a behavioral data analytics platform, a predictive analytics engine, and real-time data pipelines. It provides the infrastructure for building Extract, Transform, Load (ETL) workflows to process large datasets for distributed storage and SQL-based analysis. The system supports diverse analytical implementations, such as a predictive engine using linear regression for value forecasting and a real-time architecture that moves data through message brokers for immediate reporting. It in
The purpose of the dq tool is to make simple storing test results and visualisation of these in a BI dashboard.
Apache Atlas - Open Metadata Management and Governance capabilities across the Hadoop platform and beyond
Apache Hamilton — portable & expressive data transformation DAGs
dbt-expectations is an extension package for dbt, inspired by the Great Expectations package for Python. The intent is to allow dbt users to deploy GE-like tests in their data warehouse directly from dbt, vs having to add another integration with their data warehouse.
CKAN is an open-source data management platform that provides the foundation for building data portals. It supports the full lifecycle of datasets—from creation and organization to publishing, cataloging with faceted search, and interactive data visualization—all through a web interface. The platform is built on a modular architecture that includes a plugin-based extensibility system, a harvesting framework for importing metadata from external sources, and a standardized RESTful JSON API for programmatic access to datasets and metadata. The web interface is rendered using the Jinja2 templatin
DBND an open source framework for building and tracking data pipelines. DBND is used for processes ranging from data ingestion, preparation, machine learning model training and production.
Data breaks. Servers break. Your toolchain breaks. Ensure your data team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems from end to end.
Samples, tutorials and other information about watsonx.data
Amundsen is a data catalog and discovery platform that provides a centralized directory for indexing tables and dashboards. It functions as a metadata management system and search engine, allowing users to locate and understand available data assets across diverse distributed sources. The platform includes capabilities for data lineage tracking to map the origin and movement of datasets between systems. It also serves as a data profiling tool, calculating distribution and quality statistics for individual table columns to provide automated insights into the nature of the data. The system man
A federated, open-source data catalog for all your big data and small data
Collect, aggregate, and visualize a data ecosystem's metadata
A detective for your data. Zero-config data quality monitoring — works with dbt, Postgres, BigQuery, Snowflake. No YAML.