8 مستودعات
Software for cataloging, indexing, and searching datasets across distributed systems.
Distinguishing note: Focuses on the discovery and exploration of data assets rather than database management or storage.
Explore 8 awesome GitHub repositories matching data & databases · Data Discovery Tools. Refine with filters or upvote what's useful.
This project is a comprehensive, curated knowledge base designed to support the development and maintenance of production-grade machine learning systems. It serves as a centralized repository of industry-standard technical literature, engineering case studies, and research papers, providing a structured reference for practitioners navigating the complexities of modern data science and machine learning engineering. The resource distinguishes itself through a cross-domain approach that bridges the gap between academic research and practical implementation. By synthesizing proven industry archit
Locate and explore available datasets using specialized tools designed to catalog, index, and search for information across distributed storage systems.
Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools. The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orches
Exposes data model structures through a programmatic interface to simplify client-side integration.
This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems. The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
Helps identify and evaluate databases and processing frameworks for large-scale data infrastructure.
DataHub is a metadata management system and data catalog platform designed to provide a centralized directory for discovering, managing, and documenting datasets across a diverse data stack. It serves as a comprehensive framework for metadata management, incorporating a data governance framework to classify sensitive information and assign ownership for organizational accountability. The platform distinguishes itself through AI-enabled data discovery, which connects large language models to a metadata graph to allow for natural language search and exploration of data assets. It also provides
Integrates large language models with a metadata graph to enable natural language search and discovery.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Provides a centralized interface for users to search and access data assets, accelerating the time required to derive insights.
This project is a collection of educational resources and curricula designed for mastering AI pair programming and prompt engineering. It provides a structured training course and instructional materials for integrating AI assistants into the software development lifecycle. The materials cover the use of large language models to modernize legacy code and translate applications between programming languages. It includes a specific guide for crafting natural language queries to generate code and automate development workflows. The content addresses a broad range of capabilities, including AI-a
Provides workflows for using large language models to discover and explore data assets via natural language.
Amundsen is a data catalog and discovery platform that provides a centralized directory for indexing tables and dashboards. It functions as a metadata management system and search engine, allowing users to locate and understand available data assets across diverse distributed sources. The platform includes capabilities for data lineage tracking to map the origin and movement of datasets between systems. It also serves as a data profiling tool, calculating distribution and quality statistics for individual table columns to provide automated insights into the nature of the data. The system man
Implements a searchable interface and index for locating specific datasets across diverse distributed data sources.
This project provides a curated catalog of community-contributed geospatial datasets designed for environmental analysis and mapping workflows. It functions as a centralized repository for discovering and retrieving geographic information, facilitating access to earth observation data without the need for manual preprocessing. Beyond its role as a data catalog, the project includes automation utilities for maintaining project documentation and monitoring repository health. It uses marker-based text injection to dynamically update documentation files and aggregates public engagement metrics, s
Functions as a searchable repository of geographic information systems data for environmental analysis.