30 open-source projects similar to amundsen-io/amundsen, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Amundsen alternative.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
DataHub is a metadata management system and data catalog platform designed to provide a centralized directory for discovering, managing, and documenting datasets across a diverse data stack. It serves as a comprehensive framework for metadata management, incorporating a data governance framework to classify sensitive information and assign ownership for organizational accountability. The platform distinguishes itself through AI-enabled data discovery, which connects large language models to a metadata graph to allow for natural language search and exploration of data assets. It also provides
OpenMetadata is an enterprise data catalog, metadata platform, and governance suite that functions as a knowledge graph for data assets. It serves as an AI-ready metadata layer, providing governed context and organizational memory to large language model agents via the Model Context Protocol. The platform distinguishes itself by capturing institutional knowledge, linking conversations, decisions, and remediation notes directly to data assets to preserve tribal knowledge. It integrates AI agents to automate metadata governance, such as suggesting descriptions and identifying sensitive data thr
This project is an AWS pandas integration library and data pipeline framework designed to simplify the movement and transformation of data between local memory and AWS storage and analytics services. It functions as a cloud data lake toolkit and storage file manager, allowing users to read, write, and transform structured data across various cloud environments. The library distinguishes itself as a distributed compute orchestrator capable of managing clusters in environments such as EMR to process datasets that exceed the memory limits of a single machine. It also provides specialized capabil
Gravitino is a federated metadata lake and unified data catalog designed to manage tables, files, and AI models across diverse data sources and cloud storage. It serves as a centralized interface for governing schemas, access controls, and tagging across relational databases, messaging queues, and object stores. The project distinguishes itself by unifying the management of AI assets, such as machine learning models and their version lineages, alongside traditional tabular data. It also implements the Iceberg REST specification to provide a standardized metadata server and proxy for lakehouse
CKAN is an open-source data management platform that provides the foundation for building data portals. It supports the full lifecycle of datasets—from creation and organization to publishing, cataloging with faceted search, and interactive data visualization—all through a web interface. The platform is built on a modular architecture that includes a plugin-based extensibility system, a harvesting framework for importing metadata from external sources, and a standardized RESTful JSON API for programmatic access to datasets and metadata. The web interface is rendered using the Jinja2 templatin
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types. The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions witho
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Feast is a machine learning feature store and MLOps data infrastructure layer. It provides a centralized system for managing and serving features across offline training and online production environments, utilizing an online feature serving layer for low-latency retrieval. The project centers on a feature registry that acts as a central catalog for defining, governing, and discovering feature services. It employs a unified data access layer to decouple feature retrieval from physical storage and includes a point-in-time data generator to create historically accurate training datasets that pr
Collect, aggregate, and visualize a data ecosystem's metadata
Apache Atlas - Open Metadata Management and Governance capabilities across the Hadoop platform and beyond
Kedro is a data science pipeline framework and orchestration tool designed to build reproducible and modular data engineering workflows. It functions as an MLOps project template and Python data workflow tool that enforces software engineering best practices to move projects from prototype to production. The system distinguishes itself through a centralized data catalog manager that abstracts data access and versioning across various file formats and cloud storage systems. It further separates processing logic from data access via a lazy-loading data registry and provides a standardized proje
EverythingPowerToys is a high-performance file and folder search tool for Windows that functions as a system-wide file indexer. It provides near-instant retrieval of files and directories across local storage using a centralized interface and a high-performance indexing engine. The utility specializes in advanced file querying by supporting regular expression patterns to locate files based on complex naming schemes. It also resolves system environment variables within search queries to find files in dynamic directory paths. The project covers a broad range of file management and search capab
This project is a vulnerability search engine and security knowledge base designed to collect and index public security disclosures. It functions as a vulnerability database crawler that extracts technical reports and security flaws from websites to create a searchable local archive. The system utilizes a security knowledge indexer and a full-text inverted index to convert unstructured crawled data into a structured format. This allows for keyword-based information retrieval, enabling the location of specific security flaws and technical details through a dedicated search interface. The plat
Pansou is a cloud storage search engine and distributed search aggregator designed to locate and retrieve files across multiple remote storage platforms. It functions by consolidating search results from various sources into a single interface, allowing users to find specific files through keyword-based queries. The system utilizes a plugin-based architecture that supports the development of custom search modules. This extensibility enables the integration of external artificial intelligence clients, which can interact with the platform to automate the discovery and refinement of file metadat
OpenSearch is a distributed search and analytics engine designed for indexing, searching, and analyzing massive volumes of structured and unstructured data in real time. It functions as a comprehensive platform that integrates enterprise-grade search capabilities, a vector database for high-dimensional similarity lookups, and a unified observability suite for monitoring logs, metrics, and traces across complex distributed environments. The platform distinguishes itself through its support for agentic workflow automation, allowing users to orchestrate multi-agent tasks and integrate foundation
Vizro is a low-code Python framework for building production-ready data visualization applications. It functions as a UI orchestrator that allows users to define multi-page analytical dashboards through structured configurations in Python, YAML, or JSON, reducing the need for extensive frontend engineering. The project distinguishes itself through generative AI integration, utilizing a model context protocol server to translate natural language descriptions into validated dashboard configurations, charts, and layouts. It also features a decoupled data cataloging system that separates data sou
This project provides a system for managing agent context and session memory, featuring an agent context compactor, an AI session memory manager, and a tool output sandbox. It functions as a middleware layer and server extension for the Model Context Protocol to optimize context windows and reduce token usage. The system optimizes agent performance by sandboxing tool outputs and externalizing large data sets, replacing raw I/O with pointers and concise summaries. It employs a persistent knowledge base that indexes session history and tool outputs for retrieval via full-text search, ensuring s
Koel is a self-hosted music streaming server designed for hosting, managing, and streaming personal digital music collections via web and mobile applications. It functions as a personal audio streaming platform that allows users to organize local and cloud-based audio libraries with integrated user accounts and playlist management. The system distinguishes itself by acting as a cloud-integrated media server, enabling the connection of remote storage providers to serve music files without requiring local disk space. It provides a cross-platform playback experience, ensuring consistent access t
Nominatim is a self-hosted geospatial search engine and geocoding server that utilizes OpenStreetMap data. It provides a complete infrastructure for forward geocoding, converting addresses or place names into geographic coordinates, and reverse geocoding, translating coordinates into human-readable physical addresses. The project features a dedicated data importer that parses raw map data into a PostgreSQL geospatial database. It distinguishes itself through a configurable import pipeline that uses style files to filter map features and an importance-based ranking system to prioritize search
Kedro is a data science pipeline framework and production toolbox designed to build reproducible, modular workflows using software engineering best practices. It functions as a data engineering orchestrator and catalog manager, bridging the gap between interactive analysis and maintainable production pipelines. The framework distinguishes itself by using a data catalog to decouple data access from processing logic and providing tools to transition analysis from interactive notebooks into structured workflows. It includes a workflow visualization tool that generates visual maps of data pipelin
RediSearch is a Redis module that adds secondary indexing, full-text search, aggregation, and vector similarity search directly into the in-memory data store. It operates as an in-process search engine, extending the core key-value store with capabilities for indexing hash and JSON documents, enabling fast field-level lookups beyond primary key access. The module provides a full-text search engine built on inverted indexes, supporting stemming, fuzzy matching, and relevance scoring via tf-idf. It also includes a vector similarity search engine using a Hierarchical Navigable Small World graph
This project is a framework-agnostic library for building accessible, search-as-you-type interfaces. It provides a headless logic layer that decouples search state management and result filtering from the visual presentation, allowing developers to maintain full control over the underlying HTML structure and styling. The library distinguishes itself through a highly modular architecture that supports multi-source data aggregation, enabling the combination of results from static arrays, remote APIs, and external indices into a single interface. It features a flexible rendering engine that inte
DjangoBlog is an open-source blog engine built with the Django web framework, designed as a full-featured content management system. It provides Markdown editing for articles and pages, supports social login through OAuth providers including Google, GitHub, Facebook, Weibo, and QQ, and offers full-text search powered by Elasticsearch or Whoosh with keyword highlighting in results. The blog distinguishes itself through several integrated capabilities. It includes a Redis-based page caching system that caches rendered responses and automatically invalidates them on content changes to reduce dat
JMComic-Crawler-Python is a high-performance asynchronous web scraper and API client designed to programmatically retrieve images and metadata from a comic hosting service. It functions as a media archiving tool for batch downloading albums and chapters, automating the process of saving content to a local filesystem. The project is distinguished by its ability to reverse server-side pixel obfuscation, using a decryption tool to reconstruct sliced and shuffled images. To maintain stable connectivity, it utilizes a network bypass utility featuring dynamic domain rotation and proxy routing to ci
lunr.js is a JavaScript full-text search library and client-side search engine. It creates in-memory search indexes for fast keyword retrieval and ranked document matching within browser or Node.js environments. The library utilizes a JSON serializable search index, allowing the search structure to be converted to and from JSON for storage and distribution of pre-built search data. This enables search functionality for static websites by indexing content into portable files. The system supports advanced querying capabilities, including fuzzy text matching to account for typos, field-scoped i
This project is an open-source search data index and a collection of historical search trend data provided as a public trends archive. It serves as an open dataset for analyzing global patterns and events through downloadable files. The repository provides an aggregated index of anonymized and normalized search and media datasets. These resources are designed for academic and professional analysis, allowing for the study of longitudinal trends across different regions and timeframes. The data supports global search trend analysis, market pattern analysis, and public interest research. It ena
zvec is an embedded vector database engine and indexing library designed for high-dimensional similarity search. It functions as a hybrid search engine and a retrieval-augmented generation knowledge base, allowing for the storage and retrieval of dense and sparse vectors. The system is distinguished by its hybrid retrieval pipeline, which fuses vector similarity, full-text keyword matching, and scalar metadata filtering into single query operations. It supports a plugin-based model integration system for registering custom embedding models and rerankers, as well as language bindings for nativ
Rudder Server is a customer data platform and event routing pipeline designed to collect, transform, and route customer event data from various sources to data warehouses and business tools. It functions as a customer identity resolver, linking identifiers from multiple sources to build a unified identity graph and comprehensive behavioral customer profiles. The system differentiates itself through reverse ETL capabilities, which push processed customer segments and audiences from data warehouses back into operational third-party applications. It also provides a containerized data plane for K