dgraph-io/dgraph — A distributed graph database engine for managing and querying highly connected data.

valkey-io/valkey — An in-memory, distributed NoSQL database server for high-performance key-value storage.

scylladb/scylladb — A high-throughput, distributed NoSQL database engine optimized for large-scale data storage.

mongodb/mongo — A distributed, document-oriented database system designed for flexible data structures and horizontal scaling.

arangodb/arangodb — A multi-model database system that manages documents, graphs, and key-value pairs in one engine.

Databases

Databases and storage engines for structured, relational, NoSQL, vector, and time-series data management. Find robust, self-hosted solutions for your infrastructure.

Find the best repos with AI.We'll search the best matching repositories with AI.

dgraph-io/dgraph
dgraph-io/dgraph
21,700View on GitHub
Dgraph is a distributed graph database designed to store and query highly connected data. It organizes information as nodes and edges to represent complex relationships between entities, providing a platform for managing and analyzing deeply linked datasets. The system functions as a horizontally scalable cluster that partitions data across multiple nodes to maintain performance and availability as information volume increases. It utilizes a specialized query language built for low-latency navigation of interconnected data points, allowing for the execution of complex queries across large-scale information networks. The platform incorporates a graph-oriented storage engine and in-memory indexing to facilitate efficient traversal of relationships. It manages state changes and data consistency through a distributed consensus algorithm and predicate-based sharding, which enables the system to decompose and execute queries in parallel across the cluster.
A distributed graph database engine for managing and querying highly connected data.
GoDistributed DatabasesDistributed DatabasesGraph Databases
View on GitHub21,700
valkey-io/valkey
valkey-io/valkey
24,875View on GitHub
Valkey is an in-memory, NoSQL database server designed for high-performance data storage and real-time state management. It operates as a distributed key-value store, maintaining datasets entirely within system memory to facilitate sub-millisecond response times for read and write operations. The system distinguishes itself through a single-threaded event loop that utilizes asynchronous I/O multiplexing to ensure high throughput. It supports high availability via master-replica replication and provides a decoupled communication model through a built-in publish-subscribe messaging pattern. To ensure data durability, the engine employs a copy-on-write mechanism to generate point-in-time snapshots of the dataset on disk. The platform offers extensive infrastructure customization, allowing users to compile binaries from source with specialized memory allocators and hardware-level configurations. These capabilities enable the deployment of scalable, distributed storage clusters tailored to specific performance and hardware requirements.
An in-memory, distributed NoSQL database server for high-performance key-value storage.
CDatabasesDistributed DatabasesKey-Value Stores
View on GitHub24,875
scylladb/scylladb
scylladb/scylladb
15,355View on GitHub
ScyllaDB is a distributed NoSQL database engine designed for high-throughput data storage and low-latency performance at scale. It functions as a shard-aware platform that manages large-scale datasets across distributed clusters, providing a foundation for real-time applications that require consistent availability and operational stability. The system distinguishes itself through a shared-nothing architecture that distributes data across independent CPU cores to eliminate lock contention. It incorporates a user-space networking stack and an asynchronous event-driven engine to maximize hardware utilization. Furthermore, the database provides native compatibility with established cloud-native and NoSQL protocols, allowing for the migration of existing application workloads without requiring source code modifications. Beyond its core storage capabilities, the platform supports specialized indexing for high-dimensional vector embeddings, enabling semantic search and retrieval-augmented generation for artificial intelligence tasks. It also handles high-velocity time-series data ingestion and provides tools for managing distributed cluster deployments, performance monitoring, and secure API access. The software is designed for deployment across cloud and on-premises environments, including support for containerized execution.
A high-throughput, distributed NoSQL database engine optimized for large-scale data storage.
C++Distributed DatabasesNoSQL DatabasesTime Series
View on GitHub15,355
mongodb/mongo
mongodb/mongo
28,158View on GitHub
This project is a distributed, document-oriented database system designed to store information in flexible, hierarchical structures. It supports horizontal scaling through automated sharding and maintains high availability across global clusters using a multi-node replication protocol. By executing multi-document operations as atomic units, the system ensures data integrity and consistency across distributed environments. The platform distinguishes itself by integrating advanced vector-based indexing, which enables semantic similarity searches alongside traditional geospatial and lexical queries. It functions as an enterprise-grade data platform, incorporating granular access controls, encryption, and auditing mechanisms to meet the requirements of regulated production environments. These capabilities allow for the management of large-scale datasets while maintaining the flexibility of a schema-less storage model. The system provides a comprehensive suite of tools for database administration, including command-line utilities for infrastructure management, data migration, and performance monitoring. It supports integration with container orchestration platforms and offers standardized client libraries to facilitate connectivity across various programming languages and business intelligence tools.
A distributed, document-oriented database system designed for flexible data structures and horizontal scaling.
C++Distributed DatabasesDistributed DatabasesDocument Databases
View on GitHub28,158
arangodb/arangodb
arangodb/arangodb
14,091View on GitHub
This project is a multi-model database system designed to store and manage information as documents, graphs, and key-value pairs within a single engine. It functions as a graph database and knowledge graph platform, providing the infrastructure to build, query, and visualize structured data models. By integrating vector search capabilities, the system serves as a vector database that supports retrieval-augmented generation for artificial intelligence applications. The platform distinguishes itself through a unified query language that allows users to perform document lookups, graph traversals, and vector searches across diverse data models simultaneously. It includes a dedicated graph analytics engine capable of executing structural algorithms, such as pathfinding and centrality analysis, to identify patterns and influential nodes within complex networks. These features enable the construction of knowledge graphs that ground generative AI models in verified enterprise context, reducing hallucinations and improving response accuracy. Beyond its core storage and retrieval capabilities, the system supports predictive machine learning by leveraging stored relationship data to classify elements and forecast connections. It provides an interactive web interface for the visual exploration and navigation of graph structures, facilitating the analysis of complex information networks. The software is documented and distributed as a comprehensive environment for managing multi-model data and building intelligent, context-aware systems.
A multi-model database system that manages documents, graphs, and key-value pairs in one engine.
C++Graph DatabasesKnowledge GraphsVector Databases
View on GitHub14,091
cockroachdb/cockroach
cockroachdb/cockroach
32,207View on GitHub
Cockroach is a distributed SQL database designed to scale horizontally across multiple nodes while maintaining strict ACID compliance and global data consistency. It functions as a relational database engine that automatically partitions data into ranges, rebalancing them across a cluster to accommodate growing storage and throughput requirements. By utilizing a distributed consensus protocol, the system ensures that all nodes agree on the order of operations, providing fault tolerance and continuous availability even in the event of hardware failures. The system distinguishes itself through a layered architecture that separates the relational SQL abstraction from a distributed key-value store. It achieves global consistency without requiring perfectly synchronized hardware clocks by employing a hybrid logical clock synchronization mechanism. To support high-concurrency environments, it utilizes multi-version concurrency control and lock-free transaction execution, which allow for consistent snapshots and efficient conflict resolution. Furthermore, the engine is built for compatibility, implementing the standard wire protocol to support existing relational database drivers and tools. Beyond its core transactional capabilities, the platform includes comprehensive tooling for cluster orchestration, security, and performance diagnostics. It supports a variety of deployment models, ranging from self-hosted on-premises configurations to fully managed cloud services. The system provides a command-line interface for session management and query execution, ensuring that administrators can monitor cluster health and manage workloads through standard relational interfaces.
A distributed SQL database engine designed for horizontal scaling and global ACID consistency.
GoDistributed Relational DatabasesDistributed SQL DatabasesDistributed SQL Engines
View on GitHub32,207
tikv/tikv
tikv/tikv
16,535View on GitHub
TiKV is a distributed transactional key-value store designed for horizontal scalability and high availability. It functions as a storage engine that maintains massive datasets across a cluster of physical nodes, ensuring that information remains accessible and consistent even when individual hardware components fail. The system utilizes a consensus-based replication model to synchronize data across nodes, ensuring that all replicas agree on the order of operations. It manages data distribution through a sharding mechanism that partitions large datasets into smaller groups, each governed by independent consensus instances. To handle concurrent access, the engine employs multi-version concurrency control, allowing for consistent reads without blocking ongoing write operations. The architecture supports complex distributed transactions by coordinating multi-stage voting processes to ensure that all participating nodes either commit or abort changes together. It maintains data integrity through a storage engine that organizes information into sorted files on disk to optimize performance. The cluster maintains a consistent view of its state and topology through peer-to-peer communication and centralized orchestration.
A distributed transactional key-value storage engine designed for horizontal scalability.
RustDistributed DatabasesStorage Engines
View on GitHub16,535
neo4j/neo4j
neo4j/neo4j
15,928View on GitHub
Neo4j is a native graph database management system designed to store and query highly connected data using a property-graph model. It provides an ACID-compliant transaction engine that ensures data integrity, supported by a distributed cluster architecture that maintains causal consistency across nodes. Users interact with the system through a declarative query language, which allows for complex pattern matching and path traversal without requiring manual traversal logic. The platform distinguishes itself through its hybrid approach to data retrieval, combining traditional graph-based queries with high-dimensional vector indexing. This integration enables simultaneous semantic similarity searches and relational data analysis within a single environment. By supporting both structured graph patterns and vector embeddings, the system facilitates advanced analytical tasks such as community detection, pathfinding, and centrality calculations. The project covers a broad capability surface, including comprehensive database administration, security controls, and performance optimization tools. It provides extensive support for AI-augmented workflows, enabling the integration of large language models for retrieval-augmented generation, natural language query translation, and autonomous agent memory management. These features are accessible through standardized language drivers, HTTP interfaces, and native schema enforcement mechanisms. The software is distributed as a database engine with support for both self-managed and cloud-hosted infrastructure, offering command-line tools for provisioning, monitoring, and lifecycle management.
A native graph database management system for storing and querying property-graph data.
JavaDatabasesDistributed DatabasesGraph Databases
View on GitHub15,928
clickhouse/clickhouse
ClickHouse/ClickHouse
48,042View on GitHub
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through advanced storage and execution techniques, including vectorized query processing and a merge tree storage engine that maintains performance during massive insertions. It features adaptive subcolumn mapping for semi-structured data and supports native vector search for machine learning and generative AI applications. To facilitate efficient data movement, the engine utilizes zero-copy shared memory buffers, minimizing overhead when interacting with external analytical tools or processing diverse file formats like Parquet, JSON, and Arrow. Beyond its core storage and processing capabilities, the project provides a comprehensive suite of tools for observability, security, and data integration. It includes built-in support for natural language querying, automated workflow orchestration for AI agents, and extensive diagnostic features for query plan inspection. The platform also offers robust cloud infrastructure management, including support for private networking, compliant deployment strategies, and integrated billing consolidation.
A high-performance, columnar analytical database engine for real-time data aggregation.
C++Distributed Query EnginesStorage EnginesVector Databases
View on GitHub48,042
tursodatabase/libsql
tursodatabase/libsql
16,389View on GitHub
LibSQL is a high-performance, distributed SQL database engine that extends SQLite to support remote network access, edge computing, and real-time synchronization. It functions as an embedded database library that integrates directly into application processes while providing the infrastructure to maintain consistency across multiple geographic regions. The platform distinguishes itself by enabling database interaction over standard HTTP protocols, allowing applications to query remote data sources in serverless and edge environments without requiring local filesystem access. It includes native support for high-dimensional vector similarity search and indexing, enabling AI and machine learning workflows to run directly within the database engine. The system provides a comprehensive suite of tools for managing data lifecycles, including database branching, point-in-time state restoration, and automated synchronization between local replicas and remote primary instances. It also incorporates granular security primitives, such as token-based access control and network-level restrictions, to protect database resources in multi-tenant environments. The project offers extensive observability and administrative features, including query performance monitoring, audit logging, and organizational management tools. It is designed for integration through language-specific drivers and supports advanced data processing through specialized modules for full-text and similarity search.
A high-performance, distributed SQL database engine that extends SQLite for network and edge use.
CDistributed DatabasesDistributed SQL DatabasesVector Databases
View on GitHub16,389
pingcap/tidb
pingcap/tidb
40,166View on GitHub
TiDB is a horizontally scalable, distributed SQL database designed to provide consistent transactional storage and high-performance analytical processing within a single unified architecture. It utilizes a decoupled compute-storage design and a distributed key-value storage layer to ensure horizontal scalability and efficient range-based queries. By employing a consensus-based replication algorithm, the system maintains high availability and automatic failover across multiple nodes and geographical regions. The platform distinguishes itself through its hybrid transactional and analytical processing capabilities, which allow complex SQL queries to run against replicated columnar data without disrupting primary transactional workloads. It also integrates high-dimensional vector search functionality, enabling semantic similarity queries directly alongside traditional relational data. To support diverse operational needs, the system provides native tools for real-time data streaming, seamless migration from external database systems, and multi-region disaster recovery. The database is built for cloud-native environments, offering comprehensive lifecycle management through Kubernetes operators that automate deployment, scaling, and rolling upgrades. It maintains compatibility with standard SQL interfaces, allowing applications to connect using common drivers while managing complex concurrency through pessimistic transaction handling. Detailed documentation and command-line utilities are available to assist with cluster orchestration, performance troubleshooting, and the configuration of production-grade topologies.
A distributed SQL database engine providing unified transactional and analytical processing.
GoDistributed DatabasesDistributed SQL DatabasesVector Databases
View on GitHub40,166
taosdata/tdengine
taosdata/TDengine
24,734View on GitHub
TDengine is a distributed time-series database designed for the high-speed ingestion, compression, and retrieval of timestamped metrics and sensor data. It functions as a SQL-compatible analytics engine, allowing users to perform complex operations on massive volumes of time-ordered information using standard relational syntax. The platform is built to serve as a backend foundation for industrial IoT environments, managing real-time data streams and device metadata through a cluster-based architecture. The system distinguishes itself through a distributed sharding architecture that uses consistent hashing to ensure horizontal scalability and high-throughput ingestion. It employs a log-structured write path to minimize disk seek latency and utilizes super-table virtualization to provide a unified logical view across multiple physical tables. To maintain performance and cost-efficiency, the database features automated multi-tiered lifecycle management, which migrates data between high-performance memory and low-cost storage based on age and access frequency. Beyond its core storage capabilities, the platform provides robust tools for edge-to-cloud synchronization, ensuring consistent data states across geographically distributed infrastructure. It includes built-in support for real-time stream processing, allowing for the analysis of live data without requiring external message queues. The system also incorporates comprehensive security frameworks, including user access control, audit logging, and encrypted transport protocols to protect sensitive operational data. Developers can interact with the database through native client libraries that support connection pooling and query parameter binding. The system is documented with comprehensive error code diagnostics and provides command-line utilities for cluster administration, health monitoring, and configuration management.
A distributed time-series database engine optimized for high-speed ingestion of sensor data.
CDistributed DatabasesTime Series Databases
View on GitHub24,734
citusdata/citus
citusdata/citus
12,562View on GitHub
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based distribution and schema-based sharding, which allows for the isolation of tenant data and the migration of high-volume workloads to dedicated nodes. To accelerate analytical performance, the system integrates columnar storage with data compression and supports pre-aggregated rollups, ensuring that large-scale datasets remain performant as the cluster grows. Beyond its core distribution capabilities, the project provides comprehensive tools for cluster administration and data lifecycle management. It automates shard rebalancing, schema propagation via a two-phase commit protocol, and the maintenance of time-based partitions. The system also includes diagnostic utilities for monitoring query performance, detecting resource contention, and analyzing index usage across the distributed environment.
A PostgreSQL extension that transforms a standard database into a distributed SQL engine.
CDistributed Relational DatabasesDistributed SQL Engines
View on GitHub12,562
dragonflydb/dragonfly
dragonflydb/dragonfly
30,688View on GitHub
Dragonfly is a high-performance, multi-model in-memory data store designed to serve as a drop-in replacement for existing database infrastructures. By utilizing a multi-threaded, shared-nothing architecture and a fiber-based concurrency model, it maximizes CPU utilization and minimizes latency for read and write operations. The system supports a wide range of data structures, including strings, hashes, lists, sets, sorted sets, and JSON documents, while maintaining full compatibility with standard industry wire protocols and client libraries. What distinguishes Dragonfly is its focus on efficiency and scalability through advanced memory management and request processing. It employs a lock-free, cache-friendly hash table structure and zero-copy serialization to reduce overhead during high-throughput operations. For durability, the system utilizes asynchronous, snapshot-based persistence that captures the state of the dataset without blocking active requests. Furthermore, it provides built-in support for horizontal scaling and cluster management, allowing for the distribution of large datasets across multiple nodes to ensure high availability. Beyond core storage, the platform includes a comprehensive suite of operational and analytical capabilities. It features integrated support for geospatial data management, real-time message brokering via publish-subscribe patterns, and full-text search. To handle massive datasets efficiently, the engine incorporates probabilistic data structures for cardinality estimation, frequency tracking, and membership testing. These features are complemented by robust administrative tools, including access control, request rate limiting, and detailed server monitoring.
A high-performance, multi-model in-memory data store designed as a drop-in database replacement.
C++Database Sharding SolutionsKey-Value Stores
View on GitHub30,688
surrealdb/surrealdb
surrealdb/surrealdb
32,397View on GitHub
SurrealDB is a multi-model database engine designed to store and query document, graph, relational, and vector data within a single ACID-compliant platform. It functions as an AI-native data store, integrating vector search, graph traversal, and machine learning model execution directly into its query layer. By providing a unified declarative query language, the platform eliminates the need for external middleware to synchronize data across different storage models. The platform distinguishes itself through its ability to manage agent memory and complex workflows natively. It allows developers to store agent memory, knowledge graphs, and structured data within a single transaction boundary, ensuring consistent state and permissions. Furthermore, the engine supports real-time reactive applications by pushing data updates directly to connected clients through live queries, removing the requirement for external message brokers or polling mechanisms. SurrealDB is built for versatility, operating as a portable database runtime that maintains a consistent interface across embedded, edge, and cloud environments. Its architecture includes a granular, record-level permission model that enforces security and multi-tenant isolation directly at the data layer. The system also features an isolated sandboxing environment for custom extensions, allowing for specialized data processing without compromising system stability or security. The project provides extensive documentation and learning resources, including a structured curriculum and hands-on projects, to assist with onboarding and architectural mastery. It is distributed as a single binary, facilitating deployment across diverse infrastructure ranging from resource-constrained devices to large-scale distributed cloud clusters.
A multi-model database engine supporting document, graph, relational, and vector data.
RustDatabasesDistributed DatabasesGraph Databases
View on GitHub32,397
tursodatabase/turso
tursodatabase/turso
17,434View on GitHub
Turso is a distributed SQL database platform that provides managed, edge-hosted SQLite instances. It functions as a serverless database provider, enabling the deployment of relational databases that synchronize data across multiple geographic regions to support high availability and performance. The platform distinguishes itself by utilizing a fork of SQLite as its core storage engine, which supports both local file storage and remote network-based replication. It employs an edge-optimized proxy to route queries through a global network, minimizing latency by connecting users to the nearest database replica. Communication is handled via a stateless, HTTP-based protocol that operates over standard web ports. The service includes comprehensive infrastructure for multi-tenant database orchestration, allowing for the dynamic provisioning of isolated instances without manual server management. Users can manage these remote databases, configure access permissions, and handle security credentials directly through a command-line interface.
A distributed SQL database platform providing managed, edge-hosted SQLite instances.
RustDistributed SQL Databases
View on GitHub17,434
milvus-io/milvus
milvus-io/milvus
44,804View on GitHub
Milvus is a specialized vector database engine designed for the indexing, management, and high-speed similarity retrieval of high-dimensional vector embeddings. It functions as a similarity search engine capable of identifying nearest neighbors within large-scale vector spaces, supporting the storage and retrieval of billions of data points while maintaining consistent performance. The system utilizes a distributed architecture that decouples storage, query, and coordination into independent services, allowing for horizontal scaling across clusters. It employs a global indexing mechanism that builds specialized data structures across immutable, independently indexed segments. This design, combined with a shared-storage decoupled model, enables compute and storage resources to scale independently in cloud environments, while a log-based persistence layer ensures data durability and state recovery. The platform supports a wide range of data retrieval patterns, including retrieval-augmented generation, hybrid search, and multimodal data retrieval for text, images, and graphs. Deployment options range from lightweight local instances for rapid prototyping to robust standalone setups and fully managed distributed clusters. Documentation includes sizing tools to assist in estimating hardware requirements based on specific data volumes and operational patterns.
A specialized vector database engine for high-speed similarity retrieval of embeddings.
GoVector Databases
View on GitHub44,804
rethinkdb/rethinkdb
rethinkdb/rethinkdb
26,996View on GitHub
RethinkDB is a distributed, document-oriented database designed to store and manage JSON-formatted data across scalable clusters. It utilizes a custom log-structured storage engine with B-Tree indexing to ensure high-performance disk I/O and data persistence. The system maintains high availability through automatic sharding and replication, employing a primary-replica voting consensus mechanism to handle node failures and ensure consistent cluster operations. A defining characteristic of the platform is its reactive changefeed engine, which allows applications to subscribe to live data updates. Instead of polling for changes, developers can maintain persistent cursors on tables to stream document modifications in real-time. This is complemented by a fluent, functional query language that translates native code constructs into optimized, parallelized execution plans. By embedding these queries directly into application code, the system provides a type-safe interface that helps prevent injection vulnerabilities while enabling complex data manipulation and aggregation. The platform provides a comprehensive suite of administrative tools for managing production environments, including granular user permissions, TLS network encryption, and visual cluster monitoring. It supports advanced data modeling through document embedding and cross-table linking, as well as specialized geospatial processing for proximity-based queries. The system is designed for integration with modern web frameworks and message brokers, facilitating real-time synchronization with external services and search engines. RethinkDB is configured via key-value files and command-line interfaces, with support for containerized deployment and automated infrastructure orchestration.
A distributed, document-oriented database engine for managing JSON data across clusters.
C++DatabasesDocument DatabasesStorage Engines
View on GitHub26,996
typesense/typesense
typesense/typesense
25,254View on GitHub
Typesense is a distributed search engine designed to provide sub-millisecond query latency across massive datasets. It functions as both a high-performance indexing and retrieval engine and a comprehensive search experience platform, offering built-in typo tolerance and tools for managing relevance through synonym configuration, result curation, and complex filtering. The platform distinguishes itself by utilizing in-memory indexing to maintain high-throughput data retrieval and integrating vector database capabilities to support semantic similarity searches. It ensures data consistency and high availability across distributed clusters through a consensus-based coordination model and asynchronous snapshot replication. By combining traditional keyword matching with high-dimensional embedding support, it enables natural language understanding and similarity-based retrieval within application workflows. The system manages large-scale data through distributed indexing and log-structured merge trees, which optimize write performance and simplify incremental updates. Users can refine search outcomes by applying custom grouping logic and negation filters to improve discovery accuracy. Comprehensive documentation and community support channels are available to assist with integration and troubleshooting.
A distributed search engine that functions as a specialized indexing and retrieval platform.
C++Distributed Search EnginesVector DatabasesVector Databases
View on GitHub25,254