dgraph-io/dgraph — A distributed graph database engine built for storing and querying highly connected data.

tikv/tikv — A distributed transactional key-value store designed for massive datasets and horizontal scalability.

scylladb/scylladb — A high-throughput, distributed NoSQL database engine optimized for low-latency performance.

valkey-io/valkey — An in-memory, distributed NoSQL database server built for high-performance state management.

typesense/typesense — A distributed search engine and indexing platform designed for sub-millisecond data retrieval.

Databases & Data

Database management systems, storage engines, and data processing tools for building scalable, high-performance data architectures and analytical applications.

Find the best repos with AI.We'll search the best matching repositories with AI.

dgraph-io/dgraph
dgraph-io/dgraph
21,700View on GitHub
Dgraph is a distributed graph database designed to store and query highly connected data. It organizes information as nodes and edges to represent complex relationships between entities, providing a platform for managing and analyzing deeply linked datasets. The system functions as a horizontally scalable cluster that partitions data across multiple nodes to maintain performance and availability as information volume increases. It utilizes a specialized query language built for low-latency navigation of interconnected data points, allowing for the execution of complex queries across large-scale information networks. The platform incorporates a graph-oriented storage engine and in-memory indexing to facilitate efficient traversal of relationships. It manages state changes and data consistency through a distributed consensus algorithm and predicate-based sharding, which enables the system to decompose and execute queries in parallel across the cluster.
A distributed graph database engine built for storing and querying highly connected data.
GoDistributed DatabasesDistributed DatabasesGraph Databases
View on GitHub21,700
tikv/tikv
tikv/tikv
16,535View on GitHub
TiKV is a distributed transactional key-value store designed for horizontal scalability and high availability. It functions as a storage engine that maintains massive datasets across a cluster of physical nodes, ensuring that information remains accessible and consistent even when individual hardware components fail. The system utilizes a consensus-based replication model to synchronize data across nodes, ensuring that all replicas agree on the order of operations. It manages data distribution through a sharding mechanism that partitions large datasets into smaller groups, each governed by independent consensus instances. To handle concurrent access, the engine employs multi-version concurrency control, allowing for consistent reads without blocking ongoing write operations. The architecture supports complex distributed transactions by coordinating multi-stage voting processes to ensure that all participating nodes either commit or abort changes together. It maintains data integrity through a storage engine that organizes information into sorted files on disk to optimize performance. The cluster maintains a consistent view of its state and topology through peer-to-peer communication and centralized orchestration.
A distributed transactional key-value store designed for massive datasets and horizontal scalability.
RustDistributed DatabasesDistributed Key-Value StoresKey-Value
View on GitHub16,535
scylladb/scylladb
scylladb/scylladb
15,355View on GitHub
ScyllaDB is a distributed NoSQL database engine designed for high-throughput data storage and low-latency performance at scale. It functions as a shard-aware platform that manages large-scale datasets across distributed clusters, providing a foundation for real-time applications that require consistent availability and operational stability. The system distinguishes itself through a shared-nothing architecture that distributes data across independent CPU cores to eliminate lock contention. It incorporates a user-space networking stack and an asynchronous event-driven engine to maximize hardware utilization. Furthermore, the database provides native compatibility with established cloud-native and NoSQL protocols, allowing for the migration of existing application workloads without requiring source code modifications. Beyond its core storage capabilities, the platform supports specialized indexing for high-dimensional vector embeddings, enabling semantic search and retrieval-augmented generation for artificial intelligence tasks. It also handles high-velocity time-series data ingestion and provides tools for managing distributed cluster deployments, performance monitoring, and secure API access. The software is designed for deployment across cloud and on-premises environments, including support for containerized execution.
A high-throughput, distributed NoSQL database engine optimized for low-latency performance.
C++Distributed DatabasesNoSQL DatabasesTime Series
View on GitHub15,355
valkey-io/valkey
valkey-io/valkey
24,875View on GitHub
Valkey is an in-memory, NoSQL database server designed for high-performance data storage and real-time state management. It operates as a distributed key-value store, maintaining datasets entirely within system memory to facilitate sub-millisecond response times for read and write operations. The system distinguishes itself through a single-threaded event loop that utilizes asynchronous I/O multiplexing to ensure high throughput. It supports high availability via master-replica replication and provides a decoupled communication model through a built-in publish-subscribe messaging pattern. To ensure data durability, the engine employs a copy-on-write mechanism to generate point-in-time snapshots of the dataset on disk. The platform offers extensive infrastructure customization, allowing users to compile binaries from source with specialized memory allocators and hardware-level configurations. These capabilities enable the deployment of scalable, distributed storage clusters tailored to specific performance and hardware requirements.
An in-memory, distributed NoSQL database server built for high-performance state management.
CDistributed DatabasesKey-Value StoresNoSQL Databases
View on GitHub24,875
typesense/typesense
typesense/typesense
25,254View on GitHub
Typesense is a distributed search engine designed to provide sub-millisecond query latency across massive datasets. It functions as both a high-performance indexing and retrieval engine and a comprehensive search experience platform, offering built-in typo tolerance and tools for managing relevance through synonym configuration, result curation, and complex filtering. The platform distinguishes itself by utilizing in-memory indexing to maintain high-throughput data retrieval and integrating vector database capabilities to support semantic similarity searches. It ensures data consistency and high availability across distributed clusters through a consensus-based coordination model and asynchronous snapshot replication. By combining traditional keyword matching with high-dimensional embedding support, it enables natural language understanding and similarity-based retrieval within application workflows. The system manages large-scale data through distributed indexing and log-structured merge trees, which optimize write performance and simplify incremental updates. Users can refine search outcomes by applying custom grouping logic and negation filters to improve discovery accuracy. Comprehensive documentation and community support channels are available to assist with integration and troubleshooting.
A distributed search engine and indexing platform designed for sub-millisecond data retrieval.
C++Vector DatabasesDistributed Search EnginesVector Databases
View on GitHub25,254
arangodb/arangodb
arangodb/arangodb
14,091View on GitHub
This project is a multi-model database system designed to store and manage information as documents, graphs, and key-value pairs within a single engine. It functions as a graph database and knowledge graph platform, providing the infrastructure to build, query, and visualize structured data models. By integrating vector search capabilities, the system serves as a vector database that supports retrieval-augmented generation for artificial intelligence applications. The platform distinguishes itself through a unified query language that allows users to perform document lookups, graph traversals, and vector searches across diverse data models simultaneously. It includes a dedicated graph analytics engine capable of executing structural algorithms, such as pathfinding and centrality analysis, to identify patterns and influential nodes within complex networks. These features enable the construction of knowledge graphs that ground generative AI models in verified enterprise context, reducing hallucinations and improving response accuracy. Beyond its core storage and retrieval capabilities, the system supports predictive machine learning by leveraging stored relationship data to classify elements and forecast connections. It provides an interactive web interface for the visual exploration and navigation of graph structures, facilitating the analysis of complex information networks. The software is documented and distributed as a comprehensive environment for managing multi-model data and building intelligent, context-aware systems.
A multi-model database system that natively supports document, graph, and key-value data models.
C++Graph DatabasesMulti-Model DatabasesVector Databases
View on GitHub14,091
surrealdb/surrealdb
surrealdb/surrealdb
32,397View on GitHub
SurrealDB is a multi-model database engine designed to store and query document, graph, relational, and vector data within a single ACID-compliant platform. It functions as an AI-native data store, integrating vector search, graph traversal, and machine learning model execution directly into its query layer. By providing a unified declarative query language, the platform eliminates the need for external middleware to synchronize data across different storage models. The platform distinguishes itself through its ability to manage agent memory and complex workflows natively. It allows developers to store agent memory, knowledge graphs, and structured data within a single transaction boundary, ensuring consistent state and permissions. Furthermore, the engine supports real-time reactive applications by pushing data updates directly to connected clients through live queries, removing the requirement for external message brokers or polling mechanisms. SurrealDB is built for versatility, operating as a portable database runtime that maintains a consistent interface across embedded, edge, and cloud environments. Its architecture includes a granular, record-level permission model that enforces security and multi-tenant isolation directly at the data layer. The system also features an isolated sandboxing environment for custom extensions, allowing for specialized data processing without compromising system stability or security. The project provides extensive documentation and learning resources, including a structured curriculum and hands-on projects, to assist with onboarding and architectural mastery. It is distributed as a single binary, facilitating deployment across diverse infrastructure ranging from resource-constrained devices to large-scale distributed cloud clusters.
A multi-model database engine that integrates document, graph, relational, and vector data storage.
RustDatabase EnginesVector DatabasesDistributed Databases
View on GitHub32,397
dolthub/dolt
dolthub/dolt
19,907View on GitHub
Dolt is a relational database engine that integrates version control directly into the database management layer. It functions as a version-controlled SQL database that tracks every row and schema change using a commit-based history, allowing users to branch, merge, and audit data modifications. By implementing a wire-protocol-compatible server, the system enables standard SQL clients and tools to interact with versioned data as if they were connecting to a traditional relational database. The platform distinguishes itself by applying repository-style workflows to data management, including support for forking, pull requests, and issue tracking. It utilizes a Merkle-tree-based storage engine to calculate structural and row-level differences between database states, surfacing merge conflicts as queryable relational tables. This architecture allows teams to isolate experimental changes in branches and maintain a tamper-evident history of all modifications that can be queried via SQL. Beyond its core versioning capabilities, the system provides comprehensive infrastructure for data engineering, including remote synchronization, replication, and automated workflow triggers. It supports standard SQL query execution and data import from common file formats, while offering granular access control and role-based permissions to secure database states. The software is designed to operate as a drop-in replacement for existing database environments, maintaining compatibility with standard drivers and management tools.
A relational database engine that integrates Git-style version control directly into the data layer.
GoDistributed SQL DatabasesDistributed DatabasesRelational Database Engines
View on GitHub19,907
pingcap/tidb
pingcap/tidb
40,166View on GitHub
TiDB is a horizontally scalable, distributed SQL database designed to provide consistent transactional storage and high-performance analytical processing within a single unified architecture. It utilizes a decoupled compute-storage design and a distributed key-value storage layer to ensure horizontal scalability and efficient range-based queries. By employing a consensus-based replication algorithm, the system maintains high availability and automatic failover across multiple nodes and geographical regions. The platform distinguishes itself through its hybrid transactional and analytical processing capabilities, which allow complex SQL queries to run against replicated columnar data without disrupting primary transactional workloads. It also integrates high-dimensional vector search functionality, enabling semantic similarity queries directly alongside traditional relational data. To support diverse operational needs, the system provides native tools for real-time data streaming, seamless migration from external database systems, and multi-region disaster recovery. The database is built for cloud-native environments, offering comprehensive lifecycle management through Kubernetes operators that automate deployment, scaling, and rolling upgrades. It maintains compatibility with standard SQL interfaces, allowing applications to connect using common drivers while managing complex concurrency through pessimistic transaction handling. Detailed documentation and command-line utilities are available to assist with cluster orchestration, performance troubleshooting, and the configuration of production-grade topologies.
A horizontally scalable, distributed SQL database engine with a unified transactional and analytical architecture.
GoDistributed SQL DatabasesDistributed DatabasesDistributed Key-Value Stores
View on GitHub40,166
neo4j/neo4j
neo4j/neo4j
15,928View on GitHub
Neo4j is a native graph database management system designed to store and query highly connected data using a property-graph model. It provides an ACID-compliant transaction engine that ensures data integrity, supported by a distributed cluster architecture that maintains causal consistency across nodes. Users interact with the system through a declarative query language, which allows for complex pattern matching and path traversal without requiring manual traversal logic. The platform distinguishes itself through its hybrid approach to data retrieval, combining traditional graph-based queries with high-dimensional vector indexing. This integration enables simultaneous semantic similarity searches and relational data analysis within a single environment. By supporting both structured graph patterns and vector embeddings, the system facilitates advanced analytical tasks such as community detection, pathfinding, and centrality calculations. The project covers a broad capability surface, including comprehensive database administration, security controls, and performance optimization tools. It provides extensive support for AI-augmented workflows, enabling the integration of large language models for retrieval-augmented generation, natural language query translation, and autonomous agent memory management. These features are accessible through standardized language drivers, HTTP interfaces, and native schema enforcement mechanisms. The software is distributed as a database engine with support for both self-managed and cloud-hosted infrastructure, offering command-line tools for provisioning, monitoring, and lifecycle management.
A native graph database management system providing ACID-compliant storage for connected data.
JavaDistributed DatabasesGraph DatabasesHybrid Vector-Graph Databases
View on GitHub15,928
tursodatabase/libsql
tursodatabase/libsql
16,389View on GitHub
LibSQL is a high-performance, distributed SQL database engine that extends SQLite to support remote network access, edge computing, and real-time synchronization. It functions as an embedded database library that integrates directly into application processes while providing the infrastructure to maintain consistency across multiple geographic regions. The platform distinguishes itself by enabling database interaction over standard HTTP protocols, allowing applications to query remote data sources in serverless and edge environments without requiring local filesystem access. It includes native support for high-dimensional vector similarity search and indexing, enabling AI and machine learning workflows to run directly within the database engine. The system provides a comprehensive suite of tools for managing data lifecycles, including database branching, point-in-time state restoration, and automated synchronization between local replicas and remote primary instances. It also incorporates granular security primitives, such as token-based access control and network-level restrictions, to protect database resources in multi-tenant environments. The project offers extensive observability and administrative features, including query performance monitoring, audit logging, and organizational management tools. It is designed for integration through language-specific drivers and supports advanced data processing through specialized modules for full-text and similarity search.
A distributed SQL database engine extending SQLite for edge computing and network synchronization.
CDistributed SQL DatabasesDistributed DatabasesVector Databases
View on GitHub16,389
taosdata/tdengine
taosdata/TDengine
24,734View on GitHub
TDengine is a distributed time-series database designed for the high-speed ingestion, compression, and retrieval of timestamped metrics and sensor data. It functions as a SQL-compatible analytics engine, allowing users to perform complex operations on massive volumes of time-ordered information using standard relational syntax. The platform is built to serve as a backend foundation for industrial IoT environments, managing real-time data streams and device metadata through a cluster-based architecture. The system distinguishes itself through a distributed sharding architecture that uses consistent hashing to ensure horizontal scalability and high-throughput ingestion. It employs a log-structured write path to minimize disk seek latency and utilizes super-table virtualization to provide a unified logical view across multiple physical tables. To maintain performance and cost-efficiency, the database features automated multi-tiered lifecycle management, which migrates data between high-performance memory and low-cost storage based on age and access frequency. Beyond its core storage capabilities, the platform provides robust tools for edge-to-cloud synchronization, ensuring consistent data states across geographically distributed infrastructure. It includes built-in support for real-time stream processing, allowing for the analysis of live data without requiring external message queues. The system also incorporates comprehensive security frameworks, including user access control, audit logging, and encrypted transport protocols to protect sensitive operational data. Developers can interact with the database through native client libraries that support connection pooling and query parameter binding. The system is documented with comprehensive error code diagnostics and provides command-line utilities for cluster administration, health monitoring, and configuration management.
A specialized distributed time-series database engine for high-speed ingestion and analytics.
CDistributed DatabasesTime Series Databases
View on GitHub24,734
memvid/memvid
memvid/memvid
13,160View on GitHub
Memvid is an embedded memory framework designed to provide persistent, versioned context for intelligent agents. It functions as a local vector database library that stores all data within a single binary file, removing the need for external database infrastructure or network dependencies. The system distinguishes itself by integrating in-process vector indexing with append-only versioning, allowing for high-speed semantic similarity searches alongside the ability to track and roll back state changes over time. It includes built-in transparent data encryption and masking to secure sensitive information at rest, ensuring privacy and compliance during all storage operations. The framework provides a comprehensive set of tools for managing agent context, including programmatic SDK access for reading and writing data. By combining embedded key-value mapping with low-latency retrieval mechanisms, it enables applications to maintain consistent, long-term memory across sessions.
An embedded vector database library designed for persistent, versioned context in intelligent agents.
RustKey-Value StoresVector Databases
View on GitHub13,160
dragonflydb/dragonfly
dragonflydb/dragonfly
30,688View on GitHub
Dragonfly is a high-performance, multi-model in-memory data store designed to serve as a drop-in replacement for existing database infrastructures. By utilizing a multi-threaded, shared-nothing architecture and a fiber-based concurrency model, it maximizes CPU utilization and minimizes latency for read and write operations. The system supports a wide range of data structures, including strings, hashes, lists, sets, sorted sets, and JSON documents, while maintaining full compatibility with standard industry wire protocols and client libraries. What distinguishes Dragonfly is its focus on efficiency and scalability through advanced memory management and request processing. It employs a lock-free, cache-friendly hash table structure and zero-copy serialization to reduce overhead during high-throughput operations. For durability, the system utilizes asynchronous, snapshot-based persistence that captures the state of the dataset without blocking active requests. Furthermore, it provides built-in support for horizontal scaling and cluster management, allowing for the distribution of large datasets across multiple nodes to ensure high availability. Beyond core storage, the platform includes a comprehensive suite of operational and analytical capabilities. It features integrated support for geospatial data management, real-time message brokering via publish-subscribe patterns, and full-text search. To handle massive datasets efficiently, the engine incorporates probabilistic data structures for cardinality estimation, frequency tracking, and membership testing. These features are complemented by robust administrative tools, including access control, request rate limiting, and detailed server monitoring.
A high-performance, multi-model in-memory data store designed as a drop-in database replacement.
C++Key-Value StoresMulti-Model Databases
View on GitHub30,688
milvus-io/milvus
milvus-io/milvus
44,804View on GitHub
Milvus is a specialized vector database engine designed for the indexing, management, and high-speed similarity retrieval of high-dimensional vector embeddings. It functions as a similarity search engine capable of identifying nearest neighbors within large-scale vector spaces, supporting the storage and retrieval of billions of data points while maintaining consistent performance. The system utilizes a distributed architecture that decouples storage, query, and coordination into independent services, allowing for horizontal scaling across clusters. It employs a global indexing mechanism that builds specialized data structures across immutable, independently indexed segments. This design, combined with a shared-storage decoupled model, enables compute and storage resources to scale independently in cloud environments, while a log-based persistence layer ensures data durability and state recovery. The platform supports a wide range of data retrieval patterns, including retrieval-augmented generation, hybrid search, and multimodal data retrieval for text, images, and graphs. Deployment options range from lightweight local instances for rapid prototyping to robust standalone setups and fully managed distributed clusters. Documentation includes sizing tools to assist in estimating hardware requirements based on specific data volumes and operational patterns.
A specialized vector database engine built for high-speed similarity retrieval of embeddings.
GoVector Databases
View on GitHub44,804
pubkey/rxdb
pubkey/rxdb
23,048View on GitHub
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance data management. It features a bidirectional replication protocol that handles conflict resolution and state convergence, alongside a pluggable storage abstraction that allows developers to swap between engines like IndexedDB, SQLite, or in-memory stores without altering application logic. To ensure responsiveness, the system offloads storage operations to background worker threads and coordinates database access across multiple browser tabs through a leader election mechanism. The platform offers a comprehensive suite of capabilities for data integrity, performance, and security. It enforces strict data validation through schema-based definitions and optimizes storage footprints using transparent key compression. Developers can bind database query results directly to user interface components, enabling reactive state management where the UI automatically updates in response to local or remote data changes. The project is built for extensibility, offering a wide range of plugins for encryption, full-text search, and integration with various backend protocols including GraphQL, REST, and peer-to-peer channels. It provides extensive documentation and standardized interfaces to facilitate integration into diverse application architectures.
A reactive, offline-first NoSQL database engine for synchronizing application state across clients.
TypeScriptNoSQL
View on GitHub23,048
clickhouse/clickhouse
ClickHouse/ClickHouse
48,042View on GitHub
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through advanced storage and execution techniques, including vectorized query processing and a merge tree storage engine that maintains performance during massive insertions. It features adaptive subcolumn mapping for semi-structured data and supports native vector search for machine learning and generative AI applications. To facilitate efficient data movement, the engine utilizes zero-copy shared memory buffers, minimizing overhead when interacting with external analytical tools or processing diverse file formats like Parquet, JSON, and Arrow. Beyond its core storage and processing capabilities, the project provides a comprehensive suite of tools for observability, security, and data integration. It includes built-in support for natural language querying, automated workflow orchestration for AI agents, and extensive diagnostic features for query plan inspection. The platform also offers robust cloud infrastructure management, including support for private networking, compliant deployment strategies, and integrated billing consolidation.
A high-performance columnar analytical database engine designed for large-scale data aggregation.
C++Vector DatabasesVector Databases
View on GitHub48,042