27 Repos
Strategies for partitioning and distributing data across multiple nodes to improve scalability and performance.
Distinguishing note: Specifically addresses range-based partitioning of data based on primary keys.
Explore 27 awesome GitHub repositories matching data & databases · Data Sharding. Refine with filters or upvote what's useful.
Dieses Projekt ist eine umfassende Bildungsressource und ein Studienleitfaden, der sich auf die Architektur verteilter Systeme und das Design von Backend-Infrastrukturen konzentriert. Es bietet einen strukturierten Lehrplan zur Beherrschung der Prinzipien von Skalierbarkeit, Zuverlässigkeit und Leistung, die für den Entwurf komplexer Softwaresysteme erforderlich sind. Das Repository zeichnet sich durch einen methodischen Ansatz zur Vorbereitung auf technische Vorstellungsgespräche aus, der Entwurfsmuster, architektonische Kompromisse und Tools für räumliche Wiederholungen integriert, um Nutzern das Behalten komplexer Konzepte zu erleichtern. Es betont die einschränkungsgesteuerte Analyse und lehrt Nutzer, wie sie konkurrierende Anforderungen wie Latenz, Konsistenz und Verfügbarkeit beim Entwurf von Architekturen bewerten können. Der Inhalt deckt ein breites Spektrum an Systemdesign-Fähigkeiten ab, einschließlich Strategien für die Datenbankskalierung, Verkehrsmanagement und Infrastrukturoptimierung. Es werden Techniken für horizontale Skalierung, mehrschichtiges Caching, asynchrone Kommunikation und Service-Discovery detailliert beschrieben, während gleichzeitig Frameworks für die Durchführung von Ressourcenschätzungen und Kapazitätsplanungen bereitgestellt werden. Die Dokumentation ist als Studienleitfaden organisiert und bietet einen systematischen Pfad durch die Grundlagen des Backend-Engineerings und des großskaligen Systemdesigns.
Covers data sharding strategies for distributing large datasets across physical servers to overcome capacity limits.
This project is a comprehensive Java backend engineering guide and technical reference focused on high-concurrency design, distributed systems, and microservices architecture. It provides detailed strategies for decomposing monolithic applications, managing service discovery, and implementing the architectural patterns required for scalable backend environments. The repository distinguishes itself through an extensive collection of big data algorithmic references and database scaling strategies. It covers memory-efficient techniques for analyzing massive datasets, such as Top-K element extrac
Distributes data across multiple master nodes using hash slots to ensure balanced storage and scalability.
RethinkDB is a distributed, document-oriented database designed to store and manage JSON-formatted data across scalable clusters. It utilizes a custom log-structured storage engine with B-Tree indexing to ensure high-performance disk I/O and data persistence. The system maintains high availability through automatic sharding and replication, employing a primary-replica voting consensus mechanism to handle node failures and ensure consistent cluster operations. A defining characteristic of the platform is its reactive changefeed engine, which allows applications to subscribe to live data update
RethinkDB partitions data across cluster nodes by distributing document ranges based on primary keys to ensure balanced storage and parallelized query execution.
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
Distributes documents across multiple database instances to improve query performance and handle large datasets.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Covers partitioning and sharding strategies for scaling data systems horizontally.
This project is a feature-rich Go client library designed for interacting with Redis. It serves as a comprehensive interface for managing remote data stores, enabling developers to execute standard database commands, handle complex data structures, and perform asynchronous operations within Go applications. The library distinguishes itself through its support for advanced Redis capabilities, including connection pooling, pipelining, and transactional integrity. It provides specialized primitives for managing distributed clusters, including automated topology updates and request routing to sha
Automatically partitions data across multiple server nodes to scale storage and throughput.
Serve is a multimodal AI orchestrator and inference server designed for deploying and scaling machine learning models as cloud-native services. It functions as a containerized workflow engine and distributed service mesh that routes multimodal data through connected execution units. The framework provides specialized capabilities for large language models, including a token streaming gateway that delivers generated text incrementally to reduce perceived latency. It distinguishes itself by enabling the chaining of executors into complex data processing pipelines and the orchestration of these
Implements data sharding to partition model weights across multiple executors to manage memory and improve performance.
Jina is a cloud-native framework for building and deploying multimodal AI applications that process text, images, and audio across distributed microservices. It functions as an inference orchestrator and a distributed model gateway, providing a containerized stack to organize AI executors into operational pipelines. The system manages large language model workloads through token-streamed response delivery and dynamic batching to increase hardware throughput. It utilizes a protocol-agnostic communication layer to route data across different machine learning frameworks. The framework covers hi
Implements data sharding to distribute multimodal datasets across multiple service instances for parallel processing.
Dgraph is a distributed graph database designed to store and query highly connected data. It organizes information as nodes and edges to represent complex relationships between entities, providing a platform for managing and analyzing deeply linked datasets. The system functions as a horizontally scalable cluster that partitions data across multiple nodes to maintain performance and availability as information volume increases. It utilizes a specialized query language built for low-latency navigation of interconnected data points, allowing for the execution of complex queries across large-sca
Partitions graph data across cluster nodes using predicate-based sharding to enable horizontal scaling.
Vitess is a distributed MySQL orchestrator and clustering system designed for horizontal database scaling. It functions as sharding middleware that distributes data and load across multiple MySQL instances to handle growth beyond the capacity of a single machine. The system provides a proxy layer that abstracts data distribution, allowing applications to query a cluster as a single logical database without knowing the physical location of the data. This is achieved through a routing mechanism that intercepts queries and directs them to the appropriate shards based on keyspace mappings. The p
Distributes data across multiple MySQL instances by mapping primary key ranges to individual shards.
This project is a comprehensive technical interview preparation resource and computer science interview guide. It serves as an educational reference for developers to study core software engineering fundamentals and common coding patterns required for employment screenings. The repository provides detailed guides and references covering data structures and algorithms, networking and security, operating systems, and web development. It specifically focuses on the implementation and complexity analysis of sorting, searching, and graph algorithms. The material encompasses a wide breadth of comp
Describes strategies for partitioning large datasets into smaller shards to distribute load and improve scalability.
Twemproxy is a lightweight proxy that routes and distributes requests across multiple Redis and Memcached backend servers. It functions as a protocol translation gateway and distributed cache shard manager, partitioning data across clusters to balance load and storage capacity. The system acts as a high-availability cache orchestrator, employing health monitoring and automatic server ejection to maintain continuous access to cached data. It integrates with sentinels for dynamic master and replica discovery and utilizes consistent hashing and tag-based key grouping to manage data distribution
Partitions and distributes cache data across multiple nodes using hashing modes to improve scalability.
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Distributes graph data across nodes by hashing vertex IDs to balance load and enable scalability.
Garnet is a multi-threaded in-memory database and distributed key-value store. It functions as a high-performance remote cache store that implements the RESP wire protocol to maintain compatibility with existing Redis clients and libraries. The project is distinguished by a shared-memory architecture that enables parallel request processing across multiple cores for sub-millisecond latency. It features a tiered storage system that automatically offloads colder data from system memory to SSD or cloud storage layers, and includes a specialized vector search database for high-dimensional similar
Organizes data across multiple nodes using sharding and replication to ensure high availability and scalability.
SQLAlchemy is a comprehensive Python SQL toolkit and object-relational mapper that provides a full suite of tools for interacting with relational databases. It serves as a foundational layer for database connectivity, offering both a high-level object-oriented interface for data persistence and a programmatic SQL expression language for constructing complex, dialect-agnostic queries. The project distinguishes itself through its sophisticated unit of work persistence, which coordinates atomic transactions and tracks object state changes to minimize redundant database operations. It provides a
Routes database queries and operations to specific physical nodes based on defined partitioning logic to scale storage capacity.
This project provides educational materials and courseware focused on the theoretical and practical foundations of distributed systems design. It serves as a comprehensive curriculum covering the disciplines of consensus, data consistency, reliability engineering, and scalability. The instructional content focuses on achieving cluster agreement through consensus algorithms and managing system-wide state via coordination frameworks. It includes a dedicated guide to data theory, exploring replication strategies, consistency models, and data convergence. The courseware covers a broad capability
Covers strategies for distributing large datasets into independent shards to ensure linear scalability.
Mycat-Server is a MySQL database middleware system that functions as a sharding proxy, distributed database coordinator, and high availability manager. It acts as a proxy layer that routes SQL traffic between applications and multiple backend MySQL database instances to enable horizontal scaling. The system coordinates distributed transactions, generates global unique sequences to prevent primary key collisions, and executes distributed join queries across multiple database shards. It includes a load balancer that performs read-write splitting by directing traffic between primary and slave no
Distributes relational data across physical database nodes using configurable rules to enable horizontal scaling.
kube-prometheus is a monitoring stack deployment and orchestration framework. It uses an operator pattern to automate the installation and lifecycle management of Prometheus and Alertmanager via custom resource definitions. The project focuses on scaling data collection through hash-based target sharding and topology-aware distribution to reduce cross-zone traffic. It implements a sidecar-based configuration reloading mechanism and utilizes consistent hashing to distribute scrape targets across multiple instances. The system covers broad observability capabilities including metric data colle
Retains pods from scaled-down shards so historical metric data remains queryable until the retention period expires.
Kingshard is a MySQL database proxy and sharding middleware that routes SQL traffic between clients and multiple database nodes. It functions as a load balancer, read-write splitter, and SQL query firewall to manage how data is accessed and distributed across a database infrastructure. The system implements data sharding using hash, range, or date strategies to split tables across multiple nodes. It enables read-write splitting by directing data modification requests to a master node while distributing read-only queries across a pool of slave replicas. The proxy provides traffic management t
Splits table data across multiple nodes using hash or range keys to improve system scalability.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Partitions data into shards that distribute storage and computation across a cluster for linear scalability.