33 dépôts
Techniques for dividing large datasets into smaller, manageable segments for parallel processing.
Distinguishing note: Focuses on the storage organization strategy rather than the indexing algorithm.
Explore 33 awesome GitHub repositories matching data & databases · Data Partitioning. Refine with filters or upvote what's useful.
Ce projet est une ressource éducative et un guide d'étude complet axé sur l'architecture des systèmes distribués et la conception d'infrastructures backend. Il fournit un programme structuré pour maîtriser les principes de scalabilité, de fiabilité et de performance requis pour concevoir des systèmes logiciels complexes. Le dépôt se distingue en offrant une approche méthodique de la préparation aux entretiens techniques, intégrant des modèles de conception, des compromis architecturaux et des outils de répétition espacée pour aider les utilisateurs à retenir des concepts complexes. Il met l'accent sur l'analyse axée sur les contraintes, enseignant aux utilisateurs comment évaluer des exigences concurrentes comme la latence, la cohérence et la disponibilité lors de l'élaboration de conceptions architecturales. Le contenu couvre un large spectre de capacités de conception de systèmes, notamment des stratégies pour la mise à l'échelle des bases de données, la gestion du trafic et l'optimisation de l'infrastructure. Il détaille des techniques pour la mise à l'échelle horizontale, la mise en cache multicouche, la communication asynchrone et la découverte de services, tout en fournissant des cadres pour effectuer des estimations de ressources et la planification de la capacité. La documentation est organisée comme un guide d'étude, offrant un chemin systématique à travers les fondamentaux de l'ingénierie backend et de la conception de systèmes à grande échelle.
Explains techniques for partitioning large datasets across multiple database nodes to improve performance.
Milvus is a specialized vector database engine designed for the indexing, management, and high-speed similarity retrieval of high-dimensional vector embeddings. It functions as a similarity search engine capable of identifying nearest neighbors within large-scale vector spaces, supporting the storage and retrieval of billions of data points while maintaining consistent performance. The system utilizes a distributed architecture that decouples storage, query, and coordination into independent services, allowing for horizontal scaling across clusters. It employs a global indexing mechanism that
Partitions data into immutable segments to optimize memory usage and parallel search performance.
This project is a comprehensive educational resource focused on the principles, patterns, and trade-offs required to design scalable, reliable, and high-performance distributed systems. It provides a structured curriculum that covers the fundamental architectural strategies necessary for building modern software infrastructure, ranging from high-level system decomposition to low-level networking and data management. The repository distinguishes itself by offering deep dives into complex architectural patterns, such as microservices-based decomposition, event-driven communication, and command-
Details the architectural pattern of horizontal partitioning to scale database performance.
Qdrant is a high-performance vector similarity database designed to store, index, and search high-dimensional vectors alongside structured metadata. It functions as a distributed search engine that manages large-scale data clusters, providing low-latency retrieval and complex filtering capabilities. The system is built to serve as a specialized middleware layer, connecting machine learning pipelines and AI agents to persistent storage for intelligent information retrieval and recommendation tasks. The platform distinguishes itself through advanced retrieval techniques, including support for h
Isolates data by user or group using metadata to ensure tenant privacy.
RocksDB is a high-performance, embeddable persistent key-value library and storage engine based on Log-Structured Merge-trees. It is designed to provide durable storage for large-scale datasets, integrating directly into applications to manage data on flash and RAM-based hardware. The engine is distinguished by its focus on minimizing read and write amplification through multi-threaded compaction and custom memory allocators. It features specialized optimizations for flash storage, including support for zoned block devices, and provides the ability to extend store behavior via external plugin
Divides the keyspace into independent namespaces with separate memtables and compaction settings for logical data organization.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Distributes large datasets into smaller segments across nodes to enable horizontal scaling.
This project is a comprehensive educational resource and technical documentation suite for learning and developing deep learning models. It serves as an open-source textbook, implementation manual, and framework tutorial designed to guide users through the mathematical foundations and practical application of neural networks. The resource provides detailed instructional content on building various model architectures, including convolutional and recurrent neural networks. It includes a dedicated distributed training guide and a learning path that covers the fundamentals of tensors, automatic
Explains techniques for partitioning datasets into batches for parallel processing across workers.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Configures hash or range partitioning schemes during table creation to optimize data distribution and query performance.
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Divides large datasets into logical slices to enable incremental processing, targeted re-runs, and efficient management of high-volume workflows.
Druid is a distributed columnar store and online analytical processing database designed for real-time analytics. It functions as a SQL analytics platform and a streaming data ingestion engine, allowing for the analysis of large datasets with low latency to support interactive dashboards and high-concurrency operational workloads. The system integrates a streaming data ingestion engine that loads information via batch or streaming processes to enable immediate analysis of arriving data. It provides high-performance analytical processing to execute slice-and-dice queries on massive data volume
Divides datasets into time-based chunks to enable parallel processing and querying across the cluster.
Dask est un framework de calcul parallèle et un planificateur de tâches distribué conçu pour mettre à l'échelle les flux de travail de science des données Python, des machines uniques aux grands clusters. Il fonctionne comme un gestionnaire de ressources de cluster qui orchestre la logique computationnelle en représentant les tâches et leurs dépendances sous forme de graphes acycliques dirigés. Cette architecture permet au système d'automatiser la distribution des charges de travail sur le matériel disponible tout en gérant des exigences d'exécution complexes. Le projet se distingue par un moteur d'évaluation paresseuse qui diffère les opérations sur les données jusqu'à ce qu'elles soient explicitement demandées, permettant une optimisation globale du graphe et une allocation efficace des ressources. Il intègre le déversement de données conscient de la mémoire pour éviter les plantages du système lors du traitement de jeux de données dépassant la mémoire disponible, et il utilise la fusion de graphes de tâches pour combiner des séquences d'opérations en étapes d'exécution uniques, minimisant la surcharge de planification et la communication entre nœuds. La plateforme fournit une surface de capacités complète pour l'analyse de données à grande échelle, incluant le support pour l'apprentissage automatique distribué, l'intégration du calcul haute performance et le traitement de données parallèle. Elle offre des outils étendus pour la gestion du cycle de vie des clusters, le profilage des performances et la surveillance en temps réel de l'exécution des tâches. Les utilisateurs peuvent déployer ces environnements sur diverses infrastructures, incluant le matériel local, les fournisseurs cloud, les systèmes conteneurisés et les clusters de calcul haute performance.
Divides large datasets into smaller, manageable blocks to optimize memory usage and parallel processing performance.
Electric is a Postgres data synchronization engine and replication proxy designed to enable local-first software. It replicates data from Postgres databases to client-side stores in real time using logical replication, allowing applications to maintain a local embedded database for offline access and low-latency updates. The system distinguishes itself by using shapes to filter and authorize specific subsets of database rows and columns before streaming them to clients or edge workers. It further supports multi-user collaboration by integrating a conflict-free replicated data type framework t
Enables subscribing to data from root partitioned tables or specific partitions to control sync volume.
Cassandra is a distributed NoSQL database and wide-column store designed for high availability and linear scalability. It functions as a fault-tolerant distributed system that utilizes an LSM-tree storage engine to optimize write throughput and manage massive datasets. The system is a CQL-compliant database, using a structured query language to manage and retrieve tabular data stored across multiple nodes. It organizes information into rows and columns based on a flexible schema and primary keys. The project provides capabilities for horizontal database scaling, distributed data partitioning
Automatically spreads tabular data across multiple machines to ensure linear scalability and avoid single points of failure.
pprof is a tool for visualizing and analyzing performance profiling data. It converts sampled call stacks into a directed graph rendered as an SVG, enabling visual identification of execution hotspots. The tool also parses Linux perf.data files, converting them into an internal profile representation for further analysis. Beyond visualization, pprof provides a command-line REPL for interactive exploration of profiling data, allowing users to filter, refine, and query performance information on the fly. It generates sorted text reports that highlight the most resource-intensive call stacks, an
Generates sorted text reports pinpointing the most resource-intensive call stacks.
This project is an open-source software development kit and framework for implementing the Matter smart home standard. It provides a universal IPv6-based application layer and a cluster-based data model to ensure interoperability between diverse smart home devices and controllers. The system is distinguished by its multi-transport network abstraction, which maps Bluetooth LE, Thread, and Wi-Fi implementations to a common layer. It includes specialized tooling for secure device commissioning via QR codes and NFC, as well as a comprehensive over-the-air firmware update system for distributing s
Defines memory areas in persistent storage to hold factory data and apply write protection.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Allows executing queries on specific cluster partitions to minimize network overhead and improve performance for localized data access.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Restricts data update operations to specific partitions by matching partition column values against segment metadata.
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Automatically creates partitions during data insertion, eliminating manual partition management.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Queries region statistics and partition metadata to find partitions receiving the most write traffic.
Algodeck is an open-source collection of flash cards designed for reviewing algorithms, data structures, and system design concepts, specifically curated for technical interview preparation. The project organizes knowledge into atomic question-and-answer pairs and incorporates spaced repetition scheduling to optimize long-term memory retention. The flash card catalog covers a broad range of computer science topics, including classic sorting algorithms like quicksort and mergesort, data structure operations for arrays, trees, heaps, tries, and graphs, as well as bit manipulation techniques for
Explains the many-small-partitions rebalancing strategy for distributed data systems.