18 dépôts
Methods for executing and parallelizing data queries across multiple nodes in a distributed environment.
Distinguishing note: Focuses on the execution plan and streaming of data from distributed sources.
Explore 18 awesome GitHub repositories matching data & databases · Distributed Query Processing. Refine with filters or upvote what's useful.
RethinkDB is a distributed, document-oriented database designed to store and manage JSON-formatted data across scalable clusters. It utilizes a custom log-structured storage engine with B-Tree indexing to ensure high-performance disk I/O and data persistence. The system maintains high availability through automatic sharding and replication, employing a primary-replica voting consensus mechanism to handle node failures and ensure consistent cluster operations. A defining characteristic of the platform is its reactive changefeed engine, which allows applications to subscribe to live data update
RethinkDB executes queries by transforming them into parallelized, lazy-evaluated execution plans that stream data chunks from multiple servers to the client for efficient processing.
Dgraph is a distributed graph database designed to store and query highly connected data. It organizes information as nodes and edges to represent complex relationships between entities, providing a platform for managing and analyzing deeply linked datasets. The system functions as a horizontally scalable cluster that partitions data across multiple nodes to maintain performance and availability as information volume increases. It utilizes a specialized query language built for low-latency navigation of interconnected data points, allowing for the execution of complex queries across large-sca
Decomposes complex queries into parallel sub-tasks executed across multiple nodes for efficient processing.
Vitess is a database clustering system for horizontal scaling of MySQL. It functions as a middleware layer that abstracts complex sharding and physical topology, allowing applications to interact with a distributed database environment through a unified interface. By intercepting and routing SQL queries across multiple shards, it enables large-scale data management while maintaining the appearance of a single database instance. The platform distinguishes itself through its ability to perform online schema migrations and distributed transaction coordination without requiring application downti
Directs incoming SQL requests to the correct database shards and aggregates results to hide complex topology from the application.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Executes query plans through a hierarchy of stages, tasks, and operators that transform and exchange data across the cluster.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Executes distributed analytical queries across multiple nodes to optimize performance for massive datasets.
Thanos is a distributed metrics query engine and monitoring scalability suite designed to provide a unified interface for aggregating data from multiple Prometheus servers and clusters. It functions as a high availability monitoring backend that eliminates single points of failure by deduplicating data from replicated instances. The system enables long-term retention by persisting time-series data to cloud-native object storage, allowing for unlimited historical archiving beyond the limits of local disks. It further optimizes this storage through a downsampling and retention manager that comp
Implements parallel execution of data queries across multiple distributed nodes to retrieve a unified result set.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Routes and executes database operations across multiple nodes simultaneously to accelerate analytical workloads.
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Decomposes complex graph traversals into parallel sub-tasks executed concurrently across multiple storage nodes.
This project is a comprehensive learning resource and reference guide for software architecture and distributed systems design. It serves as a structured curriculum for engineers to study fundamental architectural patterns, scalability strategies, and distributed computing theory, specifically tailored to prepare for technical interviews and professional engineering roles. The repository distinguishes itself by providing a curated collection of industry-standard infrastructure tools and methodologies. It covers the selection and implementation of technologies for data storage, message brokeri
Provides methods for executing and parallelizing data queries across multiple nodes in a distributed environment.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Scales analytic workloads across a cluster by splitting and coordinating query fragments.
go-ibax is a blockchain protocol platform and decentralized application infrastructure used to deploy networks with custom governance and token economics. It provides a foundation for building decentralized applications through a framework that integrates identity management and on-chain data storage. The project features a multilingual virtual machine capable of executing smart contracts written in Go, Rust, and Solidity. It implements a sharded blockchain network to increase throughput and a privacy layer utilizing zero-knowledge proofs and homomorphic encryption to anonymize transaction da
Optimizes complex logic execution by storing parsed block data in a distributed database across nodes.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Transfers intermediate query results between parallel processing stages using hash or broadcast strategies to optimize execution speed.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Splits a query into sub-queries, dispatches them to relevant data nodes, and merges partial results into a single response.
Mimir est une base de données de séries temporelles multi-tenant et un magasin de métriques distribué conçu pour la télémétrie évolutive. Il sert de backend compatible avec Prometheus, fournissant un stockage à long terme et un moteur de requête évolutif pour des volumes massifs de données de séries temporelles. Le système est construit pour l'observabilité multi-tenant, isolant les données de télémétrie et les limites de ressources pour des équipes ou organisations indépendantes au sein d'un cluster unique. Il assure une haute disponibilité et durabilité en fragmentant et répliquant les données sur un cluster distribué, utilisant le stockage objet pour la persistance afin d'éliminer les dépendances aux bases de données externes. Le projet couvre de vastes capacités, incluant l'agrégation globale de métriques pour l'analyse inter-régions et l'exécution de requêtes distribuées via la parallélisation et la mise en cache. Il intègre également des outils d'observabilité tels que l'alerte fédérée, la surveillance synthétique et des flux de travail de résolution d'incidents pilotés par l'IA pour accélérer le dépannage. Les contrôles administratifs incluent des quotas de ressources par tenant, des surcharges de ressources par utilisateur et le shuffle-sharding pour l'isolation des charges de travail.
Executes and parallelizes data queries across multiple nodes, fetching from both memory and object storage.
OpenTSDB est une base de données de séries temporelles distribuée et un moteur de métriques conçu pour stocker et gérer des volumes massifs de métriques système à haute cardinalité. Il fonctionne comme un magasin de données et une plateforme d'analyse qui permet l'ingestion de métriques à grande échelle et la surveillance de la performance de l'infrastructure à travers un cluster distribué. Le système se distingue par une abstraction de stockage distribué qui supporte de multiples backends tels que HBase, Cassandra et Google Bigtable. Il utilise un arbre de métriques hiérarchique pour organiser les séries temporelles et emploie l'indexation par identifiant numérique pour réduire l'empreinte de stockage et accélérer les recherches pour les métriques taguées. Le projet couvre de larges domaines de capacités incluant l'analyse de données de séries temporelles avec des calculs de centiles distribués et le downsampling, ainsi qu'une gestion complète des métadonnées. Il fournit une intégration API pour l'ingestion et l'interrogation de données, le cache off-heap pour l'optimisation des performances, et des outils pour l'audit d'intégrité des données et l'analyse d'anomalies. Le système est géré via une interface en ligne de commande pour l'administration de la base de données et la synchronisation de l'arbre de métriques.
Parallelizes complex query requests across multiple nodes to reduce response latency.
m3 est une base de données de séries temporelles distribuée conçue pour les métriques haute résolution et la gestion de données à haute cardinalité. Elle fonctionne comme un système de stockage évolutif et un moteur de requête multi-cluster, fournissant un agrégateur de métriques distribué capable de sous-échantillonner et de résumer les données avant qu'elles ne soient validées dans le stockage. Le projet se distingue par un modèle de cluster coordonné utilisant etcd pour l'appartenance aux nœuds et le placement des shards. Il prend en charge plusieurs protocoles d'ingestion, notamment le protocole d'écriture à distance Prometheus, le protocole InfluxDB line et le protocole Graphite Carbon en texte brut, et fournit des interfaces de requête compatibles pour PromQL et Graphite. Le système couvre de larges domaines de capacités, notamment le stockage de séries temporelles en colonnes, la réplication synchrone des données et le fan-out de requêtes distribuées. Il intègre l'automatisation du cycle de vie des données, le réglage de la cohérence basé sur le quorum et l'indexation des séries basée sur les tags pour maintenir l'intégrité des données et la vitesse de récupération à travers des espaces de noms isolés. L'orchestration du cluster et le placement des composants sont gérés via des outils automatisés et des opérateurs pour assurer une haute disponibilité et une distribution équilibrée des données.
Fans out requests across multiple clusters and namespaces to provide a complete view of time-series data.
YDB est une base de données SQL distribuée et un moteur analytique conçu pour la scalabilité horizontale et une forte cohérence. Il fonctionne comme un système multi-modèle qui prend en charge les charges de travail transactionnelles et analytiques via une architecture distribuée fournissant des transactions ACID sérialisables. Le système se distingue par sa large compatibilité de protocole, implémentant le protocole wire PostgreSQL pour les pilotes SQL standard et le protocole Kafka pour la messagerie et le streaming. Il sert en outre de base de données vectorielle, prenant en charge les index vectoriels et les recherches de voisins les plus proches approximatifs pour la recherche sémantique et les embeddings. La plateforme gère les données en utilisant un modèle de stockage hybride avec des formats orientés lignes et orientés colonnes, utilisant l'exécution de requêtes vectorisées pour des analyses à l'échelle du pétaoctet. Sa surface opérationnelle inclut le streaming de capture de données modifiées (CDC), des files d'attente persistantes avec garantie d'exécution unique (exactly-once) et une haute disponibilité multi-zone. Le déploiement et la gestion du cycle de vie sont pris en charge via un opérateur Kubernetes et le provisionnement d'infrastructure as code.
Runs streaming queries that automatically restart on failure and use checkpoints to persist state.
Olric is a distributed data grid and in-memory key-value store that partitions and replicates data across a cluster of servers. It serves as a shared memory system for managing distributed maps, performing atomic operations, and acting as an in-memory data cache. The system provides a distributed locking mechanism for concurrency control and a pub-sub messaging system that broadcasts and routes messages over named channels across the cluster. The platform covers wide-ranging capabilities including cluster management and orchestration, data replication with configurable quorums, and automated
Executes distributed queries to scan and retrieve keys from maps across multiple cluster nodes.