18 个仓库
Methods for executing and parallelizing data queries across multiple nodes in a distributed environment.
Distinguishing note: Focuses on the execution plan and streaming of data from distributed sources.
Explore 18 awesome GitHub repositories matching data & databases · Distributed Query Processing. Refine with filters or upvote what's useful.
RethinkDB is a distributed, document-oriented database designed to store and manage JSON-formatted data across scalable clusters. It utilizes a custom log-structured storage engine with B-Tree indexing to ensure high-performance disk I/O and data persistence. The system maintains high availability through automatic sharding and replication, employing a primary-replica voting consensus mechanism to handle node failures and ensure consistent cluster operations. A defining characteristic of the platform is its reactive changefeed engine, which allows applications to subscribe to live data update
RethinkDB executes queries by transforming them into parallelized, lazy-evaluated execution plans that stream data chunks from multiple servers to the client for efficient processing.
Dgraph is a distributed graph database designed to store and query highly connected data. It organizes information as nodes and edges to represent complex relationships between entities, providing a platform for managing and analyzing deeply linked datasets. The system functions as a horizontally scalable cluster that partitions data across multiple nodes to maintain performance and availability as information volume increases. It utilizes a specialized query language built for low-latency navigation of interconnected data points, allowing for the execution of complex queries across large-sca
Decomposes complex queries into parallel sub-tasks executed across multiple nodes for efficient processing.
Vitess is a database clustering system for horizontal scaling of MySQL. It functions as a middleware layer that abstracts complex sharding and physical topology, allowing applications to interact with a distributed database environment through a unified interface. By intercepting and routing SQL queries across multiple shards, it enables large-scale data management while maintaining the appearance of a single database instance. The platform distinguishes itself through its ability to perform online schema migrations and distributed transaction coordination without requiring application downti
Directs incoming SQL requests to the correct database shards and aggregates results to hide complex topology from the application.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Executes query plans through a hierarchy of stages, tasks, and operators that transform and exchange data across the cluster.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Executes distributed analytical queries across multiple nodes to optimize performance for massive datasets.
Thanos is a distributed metrics query engine and monitoring scalability suite designed to provide a unified interface for aggregating data from multiple Prometheus servers and clusters. It functions as a high availability monitoring backend that eliminates single points of failure by deduplicating data from replicated instances. The system enables long-term retention by persisting time-series data to cloud-native object storage, allowing for unlimited historical archiving beyond the limits of local disks. It further optimizes this storage through a downsampling and retention manager that comp
Implements parallel execution of data queries across multiple distributed nodes to retrieve a unified result set.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Routes and executes database operations across multiple nodes simultaneously to accelerate analytical workloads.
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Decomposes complex graph traversals into parallel sub-tasks executed concurrently across multiple storage nodes.
This project is a comprehensive learning resource and reference guide for software architecture and distributed systems design. It serves as a structured curriculum for engineers to study fundamental architectural patterns, scalability strategies, and distributed computing theory, specifically tailored to prepare for technical interviews and professional engineering roles. The repository distinguishes itself by providing a curated collection of industry-standard infrastructure tools and methodologies. It covers the selection and implementation of technologies for data storage, message brokeri
Provides methods for executing and parallelizing data queries across multiple nodes in a distributed environment.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Scales analytic workloads across a cluster by splitting and coordinating query fragments.
go-ibax is a blockchain protocol platform and decentralized application infrastructure used to deploy networks with custom governance and token economics. It provides a foundation for building decentralized applications through a framework that integrates identity management and on-chain data storage. The project features a multilingual virtual machine capable of executing smart contracts written in Go, Rust, and Solidity. It implements a sharded blockchain network to increase throughput and a privacy layer utilizing zero-knowledge proofs and homomorphic encryption to anonymize transaction da
Optimizes complex logic execution by storing parsed block data in a distributed database across nodes.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Transfers intermediate query results between parallel processing stages using hash or broadcast strategies to optimize execution speed.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Splits a query into sub-queries, dispatches them to relevant data nodes, and merges partial results into a single response.
Mimir 是一个多租户时间序列数据库和分布式指标存储,专为可扩展的遥测而设计。它作为 Prometheus 兼容的后端,为海量时间序列数据提供长期存储和可扩展的查询引擎。 该系统专为多租户可观测性而构建,在单个集群内为独立团队或组织隔离遥测数据和资源限制。它通过在分布式集群中分片和复制数据来确保高可用性和持久性,并利用对象存储进行持久化,从而消除了对外部数据库的依赖。 该项目涵盖了广泛的功能,包括用于跨区域分析的全局指标聚合,以及利用并行化和缓存的分布式查询执行。它还集成了可观测性工具,如联邦告警、合成监控和 AI 驱动的事件解决工作流,以加速故障排查。 管理控制功能包括租户资源配额、用户级资源覆盖以及用于工作负载隔离的洗牌分片(Shuffle-sharding)。
Executes and parallelizes data queries across multiple nodes, fetching from both memory and object storage.
OpenTSDB 是一个分布式时间序列数据库和指标引擎,专为存储和管理海量高基数系统指标而设计。它作为一个数据存储和分析平台,支持跨分布式集群的大规模指标摄取和基础设施性能监控。 该系统以其支持 HBase、Cassandra 和 Google Bigtable 等多个后端的分布式存储抽象而著称。它利用分层指标树来组织时间序列,并采用数字标识符索引来减少存储占用并加速标记指标的查找。 该项目涵盖了广泛的能力领域,包括具有分布式百分位数计算和降采样功能的时间序列数据分析,以及全面的元数据管理。它提供用于数据摄取和查询的 API 集成、用于性能优化的堆外缓存,以及用于数据完整性审计和异常分析的工具。 该系统通过用于数据库管理和指标树同步的命令行界面进行管理。
Parallelizes complex query requests across multiple nodes to reduce response latency.
m3 是一个分布式时间序列数据库,专为高分辨率指标和高基数数据管理而设计。它作为一个可扩展的存储系统和多集群查询引擎,提供了一个分布式指标聚合器,能够在数据提交到存储之前进行降采样和汇总。 该项目以其使用 etcd 进行节点成员管理和分片放置的协调集群模型而脱颖而出。它支持多种摄取协议,包括 Prometheus 远程写入协议、InfluxDB 行协议和 Graphite Carbon 纯文本协议,并提供与 PromQL 和 Graphite 兼容的查询接口。 该系统涵盖了广泛的功能领域,包括列式时间序列存储、同步数据复制和分布式查询扇出。它集成了数据生命周期自动化、基于法定人数 (Quorum) 的一致性调整,以及基于标签的序列索引,以在隔离的命名空间中保持数据完整性和检索速度。 集群编排和组件放置通过自动化工具和 Operator 进行管理,以确保高可用性和均衡的数据分布。
Fans out requests across multiple clusters and namespaces to provide a complete view of time-series data.
YDB 是一个分布式 SQL 数据库和分析引擎,专为水平扩展和强一致性而设计。它作为一个多模型系统,通过提供可序列化 ACID 事务的分布式架构支持事务和分析工作负载。 该系统以其广泛的协议兼容性而著称,实现了用于标准 SQL 驱动程序的 PostgreSQL 有线协议和用于消息传递与流处理的 Kafka 协议。它进一步作为向量数据库,支持用于语义搜索和嵌入的向量索引以及近似最近邻搜索。 该平台使用具有行式和列式格式的混合存储模型管理数据,利用向量化查询执行进行 PB 级分析。其操作范围包括变更数据捕获流、精确一次(exactly-once)持久队列和多区域高可用性。 部署和生命周期管理通过 Kubernetes Operator 和基础设施即代码(IaC)配置提供支持。
Runs streaming queries that automatically restart on failure and use checkpoints to persist state.
Olric is a distributed data grid and in-memory key-value store that partitions and replicates data across a cluster of servers. It serves as a shared memory system for managing distributed maps, performing atomic operations, and acting as an in-memory data cache. The system provides a distributed locking mechanism for concurrency control and a pub-sub messaging system that broadcasts and routes messages over named channels across the cluster. The platform covers wide-ranging capabilities including cluster management and orchestration, data replication with configurable quorums, and automated
Executes distributed queries to scan and retrieve keys from maps across multiple cluster nodes.