18 repositorios
Methods for executing and parallelizing data queries across multiple nodes in a distributed environment.
Distinguishing note: Focuses on the execution plan and streaming of data from distributed sources.
Explore 18 awesome GitHub repositories matching data & databases · Distributed Query Processing. Refine with filters or upvote what's useful.
RethinkDB is a distributed, document-oriented database designed to store and manage JSON-formatted data across scalable clusters. It utilizes a custom log-structured storage engine with B-Tree indexing to ensure high-performance disk I/O and data persistence. The system maintains high availability through automatic sharding and replication, employing a primary-replica voting consensus mechanism to handle node failures and ensure consistent cluster operations. A defining characteristic of the platform is its reactive changefeed engine, which allows applications to subscribe to live data update
RethinkDB executes queries by transforming them into parallelized, lazy-evaluated execution plans that stream data chunks from multiple servers to the client for efficient processing.
Dgraph is a distributed graph database designed to store and query highly connected data. It organizes information as nodes and edges to represent complex relationships between entities, providing a platform for managing and analyzing deeply linked datasets. The system functions as a horizontally scalable cluster that partitions data across multiple nodes to maintain performance and availability as information volume increases. It utilizes a specialized query language built for low-latency navigation of interconnected data points, allowing for the execution of complex queries across large-sca
Decomposes complex queries into parallel sub-tasks executed across multiple nodes for efficient processing.
Vitess is a database clustering system for horizontal scaling of MySQL. It functions as a middleware layer that abstracts complex sharding and physical topology, allowing applications to interact with a distributed database environment through a unified interface. By intercepting and routing SQL queries across multiple shards, it enables large-scale data management while maintaining the appearance of a single database instance. The platform distinguishes itself through its ability to perform online schema migrations and distributed transaction coordination without requiring application downti
Directs incoming SQL requests to the correct database shards and aggregates results to hide complex topology from the application.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Executes query plans through a hierarchy of stages, tasks, and operators that transform and exchange data across the cluster.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Executes distributed analytical queries across multiple nodes to optimize performance for massive datasets.
Thanos is a distributed metrics query engine and monitoring scalability suite designed to provide a unified interface for aggregating data from multiple Prometheus servers and clusters. It functions as a high availability monitoring backend that eliminates single points of failure by deduplicating data from replicated instances. The system enables long-term retention by persisting time-series data to cloud-native object storage, allowing for unlimited historical archiving beyond the limits of local disks. It further optimizes this storage through a downsampling and retention manager that comp
Implements parallel execution of data queries across multiple distributed nodes to retrieve a unified result set.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Routes and executes database operations across multiple nodes simultaneously to accelerate analytical workloads.
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Decomposes complex graph traversals into parallel sub-tasks executed concurrently across multiple storage nodes.
This project is a comprehensive learning resource and reference guide for software architecture and distributed systems design. It serves as a structured curriculum for engineers to study fundamental architectural patterns, scalability strategies, and distributed computing theory, specifically tailored to prepare for technical interviews and professional engineering roles. The repository distinguishes itself by providing a curated collection of industry-standard infrastructure tools and methodologies. It covers the selection and implementation of technologies for data storage, message brokeri
Provides methods for executing and parallelizing data queries across multiple nodes in a distributed environment.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Scales analytic workloads across a cluster by splitting and coordinating query fragments.
go-ibax is a blockchain protocol platform and decentralized application infrastructure used to deploy networks with custom governance and token economics. It provides a foundation for building decentralized applications through a framework that integrates identity management and on-chain data storage. The project features a multilingual virtual machine capable of executing smart contracts written in Go, Rust, and Solidity. It implements a sharded blockchain network to increase throughput and a privacy layer utilizing zero-knowledge proofs and homomorphic encryption to anonymize transaction da
Optimizes complex logic execution by storing parsed block data in a distributed database across nodes.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Transfers intermediate query results between parallel processing stages using hash or broadcast strategies to optimize execution speed.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Splits a query into sub-queries, dispatches them to relevant data nodes, and merges partial results into a single response.
Mimir es una base de datos de series temporales multi-inquilino y un almacén de métricas distribuido diseñado para telemetría escalable. Sirve como un backend compatible con Prometheus, proporcionando almacenamiento a largo plazo y un motor de consultas escalable para volúmenes masivos de datos de series temporales. El sistema está construido para la observabilidad multi-inquilino, aislando datos de telemetría y límites de recursos para equipos u organizaciones independientes dentro de un mismo clúster. Asegura alta disponibilidad y durabilidad mediante el sharding y la replicación de datos a través de un clúster distribuido, utilizando almacenamiento de objetos para la persistencia y eliminar dependencias de bases de datos externas. El proyecto cubre capacidades de amplio alcance, incluyendo la agregación global de métricas para análisis entre regiones y la ejecución de consultas distribuidas mediante paralelización y caché. También integra herramientas de observabilidad como alertas federadas, monitoreo sintético y flujos de trabajo de resolución de incidentes impulsados por IA para acelerar la resolución de problemas. Los controles administrativos incluyen cuotas de recursos por inquilino, anulaciones de recursos por usuario y shuffle-sharding para el aislamiento de cargas de trabajo.
Executes and parallelizes data queries across multiple nodes, fetching from both memory and object storage.
OpenTSDB es una base de datos de series temporales distribuida y un motor de métricas diseñado para almacenar y gestionar volúmenes masivos de métricas de sistema de alta cardinalidad. Funciona como un almacén de datos y plataforma de análisis que permite la ingesta de métricas a gran escala y el monitoreo del rendimiento de la infraestructura a través de un clúster distribuido. El sistema se distingue por una abstracción de almacenamiento distribuido que admite múltiples backends como HBase, Cassandra y Google Bigtable. Utiliza un árbol de métricas jerárquico para organizar series temporales y emplea indexación de identificadores numéricos para reducir la huella de almacenamiento y acelerar las búsquedas de métricas etiquetadas. El proyecto cubre áreas de capacidad amplias, incluyendo análisis de datos de series temporales con cálculos de percentiles distribuidos y submuestreo, así como una gestión integral de metadatos. Proporciona integración de API para la ingesta y consulta de datos, caché fuera de memoria (off-heap) para optimización del rendimiento y herramientas para la auditoría de integridad de datos y análisis de anomalías. El sistema se gestiona a través de una interfaz de línea de comandos para la administración de bases de datos y la sincronización del árbol de métricas.
Parallelizes complex query requests across multiple nodes to reduce response latency.
m3 es una base de datos de series temporales distribuida, diseñada para métricas de alta resolución y gestión de datos de alta cardinalidad. Funciona como un sistema de almacenamiento escalable y un motor de consultas multiclúster, proporcionando un agregador de métricas distribuido capaz de realizar downsampling y resumir datos antes de que se confirmen en el almacenamiento. El proyecto se distingue por un modelo de clúster coordinado que utiliza etcd para la pertenencia a nodos y la colocación de shards. Soporta múltiples protocolos de ingesta, incluyendo el protocolo de escritura remota de Prometheus, el protocolo de línea de InfluxDB y el protocolo de texto plano de Graphite Carbon, y proporciona interfaces de consulta compatibles para PromQL y Graphite. El sistema cubre amplias áreas de capacidad, incluyendo almacenamiento de series temporales en columnas, replicación de datos síncrona y distribución de consultas (fan-out) distribuida. Incorpora automatización del ciclo de vida de los datos, ajuste de consistencia basado en quórum e indexación de series basada en etiquetas para mantener la integridad de los datos y la velocidad de recuperación en espacios de nombres aislados. La orquestación del clúster y la colocación de componentes se gestionan mediante herramientas y operadores automatizados para garantizar la alta disponibilidad y una distribución equilibrada de los datos.
Fans out requests across multiple clusters and namespaces to provide a complete view of time-series data.
YDB es una base de datos SQL distribuida y motor analítico diseñado para la escalabilidad horizontal y una fuerte consistencia. Funciona como un sistema multimodelo que admite cargas de trabajo transaccionales y analíticas a través de una arquitectura distribuida que proporciona transacciones ACID serializables. El sistema se distingue por su amplia compatibilidad de protocolos, implementando el protocolo de cable de PostgreSQL para controladores SQL estándar y el protocolo de Kafka para mensajería y streaming. Además, sirve como una base de datos vectorial, admitiendo índices vectoriales y búsquedas de vecinos más cercanos aproximados para búsqueda semántica e incrustaciones. La plataforma gestiona datos utilizando un modelo de almacenamiento híbrido con formatos orientados a filas y columnas, utilizando ejecución de consultas vectorizadas para analíticas a escala de petabytes. Su superficie operativa incluye streaming de captura de datos de cambio, colas persistentes de entrega única y alta disponibilidad multizona. El despliegue y la gestión del ciclo de vida son compatibles a través de un operador de Kubernetes y aprovisionamiento de infraestructura como código.
Runs streaming queries that automatically restart on failure and use checkpoints to persist state.
Olric is a distributed data grid and in-memory key-value store that partitions and replicates data across a cluster of servers. It serves as a shared memory system for managing distributed maps, performing atomic operations, and acting as an in-memory data cache. The system provides a distributed locking mechanism for concurrency control and a pub-sub messaging system that broadcasts and routes messages over named channels across the cluster. The platform covers wide-ranging capabilities including cluster management and orchestration, data replication with configurable quorums, and automated
Executes distributed queries to scan and retrieve keys from maps across multiple cluster nodes.