5 مستودعات
Systems that manage task lifecycles, scheduling, and data streaming across cluster nodes.
Distinguishing note: No candidates provided; focuses on query execution orchestration.
Explore 5 awesome GitHub repositories matching data & databases · Distributed Execution Coordinators. Refine with filters or upvote what's useful.
Cockroach is a distributed SQL database designed to scale horizontally across multiple nodes while maintaining strict ACID compliance and global data consistency. It functions as a relational database engine that automatically partitions data into ranges, rebalancing them across a cluster to accommodate growing storage and throughput requirements. By utilizing a distributed consensus protocol, the system ensures that all nodes agree on the order of operations, providing fault tolerance and continuous availability even in the event of hardware failures. The system distinguishes itself through
Manages complex database tasks by scheduling work and streaming data across nodes for parallel execution.
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Manages task scheduling and failure recovery across a distributed cluster of job and task managers.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Scales analytic workloads across a cluster by splitting and coordinating query fragments on multiple nodes.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Orchestrates the distribution and lifecycle of execution plans across cluster members with automatic rescheduling and failure recovery.
Dkron is a distributed, fault-tolerant system designed for scheduling and executing recurring tasks across a cluster of nodes. It functions as a cron-based orchestrator that manages job lifecycles, including automatic retries, timeouts, and complex dependencies, while ensuring state consistency through a consensus protocol. By coordinating remote task execution across infrastructure, it enables the automation of background operations and the management of distributed workflows. The system distinguishes itself through a modular architecture that supports pluggable storage backends and a plugin
Assigns tasks to specific nodes using tags and streams results back to a central leader to ensure reliable processing across the network.