126 Repos
Tools for managing the lifecycle and membership of nodes in a cluster.
Distinguishing note: Focuses on node-level orchestration, distinct from general cluster management.
Explore 126 awesome GitHub repositories matching devops & infrastructure · Cluster Node Management. Refine with filters or upvote what's useful.
This project is a containerized node orchestrator and deployment tool designed to manage execution clients and rollup nodes on a blockchain network. It provides a coordinated stack of isolated virtual environments to establish and maintain network connections. The system includes specialized data provisioning tools for initializing local directories and fetching verified archival snapshots to bypass sequential synchronization. It also features a monitoring suite with health check services and dashboards to track synchronization progress and overall system performance. The orchestrator covers
Provides mechanisms to clear local chain data and history before restarting or reconfiguring nodes.
Dokploy is a self-hosted platform-as-a-service designed to simplify the deployment and management of containerized applications and databases. It provides a centralized control plane that decouples administrative management from application workloads, allowing users to oversee infrastructure across multiple server nodes through a unified web interface or a command-line tool. The platform distinguishes itself through an extensive library of pre-configured application templates, enabling the rapid deployment of databases, identity providers, and various productivity or development tools. It sup
Assigns specific roles to servers within a cluster to designate machines for system state or application execution.
Minikube is a command-line tool designed for local Kubernetes development, enabling users to provision and manage full-featured container clusters directly on a workstation. It serves as a local orchestrator that automates the lifecycle of isolated environments, allowing developers to start, stop, pause, and delete clusters to support testing and integration workflows. The project distinguishes itself through its flexible architecture, which supports multiple virtualization drivers and container runtimes to accommodate diverse host environments. It provides deep integration between the host a
Allows users to add, remove, and list nodes in a multi-node cluster environment.
Shardeum is an autoscaling blockchain infrastructure designed to distribute network workloads across multiple shards to increase throughput. It uses a dynamic-sharding architecture that horizontally scales node capacity and adjusts the number of active shards based on real-time network demand. The system features an execution environment compatible with the Ethereum Virtual Machine, allowing it to run smart contracts and decentralized applications. It maintains network agreement and security through consensus-group partitioning, which organizes validator nodes into discrete groups. The platf
Implements automatic scaling of shards and node distributions to meet real-time network demand.
xxl-job is a distributed task scheduling platform and job orchestrator designed to manage and trigger timed jobs across a cluster of remote executor nodes. It provides a centralized system for scheduling tasks, linking dependent jobs, and managing complex execution lifecycles through a relational database that persists configurations and logs. The platform distinguishes itself through a web-based interface for cron job management, allowing users to create and update scheduled tasks without modifying source code. It supports cross-language task execution by triggering logic on third-party exec
Supports increasing or decreasing the number of executor nodes to meet task resource demands dynamically.
Consul is a distributed coordination service and service mesh tool used for service discovery, health monitoring, and cluster state management across dynamic networks. It provides a platform for locating network addresses of services and managing traffic across distributed infrastructure using DNS and HTTP interfaces. The project distinguishes itself through multi-datacenter network orchestration, enabling the federation of services across different regions using mesh gateways. It secures communication via a service mesh architecture that employs identity-based authorization and mutual TLS en
Implements a bootstrap process for automatic node discovery and connectivity in cloud and container orchestrators.
This project is a Kubernetes serverless framework and OCI container function platform. It provides a system for deploying event-driven functions and microservices as compatible container images onto a Kubernetes cluster. The platform includes an event-driven function orchestrator that triggers executions via HTTP requests or message streams. It features an auto-scaling function manager that adjusts the number of active instances based on real-time demand and scales down to zero during inactivity. A background queuing system is included to process asynchronous tasks and maintain application re
Automatically adjusts the number of active function instances based on real-time demand to optimize resource usage.
Seata is a distributed transaction coordinator designed to ensure data consistency and atomicity across microservices. It provides a centralized framework for managing global transactions, preventing partial data updates across different databases and services. The project implements multiple transaction modes to balance consistency and performance. This includes an automatic mode that uses rollback logs to coordinate compensation without modifying business logic, a try-confirm-cancel pattern for resources lacking native ACID support, and a saga orchestration engine for managing long-lived bu
Locates available coordinator nodes by querying a naming server using cluster metadata and load balancing.
Excelize is a library for reading and writing spreadsheet files in the Office Open XML format. It provides a comprehensive suite of tools for programmatically creating, modifying, and analyzing workbooks, worksheets, and cell data, ensuring compatibility across various office software suites through structured XML serialization. The library distinguishes itself with a built-in formula calculation engine that evaluates complex mathematical and logical expressions directly against workbook data. It also features a memory-mapped streaming architecture, which allows for the efficient processing o
Manages cluster nodes by executing administrative commands across multiple machines.
Containerd is a daemon-based container runtime that manages the complete lifecycle of containers on a host system. It functions as a core orchestration backend, handling image distribution, storage, and process execution while adhering to industry-standard specifications for container execution and configuration. The project is distinguished by its modular, plugin-based architecture, which allows for the extension of storage, runtime, and networking capabilities without requiring a full daemon recompile. It utilizes a shim-based execution model to delegate low-level operations, ensuring isola
Validates node-level functionality by deploying local cluster environments for testing.
NATS Server is a high-performance, lightweight messaging system designed for cloud-native applications, edge computing, and distributed microservices. It functions as a distributed publish-subscribe broker that routes messages using hierarchical, dot-separated subject strings, enabling decoupled communication between services without requiring centralized broker lookups. The system supports core messaging patterns including asynchronous publish-subscribe, request-reply, and load-balanced queue processing. The platform distinguishes itself through a decentralized architecture that eliminates t
Automates the eviction of unresponsive cluster peers to maintain data redundancy and cluster health.
rqlite is a distributed relational database that replicates SQLite data across a cluster using the Raft consensus algorithm. It functions as a fault-tolerant storage system that provides high availability and a web API for executing SQL queries and managing relational data without requiring native database drivers. The system distinguishes itself by using an HTTP SQL interface to expose database operations and cluster management. It features a real-time change data capture stream that pushes database mutations to external HTTP endpoints via webhooks and supports the scaling of read throughput
Manages the lifecycle of the cluster by handling the addition and removal of nodes and discovery.
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
Increases peak performance by expanding matrix multiplication units and implementing liquid cooling systems.
kops is a Kubernetes cluster provisioner and lifecycle manager designed to automate the creation, maintenance, and destruction of production-grade clusters on cloud infrastructure. It functions as a declarative infrastructure manager, synchronizing the live state of a cluster with versioned manifests stored in remote object storage to ensure idempotent operations. The project distinguishes itself by offering comprehensive automation for the entire cluster lifecycle, including high-availability control plane deployment, incremental rolling updates, and automated version upgrades. It also serve
Assigns custom metadata labels to nodes to control pod scheduling and workload placement.
FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture. The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state. The platform provides
Expands storage and compute capacity by adding commodity servers and automatically handling hardware failures.
VictoriaMetrics is a high-performance, scalable time series database and observability platform designed for long-term storage and analysis of metric, log, and trace data. It functions as a unified backend for monitoring ecosystems, offering full compatibility with industry-standard protocols and query languages. The system is built to handle massive data volumes through a distributed architecture that supports horizontal scaling and efficient data lifecycle management. The platform distinguishes itself through a storage engine that utilizes consistent hashing for data sharding and log-struct
Automates the identification and registration of storage nodes within a cluster to simplify scaling.
This project is a local Kubernetes cluster manager and tool that runs control plane and worker nodes as containers on a host machine. It provides an environment for local development and automated testing by emulating a full Kubernetes cluster within a container runtime. The tool enables the creation of multi-node topologies and high-availability control planes through configuration files. It supports image sideloading to transfer container images directly from the host to nodes, bypassing remote registries, and allows for offline deployments using pre-built node images. Capabilities include
Utilizes the standard Kubernetes installation tool within containers to initialize the cluster and join nodes.
Skynet is a distributed game server framework designed for building scalable online game backends. It utilizes distributed actor-based clusters and real-time network communication to manage high-concurrency session coordination across multiple nodes. The framework includes a cluster management orchestrator for coordinating services via cluster-wide messaging and dynamic configuration updates. It features a multi-protocol network gateway supporting TCP, UDP, and WebSockets, alongside a data encoding layer using BSON and Sproto serialization for efficient information transfer between distribute
Provides an orchestrator for managing the lifecycle and coordination of nodes and services within a distributed cluster.
ZooKeeper is a distributed coordination service that provides a centralized system for managing configuration, naming, and synchronization across a cluster of distributed processes. It functions as a cluster consensus service, a distributed configuration store, and a distributed lock manager to maintain a consistent state across multiple network nodes. The service implements a consensus protocol to ensure data consistency and uses a replicated state machine to maintain identical copies of the system state across all servers. It provides a distributed lock management system to coordinate exclu
Synchronizes configuration settings across multiple server nodes in a cluster to maintain global consistency.
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Allows manual addition or removal of meta, graph, and storage nodes to scale cluster capacity.