# kvcache-ai/mooncake

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/kvcache-ai-mooncake).**

5,594 stars · 856 forks · C++ · Apache-2.0

## Links

- GitHub: https://github.com/kvcache-ai/Mooncake
- Homepage: https://kvcache-ai.github.io/Mooncake/
- awesome-repositories: https://awesome-repositories.com/repository/kvcache-ai-mooncake.md

## Description

Mooncake is a disaggregated large language model serving platform and distributed key-value store designed for high-performance inference infrastructure. It functions as a GPU memory orchestrator and KV cache management system that pools and transfers key-value caches across clusters to accelerate inference.

The system distinguishes itself by separating the prefill and decode phases of inference into distinct hardware clusters to optimize resource utilization. It utilizes a high-performance RDMA distributed cache with zero-copy transfers to move data between compute nodes, bypassing the CPU to reduce latency and overhead.

The platform covers broad capability areas including distributed memory pooling, accelerator memory routing via CXL, and multi-tier storage offloading to SSDs. It manages cluster state through metadata coordination services and implements resource governance via lease-based object protection and watermark-based cache eviction.

The software is packaged for containerized deployment with support for host networking and hardware device mapping.

## Tags

### Artificial Intelligence & ML

- [Disaggregated Inference Orchestration](https://awesome-repositories.com/f/artificial-intelligence-ml/disaggregated-inference-orchestration.md) — Orchestrates the separate prefill and decode phases across distributed GPU pools to optimize LLM serving throughput.
- [Disaggregated Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/distributed-deployment-utilities/disaggregated-inference.md) — Separates the prefill and decode phases of LLM inference into distinct hardware clusters. ([source](https://kvcache-ai.github.io/Mooncake/))
- [Prefill-Decode Disaggregation](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-decoding-models/sequence-decoders/prefill-decode-disaggregation.md) — Implements the architectural separation of compute-intensive prefill and memory-intensive decoding phases into distinct hardware clusters.
- [KV Cache Management](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-management.md) — Manages the storage and retrieval of key-value caches in transformer models to reduce inference latency and memory overhead.
- [Tiered Storage Offloading](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-management/tiered-storage-offloading.md) — Balances cost and capacity by offloading KV cache blocks between high-speed RAM and SSD storage.
- [Prefix-Aware and Disaggregated Deployments](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/llm-serving-architectures/prefix-aware-and-disaggregated-deployments.md) — Provides a serving platform that separates prefill and decode phases into distinct clusters for optimal resource utilization.
- [High-Throughput Inference Services](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/runtime-interfaces-orchestration/inference-orchestration/high-throughput-inference-services.md) — Distributes computational loads across multiple accelerators to support high-throughput large-scale language model inference.
- [SLO-Driven Predictive Routing](https://awesome-repositories.com/f/artificial-intelligence-ml/slo-driven-predictive-routing.md) — Uses prediction-based routing and early rejection to balance throughput and latency against service level objectives. ([source](https://kvcache-ai.github.io/Mooncake/))

### Data & Databases

- [Cache Eviction Policies](https://awesome-repositories.com/f/data-databases/cache-eviction-policies.md) — Triggers removal of cache objects when memory usage reaches a predefined high watermark ratio. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/sglang-integration/hicache-integration-v1.html))
- [Distributed Caches](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/caching-performance/caching/distributed-caches.md) — Stores and retrieves reusable cache data across an inference cluster to support disaggregated architectures. ([source](https://cdn.jsdelivr.net/gh/kvcache-ai/mooncake@main/README.md))
- [Distributed Key-Value Stores](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-persistence-storage/specialized-storage-engines/distributed-key-value-stores.md) — Implements a high-performance key-value engine to store and retrieve immutable data objects across a cluster. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [Distributed Caching](https://awesome-repositories.com/f/data-databases/distributed-caching.md) — Manages the storage and transfer of key-value caches across instances to reduce inference latency. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/lmcache-integration.html))
- [LLM KV Cache Stores](https://awesome-repositories.com/f/data-databases/distributed-caching/llm-kv-cache-stores.md) — Implements a distributed storage layer specifically for pooling and transferring LLM key-value caches across clusters.
- [Distributed Shared Memory](https://awesome-repositories.com/f/data-databases/distributed-shared-memory.md) — Aggregates memory across multiple servers into a unified shared pool for scalable remote memory access. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/sglang-integration/hicache-integration-v1.html))
- [Distributed Resource Pooling](https://awesome-repositories.com/f/data-databases/in-memory-caches/distributed-memory-caches/distributed-resource-pooling.md) — Pools underutilized processor and memory resources across clusters to create a shared storage layer. ([source](https://kvcache-ai.github.io/Mooncake/))
- [Accelerator-to-Accelerator Communication](https://awesome-repositories.com/f/data-databases/data-exchange-protocols/runtime-data-exchange/accelerator-to-accelerator-communication.md) — Facilitates high-bandwidth data exchange between specialized AI accelerators using optimized communication paths. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/supported-protocols.html))
- [Centralized Service Metadata](https://awesome-repositories.com/f/data-databases/distributed-key-value-stores/centralized-service-metadata.md) — Integrates with Redis, etcd, or HTTP services to provide shared storage for transfer engine metadata and governance. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/build.html))
- [Data Locality Optimizers](https://awesome-repositories.com/f/data-databases/query-optimizations/data-layout-optimizers/data-locality-optimizers.md) — Assigns preferred storage segments for object allocation to minimize network overhead and increase speed. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [SSD Storage Extensions](https://awesome-repositories.com/f/data-databases/ssd-storage-extensions.md) — Offloads in-memory data to a distributed file system on SSDs to balance storage cost and capacity. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))

### DevOps & Infrastructure

- [RDMA GPU Transfers](https://awesome-repositories.com/f/devops-infrastructure/cluster-node-management/gpu-cluster-communications/rdma-gpu-transfers.md) — Employs RDMA and zero-copy techniques to move data between compute nodes at line rate. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [KV Cache VRAM Transfers](https://awesome-repositories.com/f/devops-infrastructure/cluster-node-management/gpu-cluster-communications/rdma-gpu-transfers/kv-cache-vram-transfers.md) — Moves key-value cache blocks between GPU memory pools across PCIe or RDMA to support disaggregated serving. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/lmdeploy-integration-v0.9.html))
- [RDMA Memory Pools](https://awesome-repositories.com/f/devops-infrastructure/cluster-node-management/gpu-cluster-communications/rdma-gpu-transfers/rdma-memory-pools.md) — Provides a high-performance memory pool that uses RDMA and zero-copy transfers to eliminate CPU overhead during data movement.
- [Storage Scaling](https://awesome-repositories.com/f/devops-infrastructure/container-cluster-deployments/elastic-scaling-deployments/storage-scaling.md) — Adjusts system capacity in real-time by adding or removing storage nodes from the cluster without downtime. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [Distributed Leader Election](https://awesome-repositories.com/f/devops-infrastructure/distributed-leader-election.md) — Provides coordination mechanisms to ensure a single active leader among multiple master nodes for cluster state management.
- [High Availability Clustering](https://awesome-repositories.com/f/devops-infrastructure/high-availability-clustering.md) — Deploys multiple coordinated master nodes with leader election to ensure continuous service availability. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/sglang-integration/hicache-integration-v1.html))

### Networking & Communication

- [Inter-Node Cache Transfers](https://awesome-repositories.com/f/networking-communication/inter-node-cache-transfers.md) — Transfers cache data between processing instances over RDMA or TCP to accelerate serving. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/vllm-integration/vllm-integration-v0.2.html))
- [Inter-Worker Tensor Transfers](https://awesome-repositories.com/f/networking-communication/inter-worker-tensor-transfers.md) — Uses high-performance RDMA to move cache and tensors between prefill and decode worker nodes. ([source](https://cdn.jsdelivr.net/gh/kvcache-ai/mooncake@main/README.md))
- [Direct GPU-to-Storage Transfers](https://awesome-repositories.com/f/networking-communication/network-transfer-management/direct-gpu-network-transfers/direct-gpu-to-storage-transfers.md) — Moves data directly between GPU memory and storage to bypass the CPU and reduce overhead. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/supported-protocols.html))
- [Zero-Copy Networking](https://awesome-repositories.com/f/networking-communication/zero-copy-networking.md) — Moves data directly between hardware memory spaces over RDMA to bypass the CPU and reduce latency.
- [Asynchronous Data Transfers](https://awesome-repositories.com/f/networking-communication/asynchronous-data-transfers.md) — Implements non-blocking movement of data across networks to maintain high responsiveness during inference. ([source](https://kvcache-ai.github.io/Mooncake/design/transfer-engine/index.html))
- [Network Protocols](https://awesome-repositories.com/f/networking-communication/communication-protocols-architectures/communication-protocols-standards/network-protocols.md) — Utilizes various network protocols including TCP and RDMA for high-speed data transmission between compute instances. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/supported-protocols.html))
- [RDMA](https://awesome-repositories.com/f/networking-communication/connection-pooling/rdma.md) — Limits active RDMA connections using an eviction algorithm to prevent performance degradation. ([source](https://kvcache-ai.github.io/Mooncake/design/transfer-engine/index.html))
- [Object Lease Protections](https://awesome-repositories.com/f/networking-communication/dhcp-servers/lease-management/object-lease-protections.md) — Prevents the eviction of critical data by granting temporary locks that expire after a set duration.
- [Connection Handshake Metadata](https://awesome-repositories.com/f/networking-communication/network-reliability-diagnostics/connection-session-management/connection-management/connection-lifecycle-managers/plugin-connection-managers/connection-handshake-metadata.md) — Exchanges operational metadata during handshakes to track internal connection status and node metadata. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/sglang-integration/hicache-integration-v1.html))
- [Hardware Topology Optimizers](https://awesome-repositories.com/f/networking-communication/network-topology-extensions/topology-abstraction-layers/hardware-topology-optimizers.md) — Optimizes data paths by selecting the most efficient network interface via a hardware topology matrix. ([source](https://kvcache-ai.github.io/Mooncake/design/transfer-engine/index.html))
- [Topology-Aware Routing](https://awesome-repositories.com/f/networking-communication/topology-aware-routing.md) — Selects the most efficient network interface by matching hardware paths against a device topology matrix.
- [Tiered Storage Transfers](https://awesome-repositories.com/f/networking-communication/zero-copy-file-transfers/tiered-storage-transfers.md) — Achieves zero-copy transfers of data between different storage tiers using RDMA and TCP protocols. ([source](https://kvcache-ai.github.io/Mooncake/design/transfer-engine/index.html))

### Operating Systems & Systems Programming

- [GPU Memory Orchestration](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-orchestration.md) — Coordinates data movement and synchronization between system memory, video memory, and local storage for accelerators.
- [Hardware Memory Abstractions](https://awesome-repositories.com/f/operating-systems-systems-programming/hardware-memory-abstractions.md) — Provides a unified interface to synchronize data between system memory, video memory, and local storage. ([source](https://cdn.jsdelivr.net/gh/kvcache-ai/mooncake@main/README.md))
- [Accelerator Routing Logic](https://awesome-repositories.com/f/operating-systems-systems-programming/hardware-interfacing-drivers/hardware-acceleration/accelerator-routing-logic.md) — Implements logic to detect and optimize data paths across hardware environments for efficient accelerator memory access. ([source](https://cdn.jsdelivr.net/gh/kvcache-ai/mooncake@main/README.md))
- [Multi-Tenant Memory Tracking](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management-systems/database-memory-management/multi-tenant-memory-tracking.md) — Enforces memory admission limits for different tenants and performs scoped eviction when limits are exceeded. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [Decoupled Store Services](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/inference-cache-management/decoupled-store-services.md) — Runs a local store service to handle memory management separately from the main process. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/sglang-integration/hicache-integration-v1.html))
- [Remote Memory Segments](https://awesome-repositories.com/f/operating-systems-systems-programming/remote-memory-segments.md) — Provides named memory segments that allow remote compute nodes to perform read and write operations on local memory. ([source](https://kvcache-ai.github.io/Mooncake/design/transfer-engine/index.html))

### Software Engineering & Architecture

- [Lease-Based Protection](https://awesome-repositories.com/f/software-engineering-architecture/lease-based-protection.md) — Grants temporary locks on objects to prevent removal until a specific lease expires. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [LRU Cache Eviction](https://awesome-repositories.com/f/software-engineering-architecture/memory-management/lru-cache-eviction.md) — Reclaims system memory by prioritizing the removal of least recently used objects. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [Object Pinning](https://awesome-repositories.com/f/software-engineering-architecture/object-pinning.md) — Protects essential cache data from eviction using time-limited soft pins or permanent hard pins. ([source](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html))
- [Metadata Store Backends](https://awesome-repositories.com/f/software-engineering-architecture/pluggable-backends/metadata-store-backends.md) — Supports interchangeable remote database backends for storing and retrieving transfer engine metadata. ([source](https://kvcache-ai.github.io/Mooncake/getting_started/examples/vllm-integration/vllm-integration-v0.2.html))
- [Pluggable Cluster Metadata Stores](https://awesome-repositories.com/f/software-engineering-architecture/pluggable-backends/metadata-store-backends/pluggable-cluster-metadata-stores.md) — Manages cluster-wide connection status and metadata using external coordination services like etcd. ([source](https://kvcache-ai.github.io/Mooncake/design/transfer-engine/index.html))

### Part of an Awesome List

- [Disaggregated Serving](https://awesome-repositories.com/f/awesome-lists/ai/disaggregated-serving.md) — KV-cache-centric architecture for disaggregated model serving.
