Mooncake

Mooncake is a disaggregated large language model serving platform and distributed key-value store designed for high-performance inference infrastructure. It functions as a GPU memory orchestrator and KV cache management system that pools and transfers key-value caches across clusters to accelerate inference.

The system distinguishes itself by separating the prefill and decode phases of inference into distinct hardware clusters to optimize resource utilization. It utilizes a high-performance RDMA distributed cache with zero-copy transfers to move data between compute nodes, bypassing the CPU to reduce latency and overhead.

The platform covers broad capability areas including distributed memory pooling, accelerator memory routing via CXL, and multi-tier storage offloading to SSDs. It manages cluster state through metadata coordination services and implements resource governance via lease-based object protection and watermark-based cache eviction.

The software is packaged for containerized deployment with support for host networking and hardware device mapping.

Features

Disaggregated Inference Orchestration - Orchestrates the separate prefill and decode phases across distributed GPU pools to optimize LLM serving throughput.
Disaggregated Inference - Separates the prefill and decode phases of LLM inference into distinct hardware clusters.
Prefill-Decode Disaggregation - Implements the architectural separation of compute-intensive prefill and memory-intensive decoding phases into distinct hardware clusters.
KV Cache Management - Manages the storage and retrieval of key-value caches in transformer models to reduce inference latency and memory overhead.
Tiered Storage Offloading - Balances cost and capacity by offloading KV cache blocks between high-speed RAM and SSD storage.
Prefix-Aware and Disaggregated Deployments - Provides a serving platform that separates prefill and decode phases into distinct clusters for optimal resource utilization.
High-Throughput Inference Services - Distributes computational loads across multiple accelerators to support high-throughput large-scale language model inference.
Cache Eviction Policies - Triggers removal of cache objects when memory usage reaches a predefined high watermark ratio.
Distributed Caches - Stores and retrieves reusable cache data across an inference cluster to support disaggregated architectures.
Distributed Key-Value Stores - Implements a high-performance key-value engine to store and retrieve immutable data objects across a cluster.
Distributed Caching - Manages the storage and transfer of key-value caches across instances to reduce inference latency.
LLM KV Cache Stores - Implements a distributed storage layer specifically for pooling and transferring LLM key-value caches across clusters.
Distributed Shared Memory - Aggregates memory across multiple servers into a unified shared pool for scalable remote memory access.
Distributed Resource Pooling - Pools underutilized processor and memory resources across clusters to create a shared storage layer.
RDMA GPU Transfers - Employs RDMA and zero-copy techniques to move data between compute nodes at line rate.
KV Cache VRAM Transfers - Moves key-value cache blocks between GPU memory pools across PCIe or RDMA to support disaggregated serving.
RDMA Memory Pools - Provides a high-performance memory pool that uses RDMA and zero-copy transfers to eliminate CPU overhead during data movement.
Inter-Node Cache Transfers - Transfers cache data between processing instances over RDMA or TCP to accelerate serving.
Inter-Worker Tensor Transfers - Uses high-performance RDMA to move cache and tensors between prefill and decode worker nodes.
Direct GPU-to-Storage Transfers - Moves data directly between GPU memory and storage to bypass the CPU and reduce overhead.
Zero-Copy Networking - Moves data directly between hardware memory spaces over RDMA to bypass the CPU and reduce latency.
GPU Memory Orchestration - Coordinates data movement and synchronization between system memory, video memory, and local storage for accelerators.
Hardware Memory Abstractions - Provides a unified interface to synchronize data between system memory, video memory, and local storage.
Lease-Based Protection - Grants temporary locks on objects to prevent removal until a specific lease expires.
LRU Cache Eviction - Reclaims system memory by prioritizing the removal of least recently used objects.
Object Pinning - Protects essential cache data from eviction using time-limited soft pins or permanent hard pins.
SLO-Driven Predictive Routing - Uses prediction-based routing and early rejection to balance throughput and latency against service level objectives.
Accelerator-to-Accelerator Communication - Facilitates high-bandwidth data exchange between specialized AI accelerators using optimized communication paths.
Centralized Service Metadata - Integrates with Redis, etcd, or HTTP services to provide shared storage for transfer engine metadata and governance.
Data Locality Optimizers - Assigns preferred storage segments for object allocation to minimize network overhead and increase speed.
SSD Storage Extensions - Offloads in-memory data to a distributed file system on SSDs to balance storage cost and capacity.
Storage Scaling - Adjusts system capacity in real-time by adding or removing storage nodes from the cluster without downtime.
Distributed Leader Election - Provides coordination mechanisms to ensure a single active leader among multiple master nodes for cluster state management.
High Availability Clustering - Deploys multiple coordinated master nodes with leader election to ensure continuous service availability.
Asynchronous Data Transfers - Implements non-blocking movement of data across networks to maintain high responsiveness during inference.
Network Protocols - Utilizes various network protocols including TCP and RDMA for high-speed data transmission between compute instances.
RDMA - Limits active RDMA connections using an eviction algorithm to prevent performance degradation.
Object Lease Protections - Prevents the eviction of critical data by granting temporary locks that expire after a set duration.
Connection Handshake Metadata - Exchanges operational metadata during handshakes to track internal connection status and node metadata.
Hardware Topology Optimizers - Optimizes data paths by selecting the most efficient network interface via a hardware topology matrix.
Topology-Aware Routing - Selects the most efficient network interface by matching hardware paths against a device topology matrix.
Tiered Storage Transfers - Achieves zero-copy transfers of data between different storage tiers using RDMA and TCP protocols.
Accelerator Routing Logic - Implements logic to detect and optimize data paths across hardware environments for efficient accelerator memory access.
Multi-Tenant Memory Tracking - Enforces memory admission limits for different tenants and performs scoped eviction when limits are exceeded.
Decoupled Store Services - Runs a local store service to handle memory management separately from the main process.
Remote Memory Segments - Provides named memory segments that allow remote compute nodes to perform read and write operations on local memory.
Metadata Store Backends - Supports interchangeable remote database backends for storing and retrieving transfer engine metadata.
Pluggable Cluster Metadata Stores - Manages cluster-wide connection status and metadata using external coordination services like etcd.
Disaggregated Serving - KV-cache-centric architecture for disaggregated model serving.

LMCache/LMCache

6,909View on GitHub

LMCache is a distributed key-value cache manager and tiering system designed to accelerate large language model inference. It functions as a tiered storage layer that offloads tensors from GPU memory to CPU RAM, local disks, or remote object stores, enabling the reuse of cached prefixes across different inference sessions and serving engines. The system differentiates itself through a disaggregated prefill-decode model, which separates prompt processing from token generation by transferring caches between distributed compute nodes. It utilizes peer-to-peer orchestration to share and retrieve

ai-dynamo/dynamo

6,112View on GitHub

Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and

llm-d/llm-d

2,514View on GitHub

llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov

sgl-project/sglang

29,079View on GitHub

Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr

kvcache-aiMooncake

Features