llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization.
The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, moving memory blocks between GPU memory, host RAM, and shared storage to support long-context workloads.
The framework covers comprehensive traffic management and scaling capabilities, including SLO-aware autoscaling, cache-affinity routing, and predictive latency scoring. It also provides mechanisms for offline batch processing and high-availability scheduler management to balance interactive traffic with asynchronous workloads.
The system exposes these capabilities via an OpenAI-compatible chat completion API.