Llm D

llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization.

The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, moving memory blocks between GPU memory, host RAM, and shared storage to support long-context workloads.

The framework covers comprehensive traffic management and scaling capabilities, including SLO-aware autoscaling, cache-affinity routing, and predictive latency scoring. It also provides mechanisms for offline batch processing and high-availability scheduler management to balance interactive traffic with asynchronous workloads.

The system exposes these capabilities via an OpenAI-compatible chat completion API.

Features

Disaggregated Inference Orchestration - Provides a distributed architecture that separates prefill and decode phases across specialized worker pools to maximize throughput.

Disaggregated Inference - Implements a disaggregated architecture that separates prefill and decode phases across specialized hardware nodes to maximize throughput.

Disaggregated Throughput Optimizations - Provides a disaggregated prefill and decode topology specifically designed to maximize throughput for batch-intensive LLM workloads.

Hardware-Agnostic Accelerators - Utilizes a hardware-agnostic control plane to manage various accelerators and enable low-latency inter-chip communication.

Inference Gateways - Functions as an OpenAI-compatible API gateway for request routing and traffic distribution.

KV Cache Management - Optimizes prefix reuse through cache-aware routing and tiered offloading of memory blocks.

KV-Cache-Aware Request Routing - Employs a weighted scoring system to route requests to replicas holding the necessary KV cache context.

Prefix Cache Reuse - Maximizes cache hits by tracking state across servers and offloading excess data to secondary storage.

Inference Deployment Orchestrators - Provides a control plane for deploying model replicas and managing hardware accelerators with SLO-aware scaling.

Agnostic Control Planes - Provides a hardware-agnostic control plane to manage diverse accelerators and ensure low-latency communication across different chips.

OpenAI-Compatible Model Servers - Provides an OpenAI-compatible API gateway for drop-in integration of large language model inference services.

Prefix-Aware Routing - Routes incoming traffic to specific model replicas based on cached prompt prefixes to minimize redundant computation.

LLM KV Cache Stores - Optimizes memory use and prefix reuse by offloading KV caches to CPU or shared storage and routing via affinity.

Inference Batching - Manages large volumes of offline inference requests through queuing and flow control to maximize hardware utilization.

LLM Replica Autoscaling - Automatically adjusts the number of model replicas based on queue depth and memory pressure to maintain latency targets.

Model Deployments - Deploys and manages multiple instances of model replicas on hardware accelerators for scalable request processing.

Traffic Load Balancers - Provides a proxy-based distribution system to balance incoming model traffic across multiple server replicas.

Cache-Aware Load Balancing - Distributes inference requests across server pools using real-time metrics, predicted latency, and cache locality.

Memory Offloading Frameworks - Implements tiered cache offloading by moving memory blocks between GPU memory, host RAM, and shared storage for long-context workloads.

Latency Reduction Techniques - Implements a performance suite using speculative decoding and fused kernels to minimize token generation latency.

Activation and KV Cache Offloaders - Swaps memory blocks to host RAM to prevent context drops during long-context workloads.

Shared Storage Offloading - Moves memory blocks to a shared file system to decouple cache capacity from local GPU memory.

Cache-Aware Schedulers - Implements a scheduling system that leverages shared context and prefix caching to optimize throughput for multi-tenant environments.

KV-Cache Transport Optimizations - Reduces tail latency during context migration using adaptive congestion control and specialized libraries.

Asynchronous Batching Execution - Executes latency-tolerant requests from message queues to maximize hardware utilization alongside interactive traffic.

Disaggregated Phase Scaling - Scales large models by separating prefill and decode stages using expert parallelism.

Adapter-Aware Routing - Routes traffic to specific nodes based on the location of loaded LoRA adapters to avoid redundant execution.

Offline Batch Inference - Provides a mechanism to run large-scale asynchronous inference via compatible APIs to maximize total hardware utilization.

SLO-Driven Predictive Routing - Scores endpoints by predicting time-to-first-token and inter-token latency to ensure requests meet defined performance targets.

Disaggregated Serving - Utilizes a disaggregated architecture to split prefill and decode stages across fast accelerator interconnects.

Saturation-Based Scaling - Reactive optimizer that adjusts the number of replicas by monitoring queue length to prevent request overflow.

Asynchronous Gating - Implements flow-control gating to execute offline batch requests using spare hardware capacity.

Pressure-Based Scaling - Adjusts the number of active replicas by monitoring cache saturation and queue length to maintain system stability.

Capacity Scaling - Adjusts compute resources using native metrics and traffic routing to minimize costs while meeting latency targets.

Queue-Based Scaling Triggers - Handles traffic spikes using intelligent queuing and autoscales capacity based on real-time load metrics.

Scheduler High Availability - Ensures redundancy and accurate cache tracking through a high-availability scheduler that discovers server pods.

SLO-Aware Scaling - Implements production stability through SLO-aware autoscaling and flow control for multi-tenant environments.

Inference Job Management - Provides capabilities to queue offline requests and dispatch them using flow-control gating for high-volume workloads.

Pod Autoscaling - Autoscales container replicas by monitoring queue depth and memory saturation to maintain latency targets.

LLM Production Infrastructure - Sets up and manages high-availability infrastructure specifically designed for production LLM serving.

Cold-Start Load Balancing - Provides a specialized distribution strategy for cache-cold requests to prevent hotspots and oscillations in the request queue.

Load Balancing Metrics - Distributes requests across replicas using real-time queue depth and memory saturation metrics.

Predictive Latency Routing - Implements a routing system that predicts time-to-first-token and inter-token latency to enforce performance targets.

SLO-Based Server Placement - Ships a request packing system that compares predicted latency against defined SLO targets to optimize server placement.

Inference Batching Schedulers - Provides a request queuing system that balances interactive traffic with asynchronous offline workloads.

llm-dllm-d

Features

Star history