# llm-d/llm-d

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/llm-d-llm-d).**

2,514 stars · 324 forks · Shell · apache-2.0

## Links

- GitHub: https://github.com/llm-d/llm-d
- Homepage: https://www.llm-d.ai
- awesome-repositories: https://awesome-repositories.com/repository/llm-d-llm-d.md

## Description

llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization.

The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, moving memory blocks between GPU memory, host RAM, and shared storage to support long-context workloads.

The framework covers comprehensive traffic management and scaling capabilities, including SLO-aware autoscaling, cache-affinity routing, and predictive latency scoring. It also provides mechanisms for offline batch processing and high-availability scheduler management to balance interactive traffic with asynchronous workloads.

The system exposes these capabilities via an OpenAI-compatible chat completion API.

## Tags

### Artificial Intelligence & ML

- [Disaggregated Inference Orchestration](https://awesome-repositories.com/f/artificial-intelligence-ml/disaggregated-inference-orchestration.md) — Provides a distributed architecture that separates prefill and decode phases across specialized worker pools to maximize throughput.
- [Disaggregated Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/distributed-deployment-utilities/disaggregated-inference.md) — Implements a disaggregated architecture that separates prefill and decode phases across specialized hardware nodes to maximize throughput.
- [Disaggregated Throughput Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/batch-size-tuning/automatic-batch-size-optimization/automatic-ml-workload-batching/disaggregated-throughput-optimizations.md) — Provides a disaggregated prefill and decode topology specifically designed to maximize throughput for batch-intensive LLM workloads. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [Hardware-Agnostic Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-agnostic-accelerators.md) — Utilizes a hardware-agnostic control plane to manage various accelerators and enable low-latency inter-chip communication. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))
- [Inference Gateways](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-gateways.md) — Functions as an OpenAI-compatible API gateway for request routing and traffic distribution.
- [KV Cache Management](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-management.md) — Optimizes prefix reuse through cache-aware routing and tiered offloading of memory blocks.
- [KV-Cache-Aware Request Routing](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-optimizations/kv-cache-aware-request-routing.md) — Employs a weighted scoring system to route requests to replicas holding the necessary KV cache context. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))
- [Prefix Cache Reuse](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-optimizations/kv-cache-aware-request-routing/prefix-cache-reuse.md) — Maximizes cache hits by tracking state across servers and offloading excess data to secondary storage. ([source](https://llm-d.ai/docs/architecture))
- [Inference Deployment Orchestrators](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-orchestrators/inference-deployment-orchestrators.md) — Provides a control plane for deploying model replicas and managing hardware accelerators with SLO-aware scaling.
- [Agnostic Control Planes](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/hardware-agnostic-deployment/agnostic-control-planes.md) — Provides a hardware-agnostic control plane to manage diverse accelerators and ensure low-latency communication across different chips.
- [OpenAI-Compatible Model Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-serving-apis/openai-compatible-model-servers.md) — Provides an OpenAI-compatible API gateway for drop-in integration of large language model inference services. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))
- [Prefix-Aware Routing](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-caching/prefix-caching/prefix-aware-routing.md) — Routes incoming traffic to specific model replicas based on cached prompt prefixes to minimize redundant computation.
- [Latency Reduction Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-models/latency-reduction-techniques.md) — Implements a performance suite using speculative decoding and fused kernels to minimize token generation latency. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))
- [Activation and KV Cache Offloaders](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-optimizations/activation-and-kv-cache-offloaders.md) — Swaps memory blocks to host RAM to prevent context drops during long-context workloads. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))
- [Shared Storage Offloading](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-optimizations/activation-and-kv-cache-offloaders/shared-storage-offloading.md) — Moves memory blocks to a shared file system to decouple cache capacity from local GPU memory. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [Cache-Aware Schedulers](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-optimizations/kv-cache-aware-request-routing/prefix-cache-reuse/cache-aware-schedulers.md) — Implements a scheduling system that leverages shared context and prefix caching to optimize throughput for multi-tenant environments. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [KV-Cache Transport Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-transport-optimizations.md) — Reduces tail latency during context migration using adaptive congestion control and specialized libraries. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [Asynchronous Batching Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/distributed-and-scaling-strategies/execution-strategies/asynchronous-batching-execution.md) — Executes latency-tolerant requests from message queues to maximize hardware utilization alongside interactive traffic. ([source](https://llm-d.ai/docs/guides))
- [Disaggregated Phase Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization/disaggregated-phase-scaling.md) — Scales large models by separating prefill and decode stages using expert parallelism. ([source](https://llm-d.ai/docs/guides))
- [Adapter-Aware Routing](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-management/lora-adapter-loaders/adapter-aware-routing.md) — Routes traffic to specific nodes based on the location of loaded LoRA adapters to avoid redundant execution. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [Offline Batch Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/offline-batch-inference.md) — Provides a mechanism to run large-scale asynchronous inference via compatible APIs to maximize total hardware utilization. ([source](https://cdn.jsdelivr.net/gh/llm-d/llm-d@main/README.md))
- [SLO-Driven Predictive Routing](https://awesome-repositories.com/f/artificial-intelligence-ml/slo-driven-predictive-routing.md) — Scores endpoints by predicting time-to-first-token and inter-token latency to ensure requests meet defined performance targets.

### Data & Databases

- [LLM KV Cache Stores](https://awesome-repositories.com/f/data-databases/distributed-caching/llm-kv-cache-stores.md) — Optimizes memory use and prefix reuse by offloading KV caches to CPU or shared storage and routing via affinity.
- [Inference Batching](https://awesome-repositories.com/f/data-databases/request-batching/inference-batching.md) — Manages large volumes of offline inference requests through queuing and flow control to maximize hardware utilization.
- [Saturation-Based Scaling](https://awesome-repositories.com/f/data-databases/horizontal-scaling/saturation-based-scaling.md) — Reactive optimizer that adjusts the number of replicas by monitoring queue length to prevent request overflow. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))
- [Asynchronous Gating](https://awesome-repositories.com/f/data-databases/request-batching/inference-batching/asynchronous-gating.md) — Implements flow-control gating to execute offline batch requests using spare hardware capacity.

### DevOps & Infrastructure

- [LLM Replica Autoscaling](https://awesome-repositories.com/f/devops-infrastructure/autoscaling-systems/llm-replica-autoscaling.md) — Automatically adjusts the number of model replicas based on queue depth and memory pressure to maintain latency targets.
- [Model Deployments](https://awesome-repositories.com/f/devops-infrastructure/cloud-infrastructure-deployment/model-deployments.md) — Deploys and manages multiple instances of model replicas on hardware accelerators for scalable request processing. ([source](https://llm-d.ai/docs/getting-started/quickstart))
- [Traffic Load Balancers](https://awesome-repositories.com/f/devops-infrastructure/traffic-load-balancers.md) — Provides a proxy-based distribution system to balance incoming model traffic across multiple server replicas. ([source](https://llm-d.ai/docs/getting-started/quickstart))
- [Cache-Aware Load Balancing](https://awesome-repositories.com/f/devops-infrastructure/traffic-load-balancers/inference-load-balancers/cache-aware-load-balancing.md) — Distributes inference requests across server pools using real-time metrics, predicted latency, and cache locality.
- [Pressure-Based Scaling](https://awesome-repositories.com/f/devops-infrastructure/autoscaling-systems/pressure-based-scaling.md) — Adjusts the number of active replicas by monitoring cache saturation and queue length to maintain system stability.
- [Capacity Scaling](https://awesome-repositories.com/f/devops-infrastructure/cluster-node-management/capacity-scaling.md) — Adjusts compute resources using native metrics and traffic routing to minimize costs while meeting latency targets. ([source](https://llm-d.ai/docs/architecture))
- [Queue-Based Scaling Triggers](https://awesome-repositories.com/f/devops-infrastructure/cluster-node-management/capacity-scaling/queue-based-scaling-triggers.md) — Handles traffic spikes using intelligent queuing and autoscales capacity based on real-time load metrics. ([source](https://llm-d.ai/docs/guides))
- [Scheduler High Availability](https://awesome-repositories.com/f/devops-infrastructure/high-availability-systems/scheduler-high-availability.md) — Ensures redundancy and accurate cache tracking through a high-availability scheduler that discovers server pods. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [SLO-Aware Scaling](https://awesome-repositories.com/f/devops-infrastructure/infrastructure-scaling/slo-aware-scaling.md) — Implements production stability through SLO-aware autoscaling and flow control for multi-tenant environments. ([source](https://cdn.jsdelivr.net/gh/llm-d/llm-d@main/README.md))
- [Inference Job Management](https://awesome-repositories.com/f/devops-infrastructure/job-scheduling/high-performance-batch-jobs/inference-job-management.md) — Provides capabilities to queue offline requests and dispatch them using flow-control gating for high-volume workloads. ([source](https://llm-d.ai/docs/architecture))
- [Pod Autoscaling](https://awesome-repositories.com/f/devops-infrastructure/pod-autoscaling.md) — Autoscales container replicas by monitoring queue depth and memory saturation to maintain latency targets. ([source](https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale))
- [LLM Production Infrastructure](https://awesome-repositories.com/f/devops-infrastructure/production-deployment-tools/llm-production-infrastructure.md) — Sets up and manages high-availability infrastructure specifically designed for production LLM serving.
- [Cold-Start Load Balancing](https://awesome-repositories.com/f/devops-infrastructure/traffic-load-balancers/inference-load-balancers/cold-start-load-balancing.md) — Provides a specialized distribution strategy for cache-cold requests to prevent hotspots and oscillations in the request queue. ([source](https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators))

### Operating Systems & Systems Programming

- [Memory Offloading Frameworks](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-optimizations/memory-offloading-frameworks.md) — Implements tiered cache offloading by moving memory blocks between GPU memory, host RAM, and shared storage for long-context workloads.

### Part of an Awesome List

- [Disaggregated Serving](https://awesome-repositories.com/f/awesome-lists/ai/disaggregated-serving.md) — Utilizes a disaggregated architecture to split prefill and decode stages across fast accelerator interconnects. ([source](https://cdn.jsdelivr.net/gh/llm-d/llm-d@main/README.md))

### Networking & Communication

- [Load Balancing Metrics](https://awesome-repositories.com/f/networking-communication/load-balancing-metrics.md) — Distributes requests across replicas using real-time queue depth and memory saturation metrics. ([source](https://llm-d.ai/docs/guides/optimized-baseline))
- [Predictive Latency Routing](https://awesome-repositories.com/f/networking-communication/network-infrastructure-routing/network-routing-traffic-management/network-traffic-management/multipath-latency-routing/predictive-latency-routing.md) — Implements a routing system that predicts time-to-first-token and inter-token latency to enforce performance targets. ([source](https://llm-d.ai/docs/architecture))
- [SLO-Based Server Placement](https://awesome-repositories.com/f/networking-communication/server-discovery/latency-aware-server-selections/slo-based-server-placement.md) — Ships a request packing system that compares predicted latency against defined SLO targets to optimize server placement. ([source](https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms))

### System Administration & Monitoring

- [Inference Batching Schedulers](https://awesome-repositories.com/f/system-administration-monitoring/concurrency-management-systems/inference-batching-schedulers.md) — Provides a request queuing system that balances interactive traffic with asynchronous offline workloads.
