Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients.
The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and memory. It employs a key-value cache-aware request router that directs queries to workers holding relevant cache entries to reduce recomputation. High-speed data transfer mechanisms move cache blocks and weights directly between GPU VRAMs over RDMA or NVLink to minimize latency.
The platform includes comprehensive capabilities for distributed fault tolerance, allowing in-flight requests to migrate and resume from failure points via token-state continuation. It features SLA-based autoscaling and performance profiling to right-size GPU pools and a Kubernetes-native operator for topology-aware scheduling. Additional support covers multimodal inference for images, video, and audio, alongside dynamic swapping of LoRA adapters.
Installation is available via wheels, container images, charts, and crates, with support for major Linux distributions and NVIDIA GPU architectures from Ampere through Blackwell.