Dynamo

Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients.

The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and memory. It employs a key-value cache-aware request router that directs queries to workers holding relevant cache entries to reduce recomputation. High-speed data transfer mechanisms move cache blocks and weights directly between GPU VRAMs over RDMA or NVLink to minimize latency.

The platform includes comprehensive capabilities for distributed fault tolerance, allowing in-flight requests to migrate and resume from failure points via token-state continuation. It features SLA-based autoscaling and performance profiling to right-size GPU pools and a Kubernetes-native operator for topology-aware scheduling. Additional support covers multimodal inference for images, video, and audio, alongside dynamic swapping of LoRA adapters.

Installation is available via wheels, container images, charts, and crates, with support for major Linux distributions and NVIDIA GPU architectures from Ampere through Blackwell.

Features

Disaggregated Inference Orchestration - Coordinates separate prefill and decode worker pools and manages the necessary KV cache transfers between them.

Prefill-Decode Disaggregation - Separates prompt processing and token generation onto independent GPU pools to optimize throughput and memory.

OpenAI-Compatible APIs - Exposes a standard HTTP frontend with completions endpoints for compatibility with existing LLM tools.

Activation and KV Cache Offloaders - Enables remote access and reuse of key-value cache blocks across nodes via high-speed interconnects.

KV-Cache-Aware Request Routing - Routes inference requests to GPUs based on KV cache affinity to avoid redundant computation.

LLM Serving Architectures - Implements high-performance architectures to coordinate API gateways and worker nodes for large-scale LLM serving.

Inference Orchestration - Provides a system for scaling and managing the distribution of prefill and decode inference workloads across multiple GPU nodes.

Disaggregated Phase Scaling - Adjusts the number of workers dedicated to prefill and decode phases separately based on real-time metrics.

Disaggregated Inference - Implements an architecture that separates prefill and decode stages across distinct hardware nodes to optimize throughput.

Multi-Backend Inference Orchestration - Routes requests across various backend engines to manage serving and resource allocation within a single orchestration layer.

Prefix-Aware Routing - Directs requests to workers holding required prompt prefixes in memory to accelerate response times.

Token-State Continuation - Tracks accumulated token sequences mid-generation so a new worker can resume from the exact failure point.

Key-Value Cache Reuse - Retains and reuses key-value cache entries across multiple requests to accelerate repeated prompts.

Worker Cache State Tracking - Consumes cache events from backend workers to maintain an accurate view of cached blocks for optimized routing.

KV Cache Management - Allocates memory blocks across storage tiers to optimize the key-value cache and reduce prompt recomputation.

Remote Cache Block Sharing - Exchanges block metadata and access permissions across nodes via high-speed interconnects for remote cache reuse.

Inference Runtime Integrations - Integrates various inference engines into a distributed runtime for language, embedding, and multimodal models.

Autoscaling Systems - Profiles workloads and right-sizes GPU pools to meet specific latency SLAs while minimizing cost.

RDMA GPU Transfers - Moves KV cache blocks directly between GPU VRAMs over RDMA or NVLink without blocking forward passes.

KV Cache VRAM Transfers - Moves KV cache from prefill VRAM to decode VRAM without blocking ongoing GPU passes.

Inference - Implements a translation layer that connects vLLM and TensorRT-LLM into a unified block-oriented memory interface.

Inference Engine Adapters - Implements a multi-backend runtime adapter to connect different inference engines through a unified block-oriented memory interface.

Paged KV Cache Management - Implements a paged key-value cache that pools and reuses fixed-size blocks across requests.

Master-Worker Coordination - Coordinates the lifecycle and task assignment of inference engine workers across multiple backends.

Fault Tolerance - Provides systems for managing failures and ensuring resilience in distributed inference applications.

Resource Deployment Planning - Allocates compute resources across prefill and decode nodes to meet specified latency targets.

Workload Simulations - Mimics backend API behavior and synthetic traffic patterns to validate routing and infrastructure logic without consuming GPUs.

External Tool Execution - Coordinates communication between language models and external functions to execute programmatic tasks.

Inference Pipeline Observability - Exposes metrics, distributed traces, and dashboards for the complete inference pipeline.

Diffusion Models - Executes inference for language, embedding, vision, and diffusion models to generate text, images, or video.

Hardware Acceleration Support - Optimizes inference workloads for the latest NVIDIA GB200 hardware architectures.

Agentic Latency Optimizations - Implements per-request priorities and cache pinning to optimize performance for autonomous agent-driven workloads.

Warm Pool and Predictive Scaling - Forecasts request counts and sequence lengths using time-series methods to proactively scale GPU resources.

Multimodal Inference Engines - Processes images, video, and audio alongside text across multiple backend engines for complex workloads.

Speculative Decoding Strategies - Uses a smaller draft model to propose candidate tokens that a larger target model verifies in parallel.

Worker Availability Verification - Tracks active and ready inference nodes to ensure the router only directs traffic to healthy workers.

Adapter-Aware Routing - Routes requests to specific GPU nodes that have the required LoRA adaptation weights already loaded in memory.

Dynamic Adapter Swapping - Swaps LoRA adapters in and out at runtime from compatible storage without restarting the engine.

Remote Weight Streaming - Transfers model weights directly between GPUs over high-speed interconnects to reduce cold-start latency.

LoRA Adapter Interfaces - Provides interfaces for dynamically loading and switching between different LoRA adapters during inference.

Multi-Node - Spreads tensor-parallel inference across multiple hardware nodes using global NCCL communicators.

Token State Continuations - Captures and transfers token state from failed workers to resume generation from the exact failure point.

Disaggregated Serving - Separates prompt processing and token generation into distinct worker pools to optimize hardware utilization.

State Snapshots - Captures system state snapshots to enable recovery and consistent state replication across the cluster.

AI Memory Tiering - Manages block reuse and eviction across HBM, DRAM, and NVMe tiers to balance speed and capacity.

Load-Aware Request Routing - Routes inference requests to the least-loaded worker based on active cache utilization.

Auto-scaling Engines - Dynamically adjusts replica counts for prefill and decode engines based on traffic signals.

Inference Worker Recovery - Detects unhealthy workers, drains in-flight requests, and reroutes traffic to maintain service continuity.

Capacity Scaling - Computes scaling targets for GPU workers based on real-time throughput and load metrics.

Inference Scaling Analysis - Executes concurrency sweeps to identify saturation points and generate performance curves for the inference stack.

Topology-Aware Schedulers - Employs a topology-aware operator to optimally place interdependent inference components across racks and hosts.

Hardware Profile Deployments - Automatically profiles and deploys the optimal engine configuration based on specified model, hardware, and target requirements.

SLA-Driven Resource Planning - Simulates thousands of serving configurations offline to select optimal resource allocations for given SLAs.

Deployment Simulators - Evaluates thousands of serving configurations offline to find the optimal setup without consuming GPU resources.

Scaling Profiles - Generates performance data and selects optimal engine configurations for the system planner.

Cache-Utilization Scaling - Uses static thresholds on queue depth and cache utilization to scale engines without strict SLA targets.

Kubernetes Orchestrators - Orchestrates GPU-accelerated inference workloads in Kubernetes clusters using topology-aware scheduling.

Kubernetes Deployments - Launches and manages distributed inference workloads on Kubernetes clusters.

LLM Deployment Operators - Ships a controller that manages inference workloads using custom resource definitions and topology-aware scheduling on Kubernetes.

Kubernetes Operators - Provides a Kubernetes operator to manage the deployment and scaling of inference workloads using custom resource definitions.

Adapter Management - Dynamically loads and removes fine-tuned LoRA adapters from storage without restarting the inference engine.

Queue Reordering - Controls request reordering in the router queue using policies like first-come or shortest-processing-time.

Inference Load Balancers - Distributes incoming requests among backend workers using configurable scheduling policies.

Data-Parallel Rank Balancing - Distributes requests across data-parallel ranks using external control for hybrid load balancing.

Dynamic Capacity Adjustment - Allows runtime modification of worker pool capacity for prefill and decode phases without requiring application restarts.

Independent Model Component Scaling - Runs the vision encoder as a separate worker that scales independently from the language model.

Token Streaming - Provides real-time delivery of generated tokens to the client to minimize perceived latency.

Multi-Architecture Support - Provides execution support across a wide range of NVIDIA GPU architectures from Ampere through Blackwell.

Inference Request Queuing - Manages the order and dispatch of concurrent inference requests to optimize time-to-first-token.

Generation State Migration - Tracks accumulated token sequences mid-generation to allow a new worker to resume from the exact failure point.

Hierarchical Request Routing - Routes requests through a control layer to separate pools optimized for different request classes.

LLM Performance Analyzers - Ships tools for identifying execution and logic bottlenecks specifically within language model serving configurations.

Request Migration Monitors - Tracks request migrations and sequence length violations to identify reliability patterns in the distributed serving layer.

Inference Performance Monitoring - Tracks model-serving metrics such as GPU utilization, request throughput, and response latency.

Inference Metric Recording - Exposes engine-level metrics and monitoring data, including token counts and confidence scores, for the serving pipeline.

Performance Visualization - Provides visual dashboards and plots to compare model and hardware configurations using telemetry data.

Kubernetes Custom Resource Definitions - Uses Kubernetes Custom Resource Definitions to declaratively manage worker groups and topology placement.

Inference Latency Targets - Uses estimates and tuning to scale engines toward precise time-to-first-token and inter-token latency targets.

Performance Profiling - Measures engine performance data before deployment to determine hardware allocation and scaling requirements.

Request Migrations - Moves active requests to healthy workers during failures to prevent request loss.

Data-Parallel Rank Routing - Directs requests to a specific data-parallel replica based on rank assignment to ensure consistent reuse.

Inference Engines - Distributed inference serving framework for datacenter scales.

Model Serving & Deployment - Optimizes inference for generative AI in distributed environments.

ai-dynamodynamo

Features

Star history