1 repository
User-provided metadata accompanying inference requests to guide resource allocation and execution priority.
Distinct from Inference Request APIs: Focuses on guiding model execution (priority, length) rather than type-hinting or network queuing
Explore 1 awesome GitHub repository matching artificial intelligence & ml · Inference Request Hints. Refine with filters or upvote what's useful.
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Allows users to specify latency priority, expected output length, and cache pinning to guide request handling.