vllm-omni is a high-throughput serving engine and distributed inference framework designed for omni-modal models. It serves as a multi-modal model API server capable of generating text, image, video, and audio data, providing a standardized interface for remote client access.
The system features a non-autoregressive generation engine for parallel media production and a robot policy inference server that acts as a real-time communication bridge to robotic hardware using specialized protocols. It supports hybrid execution models that combine sequential token generation with parallelized media generation to optimize output latency.
The framework covers distributed workload scaling through tensor parallelism and multi-stage model sharding, alongside memory management via paged-attention caching and continuous batching. It also includes tools for measuring serving throughput and performance benchmarking using randomized prompts.