Vllm Omni

vllm-omni is a high-throughput serving engine and distributed inference framework designed for omni-modal models. It serves as a multi-modal model API server capable of generating text, image, video, and audio data, providing a standardized interface for remote client access.

The system features a non-autoregressive generation engine for parallel media production and a robot policy inference server that acts as a real-time communication bridge to robotic hardware using specialized protocols. It supports hybrid execution models that combine sequential token generation with parallelized media generation to optimize output latency.

The framework covers distributed workload scaling through tensor parallelism and multi-stage model sharding, alongside memory management via paged-attention caching and continuous batching. It also includes tools for measuring serving throughput and performance benchmarking using randomized prompts.

Features

Distributed Inference Frameworks - Implements a high-throughput coordination system for executing omni-modal models across multiple hardware accelerators and workers.

Model Serving Frameworks - Serves as a high-throughput runtime for deploying and accessing omni-modal models that generate text, image, video, and audio.

Continuous Batching Strategies - Implements techniques to dynamically insert new requests into active inference batches to maximize hardware utilization.

OpenAI-Compatible Model Servers - Exposes multi-modal models through a standardized OpenAI-compatible HTTP interface for drop-in integration.

vLLM Backend Runners - Serves as a high-throughput runtime for omni-modal models using vLLM's PagedAttention and tensor parallelism.

Model Servers - Ships a standardized API server for deploying and accessing multi-modal models via completions endpoints.

Multi-Modal Tokenizers - Provides systems to convert text, audio, image, and video into unified numerical sequences for model processing.

Model Serving Interfaces - Deploys models integrating text, audio, image, and video capabilities via unified serving interfaces.

Omni-Modal Model Deployment - Runs models that integrate text, audio, image, and video across various hardware accelerators.

Policy Deployments - Connects multi-modal model runtimes to robotic hardware interfaces for real-time communication and control.

Hybrid Sequential-Parallel Generation - Combines sequential token generation for text with parallelized generation for media to optimize output latency.

Omni-Modal Generation - Generates text, image, video, and audio data using hybrid autoregressive and non-autoregressive model architectures.

Tensor Parallelism - Partitions model weights across multiple processing units to enable inference for models exceeding single-device memory.

Generation Engines - Provides a parallel execution runtime for generating media content without the latency of token-by-token processing.

Real-time Policy Execution - Establishes real-time communication links and maps model outputs to physical motor control signals via specialized protocols.

Policy Servers - Provides a real-time communication bridge that serves robotic policies to hardware devices using specialized protocols.

Media Deployment - Executes non-autoregressive architectures to generate media content in parallel rather than one token at a time.

Inference Pipeline Sharding - Divides the model architecture into discrete stages distributed across a cluster to balance the computational load.

Inference Sequence Schedulers - Coordinates the execution of autoregressive tasks using specialized schedulers and caching to improve response times.

Distributed Execution - Spreads model computation across multiple processing units to increase throughput for massive deployments.

Multi-Modal Output Streaming - Implements incremental streaming of generated multi-modal content to reduce perceived latency for the client.

Prediction Workload Distribution - Distributes model prediction computations across worker nodes to handle large scale deployments.

Parallel Media Processing - Provides a non-autoregressive generation engine to produce visual and audio content in parallel for reduced latency.

Hardware Control Signal Mapping - Translates high-level model outputs into real-time control signals compatible with specialized robotic hardware communication standards.

Distributed Inference - Provides a connectivity layer for decentralized GPU worker networks executing multi-modal inference.

Paged KV Cache Management - Manages key-value cache states using fixed-size non-contiguous blocks to reduce memory fragmentation.

Asynchronous Request Pipelines - Handles high-concurrency API requests using an asynchronous event loop to decouple network I/O from model compute.

Distributed Coordination Systems - Coordinates model execution across multiple workers and stages to balance processing loads and maintain system stability.

Generative Model Serving Benchmarks - Measures throughput and latency of generative models served by the engine to evaluate serving configurations.

Model Runtime Interfaces - Creates a consistent server interface allowing existing ecosystem tools and clients to interact with the model runtime.

vllm-projectvllm-omni

Features

Star history