Cosmos

Cosmos is an open platform of world models, datasets, and tools for building physical AI systems such as robots and autonomous vehicles. It provides video generation and video understanding models that can generate synthetic videos and world simulations from text, image, video, or action inputs, and analyze videos to produce captions, event timestamps, spatial bounding boxes, and next-action predictions.

The platform includes a world simulation generator that produces images, videos, synchronized audio, and action-conditioned rollouts for synthetic data, alongside a visual content analyzer that extracts structured text outputs for robotics and autonomous systems. Cosmos offers a model fine-tuning framework with checkpoint-based recipes to adapt pre-trained models on custom video, action, or reasoning datasets, and exposes its reasoner and generator models through a standard OpenAI-compatible chat-completions API endpoint for production inference.

The platform's architecture combines multi-modal encoder fusion, a video tokenizer with diffusion backbone, causal transformer reasoning, and synchronized audio-visual generation. It supports application domains including autonomous vehicle development, robotics training data generation, physical AI world simulation, visual content understanding, and production model serving.

Features

Physical AI World Generators - An open platform of world models, datasets, and tools for building physical AI systems like robots and autonomous vehicles.

Simulation Data Generators - Creating simulated driving scenarios and analyzing visual data to train perception and planning models for self-driving cars.

Cross-Attention Fusion Layers - Combines text, image, video, and action inputs into a unified latent space using cross-attention layers for flexible conditioning.

Video Tokenizers - Converts raw video frames into discrete latent tokens and reconstructs them using a diffusion-based decoder for high-fidelity generation.

Video Content Analyzers - Analyzes images and videos to produce text outputs such as captions, event timestamps, spatial bounding boxes, and next-action predictions for robotics and autonomous systems.

OpenAI-Compatible Model Servers - Exposes reasoner and generator models behind an OpenAI-compatible API endpoint for production inference.

Synthetic Data Generators - Producing diverse synthetic video and action sequences to train robot manipulation and navigation policies without real-world data.

World Simulation Generators - Generates synthetic videos and world simulations from text, image, video, or action inputs for training and testing.

World Simulation Generators - Generates images, videos, synchronized sound, and action-conditioned rollouts from text, image, video, or action inputs for world simulation and synthetic data.

Video Understanding Models - Analyzes videos to produce captions, event timestamps, spatial bounding boxes, and next-action predictions.

Visual Content Analyzers - Analyzing images and video to extract captions, event timestamps, spatial bounding boxes, and next-action predictions for automation.

Video Token Reasoners - Processes video tokens autoregressively to output structured text predictions like captions, bounding boxes, and next actions.

World Model Rollout Pipelines - Feeds action sequences as conditioning signals into the world model to generate temporally consistent future frames.

Action-Conditioned Fine-Tuning - Adapting pre-trained world models on proprietary video or action datasets using supervised fine-tuning recipes for specialized behavior.

Checkpoint-Based Fine-Tuning Recipes - Provides supervised fine-tuning recipes to adapt pre-trained checkpoints on custom video, action, or reasoning datasets.

World Model Fine-Tuning - Adapts pre-trained checkpoints on custom video, action, or reasoning datasets using supervised fine-tuning recipes for task-specific behavior.

Inference API Servers - Exposes reasoner and generator models behind a standard chat-completions endpoint for production inference.

Video Checkpoint Fine-Tuning - Provides pre-trained model weights and supervised fine-tuning scripts that adapt the backbone to custom video or action datasets.

Model Serving Endpoints - Exposing reasoner and generator models behind a standard chat-completions API endpoint for scalable inference deployment.

Generative - Generates temporally aligned audio tracks alongside video frames using a shared latent representation and joint decoder.

NVIDIAcosmos

Features

Star history