Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control.
The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual texture refinement and utilizes block-based caching to maintain temporal consistency across extended sequences.
The framework covers a broad range of capabilities, including supervised fine-tuning, reinforcement learning via reward model integration, and image model personalization. It supports advanced video controls such as camera trajectory adherence, image-to-video synthesis, and streaming video editing.
Performance is managed through model weight quantization, VRAM reduction techniques, and sharded data parallelism for large-scale training.