Sana | Awesome Repository

Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control.

The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual texture refinement and utilizes block-based caching to maintain temporal consistency across extended sequences.

The framework covers a broad range of capabilities, including supervised fine-tuning, reinforcement learning via reward model integration, and image model personalization. It supports advanced video controls such as camera trajectory adherence, image-to-video synthesis, and streaming video editing.

Performance is managed through model weight quantization, VRAM reduction techniques, and sharded data parallelism for large-scale training.

Features

Text-to-Image Generators - Synthesizes high-resolution images from text prompts using a linear diffusion transformer to balance quality and efficiency.
Diffusion Transformers - Utilizes a linear diffusion transformer with linear complexity layers to handle high-resolution image and video synthesis.
Chunk-Causal Training - Trains video models by processing sequences in overlapping segments to maintain temporal consistency across long durations.
Constant-Memory Video Caching - Employs a fixed-size recurrent state to generate arbitrarily long video sequences without increasing memory usage.

Features

Text-to-Image Generators - Synthesizes high-resolution images from text prompts using a linear diffusion transformer to balance quality and efficiency.
Diffusion Transformers - Utilizes a linear diffusion transformer with linear complexity layers to handle high-resolution image and video synthesis.
Chunk-Causal Training - Trains video models by processing sequences in overlapping segments to maintain temporal consistency across long durations.
Constant-Memory Video Caching - Employs a fixed-size recurrent state to generate arbitrarily long video sequences without increasing memory usage.

Performance is managed through model weight quantization, VRAM reduction techniques, and sharded data parallelism for large-scale training.