TurboDiffusion is a video diffusion inference engine and generator designed to create high-resolution videos from text prompts and images. It provides a runtime environment for executing optimized diffusion model checkpoints with a focus on reducing latency and GPU memory usage.
The project features a specialized training framework for aligning sparse-linear attention models with pretrained full-attention models. This system includes capabilities for sparse attention parameter merging and sparse-linear model alignment to reduce computational costs during inference while maintaining output quality.
The engine implements several performance optimization strategies, including weight quantization for consumer-grade hardware, timestep distillation to reduce the number of inference steps, and sparse-attention approximations. It also supports an interactive inference server that enables stateful, multi-turn video generation through a terminal interface to eliminate model reload times.