VAR is a visual autoregressive model and image generation framework that applies large language model scaling laws to visual data. It functions as an image generator that uses a coarse-to-fine next-scale prediction approach rather than traditional raster-scan tokenization.
The system utilizes scale-based tokenization to represent images as a hierarchy of discrete tokens. It generates high-resolution content by iteratively predicting the next resolution level, refining coarse predictions into fine-grained details.
The project covers a broad range of capabilities including autoregressive image generation, visual scaling laws research, and visual content sampling. It incorporates classifier-free guidance to balance sample quality and diversity during the generation process.
The training infrastructure includes automated state management and checkpoint-based resumption to maintain progress during large-scale training runs.