This is a framework for training and sampling diffusion models to generate high-fidelity images, video, and 4D assets. It provides a modular environment for managing generative AI training pipelines, including the handling of datasets, noise sampling, and loss weighting to stabilize the creation of synthetic content.
The project features a modular model configuration system that uses YAML-based assembly to define network submodules and conditioners. It also includes a dedicated toolset for AI image watermarking, allowing for the embedding and detection of invisible markers to verify the origin of generated media.
The system supports text-to-image generation and novel-view video synthesis, transforming single input videos into consistent 4D assets. Capabilities cover latent diffusion sampling using customizable numerical solvers, as well as conditioning mechanisms that use external embedders to steer the generative process.