This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process. The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.
StableCascade is a generative AI system and latent diffusion framework designed for text-to-image synthesis and image-to-image transformations. It utilizes a multi-stage cascade architecture that encodes and decodes images via a latent space to produce high-fidelity visual imagery. The system includes a cascade diffusion pipeline for controlling image structure through inpainting, outpainting, and super-resolution. It also provides a toolkit for image-to-image generation and the creation of image variations using embeddings. The framework supports model optimization through low-rank adaptati
DiT is a latent diffusion model and transformer-based generative AI framework implemented in PyTorch. It functions as a class-conditional image generator that replaces traditional convolutional backbones with a transformer architecture to synthesize high-fidelity images. The project utilizes patch-based latent processing and latent space compression to operate on low-dimensional image representations. It incorporates class-conditional guidance and adjustable guidance scales to control the visual content of generated images during the sampling process. The framework covers distributed model t
This is a PyTorch-based implementation of diffusion models for synthesizing photorealistic images and video. It provides a framework for text-to-image and text-to-video generation, as well as unconditional image synthesis. The system utilizes a cascading diffusion pipeline to produce high-resolution imagery by passing low-resolution outputs through a sequence of super-resolution models. It also includes capabilities for image inpainting, allowing the reconstruction of masked or missing regions of visual media guided by surrounding context and text prompts. The project includes tools for diff