This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process. The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
Open-Sora is a video generation framework designed to produce cinematic sequences from text prompts and images. It functions as a generative system that transforms written descriptions or reference images into video content featuring realistic textures and lighting. The project includes a dedicated prompt engineering tool that uses large language models to expand simple user inputs into detailed descriptions. It also features a motion controller for adjusting movement intensity in generated sequences and evaluating motion levels in existing video files. The framework incorporates text-to-vid
Instruct-pix2pix is an instruction-based image model and PyTorch library designed to modify visual content by following natural language directions. It functions as a diffusion model image editor that applies human-written instructions to existing pictures rather than using traditional text-to-image prompts. The project provides a fine-tunable diffusion framework for adapting pre-trained checkpoints to specific image editing datasets. It includes a synthetic dataset generator that creates paired images and text triplets to train models on various image editing tasks. The system covers a rang