30 open-source projects similar to compvis/latent-diffusion, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Latent Diffusion alternative.
This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process. The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.
StableCascade is a generative AI system and latent diffusion framework designed for text-to-image synthesis and image-to-image transformations. It utilizes a multi-stage cascade architecture that encodes and decodes images via a latent space to produce high-fidelity visual imagery. The system includes a cascade diffusion pipeline for controlling image structure through inpainting, outpainting, and super-resolution. It also provides a toolkit for image-to-image generation and the creation of image variations using embeddings. The framework supports model optimization through low-rank adaptati
Kolors is a generative model implementation for synthesizing photorealistic images from natural language descriptions and visual references. It utilizes a latent diffusion model framework to produce high-fidelity imagery, operating within a compressed latent space to improve generation efficiency and quality. The system functions as a multilingual image generator, interpreting text prompts in multiple languages to produce semantically accurate visual outputs. It includes a custom model training pipeline that uses low-rank adaptation to teach the model specific subjects or artistic styles from
This project is an educational course and collection of training materials focused on generative diffusion models. It provides a curriculum and practical guides for training, fine-tuning, and deploying models capable of synthesizing images, audio, and video. The material covers specific implementation strategies including noise-based synthesis, iterative refinement, and latent space compression. It provides instruction on guiding generative outputs through conditional synthesis and prompt adherence optimization, as well as techniques for image inpainting and text-based editing. The project i
DiT is a latent diffusion model and transformer-based generative AI framework implemented in PyTorch. It functions as a class-conditional image generator that replaces traditional convolutional backbones with a transformer architecture to synthesize high-fidelity images. The project utilizes patch-based latent processing and latent space compression to operate on low-dimensional image representations. It incorporates class-conditional guidance and adjustable guidance scales to control the visual content of generated images during the sampling process. The framework covers distributed model t
This is a PyTorch-based implementation of diffusion models for synthesizing photorealistic images and video. It provides a framework for text-to-image and text-to-video generation, as well as unconditional image synthesis. The system utilizes a cascading diffusion pipeline to produce high-resolution imagery by passing low-resolution outputs through a sequence of super-resolution models. It also includes capabilities for image inpainting, allowing the reconstruction of masked or missing regions of visual media guided by surrounding context and text prompts. The project includes tools for diff
This project is a framework for training consistency models and performing diffusion model distillation. It functions as a few-step text-to-image generator and an image-to-image transformation tool designed to produce high-resolution visuals from text prompts or existing images. The system focuses on converting pre-trained diffusion models into consistency models to reduce the number of required inference steps. It enables the training of lightweight model adaptors to inject specific visual styles into large models without requiring full network fine-tuning. The project covers broad capabili
OOTDiffusion is an AI virtual try-on system designed for controllable image synthesis. It generates images of people wearing specific clothing items by superimposing garments onto human figures for both half-body and full-body compositions. The project facilitates digital fashion prototyping and virtual clothing fitting by creating garment-to-person overlays. It aims to maintain the original identity of the wearer and the specific details of the clothing during the synthesis process. The system utilizes a latent diffusion model and conditioning-based image generation to control the output. I
AudioLDM is a latent diffusion framework for generating high-fidelity audio, music, and sound effects. It functions as a text-to-audio generator that converts natural language descriptions into synthetic audio signals with control over pitch and environment. The system provides specialized tools for audio-to-audio synthesis and generative repair. This includes the ability to perform audio style transfer and replicate specific acoustic events based on existing files. The project covers a broad range of audio transformation tasks, including audio super-resolution for increasing signal fidelity
Stable Diffusion Web UI is a browser-based interface for generating, editing, and upscaling images and videos using latent diffusion models. It functions as a text-to-image generator, an AI image editor, and a tool for increasing image resolution and clarity. The system includes capabilities for custom model training, specifically allowing the creation of textual inversion embeddings to teach a model new concepts and visual styles from user photos. It also provides tools for AI video production, generating short clips from text prompts. The software covers image-to-image transformation, imag
VACE is a set of software tools and frameworks for reference-guided video generation, diffusion-based editing, and video-to-video translation. It provides utilities to produce new video content and modify existing sequences by using reference materials to guide visual style, subject matter, and composition. The framework enables video-to-video translation and synthesis, allowing for the update of visual styles and depth. It also functions as a video editor for modifying properties and content through reference-guided transformations. The system covers localized video editing and inpainting,
This is a framework for training and sampling diffusion models to generate high-fidelity images, video, and 4D assets. It provides a modular environment for managing generative AI training pipelines, including the handling of datasets, noise sampling, and loss weighting to stabilize the creation of synthetic content. The project features a modular model configuration system that uses YAML-based assembly to define network submodules and conditioners. It also includes a dedicated toolset for AI image watermarking, allowing for the embedding and detection of invisible markers to verify the origi
Stable Diffusion is a generative machine learning pipeline that synthesizes high-resolution visual content by performing iterative denoising within a compressed latent space. By mapping natural language embeddings into pixel outputs through conditioned probabilistic processes, the framework enables the generation of images from text prompts and the transformation of existing visual inputs based on semantic instructions. The architecture utilizes a modular execution environment that decouples model loading, scheduler logic, and inference components to support diverse hardware configurations. I
This project is a Dreambooth implementation designed to personalize Stable Diffusion models. It serves as an AI image personalization tool and model tuner that enables the creation of unique subject identifiers to generate consistent, personalized images. The system focuses on subject-driven image synthesis by fine-tuning pre-trained diffusion models on small, custom datasets. This allows the model to recognize specific people, objects, or artistic styles and place those learned subjects into diverse contexts via text-to-image conditioning. The implementation includes a diffusion model optim
IDM-VTON is an AI virtual try-on framework and fashion synthesis tool designed to generate realistic images of people wearing specific garments. It operates as a diffusion-based image generator that blends garment textures with human poses to create synthetic fashion imagery. The system implements virtual fitting room capabilities through a generative model that combines person and clothing inputs. It includes a web-based interface to run interactive visual demonstrations and synthesize try-on images in real-time. The framework covers the broader domain of AI fashion visualization, enabling
This project is a diffusion model training framework and image synthesis pipeline. It provides the tools necessary to train generative models to learn image data distributions through an iterative denoising process. The framework includes a generative model evaluation tool consisting of automated scripts used to measure the quality and accuracy of produced samples. The system covers model training pipelines and performance evaluation for generative diffusion models.
This is a collection of Jupyter notebooks that serve as educational guides for training, fine-tuning, and deploying machine learning models within the Hugging Face ecosystem. The notebooks cover the full lifecycle of model development, from loading and configuring pre-trained transformers to packaging trained models for real-time inference via scalable endpoints. The notebooks demonstrate a range of capabilities including diffusion model training and fine-tuning for image generation and editing, transformer model adaptation for natural language processing tasks, and parameter-efficient fine-t
Taming Transformers is a generative system for high-resolution image synthesis that combines a vector-quantized GAN image encoder with an autoregressive transformer. It utilizes a discrete latent space to represent images as codebook tokens, enabling the production of high-fidelity visuals through a hybrid architecture. The project provides specialized capabilities for layout-based scene synthesis, allowing for the creation of complex images by placing objects according to defined bounding box coordinates. It also includes tools for image inpainting to fill missing sections of an image by ana
Instruct-pix2pix is an instruction-based image model and PyTorch library designed to modify visual content by following natural language directions. It functions as a diffusion model image editor that applies human-written instructions to existing pictures rather than using traditional text-to-image prompts. The project provides a fine-tunable diffusion framework for adapting pre-trained checkpoints to specific image editing datasets. It includes a synthetic dataset generator that creates paired images and text triplets to train models on various image editing tasks. The system covers a rang
This project is a PyTorch implementation of a text-to-image transformer. It is a generative AI model designed to map discrete text tokens to image pixels using a transformer network to create visual content from textual descriptions. The system utilizes a discrete VAE image encoder to compress visual data into tokens for transformer processing. It supports classifier-free guidance to adjust the influence of text prompts during inference and includes capabilities for ranking generated images based on their similarity to text prompts. The architecture incorporates sparse attention mechanisms a
Wan2.1 is a generative video synthesis framework that provides foundation models for creating high-fidelity video sequences and static images from descriptive text prompts. The system utilizes a unified architecture trained on both static and dynamic datasets, allowing it to function as a comprehensive tool for visual media creation. The framework distinguishes itself through a transformer-based temporal modeling approach that ensures structural coherence and consistent motion across video frames. It supports multi-resolution latent scaling, enabling the generation of content in various aspec
sd-scripts is a suite of utilities designed for fine-tuning generative models, preprocessing datasets, and converting model weights. It provides a collection of scripts for executing Stable Diffusion training through methods such as DreamBooth, textual inversion, and full fine-tuning, alongside a framework for creating and managing Low-Rank Adaptation weights. The project features specialized capabilities for model weight conversion between different architectures and precision formats. It includes tools for merging adaptation weights into base models, extracting weights from trained models,
Sygil-webui is a web interface for Stable Diffusion latent diffusion models, providing a creative suite for text-to-image and text-to-video synthesis. It functions as an image generation tool and a latent diffusion image editor, allowing users to create visuals and video sequences from textual descriptions. The project includes a dedicated model training interface for creating custom textual inversion embeddings, which introduces specific new concepts or styles into the diffusion models. It also features specialized tools for generative image editing, including mask-based inpainting, image-to
HunyuanDiT is a bilingual text-to-image generative model and diffusion transformer image generator. It uses a latent diffusion system to synthesize high-resolution images from text prompts, with a specific focus on understanding and generating content from both Chinese and English language descriptions. The project features a multi-resolution transformer architecture and a bilingual embedding space to map different scripts into a shared semantic area. It supports iterative multi-turn image refinement, which translates conversational dialogue into updated prompts to progressively modify visual
LatentSync is an audio-driven video generator and latent diffusion lip sync model designed to synchronize a speaker's lip movements in a video to a target audio track. It provides a lip synchronization training framework for developing synchronization networks on custom video and audio datasets. The system utilizes a video preprocessing pipeline to clean, segment, and align face data. It includes a visual sync evaluation tool that calculates confidence scores to measure the accuracy of audio and visual alignment in generated videos. The project covers capabilities for custom synchronization
Diffusers is a PyTorch-based library and generative AI framework used to build, train, and deploy diffusion pipelines for producing multi-modal media. It provides a suite of tools for generating images, video, and audio from natural language descriptions, as well as specialized systems for text-to-image generation. The project differentiates itself through a modular architecture that separates noise schedulers, pretrained model blocks, and pipeline compositions. This structure allows for the construction of custom generation workflows and the ability to swap individual components of the diffu
stable-diffusion.cpp is a high-performance C++ inference engine designed for generating images and video from text prompts using Stable Diffusion models. It functions as a latent diffusion model runtime and a lightweight machine learning framework that enables local diffusion model execution on consumer hardware. The project distinguishes itself as a CPU-based image generator capable of running without a dedicated GPU. It employs a specialized C++ tensor backend and cross-backend hardware abstraction to dispatch compute tasks across different processor instruction sets and graphics APIs. The
Shap-E is a generative 3D modeling system that creates three-dimensional digital assets from natural language descriptions or two-dimensional images. It functions as a generative model capable of producing three-dimensional implicit functions and assets. The project includes a 3D latent encoder that converts trimeshes and 3D models into latent representations using point clouds and multiview renders. It utilizes an image-to-3D generator to produce assets from synthetic view images and a text-to-3D generator to build shapes from text prompts. The system implements a pipeline involving latent
Text2Video-Zero is a text-to-video diffusion model and framework designed to synthesize temporally consistent video sequences from textual prompts. It functions as a zero-shot video generator, repurposing pre-trained image diffusion models to create video content without requiring additional training on video datasets. The system includes a conditional video synthesizer that allows for guided generation using depth, edge, or pose maps to control structural layout and movement. It also provides text-based video editing capabilities to modify the style or content of existing video clips through
CogVideo is a video generation framework and large language model architecture designed for synthesizing high-resolution video clips from natural language descriptions and images. It functions as a text-to-video and image-to-video generator, while also providing a model for video captioning to analyze visual content into descriptive text summaries. The system supports animating static images into motion sequences and transforming series of images into video based on prompts. It includes capabilities for extending the length of generated video clips to create longer sequences of motion. The f