30 open-source projects similar to openai/shap-e, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Shap E alternative.
Threestudio is a 3D generative AI framework designed to create three-dimensional assets from text prompts and images. It provides specialized pipelines for text-to-3D generation and image-to-3D reconstruction, utilizing a neural radiance field trainer to produce geometry and textures. The framework is distinguished by its support for hybrid geometry backends, including signed distance functions, tetrahedra grids, and volume grids. It employs score distillation sampling to guide the generation process and features a modular plugin system for loading custom modules and nodes. The system covers
Point-e is a system for 3D model synthesis that generates three-dimensional point clouds from natural language descriptions and two-dimensional images. It utilizes diffusion models to synthesize these spatial representations based on text prompts or source images. The project includes specialized tools for refining these outputs, such as a point cloud upsampler to increase the density and resolution of low-resolution models. It also provides a mesh converter that uses distance function regression to transform raw point cloud data into structured 3D meshes. The broader capability surface cove
This project is a diffusion-based 3D generator and image-to-3D reconstruction system. It translates natural language descriptions or two-dimensional images into three-dimensional assets using neural radiance fields and diffusion models. The system utilizes score-distillation sampling and diffusion-based guidance to refine 3D shapes without requiring 3D training data. It includes specialized tools for transforming neural representations into exportable meshes with texture and material data, as well as a pipeline for iterative optimization of geometry and textures. The project covers a broad r
TRELLIS is a 3D generative AI model and latent diffusion framework designed to transform natural language descriptions or reference images into textured 3D assets. It operates as a text-to-3D asset generator that utilizes structured latent representations to produce high-quality 3D meshes, Gaussians, and Radiance Fields. The system functions as a multi-format 3D decoder, converting internal representations into standard exchange formats such as GLB and PLY. It also serves as a 3D asset editing tool, enabling the modification of specific regions of generated objects through targeted text or im
Latent Diffusion is a framework for high-resolution image synthesis that performs the denoising process within a compressed latent space. It uses variational autoencoders to encode images into a lower-dimensional representation, reducing the computational cost of noise prediction compared to operating on raw pixels. The project enables text-to-image generation by integrating natural language descriptions through cross-attention conditioning. It also supports image inpainting and restoration, filling masked or missing image areas with generated content, and example-based synthesis using retrie
LatentSync is an audio-driven video generator and latent diffusion lip sync model designed to synchronize a speaker's lip movements in a video to a target audio track. It provides a lip synchronization training framework for developing synchronization networks on custom video and audio datasets. The system utilizes a video preprocessing pipeline to clean, segment, and align face data. It includes a visual sync evaluation tool that calculates confidence scores to measure the accuracy of audio and visual alignment in generated videos. The project covers capabilities for custom synchronization
IDM-VTON is an AI virtual try-on framework and fashion synthesis tool designed to generate realistic images of people wearing specific garments. It operates as a diffusion-based image generator that blends garment textures with human poses to create synthetic fashion imagery. The system implements virtual fitting room capabilities through a generative model that combines person and clothing inputs. It includes a web-based interface to run interactive visual demonstrations and synthesize try-on images in real-time. The framework covers the broader domain of AI fashion visualization, enabling
TRELLIS.2 is a generative image-to-3D system that creates high-resolution 3D assets with physically based rendering materials from 2D images. It utilizes a sparse voxel representation to handle complex topologies and internal structures without relying on iso-surface fields. The project features a structured latent space representation that maps geometry and texture attributes to maintain visual fidelity. It employs an optimization-free geometry reconstruction process to decode latent representations directly into voxel grids and includes a PBR texture generator for synthesizing base color, r
Hunyuan3D-2.1 is a generative 3D framework and image-to-3D pipeline that transforms single 2D images into textured 3D geometries. It functions as an asset generator that produces high-quality 3D meshes and textures using a flow-matching system. The project includes a specialized synthesizer for creating photorealistic textures with physically based rendering properties. These tools allow for the simulation of metallic reflections and light interactions on generated models. The system covers 3D asset pipeline automation through a sequence of shape generation and mesh refinement. It also provi
VACE is a set of software tools and frameworks for reference-guided video generation, diffusion-based editing, and video-to-video translation. It provides utilities to produce new video content and modify existing sequences by using reference materials to guide visual style, subject matter, and composition. The framework enables video-to-video translation and synthesis, allowing for the update of visual styles and depth. It also functions as a video editor for modifying properties and content through reference-guided transformations. The system covers localized video editing and inpainting,
AudioLDM is a latent diffusion framework for generating high-fidelity audio, music, and sound effects. It functions as a text-to-audio generator that converts natural language descriptions into synthetic audio signals with control over pitch and environment. The system provides specialized tools for audio-to-audio synthesis and generative repair. This includes the ability to perform audio style transfer and replicate specific acoustic events based on existing files. The project covers a broad range of audio transformation tasks, including audio super-resolution for increasing signal fidelity
This is a framework for training and sampling diffusion models to generate high-fidelity images, video, and 4D assets. It provides a modular environment for managing generative AI training pipelines, including the handling of datasets, noise sampling, and loss weighting to stabilize the creation of synthetic content. The project features a modular model configuration system that uses YAML-based assembly to define network submodules and conditioners. It also includes a dedicated toolset for AI image watermarking, allowing for the embedding and detection of invisible markers to verify the origi
StableCascade is a generative AI system and latent diffusion framework designed for text-to-image synthesis and image-to-image transformations. It utilizes a multi-stage cascade architecture that encodes and decodes images via a latent space to produce high-fidelity visual imagery. The system includes a cascade diffusion pipeline for controlling image structure through inpainting, outpainting, and super-resolution. It also provides a toolkit for image-to-image generation and the creation of image variations using embeddings. The framework supports model optimization through low-rank adaptati
Stable Diffusion Web UI is a browser-based interface for generating, editing, and upscaling images and videos using latent diffusion models. It functions as a text-to-image generator, an AI image editor, and a tool for increasing image resolution and clarity. The system includes capabilities for custom model training, specifically allowing the creation of textual inversion embeddings to teach a model new concepts and visual styles from user photos. It also provides tools for AI video production, generating short clips from text prompts. The software covers image-to-image transformation, imag
OOTDiffusion is an AI virtual try-on system designed for controllable image synthesis. It generates images of people wearing specific clothing items by superimposing garments onto human figures for both half-body and full-body compositions. The project facilitates digital fashion prototyping and virtual clothing fitting by creating garment-to-person overlays. It aims to maintain the original identity of the wearer and the specific details of the clothing during the synthesis process. The system utilizes a latent diffusion model and conditioning-based image generation to control the output. I
Sygil-webui is a web interface for Stable Diffusion latent diffusion models, providing a creative suite for text-to-image and text-to-video synthesis. It functions as an image generation tool and a latent diffusion image editor, allowing users to create visuals and video sequences from textual descriptions. The project includes a dedicated model training interface for creating custom textual inversion embeddings, which introduces specific new concepts or styles into the diffusion models. It also features specialized tools for generative image editing, including mask-based inpainting, image-to
Stable Diffusion is a generative machine learning pipeline that synthesizes high-resolution visual content by performing iterative denoising within a compressed latent space. By mapping natural language embeddings into pixel outputs through conditioned probabilistic processes, the framework enables the generation of images from text prompts and the transformation of existing visual inputs based on semantic instructions. The architecture utilizes a modular execution environment that decouples model loading, scheduler logic, and inference components to support diverse hardware configurations. I
Tortoise-tts is a neural text-to-speech engine and voice cloning toolkit designed for high-quality audio generation. It functions as a zero-shot synthesis system, meaning it can generate speech for unseen speakers without requiring additional training or fine-tuning for each new voice. The system specializes in replicating human vocal characteristics using small sets of reference audio clips. It allows for the extraction of voice latents to mimic specific speakers, the generation of random synthetic identities, and the blending of multiple voice profiles to create hybrid vocal identities. Th
Hallo is an audio-driven talking head generator and portrait animation framework. It synchronizes a static portrait image with an audio file to produce realistic talking head videos by mapping audio spectral features to facial expressions and lip movements. The system utilizes a diffusion video synthesis model that employs iterative denoising and latent representations to generate temporally consistent video frames. It incorporates identity-preserving feature extraction and latent space motion modeling to maintain visual consistency and control facial poses. The toolkit provides capabilities
IOPaint is an AI image editor and Stable Diffusion inpainting tool providing a web interface for removing objects and replacing image content. It utilizes latent diffusion image processing to synthesize high-resolution replacements for erased sections of an image. The project features a specialized AI background remover for isolating subjects and an AI image upscaler that employs super-resolution models for general photos and anime artwork. The software covers a broad range of capabilities including image segmentation for object isolation, face restoration for improving facial details, and t
Taming Transformers is a generative system for high-resolution image synthesis that combines a vector-quantized GAN image encoder with an autoregressive transformer. It utilizes a discrete latent space to represent images as codebook tokens, enabling the production of high-fidelity visuals through a hybrid architecture. The project provides specialized capabilities for layout-based scene synthesis, allowing for the creation of complex images by placing objects according to defined bounding box coordinates. It also includes tools for image inpainting to fill missing sections of an image by ana
This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process. The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.
AnimateDiff is a latent diffusion video generator and text-to-video diffusion framework. It converts existing text-to-image diffusion models into animation generators by applying specialized motion modules, allowing for the creation of video sequences without modifying the original base model. The project provides an image-to-video animation framework that uses sparse RGB images, sketches, or structural keyframe constraints to guide generation. It further distinguishes itself with a motion adapter system that injects cinematic camera movements, such as zooming, panning, and tilting, into anim
Diffusers is a PyTorch-based library and generative AI framework used to build, train, and deploy diffusion pipelines for producing multi-modal media. It provides a suite of tools for generating images, video, and audio from natural language descriptions, as well as specialized systems for text-to-image generation. The project differentiates itself through a modular architecture that separates noise schedulers, pretrained model blocks, and pipeline compositions. This structure allows for the construction of custom generation workflows and the ability to swap individual components of the diffu
Open-Sora is a video generation framework designed to produce cinematic sequences from text prompts and images. It functions as a generative system that transforms written descriptions or reference images into video content featuring realistic textures and lighting. The project includes a dedicated prompt engineering tool that uses large language models to expand simple user inputs into detailed descriptions. It also features a motion controller for adjusting movement intensity in generated sequences and evaluating motion levels in existing video files. The framework incorporates text-to-vid
InstantID is a diffusion-based identity preservation framework designed for zero-shot image generation. It allows for the synthesis of images featuring a specific person's facial identity using a single reference photo without requiring additional model training or fine-tuning. The project distinguishes itself through the use of consistency model distillation to accelerate inference, reducing the number of steps needed to produce high-quality results. It combines identity-preserving feature extraction with multi-modal prompt integration to merge visual embeddings from a reference image with t
ToonCrafter is a model that combines latent diffusion, reference-based colorization, and sketch-guided control for cartoon animation and interpolation. It functions as a cartoon video interpolation model, a reference-based colorization model, and a sketch-guided animation tool, all built on a latent diffusion animation framework. The project distinguishes itself by integrating three core capabilities into a single pipeline: generating smooth intermediate frames between two cartoon images using diffusion-based priors, transferring color and style from a reference image onto black-and-white ske
DiT is a latent diffusion model and transformer-based generative AI framework implemented in PyTorch. It functions as a class-conditional image generator that replaces traditional convolutional backbones with a transformer architecture to synthesize high-fidelity images. The project utilizes patch-based latent processing and latent space compression to operate on low-dimensional image representations. It incorporates class-conditional guidance and adjustable guidance scales to control the visual content of generated images during the sampling process. The framework covers distributed model t
stable-diffusion.cpp is a high-performance C++ inference engine designed for generating images and video from text prompts using Stable Diffusion models. It functions as a latent diffusion model runtime and a lightweight machine learning framework that enables local diffusion model execution on consumer hardware. The project distinguishes itself as a CPU-based image generator capable of running without a dedicated GPU. It employs a specialized C++ tensor backend and cross-backend hardware abstraction to dispatch compute tasks across different processor instruction sets and graphics APIs. The
MegaTTS3 is a bilingual speech synthesis system that generates natural-sounding speech in Chinese and English, including seamless code-switching within a single utterance. It functions as a text-to-speech engine, voice cloning system, and speech-to-text alignment tool, built around an acoustic latent compression model that encodes high-resolution audio into compact representations for efficient processing. The system distinguishes itself through accent intensity control, allowing adjustment of a speaker's accent strength in generated speech, and voice cloning from short audio samples for pers