30 open-source projects similar to guytevet/motion-diffusion-model, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Motion Diffusion Model alternative.
This project is a diffusion model training framework and image synthesis pipeline. It provides the tools necessary to train generative models to learn image data distributions through an iterative denoising process. The framework includes a generative model evaluation tool consisting of automated scripts used to measure the quality and accuracy of produced samples. The system covers model training pipelines and performance evaluation for generative diffusion models.
DiT is a latent diffusion model and transformer-based generative AI framework implemented in PyTorch. It functions as a class-conditional image generator that replaces traditional convolutional backbones with a transformer architecture to synthesize high-fidelity images. The project utilizes patch-based latent processing and latent space compression to operate on low-dimensional image representations. It incorporates class-conditional guidance and adjustable guidance scales to control the visual content of generated images during the sampling process. The framework covers distributed model t
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
Tiny Universe is an educational monorepo that delivers multiple independent implementations of core AI subsystems as self-contained Jupyter notebooks. It provides from-scratch constructions of foundational architectures including a complete Transformer model built from the original paper specification, a denoising diffusion probabilistic model for image generation, and a ReAct-style autonomous agent framework that equips an LLM with tools for planning and multi-step task execution. The project distinguishes itself by covering the full lifecycle of modern AI systems through hands-on implementa
This project is an educational course and collection of training materials focused on generative diffusion models. It provides a curriculum and practical guides for training, fine-tuning, and deploying models capable of synthesizing images, audio, and video. The material covers specific implementation strategies including noise-based synthesis, iterative refinement, and latent space compression. It provides instruction on guiding generative outputs through conditional synthesis and prompt adherence optimization, as well as techniques for image inpainting and text-based editing. The project i
Latent Diffusion is a framework for high-resolution image synthesis that performs the denoising process within a compressed latent space. It uses variational autoencoders to encode images into a lower-dimensional representation, reducing the computational cost of noise prediction compared to operating on raw pixels. The project enables text-to-image generation by integrating natural language descriptions through cross-attention conditioning. It also supports image inpainting and restoration, filling masked or missing image areas with generated content, and example-based synthesis using retrie
This is a PyTorch-based implementation of diffusion models for synthesizing photorealistic images and video. It provides a framework for text-to-image and text-to-video generation, as well as unconditional image synthesis. The system utilizes a cascading diffusion pipeline to produce high-resolution imagery by passing low-resolution outputs through a sequence of super-resolution models. It also includes capabilities for image inpainting, allowing the reconstruction of masked or missing regions of visual media guided by surrounding context and text prompts. The project includes tools for diff
Instruct-pix2pix is an instruction-based image model and PyTorch library designed to modify visual content by following natural language directions. It functions as a diffusion model image editor that applies human-written instructions to existing pictures rather than using traditional text-to-image prompts. The project provides a fine-tunable diffusion framework for adapting pre-trained checkpoints to specific image editing datasets. It includes a synthetic dataset generator that creates paired images and text triplets to train models on various image editing tasks. The system covers a rang
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
AlphaPose is a deep learning pose estimation framework and PyTorch computer vision library designed for detecting and tracking human body, face, hand, and foot keypoints in images and videos. It provides a system for skeletal posture estimation and multi-person pose tracking. The project implements tools for three-dimensional human pose reconstruction, generating joint positions and body mesh shapes from two-dimensional image data. It also includes a multi-person pose tracker capable of maintaining the identity of multiple people across consecutive video frames. The framework covers a broad
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
Open-Sora is a video generation framework designed to produce cinematic sequences from text prompts and images. It functions as a generative system that transforms written descriptions or reference images into video content featuring realistic textures and lighting. The project includes a dedicated prompt engineering tool that uses large language models to expand simple user inputs into detailed descriptions. It also features a motion controller for adjusting movement intensity in generated sequences and evaluating motion levels in existing video files. The framework incorporates text-to-vid
F5-TTS is a text-to-speech system that utilizes a flow matching engine and diffusion transformers to generate fluent synthetic speech. It functions as a multilingual speech synthesizer and neural training framework, providing tools for voice cloning and high-performance inference serving. The project distinguishes itself through a voice cloning toolkit capable of mimicking specific speaker characteristics and tones from reference audio clips. It supports cross-lingual generation, allowing for the synthesis of audio across various global languages or the mixing of multiple languages within a s
HunyuanDiT is a bilingual text-to-image generative model and diffusion transformer image generator. It uses a latent diffusion system to synthesize high-resolution images from text prompts, with a specific focus on understanding and generating content from both Chinese and English language descriptions. The project features a multi-resolution transformer architecture and a bilingual embedding space to map different scripts into a shared semantic area. It supports iterative multi-turn image refinement, which translates conversational dialogue into updated prompts to progressively modify visual
ACE-Step is a high-fidelity audio synthesis system and diffusion model designed to generate music and vocals from text descriptions. It functions as a music generator and vocal synthesizer, using a diffusion transformer decoder to produce audio across various languages and genres. The project provides tools for text-guided audio editing, including the ability to extend the duration of tracks, regenerate specific song segments, and perform latent-space audio inpainting to modify lyrics or styles. It also includes a framework for audio style fine-tuning using low-rank adaptation to adapt vocal
Open-Sora-Plan is a text-to-video framework and distributed video training system. It utilizes a diffusion transformer architecture and large language model components to transform written descriptions or image prompts into high-quality video sequences. The system features a distributed infrastructure designed for large-scale video training and inference. It employs sequence parallelism to split high-resolution or long-duration video samples across multiple GPUs and uses a sparse attention mechanism to increase processing speed. The project includes capabilities for both text-to-video and im
ToonCrafter is a model that combines latent diffusion, reference-based colorization, and sketch-guided control for cartoon animation and interpolation. It functions as a cartoon video interpolation model, a reference-based colorization model, and a sketch-guided animation tool, all built on a latent diffusion animation framework. The project distinguishes itself by integrating three core capabilities into a single pipeline: generating smooth intermediate frames between two cartoon images using diffusion-based priors, transferring color and style from a reference image onto black-and-white ske
Champ is a generative vision system and controllable image-to-video generator designed for human image animation. It uses a diffusion-based video synthesizer and 3D parametric guidance to transform a single reference image into a consistent sequence of motion based on external driving data. The framework distinguishes itself through a human pose transfer system that employs 3D body parametric extraction and coordinate-space alignment. This allows the model to map motion from a driving video to a reference person by adjusting for body scales and camera perspectives using depth and semantic con
CogVideo is a video generation framework and large language model architecture designed for synthesizing high-resolution video clips from natural language descriptions and images. It functions as a text-to-video and image-to-video generator, while also providing a model for video captioning to analyze visual content into descriptive text summaries. The system supports animating static images into motion sequences and transforming series of images into video based on prompts. It includes capabilities for extending the length of generated video clips to create longer sequences of motion. The f
LLaDA is a masked diffusion language model and conditional text generator. It generates text by iteratively refining masked tokens through a diffusion process rather than predicting the next token in a sequence. The project functions as a vision-language diffusion model, converting visual inputs into text responses. It also serves as a preference optimization framework that uses log-likelihood estimation and evidence lower bounds to tune model responses. The system supports multi-round conversational AI and text sequence evaluation. It integrates vision-language embedding for cross-modal con
ComfyUIIPAdapterplus is a node-based extension for ComfyUI that implements IPAdapter models to guide image generation using reference images. It functions as an image prompting tool and a Stable Diffusion image adapter, allowing reference files to serve as visual prompts for controlling style, composition, and subject identity. The project provides specialized capabilities for maintaining facial identity and high-fidelity features across generated portraits. It enables the transfer of visual characteristics and artistic styles from reference images, as well as the extraction of spatial layo
VACE is a set of software tools and frameworks for reference-guided video generation, diffusion-based editing, and video-to-video translation. It provides utilities to produce new video content and modify existing sequences by using reference materials to guide visual style, subject matter, and composition. The framework enables video-to-video translation and synthesis, allowing for the update of visual styles and depth. It also functions as a video editor for modifying properties and content through reference-guided transformations. The system covers localized video editing and inpainting,
Diffusion Policy is a robot learning framework that uses diffusion models to map visual observations to precise action trajectories. It functions as an imitation learning toolkit and visuomotor policy learner, providing a system to train neural networks that replicate human behavior by generating robotic movements based on image and sensor data. The framework employs a conditional denoising process to sample sequences of robotic movements, allowing it to handle multimodal action distributions where multiple valid trajectories may exist for a single state. It utilizes score-based action modeli
This project is a research-oriented PyTorch framework designed for the implementation and training of generative video diffusion models. It provides a modular toolkit that extends standard image-based diffusion techniques into three dimensions, enabling the synthesis of coherent video sequences through iterative denoising processes. The framework distinguishes itself by utilizing factored space-time attention, which decomposes high-dimensional video data into separate spatial and temporal layers to maintain motion consistency while managing computational complexity. It supports multi-modal tr
This project is a framework for training consistency models and performing diffusion model distillation. It functions as a few-step text-to-image generator and an image-to-image transformation tool designed to produce high-resolution visuals from text prompts or existing images. The system focuses on converting pre-trained diffusion models into consistency models to reduce the number of required inference steps. It enables the training of lightweight model adaptors to inject specific visual styles into large models without requiring full network fine-tuning. The project covers broad capabili
This is a PyTorch implementation of a text-to-image model designed for synthesizing high-fidelity images from natural language descriptions. It utilizes a diffusion image generator to transform latent embeddings into visual data through an iterative denoising process. The system employs a two-stage latent mapping process, using a CLIP-based latent prior to map text embeddings to image embeddings before decoding them into pixels. It features a cascading diffusion decoder that produces high-resolution imagery by passing low-resolution outputs through a sequence of models at increasing scales.
lora-scripts is a fine-tuning toolkit designed for adapting base diffusion models to specific styles or subjects. It provides a specialized set of scripts and tools for executing low-rank adaptation and Dreambooth training jobs. The project features a web-based graphical interface that manages the training workflow, allowing users to configure and execute jobs without manual script editing. This interface maps user inputs to hyperparameters and provides a real-time dashboard for monitoring training metrics and loss curves to track model convergence. The system includes a dataset tagging mana
IF is a text-to-image diffusion system that translates natural language descriptions into visual imagery. The project provides a generative pipeline for creating images, an inpainting tool for modifying specific image sections, and a super-resolution upscaler to increase pixel density and clarity. The system includes a concept fine-tuning framework that allows for the teaching of new visual concepts by updating a small set of parameters. It also supports image style transfer to apply the aesthetic characteristics of a reference image to a new output.
Kolors is a generative model implementation for synthesizing photorealistic images from natural language descriptions and visual references. It utilizes a latent diffusion model framework to produce high-fidelity imagery, operating within a compressed latent space to improve generation efficiency and quality. The system functions as a multilingual image generator, interpreting text prompts in multiple languages to produce semantically accurate visual outputs. It includes a custom model training pipeline that uses low-rank adaptation to teach the model specific subjects or artistic styles from
Videocrafter is a latent diffusion model designed for AI video synthesis. It functions as both a text-to-video and image-to-video generation system, synthesizing high-quality video sequences from descriptive text prompts or static image inputs. The model utilizes a diffusion-based neural network to transform inputs into animated content, ensuring visual consistency and temporal coherence throughout the generated sequences. This allows for the creation of custom video clips and the animation of static images into fluid motion.