30 open-source projects similar to tdrussell/diffusion-pipe, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Diffusion Pipe alternative.
A general fine-tuning kit geared toward image/video/audio diffusion models.
A unified inference and post-training framework for accelerated video generation.
Diffusers is a PyTorch-based library and generative AI framework used to build, train, and deploy diffusion pipelines for producing multi-modal media. It provides a suite of tools for generating images, video, and audio from natural language descriptions, as well as specialized systems for text-to-image generation. The project differentiates itself through a modular architecture that separates noise schedulers, pretrained model blocks, and pipeline compositions. This structure allows for the construction of custom generation workflows and the ability to swap individual components of the diffu
DiffSynth-Studio is a comprehensive platform for the lifecycle management of generative diffusion models, providing a unified environment for inference, fine-tuning, and training. It utilizes a modular pipeline architecture and a standardized abstraction layer to support consistent workflows across diverse model configurations for image and video generation. The platform distinguishes itself through a memory-optimized inference engine that dynamically manages resources to facilitate high-resolution generation on constrained hardware. It also integrates specialized training capabilities, inclu
FramePack is a neural video synthesis engine and generation framework designed to produce long, temporally consistent video sequences. It functions as a diffusion model optimizer, providing a suite of techniques to manage the computational demands of high-parameter video models while maintaining visual stability during extended generation tasks. The system distinguishes itself through a hierarchical approach to frame prediction, which plans distant anchor frames before filling in intermediate content to prevent cumulative temporal drift. By utilizing constant-length context compression and to
🔥 2026.06.01 We released LongLive-RAG, a general retrieval-augmented framework for long video gen. - 🔥 2026.05.30 LongLive2.0 now supports I2V AR teacher-forcing training and I2V DMD distillation for Wan2.2-TI2V-5B. - ⚡ 2026.05.25 We optimized the NVFP4 inference path with fused Triton…
Open-Sora-Plan is a text-to-video framework and distributed video training system. It utilizes a diffusion transformer architecture and large language model components to transform written descriptions or image prompts into high-quality video sequences. The system features a distributed infrastructure designed for large-scale video training and inference. It employs sequence parallelism to split high-resolution or long-duration video samples across multiple GPUs and uses a sparse attention mechanism to increase processing speed. The project includes capabilities for both text-to-video and im
MAGI-1: Autoregressive Video Generation at Scale
SkyReels-V2 is a video generation system that creates, extends, and refines video clips from text descriptions, images, or both. It operates as a diffusion-based video generation model that can produce videos of any duration by denoising frames sequentially, with each new frame conditioned on the ones that came before it. The system supports generating videos from scratch using text prompts, starting from a single image and producing subsequent frames, or constraining both the first and last frames to match user-provided images. What distinguishes SkyReels-V2 is its combination of infinite-le
Wan2.1 is a generative video synthesis framework that provides foundation models for creating high-fidelity video sequences and static images from descriptive text prompts. The system utilizes a unified architecture trained on both static and dynamic datasets, allowing it to function as a comprehensive tool for visual media creation. The framework distinguishes itself through a transformer-based temporal modeling approach that ensures structural coherence and consistent motion across video frames. It supports multi-resolution latent scaling, enabling the generation of content in various aspec
A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
Wan2.2 is a generative video artificial intelligence system designed to synthesize visual media by interpreting natural language instructions. It functions as a text-to-video diffusion model that transforms written concepts into coherent motion sequences through deep learning and latent space manipulation. The system utilizes a transformer-based architecture to process video data as a series of tokens, allowing it to capture complex spatial and temporal relationships. By employing a temporal attention mechanism, the model maintains visual consistency across frames, while its latent space appr
HunyuanVideo-1.5 is a video generation foundation model and text-to-video diffusion framework. It utilizes a latent video diffusion model and a spatio-temporal transformer architecture to generate high-definition video sequences from text descriptions and images. The project enables cinematic camera control for directing pans and tilts and provides image-to-video animation capabilities. It supports visual style adaptation through low-rank adaptation tuning and uses a language model for prompt refinement to improve visual alignment. The model covers high-resolution video upscaling via a super
PySceneDetect is a suite of tools for identifying cuts and transitions in video files using content, threshold, and histogram detection algorithms. It functions as a scene detector, frame extractor, statistics analyzer, metadata exporter, and video scene splitter. The project identifies scene boundaries and can divide video files into smaller clips using external processing tools. It allows for the extraction of representative image frames from detected changes and the export of scene lists into industry-standard formats such as EDL, FCP, HTML, OTIO, and CSV. The toolset includes capabilitie
TurboDiffusion is a video diffusion inference engine and generator designed to create high-resolution videos from text prompts and images. It provides a runtime environment for executing optimized diffusion model checkpoints with a focus on reducing latency and GPU memory usage. The project features a specialized training framework for aligning sparse-linear attention models with pretrained full-attention models. This system includes capabilities for sparse attention parameter merging and sparse-linear model alignment to reduce computational costs during inference while maintaining output qua