30 open-source projects similar to kijai/comfyui-wanvideowrapper, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best ComfyUI WanVideoWrapper alternative.
CogVideo is a generative video framework that uses diffusion models and transformer-based architectures to synthesize high-resolution video clips. It functions as both a text-to-video and image-to-video generator, converting textual descriptions or static images into temporal visual sequences. The system integrates large language model capabilities to expand short user prompts into detailed descriptions for better visual alignment. It supports the animation of static images through latent seeding and provides the ability to extend the length of existing video sequences. The project includes
HunyuanVideo-1.5 is a video generation foundation model and text-to-video diffusion framework. It utilizes a latent video diffusion model and a spatio-temporal transformer architecture to generate high-definition video sequences from text descriptions and images. The project enables cinematic camera control for directing pans and tilts and provides image-to-video animation capabilities. It supports visual style adaptation through low-rank adaptation tuning and uses a language model for prompt refinement to improve visual alignment. The model covers high-resolution video upscaling via a super
Open-Sora-Plan is a text-to-video framework and distributed video training system. It utilizes a diffusion transformer architecture and large language model components to transform written descriptions or image prompts into high-quality video sequences. The system features a distributed infrastructure designed for large-scale video training and inference. It employs sequence parallelism to split high-resolution or long-duration video samples across multiple GPUs and uses a sparse attention mechanism to increase processing speed. The project includes capabilities for both text-to-video and im
StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model. The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generati
AnimateDiff is a latent diffusion video generator and text-to-video diffusion framework. It converts existing text-to-image diffusion models into animation generators by applying specialized motion modules, allowing for the creation of video sequences without modifying the original base model. The project provides an image-to-video animation framework that uses sparse RGB images, sketches, or structural keyframe constraints to guide generation. It further distinguishes itself with a motion adapter system that injects cinematic camera movements, such as zooming, panning, and tilting, into anim
CogVideo is a video generation framework and large language model architecture designed for synthesizing high-resolution video clips from natural language descriptions and images. It functions as a text-to-video and image-to-video generator, while also providing a model for video captioning to analyze visual content into descriptive text summaries. The system supports animating static images into motion sequences and transforming series of images into video based on prompts. It includes capabilities for extending the length of generated video clips to create longer sequences of motion. The f
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
SkyReels-V2 is a video generation system that creates, extends, and refines video clips from text descriptions, images, or both. It operates as a diffusion-based video generation model that can produce videos of any duration by denoising frames sequentially, with each new frame conditioned on the ones that came before it. The system supports generating videos from scratch using text prompts, starting from a single image and producing subsequent frames, or constraining both the first and last frames to match user-provided images. What distinguishes SkyReels-V2 is its combination of infinite-le
Open-Sora is a video generation framework designed to produce cinematic sequences from text prompts and images. It functions as a generative system that transforms written descriptions or reference images into video content featuring realistic textures and lighting. The project includes a dedicated prompt engineering tool that uses large language models to expand simple user inputs into detailed descriptions. It also features a motion controller for adjusting movement intensity in generated sequences and evaluating motion levels in existing video files. The framework incorporates text-to-vid
Videocrafter is a latent diffusion model designed for AI video synthesis. It functions as both a text-to-video and image-to-video generation system, synthesizing high-quality video sequences from descriptive text prompts or static image inputs. The model utilizes a diffusion-based neural network to transform inputs into animated content, ensuring visual consistency and temporal coherence throughout the generated sequences. This allows for the creation of custom video clips and the animation of static images into fluid motion.
TurboDiffusion is a video diffusion inference engine and generator designed to create high-resolution videos from text prompts and images. It provides a runtime environment for executing optimized diffusion model checkpoints with a focus on reducing latency and GPU memory usage. The project features a specialized training framework for aligning sparse-linear attention models with pretrained full-attention models. This system includes capabilities for sparse attention parameter merging and sparse-linear model alignment to reduce computational costs during inference while maintaining output qua
This repository provides a collection of reference implementations and code examples for training and deploying machine learning models using the MLX framework. It serves as a practical guide for executing distributed training, fine-tuning large language models, converting model weights, and implementing multimodal generative workflows. The project distinguishes itself through specialized examples for local hardware execution, featuring weight quantization to reduce memory usage and low-rank adaptation for parameter-efficient fine-tuning. It also includes scripts for transforming external mod
Genkit is an open-source framework for building AI-powered applications. It provides a unified interface for connecting to hundreds of generative AI models from multiple providers, enabling text, image, audio, and video generation through a single API. The framework structures multi-step AI interactions—including chat, retrieval-augmented generation, tool use, and agentic workflows—as composable, traceable flows with built-in streaming and state management. The framework distinguishes itself through a comprehensive developer toolkit that includes a command-line interface and a local developer
AnimateAnyone is an appearance-preserving video synthesizer designed for character animation from a single static image. It functions as a diffusion image-to-video generator that transforms a source image into a high-fidelity video sequence while maintaining consistent character identity, clothing, and visual details across all frames. The system enables video-driven character reenactment by transferring motions, facial expressions, and body movements from a reference video onto a static character. It employs pose-guided video generation to control movement via skeleton keypoints and pose sig
Tune-A-Video is a text-to-video diffusion framework designed to convert pretrained text-to-image diffusion models into video generators. It utilizes a spatio-temporal attention mechanism and single text-video pair training to enable the synthesis of moving sequences from text prompts. The project provides tools for one-shot video personalization, allowing a model to be tuned on a single reference video to preserve specific characters or artistic styles across new generations. It also functions as a video editor that modifies subjects, backgrounds, and styles through noise-sampling prompt guid
LongCat-Video is a collection of specialized models for video synthesis, featuring a large language model based architecture for creating high-resolution videos from text, images, or existing sequences. It includes dedicated systems for text-to-video generation, image-to-video animation, and the creation of talking avatars. The project provides specific capabilities for extending the length of existing clips through a video continuation model that predicts subsequent frames. It also enables the synchronization of character lip movements with audio and text prompts to produce speaking videos.
Wan2.1 is a generative video synthesis framework that provides foundation models for creating high-fidelity video sequences and static images from descriptive text prompts. The system utilizes a unified architecture trained on both static and dynamic datasets, allowing it to function as a comprehensive tool for visual media creation. The framework distinguishes itself through a transformer-based temporal modeling approach that ensures structural coherence and consistent motion across video frames. It supports multi-resolution latent scaling, enabling the generation of content in various aspec
Wan2.2 is a generative video artificial intelligence system designed to synthesize visual media by interpreting natural language instructions. It functions as a text-to-video diffusion model that transforms written concepts into coherent motion sequences through deep learning and latent space manipulation. The system utilizes a transformer-based architecture to process video data as a series of tokens, allowing it to capture complex spatial and temporal relationships. By employing a temporal attention mechanism, the model maintains visual consistency across frames, while its latent space appr
ComfyUI is a node-based generative AI orchestration engine designed for constructing, testing, and executing complex image and video synthesis pipelines. By utilizing a directed acyclic graph execution model, the platform allows users to build reproducible workflows through modular, interconnected processing blocks without requiring manual code implementation. It serves as both a local environment for high-performance model inference and a production-ready server for deploying generative capabilities. The platform distinguishes itself through its focus on workflow portability and extensibilit
HunyuanVideo is a generative artificial intelligence framework designed to synthesize high-fidelity video sequences from descriptive text prompts. It utilizes a latent diffusion architecture that compresses video data into compact representations, allowing for the generation of dynamic visual content while maintaining temporal and spatial fidelity. The system distinguishes itself through a specialized inference engine that supports eight-bit weight quantization and sequence-parallel distribution. These capabilities enable the execution of large-scale generative models on hardware with limited
Pixelle-Video is a text-to-video automation platform and generation engine that converts text topics into complete videos with synchronized narration, images, and music. It functions as a modular system for producing short-form content, utilizing large language models to automate script composition, visual asset generation, and voiceover production. The platform features a node-based workflow orchestrator that allows the composition of custom generation pipelines by linking different AI models. It includes a dynamic video layout designer that uses HTML templates to define aspect ratios and vi
EchoMimic is a multimodal human animation framework and diffusion-based video generator. It produces lifelike facial and semi-body animations of a reference image by synthesizing motion and appearance from various source data. The system enables portrait animation driven by audio, pose sequences, or driver videos. It features a landmark conditioning tool that allows for the precise control of facial movements by modifying specific landmark points. The framework covers multi-modal motion synthesis and the synchronization of reference images to match the physical movements of a target driver.
This is a PyTorch-based implementation of diffusion models for synthesizing photorealistic images and video. It provides a framework for text-to-image and text-to-video generation, as well as unconditional image synthesis. The system utilizes a cascading diffusion pipeline to produce high-resolution imagery by passing low-resolution outputs through a sequence of super-resolution models. It also includes capabilities for image inpainting, allowing the reconstruction of masked or missing regions of visual media guided by surrounding context and text prompts. The project includes tools for diff
This project is a research-oriented PyTorch framework designed for the implementation and training of generative video diffusion models. It provides a modular toolkit that extends standard image-based diffusion techniques into three dimensions, enabling the synthesis of coherent video sequences through iterative denoising processes. The framework distinguishes itself by utilizing factored space-time attention, which decomposes high-dimensional video data into separate spatial and temporal layers to maintain motion consistency while managing computational complexity. It supports multi-modal tr
mmagic is a multimodal training pipeline and framework for generative AI, focusing on visual synthesis and restoration. It provides the infrastructure to build and train models for tasks such as text-to-image and text-to-video generation, 3D-aware content synthesis, and high-fidelity image translation using diffusion models and generative adversarial networks. The project distinguishes itself through specialized capabilities for generative model personalization, including techniques for fine-tuning subjects and styles. It also supports advanced visual manipulations such as latent space interp
Text2Video-Zero is a text-to-video diffusion model and framework designed to synthesize temporally consistent video sequences from textual prompts. It functions as a zero-shot video generator, repurposing pre-trained image diffusion models to create video content without requiring additional training on video datasets. The system includes a conditional video synthesizer that allows for guided generation using depth, edge, or pose maps to control structural layout and movement. It also provides text-based video editing capabilities to modify the style or content of existing video clips through
Sygil-webui is a web interface for Stable Diffusion latent diffusion models, providing a creative suite for text-to-image and text-to-video synthesis. It functions as an image generation tool and a latent diffusion image editor, allowing users to create visuals and video sequences from textual descriptions. The project includes a dedicated model training interface for creating custom textual inversion embeddings, which introduces specific new concepts or styles into the diffusion models. It also features specialized tools for generative image editing, including mask-based inpainting, image-to
This project is an AI model API gateway and proxy server designed to provide a unified interface for interacting with diverse artificial intelligence service providers. It functions as a centralized middleware platform that routes, load balances, and translates API requests across multiple models, enabling developers to access text, image, audio, and video generation capabilities through a single, standardized integration. The gateway distinguishes itself through comprehensive administrative and financial controls, including event-driven usage accounting, real-time token consumption tracking,
This project serves as a comprehensive, curated directory of resources, tools, and platforms dedicated to the generative artificial intelligence ecosystem. It functions as a central hub for developers and researchers to discover the frameworks, models, and services necessary for building, deploying, and managing intelligent software applications. The directory distinguishes itself by providing a structured index of specialized tooling across several technical domains. It covers the full lifecycle of generative AI, including the development of autonomous agent systems, the implementation of re
Paints-UNDO is an AI-driven system designed to reverse final digital images into simulated brush stroke sequences. It functions as a digital art undo simulator and a drawing sequence reconstructor, predicting and visualizing previous states of a painting by reversing artistic operations. The project transforms static images into process videos by interpolating between reconstructed drawing states. It uses an image-to-video painting process generator to create smooth progression videos of artwork. The system covers digital art reconstruction and artistic process simulation, including the abil