Text2Video-Zero is a text-to-video diffusion model and framework designed to synthesize temporally consistent video sequences from textual prompts. It functions as a zero-shot video generator, repurposing pre-trained image diffusion models to create video content without requiring additional training on video datasets. The system includes a conditional video synthesizer that allows for guided generation using depth, edge, or pose maps to control structural layout and movement. It also provides text-based video editing capabilities to modify the style or content of existing video clips through
Videocrafter is a latent diffusion model designed for AI video synthesis. It functions as both a text-to-video and image-to-video generation system, synthesizing high-quality video sequences from descriptive text prompts or static image inputs. The model utilizes a diffusion-based neural network to transform inputs into animated content, ensuring visual consistency and temporal coherence throughout the generated sequences. This allows for the creation of custom video clips and the animation of static images into fluid motion.
Magic Animate is a diffusion model video generator designed for human image animation. It transforms a static human photo into a temporally consistent video by mapping movements from a reference motion clip, acting as a tool to create realistic animations from a single image. The system ensures visual stability and minimizes flicker through temporal attention injection and motion-controlled noise scheduling. To accelerate the generation of high-resolution video, it includes a distributed GPU inference engine that splits model workloads across multiple graphics cards. The project covers a com
HunyuanVideo-1.5 is a video generation foundation model and text-to-video diffusion framework. It utilizes a latent video diffusion model and a spatio-temporal transformer architecture to generate high-definition video sequences from text descriptions and images. The project enables cinematic camera control for directing pans and tilts and provides image-to-video animation capabilities. It supports visual style adaptation through low-rank adaptation tuning and uses a language model for prompt refinement to improve visual alignment. The model covers high-resolution video upscaling via a super