Videocrafter is a latent diffusion model designed for AI video synthesis. It functions as both a text-to-video and image-to-video generation system, synthesizing high-quality video sequences from descriptive text prompts or static image inputs. The model utilizes a diffusion-based neural network to transform inputs into animated content, ensuring visual consistency and temporal coherence throughout the generated sequences. This allows for the creation of custom video clips and the animation of static images into fluid motion.
StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model. The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generati
Magic Animate is a diffusion model video generator designed for human image animation. It transforms a static human photo into a temporally consistent video by mapping movements from a reference motion clip, acting as a tool to create realistic animations from a single image. The system ensures visual stability and minimizes flicker through temporal attention injection and motion-controlled noise scheduling. To accelerate the generation of high-resolution video, it includes a distributed GPU inference engine that splits model workloads across multiple graphics cards. The project covers a com
InfiniteTalk is an open-source system for generating talking head videos driven by audio input. It synthesizes realistic lip movements, head poses, and facial expressions synchronized to a spoken audio track, using either a single still image or a small set of reference video frames as the visual source. The system can produce videos of arbitrary length while maintaining temporal coherence, and it supports animating multiple subjects in a single scene. A key differentiator is the ability to coordinate multiple talking subjects through a structured JSON description, giving each independent lip