InfiniteTalk is an open-source system for generating talking head videos driven by audio input. It synthesizes realistic lip movements, head poses, and facial expressions synchronized to a spoken audio track, using either a single still image or a small set of reference video frames as the visual source. The system can produce videos of arbitrary length while maintaining temporal coherence, and it supports animating multiple subjects in a single scene. A key differentiator is the ability to coordinate multiple talking subjects through a structured JSON description, giving each independent lip
LatentSync is an audio-driven video generator and latent diffusion lip sync model designed to synchronize a speaker's lip movements in a video to a target audio track. It provides a lip synchronization training framework for developing synchronization networks on custom video and audio datasets. The system utilizes a video preprocessing pipeline to clean, segment, and align face data. It includes a visual sync evaluation tool that calculates confidence scores to measure the accuracy of audio and visual alignment in generated videos. The project covers capabilities for custom synchronization
EchoMimic is an audio-driven portrait animation framework and latent diffusion video generator. It transforms static reference images into dynamic talking head videos by synchronizing facial movements with audio tracks and motion drivers. The system functions as a hybrid motion synthesis engine that combines audio inputs and pose data. It utilizes a facial landmark motion controller to edit positioning markers, enabling precise synchronization and video-to-video pose transfer. The pipeline covers image-to-video animation through latent diffusion and facial landmark conditioning. This allows