InfiniteTalk is an open-source system for generating talking head videos driven by audio input. It synthesizes realistic lip movements, head poses, and facial expressions synchronized to a spoken audio track, using either a single still image or a small set of reference video frames as the visual source. The system can produce videos of arbitrary length while maintaining temporal coherence, and it supports animating multiple subjects in a single scene.
A key differentiator is the ability to coordinate multiple talking subjects through a structured JSON description, giving each independent lip sync and motion. The system can infer plausible head and body motion from a single static image, and it provides an interactive web interface for uploading media and generating videos without command-line interaction. An audio-visual feature alignment network ensures accurate lip sync across varying speech rates, and temporal recurrent frame generation keeps motion smooth over long durations.