InfiniteTalk

Features

Talking Head Generators - Generates talking head videos from an audio track and sparse-frame references, with lip-sync and consistent head motion.

Infinite-Length Generators - Generates talking videos of any length while preserving temporal coherence from sparse-frame reference input.

Lip-Synced - Generates accurate lip-sync and facial animation for any audio input over arbitrary durations.

Multi-Subject Animations - Animates multiple subjects in a single scene, each with synchronized lip-sync and motion defined by a JSON description.

Single-Image Pose and Expression Inference - Infers plausible head and body motion from a single static image using learned priors.

Sparse-Frame Appearance Encoders - Encodes a person's visual identity from a small set of reference frames for consistent generation.

Arbitrary Duration Video Generators - Creates videos of arbitrary length while maintaining temporal coherence from sparse-frame input.

Audio-Driven Talking Head Synthesis - Generates a talking video from a single image and an audio track, matching lip, head, and body motion.

Audio Driven Synthesis - Synthesizes lip, head, and expression movements directly from audio features using a trained neural network.

Avatar Generation - Creates talking avatar videos with synchronized lip movements, head poses, and expressions from audio and reference media.

Web-Based Inference Orchestrators - Orchestrates file upload, model inference, and video output through a browser interface.

Autoregressive Frame Denoisers - Generates each subsequent frame conditioned on previous outputs and audio features to maintain smoothness.

Unlimited-Duration Talking Video Generators - Creates videos of any length while maintaining temporal coherence across frames from sparse-frame input.

Audio-Visual Signal Alignment - Aligns audio and visual latent spaces to ensure accurate lip sync across varying speech rates.

Interactive Model Interfaces - Provides an interactive web interface for uploading media and generating talking videos without command-line usage.

InfiniteTalk is an open-source system for generating talking head videos driven by audio input. It synthesizes realistic lip movements, head poses, and facial expressions synchronized to a spoken audio track, using either a single still image or a small set of reference video frames as the visual source. The system can produce videos of arbitrary length while maintaining temporal coherence, and it supports animating multiple subjects in a single scene.

A key differentiator is the ability to coordinate multiple talking subjects through a structured JSON description, giving each independent lip sync and motion. The system can infer plausible head and body motion from a single static image, and it provides an interactive web interface for uploading media and generating videos without command-line interaction. An audio-visual feature alignment network ensures accurate lip sync across varying speech rates, and temporal recurrent frame generation keeps motion smooth over long durations.

Features