30 open-source projects similar to opentalker/video-retalking, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Video Retalking alternative.
MuseTalk is a deep learning lip synchronization system designed to align video facial movements with audio tracks for high-fidelity video dubbing. It functions as an engine that matches facial expressions to audio input in real-time, enabling the modification of a speaker's lip movements to match new audio sources across different languages. The project features a distributed GPU training pipeline and a multi-stage processing workflow for refining the visual accuracy of synthetic speech. It distinguishes itself through the use of region-specific face masking and mouth openness control, which
LiveTalking is an interactive talking head engine and AI avatar management platform designed to synchronize synthetic speech with facial movements. It functions as a real-time orchestrator that connects large language models and text-to-speech services to neural-rendered digital humans. The project distinguishes itself through low-latency streaming capabilities and the ability to handle real-time conversational interruptions. It supports advanced audio-visual customization, including human voice cloning and the ability to drive avatar expressions using real-time webcam data. The platform cov
Wav2Lip is a deep learning lip sync model and neural talking head framework designed to synchronize the lip movements in a video to match a provided audio file. It functions as a computer vision lip synchronizer and speech-to-lip generator that maps speech patterns to visual mouth movements to produce realistic talking head videos. The system utilizes a framework for training and evaluating models that align audio and video frames. This includes the ability to train lip-sync models and visual discriminators using speech-to-lip datasets and evaluating the resulting synchronization accuracy thr
Duix-Avatar is an AI digital human toolkit used to create, clone, and animate realistic virtual personas. It functions as a digital persona cloning tool and a text-to-speech animation API that converts written text or audio into synthetic voice and facial motion markers. The framework provides an offline video generation engine that renders digital human animations and lip-synced videos on local hardware. It includes a specialized lip sync engine to synchronize mouth movements with audio waveforms and a pipeline for extracting facial and vocal features from source media to create synthetic re
EMO is an AI portrait animator and audio-to-video diffusion model designed to generate expressive talking head videos. It transforms a single static portrait image and an audio track into a synchronized video of a person speaking. The system focuses on digital human synthesis, producing high-fidelity facial movements and emotional cues. It synchronizes lip movements and facial gestures to match spoken voice recordings to create realistic portrait animations. The framework utilizes a diffusion process and a cross-modal alignment mechanism to ensure timing between audio signals and visual land
PaddleGAN is a generative AI framework and deep learning computer vision library built on the PaddlePaddle framework. It serves as a toolkit for image and video synthesis, providing a collection of generative adversarial network implementations for creating synthetic visual content. The library focuses on advanced synthesis capabilities, including the generation of talking heads through lip motion synchronization and the creation of synthetic videos via motion transfer from driving sequences. It provides tools for domain-to-domain translation, allowing for image style transfer and the transfo
LatentSync is an audio-driven video generator and latent diffusion lip sync model designed to synchronize a speaker's lip movements in a video to a target audio track. It provides a lip synchronization training framework for developing synchronization networks on custom video and audio datasets. The system utilizes a video preprocessing pipeline to clean, segment, and align face data. It includes a visual sync evaluation tool that calculates confidence scores to measure the accuracy of audio and visual alignment in generated videos. The project covers capabilities for custom synchronization
InfiniteTalk is an open-source system for generating talking head videos driven by audio input. It synthesizes realistic lip movements, head poses, and facial expressions synchronized to a spoken audio track, using either a single still image or a small set of reference video frames as the visual source. The system can produce videos of arbitrary length while maintaining temporal coherence, and it supports animating multiple subjects in a single scene. A key differentiator is the ability to coordinate multiple talking subjects through a structured JSON description, giving each independent lip
This Python SDK provides a comprehensive toolkit for synthetic audio generation, voice cloning, and the development of conversational AI agents. It enables the creation of lifelike spoken audio from text, the replication of human voices through custom cloning, and the deployment of real-time voice agents capable of interacting with external large language models. The library distinguishes itself through deep integration of conversational AI capabilities, including the design of agent personas and the execution of real-time actions via APIs. It supports professional-grade audio production thro
Duix-Mobile is a software development kit for deploying real-time conversational AI characters on mobile devices. It enables the creation of interactive digital humans capable of fluid voice-to-voice interactions, featuring low-latency speech recognition and synchronized lip movements. The project distinguishes itself through the ability to integrate custom external language models and speech providers to define an avatar's intelligence and voice. It supports the generation of real-time multilingual subtitles and provides mechanisms to track the training status of newly created digital charac
Linly-Dubbing is an automated video dubbing pipeline designed for multilingual video localization. It converts spoken content in videos into another language by coordinating speech-to-text transcription, text translation, and text-to-speech synthesis. The system distinguishes itself through AI-driven lip synchronization and animation, which aligns facial expressions and mouth movements to the synthesized voiceover. It also utilizes audio source separation to isolate vocals from background music and noise, allowing for clean voice replacement while preserving original background audio. The br
SadTalker is a generative framework designed to synthesize expressive talking head videos from static portrait images. By mapping audio signals or text prompts to three-dimensional facial motion coefficients, the system synchronizes lip movements, facial expressions, and head orientation to create realistic digital character performances. The project distinguishes itself by decoupling identity from dynamic motion through latent space encoding, ensuring that the generated animations maintain visual fidelity to the source portrait. It supports comprehensive motion synthesis, including full-body
LongCat-Video is a collection of specialized models for video synthesis, featuring a large language model based architecture for creating high-resolution videos from text, images, or existing sequences. It includes dedicated systems for text-to-video generation, image-to-video animation, and the creation of talking avatars. The project provides specific capabilities for extending the length of existing clips through a video continuation model that predicts subsequent frames. It also enables the synchronization of character lip movements with audio and text prompts to produce speaking videos.
EchoMimic is an audio-driven portrait animation framework and latent diffusion video generator. It transforms static reference images into dynamic talking head videos by synchronizing facial movements with audio tracks and motion drivers. The system functions as a hybrid motion synthesis engine that combines audio inputs and pose data. It utilizes a facial landmark motion controller to edit positioning markers, enabling precise synchronization and video-to-video pose transfer. The pipeline covers image-to-video animation through latent diffusion and facial landmark conditioning. This allows
EasyVtuber is 2D avatar animation software that transforms a single static image into a real-time animated character. It functions as a face tracking animation tool and live streaming avatar driver, mapping facial movements from webcams or iOS devices to drive virtual expressions and head motion. The project distinguishes itself through a neural animation pipeline that includes AI video upscaling and frame interpolation to increase visual smoothness and resolution. It utilizes a transparent video streaming system via Spout2, allowing rendered frames with alpha channels to be sent directly to
Aigcpanel is a visual workflow automation tool and model lifecycle manager designed for generative AI media pipelines. It provides a unified interface to install, launch, and configure both local and remote AI model endpoints, acting as an orchestration platform for large language models and AI tools. The system features a drag-and-drop node editor for chaining AI models and scripts into automated processing pipelines. It distinguishes itself with a breakpoint-aware execution model that allows users to pause and resume long media tasks from specific points in the workflow. Additionally, it in
Wan2.1 is a generative video synthesis framework that provides foundation models for creating high-fidelity video sequences and static images from descriptive text prompts. The system utilizes a unified architecture trained on both static and dynamic datasets, allowing it to function as a comprehensive tool for visual media creation. The framework distinguishes itself through a transformer-based temporal modeling approach that ensures structural coherence and consistent motion across video frames. It supports multi-resolution latent scaling, enabling the generation of content in various aspec
Ten Framework is a multimodal large language model agent framework designed for building low-latency conversational agents. It integrates voice, text, and visual inputs in real time to facilitate human interaction. The project includes a real-time speech processing pipeline for streaming transcription, voice activity detection, and speaker diarization. It also features an avatar synchronization engine that coordinates character lip animations and visual outputs with synthesized speech. The framework covers edge AI deployment through containerized packaging and direct integration with embedde
This project is a deep learning image restoration tool designed to remove scratches, fading, and noise from aged photographs and film. It utilizes generative adversarial networks for image translation, alongside specialized networks for face enhancement and video colorization. The system distinguishes itself through a combination of latent-space domain mapping and progressive face enhancement to recover blurred or missing high-frequency facial details. For video content, it employs a colorization framework that uses optical flow and temporal guidance to propagate color from selected keyframes
Open-Higgsfield-AI is a generative AI content studio and visual workflow orchestrator. It provides a unified interface for creating photorealistic images and videos, utilizing a node-based editor to chain multiple image, video, and audio models into automated content pipelines. The system functions as an AI video animation tool and local GPU inference engine, allowing users to run generative models on local hardware or remote servers. It includes specialized capabilities for audio-driven lip synchronization and cinematic camera controls to adjust virtual lens and focal settings. The platform
BasicSR is a PyTorch-based image restoration toolbox and framework designed for training and deploying deep learning models to upscale, denoise, and deblur images and videos. It serves as a comprehensive system for image super-resolution and video quality restoration, providing the necessary infrastructure to recover fine visual details and increase pixel density. The project distinguishes itself through specialized toolkits for facial image enhancement and high-fidelity face synthesis, as well as a dedicated video quality restoration suite that utilizes deformable convolutions and generative
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
HivisionIDPhotos is an AI-powered identification photo generator designed to automate the creation of standardized portraits. It utilizes machine learning to handle alignment, cropping, and background removal, transforming regular images into official identification photographs. The system features a background removal tool that uses offline inference to isolate subjects and a portrait enhancement tool that applies beauty filters to improve facial appearance and skin quality. To prepare photos for physical use, it includes a print layout generator that arranges processed images into standard
OutfitAnyone is a diffusion-based virtual try-on system and AI person-garment integration tool. It functions as an image-to-image clothing transfer model designed to visualize how specific clothing items look on any person regardless of their pose. The system adapts garment textures and shapes to a person's body and pose to produce photorealistic results. It specifically focuses on adjusting clothing deformation based on body shape to maintain high fidelity and detail consistency during the fitting process. The project covers AI fashion visualization and virtual garment fitting, providing ca
Facechain is a generative AI toolchain and portrait generator designed to create personalized synthetic identities and consistent digital portraits. It provides a pipeline for training and refining diffusion models to produce subject-driven image synthesis from reference photos. The project focuses on digital twin generation, enabling the creation of a personalized model from a single image to maintain identity consistency across various poses and artistic styles. It utilizes identity fusion and similarity sorting to balance facial accuracy with stylized visual effects. The toolkit covers a
This project is a Stable Diffusion WebUI extension that provides a graphical interface for personalized portrait generation and AI photo editing. It allows users to train custom identity models from a small set of uploaded images to create consistent digital versions of specific people. The extension includes a virtual try-on system that replaces clothing in images by aligning reference garments with template bodies. It also features tools for face swapping in both static images and videos, as well as a portrait animator that transforms static images into dynamic videos using reference-guided
Deokyun Kim, Minseon Kim, Gihyun Kwon*, and Dae-shik Kim, Progressive Face Super-Resolution via Attention to Facial Landmark, The British Machine Vision Conference 2019 (BMVC 2019)