MegaTTS3 is a bilingual speech synthesis system that generates natural-sounding speech in Chinese and English, including seamless code-switching within a single utterance. It functions as a text-to-speech engine, voice cloning system, and speech-to-text alignment tool, built around an acoustic latent compression model that encodes high-resolution audio into compact representations for efficient processing.
The system distinguishes itself through accent intensity control, allowing adjustment of a speaker's accent strength in generated speech, and voice cloning from short audio samples for personalized synthesis. It provides both a command-line interface for automated speech generation without a graphical environment and a web-based inference UI for browser-driven voice sample upload and text-to-speech output. A pseudo-label aligner trains text-speech alignment models using expert-generated labels for robust alignment.
Additional capabilities include grapheme-to-phoneme conversion for improved pronunciation accuracy, latent diffusion transformer-based audio reconstruction, and support for bilingual speech synthesis with code-switching. The system compresses speech into acoustic latents for efficient storage and downstream voice conversion tasks.