MegaTTS3 | Awesome Repository

MegaTTS3 is a bilingual speech synthesis system that generates natural-sounding speech in Chinese and English, including seamless code-switching within a single utterance. It functions as a text-to-speech engine, voice cloning system, and speech-to-text alignment tool, built around an acoustic latent compression model that encodes high-resolution audio into compact representations for efficient processing.

The system distinguishes itself through accent intensity control, allowing adjustment of a speaker's accent strength in generated speech, and voice cloning from short audio samples for personalized synthesis. It provides both a command-line interface for automated speech generation without a graphical environment and a web-based inference UI for browser-driven voice sample upload and text-to-speech output. A pseudo-label aligner trains text-speech alignment models using expert-generated labels for robust alignment.

Additional capabilities include grapheme-to-phoneme conversion for improved pronunciation accuracy, latent diffusion transformer-based audio reconstruction, and support for bilingual speech synthesis with code-switching. The system compresses speech into acoustic latents for efficient storage and downstream voice conversion tasks.

Features

Bilingual Speech Synthesizers - Generates natural-sounding speech in Chinese and English, including code-switching within a single utterance.
Text-to-Speech Engines - Converts written text into natural-sounding speech using a lightweight diffusion transformer model.
Latent Space Encoders - Encodes high-quality audio into a compact latent representation that can be reconstructed with minimal loss.

Features

Bilingual Speech Synthesizers - Generates natural-sounding speech in Chinese and English, including code-switching within a single utterance.
Text-to-Speech Engines - Converts written text into natural-sounding speech using a lightweight diffusion transformer model.
Latent Space Encoders - Encodes high-quality audio into a compact latent representation that can be reconstructed with minimal loss.