# bytedance/megatts3

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/bytedance-megatts3).**

6,066 stars · 469 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/bytedance/MegaTTS3
- awesome-repositories: https://awesome-repositories.com/repository/bytedance-megatts3.md

## Topics

`research`

## Description

MegaTTS3 is a bilingual speech synthesis system that generates natural-sounding speech in Chinese and English, including seamless code-switching within a single utterance. It functions as a text-to-speech engine, voice cloning system, and speech-to-text alignment tool, built around an acoustic latent compression model that encodes high-resolution audio into compact representations for efficient processing.

The system distinguishes itself through accent intensity control, allowing adjustment of a speaker's accent strength in generated speech, and voice cloning from short audio samples for personalized synthesis. It provides both a command-line interface for automated speech generation without a graphical environment and a web-based inference UI for browser-driven voice sample upload and text-to-speech output. A pseudo-label aligner trains text-speech alignment models using expert-generated labels for robust alignment.

Additional capabilities include grapheme-to-phoneme conversion for improved pronunciation accuracy, latent diffusion transformer-based audio reconstruction, and support for bilingual speech synthesis with code-switching. The system compresses speech into acoustic latents for efficient storage and downstream voice conversion tasks.

## Tags

### Artificial Intelligence & ML

- [Bilingual Speech Synthesizers](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/bilingual-speech-synthesizers.md) — Generates natural-sounding speech in Chinese and English, including code-switching within a single utterance.
- [Latent Space Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models/latent-space-projections/latent-space-encoders.md) — Encodes high-quality audio into a compact latent representation that can be reconstructed with minimal loss. ([source](https://github.com/bytedance/MegaTTS3#readme))
- [Zero-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-voice-cloning.md) — Replicates a speaker's voice using only a brief audio reference for personalized speech synthesis.
- [Grapheme To Phoneme Conversion](https://awesome-repositories.com/f/artificial-intelligence-ml/grapheme-to-phoneme-conversion.md) — Converts written text into phonetic representations for improved pronunciation accuracy.
- [Acoustic-Text Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-transcription/acoustic-text-alignment.md) — Aligns spoken audio to its corresponding text using a robust aligner trained on pseudo-labels from expert models. ([source](https://github.com/bytedance/MegaTTS3#readme))
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Converts written text into natural-sounding speech using a lightweight diffusion transformer model. ([source](https://github.com/bytedance/MegaTTS3#readme))
- [Latent Acoustic Mapping](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/latent-acoustic-mapping.md) — Encodes high-resolution audio into a compact latent representation for efficient model training and voice conversion.
- [Acoustic Latent Compressors](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/latent-acoustic-mapping/acoustic-latent-compressors.md) — Encodes high-resolution audio into a compact latent representation for efficient model training and voice conversion. ([source](https://github.com/bytedance/MegaTTS3/blob/main/readme.md))
- [Command-Line Speech Synthesizers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/local-speech-synthesis/command-line-speech-synthesizers.md) — Accepts a voice sample and text as arguments to produce speech output without a graphical interface. ([source](https://github.com/bytedance/MegaTTS3/blob/main/readme.md))
- [Speech Latent](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-architectures/diffusion-transformers/speech-latent.md) — Encodes speech into a compact latent space and reconstructs audio using a diffusion-based transformer decoder.
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Replicates a speaker's voice characteristics using only a brief audio reference, enabling personalized speech synthesis. ([source](https://github.com/bytedance/MegaTTS3#readme))
- [Speech Accent Transformation](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning/voice-identity-conversions/speech-accent-transformation.md) — Provides accent intensity control by scaling learned accent embeddings during inference.
- [Command-Line Speech Synthesizers](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/command-line-speech-synthesizers.md) — Runs speech synthesis from a command line by providing a voice sample and text as arguments.
- [Web-Based Speech Synthesizers](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/web-based-speech-synthesizers.md) — Provides a browser-based UI for uploading voice samples and generating speech from text.
- [Web-Based Speech Inference UIs](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-clients/browser-based-inference/web-based-speech-inference-uis.md) — Provides a browser interface for uploading voice samples and generating speech from text.
- [Accent Intensity Controllers](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning/voice-identity-conversions/speech-accent-transformation/accent-intensity-controllers.md) — Adjusts the strength of a speaker's accent in the generated speech through configurable weight parameters. ([source](https://github.com/bytedance/MegaTTS3#readme))

### Graphics & Multimedia

- [Text-to-Speech Engines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines/text-to-speech-engines.md) — Converts written text into natural-sounding speech using a lightweight diffusion transformer model.

### Part of an Awesome List

- [Bilingual Code-Switching](https://awesome-repositories.com/f/awesome-lists/devtools/switches/bilingual-code-switching.md) — Supports seamless switching between Chinese and English within a single utterance.
- [Bilingual Synthesizers](https://awesome-repositories.com/f/awesome-lists/more/speech-and-audio-processing/bilingual-synthesizers.md) — Generates speech in Chinese and English, including code-switching within a single utterance.

### Data & Databases

- [Speech-Text Pseudo-Label Aligners](https://awesome-repositories.com/f/data-databases/label-based-data-selection/metadata-labelers/model-assisted-labelers/pseudo-labeling-iterators/speech-text-pseudo-label-aligners.md) — Aligns spoken audio with its corresponding text transcription using a robust aligner trained on pseudo-labels from expert models.

### Development Tools & Productivity

- [Command Line Interfaces](https://awesome-repositories.com/f/development-tools-productivity/command-line-interfaces.md) — Ships a command-line interface for generating speech from text and voice samples.
