# yl4579/styletts2

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/yl4579-styletts2).**

6,294 stars · 691 forks · Python · MIT

## Links

- GitHub: https://github.com/yl4579/StyleTTS2
- awesome-repositories: https://awesome-repositories.com/repository/yl4579-styletts2.md

## Topics

`adversarial-training` `deep-learning` `diffusion-models` `gan` `latent-diffusion` `latent-diffusion-models` `pytorch` `speaker-adaptation` `speech-synthesis` `text-to-speech` `tts` `wavlm`

## Description

StyleTTS2 is an adversarial text-to-speech model that uses style diffusion and large speech language models to generate natural-sounding speech from text input. It combines adversarial training with large pre-trained speech models to improve speech quality and reduce artifacts, while employing a style diffusion process that extracts prosodic and timbral features from reference audio to guide speech generation.

The model supports multi-speaker voice synthesis by conditioning the diffusion process on speaker-specific embeddings derived from reference utterances, enabling voice cloning and adaptation to new speakers with minimal data. It offers style-controllable text-to-speech through diffusion-based sampling, and can generate speech directly from text without requiring a reference audio sample at inference time. The system uses a two-stage training pipeline that first trains on aligned data, then fine-tunes with unaligned data using style diffusion and adversarial loss.

Pre-trained model checkpoints are available for loading and running inference, with provided notebooks and importable scripts for generating speech from text. The model can be fine-tuned on a new speaker using a small amount of speech data, or trained from scratch on single or multiple voices using speaker labels and adjustable settings.

## Tags

### Artificial Intelligence & ML

- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Generating natural-sounding speech from text input without needing a reference audio sample.
- [Text-To-Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech-models.md) — An adversarial text-to-speech model that uses style diffusion and large speech language models to generate natural-sounding speech from text input.
- [TTS Adversarial Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-adversarial-networks/generative-adversarial-active-learning/tts-adversarial-frameworks.md) — Improving speech quality and naturalness through adversarial training with large speech language models.
- [Style Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/image-generation/audio-style-transfers/style-encoders.md) — Extracts prosodic and timbral style features from a short reference audio clip to guide the diffusion-based speech generation.
- [Adversarial Speech Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/training-frameworks/model-training-frameworks/text-to-speech-model-training/adversarial-speech-training.md) — Uses a discriminator trained on representations from a large pre-trained speech model to improve naturalness and reduce artifacts.
- [Two-Stage Training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/training-frameworks/model-training-frameworks/text-to-speech-model-training/two-stage-training-pipelines.md) — First trains a text-to-speech model on aligned data, then fine-tunes with unaligned data using style diffusion and adversarial loss.
- [Multi-Speaker Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis/modular-voice-configurations/voice-synthesizer-training/multi-speaker-training.md) — Supports training on multiple voices by using speaker labels to sample reference audio for style diffusion during training.
- [Speaker Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-embeddings.md) — Controls voice identity by conditioning the diffusion process on speaker-specific embeddings derived from reference utterances.
- [Reference-Free Style Controls](https://awesome-repositories.com/f/artificial-intelligence-ml/text-generation-controls/acoustic-style-controls/reference-free-style-controls.md) — Controls speaking style through diffusion-based sampling without requiring a reference audio sample at inference time.
- [Adversarial Speech Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/adversarial-speech-generators.md) — Employs adversarial training with large speech language models to improve speech quality and naturalness.
- [Diffusion-Based Speech Style Transfers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/speech-style-transfer/diffusion-based-speech-style-transfers.md) — Producing expressive speech with varied styles by leveraging diffusion-based style transfer from reference audio.
- [Style-Conditioned Diffusion Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-architectures/diffusion-transformers/speech-latent/style-conditioned-diffusion-decoders.md) — Generates speech by iteratively denoising a latent representation conditioned on style embeddings extracted from reference audio.
- [Pre-trained Weight Loading](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/vision-transformer-pre-training/pre-trained-model-checkpoints/pre-trained-weight-loading.md) — Loads a pre-trained model checkpoint and generates speech from text without requiring a reference audio sample.
- [Speech Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/speech-model-fine-tuning.md) — Adapts a pre-trained multi-speaker model to a new speaker using a small amount of speech data for reduced training time.
- [Few-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis/modular-voice-configurations/voice-synthesizer-training/multi-speaker-training/few-shot-voice-cloning.md) — Adapting a pre-trained model to a new speaker using a small amount of speech data for personalized voice generation.
- [Speaker Adaptation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis/modular-voice-configurations/voice-synthesizer-training/multi-speaker-training/speaker-adaptation.md) — Adapt a pre-trained multi-speaker model to a new speaker using a small amount of speech data to reduce training time. ([source](https://cdn.jsdelivr.net/gh/yl4579/styletts2@main/README.md))
- [Multi-Speaker Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/multi-speaker-synthesis.md) — Generates speech in multiple voices by sampling reference audio for style diffusion during training and inference.
- [Inference Scripts](https://awesome-repositories.com/f/artificial-intelligence-ml/pre-trained-speech-models/inference-scripts.md) — Provides notebooks and scripts to load pre-trained speech models and generate speech from text. ([source](https://cdn.jsdelivr.net/gh/yl4579/styletts2@main/README.md))

### Graphics & Multimedia

- [Text-to-Speech Synthesizers](https://awesome-repositories.com/f/graphics-multimedia/text-to-speech-synthesizers.md) — Converts text into natural-sounding synthetic speech using adversarial training and style diffusion. ([source](https://cdn.jsdelivr.net/gh/yl4579/styletts2@main/README.md))
