# rvc-boss/gpt-sovits

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/rvc-boss-gpt-sovits).**

58,724 stars · 6,427 forks · Python · MIT

## Links

- GitHub: https://github.com/RVC-Boss/GPT-SoVITS
- awesome-repositories: https://awesome-repositories.com/repository/rvc-boss-gpt-sovits.md

## Topics

`text-to-speech` `tts` `vits` `voice-clone` `voice-cloneai` `voice-cloning`

## Description

GPT-SoVITS is a text-to-speech synthesis engine and voice cloning toolkit designed for generating natural-sounding human speech. It functions as a neural audio processing pipeline that maps input text to high-fidelity audio waveforms, utilizing conditional variational autoencoders and flow-based decoders to ensure expressive output.

The platform distinguishes itself through its ability to perform few-shot voice cloning and cross-lingual speech generation, allowing users to maintain a specific speaker's vocal identity and emotional delivery across multiple languages. By employing cross-modal latent alignment, the system effectively bridges text-based linguistic features with speaker-specific embeddings, while a generative adversarial network-based vocoder ensures the final audio maintains high time-domain quality.

The software provides a modular pipeline that supports the entire lifecycle of custom voice model development, including data preprocessing, fine-tuning on small datasets, and inference. It incorporates self-supervised speech representation models to extract discrete linguistic units, facilitating robust voice conversion and automated audio content creation. The project includes documentation for model training, inference procedures, and command-line execution.

## Tags

### Artificial Intelligence & ML

- [Acoustic Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/acoustic-models.md) — Translates linguistic input into audio features using a conditional variational autoencoder and flow-based decoder.
- [Cross-Lingual Speech Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/cross-lingual-speech-generators.md) — Produces fluent multi-language audio output while maintaining the unique vocal characteristics of a specific target speaker.
- [Voice Cloning Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/voice-cloning-tools.md) — Clones voices by processing custom audio samples through fine-tuned neural network architectures.
- [Synthetic Speech Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/synthetic-speech-generation.md) — Replicates human vocal tone and cadence to create natural-sounding synthetic speech from written text.
- [Self-Supervised Speech Representations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/self-supervised-speech-representations.md) — Extracts linguistic features from raw audio using self-supervised models to support voice synthesis and conversion.
- [Fine-Tuning Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/fine-tuning-pipelines.md) — Adapts pre-trained models to specific personas or characters using targeted training on small audio datasets.
- [Cross-Modal Alignment Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/cross-modal-alignment-models.md) — Maps text-based linguistic features to speaker-specific embeddings to enable zero-shot style transfer.
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning.md) — Enables performance optimization and model adaptation through structured fine-tuning procedures. ([source](https://github.com/RVC-Boss/GPT-SoVITS/tree/main/docs/tr))

### Graphics & Multimedia

- [Text-to-Speech Engines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines.md) — Converts written text into natural-sounding human speech via an integrated neural audio synthesis engine.
- [Neural Audio Pipelines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing-frameworks/neural-audio-pipelines.md) — Facilitates an end-to-end workflow for training, fine-tuning, and deploying custom voice models.
- [Neural Vocoders](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-synthesis/neural-vocoders.md) — Transforms generated spectral data into high-fidelity time-domain audio waveforms using specialized neural models.

### Part of an Awesome List

- [Generative Media Tools](https://awesome-repositories.com/f/awesome-lists/ai/generative-media-tools.md) — Few-shot voice cloning and TTS model.
- [Speech Processing](https://awesome-repositories.com/f/awesome-lists/media/speech-processing.md) — Few-shot voice conversion and TTS system.