# fishaudio/bert-vits2

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/fishaudio-bert-vits2).**

8,761 stars · 1,292 forks · Python · AGPL-3.0

## Links

- GitHub: https://github.com/fishaudio/Bert-VITS2
- awesome-repositories: https://awesome-repositories.com/repository/fishaudio-bert-vits2.md

## Topics

`agent` `bert` `bert-vits` `bert-vits2` `fish` `fish-speech` `llm` `tts` `vits` `vits2` `vocoder`

## Description

Bert-VITS2 is a neural speech synthesis system and AI voice generator designed to convert written text into natural sounding audio. It utilizes a VITS2 engine and a neural speech synthesis model to produce high-fidelity human voices.

The system incorporates a multilingual BERT language processor to improve the prosody and emotional accuracy of the generated speech. It supports multilingual voice generation and custom voice cloning to replicate specific human speech patterns and tones.

The architecture covers text-to-speech synthesis through a multi-stage pipeline involving phoneme alignment, stochastic duration prediction, and waveform synthesis. It employs a HiFi-GAN neural vocoder and variational inference to transform text sequences into synthetic audio.

## Tags

### Artificial Intelligence & ML

- [Synthetic Speech Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/synthetic-speech-generation.md) — Provides high-fidelity synthetic speech generation by converting written text into natural-sounding audio. ([source](https://github.com/fishaudio/bert-vits2#readme))
- [Neural Vocoders](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-tokenization/waveform-decoders/neural-vocoders.md) — Ships a HiFi-GAN neural vocoder to convert mel-spectrograms into high-fidelity audio waveforms.
- [Speech Synthesis Models](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis-models.md) — Implements a neural speech synthesis model for generating high-quality, human-like voices from text.
- [Multilingual Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis-models/multilingual-synthesis.md) — Features a multilingual synthesis architecture capable of generating spoken audio in multiple different languages.
- [Multi-Stage Synthesis Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-audio-synthesis/multi-stage-synthesis-pipelines.md) — Employs a multi-stage synthesis pipeline that sequentially processes text through linguistic analysis, duration prediction, and waveform synthesis.
- [Text-to-Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech-synthesis.md) — Converts written text into natural-sounding synthetic speech using neural voice models.
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Supports custom voice cloning to replicate specific human speech patterns and tones.
- [Variational Autoencoders](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training/variational-autoencoders.md) — Implements a variational autoencoder to model latent speech distributions for more natural audio synthesis.
- [Prosodic Duration Predictors](https://awesome-repositories.com/f/artificial-intelligence-ml/prosodic-duration-predictors.md) — Includes a stochastic duration predictor to ensure natural speech rhythm and avoid robotic timing.
- [Phoneme-Based Speech Processors](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/phoneme-based-speech-processors.md) — Implements a phoneme-based speech processing pipeline that leverages BERT for improved prosody and timing.
- [Semantic Embedding Extractors](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/bert-integrations/semantic-embedding-extractors.md) — Uses a multilingual BERT processor to extract semantic embeddings for improved emotional accuracy and prosody.
- [Synthetic Voice Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-assistants/voice-personalization/synthetic-voice-generators.md) — Functions as an AI voice generator with support for multiple languages and natural intonation.

### Scientific & Mathematical Computing

- [Normalizing Flow Layers](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/statistics-probability/probability-distributions/probability-distribution-transformations/normalizing-flow-layers.md) — Uses normalizing flow layers to transform simple probability distributions into complex, natural-sounding speech patterns.
