ChatTTS | Awesome Repository

ChatTTS is a conversational text-to-speech generative model designed to convert written dialogue into natural sounding audio. It functions as a multilingual speech synthesis framework capable of producing human-like audio across different languages and speaker profiles.

The system is distinguished by its ability to generate interactive dialogue with realistic vocal nuances. It utilizes a speech nuance controller to insert specific tokens that trigger non-verbal elements, such as laughter, pauses, and interjections, during the synthesis process.

The project includes a streaming audio generator that delivers speech incrementally to reduce latency. It further supports multi-speaker embeddings to maintain consistent vocal characteristics throughout a conversation.

Features

Conversational Audio Streams - Provides a generative model for natural, multi-speaker interactive dialogue and conversational audio streams.
Audio Tokenization - Converts raw audio waveforms into discrete numerical codes for processing by the language model.
Autoregressive Transformers - Implements an autoregressive transformer architecture to predict audio tokens for sequential speech generation.
Prosody Control Tokens - Inserts specialized control tokens to trigger non-verbal vocal behaviors like laughter and pauses.

Features

Conversational Audio Streams - Provides a generative model for natural, multi-speaker interactive dialogue and conversational audio streams.
Audio Tokenization - Converts raw audio waveforms into discrete numerical codes for processing by the language model.
Autoregressive Transformers - Implements an autoregressive transformer architecture to predict audio tokens for sequential speech generation.
Prosody Control Tokens - Inserts specialized control tokens to trigger non-verbal vocal behaviors like laughter and pauses.