ChatTTS is a conversational text-to-speech generative model designed to convert written dialogue into natural sounding audio. It functions as a multilingual speech synthesis framework capable of producing human-like audio across different languages and speaker profiles.
The system is distinguished by its ability to generate interactive dialogue with realistic vocal nuances. It utilizes a speech nuance controller to insert specific tokens that trigger non-verbal elements, such as laughter, pauses, and interjections, during the synthesis process.
The project includes a streaming audio generator that delivers speech incrementally to reduce latency. It further supports multi-speaker embeddings to maintain consistent vocal characteristics throughout a conversation.