EmotiVoice | Awesome Repository

EmotiVoice is an emotional text-to-speech engine and bilingual speech synthesizer designed to generate synthetic audio in English and Chinese. It utilizes a deep learning architecture to produce high-fidelity speech with controllable emotional states and timbres.

The project includes a voice cloning framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets. It employs a jointly-trained acoustic-vocoder pipeline and style-embedding-based synthesis to manage expression and reduce audio artifacts.

The system covers a broad range of speech processing capabilities, including grapheme-to-phoneme conversion for bilingual text, voice model fine-tuning, and mel spectrogram visualization for quality monitoring. Users can generate audio through a web-based synthesis dashboard, a command line interface, or a self-hosted HTTP API.

The environment can be deployed as a containerized service using Docker for consistent execution across different systems.

Features

Speech Synthesis - Synthesizes seamless spoken audio from a mix of Chinese and English text.
Voice Cloning - Implements a framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets.
Joint Acoustic-Vocoder Training - Employs a jointly-trained acoustic-vocoder pipeline to produce high-fidelity audio with reduced artifacts.
Expressive Synthesis - Uses style and emotional embeddings to control the timbre and expression of generated speech.

Features

Speech Synthesis - Synthesizes seamless spoken audio from a mix of Chinese and English text.
Voice Cloning - Implements a framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets.
Joint Acoustic-Vocoder Training - Employs a jointly-trained acoustic-vocoder pipeline to produce high-fidelity audio with reduced artifacts.
Expressive Synthesis - Uses style and emotional embeddings to control the timbre and expression of generated speech.

The environment can be deployed as a containerized service using Docker for consistent execution across different systems.