EmotiVoice is an emotional text-to-speech engine and bilingual speech synthesizer designed to generate synthetic audio in English and Chinese. It utilizes a deep learning architecture to produce high-fidelity speech with controllable emotional states and timbres.
The project includes a voice cloning framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets. It employs a jointly-trained acoustic-vocoder pipeline and style-embedding-based synthesis to manage expression and reduce audio artifacts.
The system covers a broad range of speech processing capabilities, including grapheme-to-phoneme conversion for bilingual text, voice model fine-tuning, and mel spectrogram visualization for quality monitoring. Users can generate audio through a web-based synthesis dashboard, a command line interface, or a self-hosted HTTP API.
The environment can be deployed as a containerized service using Docker for consistent execution across different systems.