# netease-youdao/emotivoice

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/netease-youdao-emotivoice).**

8,446 stars · 745 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/netease-youdao/EmotiVoice
- awesome-repositories: https://awesome-repositories.com/repository/netease-youdao-emotivoice.md

## Topics

`ai` `deep-learning` `emotion` `emotivoice` `multi-speaker` `prompt` `python` `pytorch` `speech` `speech-synthesis` `style` `text-to-speech` `tts`

## Description

EmotiVoice is an emotional text-to-speech engine and bilingual speech synthesizer designed to generate synthetic audio in English and Chinese. It utilizes a deep learning architecture to produce high-fidelity speech with controllable emotional states and timbres.

The project includes a voice cloning framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets. It employs a jointly-trained acoustic-vocoder pipeline and style-embedding-based synthesis to manage expression and reduce audio artifacts.

The system covers a broad range of speech processing capabilities, including grapheme-to-phoneme conversion for bilingual text, voice model fine-tuning, and mel spectrogram visualization for quality monitoring. Users can generate audio through a web-based synthesis dashboard, a command line interface, or a self-hosted HTTP API.

The environment can be deployed as a containerized service using Docker for consistent execution across different systems.

## Tags

### Artificial Intelligence & ML

- [Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis.md) — Synthesizes seamless spoken audio from a mix of Chinese and English text.
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Implements a framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets. ([source](https://github.com/netease-youdao/EmotiVoice#readme))
- [Joint Acoustic-Vocoder Training](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/acoustic-models/joint-acoustic-vocoder-training.md) — Employs a jointly-trained acoustic-vocoder pipeline to produce high-fidelity audio with reduced artifacts.
- [Expressive Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/expressive-synthesis.md) — Uses style and emotional embeddings to control the timbre and expression of generated speech.
- [Grapheme To Phoneme Conversion](https://awesome-repositories.com/f/artificial-intelligence-ml/grapheme-to-phoneme-conversion.md) — Transforms Chinese text into phonetic representations via number normalization and pinyin conversion. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/frontend_cn.py))
- [Acoustic Model Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-trainers/acoustic-model-trainers.md) — Provides joint training for acoustic models and vocoders to ensure high-fidelity synthetic audio generation. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/train_am_vocoder_joint.py))
- [Voice Model Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-trainers/voice-model-trainers.md) — Utilizes a deep learning architecture to align text with high-fidelity emotional expressions.
- [Synthetic Speech Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/multimodal-processing-tools/synthetic-speech-generation.md) — Produces high-fidelity synthetic speech by replicating vocal characteristics based on specific speaker profiles. ([source](https://github.com/netease-youdao/EmotiVoice/tree/main/data/DataBaker))
- [Voice Synthesizer Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis/modular-voice-configurations/voice-synthesizer-training.md) — Processes custom audio datasets and transcriptions to train models on specific speaker characteristics.
- [Phoneme-Based Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-alignment-models/phoneme-based-alignment/phoneme-based-pipelines.md) — Implements a pipeline to transform raw bilingual text into phonetic representations for synthesis.
- [Emotional Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis/emotional-synthesis.md) — Generates synthetic audio in English and Chinese with controllable emotional states like happiness or sadness. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/predict.py))
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning.md) — Supports improving emotional expression by adapting pre-trained synthetic voices using custom datasets and alignment. ([source](https://github.com/netease-youdao/EmotiVoice/tree/main/data/DataBaker))
- [Speech Synthesis Services](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis-services.md) — Operates as a containerized web server exposing speech synthesis capabilities through an HTTP interface.
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Provides a self-hosted web service via Docker for programmatic text-to-speech generation.
- [Local Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/local-speech-synthesis.md) — Allows users to generate synthetic speech locally via a desktop application without an internet connection. ([source](https://github.com/netease-youdao/EmotiVoice/wiki/HTTP-API))
- [Training Dataset Preparation](https://awesome-repositories.com/f/artificial-intelligence-ml/training-dataset-preparation.md) — Includes utilities to organize datasets and initialize model checkpoints specifically for voice model training. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/prepare_for_training.py))
- [Synthetic Voice Design](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning/voice-identity-conversions/synthetic-voice-design.md) — Provides a library of diverse speaker identities and gender profiles to define the characteristics of generated speech. ([source](https://github.com/netease-youdao/EmotiVoice/wiki/%F0%9F%98%8A-voice-wiki-page))

### Part of an Awesome List

- [Bilingual Synthesizers](https://awesome-repositories.com/f/awesome-lists/more/speech-and-audio-processing/bilingual-synthesizers.md) — Acts as a bilingual synthesis engine processing mixed Chinese and English text into seamless audio.

### Graphics & Multimedia

- [Emotional Modulation](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-processing/audio-emotion-classifiers/emotional-modulation.md) — Generates synthetic audio that conveys specific human emotions like happiness or sadness.
- [Emotional TTS Engines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines/emotional-tts-engines.md) — Generates synthetic audio in English and Chinese with controllable emotional states.
- [Text-to-Speech Engines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines/text-to-speech-engines.md) — Provides a system to convert written text into spoken audio via remote server requests. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/README.zh.md))

### Development Tools & Productivity

- [Command Line Interfaces](https://awesome-repositories.com/f/development-tools-productivity/command-line-interfaces.md) — Provides a command line interface for generating synthetic audio from text. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/ROADMAP.md))

### DevOps & Infrastructure

- [Speech API Hosting](https://awesome-repositories.com/f/devops-infrastructure/speech-api-hosting.md) — Ships an HTTP interface to expose synthetic voice generation programmatically to external applications. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/ROADMAP.md))

### Software Engineering & Architecture

- [API Wrappers](https://awesome-repositories.com/f/software-engineering-architecture/api-wrappers.md) — Exposes the internal speech engine via a web server wrapper for remote programmatic use.

### User Interface & Experience

- [Web Dashboards](https://awesome-repositories.com/f/user-interface-experience/web-dashboards.md) — Ships an interactive browser-based dashboard for performing text-to-speech synthesis without writing code. ([source](https://github.com/netease-youdao/EmotiVoice/blob/main/README.zh.md))