# FunAudioLLM/CosyVoice

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/funaudiollm-cosyvoice).**

19,637 stars · 2,216 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/FunAudioLLM/CosyVoice
- Homepage: https://funaudiollm.github.io/cosyvoice3
- awesome-repositories: https://awesome-repositories.com/repository/funaudiollm-cosyvoice.md

## Topics

`audio-generation` `cantonese` `chatbot` `chatgpt` `chinese` `cosyvoice` `cross-lingual` `english` `fine-grained` `fine-tuning` `gpt-4o` `japanese` `korean` `multi-lingual` `natural-language-generation` `python` `text-to-speech` `tts` `voice-cloning`

## Description

CosyVoice is a speech synthesis framework that utilizes large language models to generate expressive, multilingual audio. The system functions as an audio generation engine capable of producing natural-sounding speech across multiple languages while preserving regional dialects and specific emotional tones.

The platform distinguishes itself through its zero-shot voice cloning capabilities, which allow for the creation of synthetic voice profiles from short audio samples without requiring additional model training. It provides fine-grained control over vocal attributes, enabling users to adjust prosody, pacing, volume, and breathing to achieve realistic output. Furthermore, the system supports phoneme-level alignment and latent space conditioning to modulate emotional personas and ensure precise pronunciation.

The architecture incorporates reinforcement learning to iteratively refine output quality and alignment with human-perceived speech standards. Users can also perform custom speaker model adaptation to improve voice similarity and consistency for specialized production requirements.

## Tags

### Artificial Intelligence & ML

- [Neural Text-to-Speech Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/neural-text-to-speech-engines.md) — Functions as a speech synthesis framework using large language models to generate expressive, multilingual audio.
- [Zero-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-voice-cloning.md) — Enables the creation of synthetic voice profiles from short audio samples without requiring additional model training.
- [Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis.md) — Generates natural-sounding speech across multiple languages while preserving regional dialects and specific emotional tones.
- [Expressive Synthesis Models](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-generation-models/expressive-synthesis-models.md) — Implements a neural synthesis architecture that modulates vocal attributes to produce speech with customizable emotional personas.
- [Autoregressive Models](https://awesome-repositories.com/f/artificial-intelligence-ml/autoregressive-models.md) — Generates speech by predicting sequences of discrete acoustic tokens using a transformer architecture.
- [Prosody Control Tokens](https://awesome-repositories.com/f/artificial-intelligence-ml/latent-conditioning-mechanisms/prosody-control-tokens.md) — Injects emotional and stylistic vector representations into the model to modulate prosody and tone.
- [Multilingual Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-speech-models.md) — Generates natural-sounding audio supporting multiple languages and mixed-lingual content. ([source](https://funaudiollm.github.io/cosyvoice3/))
- [Prosody Controls](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/prosody-controls.md) — Provides fine-grained control over vocal attributes like breathing, pacing, and volume to produce realistic, human-like speech output.
- [Cross-Modal Alignment Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/cross-modal-alignment-models.md) — Conditions speech generation on both text input and reference audio embeddings to align synthetic output with target speaker characteristics.
- [Reinforcement Learning Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/reinforcement-learning-alignment.md) — Refines model outputs using reward signals to optimize for naturalness and human-perceived speech quality.
- [Phoneme-Based Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-alignment-models/phoneme-based-alignment.md) — Maps text inputs to specific phonetic sequences to ensure precise pronunciation and prosodic rendering.
- [Custom Model Adapters](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-adapters.md) — Refines base speech generation models for specific target speakers to improve voice similarity and consistency.
- [Prosody Modulation Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/prosody-modulation-tools.md) — Provides controls to modify the emotional tone and prosodic style of generated audio. ([source](https://funaudiollm.github.io/cosyvoice3/))
- [Speech Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/speech-model-fine-tuning.md) — Refines base speech generation models for specific target speakers to improve voice similarity. ([source](https://funaudiollm.github.io/cosyvoice3/))
- [Dialectal Synthesis Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/dialectal-synthesis-engines.md) — Generates audio while preserving specific phonetic and tonal characteristics of regional dialects. ([source](https://funaudiollm.github.io/cosyvoice3/))
- [Phonetic Pronunciation Overrides](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/phonetic-pronunciation-overrides.md) — Allows overriding default speech output with explicit phoneme sequences for precise pronunciation. ([source](https://funaudiollm.github.io/cosyvoice3/))

### Graphics & Multimedia

- [Emotional Modulation](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-processing/audio-emotion-classifiers/emotional-modulation.md) — Generates multilingual speech while applying specific emotional tones for engaging communication. ([source](https://funaudiollm.github.io/cosyvoice3/))
- [Neural Vocoders](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-synthesis/neural-vocoders.md) — Converts raw acoustic tokens into high-fidelity waveforms using deep learning models.
