Orpheus TTS

Orpheus-TTS is an open-source text-to-speech system that generates human-like audio with controllable emotional tone and the ability to clone voices from short audio samples. It is built on an architecture that treats speech generation as a language modeling task, using a large language model trained on text-speech pairs to produce audio tokens autoregressively.

The system distinguishes itself through several key capabilities. It supports emotion-controllable speech synthesis by embedding emotional and intonation markers directly into text prompts, allowing the model to condition its output on expressive cues. It also offers low-latency streaming, outputting audio tokens incrementally as they are generated for real-time playback with approximately 200ms latency. Additionally, the model can be fine-tuned to custom voices using standard language model training pipelines and small paired datasets, and it supports zero-shot voice cloning that replicates a speaker's voice from reference audio without requiring any training.

The project covers emotional speech generation, multi-voice persona selection for varying conversational realism, and natural speech synthesis with realistic intonation and rhythm. It also provides voice cloning and customization capabilities, including emotional tone control and voice model fine-tuning.

Features

Autoregressive Speech Language Models - Uses a large language model trained on text-speech pairs to generate audio tokens autoregressively, treating speech as a language modeling task.

Emotional Synthesis - An open-source speech synthesis model that generates human-like audio with controllable emotional tone and voice cloning from short audio samples.

Audio Tokenization - Encodes raw audio into discrete tokens via a neural codec, enabling the model to predict speech sequences like text tokens.

Zero-Shot Voice Cloning - Replicates a speaker's voice from reference audio without fine-tuning, using the model's learned acoustic representations.

Custom Voice Adapters - Kokoro adapts a pretrained text-to-speech model to a custom voice using a small dataset of text-speech pairs and standard LLM training tools.

Custom Voice Adaptations - Adapting a pretrained text-to-speech model to a custom voice using a small dataset of text-speech pairs and standard LLM training tools.

Custom Voice Fine-Tuning - A pretrained text-to-speech model that can be adapted to custom voices using standard LLM training tools and small datasets.

Natural Intonation Models - Kokoro produces speech with natural intonation, emotion, and rhythm that rivals closed-source models for realistic audio output.

Voice Cloning Engines - A text-to-speech engine that replicates a speaker's voice from existing audio without requiring any training or fine-tuning.

Prompt-Embedded Emotion Tags - Embeds emotional and intonation markers directly into the text prompt, allowing the model to condition output on expressive cues.

Voice Fine-Tuning Pipelines - Adapts the pretrained model to a custom voice using standard language model training pipelines and small paired datasets.

Low-Latency Audio Streams - Kokoro outputs audio chunks incrementally as they are generated, achieving ~200ms latency for real-time playback.

Audio Stream Outputs - A speech generation system that outputs audio chunks incrementally with ~200ms latency for real-time playback.

Incremental Audio Token Decoding - Outputs audio tokens incrementally as they are generated, enabling low-latency playback before the full sequence is complete.

Voice Identity Selections - Kokoro chooses among predefined speaker names to vary the conversational realism and character of the generated speech.

Persona-Based Voice Selections - Choosing among predefined speaker names to vary the conversational realism and character of the generated speech.

canopyaiOrpheus-TTS

Features

Star history