# fishaudio/fish-speech

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/fishaudio-fish-speech).**

24,928 stars · 2,075 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/fishaudio/fish-speech
- Homepage: https://speech.fish.audio
- awesome-repositories: https://awesome-repositories.com/repository/fishaudio-fish-speech.md

## Topics

`llama` `transformer` `tts` `valle` `vits` `vqgan` `vqvae`

## Description

This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns.

The platform distinguishes itself through a production-ready inference server that employs continuous batching to maximize hardware utilization and reduce latency. It includes a comprehensive voice cloning toolkit that replicates unique vocal characteristics from short reference audio samples without requiring additional model training. Users can further customize output through low-rank adaptation fine-tuning, which allows for efficient style adjustments, and speaker-specific token embeddings that manage distinct voice characteristics during multi-speaker generation.

Beyond core synthesis, the project provides a full suite of utilities for training and alignment, including reinforcement learning techniques to optimize for semantic accuracy and instruction adherence. It supports a variety of operational interfaces, including a command-line tool, a web-based dashboard, and an authenticated HTTP server for remote generation workloads. The system also includes data preparation and serialization tools to streamline the process of organizing and normalizing audio datasets for model training.

## Tags

### Artificial Intelligence & ML

- [Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis.md) — Utilizes a two-stage autoregressive transformer to produce high-fidelity audio output. ([source](https://speech.fish.audio))
- [Speech Synthesis Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis-engines.md) — Provides a deep learning architecture that converts text into high-fidelity human speech.
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Generates natural sounding human speech from text with precise control over tone and quality.
- [Autoregressive Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/autoregressive-transformers.md) — Separates semantic prediction from acoustic reconstruction to balance linguistic accuracy and audio fidelity.
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Replicates human voices using short reference audio samples to capture timbre and emotional style. ([source](https://speech.fish.audio))
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/model-fine-tuning.md) — Supports fine-tuning models using low-rank adaptation to adjust speech patterns and merge weights. ([source](https://speech.fish.audio/finetune/))
- [Conversational AI](https://awesome-repositories.com/f/artificial-intelligence-ml/conversational-ai.md) — Enables expressive speech generation by utilizing context from previous conversational turns. ([source](https://speech.fish.audio))
- [Parameter Efficient Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/parameter-efficient-fine-tuning.md) — Enables efficient style customization by training and merging small adapter layers.
- [Training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/training-pipelines.md) — Provides a comprehensive suite for dataset preparation, fine-tuning, and reinforcement learning alignment.
- [Voice Cloning Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning-toolkits.md) — Captures and replicates unique vocal characteristics from short audio samples without additional training.
- [Multi-Speaker Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/multi-speaker-synthesis.md) — Supports generating audio with multiple speakers in a single pass using speaker-specific tokens. ([source](https://speech.fish.audio))
- [Multilingual Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-speech-models.md) — Handles multiple languages and complex conversational contexts without language-specific phoneme conversion.
- [Reinforcement Learning Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/reinforcement-learning-alignment.md) — Refines speech models using reward-based evaluation of semantic accuracy and acoustic quality. ([source](https://speech.fish.audio))
- [Speaker Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-embeddings.md) — Uses dedicated identifiers to manage and switch between distinct voice characteristics.
- [Audio Tokenization](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-tokenization.md) — Converts raw audio waveforms into compact numerical representations for training and generation.
- [Feature Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-extraction.md) — Converts raw audio waveforms into discrete numerical representations for training and generation pipelines. ([source](https://speech.fish.audio/finetune/))
- [Multilingual Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-models.md) — Enables speech generation in multiple languages without requiring complex language-specific preprocessing. ([source](https://speech.fish.audio))

### DevOps & Infrastructure

- [Inference Servers](https://awesome-repositories.com/f/devops-infrastructure/inference-servers.md) — Delivers low-latency audio generation through optimized model serving strategies.
- [Inference Optimization](https://awesome-repositories.com/f/devops-infrastructure/inference-optimization.md) — Implements continuous batching to maximize hardware utilization and reduce latency in production.
- [Audio Serving](https://awesome-repositories.com/f/devops-infrastructure/audio-serving.md) — Deploys scalable speech generation services requiring high throughput and fast response times.
- [Model Serving](https://awesome-repositories.com/f/devops-infrastructure/model-serving.md) — Optimizes audio delivery using continuous batching and prefix caching for low-latency production inference. ([source](https://speech.fish.audio))

### Web Development

- [HTTP Servers](https://awesome-repositories.com/f/web-development/http-servers.md) — Includes an HTTP server for handling text-to-speech requests and securing model endpoints. ([source](https://speech.fish.audio/server/))