# kyutai-labs/pocket-tts

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/kyutai-labs-pocket-tts).**

3,301 stars · 365 forks · Python · mit

## Links

- GitHub: https://github.com/kyutai-labs/pocket-tts
- awesome-repositories: https://awesome-repositories.com/repository/kyutai-labs-pocket-tts.md

## Description

Pocket-tts is a text-to-speech server and neural speech synthesizer that converts written text into audible speech. It includes a CPU-optimized inference engine and a voice cloning tool capable of analyzing audio samples to reproduce specific speaker characteristics.

The system differentiates itself through the use of dynamic int8 quantization to reduce memory usage and increase generation speed on processors. It supports real-time speech synthesis by streaming audio chunks incrementally and utilizes voice state caching to store processed embeddings as portable files, bypassing redundant processing during speaker cloning.

The project covers a broad range of capabilities, including local model hosting and self-hosted API services for remote audio generation. It provides utilities for model initialization across multiple languages and a native backend to handle computationally intensive synthesis operations.

## Tags

### Artificial Intelligence & ML

- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Implements a high-fidelity neural speech synthesizer for converting written text into spoken audio across multiple languages. ([source](https://cdn.jsdelivr.net/gh/kyutai-labs/pocket-tts@main/README.md))
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Replicates specific human vocal characteristics from audio samples to generate synthetic speech.
- [Voice Cloning Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/voice-cloning-tools.md) — Processes custom audio recordings to extract speaker characteristics for high-quality synthetic speech.
- [CPU Inference Runtimes](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-clients/on-device-inference/cpu-inference-runtimes.md) — Provides a runtime optimized for CPU execution using dynamic int8 quantization for fast speech generation.
- [C++ Inference Backends](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-engines/c-inference-backends.md) — Implements a high-performance synthesis backend written in C++ to handle computationally intensive operations.
- [CPU Inference Quantizers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-quantization/8-bit-inference-quantizers/cpu-inference-quantizers.md) — Utilizes dynamic int8 quantization to reduce memory usage and accelerate inference on CPU hardware.
- [Real-Time Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/real-time-speech-processing/real-time-speech-synthesis.md) — Generates natural-sounding speech output in real time for low-latency interactive applications.
- [Incremental Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis/incremental-synthesis.md) — Provides incremental audio streaming to enable low-latency, real-time playback of synthetic speech. ([source](https://kyutai-labs.github.io/pocket-tts/API%20Reference/python-api/))
- [Embedding Exports](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/portable-voice-models/embedding-exports.md) — Provides voice state caching by exporting analyzed speaker characteristics as portable files to bypass redundant cloning processing. ([source](https://cdn.jsdelivr.net/gh/kyutai-labs/pocket-tts@main/README.md))
- [Quantized Model Deployments](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/local-and-on-device-inference/edge-ai-model-deployment/quantized-model-deployments.md) — Deploys models using low-precision quantization to optimize memory and speed on CPU hardware.
- [Model Performance Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/profiling-and-benchmarking/model-performance-optimization.md) — Enhances model speed and reduces memory usage through dynamic int8 quantization. ([source](https://kyutai-labs.github.io/pocket-tts/quantization/))
- [Self-Hosted Synthesis Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/local-speech-synthesis/self-hosted-synthesis-servers.md) — Hosts a local web server and API to manage speech models and generate audio on demand.

### Part of an Awesome List

- [Text To Speech](https://awesome-repositories.com/f/awesome-lists/media/text-to-speech.md) — A comprehensive toolkit and server for synthesizing realistic human speech from text.
- [Voice Embedding Precomputations](https://awesome-repositories.com/f/awesome-lists/media/voice-processing/voice-embedding-precomputations.md) — Converts audio samples into reusable embedding files to streamline speech generation. ([source](https://kyutai-labs.github.io/pocket-tts/CLI%20Commands/export_voice/))

### Graphics & Multimedia

- [High-Fidelity Speech Synthesis](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-playback/high-fidelity-audio-streaming/high-fidelity-speech-synthesis.md) — Uses neural vocoders to produce high-fidelity audio from text with support for multiple languages.
- [Text-to-Speech Engines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines/text-to-speech-engines.md) — Provides a local engine for converting written text into natural-sounding human speech.
- [Generative Audio Chunking](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-streaming-engines/audio-playback-engines/chunked-audio-streaming/generative-audio-chunking.md) — Enables immediate playback by sequentially yielding audio waveform chunks as they are being generated.

### Data & Databases

- [Embedding Caches](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/caching-performance/caching-strategies/query-result-caching/method-result-caches/embedding-caches.md) — Caches precomputed vector embeddings of voice characteristics to avoid redundant processing during cloning.
- [Voice Feature Caching](https://awesome-repositories.com/f/data-databases/performance-caching-systems/voice-feature-caching.md) — Stores extracted vocal characteristics in local files to accelerate repeated synthesis using the same voice. ([source](https://kyutai-labs.github.io/pocket-tts/API%20Reference/python-api/))

### Web Development

- [Local API Servers](https://awesome-repositories.com/f/web-development/local-api-servers.md) — Provides a local API server to expose text-to-speech conversion capabilities via HTTP requests. ([source](https://kyutai-labs.github.io/pocket-tts/CLI%20Commands/serve/))
- [Model Inference APIs](https://awesome-repositories.com/f/web-development/model-inference-apis.md) — Exposes model inference functionality through a web server to enable remote audio generation.
