Kokoro FastAPI

Kokoro-FastAPI is a text-to-speech API and LLM speech synthesis server that generates spoken audio from text via a REST interface. It functions as a Kubernetes-native deployment designed for orchestrated speech synthesis.

The system includes a voice blending engine that creates unique vocal profiles by mixing multiple existing voices using custom weight ratios.

The service provides real-time audio streaming to reduce latency and generates word-level timestamps for speech synchronization. It manages hardware efficiency through on-demand model loading to optimize VRAM usage and includes system resource monitoring for tracking CPU and GPU states.

Deployment is supported via Helm charts for installation within containerized clusters.

Features

Text-to-Speech Conversions - Provides a high-quality text-to-speech API for converting written text into spoken audio.

Text-to-Speech - Functions as a high-fidelity generative synthesis server that converts written text into spoken audio.

GPU Memory Optimizers - Manages VRAM consumption to prevent exhaustion by dynamically reloading models during request processing.

Grapheme To Phoneme Conversion - Transforms raw input text into phonetic representations and token IDs before passing them to the synthesis engine.

Model API Gateways - Exposes the underlying synthesis model and monitoring tools through a FastAPI-based REST gateway.

Voice Identity Interpolators - Synthesizes unique vocal profiles by interpolating voice embedding vectors based on custom weight ratios.

Speech Synthesis Services - Serves as a backend synthesis server that transforms text to phonemes and high-fidelity audio.

Phoneme-Based Speech Processors - Uses a phoneme-based pipeline to convert raw text into phonetic representations for consistent speech synthesis.

Hybrid Voice Synthesis - Includes a specialized engine for blending multiple speaker characteristics into a unique hybrid voice.

Synthetic Voice Design - Creates specialized vocal identities by blending multiple existing voices using specific weight ratios.

VRAM Offloading - Implements VRAM optimization by unloading models to system memory during idle periods.

Model Weight Offloading - Optimizes GPU memory efficiency by unloading model weights from VRAM during idle periods and reloading them on demand.

Real-time Synthesis Streaming - Delivers synthesized speech as a continuous audio stream to minimize the time to first byte.

Response Streaming - Provides real-time audio streaming by sending synthesized speech chunks incrementally to reduce latency.

OpenAI-Compatible APIs - Implements a standardized external interface for text-to-speech generation compatible with the OpenAI API specification.

Word-Level Timestamps - Generates precise word-level timing metadata to synchronize spoken audio with on-screen text or animations.

Speech Synthesis Markup - Provides inline markup tags to control pacing, pauses, and specific pronunciations within synthesized speech.

Speech Boundary Timestamps - Generates precise word-level timestamps to synchronize spoken audio with text or animations.

Helm Chart Deployment - Ships predefined Helm charts to automate the deployment and configuration of the synthesis service on Kubernetes.

Kubernetes Application Deployments - Provides automated workflows for deploying scalable speech synthesis services via Helm charts in Kubernetes.

remskyKokoro-FastAPI

Features

Star history