VoxCPM

VoxCPM is a multilingual speech synthesis system and text-to-speech inference server. It functions as an AI voice cloning tool and a synthetic voice designer, capable of generating natural speech across global languages and regional dialects using a GPU-accelerated audio generator.

The project features a speech model fine-tuning framework that supports both full parameter updates and low-rank adaptation for customizing voice characteristics. It enables high-fidelity voice cloning from reference audio, including cross-lingual voice transfer and acoustic environment mimicry, as well as the creation of unique vocal identities through text-based voice design.

The system provides broad capabilities for speech generation, including context-aware prosody, non-verbal cue insertion, and multi-speaker dialogue. It includes professional audio processing utilities for denoising and upsampling reference clips, as well as a high-throughput API server with streaming output and an OpenAI-compatible interface.

The software supports deployment across various hardware backends, including CUDA, MPS, and CPU, and can be deployed via containers.

Features

Multilingual Speech Models - Synthesizes natural speech across global languages and regional dialects without requiring language tags.

Speech Synthesis - Functions as a high-performance speech synthesis server that converts text to audio via an HTTP interface.

Text-to-Speech - Synthesizes natural-sounding speech across global languages and regional dialects using local generative inference.

Adapter Fine-Tuning - Uses Low-Rank Adaptation (LoRA) to fine-tune specific model layers for efficient voice cloning.

Autoregressive Audio Diffusion - Uses a hierarchical diffusion-based autoregressive architecture to generate high-fidelity continuous speech representations.

Zero-Shot Voice Cloning - Analyzes timbre and prosody from short reference audio samples to enable zero-shot voice cloning.

Generative Audio Engines - Implements a GPU-accelerated inference engine optimized for CUDA and MPS to produce studio-quality audio.

Speech Model Fine-Tuning - Provides a toolkit for adapting speech models using full parameter updates or LoRA adapters for custom voice characteristics.

High-Throughput Model Serving - Ships a dedicated high-throughput inference engine with asynchronous APIs and concurrent request support.

Hardware-Agnostic Inference Layers - Implements a standardized runtime format that enables model execution across CUDA, MPS, and CPU backends.

Model Fine-Tuning - Provides a framework for adapting speech models to specific speakers or languages via full fine-tuning and LoRA.

Speech Synthesis Gateways - Hosts a high-throughput HTTP server that acts as a gateway for streaming audio generation to external clients.

Conversational Audio Streams - Implements a network endpoint that streams generated speech as an MP3 byte stream for real-time interaction.

Local Speech Synthesis - Implements a high-performance text-to-speech engine running on local CUDA, MPS, or CPU backends.

Synthetic Voice Generators - Allows the creation of unique synthetic vocal identities by describing attributes like age, gender, and emotion in plain text.

Text-Based Voice Design - Allows the creation of unique synthetic vocal identities based on natural-language descriptions of age, gender, and pitch.

Voice Cloning - Replicates specific human vocal identities using reference audio samples and high-fidelity synthesis.

Voice Profile Managers - Offers a programmatic interface to register and organize custom voice profiles for audio customization.

OpenAI-Compatible APIs - Provides a local server implementation that accepts OpenAI-standard speech requests for broad ecosystem compatibility.

Audio Noise Cancellation - Provides utilities to remove background noise from reference audio clips to improve the quality of voice cloning.

Audio Processing - Offers professional audio utilities for denoising reference clips and upsampling low-resolution samples to studio quality.

Audio Transcription - Includes a speech-to-text utility that converts reference audio clips into text to streamline cloning.

Sample-Rate Conditioned Decoding - Produces high-resolution studio audio by conditioning the output decoder on specific target sample rates.

Full Parameter Fine-Tuning - Provides workflows for updating all model parameters to achieve maximum voice synthesis performance.

Continuous Batching Strategies - Optimizes GPU memory and throughput using continuous batching and attention paging for concurrent speech generation.

Cross-Lingual Adaptation - Enables speech generation for unsupported languages through specialized fine-tuning on target-language datasets.

Hardware-Agnostic Deployment - Implements a standardized model format that enables speech synthesis across diverse CPU and GPU backends.

LoRA Adapter Loaders - Provides a mechanism to load optional Low-Rank Adaptation weights to refine the output of the speech generation model.

Multi-Speaker Synthesis - Produces audio for multiple speakers using tagged scripts to assign distinct voices to different parts.

Supervised Fine-Tuning - Adapts models to specific speakers or languages using labeled datasets and supervised fine-tuning.

Acoustic Style Controls - Modifies emotion, pace, and delivery of audio using text-based control tags.

Phonetic Pronunciation Overrides - Allows precise control over pronunciation by overriding standard text with phoneme inputs or pinyin.

Prosody Controls - Infers prosody and expressiveness directly from text to produce natural, context-matched speech delivery.

Tokenizer-Free Processing - Processes multilingual text input directly without relying on predefined vocabulary tokens to maintain linguistic flexibility.

Acoustic Environment Replication - Reproduces both the speaker's identity and the specific acoustic environment of the source audio.

Cross-Lingual Voice Transfer - Clones a speaker's voice from a reference audio file and applies it to a different target language.

Prosody and Style Control - Enables duplication of a speaker's timbre while allowing precise adjustment of speed and emotion via text.

Speech Dialects - Provides the ability to replicate regional accents and dialects across multiple languages during speech synthesis.

Inference Scaling Services - Features a system for distributing concurrent synthesis requests across a pool of GPUs to maximize throughput.

Multi-GPU Deployment - Includes a distribution system that spreads model weights and computation across multiple GPUs for larger loads.

Emotional Modulation - Implements emotional modulation to adjust the tone and intensity of synthetic speech to match target emotions.

Audio Super-Resolution - Produces high-resolution audio from lower-sample-rate references using built-in upsampling.

Chunked Audio Streaming - Streams audio waveform chunks sequentially so playback can begin before the full synthesis completes.

Generative Audio Chunking - Yields audio waveform chunks sequentially during generation to allow playback to begin before the full sequence is complete.

Asynchronous Task Queues - Manages concurrent synthesis tasks using an asynchronous server that handles task priority and GPU memory allocation.

Non-Verbal Audio Cues - Adds realistic non-verbal cues like laughs and sighs into generated audio using specific text tags.

TTS API Endpoints - Provides a standard request-response endpoint for integrating text-to-speech functionality into external applications.

AI and Agents - Listed in the “AI and Agents” section of the Awesome Python awesome list.

Speech Processing - Multimodal speech and language model.

OpenBMBVoxCPM

Features

Star history