VoxCPM is a multilingual speech synthesis system and text-to-speech inference server. It functions as an AI voice cloning tool and a synthetic voice designer, capable of generating natural speech across global languages and regional dialects using a GPU-accelerated audio generator.
The project features a speech model fine-tuning framework that supports both full parameter updates and low-rank adaptation for customizing voice characteristics. It enables high-fidelity voice cloning from reference audio, including cross-lingual voice transfer and acoustic environment mimicry, as well as the creation of unique vocal identities through text-based voice design.
The system provides broad capabilities for speech generation, including context-aware prosody, non-verbal cue insertion, and multi-speaker dialogue. It includes professional audio processing utilities for denoising and upsampling reference clips, as well as a high-throughput API server with streaming output and an OpenAI-compatible interface.
The software supports deployment across various hardware backends, including CUDA, MPS, and CPU, and can be deployed via containers.