This project is a singing voice conversion tool based on VITS generative modeling. It transforms the identity of a singing voice to a target speaker while preserving the original melody, lyrics, and intonation.
The system distinguishes itself through hybrid voice synthesis, allowing for the blending of multiple speaker identities via linear model interpolation. It utilizes cluster-based feature retrieval to increase target voice similarity and employs a diffusion probabilistic model as a post-processor to remove electronic artifacts and improve vocal clarity.
The software covers a broad range of audio processing and model management capabilities, including fundamental frequency extraction, pitch normalization, and semitone adjustment. It provides a full training pipeline featuring audio dataset preprocessing, automatic mixed precision training, and the generation of speaker-specific voice indices. For deployment, the system supports weight compression and exportation to the ONNX format.