Voice Pro is a comprehensive speech and audio processing toolkit that combines text-to-speech synthesis, voice cloning, speech recognition, and translation capabilities into a single application. At its core, the project enables users to generate natural-sounding speech from text, clone voices from short audio samples without requiring prior training data, and perform real-time speech translation across over 100 languages.
The platform distinguishes itself through its integrated multimedia workflow, allowing users to download YouTube videos, extract audio, separate voice tracks, generate word-timed subtitles, and produce dubbed content in over 100 languages through a unified pipeline. It supports multiple speech synthesis engines including Edge-TTS, F5-TTS, E2-TTS, CosyVoice, and kokoro, while also providing the ability to train custom TTS models on user-provided datasets and export trained models to ONNX format for deployment.
Beyond core speech generation, the application offers extensive audio processing features such as transcribing speech to text with word-level subtitle generation, translating subtitle files while preserving formatting, and performing real-time speech recognition and translation with customizable audio inputs. The system also includes capabilities for extracting audio from video, removing noise, and managing the application's installation and dependencies through built-in cleanup utilities.