This project is a neural text-to-speech system and voice trainer that converts written text into spoken audio across a variety of global languages and regional dialects. It functions as an ONNX-based engine capable of performing fast offline inference and uses a phoneme-based controller to manage precise pronunciation.
The system distinguishes itself through a comprehensive toolkit for neural voice training, allowing for the creation of custom single-speaker or multi-speaker models. It supports the export of these models to a standardized open format and provides hardware acceleration via graphics processors to increase the speed of audio generation.
The engine covers a wide range of synthesis capabilities, including real-time chunked audio streaming and file-based export. It provides granular control over vocal delivery through raw phoneme injection, punctuation-based prosody adjustments, and the modification of speaking speed and volume.