MeloTTS is an open-source text-to-speech library that generates natural-sounding speech across six languages, with the ability to mix two languages within a single utterance. Its architecture combines a token-based text frontend with a language-agnostic acoustic model, enabling it to handle bilingual code-switching and produce streaming audio output in real time.
The system is designed to run efficiently on standard CPU hardware without requiring a dedicated GPU, using a lightweight neural network for real-time inference. It supports English, Spanish, French, Chinese, Japanese, and Korean, and can process mixed-language input such as Chinese and English within the same sentence by switching between language-specific acoustic models.
The library provides a freely available toolkit for developers to integrate speech synthesis into applications, with phoneme mapping that preserves language identity and prosodic boundaries across all supported languages.