Kokoro | Awesome Repository

Features

Neural Text-to-Speech Engines - Ships a lightweight neural text-to-speech engine that converts text into natural-sounding speech.
Multi-Language Speech Generators - Produces speech in multiple languages using language-specific pipelines and grapheme-to-phoneme conversion.
Grapheme To Phoneme Conversion - Transforms written text into phonetic representations for accurate pronunciation across multiple languages.
Voice Identity Selections - Chooses from multiple voice profiles to change the character of the spoken audio.
Speech Synthesis Engines - Provides a compact neural TTS engine designed for fast inference on CPU and Apple Silicon.
Text-to-Speech - Converts plain text into natural-sounding speech using a lightweight neural model.
Multi-Language Speech Generators - Produces speech in multiple languages with language-specific pipelines and grapheme-to-phoneme conversion.
Multi-Language Voice Profiles - Supports multiple languages with separate voice profiles and language-specific phoneme mappings.
Grapheme-to-Phoneme Pipelines - Implements language-specific grapheme-to-phoneme conversion pipelines for multi-language speech generation.
Voice Profile Managers - Offers multiple voice profiles to change the character and tone of generated speech output.
Apple Silicon GPU Accelerators - Accelerates speech synthesis inference on Apple M-series hardware by enabling GPU acceleration.
Apple Silicon GPU Accelerators - Provides GPU acceleration for speech synthesis inference on Apple Silicon via a single environment variable.
Speech-to-Speech Models - Writes generated speech output directly to a WAV file on disk for later use.
Audio Exporters - Saves generated speech directly to WAV files for offline storage and further processing.

Kokoro is a lightweight neural text-to-speech engine that converts written text into spoken audio using a compact model designed for fast inference. It supports multiple languages through language-specific grapheme-to-phoneme conversion pipelines, and offers voice profile selection to change the character of the generated speech.

The engine provides GPU acceleration on Apple Silicon hardware by setting a single environment variable, enabling faster inference on Mac M-series machines. It also includes pattern-based text segmentation, allowing input text to be split at user-defined delimiters to produce separate audio segments, and speed-adjustable playback controlled by a multiplier parameter.

Generated speech can be exported directly to WAV files for offline storage and further processing. The project is implemented in JavaScript and provides a complete text-to-speech pipeline with minimal dependencies.

Features

Neural Text-to-Speech Engines - Ships a lightweight neural text-to-speech engine that converts text into natural-sounding speech.
Multi-Language Speech Generators - Produces speech in multiple languages using language-specific pipelines and grapheme-to-phoneme conversion.
Grapheme To Phoneme Conversion - Transforms written text into phonetic representations for accurate pronunciation across multiple languages.
Voice Identity Selections - Chooses from multiple voice profiles to change the character of the spoken audio.
Speech Synthesis Engines - Provides a compact neural TTS engine designed for fast inference on CPU and Apple Silicon.
Text-to-Speech - Converts plain text into natural-sounding speech using a lightweight neural model.
Multi-Language Speech Generators - Produces speech in multiple languages with language-specific pipelines and grapheme-to-phoneme conversion.
Multi-Language Voice Profiles - Supports multiple languages with separate voice profiles and language-specific phoneme mappings.
Grapheme-to-Phoneme Pipelines - Implements language-specific grapheme-to-phoneme conversion pipelines for multi-language speech generation.
Voice Profile Managers - Offers multiple voice profiles to change the character and tone of generated speech output.
Apple Silicon GPU Accelerators - Accelerates speech synthesis inference on Apple M-series hardware by enabling GPU acceleration.
Apple Silicon GPU Accelerators - Provides GPU acceleration for speech synthesis inference on Apple Silicon via a single environment variable.
Speech-to-Speech Models - Writes generated speech output directly to a WAV file on disk for later use.
Audio Exporters - Saves generated speech directly to WAV files for offline storage and further processing.