Kokoro is a lightweight neural text-to-speech engine that converts written text into spoken audio using a compact model designed for fast inference. It supports multiple languages through language-specific grapheme-to-phoneme conversion pipelines, and offers voice profile selection to change the character of the generated speech.
The engine provides GPU acceleration on Apple Silicon hardware by setting a single environment variable, enabling faster inference on Mac M-series machines. It also includes pattern-based text segmentation, allowing input text to be split at user-defined delimiters to produce separate audio segments, and speed-adjustable playback controlled by a multiplier parameter.
Generated speech can be exported directly to WAV files for offline storage and further processing. The project is implemented in JavaScript and provides a complete text-to-speech pipeline with minimal dependencies.