Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi.
The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains.
The system supports real-time audio streaming and the transcription of mono 16-bit PCM WAV files. Additional capabilities include keyword spotting to restrict transcription to specific phrases, vocabulary configuration for specialized terminology, and the generation of synchronized SRT subtitle strings.