Vosk Api

Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi.

The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains.

The system supports real-time audio streaming and the transcription of mono 16-bit PCM WAV files. Additional capabilities include keyword spotting to restrict transcription to specific phrases, vocabulary configuration for specialized terminology, and the generation of synchronized SRT subtitle strings.

Features

Local Speech-to-Text - Provides an on-device transcription system that converts spoken audio to text without an internet connection.

Speech Transcription - Converts spoken audio, including mono 16-bit PCM WAV files, into written text using local models.

Real-Time Transcription - Processes raw 16-bit mono PCM audio buffers in real-time for continuous, low-latency transcription.

Local Model Execution - Enables the execution of acoustic and language models directly on local hardware for offline inference.

Real-Time Speech Processing - Processes live speech input in real-time to provide continuous text output with minimal latency.

Speech Processing Toolkits - Provides a comprehensive toolkit for integrating speech-to-text and speaker identification across Android, iOS, and Raspberry Pi.

Speech Recognition - Performs speech-to-text transcription locally on the device without requiring an internet connection.

Speech Recognition APIs - Provides native C bindings for integrating offline audio-to-text transcription into applications.

Speech-to-Text Integrations - Offers interfaces for integrating offline voice-to-text capabilities across mobile apps and server environments.

Speech-to-Text Pipelines - Implements a complete local speech-to-text workflow to ensure data privacy and offline functionality.

Keyword Spotting - Restricts transcription to a specific list of predefined phrases to increase recognition accuracy.

Acoustic Model Trainers - Provides a pipeline for preparing audio data and training custom neural network acoustic models.

Speech Model Training - Enables building and evaluating specialized acoustic and language models to improve recognition for specific domains.

Recognition Accuracy Evaluation - Calculates the word error rate by decoding test audio to measure the precision of trained speech models.

Speaker Identification Frameworks - Analyzes audio streams to distinguish between different voices and attribute transcribed text to specific individuals.

Custom Vocabularies - Allows customization of the recognized word set to improve accuracy for specialized terminology and domains.

Speech Decoding Transducers - Uses weighted finite state transducers via the Kaldi engine to map audio features to words.

Command Line Interfaces - Ships a command line interface for converting spoken audio files into text directly from the terminal.

Audio Feature Extraction - Provides utilities for converting raw audio into normalized coefficients to prepare data for model training.

C Interoperability Layers - Implements a stable C-API binary interface to enable interoperability between the core engine and multiple programming languages.

Native Bindings - Uses native bindings to expose low-level C++ speech recognition logic to high-level languages like Python and Java.

Speech Recognition - Offline speech recognition toolkit for low-resource devices.

alphacepvosk-api

Features

Star history