PocketSphinx is an offline speech recognition engine that converts raw audio from files or live microphone streams into written text without requiring a network connection. It functions as a speech-to-text library, a real-time transcription engine, and a voice command processor, capable of detecting and transcribing spoken commands from continuous audio streams with configurable acoustic and language models.
The engine uses weighted finite-state transducers to represent acoustic, phonetic, and language models as a single search graph for efficient decoding. It employs fixed-point acoustic models with 8-bit or 16-bit parameters to reduce memory usage on embedded devices, and frame-synchronous beam search to prune the search space at each audio frame for real-time performance. The system generates a lattice of alternative word sequences during decoding, from which multiple ranked transcriptions can be extracted, and records word-level start and end timestamps by tracing back through the Viterbi path.
PocketSphinx processes audio in fixed-size chunks through a ring buffer, feeding frames incrementally to the decoder without requiring the full audio in memory. It detects speech boundaries by analyzing energy levels and silence gaps, then processes each utterance independently for transcription. The library supports transcribing single-channel 16-bit PCM audio from files or standard input, outputting recognized text as line-delimited JSON, and can match a known transcript against an audio file to produce word-level or phone-level timestamps.