Pocketsphinx

PocketSphinx is an offline speech recognition engine that converts raw audio from files or live microphone streams into written text without requiring a network connection. It functions as a speech-to-text library, a real-time transcription engine, and a voice command processor, capable of detecting and transcribing spoken commands from continuous audio streams with configurable acoustic and language models.

The engine uses weighted finite-state transducers to represent acoustic, phonetic, and language models as a single search graph for efficient decoding. It employs fixed-point acoustic models with 8-bit or 16-bit parameters to reduce memory usage on embedded devices, and frame-synchronous beam search to prune the search space at each audio frame for real-time performance. The system generates a lattice of alternative word sequences during decoding, from which multiple ranked transcriptions can be extracted, and records word-level start and end timestamps by tracing back through the Viterbi path.

PocketSphinx processes audio in fixed-size chunks through a ring buffer, feeding frames incrementally to the decoder without requiring the full audio in memory. It detects speech boundaries by analyzing energy levels and silence gaps, then processes each utterance independently for transcription. The library supports transcribing single-channel 16-bit PCM audio from files or standard input, outputting recognized text as line-delimited JSON, and can match a known transcript against an audio file to produce word-level or phone-level timestamps.

Features

Speech Recognition Engines - An automatic speech recognition library that converts raw audio signals from files or live streams into written text without requiring a network connection.

Live Stream Transcribers - Detects speech segments in a continuous audio stream, transcribes each segment, and outputs results in real time with timing and probability data.

Speech to Text Transcription - Transcribing spoken audio from files or streams into written text, with support for multiple recognition hypotheses.

Audio and Video File Transcription - Processes a pre-recorded audio file through the decoder to produce a text transcription of the spoken content.

Audio-Transcript Aligners - Matches a known transcript against an audio file to produce word-level or phone-level timestamps for each spoken segment.

Real-Time Transcription - An engine that processes live audio streams from microphones or input devices, segmenting speech into utterances and outputting text with timing and probability data.

Speech Boundary Detection - Detects speech boundaries by analyzing energy levels and silence gaps, then processes each utterance independently for transcription.

Speech Segment Extraction - Divides a continuous audio stream into discrete utterances, recognizing each segment independently to produce structured transcription results.

Word-Level Timestamps - Records the start and end frame indices for each recognized word by tracing back through the Viterbi path in the search graph.

Fixed-Point Acoustic Models - Represents acoustic model parameters as 8-bit or 16-bit fixed-point numbers to reduce memory usage and computational cost on embedded devices.

Beam Search Runtimes - Prunes the search space at each audio frame by keeping only the most likely hypotheses within a beam width, enabling real-time decoding.

Accelerated Speech Recognizers - Reads single-channel 16-bit PCM audio from files or standard input and outputs recognized text as line-delimited JSON.

Offline - Converting speech to text locally without an internet connection, using pre-recorded audio files or live microphone input.

Streaming Recognition - Processes audio from a microphone or live stream in real time, converting speech to text as it is spoken.

N-Best Hypothesis Generators - Generates a lattice of alternative word sequences during decoding, from which multiple ranked transcriptions can be extracted.

Real-Time Audio Transcribers - Captures audio from a microphone or input device and feeds it incrementally to the decoder for real-time speech-to-text conversion.

Real-Time Speech Transcription - Processing live audio from a microphone or input stream to produce text output as speech is spoken, with utterance segmentation.

Speech-to-Text Libraries - A library for transcribing spoken language from audio files or microphone input into text, supporting multiple recognition hypotheses and word-level timestamps.

Speech Decoding Transducers - Uses weighted finite-state transducers to represent acoustic, phonetic, and language models as a single search graph for efficient speech recognition.

Fixed-Size Audio Chunk Pipelines - Processes audio in fixed-size chunks through a ring buffer, feeding frames incrementally to the decoder without requiring the full audio in memory.

Acoustic Model Codebooks - Compresses acoustic model output distributions by grouping similar Gaussian mixtures into shared codebooks, reducing model size.

Recognition Parameter Configurations - Adjusts acoustic model, language model, and decoder settings to tune recognition accuracy and behavior.

Voice Command Interfaces - A system for detecting and transcribing spoken commands in real time from continuous audio streams, with configurable acoustic and language models.

Audio-to-Text Alignment - Mapping a known transcript to an audio file to produce word-level or phone-level timestamps for each spoken segment.

Voice Command Recognition - Recognizing and acting on spoken commands in real time, suitable for hands-free control of applications or devices.

Audio - Lightweight engine for speech recognition.

cmusphinxpocketsphinx

Features

Star history