WhisperLive | Awesome Repository

Features

Real-Time Transcription - Converts live audio streams into text in real time using Whisper models for immediate accessibility.

Audio Transcription - Provides a backend service that streams microphone input and delivers incremental text transcriptions.

GPU-Accelerated Inference - Employs a GPU-accelerated inference engine to optimize throughput for multilingual speech recognition.

Real-Time Speech-to-Text Servers - Functions as a real-time audio transcription server using Whisper models and WebSocket streaming.

Speaker Diarization - Clusters audio feature vectors to distinguish and segment different speakers within a single audio stream.

Whisper-Based Engines - Utilizes a Faster-Whisper engine with CTranslate2 backend to optimize transcription speed and memory usage.

Audio Transcription WebSockets - Uses WebSockets to stream raw PCM audio from the client to the server for real-time processing.

High-Throughput Transcription - Processes multiple simultaneous audio streams via GPU batching to achieve high transcription throughput.

Word-Level Timestamps - Produces precise start and end timestamps and confidence scores for every individual word transcribed.

Inference Acceleration Engines - Implements high-performance inference using TensorRT to accelerate speech-to-text processing speeds.

Incremental Transcription Previews - Incrementally renders transcribed text segments on the screen as they are emitted by the backend.

Technical Jargon Optimizations - Improves transcription accuracy for domain-specific technical terms using keyword boosting.

Inference Batching - Groups multiple concurrent user audio segments into single GPU calls to maximize system throughput.

Transcription Term Boosts - Provides a mechanism to boost specific technical terms and jargon during the transcription decoding process.

Live Captioning Integrations - Displays incrementally processed speech as text on screen for real-time live captioning.

Incremental Text Rendering - Updates the user interface incrementally by appending transcribed text chunks as they are emitted.

WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps.

The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording.

Additional capabilities include high-throughput audio processing via batch inference and TensorRT acceleration, as well as audio signal normalization and recording state control. The service supports live audio captioning through segment-based incremental rendering.

Features