WhisperLiveKit is a real-time speech-to-text server that transcribes streaming audio into text with ultra-low latency using Whisper models. It serves transcription capabilities through REST endpoints and WebSocket connections, enabling external applications to send audio and receive transcriptions as words are spoken, making it suitable for live captioning or voice interfaces.
The project distinguishes itself by combining real-time transcription with speaker diarization, assigning transcribed words to individual speakers during live audio streams for meeting or interview transcripts. It also integrates a translation model pipeline using distilled NLLB models to convert spoken language into text in a different target language simultaneously during the stream. The server abstracts model loading and inference behind a unified interface supporting multiple backends like Whisper.cpp and Transformers, and provides command-line tools for managing model lifecycles including listing, downloading, and deleting speech recognition models.
Additional capabilities include generating SRT subtitle files directly from audio or video files via a command-line tool, and a benchmarking system that measures transcription speed and accuracy across backends and models, outputting results as JSON or plots. The server supports custom model loading from file paths, directories, or Hugging Face repositories, and is packaged into Docker containers with GPU or CPU support for reproducible production deployments.