WhisperLiveKit

WhisperLiveKit is a real-time speech-to-text server that transcribes streaming audio into text with ultra-low latency using Whisper models. It serves transcription capabilities through REST endpoints and WebSocket connections, enabling external applications to send audio and receive transcriptions as words are spoken, making it suitable for live captioning or voice interfaces.

The project distinguishes itself by combining real-time transcription with speaker diarization, assigning transcribed words to individual speakers during live audio streams for meeting or interview transcripts. It also integrates a translation model pipeline using distilled NLLB models to convert spoken language into text in a different target language simultaneously during the stream. The server abstracts model loading and inference behind a unified interface supporting multiple backends like Whisper.cpp and Transformers, and provides command-line tools for managing model lifecycles including listing, downloading, and deleting speech recognition models.

Additional capabilities include generating SRT subtitle files directly from audio or video files via a command-line tool, and a benchmarking system that measures transcription speed and accuracy across backends and models, outputting results as JSON or plots. The server supports custom model loading from file paths, directories, or Hugging Face repositories, and is packaged into Docker containers with GPU or CPU support for reproducible production deployments.

Features

Real-Time Speech-to-Text Servers - Serves as a backend service that transcribes live audio streams into text via WebSocket connections.

Chunked Audio Transcribers - Processes audio in small overlapping chunks to enable real-time transcription with minimal latency.

Multilingual Speech Translation - Converts spoken audio into text in a different target language simultaneously during streaming.

Streaming Translation Pipelines - Integrates distilled NLLB models to translate transcribed speech into target languages during the stream.

Real-Time Speech Transcription - Processes streaming audio incrementally with ultra-low latency, producing text as words are spoken.

Simultaneous Speech Translation - Translates spoken audio into text in a different target language simultaneously during the stream.

Speaker Diarization - Assigns transcribed words to individual speakers in real time using streaming diarization algorithms.

Real-Time Speaker Identifiers - Assigns speaker labels to transcribed words during live audio streams in real time.

Transcription with Speaker Labels - Assigns transcribed words to individual speakers in real time using streaming diarization algorithms for meeting transcripts.

Bidirectional Audio Transports - Uses WebSocket connections for low-latency bidirectional streaming of audio and transcription data.

Speech-to-Text API Wrappers - Serves speech-to-text through REST endpoints and WebSocket connections for external application integration.

Model Size Selection Options - Provides multiple model sizes to balance transcription speed, accuracy, and hardware requirements.

Model Selectors - Lists, downloads, deletes, and selects Whisper model sizes to balance transcription speed, accuracy, and hardware requirements.

Timestamped Subtitle Generators - Produces SRT subtitle files directly from audio or video files via a command-line tool for media accessibility.

Automated Subtitle Generators - Generates SRT subtitle files from audio or video files using speech-to-text models.

Model Managers - Manages model lifecycle through CLI commands for listing, downloading, and deleting speech recognition models.

Inference Backend Abstraction - Abstracts model loading and inference behind a unified interface supporting multiple backends like Whisper.cpp and Transformers.

Docker Container Deployments - Packages the server and dependencies into Docker containers with GPU or CPU support for reproducible deployment.

AI Server Containerization - Provides a containerized deployment of the speech-to-text server with GPU and CPU support for reproducible production environments.

QuentinFuxaWhisperLiveKit

Features

Star history