RealtimeSTT | Awesome Repository

RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission.

The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy speech models across multiple concurrent user sessions.

Its broader capabilities cover audio processing tasks such as voice activity detection, speaker diarization, and speaker emotion detection. The system also supports real-time speech translation, automated system input routing to simulate keyboard typing, and an extensible engine factory for adding new transcription backends.

The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.

Features

Real-Time Speech Processing - Provides a complete real-time pipeline for converting live audio streams into text using local models.
Audio Transcription - Converts raw audio chunks from files or websockets into text by resampling audio to the required processing rate.
Real-Time Transcription - Performs instantaneous conversion of live microphone audio streams into text transcripts via a persistent connection.
Transcription APIs - Provides a WebSocket-based streaming server that offers programmatic transcription capabilities for integration into external applications.

Features

Real-Time Speech Processing - Provides a complete real-time pipeline for converting live audio streams into text using local models.
Audio Transcription - Converts raw audio chunks from files or websockets into text by resampling audio to the required processing rate.
Real-Time Transcription - Performs instantaneous conversion of live microphone audio streams into text transcripts via a persistent connection.
Transcription APIs - Provides a WebSocket-based streaming server that offers programmatic transcription capabilities for integration into external applications.

The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.