RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission.
The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy speech models across multiple concurrent user sessions.
Its broader capabilities cover audio processing tasks such as voice activity detection, speaker diarization, and speaker emotion detection. The system also supports real-time speech translation, automated system input routing to simulate keyboard typing, and an extensible engine factory for adding new transcription backends.
The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.