# koljab/realtimestt

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/koljab-realtimestt).**

9,477 stars · 815 forks · Python · mit

## Links

- GitHub: https://github.com/KoljaB/RealtimeSTT
- awesome-repositories: https://awesome-repositories.com/repository/koljab-realtimestt.md

## Topics

`python` `realtime` `speech-to-text`

## Description

RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission.

The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy speech models across multiple concurrent user sessions.

Its broader capabilities cover audio processing tasks such as voice activity detection, speaker diarization, and speaker emotion detection. The system also supports real-time speech translation, automated system input routing to simulate keyboard typing, and an extensible engine factory for adding new transcription backends.

The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.

## Tags

### Artificial Intelligence & ML

- [Real-Time Speech Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/real-time-speech-processing.md) — Provides a complete real-time pipeline for converting live audio streams into text using local models.
- [Audio Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription.md) — Converts raw audio chunks from files or websockets into text by resampling audio to the required processing rate. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/external-audio.md))
- [Real-Time Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/real-time-transcription.md) — Performs instantaneous conversion of live microphone audio streams into text transcripts via a persistent connection. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/fastapi-server.md))
- [Transcription APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/transcription-apis.md) — Provides a WebSocket-based streaming server that offers programmatic transcription capabilities for integration into external applications. ([source](https://github.com/KoljaB/RealtimeSTT#readme))
- [Transformer-Based ASR](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/transformer-based-asr.md) — Converts live audio to text using pre-trained transformer model families with specific generation settings. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/engines/hf-transformers.md))
- [Local Speech-to-Text](https://awesome-repositories.com/f/artificial-intelligence-ml/local-speech-to-text.md) — Runs transcription inference on local hardware to ensure data privacy and remove external API dependencies.
- [Automatic Speech Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/automatic-speech-recognition.md) — Utilizes transformer-based recognition and omnilingual pipelines for high-accuracy real-time transcription.
- [Dual-Model Transcription Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/realtime-processing-pipelines/dual-model-transcription-pipelines.md) — Produces fast, preliminary transcriptions using a lightweight model while a larger model processes the final result. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/engines/faster-whisper.md))
- [Incremental Transcription Previews](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-transcription/incremental-transcription-previews.md) — Generates incremental text results during active speech to show immediate progress before final transcription completes. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/engines/omnilingual-asr.md))
- [Streaming Transcription Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/conversational-audio-streams/streaming-transcription-inference.md) — Processes audio frames through a low-latency inference pipeline for immediate transcription previews. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/engines/kroko-onnx.md))
- [Wake Word Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/voice-activity-detection/wake-word-detection.md) — Monitors audio streams for specific activation phrases to trigger recording only after a wake word is spoken. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/installation.md))
- [Realtime Speech Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/speech-datasets/english/speech-to-text-translation/realtime-speech-translation.md) — Provides real-time conversion of spoken audio from one language into text in another language. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/test-scripts.md))
- [Real-Time Speech Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/real-time-speech-translation.md) — Converts spoken audio from one language into text in another language in real time.
- [Speaker Diarization](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-diarization.md) — Distinguishes between multiple voices in a single audio stream to attribute transcribed text to specific speakers. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/engines/funasr.md))
- [Speech-to-Text Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-integrations.md) — Utilizes an omnilingual pipeline that supports various model sizes and compute types for multi-language transcription. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/engines/omnilingual-asr.md))
- [Voice Activity Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-activity-detection.md) — Detects speech boundaries to automatically trigger recording and transcription cycles using sensitivity filters.

### Graphics & Multimedia

- [Hybrid Precision Pipelines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/speech-to-text-pipelines/unified-transcription-pipelines/hybrid-precision-pipelines.md) — Utilizes a dual-backend pipeline with a lightweight engine for immediate partials and a heavy model for final accuracy.

### Networking & Communication

- [Speech Processing WebSocket Servers](https://awesome-repositories.com/f/networking-communication/speech-processing-websocket-servers.md) — Implements a WebSocket server that exposes real-time speech-to-text capabilities.
- [Multi-User Session Isolation](https://awesome-repositories.com/f/networking-communication/socket-networking/audio-streaming-servers/multi-user-session-isolation.md) — Ships a multi-user web server that isolates sessions while sharing inference resources for remote transcription. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/README.md))
- [PCM Audio Streaming](https://awesome-repositories.com/f/networking-communication/socket-networking/audio-streaming-servers/pcm-audio-streaming.md) — Accepts raw PCM audio over WebSockets and returns incremental transcription results.
- [Audio Transcription WebSockets](https://awesome-repositories.com/f/networking-communication/websocket-to-stream-bridges/audio-transcription-websockets.md) — Implements a WebSocket connection to transmit live PCM audio for real-time transcription results. ([source](https://github.com/KoljaB/RealtimeSTT/tree/master/example_fastapi_server))

### Data & Databases

- [Runtime Resource Sharing](https://awesome-repositories.com/f/data-databases/shared-memory-buffers/runtime-resource-sharing.md) — Loads heavy speech models into memory once and shares them across multiple concurrent user sessions to minimize overhead.

### Software Engineering & Architecture

- [Inference Stream Multiplexing](https://awesome-repositories.com/f/software-engineering-architecture/high-throughput-task-processing/network-request-processing/inference-stream-multiplexing.md) — Manages multiple concurrent user sessions by isolating audio buffers while sharing model weight memory for inference.
- [Inference Session Isolation](https://awesome-repositories.com/f/software-engineering-architecture/inference-session-isolation.md) — Isolates audio buffers and transcription states for multiple simultaneous users while sharing a single inference engine. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/fastapi-server.md))
- [Transcription Engine Adapters](https://awesome-repositories.com/f/software-engineering-architecture/pluggable-backends/transcription-engine-adapters.md) — Implements modular interfaces that allow swapping different speech-to-text engines without modifying core application logic. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/transcription-engines.md))
- [Split-Pipeline Transcription](https://awesome-repositories.com/f/software-engineering-architecture/pluggable-backends/transcription-engine-adapters/split-pipeline-transcription.md) — Allows using separate transcription engines for partial and final results to balance processing speed with accuracy. ([source](https://github.com/KoljaB/RealtimeSTT/blob/master/docs/transcription-engines.md))
