Open-source libraries and applications that provide low-latency audio processing for live automated speech-to-text transcription.
RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission. The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy speech models across multiple concurrent user sessions. Its broader capabilities cover audio processing tasks such as voice activity detection, speaker diarization, and speaker emotion detection. The system also supports real-time speech translation, automated system input routing to simulate keyboard typing, and an extensible engine factory for adding new transcription backends. The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.
This is a self-hostable real-time speech-to-text engine that supports live audio streaming, low-latency processing, multi-language models, and speaker diarization via a WebSocket API.
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additional capabilities include high-throughput audio processing via batch inference and TensorRT acceleration, as well as audio signal normalization and recording state control. The service supports live audio captioning through segment-based incremental rendering.
WhisperLive is a self-hostable, real-time speech-to-text server that supports live audio streaming via WebSockets, speaker diarization, and GPU-accelerated inference, making it a comprehensive solution for your transcription needs.
FunASR is an automatic speech recognition toolkit and multilingual speech-to-text engine designed to convert spoken audio into written text across more than fifty languages. It provides a framework for speaker diarization, an OpenAI-compatible transcription API for local server hosting, and speech models compatible with the ONNX format. The project distinguishes itself by supporting high-performance inference on edge hardware via self-contained binaries and portable model exports. It incorporates specialized capabilities for natural speech generation with adjustable timbre and emotional expression, as well as the ability to capture live microphone audio for direct voice-to-text input automation. The toolkit covers a broad range of audio analysis and processing capabilities, including voice activity detection, audio event and emotion detection, and punctuation restoration. It also includes tools for automated video captioning through the generation of timed subtitle files and distributed model fine-tuning to improve recognition accuracy using custom datasets.
FunASR is a comprehensive speech-to-text engine that supports real-time streaming, speaker diarization, and multilingual transcription, while offering an OpenAI-compatible API for self-hosted deployments.
Moonshine is a complete on-device voice interface toolkit that provides speech recognition, text-to-speech synthesis, phonetic processing, speaker diarization, and intent recognition, all running locally on edge hardware without any cloud dependency. It executes quantized neural networks for speech and language tasks directly on the device, enabling fully offline conversational AI capabilities. The toolkit distinguishes itself by orchestrating multi-turn spoken exchanges through a conversational flow manager that maintains context across interactions and manages branching dialog flows. It includes model weight quantization for reducing model size and improving inference speed on edge devices, multicore compute distribution for optimizing performance across CPU cores, and a streaming audio pipeline that processes audio in chunks with real-time transcription events. Speaker diarization distinguishes individual voices in multi-speaker audio streams, while semantic intent matching identifies user commands through embedding similarity. Moonshine provides a conversational agent builder for defining multi-step dialog flows that understand user intent and respond with synthesized speech. It supports real-time live speech transcription from microphone or file input, concurrent audio stream processing, and grapheme-to-phoneme conversion for text-to-speech synthesis across multiple languages. The toolkit includes model asset downloading and caching, audio input quality debugging, internal API call logging, and transcription latency benchmarking for evaluating real-time performance.
Moonshine is a comprehensive on-device toolkit that provides real-time speech-to-text transcription, speaker diarization, and multi-language support, making it a robust solution for self-hosted, low-latency audio processing.
This project is a self-hosted meeting transcription and summarization tool that converts audio recordings into text transcripts and structured notes using large language models. It functions as an enterprise meeting documentation manager, allowing for the organization and editing of timestamped records. The system prioritizes data privacy through local-first processing and the ability to deploy on private infrastructure. It supports a provider-agnostic architecture, enabling users to connect to local AI engines, self-hosted servers, or cloud-based API endpoints for both transcription and summarization. The platform covers a broad range of capabilities, including multilingual speech-to-text, real-time audio capture of system and microphone sounds, and hardware-accelerated transcription. It features a template-driven system for generating consistent summaries, role-based access control for team management, and tools for exporting content to PDF, Word, and Markdown formats. Security is handled through data-at-rest encryption and frameworks for regional data compliance such as GDPR and HIPAA.
This is a self-hosted meeting transcription platform that provides real-time audio capture and speech-to-text processing, directly addressing the need for a private, low-latency transcription engine.
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services. The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation. Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.
Sherpa-ONNX is a comprehensive, self-hostable speech-to-text engine that supports real-time streaming, multi-language recognition, and speaker diarization, making it a complete solution for your transcription needs.
Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi. The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains. The system supports real-time audio streaming and the transcription of mono 16-bit PCM WAV files. Additional capabilities include keyword spotting to restrict transcription to specific phrases, vocabulary configuration for specialized terminology, and the generation of synchronized SRT subtitle strings.
Vosk is a comprehensive, self-hostable speech-to-text engine that supports real-time streaming, speaker identification, and multi-language transcription through its flexible API.
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audio features directly to a target language without intermediate transcription. The system covers a broad range of speech processing tasks, including automatic speech recognition with punctuation restoration, speaker diarization, and audio sound classification. Its synthesis pipeline manages the generation of mel spectrograms and raw audio waveforms, while a streaming inference engine enables real-time processing with low latency.
PaddleSpeech is a comprehensive speech processing toolkit that includes a dedicated streaming inference engine for real-time transcription, supporting speaker diarization, multi-language models, and self-hosted deployment.
Ten Framework is a multimodal large language model agent framework designed for building low-latency conversational agents. It integrates voice, text, and visual inputs in real time to facilitate human interaction. The project includes a real-time speech processing pipeline for streaming transcription, voice activity detection, and speaker diarization. It also features an avatar synchronization engine that coordinates character lip animations and visual outputs with synthesized speech. The framework covers edge AI deployment through containerized packaging and direct integration with embedded hardware boards. Additional capabilities include a telephony gateway for connecting agents to phone networks via the Session Initiation Protocol and tools for real-time visual generation of sketches and doodles.
This framework provides a comprehensive pipeline for real-time speech processing, including streaming transcription, speaker diarization, and voice activity detection, making it a robust engine for building conversational transcription tools.
Whisper.cpp is a high-performance, local-first speech recognition engine designed to run large-scale machine learning models on consumer hardware. It functions as a portable library that converts audio into text, supporting both static file transcription and real-time stream processing. By utilizing a lightweight inference engine and weight quantization, the project minimizes memory and compute overhead, allowing for efficient execution without reliance on external cloud APIs or internet connectivity. The project distinguishes itself through a hardware-agnostic compute abstraction that offloads intensive tensor operations to a wide array of accelerators, including specialized neural engines and graphics processors. It provides granular control over the transcription process, offering features such as word-level timestamps, speaker diarization, and voice activity detection. Developers can leverage these capabilities to build interactive voice-enabled applications, including chatbots with conversation session management and synchronized media generation. Beyond its core transcription engine, the project supports a broad range of deployment environments, including web browsers via WebAssembly, mobile devices, and containerized server infrastructure. It includes tools for benchmarking performance across different hardware configurations and provides native language bindings to simplify integration into existing software stacks.
This is a high-performance, self-hostable speech-to-text engine that supports real-time stream processing, speaker diarization, and low-latency inference on consumer hardware.
SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages. The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording. The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.
This is a powerful speech-to-text engine that supports multilingual transcription and speaker diarization, though it functions primarily as a model framework rather than a ready-to-deploy streaming server with a built-in API.
WhisperKit provides a high-performance, self-hostable speech recognition engine optimized for Apple platforms that supports real-time streaming transcription and diarization, making it a capable tool for your transcription needs.
WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining precise synchronization with the original media. It functions as an integrated pipeline that combines transcription, phoneme-based alignment, and speaker diarization to produce structured, attributed transcripts. The project distinguishes itself through its use of forced alignment, which matches existing text to audio signals at the phoneme level to generate accurate word-level timestamps. It also incorporates speaker diarization to identify and label unique voices within a recording, allowing for the creation of transcripts that attribute specific segments to individual speakers. The system supports multilingual transcription and automated caption generation by sequencing multiple machine learning models, including transformer-based recognition and voice activity detection. These processes are optimized through GPU-accelerated tensor computation to handle large audio files and complex neural network operations.
WhisperX is a powerful speech-to-text engine that provides speaker diarization and multilingual support, though it is primarily optimized for batch processing rather than low-latency live streaming.
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies without language-specific rules. By employing byte-level tokenization and sliding window audio segmentation, the engine maintains memory efficiency and temporal consistency when processing long-form audio or varied acoustic environments. The toolkit provides both command-line and programmatic interfaces, enabling developers to integrate speech-to-text capabilities directly into custom software applications or automate high-volume batch processing of media libraries. It includes utilities for accessing multilingual and English-only speech corpora to support model validation and domain-specific performance tuning.
While this is a powerful and highly accurate speech-to-text engine that supports multiple languages and can be integrated via API, it is primarily designed for batch processing rather than native low-latency live streaming, requiring additional implementation to handle real-time audio streams.
This project is a multimodal translation framework and large language model capable of speech-to-speech, speech-to-text, and text-to-text translation across nearly 100 languages. It provides a real-time speech translation engine and a comprehensive toolkit for converting spoken audio between languages. The system is distinguished by its ability to preserve the original speaker's tone, pace, and prosody during translation. It utilizes a specialized on-device inference toolkit that converts model checkpoints into C-based libraries, enabling low-latency execution on mobile and edge hardware without a Python runtime. The framework covers a wide range of capabilities including automatic speech recognition, expressive speech synthesis, and real-time translation streaming. It also includes audio content moderation for toxicity detection and tools for multimodal translation evaluation and distributed model fine-tuning. The project is implemented using Jupyter Notebooks.
This framework provides a robust engine for real-time speech-to-text and translation tasks with support for low-latency edge deployment, though it is primarily designed as a multimodal translation toolkit rather than a dedicated transcription-only service.
NeMo is a comprehensive framework designed for the development, training, and deployment of large-scale conversational and generative artificial intelligence models. It provides an integrated platform for building multimodal systems, encompassing speech processing, language modeling, and reinforcement learning alignment. The framework is built to handle the entire lifecycle of AI development, from data curation and model pretraining to production-ready service deployment. The platform distinguishes itself through advanced distributed training capabilities, including tensor and pipeline parallelism, which allow for the execution of models that exceed the memory capacity of individual hardware devices. It incorporates specialized architectures such as mixture-of-experts to optimize computational efficiency and includes a programmable guardrails system to enforce safety policies and topical boundaries on model outputs. Additionally, the framework supports retrieval-augmented generation to ground model responses in external knowledge bases, reducing hallucinations and improving factual accuracy. Beyond core training and inference, the framework offers extensive tools for audio signal processing, speech-to-text transcription, and text-to-speech
NeMo is a comprehensive AI development framework that provides the underlying models and tools for building real-time speech-to-text systems, though it functions as a toolkit for creating such engines rather than a pre-packaged, ready-to-use transcription application.
VoiceInk is a system-wide speech-to-text dictation tool that converts spoken audio into text using local or cloud AI models. It functions as a local AI transcription engine and a context-aware voice assistant, allowing users to insert transcribed text directly into any active application on the operating system. The project distinguishes itself through the use of custom vocabulary management, which trains transcription engines to recognize industry-specific technical terms, professional terminology, and personal names. It further enhances output by using large language models to refine raw transcriptions into polished text, leveraging context injected from the system clipboard and active screen content. The software includes a hybrid-mode speech recognition system that can operate entirely offline for privacy or utilize remote servers for expanded language support. It features application-specific automation that switches transcription models and dictation profiles based on the active window, alongside configurable keyboard shortcuts for recording control. The application is written in Swift.
VoiceInk is a desktop dictation tool that provides real-time speech-to-text capabilities with local processing and hybrid model support, making it a suitable engine for live transcription tasks.
Omnilingual-ASR is a multilingual automatic speech recognition framework and toolkit designed to transcribe audio across 1,600 languages. It provides a complete pipeline for converting speech to text, including a toolkit for fine-tuning pre-trained speech models to specific languages or datasets using custom training recipes. The system supports zero-shot speech recognition, allowing the model to predict text in unseen languages without extensive training data. It further enables few-shot language guidance through in-context examples and uses language codes to constrain transcription output to the correct target language and script. The framework includes capabilities for high-throughput transcription via parallelized batch processing and a modular audio pipeline that normalizes and resamples diverse input formats. Resource management is handled through a system of asset cards and a command-line interface for retrieving metadata related to models, datasets, and tokenizers.
This is a comprehensive speech-to-text framework that provides the core engine for multilingual transcription, though it is primarily optimized for batch processing rather than low-latency live streaming.