Open-source tools for real-time speech-to-text processing and automated meeting captioning on your own infrastructure.
FunASR is an automatic speech recognition toolkit and multilingual speech-to-text engine designed to convert spoken audio into written text across more than fifty languages. It provides a framework for speaker diarization, an OpenAI-compatible transcription API for local server hosting, and speech models compatible with the ONNX format. The project distinguishes itself by supporting high-performance inference on edge hardware via self-contained binaries and portable model exports. It incorporates specialized capabilities for natural speech generation with adjustable timbre and emotional expression, as well as the ability to capture live microphone audio for direct voice-to-text input automation. The toolkit covers a broad range of audio analysis and processing capabilities, including voice activity detection, audio event and emotion detection, and punctuation restoration. It also includes tools for automated video captioning through the generation of timed subtitle files and distributed model fine-tuning to improve recognition accuracy using custom datasets.
FunasR is a comprehensive, self-hostable speech-to-text engine that provides real-time transcription, speaker diarization, and an OpenAI-compatible API, making it a complete solution for your meeting captioning needs.
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additional capabilities include high-throughput audio processing via batch inference and TensorRT acceleration, as well as audio signal normalization and recording state control. The service supports live audio captioning through segment-based incremental rendering.
WhisperLive is a self-hosted, real-time transcription server that provides the requested speech-to-text capabilities, including diarization, local model inference, and WebSocket-based live captioning.
SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages. The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording. The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.
This is a powerful speech-to-text model and inference framework that provides the core transcription, diarization, and multilingual capabilities required, though it functions as a model-based engine rather than a pre-packaged meeting application with a ready-to-use UI.
RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission. The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy speech models across multiple concurrent user sessions. Its broader capabilities cover audio processing tasks such as voice activity detection, speaker diarization, and speaker emotion detection. The system also supports real-time speech translation, automated system input routing to simulate keyboard typing, and an extensible engine factory for adding new transcription backends. The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.
This is a self-hosted, real-time speech-to-text engine that provides the required automatic speech recognition, diarization, and WebSocket-based API for live audio transcription.
WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining precise synchronization with the original media. It functions as an integrated pipeline that combines transcription, phoneme-based alignment, and speaker diarization to produce structured, attributed transcripts. The project distinguishes itself through its use of forced alignment, which matches existing text to audio signals at the phoneme level to generate accurate word-level timestamps. It also incorporates speaker diarization to identify and label unique voices within a recording, allowing for the creation of transcripts that attribute specific segments to individual speakers. The system supports multilingual transcription and automated caption generation by sequencing multiple machine learning models, including transformer-based recognition and voice activity detection. These processes are optimized through GPU-accelerated tensor computation to handle large audio files and complex neural network operations.
WhisperX is a powerful speech-to-text toolkit that provides accurate transcription, diarization, and multilingual support, though it functions primarily as an inference pipeline library rather than a ready-to-deploy, real-time meeting application.
Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls. The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech. Capabilities cover audio processing pipelines for effects like pitch shifts and reverb, as well as real-time and file-based transcription with filler word removal. The system supports persona-based dialogue generation, batch synthesis with prompt caching, and global text dictation for inserting transcripts directly into the operating system clipboard. The processing engine can be hosted on local hardware or remote GPU servers.
Voicebox is a local speech processing system that supports real-time transcription and local model inference, making it a capable tool for your self-hosted transcription needs despite its additional focus on voice synthesis and cloning.
Whisper is a high-performance speech-to-text inference engine that uses graphics hardware shaders to accelerate the transcription of spoken audio into written text. It implements a GPU-accelerated automatic speech recognition framework specifically designed to run Whisper models. The system focuses on high-speed processing for both recorded audio files and live microphone streams. It utilizes voice activity detection to analyze raw audio in real time, triggering the inference engine only when human speech is detected. The engine covers a broad range of capabilities including real-time audio capture, GPGPU inference optimization, and compute performance profiling to measure the execution time of individual shaders.
This is a high-performance inference engine designed for real-time speech-to-text transcription, providing the core local model processing required for a self-hosted transcription tool.
Buzz is a desktop application that provides a local speech-to-text engine for transcribing and translating audio and video files. By leveraging local machine inference, the software ensures data privacy and offline performance, removing the need for cloud connectivity during media processing. The application distinguishes itself through a modular plugin architecture that allows for the integration of custom functionality, such as content summarization and automated text formatting, without modifying the core codebase. It also features a speaker diarization pipeline that identifies and labels individual voices within recordings to improve the readability and organization of generated transcripts. The system supports automated media processing by monitoring specific directories for new files, enabling users to trigger transcription or translation workflows as soon as assets are detected. Users can export results into various standard formats, including plain text and subtitle files, while utilizing hardware acceleration to increase processing speeds for large media files.
Buzz is a desktop application that provides local speech-to-text transcription and diarization, though it is designed for processing static media files rather than real-time meeting streams.
Moonshine is a complete on-device voice interface toolkit that provides speech recognition, text-to-speech synthesis, phonetic processing, speaker diarization, and intent recognition, all running locally on edge hardware without any cloud dependency. It executes quantized neural networks for speech and language tasks directly on the device, enabling fully offline conversational AI capabilities. The toolkit distinguishes itself by orchestrating multi-turn spoken exchanges through a conversational flow manager that maintains context across interactions and manages branching dialog flows. It includes model weight quantization for reducing model size and improving inference speed on edge devices, multicore compute distribution for optimizing performance across CPU cores, and a streaming audio pipeline that processes audio in chunks with real-time transcription events. Speaker diarization distinguishes individual voices in multi-speaker audio streams, while semantic intent matching identifies user commands through embedding similarity. Moonshine provides a conversational agent builder for defining multi-step dialog flows that understand user intent and respond with synthesized speech. It supports real-time live speech transcription from microphone or file input, concurrent audio stream processing, and grapheme-to-phoneme conversion for text-to-speech synthesis across multiple languages. The toolkit includes model asset downloading and caching, audio input quality debugging, internal API call logging, and transcription latency benchmarking for evaluating real-time performance.
Moonshine is an on-device toolkit designed for building voice-driven interfaces and conversational agents that includes real-time transcription, diarization, and local inference capabilities, making it a powerful engine for developing your own self-hosted transcription application.
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services. The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation. Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.
Sherpa-ONNX is a powerful speech processing toolkit that provides the core local inference engine and WebSocket server required to build a real-time, self-hosted transcription and captioning application.
DeepSpeech is an open-source speech-to-text framework and machine learning engine designed to convert spoken audio into written text locally on a device. It provides on-device speech recognition that operates without requiring an internet connection to external servers. The system supports real-time speech transcription across a variety of hardware platforms, ranging from single-board computers and edge devices to GPU servers. This allows for audio analysis and processing directly on the local hardware.
This is a speech-to-text engine and machine learning framework used to build transcription tools, rather than a ready-to-use self-hosted application for meeting captioning.
Whisper.cpp is a high-performance, local-first speech recognition engine designed to run large-scale machine learning models on consumer hardware. It functions as a portable library that converts audio into text, supporting both static file transcription and real-time stream processing. By utilizing a lightweight inference engine and weight quantization, the project minimizes memory and compute overhead, allowing for efficient execution without reliance on external cloud APIs or internet connectivity. The project distinguishes itself through a hardware-agnostic compute abstraction that offloads intensive tensor operations to a wide array of accelerators, including specialized neural engines and graphics processors. It provides granular control over the transcription process, offering features such as word-level timestamps, speaker diarization, and voice activity detection. Developers can leverage these capabilities to build interactive voice-enabled applications, including chatbots with conversation session management and synchronized media generation. Beyond its core transcription engine, the project supports a broad range of deployment environments, including web browsers via WebAssembly, mobile devices, and containerized server infrastructure. It includes tools for benchmarking performance across different hardware configurations and provides native language bindings to simplify integration into existing software stacks.
This is a high-performance inference engine and library for speech recognition rather than a ready-to-use self-hosted application, meaning you would need to build or integrate it into your own software to achieve meeting transcription functionality.
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization, and word-level timestamping. It also includes tools for generating synchronized karaoke videos based on transcribed audio timing.
This is a high-performance C++ inference engine for the Whisper model rather than a complete, self-hosted meeting transcription application with a user interface or meeting integration features.
PocketSphinx is an offline speech recognition engine that converts raw audio from files or live microphone streams into written text without requiring a network connection. It functions as a speech-to-text library, a real-time transcription engine, and a voice command processor, capable of detecting and transcribing spoken commands from continuous audio streams with configurable acoustic and language models. The engine uses weighted finite-state transducers to represent acoustic, phonetic, and language models as a single search graph for efficient decoding. It employs fixed-point acoustic models with 8-bit or 16-bit parameters to reduce memory usage on embedded devices, and frame-synchronous beam search to prune the search space at each audio frame for real-time performance. The system generates a lattice of alternative word sequences during decoding, from which multiple ranked transcriptions can be extracted, and records word-level start and end timestamps by tracing back through the Viterbi path. PocketSphinx processes audio in fixed-size chunks through a ring buffer, feeding frames incrementally to the decoder without requiring the full audio in memory. It detects speech boundaries by analyzing energy levels and silence gaps, then processes each utterance independently for transcription. The library supports transcribing single-channel 16-bit PCM audio from files or standard input, outputting recognized text as line-delimited JSON, and can match a known transcript against an audio file to produce word-level or phone-level timestamps.
This is a low-level speech recognition engine and library designed for embedding into other applications, rather than a self-hosted, ready-to-use transcription application for meetings.
Omnilingual-ASR is a multilingual automatic speech recognition framework and toolkit designed to transcribe audio across 1,600 languages. It provides a complete pipeline for converting speech to text, including a toolkit for fine-tuning pre-trained speech models to specific languages or datasets using custom training recipes. The system supports zero-shot speech recognition, allowing the model to predict text in unseen languages without extensive training data. It further enables few-shot language guidance through in-context examples and uses language codes to constrain transcription output to the correct target language and script. The framework includes capabilities for high-throughput transcription via parallelized batch processing and a modular audio pipeline that normalizes and resamples diverse input formats. Resource management is handled through a system of asset cards and a command-line interface for retrieving metadata related to models, datasets, and tokenizers.
This is a research-oriented framework and toolkit for building and fine-tuning speech recognition models rather than a ready-to-use, self-hosted application for real-time meeting transcription and captioning.