High-performance speech-to-text libraries and frameworks for transcribing audio files into accurate machine-readable text transcripts.
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additional capabilities include high-throughput audio processing via batch inference and TensorRT acceleration, as well as audio signal normalization and recording state control. The service supports live audio captioning through segment-based incremental rendering.
WhisperLive is a self-hostable, real-time speech-to-text server that provides the requested features, including speaker diarization, word-level timestamps, and GPU-accelerated offline processing.
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization, and word-level timestamping. It also includes tools for generating synchronized karaoke videos based on transcribed audio timing.
This is a high-performance, self-hostable automatic speech recognition engine that natively supports multi-language transcription, real-time processing, speaker diarization, and word-level timestamps.
Moonshine is a complete on-device voice interface toolkit that provides speech recognition, text-to-speech synthesis, phonetic processing, speaker diarization, and intent recognition, all running locally on edge hardware without any cloud dependency. It executes quantized neural networks for speech and language tasks directly on the device, enabling fully offline conversational AI capabilities. The toolkit distinguishes itself by orchestrating multi-turn spoken exchanges through a conversational flow manager that maintains context across interactions and manages branching dialog flows. It includes model weight quantization for reducing model size and improving inference speed on edge devices, multicore compute distribution for optimizing performance across CPU cores, and a streaming audio pipeline that processes audio in chunks with real-time transcription events. Speaker diarization distinguishes individual voices in multi-speaker audio streams, while semantic intent matching identifies user commands through embedding similarity. Moonshine provides a conversational agent builder for defining multi-step dialog flows that understand user intent and respond with synthesized speech. It supports real-time live speech transcription from microphone or file input, concurrent audio stream processing, and grapheme-to-phoneme conversion for text-to-speech synthesis across multiple languages. The toolkit includes model asset downloading and caching, audio input quality debugging, internal API call logging, and transcription latency benchmarking for evaluating real-time performance.
Moonshine is a comprehensive on-device speech recognition toolkit that supports offline processing, real-time transcription, and speaker diarization, making it a robust solution for local audio-to-text tasks.
FunASR is an automatic speech recognition toolkit and multilingual speech-to-text engine designed to convert spoken audio into written text across more than fifty languages. It provides a framework for speaker diarization, an OpenAI-compatible transcription API for local server hosting, and speech models compatible with the ONNX format. The project distinguishes itself by supporting high-performance inference on edge hardware via self-contained binaries and portable model exports. It incorporates specialized capabilities for natural speech generation with adjustable timbre and emotional expression, as well as the ability to capture live microphone audio for direct voice-to-text input automation. The toolkit covers a broad range of audio analysis and processing capabilities, including voice activity detection, audio event and emotion detection, and punctuation restoration. It also includes tools for automated video captioning through the generation of timed subtitle files and distributed model fine-tuning to improve recognition accuracy using custom datasets.
FunASR is a comprehensive speech-to-text engine that supports offline processing, real-time transcription, speaker diarization, and word-level timestamps, making it a complete solution for self-hosted automatic speech recognition.
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audio features directly to a target language without intermediate transcription. The system covers a broad range of speech processing tasks, including automatic speech recognition with punctuation restoration, speaker diarization, and audio sound classification. Its synthesis pipeline manages the generation of mel spectrograms and raw audio waveforms, while a streaming inference engine enables real-time processing with low latency.
PaddleSpeech is a comprehensive speech processing toolkit that provides robust automatic speech recognition, including support for real-time streaming, speaker diarization, and multi-language capabilities, making it a complete solution for self-hosted transcription needs.
mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure. The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference audio sample, speech enhancement through noise reduction, and voice activity detection with speaker diarization to distinguish between different speakers in recordings. Additional capabilities include speech-to-text transcription with word-level timestamp alignment, streaming audio generation that outputs results incrementally, and model weight quantization to reduce memory footprint and accelerate inference. The system manages multiple models through a unified interface and supports WebSocket audio transport for low-latency communication.
This toolkit provides a comprehensive, self-hostable speech-to-text engine that supports real-time transcription, speaker diarization, and word-level timestamps while running locally on Apple silicon.
WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining precise synchronization with the original media. It functions as an integrated pipeline that combines transcription, phoneme-based alignment, and speaker diarization to produce structured, attributed transcripts. The project distinguishes itself through its use of forced alignment, which matches existing text to audio signals at the phoneme level to generate accurate word-level timestamps. It also incorporates speaker diarization to identify and label unique voices within a recording, allowing for the creation of transcripts that attribute specific segments to individual speakers. The system supports multilingual transcription and automated caption generation by sequencing multiple machine learning models, including transformer-based recognition and voice activity detection. These processes are optimized through GPU-accelerated tensor computation to handle large audio files and complex neural network operations.
WhisperX is a comprehensive automatic speech recognition engine that provides accurate word-level timestamps, speaker diarization, and multilingual support in a self-hostable, offline-capable pipeline.
Vibe is a cross-platform transcription tool that converts spoken audio into text by running Whisper neural models directly on your device, with no cloud dependency. It can transcribe audio from files, microphones, system output, and network streams, and supports both batch processing of multiple files and real-time captioning from continuous input. Beyond basic transcription, Vibe identifies and labels different speakers through speaker diarization, and offers a choice of Command-Line Interface or HTTP API for automated and remote workflows. It also includes plugins to export transcripts to common subtitle and document formats, and can summarize or translate transcripts using local or cloud AI models. The tool combines local AI inference with flexible audio capture and output, making it suitable for a wide range of offline speech-to-text tasks. Documentation and installation instructions are available from the project repository.
Vibe is a self-hostable transcription engine that leverages local Whisper models to provide real-time and batch speech-to-text, complete with speaker diarization and multi-language support.
WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices. The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accuracy with contextual modeling. Decoding employs a two-pass strategy: an initial CTC decoder generates n-best hypotheses, which are then rescored with a full attention decoder. Weighted finite-state transducer (WFST) decoding integrates an external language model for higher accuracy, and the entire model can be exported to TorchScript for C++ inference without Python dependencies. Beyond the core recognition engine, WeNet provides a complete pipeline for data preparation, including distributed partitioning, feature normalization, and token dictionary construction. Model training supports multi-GPU setups, checkpoint resumption, and TensorBoard monitoring. Decoding capabilities extend to audio-transcript alignment, word-level timestamp extraction, and N-best generation both with and without a language model. Custom phrase biasing allows injecting prior knowledge to bias recognition toward specific words. Pretrained model snapshots are available for reproducing published results or immediate use.
WeNet is a production-ready automatic speech recognition toolkit that supports streaming, offline processing, and word-level timestamps, making it a comprehensive engine for your transcription needs.
RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission. The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy speech models across multiple concurrent user sessions. Its broader capabilities cover audio processing tasks such as voice activity detection, speaker diarization, and speaker emotion detection. The system also supports real-time speech translation, automated system input routing to simulate keyboard typing, and an extensible engine factory for adding new transcription backends. The server includes dedicated health and performance monitoring endpoints to track active sessions, inference latency, and worker utilization.
This is a comprehensive, self-hostable ASR engine that supports real-time streaming, speaker diarization, and multi-language processing, making it a direct fit for your requirements.
Whisper.cpp is a high-performance, local-first speech recognition engine designed to run large-scale machine learning models on consumer hardware. It functions as a portable library that converts audio into text, supporting both static file transcription and real-time stream processing. By utilizing a lightweight inference engine and weight quantization, the project minimizes memory and compute overhead, allowing for efficient execution without reliance on external cloud APIs or internet connectivity. The project distinguishes itself through a hardware-agnostic compute abstraction that offloads intensive tensor operations to a wide array of accelerators, including specialized neural engines and graphics processors. It provides granular control over the transcription process, offering features such as word-level timestamps, speaker diarization, and voice activity detection. Developers can leverage these capabilities to build interactive voice-enabled applications, including chatbots with conversation session management and synchronized media generation. Beyond its core transcription engine, the project supports a broad range of deployment environments, including web browsers via WebAssembly, mobile devices, and containerized server infrastructure. It includes tools for benchmarking performance across different hardware configurations and provides native language bindings to simplify integration into existing software stacks.
This is a high-performance, local-first speech recognition engine that natively supports all your requirements, including offline processing, real-time transcription, speaker diarization, and word-level timestamps.
PocketSphinx is an offline speech recognition engine that converts raw audio from files or live microphone streams into written text without requiring a network connection. It functions as a speech-to-text library, a real-time transcription engine, and a voice command processor, capable of detecting and transcribing spoken commands from continuous audio streams with configurable acoustic and language models. The engine uses weighted finite-state transducers to represent acoustic, phonetic, and language models as a single search graph for efficient decoding. It employs fixed-point acoustic models with 8-bit or 16-bit parameters to reduce memory usage on embedded devices, and frame-synchronous beam search to prune the search space at each audio frame for real-time performance. The system generates a lattice of alternative word sequences during decoding, from which multiple ranked transcriptions can be extracted, and records word-level start and end timestamps by tracing back through the Viterbi path. PocketSphinx processes audio in fixed-size chunks through a ring buffer, feeding frames incrementally to the decoder without requiring the full audio in memory. It detects speech boundaries by analyzing energy levels and silence gaps, then processes each utterance independently for transcription. The library supports transcribing single-channel 16-bit PCM audio from files or standard input, outputting recognized text as line-delimited JSON, and can match a known transcript against an audio file to produce word-level or phone-level timestamps.
PocketSphinx is a lightweight, offline speech recognition engine that provides real-time transcription and word-level timestamps, making it a suitable choice for local audio-to-text processing.
Buzz is a desktop application that provides a local speech-to-text engine for transcribing and translating audio and video files. By leveraging local machine inference, the software ensures data privacy and offline performance, removing the need for cloud connectivity during media processing. The application distinguishes itself through a modular plugin architecture that allows for the integration of custom functionality, such as content summarization and automated text formatting, without modifying the core codebase. It also features a speaker diarization pipeline that identifies and labels individual voices within recordings to improve the readability and organization of generated transcripts. The system supports automated media processing by monitoring specific directories for new files, enabling users to trigger transcription or translation workflows as soon as assets are detected. Users can export results into various standard formats, including plain text and subtitle files, while utilizing hardware acceleration to increase processing speeds for large media files.
Buzz is a desktop application that provides a local speech-to-text engine using the Whisper model, offering offline processing, speaker diarization, and multi-language support for audio and video files.
Whisper is a high-performance speech-to-text inference engine that uses graphics hardware shaders to accelerate the transcription of spoken audio into written text. It implements a GPU-accelerated automatic speech recognition framework specifically designed to run Whisper models. The system focuses on high-speed processing for both recorded audio files and live microphone streams. It utilizes voice activity detection to analyze raw audio in real time, triggering the inference engine only when human speech is detected. The engine covers a broad range of capabilities including real-time audio capture, GPGPU inference optimization, and compute performance profiling to measure the execution time of individual shaders.
This is a high-performance implementation of the Whisper speech-to-text model that supports offline processing, real-time transcription, and multi-language capabilities, making it a comprehensive engine for local automatic speech recognition.
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies without language-specific rules. By employing byte-level tokenization and sliding window audio segmentation, the engine maintains memory efficiency and temporal consistency when processing long-form audio or varied acoustic environments. The toolkit provides both command-line and programmatic interfaces, enabling developers to integrate speech-to-text capabilities directly into custom software applications or automate high-volume batch processing of media libraries. It includes utilities for accessing multilingual and English-only speech corpora to support model validation and domain-specific performance tuning.
Whisper is a robust, self-hostable automatic speech recognition engine that provides high-accuracy multilingual transcription, translation, and language identification, making it a comprehensive solution for local audio-to-text processing.
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services. The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation. Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.
Sherpa-ONNX is a comprehensive, cross-platform speech recognition engine that supports offline processing, multi-language models, real-time streaming, and advanced features like speaker diarization and word-level timestamps.
Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls. The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech. Capabilities cover audio processing pipelines for effects like pitch shifts and reverb, as well as real-time and file-based transcription with filler word removal. The system supports persona-based dialogue generation, batch synthesis with prompt caching, and global text dictation for inserting transcripts directly into the operating system clipboard. The processing engine can be hosted on local hardware or remote GPU servers.
Voicebox is a local speech processing system that provides offline transcription and real-time speech-to-text capabilities, making it a suitable engine for your self-hosted audio-to-text needs.
WhisperKit is a Swift-based framework designed for high-performance, offline speech recognition on Apple devices, providing the core engine needed for local transcription tasks.
Handy is a local speech-to-text automation tool designed to convert spoken audio into text and inject it directly into active desktop applications. By running machine learning models entirely on the host hardware, it provides a private, offline-first environment for dictation and command execution. The system functions as a background service that manages microphone input, transcription state, and text output, enabling hands-free typing across various software environments. The project distinguishes itself through a modular pipeline that integrates local language models for post-transcription refinement. Users can configure custom prompts to automatically format, translate, or correct raw speech output before it is inserted into the target application. This workflow is further enhanced by event-driven automation hooks, which allow the system to trigger custom scripts, keyboard shortcuts, or command sequences in response to transcription events. Beyond core dictation, the software offers extensive control over the transcription environment, including hardware-aware audio management and real-time translation capabilities. It supports fine-grained adjustments to transcription accuracy, such as vocabulary correction for technical terminology and configurable input latency. The system also maintains a history of past sessions and provides tools for managing clipboard states and system memory usage.
This tool provides local, offline speech-to-text transcription and is designed for real-time dictation, making it a functional ASR engine for desktop integration even though it focuses on automation rather than batch file processing.
Kaldi is an automatic speech recognition toolkit used to train and deploy models that convert spoken audio into text. It functions as a framework for designing and evaluating acoustic and language models through a structured pipeline of processing tools. The system acts as a cross-platform speech engine, capable of compiling recognition logic for Android and WebAssembly to enable execution on mobile devices and web browsers. It also includes a dedicated converter for migrating speech recognition models from the HTK format into a compatible internal structure. The toolkit covers a broad range of capabilities, including automatic speech recognition training, GPU accelerated speech processing, and the deployment of speech recognition environments across different hardware architectures.
Kaldi is a comprehensive, industry-standard toolkit for building and deploying automatic speech recognition systems that supports offline processing, multi-language models, and the complex pipelines required for advanced features like diarization and timestamping.