30 open-source projects similar to cmusphinx/pocketsphinx, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Pocketsphinx alternative.
WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices. The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accurac
Whisper streaming is an automated speech recognition engine designed to convert live audio into text. It functions as a network-based transcription server that accepts raw audio data from remote clients and returns incremental text results in real-time. The system distinguishes itself through its ability to process audio streams incrementally, allowing for immediate transcription and translation as speech is captured. It incorporates voice activity detection to isolate human speech from background noise and utilizes sliding-window buffering to manage incoming audio segments, ensuring that pro
This Python SDK provides a comprehensive toolkit for synthetic audio generation, voice cloning, and the development of conversational AI agents. It enables the creation of lifelike spoken audio from text, the replication of human voices through custom cloning, and the deployment of real-time voice agents capable of interacting with external large language models. The library distinguishes itself through deep integration of conversational AI capabilities, including the design of agent personas and the execution of real-time actions via APIs. It supports professional-grade audio production thro
Whishper is a graphical user interface for transcribing audio and video files into text using the Whisper model. It serves as a speech-to-text tool and subtitle file generator that converts spoken content into editable text and timed subtitle formats. The project features an integrated transcription and translation interface, allowing users to refine automated results and convert transcribed text into different languages. It includes a visual editor for correcting speech recognition errors, adjusting segment timecodes, and performing bilingual translation reviews. The system handles the full
Vibe is a cross-platform transcription tool that converts spoken audio into text by running Whisper neural models directly on your device, with no cloud dependency. It can transcribe audio from files, microphones, system output, and network streams, and supports both batch processing of multiple files and real-time captioning from continuous input. Beyond basic transcription, Vibe identifies and labels different speakers through speaker diarization, and offers a choice of Command-Line Interface or HTTP API for automated and remote workflows. It also includes plugins to export transcripts to c
Ecoute is a live transcription tool that provides real-time transcripts for both the user's microphone input (You) and the user's speakers output (Speaker) in a textbox.
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
Cheetah is an LLM technical interview assistant composed of a native macOS application and a browser extension. It provides real-time coding and answering suggestions during technical interviews by combining live audio transcription with web-based context extraction. The system functions as a real-time interview coach that converts spoken questions into text using on-device speech-to-text processing. It uses a browser-integrated DOM scraper to extract live code and console logs, allowing the AI to analyze the current coding state and generate technical solutions based on the specific environm
Whisper Real-Time is a speech-to-text engine designed to convert continuous microphone input into written transcripts. It functions as a real-time audio processor that leverages the OpenAI Whisper model to generate immediate textual output from live spoken language. The system utilizes a transformer-based architecture to map audio sequences to text tokens. It manages incoming data through a sliding-window buffering mechanism and a circular buffer, which ensures a steady stream of audio for the inference engine. To maintain accuracy during continuous processing, the software employs a stateful
This project is a hardware-accelerated transcription server and offline subtitle generator. It functions as a speech-to-text tool that converts audio and video files into plain text, JSON, and SRT subtitle formats using the Whisper model. The system operates as an OpenAI Audio API emulator, providing a local server that mimics a specific audio interface. This allows it to serve transcriptions to existing client configurations without requiring changes to the client software. The service utilizes GPU acceleration to increase voice recognition speed and includes utilities for hardware detectio
This project is an AI-driven suite of tools designed to repurpose long-form video content into short-form clips. It integrates a speech-to-text engine for automated transcription, a highlighting system that ranks engaging segments based on emotional hooks, and a video processor that converts horizontal footage into vertical formats. The system distinguishes itself through intelligent video cropping that utilizes face tracking and motion smoothing to keep subjects centered. It also employs an analysis system to extract viral highlights by scoring segments for engagement and practical value. T
Linly-Dubbing is an automated video dubbing pipeline designed for multilingual video localization. It converts spoken content in videos into another language by coordinating speech-to-text transcription, text translation, and text-to-speech synthesis. The system distinguishes itself through AI-driven lip synchronization and animation, which aligns facial expressions and mouth movements to the synthesized voiceover. It also utilizes audio source separation to isolate vocals from background music and noise, allowing for clean voice replacement while preserving original background audio. The br
This project is a multimodal translation framework and large language model capable of speech-to-speech, speech-to-text, and text-to-text translation across nearly 100 languages. It provides a real-time speech translation engine and a comprehensive toolkit for converting spoken audio between languages. The system is distinguished by its ability to preserve the original speaker's tone, pace, and prosody during translation. It utilizes a specialized on-device inference toolkit that converts model checkpoints into C-based libraries, enabling low-latency execution on mobile and edge hardware with
ESPnet is a comprehensive speech processing toolkit and PyTorch-based trainer designed for building end-to-end speech recognition, synthesis, and translation models. It provides a structured framework for developing automatic speech recognition systems using transducer and encoder-decoder architectures, alongside engines for text-to-speech synthesis and speech translation pipelines. The project distinguishes itself through a recipe-based workflow execution system that ensures experimental reproducibility by running standardized sequences of scripts for data preparation and model training. It
This project is a framework for building local voice assistants and a real-time audio streaming server. It functions as a containerized inference engine and a multilingual speech pipeline that orchestrates speech-to-text, language models, and text-to-speech components to convert spoken input into spoken output. The system is distinguished by its use of WebSocket-based bidirectional streaming for low-latency interactions. It features a voice activity detection system that manages speech boundaries and handles user barge-in interruptions during assistant playback. It also supports custom voice
Omi is an open-source wearable AI platform that captures audio and screen data to provide real-time conversational assistance and memory. It integrates a wearable hardware development kit with a vector memory database and large language model capabilities to create a persistent digital record of user interactions. The platform is distinguished by its BLE audio streaming pipeline, which transmits raw audio from wearable hardware for real-time transcription and speaker identification. It utilizes a plugin-based agent tool framework that allows AI assistants to autonomously invoke custom functio
Vocode-core is a framework for building real-time conversational AI voice agents. It serves as a conversational orchestrator and pipeline that integrates speech-to-text, large language models, and text-to-speech services to enable low-latency voice interactions. The project features a provider-agnostic interface that allows for swappable speech and language model providers, including support for both cloud APIs and local binaries. It distinguishes itself through a specialized telephony integration layer that enables agents to be deployed across phone lines, WebRTC, and virtual meeting platfor
This project is a suite of tools centered around an AI-powered interview assistant, a professional resume builder, and an engineering salary database. The core application provides real-time audio transcription and generates code and system design solutions during technical interviews. The software is designed for stealth and detection avoidance. It utilizes an invisible screen overlay that bypasses screen-capture and screen-sharing software, allowing the user to view information without it appearing on shared displays. To further avoid detection, the system implements keyboard-only operation
This project is a Chinese automatic speech recognition framework and deep learning system designed to convert spoken Chinese audio into written text. It functions as a toolkit for training, evaluating, and deploying speech-to-text models, utilizing a specialized pinyin-to-text converter that transforms phonetic sequences into Chinese characters using a probability graph model. The system is distinguished by its deployment flexibility, offering a dockerized recognition server that provides transcription capabilities as a remote API. It supports high-performance streaming through a gRPC speech-
This is a collection of pre-trained neural models for speech recognition, synthesis, and voice activity detection. It provides a library of assets designed for speech-to-text, text-to-speech, and the identification of human speech segments within audio. The project features text-to-speech synthesis with support for multiple languages and the use of Speech Synthesis Markup Language to control prosody, pitch, and timing. For speech recognition, the system includes capabilities for transcribing audio to text with word-level timestamp extraction and an automated punctuation restorer to insert cap
mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure. The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
Summarize is a command line tool and multimodal content extractor designed to generate concise summaries from web pages, documents, and media files. It functions as an orchestrator that connects developer tools to various language model providers to process and condense information. The system provides specialized capabilities for audio and video processing, including transcription with speaker identification and the extraction of timestamped visual markers from video slides. It also includes a translation utility to convert generated summaries and extracted text into different target languag
This project is an on-device AI SDK providing a framework for running large language models, vision models, and speech models locally. It serves as an orchestration layer for local LLM execution, ensuring data privacy and offline availability by utilizing hardware acceleration on the device. The SDK is distinguished by its comprehensive voice and multimodal capabilities, including a coordinated voice pipeline for activity detection, speech-to-text, and text-to-speech synthesis. It also provides a dedicated implementation kit for local retrieval-augmented generation and tools for processing co
Whisper-diarization is a system for identifying and separating different speakers in audio recordings by combining OpenAI Whisper for transcription with automated speaker attribution. It functions as a pipeline that isolates vocal tracks from background noise and assigns transcribed segments to specific individuals. The project uses forced alignment to synchronize transcribed text timestamps with audio signals, improving the accuracy of speaker attribution. It employs voice activity detection to separate speech from silence and noise, ensuring precise boundaries for identification. The syste
Silero VAD is a voice activity detection model and deep learning speech classifier designed to distinguish human speech from silence across diverse languages and noisy environments. It functions as a pre-trained neural network capable of identifying speech segments within both static audio recordings and real-time data streams. The project includes a language identification tool for classifying spoken languages and a framework for fine-tuning audio models. It provides utilities for optimizing detection thresholds using validation datasets and retraining the model with custom labeled audio to
Whisper is a high-performance speech-to-text inference engine that uses graphics hardware shaders to accelerate the transcription of spoken audio into written text. It implements a GPU-accelerated automatic speech recognition framework specifically designed to run Whisper models. The system focuses on high-speed processing for both recorded audio files and live microphone streams. It utilizes voice activity detection to analyze raw audio in real time, triggering the inference engine only when human speech is detected. The engine covers a broad range of capabilities including real-time audio
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization,