30 open-source projects similar to uberi/speech_recognition, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Speech Recognition alternative.
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a Web
Ecoute is a live transcription tool that provides real-time transcripts for both the user's microphone input (You) and the user's speakers output (Speaker) in a textbox.
EZAudio is an audio library for Apple platforms that provides standardized interfaces for microphone capture, file playback, and hardware output. It functions as a low-latency audio processor and visualization framework designed to manipulate audio buffers and route signals with minimal delay. The project features a hardware-accelerated waveform renderer for drawing real-time audio amplitudes and rolling plots. It also includes a Fast Fourier Transform analyzer that converts time-domain audio samples into frequency-domain data for spectral analysis. The library covers a broad range of capabi
This project is a Chinese automatic speech recognition framework and deep learning system designed to convert spoken Chinese audio into written text. It functions as a toolkit for training, evaluating, and deploying speech-to-text models, utilizing a specialized pinyin-to-text converter that transforms phonetic sequences into Chinese characters using a probability graph model. The system is distinguished by its deployment flexibility, offering a dockerized recognition server that provides transcription capabilities as a remote API. It supports high-performance streaming through a gRPC speech-
Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi. The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains. The system supports real-time audio
This project is a self-hosted meeting transcription and summarization tool that converts audio recordings into text transcripts and structured notes using large language models. It functions as an enterprise meeting documentation manager, allowing for the organization and editing of timestamped records. The system prioritizes data privacy through local-first processing and the ability to deploy on private infrastructure. It supports a provider-agnostic architecture, enabling users to connect to local AI engines, self-hosted servers, or cloud-based API endpoints for both transcription and summ
RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission. The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy spee
Casibase is an open-source platform that orchestrates multi-turn conversations with large language models and manages retrieval-augmented knowledge bases from a single interface. It provides a unified system for connecting to over 30 AI model providers, ingesting documents into vector embeddings for semantic search, and running autonomous agent loops that can drive a browser, search the web, execute commands, and integrate with external tools. The platform distinguishes itself by combining AI conversation management with infrastructure and application orchestration capabilities. It includes a
This project is a framework for developing multimodal AI agents that function as programmable participants in real-time communication rooms. It enables the construction of agents that can see, hear, and speak by integrating speech-to-text, large language models, and text-to-speech pipelines to facilitate low-latency, natural conversations. The system is distinguished by its advanced orchestration of real-time media and conversational flow, including support for full-duplex speech, preemptive response generation, and sophisticated interruption management. It further differentiates itself throu
GPAC is an open-source multimedia framework built around a pluggable filter graph pipeline, where modular processing units called filters connect into a directed graph to handle media workflows. At its core, the framework centers all media packaging and manipulation on the ISO Base Media File Format (ISOBMFF), with specialized tools for reading, writing, fragmenting, and encrypting MP4 and related containers. It also provides a declarative scene graph composition system for describing interactive multimedia scenes using MPEG-4 BIFS, X3D, SVG, or VRML syntax, alongside a hardware-accelerated re
Pydub is a Python audio manipulation library and digital audio processor used for editing, slicing, and converting audio files and segments. It serves as a programmatic wrapper for FFmpeg to import and export a wide variety of audio formats. The library functions as an audio signal generator capable of creating synthetic waveforms, such as sine waves and white noise. It also provides tools for digital signal processing, including the application of filters, fades, crossfades, and gain adjustments to sound signals. Its broader capabilities cover programmatic audio editing through concatenatio
Opus is a lossy audio compression standard and codec designed for high-quality speech and music transmission over the internet. It functions as a low-latency audio codec and network-resilient streamer, providing a framework for encoding and decoding digital audio. The project distinguishes itself through the support of multi-channel ambisonics for immersive three-dimensional spatial audio reproduction. It is specifically optimized for real-time interactive communication, utilizing dynamic bitrate adjustment and forward error correction to maintain audio quality on unstable networks. The syst
Audacity is a cross-platform digital audio workstation and multi-track audio editor. It serves as a comprehensive suite for capturing live audio input, refining sound files through splicing and effects, and mixing multi-track audio files using a non-destructive waveform interface. The project functions as a VST3 plugin host, providing a software environment to load and execute audio effects and virtual instruments for real-time signal processing. It also includes an audio spectrum analyzer for visualizing frequencies and waveforms to identify specific sonic characteristics. The software cove
MPD is a headless music server daemon that indexes audio libraries and streams music to local or remote outputs. It functions as a music library manager and network audio streamer, providing a remote audio control protocol that allows external clients to manage playback, playlists, and database queries. The system acts as a multiroom audio coordinator, synchronizing audio distribution across multiple networked clients and hardware devices. It supports a variety of remote management capabilities, including a dedicated control API and the ability to broadcast audio streams over network protocol
Geist is an open-source font family and typography collection designed for high legibility in technical interfaces. It consists of a series of web-optimized typefaces, including geometric sans-serif, monospaced, and pixel styles. The collection functions as a variable font library, utilizing coordinate interpolation to allow precise control over weight and style within a single font file. These fonts are built as OpenType typefaces, incorporating standardized layout tables to define advanced typographic behaviors such as kerning and ligatures. The project provides specific implementations fo
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
Fonoster is a conversational AI framework and multi-tenant communications platform as a service. It serves as a programmable voice gateway and SIP telephony platform, enabling the creation of voice-based assistants and automated communication workflows using large language models. The project distinguishes itself through a vendor-agnostic speech integration engine that abstracts speech-to-text and text-to-speech providers. It features a multi-tenant architecture that isolates telephony resources and user identities into distinct organizational workspaces. The system covers a broad range of t
Screenpipe is a local screen and audio recorder that captures and indexes digital activity to create a searchable archive of computer usage. It functions as an AI context engine, providing a local database of visual and auditory history to ground large language models. The system serves as a Model Context Protocol server, delivering screen history and meeting transcriptions to external AI assistants. It utilizes an OCR screen search tool to extract text from visual data and a speech-to-text transcription tool for identifying speakers in system and microphone audio. The software includes capa
PocketSphinx is an offline speech recognition engine that converts raw audio from files or live microphone streams into written text without requiring a network connection. It functions as a speech-to-text library, a real-time transcription engine, and a voice command processor, capable of detecting and transcribing spoken commands from continuous audio streams with configurable acoustic and language models. The engine uses weighted finite-state transducers to represent acoustic, phonetic, and language models as a single search graph for efficient decoding. It employs fixed-point acoustic mod
This project is a comprehensive framework for building AI-powered applications, providing a unified toolkit for orchestrating language models, autonomous agents, and interactive user interfaces. It serves as a central library for managing the entire lifecycle of AI interactions, from initial prompt generation and model provider abstraction to complex, multi-step reasoning and tool execution. The framework distinguishes itself through its deep integration with frontend development, specifically by enabling generative user interfaces that render dynamic components directly from model outputs. I
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies
Wechaty is a cross-platform chatbot framework designed to build and manage automated messaging agents. It provides a unified programming interface that abstracts diverse instant messaging protocols, allowing developers to create bots that function consistently across multiple communication services. By utilizing a modular architecture, the framework enables the development of conversational agents capable of handling complex messaging workflows, contact management, and group room interactions. The framework distinguishes itself through a puppet-based protocol abstraction and a language-agnost
WhisperLiveKit is a real-time speech-to-text server that transcribes streaming audio into text with ultra-low latency using Whisper models. It serves transcription capabilities through REST endpoints and WebSocket connections, enabling external applications to send audio and receive transcriptions as words are spoken, making it suitable for live captioning or voice interfaces. The project distinguishes itself by combining real-time transcription with speaker diarization, assigning transcribed words to individual speakers during live audio streams for meeting or interview transcripts. It also
CapsWriter-Offline is a suite of desktop tools that operates without an internet connection, combining local media browsing, voice dictation, audio and video transcription, and 360-degree media viewing into a single application. The project's core identity centers on providing offline functionality for both media handling and speech-to-text workflows. What distinguishes it is the integration of voice dictation with a persistent local storage layer that saves every audio recording and daily transcript logs, along with a rule-based text normalization engine that converts spoken number phrases a
This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k. The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to co
Annyang is a speech recognition library and web speech API wrapper that enables the integration of voice command interfaces into websites. It functions as a browser-based voice controller, mapping spoken phrases and regular expressions to specific JavaScript functions to trigger application actions. The library provides mechanisms for voice command mapping and simulation, allowing developers to associate spoken text with executable callbacks. It includes tools for command variable extraction using regular expression capture groups, which allows specific words from a spoken phrase to be passed
big-AGI is a self-hosted AI frontend and multi-model client that provides a unified workspace for interacting with various large language models. It functions as an orchestration dashboard, allowing users to connect to cloud-based AI providers, aggregator services, and locally hosted model servers. The project is distinguished by its ability to execute prompts across multiple models simultaneously for side-by-side comparison and response synthesis. It enables the merging of outputs from different models to reduce hallucinations and improve accuracy, while using persona-based configuration map