30 open-source projects similar to chidiwilliams/buzz, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Buzz alternative.
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
Handy is a local speech-to-text automation tool designed to convert spoken audio into text and inject it directly into active desktop applications. By running machine learning models entirely on the host hardware, it provides a private, offline-first environment for dictation and command execution. The system functions as a background service that manages microphone input, transcription state, and text output, enabling hands-free typing across various software environments. The project distinguishes itself through a modular pipeline that integrates local language models for post-transcription
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a Web
Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls. The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech
Whisper.cpp is a high-performance, local-first speech recognition engine designed to run large-scale machine learning models on consumer hardware. It functions as a portable library that converts audio into text, supporting both static file transcription and real-time stream processing. By utilizing a lightweight inference engine and weight quantization, the project minimizes memory and compute overhead, allowing for efficient execution without reliance on external cloud APIs or internet connectivity. The project distinguishes itself through a hardware-agnostic compute abstraction that offloa
WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining precise synchronization with the original media. It functions as an integrated pipeline that combines transcription, phoneme-based alignment, and speaker diarization to produce structured, attributed transcripts. The project distinguishes itself through its use of forced alignment, which matches existing text to audio signals at the phoneme level to generate accurate word-level timestamps. It also incorporates speaker diarization to identify and label unique voices within a recordi
Vibe is a cross-platform transcription tool that converts spoken audio into text by running Whisper neural models directly on your device, with no cloud dependency. It can transcribe audio from files, microphones, system output, and network streams, and supports both batch processing of multiple files and real-time captioning from continuous input. Beyond basic transcription, Vibe identifies and labels different speakers through speaker diarization, and offers a choice of Command-Line Interface or HTTP API for automated and remote workflows. It also includes plugins to export transcripts to c
VoiceInk is a system-wide speech-to-text dictation tool that converts spoken audio into text using local or cloud AI models. It functions as a local AI transcription engine and a context-aware voice assistant, allowing users to insert transcribed text directly into any active application on the operating system. The project distinguishes itself through the use of custom vocabulary management, which trains transcription engines to recognize industry-specific technical terms, professional terminology, and personal names. It further enhances output by using large language models to refine raw tr
Local speech-to-text for macOS on-device AI, fully private, optional cloud
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
This project is a development framework for building edge-based AI agents that perform multimodal inference and system-level automation directly on mobile devices. By prioritizing local-first execution, the platform ensures data privacy and offline functionality, allowing developers to run large language models on hardware without requiring external server connectivity. The framework distinguishes itself through an integrated orchestration layer that connects language models to custom tools, scripts, and native device intents. It provides a structured registry for mapping natural language ins
RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission. The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy spee
Summarize is a command line tool and multimodal content extractor designed to generate concise summaries from web pages, documents, and media files. It functions as an orchestrator that connects developer tools to various language model providers to process and condense information. The system provides specialized capabilities for audio and video processing, including transcription with speaker identification and the extraction of timestamped visual markers from video slides. It also includes a translation utility to convert generated summaries and extracted text into different target languag
WhisperLiveKit is a real-time speech-to-text server that transcribes streaming audio into text with ultra-low latency using Whisper models. It serves transcription capabilities through REST endpoints and WebSocket connections, enabling external applications to send audio and receive transcriptions as words are spoken, making it suitable for live captioning or voice interfaces. The project distinguishes itself by combining real-time transcription with speaker diarization, assigning transcribed words to individual speakers during live audio streams for meeting or interview transcripts. It also
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audi
Whisper is a high-performance speech-to-text inference engine that uses graphics hardware shaders to accelerate the transcription of spoken audio into written text. It implements a GPU-accelerated automatic speech recognition framework specifically designed to run Whisper models. The system focuses on high-speed processing for both recorded audio files and live microphone streams. It utilizes voice activity detection to analyze raw audio in real time, triggering the inference engine only when human speech is detected. The engine covers a broad range of capabilities including real-time audio
openai-go is an LLM SDK for Go and a client for interacting with OpenAI services. It provides type-safe bindings to generate text, images, and audio via REST endpoints, enabling the integration of large language models and AI assistant orchestration into Go applications. The library serves as an agent orchestration tool for managing stateful conversation threads and autonomous agents with integrated tool calling and file search. It also functions as an asynchronous batch processing client for monitoring large-scale request groups and fine-tuning jobs, alongside a management SDK for controllin
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies
Llamafile is a machine learning model runner and packager that enables local inference by bundling model weights and runtime environments into a single, self-contained executable. It functions as a cross-platform engine, allowing users to execute large language models and perform speech-to-text tasks directly on their own hardware without requiring external software dependencies or complex installations. The project distinguishes itself by utilizing a specialized binary format that allows the same executable to run natively across multiple operating systems and hardware architectures. It auto
Pyvideotrans is an automated video localization platform designed to transcribe, translate, and dub media content for international distribution. It functions as an end-to-end workflow that combines speech recognition, text translation, and synthetic voice generation to process video files into localized versions. The system distinguishes itself by offering a choice between local model inference for privacy and integration with third-party cloud services via user-provided credentials. This architecture allows users to maintain control over their billing and data security while utilizing modul
Backstage is an open-source framework for building internal developer portals. It provides a centralized, metadata-driven software catalog that tracks ownership, dependencies, and lifecycle status for all technical assets by harvesting configuration files directly from version control systems. The platform is built on a plugin-based modular architecture, allowing teams to extend core functionality through isolated, independently deployable modules that integrate into a unified frontend and backend ecosystem. The project distinguishes itself through its focus on developer productivity and stan
Nextcloud is a self-hosted platform designed for private cloud storage, file synchronization, and collaborative team workspaces. It provides a comprehensive suite of tools for document editing, groupware services like calendars and contacts, and secure data management, all while ensuring users maintain full control over their infrastructure and data sovereignty. The platform distinguishes itself through a decentralized federated architecture that allows independent server instances to securely share data and collaborate across a network. It features a highly modular plugin ecosystem, enabling
Open-source AI voice typing for macOS, Windows, and Linux. Press a hotkey, speak naturally, get polished text in any app.
Your personal voice interface for any app. Speak naturally and your words appear wherever your cursor is, with fully customizable AI voice dictation. Open source alternative to Wispr Flow.
Epicenter is a local-first knowledge management system and data orchestrator designed to structure information generated by large language models into validated schemas. It functions as a storage architecture that persists application data in human-readable files and databases to ensure user ownership and portability. The system distinguishes itself by projecting language model outputs into structured, schema-validated tables and utilizing conflict-free replicated data types to synchronize application state across multiple devices without a central server. This allows for offline access and c
macOS native voice input for Crypto AI professionals — offline ASR, domain vocabulary, implicit learning
Hold Fn key, speak, paste transcribed text. macOS menu bar app.
Mac-native dictation that just works – lives in the notch, zero setup, free forever