30 open-source projects similar to ggml-org/whisper.cpp, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Whisper.cpp alternative.
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization,
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
Llamafile is a machine learning model runner and packager that enables local inference by bundling model weights and runtime environments into a single, self-contained executable. It functions as a cross-platform engine, allowing users to execute large language models and perform speech-to-text tasks directly on their own hardware without requiring external software dependencies or complex installations. The project distinguishes itself by utilizing a specialized binary format that allows the same executable to run natively across multiple operating systems and hardware architectures. It auto
Frigate is a self-hosted network video recorder that functions as a private, local AI-powered vision engine. It manages video streams by performing real-time object detection, tracking, and classification directly on local hardware, ensuring that security monitoring and activity recording remain independent of cloud services. The system distinguishes itself through a modular, hardware-accelerated video pipeline that offloads intensive decoding and machine learning inference to dedicated GPUs, NPUs, or specialized accelerators like Coral TPUs and Hailo modules. It utilizes state-based object t
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computation
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
This project is a Python framework for building autonomous, event-driven agent systems. It provides a unified runtime for orchestrating multi-agent workflows, managing persistent conversation state, and executing code within secure, isolated sandbox environments. The framework is designed to handle complex task delegation, allowing agents to invoke other agents as tools while maintaining context across multi-turn interactions. The framework distinguishes itself through its deep integration with the Model Context Protocol, enabling agents to connect to external data sources and remote services
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as w
Qwen3 is a transformer-based large language model designed as a generative AI foundation for understanding, reasoning, and generating human language. It functions as a comprehensive ecosystem for model training, fine-tuning, and production-ready inference, providing the underlying architecture and weights necessary to build diverse artificial intelligence applications. The project distinguishes itself through extensive support for model quantization and distributed inference, enabling efficient execution across a wide range of hardware from consumer-grade devices to scalable cloud infrastruct
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining precise synchronization with the original media. It functions as an integrated pipeline that combines transcription, phoneme-based alignment, and speaker diarization to produce structured, attributed transcripts. The project distinguishes itself through its use of forced alignment, which matches existing text to audio signals at the phoneme level to generate accurate word-level timestamps. It also incorporates speaker diarization to identify and label unique voices within a recordi
Buzz is a desktop application that provides a local speech-to-text engine for transcribing and translating audio and video files. By leveraging local machine inference, the software ensures data privacy and offline performance, removing the need for cloud connectivity during media processing. The application distinguishes itself through a modular plugin architecture that allows for the integration of custom functionality, such as content summarization and automated text formatting, without modifying the core codebase. It also features a speaker diarization pipeline that identifies and labels
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
This project is a comprehensive platform for hosting and interacting with large language models directly on local hardware. It provides a web-based graphical interface that allows users to manage model loading, configure generation parameters, and execute text or chat interactions entirely offline. By running models locally, the software ensures complete data privacy and eliminates reliance on external cloud services for generative tasks. Beyond basic inference, the platform functions as a versatile workbench for generative AI development. It includes an integrated pipeline for fine-tuning mo
DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading. The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
ChatGLM3 is a comprehensive framework for deploying, fine-tuning, and serving large language models. It functions as a high-performance inference engine designed to support conversational AI, enabling developers to build interactive agents capable of multi-turn dialogue, autonomous code execution, and structured tool invocation. The project distinguishes itself through its focus on hardware-agnostic deployment and resource optimization. It supports distributed model parallelism across multiple graphics cards, paged key-value caching for concurrent request processing, and weight quantization t
Zig is a general-purpose systems programming language designed for high-performance applications that require manual memory management and direct control over hardware resources. It prioritizes predictable execution by enforcing explicit control flow and requiring functions to accept explicit memory allocators, ensuring that all heap operations and logic paths remain visible to the developer. The language distinguishes itself through a powerful compile-time metaprogramming engine that allows for arbitrary code execution during the build process, enabling advanced reflection and the generation
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies
TensorRT-LLM is a platform and toolkit designed for compiling, optimizing, and serving transformer-based models on accelerated hardware. It functions as a framework that transforms machine learning models into efficient execution graphs, providing an engine to refine these models for specific hardware to maximize throughput and minimize latency during text generation. The project distinguishes itself through advanced execution strategies that manage the entire inference pipeline. It utilizes kernel-level fusion and static graph execution to optimize mathematical operations and computational f
Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi. The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains. The system supports real-time audio
FunASR is an automatic speech recognition toolkit and multilingual speech-to-text engine designed to convert spoken audio into written text across more than fifty languages. It provides a framework for speaker diarization, an OpenAI-compatible transcription API for local server hosting, and speech models compatible with the ONNX format. The project distinguishes itself by supporting high-performance inference on edge hardware via self-contained binaries and portable model exports. It incorporates specialized capabilities for natural speech generation with adjustable timbre and emotional expre
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audi
Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls. The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model executi
AgenticSeek is a multi-agent orchestration system designed to decompose complex user objectives into granular, actionable tasks. By coordinating a team of specialized autonomous workers, the platform manages end-to-end workflows, ensuring that each component of a project is assigned to the most capable agent for execution. The system operates as a local-first runtime, executing all artificial intelligence models directly on user hardware to maintain data sovereignty and privacy. It integrates a browser automation engine for autonomous web research and interaction, alongside a sandboxed enviro
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
DeepSpeech is an open-source speech-to-text framework and machine learning engine designed to convert spoken audio into written text locally on a device. It provides on-device speech recognition that operates without requiring an internet connection to external servers. The system supports real-time speech transcription across a variety of hardware platforms, ranging from single-board computers and edge devices to GPU servers. This allows for audio analysis and processing directly on the local hardware.