Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access.
The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services.
The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation.
Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.