Sherpa Onnx | Awesome Repository

Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access.

The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services.

The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation.

Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.

Features

Local Inference - Executes all speech processing tasks directly on the local device without requiring network access.
Speech Recognition Systems - Provides an offline inference system that converts spoken audio to text across multiple platforms.
Voice Activity Detection - Identifies speech segments in audio files or streams to optimize transcription and filter silence.
Real-Time Transcription - Processes live microphone input for low-latency transcription, keyword spotting, and noise reduction.

Features

Local Inference - Executes all speech processing tasks directly on the local device without requiring network access.
Speech Recognition Systems - Provides an offline inference system that converts spoken audio to text across multiple platforms.
Voice Activity Detection - Identifies speech segments in audio files or streams to optimize transcription and filter silence.
Real-Time Transcription - Processes live microphone input for low-latency transcription, keyword spotting, and noise reduction.

Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.