Whisper is a high-performance speech-to-text inference engine that uses graphics hardware shaders to accelerate the transcription of spoken audio into written text. It implements a GPU-accelerated automatic speech recognition framework specifically designed to run Whisper models.
The system focuses on high-speed processing for both recorded audio files and live microphone streams. It utilizes voice activity detection to analyze raw audio in real time, triggering the inference engine only when human speech is detected.
The engine covers a broad range of capabilities including real-time audio capture, GPGPU inference optimization, and compute performance profiling to measure the execution time of individual shaders.