Whisper

This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation.

The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies without language-specific rules. By employing byte-level tokenization and sliding window audio segmentation, the engine maintains memory efficiency and temporal consistency when processing long-form audio or varied acoustic environments.

The toolkit provides both command-line and programmatic interfaces, enabling developers to integrate speech-to-text capabilities directly into custom software applications or automate high-volume batch processing of media libraries. It includes utilities for accessing multilingual and English-only speech corpora to support model validation and domain-specific performance tuning.

Features

Speech Recognition Systems - Transforms spoken audio into written text or translates across languages using a sequence-to-sequence transformer architecture.
Sequence Models - Maps variable-length audio input sequences to text output sequences using deep learning and byte-level tokenization.
Multi-Task Learning Models - Coordinates speech recognition, translation, and language identification simultaneously by sharing input-output sequences within a single model.
Transformer - Employs stacked attention layers within a sequence-to-sequence design to process audio input and generate corresponding text.
Weakly Supervised Learning - Trains generalized speech representation models by leveraging massive volumes of weakly labeled audio-transcript pairs.
Automatic Speech Recognition - Leverages large-scale, robust models trained on diverse datasets to convert spoken audio recordings into accurate text.
Multilingual Speech Translation - Detects, transcribes, and translates foreign-language audio into English text through automated speech processing.
Speech Recognition APIs - Exposes programmatic interfaces for integrating high-performance speech-to-text capabilities directly into custom software applications.
Speech Recognition Libraries - Simplifies the integration of robust speech-to-text functionality into applications to enable voice-driven features.
Automatic Speech Recognition Toolkits - Bundles command-line and programmatic tools to incorporate high-accuracy speech transcription into automated media processing workflows.
Speech Translation Systems - Automates the identification, transcription, and translation of foreign-language audio into English text.
Additional AI Tools - Robust speech recognition model for transcription and translation.
AI and Agents - A general-purpose automatic speech recognition model.
AI & Machine Learning - General-purpose local speech recognition model.
AI Tools and Frameworks - Robust speech-to-text transcription and translation model.
Audio Generation and Processing - Robust large-scale speech recognition and transcription model.
Core Models - The primary open-source speech recognition model from OpenAI.
Foundation Models - Robust speech recognition model trained on large-scale audio data.
Generative Media Tools - Robust speech recognition and transcription.
Speech Processing - Robust speech-to-text transcription model.
Speech Recognition - Robust speech-to-text transcription model.
Speech to text - Listed in the “Speech to text” section of the Ailia Models awesome list.
Business And Marketing Tools - General-purpose speech recognition model.
CLI Tooling - Enables the execution of complex speech recognition tasks directly from the terminal by selecting specific model sizes and input files.
Batch Media Processors - Streamlines high-volume audio transcription tasks through terminal-based commands for efficient batch processing of media files.

Star history

openaiwhisper

Name: openai/whisper
Author: openai

View on GitHub

102,828 stars12,544 forksPythonMIT14 views

Whisper

Features

Speech Recognition Systems - Transforms spoken audio into written text or translates across languages using a sequence-to-sequence transformer architecture.
Sequence Models - Maps variable-length audio input sequences to text output sequences using deep learning and byte-level tokenization.
Multi-Task Learning Models - Coordinates speech recognition, translation, and language identification simultaneously by sharing input-output sequences within a single model.
Transformer - Employs stacked attention layers within a sequence-to-sequence design to process audio input and generate corresponding text.
Weakly Supervised Learning - Trains generalized speech representation models by leveraging massive volumes of weakly labeled audio-transcript pairs.
Automatic Speech Recognition - Leverages large-scale, robust models trained on diverse datasets to convert spoken audio recordings into accurate text.
Multilingual Speech Translation - Detects, transcribes, and translates foreign-language audio into English text through automated speech processing.
Speech Recognition APIs - Exposes programmatic interfaces for integrating high-performance speech-to-text capabilities directly into custom software applications.
Speech Recognition Libraries - Simplifies the integration of robust speech-to-text functionality into applications to enable voice-driven features.
Automatic Speech Recognition Toolkits - Bundles command-line and programmatic tools to incorporate high-accuracy speech transcription into automated media processing workflows.
Speech Translation Systems - Automates the identification, transcription, and translation of foreign-language audio into English text.
Additional AI Tools - Robust speech recognition model for transcription and translation.
AI and Agents - A general-purpose automatic speech recognition model.
AI & Machine Learning - General-purpose local speech recognition model.
AI Tools and Frameworks - Robust speech-to-text transcription and translation model.
Audio Generation and Processing - Robust large-scale speech recognition and transcription model.
Core Models - The primary open-source speech recognition model from OpenAI.
Foundation Models - Robust speech recognition model trained on large-scale audio data.
Generative Media Tools - Robust speech recognition and transcription.
Speech Processing - Robust speech-to-text transcription model.
Speech Recognition - Robust speech-to-text transcription model.
Speech to text - Listed in the “Speech to text” section of the Ailia Models awesome list.
Business And Marketing Tools - General-purpose speech recognition model.
CLI Tooling - Enables the execution of complex speech recognition tasks directly from the terminal by selecting specific model sizes and input files.
Batch Media Processors - Streamlines high-volume audio transcription tasks through terminal-based commands for efficient batch processing of media files.

Open-source alternatives to Whisper

Similar open-source projects, ranked by how many features they share with Whisper.

suno-ai/bark
suno-ai/bark
39,159View on GitHub
Bark is a generative audio engine and machine learning inference library designed to convert written text into high-fidelity speech and sound effects. It functions as a text-to-audio transformer, utilizing multi-stage neural network architectures to map semantic input tokens into detailed audio codebooks for synthesis. The system distinguishes itself through a hierarchical transformer stacking approach that separates semantic understanding from acoustic realization. By employing autoregressive token prediction and vector quantized codebook mapping, the engine bridges linguistic and sonic doma
Jupyter Notebook
View on GitHub39,159
nvidia/nemo
NVIDIA/NeMo
17,394View on GitHub
NeMo is a multimodal AI framework and toolkit designed for the development, training, and scaling of large language models, generative AI systems, and speech-based models. It functions as an automatic speech recognition toolkit, a text-to-speech engine, and a framework for building models that process and generate combinations of text, image, and audio data. The project serves as a conversational AI orchestrator capable of managing real-time, interruptible voice interactions. It provides specialized workflows for speech translation, converting spoken audio from one language into text or speec
Python
View on GitHub17,394
k2-fsa/sherpa-onnx
k2-fsa/sherpa-onnx
13,017View on GitHub
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a Web
C++aarch64androidarm32
View on GitHub13,017
facebookresearch/omnilingual-asr
facebookresearch/omnilingual-asr
2,671View on GitHub
Omnilingual-ASR is a multilingual automatic speech recognition framework and toolkit designed to transcribe audio across 1,600 languages. It provides a complete pipeline for converting speech to text, including a toolkit for fine-tuning pre-trained speech models to specific languages or datasets using custom training recipes. The system supports zero-shot speech recognition, allowing the model to predict text in unseen languages without extensive training data. It further enables few-shot language guidance through in-context examples and uses language codes to constrain transcription output t
Python
View on GitHub2,671

See all 30 alternatives to Whisper

Frequently asked questions

What does openai/whisper do?

What are the main features of openai/whisper?

The main features of openai/whisper are: Speech Recognition Systems, Sequence Models, Multi-Task Learning Models, Transformer, Weakly Supervised Learning, Automatic Speech Recognition, Multilingual Speech Translation, Speech Recognition APIs.

What are some open-source alternatives to openai/whisper?

Open-source alternatives to openai/whisper include: suno-ai/bark — Bark is a generative audio engine and machine learning inference library designed to convert written text into… nvidia/nemo — NeMo is a multimodal AI framework and toolkit designed for the development, training, and scaling of large language… k2-fsa/sherpa-onnx — Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device… facebookresearch/omnilingual-asr — Omnilingual-ASR is a multilingual automatic speech recognition framework and toolkit designed to transcribe audio… m-bain/whisperx — WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining… 2noise/chattts — ChatTTS is a conversational text-to-speech generative model designed to convert written dialogue into natural sounding…

Whisper

Features

Star history

Whisper

Features

Open-source alternatives to Whisper

suno-ai/bark

NVIDIA/NeMo

k2-fsa/sherpa-onnx

facebookresearch/omnilingual-asr

Frequently asked questions

Star history

Frequently asked questions

Open-source alternatives to Whisper

suno-ai/bark

NVIDIA/NeMo

k2-fsa/sherpa-onnx

facebookresearch/omnilingual-asr