# m-bain/whisperX

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/m-bain-whisperx).**

20,228 stars · 2,148 forks · Python · bsd-2-clause

## Links

- GitHub: https://github.com/m-bain/whisperX
- awesome-repositories: https://awesome-repositories.com/repository/m-bain-whisperx.md

## Topics

`asr` `speech` `speech-recognition` `speech-to-text` `whisper`

## Description

WhisperX is an automated speech recognition toolkit designed to convert spoken audio into text while maintaining precise synchronization with the original media. It functions as an integrated pipeline that combines transcription, phoneme-based alignment, and speaker diarization to produce structured, attributed transcripts.

The project distinguishes itself through its use of forced alignment, which matches existing text to audio signals at the phoneme level to generate accurate word-level timestamps. It also incorporates speaker diarization to identify and label unique voices within a recording, allowing for the creation of transcripts that attribute specific segments to individual speakers.

The system supports multilingual transcription and automated caption generation by sequencing multiple machine learning models, including transformer-based recognition and voice activity detection. These processes are optimized through GPU-accelerated tensor computation to handle large audio files and complex neural network operations.

## Tags

### Artificial Intelligence & ML

- [Audio Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription.md) — Converts spoken language into written text with precise word-level synchronization. ([source](https://github.com/m-bain/whisperX/search))
- [Automatic Speech Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/automatic-speech-recognition.md) — Provides a high-accuracy engine for converting spoken audio into synchronized text.
- [Whisper-Based Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-engines/whisper-based-engines.md) — Implements a speech-to-text engine that combines forced alignment and speaker diarization for high-precision transcription.
- [Speech Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-transcription.md) — Automates the conversion of spoken audio into accurate written text with word-level timestamps.
- [Multilingual Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/multilingual-transcription.md) — Supports transcription across multiple languages by automatically selecting appropriate alignment models. ([source](https://github.com/m-bain/whisperX/blob/main/EXAMPLES.md))
- [Phoneme-Based Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-alignment-models/phoneme-based-alignment.md) — Improves transcription accuracy by matching text to audio signals at the phoneme level.
- [Speaker Diarization](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-diarization.md) — Groups audio segments by voice characteristics using embedding extraction and clustering to identify unique speakers.
- [Inference Pipeline Orchestrators](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-pipeline-orchestrators.md) — Sequences multiple machine learning models into an integrated pipeline for transcription, alignment, and speaker identification.
- [Hardware-Accelerated](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/hardware-and-acceleration/tensor-computing-libraries/tensor-libraries/hardware-accelerated.md) — Provides native support for GPU-accelerated tensor computation to optimize complex neural network operations.

### Graphics & Multimedia

- [Forced Alignment](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/audio-analysis-synthesis/forced-alignment.md) — Maps text transcripts to specific time intervals within audio files using phoneme-level acoustic models.
- [Multilingual Captioning](https://awesome-repositories.com/f/graphics-multimedia/video-production/captioning-systems/multilingual-captioning.md) — Generates perfectly synchronized multilingual subtitles for video and audio media.