Omnilingual Asr

Omnilingual-ASR is a multilingual automatic speech recognition framework and toolkit designed to transcribe audio across 1,600 languages. It provides a complete pipeline for converting speech to text, including a toolkit for fine-tuning pre-trained speech models to specific languages or datasets using custom training recipes.

The system supports zero-shot speech recognition, allowing the model to predict text in unseen languages without extensive training data. It further enables few-shot language guidance through in-context examples and uses language codes to constrain transcription output to the correct target language and script.

The framework includes capabilities for high-throughput transcription via parallelized batch processing and a modular audio pipeline that normalizes and resamples diverse input formats. Resource management is handled through a system of asset cards and a command-line interface for retrieving metadata related to models, datasets, and tokenizers.

Features

Multilingual Transcription - Provides a comprehensive framework for transcribing audio across more than 1,600 different languages using pre-trained models.

Automatic Speech Recognition - Provides a comprehensive system for transcribing audio across 1,600 languages using pre-trained multilingual models.

Speech Model Training - Adapts pre-trained speech checkpoints to specific datasets using custom data preparation and training recipes.

Zero-Shot Recognition - Enables transcription of spoken audio in unseen languages without requiring specific training data for those tongues.

Multilingual ASR Frameworks - A speech recognition system for transcribing audio across 1,600 languages using pre-trained multilingual models.

Multilingual Audio Processing - Manages and processes speech data across thousands of languages with tools for resampling and normalization.

Speech-to-Text Modeling Toolkits - Provides a toolkit for adapting pre-trained checkpoints to specific languages or datasets using custom training recipes.

Speech Transcription - Converts spoken audio recordings into written text quickly and at scale across various file formats.

Transcription Language Configurations - Implements language code constraints to ensure transcription output matches the intended target language and script.

Zero-Shot Inference - Transcribes spoken audio in new or unseen languages without requiring extensive task-specific training data.

Cross-Lingual Transfer - Leverages pre-trained multilingual weights to perform zero-shot recognition on unseen languages.

Pretrained Checkpoint Fine-Tuning - Enables adapting large pre-trained speech models to specific domain datasets using customized training recipes.

Audio Processing - Converts audio from file paths, buffers, or dictionaries by automatically resampling and normalizing data.

Batch Transcription - Processes multiple audio segments simultaneously through specialized architectures to increase transcription throughput.

High-Throughput Transcription - Generates transcriptions in parallel using specialized models to maximize the volume of audio processed per second.

Few-Shot ASR Adaptation - Performs inference on unseen languages by providing a small set of audio-transcription pairs as examples.

Language-Constrained Inference - Uses specific language identifiers to constrain the transcription output to the correct target language and script.

Audio-Transcription Exemplars - Directs the model to recognize new languages by providing small sets of audio-transcription pairs during inference.

Audio Normalization Pipelines - Automatically normalizes various audio input formats into a consistent sample rate for model compatibility.

facebookresearchomnilingual-asr

Features

Star history