Wenet

Features

Bilingual ASR Platforms - Provides a complete end-to-end ASR platform for both Chinese and English languages with pretrained models.

End-to-End Pipelines - Delivers an integrated end-to-end pipeline from data preparation through model training to deployment for ASR.

Two-Pass Decoding - "Uses an initial CTC decoder to generate n-best hypotheses, then rescores them with a full attention decoder for higher accuracy."

Decoding Graph Builders - The ASR toolkit builds a decoding graph by composing acoustic model units, a lexicon, and a language model into a single WFST graph.

Speech Model Training - Provides end-to-end training for speech recognition models with multi-GPU support and checkpoint resumption.

Transformer ASR Training Workflows - Ships a full training pipeline for transformer-based ASR models with multi-GPU support and checkpoint resumption.

Streaming Recognition - Provides real-time speech transcription with configurable chunk size for low-latency processing.

Streaming ASR Engines - Implements a real-time streaming ASR engine with configurable chunk size for low-latency transcription.

Production Inference Exports - Exporting trained models to TorchScript or serialized formats for deployment in C++ runtimes on servers and Android.

TorchScript Exports - "Serializes the trained PyTorch model into TorchScript format for deployment in a standalone C++ runtime without Python."

Attention Rescoring Decoders - The ASR toolkit improves decoding accuracy by rescoring n-best hypotheses with an attention decoder to select the most accurate transcription.

Speech Transcription - The ASR toolkit decodes audio using multiple strategies, supports streaming and non-streaming transcription, and evaluates word error rate.

Joint CTC and Attention Training - "Trains the model with both CTC and attention loss simultaneously to leverage complementary strengths of alignment and contextual modeling."

Speech Decoding Transducers - "Integrates CTC probabilities, a lexicon, and an external language model into a single search graph for beam search decoding."

Chunked Streaming Transformers - "Processes audio in fixed-size chunks to enable low-latency streaming while maintaining context across chunks via chunk-level self-attention."

Transformer ASR Toolkits - Provides a toolkit for building and deploying transformer-based end-to-end speech recognition models.

Custom Phrase Biasing Methods - Injecting prior knowledge from a user-provided phrase list to bias recognition toward specific words or phrases.

Audio-Transcript Aligners - The ASR toolkit aligns an audio recording to a given text transcript, producing per-word timestamps and confidence scores.

Word-Level Timestamps - The ASR toolkit extracts word-level timestamps from CTC spike outputs of the encoder for alignment and downstream processing.

x86 and Android Inference Targets - Supports running ASR inference on x86 servers and Android devices via a C++ runtime.

WFST Integrations - Integrates external language models using weighted finite-state transducer graphs to improve recognition accuracy.

WFST Language Model Adapters - The ASR toolkit integrates an external language model into decoding using weighted finite-state transducers to boost recognition accuracy.

Model Exporting - Exports trained ASR models to serialized formats for production inference in other languages.

C++ Inference Exports - Exports trained models to a format deployable in C++ runtimes without Python dependencies.

N-Best Hypothesis Generators - The ASR toolkit generates N-best transcription hypotheses using CTC WFST search with a language model for improved accuracy.

Production-Ready ASR Toolkits - Provides a production-grade ASR toolkit with multi-GPU training, TorchScript export, and C++ inference on servers and mobile.

Streaming and Batch Serving - Serves trained ASR models in both real-time streaming and batch processing modes for production use.

WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices.

The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accuracy with contextual modeling. Decoding employs a two-pass strategy: an initial CTC decoder generates n-best hypotheses, which are then rescored with a full attention decoder. Weighted finite-state transducer (WFST) decoding integrates an external language model for higher accuracy, and the entire model can be exported to TorchScript for C++ inference without Python dependencies.

Beyond the core recognition engine, WeNet provides a complete pipeline for data preparation, including distributed partitioning, feature normalization, and token dictionary construction. Model training supports multi-GPU setups, checkpoint resumption, and TensorBoard monitoring. Decoding capabilities extend to audio-transcript alignment, word-level timestamp extraction, and N-best generation both with and without a language model. Custom phrase biasing allows injecting prior knowledge to bias recognition toward specific words. Pretrained model snapshots are available for reproducing published results or immediate use.

Features