WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices.
The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accuracy with contextual modeling. Decoding employs a two-pass strategy: an initial CTC decoder generates n-best hypotheses, which are then rescored with a full attention decoder. Weighted finite-state transducer (WFST) decoding integrates an external language model for higher accuracy, and the entire model can be exported to TorchScript for C++ inference without Python dependencies.
Beyond the core recognition engine, WeNet provides a complete pipeline for data preparation, including distributed partitioning, feature normalization, and token dictionary construction. Model training supports multi-GPU setups, checkpoint resumption, and TensorBoard monitoring. Decoding capabilities extend to audio-transcript alignment, word-level timestamp extraction, and N-best generation both with and without a language model. Custom phrase biasing allows injecting prior knowledge to bias recognition toward specific words. Pretrained model snapshots are available for reproducing published results or immediate use.