This project is a Chinese automatic speech recognition framework and deep learning system designed to convert spoken Chinese audio into written text. It functions as a toolkit for training, evaluating, and deploying speech-to-text models, utilizing a specialized pinyin-to-text converter that transforms phonetic sequences into Chinese characters using a probability graph model.
The system is distinguished by its deployment flexibility, offering a dockerized recognition server that provides transcription capabilities as a remote API. It supports high-performance streaming through a gRPC speech-to-text interface, enabling bidirectional data transmission for real-time transcriptions and asynchronous audio streaming.
The framework covers a full machine learning workflow, including custom acoustic and language model training, n-gram language modeling, and accuracy evaluation via word error rate calculations. It handles the entire audio pipeline from raw WAVE file parsing and feature extraction to the hosting of recognition services via RESTful API gateways.