pycorrector is an open-source toolkit for detecting and correcting spelling and grammar errors in Chinese text. It combines multiple correction approaches, including rule-based methods using Kenlm n-gram language models and confusion sets, as well as deep learning correctors built on BERT, GPT, and T5 models. The toolkit also provides a command-line interface for batch processing Chinese text files with configurable detection and output options.
The project distinguishes itself by offering a range of correction strategies that can be mixed and matched. Rule-based correction uses character-level perplexity scoring from a language model to detect errors, then evaluates phonetic and shape-similar features to suggest fixes. Deep learning correctors handle more complex cases, with BERT-based models for character-level errors, GPT-based models for phonetic and grammatical mistakes, and T5 sequence-to-sequence models for length-mismatched grammar errors like missing or extra characters. The toolkit also supports pinyin-to-character conversion and bidirectional traditional-simplified Chinese character conversion.
Beyond the core correction pipeline, pycorrector provides tools for model training, evaluation, and customization. Users can fine-tune pretrained language models on their own labeled datasets to create domain-specific correctors, and measure accuracy through precision, recall, and F1 scores at both character and sentence levels. The correction behavior is configurable through custom word lists, confusion character sets, mapping tables, and detection sensitivity thresholds. The project supports loading custom Kenlm language models and pretrained correction model checkpoints from HuggingFace.