Pycorrector

pycorrector is an open-source toolkit for detecting and correcting spelling and grammar errors in Chinese text. It combines multiple correction approaches, including rule-based methods using Kenlm n-gram language models and confusion sets, as well as deep learning correctors built on BERT, GPT, and T5 models. The toolkit also provides a command-line interface for batch processing Chinese text files with configurable detection and output options.

The project distinguishes itself by offering a range of correction strategies that can be mixed and matched. Rule-based correction uses character-level perplexity scoring from a language model to detect errors, then evaluates phonetic and shape-similar features to suggest fixes. Deep learning correctors handle more complex cases, with BERT-based models for character-level errors, GPT-based models for phonetic and grammatical mistakes, and T5 sequence-to-sequence models for length-mismatched grammar errors like missing or extra characters. The toolkit also supports pinyin-to-character conversion and bidirectional traditional-simplified Chinese character conversion.

Beyond the core correction pipeline, pycorrector provides tools for model training, evaluation, and customization. Users can fine-tune pretrained language models on their own labeled datasets to create domain-specific correctors, and measure accuracy through precision, recall, and F1 scores at both character and sentence levels. The correction behavior is configurable through custom word lists, confusion character sets, mapping tables, and detection sensitivity thresholds. The project supports loading custom Kenlm language models and pretrained correction model checkpoints from HuggingFace.

Features

Error Correction Toolkits - Provides a comprehensive open-source toolkit for detecting and correcting spelling and grammar errors in Chinese text.

Confusion Set Matchers - Replaces characters or words by looking them up in hand-curated phonetic, shape-similar, or custom mapping tables.

Chinese Error Correction Models - Ships a fine-tuned BERT model for detecting and correcting character-level spelling errors in Chinese text.

N-Gram Language Models - Scores candidate corrections by their perplexity under a Kenlm n-gram language model to rank and select the best fix.

Chinese Spelling Correctors - Provides a rule-based corrector using Kenlm n-gram language models for phonetic and shape-similar character errors.

Phonetic Homophone Correction - Replaces characters that sound alike but are wrong using a confusion set of homophones to fix common misspellings.

Perplexity-Based Error Detectors - Detects erroneous character positions by comparing character-level perplexity scores against a tunable threshold from a language model.

Chinese Spelling Correctors - Corrects Chinese spelling mistakes using statistical and neural models for phonetic and shape-similar errors.

Shape-Similar Error Correctors - Replaces characters that look alike but are wrong using a confusion set of visually similar glyphs.

Text Error Detection APIs - Provides APIs that identify error positions and types in text without performing correction.

Chinese Error Detection APIs - Implements APIs that detect erroneous characters in Chinese text and return their positions and error types.

Chinese Error Correction Models - Ships GPT-based correctors (ChatGLM3, Qwen2.5) for detecting and fixing phonetic and grammatical errors in Chinese.

Chinese Error Correction Models - Uses a fine-tuned BERT model to predict and correct character-level errors in Chinese text input.

Transformer Fine-Tuning - Fine-tunes encoder-decoder or decoder-only transformer models on paired error-correction data for sequence-level text correction.

Chinese Error Correction Models - Ships a T5 sequence-to-sequence corrector for handling length-mismatched grammar errors in Chinese text.

Deep Learning Correctors - Employs end-to-end deep learning models like RNN, CRF, and transformers to automatically fix errors in Chinese text.

Built-In Language Model Correctors - Corrects Chinese text errors using built-in language models for phonetic, shape-similar, and grammatical mistakes.

Multi-Model Chinese Correctors - Combines rule-based, statistical, and deep learning models to correct Chinese text errors.

Rule-Based Chinese Correctors - Detects misspelled characters via language model perplexity and corrects them using phonetic and shape-similar features.

Custom Model Training - Supports fine-tuning pretrained language models on user-annotated datasets to adapt error correction to specific domains.

Error Correction Model Training - Supports fine-tuning pretrained language models on labeled Chinese text pairs to produce custom error correctors.

T5 Correction Model Predictions - Provides a trained T5 model that corrects Chinese text and annotates errors in the output.

Pretrained Model Loading - Loads pretrained correction model checkpoints from HuggingFace for Chinese text error correction without retraining.

Custom Language Model Correctors - Loads a user-supplied Kenlm language model to tailor spelling correction to a specific domain.

Error Correction Model Training - Provides training scripts for BERT models on paired error-correction data to learn character-level corrections.

N-Gram Frequency Rankers - Ranks candidate corrections by counting character n-gram frequencies in a corpus without requiring word segmentation.

Batch Text Correctors - Provides a command-line interface for batch processing Chinese text files with configurable correction options.

Automated Text Corrections - Runs batch text correction on a file from the terminal using a language model, with options to control output detail and character detection.

Batch Text Correctors - Runs batch Chinese text correction on a file from the terminal with configurable output and detail level.

Chinese Grammar Error Corrections - Corrects Chinese grammar errors like missing or extra characters using sequence-to-sequence models.

GPT-Based Chinese Correctors - Ships a GPT-based corrector that detects and fixes phonetic, shape-similar, and grammatical errors in Chinese sentences.

N-Gram Language Model Corrections - Scores candidate corrections with a Kenlm n-gram language model to detect and fix ungrammatical sequences.

Pinyin-to-Text Converters - Converts pinyin input to the most likely Chinese character sequence using a statistical language model and n-gram frequency counts.

Correction Accuracy Evaluators - Provides evaluation tools to measure precision, recall, and F1 scores of correction models on benchmark test sets.

shibing624pycorrector

Pycorrector

Features

Star history