This is a Chinese text segmentation library that converts Chinese characters into their phonetic pinyin representation. It functions as a polyphone disambiguation tool, resolving ambiguous pronunciations for multi-sound characters using word segmentation and context analysis, and also serves as a pinyin sorting utility for ordering Chinese strings alphabetically. The library distinguishes itself through surname-aware pronunciation switching, applying specialized phonetic rules for Chinese surnames with non-standard pronunciations in name contexts. It supports pluggable word segmentation algor
python-pinyin is a Python library for transliterating simplified and traditional Chinese characters into phonetic pinyin. It functions as a transliteration system that converts text while supporting tone sandhi and providing utilities to transform pinyin between different formats, such as numeric tones, accent marks, or phonetic initials. The library features a polyphonic character resolver that analyzes surrounding word context to select the correct pronunciation for characters with multiple sounds. It also includes a customizable dictionary system that allows the extension of default transl
pinyin-pro is a Chinese pinyin transcription library and text segmentation tool. It converts Chinese characters into pinyin with support for tones, initials, and finals, while resolving polyphonic characters based on context. The project includes a pinyin pattern matching engine that enables searching Chinese text using full spellings, initials, or hybrid phonetic patterns. It also features a pinyin HTML generator that wraps characters and their transcriptions in markup tags for styled web display. The library provides capabilities for Chinese text segmentation, surname pronunciation priorit
ToolGood.Words is a sensitive word filtering library and text sanitization component designed for high-performance detection and masking of prohibited terms. It provides tools for Chinese text normalization, pinyin transliteration, and the replacement of banned words with placeholders. The project is distinguished by its ability to uncover obfuscated language through a pinyin transliteration engine and phonetic-based detection. It identifies sensitive content hidden by phonetic substitutions, first-letter initials, or intentional misspellings by mapping Chinese characters to pinyin representa