This is a Chinese text segmentation library that converts Chinese characters into their phonetic pinyin representation. It functions as a polyphone disambiguation tool, resolving ambiguous pronunciations for multi-sound characters using word segmentation and context analysis, and also serves as a pinyin sorting utility for ordering Chinese strings alphabetically. The library distinguishes itself through surname-aware pronunciation switching, applying specialized phonetic rules for Chinese surnames with non-standard pronunciations in name contexts. It supports pluggable word segmentation algor
python-pinyin is a Python library for transliterating simplified and traditional Chinese characters into phonetic pinyin. It functions as a transliteration system that converts text while supporting tone sandhi and providing utilities to transform pinyin between different formats, such as numeric tones, accent marks, or phonetic initials. The library features a polyphonic character resolver that analyzes surrounding word context to select the correct pronunciation for characters with multiple sounds. It also includes a customizable dictionary system that allows the extension of default transl
This is a dictionary-based Chinese Pinyin transliteration library used to convert Chinese characters into Pinyin with support for various tone styles and formats. It provides specialized utilities for polyphonic character resolution to manage multiple pronunciations and a generator for extracting the first letter of characters to create searchable index strings. The library includes a formatter for converting names into Pinyin following official international travel document and passport spelling standards. It also features a tool for transforming Chinese text into hyphenated or dotted string
ToolGood.Words is a sensitive word filtering library and text sanitization component designed for high-performance detection and masking of prohibited terms. It provides tools for Chinese text normalization, pinyin transliteration, and the replacement of banned words with placeholders. The project is distinguished by its ability to uncover obfuscated language through a pinyin transliteration engine and phonetic-based detection. It identifies sensitive content hidden by phonetic substitutions, first-letter initials, or intentional misspellings by mapping Chinese characters to pinyin representa