python-pinyin is a Python library for transliterating simplified and traditional Chinese characters into phonetic pinyin. It functions as a transliteration system that converts text while supporting tone sandhi and providing utilities to transform pinyin between different formats, such as numeric tones, accent marks, or phonetic initials. The library features a polyphonic character resolver that analyzes surrounding word context to select the correct pronunciation for characters with multiple sounds. It also includes a customizable dictionary system that allows the extension of default transl
pinyin-pro is a Chinese pinyin transcription library and text segmentation tool. It converts Chinese characters into pinyin with support for tones, initials, and finals, while resolving polyphonic characters based on context. The project includes a pinyin pattern matching engine that enables searching Chinese text using full spellings, initials, or hybrid phonetic patterns. It also features a pinyin HTML generator that wraps characters and their transcriptions in markup tags for styled web display. The library provides capabilities for Chinese text segmentation, surname pronunciation priorit
OpenCC is a library and command-line tool for converting text between Simplified Chinese, Traditional Chinese, and Japanese Kanji. It operates at both the individual character and multi-character phrase levels, and applies region-specific vocabulary choices for Mainland China, Taiwan, and Hong Kong during conversion. The conversion engine resolves ambiguous character mappings using semantic and contextual rules, normalizes variant character forms for consistent orthography, and sequences multiple dictionary files into a configurable pipeline. It supports embedding custom conversion rules dire
SnowNLP is a Python library for Chinese natural language processing. It provides tools for text segmentation, sentiment analysis, document classification, and phonetic transliteration. The library includes capabilities for training and saving custom machine learning models for tokenization and sentiment analysis using raw training datasets. It covers a range of linguistic processing areas, including parts of speech tagging, sentence splitting, and text similarity measurement. The toolkit also provides utilities for extracting key information through text summarization and calculating word im