This is a Chinese text segmentation library that converts Chinese characters into their phonetic pinyin representation. It functions as a polyphone disambiguation tool, resolving ambiguous pronunciations for multi-sound characters using word segmentation and context analysis, and also serves as a pinyin sorting utility for ordering Chinese strings alphabetically. The library distinguishes itself through surname-aware pronunciation switching, applying specialized phonetic rules for Chinese surnames with non-standard pronunciations in name contexts. It supports pluggable word segmentation algor
SnowNLP is a Python library for Chinese natural language processing. It provides tools for text segmentation, sentiment analysis, document classification, and phonetic transliteration. The library includes capabilities for training and saving custom machine learning models for tokenization and sentiment analysis using raw training datasets. It covers a range of linguistic processing areas, including parts of speech tagging, sentence splitting, and text similarity measurement. The toolkit also provides utilities for extracting key information through text summarization and calculating word im
UserScripts is a collection of JavaScript browser userscripts designed to modify website behavior and add custom functionality to web browsers. It serves as a multi-purpose toolset for web page content automation, web interface enhancement, and specialized web scraping and downloading. The project distinguishes itself through a wide range of specialized utilities, including a browser-based text transformer for character encoding and terminology mapping, and tools for bypassing content censorship. It provides advanced web scraping capabilities such as deciphering obfuscated download links, agg
This project is a CJK input method framework and configuration set designed for the Rime input engine. It provides a comprehensive system of schemas and dictionary packs to optimize Chinese character entry through pinyin and double-pinyin workflows. The framework is distinguished by its use of Lua-powered extensions that add dynamic utilities, such as inline mathematical calculators, automated timestamps, and text formatting, directly to the input interface. It also features refined word libraries and language models specifically tuned to improve prediction accuracy and first-choice hit rates