30 open-source projects similar to byvoid/opencc, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best OpenCC alternative.
ToolGood.Words is a sensitive word filtering library and text sanitization component designed for high-performance detection and masking of prohibited terms. It provides tools for Chinese text normalization, pinyin transliteration, and the replacement of banned words with placeholders. The project is distinguished by its ability to uncover obfuscated language through a pinyin transliteration engine and phonetic-based detection. It identifies sensitive content hidden by phonetic substitutions, first-letter initials, or intentional misspellings by mapping Chinese characters to pinyin representa
This is a Chinese text segmentation library that converts Chinese characters into their phonetic pinyin representation. It functions as a polyphone disambiguation tool, resolving ambiguous pronunciations for multi-sound characters using word segmentation and context analysis, and also serves as a pinyin sorting utility for ordering Chinese strings alphabetically. The library distinguishes itself through surname-aware pronunciation switching, applying specialized phonetic rules for Chinese surnames with non-standard pronunciations in name contexts. It supports pluggable word segmentation algor
SnowNLP is a Python library for Chinese natural language processing. It provides tools for text segmentation, sentiment analysis, document classification, and phonetic transliteration. The library includes capabilities for training and saving custom machine learning models for tokenization and sentiment analysis using raw training datasets. It covers a range of linguistic processing areas, including parts of speech tagging, sentence splitting, and text similarity measurement. The toolkit also provides utilities for extracting key information through text summarization and calculating word im
LyricsX is a macOS application that renders synchronized song lyrics over the system UI during music playback. It functions as a desktop display tool, an external lyric aggregator, and a synchronization utility. The application fetches lyrics from multiple remote data sources using current playback metadata and provides a script converter to translate text between Traditional and Simplified Chinese characters. It also includes a lyric file manager for importing and exporting common lyric formats via drag-and-drop interactions. The tool provides capabilities for timing synchronization to matc
This is a Chinese natural language processing toolkit providing a suite of tools for word segmentation, part-of-speech tagging, and named entity recognition. It includes a neural dependency parser for analyzing syntactic and semantic relationships between words and a machine learning training suite for creating custom linguistic models using annotated datasets. The toolkit distinguishes itself through its deployment flexibility, offering a dockerized server and a web service interface that exposes processing capabilities via API. It supports the use of pretrained models and allows for the int
This project is a CJK input method framework and configuration set designed for the Rime input engine. It provides a comprehensive system of schemas and dictionary packs to optimize Chinese character entry through pinyin and double-pinyin workflows. The framework is distinguished by its use of Lua-powered extensions that add dynamic utilities, such as inline mathematical calculators, automated timestamps, and text formatting, directly to the input interface. It also features refined word libraries and language models specifically tuned to improve prediction accuracy and first-choice hit rates
UserScripts is a collection of JavaScript browser userscripts designed to modify website behavior and add custom functionality to web browsers. It serves as a multi-purpose toolset for web page content automation, web interface enhancement, and specialized web scraping and downloading. The project distinguishes itself through a wide range of specialized utilities, including a browser-based text transformer for character encoding and terminology mapping, and tools for bypassing content censorship. It provides advanced web scraping capabilities such as deciphering obfuscated download links, agg
HanLP is a natural language processing library and deep learning framework specifically optimized for the Chinese language, while also functioning as a multilingual text processor. It serves as a toolkit for performing linguistic analysis, semantic understanding, and script conversion. The project distinguishes itself through a dedicated focus on Chinese linguistic structures, including a specialized script converter for transforming text between Simplified Chinese, Traditional Chinese, and Pinyin. It further supports domain-specific model training to improve the recognition of professional t
ansj_seg is a Java NLP toolkit and segmentation library designed for processing Chinese text. It functions as a word segmenter, part-of-speech tagger, and named entity recognizer to divide continuous Chinese characters into meaningful words and tokens. The library utilizes statistical models for text segmentation and provides capabilities for identifying and extracting person names from unstructured documents. It also assigns grammatical categories to tokens to determine their linguistic roles within a sentence. The toolkit supports domain-specific text processing through the use of custom d
GoldenDict-ng is a multi-source dictionary application and offline dictionary reader that enables users to search for word definitions across local files, DICT servers, and web sources in a single interface. It functions as a web-based definition browser, rendering entries using a browser engine to support HTML, CSS, and JavaScript for rich content presentation. The project distinguishes itself by integrating with Anki flashcard systems to facilitate language learning workflows and offering specialized translation tools that support clipboard monitoring and character set conversion. It also p
Fcitx5 Android is an input method manager that brings the Fcitx5 framework to Android, enabling multilingual text input through a customizable virtual keyboard. It functions as a platform for loading plugin-based input engines, supports conversion between simplified and traditional Chinese characters, and provides a theme engine for dynamically altering the keyboard's appearance with custom colors, images, and popup previews. The project is built around a native C++ core that handles dictionary lookups and text processing, connected to the Android interface via a JNI bridge for performance. I
This project is a Chinese text segmentation library and tokenizer designed to split Chinese sentences into individual words. It serves as a natural language processing tool for splitting characters into words, tagging parts of speech, and extracting keywords using statistical analysis. The library distinguishes itself through support for custom dictionary configuration and vocabulary file management, allowing users to override default segmentation rules for domain-specific accuracy. It also includes a TF-IDF keyword extractor to identify significant words and core topics within documents. Th
Analysis-ik is a Chinese text segmenter and analysis plugin for Lucene-based search engines. It provides a specialized analyzer for splitting Chinese sentences into meaningful words to improve indexing and search accuracy within Elasticsearch and OpenSearch. The project features a dynamic dictionary manager that can load word libraries and stop-word files from remote HTTP endpoints. It monitors metadata headers on these remote files to trigger automatic vocabulary updates without requiring a service restart. The analyzer supports both fine-grained exhaustive and coarse-grained smart segmenta
pkuseg-python is a Chinese word segmentation toolkit and natural language processing library. It provides specialized models for splitting Chinese text into words across various domains, including news, medical, and web content, and includes a tool for assigning grammatical parts of speech tags to segmented words. The library allows for the training of custom segmentation models using annotated datasets and supports the integration of user-defined dictionaries to ensure specialized terminology is recognized correctly. It employs a multi-threaded execution engine to process large volumes of Ch
Librime is an input method engine library that translates keystrokes into Chinese characters using phonetic and shape-based rules defined in YAML schemas. It processes keyboard input through a modular pipeline of configurable translation modules, supporting both phonetic mapping and structural shape-based decomposition methods like Cangjie or Wubi. The engine distinguishes itself through its YAML-driven schema system, which allows users to define custom input method behaviors and key mappings in external configuration files without recompiling the engine. It supports runtime switching between
imewlconverter is an input method editor wordlist converter and format transformer designed to migrate user dictionaries and phrase lists between different software environments. It functions as a cross-platform dictionary migrator, translating proprietary binary and text wordlists for use across Windows, macOS, and mobile systems. The tool standardizes diverse lexicon formats, such as WL, FIT, DCTX, LD2, and QPYD, into common structures to ensure cross-platform compatibility. It specifically handles binary wordlist extraction and the transformation of custom phrase lists for systems includin
InvenTree is an open-source inventory management platform built on Django, designed for tracking parts, stock levels, and supply chain operations through a web interface and REST API. The system uses barcodes—including QR codes, 1D barcodes, and Data Matrix codes—as primary identifiers for scanning, linking, and triggering inventory actions, and extends core functionality through a Python plugin framework supporting custom actions, UI panels, barcode handlers, and scheduled tasks. The platform distinguishes itself through a comprehensive plugin-based extensibility system that allows custom in
Synonyms is a natural language processing library and semantic similarity engine specifically designed for Chinese text. It functions as a word embedding toolkit and tokenizer that extracts semantic meaning and identifies synonyms by calculating the conceptual closeness between words and sentences. The system provides a toolkit for Chinese word embedding and synonym discovery, allowing for the retrieval of semantically similar words to expand vocabulary. It distinguishes itself through a configuration-driven approach to model loading, which supports the integration of custom word embeddings t
mlpack is a header-only C++ machine learning library that defines matrix types as compile-time templates, enabling flexible numeric precision and memory layout without runtime overhead. Its core identity is built around a template metaprogramming architecture that allows algorithms to be included selectively as independent modules, reducing binary size, and supports compile-time serialization of neural network parameters by deducing matrix types and structure at compile time. The library distinguishes itself through a multi-language binding framework that automatically generates bindings for
GPT2-Chinese is a Chinese language model implementation based on the GPT-2 architecture. It provides a causal language model trainer and a natural language generation tool designed for training and generating human-like Chinese text sequences. The system integrates a BERT tokenizer to process Chinese corpora into manageable units for machine learning. It enables the development of predictive text models that can generate specific patterns, such as news or poetry, through prompt-based text completion. The project covers a full workflow including text tokenization, model training using a trans
SWIG is a tool that generates wrapper code to expose C and C++ libraries to a wide range of higher-level programming languages. It reads annotated C/C++ header files and produces language-specific bindings from a single interface definition, supporting languages such as Python, Java, Ruby, C#, Perl, and many others. The generated wrapper code is free from the project's GPL license, allowing users to distribute it under their own terms. The tool handles modern C++ features including templates, namespaces, smart pointers, and constructs up to C++20 through specialized parsing and code generatio
This project is a Telegram command line interface and MTProto client. It functions as a userbot framework, providing a terminal-based environment to interact with Telegram accounts without a graphical user interface. The system differentiates itself through extensibility, offering Python bindings and a Lua scripting engine to automate account tasks and respond to messages. It also serves as a JSON-based chat exporter, capable of extracting user metadata and conversation histories into structured files. The client covers core messaging capabilities, including text exchange, group chat managem
Synonyms is a Chinese natural language processing tool focused on semantic analysis. It provides capabilities for Chinese word segmentation, part-of-speech tagging, and the retrieval of synonyms based on semantic proximity. The project converts words and sentences into numerical vector representations to calculate similarity scores. This allows for the determination of semantic proximity between different phrases and the identification of chatbot intent through sentence comparison. The system also includes tools for automated keyword extraction and importance ranking to identify significant
Oh My Bash is a shell framework designed to manage the Bash environment through a modular configuration system. It functions as a configuration manager and prompt theme engine, providing a collection of plugins and themes to customize the terminal experience. The project includes a shell plugin library that provides specialized shortcuts and commands for various languages and platforms. It allows for the integration of pre-defined plugins and the use of behavioral overrides to modify bundled themes and modules without altering the core installation. The framework covers bash shell customizat
Enquirer is a Node.js library for creating interactive command-line interfaces to gather structured user input. It provides a set of terminal prompts, including menus, forms, and text fields, to collect data via autocomplete, multiselect, and boolean confirmations. The project serves as a customizable framework that allows for the creation of custom prompt types through a base class and the extension of functionality via a plugin architecture. The library covers a wide range of interaction patterns, such as capturing numerical and sensitive data, validating user input against custom rules, a
Rspack is a high-performance web bundler written in Rust that packages JavaScript and TypeScript for web applications. It functions as an incremental build engine and a tree-shaking asset optimizer designed to reduce build times and minimize final bundle sizes for web delivery. The project is built for compatibility with the webpack ecosystem, implementing a compatible API that allows existing plugins and configurations to work without modification. This enables the integration of community loaders and plugins while leveraging a Rust-based compilation engine. The tool covers a broad range of
Riot is a Go-based distributed search engine and indexing server designed for full-text indexing and retrieval. It functions as a retrieval system that sorts documents by relevance using BM25 ranking algorithms, term frequency, and inverse document frequency. The engine provides specialized support for the Chinese language, featuring concurrent text segmentation and phonetic Pinyin mapping to match romanized input with characters. It utilizes a distributed architecture that employs hash-based index sharding to balance data load and throughput across multiple server nodes. The system covers a
Timber is a PHP library that integrates the Twig template engine into WordPress themes, providing an object-oriented framework for theme development. It wraps WordPress data — posts, terms, users, menus, and comments — in structured PHP classes, allowing developers to work with objects instead of raw arrays while keeping HTML markup separate from PHP logic through Twig templates. The library distinguishes itself by offering a complete set of tools for modern WordPress theme building. It includes a file-based template hierarchy with fallback chains, dynamic image manipulation with resizing, cr
Nuxt is a full-stack framework for building Vue.js applications. It serves as an application orchestrator that integrates server-side rendering, static site generation, and backend API logic within a single unified project. The framework uses a file-based routing system to automatically generate application URLs based on the project's folder and file structure. It supports multi-strategy web rendering, allowing for a combination of server-side, static, and hybrid rendering techniques to optimize page load speeds and search engine visibility. The project provides automated component discovery