30 open-source projects similar to toolgood/toolgood.words, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best ToolGood.Words alternative.
python-pinyin is a Python library for transliterating simplified and traditional Chinese characters into phonetic pinyin. It functions as a transliteration system that converts text while supporting tone sandhi and providing utilities to transform pinyin between different formats, such as numeric tones, accent marks, or phonetic initials. The library features a polyphonic character resolver that analyzes surrounding word context to select the correct pronunciation for characters with multiple sounds. It also includes a customizable dictionary system that allows the extension of default transl
pinyin-pro is a Chinese pinyin transcription library and text segmentation tool. It converts Chinese characters into pinyin with support for tones, initials, and finals, while resolving polyphonic characters based on context. The project includes a pinyin pattern matching engine that enables searching Chinese text using full spellings, initials, or hybrid phonetic patterns. It also features a pinyin HTML generator that wraps characters and their transcriptions in markup tags for styled web display. The library provides capabilities for Chinese text segmentation, surname pronunciation priorit
OpenCC is a library and command-line tool for converting text between Simplified Chinese, Traditional Chinese, and Japanese Kanji. It operates at both the individual character and multi-character phrase levels, and applies region-specific vocabulary choices for Mainland China, Taiwan, and Hong Kong during conversion. The conversion engine resolves ambiguous character mappings using semantic and contextual rules, normalizes variant character forms for consistent orthography, and sequences multiple dictionary files into a configurable pipeline. It supports embedding custom conversion rules dire
SnowNLP is a Python library for Chinese natural language processing. It provides tools for text segmentation, sentiment analysis, document classification, and phonetic transliteration. The library includes capabilities for training and saving custom machine learning models for tokenization and sentiment analysis using raw training datasets. It covers a range of linguistic processing areas, including parts of speech tagging, sentence splitting, and text similarity measurement. The toolkit also provides utilities for extracting key information through text summarization and calculating word im
This is a dictionary-based Chinese Pinyin transliteration library used to convert Chinese characters into Pinyin with support for various tone styles and formats. It provides specialized utilities for polyphonic character resolution to manage multiple pronunciations and a generator for extracting the first letter of characters to create searchable index strings. The library includes a formatter for converting names into Pinyin following official international travel document and passport spelling standards. It also features a tool for transforming Chinese text into hyphenated or dotted string
This is a Chinese text segmentation library that converts Chinese characters into their phonetic pinyin representation. It functions as a polyphone disambiguation tool, resolving ambiguous pronunciations for multi-sound characters using word segmentation and context analysis, and also serves as a pinyin sorting utility for ordering Chinese strings alphabetically. The library distinguishes itself through surname-aware pronunciation switching, applying specialized phonetic rules for Chinese surnames with non-standard pronunciations in name contexts. It supports pluggable word segmentation algor
This project is a high-performance Java library and content moderation framework designed to detect and mask prohibited words in text. It utilizes a Deterministic Finite Automaton (DFA) scanner to implement efficient longest-match word detection. The engine distinguishes itself through a text normalization pipeline and noise-filtering preprocessor that standardize character casing, scripts, and widths while removing interspersed special characters to prevent filter evasion. It supports dynamic dictionary management, allowing blacklists and allow-lists to be updated in the background without r
This project provides a collection of processed Chinese conversational datasets and preprocessing workflows designed for training and instruction tuning of large language models. It functions as a training corpus of cleaned, standardized Chinese text formatted as query-answer pairs. The repository includes a preprocessing pipeline and dataset aggregator that combine multiple public chat sources into unified files. These tools normalize text by converting traditional Chinese characters to simplified characters and transforming complex dialogue threads into a standardized sequence of single tur
UserScripts is a collection of JavaScript browser userscripts designed to modify website behavior and add custom functionality to web browsers. It serves as a multi-purpose toolset for web page content automation, web interface enhancement, and specialized web scraping and downloading. The project distinguishes itself through a wide range of specialized utilities, including a browser-based text transformer for character encoding and terminology mapping, and tools for bypassing content censorship. It provides advanced web scraping capabilities such as deciphering obfuscated download links, agg
LLM Guard is a security firewall and guardrail framework designed to scan and sanitize inputs and outputs for large language models. It functions as a proxy gateway and security layer to block prompt injections, toxicity, and sensitive data leakage while ensuring that model interactions remain compliant with organizational policies. The system distinguishes itself through a modular scanner pipeline that utilizes local model orchestration to eliminate external network dependencies. It supports real-time security filtering via streaming chunk analysis and implements a fail-fast execution model
GoldenDict-ng is a multi-source dictionary application and offline dictionary reader that enables users to search for word definitions across local files, DICT servers, and web sources in a single interface. It functions as a web-based definition browser, rendering entries using a browser engine to support HTML, CSS, and JavaScript for rich content presentation. The project distinguishes itself by integrating with Anki flashcard systems to facilitate language learning workflows and offering specialized translation tools that support clipboard monitoring and character set conversion. It also p
sqlean is a collection of SQLite extension libraries implemented as C-based shared libraries. It provides a suite of additional scalar and table-valued functions that expand the native capabilities of the SQLite database engine. The project provides specialized toolsets for cryptography, advanced mathematics, networking, and filesystem access. These include binary hashing and encoding, statistical analysis, IP address validation, and the ability to map CSV files or filesystem paths as virtual tables. The library also includes comprehensive text processing tools such as regular expressions, f
Sensitive-lexicon is a sensitive word detection service and content moderation tool designed to identify prohibited text. It utilizes a curated lexicon of thousands of categorized terms and a fuzzy matching text scanner to detect restricted words and phrases. The project features specialized filters for Chinese language content across political, social, and adult domains. It supports approximate string matching to identify terms that use noise characters or whitespace to evade standard keyword filters. The system includes a network interface for hosting the detection service, allowing for re
This is a Chinese natural language processing toolkit providing a suite of tools for word segmentation, part-of-speech tagging, and named entity recognition. It includes a neural dependency parser for analyzing syntactic and semantic relationships between words and a machine learning training suite for creating custom linguistic models using annotated datasets. The toolkit distinguishes itself through its deployment flexibility, offering a dockerized server and a web service interface that exposes processing capabilities via API. It supports the use of pretrained models and allows for the int
Synonyms is a natural language processing library and semantic similarity engine specifically designed for Chinese text. It functions as a word embedding toolkit and tokenizer that extracts semantic meaning and identifies synonyms by calculating the conceptual closeness between words and sentences. The system provides a toolkit for Chinese word embedding and synonym discovery, allowing for the retrieval of semantically similar words to expand vocabulary. It distinguishes itself through a configuration-driven approach to model loading, which supports the integration of custom word embeddings t
Synonyms is a Chinese natural language processing tool focused on semantic analysis. It provides capabilities for Chinese word segmentation, part-of-speech tagging, and the retrieval of synonyms based on semantic proximity. The project converts words and sentences into numerical vector representations to calculate similarity scores. This allows for the determination of semantic proximity between different phrases and the identification of chatbot intent through sentence comparison. The system also includes tools for automated keyword extraction and importance ranking to identify significant
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
This project is a CJK input method framework and configuration set designed for the Rime input engine. It provides a comprehensive system of schemas and dictionary packs to optimize Chinese character entry through pinyin and double-pinyin workflows. The framework is distinguished by its use of Lua-powered extensions that add dynamic utilities, such as inline mathematical calculators, automated timestamps, and text formatting, directly to the input interface. It also features refined word libraries and language models specifically tuned to improve prediction accuracy and first-choice hit rates
MallChat is an e-commerce backend system and real-time messaging platform. It provides a server-side infrastructure for managing online stores, including integrated shopping carts, order processing, and payment workflows. The system features a WebSocket chat server for instant customer communication and a content moderation engine that uses pattern matching to block sensitive language. It integrates large language models to provide automated conversational AI chatbots for customer support and product recommendations. Identity management is handled through a token-based authentication system
KnowledgeGraphData is a collection of structured datasets and corpora designed to provide a foundational layer for cognitive intelligence and artificial intelligence systems. It primarily consists of large-scale Chinese knowledge graph datasets, including entity-relation data and NLP training sets used to drive semantic understanding and automated question answering. The project focuses on the construction and export of massive entity-attribute-value graphs, organizing knowledge into portable formats. It provides specialized domain partitioning to tailor information retrieval for professional
Chinese-BERT-wwm is a pre-trained transformer model and encoder designed for Chinese natural language processing. It converts Chinese text into dense vector representations to be used across various natural language processing applications. The model utilizes a whole word masking strategy during pre-training, masking entire words rather than individual characters. This approach is designed to improve the capture of semantic meaning and language structure within Chinese datasets. The project covers a range of downstream tasks including text classification, sequence labeling, and reading compr
This project is a PyTorch-based Chinese text classification framework. It provides a transformer-based pipeline designed to categorize Chinese language sequences into predefined labels using deep learning models. The implementation supports both BERT and ERNIE language models for processing and tagging complex Chinese text. These models are used to perform tasks such as sentiment analysis and general text categorization. The system utilizes transformer-based text encoding and attention-weighted sequence pooling to convert raw characters into document vectors. It employs pre-trained model fin
Trime is a customizable text input framework and engine based on the Rime input method. It enables the entry of characters across multiple languages using phonetic markers and shape-based patterns. The project functions as a cross-platform input method, providing the necessary logic to build and deploy text input tools for both mobile and desktop devices. It also serves as a Chinese text converter for translating traditional Chinese characters into simplified Chinese to create localized resource files.
sd is a command line text manipulation utility designed for searching and replacing text patterns across multiple files. It functions as a regex-based find and replace tool that allows for in-place file editing directly from the terminal. The project supports both regular expression replacements, including the use of capture groups for complex transformations, and fixed string replacement for literal text substitutions. It specifically handles multi-line text replacement by processing file contents as single blocks to match patterns that span across newline characters. The tool provides capa
todo-comments.nvim is a Neovim plugin and codebase task navigator that highlights and manages task keywords within code comments. It functions as a Lua-based highlighter and workflow extension that aggregates pending work and notes from source code into a searchable project list. The plugin provides visual tracking of task comments using custom syntax highlighting and allows for jumping between these markers within a file. It enables project-wide task management by searching for tagged comments across multiple files to organize a development roadmap.
Fcitx5 Android is an input method manager that brings the Fcitx5 framework to Android, enabling multilingual text input through a customizable virtual keyboard. It functions as a platform for loading plugin-based input engines, supports conversion between simplified and traditional Chinese characters, and provides a theme engine for dynamically altering the keyboard's appearance with custom colors, images, and popup previews. The project is built around a native C++ core that handles dictionary lookups and text processing, connected to the Android interface via a JNI bridge for performance. I
nbnhhsh is a browser extension that translates Chinese pinyin acronyms and internet slang into their full phrases. It operates entirely on the client side, using a precompiled dictionary bundled with the extension to perform lookups without server round-trips after the initial load. The project distinguishes itself through a community-driven dictionary that accepts user-submitted definitions through a review queue before merging them into the main dataset. It provides text selection lookup on any webpage, allowing users to highlight pinyin initialisms and see their expanded meanings, and can
Tailspin is a regex-based text colorizer and terminal log viewer designed to transform plain text streams into colorized output. It functions as a command line log highlighter and tailer that applies syntax highlighting to logs using regular expressions. The tool distinguishes itself through its ability to monitor files in real time and pipe live output through a highlighter. It recognizes and colors common data types such as IP addresses, UUIDs, HTTP methods, JSON objects, dates, and memory pointers. Users can define custom highlight styles and regex patterns to assign specific colors to uni
Apkleaks is a static analysis tool and security auditor designed to extract hardcoded secrets, API endpoints, and sensitive data from Android application packages. It operates as a secret scanner that analyzes compiled binaries without executing them to identify potential information leaks and insecure endpoints. The tool utilizes a regex-based data extraction engine to identify sensitive strings within decompiled code. It supports customization through JSON-defined search patterns and provides configuration flags to tune the behavior of the underlying disassembler. The analysis pipeline enc