30 open-source projects similar to lancopku/pkuseg-python, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Pkuseg Python alternative.
ansj_seg is a Java NLP toolkit and segmentation library designed for processing Chinese text. It functions as a word segmenter, part-of-speech tagger, and named entity recognizer to divide continuous Chinese characters into meaningful words and tokens. The library utilizes statistical models for text segmentation and provides capabilities for identifying and extracting person names from unstructured documents. It also assigns grammatical categories to tokens to determine their linguistic roles within a sentence. The toolkit supports domain-specific text processing through the use of custom d
This project is a Chinese text segmentation library and tokenizer designed to split Chinese sentences into individual words. It serves as a natural language processing tool for splitting characters into words, tagging parts of speech, and extracting keywords using statistical analysis. The library distinguishes itself through support for custom dictionary configuration and vocabulary file management, allowing users to override default segmentation rules for domain-specific accuracy. It also includes a TF-IDF keyword extractor to identify significant words and core topics within documents. Th
SnowNLP is a Python library for Chinese natural language processing. It provides tools for text segmentation, sentiment analysis, document classification, and phonetic transliteration. The library includes capabilities for training and saving custom machine learning models for tokenization and sentiment analysis using raw training datasets. It covers a range of linguistic processing areas, including parts of speech tagging, sentence splitting, and text similarity measurement. The toolkit also provides utilities for extracting key information through text summarization and calculating word im
This is a Chinese natural language processing toolkit providing a suite of tools for word segmentation, part-of-speech tagging, and named entity recognition. It includes a neural dependency parser for analyzing syntactic and semantic relationships between words and a machine learning training suite for creating custom linguistic models using annotated datasets. The toolkit distinguishes itself through its deployment flexibility, offering a dockerized server and a web service interface that exposes processing capabilities via API. It supports the use of pretrained models and allows for the int
LAC is a Chinese lexical analysis engine and toolkit designed for joint word segmentation, part-of-speech tagging, and named entity recognition. It functions as a high-performance system that identifies word boundaries and grammatical categories using trained machine learning models. The project features a lightweight, compiled native runtime that enables on-device natural language processing and embedding into mobile applications. It includes model compression and conversion to optimize for resource-constrained environments and supports multi-threaded parallel execution to increase throughpu
HanLP is a natural language processing library and deep learning framework specifically optimized for the Chinese language, while also functioning as a multilingual text processor. It serves as a toolkit for performing linguistic analysis, semantic understanding, and script conversion. The project distinguishes itself through a dedicated focus on Chinese linguistic structures, including a specialized script converter for transforming text between Simplified Chinese, Traditional Chinese, and Pinyin. It further supports domain-specific model training to improve the recognition of professional t
Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language. The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from
Synonyms is a natural language processing library and semantic similarity engine specifically designed for Chinese text. It functions as a word embedding toolkit and tokenizer that extracts semantic meaning and identifies synonyms by calculating the conceptual closeness between words and sentences. The system provides a toolkit for Chinese word embedding and synonym discovery, allowing for the retrieval of semantically similar words to expand vocabulary. It distinguishes itself through a configuration-driven approach to model loading, which supports the integration of custom word embeddings t
Compromise is a natural language processing library and rule-based text parser designed to analyze unstructured text. It functions as a toolkit for identifying parts of speech, linguistic patterns, and semantic meaning, while providing specialized engines for named entity recognition and the parsing of temporal and numeric data. The project is distinguished by its linguistic morphological engine, which can conjugate verbs across different tenses and inflect nouns and adjectives. It further allows for linguistic model customization through a plugin system that enables the extension of lexicons
Analysis-ik is a Chinese text segmenter and analysis plugin for Lucene-based search engines. It provides a specialized analyzer for splitting Chinese sentences into meaningful words to improve indexing and search accuracy within Elasticsearch and OpenSearch. The project features a dynamic dictionary manager that can load word libraries and stop-word files from remote HTTP endpoints. It monitors metadata headers on these remote files to trigger automatic vocabulary updates without requiring a service restart. The analyzer supports both fine-grained exhaustive and coarse-grained smart segmenta
Synonyms is a Chinese natural language processing tool focused on semantic analysis. It provides capabilities for Chinese word segmentation, part-of-speech tagging, and the retrieval of synonyms based on semantic proximity. The project converts words and sentences into numerical vector representations to calculate similarity scores. This allows for the determination of semantic proximity between different phrases and the identification of chatbot intent through sentence comparison. The system also includes tools for automated keyword extraction and importance ranking to identify significant
This project is a curated collection of Chinese names, surnames, and kinship terms designed for linguistic analysis and natural language processing. It functions as a multilingual name dataset and a training resource for named entity recognition, providing a unified repository of names across Chinese, Japanese, and English languages. The project includes a synthetic name generator that creates realistic person names by applying analyzed naming patterns and demographic data. It also provides a cleaned Chinese idiom lexicon gathered and deduplicated from multiple sources. The available data su
KnowledgeGraphData is a collection of structured datasets and corpora designed to provide a foundational layer for cognitive intelligence and artificial intelligence systems. It primarily consists of large-scale Chinese knowledge graph datasets, including entity-relation data and NLP training sets used to drive semantic understanding and automated question answering. The project focuses on the construction and export of massive entity-attribute-value graphs, organizing knowledge into portable formats. It provides specialized domain partitioning to tailor information retrieval for professional
This project is a collection of pre-trained dense and sparse word vectors trained on diverse Chinese corpora. It serves as a library of linguistic representations and an NLP vector dataset designed to improve the accuracy of semantic and morphological analysis in text models. The collection provides corpus-specific representations and utilizes n-gram co-occurrence modeling to capture diverse linguistic patterns. It includes a hybrid of dense-sparse vectors to balance computational efficiency and semantic precision. The project covers semantic vector search and the development of Chinese natu
Chinese-BERT-wwm is a pre-trained transformer model and encoder designed for Chinese natural language processing. It converts Chinese text into dense vector representations to be used across various natural language processing applications. The model utilizes a whole word masking strategy during pre-training, masking entire words rather than individual characters. This approach is designed to improve the capture of semantic meaning and language structure within Chinese datasets. The project covers a range of downstream tasks including text classification, sequence labeling, and reading compr
Flair is a natural language processing framework for training and applying models for sequence labeling and text classification. It provides a system for generating word embeddings and identifying semantic entities within text. The framework includes a dedicated system for zero and few-shot learning, enabling text classification and entity extraction using minimal training examples by leveraging pre-trained knowledge. Its capabilities cover named entity recognition, sentiment analysis, and the training of specialized models using custom datasets. It also includes tooling for the visual highl
WizardLM is a large language model and instruction-tuning framework designed to execute sophisticated coding, mathematical, and conversational tasks. It functions as an AI system for mathematical reasoning and code generation, as well as a synthetic dataset generator used to train other language models. The project is distinguished by its evolutionary instruction tuning, which uses a method to rewrite simple instructions into complex tasks. This process expands training dataset difficulty and produces a high volume of open-domain tasks across various difficulty levels. The system covers capa
This project is a comprehensive Python toolkit designed for natural language processing, research, and education. It functions as a linguistic data processor that provides a standardized framework for managing, cleaning, and analyzing large collections of annotated text corpora and lexical resources. The library distinguishes itself through its integration of both symbolic and statistical methods, allowing users to perform complex tasks ranging from rule-based grammar parsing to machine learning-driven classification. It offers a modular pipeline for text processing, enabling the transformati
CoreNLP is a Java natural language processing library designed to convert raw human language text into structured data. It utilizes a suite of linguistic annotators to analyze text through a pipeline, extracting grammatical structures, sentiment, and linguistic patterns. The project includes a coreference resolution engine that links multiple mentions of the same entity to maintain contextual consistency across documents. It also provides tools for named entity recognition to categorize people, companies, and locations, and a part-of-speech tagger to assign grammatical categories and base for
Flair is a transformer-based natural language processing framework used to build and train models for text classification and sequence tagging. It provides a specialized library for generating contextual text embeddings and performing linguistic analysis. The framework includes dedicated tools for named entity recognition, including the identification of specialized biomedical entities across multiple languages. It further supports entity linking to map identified text mentions to unique entries within general or biomedical knowledge bases. The project covers a broad range of language analys
DeepKE is a knowledge extraction toolkit and framework designed to transform unstructured text into structured knowledge graphs. It provides a pipeline for identifying and classifying named entities, semantic relations, and events, converting raw datasets into structured triples. The project utilizes large language models as tool callers through a standardized context protocol to drive automated data extraction processes. It supports schema-driven extraction across multiple domains and bilingual text, employing joint entity and relation extraction to identify components in a single structured
Pattern is a Python web mining library that functions as an HTML web scraper, a natural language processing toolkit, and a network analysis tool. It provides a mathematical framework for categorizing datasets through a vector space model library. The project enables the extraction of structured data from web services and the creation of searchable web content indexes. It processes unstructured text using sentiment analysis, part-of-speech tagging, and n-gram searching. The library covers machine learning classification through the training of models using perceptron algorithms and support ve
This project is a comprehensive suite for neural speech synthesis, featuring a deep learning text-to-speech engine, a neural speech synthesis trainer, and a voice cloning toolkit. It provides a system for synthesizing human-like speech from text using neural network models and high-fidelity vocoders. The suite includes a speech model conversion utility to transform deep learning models between different formats for deployment across various hardware runtimes. It also provides a self-contained HTTP server to expose pre-trained text-to-speech models as a remote audio API. Capabilities include
TextBlob is a natural language processing library that provides a unified interface for common linguistic tasks. It operates as a wrapper-based API, simplifying the use of complex processing libraries by delegating core operations to specialized external frameworks. The project features a pluggable processing pipeline that allows for the integration of custom logic and alternative language engines. It supports the extension of processing models through plugins to add specific language support or custom data processing. The library covers a broad range of linguistic capabilities, including se
Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing framework. It provides a multimodal machine learning framework and a distributed pipeline system for sequencing annotators to process large-scale linguistic data. The library includes a transformer text processor for generating contextual vector embeddings and a dedicated inference engine for managing large language models. The project distinguishes itself through its ability to process heterogeneous data types, including text, audio, and images, within a unified vision-langu
Huatuo-Llama-Med-Chinese is a medical large language model specialized in processing and generating natural language text in Chinese. It is an instruction-tuned system designed to answer professional healthcare questions by leveraging a dedicated medical knowledge base. The model integrates structured medical literature and knowledge graphs to ensure clinical accuracy during response generation. It employs knowledge-graph augmented inference to combine structured entity relationships with neural network outputs. The system is developed through domain-specific weight adaptation, cross-lingual
This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production. The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normaliz
LanguageTool is a multilingual grammar and style checking engine designed to detect spelling, grammar, and writing errors across multiple languages. It provides automated proofreading capabilities that can be deployed as a self-hosted server or executed as a standalone local desktop application. The project distinguishes itself through a flexible rule development framework, allowing linguistic patterns to be defined via XML or implemented as custom Java classes. It utilizes n-gram frequency modeling for confused word detection and supports neural word embeddings to improve disambiguation betw
This repository is a comprehensive educational program and deep learning framework designed to teach practical deep learning using PyTorch through notebooks and code examples. It serves as a high-level library for building, training, and deploying neural networks, acting as a model training orchestrator that coordinates PyTorch models, optimizers, and loss functions. The project provides specialized toolkits for computer vision, natural language processing, and tabular data preprocessing. It distinguishes itself through advanced training controls such as discriminative learning rates, a two-w