# lancopku/pkuseg-python

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/lancopku-pkuseg-python).**

6,707 stars · 982 forks · Python · MIT

## Links

- GitHub: https://github.com/lancopku/pkuseg-python
- awesome-repositories: https://awesome-repositories.com/repository/lancopku-pkuseg-python.md

## Description

pkuseg-python is a Chinese word segmentation toolkit and natural language processing library. It provides specialized models for splitting Chinese text into words across various domains, including news, medical, and web content, and includes a tool for assigning grammatical parts of speech tags to segmented words.

The library allows for the training of custom segmentation models using annotated datasets and supports the integration of user-defined dictionaries to ensure specialized terminology is recognized correctly. It employs a multi-threaded execution engine to process large volumes of Chinese text files in parallel.

## Tags

### Data & Databases

- [Chinese Language Segmenters](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/text-segmentation/chinese-language-segmenters.md) — Provides specialized tools for tokenizing and segmenting Chinese text across news, medical, and web domains. ([source](https://github.com/lancopku/pkuseg-python#readme))
- [Dictionary Constraints](https://awesome-repositories.com/f/data-databases/input-method-dictionaries/morpheme-segmentation-dictionaries/dictionary-constraints.md) — Implements dictionary-constrained segmentation to ensure specialized terminology is correctly recognized.
- [Parallel Text Processing](https://awesome-repositories.com/f/data-databases/parallel-text-processing.md) — Employs a multi-threaded execution engine to process large volumes of Chinese text files in parallel.
- [Chinese POS Tagging](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/text-segmentation/chinese-pos-tagging.md) — Assigns grammatical categories to segmented Chinese words to facilitate deeper lexical and syntactic analysis.

### Artificial Intelligence & ML

- [Chinese NLP Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/chinese-nlp-libraries.md) — Offers a natural language processing library specifically designed for the linguistic analysis of the Chinese language.
- [Custom Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-training.md) — Enables the creation of specialized segmentation models using annotated datasets to improve accuracy for unique vocabularies. ([source](https://github.com/lancopku/pkuseg-python#readme))
- [NLP-Specific](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-training/nlp-specific.md) — Allows training of custom models on specialized linguistic datasets for domains like medicine or news.
- [Text Segmentation Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/text-segmentation-model-training.md) — Provides tools to train custom word segmentation models using domain-specific annotated datasets. ([source](https://github.com/lancopku/pkuseg-python/blob/master/readme/environment.md))
- [Word Segmentation Training](https://awesome-repositories.com/f/artificial-intelligence-ml/word-segmentation-training.md) — A machine learning toolkit for training custom segmentation models from annotated datasets and user dictionaries.
- [Custom Data Fine-Tunings](https://awesome-repositories.com/f/artificial-intelligence-ml/full-parameter-fine-tuning/custom-data-fine-tunings.md) — Provides the ability to adapt pre-trained segmentation models to specialized domains using annotated custom datasets.
- [Part-of-Speech Taggers](https://awesome-repositories.com/f/artificial-intelligence-ml/part-of-speech-taggers.md) — Assigns grammatical labels to segmented words to identify their linguistic category. ([source](https://github.com/lancopku/pkuseg-python#readme))
- [Part-of-Speech Tagging Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/part-of-speech-tagging-pipelines.md) — Ships a pipeline that appends grammatical category labels to segmented words as a secondary processing step.

### Part of an Awesome List

- [Domain Specific Models](https://awesome-repositories.com/f/awesome-lists/ai/domain-specific-models.md) — Offers specialized models tailored for specific text genres such as medical or news content.
- [Text Processing](https://awesome-repositories.com/f/awesome-lists/data/text-processing.md) — Processes large volumes of text stored in files using a multi-threaded execution engine. ([source](https://github.com/lancopku/pkuseg-python/blob/master/readme/interface.md))
- [Natural Language Processing](https://awesome-repositories.com/f/awesome-lists/ai/natural-language-processing.md) — Chinese word segmentation toolkit for various domains.

### Education & Learning Resources

- [Linguistic Constraints](https://awesome-repositories.com/f/education-learning-resources/word-dictionaries/linguistic-constraints.md) — Forces the recognition of user-defined words and phrases to maintain consistency across specialized vocabularies. ([source](https://github.com/lancopku/pkuseg-python/blob/master/readme/history.md))

### Programming Languages & Runtimes

- [Custom Dictionaries](https://awesome-repositories.com/f/programming-languages-runtimes/programming-utilities/data-text-processing/custom-dictionaries.md) — Integrates user-defined word lists during segmentation to ensure specialized terminology is recognized correctly. ([source](https://github.com/lancopku/pkuseg-python/blob/master/readme/interface.md))

### Software Engineering & Architecture

- [Batch Document Processing](https://awesome-repositories.com/f/software-engineering-architecture/batch-document-processing.md) — Distributes text segmentation tasks across multiple CPU cores to increase processing throughput for large file volumes.
- [Parallel Text Processing](https://awesome-repositories.com/f/software-engineering-architecture/parallel-text-processing.md) — Distributes word segmentation tasks across multiple CPU threads to increase processing throughput. ([source](https://github.com/lancopku/pkuseg-python/blob/master/readme/multiprocess.md))