# brightmart/nlp_chinese_corpus

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/brightmart-nlp-chinese-corpus).**

9,903 stars · 1,554 forks · MIT

## Links

- GitHub: https://github.com/brightmart/nlp_chinese_corpus
- awesome-repositories: https://awesome-repositories.com/repository/brightmart-nlp-chinese-corpus.md

## Topics

`bert` `chinese` `chinese-corpus` `chinese-dataset` `chinese-nlp` `corpus` `dataset` `language-model` `news` `nlp` `pretrain` `question-answering` `text-classification` `wiki` `word2vec`

## Description

This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora.

The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-based systems.

The repository supports a range of linguistic applications, including news data analysis and the general sourcing of curated Chinese text for model training.

## Tags

### Artificial Intelligence & ML

- [Language Corpora](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-models/chinese-language-model-repositories/language-corpora.md) — Offers comprehensive curated datasets of news and conversational Chinese text for language model development. ([source](https://github.com/brightmart/nlp_chinese_corpus/blob/master/README.md))
- [Chinese Natural Language Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/chinese-natural-language-processing.md) — Provides large-scale curated Chinese text datasets for the analysis and synthesis of Chinese natural language.
- [Cross-Lingual Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/cross-lingual-alignment.md) — Structures aligned Chinese and English corpora to support cross-lingual model development and retrieval.
- [Training Datasets](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/training-datasets.md) — Provides large-scale curated text collections partitioned by source and quality for generative AI pre-training.
- [Parallel Sentence Alignment](https://awesome-repositories.com/f/artificial-intelligence-ml/parallel-sentence-alignment.md) — Pairs translated texts by mapping identical meanings across Chinese and English to create machine translation datasets.
- [Cross-Lingual Translation Training](https://awesome-repositories.com/f/artificial-intelligence-ml/text-translation-tools/cross-lingual-translation-training.md) — Provides aligned Chinese and English sentence pairs for training cross-lingual translation encoders. ([source](https://github.com/brightmart/nlp_chinese_corpus/blob/master/README.md))
- [Question Answering](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/document-data-intelligence/question-answering.md) — Provides high-quality, categorized question-and-answer pairs for training knowledge-extraction systems. ([source](https://github.com/brightmart/nlp_chinese_corpus#readme))
- [Question Answering](https://awesome-repositories.com/f/artificial-intelligence-ml/question-answering.md) — Supplies millions of filtered questions and answers categorized by label to build QA systems. ([source](https://github.com/brightmart/nlp_chinese_corpus/blob/master/README.md))

### Data & Databases

- [Parallel Corpus Mining](https://awesome-repositories.com/f/data-databases/data-mining/parallel-corpus-mining.md) — Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models. ([source](https://github.com/brightmart/nlp_chinese_corpus#readme))
- [Parallel Translation Corpora](https://awesome-repositories.com/f/data-databases/data-mining/parallel-corpus-mining/parallel-translation-corpora.md) — Provides millions of aligned Chinese and English sentence pairs for machine translation training.
- [Filtering and Deduplication](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/filtering-deduplication.md) — Implements deduplication and filtering to remove redundant content from large news article crawls.
- [QA Pair Linking](https://awesome-repositories.com/f/data-databases/label-based-data-selection/qa-pair-linking.md) — Links specific questions to corresponding answers using category labels for building knowledge-based systems.

### Part of an Awesome List

- [Question Answering Datasets](https://awesome-repositories.com/f/awesome-lists/data/question-answering-datasets.md) — Offers high-quality pairs of categorized questions and answers for training and evaluating QA systems.
- [News Datasets](https://awesome-repositories.com/f/awesome-lists/media/news-aggregators/news-datasets.md) — Provides millions of deduplicated news articles with titles and keywords for model training. ([source](https://github.com/brightmart/nlp_chinese_corpus#readme))
- [Chinese News Corpora](https://awesome-repositories.com/f/awesome-lists/media/news-and-journalism/chinese-news-corpora.md) — Provides a deduplicated archive of millions of news articles with titles and keywords.
