Nlp Chinese Corpus | Awesome Repository

This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora.

The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-based systems.

The repository supports a range of linguistic applications, including news data analysis and the general sourcing of curated Chinese text for model training.

Features

Language Corpora - Offers comprehensive curated datasets of news and conversational Chinese text for language model development.
Chinese Natural Language Processing - Provides large-scale curated Chinese text datasets for the analysis and synthesis of Chinese natural language.
Cross-Lingual Alignment - Structures aligned Chinese and English corpora to support cross-lingual model development and retrieval.
Training Datasets - Provides large-scale curated text collections partitioned by source and quality for generative AI pre-training.

Features

Language Corpora - Offers comprehensive curated datasets of news and conversational Chinese text for language model development.
Chinese Natural Language Processing - Provides large-scale curated Chinese text datasets for the analysis and synthesis of Chinese natural language.
Cross-Lingual Alignment - Structures aligned Chinese and English corpora to support cross-lingual model development and retrieval.
Training Datasets - Provides large-scale curated text collections partitioned by source and quality for generative AI pre-training.

The repository supports a range of linguistic applications, including news data analysis and the general sourcing of curated Chinese text for model training.