# plexpt/chatgpt-corpus

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/plexpt-chatgpt-corpus).**

964 stars · 146 forks · GPL-3.0

## Links

- GitHub: https://github.com/PlexPt/chatgpt-corpus
- Homepage: https://chat.aimakex.com/
- awesome-repositories: https://awesome-repositories.com/repository/plexpt-chatgpt-corpus.md

## Topics

`awesome` `corpus` `corpus-data` `question-answering`

## Description

This project provides a comprehensive Chinese language corpus designed to support the training and fine-tuning of large language models. It serves as a structured natural language processing resource, offering a collection of text data that includes dialogue, customer service interactions, and creative writing.

The dataset is organized into distinct thematic categories, allowing for targeted model development across specific conversational and narrative contexts. By providing information in standardized, schema-agnostic text formats, the collection ensures portability across various machine learning frameworks and training environments.

The corpus facilitates research and development in natural language understanding by offering normalized text ready for subword tokenization. These materials are structured to support batch loading, enabling the preparation of diverse datasets for large-scale generative artificial intelligence training.

## Tags

### Artificial Intelligence & ML

- [Language Corpora](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-models/chinese-language-model-repositories/language-corpora.md) — Provides a comprehensive dataset of Chinese text covering multiple domains to support research and development.
- [Training Datasets](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/training-datasets.md) — Supplies large-scale text collections including dialogue and creative writing to improve the performance of large language models. ([source](https://github.com/plexpt/chatgpt-corpus#readme))
- [Large Language Model Training Resources](https://awesome-repositories.com/f/artificial-intelligence-ml/artificial-intelligence-research/large-language-model-training-resources.md) — Organizes diverse Chinese text datasets to support the pre-training and fine-tuning of large language models.
- [Natural Language Processing Resources](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/natural-language-processing-resources.md) — Serves as a structured natural language processing resource to improve the performance and accuracy of generative AI models.
- [Conversational AI](https://awesome-repositories.com/f/artificial-intelligence-ml/conversational-ai.md) — Curates dialogue and customer service interaction datasets to improve the natural language understanding of conversational AI systems.
- [Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning.md) — Provides specialized fiction and narrative datasets to enhance the storytelling capabilities of generative AI models through fine-tuning.
- [Natural Language Processing Datasets](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/machine-learning-datasets/natural-language-processing-datasets.md) — Offers structured Chinese text corpora to facilitate linguistic analysis and evaluation of machine learning models.

### Part of an Awesome List

- [Pre-training Datasets](https://awesome-repositories.com/f/awesome-lists/data/pre-training-datasets.md) — Provides a large-scale Chinese language text collection including dialogue and fiction for training large language models.
- [Prompt Engineering](https://awesome-repositories.com/f/awesome-lists/ai/prompt-engineering.md) — Large-scale datasets for training and fine-tuning language models.
- [Instruction Datasets](https://awesome-repositories.com/f/awesome-lists/data/instruction-datasets.md) — Large-scale self-instruct dataset generated by ChatGPT.

### Data & Databases

- [Batched Data Loading](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/data-engineering-pipelines/batched-data-loading.md) — Provides structured text data formatted for efficient batch loading into machine learning training pipelines.
