This project is a training pipeline and framework for developing Chinese language models based on the Llama 2 architecture. It functions as a distributed GPU trainer and dataset preprocessing toolkit designed for both the initial pre-training of baseline models and subsequent supervised fine-tuning.
The system distinguishes itself through a specialized workflow for Chinese text, incorporating a data curation pipeline that uses similarity hashing for deduplication and a tokenization process that converts raw text into memory-mapped binary files for efficient disk access. It implements a supervised fine-tuning framework that utilizes masked-loss calculations to focus model learning on target answers rather than input prompts.
Broad capabilities include distributed gradient synchronization across multiple compute nodes, learning rate scheduling with linear warmup and cosine decay, and precision-scaled gradient accumulation. The project also provides utilities for conversational data structuring and text generation through sampling parameters.