Baby Llama2 Chinese

This project is a training pipeline and framework for developing Chinese language models based on the Llama 2 architecture. It functions as a distributed GPU trainer and dataset preprocessing toolkit designed for both the initial pre-training of baseline models and subsequent supervised fine-tuning.

The system distinguishes itself through a specialized workflow for Chinese text, incorporating a data curation pipeline that uses similarity hashing for deduplication and a tokenization process that converts raw text into memory-mapped binary files for efficient disk access. It implements a supervised fine-tuning framework that utilizes masked-loss calculations to focus model learning on target answers rather than input prompts.

Broad capabilities include distributed gradient synchronization across multiple compute nodes, learning rate scheduling with linear warmup and cosine decay, and precision-scaled gradient accumulation. The project also provides utilities for conversational data structuring and text generation through sampling parameters.

Features

Chinese Language Model Pre-training - Trains base transformer models on large Chinese text corpora to establish baseline linguistic capabilities.

Training Pipelines - Implements end-to-end workflows for pre-training and fine-tuning language models specifically for Chinese linguistic contexts.

Chinese Language Models - Develops transformer-based models specifically optimized for Chinese language understanding and generation.

Dataset Preprocessing Tools - Provides a toolkit for cleaning, deduplicating, and tokenizing raw text corpora into optimized formats.

Distributed ML Trainers - Provides a scalable system for distributing the training of language models across multiple GPU compute nodes.

Distributed Gradient Synchronization - Implements mechanisms to coordinate gradient updates across multiple GPU nodes to accelerate distributed training.

Instruction Fine-tuning - Implements supervised fine-tuning on instruction-response pairs with masked-loss to improve command following.

Language Model Pre-training - Trains transformer models on large-scale Chinese text corpora to learn base linguistic patterns and knowledge.

GPU Training Accelerators - Accelerates training by synchronizing gradients across multiple compute nodes and GPUs using parallelization strategies.

Distributed Training - Scales the training process across multiple compute nodes and GPUs using distributed gradient synchronization.

Dialogue Loss Masking - Calculates training loss specifically on target response tokens while masking the input prompt during supervised fine-tuning.

Training Execution Loops - Executes training loops utilizing gradient accumulation and precision scaling to process large datasets.

Supervised Fine-Tuning - Adapts pre-trained models to follow instructions using labeled prompt-answer pairs and loss masking.

Supervised Fine-Tuning Frameworks - Ships a framework for aligning pretrained models to follow instructions using labeled prompt-answer pairs.

Transformer Architecture Implementation - Implements a transformer architecture using self-attention mechanisms and positional embeddings for sequential text processing.

Linguistic Baselines - Establishes baseline linguistic capabilities by training small-scale models on raw Chinese text.

SFT Sample Preparations - Formats supervised data with control tokens and loss masks to focus model learning on target answers.

Text Dataset Curators - Provides pipelines to filter short text and deduplicate data to create high-quality training corpora.

Learning Rate Schedulers - Implements dynamic learning rate adjustment using linear warmup and cosine decay for stable convergence.

Cosine Warmup Schedules - Utilizes a learning rate schedule that combines linear warmup with cosine decay for stable model convergence.

Conversational Adaptation - Adapts pre-trained models for conversational tasks through full parameter updates on instruction datasets.

Model Training Optimizers - Optimizes training convergence using learning rate decay and gradient clipping within the training loop.

Tokenization Pipelines - Implements a sequential pipeline that converts raw text into binary, memory-mapped formats for training.

Text Similarity Scoring - Identifies and removes redundant training entries using similarity hashing and algorithmic text comparison.

Text Tokenizers - Processes raw text into tokenized sequences stored in binary formats for fast disk access.

Gradient Accumulation Strategies - Simulates larger batch sizes by accumulating gradients over multiple steps using precision scaling to avoid numerical overflow.

Model Development - Develops text completion capabilities by training small-parameter base models on large corpora.

Intra-Dataset Deduplication - Removes redundant training entries within the dataset using similarity hashing to ensure high data quality.

Dataset Tokenization Tools - Converts raw text into tokenized binary formats for efficient large-scale dataset ingestion.

Memory-Mapped File Access - Maps tokenized binary files directly into the process address space for high-performance disk access during training.

DLLXWbaby-llama2-chinese

Features

Star history