Qlora

This project is a quantized fine-tuning framework for large language models. It implements a low-rank adaptation library and a four-bit quantizer to reduce the GPU memory requirements needed to train large models.

The framework utilizes four-bit quantization and low-rank adapters to enable model training on consumer-grade hardware. It further reduces the memory footprint through double quantization and a paged optimizer that offloads states to system RAM.

The system supports distributed training across multiple GPUs to handle larger parameter scales and includes utilities for custom dataset loading. It also provides automated generation scoring to evaluate model performance against benchmarks.

Features

Low-Rank Adaptation - Implements low-rank adaptation (LoRA) to train a small set of parameters while keeping the base model weights frozen.

Data-Parallel Training - Provides distributed data parallelism to split training workloads across multiple GPUs for larger model scales.

Distributed Training - Supports spreading large model training workloads across multiple graphics cards to increase parameter scale.

Large Language Model Fine-Tuning - Provides a complete framework for adapting large language models to specific tasks via quantized fine-tuning.

Quantized Fine-Tuning - Allows for training low-rank adapters over frozen four-bit weights to run large models on limited hardware.

Memory Optimization Techniques - Combines specialized quantization and paged optimizers to minimize GPU memory consumption and prevent allocation spikes.

Model Quantization - Reduces the memory footprint of neural networks through four-bit quantization while maintaining training performance.

Quantized Fine-Tuning Frameworks - Provides a comprehensive framework combining four-bit quantization and low-rank adapters for memory-efficient LLM training.

Nested Quantization - Implements double quantization of quantization constants to further minimize the memory footprint of the model.

NormalFloat Formats - Implements a specialized four-bit NormalFloat quantization format to maintain model accuracy while reducing GPU memory usage.

Weight Quantization Tools - Includes a four-bit quantizer to compress LLM weights, enabling training on hardware with limited GPU memory.

Optimizer State Offloading - Ships a paged optimizer that offloads states to system RAM to handle memory spikes and reduce GPU requirements.

Distributed GPU Computing - Implements a system for managing parallelism across multiple GPUs to increase the scale of trainable parameters.

GPU Resource Scaling - Provides capabilities to distribute model training across multiple graphics cards to handle massive parameter scales.

Gradient Computation - Utilizes Bfloat16 precision for gradient calculations to ensure numerical stability during the fine-tuning process.

Large Model Training Utilities - Enables training of large models on consumer-grade hardware by utilizing four-bit quantization to lower GPU memory requirements.

Large Language Models - Quantized LoRA for fine-tuning large models on consumer hardware.

LLM Training and Optimization - Efficient fine-tuning method using 4-bit quantization on consumer hardware.

Model Quantization - Efficient fine-tuning of quantized models using low-rank adapters.

Natural Language Processing - Listed in the “Natural Language Processing” section of the FunNLP awesome list.

Large Language Models (LLMs) - Listed in the “Large Language Models (LLMs)” section of the The Incredible Pytorch awesome list.

artidoroqlora

Features

Star history