bitsandbytes is a deep learning quantization tool and library designed to reduce the memory footprint of large language models. It serves as a GPU memory optimizer and quantization framework, compressing model weights and features to 8-bit and 4-bit precision to enable inference and training on hardware with limited memory.
The project provides a framework for low-rank adaptation, allowing the fine-tuning of quantized models by combining 4-bit weights with small trainable matrices. It further distinguishes itself through memory paging, which moves optimizer states between CPU and GPU memory to prevent out-of-memory crashes during intensive training processes.
The library covers a broad range of optimization capabilities, including vector-wise and block-wise quantization for weights and optimizer states. It also supports weight sharding for distributed quantized training and specialized normalization to stabilize gradients within embedding layers.