Bitsandbytes

bitsandbytes is a deep learning quantization tool and library designed to reduce the memory footprint of large language models. It serves as a GPU memory optimizer and quantization framework, compressing model weights and features to 8-bit and 4-bit precision to enable inference and training on hardware with limited memory.

The project provides a framework for low-rank adaptation, allowing the fine-tuning of quantized models by combining 4-bit weights with small trainable matrices. It further distinguishes itself through memory paging, which moves optimizer states between CPU and GPU memory to prevent out-of-memory crashes during intensive training processes.

The library covers a broad range of optimization capabilities, including vector-wise and block-wise quantization for weights and optimizer states. It also supports weight sharding for distributed quantized training and specialized normalization to stabilize gradients within embedding layers.

Features

Deep Learning Quantization Tools - Provides a comprehensive set of vector-wise and block-wise quantization methods for memory-efficient inference and training.

GPU Memory Optimizers - Manages optimizer states and weights through paging and quantization to prevent out-of-memory errors.

Large Model Optimizations - Enables running massive neural networks on consumer GPUs through quantization and device mapping.

Low-Rank Adaptation - Provides a framework for low-rank adaptation to enable efficient fine-tuning of quantized models.

4-bit Adaptation Frameworks - Combines 4-bit quantization with low-rank adaptation to minimize training memory while preserving accuracy.

Quantized Fine-Tuning - Enables training of large models on limited hardware by operating on quantized base weights.

Weight Quantization - Compresses large language model weights to 8-bit and 4-bit precision to drastically reduce GPU memory usage.

Vector-Wise Quantization - Compresses model weights to 8-bit precision using vector-wise scaling to preserve numerical accuracy during inference.

Training Memory Optimizers - Reduces GPU memory footprint for large language models via weight and feature quantization.

Distributed Training Sharding - Distributes quantized model weights across multiple accelerators to maintain compatibility with parallel training.

8-bit Compression Tools - Compresses model features to 8-bit precision to reduce memory usage while maintaining performance.

Quantized Training - Integrates weight precision reduction directly into the training process to lower memory requirements across GPUs.

Block-Wise Quantization - Reduces the memory footprint of optimizer states using block-wise quantization to maintain precision.

Weight Sharding - Provides the ability to distribute 4-bit weights across multiple GPUs for compatibility with parallel training systems.

Optimizer State Offloading - Moves optimizer states between GPU memory and system RAM to prevent out-of-memory crashes.

GPU Memory Orchestration - Implements the orchestration of optimizer states moving between host CPU and GPU memory to prevent crashes.

Performance Optimization - Library for k-bit quantization to optimize LLM memory usage.

Computation and Optimization - Lightweight wrapper for 8-bit optimizers and CUDA functions.

bitsandbytes-foundationbitsandbytes

Features

Star history