Bitsandbytes

bitsandbytes is a quantization library for large language models that reduces memory footprints using k-bit quantization. It provides a framework for 4-bit low-rank adaptation, tools for 8-bit model compression, and memory-efficient optimizer extensions for PyTorch.

The project enables the training of large models on limited hardware through 4-bit quantization and low-rank adaptation weights. It also facilitates faster inference by compressing models to 8-bit precision using vector-wise quantization.

The library covers a range of memory optimization capabilities, including optimizer memory reduction via block-wise quantization and general model compression to maintain output quality while lowering video memory requirements.

Features

Weight Quantization - Reduces model memory footprints by mapping floating point weights to k-bit integer values.

GPU Kernel Implementations - Provides custom CUDA kernels to execute precision-critical mathematical operations directly on GPU hardware for maximum speed.

Low-Rank Adaptation - Implements low-rank adaptation (LoRA) to update only a small subset of model parameters during training.

4-bit Adaptation Frameworks - Provides a framework for training large models on limited hardware through 4-bit quantization and low-rank adaptation.

Quantization Toolkits - Offers a library for reducing the memory footprint of large language models using k-bit quantization.

Model Quantization - Reduces the memory footprint of large language models using k-bit quantization for limited video memory.

Quantized Training - Enables training of large models on limited hardware by employing 4-bit quantization and low-rank adaptation.

Block-Wise Quantization - Implements block-wise quantization of optimizer states to maintain full precision training performance while reducing memory.

Quantized Optimizers - Implements memory-efficient optimizers using block-wise quantization to maintain full precision performance.

8-bit Compression Tools - Provides a tool for compressing large language models to 8-bit precision for faster inference.

Model Compression - Compresses neural networks to 8-bit precision to lower memory usage while maintaining output quality.

Vector-Wise Quantization - Uses vector-wise quantization to maintain high numerical accuracy during model inference.

Training Memory Optimizers - Lowers the memory required for training large models using block-wise quantization for the optimizer.

Large Language Models - Library for 8-bit optimization and quantization.

Model Quantization - 8-bit matrix multiplication and quantization primitives.

Model Quantization Tools - Efficient 8-bit matrix multiplication for large-scale models.

Large Language Models (LLMs) - Listed in the “Large Language Models (LLMs)” section of the The Incredible Pytorch awesome list.

timdettmersbitsandbytes

Features

Star history