Llm Compressor

llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment.

The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language, and Audio-Language models.

The toolkit covers a broad range of optimization capabilities, including calibration-based and data-free quantization, checkpoint format conversion, and the reduction of precision for attention mechanisms and key-value caches. It manages these processes through structured compression recipes and orchestration pipelines to standardize model preparation and optimization.

Features

Model Quantization - Provides a comprehensive toolkit for reducing the precision of model weights to decrease memory footprint.

Distributed Quantization Processing - Uses distributed data parallel processing to accelerate the quantization of massive models.

Data-Free Quantization - Applies quantization schemes directly to weight checkpoints without requiring calibration data or model definitions.

Inference Optimizations - Optimizes LLM inference by lowering the precision of attention caches and activations to increase throughput.

Quantization Toolkits - Ships a comprehensive toolkit for compressing large language models using weight and activation quantization.

Model Compression Suites - Provides structured compression pipelines and recipes to standardize model size reduction for deployment.

Model Quantization Frameworks - Provides a framework to reduce model size and computational requirements by converting weights into lower-precision formats.

Calibration-Driven Quantization - Implements calibration forward passes with representative data to optimize model weights during quantization.

Definition-Free Quantization - Performs quantization on model weights even when a standard library model definition is unavailable.

Activation Quantization - Implements precision reduction for model activations to lower runtime memory usage and increase throughput.

Calibration Parameters - Determines optimal scaling factors for low-precision weights by running representative data through forward passes.

Weight Quantization - Maps high-precision weights to lower-bit formats using specialized algorithms to reduce memory footprint.

Joint Weight and Activation Quantization - Reduces model size and memory usage by converting both weights and activations to lower-precision formats.

KV Cache Quantizers - Offers specialized utilities for reducing the precision of key-value caches to increase inference throughput.

Post-Training Quantization - Reduces model precision after training using calibration datasets and weight-only techniques.

Compression Recipes - Standardizes the sequence of model preparation, calibration, and optimization through structured configuration recipes.

Memory-Efficient Deep Learning - Enables memory-efficient deployment of models that exceed system memory through disk offloading and sequential loading.

Checkpoint Format Transpilations - Provides utilities to transform model weights between different quantization formats or convert compressed weights back to dense formats.

Weight Offloading - Manages massive models by sequentially loading tensors from disk to avoid exceeding system memory during quantization.

Multimodal Quantization Systems - Quantizes diverse architectures including Mixture-of-Experts, Vision-Language, and Audio-Language models for efficient deployment.

Model Compression - Applies quantization and size reduction techniques to specialized architectures including vision and audio language models.

Multimodal Compression Adaptations - Applies specialized compression techniques tailored to the specific tensor structures of Mixture-of-Experts and Vision-Language models.

Architecture-Specific Quantization - Applies compression techniques specifically tailored for Mixture-of-Experts, Vision-Language, and Audio-Language model types.

Distributed Quantization Workloads - Accelerates the compression of large models by splitting quantization workloads across multiple graphics processors.

Memory Offloading Frameworks - Utilizes sequential onloading and disk offloading to quantize models that exceed available system memory.

Inference and Serving - Compression algorithms for optimized model deployment.

vllm-projectllm-compressor

Features

Star history