# vllm-project/llm-compressor

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/vllm-project-llm-compressor).**

2,764 stars · 403 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/vllm-project/llm-compressor
- Homepage: https://docs.vllm.ai/projects/llm-compressor
- awesome-repositories: https://awesome-repositories.com/repository/vllm-project-llm-compressor.md

## Topics

`compression` `quantization` `sparsity`

## Description

llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment.

The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language, and Audio-Language models.

The toolkit covers a broad range of optimization capabilities, including calibration-based and data-free quantization, checkpoint format conversion, and the reduction of precision for attention mechanisms and key-value caches. It manages these processes through structured compression recipes and orchestration pipelines to standardize model preparation and optimization.

## Tags

### Artificial Intelligence & ML

- [Model Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-quantization.md) — Provides a comprehensive toolkit for reducing the precision of model weights to decrease memory footprint.
- [Distributed Quantization Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-quantization-processing.md) — Uses distributed data parallel processing to accelerate the quantization of massive models. ([source](https://cdn.jsdelivr.net/gh/vllm-project/llm-compressor@main/README.md))
- [Data-Free Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/half-precision-inference/half-precision-matrix-multiplications/fp8-matrix-multiplication/block-scaled-fp8-quantization-kernels/checkpoint-quantization/data-free-quantization.md) — Applies quantization schemes directly to weight checkpoints without requiring calibration data or model definitions. ([source](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/entrypoints/))
- [Inference Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/inference-optimizations.md) — Optimizes LLM inference by lowering the precision of attention caches and activations to increase throughput.
- [Quantization Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization-techniques/quantization-toolkits.md) — Ships a comprehensive toolkit for compressing large language models using weight and activation quantization.
- [Model Compression Suites](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/compression-techniques/model-pruning/model-compression-suites.md) — Provides structured compression pipelines and recipes to standardize model size reduction for deployment. ([source](https://docs.vllm.ai/projects/llm-compressor/en/latest/api/))
- [Model Quantization Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/quantization/model-quantization-frameworks.md) — Provides a framework to reduce model size and computational requirements by converting weights into lower-precision formats.
- [Calibration-Driven Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-quantization/8-bit-inference-quantizers/static-quantization/calibration-driven-quantization.md) — Implements calibration forward passes with representative data to optimize model weights during quantization. ([source](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/entrypoints/))
- [Definition-Free Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-quantization/definition-free-quantization.md) — Performs quantization on model weights even when a standard library model definition is unavailable. ([source](https://cdn.jsdelivr.net/gh/vllm-project/llm-compressor@main/README.md))
- [Activation Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/precision-quantization/activation-quantization.md) — Implements precision reduction for model activations to lower runtime memory usage and increase throughput.
- [Calibration Parameters](https://awesome-repositories.com/f/artificial-intelligence-ml/precision-quantization/online-quantization/calibration-parameters.md) — Determines optimal scaling factors for low-precision weights by running representative data through forward passes.
- [Weight Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization.md) — Maps high-precision weights to lower-bit formats using specialized algorithms to reduce memory footprint. ([source](https://docs.vllm.ai/projects/llm-compressor/en/latest/))
- [Joint Weight and Activation Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization/joint-weight-and-activation-quantization.md) — Reduces model size and memory usage by converting both weights and activations to lower-precision formats. ([source](https://cdn.jsdelivr.net/gh/vllm-project/llm-compressor@main/README.md))
- [KV Cache Quantizers](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization/kv-cache-quantizers.md) — Offers specialized utilities for reducing the precision of key-value caches to increase inference throughput.
- [Post-Training Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization/post-training-quantization.md) — Reduces model precision after training using calibration datasets and weight-only techniques.
- [Compression Recipes](https://awesome-repositories.com/f/artificial-intelligence-ml/compression-recipes.md) — Standardizes the sequence of model preparation, calibration, and optimization through structured configuration recipes.
- [Memory-Efficient Deep Learning](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-efficient-deep-learning.md) — Enables memory-efficient deployment of models that exceed system memory through disk offloading and sequential loading.
- [Checkpoint Format Transpilations](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-export-formats/checkpoint-format-transpilations.md) — Provides utilities to transform model weights between different quantization formats or convert compressed weights back to dense formats. ([source](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/entrypoints/))
- [Weight Offloading](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-management/weight-offloading.md) — Manages massive models by sequentially loading tensors from disk to avoid exceeding system memory during quantization.
- [Multimodal Quantization Systems](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-models/multimodal-model-runners/multimodal-quantization-systems.md) — Quantizes diverse architectures including Mixture-of-Experts, Vision-Language, and Audio-Language models for efficient deployment.
- [Model Compression](https://awesome-repositories.com/f/artificial-intelligence-ml/neural-networks/model-compression.md) — Applies quantization and size reduction techniques to specialized architectures including vision and audio language models.
- [Multimodal Compression Adaptations](https://awesome-repositories.com/f/artificial-intelligence-ml/parameter-efficient-fine-tuning/multimodal-compression-adaptations.md) — Applies specialized compression techniques tailored to the specific tensor structures of Mixture-of-Experts and Vision-Language models.

### Part of an Awesome List

- [Architecture-Specific Quantization](https://awesome-repositories.com/f/awesome-lists/ai/advanced-model-techniques/architectural-optimizations/architecture-specific-quantization.md) — Applies compression techniques specifically tailored for Mixture-of-Experts, Vision-Language, and Audio-Language model types. ([source](https://cdn.jsdelivr.net/gh/vllm-project/llm-compressor@main/README.md))
- [Distributed Quantization Workloads](https://awesome-repositories.com/f/awesome-lists/ai/distributed-parallelism/distributed-quantization-workloads.md) — Accelerates the compression of large models by splitting quantization workloads across multiple graphics processors.
- [Inference and Serving](https://awesome-repositories.com/f/awesome-lists/ai/inference-and-serving.md) — Compression algorithms for optimized model deployment.

### Operating Systems & Systems Programming

- [Memory Offloading Frameworks](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-optimizations/memory-offloading-frameworks.md) — Utilizes sequential onloading and disk offloading to quantize models that exceed available system memory. ([source](https://cdn.jsdelivr.net/gh/vllm-project/llm-compressor@main/README.md))
