llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment.
The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language, and Audio-Language models.
The toolkit covers a broad range of optimization capabilities, including calibration-based and data-free quantization, checkpoint format conversion, and the reduction of precision for attention mechanisms and key-value caches. It manages these processes through structured compression recipes and orchestration pipelines to standardize model preparation and optimization.