Neural Compressor

Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency.

The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model execution.

The toolkit covers a broad range of optimization capabilities, including post-training quantization, mixed-precision layer mapping, and graph operation fusion. It also includes automated performance tuning to discover optimal configuration settings for specific hardware targets.

Features

Inference Accelerators - Maps neural network operators to specialized CPU and GPU vector instructions to accelerate model execution.

Post-Training Quantization - Reduces model precision by converting floating point weights to lower bit widths after the primary training phase.

Deep Learning Quantization Tools - Provides a comprehensive library of precision reduction methods for neural network weights and optimizer states.

Hardware Dispatchers - Dynamically selects and executes the most efficient compute kernels based on detected CPU and GPU hardware.

Inference Acceleration Engines - Provides an optimized execution environment and kernels for low-latency deployment of large-scale models.

Graph Fusions - Combines separate compute kernels into single fused functions at the graph level to reduce invocation overhead.

Weight-Only Compression - Converts weight-only large language models into hardware-specific representations to increase execution speed.

LLM Performance Optimization Libraries - Increases the execution speed and resource efficiency of large language models using hardware-specific kernels.

Large Language Model Optimization - Applies specialized optimizations to improve the operational efficiency of massive language and vision models.

Model Quantization Tools - Ships utilities that reduce the precision of model weights to decrease memory usage and accelerate inference.

Hardware-Specific Model Optimizations - Adapts models to utilize specific hardware accelerators by dispatching operators to vector and matrix instructions.

Mixed-Precision Quantization - Assigns different bit-depths to individual layers to maintain accuracy while minimizing the total memory footprint.

Model Quantization - Implements techniques to reduce weight precision, including 8-bit integer quantization, to decrease memory footprint.

Backend-Agnostic Engines - Implements a computational framework that decouples neural network operations from hardware backends for cross-platform deployment.

Model Compression - Reduces the size and computational requirements of neural networks through mixed precision and quantization.

Section-Specific Precision Control - Applies granular quantization strategies to specific model sections to balance accuracy and computational efficiency.

Weight Quantization - Compresses model weights into lower-precision integer formats to reduce memory usage and accelerate inference.

Deep Learning Acceleration - Uses hardware-specific vector and matrix acceleration units on CPUs and GPUs to speed up tensor operations.

Kernel Fusion Operations - Fuses multiple adjacent mathematical operations into single compute kernels to minimize memory access overhead.

Fused Operation Pipelines - Combines multiple mathematical steps into single execution passes to reduce memory access and invocation overhead.

GPU Acceleration - Leverages GPU hardware drivers to optimize processing speed via dedicated device backends.

Hardware Performance Tuning - Optimizes hardware configurations to maximize throughput and bandwidth for deep learning workloads.

Hyperparameter Tuning - Provides iterative processes to optimize model configurations and quantization settings for target hardware.

Hardware-Aware Compilers - Fuses graph operations and optimizes model representations for specific target device backends.

Hyperparameter Search Strategies - Uses search algorithms to automatically discover optimal quantization and configuration settings for hardware targets.

Performance Tuning - Implements automated discovery of optimal configuration settings to maximize hardware utilization and minimize latency.

Model Optimization - Optimizes model performance through quantization and compression techniques.

Model Optimization - Toolkit for model compression, pruning, and distillation.

Developer Tools - Automatic accuracy-driven tuning and quantization for neural networks.

intelneural-compressor

Features

Star history