Neural Compressor

Features

Inference Accelerators - Maps neural network operators to specialized CPU and GPU vector instructions to accelerate model execution.
Post-Training Quantization - Reduces model precision by converting floating point weights to lower bit widths after the primary training phase.
Deep Learning Quantization Tools - Provides a comprehensive library of precision reduction methods for neural network weights and optimizer states.
Hardware Dispatchers - Dynamically selects and executes the most efficient compute kernels based on detected CPU and GPU hardware.
Inference Acceleration Engines - Provides an optimized execution environment and kernels for low-latency deployment of large-scale models.
Graph Fusions - Combines separate compute kernels into single fused functions at the graph level to reduce invocation overhead.
Weight-Only Compression - Converts weight-only large language models into hardware-specific representations to increase execution speed.
LLM Performance Optimization Libraries - Increases the execution speed and resource efficiency of large language models using hardware-specific kernels.
Large Language Model Optimization - Applies specialized optimizations to improve the operational efficiency of massive language and vision models.
Model Quantization Tools - Ships utilities that reduce the precision of model weights to decrease memory usage and accelerate inference.
Hardware-Specific Model Optimizations - Adapts models to utilize specific hardware accelerators by dispatching operators to vector and matrix instructions.
Mixed-Precision Quantization - Assigns different bit-depths to individual layers to maintain accuracy while minimizing the total memory footprint.
Model Quantization - Implements techniques to reduce weight precision, including 8-bit integer quantization, to decrease memory footprint.
Backend-Agnostic Engines - Implements a computational framework that decouples neural network operations from hardware backends for cross-platform deployment.
Model Compression - Reduces the size and computational requirements of neural networks through mixed precision and quantization.
Section-Specific Precision Control - Applies granular quantization strategies to specific model sections to balance accuracy and computational efficiency.
Weight Quantization - Compresses model weights into lower-precision integer formats to reduce memory usage and accelerate inference.
Deep Learning Acceleration - Uses hardware-specific vector and matrix acceleration units on CPUs and GPUs to speed up tensor operations.
Kernel Fusion Operations - Fuses multiple adjacent mathematical operations into single compute kernels to minimize memory access overhead.
Fused Operation Pipelines - Combines multiple mathematical steps into single execution passes to reduce memory access and invocation overhead.
GPU Acceleration - Leverages GPU hardware drivers to optimize processing speed via dedicated device backends.
Hardware Performance Tuning - Optimizes hardware configurations to maximize throughput and bandwidth for deep learning workloads.
Hyperparameter Tuning - Provides iterative processes to optimize model configurations and quantization settings for target hardware.
Hardware-Aware Compilers - Fuses graph operations and optimizes model representations for specific target device backends.
Hyperparameter Search Strategies - Uses search algorithms to automatically discover optimal quantization and configuration settings for hardware targets.
Performance Tuning - Implements automated discovery of optimal configuration settings to maximize hardware utilization and minimize latency.
Model Optimization - Optimizes model performance through quantization and compression techniques.
Model Optimization - Toolkit for model compression, pruning, and distillation.
Developer Tools - Automatic accuracy-driven tuning and quantization for neural networks.

Open-source alternatives to Neural Compressor

Similar open-source projects, ranked by how many features they share with Neural Compressor.

vllm-project/llm-compressor
vllm-project/llm-compressor
2,764View on GitHub
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language,
Pythoncompressionquantizationsparsity
View on GitHub2,764
nvidia/isaac-gr00t
NVIDIA/Isaac-GR00T
6,222View on GitHub
Jupyter Notebook
View on GitHub6,222
pytorch/torchtune
pytorch/torchtune
5,774View on GitHub
Torchtune is a PyTorch-native library for fine-tuning, aligning, and quantizing large language models. It provides a configurable training pipeline orchestrated through YAML recipes, with CLI overrides and component swapping, distributed training via FSDP2, memory optimizations, and parameter-efficient fine-tuning methods like LoRA, DoRA, and QLoRA. The library distinguishes itself through its YAML-driven configuration system that defines all training parameters and instantiates components from config files, with full CLI override capability for any field or component at launch time. It suppo
Python
View on GitHub5,774
paddlepaddle/paddle-lite
PaddlePaddle/Paddle-Lite
7,260View on GitHub
Paddle-Lite is a deep learning inference engine and edge computing runtime designed to execute trained models on mobile and edge devices. It provides a hardware-accelerated inference framework and a decoupled runtime with a minimal binary footprint to operate in resource-constrained environments without third-party dependencies. The project includes a model quantization tool for reducing precision and size via static and dynamic quantization, as well as a computation graph optimizer. These tools reduce latency and memory usage by fusing operators and pruning the model intermediate representat
C++armbaidudeep-learning
View on GitHub7,260

See all 30 alternatives to Neural Compressor

intelneural-compressor

Features

Open-source alternatives to Neural Compressor

vllm-project/llm-compressor

NVIDIA/Isaac-GR00T

pytorch/torchtune

PaddlePaddle/Paddle-Lite

Star history

Open-source alternatives to Neural Compressor

vllm-project/llm-compressor

NVIDIA/Isaac-GR00T

pytorch/torchtune

PaddlePaddle/Paddle-Lite