PowerInfer | Awesome Repository

PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors.

The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for integrating local models with existing chat clients.

The project covers broad capability areas including distributed model inference across multiple nodes, GPU hardware acceleration for Apple Metal and other processors, and structured text generation using formal grammars to constrain outputs. It also implements memory management techniques such as hybrid memory offloading, weight quantization, and CPU core affinity binding.

Features

Local Inference Engines - Implements a high-performance local inference engine designed for executing LLMs on consumer-grade hardware.
Sparse Model Architectures - Increases generation speed by identifying and ignoring inactive neurons based on activation sparsity.
Apple Hardware Acceleration - Executes computation graphs on Apple hardware by mapping host memory buffers to GPU kernels.

Features

Local Inference Engines - Implements a high-performance local inference engine designed for executing LLMs on consumer-grade hardware.
Sparse Model Architectures - Increases generation speed by identifying and ignoring inactive neurons based on activation sparsity.
Apple Hardware Acceleration - Executes computation graphs on Apple hardware by mapping host memory buffers to GPU kernels.