# tiiny-ai/powerinfer

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/tiiny-ai-powerinfer).**

8,714 stars · 487 forks · C++ · mit

## Links

- GitHub: https://github.com/Tiiny-AI/PowerInfer
- awesome-repositories: https://awesome-repositories.com/repository/tiiny-ai-powerinfer.md

## Topics

`large-language-models` `llama` `llm` `llm-inference` `local-inference`

## Description

PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors.

The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for integrating local models with existing chat clients.

The project covers broad capability areas including distributed model inference across multiple nodes, GPU hardware acceleration for Apple Metal and other processors, and structured text generation using formal grammars to constrain outputs. It also implements memory management techniques such as hybrid memory offloading, weight quantization, and CPU core affinity binding.

## Tags

### Artificial Intelligence & ML

- [Local Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/large-language-model-optimization/local-inference-engines.md) — Implements a high-performance local inference engine designed for executing LLMs on consumer-grade hardware.
- [Sparse Model Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/sparse-model-architectures.md) — Increases generation speed by identifying and ignoring inactive neurons based on activation sparsity.
- [Apple Hardware Acceleration](https://awesome-repositories.com/f/artificial-intelligence-ml/apple-hardware-acceleration.md) — Executes computation graphs on Apple hardware by mapping host memory buffers to GPU kernels. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-metal.h))
- [GPU-Accelerated Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-accelerated-inference.md) — Provides a GPU acceleration backend optimized for high-throughput inference of large language models.
- [Hardware Acceleration Backends](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-acceleration-backends.md) — Maps tensor operations to specialized GPU compute kernels and shaders to maximize hardware acceleration.
- [Local Language Model Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/local-ai-deployment-platforms/deployment-platforms/local-inference/local-language-model-execution.md) — Provides a runtime for executing large language model prompts locally with configurable sampling parameters. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/smallthinker/README.md))
- [Metal API Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/metal-api-optimizations.md) — Executes tensor operations and activation functions using Metal shaders on compatible hardware. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-metal.metal))
- [GGUF Format Conversions](https://awesome-repositories.com/f/artificial-intelligence-ml/model-format-converters/gguf-format-conversions.md) — Converts model weights and metadata into the GGUF binary format for efficient local loading.
- [Weight Offloading](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-management/weight-offloading.md) — Partitions network weights between video memory and system RAM based on activation patterns to fit large models. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/docs/token_generation_performance_tips.md))
- [Weight Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization.md) — Reduces the precision of model weights using various bit-widths to lower memory requirements and accelerate inference. ([source](https://github.com/Tiiny-AI/PowerInfer#readme))
- [Sparse Inference Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/sparse-model-architectures/sparse-inference-frameworks.md) — Implements a sparse inference framework that increases generation speed by exploiting activation sparsity.
- [Text Completion Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/text-completion-engines.md) — Implements an engine for predicting the next sequence of tokens from a prompt using configurable sampling parameters. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/examples/server/README.md))
- [OpenAI-Compatible APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/model-integration-serving/model-integration-interfaces/ai-integration-apis/openai-compatible-apis.md) — Exposes standard HTTP endpoints matching the OpenAI specification for compatibility with external AI clients. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/examples/server/README.md))
- [Asynchronous Tensor Loading](https://awesome-repositories.com/f/artificial-intelligence-ml/asynchronous-tensor-loading.md) — Implements asynchronous loading of model weights to overlap data transfer with active GPU computation. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-backend.c))
- [Batch Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/batch-inference-engines.md) — Provides a deployment server capable of processing multiple inference requests simultaneously to increase throughput. ([source](https://github.com/Tiiny-AI/PowerInfer#readme))
- [Batched Response Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/batched-response-generation.md) — Produces multiple independent text completions from a single prompt to increase inference throughput. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/examples/batched/README.md))
- [Distributed Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-inference-engines.md) — Splits compute graphs into slices and distributes them across multiple nodes for parallel execution. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-mpi.c))
- [Distributed Model Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-model-execution.md) — Executes large model workloads spread across multiple compute devices to increase processing speed.
- [Output Constraint Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/decoding-generation-controls/output-constraint-engines.md) — Uses formal grammars to enforce structured output formats like JSON during text generation. ([source](https://github.com/Tiiny-AI/PowerInfer/tree/main/grammars))
- [OpenAI-Compatible Inference Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/local-and-on-device-inference/command-line-inference-interfaces/openai-compatible-inference-servers.md) — Hosts a local model server that mimics OpenAI API endpoints for ecosystem interoperability.
- [Tensor Memory Management](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/hardware-and-acceleration/tensor-computing-libraries/tensor-memory-management.md) — Allocates and manages memory buffers for tensors and computation graphs across diverse hardware backends. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-alloc.c))
- [Memory-Constrained Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/large-language-model-optimization/memory-constrained-inference.md) — Limits the total memory used during inference to enable the execution of large models on low-RAM devices. ([source](https://github.com/Tiiny-AI/PowerInfer/tree/main/smallthinker))
- [Multi-GPU Distribution](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/distributed-deployment-utilities/multi-gpu-distribution.md) — Splits tensors across multiple available graphics devices to balance the computational load. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-cuda.h))
- [Structured Output Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-code-generators/structured-generation-engines/structured-output-generators.md) — Forces language models to produce strictly typed, machine-readable data formats using formal grammars.
- [Grammar-Constrained Samplers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-generation-strategies/token-prediction/grammar-constrained-samplers.md) — Uses formal grammars to restrict token generation and enforce structured output formats like JSON.

### Operating Systems & Systems Programming

- [Memory Offloading Frameworks](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-optimizations/memory-offloading-frameworks.md) — Offloads model tensors and dense layers to video memory to increase computation speed. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/docs/token_generation_performance_tips.md))
- [Hardware Acceleration](https://awesome-repositories.com/f/operating-systems-systems-programming/hardware-interfacing-drivers/hardware-acceleration.md) — Offloads model tensors and computations to graphics hardware and Apple Metal for improved performance.
- [CPU Affinity Binding](https://awesome-repositories.com/f/operating-systems-systems-programming/cpu-affinity-binding.md) — Binds execution threads to high-performance CPU cores to minimize scheduling latency and maximize generation speed. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/docs/token_generation_performance_tips.md))
- [GPU Memory Allocators](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/allocation-strategies/dynamic-memory-allocation/gpu-memory-allocators.md) — Manages direct allocation and transfer of tensor data buffers within GPU hardware memory. ([source](https://github.com/Tiiny-AI/PowerInfer/blob/main/ggml-cuda.h))

### Data & Databases

- [Model Weight Conversions](https://awesome-repositories.com/f/data-databases/vector-data-formats/format-conversion-utilities/model-weight-conversions.md) — Transforms model weights into specialized formats required for optimized sparse inference. ([source](https://github.com/Tiiny-AI/PowerInfer#readme))

### Development Tools & Productivity

- [Inference Batching](https://awesome-repositories.com/f/development-tools-productivity/batch-processing-pipelines/inference-batching.md) — Groups multiple independent requests into a single compute pass to maximize hardware utilization.

### DevOps & Infrastructure

- [Compute Graph Slicing](https://awesome-repositories.com/f/devops-infrastructure/distributed-computing/compute-graph-slicing.md) — Splits the compute graph into segments and distributes them across multiple nodes to parallelize model execution.