# intel-analytics/ipex-llm

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/intel-analytics-ipex-llm).**

8,836 stars · 1,427 forks · Python · Apache-2.0 · archived

## Links

- GitHub: https://github.com/intel-analytics/ipex-llm
- awesome-repositories: https://awesome-repositories.com/repository/intel-analytics-ipex-llm.md

## Description

ipex-llm is an acceleration library and inference engine designed to optimize the execution and finetuning of large language models on Intel GPUs and NPUs. It provides a HuggingFace compatible model backend and a dedicated quantization toolkit for converting model weights into low-bit precision formats.

The project facilitates distributed inference by splitting large model workloads across multiple accelerators using pipeline and tensor parallelism. It enables the deployment of models on Intel Arc, Flex, and Max GPUs to increase throughput and reduce latency.

The library covers a broad range of optimization capabilities, including low-precision finetuning for local model updates and the loading of diverse community model formats. It also includes tools for measuring model predictive performance using standard perplexity metrics.

## Tags

### Artificial Intelligence & ML

- [XPU Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/xpu-accelerators.md) — Offloads tensor computations to Intel GPUs and NPUs using optimized low-level libraries for increased throughput.
- [AI Ecosystem Backends](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-ecosystem-backends.md) — Provides an optimized backend compatible with the HuggingFace ecosystem to simplify open-weights model deployment.
- [Cross-Hardware Workload Distribution](https://awesome-repositories.com/f/artificial-intelligence-ml/cross-hardware-workload-distribution.md) — Coordinates the distribution of inference tasks across heterogeneous hardware including CPUs, GPUs, and NPUs.
- [Distributed Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-inference-engines.md) — Splits large language model workloads across multiple accelerators to handle models exceeding single-device memory.
- [LLM Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/llm-model-integrations.md) — Integrates optimized hardware backends with high-level AI tools like HuggingFace to simplify model deployment. ([source](https://github.com/intel-analytics/ipex-llm#readme))
- [GPU-Accelerated Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-accelerated-inference.md) — Accelerates the inference phase of large language models specifically on Intel Arc, Flex, and Max GPUs.
- [GPU Inference SDKs](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-inference-sdks.md) — Provides an execution environment that optimizes model throughput and latency across Intel graphics hardware.
- [Hardware-Accelerated Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-accelerated-inference.md) — Increases throughput and reduces latency by executing models directly on specialized Intel GPU and NPU hardware. ([source](https://github.com/intel-analytics/ipex-llm#readme))
- [Hardware Execution Bridges](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-execution-bridges.md) — Connects optimized execution kernels to high-level AI frameworks by mapping standard model formats to hardware implementations.
- [Model Inference Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/large-language-model-optimization/model-inference-optimizations.md) — Provides hardware-specific performance optimizations for executing large language models on Intel GPUs and NPUs. ([source](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/PythonAPI/optimize.md))
- [Language Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training.md) — Supports low-precision finetuning of large language models to optimize performance for specific tasks on Intel hardware. ([source](https://github.com/intel-analytics/ipex-llm#readme))
- [Quantization Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization-techniques/quantization-toolkits.md) — Provides a set of tools for converting model weights into low-bit precision formats to reduce memory usage.
- [Model Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-quantization.md) — Reduces model memory footprints by converting weights to lower precision for faster local execution.
- [Weight Quantization](https://awesome-repositories.com/f/artificial-intelligence-ml/quantized-inference-runtimes/weight-quantization.md) — Compresses model weights into lower-precision formats to reduce memory footprint and accelerate execution.
- [Tensor Parallelism](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-parallelism.md) — Splits large model weights across multiple hardware accelerators to manage memory and increase speed.
- [Tensor Parallelism Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-parallelism-frameworks.md) — Splits neural network layers across multiple hardware devices using pipeline and tensor parallelism.
- [External Model Loading](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-model-formats/external-model-loading.md) — Implements the loading of model weights from diverse community standards for use within its optimized runtime. ([source](https://github.com/intel-analytics/ipex-llm#readme))
- [Mixed Precision Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-and-accelerated-compute/training-acceleration-tools/mixed-precision-training.md) — Employs lower-bit precision formats to reduce memory consumption during the model finetuning process.
- [Finetuning Workflows](https://awesome-repositories.com/f/artificial-intelligence-ml/model-pretraining-frameworks/finetuning-workflows.md) — Adapts pretrained foundation models using specialized finetuning workflows on Intel hardware.
- [Model Persistence](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training/model-persistence.md) — Saves and restores low-bit quantized models to disk to minimize resource consumption during initialization. ([source](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/PythonAPI/optimize.md))

### Part of an Awesome List

- [XPU Deployment Orchestration](https://awesome-repositories.com/f/awesome-lists/ai/model-optimization-and-deployment/xpu-deployment-orchestration.md) — Integrates optimized Intel hardware backends with tools like HuggingFace and vLLM for streamlined model orchestration.
- [Machine Learning Libraries](https://awesome-repositories.com/f/awesome-lists/ai/machine-learning-libraries.md) — Accelerated LLM inference on Intel hardware.
- [Science and Data Analysis](https://awesome-repositories.com/f/awesome-lists/ai/science-and-data-analysis.md) — LLM inference and finetuning acceleration.
