ipex-llm is an acceleration library and inference engine designed to optimize the execution and finetuning of large language models on Intel GPUs and NPUs. It provides a HuggingFace compatible model backend and a dedicated quantization toolkit for converting model weights into low-bit precision formats.
The project facilitates distributed inference by splitting large model workloads across multiple accelerators using pipeline and tensor parallelism. It enables the deployment of models on Intel Arc, Flex, and Max GPUs to increase throughput and reduce latency.
The library covers a broad range of optimization capabilities, including low-precision finetuning for local model updates and the loading of diverse community model formats. It also includes tools for measuring model predictive performance using standard perplexity metrics.