llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search.
The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal grammars to force model outputs to adhere to specific JSON schemas or patterns, and it implements speculative decoding to increase inference speed.
Broad capabilities include hardware acceleration for GPUs, tools for converting models between different data formats, and utilities for measuring model quality via perplexity and divergence metrics. The engine can be wrapped in an HTTP server that provides an OpenAI-compatible API for integration with external tools.