# ggml-org/llama.cpp

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/ggml-org-llama-cpp).**

116,799 stars · 19,628 forks · C++ · MIT

## Links

- GitHub: https://github.com/ggml-org/llama.cpp
- awesome-repositories: https://awesome-repositories.com/repository/ggml-org-llama-cpp.md

## Topics

`ggml`

## Description

Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures.

The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory usage and improves performance, alongside a command-line interface for managing chat templates and inference parameters.

The ecosystem further supports structured data generation through grammar-based output constraints and provides diagnostic utilities for visualizing computational graphs. Comprehensive documentation is available, including a reference matrix that details the compatibility of computational operations across supported hardware backends.

## Tags

### Artificial Intelligence & ML

- [Text-Only Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/text-only-inference-engines.md) — Executes large language models locally on standard consumer hardware with high performance. ([source](https://cdn.jsdelivr.net/gh/ggml-org/llama.cpp@master/README.md))
- [Hardware Abstraction Layers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/hardware-and-acceleration/hardware-abstraction-layers.md) — Unifies diverse CPU and GPU architectures through a common interface to normalize model execution across heterogeneous hardware. ([source](https://cdn.jsdelivr.net/gh/ggml-org/llama.cpp@master/README.md))
- [Multimodal Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/multimodal-inference-engines.md) — Processes both text and image inputs locally to enable multimodal model capabilities on standard consumer devices. ([source](https://cdn.jsdelivr.net/gh/ggml-org/llama.cpp@master/README.md))
- [Inference API Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/inference-api-servers.md) — Exposes inference capabilities via a lightweight HTTP server that supports standard chat completion and embedding endpoints. ([source](https://cdn.jsdelivr.net/gh/ggml-org/llama.cpp@master/README.md))
- [Model Quantization Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/model-quantization-tools.md) — Compresses model weights into quantized formats to significantly reduce memory footprint and boost inference speed. ([source](https://cdn.jsdelivr.net/gh/ggml-org/llama.cpp@master/README.md))
- [Command Line Inference Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/local-and-on-device-inference/command-line-inference-interfaces.md) — Terminal-based utilities allow for direct interaction with models, including configuration of inference parameters and chat management. ([source](https://cdn.jsdelivr.net/gh/ggml-org/llama.cpp@master/README.md))

### Part of an Awesome List

- [AI and Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/ai-and-machine-learning.md) — Efficient inference engine for large language models.
- [AI & Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/ai-machine-learning.md) — High-performance local inference for LLaMA models.
- [Inference and Serving](https://awesome-repositories.com/f/awesome-lists/ai/inference-and-serving.md) — High-performance inference engine written in C/C++.
- [Inference Engines](https://awesome-repositories.com/f/awesome-lists/ai/inference-engines.md) — Efficient LLM inference implementation in C/C++.
- [Large Language Models](https://awesome-repositories.com/f/awesome-lists/ai/large-language-models.md) — High-performance LLM inference in C/C++.
- [Model Quantization](https://awesome-repositories.com/f/awesome-lists/ai/model-quantization.md) — Listed in the “Model Quantization” section of the Llm Course awesome list.
- [Model Serving & Deployment](https://awesome-repositories.com/f/awesome-lists/ai/model-serving-deployment.md) — Performs efficient local inference for various LLMs.
- [Running Models](https://awesome-repositories.com/f/awesome-lists/ai/running-models.md) — Listed in the “Running Models” section of the Llm Course awesome list.