# google/gemma.cpp

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/google-gemma-cpp).**

6,735 stars · 597 forks · C++ · apache-2.0

## Links

- GitHub: https://github.com/google/gemma.cpp
- awesome-repositories: https://awesome-repositories.com/repository/google-gemma-cpp.md

## Description

gemma.cpp is a C++ inference engine for Gemma, PaliGemma, and Griffin language models, designed to run directly on-device without Python dependencies. It provides a self-contained runtime that loads quantized model weights and performs text generation on CPU or GPU, along with a model checkpoint converter that transforms PyTorch or Keras checkpoints into a compact binary format for fast loading.

The engine supports multiple model architectures, including the Griffin recurrent architecture with gated linear recurrent layers and sliding-window attention for efficient long-sequence handling, as well as vision-language fusion for processing images alongside text prompts. It features a grammar-constrained decoder that enforces user-defined acceptance rules to produce structured or formatted output, and a callback-based token generation system that yields each output token to a user-supplied function for custom streaming or early termination.

The project compiles into a standalone shared library artifact, enabling integration into external CMake projects via FetchContent for building custom inference frontends. It covers the full inference pipeline from checkpoint conversion and weight loading through tokenization, single-step forward passes, and token decoding, all within a lightweight runtime that operates without external services.

## Tags

### Programming Languages & Runtimes

- [C++ Inference Runtimes](https://awesome-repositories.com/f/programming-languages-runtimes/c-inference-runtimes.md) — "A library for running Gemma, PaliGemma, and Griffin language models directly in C++ with no Python dependencies."

### Artificial Intelligence & ML

- [Generative Text Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/generative-text-inference.md) — Load a Gemma model and generate text responses from user prompts in an interactive terminal session. ([source](https://cdn.jsdelivr.net/gh/google/gemma.cpp@main/README.md))
- [On-Device Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-clients/on-device-inference.md) — Running large language models locally on a CPU or GPU for text generation without cloud dependencies.
- [Model Loading](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/data-and-checkpointing/model-loading.md) — Load a pre-trained model from a binary weight file and prepare it for inference. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [Model Checkpoint Converters](https://awesome-repositories.com/f/artificial-intelligence-ml/model-checkpoint-converters.md) — Transforming PyTorch or Keras model checkpoints into a binary format for fast loading in C++ applications. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [On-Device Models](https://awesome-repositories.com/f/artificial-intelligence-ml/on-device-models.md) — "A lightweight runtime that loads quantized model weights and runs text generation on CPU or GPU without external services."
- [Vision-Language Models](https://awesome-repositories.com/f/artificial-intelligence-ml/on-device-models/vision-language-models.md) — Loads a PaliGemma model that encodes an image and text prompt jointly, then generates descriptive or question-answering responses. ([source](https://cdn.jsdelivr.net/gh/google/gemma.cpp@main/README.md))
- [Gated Linear Recurrent Layers](https://awesome-repositories.com/f/artificial-intelligence-ml/recurrent-neural-networks/gated-recurrent-units/gated-linear-recurrent-layers.md) — Uses a gated linear recurrent layer with sliding-window attention to reduce memory and handle longer sequences efficiently.
- [Inference Step Executions](https://awesome-repositories.com/f/artificial-intelligence-ml/step-based-schedulers/step-execution-engines/inference-step-executions.md) — Perform one forward pass through the neural network for a single token, mutating the internal activations and key-value cache. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [Grammar-Constrained Samplers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-generation-strategies/token-prediction/grammar-constrained-samplers.md) — Forcing token generation to follow a user-defined grammar or acceptance rule for structured output.
- [Text Tokenizers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers.md) — Convert a string prompt into a vector of token IDs using a loaded tokenizer model. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [Callback-Based Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/callback-based-generators.md) — Accept a tokenized prompt and produce output tokens one at a time, calling a user-defined callback for each token. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [Token Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/token-decoders.md) — Convert a vector of token IDs back into a human-readable string. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [Weight Serialization](https://awesome-repositories.com/f/artificial-intelligence-ml/weight-reconstruction/weight-serialization.md) — Converts PyTorch or Keras checkpoints into a binary blob with a fixed memory layout for zero-deserialization loading at runtime.
- [Griffin Recurrent Model Runtimes](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/frameworks/model-construction/neural-network-layers/recurrent-layers/recurrent-model-definitions/griffin-recurrent-model-runtimes.md) — Loading and running Griffin-based recurrent models that use less memory and handle longer sequences efficiently. ([source](https://cdn.jsdelivr.net/gh/google/gemma.cpp@main/README.md))
- [Custom Frontend Definitions](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/model-integration-pipelines/model-inference/inference-context-customization/custom-frontend-definitions.md) — Replace the built-in interactive interface with a custom application that calls the model's generation and tokenizer functions. ([source](https://github.com/google/gemma.cpp/blob/main/DEVELOPERS.md))
- [Custom Frontend Development](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/model-integration-pipelines/model-inference/inference-context-customization/custom-frontend-development.md) — Building custom applications that call model generation and tokenizer functions instead of using the built-in interactive interface.
- [Custom Frontend SDKs](https://awesome-repositories.com/f/artificial-intelligence-ml/model-serving/embedded-inference-libraries/custom-frontend-sdks.md) — "A shared library and CMake integration for embedding Gemma model inference into external C++ applications."
- [Vision-Language Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-inference.md) — Processing images alongside text prompts to generate descriptive or question-answering responses using a multimodal model.

### Networking & Communication

- [Token Generation Callbacks](https://awesome-repositories.com/f/networking-communication/callback-based-data-streaming/token-generation-callbacks.md) — Yields each output token to a user-supplied callback, enabling custom streaming, filtering, or early termination.

### Software Engineering & Architecture

- [In-Place Cache Mutation](https://awesome-repositories.com/f/software-engineering-architecture/logic-control-engines/forward-pass-logic/in-place-cache-mutation.md) — Runs one neural network inference step per token, mutating internal activations and key-value cache in place.

### Operating Systems & Systems Programming

- [CMake FetchContent Integration](https://awesome-repositories.com/f/operating-systems-systems-programming/systems-programming/shared-library-development/cmake-fetchcontent-integration.md) — Compiles the inference engine into a standalone shared library for linking into external CMake projects via FetchContent.

### Part of an Awesome List

- [AI & Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/ai-machine-learning.md) — Lightweight inference engine for Google's Gemma models
