gemma.cpp is a C++ inference engine for Gemma, PaliGemma, and Griffin language models, designed to run directly on-device without Python dependencies. It provides a self-contained runtime that loads quantized model weights and performs text generation on CPU or GPU, along with a model checkpoint converter that transforms PyTorch or Keras checkpoints into a compact binary format for fast loading.
The engine supports multiple model architectures, including the Griffin recurrent architecture with gated linear recurrent layers and sliding-window attention for efficient long-sequence handling, as well as vision-language fusion for processing images alongside text prompts. It features a grammar-constrained decoder that enforces user-defined acceptance rules to produce structured or formatted output, and a callback-based token generation system that yields each output token to a user-supplied function for custom streaming or early termination.
The project compiles into a standalone shared library artifact, enabling integration into external CMake projects via FetchContent for building custom inference frontends. It covers the full inference pipeline from checkpoint conversion and weight loading through tokenization, single-step forward passes, and token decoding, all within a lightweight runtime that operates without external services.