This project is a vision language model framework and vision-to-text pipeline designed for deploying and optimizing models that process both images and text. It provides an on-device inference engine and a vision language model framework to run quantized models locally on mobile and desktop hardware accelerators. The framework features a model quantization toolkit to reduce weight precision for lower memory footprints and increased execution speed on specialized silicon. It also includes an efficient vision encoder utilizing a hybrid encoding system to compress image tokens, which reduces pro
gemma.cpp is a C++ inference engine for Gemma, PaliGemma, and Griffin language models, designed to run directly on-device without Python dependencies. It provides a self-contained runtime that loads quantized model weights and performs text generation on CPU or GPU, along with a model checkpoint converter that transforms PyTorch or Keras checkpoints into a compact binary format for fast loading. The engine supports multiple model architectures, including the Griffin recurrent architecture with gated linear recurrent layers and sliding-window attention for efficient long-sequence handling, as
Narrator is an artificial intelligence system that converts real-time video feeds into natural language audio descriptions. It functions as a multimodal vision narrator and scene descriptor, using computer vision to transform environmental data from a camera into synthetic speech. The tool operates as a pipeline that captures periodic images from a feed and uses a multimodal large language model to analyze visual events. These analyses are then converted via text-to-speech synthesis into a voiceover that describes real-world activities and surroundings. The system supports automated environm