This project is an on-device AI SDK providing a framework for running large language models, vision models, and speech models locally. It serves as an orchestration layer for local LLM execution, ensuring data privacy and offline availability by utilizing hardware acceleration on the device.
The SDK is distinguished by its comprehensive voice and multimodal capabilities, including a coordinated voice pipeline for activity detection, speech-to-text, and text-to-speech synthesis. It also provides a dedicated implementation kit for local retrieval-augmented generation and tools for processing combined image and text inputs via vision-language models.
The broader capability surface covers model lifecycle management, including downloading, caching, and the dynamic swapping of fine-tuned adapters. It includes support for structured output generation, tool calling for external function integration, and hardware-accelerated image generation.
The system also incorporates performance monitoring for inference metrics and comprehensive audio-visual capture tools for camera and microphone input.