Nexa Sdk

The nexa-sdk is an on-device AI SDK and multimodal inference engine designed to run large language, vision, and audio models locally on mobile and desktop hardware. It functions as a local LLM runtime and NPU acceleration framework, enabling the execution of generative and discriminative models without reliance on cloud services.

The project distinguishes itself through a dedicated NPU acceleration framework that optimizes model execution on Neural Processing Units to reduce latency and power consumption. It employs hardware-agnostic backend routing to dynamically distribute computations across CPUs, GPUs, and NPUs, and supports GGUF-based model loading for efficient memory mapping and layer offloading.

Its capabilities cover a broad spectrum of AI tasks, including conversational text generation, text-to-image synthesis, and automatic speech recognition. It also provides tools for vector embedding generation and document reranking for local semantic search, as well as a REST-based inference server with an OpenAI-compatible interface for external integration.

The SDK supports cross-platform deployment across Android and Linux environments and includes a Python library for developer integration.

Features

Conversational Response Generation - Produces natural language responses for interactive dialogue through streaming or static interfaces.
NPU Acceleration - Integrates specialized neural processing units to optimize model inference performance and power efficiency on edge hardware.
OpenAI-Compatible APIs - Implements standard HTTP endpoints for model interaction and function calling in an OpenAI-compatible format.
NPU Inference Execution - Executes large language models on NPUs using native formats and GGUF layer offloading.
Hardware Acceleration Backends - Implements hardware-agnostic routing to distribute model computations across CPU, GPU, and NPU backends.
On-Device Inference - Executes large language, vision, and audio models directly on local mobile or desktop hardware.
Local Model Runtimes - Acts as a local runtime for executing language models in GGUF and other formats on-device.
External Model Loading - Imports and initializes pre-trained models from standardized open formats and external repositories.
Multimodal Inference Engines - Functions as a multimodal engine capable of processing combined text, image, and audio inputs locally.
Local Model Inference Servers - Provides a REST-based local inference server with an OpenAI-compatible interface for chat, embeddings, and reranking.
Remote Model Hubs - Downloads pre-trained model snapshots from centralized remote hubs to a local directory for deployment.
Memory-Mapped Loading - Loads large language models using memory-mapped I/O and layer offloading via the GGUF format.
Model Format Parsers - Interprets and translates diverse machine learning model file formats to ensure cross-architecture compatibility.
Multimodal Analysis Tools - Processes combined text, image, and audio inputs for visual understanding and complex multimodal reasoning.
Multimodal Analytical Pipelines - Features a unified multimodal pipeline for processing text, image, and audio data streams.
Backend-Agnostic Engines - Implements a backend-agnostic engine that decouples neural network operations from specific hardware backends for cross-platform execution.
On-Device Models - Provides a comprehensive SDK for running large language, vision, and audio models locally on mobile and desktop hardware.
Generative Media Runtimes - Produces synthetic text, images, and speech locally using generative models without cloud reliance.
Sampling Controls - Adjusts temperature, token selection, and chat templates to control the style and accuracy of generated text.
Inference State Management - Manages internal key-value caches and state to optimize the performance of local model responses.
Cross-Platform Deployment Targets - Supports deployment of AI capabilities across multiple operating systems using native SDKs and containers.
Android Platform Integrations - Integrates AI model execution into Android applications and embedded platforms for local processing.
REST APIs - Provides a REST API to perform chat completions, generate embeddings, and execute reranking tasks.
Automatic Speech Recognition - Converts spoken audio into written text across multiple languages using batch or real-time streaming.
Computer Vision - Processes visual data to identify objects or extract text via optical character recognition.
Document Rerankers - Reranks document lists based on relevance to improve retrieval accuracy in semantic search.
Text-to-Image Generators - Generates high-resolution images from natural language text prompts using hardware-optimized diffusion models.
Image Generation APIs - Provides API endpoints to create visual imagery from text prompts using optimized diffusion models.
Command Line Inference Interfaces - Provides a terminal-based interface for executing text generation, audio analysis, and speech transcription.
Incremental Inference Streaming - Sends model outputs to users incrementally as tokens are generated to reduce perceived latency.
Local Model Lifecycle Managers - Includes tools for managing the local lifecycle of models, including listing and removing cached versions.
Model Downloaders - Facilitates the retrieval of model weights and quantization files from remote hubs for local storage.
Audio Transcription - Offers NPU-accelerated automatic speech recognition for efficient local audio transcription.
Document Reranking - Uses NPU-accelerated reranker models to score document relevance against specific queries locally.
Optical Character Recognition - Performs NPU-accelerated text extraction from images using local computer vision models.
Vector Embedding Generation - Utilizes NPU acceleration to convert text strings into high-dimensional vector representations for local semantic search.
Vision Language Inference - Processes combined text and image inputs through NPU-accelerated vision language models.
Text-to-Speech - Synthesizes natural human speech from text input using on-device generative models.
Vector Embeddings - Creates high-dimensional numerical representations of text to enable semantic search and retrieval-augmented generation.
Visual Analysis - Processes images and text simultaneously to perform multimodal reasoning and visual understanding.
Vector Search - Implements on-device vector embedding generation and document reranking for semantic search.
Python Library Integrations - Exposes model management and inference capabilities through a Python library for application integration.
On-Device Model Management - Manages the local downloading, storage, and versioning of model snapshots and quantization versions on device.
Model Deployment Tools - Token-compressed models for efficient on-device inference.

openvinotoolkit/openvino

10,414View on GitHub

OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and

sgl-project/sglang

29,079View on GitHub

Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr

pytorch/executorch

4,296View on GitHub

ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,

intel/ipex-llm

8,836View on GitHub

Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP

NexaAInexa-sdk

Features