LiteRT

Features

Local Inference Engines - Ships a high-performance inference engine and API for executing machine learning and generative AI models on mobile, desktop, and IoT hardware.
On-Device Inference Engines - Provides an inference engine optimized for executing machine learning and generative AI models locally on mobile, desktop, and IoT hardware.
On-Device Deployments - Enables the deployment and execution of large language models and agentic planning capabilities entirely on-device.
Generative AI Runtimes - Provides a specialized environment for running quantized large language and diffusion models locally on edge hardware.
On-Device LLM Runners - Provides a specialized environment for running quantized large language and diffusion models locally on edge hardware.
Edge Hardware Optimizations - Applies quantization and graph optimizations to reduce memory footprint and increase inference speed on resource-constrained hardware.
Model Format Converters - Provides tools for transforming trained machine learning models into optimized formats for deployment on resource-constrained edge devices.
Model Quantization - Applies quantization and architecture-specific optimizations to enable large language and diffusion models to run locally.
NPU Acceleration - Integrates with specialized Neural Processing Units to optimize model inference performance and energy efficiency.
NPU Accelerators - Implements a unified interface for executing machine learning inference on Neural Processing Units with automatic CPU and GPU fallbacks.
Post-Training Quantization - Reduces model precision from floating point to integers after training to decrease binary size and increase inference speed.
Model Bytecode Compilation - Translates high-level machine learning model definitions into optimized low-level bytecode for specific edge hardware architectures.
Model Conversion - Transforms machine learning models into specialized formats to increase execution speed and reduce memory usage on edge hardware.
Build-Time Bytecode Compilation - Translates models into bytecode during the build process to reduce runtime initialization and memory overhead on edge devices.
Ahead-of-Time Wasm Execution - Compiles models into hardware-specific bytecode before deployment to minimize startup latency on constrained devices.
Computation Subgraph Delegation - Partitions model computation graphs to delegate specific operations to the most compatible CPU, GPU, or NPU backends.
Hardware-Accelerated Inference - Executes machine learning models on edge devices using CPU, GPU, or NPU hardware acceleration for high performance.
Execution Fallbacks - Automatically redirects computation to a compatible processor if the primary hardware accelerator lacks support for specific operations.
On-Device Inference - Provides a high-level inference API and runtime environment to manage model state on edge devices.
Hardware Acceleration - Enables the execution of machine learning and generative AI models across mobile, desktop, and IoT hardware using CPUs, GPUs, and NPUs.
Automatic Accelerator Selection - Automatically selects the most efficient hardware accelerator and manages asynchronous execution for machine learning tasks.
Tensor Memory Management - Manages the allocation and tracking of memory views for tensors using buffer references to control data flow.
Hardware Performance Tuning - Tunes execution across CPUs, GPUs, and NPUs using hardware-specific optimizations to achieve peak processing speeds.
Graph Compilation Caching - Stores compiled computation graphs in a local directory to bypass runtime initialization overhead.
NPU Unified Interfaces - Provides a unified interface to execute models on NPUs while abstracting vendor-specific compiler and runtime details.
On-Device Compilation - Translates models into NPU instructions during application initialization to ensure compatibility across diverse hardware platforms.
Hardware Buffer Zero-Copy - Eliminates expensive CPU memory copy operations by passing tensor data directly to the NPU hardware buffer.
Model Artifact Caches - Caches pre-compiled model hardware instructions in local storage to eliminate repeated translation during application launches.
Zero-Copy Buffer Interoperability - Passes tensor data directly to accelerators without duplicating data to system memory to reduce latency and power.
PyTorch - Provides specialized paths for converting trained PyTorch models into optimized formats for on-device deployment.
Inference Engines - Framework for efficient ML and GenAI deployment on edge.
Model Serving & Deployment - Deploys models on mobile and edge devices.

Open-source alternatives to LiteRT

Similar open-source projects, ranked by how many features they share with LiteRT.

pytorch/executorch
pytorch/executorch
4,296View on GitHub
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
Pythondeep-learningembeddedgpu
View on GitHub4,296
openvinotoolkit/openvino
openvinotoolkit/openvino
10,414View on GitHub
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
C++aicomputer-visiondeep-learning
View on GitHub10,414
alibaba/mnn
alibaba/MNN
14,242View on GitHub
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
C++armconvolutiondeep-learning
View on GitHub14,242
sgl-project/sglang
sgl-project/sglang
29,079View on GitHub
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Pythonattentionblackwellcuda
View on GitHub29,079

See all 30 alternatives to LiteRT

google-ai-edgeLiteRT

Features

Open-source alternatives to LiteRT

pytorch/executorch

openvinotoolkit/openvino

alibaba/MNN

sgl-project/sglang

Star history

Open-source alternatives to LiteRT

pytorch/executorch

openvinotoolkit/openvino

alibaba/MNN

sgl-project/sglang