Frameworks and runtimes for optimizing and executing machine learning models on mobile and edge hardware.
LiteRT is a runtime and API for executing machine learning and generative AI models on mobile, desktop, and IoT hardware. It consists of an inference engine and a specialized environment for running quantized large language and diffusion models locally on edge hardware. The system includes an ahead-of-time model compiler that translates models into hardware-specific bytecode to reduce startup latency and memory overhead. It provides a unified interface for Neural Processing Units with automatic fallback routing to CPUs or GPUs when specific subgraph support is unavailable. An edge model converter transforms trained models into optimized formats for deployment on resource-constrained devices. The project covers model optimization through format conversion and post-training quantization to reduce binary size. It manages hardware acceleration through automatic accelerator selection and zero-copy memory optimizations to eliminate CPU memory copying. The framework also supports custom operator definitions through a low-level kernel interface to extend the supported mathematical operations.
LiteRT is a comprehensive inference engine specifically designed for deploying and running optimized, quantized machine learning models across mobile, desktop, and IoT hardware with hardware-accelerated execution.
Burn is a deep learning framework designed for building, training, and deploying neural networks using a modular architecture. As a machine learning library built in Rust, it provides a backend-agnostic computational engine that enables the execution of models across diverse hardware, including central processors, graphics processors, and web runtimes. The framework distinguishes itself through a highly portable design that allows developers to maintain a single workflow for both training and inference across heterogeneous environments. It incorporates advanced optimization techniques such as just-in-time kernel fusion, asynchronous execution, and static graph compilation to maximize computational efficiency and hardware throughput. The library also functions as a comprehensive model quantization toolkit, offering tools to convert weights and activations into lower-bit representations. These capabilities facilitate the deployment of neural networks on resource-constrained edge devices by reducing memory footprints and accelerating inference tasks without requiring manual code changes for different hardware targets.
Burn is a comprehensive deep learning framework that provides native support for model quantization, cross-platform hardware acceleration, and efficient inference, making it a direct fit for deploying models on edge and mobile hardware.
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weight permutation, the engine improves cache locality and computational density. These capabilities are specifically tuned to accelerate autoregressive decoding, minimizing latency during the sequential token generation process to support real-time text generation requirements. The toolkit includes a comprehensive suite for hardware-accelerated neural computation, allowing users to benchmark inference kernels and measure generation latency against baseline implementations. These tools ensure that the inference pipeline maintains high throughput and efficiency when processing compressed models on supported graphics hardware.
BitNet is a specialized inference engine that provides model quantization and hardware-accelerated kernels to run compressed language models efficiently, fitting the requirements for edge-focused model optimization and deployment.
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal grammars to force model outputs to adhere to specific JSON schemas or patterns, and it implements speculative decoding to increase inference speed. Broad capabilities include hardware acceleration for GPUs, tools for converting models between different data formats, and utilities for measuring model quality via perplexity and divergence metrics. The engine can be wrapped in an HTTP server that provides an OpenAI-compatible API for integration with external tools.
This is a high-performance inference engine specifically designed to run large language models on resource-constrained hardware through advanced quantization, hardware acceleration, and efficient memory management.
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse CPUs, GPUs, and NPUs. By utilizing an offline conversion pipeline, it translates external model formats into a unified, optimized binary representation tailored for local hardware. Beyond core inference, the project includes extensive utilities for data preprocessing, covering image, audio, and text transformations required for real-time model input. It also provides diagnostic and monitoring tools for performance benchmarking, model topology analysis, and debugging, alongside experimental support for on-device training and fine-tuning. The engine is distributed as a native library with support for cross-platform compilation, enabling integration into mobile and embedded applications.
MNN is a high-performance inference engine specifically engineered for deploying and running optimized machine learning models on mobile and edge hardware, providing full support for quantization, hardware acceleration, and model conversion.
The nexa-sdk is an on-device AI SDK and multimodal inference engine designed to run large language, vision, and audio models locally on mobile and desktop hardware. It functions as a local LLM runtime and NPU acceleration framework, enabling the execution of generative and discriminative models without reliance on cloud services. The project distinguishes itself through a dedicated NPU acceleration framework that optimizes model execution on Neural Processing Units to reduce latency and power consumption. It employs hardware-agnostic backend routing to dynamically distribute computations across CPUs, GPUs, and NPUs, and supports GGUF-based model loading for efficient memory mapping and layer offloading. Its capabilities cover a broad spectrum of AI tasks, including conversational text generation, text-to-image synthesis, and automatic speech recognition. It also provides tools for vector embedding generation and document reranking for local semantic search, as well as a REST-based inference server with an OpenAI-compatible interface for external integration. The SDK supports cross-platform deployment across Android and Linux environments and includes a Python library for developer integration.
This SDK provides a comprehensive runtime for executing large language and multimodal models on edge hardware, featuring NPU acceleration, cross-platform support, and efficient model loading for low-latency local inference.
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computations to specialized hardware such as GPUs, NPUs, and dedicated chipsets. It also provides a comprehensive toolkit for model optimization, including quantization, precision conversion, and graph-level transformations, which allow for significant reductions in binary size and latency for both edge and cloud deployments. Beyond core inference, the project includes extensive support for generative AI, offering built-in capabilities for tokenization, chat template formatting, and streaming output generation. It supports complex model architectures through custom operator registration and modular adapter management, ensuring that developers can integrate specialized mathematical operations or fine-tuned model weights into their pipelines. The software is built primarily in C++ and provides language-specific bindings to facilitate integration into various programming environments. It includes robust diagnostic and profiling tools that allow for granular performance analysis, hardware utilization tracking, and debugging of tensor data during the inference process.
This is a comprehensive inference engine that provides the necessary tools for model quantization, hardware-accelerated execution, and cross-platform deployment, making it a flagship solution for running machine learning models on edge and mobile hardware.
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language, and Audio-Language models. The toolkit covers a broad range of optimization capabilities, including calibration-based and data-free quantization, checkpoint format conversion, and the reduction of precision for attention mechanisms and key-value caches. It manages these processes through structured compression recipes and orchestration pipelines to standardize model preparation and optimization.
This toolkit provides essential model quantization and conversion capabilities for optimizing large models, though it focuses on compression workflows rather than providing a full runtime engine for edge hardware inference.
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek, Qualcomm, and Samsung. It supports autoregressive text generation with tokenization, KV cache management, and streaming output, alongside multi-language runtime bindings for Java, Kotlin, Objective-C, and C++. Operator-level profiling and debugging tools capture execution traces and link them back to original source code for performance analysis. The platform covers model export and optimization through PyTorch export, quantization to lower-bit representations, static memory planning, and custom compiler passes. It includes capabilities for image preprocessing, multimodal and audio model inference, and decoding vision model outputs into task-specific results. Tensor management, platform abstraction, and extensibility mechanisms allow adding custom backends, kernels, and compiler passes. Documentation covers building from source, cross-compilation for embedded targets and iOS, and integration with Android and iOS frameworks through platform-specific APIs.
ExecuTorch is a comprehensive framework designed specifically for deploying PyTorch models to edge and mobile devices, offering native support for quantization, hardware acceleration, and cross-platform inference.
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model execution. The toolkit covers a broad range of optimization capabilities, including post-training quantization, mixed-precision layer mapping, and graph operation fusion. It also includes automated performance tuning to discover optimal configuration settings for specific hardware targets.
This toolkit provides comprehensive model quantization and hardware-aware optimization features specifically designed to reduce memory footprint and latency for neural networks, making it a highly relevant tool for deploying models on resource-constrained hardware.
This project is a framework for running Stable Diffusion image generation models on Apple Silicon using Core ML hardware acceleration. It provides a local generative AI pipeline for producing images from text prompts using Swift and Python without relying on external cloud APIs. The system includes a model converter to transform deep learning checkpoints into Core ML formats and a model optimizer to quantize weights and activations. It features a ControlNet integration layer to guide image generation using external signals such as edge and depth maps. Capabilities cover text-to-image generation with multilingual text encoding and image safety verification. Performance is managed through weight compression, palettization, and model splitting to fit within hardware memory constraints, while compute planning and quantization are used to reduce prediction latency. The implementation provides native interfaces for both Python and Swift to integrate generative pipelines into macOS and iOS applications.
This framework provides tools for converting, quantizing, and running generative models specifically on Apple Silicon, making it a specialized solution for edge-based inference on mobile and desktop hardware.
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facilitating local inference alongside cost-effective cloud training strategies that utilize fault-tolerant checkpointing to manage interruptions. Beyond its core training and inference capabilities, the toolkit provides a suite for measuring model reasoning and instruction-following performance. It includes modular features for converting model parameters between formats and optimizing execution engines to maximize throughput during text generation.
This framework provides tools for model quantization, conversion, and inference optimization specifically designed to run large language models on consumer-grade hardware, aligning well with the requirements for edge and mobile deployment.
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model distribution across multiple GPUs, static prompt state caching to avoid re-encoding repeated inputs, and CPU instruction set dispatch that selects the optimal code path for the hardware. An asynchronous inference queue allows overlapping computation with other work, while the OpenAI-compatible REST API enables drop-in integration with existing applications. CTranslate2 provides model conversion tools for frameworks including Fairseq, Hugging Face Transformers, Marian, OpenNMT-py, OpenNMT-tf, and OPUS-MT, transforming trained models into an optimized binary format. It supports a range of quantization types such as INT8, FP16, and BF16, with automatic compute type selection based on the available hardware. The engine handles text translation, text generation with configurable decoding strategies like beam search and sampling, sequence scoring, text encoding, and speech transcription, all with streaming input and output capabilities.
CTranslate2 is a specialized inference engine that provides model conversion, quantization, and hardware-optimized execution for Transformer models, making it a highly effective tool for deploying models on resource-constrained environments.
FunASR is an automatic speech recognition toolkit and multilingual speech-to-text engine designed to convert spoken audio into written text across more than fifty languages. It provides a framework for speaker diarization, an OpenAI-compatible transcription API for local server hosting, and speech models compatible with the ONNX format. The project distinguishes itself by supporting high-performance inference on edge hardware via self-contained binaries and portable model exports. It incorporates specialized capabilities for natural speech generation with adjustable timbre and emotional expression, as well as the ability to capture live microphone audio for direct voice-to-text input automation. The toolkit covers a broad range of audio analysis and processing capabilities, including voice activity detection, audio event and emotion detection, and punctuation restoration. It also includes tools for automated video captioning through the generation of timed subtitle files and distributed model fine-tuning to improve recognition accuracy using custom datasets.
This toolkit provides a specialized framework for speech recognition that includes model export, ONNX compatibility, and optimized inference binaries specifically designed for edge hardware deployment.
TVM is a machine learning compiler framework designed to convert deep learning models from various frameworks into optimized machine code. It functions as a cross-platform deployment engine that transforms high-level model definitions into efficient, hardware-specific binaries for diverse computing architectures. The system utilizes a multi-level compilation pipeline that decouples algorithm logic from hardware implementation through tensor-operator abstractions. It employs a graph-level intermediate representation to perform cross-operator optimizations and memory planning before lowering computations to target-specific instructions. To maximize performance, the framework includes an automated schedule space search that explores potential loop transformations and hardware mappings, alongside a lightweight virtual machine runtime for consistent model execution. This toolkit supports the deployment of computational workloads across a wide range of devices, including CPUs, GPUs, and specialized accelerators. It provides capabilities for cross-compiling models for various operating systems and processor architectures, facilitating the development of high-performance machine learning applications for resource-constrained edge devices.
TVM is a comprehensive machine learning compiler framework that provides model conversion, quantization, and hardware-specific optimization to enable high-performance inference on diverse edge and mobile architectures.
GGML is a machine learning tensor library and neural network engine written in C. It functions as a compute-focused runtime designed to execute transformer-based models and perform complex mathematical operations on multi-dimensional arrays directly on local consumer hardware. The library distinguishes itself by enabling local inference for large language models and edge machine learning deployment without reliance on external cloud infrastructure. It achieves this through a tensor-based computation graph that organizes operations for efficient execution and memory management, alongside static memory allocation to minimize runtime overhead. The engine supports high-performance tensor computing by utilizing hardware-agnostic kernel dispatch and processor-specific instruction sets for parallel arithmetic. It further optimizes resource usage through quantized weight representations, which reduce the memory footprint of models to facilitate execution on local devices.
GGML is a high-performance tensor library and inference engine specifically engineered to run quantized machine learning models on local, resource-constrained hardware with hardware-accelerated execution.
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and a graph-based inference pipeline that orchestrates sequences of models and custom logic nodes. The platform covers a broad range of capabilities, including comprehensive model preparation via framework conversion and precision quantization, high-performance model serving through REST and gRPC endpoints, and deep observability through performance profiling and hardware affinity visualization. It also provides extensive deployment options ranging from bare metal server binaries to Kubernetes orchestration.
OpenVINO is a comprehensive inference engine and optimization toolkit that supports model conversion, quantization, and hardware-accelerated execution, making it a robust solution for deploying models on diverse hardware.
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as weight quantization and parameter-efficient fine-tuning via low-rank adaptation, which significantly reduce memory requirements and computational overhead. These features enable the deployment of large models on consumer-grade hardware while maintaining high throughput and performance. Beyond core inference, the toolkit includes a suite of utilities for programmatic integration, allowing developers to embed model capabilities into custom software workflows via standard interfaces. It also provides multiple interactive interfaces, including web-based graphical environments for text and vision tasks and a command-line interface for rapid prototyping and evaluation. The software is distributed as a Python-based package, requiring standard environment configuration to manage dependencies and hardware resource allocation.
This project provides a specialized inference engine and optimization toolkit for running large transformer models on local hardware, including support for quantization and hardware-agnostic execution.
YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning to high-speed inference and deployment. The framework utilizes a modular neural architecture, allowing users to swap backbone and head components to tailor models for specific visual tasks. What distinguishes this project is its focus on production-ready deployment and model efficiency. It includes a robust model export engine that converts trained networks into standardized formats, enabling high-performance execution across diverse hardware, including edge devices and web browsers. To optimize models for resource-constrained environments, the framework offers advanced techniques such as neural network pruning, weight sparsity, and mixed-precision training, alongside tools for benchmarking performance and fine-tuning pruned models. The platform supports a highly configurable training pipeline that leverages parallel processing and dynamic data augmentation to improve model robustness. Users can manage complex training workflows through externalized configuration files, which decouple model logic from dataset structures. The system also provides sophisticated inference capabilities, including test-time augmentation and model ensembling, to balance detection accuracy with processing latency requirements.
This framework provides a comprehensive suite for training and deploying computer vision models, including built-in support for model conversion, quantization, and export to formats optimized for edge and mobile hardware.
Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures. The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory usage and improves performance, alongside a command-line interface for managing chat templates and inference parameters. The ecosystem further supports structured data generation through grammar-based output constraints and provides diagnostic utilities for visualizing computational graphs. Comprehensive documentation is available, including a reference matrix that details the compatibility of computational operations across supported hardware backends.
This is a specialized inference engine that provides model quantization, conversion, and hardware-accelerated execution for running large language models on consumer-grade edge hardware.