ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities.
The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek, Qualcomm, and Samsung. It supports autoregressive text generation with tokenization, KV cache management, and streaming output, alongside multi-language runtime bindings for Java, Kotlin, Objective-C, and C++. Operator-level profiling and debugging tools capture execution traces and link them back to original source code for performance analysis.
The platform covers model export and optimization through PyTorch export, quantization to lower-bit representations, static memory planning, and custom compiler passes. It includes capabilities for image preprocessing, multimodal and audio model inference, and decoding vision model outputs into task-specific results. Tensor management, platform abstraction, and extensibility mechanisms allow adding custom backends, kernels, and compiler passes.
Documentation covers building from source, cross-compilation for embedded targets and iOS, and integration with Android and iOS frameworks through platform-specific APIs.