High-performance libraries and implementations designed to accelerate large language model inference through speculative execution techniques.
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fine-tuning, while offering a unified web-based interface for no-code model training, data preparation, and real-time performance monitoring. Beyond its core training capabilities, the project includes a local inference runtime that supports API-based deployment, tool-calling, and automated output verification. It manages the entire model development process, from dataset generation and hyperparameter configuration to model exporting and performance benchmarking across diverse hardware configurations. The software provides setup utilities for local development environments and includes diagnostic tools to assist with installation and hardware compatibility.
Unsloth provides a high-performance inference runtime optimized for consumer hardware that supports efficient model execution, though it focuses more on training and kernel-level optimization than on implementing speculative decoding specifically.
ncnn is a high-performance neural network inference framework designed for executing deep learning models locally on mobile and desktop hardware. It functions as a specialized engine that enables the deployment of artificial intelligence tasks directly on resource-constrained devices, eliminating the need for external network connectivity or cloud-based processing services. The framework provides a comprehensive toolset for model optimization, allowing users to convert and quantize machine learning models into specialized binary structures. By utilizing static model graph compilation and zero-copy memory management, the engine minimizes memory footprint and reduces data movement during execution. It further distinguishes itself through platform-agnostic hardware abstraction, which maps neural network operations to available local accelerators, including CPUs, GPUs, and specialized neural processing units. The library supports a wide range of complex, multi-branch neural network architectures, facilitating tasks such as image recognition and audio analysis. Performance is maintained through layer-specific kernel optimizations and graph-level operator fusion, which maximize efficiency on diverse hardware architectures. The project is distributed as a C++ library, providing a unified interface for cross-platform inference deployment.
This is a general-purpose deep learning inference framework for mobile and edge devices, but it lacks the specific speculative decoding capabilities required for accelerating large language model inference.
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-performance inference engine that exposes OpenAI-compatible HTTP endpoints, allowing for integration into existing application architectures. To support complex workflows, it includes native capabilities for agentic tool use and function calling, which can be further refined through dedicated fine-tuning processes. The platform covers a broad range of operational requirements, including model quantization, multi-device tensor parallelism, and memory-efficient key-value caching to optimize throughput and resource usage. It also provides robust utilities for benchmarking performance, managing system-level behaviors, and securing model endpoints through authentication and safety-aligned configurations. The repository includes extensive documentation and scripts for model weight conversion, vocabulary expansion, and deployment across both CPU and GPU hardware.
This repository is a comprehensive ecosystem for the Qwen model family rather than a general-purpose inference engine or framework specifically designed to implement speculative decoding for arbitrary LLMs.
Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on your own hardware. The system distinguishes itself through specialized memory and computation management techniques, including memory-mapped weight loading and quantization-aware inference, which allow for efficient execution on standard consumer hardware. It utilizes a stateless request execution model and a tensor-based computation graph to handle token-based sequence processing, ensuring that each inference task operates independently without reliance on persistent server state. This project provides the necessary tools for local large language model deployment, including a command-line interface for retrieving authorized model checkpoints and configuration files. It supports offline research and the integration of text generation capabilities into custom software applications, allowing users to manage model parameters such as sequence length and batch size to meet specific performance requirements.
This is a local inference engine designed for running transformer models with support for quantization and efficient memory management, though it lacks explicit native support for speculative decoding.
DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading. The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization, and custom-compiled kernels, it minimizes network bandwidth bottlenecks and maximizes computational throughput. It further supports complex architectures like mixture-of-experts and long-context models by integrating sequence parallelism and sparse attention mechanisms, ensuring efficient resource utilization across heterogeneous hardware topologies. Beyond its core training capabilities, the project includes a robust set of utilities for automated performance tuning, model profiling, and universal checkpointing. It provides infrastructure support for diverse processor architectures and cloud-based cluster deployment, allowing users to optimize execution environments through targeted kernel compilation and diagnostic monitoring.
DeepSpeed is a comprehensive deep learning optimization library that provides the necessary GPU acceleration, quantization, and distributed inference infrastructure to implement speculative decoding, even though it is a broader framework rather than a dedicated speculative decoding engine.
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facilitating local inference alongside cost-effective cloud training strategies that utilize fault-tolerant checkpointing to manage interruptions. Beyond its core training and inference capabilities, the toolkit provides a suite for measuring model reasoning and instruction-following performance. It includes modular features for converting model parameters between formats and optimizing execution engines to maximize throughput during text generation.
This project is a broad framework for LLM training and deployment, but it lacks specific implementation or support for speculative decoding, which is the core requirement for your search.
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model execution. The toolkit covers a broad range of optimization capabilities, including post-training quantization, mixed-precision layer mapping, and graph operation fusion. It also includes automated performance tuning to discover optimal configuration settings for specific hardware targets.
This toolkit focuses on model compression and quantization techniques to optimize inference, but it does not implement speculative decoding as a core inference acceleration strategy.
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse CPUs, GPUs, and NPUs. By utilizing an offline conversion pipeline, it translates external model formats into a unified, optimized binary representation tailored for local hardware. Beyond core inference, the project includes extensive utilities for data preprocessing, covering image, audio, and text transformations required for real-time model input. It also provides diagnostic and monitoring tools for performance benchmarking, model topology analysis, and debugging, alongside experimental support for on-device training and fine-tuning. The engine is distributed as a native library with support for cross-platform compilation, enabling integration into mobile and embedded applications.
MNN is a high-performance inference engine for mobile and edge devices, but it functions as a general-purpose neural network runtime rather than a specialized framework for speculative decoding in large language models.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabilities, including real-time video analytics, object detection and tracking, and image segmentation. It also integrates hardware-accelerated decoding and TensorRT-based inference to optimize model execution on embedded platforms. The project provides a TensorRT inference wrapper and an embedded vision SDK to facilitate the deployment of neural network primitives.
This project is a computer vision and deep learning inference toolkit for embedded hardware, but it lacks the specific speculative decoding and LLM-focused serving architecture required for accelerating large language models.
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek, Qualcomm, and Samsung. It supports autoregressive text generation with tokenization, KV cache management, and streaming output, alongside multi-language runtime bindings for Java, Kotlin, Objective-C, and C++. Operator-level profiling and debugging tools capture execution traces and link them back to original source code for performance analysis. The platform covers model export and optimization through PyTorch export, quantization to lower-bit representations, static memory planning, and custom compiler passes. It includes capabilities for image preprocessing, multimodal and audio model inference, and decoding vision model outputs into task-specific results. Tensor management, platform abstraction, and extensibility mechanisms allow adding custom backends, kernels, and compiler passes. Documentation covers building from source, cross-compilation for embedded targets and iOS, and integration with Android and iOS frameworks through platform-specific APIs.
This is a lightweight runtime for deploying PyTorch models to edge and mobile devices, but it lacks the specific speculative decoding algorithms required to accelerate LLM inference.
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and memory. It employs a key-value cache-aware request router that directs queries to workers holding relevant cache entries to reduce recomputation. High-speed data transfer mechanisms move cache blocks and weights directly between GPU VRAMs over RDMA or NVLink to minimize latency. The platform includes comprehensive capabilities for distributed fault tolerance, allowing in-flight requests to migrate and resume from failure points via token-state continuation. It features SLA-based autoscaling and performance profiling to right-size GPU pools and a Kubernetes-native operator for topology-aware scheduling. Additional support covers multimodal inference for images, video, and audio, alongside dynamic swapping of LoRA adapters. Installation is available via wheels, container images, charts, and crates, with support for major Linux distributions and NVIDIA GPU architectures from Ampere through Blackwell.
This is a distributed inference orchestration platform for managing and scaling LLM workloads across GPU clusters, but it functions as a high-level management layer rather than an inference engine implementing speculative decoding.