9 open-source projects similar to hao-ai-lab/lookaheaddecoding, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best LookaheadDecoding alternative.
gpt-fast is a PyTorch transformer inference engine designed for low-latency text generation. It functions as a distributed GPU inference library, a quantized model runner, and a speculative decoding framework. The system utilizes a speculative decoding workflow where a small draft model predicts token sequences for verification by a larger model to accelerate generation. It supports quantized model execution to reduce memory footprint and implements tensor parallelism to split computations across multiple GPUs. The project includes a standardized evaluation harness to measure the accuracy an
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
ICLR2025 Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
COLM 2024 TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
ICLR 2025 PEARL: Parallel Speculative Decoding with Adaptive Draft Length
Fast inference from large lauguage models via speculative decoding
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat