9 open-source projects similar to smart-lty/parallelspeculativedecoding, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best ParallelSpeculativeDecoding alternative.
gpt-fast is a PyTorch transformer inference engine designed for low-latency text generation. It functions as a distributed GPU inference library, a quantized model runner, and a speculative decoding framework. The system utilizes a speculative decoding workflow where a small draft model predicts token sequences for verification by a larger model to accelerate generation. It supports quantized model execution to reduce memory footprint and implements tensor parallelism to split computations across multiple GPUs. The project includes a standardized evaluation harness to measure the accuracy an
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
ICML 2024 Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
COLM 2024 TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
ICLR2025 Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
Fast inference from large lauguage models via speculative decoding
FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat