gpt-fast is a PyTorch transformer inference engine designed for low-latency text generation. It functions as a distributed GPU inference library, a quantized model runner, and a speculative decoding framework. The system utilizes a speculative decoding workflow where a small draft model predicts token sequences for verification by a larger model to accelerate generation. It supports quantized model execution to reduce memory footprint and implements tensor parallelism to split computations across multiple GPUs. The project includes a standardized evaluation harness to measure the accuracy an
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
ICML 2024 Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
ICLR2025 Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding