1 dépôt
Systems that combine offloading with pipeline parallelism across multiple machines to accelerate generation when aggregated GPU memory is insufficient.
Distinct from Memory Offloading Frameworks: Distinct from Memory Offloading Frameworks: adds distributed pipeline parallelism across machines, not just single-machine CPU/disk offloading.
Explore 1 awesome GitHub repository matching operating systems & systems programming · Distributed Offloading Systems. Refine with filters or upvote what's useful.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Combines offloading with pipeline parallelism across multiple machines to accelerate generation when aggregated GPU memory is insufficient.