# fminference/flexllmgen

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/fminference-flexllmgen).**

9,362 stars · 591 forks · Python · Apache-2.0 · archived

## Links

- GitHub: https://github.com/FMInference/FlexLLMGen
- awesome-repositories: https://awesome-repositories.com/repository/fminference-flexllmgen.md

## Topics

`deep-learning` `gpt-3` `high-throughput` `large-language-models` `machine-learning` `offloading` `opt`

## Description

FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.

The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed scaling by combining offloading with pipeline parallelism across multiple machines, accelerating generation when aggregated GPU memory is insufficient. Integration with the HELM framework enables execution of language model benchmarks like MMLU using offloaded models on a single GPU.

The system provides a complete toolchain for model serving, including a model weight compressor, a tensor offloading framework, and a throughput-oriented server. It handles batch inference request processing, distributed GPU pipeline parallelism, and single-GPU large model execution through its memory offloading and weight compression capabilities.

## Tags

### Artificial Intelligence & ML

- [Single-GPU Inference Runtimes](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-models/single-gpu-inference-runtimes.md) — Runs large language models on a single GPU by offloading weights and cache to CPU and disk to fit models larger than available memory.
- [Throughput-Oriented Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-deployment-servers/llm-inference-servers/throughput-oriented-servers.md) — Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
- [Single-GPU Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-model-deployments/single-gpu-scaling.md) — Runs large language models with limited GPU memory by offloading weights and attention cache to CPU and disk.
- [High Throughput Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/high-throughput-inference.md) — Processes multiple generation requests together in large batches to maximize throughput on throughput-oriented workloads.
- [Single-GPU Execution Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-model-serving/single-gpu-execution-engines.md) — Executes large language models with limited GPU memory by offloading weights and attention cache to CPU and disk. ([source](https://cdn.jsdelivr.net/gh/fminference/flexllmgen@main/README.md))
- [4-Bit Compressors](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-models/weight-only-compression/4-bit-compressors.md) — Reduces LLM weight memory usage by approximately 70% through 4-bit compression with minimal accuracy loss.
- [Pipeline Parallelism Partitioners](https://awesome-repositories.com/f/artificial-intelligence-ml/pipeline-parallelism-partitioners.md) — Distributes model layers across multiple GPUs to accelerate generation when aggregated GPU memory is insufficient.

### Data & Databases

- [Inference Batching](https://awesome-repositories.com/f/data-databases/request-batching/inference-batching.md) — Processes multiple generation requests together in large batches to maximize throughput on a single GPU.

### DevOps & Infrastructure

- [4-Bit Quantization Tools](https://awesome-repositories.com/f/devops-infrastructure/intel-hardware-acceleration/low-bit-weight-quantization/4-bit-quantization-tools.md) — Reduces model weight memory by approximately 70% using 4-bit quantization with minimal accuracy loss.
- [GPU Parallelism Partitioners](https://awesome-repositories.com/f/devops-infrastructure/multi-gpu-deployment/distributed-inference-clusters/gpu-parallelism-partitioners.md) — Combines offloading with pipeline parallelism across multiple machines to accelerate generation when aggregated GPU memory is insufficient. ([source](https://cdn.jsdelivr.net/gh/fminference/flexllmgen@main/README.md))

### Operating Systems & Systems Programming

- [Memory Offloading Frameworks](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-optimizations/memory-offloading-frameworks.md) — Stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.
- [Distributed Offloading Systems](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-optimizations/memory-offloading-frameworks/distributed-offloading-systems.md) — Combines offloading with pipeline parallelism across multiple machines to accelerate generation when aggregated GPU memory is insufficient.

### Software Engineering & Architecture

- [Inference Engines](https://awesome-repositories.com/f/software-engineering-architecture/headless-runtimes/inference-engines.md) — An engine for running large language models on a single GPU with weight compression and tensor offloading to CPU or disk.
