8 repository-uri
Systems that manage the movement of tensors between different memory tiers to run oversized models.
Distinct from GPU Memory Optimizations: Distinct from general GPU memory optimizations by focusing specifically on the offloading of tensors to CPU/Disk.
Explore 8 awesome GitHub repositories matching operating systems & systems programming · Memory Offloading Frameworks. Refine with filters or upvote what's useful.
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
Reduces GPU memory footprint through gradient checkpointing and offloading optimizer states to secondary storage.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
Implements a mechanism to move model tensors between GPU memory, system RAM, and disk.
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
Offloads model tensors and dense layers to video memory to increase computation speed.
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Optimizes throughput by asynchronously prefetching and offloading adapters between GPU and CPU memory.
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language,
Utilizes sequential onloading and disk offloading to quantize models that exceed available system memory.
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Implements tiered cache offloading by moving memory blocks between GPU memory, host RAM, and shared storage for long-context workloads.
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Manages the movement of weights, gradients, and optimizers between memory tiers to prevent out-of-memory errors.