20 dépôts
Techniques for maximizing memory throughput and minimizing latency on GPU hardware, such as swizzling and double buffering.
Distinguishing note: Candidates refer to OS memory banking, process communication buffers, or AI agent context memory, not hardware-level GPU memory layout optimization.
Explore 20 awesome GitHub repositories matching operating systems & systems programming · GPU Memory Optimizations. Refine with filters or upvote what's useful.
LeetCUDA is a collection of high-performance GPU kernel libraries focusing on memory optimization, activation functions, and attention mechanisms. It serves as a reference library for CUDA kernel implementations, ranging from basic element-wise operations to complex neural network components, and provides Python bindings to integrate these kernels into deep learning workflows. The project is distinguished by its focus on low-level hardware optimizations. This includes the use of tensor cores for half-precision matrix multiplication, asynchronous data pipelining with double buffering, and shar
Implements shared memory swizzling, double buffering, and vectorized access to maximize GPU memory throughput.
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
Reduces GPU memory footprint through gradient checkpointing and offloading optimizer states to secondary storage.
PowerInfer is an inference engine and serving framework designed to run large language models on local hardware. It combines a hybrid CPU-GPU offloader, a quantization tool, and a sparse model optimizer to enable the execution of high-parameter models on consumer-grade devices. The system distinguishes itself through neuron-activation-based offloading, using a predictor model to preload frequent neurons into VRAM while keeping rare neurons in system memory. This hybrid execution model balances workloads between the GPU and CPU based on input patterns to optimize memory access and increase tok
Optimizes memory access by preloading frequent neurons onto the GPU and computing rare ones on the CPU.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
Implements a mechanism to move model tensors between GPU memory, system RAM, and disk.
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
Offloads model tensors and dense layers to video memory to increase computation speed.
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
Tracks GPU memory usage relative to input token counts to optimize hardware resource allocation.
This project is a suite of analytical tools for quantifying web performance, specifically designed for benchmarking the rendering speed and memory usage of various JavaScript frameworks. It provides a standardized set of DOM manipulation tests and a comparison tool that uses weighted geometric means to measure efficiency across different web implementations. The benchmark harness distinguishes itself by providing deep analysis of DOM reconciliation strategies, comparing the performance and correctness of keyed versus non-keyed rendering. It also includes a memory profiler for tracking allocat
Monitors memory consumption and overhead specifically for the runtime engine during DOM update cycles.
JerryScript is a lightweight, ECMAScript-compliant JavaScript engine and bytecode compiler designed for resource-constrained devices. It serves as an embedded interpreter and IoT scripting runtime, enabling the execution of JavaScript code within native C applications on hardware with limited memory. The project differentiates itself through a focus on low-memory runtime management, utilizing bytecode precompilation and pre-compiled state snapshots to reduce startup time and memory overhead. It features a C-binding native bridge for bidirectional communication between native code and scripts,
Measures engine overhead by recording memory usage during runtime or termination.
Flash Linear Attention is a training framework and inference engine for sequence models that use linear attention and state space mechanisms, designed to process long contexts with reduced memory and compute overhead. It provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures, and includes a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding. The project supports building hybrid sequence models that interleave standard at
Runs fused GPU kernels for token mixing operations that minimize memory bandwidth and launch overhead across different GPU architectures.
NCCL est une bibliothèque de communication haute performance et un framework de calcul GPU distribué conçu pour exécuter des échanges de données collectifs et point à point sur plusieurs GPU dans des systèmes à un ou plusieurs nœuds. Il sert de couche de transport GPU RDMA et d'orchestrateur de mémoire, facilitant la synchronisation à large bande passante des données et des gradients de modèle pour l'entraînement et l'inférence GPU distribués. La bibliothèque se distingue par sa capacité à exécuter des primitives de communication directement depuis les noyaux (kernels) GPU, supprimant le CPU hôte du chemin critique. Elle utilise une sélection de chemin consciente de la topologie pour optimiser le mouvement des données et emploie un transport réseau basé sur RDMA, incluant InfiniBand et NVLink, pour permettre un accès mémoire zéro-copie entre les appareils sur différents nœuds physiques. Le projet couvre un large éventail de modèles de communication collective, notamment les réductions, les diffusions (broadcasts), les rassemblements (gathers) et les échanges tous-à-tous, ainsi que l'accès mémoire distant point à point. Il fournit une gestion complète des communicateurs pour initialiser, partitionner et redimensionner les groupes GPU, ainsi qu'une gestion spécialisée de la mémoire pour enregistrer les tampons (buffers) et coordonner la mémoire partagée des appareils. Le système inclut une suite d'outils de surveillance et d'observabilité pour le suivi de la santé, la journalisation diagnostique et la surveillance des événements en temps réel, ainsi que des interfaces d'intégration pour les frameworks de machine learning, les graphes CUDA, MPI et Python.
Monitors and logs GPU memory usage, distinguishing between persistent and suspendable allocations.
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
Monitors GPU memory usage relative to input length to determine optimal context truncation limits.
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
ExecuTorch monitors peak and per-operator memory consumption to optimize resource usage on constrained hardware.
oneDNN is a library for deep learning acceleration that provides optimized building blocks for neural network training and inference. It manages tensor computation across CPU and GPU hardware, enabling the execution of high-performance primitives for model training and neural network inference optimization. The project distinguishes itself through hardware-specific kernel optimization and the use of just-in-time compilation to target specific processor instruction sets. It supports quantized neural network execution using both static and dynamic quantization to reduce memory usage and increas
Optimizes memory throughput by managing format propagation and reordering data between CPU and GPU engines.
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Saves memory during large folder synchronizations by using unique identifiers instead of full message headers.
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Optimizes throughput by asynchronously prefetching and offloading adapters between GPU and CPU memory.
MIRIX is an AI agent state orchestrator and long-term memory system designed to provide persistent context for large language models. It functions as a multi-modal AI memory pipeline that processes text, voice, and screen captures into structured knowledge stores, including a dedicated screen activity knowledge base. The project distinguishes itself by integrating a multi-modal observation pipeline that monitors desktop activity in real-time to build a searchable history of user actions. It utilizes a multi-tiered memory hierarchy—separating episodic, semantic, procedural, and core stores—and
Provides control over whether incoming information is processed immediately or batched for background memory updates.
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language,
Utilizes sequential onloading and disk offloading to quantize models that exceed available system memory.
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Implements tiered cache offloading by moving memory blocks between GPU memory, host RAM, and shared storage for long-context workloads.
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Manages the movement of weights, gradients, and optimizers between memory tiers to prevent out-of-memory errors.