Why is xlite-dev/leetcuda a recommended GPU Memory Optimizations GitHub Repositories repository?

Implements shared memory swizzling, double buffering, and vectorized access to maximize GPU memory throughput.

Why is openrlhf/openrlhf a recommended GPU Memory Optimizations GitHub Repositories repository?

Reduces GPU memory footprint through gradient checkpointing and offloading optimizer states to secondary storage.

Why is sjtu-ipads/powerinfer a recommended GPU Memory Optimizations GitHub Repositories repository?

Optimizes memory access by preloading frequent neurons onto the GPU and computing rare ones on the CPU.

Why is fminference/flexllmgen a recommended GPU Memory Optimizations GitHub Repositories repository?

Stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.

Why is fminference/flexgen a recommended GPU Memory Optimizations GitHub Repositories repository?

Implements a mechanism to move model tensors between GPU memory, system RAM, and disk.

Why is tiiny-ai/powerinfer a recommended GPU Memory Optimizations GitHub Repositories repository?

Offloads model tensors and dense layers to video memory to increase computation speed.

Why is tingsongyu/pytorch_tutorial a recommended GPU Memory Optimizations GitHub Repositories repository?

Tracks GPU memory usage relative to input token counts to optimize hardware resource allocation.

Why is krausest/js-framework-benchmark a recommended GPU Memory Optimizations GitHub Repositories repository?

Monitors memory consumption and overhead specifically for the runtime engine during DOM update cycles.

Why is jerryscript-project/jerryscript a recommended GPU Memory Optimizations GitHub Repositories repository?

Measures engine overhead by recording memory usage during runtime or termination.

Why is fla-org/flash-linear-attention a recommended GPU Memory Optimizations GitHub Repositories repository?

Runs fused GPU kernels for token mixing operations that minimize memory bandwidth and launch overhead across different GPU architectures.

20 dépôts

Awesome GitHub RepositoriesGPU Memory Optimizations

Techniques for maximizing memory throughput and minimizing latency on GPU hardware, such as swizzling and double buffering.

Distinguishing note: Candidates refer to OS memory banking, process communication buffers, or AI agent context memory, not hardware-level GPU memory layout optimization.

Explore 20 awesome GitHub repositories matching operating systems & systems programming · GPU Memory Optimizations. Refine with filters or upvote what's useful.

Trouvez les meilleurs dépôts grâce à l'IA.Nous recherchons les dépôts les plus pertinents grâce à l'IA.

xlite-dev/leetcuda
xlite-dev/LeetCUDA
9,694Voir sur GitHub
LeetCUDA is a collection of high-performance GPU kernel libraries focusing on memory optimization, activation functions, and attention mechanisms. It serves as a reference library for CUDA kernel implementations, ranging from basic element-wise operations to complex neural network components, and provides Python bindings to integrate these kernels into deep learning workflows. The project is distinguished by its focus on low-level hardware optimizations. This includes the use of tensor cores for half-precision matrix multiplication, asynchronous data pipelining with double buffering, and shar
Implements shared memory swizzling, double buffering, and vectorized access to maximize GPU memory throughput.
Cudacudacuda-12cuda-cpp
Voir sur GitHub9,694
openrlhf/openrlhf
OpenRLHF/OpenRLHF
9,675Voir sur GitHub
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
Reduces GPU memory footprint through gradient checkpointing and offloading optimizer states to secondary storage.
Pythonlarge-language-modelsopenai-o1proximal-policy-optimization
Voir sur GitHub9,675
sjtu-ipads/powerinfer
SJTU-IPADS/PowerInfer
9,568Voir sur GitHub
PowerInfer is an inference engine and serving framework designed to run large language models on local hardware. It combines a hybrid CPU-GPU offloader, a quantization tool, and a sparse model optimizer to enable the execution of high-parameter models on consumer-grade devices. The system distinguishes itself through neuron-activation-based offloading, using a predictor model to preload frequent neurons into VRAM while keeping rare neurons in system memory. This hybrid execution model balances workloads between the GPU and CPU based on input patterns to optimize memory access and increase tok
Optimizes memory access by preloading frequent neurons onto the GPU and computing rare ones on the CPU.
C++
Voir sur GitHub9,568
fminference/flexllmgen
FMInference/FlexLLMGen
9,362Voir sur GitHub
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.
Pythondeep-learninggpt-3high-throughput
Voir sur GitHub9,362
fminference/flexgen
FMInference/FlexGen
9,366Voir sur GitHub
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
Implements a mechanism to move model tensors between GPU memory, system RAM, and disk.
Python
Voir sur GitHub9,366
tiiny-ai/powerinfer
Tiiny-AI/PowerInfer
8,714Voir sur GitHub
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
Offloads model tensors and dense layers to video memory to increase computation speed.
C++large-language-modelsllamallm
Voir sur GitHub8,714
tingsongyu/pytorch_tutorial
TingsongYu/PyTorch_Tutorial
8,018Voir sur GitHub
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
Tracks GPU memory usage relative to input token counts to optimize hardware resource allocation.
Python
Voir sur GitHub8,018
krausest/js-framework-benchmark
krausest/js-framework-benchmark
7,434Voir sur GitHub
This project is a suite of analytical tools for quantifying web performance, specifically designed for benchmarking the rendering speed and memory usage of various JavaScript frameworks. It provides a standardized set of DOM manipulation tests and a comparison tool that uses weighted geometric means to measure efficiency across different web implementations. The benchmark harness distinguishes itself by providing deep analysis of DOM reconciliation strategies, comparing the performance and correctness of keyed versus non-keyed rendering. It also includes a memory profiler for tracking allocat
Monitors memory consumption and overhead specifically for the runtime engine during DOM update cycles.
JavaScript
Voir sur GitHub7,434
jerryscript-project/jerryscript
jerryscript-project/jerryscript
7,399Voir sur GitHub
JerryScript is a lightweight, ECMAScript-compliant JavaScript engine and bytecode compiler designed for resource-constrained devices. It serves as an embedded interpreter and IoT scripting runtime, enabling the execution of JavaScript code within native C applications on hardware with limited memory. The project differentiates itself through a focus on low-memory runtime management, utilizing bytecode precompilation and pre-compiled state snapshots to reduce startup time and memory overhead. It features a C-binding native bridge for bidirectional communication between native code and scripts,
Measures engine overhead by recording memory usage during runtime or termination.
C
Voir sur GitHub7,399
fla-org/flash-linear-attention
fla-org/flash-linear-attention
5,248Voir sur GitHub
Flash Linear Attention is a training framework and inference engine for sequence models that use linear attention and state space mechanisms, designed to process long contexts with reduced memory and compute overhead. It provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures, and includes a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding. The project supports building hybrid sequence models that interleave standard at
Runs fused GPU kernels for token mixing operations that minimize memory bandwidth and launch overhead across different GPU architectures.
Pythonlarge-language-modelsmachine-learning-systemsnatural-language-processing
Voir sur GitHub5,248
nvidia/nccl
NVIDIA/nccl
4,816Voir sur GitHub
NCCL est une bibliothèque de communication haute performance et un framework de calcul GPU distribué conçu pour exécuter des échanges de données collectifs et point à point sur plusieurs GPU dans des systèmes à un ou plusieurs nœuds. Il sert de couche de transport GPU RDMA et d'orchestrateur de mémoire, facilitant la synchronisation à large bande passante des données et des gradients de modèle pour l'entraînement et l'inférence GPU distribués. La bibliothèque se distingue par sa capacité à exécuter des primitives de communication directement depuis les noyaux (kernels) GPU, supprimant le CPU hôte du chemin critique. Elle utilise une sélection de chemin consciente de la topologie pour optimiser le mouvement des données et emploie un transport réseau basé sur RDMA, incluant InfiniBand et NVLink, pour permettre un accès mémoire zéro-copie entre les appareils sur différents nœuds physiques. Le projet couvre un large éventail de modèles de communication collective, notamment les réductions, les diffusions (broadcasts), les rassemblements (gathers) et les échanges tous-à-tous, ainsi que l'accès mémoire distant point à point. Il fournit une gestion complète des communicateurs pour initialiser, partitionner et redimensionner les groupes GPU, ainsi qu'une gestion spécialisée de la mémoire pour enregistrer les tampons (buffers) et coordonner la mémoire partagée des appareils. Le système inclut une suite d'outils de surveillance et d'observabilité pour le suivi de la santé, la journalisation diagnostique et la surveillance des événements en temps réel, ainsi que des interfaces d'intégration pour les frameworks de machine learning, les graphes CUDA, MPI et Python.
Monitors and logs GPU memory usage, distinguishing between persistent and suspendable allocations.
C++
Voir sur GitHub4,816
tingsongyu/pytorch-tutorial-2nd
TingsongYu/PyTorch-Tutorial-2nd
4,555Voir sur GitHub
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
Monitors GPU memory usage relative to input length to determine optimal context truncation limits.
Jupyter Notebookcomputer-visiondeepsortdiffusion-models
Voir sur GitHub4,555
pytorch/executorch
pytorch/executorch
4,296Voir sur GitHub
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
ExecuTorch monitors peak and per-operator memory consumption to optimize resource usage on constrained hardware.
Pythondeep-learningembeddedgpu
Voir sur GitHub4,296
uxlfoundation/onednn
uxlfoundation/oneDNN
4,009Voir sur GitHub
oneDNN is a library for deep learning acceleration that provides optimized building blocks for neural network training and inference. It manages tensor computation across CPU and GPU hardware, enabling the execution of high-performance primitives for model training and neural network inference optimization. The project distinguishes itself through hardware-specific kernel optimization and the use of just-in-time compilation to target specific processor instruction sets. It supports quantized neural network execution using both static and dynamic quantization to reduce memory usage and increas
Optimizes memory throughput by managing format propagation and reordering data between CPU and GPU engines.
C++aarch64amxavx512
Voir sur GitHub4,009
imapsync/imapsync
imapsync/imapsync
3,945Voir sur GitHub
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Saves memory during large folder synchronizations by using unique identifiers instead of full message headers.
Shellemailsimapimaps
Voir sur GitHub3,945
predibase/lorax
predibase/lorax
3,724Voir sur GitHub
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Optimizes throughput by asynchronously prefetching and offloading adapters between GPU and CPU memory.
Pythonfine-tuninggptllama
Voir sur GitHub3,724
mirix-ai/mirix
Mirix-AI/MIRIX
3,535Voir sur GitHub
MIRIX is an AI agent state orchestrator and long-term memory system designed to provide persistent context for large language models. It functions as a multi-modal AI memory pipeline that processes text, voice, and screen captures into structured knowledge stores, including a dedicated screen activity knowledge base. The project distinguishes itself by integrating a multi-modal observation pipeline that monitors desktop activity in real-time to build a searchable history of user actions. It utilizes a multi-tiered memory hierarchy—separating episodic, semantic, procedural, and core stores—and
Provides control over whether incoming information is processed immediately or batched for background memory updates.
Pythonllm-agentsllm-memorymemory-agents
Voir sur GitHub3,535
vllm-project/llm-compressor
vllm-project/llm-compressor
2,764Voir sur GitHub
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language,
Utilizes sequential onloading and disk offloading to quantize models that exceed available system memory.
Pythoncompressionquantizationsparsity
Voir sur GitHub2,764
llm-d/llm-d
llm-d/llm-d
2,514Voir sur GitHub
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Implements tiered cache offloading by moving memory blocks between GPU memory, host RAM, and shared storage for long-context workloads.
Shell
Voir sur GitHub2,514
rlinf/rlinf
RLinf/RLinf
2,502Voir sur GitHub
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Manages the movement of weights, gradients, and optimizers between memory tiers to prevent out-of-memory errors.
Pythonagentic-aiembodied-aireinforcement-learning
Voir sur GitHub2,502

Awesome GPU Memory Optimizations GitHub Repositories

xlite-dev/LeetCUDA

OpenRLHF/OpenRLHF

SJTU-IPADS/PowerInfer

FMInference/FlexLLMGen

FMInference/FlexGen

Tiiny-AI/PowerInfer

TingsongYu/PyTorch_Tutorial

krausest/js-framework-benchmark

jerryscript-project/jerryscript

fla-org/flash-linear-attention

NVIDIA/nccl

TingsongYu/PyTorch-Tutorial-2nd

pytorch/executorch

uxlfoundation/oneDNN

imapsync/imapsync

predibase/lorax

Mirix-AI/MIRIX

vllm-project/llm-compressor

llm-d/llm-d

RLinf/RLinf

Explorer les sous-tags