20 repositorios
Techniques for maximizing memory throughput and minimizing latency on GPU hardware, such as swizzling and double buffering.
Distinguishing note: Candidates refer to OS memory banking, process communication buffers, or AI agent context memory, not hardware-level GPU memory layout optimization.
Explore 20 awesome GitHub repositories matching operating systems & systems programming · GPU Memory Optimizations. Refine with filters or upvote what's useful.
LeetCUDA is a collection of high-performance GPU kernel libraries focusing on memory optimization, activation functions, and attention mechanisms. It serves as a reference library for CUDA kernel implementations, ranging from basic element-wise operations to complex neural network components, and provides Python bindings to integrate these kernels into deep learning workflows. The project is distinguished by its focus on low-level hardware optimizations. This includes the use of tensor cores for half-precision matrix multiplication, asynchronous data pipelining with double buffering, and shar
Implements shared memory swizzling, double buffering, and vectorized access to maximize GPU memory throughput.
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
Reduces GPU memory footprint through gradient checkpointing and offloading optimizer states to secondary storage.
PowerInfer is an inference engine and serving framework designed to run large language models on local hardware. It combines a hybrid CPU-GPU offloader, a quantization tool, and a sparse model optimizer to enable the execution of high-parameter models on consumer-grade devices. The system distinguishes itself through neuron-activation-based offloading, using a predictor model to preload frequent neurons into VRAM while keeping rare neurons in system memory. This hybrid execution model balances workloads between the GPU and CPU based on input patterns to optimize memory access and increase tok
Optimizes memory access by preloading frequent neurons onto the GPU and computing rare ones on the CPU.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory.
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
Implements a mechanism to move model tensors between GPU memory, system RAM, and disk.
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
Offloads model tensors and dense layers to video memory to increase computation speed.
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
Tracks GPU memory usage relative to input token counts to optimize hardware resource allocation.
This project is a suite of analytical tools for quantifying web performance, specifically designed for benchmarking the rendering speed and memory usage of various JavaScript frameworks. It provides a standardized set of DOM manipulation tests and a comparison tool that uses weighted geometric means to measure efficiency across different web implementations. The benchmark harness distinguishes itself by providing deep analysis of DOM reconciliation strategies, comparing the performance and correctness of keyed versus non-keyed rendering. It also includes a memory profiler for tracking allocat
Monitors memory consumption and overhead specifically for the runtime engine during DOM update cycles.
JerryScript is a lightweight, ECMAScript-compliant JavaScript engine and bytecode compiler designed for resource-constrained devices. It serves as an embedded interpreter and IoT scripting runtime, enabling the execution of JavaScript code within native C applications on hardware with limited memory. The project differentiates itself through a focus on low-memory runtime management, utilizing bytecode precompilation and pre-compiled state snapshots to reduce startup time and memory overhead. It features a C-binding native bridge for bidirectional communication between native code and scripts,
Measures engine overhead by recording memory usage during runtime or termination.
Flash Linear Attention is a training framework and inference engine for sequence models that use linear attention and state space mechanisms, designed to process long contexts with reduced memory and compute overhead. It provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures, and includes a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding. The project supports building hybrid sequence models that interleave standard at
Runs fused GPU kernels for token mixing operations that minimize memory bandwidth and launch overhead across different GPU architectures.
NCCL es una biblioteca de comunicación de alto rendimiento y un framework de computación distribuida en GPU diseñado para ejecutar intercambios de datos colectivos y punto a punto a través de múltiples GPUs en sistemas de uno o varios nodos. Sirve como capa de transporte RDMA para GPU y orquestador de memoria, facilitando la sincronización de gran ancho de banda de datos y gradientes de modelos para el entrenamiento e inferencia distribuida en GPU. La biblioteca se distingue por su capacidad para ejecutar primitivas de comunicación directamente desde kernels de GPU, eliminando la CPU anfitriona del camino crítico. Utiliza la selección de rutas consciente de la topología para optimizar el movimiento de datos y emplea transporte de red basado en RDMA, incluyendo InfiniBand y NVLink, para permitir el acceso a memoria de copia cero entre dispositivos a través de diferentes nodos físicos. El proyecto cubre una amplia gama de patrones de comunicación colectiva, incluyendo reducciones, broadcasts, gathers e intercambios all-to-all, junto con acceso remoto a memoria punto a punto. Proporciona una gestión integral de comunicadores para inicializar, particionar y redimensionar grupos de GPU, así como una gestión de memoria especializada para registrar buffers y coordinar memoria compartida de dispositivo. El sistema incluye un conjunto de herramientas de monitoreo y observabilidad para el seguimiento de la salud, registro de diagnósticos y monitoreo de eventos en tiempo real, así como interfaces de integración para frameworks de aprendizaje automático, CUDA graphs, MPI y Python.
Monitors and logs GPU memory usage, distinguishing between persistent and suspendable allocations.
Este proyecto es un recurso educativo integral y un curso para construir redes neuronales usando PyTorch. Cubre los bloques de construcción fundamentales del deep learning, incluyendo la manipulación de tensores, la diferenciación automática y la construcción de componentes modulares de redes neuronales. El repositorio sirve como guía técnica para varios dominios especializados. Proporciona detalles de implementación para tareas de visión artificial como clasificación de imágenes, detección de objetos y segmentación semántica, así como flujos de trabajo de procesamiento de lenguaje natural que involucran transformers, redes recurrentes y modelos generativos. Además, incluye una referencia para IA generativa, centrándose específicamente en la síntesis de imágenes mediante modelos de difusión y redes adversarias. El material se extiende a pipelines de optimización y despliegue de modelos. Cubre técnicas para reducir el tamaño del modelo y aumentar la velocidad de inferencia mediante cuantización y la exportación de modelos a formatos como ONNX y TensorRT. Otras áreas de capacidad incluyen ingeniería de datos para carga paralela, evaluación de modelos mediante métricas personalizadas y el despliegue de modelos de lenguaje grandes (LLM) de código abierto. El proyecto se entrega principalmente como una serie de Jupyter Notebooks.
Monitors GPU memory usage relative to input length to determine optimal context truncation limits.
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
ExecuTorch monitors peak and per-operator memory consumption to optimize resource usage on constrained hardware.
oneDNN es una biblioteca para la aceleración del aprendizaje profundo que proporciona bloques de construcción optimizados para el entrenamiento e inferencia de redes neuronales. Gestiona la computación de tensores a través de hardware CPU y GPU, permitiendo la ejecución de primitivas de alto rendimiento para el entrenamiento de modelos y la optimización de la inferencia de redes neuronales. El proyecto se distingue por la optimización de kernels específica para el hardware y el uso de compilación just-in-time para apuntar a conjuntos de instrucciones de procesador específicos. Soporta la ejecución de redes neuronales cuantizadas utilizando cuantización estática y dinámica para reducir el uso de memoria y aumentar el rendimiento. La biblioteca cubre una amplia gama de capacidades, incluyendo primitivas de aprendizaje profundo como convoluciones, multiplicación de matrices y ejecución de redes neuronales recurrentes. Implementa optimizaciones de rendimiento avanzadas, incluyendo fusión de operaciones, optimización de grafos de computación y gestión de formatos de memoria. La integración se proporciona a través de una ABI C estable y un wrapper C++, con soporte para SYCL, OpenCL y bibliotecas de álgebra lineal externas. El sistema incluye herramientas de observabilidad para la creación de perfiles de rendimiento de hardware, benchmarking de primitivas y registro de ejecución detallado.
Optimizes memory throughput by managing format propagation and reordering data between CPU and GPU engines.
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Saves memory during large folder synchronizations by using unique identifiers instead of full message headers.
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Optimizes throughput by asynchronously prefetching and offloading adapters between GPU and CPU memory.
MIRIX is an AI agent state orchestrator and long-term memory system designed to provide persistent context for large language models. It functions as a multi-modal AI memory pipeline that processes text, voice, and screen captures into structured knowledge stores, including a dedicated screen activity knowledge base. The project distinguishes itself by integrating a multi-modal observation pipeline that monitors desktop activity in real-time to build a searchable history of user actions. It utilizes a multi-tiered memory hierarchy—separating episodic, semantic, procedural, and core stores—and
Provides control over whether incoming information is processed immediately or batched for background memory updates.
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language,
Utilizes sequential onloading and disk offloading to quantize models that exceed available system memory.
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Implements tiered cache offloading by moving memory blocks between GPU memory, host RAM, and shared storage for long-context workloads.
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Manages the movement of weights, gradients, and optimizers between memory tiers to prevent out-of-memory errors.