20 Repos
Techniques for reducing VRAM usage during model training or inference.
Distinguishing note: Focuses on VRAM reduction for generative models.
Explore 20 awesome GitHub repositories matching artificial intelligence & ml · Memory Optimization. Refine with filters or upvote what's useful.
LLaMA-Factory is a comprehensive suite for dataset preparation, model fine-tuning, memory optimization, and standardized API deployment. It provides a unified platform for the supervised and reward-based fine-tuning of large language models and vision-language models. The framework includes a specialized toolkit for training vision-language models and a model serving interface that deploys trained models through high-performance APIs. It utilizes precision tuning and quantization techniques to reduce the hardware requirements and memory footprint of large models. The system covers data pipel
Optimizes VRAM usage during training and inference through precision tuning and quantization.
ControlNet is a framework for structural image generation that extends pre-trained diffusion models with neural network architectures designed for precise spatial control. By injecting structural guidance directly into the latent-space denoising process, the system enables users to enforce geometric or semantic constraints on generated outputs while maintaining style consistency. The framework distinguishes itself through a weight-locked copying mechanism that preserves the integrity of the original model while introducing new control signals. It supports multi-condition synthesis, allowing f
Reduces video memory consumption to enable larger batch sizes on limited hardware.
WeClone is an end-to-end framework designed for the creation, training, and deployment of personalized conversational AI digital twins. By fine-tuning large language models on individual chat history, the platform enables the replication of unique communication styles, speech patterns, and conversational habits. The system manages the entire lifecycle of these digital avatars, from initial data preparation to final integration into messaging platforms for real-time interaction. The platform distinguishes itself through a comprehensive suite of data processing utilities that prepare raw messag
Implements memory optimization techniques like quantization and batch size adjustment to fit large models into limited hardware memory.
Handy is a local speech-to-text automation tool designed to convert spoken audio into text and inject it directly into active desktop applications. By running machine learning models entirely on the host hardware, it provides a private, offline-first environment for dictation and command execution. The system functions as a background service that manages microphone input, transcription state, and text output, enabling hands-free typing across various software environments. The project distinguishes itself through a modular pipeline that integrates local language models for post-transcription
Frees system memory by unloading transcription models after periods of inactivity.
Stable Diffusion WebUI Forge is a web-based interface and inference engine designed for the generation of AI media. It functions as a platform for executing diffusion-based models, providing a centralized environment to manage image preprocessors, custom generation logic, and hardware-accelerated sampling. The project distinguishes itself through a neural network patching framework that allows for the modification of model layers and the application of spatial conditioning during inference. By injecting custom logic and adapters directly into the network, users can influence output behaviors
Minimizes video memory consumption to allow high-resolution models to run on hardware with limited capacity.
kohya_ss is a graphical user interface and workbench for fine-tuning diffusion models, specifically designed for Stable Diffusion. It provides a suite of tools for training generative AI models, including specialized interfaces for creating Low-Rank Adaptation weights and training ControlNet spatial control networks. The project distinguishes itself through integrated VRAM usage optimization and hardware acceleration, featuring specific support for Intel GPUs via XPU-accelerated libraries. It implements parameter-efficient training methods and memory-saving techniques like gradient checkpoint
Minimizes VRAM consumption using techniques like gradient checkpointing and caching to prevent out-of-memory errors.
Axolotl is a distributed training orchestrator and fine-tuning framework for large language models, multimodal systems, and quantized models. It provides a structured environment for specializing pre-trained models through full parameter updates or low-rank adaptation, as well as aligning model outputs with human expectations via preference tuning pipelines and reward modeling. The system distinguishes itself through a configuration-driven pipeline that manages preprocessing and training workflows via a single file for reproducibility. It implements high-throughput optimizations such as multi
Reduces VRAM requirements during training through quantization and reduced-precision fine-tuning.
Airllm is a framework designed to execute and fine-tune large language models on consumer-grade hardware. By employing layer-wise model decomposition and memory-efficient loading techniques, the engine enables the operation of massive models that would otherwise exceed available system or video memory. The project distinguishes itself through a suite of optimization strategies that balance memory footprint with performance. It utilizes block-wise weight quantization and asynchronous layer prefetching to reduce resource consumption and hide data transfer latency. Additionally, the framework su
Reduces VRAM usage for large models using attention optimizations and parameter-efficient techniques to enable execution on consumer hardware.
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
Reduces memory usage during first token generation to support longer context windows.
HybridCLR is a hybrid C# execution engine and assembly loader designed for Unity. It provides a system for hot-updating C# logic across all platforms at runtime without requiring the application to be rebuilt or reinstalled. The project is distinguished by its mixed-mode execution, which runs unmodified code at native speed while using a high-performance interpreter for updated functions. It includes a generic type resolver that allows hot-updated code to use generic classes and functions regardless of whether they were pre-instantiated in the main binary. To protect proprietary source code,
Completely unloads existing assemblies from memory to allow for clean replacement of updated code.
sd-scripts is a suite of utilities designed for fine-tuning generative models, preprocessing datasets, and converting model weights. It provides a collection of scripts for executing Stable Diffusion training through methods such as DreamBooth, textual inversion, and full fine-tuning, alongside a framework for creating and managing Low-Rank Adaptation weights. The project features specialized capabilities for model weight conversion between different architectures and precision formats. It includes tools for merging adaptation weights into base models, extracting weights from trained models,
Reduces VRAM requirements during training and inference through block swapping, mixed precision, and latent caching.
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
Offers a high-performance implementation that optimizes the prefill stage through model compilation.
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Adjusts the number of workers dedicated to prefill and decode phases separately based on real-time metrics.
DeepSeek-VL2 ist ein multimodales Large Language Model und Vision-Language-System, das darauf ausgelegt ist, visuelle Szenen zu analysieren und beschreibenden Text zu generieren. Es fungiert als Modell für visuelle Fragenbeantwortung (VQA) und visuelle Verankerung (Visual Grounding), das in der Lage ist, Informationen aus Dokumenten zu extrahieren und spezifische Objekte oder Regionen innerhalb von Bildern basierend auf textuellen Beschreibungen zu lokalisieren. Das Projekt nutzt eine Mixture-of-Experts-Architektur, um kombinierte Bild- und Texteingaben zu verarbeiten. Es ist für die Inferenz durch inkrementelles Prefilling optimiert, was den GPU-Speicherbedarf auf Hardware reduziert. Das Modell deckt multimodale Datenanalyse und visuelles Dokumentenverständnis ab, einschließlich der Interpretation von Diagrammen und Layouts. Es führt visuelle Inferenz und Verankerung durch, um textuelle Anfragen mit entsprechenden visuellen Inhalten abzugleichen.
Reduces GPU memory consumption during the initial prompt prefill stage via incremental processing.
FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat
Implements fused batch prefill kernels for variable-length sequences with ragged page tables.
Llama-swap is a local inference orchestrator and API gateway for large language models. It functions as an OpenAI API proxy that manages the lifecycle of multiple local model servers, automatically starting and stopping them to swap models based on incoming request identifiers. The project distinguishes itself through dynamic model swapping and hardware optimization. It utilizes a specialized matrix-based concurrency control to define which models can run simultaneously and employs cost-based eviction to remove inactive servers from memory based on relative resource costs. The system provide
Removes inactive models from memory after a specific timeout period to free up system resources.
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
Releases GPU memory by moving a loaded model to CPU or fully unloading it, then reloads it later on demand.
This project is a headless large language model inference engine and server manager designed for local deployments. It provides a developer toolkit and API gateway that allows for the management of model lifecycles and inference tasks without a graphical user interface. The system enables the deployment of model engines across different operating systems, cloud environments, or CI pipelines. It includes a command-line interface for bootstrapping development projects and automating the orchestration of loading and unloading model binaries based on specific workflow needs. The toolset covers i
Removes a loaded model from memory to free resources, optionally unloading all models at once.
mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies. The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chu
Implements chunked prefill execution to maintain a constant memory ceiling during initial sequence processing.
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Scales large models by separating prefill and decode stages using expert parallelism.