Why is hiyouga/llama-factory a recommended Memory Optimization GitHub Repositories repository?

Optimizes VRAM usage during training and inference through precision tuning and quantization.

Why is lllyasviel/controlnet a recommended Memory Optimization GitHub Repositories repository?

Reduces video memory consumption to enable larger batch sizes on limited hardware.

Why is xming521/weclone a recommended Memory Optimization GitHub Repositories repository?

Implements memory optimization techniques like quantization and batch size adjustment to fit large models into limited hardware memory.

Why is cjpais/handy a recommended Memory Optimization GitHub Repositories repository?

Frees system memory by unloading transcription models after periods of inactivity.

Why is lllyasviel/stable-diffusion-webui-forge a recommended Memory Optimization GitHub Repositories repository?

Minimizes video memory consumption to allow high-resolution models to run on hardware with limited capacity.

Why is bmaltais/kohya_ss a recommended Memory Optimization GitHub Repositories repository?

Minimizes VRAM consumption using techniques like gradient checkpointing and caching to prevent out-of-memory errors.

Why is openaccess-ai-collective/axolotl a recommended Memory Optimization GitHub Repositories repository?

Reduces VRAM requirements during training through quantization and reduced-precision fine-tuning.

Why is lyogavin/airllm a recommended Memory Optimization GitHub Repositories repository?

Reduces VRAM usage for large models using attention optimizations and parameter-efficient techniques to enable execution on consumer hardware.

Why is intel/ipex-llm a recommended Memory Optimization GitHub Repositories repository?

Reduces memory usage during first token generation to support longer context windows.

Why is focus-creative-games/hybridclr a recommended Memory Optimization GitHub Repositories repository?

Completely unloads existing assemblies from memory to allow for clean replacement of updated code.

20 Repos

Awesome GitHub RepositoriesMemory Optimization

Techniques for reducing VRAM usage during model training or inference.

Distinguishing note: Focuses on VRAM reduction for generative models.

Explore 20 awesome GitHub repositories matching artificial intelligence & ml · Memory Optimization. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

hiyouga/llama-factory
hiyouga/LLaMA-Factory
72,241Auf GitHub ansehen
LLaMA-Factory is a comprehensive suite for dataset preparation, model fine-tuning, memory optimization, and standardized API deployment. It provides a unified platform for the supervised and reward-based fine-tuning of large language models and vision-language models. The framework includes a specialized toolkit for training vision-language models and a model serving interface that deploys trained models through high-performance APIs. It utilizes precision tuning and quantization techniques to reduce the hardware requirements and memory footprint of large models. The system covers data pipel
Optimizes VRAM usage during training and inference through precision tuning and quantization.
Python
Auf GitHub ansehen72,241
lllyasviel/controlnet
lllyasviel/ControlNet
33,942Auf GitHub ansehen
ControlNet is a framework for structural image generation that extends pre-trained diffusion models with neural network architectures designed for precise spatial control. By injecting structural guidance directly into the latent-space denoising process, the system enables users to enforce geometric or semantic constraints on generated outputs while maintaining style consistency. The framework distinguishes itself through a weight-locked copying mechanism that preserves the integrity of the original model while introducing new control signals. It supports multi-condition synthesis, allowing f
Reduces video memory consumption to enable larger batch sizes on limited hardware.
Python
Auf GitHub ansehen33,942
xming521/weclone
xming521/WeClone
18,028Auf GitHub ansehen
WeClone is an end-to-end framework designed for the creation, training, and deployment of personalized conversational AI digital twins. By fine-tuning large language models on individual chat history, the platform enables the replication of unique communication styles, speech patterns, and conversational habits. The system manages the entire lifecycle of these digital avatars, from initial data preparation to final integration into messaging platforms for real-time interaction. The platform distinguishes itself through a comprehensive suite of data processing utilities that prepare raw messag
Implements memory optimization techniques like quantization and batch size adjustment to fit large models into limited hardware memory.
Pythonchat-historydigital-avatarllm
Auf GitHub ansehen18,028
cjpais/handy
cjpais/Handy
15,515Auf GitHub ansehen
Handy is a local speech-to-text automation tool designed to convert spoken audio into text and inject it directly into active desktop applications. By running machine learning models entirely on the host hardware, it provides a private, offline-first environment for dictation and command execution. The system functions as a background service that manages microphone input, transcription state, and text output, enabling hands-free typing across various software environments. The project distinguishes itself through a modular pipeline that integrates local language models for post-transcription
Frees system memory by unloading transcription models after periods of inactivity.
Rustaccessibilitycross-platformspeech-to-text
Auf GitHub ansehen15,515
lllyasviel/stable-diffusion-webui-forge
lllyasviel/stable-diffusion-webui-forge
12,730Auf GitHub ansehen
Stable Diffusion WebUI Forge is a web-based interface and inference engine designed for the generation of AI media. It functions as a platform for executing diffusion-based models, providing a centralized environment to manage image preprocessors, custom generation logic, and hardware-accelerated sampling. The project distinguishes itself through a neural network patching framework that allows for the modification of model layers and the application of spatial conditioning during inference. By injecting custom logic and adapters directly into the network, users can influence output behaviors
Minimizes video memory consumption to allow high-resolution models to run on hardware with limited capacity.
Python
Auf GitHub ansehen12,730
bmaltais/kohya_ss
bmaltais/kohya_ss
12,384Auf GitHub ansehen
kohya_ss is a graphical user interface and workbench for fine-tuning diffusion models, specifically designed for Stable Diffusion. It provides a suite of tools for training generative AI models, including specialized interfaces for creating Low-Rank Adaptation weights and training ControlNet spatial control networks. The project distinguishes itself through integrated VRAM usage optimization and hardware acceleration, featuring specific support for Intel GPUs via XPU-accelerated libraries. It implements parameter-efficient training methods and memory-saving techniques like gradient checkpoint
Minimizes VRAM consumption using techniques like gradient checkpointing and caching to prevent out-of-memory errors.
Python
Auf GitHub ansehen12,384
openaccess-ai-collective/axolotl
OpenAccess-AI-Collective/axolotl
12,062Auf GitHub ansehen
Axolotl is a distributed training orchestrator and fine-tuning framework for large language models, multimodal systems, and quantized models. It provides a structured environment for specializing pre-trained models through full parameter updates or low-rank adaptation, as well as aligning model outputs with human expectations via preference tuning pipelines and reward modeling. The system distinguishes itself through a configuration-driven pipeline that manages preprocessing and training workflows via a single file for reproducibility. It implements high-throughput optimizations such as multi
Reduces VRAM requirements during training through quantization and reduced-precision fine-tuning.
Python
Auf GitHub ansehen12,062
lyogavin/airllm
lyogavin/airllm
11,508Auf GitHub ansehen
Airllm is a framework designed to execute and fine-tune large language models on consumer-grade hardware. By employing layer-wise model decomposition and memory-efficient loading techniques, the engine enables the operation of massive models that would otherwise exceed available system or video memory. The project distinguishes itself through a suite of optimization strategies that balance memory footprint with performance. It utilizes block-wise weight quantization and asynchronous layer prefetching to reduce resource consumption and hide data transfer latency. Additionally, the framework su
Reduces VRAM usage for large models using attention optimizations and parameter-efficient techniques to enable execution on consumer hardware.
Jupyter Notebookchinese-llmchinese-nlpfinetune
Auf GitHub ansehen11,508
intel/ipex-llm
intel/ipex-llm
8,836Auf GitHub ansehen
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
Reduces memory usage during first token generation to support longer context windows.
Python
Auf GitHub ansehen8,836
focus-creative-games/hybridclr
focus-creative-games/hybridclr
7,863Auf GitHub ansehen
HybridCLR is a hybrid C# execution engine and assembly loader designed for Unity. It provides a system for hot-updating C# logic across all platforms at runtime without requiring the application to be rebuilt or reinstalled. The project is distinguished by its mixed-mode execution, which runs unmodified code at native speed while using a high-performance interpreter for updated functions. It includes a generic type resolver that allows hot-updated code to use generic classes and functions regardless of whether they were pre-instantiated in the main binary. To protect proprietary source code,
Completely unloads existing assemblies from memory to allow for clean replacement of updated code.
C++csharpframeworkhot
Auf GitHub ansehen7,863
kohya-ss/sd-scripts
kohya-ss/sd-scripts
7,133Auf GitHub ansehen
sd-scripts is a suite of utilities designed for fine-tuning generative models, preprocessing datasets, and converting model weights. It provides a collection of scripts for executing Stable Diffusion training through methods such as DreamBooth, textual inversion, and full fine-tuning, alongside a framework for creating and managing Low-Rank Adaptation weights. The project features specialized capabilities for model weight conversion between different architectures and precision formats. It includes tools for merging adaptation weights into base models, extracting weights from trained models,
Reduces VRAM requirements during training and inference through block swapping, mixed precision, and latent caching.
Python
Auf GitHub ansehen7,133
meta-pytorch/gpt-fast
meta-pytorch/gpt-fast
6,223Auf GitHub ansehen
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
Offers a high-performance implementation that optimizes the prefill stage through model compilation.
Python
Auf GitHub ansehen6,223
ai-dynamo/dynamo
ai-dynamo/dynamo
6,112Auf GitHub ansehen
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Adjusts the number of workers dedicated to prefill and decode phases separately based on real-time metrics.
Rust
Auf GitHub ansehen6,112
deepseek-ai/deepseek-vl2
deepseek-ai/DeepSeek-VL2
5,302Auf GitHub ansehen
DeepSeek-VL2 ist ein multimodales Large Language Model und Vision-Language-System, das darauf ausgelegt ist, visuelle Szenen zu analysieren und beschreibenden Text zu generieren. Es fungiert als Modell für visuelle Fragenbeantwortung (VQA) und visuelle Verankerung (Visual Grounding), das in der Lage ist, Informationen aus Dokumenten zu extrahieren und spezifische Objekte oder Regionen innerhalb von Bildern basierend auf textuellen Beschreibungen zu lokalisieren. Das Projekt nutzt eine Mixture-of-Experts-Architektur, um kombinierte Bild- und Texteingaben zu verarbeiten. Es ist für die Inferenz durch inkrementelles Prefilling optimiert, was den GPU-Speicherbedarf auf Hardware reduziert. Das Modell deckt multimodale Datenanalyse und visuelles Dokumentenverständnis ab, einschließlich der Interpretation von Diagrammen und Layouts. Es führt visuelle Inferenz und Verankerung durch, um textuelle Anfragen mit entsprechenden visuellen Inhalten abzugleichen.
Reduces GPU memory consumption during the initial prompt prefill stage via incremental processing.
Python
Auf GitHub ansehen5,302
flashinfer-ai/flashinfer
flashinfer-ai/flashinfer
4,996Auf GitHub ansehen
FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat
Implements fused batch prefill kernels for variable-length sequences with ragged page tables.
Pythonattentioncudadistributed-inference
Auf GitHub ansehen4,996
mostlygeek/llama-swap
mostlygeek/llama-swap
4,786Auf GitHub ansehen
Llama-swap is a local inference orchestrator and API gateway for large language models. It functions as an OpenAI API proxy that manages the lifecycle of multiple local model servers, automatically starting and stopping them to swap models based on incoming request identifiers. The project distinguishes itself through dynamic model swapping and hardware optimization. It utilizes a specialized matrix-based concurrency control to define which models can run simultaneously and employs cost-based eviction to remove inactive servers from memory based on relative resource costs. The system provide
Removes inactive models from memory after a specific timeout period to free up system resources.
Go
Auf GitHub ansehen4,786
opennmt/ctranslate2
OpenNMT/CTranslate2
4,319Auf GitHub ansehen
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
Releases GPU memory by moving a loaded model to CPU or fully unloading it, then reloads it later on demand.
C++avxavx2cpp
Auf GitHub ansehen4,319
lmstudio-ai/lms
lmstudio-ai/lms
4,214Auf GitHub ansehen
This project is a headless large language model inference engine and server manager designed for local deployments. It provides a developer toolkit and API gateway that allows for the management of model lifecycles and inference tasks without a graphical user interface. The system enables the deployment of model engines across different operating systems, cloud environments, or CI pipelines. It includes a command-line interface for bootstrapping development projects and automating the orchestration of loading and unloading model binaries based on specific workflow needs. The toolset covers i
Removes a loaded model from memory to free resources, optionally unloading all models at once.
TypeScriptllmlmstudionodejs
Auf GitHub ansehen4,214
sgl-project/mini-sglang
sgl-project/mini-sglang
3,514Auf GitHub ansehen
mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies. The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chu
Implements chunked prefill execution to maintain a constant memory ceiling during initial sequence processing.
Python
Auf GitHub ansehen3,514
llm-d/llm-d
llm-d/llm-d
2,514Auf GitHub ansehen
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Scales large models by separating prefill and decode stages using expert parallelism.
Shell
Auf GitHub ansehen2,514