# CPU GGUF Model Inference Engines

> Search results for `run quantized GGUF models on CPU` on awesome-repositories.com. 115 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/run-quantized-gguf-models-on-cpu

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/run-quantized-gguf-models-on-cpu).**

## Results

- [handsonllm/hands-on-large-language-models](https://awesome-repositories.com/repository/handsonllm-hands-on-large-language-models.md) (27,059 ⭐) — This project is an educational resource focused on the internal mechanics and design principles of transformer-based neural networks. It provides a structured guide to the fundamental components of generative artificial intelligence, including sequence modeling, semantic embeddings, and the mathematical foundations of large language models.

The repository distinguishes itself through a heavy emphasis on visual documentation, utilizing diagrams and step-by-step explanations to clarify how data flows through complex neural architectures. It serves as a technical reference for developers seeking
- [mlabonne/llm-course](https://awesome-repositories.com/repository/mlabonne-llm-course.md) (80,178 ⭐) — This project is a comprehensive educational curriculum and engineering handbook focused on the lifecycle of large language models. It serves as a structured knowledge base for machine learning practitioners, covering the fundamental mathematical and architectural principles of transformer-based sequence modeling, as well as the practical implementation of supervised instruction fine-tuning and preference-based model alignment.

The repository distinguishes itself by providing a deep dive into advanced model composition and optimization techniques. It details methodologies for weight-space mode
- [city96/comfyui-gguf](https://awesome-repositories.com/repository/city96-comfyui-gguf.md) (3,291 ⭐) — ComfyUI-GGUF is a memory optimizer and model loader for ComfyUI that enables the execution of large transformer-based generative models using quantized weights. It provides a system for loading GGUF formatted weights within a node-based diffusion interface to reduce GPU memory consumption.

The project includes a quantization tool for converting standard model checkpoints into compressed binary formats and a tensor fixer to restore missing keys and correct architectures in binary model files. These utilities ensure that compressed models remain functional during inference on hardware with limi
- [kvcache-ai/ktransformers](https://awesome-repositories.com/repository/kvcache-ai-ktransformers.md) (17,288 ⭐) — Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device.

The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
- [intel/neural-compressor](https://awesome-repositories.com/repository/intel-neural-compressor.md) (2,585 ⭐) — Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency.

The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model executi
- [antimatter15/alpaca.cpp](https://awesome-repositories.com/repository/antimatter15-alpaca-cpp.md) (10,138 ⭐) — alpaca.cpp is a high-performance local inference engine implemented in C++ for executing instruction-tuned large language models. It serves as a quantized model runtime designed to load and run model tensors on local hardware with minimal dependencies, removing the requirement for a full Python environment.

The project focuses on on-device text generation and the deployment of private AI chatbots. It utilizes model weight quantization to reduce memory requirements and increase inference speed on consumer-grade devices.

The system covers hardware-optimized model execution through thread-pool
- [uraimo/run-on-arch-action](https://awesome-repositories.com/repository/uraimo-run-on-arch-action.md) (747 ⭐) — A Github Action that executes jobs/commands on non-x86 cpu architectures (ARMv6, ARMv7, aarch64, s390x, ppc64le, riscv64) via QEMU
- [mudler/localai](https://awesome-repositories.com/repository/mudler-localai.md) (46,889 ⭐) — LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services.

The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and
- [tencent-hunyuan/hunyuanvideo](https://awesome-repositories.com/repository/tencent-hunyuan-hunyuanvideo.md) (12,233 ⭐) — HunyuanVideo is a generative artificial intelligence framework designed to synthesize high-fidelity video sequences from descriptive text prompts. It utilizes a latent diffusion architecture that compresses video data into compact representations, allowing for the generation of dynamic visual content while maintaining temporal and spatial fidelity.

The system distinguishes itself through a specialized inference engine that supports eight-bit weight quantization and sequence-parallel distribution. These capabilities enable the execution of large-scale generative models on hardware with limited
- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and
- [geshan/laravel6-on-google-cloud-run](https://awesome-repositories.com/repository/geshan-laravel6-on-google-cloud-run.md) (25 ⭐) — Laravel 6 on Google cloud run for a demo
- [microsoft/onnxruntime](https://awesome-repositories.com/repository/microsoft-onnxruntime.md) (19,347 ⭐) — This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management.

The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computation
- [intel/ipex-llm](https://awesome-repositories.com/repository/intel-ipex-llm.md) (8,836 ⭐) — Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats.

The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
- [google/cpu_features](https://awesome-repositories.com/repository/google-cpu-features.md) (2,607 ⭐) — A cross platform C99 library to get cpu features at runtime.
- [hunyuan-promptenhancer/promptenhancer](https://awesome-repositories.com/repository/hunyuan-promptenhancer-promptenhancer.md) (3,421 ⭐) — PromptEnhancer is a GGUF prompt rewriting engine that loads quantized models to rewrite plain text prompts into clearer, structured versions while preserving the original subject, style, and layout. It is designed to refine image editing instructions by incorporating visual context from the input image, producing precise editing prompts.

The tool operates through a structured prompt rewriting engine that combines editing instructions with visual context from images, embedding that context into the rewriting process for context-aware refinement. It runs inference with a minimal memory footprin
- [huggingface/smolagents](https://awesome-repositories.com/repository/huggingface-smolagents.md) (27,885 ⭐) — This framework provides a development toolkit for building autonomous agents that utilize language models to solve complex, non-deterministic tasks. Its core design centers on a code-executing architecture where agents generate and run Python code snippets to perform logic, data manipulation, and tool interactions. By moving beyond structured data formats, the system enables agents to manage program flow and object state through iterative reasoning cycles.

The project distinguishes itself through its focus on code-based agent implementation and secure execution environments. Developers can ch
- [axolotl-ai-cloud/axolotl](https://awesome-repositories.com/repository/axolotl-ai-cloud-axolotl.md) (12,059 ⭐) — Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies.

The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
- [nvlabs/sana](https://awesome-repositories.com/repository/nvlabs-sana.md) (8,310 ⭐) — Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control.

The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
- [kamgurgul/cpu-info](https://awesome-repositories.com/repository/kamgurgul-cpu-info.md) (1,026 ⭐) — CPU Info is a multiplatform application which provides information about device hardware and software
- [nswbmw/cpu-memory-monitor](https://awesome-repositories.com/repository/nswbmw-cpu-memory-monitor.md) (30 ⭐) — CPU & Memory Monitor, auto dump.
- [meilisearch/meilisearch](https://awesome-repositories.com/repository/meilisearch-meilisearch.md) (58,118 ⭐) — Meilisearch is a Rust-based search engine providing typo-tolerant full-text and vector-based semantic search with real-time conversational capabilities.
- [microsoft/bitnet](https://awesome-repositories.com/repository/microsoft-bitnet.md) (39,327 ⭐) — BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds.

The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weig
- [iuliaturc/gguf-docs](https://awesome-repositories.com/repository/iuliaturc-gguf-docs.md) (483 ⭐) — Legacy Quants - K-Quants - I-Quants - Importance Matrix - Naming Convention
- [openlmlab/moss](https://awesome-repositories.com/repository/openlmlab-moss.md) (12,140 ⭐) — MOSS is a conversational AI platform, fine-tuning toolkit, and quantized model runtime. It provides a framework for deploying large language models capable of multi-turn dialogue, general-purpose response generation, and following complex instructions.

The system functions as a tool-augmented framework that extends model knowledge through external plugins and tool-call loops. This allows the model to execute tasks via search engines and calculators to augment responses with external data.

The project covers model training through supervised conversational fine-tuning and optimizes deployment
- [docker/awesome-compose](https://awesome-repositories.com/repository/docker-awesome-compose.md) (45,561 ⭐) — Awesome Compose is a collection of resources designed to demonstrate the orchestration of multi-container applications. It serves as a practical reference for using declarative configuration files to define, manage, and deploy complex software stacks, ensuring that services run consistently across development, testing, and production environments.

The project highlights the capabilities of container lifecycle management by providing examples of how to bundle software with its dependencies into isolated, portable units. It emphasizes the use of multi-stage build pipelines to optimize image siz
- [tmux-plugins/tmux-cpu](https://awesome-repositories.com/repository/tmux-plugins-tmux-cpu.md) (528 ⭐) — Plug and play cpu percentage and icon indicator for Tmux.
- [docker/compose](https://awesome-repositories.com/repository/docker-compose.md) (37,588 ⭐) — Docker Compose is a tool for defining and running multi-container applications through declarative configuration files. It functions as an application lifecycle manager, coordinating the startup, shutdown, and scaling of interconnected services within isolated environments. By using a standardized configuration format, it enables infrastructure as code, allowing developers to manage complex application stacks and their dependencies in a single, repeatable file.

The project distinguishes itself by integrating directly with the broader Docker platform, leveraging a client-server architecture wh
- [openmoss/moss](https://awesome-repositories.com/repository/openmoss-moss.md) (12,140 ⭐) — MOSS is a conversational AI API server and framework designed to manage stateful multi-turn dialogues via session identifiers for remote interaction. It functions as a tool-augmented language model framework and a quantized inference engine.

The project integrates external plugins, such as search engines and calculators, to provide factual and computed data within model responses. It also includes a supervised fine-tuning toolkit for adapting base language models to specific conversational datasets and behavioral instructions.

The system supports inference optimization through 4-bit and 8-bi
- [meta-llama/llama-cookbook](https://awesome-repositories.com/repository/meta-llama-llama-cookbook.md) (18,375 ⭐) — This project is a collection of implementation guides, recipes, and developer resources for building applications with Llama models. It serves as a comprehensive kit for developing autonomous agents, establishing retrieval-augmented generation systems, and executing model fine-tuning.

The resource provides specific patterns for multimodal workflows that process text, images, and audio. It includes specialized guidance on adapting pre-trained model weights for targeted tasks and implementing tool-calling orchestration to connect models with external APIs and functions.

The codebase covers a b
- [jdxcode/tmux-cpu-info](https://awesome-repositories.com/repository/jdxcode-tmux-cpu-info.md) (15 ⭐) — Shows a tiny bar in your tmux statusline with the current CPU usage
- [mozilla-ai/llamafile](https://awesome-repositories.com/repository/mozilla-ai-llamafile.md) (23,726 ⭐) — Llamafile is a machine learning model runner and packager that enables local inference by bundling model weights and runtime environments into a single, self-contained executable. It functions as a cross-platform engine, allowing users to execute large language models and perform speech-to-text tasks directly on their own hardware without requiring external software dependencies or complex installations.

The project distinguishes itself by utilizing a specialized binary format that allows the same executable to run natively across multiple operating systems and hardware architectures. It auto
- [blakeblackshear/frigate](https://awesome-repositories.com/repository/blakeblackshear-frigate.md) (33,778 ⭐) — Frigate is a self-hosted network video recorder that functions as a private, local AI-powered vision engine. It manages video streams by performing real-time object detection, tracking, and classification directly on local hardware, ensuring that security monitoring and activity recording remain independent of cloud services.

The system distinguishes itself through a modular, hardware-accelerated video pipeline that offloads intensive decoding and machine learning inference to dedicated GPUs, NPUs, or specialized accelerators like Coral TPUs and Hailo modules. It utilizes state-based object t
- [robbyant/lingbot-world](https://awesome-repositories.com/repository/robbyant-lingbot-world.md) (2,915 ⭐) — Lingbot-world is an interactive world simulator and framework for generating high-fidelity video environments from text and image prompts. It functions as a video generation system designed to create controllable simulations for applications such as robotics learning and gaming.

The project includes a video motion controller that directs camera and object movement using transformation matrices and action strings. It utilizes a quantized inference engine to reduce memory usage and accelerate the generation of video sequences.

The system covers a range of optimization techniques, including fou
- [yhhhli/apot_quantization](https://awesome-repositories.com/repository/yhhhli-apot-quantization.md) (0 ⭐)
- [playbahn/tmux-cpu-rs](https://awesome-repositories.com/repository/playbahn-tmux-cpu-rs.md) (6 ⭐) — A small, fast Rust-based CLI tool to display CPU usage inside your tmux status line, with caching
- [qwenlm/qwen-vl](https://awesome-repositories.com/repository/qwenlm-qwen-vl.md) (6,535 ⭐)
- [nomic-ai/gpt4all](https://awesome-repositories.com/repository/nomic-ai-gpt4all.md) (77,375 ⭐) — GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights.

What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vect
- [tensorflow/model-optimization](https://awesome-repositories.com/repository/tensorflow-model-optimization.md) (1,573 ⭐) — A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
- [llmware-ai/llmware](https://awesome-repositories.com/repository/llmware-ai-llmware.md) (14,838 ⭐) — llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation.

The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
- [kedacore/keda](https://awesome-repositories.com/repository/kedacore-keda.md) (10,314 ⭐) — KEDA is a Kubernetes event-driven autoscaler and cloud event scaling engine. It functions as a custom metrics provider that monitors external event sources—including message brokers, databases, and cloud metrics—to dynamically adjust the replica counts of containerized workloads.

The project is distinguished by its scale-to-zero workflow, which reduces workloads to zero replicas during inactivity and automatically restarts them when new events are detected. It operates as a multi-cloud event trigger system, using a pluggable scaler interface to integrate with a wide array of third-party servi
- [jomjol/ai-on-the-edge-device](https://awesome-repositories.com/repository/jomjol-ai-on-the-edge-device.md) (8,461 ⭐) — AI-on-the-edge-device is an edge AI meter digitizer and computer vision image processor designed to convert images of analog and digital utility meters into numeric values. It functions as an IoT gateway that runs neural network inference locally on hardware to monitor water, power, and gas readings.

The system is distinguished by its ability to handle both analog pointers and digital digits through custom-trained neural networks. It includes specialized tools for image alignment, region-of-interest extraction, and hardware-level lighting control to minimize glare on glass surfaces. To mainta
- [thewtex/tmux-mem-cpu-load](https://awesome-repositories.com/repository/thewtex-tmux-mem-cpu-load.md) (1,114 ⭐) — CPU, RAM, and load monitor for use with tmux
- [zhaochenyang20/awesome-ml-sys-tutorial](https://awesome-repositories.com/repository/zhaochenyang20-awesome-ml-sys-tutorial.md) (5,371 ⭐) — This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters.

The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
- [zai-org/chatglm2-6b](https://awesome-repositories.com/repository/zai-org-chatglm2-6b.md) (15,564 ⭐) — ChatGLM2-6B is a bilingual chat large language model designed for natural conversation and text generation in both English and Chinese. It functions as a fine-tunable language model that supports updating weights via specialized scripts to adapt to specific datasets and tasks.

The project serves as a quantized inference engine and multi-GPU model orchestrator, enabling the execution of large models on consumer-grade hardware. It is capable of processing long context sequences up to 32K tokens to maintain understanding across extended documents.

The system covers capabilities for multilingual
- [modular/modular](https://awesome-repositories.com/repository/modular-modular.md) (26,357 ⭐) — Modular is a unified machine learning development platform designed for building, compiling, and deploying high-performance neural network models. It provides a comprehensive execution engine that supports both local and production-grade inference, enabling developers to manage the entire model lifecycle from initial architecture definition to scalable, containerized service deployment.

The platform distinguishes itself through a hardware-agnostic runtime that abstracts diverse silicon architectures, allowing models to execute efficiently across varied compute environments. It includes a spec
- [mozilla/firefox-translations-models](https://awesome-repositories.com/repository/mozilla-firefox-translations-models.md) (462 ⭐) — CPU-optimized NMT models for Firefox Translations.
- [run-house/kubetorch](https://awesome-repositories.com/repository/run-house-kubetorch.md) (1,212 ⭐) — Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.
- [sgl-project/sglang](https://awesome-repositories.com/repository/sgl-project-sglang.md) (29,079 ⭐) — Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems.

The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
- [1panel-dev/1panel](https://awesome-repositories.com/repository/1panel-dev-1panel.md) (35,898 ⭐) — 1Panel is a centralized server management and container orchestration platform designed to simplify the administration of Linux-based infrastructure. It provides a unified web interface for managing containerized workloads, automating system maintenance, and configuring server resources. By acting as a comprehensive control plane, the platform streamlines the deployment of applications, databases, and web services while offering granular control over host system internals and security settings.

What distinguishes this platform is its integrated support for private artificial intelligence infr
- [facebook/react-native](https://awesome-repositories.com/repository/facebook-react-native.md) (126,019 ⭐) — This project is a cross-platform mobile framework that enables the development of native iOS and Android applications from a single codebase. It utilizes a declarative component-based model where developers define user interfaces using a syntax extension that maps directly to underlying platform-native view primitives. By decoupling application logic from the host platform's main thread, the framework maintains a consistent native view hierarchy while ensuring that JavaScript execution remains independent of UI rendering.

The framework distinguishes itself through a robust bridge architecture