The visitor wants a software framework or runtime environment capable of executing quantized large language models in GGUF format on standard CPU hardware.

ggerganov/llama.cpp is the closest match — This is the primary engine for GGUF-based inference, providing the native runtime for CPU-based execution, quantization tools, and an OpenAI-compatible API that directly fulfills all your requirements.. Other strong matches: nomic-ai/gpt4all, menloresearch/jan, ericlbuehler/mistral.rs, lostruins/koboldcpp.

Why does ggerganov/llama.cpp match “an engine for running quantized models on CPU”?

This is the primary engine for GGUF-based inference, providing the native runtime for CPU-based execution, quantization tools, and an OpenAI-compatible API that directly fulfills all your requirements.

Why does nomic-ai/gpt4all match “an engine for running quantized models on CPU”?

GPT4All is a comprehensive local inference engine that natively supports GGUF models, CPU-based execution, and quantization, while providing an OpenAI-compatible API for seamless integration.

Why does menloresearch/jan match “an engine for running quantized models on CPU”?

Jan is a desktop application that provides a user-friendly interface for local LLM inference, supporting GGUF models and offering an OpenAI-compatible API server for local execution on your hardware.

Why does ericlbuehler/mistral.rs match “an engine for running quantized models on CPU”?

This is a comprehensive local LLM inference engine that natively supports GGUF format, CPU-based execution, and quantization, while providing OpenAI-compatible APIs for multi-model serving.

Why does lostruins/koboldcpp match “an engine for running quantized models on CPU”?

KoboldCPP is a dedicated local inference engine built specifically to run quantized GGUF models on CPU and GPU hardware, providing the exact OpenAI-compatible API and multi-model support required for local LLM deployment.

CPU GGUF Model Inference Engines

High-performance software libraries and runtimes designed to execute quantized GGUF machine learning models on CPUs.

Find the best repos with AI.We'll search the best matching repositories with AI.

ggerganov/llama.cpp
ggerganov/llama.cpp
116,912View on GitHub
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal
This is the primary engine for GGUF-based inference, providing the native runtime for CPU-based execution, quantization tools, and an OpenAI-compatible API that directly fulfills all your requirements.
C++OpenAI-Compatible APIsOpenAI-Compatible Inference ServersWeight Quantization
View on GitHub116,912
nomic-ai/gpt4all
nomic-ai/gpt4all
77,375View on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vect
GPT4All is a comprehensive local inference engine that natively supports GGUF models, CPU-based execution, and quantization, while providing an OpenAI-compatible API for seamless integration.
C++OpenAI-CompatibleOpenAI-Compatible APIsLocal API Servers
View on GitHub77,375
menloresearch/jan
menloresearch/jan
43,052View on GitHub
Jan is a local language model desktop application and AI assistant orchestrator. It provides a unified interface for interacting with both resident models and remote cloud AI providers. The project functions as a host for the Model Context Protocol, connecting AI models to external tools and data sources. It also operates as an OpenAI compatible API server, exposing local models through a standardized server endpoint for other applications to query. The system supports the creation of specialized AI personas with custom instructions and allows for the management of hybrid model environments,
Jan is a desktop application that provides a user-friendly interface for local LLM inference, supporting GGUF models and offering an OpenAI-compatible API server for local execution on your hardware.
TypeScriptLocal API ServersOpenAI-Compatible Servers
View on GitHub43,052
ericlbuehler/mistral.rs
EricLBuehler/mistral.rs
6,597View on GitHub
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe
This is a comprehensive local LLM inference engine that natively supports GGUF format, CPU-based execution, and quantization, while providing OpenAI-compatible APIs for multi-model serving.
RustOpenAI-CompatibleOpenAI-Compatible APIsMulti-Model Servers
View on GitHub6,597
lostruins/koboldcpp
LostRuins/koboldcpp
9,511View on GitHub
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
KoboldCPP is a dedicated local inference engine built specifically to run quantized GGUF models on CPU and GPU hardware, providing the exact OpenAI-compatible API and multi-model support required for local LLM deployment.
C++OpenAI-Compatible APIsWeight Quantization
View on GitHub9,511
tiiny-ai/powerinfer
Tiiny-AI/PowerInfer
8,714View on GitHub
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
PowerInfer is a high-performance local inference engine that supports GGUF models, quantization, and OpenAI-compatible APIs, making it a capable tool for running LLMs on consumer hardware.
C++OpenAI-Compatible APIsOpenAI-Compatible Inference ServersWeight Quantization
View on GitHub8,714
josstorer/rwkv-runner
josStorer/RWKV-Runner
6,219View on GitHub
This tool provides a user-friendly interface and server for running various local models, including support for GGUF and CPU-based inference through its underlying backends.
TypeScriptOpenAI-CompatibleOpenAI-Compatible Servers
View on GitHub6,219
openbmb/minicpm
OpenBMB/MiniCPM
9,464View on GitHub
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
MiniCPM provides a comprehensive suite for running quantized models on consumer hardware, including native GGUF support and OpenAI-compatible APIs for local inference.
Jupyter NotebookGGUF Weight QuantizationOpenAI-Compatible APIsWeight Quantization
View on GitHub9,464
mlc-ai/mlc-llm
mlc-ai/mlc-llm
22,057View on GitHub
MLC LLM is a machine learning compiler and inference engine designed to execute large language models locally across diverse hardware platforms, including desktop, mobile, and web environments. By utilizing machine learning compilation, the project transforms high-level model definitions into specialized, hardware-specific binary libraries. This process optimizes model weights and generates compute kernels tailored to the unique memory and processing characteristics of target graphics and mobile hardware. The engine distinguishes itself by providing a unified runtime abstraction that enables
MLC LLM is a high-performance inference engine that supports local execution and quantization, though it focuses on ahead-of-time compilation for specific hardware rather than the native GGUF format typically used by CPU-centric runtimes.
PythonOpenAI-Compatible APIsLocal API Servers
View on GitHub22,057
intel/ipex-llm
intel/ipex-llm
8,836View on GitHub
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
This library provides a specialized inference engine optimized for Intel hardware that supports GGUF format, quantization, and OpenAI-compatible API serving for local LLM execution.
PythonGGUF ExecutionOpenAI-Compatible APIsWeight Quantization
View on GitHub8,836
abetlen/llama-cpp-python
abetlen/llama-cpp-python
9,993View on GitHub
llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory acro
This library provides a Python interface for running GGUF-formatted models on CPU hardware with support for quantization and an OpenAI-compatible API, making it a functional tool for local LLM inference.
PythonOpenAI-Compatible Inference ServersOpenAI-Compatible Servers
View on GitHub9,993
ggml-org/llama.cpp
ggml-org/llama.cpp
116,799View on GitHub
Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures. The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory us
This is the primary inference engine for running quantized GGUF models on CPU hardware, offering the exact API compatibility, quantization tools, and multi-model support required for local LLM execution.
C++Hardware Abstraction LayersText-Only Inference EnginesMultimodal Inference Engines
View on GitHub116,799
city96/comfyui-gguf
city96/ComfyUI-GGUF
3,291View on GitHub
ComfyUI-GGUF is a memory optimizer and model loader for ComfyUI that enables the execution of large transformer-based generative models using quantized weights. It provides a system for loading GGUF formatted weights within a node-based diffusion interface to reduce GPU memory consumption. The project includes a quantization tool for converting standard model checkpoints into compressed binary formats and a tensor fixer to restore missing keys and correct architectures in binary model files. These utilities ensure that compressed models remain functional during inference on hardware with limi
This is a custom node suite for the ComfyUI diffusion interface rather than a standalone local LLM inference engine, making it a specialized plugin for image generation workflows instead of a general-purpose LLM runtime.
PythonGGUF ExecutionGGUF Weight QuantizationGGUF Format Conversions
View on GitHub3,291
opennmt/ctranslate2
OpenNMT/CTranslate2
4,319View on GitHub
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
CTranslate2 is a high-performance inference engine that supports CPU execution, quantization, and OpenAI-compatible APIs, though it uses its own optimized binary format rather than native GGUF support.
C++CPU Inference RuntimesOpenAI-Compatible APIsWeight Quantization
View on GitHub4,319
sakurallm/sakurallm
SakuraLLM/SakuraLLM
4,618View on GitHub
SakuraLLM is a multi-format document translation system that hosts large language models for translating Japanese text into other languages. It functions as an inference server that exposes translation models through an OpenAI-compatible API, allowing any tool supporting the OpenAI client format to send translation requests. The system is designed as a glossary-aware translation engine that applies user-defined term dictionaries to ensure consistent translation of proper nouns and names across outputs. The project distinguishes itself by supporting multiple high-performance inference backends
This is a specialized translation server that leverages standard local LLM inference backends like llama.cpp and Ollama to run models on CPU and GPU hardware, effectively serving as a wrapper for the requested inference capabilities.
PythonOpenAI-Compatible APIsllama.cpp Backend Runners
View on GitHub4,618
karpathy/nanochat
karpathy/nanochat
55,103View on GitHub
Nanochat is a lightweight execution environment designed for training and running language models on standard consumer hardware. It functions as both a neural network training framework and an inference engine, enabling users to perform backpropagation-based training and model execution directly on general-purpose processors without the need for dedicated graphics hardware. The project distinguishes itself through a suite of optimization tools that prioritize efficiency on local machines. By utilizing memory-mapped weight loading and CPU-optimized vector math, it maximizes throughput for inte
This framework provides a CPU-optimized environment for local model execution and quantization, though it lacks explicit mention of GGUF format support required for full compatibility with standard quantized LLM ecosystems.
PythonLocal Execution Environments
View on GitHub55,103
qwenlm/qwen
QwenLM/Qwen
21,294View on GitHub
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-perfor
This framework provides a comprehensive environment for serving and deploying large language models, including native support for CPU inference, quantization, and OpenAI-compatible APIs, though it is primarily focused on the Qwen model family rather than being a general-purpose GGUF runtime.
PythonOpenAI-Compatible APIs
View on GitHub21,294
fauxpilot/fauxpilot
fauxpilot/fauxpilot
14,732View on GitHub
Fauxpilot is a self-hosted AI coding assistant and local inference server. It functions as a proxy and API gateway that redirects traffic from IDE plugins to a local large language model, allowing for AI-assisted programming without external cloud dependencies. The project provides a specialized API emulation layer that mimics coding assistant protocols and a standardized OpenAI-compatible interface. This enables supported code editors to use local models for completions and suggestions by overriding default proxy URLs. The system includes capabilities for downloading and deploying local mod
This project is an API gateway and proxy designed to emulate coding assistant protocols, rather than an inference engine itself, though it can be configured to interface with one.
PythonOpenAI-CompatibleOpenAI-Compatible APIs
View on GitHub14,732
openvinotoolkit/openvino
openvinotoolkit/openvino
10,414View on GitHub
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
OpenVINO is a comprehensive inference engine that supports CPU-based execution and quantization for generative AI models, though it requires model conversion to its native format rather than direct GGUF execution.
C++OpenAI-Compatible APIsWeight Quantization
View on GitHub10,414
meta-llama/llama
meta-llama/llama
59,464View on GitHub
Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on your own hardware. The system distinguishes itself through specialized memory and computation management techniques, including memory-mapped weight loading and quantization-aware inference, which allow for efficient execution on standard consumer hardware. It utilizes a stateles
This framework provides a robust runtime for executing transformer-based models locally on consumer hardware with support for quantization and memory-mapped loading, though it is primarily a research-focused implementation rather than a drop-in GGUF-native inference server.
PythonInference EnginesLarge Language Model RuntimesLocal Inference Engines
View on GitHub59,464
unslothai/unsloth
unslothai/unsloth
66,628View on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fin
Unsloth is primarily a high-performance fine-tuning platform that includes a local inference runtime capable of executing models, though its core focus remains on training optimization rather than serving as a dedicated GGUF-specific inference engine.
PythonLanguage Model TrainingCustom Kernel AcceleratorsEfficient Training Pipelines
View on GitHub66,628
liguodongiot/llm-action
liguodongiot/llm-action
23,169View on GitHub
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
This framework provides a comprehensive environment for local model inference and includes the necessary utilities for quantization and weight conversion to support running models on consumer-grade hardware.
HTMLDistributed Deep Learning FrameworksLanguage Model Fine-TuningLanguage Model Fine-Tuning
View on GitHub23,169

CPU GGUF Model Inference Engines

ggerganov/llama.cpp

nomic-ai/gpt4all

menloresearch/jan

EricLBuehler/mistral.rs

LostRuins/koboldcpp

Tiiny-AI/PowerInfer

josStorer/RWKV-Runner

OpenBMB/MiniCPM

mlc-ai/mlc-llm

intel/ipex-llm

abetlen/llama-cpp-python

ggml-org/llama.cpp

city96/ComfyUI-GGUF

OpenNMT/CTranslate2

SakuraLLM/SakuraLLM

karpathy/nanochat

QwenLM/Qwen

fauxpilot/fauxpilot

openvinotoolkit/openvino

meta-llama/llama

unslothai/unsloth

liguodongiot/llm-action