What are the best open-source alternatives to Llama.cpp?

30 open-source projects similar to ggerganov/llama.cpp, ranked by shared features. Top picks: lostruins/koboldcpp, sgl-project/sglang, vllm-project/vllm, tiiny-ai/powerinfer, abetlen/llama-cpp-python, pytorch/executorch, opennmt/ctranslate2, openbmb/minicpm, nvidia/tensorrt-llm, mlc-ai/mlc-llm.

Is lostruins/koboldcpp a good alternative to Llama.cpp?

KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted mo…

Is sgl-project/sglang a good alternative to Llama.cpp?

Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains thr…

Is vllm-project/vllm a good alternative to Llama.cpp?

vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built…

Is tiiny-ai/powerinfer a good alternative to Llama.cpp?

PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes i…

Is abetlen/llama-cpp-python a good alternative to Llama.cpp?

llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both te…

Is pytorch/executorch a good alternative to Llama.cpp?

ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardwar…

Is opennmt/ctranslate2 a good alternative to Llama.cpp?

CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine transl…

Is openbmb/minicpm a good alternative to Llama.cpp?

MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project di…

Is nvidia/tensorrt-llm a good alternative to Llama.cpp?

TensorRT-LLM is a platform and toolkit designed for compiling, optimizing, and serving transformer-based models on accelerated hardware. It functions as a framework that transforms machine learning models into efficient execution graphs, providing an engine to refine these models for specific hardw…

Is mlc-ai/mlc-llm a good alternative to Llama.cpp?

MLC LLM is a machine learning compiler and inference engine designed to execute large language models locally across diverse hardware platforms, including desktop, mobile, and web environments. By utilizing machine learning compilation, the project transforms high-level model definitions into speci…

Back to ggerganov/llama.cpp

Open-source alternatives to Llama.cpp

30 open-source projects similar to ggerganov/llama.cpp, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Llama.cpp alternative.

lostruins/koboldcpp
LostRuins/koboldcpp
9,511View on GitHub
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
C++gemmaggmlgguf
View on GitHub9,511
sgl-project/sglang
sgl-project/sglang
29,079View on GitHub
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Pythonattentionblackwellcuda
View on GitHub29,079
vllm-project/vllm
vllm-project/vllm
83,048View on GitHub
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cach
Pythonamdblackwellcuda
View on GitHub83,048

Open-source alternatives to Llama.cpp

LostRuins/koboldcpp

sgl-project/sglang

vllm-project/vllm

Tiiny-AI/PowerInfer

abetlen/llama-cpp-python

pytorch/executorch

OpenNMT/CTranslate2

OpenBMB/MiniCPM

NVIDIA/TensorRT-LLM

mlc-ai/mlc-llm

SJTU-IPADS/PowerInfer

kvcache-ai/ktransformers

InternLM/lmdeploy

ModelTC/LightLLM

lm-sys/FastChat

Michael-A-Kuykendall/shimmy

nomic-ai/gpt4all

ggml-org/llama.cpp

antimatter15/alpaca.cpp

ggerganov/whisper.cpp

artidoro/qlora

openvinotoolkit/openvino

predibase/lorax

xenova/transformers.js

EricLBuehler/mistral.rs

jmorganca/ollama

microsoft/DeepSpeed

huggingface/text-generation-inference

xorbitsai/inference

ollama/ollama