llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs.
The main features of abetlen/llama-cpp-python are: LLM Python Bindings, Chat Completion Services, Embedding Generators, Local Model Serving, Inference API Servers, Model Loading, Model Loaders, Model Quantization.
Open-source alternatives to abetlen/llama-cpp-python include: sgl-project/sglang — Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It… ericlbuehler/mistral.rs — mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and… openvinotoolkit/openvino — OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models… ggerganov/llama.cpp — llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across… lostruins/koboldcpp — KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models… microsoft/onnxruntime — This project is a cross-platform machine learning inference engine designed to execute pre-trained models across…
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal