30 open-source projects similar to ggml-org/llama.cpp, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Llama.cpp alternative.
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal
LiteLLM is a unified gateway and proxy server designed to centralize access to over one hundred language model providers. It provides a standardized API interface that abstracts vendor-specific schemas, allowing developers to interact with diverse models through a single, consistent format. By acting as a central traffic management layer, it enables organizations to route, secure, and govern model interactions across multiple deployments. The platform distinguishes itself through its policy-driven architecture, which uses configuration-based routing to manage traffic distribution, load balanc
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Airllm is a framework designed to execute and fine-tune large language models on consumer-grade hardware. By employing layer-wise model decomposition and memory-efficient loading techniques, the engine enables the operation of massive models that would otherwise exceed available system or video memory. The project distinguishes itself through a suite of optimization strategies that balance memory footprint with performance. It utilizes block-wise weight quantization and asynchronous layer prefetching to reduce resource consumption and hide data transfer latency. Additionally, the framework su
Ollama provides a framework for running and managing local machine learning models. It includes a command-line interface for model lifecycle management, such as creation, embedding generation, and configuration, alongside a stable API for programmatic interaction across multiple programming languages. The platform supports the import of models and adapters in various formats, including GGUF and Safetensors. Users can define custom model behaviors, prompt templates, and system messages through a configuration file format. It also offers tools for fine-tuning models with LoRA adapters and apply
This project is a comprehensive platform for hosting and interacting with large language models directly on local hardware. It provides a web-based graphical interface that allows users to manage model loading, configure generation parameters, and execute text or chat interactions entirely offline. By running models locally, the software ensures complete data privacy and eliminates reliance on external cloud services for generative tasks. Beyond basic inference, the platform functions as a versatile workbench for generative AI development. It includes an integrated pipeline for fine-tuning mo
OpenLLM is a framework for deploying, managing, and scaling open-source large language models
LangChain is an orchestration framework designed for building, managing, and deploying applications powered by large language models. It provides a unified integration layer that normalizes disparate model provider APIs into a consistent set of primitives, enabling developers to build complex, multi-step AI workflows that manage state, memory, and tool execution. The project distinguishes itself through a durable execution runtime that maintains persistent state across long-running processes by checkpointing progress to external storage. It models agent workflows as directed graphs, allowing
LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services. The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cach
Open WebUI is a self-hosted, web-based platform designed for interacting with local and remote artificial intelligence models. It functions as a unified interface and orchestration suite, enabling users to build, deploy, and manage specialized AI agents equipped with custom instructions, external tool access, and private knowledge bases. The platform distinguishes itself through a modular architecture that supports complex AI workflows. It features a plugin-based framework for custom logic and pipeline-based request processing, allowing developers to filter or transform data streams before th
FastChat is a training and serving platform for large language models that provides an integrated toolkit for fine-tuning, hosting, and benchmarking chatbots. It functions as an inference server capable of hosting multiple models and exposing them via a standardized API for chat applications. The platform distinguishes itself through a distributed model controller that manages worker nodes and routes requests across a hardware-agnostic inference layer supporting various accelerators. It includes a dedicated evaluation framework for assessing model quality using automated judges, multi-turn di
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory acro
TensorRT-LLM is a platform and toolkit designed for compiling, optimizing, and serving transformer-based models on accelerated hardware. It functions as a framework that transforms machine learning models into efficient execution graphs, providing an engine to refine these models for specific hardware to maximize throughput and minimize latency during text generation. The project distinguishes itself through advanced execution strategies that manage the entire inference pipeline. It utilizes kernel-level fusion and static graph execution to optimize mathematical operations and computational f
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vect
WebLLM is a library for executing large language models directly within web browsers. It provides a framework for building conversational artificial intelligence applications that perform inference locally, ensuring user data privacy by eliminating the need for external server dependencies. The project distinguishes itself by leveraging browser-native graphics APIs to perform intensive machine learning computations on the client side. It maintains application responsiveness by offloading heavy model tasks to background threads and ensures continuous operation through service workers that func
SkyPilot is a multi-cloud AI orchestrator and distributed task scheduler designed to launch and manage AI workloads across various cloud providers, Kubernetes, and Slurm clusters. It functions as an infrastructure-as-code framework that uses declarative files to define resource requirements and setup commands for consistent execution across different environments. The project differentiates itself through automated cost optimization, selecting the most affordable GPU or TPU hardware and managing spot instances to reduce expenses. It also provides a remote development environment that bridges
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device. The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
Shimmy is a local large language model inference engine and server that loads and serves GGUF formatted weights. It is distributed as a single binary runtime written in Rust, providing a standalone environment for running models without external runtime dependencies. The project utilizes WebGPU for hardware acceleration, allowing model compute kernels to execute across diverse graphics hardware through a standardized interface. It features a local server that implements an OpenAI-compatible API layer, enabling applications to interface with local models via standardized REST endpoints. Memor
This project is a web-based user interface for interacting with large language models, featuring streaming responses and persistent conversation history. It functions as an orchestration gateway that directs user prompts to specific language models and acts as a Model Context Protocol client to execute external tools and incorporate live data into conversations. The application includes a routing layer that analyzes input signals and tool requirements to dynamically direct messages to the most appropriate specialized model. It also provides customization settings for brand identity, allowing
Jan is a desktop application that functions as a local artificial intelligence model runtime and an open-standard API server. It enables the execution of large language models directly on local hardware, ensuring that data remains private and accessible offline while providing a unified interface for managing model weights and inference runtimes. The platform distinguishes itself by offering a modular inference backend that allows users to swap execution engines based on hardware compatibility and performance needs. It acts as a cross-platform orchestrator, providing the ability to switch bet
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe
This project is a large language model inference library and framework designed to run models for text generation, problem solving, and coding assistance. It includes a multimodal framework for processing combined image and text inputs and a tool-use implementation that enables the execution of external functions based on model reasoning. The system features a distributed GPU inference engine that spreads large model workloads across multiple graphics processors to increase processing speed and meet memory requirements. It also provides containerized model deployment through pre-packaged imag
The nexa-sdk is an on-device AI SDK and multimodal inference engine designed to run large language, vision, and audio models locally on mobile and desktop hardware. It functions as a local LLM runtime and NPU acceleration framework, enabling the execution of generative and discriminative models without reliance on cloud services. The project distinguishes itself through a dedicated NPU acceleration framework that optimizes model execution on Neural Processing Units to reduce latency and power consumption. It employs hardware-agnostic backend routing to dynamically distribute computations acro
BentoML is a machine learning model serving framework and GPU-accelerated inference server designed to package, deploy, and scale AI models as production-ready REST APIs. It functions as an AI model lifecycle manager and an inference graph orchestrator, enabling the chaining of multiple models and custom logic into complex pipelines for advanced task sequences. The framework distinguishes itself through a dynamic batching engine that optimizes GPU throughput and an artifact-based packaging system that bundles model weights and dependencies into immutable archives for consistent deployment. It
Triton Inference Server is a high-performance server designed to deploy machine learning models from multiple frameworks across GPUs and CPUs. It functions as a hardware-accelerated inference engine and a gRPC inference gateway, providing a standardized communication layer for transmitting binary tensor data with low latency. The system acts as a multi-framework model orchestrator, allowing users to link multiple AI models into ensembles and scripts to create complex inference pipelines. It also serves as a model lifecycle manager, providing controls to load, unload, and monitor the performan