Mistral.rs

mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware.

The project distinguishes itself through an agentic tool execution framework that runs server-side tools like code execution, shell commands, and web search in an automated loop during model generation, with session state persistence. It provides an in-process inference engine that can be embedded directly into Rust or Python applications without a separate server process, and includes an in-situ quantization engine that converts model weights to lower precision at load time with per-layer tuning. The system supports structured output constraints, forcing model output to conform to JSON Schema or grammar specifications during decoding, and offers automatic architecture detection that identifies model type, quantization format, and chat template from a Hugging Face model ID.

The platform includes capabilities for managing LoRA adapters, composing models as mixture-of-experts configurations, and running distributed inference across multiple GPUs or nodes using tensor parallelism and ring transport. It provides a built-in web chat interface, supports speculative decoding with a smaller assistant model, and offers benchmarking, logging, and Prometheus metrics for monitoring. The project can be run from a configuration file, with options for customizing build processes, tuning hardware settings automatically, and managing model caches.

Features

Local Model Serving - Runs open-weight language models locally behind OpenAI and Anthropic-compatible APIs for private inference.
Agentic Execution Loops - Automatically runs server-side tools like web search and code execution in a generation loop.
Agent Tool Execution - Runs server-side tools like web search and code execution during chat generation.
OpenAI-Compatible APIs - Exposes local models behind an OpenAI-compatible API for use with existing SDKs and clients.
Anthropic-Compatible Endpoints - Exposes /v1/messages endpoints compatible with the Anthropic API specification.
Multi-Protocol API Servers - Serves both OpenAI and Anthropic compatible API endpoints from a single server process.
OpenAI-Compatible - Exposes models behind OpenAI-compatible endpoints for tool definition and invocation.
Chat Completion Services - Provides a chat completion endpoint that accepts messages and returns model responses with streaming and tool support.
LLM Response Streaming - Streams generated tokens incrementally as they are produced for real-time display.
Multi-Source Model Loaders - Loads models from Hugging Face repos, local directories, or GGUF files with automatic architecture detection.
Quantized Model Deployments - Loads and runs models with reduced precision to lower memory use and speed up inference on consumer hardware.
Speculative Decoding Strategies - Uses a smaller assistant model to predict multiple tokens per step, speeding up inference on the target model.
Local Inference Engines - Loads and runs large language models locally with support for quantization, tool calling, and multimodal inputs.
Unified Runner Interfaces - Loads any supported model architecture and sends text or multimodal generation requests through a unified runner interface.
Model Architecture Detectors - Automatically detects model type, quantization format, and chat template from a Hugging Face ID.
Model Quantization - Stores model weights at lower precision to reduce memory footprint, with automatic fallback from prebuilt files to runtime conversion.
Multi-Model Servers - Loads and serves multiple models from a single server process, each with its own engine.
Model Serving Platforms - Serves multiple models from a single process with per-request routing and on-demand loading and unloading.
Multimodal Model Runners - Loads and runs models that process text alongside images, audio, or video inputs.
In-Situ Quantization Engines - Converts model weights to lower precision at load time with per-layer tuning.
Python SDK Embeddings - Embeds the inference engine directly into Python applications via a Runner class.
Adapter-Aware Runtimes - A runtime that applies quantization and LoRA adapters at load time to reduce memory use and add task-specific behavior.
On-Load Quantizers - Applies quantization at load time, auto-selecting level based on hardware and using prebuilt files or in-situ conversion.
Rust SDK Embeddings - Embeds the inference engine directly into Rust applications via the mistralrs crate.
Tool Argument Constraints - Enforces JSON Schema on tool call arguments during decoding to prevent malformed output.
Streaming Text Generation - Streams generated text token-by-token as it is produced for real-time output display.
Structured Output Enforcements - Forces model output to conform to JSON Schema or grammar during decoding for predictable responses.
Tool-Execution Loops - Enables models to call server-side tools in an automated loop during generation.
Server-Side Execution Loops - Runs the full tool execution loop server-side, returning only the final reply.
Multimodal Inference - Processes text, images, video, audio, and speech inputs together in a single inference engine.
Text Model Runners - Loads and runs text-only language models from Hugging Face with auto-detected architecture.
Zero-Configuration Model Launchers - Auto-detects model architecture, quantization, and chat template from a Hugging Face ID with no configuration needed.
Multimodal Generation - Processes text, image, video, audio, and speech inputs within a single inference engine for unified generation.
In-Process Inference Engines - Runs model inference directly in the host process without a separate server.
Multi-Model Server Architectures - Loads several models in a single server process with per-request routing.
In-Process Inference Engines - Runs model inference in-process using the mistralrs crate with quantization and streaming support.
Server-Side Provider Tool Execution - Registers callbacks that run tools server-side and feed results into the generation loop.
Model Serving - Starts an HTTP server that exposes the model for inference and optionally serves a web UI.
OpenAI-Compatible Servers - Exposes local models behind OpenAI and Anthropic-compatible endpoints for use with existing SDKs and clients.
CUDA Attention Kernel Tuners - Enables or disables CUDA decode graphs, FlashInfer paged attention, and forces the MoE expert backend.
Model Performance Benchmarks - Runs performance benchmarks measuring generation speed and throughput for plain text generation.
Output Constraint Engines - Forces model output to conform to JSON Schema or grammar during decoding.
Built-In Tool Configurations - Configures built-in Python and shell executors or web-search tools that the model can invoke during generation.
Layer Placement Strategies - Assigns individual model layers to different hardware devices to optimize memory and compute usage.
Interactive Model Runners - Provides a CLI command to load a model and interact with it via streaming chat.
Offline Model Runners - Operates entirely from a local cache or disk path without network calls to the Hugging Face Hub.
Selective Layer Quantizers - Restricts quantization to only Mixture-of-Experts expert layers, leaving the shared trunk at native precision.
Automatic Hardware Tuners - Recommends optimal quantization and device mapping based on the model config and detected hardware.
Model-Hardware Tuning Recommenders - Recommends quantization and device mapping based on model config and detected hardware.
Architecture Variant Selectors - Selects which model architecture variant to load, such as plain, multimodal, or tool-calling versions.
Runtime Model Swapping - Switches between multiple loaded models without restarting the server.
Runtime Quantizers - Applies a chosen quantization format to a loaded model during execution, reducing memory and compute requirements.
Messages API Endpoints - Serves an endpoint compatible with the Anthropic Messages API for model interaction.
Dynamic Model Reloading - Frees a model's memory or restores it at runtime through dedicated API endpoints without restarting the server.
LoRA Adapter Loaders - Loads LoRA adapters on top of a base model to add task-specific behavior without altering the original weights.
Pre-built Tool Integrations - Activates pre-built tools like web search and code execution via a server flag.
Layer-Specific Quantizers - Applies different quantization levels to specific layer ranges or individual weights, mixing precision within a single model.
Quantized Model Implementations - Loads models already quantized in GPTQ or AWQ formats directly from Hugging Face with automatic format detection.
Live Re-Quantizers - Swaps every eligible layer to a new quantization type on a live server without restarting.
Quantized Model Exporters - Writes the quantized weights into a reusable UQFF file so subsequent loads skip the conversion step.
Text Embedding Generators - Generates vector embeddings from input text for use in semantic search and retrieval tasks.
Client-Side Execution Loops - Emits tool call requests in OpenAI format for client-side execution and result return.
Chat - Attaches uploaded files as content parts in chat completion requests for model analysis.
Quantization Backend Selectors - Resolves a numeric shorthand to the optimal quantization format for the detected hardware backend.
Strict Tool Argument Validators - Validates tool call arguments against JSON Schema during decoding to prevent malformed parameters.
Multi-GPU Layer Distribution - Splits model layers across GPUs using tensor parallelism or layer mapping to fit models larger than a single GPU's memory.
Request Routing by Model ID - Routes inference requests to specific loaded models by ID, with fallback to a default.
Agent Action Approval Policies - Registers a callback that reviews and approves or rejects each action an agentic model proposes before execution.
Session State Persistence - Maintains persistent Python subprocess state across multiple calls within a session.
Python Execution Sandboxes - Runs user-generated Python in subprocesses with output capture for model consumption.
Persistent Python Sessions - Provides persistent Python sessions with state retention and multimodal output for agentic workflows.
MCP Connectivity - Connects to MCP servers at startup and merges their tools into the model's available tool set.
Shell Command Execution - Runs shell commands in persistent sessions with optional sandboxing for agentic workflows.
Sandboxed Shell Executions - Executes shell commands in sandboxed sessions with approval controls for safe agentic use.
Execution Environment Configurations - Sets the shell path, timeout, working directory, and permission mode via CLI flags or SDK configuration.
TOML Configuration Launchers - Starts the model using settings defined in a full TOML configuration file.
Code Execution Sandboxes - Configures Python interpreter, timeout, working directory, and sandbox level for secure code execution.
Memory Capacity Estimators - Analyzes a model's architecture and available VRAM to recommend a quantization level before downloading any weights.
Tool Selection Constraints - Forces, disables, or limits which tools the model may call per request using tool_choice options.
Code Execution Configurations - Ships configuration options for built-in code executors used during agentic generation.
Distributed Inference Orchestrators - Spreads a model across multiple GPUs using tensor parallelism or layer mapping, with NCCL or ring transport for communication.
Shell Session Persistence - Maintains persistent per-session subprocesses for shell commands with state preservation.
AI Inference Benchmarks - Runs throughput and latency benchmarks for a given model and hardware configuration via a single CLI command.
Prometheus-Based Metric Exporters - Publishes per-request counts and latency labeled by method, route, and status at a /metrics endpoint for monitoring.
AI File Management - Manages user files that can be referenced in API requests or generated by the model.
Chat Interfaces - Ships a built-in web chat interface at /ui that shows reasoning, code execution, plots, and files.
Web Chat Interfaces - Provides a web UI at /ui that displays reasoning, code execution, plots, and files.
Artificial Intelligence - Fast LLM inference engine supporting multimodal models and quantization.
Inference and Serving - High-speed inference engine for language models.
Inference Engines - Fast and flexible inference engine for various models.
Machine Learning Frameworks - High-performance LLM inference engine.

abetlen/llama-cpp-python

9,993View on GitHub

llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory acro

ModelTC/LightLLM

3,901View on GitHub

LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula

sgl-project/sglang

29,079View on GitHub

Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr

langroid/langroid

3,894View on GitHub

Langroid is a multi-agent orchestration framework and tool integration suite designed for building complex AI applications. It serves as a multi-modal integration layer that connects diverse local and remote language models with an agentic retrieval-augmented generation system. The project distinguishes itself through a collaborative message-exchange paradigm, allowing specialized agents to delegate tasks hierarchically and coordinate via structured communication. It features an advanced state management system for conversational AI, including the ability to rewind and prune conversation hist

EricLBuehlermistral.rs

Features