Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe