Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems.
The system distinguishes itself through a disaggregated architecture that separates compute-intensive prompt processing from memory-intensive token generation across distinct hardware nodes. This approach, combined with a continuous batching engine and graph-captured kernel execution, maximizes hardware utilization and throughput. It also features dynamic adapter injection, allowing for the runtime switching of fine-tuning modules without requiring server restarts, and a hierarchical key-value cache management system that distributes state across GPU, host RAM, and external storage to support extended context windows.
Beyond core serving, the project includes comprehensive capabilities for structured output generation, enforcing machine-readable formats like JSON schemas and regular expressions during the inference process. It supports advanced performance techniques such as speculative decoding, multi-token prediction, and sparse attention mechanisms. The engine also provides robust tools for traffic management, reliability enforcement, and distributed observability, ensuring consistent performance across heterogeneous hardware clusters.