Mini Sglang | Awesome Repository

mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies.

The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chunks during the prefill phase.

The system supports both network-based API serving and local execution, including a terminal-based shell for interactive model chat.

Features

Large Language Model Serving - Provides a high-throughput inference server for hosting and serving large language models.
LLM Inference Servers - Acts as a production-ready inference server for hosting large language models with high throughput.
Prefix Cache Reuse - Eliminates redundant computations by sharing and reusing key-value caches for common request prefixes.
Prefill Phase Optimizations - Implements chunked prefill execution to maintain a constant memory ceiling during initial sequence processing.

Features

Large Language Model Serving - Provides a high-throughput inference server for hosting and serving large language models.
LLM Inference Servers - Acts as a production-ready inference server for hosting large language models with high throughput.
Prefix Cache Reuse - Eliminates redundant computations by sharing and reusing key-value caches for common request prefixes.
Prefill Phase Optimizations - Implements chunked prefill execution to maintain a constant memory ceiling during initial sequence processing.

The system supports both network-based API serving and local execution, including a terminal-based shell for interactive model chat.