# sgl-project/mini-sglang

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/sgl-project-mini-sglang).**

3,514 stars · 439 forks · Python · mit

## Links

- GitHub: https://github.com/sgl-project/mini-sglang
- awesome-repositories: https://awesome-repositories.com/repository/sgl-project-mini-sglang.md

## Description

mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies.

The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chunks during the prefill phase.

The system supports both network-based API serving and local execution, including a terminal-based shell for interactive model chat.

## Tags

### Artificial Intelligence & ML

- [Large Language Model Serving](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-model-serving.md) — Provides a high-throughput inference server for hosting and serving large language models. ([source](https://cdn.jsdelivr.net/gh/sgl-project/mini-sglang@main/README.md))
- [LLM Inference Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-deployment-servers/llm-inference-servers.md) — Acts as a production-ready inference server for hosting large language models with high throughput.
- [Prefix Cache Reuse](https://awesome-repositories.com/f/artificial-intelligence-ml/kv-cache-optimizations/kv-cache-aware-request-routing/prefix-cache-reuse.md) — Eliminates redundant computations by sharing and reusing key-value caches for common request prefixes.
- [Prefill Phase Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization/prefill-phase-optimizations.md) — Implements chunked prefill execution to maintain a constant memory ceiling during initial sequence processing.
- [Tensor-Parallel Inference Distributions](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/distributed-deployment-utilities/multi-gpu-distribution/tensor-parallel-inference-distributions.md) — Distributes model weights across multiple GPUs using tensor parallelism to increase memory and throughput.
- [OpenAI-Compatible Model Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-serving-apis/openai-compatible-model-servers.md) — Serves as an OpenAI-compatible inference server for hosting large language models across a network.
- [Prefix Caching](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-caching/prefix-caching.md) — Reuses computed key-value caches for shared request prefixes to eliminate redundant calculations.
- [Tensor Parallelism](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-parallelism.md) — Distributes large model workloads across multiple GPUs using tensor parallelism to increase memory and computation speed.
- [OpenAI-Compatible APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/model-integration-serving/model-integration-interfaces/ai-integration-apis/openai-compatible-apis.md) — Exposes a standardized OpenAI-compatible API for seamless integration with existing LLM toolchains.
- [Inference Benchmarking Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-benchmarking-tools.md) — Provides a local batch processing engine for conducting model ablation studies and inference performance tests.
- [Offline Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/model-integration-pipelines/model-inference/offline-inference-engines.md) — Includes a local batch processing engine for offline model execution and performance ablation studies. ([source](https://cdn.jsdelivr.net/gh/sgl-project/mini-sglang@main/README.md))

### Part of an Awesome List

- [KV Cache Management](https://awesome-repositories.com/f/awesome-lists/ai/kv-cache-management.md) — Manages and tracks memory for key-value tensors to enable efficient retrieval of computed sequence states.
- [Inference Engines](https://awesome-repositories.com/f/awesome-lists/ai/inference-engines.md) — Lightweight, high-performance inference framework for LLMs.
- [Model Serving & Deployment](https://awesome-repositories.com/f/awesome-lists/ai/model-serving-deployment.md) — Offers a lightweight serving framework for LLMs.

### Data & Databases

- [Inference Batching](https://awesome-repositories.com/f/data-databases/request-batching/inference-batching.md) — Provides a local batch processing engine to maximize hardware utilization for offline benchmarking.
- [Long-Context Sequence Processors](https://awesome-repositories.com/f/data-databases/text-processing-pipelines/long-context-sequence-processors.md) — Splits long input sequences into smaller chunks during prefill to prevent peak memory spikes. ([source](https://cdn.jsdelivr.net/gh/sgl-project/mini-sglang@main/README.md))