LightLLM

LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images.

The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs speculative decoding, paged key-value cache management, and a separated prefill and decode pipeline.

The platform covers a broad range of operational capabilities, including tensor and data parallelism for scaling across hardware, multi-tier cache offloading for long context windows, and tool use integration for executing external functions. It also provides a standard interface for chat completions and dedicated tools for measuring request throughput and latency under real-world workloads.

The project is implemented in Python and includes base classes for integrating custom model architectures.

Features

Inference Execution - Executes the full inference pipeline through sequential pre-layer, transformer, and post-layer stages.
LLM Serving Architectures - Provides a high-performance serving architecture designed to deploy and serve large language models at scale.
Fused MoE GPU Kernels - Optimizes Mixture-of-Experts execution using fused GPU kernels and expert parallelism for high throughput.
LLM Inference Servers - Functions as a production-ready server specifically designed for hosting and serving large language models.
OpenAI-Compatible APIs - Provides a compatible interface for chat completions that adheres to the OpenAI API specification.
Hybrid Model Parallelism - Distributes workloads across hardware using tensor, data, and expert parallelism combined with dynamic caching.
KV Cache Optimizations - Allocates and deallocates memory for key-value caches on a per-token basis to prevent memory fragmentation.
High-Throughput Text Inference - Optimizes text-generation throughput by predicting multiple tokens simultaneously and reusing existing caches.
Request Schedulers - Manages request queues and optimizes the batching of prefill and decode operations to maximize overall throughput.
Speculative Decoding Strategies - Employs speculative decoding to predict multiple future tokens in a single step, reducing response generation time.
Mixture-of-Experts Inference Optimizers - Provides a specialized framework for hosting Mixture-of-Experts models using fused kernels and expert parallelism.
Multi-GPU Distribution - Splits model parameters across multiple GPUs to overcome memory limitations and increase throughput.
Weight Distribution - Splits model weights for embedding and transformer layers across multiple devices based on tensor parallelism.
Multimodal Model Runners - Runs multimodal models that process text and images using optimized vision-based inference pipelines.
Cross-Instance KV Cache Transfers - Shifts stored keys and values between distributed instances to avoid repeating computations across parallel networks.
Cross-Rank Cache Transfers - Transfers prefix caches between parallel processing ranks to eliminate redundant calculations and increase throughput.
Prefill-Decode Disaggregation - Separates the compute-intensive prefill and memory-intensive decoding phases into distinct services.
Streaming Text Generation - Sends completion results incrementally as they are produced for real-time interactive text display.
Tensor Parallelism - Splits model weights across multiple GPUs using tensor parallelism to handle models exceeding single-device memory.
Text Generation APIs - Provides a high-performance interface for generating text responses based on input prompts.
Distributed Parallelism - Scales inference throughput by distributing workloads across multiple GPUs using tensor, data, and expert parallelism.
Tiered - Moves key-value caches between GPU, CPU, and disk using eviction rules to handle long context windows.
MoE Deployments - Hosts large language models using a Mixture-of-Experts architecture to enable high-speed inference.
Image Embedding Caching - Stores visual embeddings using hashing and eviction rules to avoid repeating expensive image processing.
Inference Batching - Merges new requests into active inference batches by calculating estimated token usage against hardware capacity.
GPU Parallelism Partitioners - Distributes model workloads across multiple GPUs using tensor, data, and expert parallelism clusters for distributed execution.
Paged KV Cache Management - Implements paged memory management for key-value caches to eliminate fragmentation and optimize memory.
Asynchronous Inference Coordination - Coordinates encoding, inference, and decoding asynchronously across multiple processes to ensure hardware remains fully active.
Inference Speed Profiling - Includes detailed profiling for prefill and decode stage throughput and latency across multi-GPU configurations.
Function Calling Interfaces - Parses model outputs into tool calls based on defined schemas to execute external logic.
Guided Text Generation - Constrains model output to follow precise formats using deterministic state machines and pushdown automata.
Activation and KV Cache Offloaders - Manages long context windows by offloading key-value caches between GPU, CPU, and disk.
Native Tool Call Parsers - Extracts tool requests from model outputs in XML or standard formats during incremental streaming.
SLA-Based Scheduling - Manages the execution order of incoming requests to maintain service level agreement guarantees.
Parallel Function Calling - Converts model outputs into structured calls to trigger multiple external functions in parallel.
Multimodal Embedding Caches - Stores visual embeddings using hashing to avoid redundant image processing across multiple requests.
Model Serving Interfaces - Provides a unified interface for deploying and serving models that process both text and images.
Structured Output Generators - Forces language models to produce strictly typed, machine-readable data formats via constrained decoding.
Precision Quantization - Supports reduced bit-depth formats like int8 and fp8 to lower memory usage and increase token throughput.
Constrained Decoding - Enforces structured text generation using deterministic state machines to ensure responses follow precise formats.
Tool-Use Integrations - Connects models to external functions by parsing model outputs into structured tool calls for task execution.
Multimodal Inference - Optimizes the inference of vision-language models using shared-memory feature caches for image embeddings.
Inference State Management - Passes critical model information between layers using a customizable state object during the processing cycle.
Structural Constraint Enforcers - Uses deterministic pushdown automata to guarantee that generated text adheres to specific structural rules.
CPU-GPU Architecture Unification - Unifies folding architectures between the CPU and GPU to minimize system-level processing delays during execution.
Generation State Machines - Employs generation state machines to restrict token selection and enforce structured output formats.
Generative Model Serving Benchmarks - Includes tools for measuring throughput and latency of served models to evaluate and compare serving configurations.
Real-World Workload Benchmarking - Measures inference performance using industry-standard datasets to simulate authentic human conversation patterns.
Token Throughput Measurement - Evaluates queries per second and token throughput using customizable input lengths and request rates.
Multimodal Input Processors - Processes image data into tensors and maintains a shared-memory cache to avoid redundant vision processing.
Inference and Serving - Lightweight and scalable framework for inference and serving.
Inference Frameworks - Python-based framework optimized for high-performance serving.
Model Serving & Deployment - Provides a lightweight, high-speed LLM inference framework.
Inference Frameworks - Lightweight inference framework with efficient KV cache management.

sgl-project/sglang

29,079View on GitHub

Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr

predibase/lorax

3,724View on GitHub

Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.

ai-dynamo/dynamo

6,112View on GitHub

Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and

EricLBuehler/mistral.rs

6,597View on GitHub

mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe

ModelTCLightLLM

Features