Lorax | Awesome Repository

Features

GPU-Accelerated Inference - Provides a GPU-accelerated inference server that optimizes LLM performance through tensor parallelism and quantization.

Multi-Adapter Batching - Processes requests using different LoRA adapters in a single GPU forward pass to maximize throughput.

LLM Inference Servers - Serves large language models with high throughput and dynamic LoRA adapter swapping per request.

OpenAI-Compatible APIs - Provides a standardized HTTP interface compatible with OpenAI's API for chat and completions.

Multi-Adapter Batching - Batches requests using different LoRA adapters into a single GPU forward pass to maximize throughput.

Inference Acceleration - Implements tensor parallelism, quantization, and paged attention to reduce latency and increase throughput during model execution.

Just-In-Time Weight Loading - Prefetches and offloads task-specific adapter weights between CPU and GPU memory in real-time.

Large Language Model Serving - Hosts and exposes large language models from weight caches to serve as the foundation for inference.

Inference Optimizations - Optimizes inference throughput and latency using quantization, speculative decoding, and tensor parallelism.

Tensor-Parallel Inference Distributions - Splits model weights across multiple GPUs using tensor parallelism to serve models exceeding single-card memory.

LoRA Adapter Loaders - Deploys and serves models trained with Low-Rank Adaptation to improve response quality.

Dynamic Adapter Swapping - Dynamically swaps and batches parameter-efficient fine-tuning adapters at runtime to optimize GPU throughput.

LLM Completion Interfaces - Implements a standardized API interface for chat and completions compatible with common LLM client libraries.

Text Completion Engines - Generates text completions from prompts using single-response or streaming delivery.

Text Generation APIs - Provides APIs for generating text responses from prompts using specific adapters or structured formats.

Inference Batch Packing - Maximizes aggregate throughput by packing requests for different adapters into a single GPU forward pass.

Inference Batching - Groups multiple model inference requests into single hardware execution passes to maximize GPU throughput.

Token Streaming - Sends generated tokens incrementally as they are produced to reduce perceived latency.

Base Model Architecture Support - Supports a wide variety of large language model architectures to serve as foundations for fine-tuned adapters.

GPU Memory Optimizers - Provides tools to optimize VRAM usage by balancing memory between the KV cache and adapter storage.

Speculative Decoding Strategies - Uses draft models or projection layers to predict multiple tokens and accelerate generation speed.

Multi-GPU Distribution - Distributes large model weights across multiple GPUs to enable inference for models exceeding single-card memory.

Adapter-Aware Routing - Directs inference requests to specific nodes based on which LoRA adapters are currently loaded.

Weight Quantization - Reduces memory overhead by loading base models in low-precision formats while maintaining adapters in higher precision.

Paged Key-Value Cache Stores - Optimizes GPU memory by storing key-value caches in non-contiguous pages to reduce fragmentation.

Adapter Merging - Combines multiple adapters into a single ensemble per request using weighted merge strategies.

LLM Schema Outputs - Constrains model responses to valid JSON schemas to ensure predictable data formats for programmatic use.

JSON-Schema - Constrains model responses to valid JSON schemas for programmatic consumption.

Structured JSON Generation - Enforces JSON schema adherence during the generative process to ensure predictable data extraction.

Schema-Constrained Sampling - Restricts token selection during inference based on a JSON schema to force structured output.

Cloud Native GPU Orchestration - Manages and scales GPU resources through Kubernetes and Helm for high-performance model serving.

LLM Deployment Operators - Offers a containerized serving system managed via Helm for high-availability LLM deployments on Kubernetes.

Kubernetes Orchestration - Provides Helm charts and orchestration tools for deploying the server within Kubernetes clusters.

Adapter Management - Retrieves model adapters from local paths, cloud buckets, or hosted model hubs for runtime serving.

Adapter Offloading - Optimizes throughput by asynchronously prefetching and offloading adapters between GPU and CPU memory.

Inference Engines - Multi-LoRA inference server for scaling fine-tuned model deployments.

Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request.

The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.

The project provides a standardized interface for chat and completions that is compatible with common API protocols, supporting structured outputs via JSON schema enforcement. Its performance surface includes tensor parallelism, speculative decoding, paged attention, and model weight quantization to reduce latency and memory overhead.

Infrastructure is managed through Helm charts for Kubernetes orchestration, with integrated telemetry exported via Prometheus and Open Telemetry.

Features