Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request.
The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
The project provides a standardized interface for chat and completions that is compatible with common API protocols, supporting structured outputs via JSON schema enforcement. Its performance surface includes tensor parallelism, speculative decoding, paged attention, and model weight quantization to reduce latency and memory overhead.
Infrastructure is managed through Helm charts for Kubernetes orchestration, with integrated telemetry exported via Prometheus and Open Telemetry.