SLIME is a distributed reinforcement learning framework for large language model post-training that bridges Megatron training with SGLang inference servers. It orchestrates scalable RL loops across GPU clusters, decoupling training and inference into independent processes that communicate over HTTP and NCCL for independent scaling and fault tolerance. The system supports multi-agent reinforcement learning workflows with parallel agent instances, customizable rollout strategies, and personalized agent serving that improves models from prior conversations without disrupting API serving.
The framework distinguishes itself through byte-level delta weight synchronization that transfers only changed positions between training and inference servers, reducing bandwidth for cross-cluster deployments. It offers prefill-decode disaggregation with heterogeneous GPU group configurations, multi-token speculative decoding using the model's own prediction layer, and dynamic token-limited batching that maximizes throughput while preserving per-sample loss computation. A plugin-based customization interface exposes hooks for replacing generation, reward, and data-processing logic without modifying the core pipeline, with CPU-only contract tests validating custom implementations.
The system provides comprehensive configuration and extensibility across agent systems, custom loss functions, reward computation, data filtering and formatting, rollout generation, and training hooks. It supports mixed-precision training with BF16 and FP8 inference, Mixture-of-Experts models with routing decision replay, multi-token prediction layer training, and supervised fine-tuning. Deployment capabilities include multi-node scaling via Ray, environment separation for training and serving, automatic rollout server recovery, and co-located training and inference on shared GPUs.