Slime

Slime

SLIME is a distributed reinforcement learning framework for large language model post-training that bridges Megatron training with SGLang inference servers. It orchestrates scalable RL loops across GPU clusters, decoupling training and inference into independent processes that communicate over HTTP and NCCL for independent scaling and fault tolerance. The system supports multi-agent reinforcement learning workflows with parallel agent instances, customizable rollout strategies, and personalized agent serving that improves models from prior conversations without disrupting API serving.

The framework distinguishes itself through byte-level delta weight synchronization that transfers only changed positions between training and inference servers, reducing bandwidth for cross-cluster deployments. It offers prefill-decode disaggregation with heterogeneous GPU group configurations, multi-token speculative decoding using the model's own prediction layer, and dynamic token-limited batching that maximizes throughput while preserving per-sample loss computation. A plugin-based customization interface exposes hooks for replacing generation, reward, and data-processing logic without modifying the core pipeline, with CPU-only contract tests validating custom implementations.

The system provides comprehensive configuration and extensibility across agent systems, custom loss functions, reward computation, data filtering and formatting, rollout generation, and training hooks. It supports mixed-precision training with BF16 and FP8 inference, Mixture-of-Experts models with routing decision replay, multi-token prediction layer training, and supervised fine-tuning. Deployment capabilities include multi-node scaling via Ray, environment separation for training and serving, automatic rollout server recovery, and co-located training and inference on shared GPUs.

Features

RL Post-Training - Connects Megatron training with SGLang rollout to run scalable reinforcement learning post-training on large language models.

Custom Tool and Reward Definitions - Replaces the default reward model with a user-defined module that receives arguments and a sample list and returns a float or list of floats.

Agentic Data Generation Engines - An engine that produces high-quality training samples through multi-turn agent interactions and sandboxed tool execution for RL post-training.

Custom Rollout Flow Logic - Overrides the entire rollout generation logic to implement complex multi-turn conversations, custom sampling strategies, or external tool integration.

Generation Step Customizations - Replaces only the generation step within the default rollout loop to add tool-calling, retrieval-augmented generation, or multi-turn conversation handling.

Decoupled Training-Inference Pipelines - A pipeline that decouples training and inference engines across GPU clusters to optimize throughput and memory for large-scale RL workloads.

Training-Serving Environment Decoupling - Runs the rollout serving side in a separate Python environment, container, cluster, or orchestration system, requiring only an HTTP endpoint and weight-sync path.

Multi-Node Training Scaling - Launches a distributed Ray cluster across machines and coordinates Megatron training with SGLang inference servers on separate nodes.

Mixed Precision Training - Trains in bf16 while running inference in fp8, including optional fp8 KV cache for long-context rollouts.

BF16 Training with FP8 Inference - Trains a large language model using BF16 precision while performing inference in FP8, reducing memory and compute overhead.

Decoupled Orchestration - Separates training and inference engines into independent processes that communicate over HTTP and NCCL for independent scaling.

Megatron Parallelism Configurations - Configures tensor, pipeline, context, and expert parallelism parameters for Megatron to partition a large model across many GPUs.

Custom Computation - Implements custom reward computation logic, such as rule-based rewards, verifier checks, or external reward service integration.

Workflow Coordinators - Orchestrates multi-agent reinforcement learning workflows with parallel agent instances and reward computation.

Reinforcement Learning Training - Connects Megatron training with SGLang rollout to run scalable RL post-training loops for large language models.

Multi-Model-Family RL Post-Training - Provides validated RL post-training loops for GLM, Qwen, DeepSeek, and Llama model series.

Rollouts - Runs multi-agent, search, and coding rollouts through customization interfaces inside the standard training loop.

Asynchronous Weight Streaming - Streams updated actor weights to inference servers using NCCL broadcast, overlapping transfer with the next training step.

Prefill-Decode Disaggregation - Separates the prefill and decode phases of inference onto different server groups so each can be tuned independently for latency and memory.

Training Data Generation - Uses custom data generation interfaces and server-based engines to create arbitrary training data workflows for RL.

Custom Workflows - Creates arbitrary training data through custom interfaces and server-based engines, including tool use, sandbox interaction, and multi-agent workflows.

GPU Group Configurations - Defines independent prefill and decode server groups with per-group GPU counts, tensor-parallel sizes, and SGLang overrides for production topologies.

Token-Limited Batching - Packs variable-length sequences into batches up to a token limit per GPU, preserving per-sample loss while maximizing throughput.

External Engine Connections - Connects training jobs to independently launched SGLang servers by passing their addresses for external rollout serving.

Byte-Level Weight Delta Synchronization - Transfers only changed byte positions and values between training and inference servers to reduce bandwidth.

NCCL Transfers - Forms an NCCL group between the trainer and external engines to transfer full or delta weight updates directly over the network.

Training Pipeline Hooks - Exposes hooks and overrides for replacing generation, reward, and data-processing logic without modifying the core pipeline.

Training Metrics - Logs reward, loss, KL, entropy, and evaluation metrics to W&B or TensorBoard during training.

System Definitions - Accepts a user-provided function that specifies the logic and interaction rules for a multi-agent setup.

Framework Adaptations - Replaces default generation and reward functions with custom logic to support multi-turn interactions and tool calling.

Per-Sample Generation Plugins - Replaces the default per-sample generation function with a user-defined module that receives arguments, a sample, and sampling parameters.

HuggingFace Model Wrappers - Wraps HuggingFace models as black-box modules for Megatron parallel training.

Dynamic Sampling Filters - Samples more prompts than needed and discards those that fail a custom quality filter, such as checking for reward score diversity.

Custom Evaluation Rollout Functions - Overrides the rollout function specifically for evaluation, allowing different sampling parameters or logic.

Per-Sample Exclusion Masks - Marks individual samples to be excluded from loss computation, enabling selective training strategies based on response quality.

Custom Loss Functions - Implements custom loss functions for training, enabling novel reinforcement learning objectives.

RL Data Loop Configuration - Sets batch sizes, samples per prompt, and steps per rollout to balance data sampling and weight update cycles.

Self-Speculative Decoding - Uses the model's own prediction layer to generate multiple draft tokens per step for parallel verification.

Co-located GPU Training and Inference - Runs both training and rollout inference processes on the same set of GPUs to reduce hardware requirements.

Multi-Turn Serving Optimizations - Implements session-affinity routing to reuse prefix caches across multi-turn interactions.

Routing Decision Replays - Records and replays expert routing decisions during training to stabilize Mixture-of-Experts reinforcement learning.

Periodic In-Training Evaluation - Runs periodic evaluations on a separate prompt dataset using configurable sampling parameters to monitor training progress.

Quantization-Aware Training - Trains a model with simulated INT4 precision so it can later be served with INT4 inference, reducing rollout memory and improving throughput.

Multi-Token Prediction Layers - Trains multi-token prediction layers jointly with the main model using gradient computation and loss scaling.

Memory Optimizations - Offloads optimizer state to CPU and configures expert parallelism to fit a Mixture-of-Experts model on a small number of GPUs.

Rollout Quantizations - Converts BF16 model weights to FP8 using blockwise quantization, enabling memory-efficient inference while keeping training in higher precision.

RL Algorithm Parameter Configuration - Sets algorithm-specific parameters such as advantage estimator, KL loss coefficient, and clipping thresholds for GRPO or other RL methods.

Post-Processing Normalizations - Applies custom normalization or shaping to rewards before advantage computation.

Concurrent Instance Training - Runs several agent instances concurrently, each interacting with the environment, to collect diverse experience for RL.

Personalized Serving - Hosts a model and improves it from prior conversations using asynchronous RL that does not interfere with API serving.

Instruction-Tuning Training - Trains supervised fine-tuning models on instruction-response data by converting datasets into OpenAI message format.

Shared Filesystem Weight Synchronization - Writes full checkpoints or delta updates to a shared filesystem path and triggers SGLang to reload them over HTTP.

Tool Registries - Maintains a central registry of callable functions that a language model can invoke during generation.

FP8 Compressions - Compresses the key-value cache to FP8 to increase effective capacity for long-context interactions.

Sandboxed Execution Environments - Runs user-provided code in an isolated process with memory, time, and operation restrictions for safe agentic tool use.

Model Role Parameter Overrides - Applies role-specific overrides to shared Megatron training arguments, enabling distinct configurations for actor and critic in PPO setups.

Code Execution Sandboxes - Runs user-provided Python code in an isolated environment with memory, time, and operation restrictions.

Rollout Server Auto-Restarts - Periodically checks the health of rollout servers and automatically restarts any that become unresponsive, restoring correct model parameters before reuse.

Rollout Buffer Filters - Removes or selects samples from the rollout buffer before training, enabling priority-based or quality-based sample selection.

Co-located Backend Function Calls - Shares GPU resources between training and inference on the same nodes to reduce hardware requirements.

Rollout Sample Debugging - Save generated rollout samples to disk and replay them later without live servers to isolate training issues from serving failures.

Pipeline Component Isolation - Run either the training or rollout pipeline in isolation to reproduce and debug failures in each component separately.

Agentic Reinforcement Learning - Reinforcement learning library for agentic workflows.

Reinforcement Learning - Framework for RL scaling in LLM post-training.

Reinforcement Learning Frameworks - SGLang-native framework designed for reinforcement learning scaling.

Training and Fine-Tuning - Framework for RL scaling in LLMs.

THUDMslime

Features

Star history