Rllm

Rllm

rllm is an asynchronous reinforcement learning framework for training language agents. It provides a unified pipeline that runs the same agent code for both evaluation and training, automatically capturing traces for gradient computation. The framework supports distributed reinforcement learning across multiple GPUs and nodes using pluggable backends, and executes agents in isolated sandboxes—either locally or in the cloud—for safe and scalable rollout collection. It trains agents built with LangGraph, SmolAgents, OpenAI Agents SDK, or custom frameworks without requiring core logic changes.

The framework distinguishes itself through native multi-agent training orchestration, where collaborative workflows such as solver-judge pairs learn from shared or competing trajectories with differentiated rewards per agent role. It includes a library of over 50 curated benchmarks spanning math, code, QA, and vision, and provides a suite of pre-built reward functions and graders. Performance optimizations include pre-provisioned sandbox queues and startup snapshot caching to reduce rollout latency, and a transparent HTTP proxy captures token-level data from any inference request without modifying agent code.

Beyond its core training capability, rllm offers a CLI for launching training and evaluation jobs with automated dataset handling, and supports progressive context length scaling, parameter-efficient fine-tuning via LoRA, and multimodal model training. It integrates AI-backed run analysis, real-time web dashboard monitoring, and full-text search across training artifacts. The framework’s pluggable backend interface and environment-variable-driven configuration allow switching between Ray-distributed, managed-service, or single-machine backends without code changes, and its curated dataset management and custom dataset integration methods make it straightforward to bring new tasks into the training workflow.

Features

Reinforcement Learning Alignment - Trains language models with reinforcement learning to optimize reasoning, tool use, and multi-step problem solving.

Reinforcement Learning Training - Trains language agents via reinforcement learning with pluggable backends, asynchronous rollouts, and token-level traces.

Language Agent RL Frameworks - Core identity: an async reinforcement learning framework specifically for training language agents.

Agent Decision Logic Definitions - The platform defines an agent's decision logic as a plain async function accepting task and configuration, returning an episode or trajectory.

Custom Agent and Environment Definitions - The platform defines agent behavior and interaction environments through clear abstractions that separate design from training infrastructure.

Preconfigured Benchmark Loops - The platform runs a pre-configured agent program parameterized by the LLM and task, selected by name for common benchmark patterns.

Custom Tool and Reward Definitions - The platform defines custom tools, chat parsers, and reward functions to tailor agent behavior for specific tasks.

Agentic Workflows - The platform authors agent workflows using a protocol that runs identically during evaluation and training, ensuring consistent behavior.

Episode-Trajectory-Step Hierarchies - Organizes agent-environment interactions into nested Episode, Trajectory, and Step structures.

Trajectory Return Types - The platform returns None, a single Trajectory, or a full Episode with named trajectories for single-agent or multi-agent evaluation and training.

Custom Agent Flow Definitions - The platform creates a custom agent flow by subclassing a class with a class-level name and optional concurrency limit.

RL Trajectory - Groups trajectories by task and agent, computes advantages, and enriches steps with token-level data.

Distributed Training - Scales reinforcement learning training across multiple GPUs and nodes using Ray, vLLM, or SGLang backends.

Dedicated Worker Group Scaling - Scales training across GPUs and nodes with dedicated worker groups for policy, rollout, and reference.

External Agent Integrations - The platform converts agents from any external framework into trainable workflows without rewriting core logic.

Unified Agent-Workflow Training - Trains both agent-based and workflow-based models using the same trainer codebase and shared configuration.

Reward Functions - Ships a standard interface for defining custom reward functions that evaluate completed task episodes.

Advantage Estimation - Computes advantage values from trajectory rewards with per-role, per-token, and step-wise modes.

Asynchronous Training - Uses native async/await throughout the training pipeline so rollouts and optimization do not block.

Unified Agent Flow Execution - The platform executes the same agent flow code for both evaluation and reinforcement learning training, with transparent trace capture for gradients.

Unified Execution Loops - Runs the same agent code for both evaluation and reinforcement learning training by transparently capturing traces for gradient computation.

Unified Multi-Agent Workflows - The platform defines concurrent agent interactions such as parallel generation and evaluation that run unchanged in both evaluation and training modes.

RL Training Workflows - Runs reinforcement learning training loops asynchronously with non-blocking rollout generation and policy optimization.

Agent Instrumentation Decorators - Instruments any agent with a single decorator to automatically trace LLM calls for RL training.

Distributed Language Agent RL Workflows - Scales reinforcement learning across multiple GPUs and nodes with asynchronous trajectory generation and gradient updates.

Multi-Agent Training - Manages collaborative training workflows where multiple agents learn from shared or competing trajectories.

Differentiated Reward Training - Assigns separate rewards per agent role in multi-agent workflows for reinforcement learning.

Evaluation And Benchmarks - Runs standardized benchmarks and custom reward functions to measure and compare agent performance.

Pluggable RL Backend Interfaces - Provides a unified configuration structure to swap reinforcement learning backends without changing core training configs.

Agent Sandbox Provisioners - Executes agents in isolated ephemeral sandboxes, either locally or in the cloud, for safe and scalable rollout collection.

MicroVM - Deploys containerized agents to an auto-scaling, sandboxed runtime that isolates each session in a separate microVM.

Pre-built Graders - The platform scores agent outputs against common benchmarks using built-in graders for math, code, multiple-choice, translation, and vision tasks.

Actor-Rollout Combined Engines - Combines actor and rollout workers in a single engine for asynchronous trajectory generation.

Isolated Execution Sandboxes - The platform creates ephemeral sandboxes using local or cloud backends to safely execute agent tasks and verifiers.

Rollout-Optimization Pipelines - Uses native async/await throughout to generate trajectories and compute policy updates without blocking stages.

Training Trajectory Capture - Intercepts token IDs, logprobs, and trajectory data via an HTTP proxy without requiring agent code changes.

Agent Evaluation Tools - Calls an agent on a given task and records its trajectory as an episode, usable in both evaluation and training.

Agent Framework Integrations - Integrates with LangGraph, SmolAgents, OpenAI Agents SDK, and other frameworks by swapping the client for seamless RL training.

Math Reasoning Agents - The platform constructs a language agent that solves math problems through step-by-step reasoning, leveraging the framework's training capabilities.

Multi-Framework Agent Training - Trains agents built with LangGraph, SmolAgents, OpenAI Agents SDK, or any custom framework without rewriting core logic.

Custom Backend Interfaces - Defines an abstract interface for custom backends handling data loading, inference, gradient computation, and policy updates.

Custom Dataset Integration Methods - Provides three distinct methods to bring any custom dataset into training and evaluation workflows.

LoRA Fine-Tuning Tools - Applies low-rank adaptation to attention and MLP layers for parameter-efficient fine-tuning.

LLM Integration Frameworks - Integrates with any major language model framework through a single SDK to train and run agents without switching stacks.

Agent Performance Evaluators - Runs evaluations for agent performance in sandboxed environments, scoring outputs with reward functions.

Training Hyperparameter Configurations - Configures RL algorithms, backends, hyperparameters, agent parameters, and environment definitions for training runs.

Model Training and Inference Engines - Runs both training and inference through a single API to deploy and refine agents without switching stacks.

Training and Evaluation Pipelines - Provides a unified command-line pipeline that executes training and evaluation jobs with automated dataset downloads.

Vision-Language Training - Supports multimodal models like Qwen2-VL and Qwen3-VL by processing image inputs alongside text during training.

Bundled Training Packages - Bundles agent flow, evaluator, data preparation, and training scripts into a single CLI-runnable package.

Progressive Context Length Scaling - Progressively increases maximum context length across training stages to improve reasoning depth.

Benchmarks and Datasets - Preprocesses datasets, defines tasks as pure data, and adds custom benchmark datasets for training and evaluation.

Model Evaluation and Benchmarking - Provides a benchmark evaluation harness that runs models against curated datasets and reports accuracy scores.

Task Instance Definitions - Represents each problem instance as a unit of work with instruction, metadata, and files for evaluation.

Benchmark Dataset Loaders - Provides access to a library of over 50 pre-built benchmarks spanning math, code, QA, VLM, and other domains.

Curated Benchmark Downloaders - Automatically downloads and caches over 60 curated datasets spanning math, code, QA, and agentic tasks.

CLI Training Toolkits - Ships a CLI for evaluating, training, and scaffolding agents with over 50 built-in benchmarks.

Configurable Stage Pipelines - Structures the learning process into a configurable eight-stage pipeline for granular control.

Live Weight Update Deployments - Syncs trainer weight updates to the live inference deployment after each training step for continuous improvement.

Inference Trace Captures - Intercepts inference request and response data via a transparent proxy without modifying agent code.

Verifier-to-Benchmark Assignments - The platform assigns a reward function to a dataset or sandbox task through a configuration file, enabling evaluation without code changes.

Real-Time Monitoring Dashboards - Streams live training metrics, episodes, trajectories, and logs to a web dashboard for real-time monitoring.

Training Monitoring Dashboards - The platform streams live metrics, episode data, and execution traces to a dashboard for inspecting agent behavior and training progress.

Policy Optimization - Scaling reinforcement learning for high-performance reasoning models.

rllm-orgrllm

Features

Star history