Search-R1 is a distributed training system and reinforcement learning framework designed to create search-augmented language models. It provides an architecture for scaling model workloads across head and worker nodes while optimizing how models interleave internal reasoning with external tool calls. The system focuses on refining model behavior through custom reward signals and reinforcement learning to improve tool-use formatting and information retrieval. It implements an interleaved reasoning-search loop that allows models to alternate between internal thought generation and external data
TinyZero is a reinforcement learning framework and implementation designed to train language models to develop reasoning and self-verification abilities. It provides a training pipeline to optimize model performance on mathematical and logical tasks. The project serves as a minimal reproduction of the DeepSeek R1 architectural and training approach. It focuses on creating reasoning models that can solve structured problems through autonomous chain-of-thought discovery. The framework incorporates group relative policy optimization and reward-based self-correction to improve accuracy on logica
DeepSeek-R1 is an open-weights large language model focused on advanced reasoning. It uses chain-of-thought processing and internal monologues to solve complex mathematical and logical problems by breaking tasks into sequential, verifiable thought processes. The model is developed using reinforcement learning to optimize reasoning patterns and verify logical steps. It employs a distillation process to transfer these high-performance logic capabilities from a large teacher model into smaller, computationally efficient versions. The training framework incorporates group relative policy optimiz
ART is a platform for agentic training, providing a reinforcement learning framework, training environment, and compute orchestrator. It enables the improvement of multi-step agent reasoning and tool usage through group relative policy optimization and a judge-based reward modeling system. The project features tools for model distillation to transfer capabilities from large teacher models to smaller architectures, as well as a system for capturing execution trajectories to generate synthetic training data. It supports specialized training workflows including supervised fine-tuning for baselin