Verifiers

Verifiers - train and evaluate LLM agents | Awesome Repos

Features

RL Environment Construction - Provides a standardized system for constructing simulation environments and harnesses to train and evaluate large language models.
RL Environment Frameworks - Provides a standardized system for constructing simulation environments and training harnesses for LLM reinforcement learning.
Rubric-Based Reward Scoring - Calculates model performance by mapping environment outputs against a predefined set of success criteria and reward values.
RL Trajectory - Tracks token trajectories across multi-turn interactions, handling branching rollouts and truncated paths for RL training.
LLM Evaluation Frameworks - Provides a system for measuring language model accuracy and performance using reward rubrics and datasets.
Reinforcement Learning Environments - Offers a comprehensive toolkit for building standardized simulation environments and harnesses for LLM reinforcement learning.
Agent Performance Evaluators - Assesses agent behavior and success rates through automated testing and ablation sweeps.
Model Performance Evaluators - Tests model outputs against defined environments with terminal-based result visualization to quantify accuracy.
Reinforcement Learning Reward Systems - Defines task datasets and reward rubrics to quantify and assign utility to agent actions for optimization.
RL Training Workflows - Connects simulation environments to RL frameworks to optimize model performance based on defined rubrics.
RL Training Harnesses - Implements a bridge connecting large language models to simulation environments for optimization based on specific task goals.
Task Definitions - Implements a framework for setting up task datasets, model harnesses, and reward rubrics for LLM evaluation and training.
Environment Module Packaging - Bundles task datasets and evaluation logic into self-contained units for remote deployment and standardized sharing.
Model Agnostic Interfaces - Implements a common interface that decouples language model APIs from simulation environments to allow seamless model swapping.
Agent Trajectory Logs - Tracks token trajectories, branching rollouts, and multi-turn interactions during reinforcement learning sessions.
Episode Trajectory Recorders - Records the full sequence of token interactions and branching paths for post-hoc agent behavior analysis.
Agent Performance Metrics - Analyzes agent success using pass-rate metrics and ablation sweeps via a dedicated terminal interface.
RL Post-Training - Optimizes model performance by connecting simulation environments to RL frameworks for post-training.
Component Ablation Studies - Provides systematic removal of environment parameters and model configurations to evaluate their contribution to success rates.
RL Environment Publishing - Provides a system for uploading self-contained environment modules to a centralized hub for sharing and remote execution.
Collaborative Research Environments - Enables collaborative research by packaging and publishing environment modules to a central hub for remote execution.
Agent Performance Monitoring - Captures real-time interaction data and agent progress throughout live rollouts using monitoring rubrics.
Evaluation Metric Monitors - Gathers and records performance data during agent interactions by applying monitoring rubrics to active sessions.
Reinforcement Learning Frameworks - Reinforcement learning framework utilizing verifiable environments.

Open-source alternatives to Verifiers

Similar open-source projects, ranked by how many features they share with Verifiers.

rllm-org/rllm
rllm-org/rllm
5,641View on GitHub
rllm is an asynchronous reinforcement learning framework for training language agents. It provides a unified pipeline that runs the same agent code for both evaluation and training, automatically capturing traces for gradient computation. The framework supports distributed reinforcement learning across multiple GPUs and nodes using pluggable backends, and executes agents in isolated sandboxes—either locally or in the cloud—for safe and scalable rollout collection. It trains agents built with LangGraph, SmolAgents, OpenAI Agents SDK, or custom frameworks without requiring core logic changes. T
Pythonagent-frameworkagentic-workflowcoding-agent
View on GitHub5,641
rlinf/rlinf
RLinf/RLinf
2,502View on GitHub
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Pythonagentic-aiembodied-aireinforcement-learning
View on GitHub2,502
helicone/helicone
Helicone/helicone
5,830View on GitHub
Helicone is an AI gateway and observability platform designed to intercept, manage, and monitor interactions with large language models. By acting as a reverse-proxy, it provides a centralized layer for routing requests across multiple AI providers, allowing developers to maintain consistent application logic while gaining deep visibility into model performance, usage, and costs. The platform distinguishes itself through a robust suite of traffic management and prompt engineering tools. It enables policy-driven control, including automatic failover between providers, rate limiting, and edge-b
TypeScript
View on GitHub5,830
promptslab/promptify
promptslab/Promptify
4,616View on GitHub
Promptify is a suite of tools designed for model evaluation, prompt management, token cost tracking, structured extraction, and unified API gateway access. It provides a standardized interface to manage requests and responses across multiple large language model providers. The project features a prompt management platform for engineering and versioning prompts with structured output validation. It includes a dedicated evaluation framework to measure model performance using precision, recall, and f1 scores against labeled datasets, alongside a token cost tracker to monitor the financial expens
Python
View on GitHub4,616

See all 30 alternatives to Verifiers

willccbbverifiers

Features

Open-source alternatives to Verifiers

rllm-org/rllm

RLinf/RLinf

Helicone/helicone

promptslab/Promptify

Star history

Open-source alternatives to Verifiers

rllm-org/rllm

RLinf/RLinf

Helicone/helicone

promptslab/Promptify