30 open-source projects similar to willccbb/verifiers, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Verifiers alternative.
rllm is an asynchronous reinforcement learning framework for training language agents. It provides a unified pipeline that runs the same agent code for both evaluation and training, automatically capturing traces for gradient computation. The framework supports distributed reinforcement learning across multiple GPUs and nodes using pluggable backends, and executes agents in isolated sandboxes—either locally or in the cloud—for safe and scalable rollout collection. It trains agents built with LangGraph, SmolAgents, OpenAI Agents SDK, or custom frameworks without requiring core logic changes. T
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Helicone is an AI gateway and observability platform designed to intercept, manage, and monitor interactions with large language models. By acting as a reverse-proxy, it provides a centralized layer for routing requests across multiple AI providers, allowing developers to maintain consistent application logic while gaining deep visibility into model performance, usage, and costs. The platform distinguishes itself through a robust suite of traffic management and prompt engineering tools. It enables policy-driven control, including automatic failover between providers, rate limiting, and edge-b
Promptify is a suite of tools designed for model evaluation, prompt management, token cost tracking, structured extraction, and unified API gateway access. It provides a standardized interface to manage requests and responses across multiple large language model providers. The project features a prompt management platform for engineering and versioning prompts with structured output validation. It includes a dedicated evaluation framework to measure model performance using precision, recall, and f1 scores against labeled datasets, alongside a token cost tracker to monitor the financial expens
Lighteval is an open-source framework for running standardized benchmarks and custom evaluation tasks against language models. It provides a system for defining new evaluation tasks with custom prompts, metrics, and scoring in YAML configuration files, and integrates with the Hugging Face Hub for storing and comparing results. The framework supports evaluating models across multiple inference backends, including transformers, vllm, and custom APIs, through a unified generation and log-probability interface. It includes a pluggable metric registry for built-in and custom scoring, a prediction
Lmnr is an LLM observability platform and evaluation framework designed for tracing, logging, and monitoring language model executions. It provides the tools necessary to debug agent behavior, analyze performance, and identify failure patterns in AI agents. The platform differentiates itself through a trace-to-dataset pipeline that converts production logs into labeled test sets for regression testing. It includes a prompt-variant replay engine to compare different prompts or models side-by-side and a state-cached debugging system to replay agent loops without restarting the process. The sys
Coze-loop is an optimization platform and orchestration management suite for large language model agents. It functions as a comprehensive environment for the development, debugging, evaluation, and monitoring of AI agent performance. The project provides a dedicated prompt engineering playground for real-time iteration and validation of model responses. It includes an evaluation framework that runs automated assessments against datasets to generate performance metrics and verify output accuracy. The system covers observability through real-time execution tracing and historical analysis of ag
ART is a platform for agentic training, providing a reinforcement learning framework, training environment, and compute orchestrator. It enables the improvement of multi-step agent reasoning and tool usage through group relative policy optimization and a judge-based reward modeling system. The project features tools for model distillation to transfer capabilities from large teacher models to smaller architectures, as well as a system for capturing execution trajectories to generate synthetic training data. It supports specialized training workflows including supervised fine-tuning for baselin
Giskard is an evaluation framework, testing library, and quality monitoring system for large language models and AI agents. It serves as a toolkit for quantifying model performance and reliability, providing specialized capabilities for validating retrieval-augmented generation pipelines. The project distinguishes itself through an automated red teaming tool and security scanner designed to identify vulnerabilities, prompt injections, and safety risks. It utilizes adversarial probing and synthetic edge case generation to quantify model robustness and detect information disclosure. The platfo
RLcard is an open-source framework for developing and evaluating reinforcement learning agents across multiple card game environments. It functions as a card game environment simulator, a multi-agent RL platform, and a benchmarking toolkit for algorithms like DQN, NFSP, and CFR. The framework provides a game-agnostic environment interface that decouples agent logic from game mechanics, allowing any policy to interact through a common API. It supports pluggable reinforcement learning algorithms that operate on this interface without modifying game logic, and includes a self-play training loop
This project is an educational resource and engineering guide for building, deploying, and optimizing large language model applications and production pipelines. It serves as a blueprint for cloud AI infrastructure, providing a framework for orchestrating inference endpoints, data warehouses, and scalable production environments. The repository provides specific implementation patterns for retrieval augmented generation to ground model responses in external data. It includes a training workflow for crawling, structuring, and processing datasets to facilitate model fine-tuning, alongside an ev
Agenta is a Prompt Ops lifecycle manager and prompt management platform that decouples prompt engineering from application code. It serves as a centralized system for developing, versioning, and deploying prompt templates and model configurations across different environments. The platform functions as an AI agent orchestrator with a visual interface for building agent workflows and connecting models to external tools. It further acts as an evaluation framework and observability tool, utilizing OpenTelemetry to capture execution traces, monitor latency, and track token costs. The system cove
Isaac Lab is an open-source framework for training robot policies in physically simulated environments, supporting both single-agent and multi-agent reinforcement learning. It is built on an Omniverse-PhysX simulation backend that models rigid bodies, articulated systems, deformable objects, and sensors, and provides a task-based environment configuration system where each training environment is defined as a modular class specifying observation spaces, action spaces, reward functions, and termination conditions. The framework distinguishes itself through an RL-library abstraction layer that
This project is an educational repository of reinforcement learning agents and tutorials implemented using TensorFlow. It provides a practical codebase for both model-free and model-based learning agents, designed to demonstrate how AI agents learn through trial and error. The collection features detailed implementations of various algorithmic approaches, including Deep Q-Networks and Policy Gradient methods. It specifically covers Actor-Critic architectures for continuous and discrete action spaces, alongside Proximal Policy Optimization and Deep Deterministic Policy Gradients. The framewor
Dopamine is a reinforcement learning research framework designed for prototyping and testing algorithms across diverse simulated environments. It provides an agent development toolkit that utilizes a flat class hierarchy to facilitate the creation and extension of learning agents. The framework includes a standardization layer via environment wrappers that connect agents to various physics simulations and gaming environments. It also features a high-performance experience replay buffer for storing and sampling transition data to improve training stability, alongside a dedicated hyperparameter
Universe is a training and evaluation platform that transforms websites, games, and software into standardized environments for general intelligence agents. It functions as a reinforcement learning wrapper and remote environment orchestrator, providing a consistent interface to wrap diverse software for AI agent interaction. The platform distinguishes itself through a visual observation interface that streams real-time pixel data and transmits keyboard and mouse events to simulate human interaction. It utilizes a bi-directional communication protocol to deliver reward signals and performance
This project is a Python-based educational framework designed to simulate reinforcement learning algorithms and environments. It serves as a platform for reproducing classic textbook examples, allowing users to study agent behavior, policy improvement, and the fundamental mechanics of decision-making in controlled settings. The library provides implementations for core reinforcement learning concepts, including temporal difference learning, Monte Carlo episode sampling, and tabular value function approximation. It enables the analysis of specific algorithmic behaviors, such as identifying and
DeepPavlov is a deep learning conversational AI framework designed for building end-to-end dialog systems and chatbots. It functions as an NLP model training library and a pipeline system that connects multiple natural language processing models into a single operational chain. The framework provides a REST API model server to expose trained deep learning models as web endpoints. This allows conversational agents to be deployed as web services that handle incoming HTTP requests and return predictions. The system covers the full lifecycle of conversational AI development, including NLP pipeli
This project is a transformer post-training toolkit and reinforcement learning library designed to align language model behavior with human preferences. It provides a framework for managing the transition from supervised fine-tuning to reinforcement learning and preference optimization. The library distinguishes itself through a specialized focus on preference optimization and reward modeling, enabling the adjustment of model outputs based on preferred versus rejected examples. It also includes capabilities for training agents within controlled sandbox environments using task suites and verif
TransformerLens is a library for mechanistic interpretability research designed to reverse engineer the learned algorithms within large language models. It provides a standardized framework for wrapping diverse transformer architectures, allowing researchers to extract, manipulate, and analyze internal activations and weights through a consistent interface. The project distinguishes itself through a comprehensive system of activation hooks that can capture, patch, and ablate internal tensors during the forward pass. It includes specialized utilities for decomposing fused projections, material
DouZero is a deep reinforcement learning framework and training system designed to teach digital agents to master complex card games. It provides the infrastructure to implement high-throughput reinforcement learning pipelines and evaluate the competitive success of game agents. The system utilizes a distributed actor-learner architecture that separates game simulation actors from GPU training devices to accelerate model convergence. It combines Monte Carlo Tree Search with policy-based value estimation to determine optimal moves through recursive evaluation and random sampling. The toolkit
LLaMA-Adapter is a parameter-efficient fine-tuning framework designed to adapt large language models using a minimal set of trainable parameters. It functions as an instruction tuning tool and a multimodal adapter, allowing pre-trained models to follow human instructions and process non-textual data. The project specializes in the integration of image, video, audio, and sensor data into language models for cross-modal understanding. It enables the customization of LLaMA models through the use of lightweight adapters, which allows for the extraction and storage of learned weights independently
mini-swe-agent is an autonomous software engineering system designed to develop features and fix bugs by combining large language models with a bash interface. It operates as an agentic framework that executes coding tasks and documentation updates through a continuous cycle of model reasoning and tool execution. The project differentiates itself with a strong focus on safety and evaluation, utilizing container-based sandbox execution via Docker or Singularity to isolate command execution. It includes a batch-parallel evaluation harness to measure code-fixing accuracy against standardized sof
This project is an instruction tuning framework and synthetic data generator that uses high-capacity teacher models to produce instruction-following pairs for training smaller student models. It provides datasets and tools for supervised instruction tuning and reinforcement learning from human feedback. The framework specializes in cross-lingual tuning, offering high-quality instruction-following examples in English and Chinese to improve model generalization across different scripts. It includes a reward modeling tool for creating preference datasets and comparative ratings used to train rew
RouteLLM is a routing framework and traffic manager designed to direct prompts between high-capability and low-cost large language models. It functions as an API gateway that mimics the OpenAI specification to route requests across different model providers. The system optimizes operational costs by splitting traffic between model tiers based on predicted win rates and prompt complexity. It includes a calibration tool to analyze sample queries and determine the optimal cost-quality tradeoff for traffic distribution. The framework provides a tool for measuring the accuracy and cost efficiency
Baselines is a comprehensive suite of frameworks for reinforcement learning algorithm implementation, imitation learning, and training orchestration. It provides a library of standardized learning algorithms used to benchmark and replicate research results, alongside a deep learning policy framework for constructing neural network architectures such as multi-layer perceptrons, convolutional networks, and long short-term memory networks. The project includes a specialized imitation learning toolkit that enables agents to mimic expert behavior through behavior cloning and generative adversarial
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
Stable-baselines3 is a reinforcement learning library built on the PyTorch deep learning framework. It provides a collection of reliable, standardized implementations of reinforcement learning algorithms designed for training, testing, and benchmarking agent policies in diverse simulated environments. The library functions as an agent training toolkit that emphasizes modularity and reproducibility. It features a unified environment interface and supports vectorized execution to accelerate data collection across multiple simulation instances. Users can customize neural network architectures, f
SLIME is a distributed reinforcement learning framework for large language model post-training that bridges Megatron training with SGLang inference servers. It orchestrates scalable RL loops across GPU clusters, decoupling training and inference into independent processes that communicate over HTTP and NCCL for independent scaling and fault tolerance. The system supports multi-agent reinforcement learning workflows with parallel agent instances, customizable rollout strategies, and personalized agent serving that improves models from prior conversations without disrupting API serving. The fra
This project is a collection of deep learning research implementations and a reproduction kit designed to translate theoretical AI papers into working code. It provides a library of neural network architectures and reference implementations for reproducing seminal research concepts through interactive notebooks. The repository distinguishes itself through the implementation of AI theory and scaling laws, covering complexity dynamics, information theory, and the simulation of universal AI agents. It also includes a benchmarking suite for synthetic reasoning, allowing for the evaluation of mode