Simple Evals

This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k.

The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to collect and contrast performance metrics across different providers in a standardized way.

The tool covers a broad range of domain benchmarking, including code correctness verification via deterministic execution, medical knowledge accuracy, and general knowledge testing. It also supports multilingual assessment to measure consistency and reasoning across different languages. Scoring is handled through rubric-based logic, ground-truth comparison engines, and length-based penalties to discourage verbosity.

Features

Language Model Benchmark Suites - Runs a predefined set of standardized tests to measure language model accuracy across tasks like reasoning, math, and coding.

Unified Provider Interfaces - Generates responses from OpenAI and Anthropic APIs through a unified interface for standardized evaluation.

Criteria-Based Scoring Engines - Sums points for criteria met, subtracts penalties, then divides by total possible points to compute a fractional score.

Ground-Truth Scoring - Measures accuracy by comparing model outputs against curated answer keys using exact or fuzzy matching.

Generative Model Sampling - Provides a unified interface for sampling responses from OpenAI and Claude language models.

Language Model Response Generators - Sends conversation histories to language models and returns generated text with query metadata.

Model Benchmarks - Runs a suite of standardized benchmarks to measure language model accuracy on reasoning, math, and coding.

Multi-Provider Sampling Interfaces - Samples from OpenAI and Anthropic APIs through a unified interface for standardized evaluation.

Model Benchmarking Tools - Runs standardized tests across multiple language model providers to compare performance metrics.

Model Evaluation Frameworks - Provides a unified framework for running model inference and validation across standard language model benchmarks.

Model Evaluation and Benchmarking - Measures language model performance on standardized benchmarks including math, coding, and factuality tasks.

Standardized Benchmarks - Runs a suite of standardized benchmarks against a language model and reports accuracy scores for each test.

Rubric-Based Graders - Implements a rubric-driven grading system that automatically scores model responses against expert-defined criteria.

Rubric-Based Evaluators - Grades model completions against predefined rubric criteria, returning a boolean for each criterion met.

Code Generation Benchmarks - Assesses model ability to produce correct code from natural language descriptions using established coding benchmarks.

Code Generation Evaluators - Measures how often language models produce functionally correct code by running completions against test suites.

Code Correctness Testings - Measures how often language models produce functionally correct code by running generated completions against test suites.

Code Correctness Verifications - Verifies code correctness by running generated completions against predefined test suites with parallel execution.

Domain Knowledge Evaluations - Measures model performance on specialized benchmarks including health, browsing comprehension, and graduate-level QA.

Medical Knowledge Assessments - Evaluates language model accuracy on medical domain questions using specialized datasets like HealthBench.

Evaluation Report Aggregators - Outputs per-evaluation HTML reports and JSON metric files for each combination of model sampler and language variant.

Verbosity Penalties - Reduces scores proportionally when responses exceed a configurable character threshold with per-unit penalties.

CLI Evaluation Runners - Launches benchmark evaluations for a specified model through a simple command-line invocation.

Multilingual Accuracy Evaluations - Measures language model consistency and reasoning accuracy across multiple languages using translated benchmark datasets.

Parallel Evaluators - Executes multiple code correctness checks concurrently using a thread pool to speed up batch evaluation.

Factuality Benchmarking Frameworks - Tests model precision on fact-based question answering tasks against curated knowledge benchmarks.

Benchmark Translations - Tests model consistency across languages by running translated versions of standard benchmarks like MMLU.

Performance Metrics - Calculates quantified metrics such as precision, recall, F1 scores, and pass-at-k from evaluation results.

Pass-at-K Calculators - Calculates the probability that at least one of K generated code samples passes all tests for robust accuracy measurement.

Pass-at-K Statistical Scorings - Calculates the probability that at least one of K generated samples passes all tests for robust accuracy measurement.

Chain-of-Thought Evaluations - Tests models using simple instructions without few-shot examples to better reflect realistic usage.

Medical Knowledge Assessors - Ships a dedicated medical knowledge evaluator using the HealthBench dataset for accuracy scoring.

Language Model Math Evaluations - Measures language model accuracy on mathematical reasoning tasks by prompting step-by-step solving and comparing answers against ground truth.

Mathematical Reasoning Evaluations - Measures language model accuracy on mathematical problem-solving benchmarks using standardized test sets.

Healthcare Knowledge Benchmarks - Runs models against the HealthBench dataset and reports accuracy scores for medical knowledge assessment.

Evaluation Frameworks - Evaluation tools provided by OpenAI.

Web and Environment Benchmarks - Challenging benchmark for browsing agents.

openaisimple-evals

Features

Star history