Opencompass

OpenCompass is an open-source framework for standardized benchmarking of large language models. It provides a configurable evaluation pipeline that supports both objective and subjective assessment, using a dual-engine architecture to handle closed-form answer comparison and open-ended response rating. The framework is designed as a modular platform where datasets, models, and metrics are composed through declarative YAML configuration files.

The framework distinguishes itself through its extensible model integration layer, which supports custom models, HuggingFace models, and third-party API services through a common subclassing interface. It includes an automated judge system that delegates subjective scoring to a separate LLM evaluator, enabling quality assessment of open-ended outputs. A single-command benchmark suite runner allows executing predefined evaluation sets against any integrated model.

The evaluation surface covers multiple capability dimensions, including examination, knowledge, reasoning, understanding, language, and safety. Specific assessment areas include agentic tool use, code generation, mathematical ability, instruction following, and language proficiency. Each dataset declares its own scoring function and post-processing steps, allowing per-task custom metrics. The framework supports evaluating base models, chat models, and API-deployed models through its configurable harness.

Features

LLM Evaluation Frameworks - Provides an open-source framework for standardized benchmarking of large language models across diverse capabilities and datasets.
AI API Adapters - Provides a common adapter layer for wrapping third-party API models with authentication and request formatting.
Automated Model Judges - Scores open-ended model outputs using automated judge models or human-like rating scales for quality assessment.
Benchmarking Suites - Provides a comprehensive evaluation suite covering examination, reasoning, language, knowledge, and safety dimensions for LLMs.
Custom Model Integrations - Extends the evaluation framework to support user-defined models, third-party APIs, or HuggingFace models through a plug-in interface.
Evaluation Engines - Operates two parallel evaluation engines for closed-form answer comparison and open-ended response rating.
Generative Model Evaluation - Compares model outputs to standard answers using discriminative and generative methods with prompt engineering.
LLM Benchmarking - Runs standardized evaluation suites across multiple model types and datasets from a single configuration file.
Model Capability Assessment - Measures model performance across examination, knowledge, reasoning, understanding, language, and safety dimensions.
Extension Interfaces - Ships a base class interface that new models implement to integrate with the evaluation pipeline.
Model Benchmarking Suites - Runs predefined evaluation datasets against a model by executing a single configuration file.
Model Evaluation Frameworks - Integrates user-defined models into the evaluation pipeline by following the framework's extension interface.
Dual-Engine Evaluation Pipelines - Provides a configurable pipeline for running objective and subjective evaluations on both open-source and API-based language models.
Model Prediction Evaluation - Compares model predictions against ground-truth answers using discriminative and generative methods with prompt engineering.
LLM Capability Dimensions - Assesses model performance across examination, knowledge, reasoning, understanding, language, and safety dimensions.
Third-Party Model Integration - Supports integrating third-party, HuggingFace, and API-based models through a common subclassing interface.
Chat Model Evaluations - Ships structured tests for evaluating conversational ability and human alignment of instruction-tuned chat models.
Configuration-Driven Orchestrators - Defines the entire evaluation workflow in a single YAML file specifying datasets, models, judges, and output paths.
Pipeline Component Modularization - Composes evaluation tasks by chaining configurable dataset, model, and metric modules through a declarative YAML pipeline.
LLM-As-A-Judge Scoring - Delegates subjective scoring to a separate LLM judge configured as a pluggable evaluator module.
Agent Evaluation Tools - Measures a model's ability in complex tool calling and code interpreter usage for data science and mathematics.
Domain Knowledge Evaluations - Measures model knowledge across science, engineering, and humanities using objective evaluation.
Dataset-Scoped Metrics - Allows each dataset to declare its own scoring function and post-processing steps for custom evaluation.
Satisfaction Score Collectors - Collects human or model-simulated satisfaction scores for open-ended responses to gauge real-world quality.
Instruction Following Evaluations - Provides subjective evaluations measuring how accurately models follow complex instructions.
API-Deployed Evaluations - Deploys a downloaded model as an API service and runs benchmark evaluations against it.
HuggingFace Evaluations - Configures a HuggingFace model for evaluation by specifying its path, tokenizer, and inference parameters.
Per-Dataset Metric Configurators - Allows configuring custom scoring functions and post-processing steps for each evaluation dataset.
API Service Evaluations - Connects to third-party API services such as OpenAI or ChatGLM to run model evaluations.
Visual Data Reasoning Evaluation - Measures logical, common-sense, and tabular reasoning skills through subjective evaluation.
LLM Language Proficiency Evaluations - Measures model ability in information extraction, summarization, dialogue, and creative writing using subjective evaluation.
Mathematical Reasoning Evaluations - Measures numerical computation and problem-solving skills at high school and university levels using objective evaluation.
API Model Integration - Extends the evaluation framework by subclassing a base API model and implementing generation and token-length methods.
Code Generation Evaluators - Measures code generation correctness and quality using both objective and subjective evaluation methods.
Model Evaluation - Comprehensive platform for evaluating models across numerous datasets.
Model Evaluation and Benchmarking - Evaluation platform supporting a wide range of foundation models.
Natural Language Processing - Listed in the “Natural Language Processing” section of the FunNLP awesome list.
Evaluation Benchmarks - Comprehensive evaluation platform for language and multimodal models.

InternLM/opencompass

7,096View on GitHub

OpenCompass is a comprehensive evaluation platform, benchmarking suite, and distributed model evaluator designed to measure the performance and accuracy of large language models. It provides a framework for benchmarking both open-source and API-based models against diverse datasets using standardized metrics and reproducible pipelines. The project features an automated judging framework that uses language models as judges to score and verify the quality of generated text. It includes a performance leaderboard system for comparing the relative capabilities of various models across industry-sta

openai/simple-evals

4,354View on GitHub

This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k. The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to co

Arize-ai/phoenix

8,605View on GitHub

Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments. The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and

Giskard-AI/giskard

5,434View on GitHub

Giskard is an evaluation framework, testing library, and quality monitoring system for large language models and AI agents. It serves as a toolkit for quantifying model performance and reliability, providing specialized capabilities for validating retrieval-augmented generation pipelines. The project distinguishes itself through an automated red teaming tool and security scanner designed to identify vulnerabilities, prompt injections, and safety risks. It utilizes adversarial probing and synthetic edge case generation to quantify model robustness and detect information disclosure. The platfo

open-compassopencompass

Features