Opencompass

OpenCompass is an open-source framework for standardized benchmarking of large language models. It provides a configurable evaluation pipeline that supports both objective and subjective assessment, using a dual-engine architecture to handle closed-form answer comparison and open-ended response rating. The framework is designed as a modular platform where datasets, models, and metrics are composed through declarative YAML configuration files.

The framework distinguishes itself through its extensible model integration layer, which supports custom models, HuggingFace models, and third-party API services through a common subclassing interface. It includes an automated judge system that delegates subjective scoring to a separate LLM evaluator, enabling quality assessment of open-ended outputs. A single-command benchmark suite runner allows executing predefined evaluation sets against any integrated model.

The evaluation surface covers multiple capability dimensions, including examination, knowledge, reasoning, understanding, language, and safety. Specific assessment areas include agentic tool use, code generation, mathematical ability, instruction following, and language proficiency. Each dataset declares its own scoring function and post-processing steps, allowing per-task custom metrics. The framework supports evaluating base models, chat models, and API-deployed models through its configurable harness.

Features

LLM Evaluation Frameworks - Provides an open-source framework for standardized benchmarking of large language models across diverse capabilities and datasets.

AI API Adapters - Provides a common adapter layer for wrapping third-party API models with authentication and request formatting.

Automated Model Judges - Scores open-ended model outputs using automated judge models or human-like rating scales for quality assessment.

Benchmarking Suites - Provides a comprehensive evaluation suite covering examination, reasoning, language, knowledge, and safety dimensions for LLMs.

Custom Model Integrations - Extends the evaluation framework to support user-defined models, third-party APIs, or HuggingFace models through a plug-in interface.

Evaluation Engines - Operates two parallel evaluation engines for closed-form answer comparison and open-ended response rating.

Generative Model Evaluation - Compares model outputs to standard answers using discriminative and generative methods with prompt engineering.

LLM Benchmarking - Runs standardized evaluation suites across multiple model types and datasets from a single configuration file.

Model Capability Assessment - Measures model performance across examination, knowledge, reasoning, understanding, language, and safety dimensions.

Extension Interfaces - Ships a base class interface that new models implement to integrate with the evaluation pipeline.

Model Benchmarking Suites - Runs predefined evaluation datasets against a model by executing a single configuration file.

Model Evaluation Frameworks - Integrates user-defined models into the evaluation pipeline by following the framework's extension interface.

Dual-Engine Evaluation Pipelines - Provides a configurable pipeline for running objective and subjective evaluations on both open-source and API-based language models.

Model Prediction Evaluation - Compares model predictions against ground-truth answers using discriminative and generative methods with prompt engineering.

LLM Capability Dimensions - Assesses model performance across examination, knowledge, reasoning, understanding, language, and safety dimensions.

Third-Party Model Integration - Supports integrating third-party, HuggingFace, and API-based models through a common subclassing interface.

Chat Model Evaluations - Ships structured tests for evaluating conversational ability and human alignment of instruction-tuned chat models.

Configuration-Driven Orchestrators - Defines the entire evaluation workflow in a single YAML file specifying datasets, models, judges, and output paths.

Pipeline Component Modularization - Composes evaluation tasks by chaining configurable dataset, model, and metric modules through a declarative YAML pipeline.

LLM-As-A-Judge Scoring - Delegates subjective scoring to a separate LLM judge configured as a pluggable evaluator module.

Agent Evaluation Tools - Measures a model's ability in complex tool calling and code interpreter usage for data science and mathematics.

Domain Knowledge Evaluations - Measures model knowledge across science, engineering, and humanities using objective evaluation.

Dataset-Scoped Metrics - Allows each dataset to declare its own scoring function and post-processing steps for custom evaluation.

Satisfaction Score Collectors - Collects human or model-simulated satisfaction scores for open-ended responses to gauge real-world quality.

Instruction Following Evaluations - Provides subjective evaluations measuring how accurately models follow complex instructions.

API-Deployed Evaluations - Deploys a downloaded model as an API service and runs benchmark evaluations against it.

HuggingFace Evaluations - Configures a HuggingFace model for evaluation by specifying its path, tokenizer, and inference parameters.

Per-Dataset Metric Configurators - Allows configuring custom scoring functions and post-processing steps for each evaluation dataset.

API Service Evaluations - Connects to third-party API services such as OpenAI or ChatGLM to run model evaluations.

Visual Data Reasoning Evaluation - Measures logical, common-sense, and tabular reasoning skills through subjective evaluation.

LLM Language Proficiency Evaluations - Measures model ability in information extraction, summarization, dialogue, and creative writing using subjective evaluation.

Mathematical Reasoning Evaluations - Measures numerical computation and problem-solving skills at high school and university levels using objective evaluation.

API Model Integration - Extends the evaluation framework by subclassing a base API model and implementing generation and token-length methods.

Code Generation Evaluators - Measures code generation correctness and quality using both objective and subjective evaluation methods.

Model Evaluation - Platform supporting evaluation across diverse models and datasets.

Model Evaluation and Benchmarking - Evaluation platform supporting a wide range of foundation models.

Natural Language Processing - Listed in the “Natural Language Processing” section of the FunNLP awesome list.

Evaluation Benchmarks - Comprehensive evaluation platform for language and multimodal models.

open-compassopencompass

Features

Star history