OpenCompass is an open-source framework for standardized benchmarking of large language models. It provides a configurable evaluation pipeline that supports both objective and subjective assessment, using a dual-engine architecture to handle closed-form answer comparison and open-ended response rating. The framework is designed as a modular platform where datasets, models, and metrics are composed through declarative YAML configuration files.
The framework distinguishes itself through its extensible model integration layer, which supports custom models, HuggingFace models, and third-party API services through a common subclassing interface. It includes an automated judge system that delegates subjective scoring to a separate LLM evaluator, enabling quality assessment of open-ended outputs. A single-command benchmark suite runner allows executing predefined evaluation sets against any integrated model.
The evaluation surface covers multiple capability dimensions, including examination, knowledge, reasoning, understanding, language, and safety. Specific assessment areas include agentic tool use, code generation, mathematical ability, instruction following, and language proficiency. Each dataset declares its own scoring function and post-processing steps, allowing per-task custom metrics. The framework supports evaluating base models, chat models, and API-deployed models through its configurable harness.