# EleutherAI/lm-evaluation-harness

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/eleutherai-lm-evaluation-harness).**

11,460 stars · 3,051 forks · Python · mit

## Links

- GitHub: https://github.com/EleutherAI/lm-evaluation-harness
- Homepage: https://www.eleuther.ai
- awesome-repositories: https://awesome-repositories.com/repository/eleutherai-lm-evaluation-harness.md

## Topics

`evaluation-framework` `language-model` `transformer`

## Description

This project is a standardized framework for benchmarking large language models across a wide range of academic and reasoning datasets. It provides a platform for executing automated evaluation tasks to measure model accuracy and performance, ensuring consistent assessment through a structured configuration schema.

The framework distinguishes itself by incorporating a dedicated utility for data decontamination, which identifies and removes overlapping training samples from evaluation sets to prevent data leakage. It also features a flexible task builder that allows users to define custom benchmarks by specifying unique data sources, prompt structures, and modular scoring metrics.

The system supports large-scale testing by orchestrating distributed evaluation workloads across multiple compute nodes. It utilizes an abstracted interface to standardize communication with diverse model backends, facilitating systematic validation of model capabilities before deployment.

## Tags

### Artificial Intelligence & ML

- [Large Language Models](https://awesome-repositories.com/f/artificial-intelligence-ml/large-language-models.md) — Measures language model performance against standardized academic datasets to compare reasoning capabilities and accuracy.
- [Model Benchmarking](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-and-validation/model-benchmarking.md) — Benchmarks large language models against standardized academic and reasoning datasets to compare performance across complex tasks. ([source](https://github.com/EleutherAI/lm-evaluation-harness/tree/master/docs))
- [Model Benchmarking Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-benchmarking-frameworks.md) — Executes automated evaluation tasks to compare the capabilities and accuracy of generative AI models.
- [Model Evaluation Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-evaluation-frameworks.md) — Provides a standardized toolkit for measuring the performance of large language models across diverse academic and reasoning benchmarks.
- [Data Decontamination Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/data-decontamination-tools.md) — Prevents data leakage by identifying and removing overlapping training samples from evaluation sets using string similarity scores.
- [Model Abstractions](https://awesome-repositories.com/f/artificial-intelligence-ml/model-abstractions.md) — Provides a unified interface for interacting with diverse language model backends to standardize inference and logit extraction.
- [Machine Learning Model APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/machine-learning-model-apis.md) — Systematically tests and verifies model capabilities to ensure performance requirements are met before deployment.
- [Custom Evaluation Judges](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-evaluation-judges.md) — Enables the creation of custom benchmark tasks by specifying data sources, metrics, and prompt structures. ([source](https://github.com/EleutherAI/lm-evaluation-harness/tree/master/docs))
- [Scoring Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/evaluation-metrics/scoring-pipelines.md) — Computes evaluation results by passing model outputs through modular validation functions for accuracy and performance indicators.
- [Prompt Templates](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-templates.md) — Transforms input data into model-ready prompts using a flexible engine that supports complex formatting and few-shot examples.

### DevOps & Infrastructure

- [Distributed Orchestration](https://awesome-repositories.com/f/devops-infrastructure/distributed-orchestration.md) — Orchestrates distributed evaluation workloads across multiple compute nodes to parallelize large-scale benchmark execution.