Evaluate

Open-source alternatives to Evaluate

Similar open-source projects, ranked by how many features they share with Evaluate.

huggingface/lighteval
huggingface/lighteval
2,453View on GitHub
Lighteval is an open-source framework for running standardized benchmarks and custom evaluation tasks against language models. It provides a system for defining new evaluation tasks with custom prompts, metrics, and scoring in YAML configuration files, and integrates with the Hugging Face Hub for storing and comparing results. The framework supports evaluating models across multiple inference backends, including transformers, vllm, and custom APIs, through a unified generation and log-probability interface. It includes a pluggable metric registry for built-in and custom scoring, a prediction
Pythonevaluationevaluation-frameworkevaluation-metrics
View on GitHub2,453
confident-ai/deepeval
confident-ai/deepeval
13,733View on GitHub
Deepeval is a framework for testing and evaluating large language model applications. It provides a suite of tools for executing automated regression tests, validating model output quality against defined standards, and tracing the execution of complex agent workflows. By integrating these capabilities into development pipelines, the platform ensures consistent performance and reliability throughout the software lifecycle. The platform distinguishes itself through its focus on programmatic validation and observability. It utilizes secondary language models to score output quality and employs
Pythonevaluation-frameworkevaluation-metricsllm-evaluation
View on GitHub13,733
eleutherai/lm-evaluation-harness
EleutherAI/lm-evaluation-harness
11,460View on GitHub
This project is a standardized framework for benchmarking large language models across a wide range of academic and reasoning datasets. It provides a platform for executing automated evaluation tasks to measure model accuracy and performance, ensuring consistent assessment through a structured configuration schema. The framework distinguishes itself by incorporating a dedicated utility for data decontamination, which identifies and removes overlapping training samples from evaluation sets to prevent data leakage. It also features a flexible task builder that allows users to define custom benc
Pythonevaluation-frameworklanguage-modeltransformer
View on GitHub11,460
openai/simple-evals
openai/simple-evals
4,354View on GitHub
This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k. The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to co
Python
View on GitHub4,354

See all 30 alternatives to Evaluate

huggingfaceevaluate

Features

Open-source alternatives to Evaluate

huggingface/lighteval

confident-ai/deepeval

EleutherAI/lm-evaluation-harness

openai/simple-evals

Star history

Open-source alternatives to Evaluate

huggingface/lighteval

confident-ai/deepeval

EleutherAI/lm-evaluation-harness

openai/simple-evals