# evolvinglmms-lab/lmms-eval

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/evolvinglmms-lab-lmms-eval).**

3,701 stars · 519 forks · Python · other

## Links

- GitHub: https://github.com/EvolvingLMMs-Lab/lmms-eval
- Homepage: https://www.lmms-lab.com
- awesome-repositories: https://awesome-repositories.com/repository/evolvinglmms-lab-lmms-eval.md

## Topics

`agi` `audio-evaluation` `benchmark` `evaluation` `large-language-models` `llm-evaluation` `multimodal` `multimodal-evaluation` `video-understanding` `vision-language-model` `vlm`

## Description

lmms-eval is a benchmarking system and performance analysis suite designed to measure the capabilities of large multimodal models. It provides a framework for evaluating models across text, image, audio, and video datasets, serving as a multimodal dataset orchestrator and benchmarking tool to quantify accuracy and efficiency.

The project distinguishes itself through a unified multimodal message protocol that structures diverse media inputs for consistent model consumption. It features specialized benchmarking for audio, video, visual, document, and spatial reasoning, alongside tools for model safety evaluation focused on hallucinations, biases, and jailbreak susceptibility.

The system covers a broad range of capability areas, including performance analysis for throughput and token usage, statistical result validation for reproducibility, and inference optimization via response caching and multi-threaded media decoding. It also supports agentic loop execution for multi-round evaluations and provides a browser-based graphical interface for interactive configuration and launching.

Users can trigger evaluations programmatically through a functional API or an asynchronous HTTP server.

## Tags

### Artificial Intelligence & ML

- [LLM Evaluation Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-evaluation-frameworks.md) — Provides a comprehensive benchmarking system for measuring large multimodal models across text, image, audio, and video.
- [Model Performance Benchmarking](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-analysis/model-analysis/model-performance-benchmarking.md) — Provides a comprehensive system to evaluate the speed and accuracy of multimodal models across diverse datasets. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/README.md))
- [Evaluation Dataset Structurers](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-management/evaluation-datasets/evaluation-dataset-structurers.md) — Specifies datasets, input processing functions, and output types via configuration files to create structured benchmarks. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/README.md))
- [Confidence Interval Calculators](https://awesome-repositories.com/f/artificial-intelligence-ml/detection-confidence-metrics/confidence-interval-calculators.md) — Calculates confidence intervals, clustered standard errors, and p-values to ensure benchmark scores are reproducible. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/README.md))
- [Model Backend Adapters](https://awesome-repositories.com/f/artificial-intelligence-ml/model-backend-adapters.md) — Wraps diverse model backends in a unified interface to ensure consistent data formats across benchmarks.
- [Model Benchmarking Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/model-benchmarking-tools.md) — Calculates accuracy and efficiency metrics for models processing combined visual, auditory, and textual inputs.
- [Model Evaluation Suites](https://awesome-repositories.com/f/artificial-intelligence-ml/model-evaluation-suites.md) — Ships a suite for computing statistical significance, throughput, and token usage for multimodal evaluations.
- [Model Integration Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/model-integration-interfaces.md) — Wraps diverse model backends in a standard interface to enable consistent benchmarking across multimodal datasets. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/guides/model_guide.md))
- [Model Performance Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/model-performance-analysis.md) — Quantifies model quality and efficiency by calculating accuracy, throughput, and statistical significance.
- [Model Red-Teaming](https://awesome-repositories.com/f/artificial-intelligence-ml/model-red-teaming.md) — Implements adversarial testing using red-teaming datasets to detect hallucinations, biases, and jailbreak vulnerabilities. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Multimodal Message Containers](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-agent-capabilities/multimodal-message-containers.md) — Structures text, image, video, and audio inputs into a standardized multimodal message protocol for consistent model consumption.
- [Model Quality Metrics](https://awesome-repositories.com/f/artificial-intelligence-ml/prediction-visualization/model-quality-metrics.md) — Calculates accuracy, perplexity, and F1 scores using configurable aggregation methods to quantify model quality. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/guides/task_guide.md))
- [Multimodal Input Tuples](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-formatting/model-specific-prompt-formats/multimodal-input-tuples.md) — Structures multimodal data into specific tuples to maintain a consistent contract between dataset components and models. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/test/README.md))
- [Text Capability Benchmarks](https://awesome-repositories.com/f/artificial-intelligence-ml/text-capability-benchmarks.md) — Runs standard text-only language benchmarks to isolate linguistic reasoning from multimodal capabilities. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Visual Mathematical Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-mathematical-reasoning-evaluation.md) — Evaluates a model's capacity to solve mathematical problems presented in visual formats. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Visual Question Answering Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-question-answering-evaluation.md) — Evaluates performance on visual question answering, captioning, and comprehension across images. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Visual Spatial Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-spatial-reasoning-evaluation.md) — Benchmarks a model's understanding of object locations and physical spatial relations within scenes. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Agentic Execution Loops](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-reasoning-loops/critic-agent-loops/agentic-execution-loops.md) — Orchestrates multi-round evaluations by iteratively sending prompts and processing outputs until a terminal signal is reached. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/test/README.md))
- [Reasoning Block Filters](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-reasoning-loops/reasoning-block-filters.md) — Removes internal reasoning blocks from model outputs before scoring to ensure metrics reflect the final answer. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/getting-started/commands.md))
- [Custom Performance Metrics](https://awesome-repositories.com/f/artificial-intelligence-ml/evaluation-metrics/custom-performance-metrics.md) — Allows registration of custom scoring and aggregation functions to evaluate model performance via mathematical rules. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/guides/task_guide.md))
- [Answer Accuracy Evaluators](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/grounded-answer-generation/answer-accuracy-evaluators.md) — Computes log-probabilities of target continuations to evaluate model accuracy on closed-set multiple-choice tasks. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/guides/model_guide.md))
- [High Throughput Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/high-throughput-inference.md) — Increases request speed using adaptive concurrency, prefix-aware queueing, and shared cache reuse for media. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/README.md))
- [Inference Optimization Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-optimization-tools.md) — Optimizes evaluation speed through request batching, response caching, and accelerated media decoding.
- [Audio Performance Benchmarks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-analysis/model-analysis/model-performance-benchmarking/audio-performance-benchmarks.md) — Assesses model capabilities in speech recognition, speech translation, and audio-based question answering. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Remote Evaluation Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/training-monitoring-and-profiling/ai-observability/ai-observability-and-evaluation/evaluation-execution-tracers/remote-evaluation-execution.md) — Hosts an HTTP server to asynchronously trigger and track long-running model evaluations via remote client requests. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/external_usage.md))
- [Programmatic Evaluation APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/model-performance-evaluators/evaluation-configurations/api-deployed-evaluations/programmatic-evaluation-apis.md) — Provides a functional API to programmatically control model arguments, task selection, and batch sizes for automated benchmarking. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/external_usage.md))
- [Prompt-Based Text Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/prompt-based-text-generation.md) — Generates text responses by combining prompts with media files to test natural language generation capabilities. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/guides/model_guide.md))
- [Visual Document Understanding](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-document-understanding.md) — Tests the ability to extract and reason over information from documents, infographics, and images with text. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))

### Part of an Awesome List

- [Task Instance Definitions](https://awesome-repositories.com/f/awesome-lists/devops/tasks-and-scheduling/task-definitions/task-instance-definitions.md) — Uses external manifests and registration files to define datasets, processing logic, and scoring metrics for benchmark tasks.
- [Model Evaluation and Benchmarking](https://awesome-repositories.com/f/awesome-lists/ai/model-evaluation-and-benchmarking.md) — Evaluation framework for large vision-language models.

### Data & Databases

- [Multimodal Dataset Loaders](https://awesome-repositories.com/f/data-databases/large-scale-dataset-management/multimodal-dataset-loaders.md) — Manages task configurations and processes raw multimodal samples into model-ready formats.
- [Benchmark Task Management](https://awesome-repositories.com/f/data-databases/static-benchmark-datasets/benchmark-dataset-loaders/curated-benchmark-downloaders/benchmark-task-management.md) — Indexes available evaluation tasks, loads custom configurations, and automates dataset downloads. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/external_usage.md))
- [Evaluation Result Caches](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/caching-performance/caching-strategies/query-result-caching/method-result-caches/translation-result-caches/evaluation-result-caches.md) — Stores shared results in a directory or database to avoid redundant computations and reduce costs. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/getting-started/commands.md))
- [Response Caching](https://awesome-repositories.com/f/data-databases/response-caching.md) — Stores model outputs and evaluation results in a persistent database to prevent redundant API calls and costs.

### Graphics & Multimedia

- [Semantic Video Understanding Tools](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing/video-analysis-processing/semantic-video-understanding-tools.md) — Measures temporal reasoning, action recognition, and long-form comprehension of video content. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/current_tasks.md))
- [Parallel Media Decoding](https://awesome-repositories.com/f/graphics-multimedia/parallel-media-decoding.md) — Accelerates video and audio processing through parallel decoding and blob storage to reduce evaluation latency.

### System Administration & Monitoring

- [Token Consumption Trackers](https://awesome-repositories.com/f/system-administration-monitoring/usage-monitoring/token-usage-analytics/token-consumption-trackers.md) — Normalizes diverse model output formats into consistent token counts for input, output, and reasoning. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/test/README.md))
- [Batch Processing Monitors](https://awesome-repositories.com/f/system-administration-monitoring/activity-monitors/activity-progress-monitors/task-progress-monitors/batch-processing-monitors.md) — Aggregates processing time and average throughput across concurrent requests to assess efficiency under load. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/throughput_metrics.md))
- [Model Observability Suites](https://awesome-repositories.com/f/system-administration-monitoring/model-observability-suites.md) — Logs all generated responses to a real-time file to provide observability into model behavior. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/caching.md))

### Testing & Quality Assurance

- [Multimodal Reasoning Evaluations](https://awesome-repositories.com/f/testing-quality-assurance/model-testing/model-evaluation/multimodal-reasoning-evaluations.md) — Tests model reasoning capabilities over documents, spatial relationships, and complex mathematical problems using media inputs.
- [Token Throughput Measurement](https://awesome-repositories.com/f/testing-quality-assurance/token-throughput-measurement.md) — Calculates latency, token generation speed, and time to first token to evaluate inference performance. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/advanced/throughput_metrics.md))
- [Inference Efficiency Metrics](https://awesome-repositories.com/f/testing-quality-assurance/test-efficiency-metrics/inference-efficiency-metrics.md) — Aggregates sample-level token counts and scores into high-level summaries like tokens per correct answer. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/test/README.md))

### User Interface & Experience

- [Multimodal Input Processors](https://awesome-repositories.com/f/user-interface-experience/form-and-input-management/input-handling/multimodal-input-processors.md) — Transforms raw dataset samples into visual, audio, or text formats required for model inference. ([source](https://cdn.jsdelivr.net/gh/EvolvingLMMs-Lab/lmms-eval@main/docs/guides/task_guide.md))

### Software Engineering & Architecture

- [Custom Metric Registries](https://awesome-repositories.com/f/software-engineering-architecture/custom-metric-registries.md) — Allows registration of custom mathematical rules and scoring functions to calculate performance metrics from model outputs.
