Lmms Eval | Awesome Repository

lmms-eval is a benchmarking system and performance analysis suite designed to measure the capabilities of large multimodal models. It provides a framework for evaluating models across text, image, audio, and video datasets, serving as a multimodal dataset orchestrator and benchmarking tool to quantify accuracy and efficiency.

The project distinguishes itself through a unified multimodal message protocol that structures diverse media inputs for consistent model consumption. It features specialized benchmarking for audio, video, visual, document, and spatial reasoning, alongside tools for model safety evaluation focused on hallucinations, biases, and jailbreak susceptibility.

The system covers a broad range of capability areas, including performance analysis for throughput and token usage, statistical result validation for reproducibility, and inference optimization via response caching and multi-threaded media decoding. It also supports agentic loop execution for multi-round evaluations and provides a browser-based graphical interface for interactive configuration and launching.

Users can trigger evaluations programmatically through a functional API or an asynchronous HTTP server.

Features

LLM Evaluation Frameworks - Provides a comprehensive benchmarking system for measuring large multimodal models across text, image, audio, and video.
Model Performance Benchmarking - Provides a comprehensive system to evaluate the speed and accuracy of multimodal models across diverse datasets.
Evaluation Dataset Structurers - Specifies datasets, input processing functions, and output types via configuration files to create structured benchmarks.

Features

LLM Evaluation Frameworks - Provides a comprehensive benchmarking system for measuring large multimodal models across text, image, audio, and video.
Model Performance Benchmarking - Provides a comprehensive system to evaluate the speed and accuracy of multimodal models across diverse datasets.
Evaluation Dataset Structurers - Specifies datasets, input processing functions, and output types via configuration files to create structured benchmarks.

Users can trigger evaluations programmatically through a functional API or an asynchronous HTTP server.