# open-compass/vlmevalkit

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/open-compass-vlmevalkit).**

3,824 stars · 638 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/open-compass/VLMEvalKit
- Homepage: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
- awesome-repositories: https://awesome-repositories.com/repository/open-compass-vlmevalkit.md

## Topics

`chatgpt` `claude` `clip` `computer-vision` `evaluation` `gemini` `gpt` `gpt-4v` `gpt4` `large-language-models` `llava` `llm` `multi-modal` `openai` `openai-api` `pytorch` `qwen` `vit` `vqa`

## Description

VLMEvalKit is a vision-language model evaluation framework and inference engine designed to run standardized benchmarks and measure model accuracy across diverse visual datasets. It serves as a multimodal model benchmark and performance toolkit for calculating metrics and comparing model responses.

The toolkit includes a specialized visual reasoning evaluator that uses adversarial samples to distinguish actual image understanding from reliance on language patterns. It also provides capabilities for image generation evaluation, testing a model's ability to create or modify visuals based on text descriptions.

The framework covers multimodal inference execution and image-to-text generation, supported by batch inference execution to increase throughput. It provides utilities for benchmark score calculation, a model response browser for reviewing raw outputs, and attention mechanism optimization to reduce memory usage during inference.

## Tags

### Artificial Intelligence & ML

- [Multimodal Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/multimodal-inference-engines.md) — Provides an execution engine designed to process combined text and visual data to generate single multimodal model responses. ([source](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B))
- [Adversarial Reasoning Testing](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-architectures/orchestration-engines/ai-agent/reasoning-action-loops/visual-reasoning/visual-property-reasoning/adversarial-reasoning-testing.md) — Uses adversarial samples to ensure model image understanding is not relying on language patterns.
- [Automated Dataset Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-management/evaluation-datasets/automated-dataset-evaluation.md) — Automates the execution of evaluators against structured benchmark datasets to measure visual reasoning.
- [Image-Text Prompt Inferences](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/generative-text-inference/image-text-prompt-inferences.md) — Processes large sets of combined image and text prompts to generate predictions for evaluation.
- [Image-to-Text Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/image-to-text-transformers.md) — Enables the analysis of images and text prompts to generate descriptive written responses or direct answers. ([source](https://cdn.jsdelivr.net/gh/open-compass/vlmevalkit@main/README.md))
- [Visual Reasoning Evaluators](https://awesome-repositories.com/f/artificial-intelligence-ml/long-context-training-optimizations/long-context-retrieval-testing/reasoning-evaluations/visual-reasoning-evaluators.md) — Measures actual image understanding versus language patterns using a specialized adversarial testing suite.
- [Model Evaluation Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-evaluation-frameworks.md) — Provides a complete toolkit for running standardized benchmarks and measuring VLM accuracy.
- [Accuracy Calculators](https://awesome-repositories.com/f/artificial-intelligence-ml/prediction-visualization/accuracy-calculators.md) — Provides utilities to calculate accuracy and performance metrics by comparing model predictions against ground-truth labels. ([source](https://huggingface.co/datasets/VLMEval/OpenVLMRecords))
- [Visual Question Answering Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-question-answering-evaluation.md) — Analyzes visual content to answer natural language questions and extract meaning from images. ([source](https://huggingface.co/AIDC-AI/Ovis-U1-3B))
- [Adversarial Visual Reasoning Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-question-answering-evaluation/adversarial-visual-reasoning-evaluation.md) — Tests a model's ability to answer questions using adversarial samples that prevent reliance on language patterns. ([source](https://huggingface.co/datasets/BaiqiL/NaturalBench))
- [Attention Kernel Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms/attention-kernel-configurations/attention-kernel-optimizers.md) — Includes optimizations for attention mechanisms to reduce memory overhead and increase computation speed during inference. ([source](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B))
- [Batch Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/batch-inference-engines.md) — Implements batch processing of multimodal inputs to maximize hardware throughput during large-scale model evaluation. ([source](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B))
- [Inference-Scoring Decoupling](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-training/reward-modeling/flexible-scoring-paradigms/inference-scoring-decoupling.md) — Separates model prediction generation from metric calculation to allow for flexible post-processing and re-scoring.
- [Evaluations](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-image-models/evaluations.md) — Testing the ability of a model to create or modify images accurately based on provided text descriptions.
- [Generative Model Evaluation](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-model-evaluation.md) — Assesses how accurately a model creates or modifies visuals based on text descriptions. ([source](https://huggingface.co/AIDC-AI/Ovis-U1-3B))
- [Multimodal Performance Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/multimodal-performance-toolkits.md) — Optimizes attention speed and provides tools for analyzing raw model outputs during large-scale testing.

### Part of an Awesome List

- [Multimodal Benchmarks](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-benchmarks.md) — Provides a comprehensive framework for evaluating and comparing the performance of multimodal models.
- [Multimodal Evaluation Benchmarks](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-evaluation-benchmarks.md) — Calculates performance metrics and compares model responses using diverse image-text evaluation sets.
- [Model Evaluation and Benchmarking](https://awesome-repositories.com/f/awesome-lists/ai/model-evaluation-and-benchmarking.md) — Evaluation toolkit for large vision-language models.

### Software Engineering & Architecture

- [Vision-Language Model Benchmarking](https://awesome-repositories.com/f/software-engineering-architecture/performance-reliability/performance-engineering/performance-benchmarking/vision-language-model-benchmarking.md) — Measures vision-language model performance using standardized benchmarks and accuracy metrics.
- [Unified Model Wrappers](https://awesome-repositories.com/f/software-engineering-architecture/unified-model-wrappers.md) — Provides a unified interface to standardize diverse vision-language model APIs for consistent benchmarking.

### Data & Databases

- [Prediction Persistence Layers](https://awesome-repositories.com/f/data-databases/prediction-persistence-layers.md) — Saves raw model predictions to structured local files for offline analysis and auditing.

### Development Tools & Productivity

- [Prompt Template Injection](https://awesome-repositories.com/f/development-tools-productivity/argument-injection-utilities/prompt-template-injection.md) — Dynamically injects dataset questions into model-specific templates to ensure consistent input formatting.
- [Inference Batching](https://awesome-repositories.com/f/development-tools-productivity/batch-processing-pipelines/inference-batching.md) — Implements inference batching by grouping image and text pairs into tensors to maximize hardware throughput.
