30 open-source projects similar to open-compass/opencompass, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Opencompass alternative.
OpenCompass is a comprehensive evaluation platform, benchmarking suite, and distributed model evaluator designed to measure the performance and accuracy of large language models. It provides a framework for benchmarking both open-source and API-based models against diverse datasets using standardized metrics and reproducible pipelines. The project features an automated judging framework that uses language models as judges to score and verify the quality of generated text. It includes a performance leaderboard system for comparing the relative capabilities of various models across industry-sta
This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k. The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to co
Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments. The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and
Giskard is an evaluation framework, testing library, and quality monitoring system for large language models and AI agents. It serves as a toolkit for quantifying model performance and reliability, providing specialized capabilities for validating retrieval-augmented generation pipelines. The project distinguishes itself through an automated red teaming tool and security scanner designed to identify vulnerabilities, prompt injections, and safety risks. It utilizes adversarial probing and synthetic edge case generation to quantify model robustness and detect information disclosure. The platfo
VLMEvalKit is a vision-language model evaluation framework and inference engine designed to run standardized benchmarks and measure model accuracy across diverse visual datasets. It serves as a multimodal model benchmark and performance toolkit for calculating metrics and comparing model responses. The toolkit includes a specialized visual reasoning evaluator that uses adversarial samples to distinguish actual image understanding from reliance on language patterns. It also provides capabilities for image generation evaluation, testing a model's ability to create or modify visuals based on tex
Oumi is a comprehensive large language model development platform designed for synthesizing data, fine-tuning models, and running performance evaluations. It serves as a unified environment for the entire model lifecycle, encompassing a training and fine-tuning suite, an evaluation framework, and tools for synthetic data generation and model distillation. The platform is distinguished by its iterative, failure-driven synthesis approach, which analyzes model weaknesses during evaluation to generate targeted training data. It utilizes an LLM-based judge framework to programmatically score respo
mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for
Lighteval is an open-source framework for running standardized benchmarks and custom evaluation tasks against language models. It provides a system for defining new evaluation tasks with custom prompts, metrics, and scoring in YAML configuration files, and integrates with the Hugging Face Hub for storing and comparing results. The framework supports evaluating models across multiple inference backends, including transformers, vllm, and custom APIs, through a unified generation and log-probability interface. It includes a pluggable metric registry for built-in and custom scoring, a prediction
Comet LLM is an observability platform and evaluation framework designed for large language model applications and agentic workflows. It functions as a system for tracing, monitoring, and debugging execution flows while providing tools for prompt optimization and the enforcement of AI safety guardrails. The platform distinguishes itself through a combination of model-based scoring and heuristic metrics to quantify output quality and detect hallucinations. It includes a dedicated prompt and agent optimizer with an interactive playground for refining templates and tool configurations. For retri
Evidently is an AI observability platform and evaluation framework designed to quantify the performance of machine learning models and large language models. It functions as a monitoring tool for detecting data drift and quality degradation in tabular datasets, while providing a specialized analyzer for the faithfulness and correctness of retrieval augmented generation systems. The project distinguishes itself through an evaluation framework that utilizes judge models and custom rubrics to score language model outputs. It includes tools for iterative prompt optimization and the generation of
Evals is a framework designed for automating, managing, and executing repeatable benchmarking suites to analyze the quality and performance of language models. It provides a platform for running standardized tests to measure model accuracy and track behavioral changes over time. The system distinguishes itself through a modular architecture that uses a standardized adapter layer to normalize inputs and outputs, allowing different models to be swapped and tested interchangeably. It supports the creation of custom benchmarks using proprietary data, enabling quality assurance on sensitive tasks
This project is a standardized framework for benchmarking large language models across a wide range of academic and reasoning datasets. It provides a platform for executing automated evaluation tasks to measure model accuracy and performance, ensuring consistent assessment through a structured configuration schema. The framework distinguishes itself by incorporating a dedicated utility for data decontamination, which identifies and removes overlapping training samples from evaluation sets to prevent data leakage. It also features a flexible task builder that allows users to define custom benc
UltraChat is a collection of large-scale conversational datasets and instruction-tuning data designed for training and evaluating generative AI models. It provides structured JSON data consisting of complex, multi-round dialogue sequences intended to refine the performance of large language models in chat tasks. The project focuses on improving reasoning and response quality through a diverse set of interactions across multiple sectors. These datasets are used for supervised fine-tuning and instruction tuning workflows to improve how models follow complex directions and maintain context acros
Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes. The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
GAOKAO-Bench is an evaluation framework that utilizes GAOKAO questions as a dataset to evaluate large language models.
This repository provides tools and methodologies for studying adversarial attacks on large language models. It focuses on understanding how carefully crafted inputs can manipulate or bypass the safety mechanisms of LLMs, enabling researchers to probe model vulnerabilities and improve their robustness. The project covers techniques for generating adversarial prompts, evaluating model responses under attack conditions, and analyzing the effectiveness of different attack strategies.
Kiln is an LLM development workbench and evaluation framework designed for designing, testing, and optimizing prompts and AI agents. It functions as a multi-agent orchestrator and a RAG optimization tool, providing a visual interface for the iterative development of AI systems. The project distinguishes itself through a comprehensive fine-tuning pipeline that supports zero-code model training and reasoning distillation. It enables the creation of hierarchical multi-agent systems where specialized actors coordinate via tool calling, and it implements a Model Context Protocol server to expose t
Agenta is a Prompt Ops lifecycle manager and prompt management platform that decouples prompt engineering from application code. It serves as a centralized system for developing, versioning, and deploying prompt templates and model configurations across different environments. The platform functions as an AI agent orchestrator with a visual interface for building agent workflows and connecting models to external tools. It further acts as an evaluation framework and observability tool, utilizing OpenTelemetry to capture execution traces, monitor latency, and track token costs. The system cove
Lit is a machine learning interpretability framework and model debugging tool designed to analyze model behavior and performance. It serves as an interpretability dashboard for large language models and a general performance analyzer for text, image, and tabular datasets. The project distinguishes itself through a comprehensive suite of interpretability tools, including salience map generation for feature attribution, the creation of synthetic and counterfactual examples to test robustness, and the projection of high-dimensional embeddings into visual spaces via UMAP or PCA. It further enable
This project is an educational resource and engineering guide for building, deploying, and optimizing large language model applications and production pipelines. It serves as a blueprint for cloud AI infrastructure, providing a framework for orchestrating inference endpoints, data warehouses, and scalable production environments. The repository provides specific implementation patterns for retrieval augmented generation to ground model responses in external data. It includes a training workflow for crawling, structuring, and processing datasets to facilitate model fine-tuning, alongside an ev
lmms-eval is a benchmarking system and performance analysis suite designed to measure the capabilities of large multimodal models. It provides a framework for evaluating models across text, image, audio, and video datasets, serving as a multimodal dataset orchestrator and benchmarking tool to quantify accuracy and efficiency. The project distinguishes itself through a unified multimodal message protocol that structures diverse media inputs for consistent model consumption. It features specialized benchmarking for audio, video, visual, document, and spatial reasoning, alongside tools for model
Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning. The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test
This project is an alignment framework and suite of pipelines for training language models using supervised fine-tuning and preference optimization. It provides tools for executing large-scale distributed training across multiple GPUs and compute nodes, alongside a system for measuring model helpfulness and dialogue quality through single-turn and multi-turn benchmarks. The framework includes specialized tools for direct preference optimization to refine model behavior using paired data without a separate reward model. It also supports constitutional AI alignment and the training of reward mo
Map-anything is a 3D scene reconstruction framework and neural geometry estimator designed to transform two-dimensional images into metric three-dimensional spatial representations using feed-forward neural networks. It provides a specialized toolkit for predicting camera intrinsics and ray directions from single images without requiring external geometric metadata. The project includes a 3D model benchmarking suite that utilizes a unified model wrapper to standardize outputs from diverse reconstruction models. This allows for consistent evaluation and accuracy measurement across various spat
Heretic is a specialized toolkit for removing safety alignment and refusal constraints from transformer-based language models. It utilizes directional ablation to suppress model refusals and restore unrestricted output capabilities. The project provides a framework for quantifying the effectiveness of these modifications by measuring refusal rates and evaluating divergence from the original model behavior. It also includes a suite for residual vector analysis, allowing for the calculation of geometric relationships between prompts and the visualization of hidden states across model layers. A
Lmnr is an LLM observability platform and evaluation framework designed for tracing, logging, and monitoring language model executions. It provides the tools necessary to debug agent behavior, analyze performance, and identify failure patterns in AI agents. The platform differentiates itself through a trace-to-dataset pipeline that converts production logs into labeled test sets for regression testing. It includes a prompt-variant replay engine to compare different prompts or models side-by-side and a state-cached debugging system to replay agent loops without restarting the process. The sys
This project is a comprehensive ecosystem of frameworks, toolkits, and datasets designed to evaluate model vulnerabilities and analyze jailbreak patterns. It serves as an adversarial testing framework and research toolkit for measuring the effectiveness of safety guardrails in large language models. The system includes a library of real-world prompt injection datasets harvested from social media to study bypass strategies. It provides specialized tools for semantic attack analysis and prompt visualization, allowing for the mapping of relationships between adversarial prompts to discover commo
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc