Why is open-mmlab/mmdetection a recommended Model Evaluation Frameworks GitHub Repositories repository?

Enables large-scale model evaluation across single or multi-GPU environments.

Why is hannibal046/awesome-llm a recommended Model Evaluation Frameworks GitHub Repositories repository?

Locating specialized tools and methodologies for detecting hallucinations, ensuring model security, and aligning system behavior with human preferences.

Why is mlflow/mlflow a recommended Model Evaluation Frameworks GitHub Repositories repository?

Runs systematic evaluations using built-in metrics to track quality and detect regressions in model performance.

Why is huggingface/open-r1 a recommended Model Evaluation Frameworks GitHub Repositories repository?

Provides a standardized suite of benchmarks and testing tools designed to measure performance on mathematical, logical, and programming problem-solving tasks.

Why is liguodongiot/llm-action a recommended Model Evaluation Frameworks GitHub Repositories repository?

Supports multi-node and multi-GPU evaluation environments for benchmarking reasoning and instruction-following performance.

Why is openai/evals a recommended Model Evaluation Frameworks GitHub Repositories repository?

Executes standardized or custom test suites against language models to generate performance reports.

Why is microsoft/swin-transformer a recommended Model Evaluation Frameworks GitHub Repositories repository?

Supports distributed inference and validation across multiple devices to measure model performance.

Why is vibrantlabsai/ragas a recommended Model Evaluation Frameworks GitHub Repositories repository?

Manages execution settings like timeouts and model parameters to control how evaluation tasks run.

Why is eleutherai/lm-evaluation-harness a recommended Model Evaluation Frameworks GitHub Repositories repository?

Provides a standardized toolkit for measuring the performance of large language models across diverse academic and reasoning benchmarks.

Why is wandb/client a recommended Model Evaluation Frameworks GitHub Repositories repository?

Ships a framework for running model inference and validation using custom scorers and automated judges.

18 repository-uri

Awesome GitHub RepositoriesModel Evaluation Frameworks

Utilities for running model inference and validation on standard datasets.

Distinguishing note: Supports multi-node and multi-GPU evaluation environments.

Explore 18 awesome GitHub repositories matching artificial intelligence & ml · Model Evaluation Frameworks. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

open-mmlab/mmdetection
open-mmlab/mmdetection
32,756Vezi pe GitHub
This project is a modular research toolkit designed for developing, training, and evaluating deep learning models for object detection, segmentation, and video instance tracking. It provides a flexible training engine that manages complex neural network execution, including distributed training, custom lifecycle hooks, and weight optimization. The framework is built around a hierarchical configuration system that allows users to define architectures, data pipelines, and training hyperparameters through composable, inheritable files. The project distinguishes itself through its highly modular
Enables large-scale model evaluation across single or multi-GPU environments.
Pythoncascade-rcnnconvnextdetr
Vezi pe GitHub32,756
hannibal046/awesome-llm
Hannibal046/Awesome-LLM
26,933Vezi pe GitHub
This project serves as a comprehensive, static directory of external resources dedicated to the study and application of large language models. It functions as a centralized discovery point for developers and researchers, aggregating foundational academic papers, technical documentation, and specialized tools within a structured, version-controlled knowledge base. The repository distinguishes itself through a multi-level classification system that organizes diverse technical domains, ranging from model training frameworks and inference optimization to AI safety and hallucination detection. By
Locating specialized tools and methodologies for detecting hallucinations, ensuring model security, and aligning system behavior with human preferences.
Vezi pe GitHub26,933
mlflow/mlflow
mlflow/mlflow
26,554Vezi pe GitHub
Runs systematic evaluations using built-in metrics to track quality and detect regressions in model performance.
Pythonagentopsagentsai
Vezi pe GitHub26,554
huggingface/open-r1
huggingface/open-r1
26,326Vezi pe GitHub
Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning. The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test
Provides a standardized suite of benchmarks and testing tools designed to measure performance on mathematical, logical, and programming problem-solving tasks.
Python
Vezi pe GitHub26,326
liguodongiot/llm-action
liguodongiot/llm-action
23,169Vezi pe GitHub
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
Supports multi-node and multi-GPU evaluation environments for benchmarking reasoning and instruction-following performance.
HTMLllmllm-inferencellm-serving
Vezi pe GitHub23,169
openai/evals
openai/evals
18,702Vezi pe GitHub
Evals is a framework designed for automating, managing, and executing repeatable benchmarking suites to analyze the quality and performance of language models. It provides a platform for running standardized tests to measure model accuracy and track behavioral changes over time. The system distinguishes itself through a modular architecture that uses a standardized adapter layer to normalize inputs and outputs, allowing different models to be swapped and tested interchangeably. It supports the creation of custom benchmarks using proprietary data, enabling quality assurance on sensitive tasks
Executes standardized or custom test suites against language models to generate performance reports.
Python
Vezi pe GitHub18,702
microsoft/swin-transformer
microsoft/Swin-Transformer
15,715Vezi pe GitHub
Swin-Transformer is a deep learning framework designed for training and deploying hierarchical vision transformer models. It serves as a research library and toolkit for computer vision tasks, providing the infrastructure to build models that replace standard convolution operations with sliding window self-attention mechanisms. By utilizing a multi-scale feature hierarchy, the framework enables the processing of visual data at varying resolutions and spatial scales. The project distinguishes itself through its implementation of shifted window partitioning, which facilitates global information
Supports distributed inference and validation across multiple devices to measure model performance.
Pythonade20kimage-classificationimagenet
Vezi pe GitHub15,715
vibrantlabsai/ragas
vibrantlabsai/ragas
12,659Vezi pe GitHub
Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin
Manages execution settings like timeouts and model parameters to control how evaluation tasks run.
Pythonevaluationllmllmops
Vezi pe GitHub12,659
eleutherai/lm-evaluation-harness
EleutherAI/lm-evaluation-harness
11,460Vezi pe GitHub
This project is a standardized framework for benchmarking large language models across a wide range of academic and reasoning datasets. It provides a platform for executing automated evaluation tasks to measure model accuracy and performance, ensuring consistent assessment through a structured configuration schema. The framework distinguishes itself by incorporating a dedicated utility for data decontamination, which identifies and removes overlapping training samples from evaluation sets to prevent data leakage. It also features a flexible task builder that allows users to define custom benc
Provides a standardized toolkit for measuring the performance of large language models across diverse academic and reasoning benchmarks.
Pythonevaluation-frameworklanguage-modeltransformer
Vezi pe GitHub11,460
wandb/client
wandb/client
11,128Vezi pe GitHub
This project is a collection of utilities designed for machine learning experiment tracking, data versioning, and the observability of large language model applications. It provides a client for recording hyperparameters and metrics during training to visualize performance trends and compare different model versions. The tool includes a model evaluation framework that uses custom scorers and automated judges to assess the quality of generated text outputs. It also provides observability tools to monitor and debug the execution flow and runtime behavior of language model applications. The sys
Ships a framework for running model inference and validation using custom scorers and automated judges.
Python
Vezi pe GitHub11,128
p-e-w/heretic
p-e-w/heretic
8,509Vezi pe GitHub
Heretic is a specialized toolkit for removing safety alignment and refusal constraints from transformer-based language models. It utilizes directional ablation to suppress model refusals and restore unrestricted output capabilities. The project provides a framework for quantifying the effectiveness of these modifications by measuring refusal rates and evaluating divergence from the original model behavior. It also includes a suite for residual vector analysis, allowing for the calculation of geometric relationships between prompts and the visualization of hidden states across model layers. A
Quantifies censorship removal effectiveness by measuring refusal rates and divergence from original model behavior.
Pythonabliterationllmtransformer
Vezi pe GitHub8,509
internlm/opencompass
InternLM/opencompass
7,096Vezi pe GitHub
OpenCompass is a comprehensive evaluation platform, benchmarking suite, and distributed model evaluator designed to measure the performance and accuracy of large language models. It provides a framework for benchmarking both open-source and API-based models against diverse datasets using standardized metrics and reproducible pipelines. The project features an automated judging framework that uses language models as judges to score and verify the quality of generated text. It includes a performance leaderboard system for comparing the relative capabilities of various models across industry-sta
Provides a comprehensive framework for running model inference and validation on standardized datasets.
Python
Vezi pe GitHub7,096
open-compass/opencompass
open-compass/opencompass
6,678Vezi pe GitHub
OpenCompass is an open-source framework for standardized benchmarking of large language models. It provides a configurable evaluation pipeline that supports both objective and subjective assessment, using a dual-engine architecture to handle closed-form answer comparison and open-ended response rating. The framework is designed as a modular platform where datasets, models, and metrics are composed through declarative YAML configuration files. The framework distinguishes itself through its extensible model integration layer, which supports custom models, HuggingFace models, and third-party API
Integrates user-defined models into the evaluation pipeline by following the framework's extension interface.
Pythonbenchmarkchatgptevaluation
Vezi pe GitHub6,678
microsoft/biogpt
microsoft/BioGPT
4,486Vezi pe GitHub
BioGPT is a biomedical large language model and domain-specific transformer designed for processing and creating specialized medical text. It functions as a generative tool and knowledge extraction engine trained on large-scale scientific literature to produce human-like scientific prose and factual responses to queries. The project provides specialized capabilities for biomedical named entity recognition and the extraction of complex relations from unstructured medical corpora. It is designed to identify and classify biological entities through data mining and relation extraction to support
Includes utilities for running model inference and validation on standard biomedical datasets.
Python
Vezi pe GitHub4,486
openai/simple-evals
openai/simple-evals
4,354Vezi pe GitHub
This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k. The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to co
Provides a unified framework for running model inference and validation across standard language model benchmarks.
Python
Vezi pe GitHub4,354
open-compass/vlmevalkit
open-compass/VLMEvalKit
3,824Vezi pe GitHub
VLMEvalKit is a vision-language model evaluation framework and inference engine designed to run standardized benchmarks and measure model accuracy across diverse visual datasets. It serves as a multimodal model benchmark and performance toolkit for calculating metrics and comparing model responses. The toolkit includes a specialized visual reasoning evaluator that uses adversarial samples to distinguish actual image understanding from reliance on language patterns. It also provides capabilities for image generation evaluation, testing a model's ability to create or modify visuals based on tex
Provides a complete toolkit for running standardized benchmarks and measuring VLM accuracy.
Pythonchatgptclaudeclip
Vezi pe GitHub3,824
verazuo/jailbreak_llms
verazuo/jailbreak_llms
3,563Vezi pe GitHub
This project is a comprehensive ecosystem of frameworks, toolkits, and datasets designed to evaluate model vulnerabilities and analyze jailbreak patterns. It serves as an adversarial testing framework and research toolkit for measuring the effectiveness of safety guardrails in large language models. The system includes a library of real-world prompt injection datasets harvested from social media to study bypass strategies. It provides specialized tools for semantic attack analysis and prompt visualization, allowing for the mapping of relationships between adversarial prompts to discover commo
Provides a system for running model inference and validation against curated forbidden datasets.
Jupyter Notebookchatgptjailbreakjailbreaking
Vezi pe GitHub3,563
thunlp/ultrachat
thunlp/UltraChat
2,786Vezi pe GitHub
UltraChat is a collection of large-scale conversational datasets and instruction-tuning data designed for training and evaluating generative AI models. It provides structured JSON data consisting of complex, multi-round dialogue sequences intended to refine the performance of large language models in chat tasks. The project focuses on improving reasoning and response quality through a diverse set of interactions across multiple sectors. These datasets are used for supervised fine-tuning and instruction tuning workflows to improve how models follow complex directions and maintain context acros
Provides a framework for testing model coherence and reasoning across diverse topical datasets.
Pythonchatbotchatgptdeep-learning
Vezi pe GitHub2,786