18 repository-uri
Utilities for running model inference and validation on standard datasets.
Distinguishing note: Supports multi-node and multi-GPU evaluation environments.
Explore 18 awesome GitHub repositories matching artificial intelligence & ml · Model Evaluation Frameworks. Refine with filters or upvote what's useful.
This project is a modular research toolkit designed for developing, training, and evaluating deep learning models for object detection, segmentation, and video instance tracking. It provides a flexible training engine that manages complex neural network execution, including distributed training, custom lifecycle hooks, and weight optimization. The framework is built around a hierarchical configuration system that allows users to define architectures, data pipelines, and training hyperparameters through composable, inheritable files. The project distinguishes itself through its highly modular
Enables large-scale model evaluation across single or multi-GPU environments.
This project serves as a comprehensive, static directory of external resources dedicated to the study and application of large language models. It functions as a centralized discovery point for developers and researchers, aggregating foundational academic papers, technical documentation, and specialized tools within a structured, version-controlled knowledge base. The repository distinguishes itself through a multi-level classification system that organizes diverse technical domains, ranging from model training frameworks and inference optimization to AI safety and hallucination detection. By
Locating specialized tools and methodologies for detecting hallucinations, ensuring model security, and aligning system behavior with human preferences.
Runs systematic evaluations using built-in metrics to track quality and detect regressions in model performance.
Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning. The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test
Provides a standardized suite of benchmarks and testing tools designed to measure performance on mathematical, logical, and programming problem-solving tasks.
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
Supports multi-node and multi-GPU evaluation environments for benchmarking reasoning and instruction-following performance.
Evals is a framework designed for automating, managing, and executing repeatable benchmarking suites to analyze the quality and performance of language models. It provides a platform for running standardized tests to measure model accuracy and track behavioral changes over time. The system distinguishes itself through a modular architecture that uses a standardized adapter layer to normalize inputs and outputs, allowing different models to be swapped and tested interchangeably. It supports the creation of custom benchmarks using proprietary data, enabling quality assurance on sensitive tasks
Executes standardized or custom test suites against language models to generate performance reports.
Swin-Transformer is a deep learning framework designed for training and deploying hierarchical vision transformer models. It serves as a research library and toolkit for computer vision tasks, providing the infrastructure to build models that replace standard convolution operations with sliding window self-attention mechanisms. By utilizing a multi-scale feature hierarchy, the framework enables the processing of visual data at varying resolutions and spatial scales. The project distinguishes itself through its implementation of shifted window partitioning, which facilitates global information
Supports distributed inference and validation across multiple devices to measure model performance.
Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin
Manages execution settings like timeouts and model parameters to control how evaluation tasks run.
This project is a standardized framework for benchmarking large language models across a wide range of academic and reasoning datasets. It provides a platform for executing automated evaluation tasks to measure model accuracy and performance, ensuring consistent assessment through a structured configuration schema. The framework distinguishes itself by incorporating a dedicated utility for data decontamination, which identifies and removes overlapping training samples from evaluation sets to prevent data leakage. It also features a flexible task builder that allows users to define custom benc
Provides a standardized toolkit for measuring the performance of large language models across diverse academic and reasoning benchmarks.
This project is a collection of utilities designed for machine learning experiment tracking, data versioning, and the observability of large language model applications. It provides a client for recording hyperparameters and metrics during training to visualize performance trends and compare different model versions. The tool includes a model evaluation framework that uses custom scorers and automated judges to assess the quality of generated text outputs. It also provides observability tools to monitor and debug the execution flow and runtime behavior of language model applications. The sys
Ships a framework for running model inference and validation using custom scorers and automated judges.
Heretic is a specialized toolkit for removing safety alignment and refusal constraints from transformer-based language models. It utilizes directional ablation to suppress model refusals and restore unrestricted output capabilities. The project provides a framework for quantifying the effectiveness of these modifications by measuring refusal rates and evaluating divergence from the original model behavior. It also includes a suite for residual vector analysis, allowing for the calculation of geometric relationships between prompts and the visualization of hidden states across model layers. A
Quantifies censorship removal effectiveness by measuring refusal rates and divergence from original model behavior.
OpenCompass is a comprehensive evaluation platform, benchmarking suite, and distributed model evaluator designed to measure the performance and accuracy of large language models. It provides a framework for benchmarking both open-source and API-based models against diverse datasets using standardized metrics and reproducible pipelines. The project features an automated judging framework that uses language models as judges to score and verify the quality of generated text. It includes a performance leaderboard system for comparing the relative capabilities of various models across industry-sta
Provides a comprehensive framework for running model inference and validation on standardized datasets.
OpenCompass is an open-source framework for standardized benchmarking of large language models. It provides a configurable evaluation pipeline that supports both objective and subjective assessment, using a dual-engine architecture to handle closed-form answer comparison and open-ended response rating. The framework is designed as a modular platform where datasets, models, and metrics are composed through declarative YAML configuration files. The framework distinguishes itself through its extensible model integration layer, which supports custom models, HuggingFace models, and third-party API
Integrates user-defined models into the evaluation pipeline by following the framework's extension interface.
BioGPT is a biomedical large language model and domain-specific transformer designed for processing and creating specialized medical text. It functions as a generative tool and knowledge extraction engine trained on large-scale scientific literature to produce human-like scientific prose and factual responses to queries. The project provides specialized capabilities for biomedical named entity recognition and the extraction of complex relations from unstructured medical corpora. It is designed to identify and classify biological entities through data mining and relation extraction to support
Includes utilities for running model inference and validation on standard biomedical datasets.
This project is a language model evaluation framework and benchmarking tool designed to measure the accuracy and performance of models across diverse datasets. It provides a system for implementing model-based graders, running standardized tests for mathematical reasoning, coding, and factuality, and calculating quantified performance metrics such as precision, recall, F1 scores, and pass-at-k. The framework utilizes model-based grading and rubrics to validate response quality against expert-defined criteria. It includes a multi-model benchmarking loop and a model-agnostic API interface to co
Provides a unified framework for running model inference and validation across standard language model benchmarks.
VLMEvalKit is a vision-language model evaluation framework and inference engine designed to run standardized benchmarks and measure model accuracy across diverse visual datasets. It serves as a multimodal model benchmark and performance toolkit for calculating metrics and comparing model responses. The toolkit includes a specialized visual reasoning evaluator that uses adversarial samples to distinguish actual image understanding from reliance on language patterns. It also provides capabilities for image generation evaluation, testing a model's ability to create or modify visuals based on tex
Provides a complete toolkit for running standardized benchmarks and measuring VLM accuracy.
This project is a comprehensive ecosystem of frameworks, toolkits, and datasets designed to evaluate model vulnerabilities and analyze jailbreak patterns. It serves as an adversarial testing framework and research toolkit for measuring the effectiveness of safety guardrails in large language models. The system includes a library of real-world prompt injection datasets harvested from social media to study bypass strategies. It provides specialized tools for semantic attack analysis and prompt visualization, allowing for the mapping of relationships between adversarial prompts to discover commo
Provides a system for running model inference and validation against curated forbidden datasets.
UltraChat is a collection of large-scale conversational datasets and instruction-tuning data designed for training and evaluating generative AI models. It provides structured JSON data consisting of complex, multi-round dialogue sequences intended to refine the performance of large language models in chat tasks. The project focuses on improving reasoning and response quality through a diverse set of interactions across multiple sectors. These datasets are used for supervised fine-tuning and instruction tuning workflows to improve how models follow complex directions and maintain context acros
Provides a framework for testing model coherence and reasoning across diverse topical datasets.