# Open LLM Benchmarking Suites

> Search results for `benchmark suite for comparing open LLMs` on awesome-repositories.com. 119 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/benchmark-suite-for-comparing-open-llms

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/benchmark-suite-for-comparing-open-llms).**

## Results

- [facebookresearch/map-anything](https://awesome-repositories.com/repository/facebookresearch-map-anything.md) (2,915 ⭐) — Map-anything is a 3D scene reconstruction framework and neural geometry estimator designed to transform two-dimensional images into metric three-dimensional spatial representations using feed-forward neural networks. It provides a specialized toolkit for predicting camera intrinsics and ray directions from single images without requiring external geometric metadata.

The project includes a 3D model benchmarking suite that utilizes a unified model wrapper to standardize outputs from diverse reconstruction models. This allows for consistent evaluation and accuracy measurement across various spat
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [eugeneyan/open-llms](https://awesome-repositories.com/repository/eugeneyan-open-llms.md) (12,804 ⭐) — 📋 A list of open LLMs available for commercial use.
- [open-edge-platform/anomalib](https://awesome-repositories.com/repository/open-edge-platform-anomalib.md) (5,871 ⭐) — Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings.

The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
- [forem/forem](https://awesome-repositories.com/repository/forem-forem.md) (22,726 ⭐) — Forem is an open-source platform designed for building and managing technical communities. It functions as a social publishing engine that enables members to share long-form content, participate in threaded discussions, and engage through social interactions. The platform provides tools for organizations to maintain branded profiles, host community hackathons, and facilitate collaborative learning through structured educational tracks.

Beyond its social features, Forem integrates advanced capabilities for AI agent workflow orchestration and codebase knowledge graphing. It allows developers to
- [smallnest/go-web-framework-benchmark](https://awesome-repositories.com/repository/smallnest-go-web-framework-benchmark.md) (2,138 ⭐) — This benchmark suite aims to compare the performance of Go web frameworks. It is inspired by Go HTTP Router Benchmark but this benchmark suite is different with that. Go HTTP Router Benchmark suit aims to compare the performance of routers but this Benchmark suit aims to compare whole HTTP…
- [deepinsight/insightface](https://awesome-repositories.com/repository/deepinsight-insightface.md) (29,002 ⭐) — InsightFace is a comprehensive deep learning framework designed for face recognition, biometric identity verification, and feature extraction. It provides a specialized engine for one-to-one verification and one-to-many identification tasks, utilizing convolutional neural networks to transform raw image pixels into high-dimensional vector embeddings. The project includes a complete toolkit for detecting, aligning, and processing facial data to ensure consistent identity discrimination.

Beyond core recognition, the platform distinguishes itself through an extensive model management and optimiz
- [gliviu/dir-compare](https://awesome-repositories.com/repository/gliviu-dir-compare.md) (202 ⭐) — dir-compare Node JS directory compare
- [fastapi/fastapi](https://awesome-repositories.com/repository/fastapi-fastapi.md) (99,260 ⭐) — FastAPI is a web framework for building APIs with Python. It leverages standard language type hints to provide automatic data validation, request parsing, and interactive API documentation generation. The framework supports asynchronous request handling and manages execution contexts to prevent blocking the main event loop.

The project includes a dependency injection system that allows for the resolution and injection of reusable components into request handlers. This system supports request-scoped caching, lifecycle management, and integration with security mechanisms like OAuth2 and JSON We
- [vwxyzjn/cleanrl](https://awesome-repositories.com/repository/vwxyzjn-cleanrl.md) (9,127 ⭐) — CleanRL is a reinforcement learning library and PyTorch framework providing a suite of reproducible implementations for online reinforcement learning algorithms. It serves as a deep reinforcement learning benchmark suite and experiment orchestrator designed for research and agent development across both discrete and continuous action spaces.

The project is distinguished by its single-file algorithm implementation approach, which encapsulates each algorithm in a standalone script to eliminate complex class hierarchies. This structure is paired with a system for scheduling and executing large-s
- [facefusion/facefusion](https://awesome-repositories.com/repository/facefusion-facefusion.md) (28,806 ⭐) — Facefusion is a modular framework designed for automated image and video manipulation, specializing in tasks such as face swapping, enhancement, and restoration. It functions as a computer vision processing pipeline that chains independent machine learning modules to perform complex transformations, including facial animation, age modification, and lip synchronization. The system is built to handle both real-time interactive feeds and large-scale batch processing tasks.

The platform distinguishes itself through a highly extensible architecture that supports custom processing modules and inter
- [pytorch/benchmark](https://awesome-repositories.com/repository/pytorch-benchmark.md) (1,035 ⭐) — TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
- [trycua/cua](https://awesome-repositories.com/repository/trycua-cua.md) (18,720 ⭐) — Cua is an agent benchmarking and desktop automation platform designed to evaluate autonomous agents and execute repetitive tasks within isolated, virtualized environments. It provides a framework for provisioning consistent workspaces and measuring agent performance against standardized desktop operations.

The platform distinguishes itself by integrating virtual machine orchestration with headless interaction capabilities. By leveraging hypervisor-based virtualization, it runs operating systems at near-native speeds, while its automation layer injects commands directly into application proces
- [aider-ai/aider](https://awesome-repositories.com/repository/aider-ai-aider.md) (46,305 ⭐) — Aider is a command-line interface tool that enables large language models to directly edit, refactor, and manage source code within a local repository. It functions as an AI-powered coding assistant that integrates into the developer workflow, allowing users to apply code changes through natural language prompts while maintaining repository context and version control.

The tool distinguishes itself through a specialized diff-based patching engine that parses model-generated search-and-replace blocks to modify specific file segments without rewriting entire files. It features a provider-agnost
- [huggingface/lerobot](https://awesome-repositories.com/repository/huggingface-lerobot.md) (21,687 ⭐) — This project is a comprehensive research platform designed for the end-to-end lifecycle of robotic learning. It provides a modular framework for training neural network policies—specifically through imitation and reinforcement learning—and deploying them onto physical robotic hardware. By offering a unified interface for hardware abstraction, the platform decouples high-level control logic from the specific sensors and actuators of diverse robotic systems.

The framework distinguishes itself through a standardized approach to data and policy management. It utilizes a consistent schema for reco
- [phoronix-test-suite/phoronix-test-suite](https://awesome-repositories.com/repository/phoronix-test-suite-phoronix-test-suite.md) (3,080 ⭐) — The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available for Linux, Solaris, macOS, Windows, and BSD operating systems. The Phoronix Test Suite allows for carrying out tests in a fully automated manner from test installation to execution and reporting. All…
- [berriai/litellm](https://awesome-repositories.com/repository/berriai-litellm.md) (50,579 ⭐) — LiteLLM is a unified gateway and proxy server designed to centralize access to over one hundred language model providers. It provides a standardized API interface that abstracts vendor-specific schemas, allowing developers to interact with diverse models through a single, consistent format. By acting as a central traffic management layer, it enables organizations to route, secure, and govern model interactions across multiple deployments.

The platform distinguishes itself through its policy-driven architecture, which uses configuration-based routing to manage traffic distribution, load balanc
- [open-compass/opencompass](https://awesome-repositories.com/repository/open-compass-opencompass.md) (6,678 ⭐) — OpenCompass is an open-source framework for standardized benchmarking of large language models. It provides a configurable evaluation pipeline that supports both objective and subjective assessment, using a dual-engine architecture to handle closed-form answer comparison and open-ended response rating. The framework is designed as a modular platform where datasets, models, and metrics are composed through declarative YAML configuration files.

The framework distinguishes itself through its extensible model integration layer, which supports custom models, HuggingFace models, and third-party API
- [yourls/yourls-test-suite-for-plugins](https://awesome-repositories.com/repository/yourls-yourls-test-suite-for-plugins.md) (2 ⭐) — The YOURLS test suite for plugins is a tool to test YOURLS plugins with standard PHPUnit tests.
- [ariya/phantomjs](https://awesome-repositories.com/repository/ariya-phantomjs.md) (29,489 ⭐) — PhantomJS is a scriptable, headless browser engine based on WebKit that provides a programmatic interface for automating web page interactions. It operates without a graphical user interface, allowing for the execution of JavaScript to navigate pages, manipulate the document object model, and perform functional testing of web applications.

The tool distinguishes itself by providing low-level control over the browser rendering lifecycle and network stack. It enables real-time interception and modification of network traffic, alongside the ability to generate visual snapshots and document expor
- [suites-dev/suites](https://awesome-repositories.com/repository/suites-dev-suites.md) (538 ⭐) — A unit testing framework for TypeScript backends working with inversion of control and dependency injection
- [openai/gym](https://awesome-repositories.com/repository/openai-gym.md) (37,223 ⭐) — Gym is a reinforcement learning environment toolkit and agent simulation framework. It provides a standardized API and a universal communication interface that defines how learning agents interact with simulation environments through actions and observations.

The project includes a benchmark environment suite and a diverse library of pre-configured simulation worlds, including physics engines and classic control tasks. It enables the creation of custom simulation environments to train agents in specific operational scenarios while ensuring reproducibility across different learning algorithms.
- [pageman/sutskever-30-implementations](https://awesome-repositories.com/repository/pageman-sutskever-30-implementations.md) (3,148 ⭐) — This project is a collection of deep learning research implementations and a reproduction kit designed to translate theoretical AI papers into working code. It provides a library of neural network architectures and reference implementations for reproducing seminal research concepts through interactive notebooks.

The repository distinguishes itself through the implementation of AI theory and scaling laws, covering complexity dynamics, information theory, and the simulation of universal AI agents. It also includes a benchmarking suite for synthetic reasoning, allowing for the evaluation of mode
- [apitable/apitable](https://awesome-repositories.com/repository/apitable-apitable.md) (15,265 ⭐) — This platform is a low-code database system that combines the flexibility of a spreadsheet interface with the structured power of a relational database. It serves as a collaborative workspace for managing complex datasets, building custom business applications, and automating operational workflows without requiring traditional software development.

The platform distinguishes itself through deep integration of artificial intelligence, which enables users to query databases using natural language, generate content, and deploy custom conversational agents trained on internal data. It supports re
- [akarnokd/jmh-compare-gui](https://awesome-repositories.com/repository/akarnokd-jmh-compare-gui.md) (71 ⭐) — GUI for comparing JMH results
- [llm-attacks/llm-attacks](https://awesome-repositories.com/repository/llm-attacks-llm-attacks.md) (4,509 ⭐) — This repository provides tools and methodologies for studying adversarial attacks on large language models. It focuses on understanding how carefully crafted inputs can manipulate or bypass the safety mechanisms of LLMs, enabling researchers to probe model vulnerabilities and improve their robustness. The project covers techniques for generating adversarial prompts, evaluating model responses under attack conditions, and analyzing the effectiveness of different attack strategies.
- [sylvaincombes/jquery-images-compare](https://awesome-repositories.com/repository/sylvaincombes-jquery-images-compare.md) (65 ⭐) — A jQuery plugin for comparing two images
- [activepieces/activepieces](https://awesome-repositories.com/repository/activepieces-activepieces.md) (20,887 ⭐) — Activepieces is an open-source, self-hosted workflow automation platform designed to connect third-party applications through modular triggers and actions. It provides a low-code integration framework that allows users to build, manage, and execute complex business logic sequences within isolated, sandboxed environments.

The platform distinguishes itself through its focus on embeddability and enterprise-grade security. It features an embedded automation builder that can be integrated into external applications via iframes, supported by comprehensive identity and access management tools such a
- [kuangliu/pytorch-cifar](https://awesome-repositories.com/repository/kuangliu-pytorch-cifar.md) (6,360 ⭐) — This is a PyTorch-based training pipeline designed for reproducible image classification benchmarking on the CIFAR-10 dataset. It integrates GPU-accelerated computation, data augmentation, learning rate scheduling, and checkpointing to produce consistent accuracy measurements across multiple ResNet architectures.

The project distinguishes itself by providing a fixed-architecture benchmark suite that trains a predefined set of ResNet variants, from ResNet18 through ResNet152, on CIFAR-10. It implements a step-based learning rate decay schedule at predetermined epochs to stabilize convergence,
- [camel-ai/camel](https://awesome-repositories.com/repository/camel-ai-camel.md) (17,253 ⭐) — This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer.

The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
- [wgwang/awesome-llm-benchmarks](https://awesome-repositories.com/repository/wgwang-awesome-llm-benchmarks.md) (164 ⭐) — Awesome LLM Benchmarks to evaluate the LLMs across text, code, image, audio, video and more.
- [answerdotai/llms-txt](https://awesome-repositories.com/repository/answerdotai-llms-txt.md) (2,442 ⭐) — The /llms.txt file, helping language models use your website
- [avelino/awesome-go](https://awesome-repositories.com/repository/avelino-awesome-go.md) (175,576 ⭐) — This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains.

The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
- [huggingface/open-r1](https://awesome-repositories.com/repository/huggingface-open-r1.md) (26,326 ⭐) — Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning.

The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test
- [astral-sh/uv](https://awesome-repositories.com/repository/astral-sh-uv.md) (86,451 ⭐) — uv is a high-performance Python package manager and project build tool designed to handle dependency resolution, virtual environment orchestration, and Python interpreter management. It functions as a comprehensive workspace orchestrator, enabling developers to manage complex, multi-package repositories and ensure reproducible builds across different platforms.

The tool distinguishes itself through its use of a global, content-addressable cache and hard-link-based environment provisioning, which allow for near-instant environment creation and minimal disk usage. It employs a high-performance
- [chris00/ocaml-benchmark](https://awesome-repositories.com/repository/chris00-ocaml-benchmark.md) (34 ⭐) — Benchmarking module for OCaml
- [jingyaogong/minimind](https://awesome-repositories.com/repository/jingyaogong-minimind.md) (51,834 ⭐) — This project is a comprehensive framework for the entire lifecycle of transformer-based language models, supporting everything from foundational pretraining to specialized deployment. It provides a modular toolkit for defining neural network architectures, managing data preparation pipelines, and executing training routines across various scales. The framework is designed to handle the full model development process, including supervised fine-tuning, behavioral alignment, and the integration of agentic capabilities.

What distinguishes this framework is its focus on efficient training and adva
- [omichelsen/compare-versions](https://awesome-repositories.com/repository/omichelsen-compare-versions.md) (637 ⭐) — Compare semver version strings to find greater, equal or lesser. Runs in the browser as well as Node.js/React Native etc. Has no dependencies and is tiny.
- [foundry-rs/foundry](https://awesome-repositories.com/repository/foundry-rs-foundry.md) (10,125 ⭐) — Foundry is an Ethereum smart contract development toolkit and blockchain simulator designed for compiling, testing, and deploying contracts for the Ethereum Virtual Machine. It provides a local environment for simulating blockchain state and forking live networks to execute code without modifying the actual chain.

The project features a property-based fuzzing engine to identify edge-case failures in contract logic and a transaction debugger for analyzing detailed execution traces and gas consumption. It enables developers to mirror the state of a remote chain locally to test against real-worl
- [jbhuang0604/awesome-computer-vision](https://awesome-repositories.com/repository/jbhuang0604-awesome-computer-vision.md) (23,074 ⭐) — This project is a comprehensive, community-driven repository that serves as a centralized catalog for computer vision research and development. It functions as a structured index of academic papers, open-source software libraries, public datasets, and educational tutorials, providing a navigation point for the complex landscape of modern vision technology.

The repository distinguishes itself through a taxonomy-based indexing system that maps the relationships between foundational research, influential academic figures, and their corresponding software implementations. By utilizing a lightweig
- [chopratejas/headroom](https://awesome-repositories.com/repository/chopratejas-headroom.md) (29,537 ⭐) — Headroom is an AI gateway proxy and token optimizer designed to reduce the cost and latency of large language model interactions. It functions as an intermediary that intercepts traffic between clients and providers to apply context compression, request routing, and format translation.

The system differentiates itself through a Model Context Protocol server implementation that delivers compression and retrieval tools to compatible AI hosts. It employs a content-aware compression pipeline and tiered importance scoring to trim redundant data from logs and tool outputs while preserving essential
- [internlm/opencompass](https://awesome-repositories.com/repository/internlm-opencompass.md) (7,096 ⭐) — OpenCompass is a comprehensive evaluation platform, benchmarking suite, and distributed model evaluator designed to measure the performance and accuracy of large language models. It provides a framework for benchmarking both open-source and API-based models against diverse datasets using standardized metrics and reproducible pipelines.

The project features an automated judging framework that uses language models as judges to score and verify the quality of generated text. It includes a performance leaderboard system for comparing the relative capabilities of various models across industry-sta
- [sebastianbergmann/comparator](https://awesome-repositories.com/repository/sebastianbergmann-comparator.md) (7,053 ⭐) — This library is a data assertion tool and equality logic framework for PHP. It provides utilities to verify that two values, nested objects, or complex data types match based on their internal contents.

The project distinguishes itself through the use of custom matching rules and configurable precision. It allows for the comparison of floating point numbers and dates using a defined margin of error to account for numeric precision loss.

The framework covers deep value equality verification across scalars, arrays, and nested objects. It implements strict type enforcement to prevent implicit c
- [cockroachdb/cockroach](https://awesome-repositories.com/repository/cockroachdb-cockroach.md) (32,207 ⭐) — Cockroach is a distributed SQL database designed to scale horizontally across multiple nodes while maintaining strict ACID compliance and global data consistency. It functions as a relational database engine that automatically partitions data into ranges, rebalancing them across a cluster to accommodate growing storage and throughput requirements. By utilizing a distributed consensus protocol, the system ensures that all nodes agree on the order of operations, providing fault tolerance and continuous availability even in the event of hardware failures.

The system distinguishes itself through
- [mme-benchmarks/video-mme](https://awesome-repositories.com/repository/mme-benchmarks-video-mme.md) (779 ⭐) — ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
- [conardli/easy-dataset](https://awesome-repositories.com/repository/conardli-easy-dataset.md) (13,394 ⭐) — Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points.

The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side
- [zju-llms/foundations-of-llms](https://awesome-repositories.com/repository/zju-llms-foundations-of-llms.md) (15,771 ⭐) — Foundations-of-LLMs is an educational curriculum and technical resource designed to explain the mathematical and computational principles behind modern generative language models. It provides a structured guide for developers and practitioners to master the fundamental concepts, architectural designs, and training methodologies that enable these systems to function.

The project covers the core mechanisms of transformer-based sequence modeling, including self-attention, subword tokenization, and autoregressive generation. It details the technical frameworks used in natural language processing
- [curl/curl](https://awesome-repositories.com/repository/curl-curl.md) (42,214 ⭐) — Curl is a command-line tool and portable library for transferring data across a wide range of network protocols. It functions as a unified engine that abstracts diverse communication standards, allowing users and developers to move files and information between servers using a consistent interface. The project provides both a versatile command-line client for terminal-based automation and a stable programmatic interface for integrating complex network operations into applications.

The system is distinguished by its protocol-agnostic core and its ability to manage both synchronous and asynchro
- [openai/evals](https://awesome-repositories.com/repository/openai-evals.md) (18,702 ⭐) — Evals is a framework designed for automating, managing, and executing repeatable benchmarking suites to analyze the quality and performance of language models. It provides a platform for running standardized tests to measure model accuracy and track behavioral changes over time.

The system distinguishes itself through a modular architecture that uses a standardized adapter layer to normalize inputs and outputs, allowing different models to be swapped and tested interchangeably. It supports the creation of custom benchmarks using proprietary data, enabling quality assurance on sensitive tasks
- [browser-use/browser-use](https://awesome-repositories.com/repository/browser-use-browser-use.md) (100,229 ⭐) — Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions.

The project distinguishes itself through its ability to translate high-level intent into
