Why is torantulino/auto-gpt a recommended LLM Evaluation GitHub Repositories repository?

Runs agents through a standardized testing environment to measure performance against objective benchmarks.

Why is run-llama/llama_index a recommended LLM Evaluation GitHub Repositories repository?

LlamaIndex evaluates application performance using standardized datasets and testing patterns to iteratively improve accuracy and reliability.

Why is anthropics/anthropic-cookbook a recommended LLM Evaluation GitHub Repositories repository?

Provides a suite for systematically testing and scoring prompt quality using model-based evaluation.

Why is lm-sys/fastchat a recommended LLM Evaluation GitHub Repositories repository?

Implements a system for assessing LLM quality using automated judges and human-driven side-by-side comparisons.

Why is stanfordnlp/dspy a recommended LLM Evaluation GitHub Repositories repository?

Measures output quality using custom metrics and model-based judges to ensure consistent behavior across pipelines.

Why is mlflow/mlflow a recommended LLM Evaluation GitHub Repositories repository?

Provides a dedicated environment for evaluating LLM application performance through prediction functions and datasets.

Why is langchain-ai/deepagents a recommended LLM Evaluation GitHub Repositories repository?

Uses language models as automated judges to evaluate the quality and consistency of agent reasoning.

Why is microsoft/jarvis a recommended LLM Evaluation GitHub Repositories repository?

Evaluates the ability of large language models to automate complex tasks using standardized datasets and metrics.

Why is aishwaryanr/awesome-generative-ai-guide a recommended LLM Evaluation GitHub Repositories repository?

Provides frameworks for rigorously evaluating the performance and safety of AI systems.

Why is typpo/promptfoo a recommended LLM Evaluation GitHub Repositories repository?

Provides a comprehensive framework for measuring LLM output quality using custom metrics and automated judges.

55 repositorios

Awesome GitHub RepositoriesLLM Evaluation

Tools for measuring the quality of model outputs using custom metrics and automated judges.

Distinguishing note: Focuses on programmatic evaluation of LLM pipelines, distinct from standard unit testing.

Explore 55 awesome GitHub repositories matching testing & quality assurance · LLM Evaluation. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

torantulino/auto-gpt
Torantulino/Auto-GPT
184,986Ver en GitHub
Auto-GPT is an autonomous agent framework designed for creating and deploying AI agents that use large language models to plan and execute complex goals independently. The system provides a comprehensive environment for managing the entire agent lifecycle, from initial design and testing to live production deployment. The project features a low-code workflow designer that allows users to define agent behaviors by connecting functional blocks in a visual interface. It includes an agent marketplace for discovering and deploying pre-configured agent templates and a standardized evaluation tool t
Runs agents through a standardized testing environment to measure performance against objective benchmarks.
Python
Ver en GitHub184,986
run-llama/llama_index
run-llama/llama_index
50,306Ver en GitHub
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
LlamaIndex evaluates application performance using standardized datasets and testing patterns to iteratively improve accuracy and reliability.
Pythonagentsapplicationdata
Ver en GitHub50,306
anthropics/anthropic-cookbook
anthropics/anthropic-cookbook
45,984Ver en GitHub
This repository is a collection of guides, notebooks, and recipes for implementing advanced prompting techniques and workflow patterns with large language models. It serves as a prompt engineering guide, an evaluation suite for scoring prompt quality, and a framework for orchestrating agents and integrating external tools. The project provides implementation patterns for building applications with Claude, specifically focusing on coordinating multiple models to split complex tasks between high-reasoning and high-efficiency agents. It includes technical demonstrations for multimodal data proce
Provides a suite for systematically testing and scoring prompt quality using model-based evaluation.
Jupyter Notebook
Ver en GitHub45,984
lm-sys/fastchat
lm-sys/FastChat
39,472Ver en GitHub
FastChat is a training and serving platform for large language models that provides an integrated toolkit for fine-tuning, hosting, and benchmarking chatbots. It functions as an inference server capable of hosting multiple models and exposing them via a standardized API for chat applications. The platform distinguishes itself through a distributed model controller that manages worker nodes and routes requests across a hardware-agnostic inference layer supporting various accelerators. It includes a dedicated evaluation framework for assessing model quality using automated judges, multi-turn di
Implements a system for assessing LLM quality using automated judges and human-driven side-by-side comparisons.
Python
Ver en GitHub39,472
stanfordnlp/dspy
stanfordnlp/dspy
35,325Ver en GitHub
DSPy is a declarative programming framework designed for building complex language model applications. It treats model interactions as modular, composable programs, allowing developers to define task logic through typed class schemas rather than relying on manually written prompts. By organizing workflows into hierarchical, reusable Python objects, the framework enables the construction of sophisticated AI systems that manage state and execution flow independently. The framework distinguishes itself through an automated optimization engine that iteratively refines prompt instructions and few-
Measures output quality using custom metrics and model-based judges to ensure consistent behavior across pipelines.
Python
Ver en GitHub35,325
mlflow/mlflow
mlflow/mlflow
26,554Ver en GitHub
Provides a dedicated environment for evaluating LLM application performance through prediction functions and datasets.
Pythonagentopsagentsai
Ver en GitHub26,554
langchain-ai/deepagents
langchain-ai/deepagents
25,006Ver en GitHub
Deepagents is an LLM agent orchestration platform and stateful application server designed for deploying and managing AI agents built with computational graphs. It provides a containerized runtime environment that handles agent execution, state persistence, and the versioning of AI assistants. The platform distinguishes itself through deep integration with the Model Context Protocol, allowing agents to function as servers that expose tools and capabilities to external clients. It features a sophisticated observability suite for capturing execution traces, performing LLM-based evaluations agai
Uses language models as automated judges to evaluate the quality and consistency of agent reasoning.
Pythonagentsdeepagentslangchain
Ver en GitHub25,006
microsoft/jarvis
microsoft/JARVIS
24,854Ver en GitHub
JARVIS is a system for large language model task orchestration, deployment management, and automation benchmarking. It utilizes a task orchestrator to decompose complex requests into actionable steps and coordinates various expert models to synthesize final responses. The project includes an AI model deployment manager to handle the local deployment of expert models across different hardware scales. It further provides an AI workflow API consisting of web endpoints used to trigger automated task workflows and retrieve results from model selection stages. The framework incorporates an automat
Evaluates the ability of large language models to automate complex tasks using standardized datasets and metrics.
Python
Ver en GitHub24,854
aishwaryanr/awesome-generative-ai-guide
aishwaryanr/awesome-generative-ai-guide
24,755Ver en GitHub
This project is a community-driven knowledge repository and technical learning resource focused on the field of generative artificial intelligence. It serves as a centralized hub for developers and practitioners to access curated research, tutorials, and foundational concepts necessary for building and deploying modern artificial intelligence applications. The platform distinguishes itself through a collaborative, distributed contribution model that aggregates diverse learning materials into a structured, searchable knowledge base. It covers a wide range of specialized topics, including retri
Provides frameworks for rigorously evaluating the performance and safety of AI systems.
HTMLawesomeawesome-listgenerative-ai
Ver en GitHub24,755
typpo/promptfoo
typpo/promptfoo
22,295Ver en GitHub
promptfoo is an evaluation framework for measuring the performance of large language model prompts, agents, and retrieval augmented generation pipelines. It provides a suite of tools for conducting comparative benchmarking and executing automated quality and security regressions. The system features a benchmarking suite for running identical prompts across different model providers to compare output quality side-by-side. It also includes a dedicated red teaming tool for identifying security vulnerabilities and prompt injection risks through automated penetration testing. The framework suppor
Provides a comprehensive framework for measuring LLM output quality using custom metrics and automated judges.
TypeScript
Ver en GitHub22,295
comet-ml/comet-llm
comet-ml/comet-llm
19,673Ver en GitHub
Comet LLM is an observability platform and evaluation framework designed for large language model applications and agentic workflows. It functions as a system for tracing, monitoring, and debugging execution flows while providing tools for prompt optimization and the enforcement of AI safety guardrails. The platform distinguishes itself through a combination of model-based scoring and heuristic metrics to quantify output quality and detect hallucinations. It includes a dedicated prompt and agent optimizer with an interactive playground for refining templates and tool configurations. For retri
Quantifies LLM pipeline quality using datasets, heuristic metrics, and automated judge scoring.
Python
Ver en GitHub19,673
openai/evals
openai/evals
18,702Ver en GitHub
Evals is a framework designed for automating, managing, and executing repeatable benchmarking suites to analyze the quality and performance of language models. It provides a platform for running standardized tests to measure model accuracy and track behavioral changes over time. The system distinguishes itself through a modular architecture that uses a standardized adapter layer to normalize inputs and outputs, allowing different models to be swapped and tested interchangeably. It supports the creation of custom benchmarks using proprietary data, enabling quality assurance on sensitive tasks
Serves as a toolkit for building, running, and managing standardized benchmarks for large language models.
Python
Ver en GitHub18,702
comet-ml/opik
comet-ml/opik
17,787Ver en GitHub
Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes. The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
Runs automated tests against defined tasks using datasets and metrics to measure output quality and application behavior.
Pythonevaluationhacktoberfesthacktoberfest2025
Ver en GitHub17,787
raga-ai-hub/ragaai-catalyst
raga-ai-hub/RagaAI-Catalyst
16,150Ver en GitHub
RagaAI-Catalyst is a suite of software implementation tools providing an SDK, dashboard, and platform for monitoring, debugging, red-teaming, and evaluating agentic AI workflows. It serves as an observability framework for tracing the execution paths of large language models and multi-agent systems. The project distinguishes itself through a security suite for automated red-teaming and vulnerability scanning to detect biases, alongside a centralized prompt registry that decouples templates from application code. It further provides an evaluation platform that combines synthetic data generatio
Measures the accuracy and reliability of LLM outputs using specialized metrics and automated judges.
Python
Ver en GitHub16,150
explodinggradients/ragas
explodinggradients/ragas
14,400Ver en GitHub
Ragas is an evaluation framework and performance benchmark designed to quantify the quality of retrieval augmented generation pipelines. It functions as an application optimizer to identify bottlenecks in language model workflows using automated metrics and model-based scoring. The framework includes a system for generating synthetic datasets that mimic production scenarios and edge cases to create realistic test cases. It enables reference-free assessment, allowing the evaluation of response quality by analyzing grounding in the provided context without requiring gold-standard labels. The s
Provides a framework for measuring the quality of LLM outputs using automated judges and custom metrics.
Python
Ver en GitHub14,400
confident-ai/deepeval
confident-ai/deepeval
13,733Ver en GitHub
Deepeval is a framework for testing and evaluating large language model applications. It provides a suite of tools for executing automated regression tests, validating model output quality against defined standards, and tracing the execution of complex agent workflows. By integrating these capabilities into development pipelines, the platform ensures consistent performance and reliability throughout the software lifecycle. The platform distinguishes itself through its focus on programmatic validation and observability. It utilizes secondary language models to score output quality and employs
Uses secondary language models to evaluate and quantify the quality of outputs from primary models against predefined criteria.
Pythonevaluation-frameworkevaluation-metricsllm-evaluation
Ver en GitHub13,733
coleam00/archon
coleam00/Archon
13,728Ver en GitHub
Archon is an artificial intelligence agent automation engine designed to orchestrate complex development workflows. It functions as a platform for chaining multi-step tasks into directed graphs, allowing developers to standardize and execute repeatable coding patterns through declarative configuration files. The system distinguishes itself by maintaining stateful context across long-running sessions and executing operations within isolated, containerized worktrees to prevent file interference. It integrates with external language models and provides a centralized registry for sharing and inst
Evaluates the quality of automated task outputs using programmatic test cases and metrics.
Python
Ver en GitHub13,728
lightning-ai/litgpt
Lightning-AI/litgpt
13,431Ver en GitHub
LitGPT is a training and deployment framework for large language models, providing a suite of tools for pretraining, finetuning, quantizing, evaluating, and serving models within a production environment. It includes a dedicated training pipeline for adapting pretrained models to specific tasks, a quantization tool for reducing weight precision, and an inference server for hosting models via web interfaces. The framework supports high-performance model development through custom architecture implementation and the use of predefined recipes to standardize pretraining and finetuning. It enables
Provides a benchmarking toolset for testing the generation quality and understanding of models against standardized datasets.
Python
Ver en GitHub13,431
datawhalechina/llm-universe
datawhalechina/llm-universe
13,269Ver en GitHub
llm-universe is a structured learning resource and technical guide focused on the development of large language model applications. It serves as a curriculum for mastering model orchestration, the creation of autonomous conversational agents, and the implementation of retrieval-augmented generation systems. The project provides detailed instructions on connecting model APIs with memory and tools to create execution chains. It specifically covers the construction of retrieval pipelines, including the process of cleaning raw documents, generating embeddings, and integrating vector databases to
Provides tools and methods for measuring the quality of model outputs using custom metrics.
Jupyter Notebooklangchainrag
Ver en GitHub13,269
shishirpatil/gorilla
ShishirPatil/gorilla
12,908Ver en GitHub
Gorilla is a foundational infrastructure framework for large language model function calling. It provides a system for training, evaluating, and executing the translation of natural language instructions into accurate API calls and executable code. The project integrates a structured API documentation index, a fine-tuning pipeline for model adaptation, and a secure sandboxed action runtime for executing model-generated commands. The framework distinguishes itself through a specialized evaluation benchmark suite that measures the accuracy, cost, and latency of function calls. It includes tools
Provides a suite of metrics and datasets to measure the accuracy, cost, and latency of model-driven function calling.
Python
Ver en GitHub12,908

Awesome LLM Evaluation GitHub Repositories

Torantulino/Auto-GPT

run-llama/llama_index

anthropics/anthropic-cookbook

lm-sys/FastChat

stanfordnlp/dspy

mlflow/mlflow

langchain-ai/deepagents

microsoft/JARVIS

aishwaryanr/awesome-generative-ai-guide

typpo/promptfoo

comet-ml/comet-llm

openai/evals

comet-ml/opik

raga-ai-hub/RagaAI-Catalyst

explodinggradients/ragas

confident-ai/deepeval

coleam00/Archon

Lightning-AI/litgpt

datawhalechina/llm-universe

ShishirPatil/gorilla

Explorar subetiquetas