55 repositorios
Tools for measuring the quality of model outputs using custom metrics and automated judges.
Distinguishing note: Focuses on programmatic evaluation of LLM pipelines, distinct from standard unit testing.
Explore 55 awesome GitHub repositories matching testing & quality assurance · LLM Evaluation. Refine with filters or upvote what's useful.
Auto-GPT is an autonomous agent framework designed for creating and deploying AI agents that use large language models to plan and execute complex goals independently. The system provides a comprehensive environment for managing the entire agent lifecycle, from initial design and testing to live production deployment. The project features a low-code workflow designer that allows users to define agent behaviors by connecting functional blocks in a visual interface. It includes an agent marketplace for discovering and deploying pre-configured agent templates and a standardized evaluation tool t
Runs agents through a standardized testing environment to measure performance against objective benchmarks.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
LlamaIndex evaluates application performance using standardized datasets and testing patterns to iteratively improve accuracy and reliability.
This repository is a collection of guides, notebooks, and recipes for implementing advanced prompting techniques and workflow patterns with large language models. It serves as a prompt engineering guide, an evaluation suite for scoring prompt quality, and a framework for orchestrating agents and integrating external tools. The project provides implementation patterns for building applications with Claude, specifically focusing on coordinating multiple models to split complex tasks between high-reasoning and high-efficiency agents. It includes technical demonstrations for multimodal data proce
Provides a suite for systematically testing and scoring prompt quality using model-based evaluation.
FastChat is a training and serving platform for large language models that provides an integrated toolkit for fine-tuning, hosting, and benchmarking chatbots. It functions as an inference server capable of hosting multiple models and exposing them via a standardized API for chat applications. The platform distinguishes itself through a distributed model controller that manages worker nodes and routes requests across a hardware-agnostic inference layer supporting various accelerators. It includes a dedicated evaluation framework for assessing model quality using automated judges, multi-turn di
Implements a system for assessing LLM quality using automated judges and human-driven side-by-side comparisons.
DSPy is a declarative programming framework designed for building complex language model applications. It treats model interactions as modular, composable programs, allowing developers to define task logic through typed class schemas rather than relying on manually written prompts. By organizing workflows into hierarchical, reusable Python objects, the framework enables the construction of sophisticated AI systems that manage state and execution flow independently. The framework distinguishes itself through an automated optimization engine that iteratively refines prompt instructions and few-
Measures output quality using custom metrics and model-based judges to ensure consistent behavior across pipelines.
Provides a dedicated environment for evaluating LLM application performance through prediction functions and datasets.
Deepagents is an LLM agent orchestration platform and stateful application server designed for deploying and managing AI agents built with computational graphs. It provides a containerized runtime environment that handles agent execution, state persistence, and the versioning of AI assistants. The platform distinguishes itself through deep integration with the Model Context Protocol, allowing agents to function as servers that expose tools and capabilities to external clients. It features a sophisticated observability suite for capturing execution traces, performing LLM-based evaluations agai
Uses language models as automated judges to evaluate the quality and consistency of agent reasoning.
JARVIS is a system for large language model task orchestration, deployment management, and automation benchmarking. It utilizes a task orchestrator to decompose complex requests into actionable steps and coordinates various expert models to synthesize final responses. The project includes an AI model deployment manager to handle the local deployment of expert models across different hardware scales. It further provides an AI workflow API consisting of web endpoints used to trigger automated task workflows and retrieve results from model selection stages. The framework incorporates an automat
Evaluates the ability of large language models to automate complex tasks using standardized datasets and metrics.
This project is a community-driven knowledge repository and technical learning resource focused on the field of generative artificial intelligence. It serves as a centralized hub for developers and practitioners to access curated research, tutorials, and foundational concepts necessary for building and deploying modern artificial intelligence applications. The platform distinguishes itself through a collaborative, distributed contribution model that aggregates diverse learning materials into a structured, searchable knowledge base. It covers a wide range of specialized topics, including retri
Provides frameworks for rigorously evaluating the performance and safety of AI systems.
promptfoo is an evaluation framework for measuring the performance of large language model prompts, agents, and retrieval augmented generation pipelines. It provides a suite of tools for conducting comparative benchmarking and executing automated quality and security regressions. The system features a benchmarking suite for running identical prompts across different model providers to compare output quality side-by-side. It also includes a dedicated red teaming tool for identifying security vulnerabilities and prompt injection risks through automated penetration testing. The framework suppor
Provides a comprehensive framework for measuring LLM output quality using custom metrics and automated judges.
Comet LLM is an observability platform and evaluation framework designed for large language model applications and agentic workflows. It functions as a system for tracing, monitoring, and debugging execution flows while providing tools for prompt optimization and the enforcement of AI safety guardrails. The platform distinguishes itself through a combination of model-based scoring and heuristic metrics to quantify output quality and detect hallucinations. It includes a dedicated prompt and agent optimizer with an interactive playground for refining templates and tool configurations. For retri
Quantifies LLM pipeline quality using datasets, heuristic metrics, and automated judge scoring.
Evals is a framework designed for automating, managing, and executing repeatable benchmarking suites to analyze the quality and performance of language models. It provides a platform for running standardized tests to measure model accuracy and track behavioral changes over time. The system distinguishes itself through a modular architecture that uses a standardized adapter layer to normalize inputs and outputs, allowing different models to be swapped and tested interchangeably. It supports the creation of custom benchmarks using proprietary data, enabling quality assurance on sensitive tasks
Serves as a toolkit for building, running, and managing standardized benchmarks for large language models.
Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes. The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
Runs automated tests against defined tasks using datasets and metrics to measure output quality and application behavior.
RagaAI-Catalyst is a suite of software implementation tools providing an SDK, dashboard, and platform for monitoring, debugging, red-teaming, and evaluating agentic AI workflows. It serves as an observability framework for tracing the execution paths of large language models and multi-agent systems. The project distinguishes itself through a security suite for automated red-teaming and vulnerability scanning to detect biases, alongside a centralized prompt registry that decouples templates from application code. It further provides an evaluation platform that combines synthetic data generatio
Measures the accuracy and reliability of LLM outputs using specialized metrics and automated judges.
Ragas is an evaluation framework and performance benchmark designed to quantify the quality of retrieval augmented generation pipelines. It functions as an application optimizer to identify bottlenecks in language model workflows using automated metrics and model-based scoring. The framework includes a system for generating synthetic datasets that mimic production scenarios and edge cases to create realistic test cases. It enables reference-free assessment, allowing the evaluation of response quality by analyzing grounding in the provided context without requiring gold-standard labels. The s
Provides a framework for measuring the quality of LLM outputs using automated judges and custom metrics.
Deepeval is a framework for testing and evaluating large language model applications. It provides a suite of tools for executing automated regression tests, validating model output quality against defined standards, and tracing the execution of complex agent workflows. By integrating these capabilities into development pipelines, the platform ensures consistent performance and reliability throughout the software lifecycle. The platform distinguishes itself through its focus on programmatic validation and observability. It utilizes secondary language models to score output quality and employs
Uses secondary language models to evaluate and quantify the quality of outputs from primary models against predefined criteria.
Archon is an artificial intelligence agent automation engine designed to orchestrate complex development workflows. It functions as a platform for chaining multi-step tasks into directed graphs, allowing developers to standardize and execute repeatable coding patterns through declarative configuration files. The system distinguishes itself by maintaining stateful context across long-running sessions and executing operations within isolated, containerized worktrees to prevent file interference. It integrates with external language models and provides a centralized registry for sharing and inst
Evaluates the quality of automated task outputs using programmatic test cases and metrics.
LitGPT is a training and deployment framework for large language models, providing a suite of tools for pretraining, finetuning, quantizing, evaluating, and serving models within a production environment. It includes a dedicated training pipeline for adapting pretrained models to specific tasks, a quantization tool for reducing weight precision, and an inference server for hosting models via web interfaces. The framework supports high-performance model development through custom architecture implementation and the use of predefined recipes to standardize pretraining and finetuning. It enables
Provides a benchmarking toolset for testing the generation quality and understanding of models against standardized datasets.
llm-universe is a structured learning resource and technical guide focused on the development of large language model applications. It serves as a curriculum for mastering model orchestration, the creation of autonomous conversational agents, and the implementation of retrieval-augmented generation systems. The project provides detailed instructions on connecting model APIs with memory and tools to create execution chains. It specifically covers the construction of retrieval pipelines, including the process of cleaning raw documents, generating embeddings, and integrating vector databases to
Provides tools and methods for measuring the quality of model outputs using custom metrics.
Gorilla is a foundational infrastructure framework for large language model function calling. It provides a system for training, evaluating, and executing the translation of natural language instructions into accurate API calls and executable code. The project integrates a structured API documentation index, a fine-tuning pipeline for model adaptation, and a secure sandboxed action runtime for executing model-generated commands. The framework distinguishes itself through a specialized evaluation benchmark suite that measures the accuracy, cost, and latency of function calls. It includes tools
Provides a suite of metrics and datasets to measure the accuracy, cost, and latency of model-driven function calling.