Promptfoo

Promptfoo is an evaluation framework designed for testing, benchmarking, and red-teaming language models and agentic workflows. It provides a unified environment to run prompts against multiple providers, allowing developers to systematically validate model outputs against objective assertions, semantic similarity metrics, and custom grading rubrics.

The platform distinguishes itself through a provider-agnostic execution layer and a stateful orchestrator capable of simulating multi-turn conversations and complex tool-use trajectories. It includes a dedicated adversarial mutation pipeline that automates security vulnerability scanning, enabling teams to probe for jailbreaks, prompt injections, and safety policy violations using systematic attack strategies.

Beyond core testing, the project supports comprehensive quality assurance through retrieval-augmented generation assessment, synthetic dataset generation, and prompt performance optimization. It offers extensive extensibility through a plugin-based architecture, allowing for custom logic, Python-based testing extensions, and integration with external version control and observability platforms.

The system utilizes a declarative configuration schema to manage test cases and environment settings, supporting both self-hosted and managed infrastructure deployments. Results are consolidated into structured reports with interactive visualizations to facilitate collaborative review and integration into continuous integration pipelines.

Features

LLM Evaluation - Provides a comprehensive framework for testing, benchmarking, and red-teaming language models across multiple providers.
Prompt Engineering Toolkits - Offers a toolkit for iteratively refining, comparing, and optimizing prompt templates and model configurations.
Automated Prompt Testing - Evaluation & Testing triggers systematic quality and performance tests for prompts automatically whenever code changes are pushed to a repository.
Adversarial Red Teaming Toolkits - Automates the detection of jailbreaks, prompt injections, and safety violations by running adversarial test cases against language models.
Automated Assertion Validators - Validates language model outputs against deterministic rules, semantic similarity metrics, and custom scripts to verify quality and safety.
Adversarial Security Research - Automates security vulnerability scanning by generating and chaining malicious inputs to probe for jailbreaks and prompt injections.
LLM Provider Integrations - Connects to various commercial and open-source model APIs to run comparative benchmarks and evaluations within a unified environment.
RAG Evaluation Frameworks - Assesses the factual accuracy and relevance of retrieval-augmented generation pipelines by comparing responses against source data.
Stateful Agent Orchestration - Manages multi-turn conversation histories and tool-use trajectories to simulate complex agentic workflows during automated testing.
Automated Security Scanners - Automates adversarial testing and vulnerability detection for language models and agentic workflows.
Agent Evaluation Tools - Simulates multi-turn interactions and tool usage to verify agentic task execution.
Agent Framework Integrations - Connects with orchestration libraries to test, trace, and evaluate multi-step workflows and complex agentic applications.
AI Model Integrations - Links to a wide range of hosted, local, and custom AI providers through a unified interface for consistent testing.
Automated Prompt Optimization - Iteratively refines prompts based on performance metrics to identify the most effective versions.
Validation Frameworks - Assesses retrieval-augmented generation accuracy by comparing model responses against source data.
Hardware-Agnostic Inference Layers - Standardizes communication across diverse model APIs and local scripts to enable unified testing and comparative benchmarking.
AI Agent Benchmarks - Simulates complex multi-turn interactions and tool usage to verify agentic workflow reliability.
Prompt Injection Testing - Generates external content with hidden instructions to evaluate agent manipulation risks.
Adversarial Test Automation - Executes modular test suites that generate malicious payloads to identify security and compliance risks in language models.
Evaluation Templates - Structures test cases, prompt templates, and evaluation criteria into portable files for consistent execution across environments.
Model Observability Suites - Monitoring & Observability calculates perplexity scores for model outputs to quantify prediction certainty and identify potential hallucination risks based on configurable thresholds.
Test Case Definitions - Provides structured test case definitions to validate model outputs against expected outcomes and assertions.
Agent Tool Integrations - Provides mechanisms for connecting autonomous agents to external software tools and APIs to extend their functional capabilities during evaluations.
Model Comparison Interfaces - Provides side-by-side comparison of model versions and prompt templates to identify optimal configurations.
Provider Configurations - Stores and references API keys and base settings in a central environment while maintaining the ability to override parameters locally.
AI and LLM Testing - Framework for testing and red-teaming LLM applications.
AI Frameworks and SDKs - Tool for testing, evaluating, and red-teaming LLM prompts.
AI Observability and Evaluation - Tool for testing, red-teaming, and evaluating prompts and agents.
AI Red Teaming - CLI and framework for testing and red-teaming LLM prompts.
AI Security - Framework for red teaming and evaluating LLM prompts.
AI Security Frameworks - Framework for red-teaming and evaluating LLM security and performance.
Evaluation and Observability - Test-driven framework for regression testing and model comparison.
Large Language Models - Testing and red-teaming for prompts and agents.
Model Evaluation and Benchmarking - Red teaming and evaluation framework for LLM security testing.
Moderation APIs - Framework for red teaming and testing LLM safety against attacks.
Prompt Engineering Tools - Test, evaluate, and compare LLM outputs.
Prompt Testing and Security - CLI for testing, evaluating, and red-teaming LLM prompts.
Security & Privacy - Local tool for systematic testing of prompts, performance, and security.
Continuous Integration Quality Gates - Integrates automated quality gates into CI pipelines to enforce performance standards and prevent regressions.
Content Guardrails - Validates the effectiveness of safety filters and moderation layers by simulating adversarial inputs.
Performance Visualization - Provides interactive side-by-side visualization of model outputs to facilitate comparison and manual rating.
Test Utilities & Assertions - Provides objective and subjective criteria for validating model outputs via assertions and rubrics.
Test Report Aggregators - Consolidates performance metrics and security findings into structured reports for visualization and integration with development pipelines.
Automated Test Runners - Orchestrates automated evaluation workflows within CI pipelines to track regressions and report results.
Tool Definition Adapters - Standardizes tool definitions across multiple model providers to ensure consistent behavior and reliable execution during automated evaluation cycles.
Conversational Evaluation Suites - Models multi-turn conversation histories to evaluate stateful dialogue flows.
Evaluation Report Aggregators - Consolidates individual test session results into comprehensive performance reports for centralized tracking.
External Knowledge Integrators - Connects agents to external databases and APIs to inject domain-specific information for retrieval-augmented generation assessment.
Data-Driven Testing - Imports and exports test cases and evaluation results using external spreadsheet or document management systems for collaborative data handling.
AI Governance Policies - Creates bespoke testing plugins to enforce organization-specific behavioral standards and AI governance policies.
Adversarial Input Transformers - Transforms test inputs using techniques like obfuscation to bypass content filters and security controls.
Model Vulnerability Scanners - Provides automated scanning of model files to detect security risks and architectural vulnerabilities before deployment.
Extensible Plugin Architectures - Supports modular extension by allowing developers to inject custom logic for providers, grading rubrics, and attack strategies.
Language Model Metrics - Uses model-based grading and embedding comparisons to assess factual accuracy and faithfulness.
Evaluation Grading Configurations - Monitoring & Observability customizes the grading model, scoring weights, and evaluation prompts to tailor accuracy assessments to specific domain requirements.
Test Report Servers - Distributes test findings and performance metrics to team members through cloud-based platforms or self-hosted infrastructure for collaborative review.
Test Assertion Extensions - Allows writing custom providers, assertions, and test generators using Python to integrate with external frameworks and libraries.
Test Case Generators - Automatically generates diverse test cases and personas to expand evaluation coverage.
Custom Model Logic Interfaces - Allows defining bespoke provider behavior using scripts or local files to test proprietary models or unique workflows.
Prompt Management Systems - Connects to external version control and observability platforms to monitor, track, and optimize prompt performance.
Token Bias Adjustments - Forces model responses to adhere to specific choices by applying logit bias to ensure structured and predictable outputs.
Response Caching - Stores model call results locally to reduce latency and costs during repeated test executions.
Managed Infrastructure Support - Provides a fully-managed service for hosting evaluation environments to eliminate infrastructure maintenance.
Self-Hosted Infrastructure - Supports self-hosted deployment within private networks to ensure data sovereignty and control.
Composite Metric Calculators - Monitoring & Observability calculates composite scores from individual assertion results using mathematical expressions or scripts to generate custom performance indicators.

confident-ai/deepeval

13,733View on GitHub

Deepeval is a framework for testing and evaluating large language model applications. It provides a suite of tools for executing automated regression tests, validating model output quality against defined standards, and tracing the execution of complex agent workflows. By integrating these capabilities into development pipelines, the platform ensures consistent performance and reliability throughout the software lifecycle. The platform distinguishes itself through its focus on programmatic validation and observability. It utilizes secondary language models to score output quality and employs

typpo/promptfoo

22,295View on GitHub

promptfoo is an evaluation framework for measuring the performance of large language model prompts, agents, and retrieval augmented generation pipelines. It provides a suite of tools for conducting comparative benchmarking and executing automated quality and security regressions. The system features a benchmarking suite for running identical prompts across different model providers to compare output quality side-by-side. It also includes a dedicated red teaming tool for identifying security vulnerabilities and prompt injection risks through automated penetration testing. The framework suppor

vibrantlabsai/ragas

12,659View on GitHub

Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin

IBM/mcp-context-forge

3,310View on GitHub

mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for

promptfoopromptfoo

Features