Promptfoo | Awesome Repository

Promptfoo is an evaluation framework designed for testing, benchmarking, and red-teaming language models and agentic workflows. It provides a unified environment to run prompts against multiple providers, allowing developers to systematically validate model outputs against objective assertions, semantic similarity metrics, and custom grading rubrics.

The platform distinguishes itself through a provider-agnostic execution layer and a stateful orchestrator capable of simulating multi-turn conversations and complex tool-use trajectories. It includes a dedicated adversarial mutation pipeline that automates security vulnerability scanning, enabling teams to probe for jailbreaks, prompt injections, and safety policy violations using systematic attack strategies.

Beyond core testing, the project supports comprehensive quality assurance through retrieval-augmented generation assessment, synthetic dataset generation, and prompt performance optimization. It offers extensive extensibility through a plugin-based architecture, allowing for custom logic, Python-based testing extensions, and integration with external version control and observability platforms.

The system utilizes a declarative configuration schema to manage test cases and environment settings, supporting both self-hosted and managed infrastructure deployments. Results are consolidated into structured reports with interactive visualizations to facilitate collaborative review and integration into continuous integration pipelines.

Features

LLM Evaluation - Provides a comprehensive framework for testing, benchmarking, and red-teaming language models across multiple providers.
Prompt Engineering Toolkits - Offers a toolkit for iteratively refining, comparing, and optimizing prompt templates and model configurations.
Automated Prompt Testing - Evaluation & Testing triggers systematic quality and performance tests for prompts automatically whenever code changes are pushed to a repository.
Adversarial Red Teaming Toolkits - Automates the detection of jailbreaks, prompt injections, and safety violations by running adversarial test cases against language models.

Features

LLM Evaluation - Provides a comprehensive framework for testing, benchmarking, and red-teaming language models across multiple providers.
Prompt Engineering Toolkits - Offers a toolkit for iteratively refining, comparing, and optimizing prompt templates and model configurations.
Automated Prompt Testing - Evaluation & Testing triggers systematic quality and performance tests for prompts automatically whenever code changes are pushed to a repository.
Adversarial Red Teaming Toolkits - Automates the detection of jailbreaks, prompt injections, and safety violations by running adversarial test cases against language models.