promptfoo is an evaluation framework for measuring the performance of large language model prompts, agents, and retrieval augmented generation pipelines. It provides a suite of tools for conducting comparative benchmarking and executing automated quality and security regressions.
The system features a benchmarking suite for running identical prompts across different model providers to compare output quality side-by-side. It also includes a dedicated red teaming tool for identifying security vulnerabilities and prompt injection risks through automated penetration testing.
The framework supports declarative evaluation pipelines and metric-based scoring to quantify model reliability. These capabilities are designed for integration into continuous integration and deployment workflows to prevent regressions in model behavior. Results can be visualized in shared reports to facilitate team reviews of performance data and security findings.