Promptfoo

promptfoo is an evaluation framework for measuring the performance of large language model prompts, agents, and retrieval augmented generation pipelines. It provides a suite of tools for conducting comparative benchmarking and executing automated quality and security regressions.

The system features a benchmarking suite for running identical prompts across different model providers to compare output quality side-by-side. It also includes a dedicated red teaming tool for identifying security vulnerabilities and prompt injection risks through automated penetration testing.

The framework supports declarative evaluation pipelines and metric-based scoring to quantify model reliability. These capabilities are designed for integration into continuous integration and deployment workflows to prevent regressions in model behavior. Results can be visualized in shared reports to facilitate team reviews of performance data and security findings.

Features

Prompt Evaluation Tools - Offers a comprehensive toolkit for comparing output quality across different prompt variations and models to identify the most effective instructions.

LLM Evaluation - Provides a comprehensive framework for measuring LLM output quality using custom metrics and automated judges.

AI Model Benchmarking - Provides frameworks for running standardized tests to compare the performance and reliability of different LLM providers.

Scoring Pipelines - Implements scoring pipelines that apply algorithmic checks to quantify model quality and detect inaccuracies.

Model Comparison Interfaces - Enables side-by-side visual and analytical comparison of outputs from different LLM providers.

Model Benchmarking Suites - Conducts comparative analysis of model accuracy and reasoning using standardized datasets across providers.

Provider-Agnostic Model Interfaces - Standardizes inputs and outputs across different large language models to enable side-by-side performance comparisons.

RAG Evaluation Frameworks - Offers specialized frameworks for assessing RAG-specific metrics like groundedness and retrieval relevance.

AI Red Teaming - Evaluates and probes vulnerabilities in language models through automated red teaming and penetration testing.

Automated Prompt Testing - Provides a framework for integrating prompt evaluation and data-driven quality checks into continuous integration pipelines.

Adversarial Red Teaming Toolkits - Provides specialized toolkits for generating adversarial prompts to test for security bypasses and injections.

Automated Agent Quality Assurance - Integrates automated model behavior checks into CI/CD pipelines to ensure quality and prevent regressions.

Automated Assertion Validators - Provides a framework for validating LLM outputs against programmatic assertions and predefined quality metrics.

CI/CD Pipeline Integrations - Integrates evaluation runs into CI/CD pipelines to block deployments when model performance fails thresholds.

Automation Pipelines - Automates the execution of quality checks and security scans within the software delivery pipeline.

Vulnerability Scanning Utilities - Includes tools for performing automated vulnerability assessments and red teaming on AI pipelines.

Application Development - Tool for testing, evaluating, and comparing LLM outputs.

typpopromptfoo

Features

Star history