Evidently

Evidently is an AI observability platform and evaluation framework designed to quantify the performance of machine learning models and large language models. It functions as a monitoring tool for detecting data drift and quality degradation in tabular datasets, while providing a specialized analyzer for the faithfulness and correctness of retrieval augmented generation systems.

The project distinguishes itself through an evaluation framework that utilizes judge models and custom rubrics to score language model outputs. It includes tools for iterative prompt optimization and the generation of synthetic test datasets, including adversarial inputs for risk and brand safety testing.

The platform covers a broad range of capabilities including real-time telemetry tracing for AI workflows, automated quality assurance via CI/CD integration, and performance trend tracking. It provides visual dashboards for reporting and a threshold-based alerting system to notify users when quality metrics cross predefined limits.

Users can deploy a local workspace to manage projects and reports or use a no-code interface to configure evaluation workflows.

Features

AI Observability Tracing - Captures real-time inputs and outputs from AI applications to analyze execution paths and debug system behavior.
LLM Execution Tracing - Captures model-specific telemetry, including prompts and completions, to reconstruct execution paths and debug AI workflows.
AI Evaluation Frameworks - Provides a comprehensive framework for automating the assessment of AI outputs and reasoning quality.
Model Evaluation Metrics - Calculates quantitative performance and quality metrics for trained machine learning models.
Performance Metrics - Computes statistical performance indicators and dataframes by comparing production data against reference datasets for drift detection.
RAG Evaluation Frameworks - Assesses RAG system performance using specialized metrics for groundedness, faithfulness, and retrieval relevance.
RAG Performance Metrics - Measures the relationship between queries, retrieved contexts, and generated outputs to ensure RAG faithfulness.
Production Evaluation Strategies - Runs scheduled batch evaluations on live production traffic to monitor for drift and quality loss.
Dataset Comparators - Calculates distribution shifts by performing statistical tests between baseline datasets and live production data.
Data Quality Monitors - Monitors data health and quality metrics for tabular datasets to ensure model reliability in production.
Data Drift Detectors - Identifies statistical shifts between reference and production datasets to detect data drift in ML models.
Model Health Monitors - Provides monitoring for ML-specific metrics like data drift and prediction accuracy to ensure model health in production.
LLM-As-A-Judge Scoring - Implements an LLM-as-a-judge mechanism that uses language models to score and classify outputs based on custom rubrics.
LLM Evaluation - Measures the quality and correctness of language model responses using automated judges and specialized metrics.
Automated Prompt Optimization - Iteratively refines model instructions and few-shot examples based on quantitative performance metrics.
Custom Evaluation Judges - Allows the definition of specialized evaluation logic and rubrics using prompt templates to score text outputs.
RAG Evaluation Dataset Generation - Generates question-and-answer pairs from knowledge sources to evaluate the effectiveness of retrieval augmented generation systems.
Synthetic Data Generators - Creates structured synthetic datasets using language models to facilitate testing and cold-starting AI applications.
Evaluation Datasets - Organizes and manages testing and production datasets to create structured benchmark cases for model evaluation.
Evaluation Report Aggregators - Consolidates individual AI assessment results into visual dashboards and comprehensive performance reports.
Evaluation Visualizers - Provides dashboards for exploring and comparing AI evaluation datasets and performance metrics.
Performance Trend Visualizations - Stores and visualizes evaluation results over time to compare changes across dataset versions and prompts.
Evaluation Threshold Gates - Applies threshold-based decision logic to determine pass/fail status for AI evaluation scores.
Prompt Optimizers - Provides tools for iteratively refining and testing prompts through systematic evaluation and comparison of model responses.
Model Prediction Evaluation - Compares model predictions against ground truth labels to calculate accuracy for classification and regression tasks.
Data Quality Profilers - Analyzes tabular datasets for missing values and descriptive statistics to ensure input data integrity.
Metric Preset Templates - Bundles groups of related metrics into preset templates to standardize the analysis of tabular and generative data.
CI/CD Pipeline Integrations - Integrates automated output testing into CI/CD pipelines to validate model quality and prevent performance regressions.
Adversarial Safety Tests - Runs scenario-based risk tests with adversarial inputs to identify vulnerabilities regarding forbidden topics and brand safety.
Adversarial Input Generation - Generates adversarial inputs and edge-case scenarios to perform safety evaluations and brand risk stress-testing on AI models.
Alert Thresholds - Triggers automated notifications when AI quality scores cross predefined safety or performance thresholds.
Performance Reporting - Transforms drift scores and metrics into standalone visual dashboards and reports for performance analysis.
Performance Trend Analysis - Analyzes performance patterns and system health trends across different datasets via a monitoring dashboard.
Real-Time Monitoring Dashboards - Ships real-time monitoring dashboards that track evaluation results and trigger alerts on performance violations.
Testing & Quality Assurance - Executes quality checks and triggers alerts when AI performance falls below predefined thresholds.
Automated Agent Quality Assurance - Integrates automated evaluation suites and regression tests into CI/CD pipelines to validate AI model updates.
AI Regression Testing Suites - Automates regression testing for LLM behavior against quality standards within CI/CD pipelines.
Regression Testing Suites - Groups evaluations into conditional suites to track system stability and prevent regressions over time.
Application Development - Framework for monitoring and testing ML and LLM systems.
General Machine Learning - Tool for analyzing ML models during validation and production.
LLM Observability and Evaluation - Framework for evaluating and monitoring ML and LLM systems.
Machine Learning Operations - Tool for analyzing and monitoring data and model drift.
Model Evaluation and Benchmarking - Framework for evaluating, testing, and monitoring ML and LLM systems.
Monitoring and Drift - Monitoring and evaluation for ML models in production.
Observability And Monitoring - Observability framework for tracking machine learning and language models.
Data Validation - Monitors and evaluates machine learning models in production.
Data Visualization - Generates interactive reports for machine learning model validation.
Observability and Evaluation - Evaluation and monitoring framework for ML and LLM systems.

vibrantlabsai/ragas

12,659View on GitHub

Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin

Giskard-AI/giskard

5,434View on GitHub

Giskard is an evaluation framework, testing library, and quality monitoring system for large language models and AI agents. It serves as a toolkit for quantifying model performance and reliability, providing specialized capabilities for validating retrieval-augmented generation pipelines. The project distinguishes itself through an automated red teaming tool and security scanner designed to identify vulnerabilities, prompt injections, and safety risks. It utilizes adversarial probing and synthetic edge case generation to quantify model robustness and detect information disclosure. The platfo

Arize-ai/phoenix

8,605View on GitHub

Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments. The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and

evidentlyaievidently

Features

Open-source alternatives to Evidently

vibrantlabsai/ragas

Giskard-AI/giskard

Arize-ai/phoenix

mlflow/mlflow

Star history

Open-source alternatives to Evidently

vibrantlabsai/ragas

Giskard-AI/giskard

Arize-ai/phoenix

mlflow/mlflow