Phoenix

Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments.

The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and includes tools for RAG troubleshooting to inspect retrieval documents.

Capabilities cover the entire development lifecycle, including automated output validation, systemic performance benchmarking, and prompt engineering optimization. The system also incorporates security and access controls, such as role-based access and sensitive data masking, alongside collaborative workspaces for sharing observability data.

The platform can be deployed locally via a CLI or notebook, or scaled through Docker and Kubernetes.

Features

LLM Evaluation Frameworks - Provides a comprehensive framework for running systematic experiments and judge-based scoring to measure model accuracy.
LLM Evaluation - Provides a framework for running systematic experiments and scoring model outputs using judge-based evaluators.
Model Provider Integrations - Provides unified interfaces to automatically collect telemetry from various model providers and orchestration frameworks.
AI Observability Tracing - Records detailed execution steps from providers and SDKs to provide visibility into AI pipelines.
Prompt Evaluation Tools - Provides utilities for comparing output quality across different prompt iterations to identify optimal instructions.
Automated Model Judges - Employs automated model judges to generate structured scores and explanations for model responses.
Automated Output Evaluation - Runs custom evaluations against responses to measure accuracy and quality through instrumented traces.
Evaluation Datasets - Creates structured collections of inputs and outputs specifically for benchmarking and evaluating model performance.
Dataset Curation - Collects production or manual inputs and reference outputs to establish performance baselines.
Experiment Management Interfaces - Provides an interface to execute pipelines against datasets and manage the resulting evaluation experiments.
Model Experiment Execution - Executes tasks against datasets and applies evaluators to compare performance across model or prompt iterations.
Embedding Visualizations - Maps high-dimensional vector representations visually to detect data drift and optimize similarity search.
Prompt Experimentation - Conducts controlled experiments to compare different prompt strategies and model configurations against test cases.
Version Trackers - Provides a centralized system for versioning prompt templates, decoupling iteration from code deployments.
LLM Observability - Captures execution traces and monitors large language model applications using OpenTelemetry and OpenInference standards.
Prompt Optimizers - Improves model output quality by rapidly iterating on and optimizing prompts sent to the model.
Embedding Visualizations - Provides a visual analyzer for mapping high-dimensional embeddings to detect data drift and optimize vector search.
Model Behavior Evaluation - Assesses response quality and tool usage to detect hallucinations and validate model behavior.
Prompt Engineering Environments - Offers collaborative workspaces for versioning, testing, and optimizing prompt templates without redeploying code.
Prompt Management Systems - Offers a centralized environment for versioning, testing, and deploying prompt templates to decouple iteration from code.
Prompt Templates - Manages prompt templates using names, versions, and tags to decouple iteration from code deployments.
RAG Debugging Tools - Provides tools to upload knowledge base corpora and inferences to troubleshoot bugs in retrieval-augmented generation.
RAG Troubleshooting and Analysis - Provides specialized tools for inspecting retrieval documents and visualizing embeddings to debug knowledge base performance.
Retrieval Inspection Tools - Examines documents, scores, and embedding text used during retrieval to validate search strategies and quality.
Observability and Tracing - Records end-to-end execution flows of agents and chains to analyze performance and debug failures.
Run Comparison Tools - Runs experiments with identical inputs to compare how prompt or logic changes affect model performance.
Embedding Visualizers - Visualizes high-dimensional vector representations to detect data drift and identify performance shifts.
Prompt Playgrounds - Provides an interactive playground for experimenting with prompt variations, model selection, and parameters.
Automated Trace Evaluation - Scores recorded execution traces using judge-based evaluators to measure overall application quality.
Automatic Tracing Instrumentation - Provides automated instrumentation for capturing execution traces and performance metrics within AI frameworks.
LLM Execution Tracing - Captures and visualizes the full execution context, including prompts and tool calls, for LLM pipelines.
Execution Tracing - Records the lifecycle of a single run as a series of spans to visualize the application flow.
Tracing Infrastructure Deployment - Provides the infrastructure for deploying and scaling trace collection and storage systems via Docker and Kubernetes.
Prompt and Agent Versioning - Tracks iterations of prompts to analyze performance shifts and maintains a history of changes.
Observability Instrumentation - Implements OpenTelemetry and OpenInference standards to instrument LLM pipelines for distributed tracing.
OpenTelemetry Standard Integrations - Uses OpenTelemetry standards to collect execution traces and metadata for cross-provider compatibility.
Observability Tracing - Records execution spans and traces across distributed LLM components to visualize complex model workflows.
Step-Level Tracing - Records every prompt, retrieval, and tool call in a sequence to pinpoint exact failure points in a pipeline.
LLM-As-A-Judge Scoring - Employs a high-capability language model as a judge to score outputs based on faithfulness and relevance.
Prompt Configuration Testing - Validates specific prompt configurations and parameters against datasets to compare performance and output formats.
AI-Assisted Trace Analysis - Uses an integrated agent to analyze traces and iterate on prompts based on observed data.
Deterministic Evaluators - Validates outputs against objective criteria using exact matches, regular expressions, and statistical scores.
Evaluator Development - Supports the development of custom classifiers and scorers to detect faithfulness and relevance in model responses.
Trace-to-Dataset Converters - Extracts captured trace data from the store and converts it into datasets for external evaluation scripts.
Automated Dataset Evaluation - Executes automated evaluators against structured benchmark datasets to validate model outputs during experiments.
Scoring Pipelines - Scores outputs using pre-built modular validation functions for faithfulness and relevance.
Ground-Truth Scoring - Assesses response quality by comparing model outputs against gold-standard ground-truth datasets.
Prompt Review Workflows - Tracks and approves prompt edits using a diff-based review system for collaborative authoring.
Human-in-the-Loop Systems - Integrates human oversight into the observability workflow by allowing manual feedback and auditing of traces.
Evaluation Execution Tracers - Records the internal execution steps and reasoning paths of evaluators to validate the decisions made by judge models.
Evaluation Trace Analyzers - Captures and logs the reasoning behind evaluator scores to help developers debug why specific model outputs were graded as such.
Evaluator Authoring - Allows users to draft and refine judge-based evaluators to measure performance against specific datasets.
Model Invocation Replays - Allows modifying inputs of specific captured calls to determine if different parameters improve model outcomes.
Prompt Synchronization APIs - Synchronizes versioned prompt templates into application code via SDKs to ensure consistency across deployments.
Training Data Curators - Cleans and labels datasets to create high-quality representative samples for model fine-tuning.
AI Development Dataset Management - Organizes and manages datasets specifically tailored for LLM testing, refining, and fine-tuning.
Execution Flow Visualizations - Visualizes detailed execution steps of a system to understand the internal flow of requests.
Experiment Run Grouping - Organizes specific execution runs, such as edge cases or failures, into datasets for targeted analysis.
Prompt Template Injection - Renders templates for specific providers by injecting runtime variables into placeholders.
Trace Replay Playgrounds - Tests alternative prompts by loading recorded steps from multi-step chains into an interactive playground.
Local Development Servers - Runs a local server via CLI or notebook to collect and visualize execution traces during development.
Prompt Version Trackers - Correlates prompt changes with performance gains by tracking templates and versions during execution.
Kubernetes Deployments - Provides Helm charts and configurations for deploying and scaling the platform on Kubernetes clusters.
Private Infrastructure Hosting - Supports deploying services within private networks and local containers to ensure data privacy.
Sandboxed Execution Environments - Executes custom code-based evaluators within secure sandboxes to ensure kernel-level isolation.
Self-Hosted AI Platforms - Provides a deployable backend for managing AI telemetry and experiments within private clouds using Docker or Kubernetes.
Self-Hosted Deployment Platforms - Supports the deployment of the platform on private or air-gapped networks for total data sovereignty.
Content Guardrails - Visualizes input and output filters within traces to monitor where malicious content was blocked by guardrails.
Data Masking - Automatically redacts confidential information from execution traces to ensure privacy and security compliance.
Performance Benchmarking - Scores model outputs based on cost, latency, and performance to ensure quality before deployment.
Trace Annotation - Allows users to attach qualitative feedback, human labels, and scores to specific spans within an execution trace.
Trace Metadata - Attaches custom attributes and tags to traces to enable advanced filtering and analysis.
Manual Span Definition - Adds manual tracing points to code using decorators and wrappers for custom component instrumentation.
Root Cause Analysis - Diagnoses the source of errors by analyzing failing traces against a checklist of common failure modes.
Session Tracking - Provides tools for grouping execution traces into logical user sessions to analyze sequential interactions.
Performance Analysis - Inspects latency, token usage, and exceptions for individual calls to identify performance bottlenecks and cost drivers.
LLM - Runs structured experiments to track performance over time and validate optimization changes in AI applications.
Performance Testing Frameworks - Implements frameworks to measure application pipeline stability and performance across multiple scenarios over time.
LLM Evaluation Tools - Real-time monitoring tool for LLM drift detection and tracing.
LLM Observability and Evaluation - Observability for LLMs and machine learning models.
Model Evaluation and Benchmarking - Observability platform for experimentation, evaluation, and troubleshooting.
Observability and Evaluation - Observability tool for tracing and evaluating LLM performance.
Observability And Monitoring - Observability platform for experimentation, evaluation, and troubleshooting.
Reliability and Debugging - AI observability platform. Tracing, datasets, experiments, and playground for troubleshooting and evaluating LLM apps.

Agenta-AI/agenta

3,860View on GitHub

Agenta is a Prompt Ops lifecycle manager and prompt management platform that decouples prompt engineering from application code. It serves as a centralized system for developing, versioning, and deploying prompt templates and model configurations across different environments. The platform functions as an AI agent orchestrator with a visual interface for building agent workflows and connecting models to external tools. It further acts as an evaluation framework and observability tool, utilizing OpenTelemetry to capture execution traces, monitor latency, and track token costs. The system cove

Helicone/helicone

5,830View on GitHub

Helicone is an AI gateway and observability platform designed to intercept, manage, and monitor interactions with large language models. By acting as a reverse-proxy, it provides a centralized layer for routing requests across multiple AI providers, allowing developers to maintain consistent application logic while gaining deep visibility into model performance, usage, and costs. The platform distinguishes itself through a robust suite of traffic management and prompt engineering tools. It enables policy-driven control, including automatic failover between providers, rate limiting, and edge-b

traceloop/openllmetry

7,202View on GitHub

OpenLLMetry is an OpenTelemetry-based observability framework and instrumentation library for generative AI applications. It provides toolsets for tracing and monitoring large language model workflows, capturing telemetry from model providers, agent frameworks, and vector databases using standardized semantic conventions. The project distinguishes itself by providing a specialized evaluation and experimentation suite that associates user feedback and prompt version hashes with specific execution traces. It includes a system for tracking model reasoning paths and enforcing security guardrails

vibrantlabsai/ragas

12,659View on GitHub

Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin

Arize-aiphoenix

Features