Ragas

Ragas is an evaluation framework and performance benchmark designed to quantify the quality of retrieval augmented generation pipelines. It functions as an application optimizer to identify bottlenecks in language model workflows using automated metrics and model-based scoring.

The framework includes a system for generating synthetic datasets that mimic production scenarios and edge cases to create realistic test cases. It enables reference-free assessment, allowing the evaluation of response quality by analyzing grounding in the provided context without requiring gold-standard labels.

The system covers several analytical areas, including retrieval quality assessment, model accuracy measurement, and the optimization of application performance through the analysis of live usage data.

Features

RAG Evaluation Frameworks - Provides a comprehensive framework for assessing the performance and groundedness of retrieval-augmented generation systems.

LLM Test Pair Generators - Creates synthetic question and answer pairs by evolving documents through LLM-driven perturbation.

Synthetic Scenario Generators - Generates synthetic scenarios and query patterns to test system edge cases in RAG pipelines.

RAG Performance Metrics - Calculates accuracy by measuring the alignment between the query, retrieved context, and final output.

Retrieval Benchmarks - Quantifies the accuracy and relevance of the data retrieval process using specialized performance metrics.

LLM Evaluation - Provides a framework for measuring the quality of LLM outputs using automated judges and custom metrics.

RAG Performance Benchmarks - Quantifies retrieval accuracy and generation faithfulness using synthetic test datasets.

Reference-Free Evaluations - Evaluates response quality by analyzing grounding in the provided context without requiring gold-standard labels.

Scoring Pipelines - Implements modular scoring pipelines that isolate retrieval and generation steps for granular analysis.

Prompt-Based Schema Enforcement - Enforces consistent output formats from judge models using structured prompt templates.

Application Performance Optimization - Analyzes live usage data to identify and resolve bottlenecks in application logic.

LLM Performance Analyzers - Identifies performance bottlenecks in language model workflows using live usage data.

LLM Workflow Optimization - Analyzes live application data and output scores to identify bottlenecks in language model workflows.

Evaluation Frameworks - Toolkit for evaluating and optimizing retrieval-augmented generation applications.

LLM Evaluation Tools - Evaluation framework focused on RAG metrics and test set generation.

Model Evaluation and Benchmarking - Framework specifically for evaluating RAG pipelines.

Retrieval Augmented Generation - Evaluation framework specifically for retrieval pipelines.

Evaluation Frameworks - Framework for evaluating RAG components like faithfulness and relevance.

explodinggradientsragas

Features

Star history