Tutorial 4A: Perturb a Test Suite to evaluate RAG for human user input errors robustness
Overview
This tutorial demonstrates how to perturb a test suite in H2O Eval Studio to evaluate the robustness of Retrieval-Augmented Generation (RAG) models for human user input errors. By systematically varying input prompts - by introducing syntactically and grammatically incorrect constructions - we can assess how well the model adapts to different scenarios and maintains performance.
Objectives
- Create a perturbed test suite in H2O Eval Studio to evaluate the robustness of RAG models for human user input errors.
Prerequisites
- Access to Enterprise h2oGPTe
- Access to H2O Eval Studio
- Enterprise h2oGPTe API Key
- Basic understanding of LLMs and evaluation metrics
Step 1: Create a new test
First, we need to create a new test in H2O Eval Studio. This test will include a set of prompts that we will later perturb.
- In the H2O Eval Studio navigation menu, click Tests.
- Click New Test.
- In the Test name field, enter:
Tutorial 4A test
- In the Description field, enter:
Tutorial 4A test for evaluating RAG model robustness.
- Click Create.
Step 2: Add test cases from the test case library
Next, we will add test cases from the test case library - which offers more than 1 million prompts - to our newly created test.
- In the Tests view of the newly created test, click Actions.
- Click Add from library.
- Keep the Count field set to
5
as the number of test cases to add. - In the Search field, enter:
small
- Select
SR 11-7 (small)
test suite. - Click Add.
Test cases are added to the test.
Step 3: Perturb the test cases with perturbation techniques mimicking human user errors
Now, we will perturb the test cases using various techniques that mimic human user input errors. This will help us evaluate how well the RAG model can handle such variations.
- In the Tests view of the test, click Actions.
- Click Perturb all test cases.
- In the Perturbation settings panel, configure the following options:
- Technique: Select techniques that mimic human user input errors - both character level and word level - Comma Perturbator and Word Swap Perturbator.
- Intensity: Set the intensity level for the perturbations to
Medium
.
- Click Apply.
New test with both original and perturbed test cases is created.
The original test cases are preserved, allowing for direct comparison between the model's performance on original and perturbed inputs using the relationships defined in the test between the original and perturbed test cases.
These relationships are used by the H2O Eval Studio to detect any evaluator metrics flips - from pass to fail or vice versa - to report them as the evaluation problem or in input variations robustness testing as flipped test cases.
Step 4: Create a new evaluation
Finally, we will create a new evaluation in H2O Eval Studio to assess the performance of the RAG model on both the original and perturbed test cases.
- In the Tests view of the test, click Actions.
- Click Create Evaluation.
- In the Evaluation settings panel, configure the following options:
- Name: Enter a name for the evaluation (e.g., "RAG Model Evaluation").
- Description: Enter a description for the evaluation.
- Test: Select the test created in the previous steps.
- Model host: Select the RAG model you want to evaluate.
- Evaluators: Select the evaluators you want to use for the evaluation.
- Click Create.
Step 5: Review evaluation results
After the evaluation is complete, you can review the results in the Evaluations view - Problems tab. This will provide insights into the performance of the RAG model on both the original and perturbed test cases and report any flipped test cases.
Summary
In this tutorial, we created a perturbed test suite in H2O Eval Studio to evaluate the robustness of RAG models for human user input errors. By systematically varying input prompts, we assessed how well the model adapts to different scenarios and maintains performance. This approach helps identify potential weaknesses and areas for improvement in RAG models, ensuring they can handle a wide range of user inputs effectively.
Perturbation techniques are used throughout the H2O Eval Studio to evaluate sensitivity of LLMs to various input changes, guardrail resilience against encoding attacks, and overall robustness.
H2O Eval Studio's Model Risk Management (MRM) framework offers Input Variations Robustness Testing as one of its workflow steps. This MRM evaluation tool detects any evaluator metrics flips - from pass to fail or vice versa - when input variations are applied to the test suite. This helps ensure that models are robust and reliable, even when faced with unexpected or challenging inputs.
- Submit and view feedback for this page
- Send feedback about H2O Eval Studio to cloud-feedback@h2o.ai