Model Risk Management
H2O Eval Studio uses Model Risk Management (MRM) methodology to evaluate the performance of a RAG system. Implementation of the methodology in the form of MRM workflows is based on Human-Calibrated Automated Testing and Validation of Generative Language Models and related papers.
What is the value of MRM workflows for RAG system evaluation?
- Automated testing of the RAG system on relevant data
- Manually creating evaluation data—question/answer pairs based on the RAG's corpus—is both expensive and slow. Using generic evaluation data that is not rooted in the RAG's corpus can lead to false positives and false negatives.
- MRM workflows automatically generate question/answer pairs based on the RAG's corpus, making the process faster and more efficient.
- Human-in-the-loop
- Fully automated RAG evaluation, whether LLM-based or agent-based, risks producing false results due to the inherent nature of generative AI.
- The presence of the human-in-the-loop ensures that evaluation results are accurate and reliable, with domain experts, regulators, and other stakeholders guaranteeing the quality of the evaluation.
- Holistic weakness diagnostics
- Existing evaluation frameworks often fall short because they detect errors without providing the context or root cause, which limits their usability.
- MRM workflows offer a comprehensive approach to diagnosing weaknesses. They enable the identification of evaluation errors and their root causes, such as the type of problematic questions, difficult semantic topics, or clusters of failures. These tools help users pinpoint the exact issues.
- Robustness testing
- RAG systems can be vulnerable to minor changes in input, which can lead to significant changes in the output. This fragility can compromise their reliability.
- MRM workflows include specific steps for robustness testing, such as input variations and adversarial inputs, which evaluate how well the RAG system handles perturbed or malicious queries, ensuring its resilience and stability.
- Interpretability and transparency
- Traditional black-box models make it difficult to understand why a RAG system provides a specific answer, especially when it is incorrect or unhelpful.
- MRM workflows enhance interpretability by providing detailed insights into the evaluation process. By visualizing data like failure clusters and weakness detection charts, users can understand the "why" behind model performance, building trust and enabling targeted improvements.
- Efficiency and Actionable Insights
- The ultimate goal of RAG evaluation is to improve the system. However, many evaluation methods provide only metrics without clear guidance on how to act.
- MRM workflows generate specific, actionable artifacts that can be directly used to correct and improve the RAG system's performance.
MRM methodology
Model Risk Management is a comprehensive framework for the automated testing and validation of generative language models, in particular, the Retrieval-Augmented Generation (RAG) systems. The framework addresses key challenges related to functional performance, risk and safety metrics, interpretability and transparency, weakness diagnostics, and robustness test, offering practical methodologies for diagnosing, optimizing, and validating these systems.
Apart to LLM-based RAG systems, the framework can be applied to system that use agents to generate responses to user queries.
Principles
One of the guiding principles of the MRM framework is that LLMs cannot be trusted to evaluate LLMs, therefore, the framework relies on:
- Embeddings and semantic similarity.
- Human in the loop.
- Holistic diagnostics.
MRM workflows
H2O Eval Studio implements the MRM methodology in the form of workflows, which are automated processes that evaluate the performance of a LLM/agent-based RAG system under evaluation.
MRM workflow steps:
- Topic modeling
- Test generation
- Test validation
- Automated evaluation
- Human evaluation calibration
- Failure clustering
- Weakness detection
- Robustness testing: input variations
- Robustness testing: adversarial inputs
Topic modeling
The purpose of the topic modeling workflow step is to automatically extract chunks from the corpus documents, cluster them and generate relevant topics which can be found in the RAG’s corpus.
This step identifies topics in your documents by clustering similar text chunks.
First, we reduce the dimensionality of complex embeddings, then apply clustering algorithms such as HDBSCAN and k-means to group similar vectors, and finally extract key words to represent each topic.
You can select different techniques and adjust parameters for each step, creating multiple topic models. The quality of each model is measured using the silhouette score to find the optimal configuration.
The identified topics will be used to generate test cases for your document-aware RAG system evaluation.
Step parameters:
- Top N words: The number of top words to extract from each topic.
- Minimum topic size: The minimum number of documents per topic.
- UMAP neighbors: The number of neighbors to consider when reducing the dimensionality of complex embeddings.
- UMAP components: The number of components to extract from the UMAP embedding.
- UMAP minimal distance: The minimal distance between points in the UMAP embedding.
- HDBScan min cluster size: The minimum number of points per cluster.
- HDBScan min samples: The minimum number of points per sample.
- HDBScan cluster selection epsilon: The epsilon parameter for HDBScan clustering.
Expert settings:
- Enterprise h2oGPTe model host: h2oGPTe is used for chunking the documents which are subsequently clustered using their embeddings. h2oGPTe model host can be configured.
Depends on:
- Corpus: The corpus to extract chunks from.
Suggestions:
- Review discovered topics: Review the discovered topics to ensure they are relevant to your use case and RAG system corpus. If a topic is not relevant, you can exclude it from the test generation workflow step. If not satisfied, run the topic modeling workflow step again with different parameters.
Test generation
The purpose of the test generation workflow step is to automatically generate questions and expected answer pairs (test cases) from the corpus-based topics identified in the topic modeling workflow step. Test generation step ensures that the RAG will be evaluated on actual data, not on a generic and/or out of scope questions.
This step generates test cases for the RAG system evaluation. Choose relevant topics that were identified in the previous workflow step, with the option to include or exclude specific content chunks within each topic. Select specific generators and customize the number of test cases you need.
These generated test cases will serve as the foundation for comprehensively RAG system evaluation in subsequent workflow steps.
Step parameters:
- Test cases to generate: The number of test cases to generate.
- Generators: Question / answer pairs generators to use for test case generation. There are 10+ generators available like multi hop questions generator, inference questions generator, ambiguity handling generator, etc., each with its own strengths and weaknesses. Test generators are subsequently used to detect whether the RAG system can answer particular type of the generated questions correctly.
- Topics: The topics to generate test cases for.
Depends on:
- Topic modeling: The topic modeling workflow step must be completed before the test generation workflow step.
Suggestions:
- Review generated test cases: Review the generated test cases to ensure they are relevant to your use case and RAG system corpus. If not satisfied, edit/remove problematic test cases or run the topic-based test generation workflow step again with different test case generators and parameters. Re-run the workflow in case of changes.
Test validation
The purpose of the test validation workflow step is to validate the generated test cases to ensure they are relevant to your use case and RAG system corpus.
Validate the quality of the generated test suite using various metrics, such as Cosine similarity with the original text, BERT score or NLI score. This step helps you assess the quality of the test suite used for evaluation and iteratively improve the model validation process.
Depends on:
- Test generation: The test generation workflow step must be completed before the test validation workflow step.
Suggestions:
- Review BERT score, cosine similarity and NLI score metrics distribution: Use validation metrics drop-down to see distribution of metrics for each test case. If you spot an outlier, you can exclude it from the test generation workflow step. If you notice any other issues, you can remove the test cases to be used for evaluation in subsequent workflow steps.
Once this workflow step is completed, automatically generated and human curated test cases are ready to be used for the evaluation of the target RAG system.
Automated evaluation
The purpose of the automated evaluation workflow step is to automatically evaluate the target RAG system using the automatically generated and human curated test cases and suite of 30+ evaluators.
Evaluators use a variety of evaluation techniques to evaluate the target RAG system. Once the evaluation is complete, apart to the evaluation results visualized in the eval eye, there are also identified problems and insights that can be used to improve the RAG system.
For details, see how to create and view an evaluation.
Suggestions:
- Embeddings, n-grams, rules and NLI: MRM methodology suggests embeddings, n-grams, rules and NLI to evaluate the target RAG system - "LLM cannot be trusted to evaluate LLM". When creating a new evaluation, you can filter evaluators by
Method
to select only evaluators which you consider relevant to your use case and trusted. Please refer to evaluators for more details on used methods, models and provided metrics. - Eval eye: The eval eye visualizes the evaluation results allowing you to drill down into the evaluation results and identify the strengths and weaknesses of the target RAG system.
- Review failed test cases: The failed test cases are the test cases that failed the evaluation. Review failed test cases - actual answers, retrieved contexts and metrics scores - to understand the root cause of the failure and improve the RAG system.
Depends on:
- Test generation: The test generation workflow step must be completed before the automated evaluation workflow step.
Human evaluation calibration
The purpose of the human evaluation calibration workflow step is to let human reviewer to label a sample of evaluated test cases to agree/disagree with test passing/failing the metric scores. This allows to perform human supervised calibration of the metrics thresholds to correct the evaluation results.
The method used by this step is based on calibration of multiple small(er) subsets of the test cases to calibrate the metrics thresholds instead of calibrating a big sample of the test cases. This allows to iteratively and efficiently perform human supervised calibration of the metrics thresholds to correct the evaluation results. Changes to the metrics thresholds are applied / propagated to other workflow steps like weaknesses detection.
Labels created by the human are also used to generated corpus patch artifact which can be used to improve the RAG system. Failed test cases (according to the evaluators) which user marked as passed are added to the corpus patch artifact (original question and actual answer) as human reviewer indicated that the RAG responded correctly to the question (considering e.g. groundedness metric).
Depends on:
- Automated evaluation: The automated evaluation workflow step must be completed before the human evaluation calibration workflow step.
Suggestions:
- Calibrate all primary metrics: Make sure that you calibrate primary metrics of all evaluators from the automated evaluation workflow step. Keep resampling until you are satisfied with the calibration results and threshold values. You can return back to the automated evaluation workflow step and review the results using eval eye.
Failure clustering
The purpose of the failure clustering workflow step is to cluster the failed test cases based on the failed metrics scores. This allows to identify the root cause of the failure and improve the RAG system.
This workflow step analyzes failed test cases to identify common failure patterns and group them into meaningful clusters. By applying clustering algorithms to failure data, you can discover systematic weaknesses in your RAG system and understand the underlying causes of different failure types.
The clustering process examines various dimensions such as topics, generators, and failure characteristics to reveal patterns that might not be immediately obvious. This analysis helps prioritize debugging efforts and guides targeted improvements to your system's performance. If there are not enough failed test cases, the failure clustering workflow step cannot be executed and it is considered as successful.
This step calculates failed test case embeddings, performs dimensionality reduction, trains a random forest model to build random forest proximity matrix (test cases which end up in the same leaf node of different trees form a cluster) to cluster the failed test cases.
The role of human reviewer is key as the random forest model typically will not be able to capture the underlying causes of the failure. Human reviewer, as domain expert, can identify the common denominators of the failed test cases within the same cluster and formulate a weakness for each cluster. This allows to subsequently prepare actions to mitigate the weaknesses and improve the RAG system performance.
Depends on:
- Automated evaluation: The automated evaluation workflow step must be completed before the failure clustering workflow step.
Suggestions:
- Review clusters chart: Review the clusters chart - test cases within the particular cluster one after another to identify the root cause of the failure. If you spot that they have something in common, you can formulate a weakness hypothesis for the cluster.
Weakness detection
The purpose of the weakness detection workflow step is to detect the weaknesses in the RAG system based on the distribution of scores of all test cases. This workflow step benefits from knowing which question generators and topics were used to create the specific test cases. This allows to identify question types and topics where evaluated RAG doesn't perform well.
This workflow step helps you identify the weak areas by evaluating performance across individual dimensions - topics and query types - and across two dimensions simultaneously.
Topics chart displays the metric distribution across various topics appearing in the test suite, which helps to detect topics where evaluated RAG doesn't perform well. Note the distribution of metric scores and proximity to the metrics thresholds.
Question types chart displays the metric distribution across various question types appearing in the test suite, which helps to detect question types where evaluated RAG doesn't perform well. Note the distribution of metric scores and proximity to the metrics thresholds.
Finally, the Bivariate Analysis helps you detect compound weaknesses by visualizing the interactions between topics and query types.
The visualization displays the grid, where each cell represents the combination of the topic and query type, and the color intensity indicates the average score of the responses in this combination.
The dots in the cells represent individual test cases, and they are colored by the score of the respective test case. The size of the dots doesn't have any specific meaning, and is just a property of the visualization's autoscaling.
Depends on:
- Automated evaluation: The automated evaluation workflow step must be completed before the weakness detection workflow step.
Suggestions:
- Review charts: Review the charts to identify the weak areas in the RAG system.
Robustness testing: input variations
The purpose of the robustness testing using input variations method is to evaluate how well the RAG system handles variations in the input data. By analyzing which question generators and topics are most affected by input perturbations, robustness testing helps pinpoint areas where the RAG system is sensitive to changes and may lack resilience. This enables targeted improvements to enhance the system's stability against diverse input scenarios.
Robustness testing examines how model performance changes when test cases are perturbed. After evaluation, select a metric and apply perturbations to test the model's stability. The results track flips - instances where perturbations cause evaluation scores to cross the threshold boundary (changing from pass to fail or vice versa). Each flip indicates the model is sensitive to that particular perturbation type, providing insights into its resilience against input variations.
Depends on:
- Automated evaluation: The automated evaluation workflow step must be completed before the robustness testing using input variations workflow step.
Suggestions:
- Review charts: Review the charts to identify the weak areas in the RAG system.
Robustness testing: adversarial inputs
The purpose of the robustness testing using adversarial inputs method leverages generated adversarial test cases to evaluate how well the RAG system handles deliberately misleading, ambiguous, or contradictory inputs. Test cases from the test generation workflow step serve as a baseline for metrics calculation, while the generated adversarial test cases help assess the difference and ultimately determine the system's robustness.
Depends on:
- Automated evaluation: The automated evaluation workflow step must be completed before the robustness testing using adversarial inputs workflow step.
Suggestions:
- Review charts: Review the charts to identify the weak areas in the RAG system.
Actionable artifacts
MRM workflows generate specific, actionable artifacts that can be directly used to correct and improve the RAG system's performance.
Report
The evaluation report provides a comprehensive overview of the RAG system's performance, including metrics, charts, and insights. The report is generated after the automated evaluation workflow step and can be accessed from the evaluation dashboard.
Actions:
- Talk to report:
- Upload the evaluation report to any RAG - preferably Enterprise h2oGPTe - in order to talk to the report and get insights, suggestions, action plans, mitigation strategies, or other recommendations.
Evaluation report use cases:
- Ongoing monitoring:
- Use the evaluation report to monitor the RAG system's performance over time.
- Agentic actionability:
- Use the evaluation report to take action based on the RAG system's performance.
- Trend tracking:
- Use the report to compare the RAG system's performance with the baseline and or over time. Track the progress and identify trends.
- Compliance:
- Use the evaluation report to ensure the RAG system complies with regulatory or compliance requirements.
- Audit:
- Use the evaluation report to audit the RAG system's performance required by regulatory or compliance requirements.
Corpus patch document
The corpus patch document can be applied to the RAG corpus in order to improve its performance. Corpus patch document contains FAQ style questions and answers that were difficult for RAG system to answer. By adding these questions and answers to the corpus, the RAG system can learn to answer them better. The corpus patch document is generated from the Human Calibration workflow step and Robustness Testing using Input Variations workflow step data.
Actions:
- Apply corpus patch:
- Apply the corpus patch to the corpus in order to improve the RAG system's performance.
Corpus patch use cases:
- Improve RAG system performance:
- Use the corpus patch to improve the RAG system's performance by fixing particular questions and answers that were difficult for the RAG system to answer.
- Fill knowledge gaps:
- Use the corpus patch to fill knowledge gaps that were identified. The corpus patch allows for the direct injection of institutional knowledge that might be difficult for the RAG system to find on its own.
System prompt extension
A system prompt extension is a configuration artifact that can be applied to a RAG system to improve its behavior and performance. It contains a curated set of rules, instructions, and examples designed to guide the RAG system toward more accurate, safe, and contextually appropriate responses. Unlike changes to the corpus, which affect what the model retrieves, a system prompt extension influences how the model processes information and formulates its answer.
### Handling Input Variations (Spelling and Phrasing)
* Rule 1: Typo Tolerance: "When a user's query contains a clear spelling error
or typo (e.g., 'managament' instead of 'management'), interpret the query
based on its likely correct spelling before searching the corpus. Do not
explicitly correct the user's spelling in the final answer."
* Rule 2: Phrasing Flexibility: "Recognize that the same question can be asked
in multiple ways. Focus on the core semantic intent of the query rather than a
strict keyword match. For example, if a user asks 'What are the steps to
onboard a new hire?' and the corpus contains 'New employee orientation
process,' treat these as semantically equivalent."
* Rule 3: Synonyms and Aliases: "Be aware of common synonyms and aliases for
technical terms or product names found in the corpus. If a query uses a
synonym, ensure the search for information considers the official
terminology."
### Maintaining Consistency and Contextual Integrity
* Rule 1: Answer Grounding: "After retrieving information, prioritize answering
the user's query using only the most relevant and factual content from the
provided context. If a perturbed query retrieves a slightly different context,
ensure the final answer is still consistent with the core information from the
original document set."
* Rule 2: Avoid Contradiction: "If the information retrieved by a perturbed
query appears to contradict a fundamental fact from the original corpus,
default to the most reliable, verified information. State that the information
is conflicting and provide the most accurate known fact, while optionally
noting the discrepancy."
* Rule 3: Confidence Threshold: "If a perturbed query significantly degrades the
quality or confidence of the retrieved context, formulate a cautious response.
Instead of providing a potentially incorrect answer, state that you cannot
find a sufficiently confident answer and suggest rephrasing the question."
Actions:
- Apply:
- Apply the system prompt extension to the RAG system's system prompt either from UI or using API in order to improve the its performance.
System prompt extension use cases:
- Improving RAG system robustness and resilience:
- Make sure to test the RAG system with applied prompt extension and ideally re-run MRM workflow to validate the changes and determine whether it improved the RAG system's performance, robustness to (perturbed) and/or tough questions.
- A/B testing:
- Apply the new prompt extension to a subset of users to A/B test its impact on key metrics against the original system prompt.
- Improving contextual understanding:
- In cases where the RAG struggles with multi-hop questions, the system prompt extension will provide rules which will guide the RAG system to perform a more thorough analysis of the retrieved documents.
- Handling Ambiguity and "Out of Scope" Queries:
- For queries that are out of scope or ambiguous, a prompt extension can include instructions to ask for clarification or to politely decline to answer, preventing the RAG system from providing a low-quality or fabricated answers.
Guardrail configuration
Guardrails are used to enforce safety and ethical guidelines on the RAG system responses. Guardrail configuration contains human calibrated metrics thresholds which can be used to classify RAG system responses as reliable or not reliable and act as a safety net to prevent the RAG system from generating responses that does not meet the safety and ethical guidelines and accuracy requirements.
{
"calibrated_thresholds": {
"h2o_sonar.evaluators.fairness_bias_evaluator.FairnessBiasEvaluator": {
"fairness_bias": 0.75
},
"h2o_sonar.evaluators.pii_leakage_evaluator.PiiLeakageEvaluator": {
"model_passes": 0.5
},
"h2o_sonar.evaluators.toxicity_evaluator.ToxicityEvaluator": {
"toxicity": 0.25
}
},
"evaluator_ids": [
"h2o_sonar.evaluators.fairness_bias_evaluator.FairnessBiasEvaluator",
"h2o_sonar.evaluators.pii_leakage_evaluator.PiiLeakageEvaluator",
"h2o_sonar.evaluators.toxicity_evaluator.ToxicityEvaluator"
]
}
H2O Eval Studio provides guardrails service that can be used to enforce safety and ethical guidelines on the RAG system responses using evaluators and human-calibrated thresholds of their metrics. This configuration can be used to enforce safety and ethical guidelines on the RAG system responses.
Actions:
- Deploy:
- Use the guardrail configuration to run the H2O Eval Studio guardrails service as a real-time "safety shield" for your RAG system.
- Integrate:
- Use the guardrail Python client to integrate with the H2O Eval Studio guardrail service.
Guardrails service use cases:
- Real-time response filtering:
- Intercept the RAG system's output before it reaches the end user. The guardrails service evaluates the response against its configured thresholds for metrics like toxicity, bias, and groundedness. If a response fails to meet the safety or accuracy criteria, it can be blocked, edited, or flagged for review, preventing the user from receiving a potentially harmful or unreliable answer.
- Enforce safety and ethical guidelines
- Use the guardrail configuration to enforce safety and ethical guidelines of the RAG system responses.
- Regulatory Compliance:
- In highly regulated industries such as finance or healthcare, the guardrails service can be used to ensure the RAG system's outputs comply with strict industry regulations, such as those related to data privacy like GDPR or HIPAA.
Conclusion
Model Risk Management workflows represent a comprehensive framework for the automated testing and validation of generative language models, in particular, the Retrieval-Augmented Generation systems. The framework addresses key challenges related to functional performance, risk and safety metrics, interpretability and transparency, weakness diagnostics, and robustness testing, offering practical methodologies for diagnosing, optimizing, and validating these systems.
The approach integrates embedding-based evaluations, machine-human calibration through regression techniques, and robustness testing against diverse input conditions. Special attention is given to threshold acceptance criteria, which determine when automated evaluations are sufficient without human oversight, ensuring both efficiency and reliability.
By systematically diagnosing weakness through topic modeling and query analysis, and optimizing performance with advanced chunking, embeddings, and prompt engineering, the proposed framework ensures that RAG systems meet high standards for accuracy and fairness. Additionally, fairness-aware evaluators and safety measures are incorporated to mitigate bias and ensure ethical compliance.
The framework equips practitioners with the tools needed to confidently deploy RAG systems across industries such as finance, healthcare, and legal services, ensuring these systems operate transparently, align with user expectations, and provide actionable insights.
Resources
- Human-Calibrated Automated Testing and Validation of Generative Language Models
- Model Validation Practice in Banking: A Structured Approach for Predictive Models
- Submit and view feedback for this page
- Send feedback about H2O Eval Studio to cloud-feedback@h2o.ai