Evaluators
This page describes the available H2O Eval Studio evaluators.
Compliance Frameworks
H2O Eval Studio conforms to the following compliance frameworks.
Evaluation Standard | Summary | Type |
---|---|---|
Safe | AI systems should not under defined conditions, lead to a state in which human life, health, property, or the environment is endangered. Safe operation of AI systems is improved through: responsible design, development, and deployment practices, clear information to deployers on responsible use of the system, responsible decision-making by deployers and end users; and explanations and documentation of risks based on empirical evidence of incidents. | NIST |
Secure and Resilient | AI systems, as well as the ecosystems in which they are deployed, may be said to be resilient if they can withstand unexpected adverse events or unexpected changes in their environment or use - or if they can maintain their functions and structure in the face of internal and external change and degrade safely and gracefully when this is necessary. | NIST |
Privacy Enhanced | Privacy refers generally to the norms and practices that help to safeguard human autonomy, identity, and dignity. These norms and practices typically address freedom from intrusion, limiting observation, or individuals' agency to consent to disclosure or control of facets of their identities (e.g., body, data, reputation). Privacy values such as anonymity, confidentiality, and control generally should guide choices for AI system design, development, and deployment. | NIST |
Fair | Fairness in AI includes concerns for equality and equity by addressing issues such as harmful bias and discrimination. Standards of fairness can be complex and difficult to define because perceptions of fairness differ among cultures and may shift depending on application. Organizations' risk management efforts will be enhanced by recognizing and considering these differences. Systems in which harmful biases are mitigated are not necessarily fair. For example, systems in which predictions are somewhat balanced across demographic groups may still be inaccessible to individuals with disabilities or affected by the digital divide or may exacerbate existing disparities or systemic biases. | NIST |
Accountable and Transparent | Trustworthy AI depends upon accountability. Accountability presupposes transparency. Transparency reflects the extent to which information about an AI system and its outputs is available to individuals interacting with such a system - regardless of whether they are even aware that they are doing so. Meaningful transparency provides access to appropriate levels of information based on the stage of the AI lifecycle and tailored to the role or knowledge of AI actors or individuals interacting with or using the AI system. By promoting higher levels of understanding, transparency increases confidence in the AI system. | NIST |
Valid and Reliable | Validity and reliability for deployed AI systems are often assessed by ongoing testing or monitoring that confirms a system is performing as intended. Measurement of validity, accuracy, robustness, and reliability contribute to trustworthiness and should take into consideration that certain types of failures can cause greater harm. | NIST |
Conceptual Soundness | Involves assessing the quality of the model design and construction. It entails review of documentation and empirical evidence supporting the methods used and variables selected for the model. | SR 11-7 |
Ongoing Monitoring | Emphasizes the continuous evaluation of a model's performance after deployment. This involves tracking the model's outputs against real-world data, identifying any deviations or unexpected results, and assessing if the model's underlying assumptions or market conditions have changed. This ongoing process ensures the model remains reliable and trustworthy for decision-making. | SR 11-7 |
Outcomes Analysis | Comparison of model outputs to corresponding actual outcomes. Outcomes analysis typically relies on statistical tests or other quantitative measures. It can also include expert judgment to check the intuition behind the outcomes and confirm that the results make sense. | SR 11-7 |
You can see the specific evaluators related to each standard in the sections below.
Evaluators overview
Evaluator | LLM | RAG | J | Q | EA | RC | AA | C | GPU |
---|---|---|---|---|---|---|---|---|---|
Agent sanity check | ✓ | ✓ | ✓ | ✓ | |||||
Answer correctness | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Answer relevancy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Answer relevancy (sentence s.) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Answer semantic similarity | ✓ | ✓ | ✓ | ✓ | |||||
Answer s. sentence similarity | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
BLEU | ✓ | ✓ | ✓ | ✓ | |||||
Classification | ✓ | ✓ | ✓ | ✓ | |||||
Contact information leakage | ✓ | ✓ | ✓ | ✓ | |||||
Context mean reciprocal rank | ✓ | ✓ | ✓ | ✓ | |||||
Context precision | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Context relevancy | ✓ | ✓ | ✓ | ✓ | |||||
Context relevancy (s.r. & p.) | ✓ | ✓ | ✓ | ||||||
Context recall | ✓ | ✓ | ✓ | ✓ | |||||
Fact-check (agent-based) | ✓ | ✓ | A | ✓ | |||||
Faithfulness | ✓ | ✓ | |||||||
Fairness bias | ✓ | ✓ | ✓ | ✓ | |||||
Machine Translation (GPTScore) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Question Answering (GPTScore) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Summarization with ref. s. | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Summarization without ref. s. | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Groundedness | ✓ | ✓ | ✓ | ✓ | |||||
Hallucination | ✓ | ✓ | ✓ | ✓ | |||||
Language mismatch (Judge) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
BYOP: Bring your own prompt | ✓ | ✓ | ✓ | ||||||
PII leakage | ✓ | ✓ | ✓ | ||||||
JSon Schema | ✓ | ✓ | ✓ | ||||||
Encoding guardrail | ✓ | ✓ | ✓ | ✓ | |||||
Perplexity | ✓ | ✓ | ✓ | ✓ | |||||
ROUGE | ✓ | ✓ | ✓ | ✓ | |||||
Ragas | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Summarization (c. and f.) | ✓ | ✓ | ✓ | ✓ | |||||
Sexism (Judge) | ✓ | ✓ | ✓ | ✓ | |||||
Sensitive data leakage | ✓ | ✓ | ✓ | ||||||
Step alignment & completeness | ✓ | ✓ | ✓ | ✓ | |||||
Stereotypes (Judge) | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
Summarization (Judge) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Toxicity | ✓ | ✓ | ✓ | ✓ | |||||
Text matching | ✓ | ✓ | ✓ | ✓ |
Legend:
- LLM: evaluates Language Model (LLM) models.
- RAG: evaluates Retrieval Augmented Generation (RAG) models.
- J: evaluator requires an LLM judge (✓) or agent (A).
- Q: evaluator requires question (prompt).
- EA: evaluator requires expected answer (ground truth).
- RC: evaluator requires retrieved context.
- AA: evaluator requires actual answer.
- C: evaluator requires condition(s).
- GPU: evaluator supports GPU acceleration.
Generation
Agent Sanity Check Evaluator
Question | Expected answer | Retrieved context | Actual answer | Conditions |
---|---|---|---|---|
✓ |
Agent Sanity Check Evaluator performs basic check of h2oGPTea agentic RAG/LLM system. The evaluator reviews agent chat session to check for problems and inspects the artifacts created by the agent during its operation. It verifies the integrity and sanity of the artifacts created by the agent during its operation. This includes checking for the presence of expected files, validating their formats, and ensuring that the content meets predefined criteria. The evaluator helps identify potential issues in the agent's workflow, ensuring that it operates correctly and reliably.
- Compatibility: RAG and LLM evaluation.
- Supported systems: h2oGPTea agentic RAG/LLM system.
Method
-
Looks for artifacts created by the agent during its operation prepared by the test lab completion.
-
Performs sanity checks on the artifacts to ensure they meet expected standards: linting (JSon), content validation (non-empty, non-empty pages), expected structure (for directories and files) and field values.
-
Create problems and insights if any issues are found during the sanity checks.
-
Calculates a sanity score based on the results of the checks, providing an overall assessment of the agent's performance - percentage of artifacts meeting quality standards.
Metrics calculated by the evaluator
- Agent Sanity (float)
- The quality and integrity of the agent-created artifacts.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator
- If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
Answer Correctness Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Answer Correctness Evaluator assesses the accuracy of generated answers compared to ground truth. A higher score indicates a closer alignment between the generated answer and the expected answer (ground truth), signifying better correctness.
- Two weighted metrics + LLM judge.
- Compatibility: RAG and LLM evaluation.
- Based on RAGAs library
Method
-
This evaluator measures answer correctness compared to ground truth as a weighted average of factuality and semantic similarity.
-
Default weights are 0.75 for factuality and 0.25 for semantic similarity.
-
Semantic similarity metrics is evaluated using Answer Semantic Similarity Evaluator.
-
Factuality is evaluated as F1-score of the LLM judge answers whose prompt analyzes actual answer for statements and for each statement it checks it’s presence in the expected answer:
-
TP (true positive): statements presents in both actual and expected answers
-
FP (false positive): statements present in the actual answer only.
-
FN (false negative): statements present in the expected answer only.
-
-
F1 score quantifies correctness based on the number of statements in each of the lists above:
F1 score = |TP| / (|TP| + 0.5 * (|FP| + |FN|))
For more information, see the page on Answer Correctness in the official Ragas documentation.
Metrics calculated by the evaluator
- Answer correctness (float)
- The assessment of answer correctness metric involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness metric encompasses two critical aspects:semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator
-
If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
-
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator:
-
Best performing LLM model based on the evaluated primary metric.
-
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Answer Relevancy Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Answer Relevancy (retrieval+generation) evaluator is assessing how pertinent the actual answer is to the given question. A lower score indicates actual answer which is incomplete or contains redundant information.
-
Mean cosine similarity of the original question and questions generated by the LLM judge.
-
Compatibility: RAG and LLM evaluation.
-
Based on RAGAs library.
Method
-
The LLM judge is prompted to generate an appropriate question for the actual answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
-
The score will range between 0 and 1 most of the time, but this is not mathematically guaranteed, due to the nature of the cosine similarity that ranging from
-1
to1
.
answer relevancy = mean(cosine_similarity(question, generate_questions))
Metrics calculated by the evaluator
- Answer relevancy (float)
- Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator
- If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Answer Relevancy (Sentence Similarity)
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
The Answer Relevancy (Sentence Similarity) evaluator assesses how relevant the actual answer is by computing the similarity between the question and the actual answer sentences.
- Compatibility: RAG and LLM evaluation.
Method
- The metric is calculated as the maximum similarity between the question and the actual answer sentences:
answer relevancy = max( {S(emb(question), emb(a)): for all a in actual answer} )
- Where:
A
is the actual answer.a
is a sentence in the actual answer.emb(a)
is a vector embedding of the actual answer sentence.emb(question)
is a vector embedding of the question.S(q, a)
is the 1 - cosine distance between the questionq
and the actual answer sentencea
.
- The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).
Metrics calculated by the evaluator
- Answer relevancy (float)
- Answer Relevancy metric determines whether the RAG outputs relevant information by comparing the actual answer sentences to the question.
- A higher score is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Answer Semantic Similarity Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Answer Semantic Similarity Evaluator assesses the semantic resemblance between the generated answer and the expected answer (ground truth).
- Cross-encoder model or embeddings + cosine similarity.
- Compatibility: RAG and LLM evaluation.
- Based on RAGAs library
Method
- Evaluator utilizes a cross-encoder model to calculate the semantic similarity score between the actual answer and expected answer. A cross-encoder model takes two text inputs and generates a score indicating how similar or relevant they are to each other.
- Method is configurable, and the evaluator defaults to embeddings BAAI/bge-small-en-v1.5 (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)) and cosine similarity as the similarity metric. In this case, evaluator does vectorization of the ground truth and generated answers and calculates the cosine similarity between them.
- In general, cross-encoder models (like HuggingFace Sentence Transformers) tend to have higher accuracy in complex tasks, but are slower. Embeddings with cosine similarity tend to be faster, more scalable, but less accurate for nuanced similarities.
See also:
- Paper "Semantic Answer Similarity for Evaluating Question Answering Models": https://arxiv.org/pdf/2108.06130.pdf
- 3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
- 3rd party library used: https://github.com/explodinggradients/ragas
Metrics calculated by the evaluator
- Answer similarity (float)
- The concept of answer semantic similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth. Semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator:
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Answer Semantic Sentence Similarity Evaluator
Question | Expected answer | Retrieved context | Actual answer | Conditions |
---|---|---|---|---|
✓ | ✓ |
Answer Semantic Sentence Similarity Evaluator assesses the semantic resemblance between the sentences from the actual answer and the expected answer (ground truth).
Method
- Method is configurable, and the evaluator defaults to embeddings BAAI/bge-small-en-v1.5_ (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)) and cosine similarity as the similarity metric. In this case, evaluator does vectorization of the ground truth sentences and actual answers sentences and calculates the cosine similarity between them.
answer similarity = {max({S(emb(a), emb(e)) : for all e in expected answer}): for all a in actual answer}
mean answer similarity = mean(answer similarity)
min answer similarity = min(answer similarity)
- Where:
emb(e)
is the embedding of a sentence from the expected answer.emb(a)
is the embedding of a sentence from the actual answer.S(emb(e), emb(a))
is cosine similarity between the embedding of expected answer and actual answer.
Metrics calculated by the evaluator
-
Mean Answer similarity (float)
- Mean cosine similarity of sentences from actual output and expected output.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
-
Min Answer similarity (float)
- Minimum cosine similarity of sentences from actual output and expected output.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator
- If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
Fact-check Agent-based Evaluator
Question | Expected answer | Retrieved context | Actual answer | Conditions |
---|---|---|---|---|
✓ | ✓ |
Fact-check evaluator evaluates the actual answer and determines whether it contains false information. The evaluator uses a fact-checking agent to determine the factual correctness.
- Agent based false information detection.
- Compatibility: RAG and LLM models.
Method
- The evaluator prompts the h2oGPTe hosted agent to fact-check the actual answer.
- Evaluator checks every test case for the presence of the contact information.
- Prompt used by the evaluator:
You are agent whose role is to evaluate text of the ACTUAL ANSWER.
Instructions for WHAT should be evaluated:
[BEGIN EVALUATION INSTRUCTIONS]
Fact-check the text and determine whether the text contains false information.
Use the internet search to verify the information if needed.
[END EVALUATION INSTRUCTIONS]
Instructions how to return the evaluation result:
- provide the evaluation result as JSon with the following structure:
{{
"answer": string,
"evaluation_score": float,
"evaluation_summary": string
}}
- evaluation_score: is the float number between 0.0 and 1.0 where 1.0 means
that the ACTUAL ANSWER passed the evaluation and 0.0 means that the ACTUAL
ANSWER failed the evaluation
- evaluation_summary: is the summary of the evaluation result which briefly
provides justification for the evaluation score and describes how was the
actual answer evaluated
ACTUAL ANSWER data:
[BEGIN ACTUAL ANSWER]
{actual_answer}
[END ACTUAL ANSWER]
If it may help, use QUESTION which was answered by the ACTUAL ANSWER:
[BEGIN QUESTION]
{question}
[END QUESTION]
Metrics calculated by the evaluator
- Fact-check (float)
- Percentage of false information detected in the actual answer. The evaluator uses h2oGPTe agents to determine whether the actual answer contains false information.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
- Primary metric.
Problems reported by the evaluator
- If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
- LLM models with best and worst context retrieval performance.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
agent_host_connection_config_key
- Configuration key of the h2oGPTe agent host connection to be used for the evaluation. If not specified, the first h2oGPTe connection will be used.
agent_llm_model_name
- Name of the LLM model to be used by h2oGPTe hosted agent for the evaluation. If not specified, Claude Sonnet or GPT-4o or best llama or the first LLM model will be used.
agent_eval_h2ogpte_collection_id
- Collection ID of the h2oGPTe to be used for the evaluation. If not specified, new collection with empty corpus will be created.
max_dataset_rows
- Maximum number of dataset rows allowed to be evaluated by the evaluator. This is the protection against slow and expensive evaluations.Maximum number of rows to be used from the dataset for the evaluation.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
Faithfulness Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Faithfulness Evaluator measures the factual consistency of the generated answer with the given context.
- LLM finds claims in the actual answer and ensures that these claims are present in the retrieved context.
- Compatibility: RAG only evaluation.
- Based on RAGAs library
Method
- Faithfulness is calculated based on the actual answer and retrieved context.
- The evaluation assesses whether the claims made in the actual answer can be inferred from the retrieved context, avoiding any hallucinations.
- The score is determined by the ratio of the actual answer's claims present in the context to the total number of claims in the answer.
faithfulness = number of claims inferable from the context / claims in the answer
See also:
- 3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
- 3rd party library used: https://github.com/explodinggradients/ragas
Metrics calculated by the evaluator
- Faithfulness (float)
- Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. Higher is better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters:
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Explanations created by the evaluator:
-
llm-eval-results
- Frame with the evaluation results.
-
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
-
work-dir-archive
- Zip archive with evaluator artifacts.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Groundedness Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Groundedness (Semantic Similarity) Evaluator assesses the groundedness of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual answer is factually correct information by comparing the actual answer to the retrieved context - as the actual answer generated by the LLM model must be based on the retrieved context.
Method
- The groundedness metric is calculated as:
groundedness = min( { max( {S(emb(a), emb(c)): for all c in C} ): for all a in A } )
- Where:
A
is the actual answer.emb(a)
is a vector embedding of the actual answer sentence.C
is the context retrieved by the RAG model.emb(c)
is a vector embedding of the context chunk sentence.S(a, c)
is the 1 - cosine distance between the actual answer sentencea
and the retrieved context sentencec
.
- The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).
Metrics calculated by the evaluator
- Groundedness (float)
- Groundedness metric determines whether the RAG outputs factually correct information by comparing the actual answer to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
- Primary metric.
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
- If the actual answer is so small that the embedding ends up empty then the evaluator will produce a problem.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.
- The least grounded actual answer sentence (in case the output metric score is below the threshold).
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Hallucination Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Hallucination Evaluator assesses the hallucination of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual output is factually correct information by comparing the actual output to the retrieved context - as the actual output generated by the LLM model must be based on the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates or discards facts that are not supported by the context.
- Cross-encoder model assessing retrieved context and actual answer similarity.
- Compatibility: RAG evaluation only.
Method
- The evaluation uses Vectara hallucination evaluation cross-encoder model to calculate a score that measures the extent of hallucination in the generated answer from the retrieved context.
See also:
- 3rd party model used: https://huggingface.co/vectara/hallucination_evaluation_model
Metrics calculated by the evaluator
- Hallucination (float)
- Hallucination metric determines whether the RAG outputs factually correct information by comparing the actual output to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
- Primary metric.
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
JSon Schema Evaluator
Question | Expected answer | Retrieved context | Actual answer | Conditions |
---|---|---|---|---|
✓ |
JSon Schema evaluator checks the structure and content of the JSon data generated by the LLM/RAG model:
- JSon Schema validation of actual answers: JSon Schema specification
- Compatibility: RAG and LLM.
Method
- JSon Schema Evaluator checks the structure and content of the JSon data generated by LLM/RAG models.
- The evaluation utilizes a JSon Schema validation library to ensure the generated JSon adheres to the expected schema.
- Evaluator checks every test case - actual answer - for compliance with the JSon schema.
- If JSon Schema is not provided i.e. it is set to
{}
, then the evaluator checks only parseability of the actual answers as JSon. - The result of the test case evaluation is a boolean.
- Models are compared based on the number of test cases where they succeeded.
Metrics calculated by the evaluator
-
Valid JSon (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for JSon Schema compliance.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Primary metric.
-
Invalid JSon (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for JSon Schema compliance.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
-
Invalid retrieved JSon (float)
- JSon fragments in RAG's retrieved contexts are not JSon Schema validated.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
-
Invalid generated JSon (float)
- Percentage of successfully JSon Schema validated outputs generated by RAG/LLM (equivalent to the model failures).
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
Problems reported by the evaluator
- If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
Insights diagnosed by the evaluator
- Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
- LLM models with best and worst context retrieval performance.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
json_schema
- JSon Schema - JSon Schema specification - to validate the structure and content of the generated JSon data.
{}
to skip validation and check only parseability of the actual answers as JSon.
- JSon Schema - JSon Schema specification - to validate the structure and content of the generated JSon data.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
Language Mismatch Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
Language mismatch evaluator tries to determine whether the language of the question (prompt/input) input and the actual answer is the same.
- LLM judge based language detection.
- Compatibility: RAG and LLM models.
Method
- The evaluator prompts the LLM judge to compare languages in the question and actual answer.
- Evaluator checks every test case. The result of the test case evaluation is a boolean.
- LLM models are compared based on the number of test cases where they succeeded.
Metrics calculated by the evaluator
- Same language (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for language mismatch metric which detects whether the language of the input and output is the same.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Primary metric.
- Language mismatch (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Language mismatch retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Language mismatch generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Language mismatch parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
- LLM models with best and worst context retrieval performance.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Looping Detection Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ |
Looping detection evaluator tries to find out whether the LLM generation went into a loop.
- Compatibility: RAG and LLM models.
Method
- This evaluator provides three metrics:
number of unique sequences
unique sentences = ----------------------------
number of all sentences
longest repeated substring * frequency of this substring
longest repeated substring = --------------------------------------------------------
length of the text
length in bytes of compressed string
compression ratio = --------------------------------------
length in bytes of original string
Where:
unique sentences
omits sentences shorter than 10 characters.compression ratio
is calculated using Python'szlib
and the maximum compression level (9).
Metrics calculated by the evaluator
- Unique Sentences (float)
- Unique sentences metric is a ratio
number of unique sequences / number of all sentences
, where sentences shorter than 10 characters are omitted. - Higher score is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
- Primary metric.
- Unique sentences metric is a ratio
- Longest Repeated Substring (float)
- Longest repeated substring metric is a ratio
longest repeated substring * frequency of this substring / length of the text
. - Lower score is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
- Longest repeated substring metric is a ratio
- Compression Ratio (float)
- Ratio
length in bytes of compressed string / length in bytes of original string
. Compression is done using Python's zlib and the maximum compression level (9). - Higher score is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.75
- Ratio
Problems reported by the evaluator
- If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Machine Translation (GPTScore) Evaluator
Input | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ |
GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.
- LLM judge based evaluation.
- Compatibility: RAG and LLM models.
Method
- The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow
In other words,
. - Instructions used by the evaluator are:
- Accuracy:
Rewrite the following text with its core information and consistent facts: {ref_hypo} In other words, {hypo_ref}
- Fluency:
Rewrite the following text to make it more grammatical and well-written: {ref_hypo} In other words, {hypo_ref}
- Multidimensional quality metrics:
Rewrite the following text into high-quality text with its core information: {ref_hypo} In other words, {hypo_ref}
- Accuracy:
- Each instruction is evaluated twice - first it uses the expected answer for
{ref_hypo}
and the actual answer for{hypo_ref}
, and then it is reversed. The calculated scores are then averaged. - The lower the metric value, the better.
See also:
Metrics calculated by the evaluator
- Accuracy (float)
- Are there inaccuracies, missing, or unfactual content in the generated text?
- Lower score is better.
- Range:
[0, inf]
- Default threshold:
inf
- This is the primary metric.
- Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range:
[0, inf]
- Default threshold:
inf
- Multidimensional Quality Metrics (float)
- How is the overall quality of the generated text?
- Lower score is better.
- Range:
[0, inf]
- Default threshold:
inf
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Best performing LLM model based on the evaluated primary metric.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Evaluator parameters
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
Parameterizable BYOP Evaluator
Question | Expected answer | Retrieved context | Actual answer | Constraints |
---|---|---|---|---|
✓ | ✓ | ✓ | ✓ |
Bring Your Own Prompt (BYOP) evaluator uses user supplied custom prompt and an LLM judge to evaluate LLMs/RAGs. Currently implemented BYOP supports only binary problems, thus the prompt has to guide the judge to output either "true"
or "false"
.
Method
- User provides a custom prompt and an LLM judge.
- Custom prompt may use question, expected answer, retrieved context and/or actual answer.
- The evaluator prompts the LLM judge using the custom prompt provided by user.
- Evaluator checks every test case. The result of the test case evaluation is a boolean.
- LLM models are compared based on the number of test cases where they succeeded.
Metrics calculated by the evaluator
- Model passes (float)
- Percentage of successfully evaluated RAG/LLM outputs.
- Higher is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Primary metric.
- Model failures (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Model retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Model generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
- Model parse failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
- Lower is better.
- Range:
[0.0, 1.0]
- Default threshold:
0.5
Problems reported by the evaluator
- If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
- If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
Insights diagnosed by the evaluator
- Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
- LLM models with best and worst context retrieval performance.
- The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.
Explanations created by the evaluator:
-
llm-eval-results
- Frame with the evaluation results.
-
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
-
work-dir-archive
- Zip archive with evaluator artifacts.