Evaluators

This page describes the available H2O Eval Studio evaluators.

Compliance Frameworks

H2O Eval Studio conforms to the following compliance frameworks.

Evaluation Standard	Summary	Type
Safe	AI systems should not under defined conditions, lead to a state in which human life, health, property, or the environment is endangered. Safe operation of AI systems is improved through: responsible design, development, and deployment practices, clear information to deployers on responsible use of the system, responsible decision-making by deployers and end users; and explanations and documentation of risks based on empirical evidence of incidents.	`NIST`
Secure and Resilient	AI systems, as well as the ecosystems in which they are deployed, may be said to be resilient if they can withstand unexpected adverse events or unexpected changes in their environment or use - or if they can maintain their functions and structure in the face of internal and external change and degrade safely and gracefully when this is necessary.	`NIST`
Privacy Enhanced	Privacy refers generally to the norms and practices that help to safeguard human autonomy, identity, and dignity. These norms and practices typically address freedom from intrusion, limiting observation, or individuals' agency to consent to disclosure or control of facets of their identities (e.g., body, data, reputation). Privacy values such as anonymity, confidentiality, and control generally should guide choices for AI system design, development, and deployment.	`NIST`
Fair	Fairness in AI includes concerns for equality and equity by addressing issues such as harmful bias and discrimination. Standards of fairness can be complex and difficult to define because perceptions of fairness differ among cultures and may shift depending on application. Organizations' risk management efforts will be enhanced by recognizing and considering these differences. Systems in which harmful biases are mitigated are not necessarily fair. For example, systems in which predictions are somewhat balanced across demographic groups may still be inaccessible to individuals with disabilities or affected by the digital divide or may exacerbate existing disparities or systemic biases.	`NIST`
Accountable and Transparent	Trustworthy AI depends upon accountability. Accountability presupposes transparency. Transparency reflects the extent to which information about an AI system and its outputs is available to individuals interacting with such a system - regardless of whether they are even aware that they are doing so. Meaningful transparency provides access to appropriate levels of information based on the stage of the AI lifecycle and tailored to the role or knowledge of AI actors or individuals interacting with or using the AI system. By promoting higher levels of understanding, transparency increases confidence in the AI system.	`NIST`
Valid and Reliable	Validity and reliability for deployed AI systems are often assessed by ongoing testing or monitoring that confirms a system is performing as intended. Measurement of validity, accuracy, robustness, and reliability contribute to trustworthiness and should take into consideration that certain types of failures can cause greater harm.	`NIST`
Conceptual Soundness	Involves assessing the quality of the model design and construction. It entails review of documentation and empirical evidence supporting the methods used and variables selected for the model.	`SR 11-7`
Ongoing Monitoring	Emphasizes the continuous evaluation of a model's performance after deployment. This involves tracking the model's outputs against real-world data, identifying any deviations or unexpected results, and assessing if the model's underlying assumptions or market conditions have changed. This ongoing process ensures the model remains reliable and trustworthy for decision-making.	`SR 11-7`
Outcomes Analysis	Comparison of model outputs to corresponding actual outcomes. Outcomes analysis typically relies on statistical tests or other quantitative measures. It can also include expert judgment to check the intuition behind the outcomes and confirm that the results make sense.	`SR 11-7`

You can see the specific evaluators related to each standard in the sections below.

Generation
Retrieval
Privacy
Fairness
Summarization
Classification

Evaluators overview

Evaluator	LLM	RAG	J	Q	EA	RC	AA	C	GPU
Agent sanity check	✓	✓			✓		✓
Answer correctness	✓	✓	✓		✓		✓
Answer relevancy	✓	✓	✓	✓		✓	✓
Answer relevancy (sentence s.)	✓	✓		✓			✓		✓
Answer semantic similarity	✓	✓			✓		✓
Answer s. sentence similarity	✓	✓			✓		✓		✓
BLEU	✓	✓			✓		✓
Classification	✓	✓			✓		✓
Contact information leakage	✓	✓	✓				✓
Context mean reciprocal rank		✓		✓		✓			✓
Context precision		✓	✓	✓	✓		✓
Context relevancy		✓	✓	✓		✓
Context relevancy (s.r. & p.)		✓		✓		✓
Context recall		✓	✓		✓	✓
Fact-check (agent-based)	✓	✓	A	✓
Faithfulness		✓	✓
Fairness bias	✓	✓					✓		✓
Machine Translation (GPTScore)	✓	✓			✓		✓		✓
Question Answering (GPTScore)	✓	✓		✓			✓		✓
Summarization with ref. s.	✓	✓			✓		✓		✓
Summarization without ref. s.	✓	✓		✓			✓		✓
Groundedness		✓				✓	✓		✓
Hallucination		✓				✓	✓		✓
Language mismatch (Judge)	✓	✓	✓	✓			✓
BYOP: Bring your own prompt	✓	✓	✓
PII leakage	✓	✓					✓
JSon Schema	✓	✓					✓
Encoding guardrail	✓	✓					✓	✓
Perplexity	✓	✓					✓		✓
ROUGE	✓	✓			✓		✓
Ragas		✓	✓	✓	✓	✓	✓
Summarization (c. and f.)	✓	✓		✓			✓
Sexism (Judge)	✓	✓	✓				✓
Sensitive data leakage	✓	✓					✓
Step alignment & completeness		✓				✓	✓		✓
Stereotypes (Judge)	✓	✓	✓	✓			✓
Summarization (Judge)	✓	✓	✓	✓	✓		✓
Toxicity	✓	✓					✓		✓
Text matching	✓	✓		✓				✓

Legend:

LLM: evaluates Language Model (LLM) models.
RAG: evaluates Retrieval Augmented Generation (RAG) models.
J: evaluator requires an LLM judge (✓) or agent (A).
Q: evaluator requires question (prompt).
EA: evaluator requires expected answer (ground truth).
RC: evaluator requires retrieved context.
AA: evaluator requires actual answer.
C: evaluator requires condition(s).
GPU: evaluator supports GPU acceleration.

Generation

Agent Sanity Check Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
✓

Agent Sanity Check Evaluator performs basic check of h2oGPTea agentic RAG/LLM system. The evaluator reviews agent chat session to check for problems and inspects the artifacts created by the agent during its operation. It verifies the integrity and sanity of the artifacts created by the agent during its operation. This includes checking for the presence of expected files, validating their formats, and ensuring that the content meets predefined criteria. The evaluator helps identify potential issues in the agent's workflow, ensuring that it operates correctly and reliably.

Compatibility: RAG and LLM evaluation.
Supported systems: h2oGPTea agentic RAG/LLM system.

Method

Looks for artifacts created by the agent during its operation prepared by the test lab completion.
Performs sanity checks on the artifacts to ensure they meet expected standards: linting (JSon), content validation (non-empty, non-empty pages), expected structure (for directories and files) and field values.
Create problems and insights if any issues are found during the sanity checks.
Calculates a sanity score based on the results of the checks, providing an overall assessment of the agent's performance - percentage of artifacts meeting quality standards.

Metrics calculated by the evaluator

Agent Sanity (float)
- The quality and integrity of the agent-created artifacts.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.

Answer Correctness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Answer Correctness Evaluator assesses the accuracy of generated answers compared to ground truth. A higher score indicates a closer alignment between the generated answer and the expected answer (ground truth), signifying better correctness.

Two weighted metrics + LLM judge.
Compatibility: RAG and LLM evaluation.
Based on RAGAs library

Method

This evaluator measures answer correctness compared to ground truth as a weighted average of factuality and semantic similarity.
Default weights are 0.75 for factuality and 0.25 for semantic similarity.
Semantic similarity metrics is evaluated using Answer Semantic Similarity Evaluator.
Factuality is evaluated as F1-score of the LLM judge answers whose prompt analyzes actual answer for statements and for each statement it checks it’s presence in the expected answer:
- TP (true positive): statements presents in both actual and expected answers
- FP (false positive): statements present in the actual answer only.
- FN (false negative): statements present in the expected answer only.
F1 score quantifies correctness based on the number of statements in each of the lists above:

F1 score = |TP| / (|TP| + 0.5 * (|FP| + |FN|))

For more information, see the page on Answer Correctness in the official Ragas documentation.

Metrics calculated by the evaluator

Answer correctness (float)
- The assessment of answer correctness metric involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness metric encompasses two critical aspects:semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator:

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Relevancy Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

Answer Relevancy (retrieval+generation) evaluator is assessing how pertinent the actual answer is to the given question. A lower score indicates actual answer which is incomplete or contains redundant information.

Mean cosine similarity of the original question and questions generated by the LLM judge.
Compatibility: RAG and LLM evaluation.
Based on RAGAs library.

Method

The LLM judge is prompted to generate an appropriate question for the actual answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
The score will range between 0 and 1 most of the time, but this is not mathematically guaranteed, due to the nature of the cosine similarity that ranging from -1 to 1.

answer relevancy = mean(cosine_similarity(question, generate_questions))

Metrics calculated by the evaluator

Answer relevancy (float)
- Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Relevancy (Sentence Similarity)

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

The Answer Relevancy (Sentence Similarity) evaluator assesses how relevant the actual answer is by computing the similarity between the question and the actual answer sentences.

Compatibility: RAG and LLM evaluation.

Method

The metric is calculated as the maximum similarity between the question and the actual answer sentences:

answer relevancy = max( {S(emb(question), emb(a)): for all a in actual answer} )

Where:
- A is the actual answer.
- a is a sentence in the actual answer.
- emb(a) is a vector embedding of the actual answer sentence.
- emb(question) is a vector embedding of the question.
- S(q, a) is the 1 - cosine distance between the question q and the actual answer sentence a.
The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

Answer relevancy (float)
- Answer Relevancy metric determines whether the RAG outputs relevant information by comparing the actual answer sentences to the question.
- A higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Semantic Similarity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Answer Semantic Similarity Evaluator assesses the semantic resemblance between the generated answer and the expected answer (ground truth).

Cross-encoder model or embeddings + cosine similarity.
Compatibility: RAG and LLM evaluation.
Based on RAGAs library

Method

Evaluator utilizes a cross-encoder model to calculate the semantic similarity score between the actual answer and expected answer. A cross-encoder model takes two text inputs and generates a score indicating how similar or relevant they are to each other.
Method is configurable, and the evaluator defaults to embeddings BAAI/bge-small-en-v1.5 (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)) and cosine similarity as the similarity metric. In this case, evaluator does vectorization of the ground truth and generated answers and calculates the cosine similarity between them.
In general, cross-encoder models (like HuggingFace Sentence Transformers) tend to have higher accuracy in complex tasks, but are slower. Embeddings with cosine similarity tend to be faster, more scalable, but less accurate for nuanced similarities.

See also:

Paper "Semantic Answer Similarity for Evaluating Question Answering Models": https://arxiv.org/pdf/2108.06130.pdf
3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Answer similarity (float)
- The concept of answer semantic similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth. Semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator:

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Answer Semantic Sentence Similarity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
	✓		✓

Answer Semantic Sentence Similarity Evaluator assesses the semantic resemblance between the sentences from the actual answer and the expected answer (ground truth).

Method

Method is configurable, and the evaluator defaults to embeddings BAAI/bge-small-en-v1.5_ (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)) and cosine similarity as the similarity metric. In this case, evaluator does vectorization of the ground truth sentences and actual answers sentences and calculates the cosine similarity between them.

answer similarity = {max({S(emb(a), emb(e)) : for all e in expected answer}): for all a in actual answer}
mean answer similarity = mean(answer similarity)
min answer similarity = min(answer similarity)

Where:
- emb(e) is the embedding of a sentence from the expected answer.
- emb(a) is the embedding of a sentence from the actual answer.
- S(emb(e), emb(a)) is cosine similarity between the embedding of expected answer and actual answer.

Metrics calculated by the evaluator

Mean Answer similarity (float)
- Mean cosine similarity of sentences from actual output and expected output.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Min Answer similarity (float)
- Minimum cosine similarity of sentences from actual output and expected output.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.

Fact-check Agent-based Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
✓			✓

Fact-check evaluator evaluates the actual answer and determines whether it contains false information. The evaluator uses a fact-checking agent to determine the factual correctness.

Agent based false information detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the h2oGPTe hosted agent to fact-check the actual answer.
Evaluator checks every test case for the presence of the contact information.
Prompt used by the evaluator:

You are agent whose role is to evaluate text of the ACTUAL ANSWER.

Instructions for WHAT should be evaluated:

[BEGIN EVALUATION INSTRUCTIONS]
Fact-check the text and determine whether the text contains false information.
Use the internet search to verify the information if needed.
[END EVALUATION INSTRUCTIONS]

Instructions how to return the evaluation result:

- provide the evaluation result as JSon with the following structure:

   {{
      "answer": string,
      "evaluation_score": float,
      "evaluation_summary": string
   }}

- evaluation_score: is the float number between 0.0 and 1.0 where 1.0 means
that the ACTUAL ANSWER passed the evaluation and 0.0 means that the ACTUAL
ANSWER failed the evaluation
- evaluation_summary: is the summary of the evaluation result which briefly
provides justification for the evaluation score and describes how was the
actual answer evaluated

ACTUAL ANSWER data:

[BEGIN ACTUAL ANSWER]
{actual_answer}
[END ACTUAL ANSWER]

If it may help, use QUESTION which was answered by the ACTUAL ANSWER:

[BEGIN QUESTION]
{question}
[END QUESTION]

Metrics calculated by the evaluator

Fact-check (float)
- Percentage of false information detected in the actual answer. The evaluator uses h2oGPTe agents to determine whether the actual answer contains false information.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

agent_host_connection_config_key
- Configuration key of the h2oGPTe agent host connection to be used for the evaluation. If not specified, the first h2oGPTe connection will be used.
agent_llm_model_name
- Name of the LLM model to be used by h2oGPTe hosted agent for the evaluation. If not specified, Claude Sonnet or GPT-4o or best llama or the first LLM model will be used.
agent_eval_h2ogpte_collection_id
- Collection ID of the h2oGPTe to be used for the evaluation. If not specified, new collection with empty corpus will be created.
max_dataset_rows
- Maximum number of dataset rows allowed to be evaluated by the evaluator. This is the protection against slow and expensive evaluations.Maximum number of rows to be used from the dataset for the evaluation.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.

Faithfulness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Faithfulness Evaluator measures the factual consistency of the generated answer with the given context.

LLM finds claims in the actual answer and ensures that these claims are present in the retrieved context.
Compatibility: RAG only evaluation.
Based on RAGAs library

Method

Faithfulness is calculated based on the actual answer and retrieved context.
The evaluation assesses whether the claims made in the actual answer can be inferred from the retrieved context, avoiding any hallucinations.
The score is determined by the ratio of the actual answer's claims present in the context to the total number of claims in the answer.

faithfulness = number of claims inferable from the context / claims in the answer

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Faithfulness (float)
- Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. Higher is better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Groundedness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Groundedness (Semantic Similarity) Evaluator assesses the groundedness of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual answer is factually correct information by comparing the actual answer to the retrieved context - as the actual answer generated by the LLM model must be based on the retrieved context.

Method

The groundedness metric is calculated as:

groundedness = min( { max( {S(emb(a), emb(c)): for all c in C} ): for all a in A } )

Where:
- A is the actual answer.
- emb(a) is a vector embedding of the actual answer sentence.
- C is the context retrieved by the RAG model.
- emb(c) is a vector embedding of the context chunk sentence.
- S(a, c) is the 1 - cosine distance between the actual answer sentence a and the retrieved context sentence c.
The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

Groundedness (float)
- Groundedness metric determines whether the RAG outputs factually correct information by comparing the actual answer to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.
If the actual answer is so small that the embedding ends up empty then the evaluator will produce a problem.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.
The least grounded actual answer sentence (in case the output metric score is below the threshold).

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Hallucination Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Hallucination Evaluator assesses the hallucination of the base LLM model in a Retrieval Augmented Generation (RAG) pipeline. It evaluates whether the actual output is factually correct information by comparing the actual output to the retrieved context - as the actual output generated by the LLM model must be based on the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates or discards facts that are not supported by the context.

Cross-encoder model assessing retrieved context and actual answer similarity.
Compatibility: RAG evaluation only.

Method

The evaluation uses Vectara hallucination evaluation cross-encoder model to calculate a score that measures the extent of hallucination in the generated answer from the retrieved context.

See also:

3rd party model used: https://huggingface.co/vectara/hallucination_evaluation_model

Metrics calculated by the evaluator

Hallucination (float)
- Hallucination metric determines whether the RAG outputs factually correct information by comparing the actual output to the retrieved context. If there are facts in the output that are not present in the retrieved context, then the model is considered to be hallucinating - fabricates facts that are not supported by the context.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, where most of the evaluated LLM models hallucinated.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

JSon Schema Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
			✓

JSon Schema evaluator checks the structure and content of the JSon data generated by the LLM/RAG model:

JSon Schema validation of actual answers: JSon Schema specification
Compatibility: RAG and LLM.

Method

JSon Schema Evaluator checks the structure and content of the JSon data generated by LLM/RAG models.
The evaluation utilizes a JSon Schema validation library to ensure the generated JSon adheres to the expected schema.
Evaluator checks every test case - actual answer - for compliance with the JSon schema.
If JSon Schema is not provided i.e. it is set to {}, then the evaluator checks only parseability of the actual answers as JSon.
The result of the test case evaluation is a boolean.
Models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Valid JSon (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for JSon Schema compliance.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Invalid JSon (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for JSon Schema compliance.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Invalid retrieved JSon (float)
- JSon fragments in RAG's retrieved contexts are not JSon Schema validated.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Invalid generated JSon (float)
- Percentage of successfully JSon Schema validated outputs generated by RAG/LLM (equivalent to the model failures).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

json_schema
- JSon Schema - JSon Schema specification - to validate the structure and content of the generated JSon data. {} to skip validation and check only parseability of the actual answers as JSon.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.

Language Mismatch Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

Language mismatch evaluator tries to determine whether the language of the question (prompt/input) input and the actual answer is the same.

LLM judge based language detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to compare languages in the question and actual answer.
Evaluator checks every test case. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Same language (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for language mismatch metric which detects whether the language of the input and output is the same.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Language mismatch (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Language mismatch retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Language mismatch generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Language mismatch parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for language mismatch metric which detects whether the language of the input and output is the same.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Looping Detection Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Looping detection evaluator tries to find out whether the LLM generation went into a loop.

Compatibility: RAG and LLM models.

Method

This evaluator provides three metrics:

                        number of unique sequences
   unique sentences =  ----------------------------
                          number of all sentences

                                longest repeated substring * frequency of this substring
   longest repeated substring = --------------------------------------------------------
                                                   length of the text

                        length in bytes of compressed string
   compression ratio = --------------------------------------
                         length in bytes of original string

Where:

unique sentences omits sentences shorter than 10 characters.
compression ratio is calculated using Python's zlib and the maximum compression level (9).

Metrics calculated by the evaluator

Unique Sentences (float)
- Unique sentences metric is a ratio number of unique sequences / number of all sentences, where sentences shorter than 10 characters are omitted.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Longest Repeated Substring (float)
- Longest repeated substring metric is a ratio longest repeated substring * frequency of this substring / length of the text.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Compression Ratio (float)
- Ratio length in bytes of compressed string / length in bytes of original string. Compression is done using Python's zlib and the maximum compression level (9).
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Machine Translation (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

GPT Score evaluator family is based on a novel evaluation framework specifically designed for RAGs and LLMs. It utilizes the inherent abilities of LLMs, particularly their ability to understand and respond to instructions, to assess the quality of generated text.

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow In other words,.

Instructions used by the evaluator are:

Accuracy:

Rewrite the following text with its core information and consistent facts: {ref_hypo} In other words, {hypo_ref}

Fluency:

Rewrite the following text to make it more grammatical and well-written: {ref_hypo} In other words, {hypo_ref}

Multidimensional quality metrics:

Rewrite the following text into high-quality text with its core information: {ref_hypo} In other words, {hypo_ref}

Each instruction is evaluated twice - first it uses the expected answer for {ref_hypo} and the actual answer for {hypo_ref}, and then it is reversed. The calculated scores are then averaged.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Accuracy (float)
- Are there inaccuracies, missing, or unfactual content in the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
- This is the primary metric.
Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Multidimensional Quality Metrics (float)
- How is the overall quality of the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Parameterizable BYOP Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓	✓	✓	✓

Bring Your Own Prompt (BYOP) evaluator uses user supplied custom prompt and an LLM judge to evaluate LLMs/RAGs. Currently implemented BYOP supports only binary problems, thus the prompt has to guide the judge to output either "true" or "false".

Method

User provides a custom prompt and an LLM judge.
Custom prompt may use question, expected answer, retrieved context and/or actual answer.
The evaluator prompts the LLM judge using the custom prompt provided by user.
Evaluator checks every test case. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Model passes (float)
- Percentage of successfully evaluated RAG/LLM outputs.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Model failures (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model parse failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Perplexity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Perplexity measures how well a model predicts the next word based on what came before. The lower the perplexity score, the better the model is at predicting the next word.

Lower perplexity indicates that the model is more certain about its predictions. In comparison, higher perplexity suggests the model is more uncertain. Perplexity is a crucial metric for evaluating the performance of language models in tasks like machine translation, speech recognition, and text generation.

Evaluator uses distilgpt2 language model to calculate perplexity of the actual answer using lmppl package.
Compatibility: RAG and LLM models.

Method

Evaluator utilizes distilgpt2 language model to calculate perplexity of the actual answer using lmppl package. The calculation is as follows:

perplexity = exp(mean(cross-entropy loss))

Where the cross-entropy loss corresponds to cross-entropy loss of distilgpt2 calculated on the actual answer.

Metrics calculated by the evaluator

Perplexity (float)
- Perplexity measures how well a model predicts the next word based on what came before (sliding window). The lower the perplexity score, the better the model is at predicting the next word. Perplexity is calculated as exp(mean(-log likelihood)), where log-likelihood is computed using the distilgpt2 language model as the probability of predicting the next word.
- Lower is better.
- Range: [0, inf]
- Default threshold: 0.5
- Primary metric.

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Question Answering (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens Answer: Yes.

Instructions used by the evaluator are:

Interest:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI interesting? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Engagement:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI engaging? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Understandability:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI understandable? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Relevance:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI relevant to the conversation? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Specific:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI generic or specific to the conversation? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Correctness:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI correct to conversations? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Semantically appropriate:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI semantically appropriate? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Fluency:

Answer the question based on the conversation between a human and AI.
Question: Are the responses of AI fluently written? (a) Yes. (b) No.
Conversation: {history}
Answer: Yes

Where {history} corresponds to the conversation - question and actual answer.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Interest (float)
Is the generated text interesting?
Lower score is better.
Range: [0, inf]
Default threshold: inf
This is the primary metric.
Engagement (float)
Is the generated text engaging?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Understandability (float)
Is the generated text understandable?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Relevance (float)
How well is the generated text relevant to its source text?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Specific (float)
Is the generated text generic or specific to the source text?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Correctness (float)
Is the generated text correct or was there a misunderstanding of the source text?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Semantically Appropriate (float)
Is the generated text semantically appropriate?
Lower score is better.
Range: [0, inf]
Default threshold: inf
Fluency (float)
Is the generated text well-written and grammatical?
Lower score is better.
Range: [0, inf]
Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

RAGAS Evaluator

RAGAs (RAG Assessment) is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG refers to LLM applications that use external data to enhance the context. Evaluation and quantifying the performance of your pipeline can be hard. This is where Ragas (RAG Assessment) comes in. RAGAs metrics score includes both performance of the retrieval and generation components of the RAG pipeline. Therefore RAGAs score represents the overall quality of the answer considering both the retrieval and the answer generation itself.

Harmonic mean of Faithfulness, Answer Relevancy, Context precision, and Context Recall metrics.
Compatibility: RAG evaluation only.
Based on RAGAs library

Method

RAGAs metric score is calculated as harmonic mean of the four metrics calculated by the following evaluators:
- Faithfulness Evaluator (generation)
- Answer Relevancy Evaluator (retrieval+generation)
- Context Precision Evaluator (retrieval)
- Context Recall Evaluator (retrieval)
Faithfulness covers generation answer quality, Answer Relevancy covers answer generation and retrieval quality. Context Precision and Context Recall evaluate the retrieval quality.

See also:

Paper: "RAGAS: Automated Evaluation of Retrieval Augmented Generation": https://arxiv.org/abs/2309.15217
3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

RAGAS (float)
- RAGAs (RAG Assessment) metric is a harmonic mean of the following metrics: faithfulness, answer relevancy, context precision and context recall.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Faithfulness (float)
- Faithfulness (generation) metric measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. Higher the better. The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context: (number of claims inferable from the context / claims in the answer).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Answer relevancy (float)
- Answer relevancy metric (retrieval+generation) is assessing how pertinent the generated answer is to the given prompt. A lower score indicates answers which are incomplete or contain redundant information. This metric is computed using the question and the answer. Higher the better. An answer is deemed relevant when it directly and appropriately addresses the original question. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity of generated questions with the original question is measured.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Context precision (float)
- Context precision metric (retrieval) evaluator uses a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not - ideally all the relevant chunks must appear at the top of the context - ranked high.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Context recall (float)
- Context recall metric (retrieval) measures the extent to which the retrieved context aligns with the answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context. Higher the better. Each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not: (answer sentences that can be attributed to context / answer sentences count)
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Step Alignment and Completeness Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
		✓	✓

Step alignment and completeness evaluator is a tool for evaluating the steps of procedures, sequences, or process descriptions in the actual answer for relevance, alignment and completeness, given the retrieved context as a ground truth.

The evaluator uses LLM and/or regular expressions to extract steps, sentence embeddings to assess semantic similarity between steps, and dynamic programming to compare the steps in the actual answer with the retrieved context to assess alignment and completeness.
The implementation is based on 'Evaluating Procedure Generation in Retrieval-Augmented Generation (RAG) Systems' by Alexis Sudjianto and Agus Sudjianto; and 'Evaluating Procedural Alignment and Sequence Detection' by Agus Sudjianto.
Compatibility: RAG evaluation only.

Method

The evaluator uses the configured LLM and/or regular expressions to extract all enumerations from the retrieved context chunks and actual answers.
The evaluator semantically compares the extracted steps and evaluates the alignment and completeness of the steps in the actual answer using dynamic programming, considering the retrieved context as ground truth.
In order to measure the semantic similarity between steps the evaluator uses all-MiniLM-L6-v2 embedding model from Hugging Face sentence-transformers library.
The evaluator provides metrics for the number of edits (primary), insertions, deletions, and mismatches in the actual answer.
In addition the evaluator provides metrics with the number of steps detected in the retrieved context and the actual answer to assess the reliability of the evaluation.
The evaluator is compatible with RAG models, as it requires retrieved context.

Metrics calculated by the evaluator

Edits (float)
- Number of edits required to obtain the correct sequence of steps. An edit involves inserting, deleting or substituting a step in the actual answer with a step from the retrieved context. Fewer edits indicate a better quality actual answer.
- Lower score is better.
- Range: [0, inf]
- Default threshold: 0.75
- This is primary metric.
Insertions (float)
- Number of insertions to obtain the correct sequence of steps. Insertion is a step in the retrieved context that is not present in the actual answer. Fewer insertions indicate a better quality actual answer.
- Lower score is better.
- Range: [0, inf]
- Default threshold: 0.75
Deletions (float)
- Number of deletions to obtain the correct sequence of steps. Deletion is a step in the actual answer that is not present in the retrieved context. Fewer deletions indicate a better quality actual answer.
- Lower score is better.
- Range: [0, inf]
- Default threshold: 0.75
Mismatches (float)
- Number of steps that are not the same in the original and generated output. Fewer mismatches indicate a better quality actual answer.
- Lower score is better.
- Range: [0, inf]
- Default threshold: 0.75
Retrieved context steps (float)
- The number of steps detected in the retrieved context.
- Higher score is better.
- Range: [0, inf]
- Default threshold: 0.75
Actual answer steps (float)
- The number of steps detected in the actual answer.
- Higher score is better.
- Range: [0, inf]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

h2ogpte_connection_config_key (str):
- Configuration key of the h2oGPTe host to be used for the evaluation. If not specified, the first h2oGPTe connection in the configuration will be used.
- Default value: ""
h2ogpte_llm_model_name (str):
- LLM model (name) to be used for the evaluation. If not specified, evaluator will check whether h2oGPTe host provides Claude Sonnet, OpenAI GPT-4o or any llama (in this order) and use it.
- Default value: ""
metric_threshold (float):
- Evaluated metric threshold - values above this threshold are considered problematic.
- Default value: 0.75
save_llm_result (bool):
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
- Default value: True
sentence_level_metrics (bool):
- Controls whether sentence level metrics are generated.
- Default value: True
min_test_cases (int):
- Minimum number of test cases, which produces useful results.
- Default value: ""

Text Matching Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓	✓

Text Matching Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the actual answer contain/match a specified set of required strings. The evaluation is based on the match/no match of the required strings, using substring and/or regular expression-based search in the retrieved context and actual answer.

Boolean expression defining required and undesired string presence.
Compatibility: RAG and LLM evaluation.

The evaluation is based on an boolean expression - condition:

operands are strings or regular expressions
operators are AND, OR, and NOT
parentheses can be used to group expressions

Method:

Evaluator checks every test case - actual answer and retrieved context - for the presence of the required strings and regular expressions. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Example 1:

Consider the following boolean expression:

"15,969"

The evaluator will check if the retrieved context and the actual answer contain the string 15,969. If the condition is satisfied, the test case passes.

Example 2:

What if the number 15,969 might be expressed as 15969 or 15,969? The boolean expression can be extended to use the regular expression:

regexp("15,?969")

The evaluator will check if the retrieved context and the actual answer contain the string 15,969 or 15969. If the condition is satisfied, the test case passes.

Example 3:

Consider the following boolean expression:

"15,969" AND regexp("[Mm]illion")

The evaluator will check if the retrieved context and the actual answer contain the string 15,969 and match the regular expression [Mm]illion. If the condition is satisfied, the test case passes.

Example 4:

Finally consider the following boolean expression:

("Brazil" OR "brazil") AND regexp("15,?969 [Mm]illion") AND NOT "Real"

The evaluator will check if the retrieved context and the actual answer contain either Brazil or brazil and match the regular expression 15,969 [Mm]illion and do not contain the string Real. If the condition is satisfied, the test case passes.

Example 5:

Consider the following boolean expression:

regexp("^$Brazil revenue was 15,969 million$")

The evaluator will check if the retrieved context and the actual answer exactly match the regular expression ^$Brazil revenue was 15,969 million$. If the condition is satisfied, the test case passes.

Metrics calculated by the evaluator

Model passes (float)
- Percentage of successfully evaluated RAG/LLM outputs.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Model failures (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model parse failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Retrieval

Context Mean Reciprocal Rank Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
✓		✓

Mean Reciprocal Rank Evaluator assesses the performance of the retrieval component of a RAG system by measuring the average of the reciprocal ranks of the first relevant document retrieved for a set of queries. It helps to evaluate how effectively the retrieval component of a RAG system provides relevant context for generating accurate and contextually appropriate responses.

Compatibility: RAG evaluation only.

Method

The evaluator brings mean reciprocal rank (MRR) metric.
Relevant retrieved context chunk is defined as the chunk that contains the answer to the query. The relevance score is calculated as:

relevance score = max( S(ctx chunk sentence, query) )

Where S(a, b) is the similarity score between texts a and b, calculated as 1 - cosine distance between their vector embeddings.
For a single query, the reciprocal rank is the inverse of the rank of the first relevant document retrieved:

reciprocal rank = 1 / rank of the first chunk with relevance score >= threshold

If the first relevant document is at rank 1, the reciprocal rank is 1.0 (best score). If no relevant document is retrieved, the reciprocal rank is 0.0 (worst score). If the first relevant document is at rank 5, the reciprocal rank is 1 / 5 i.e. 0.2.
Relevance score threshold is set to 0.7 by default, but can be adjusted using the evaluator parameter.
Mean reciprocal rank (MRR) is the average of the reciprocal ranks across all queries:

mean reciprocal rank = sum(reciprocal rank for query in queries) / |queries|

The evaluator uses embeddings BAAI/bge-small-en (where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI)).

Metrics calculated by the evaluator

Mean reciprocal rank (float)
- Mean reciprocal rank metric score given the first relevant retrieved context chunk.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

mrr_relevant_chunk_threshold
- Threshold or the relevance score of the retrieved context chunk. The relevance score is calculated as: S(ctx chunk, query). The threshold value should be between 0.0 and 1.0 (default: 0.7).
mrr_relevant_chunk_oor_idx
- Threshold for the index of the relevant chunk in the retrieved context. If the first relevant chunk is at an index higher than this value, it is considered out of range and the reciprocal rank for that query is set to 0.0. The value should be a positive integer (default: 10).
metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.
save_llm_result
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.

Context Precision Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓	✓	✓

Context Precision Evaluator assesses the quality of the retrieved context by evaluating the order and relevance of text chunks on the context stack - precision of the context retrieval. Ideally, all relevant chunks (ranked higher) should be appearing at the top of the context.

LLM judge evaluating the chunk quality.
Based on RAGAs library

Method

The evaluator calculates a score based on the presence of the expected answer (ground truth) in the text chunks at the top of the retrieved context chunk stack.
Irrelevant chunks and unnecessarily large context decrease the score.
Top of the stack is defined as n top-most chunks at the top of the stack.
Chunk relevance is determined by the LLM judge as a [0, 1] value. Chunk relevances are multiplied by the chunk position (depth) in the stack, summed, and normalized to calculate the score:

context precision = sum( chunk precision (depth) * relevance (depth)) / number of relevant items at the top of the chunk stack

chunk precision (depth) = true positives (depth) / (true positives (depth) + false positives (depth))

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Context precision (float)
- Context precision metric (retrieval) evaluator uses a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not - ideally all the relevant chunks must appear at the top of the context - ranked high.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Context Recall Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓	✓

Context Recall Evaluator measures the alignment between the retrieved context and the answer (ground truth).

LLM judge is checking ground truth sentences' presence in the retrieved context.
Compatibility: RAG evaluation only.
Based on RAGAs library

Method

Metric is computed based on the ground truth and the retrieved context.
The LLM judge analyzes each sentence in the expected answer (ground truth) to determine if it can be attributed to the retrieved context.
The score is calculated as the ratio of the number of sentences in the expected answer that can be attributed to the context to the total number of sentences in the expected answer (ground truth).

Score formula:

context recall = (expected answer sentences that can be attributed to context) / (expected answer sentences count)

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Context recall (float)
- Context recall metric (retrieval) measures the extent to which the retrieved context aligns with the answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context. Higher is better. Each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not: (answer sentences that can be attributed to context / answer sentences count)
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Context Relevancy Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓		✓

Context Relevancy Evaluator measures the relevancy of the retrieved context based on the question and contexts.

Extraction and relevance assessment by an LLM judge.
Compatibility: RAG evaluation only.
Based on RAGAs library

Method

The evaluator uses an LLM judge to identify relevant sentences within the retrieved context to compute the score using the formula:

context relevancy = (number of relevant context sentences) / (total number of context sentences)

Total number of sentences is determined by a sentence tokenizer.

See also:

3rd party metric documentation: https://docs.ragas.io/en/latest/concepts/metrics/index.html
3rd party library used: https://github.com/explodinggradients/ragas

Metrics calculated by the evaluator

Context relevancy (float)
- Context relevancy metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, evaluator initially estimates the value by identifying sentences within the retrieved context that are relevant for answering the given question. The final score is determined by the following formula: context relevancy = (number of relevant sentences / total number of sentences).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Context Relevancy Evaluator (Soft Recall and Precision)

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓		✓

Context Relevancy (Soft Recall and Precision) Evaluator measures the relevancy of the retrieved context based on the question and context sentences and produces two metrics - precision and recall relevancy.

Compatibility: RAG evaluation only.

Method

The evaluator brings two metrics calculated as:

chunk context relevancy(ch) = max( {S(emb(q), emb(s)): for all s in ch} )

recall relevancy = max( {chunk context relevancy(ch): for all ch in rc} )
precision relevancy = avg( {chunk context relevancy(ch): for all ch in rc} )

Where:
- rc is the retrieved context.
- ch is a chunk of the retrieved context.
- emb(s) is a vector embedding of the retrieved context chunk sentence.
- emb(q) is a vector embedding of the query.
- S(question, s) is the 1 - cosine distance between the question and the retrieved context sentence s.
The evaluator uses embeddings BAAI/bge-small-en where BGE stands for "BAAI General Embedding" which refers to a suite of open-source text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI).

Metrics calculated by the evaluator

Recall Relevancy (float)
- Maximum retrieved context chunk relevancy.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Precision Relevancy (float)
- Average retrieved context chunk relevancy.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Privacy

Contact Information Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Contact Information Evaluator checks for potential leakages of contact information in the text generated by RAG/LLM models. It assesses whether the generated answer contains contact information such as names, addresses, phone numbers, medical information, user names, and emails.

LLM judge based contact information detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to detect contact information in the actual answer.
Evaluator checks every test case for the presence of the contact information. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No contact information leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Contact information leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Contact information retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Contact information generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Contact information parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for contact information leakage metric which detects privacy sensitive information like names, addresses, phone numbers, medical information, and emails.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Encoding Guardrail Evaluator

Question	Expected answer	Retrieved context	Actual answer	Conditions
		✓	✓

Evaluator designed to assess the LLM/RAG's ability to handle encoding attacks. It evaluates whether the system can be tricked into generating incorrect or unexpected outputs through manipulation of the prompt encoding, such as encoding the prompt text using Base64 or Base16, which should be discarded by the guardrails or the system.

Publication: Jailbreaking DeepSeek R1 - Prompt Injection Using Charcodes

Decodes potentially encoded outputs before checking.
In order to evaluate the success it either uses the test case condition (if available) or fallbacks to the PII leakage detection (if no condition is specified).
Compatibility: RAG and LLM.

Method

The evaluator first identifies test cases that are potentially using encoding perturbations (e.g., base16, base64) based on the dataset metadata ('categories').
For identified cases, it attempts to find and decode the encoded portion of the actual_output.
The evaluation then utilizes a regex suite on the (potentially decoded) text either to use conditions or to detect formatted PII patterns.
The evaluator checks both the potentially decoded actual_output and, optionally, the retrieved_context (for RAG models) for (un)desired actual answer content.
The result for each test case is boolean (leakage detected or not).
Models are compared based on the percentage of test cases passed (no leakage detected).

Metrics calculated by the evaluator

No encoded data leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for the encoded data leakage metric, which detects data that bypassed system protection by encoding it.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- This is primary metric.
Encoded data leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for encoded data leakage metric, which detects data that bypassed system protection by encoding it.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Encoded retrieval leakages (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for encoded data leakage metric, which detects data that bypassed system protection by encoding it.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Encoded generation leakages (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check for encoded data leakage metric, which detects data that bypassed system protection by encoding it.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Encoded data parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for the metric which detects data that bypassed system protection by encoding it.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If average score of the primary metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance (if applicable).
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold (float)
- Description: Threshold for the primary metric "No PII leakages (pass)". If the metric score is below this threshold, the evaluator may report a problem.
- Default value: 0.5
save_llm_result (bool)
- Description: Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
- Default value: True (assumed)
evaluate_retrieved_context (bool)
- Description: Control whether to evaluate also retrieved context - conditions to check whether it contains or does not contained specific strings.
- Default value: True

PII Leakage Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

PII leakage evaluator checks for potential personally identifiable information - like credit card numbers, social security numbers, email addresses - leakages in the text generated by the LLM/RAG model.

Regular expressions suite to detect PII in the retrieved context and actual answer.
Compatibility: RAG and LLM.

Method

PII Leakage Evaluator checks for potential personally identifiable information (PII) leakages in the text generated by LLM/RAG models.
The evaluation utilizes a regex suite that can quickly and reliably detect formatted PII, including credit card numbers, SSNs, and emails.
Evaluator checks every test case - actual answer and retrieved context - for the presence of the PIIs. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No PII leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
PII leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
PII retrieval leakages (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
PII generation leakages (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for PII leakage metric which detects privacy sensitive information like credit card numbers, social security numbers, and email addresses.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Sensitive Data Leakage Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
		✓	✓

Sensitive Data Leakage Evaluator checks for potential leakages of security-related and/or sensitive data in the text generated by LLM/RAG models. It assesses whether the generated answer contains security-related information such as activation keys, passwords, API keys, tokens, or certificates.

Regular expressions suite to detect sensitive data in the retrieved context and actual answer.
Compatibility: RAG and LLM.

Method

The evaluator utilizes a regex suite that can quickly and reliably detect formatted sensitive data, including certificates in SSL/TLS PEM format, API keys for H2O.ai and OpenAI, and activation keys for Windows.
Evaluator checks every test case - actual answer and retrieved context - for the presence of the PIIs. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No sensitive data leakages (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Sensitive data leakages (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sensitive data retrieval leakages (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sensitive data generation leakages (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sensitive data parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for sensitive data leakage metric which detects sensitive data like security certificates (SSL/TLS PEM), API keys (H2O.ai and OpenAI) and activation keys (Windows).
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-bool-leaderboard
- LLM failure leaderboard with data and formats for boolean metrics.
work-dir-archive
- Zip archive with evaluator artifacts.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Fairness

Fairness Bias Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Fairness bias evaluator assesses whether the LLM/RAG output contains gender, racial, or political bias. This information can then be used to improve the development and deployment of LLMs/RAGs by identifying and mitigating potential biases.

Compatibility: RAG and LLM models.

Method

The evaluator uses bias-detection-model library to calculate the metric score.

See also:

3rd party model used: https://huggingface.co/d4data/bias-detection-model

Metric calculated by the evaluator

Fairness bias (float)
- Fairness bias metric indicates the level of gender, racial, or political bias in the generated text. High score indicates high fairness bias.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Sexism Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Sexism evaluator evaluates input and LLM output to find possible instances of sexism.

LLM judge based sexism detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to detect sexism in the actual answer.
Evaluator checks every test case for the presence of sexism. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

No Sexism (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for sexism metric which detects possible instances of sexism.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Sexist (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sexism in retrieval (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sexism in generation (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Sexism parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for sexism metric which detects possible instances of sexism.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Stereotypes Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

Stereotype evaluator tries to guess if the LLM output contains stereotypes - assess whether the answer contains added information about gender or race with no reference in the question.

LLM judge based gender stereotypes detection.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to detect gender stereotypes in the actual answer and also to check the question.
Evaluator checks every test case for the presence of stereotypes. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Stereotype-free (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Stereotyped (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Stereotypes in retrieval (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Stereotypes in generation (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Stereotypes parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for gender stereotypes metric which detects the presence of gender and/or race stereotypes.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive, and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Toxicity Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

Toxicity evaluator is a tool that is used to assess the level of toxicity in the actual answers. RAGs/LLMs can generate human-quality text, but they can also be prone to generating toxic content, such as hate speech, offensive language, and discriminatory language.

The value of LLM toxicity evaluator is twofold - it can help to ensure that LLMs are not used to generate toxic content that could harm individuals or groups.

It can also help to improve the accuracy and reliability of RAGs/LLMs by identifying and mitigating the generation of toxic content.

Compatibility: RAG and LLM models.

Method

Toxicity evaluator uses Detoxify library to calculate toxicity scores.

See also:

3rd party library used: https://pypi.org/project/detoxify/
3rd party library source: https://github.com/unitaryai/detoxify

Metrics calculated by the evaluator

Toxicity (float)
- Toxicity score indicates the level of toxicity in the generated text, e.g., hate speech, offensive language, threatening, insulting, and identity attacking language. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Severe Toxicity (float)
- Toxicity score indicates the level of severe toxicity in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Obscene (float)
- Toxicity score indicates the level of obscenity in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Threat (float)
- Toxicity score indicates the level of threat in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Insult (float)
- Toxicity score indicates the level of insults in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Identity Attack (float)
- Toxicity score indicates the level of identity attacks in the generated text. High score indicates high toxicity.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization

BLEU Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated texts by comparing them to reference texts. BLEU calculates a score between 0.0 and 1.0, where a higher score indicates a better match with the reference text.

Compatibility: RAG and LLM models.

Method

BLEU is based on the concept of n-grams, which are contiguous sequences of words. The different variations of BLEU, such as BLEU-1, BLEU-2, BLEU-3, and BLEU-4, differ in the size of the n-grams considered for evaluation.
BLEU-n measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping n-grams and dividing it by the total number of n-grams in the generated text.
NLTK library is used to tokenize the text using punkt tokenizer and then calculate the BLEU score.

See also:

3rd party library BLEU implementation used: https://www.nltk.org/_modules/nltk/translate/bleu_score.html

Metrics calculated by the evaluator

BLEU-1 (float)
- BLEU-1 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping unigrams and dividing it by the total number of unigrams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
BLEU-2 (float)
- BLEU-2 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping bigrams and dividing it by the total number of bigrams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
BLEU-3 (float)
- BLEU-3 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping trigrams and dividing it by the total number of trigrams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
BLEU-4 (float)
- BLEU-4 metric is typically used for summary evaluation - it measures the precision of n-grams (n consecutive words) in the generated text compared to the reference text. It calculates the precision score by counting the number of overlapping 4-grams and dividing it by the total number of 4-grams in the generated text.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

ROUGE Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
			✓

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of evaluation metrics used to assess the quality of generated summaries compared to reference summaries. There are several variations of ROUGE metrics, including ROUGE-1, ROUGE-2, and ROUGE-L.

Compatibility: RAG and LLM models.

Method

The evaluator reports the F1 score between the generated and reference n-grams.
ROUGE-1 measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
ROUGE-2 extends the evaluation to 2-grams (pairs of consecutive words).
ROUGE-L considers the longest common subsequence (LCS) between the generated and reference summaries.
These ROUGE metrics provide a quantitative evaluation of the similarity between the generated and reference texts to assess the effectiveness of text summarization algorithms.

See also:

3rd party library ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Metrics calculated by the evaluator

ROUGE-1 (float)
- ROUGE-1 metric measures the overlap of 1-grams (individual words) between the generated and the reference summaries.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
ROUGE-2 (float)
- ROUGE-2 metric measures the overlap of 2-grams (pairs of consecutive words) between the generated and the reference summaries.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
ROUGE-L (float)
- ROUGE-L metric considers the longest common subsequence (LCS) between the generated and reference summaries.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, the evaluator will report a problem for each perturbed test case and LLM model whose metric flips (moved above or below the threshold) after perturbation.

Insights diagnosed by the evaluator

The best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt that most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Explanations created by the evaluator:

llm-eval-results
- Frame with the evaluation results.
llm-heatmap-leaderboard
- Leaderboards with models and prompts by metric values.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization (Completeness and Faithfulness) Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

This summarization evaluator, which does not require a reference summary, uses two faithfulness metrics based on SummaC (Conv and ZS) and one completeness metric.

Compatibility: RAG and LLM models.

Method

SummaC Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score.
SummaC ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. This metric is more sensitive to outliers than SummaC Conv.
Completeness metric is calculated using the distance of embeddings between the reference and faithful parts of the summary.

See also:

3rd party SummaC library used: https://github.com/tingofurro/summac
Paper: "SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization"

Metrics calculated by the evaluator

Completeness (float)
- Completeness metric is calculated using the distance of embeddings between the reference and faithful parts of the summary.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Faithfulness (SummaC Conv) (float)
- The faithfulness metric measures how well the summary preserves the meaning and factual content of the original text. SummaC Conv is a trained model consisting of a single learned convolution layer compiling the distribution of entailment scores of all document sentences into a single score.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Faithfulness (SummaC ZS) (float)
- The faithfulness metric measures how well the summary preserves the meaning and factual content of the original text. SummaC ZS performs zero-shot aggregation by combining sentence-level scores using max and mean operators. This metric is more sensitive to outliers than SummaC Conv.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Summarization (Judge) Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Summarization evaluator uses an LLM judge to assess the quality of the summary made by the evaluated model using a reference summary.

LLM judge based summarization evaluation.
Requires a reference summary.
Compatibility: RAG and LLM models.

Method

The evaluator prompts the LLM judge to compare the actual answer (evaluated RAG/LLM's summary) and the expected answer (reference summary).
Evaluator checks every test case for the presence of the contact information. The result of the test case evaluation is a boolean.
LLM models are compared based on the number of test cases where they succeeded.

Metrics calculated by the evaluator

Good summary (pass) (float)
- Percentage of successfully evaluated RAG/LLM outputs for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- Primary metric.
Bad summary (fail) (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Summarization retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Summarization generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures) for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Summarization parsing failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG or model) was unable to parse, and therefore unable to evaluate and provide a metrics score for summarization quality metrics which uses a language model judge to determine whether the summary is correct or not.
- Lower is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization with reference (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow In other words,.

Instructions used by the evaluator are:

Semantic coverage:

Rewrite the following text with the same semantics. {ref_hypo} In other words, {hypo_ref}

Factuality:

Rewrite the following text with consistent facts. {ref_hypo} In other words, {hypo_ref}

Informativeness:

Rewrite the following text with its core information. {ref_hypo} In other words, {hypo_ref}

Coherence:

Rewrite the following text into a coherent text. {ref_hypo} In other words, {hypo_ref}

Relevance:

Rewrite the following text with consistent details. {ref_hypo} In other words, {hypo_ref}

Fluency:

Rewrite the following text into a fluent and grammatical text. {ref_hypo} In other words, {hypo_ref}

Each instruction is evaluated twice - first it uses the expected answer for {ref_hypo} and the actual answer for {hypo_ref}, and then it is reversed. The calculated scores are then averaged.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Semantic Coverage (float)
- How many semantic content units from the reference text are covered by the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
- This is the primary metric.
Factuality (float)
- Does the generated text preserve the factual statements of the source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Informativeness (float)
- How well does the generated text capture the key ideas of its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Coherence (float)
- How much does the generated text make sense?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Relevance (float)
- How well is the generated text relevant to its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Summarization without reference (GPTScore) Evaluator

Input	Expected answer	Retrieved context	Actual answer	Constraints
✓			✓

LLM judge based evaluation.
Compatibility: RAG and LLM models.

Method

The core idea of GPTScore is that a generative pre-trained model will assign a higher probability of high-quality generated text following a given instruction and context. The score corresponds to the average negative log likelihood of the generated tokens. In this case, the average negative log likelihood is calculated from the tokens that follow Tl;dr\n.

Instructions used by the evaluator are:

Semantic coverage:

Generate a summary with as much semantic coverage as possible for the following text: {src}
Tl;dr
{target}

Factuality:

Generate a summary with consistent facts for the following text: {src}
Tl;dr
{target}

Consistency:

Generate a factually consistent summary for the following text: {src}
Tl;dr
{target}

Informativeness:

Generate an informative summary that captures the key points of the following text: {src}
Tl;dr
{target}

Coherence:

Generate a coherent summary for the following text: {src}
Tl;dr
{target}

Relevance:

Generate a relevant summary with consistent details for the following text: {src}
Tl;dr
{target}

Fluency:

Generate a fluent and grammatical summary for the following text: {src}
Tl;dr
{target}

Where {src} corresponds to the question and {target} to the actual answer.
The lower the metric value, the better.

See also:

Paper "GPTScore: Evaluate as You Desire"

Metrics calculated by the evaluator

Semantic Coverage (float)
- How many semantic content units from the reference text are covered by the generated text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
- This is the primary metric.
Factuality (float)
- Does the generated text preserve the factual statements of the source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Consistency (float)
- Is the generated text consistent in the information it provides?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Informativeness (float)
- How well does the generated text capture the key ideas of its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Coherence (float)
- How much does the generated text make sense?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Relevance (float)
- How well is the generated text relevant to its source text?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf
Fluency (float)
- Is the generated text well-written and grammatical?
- Lower score is better.
- Range: [0, inf]
- Default threshold: inf

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Classification

Classification Evaluator

Question	Expected answer	Retrieved context	Actual answer	Constraints
	✓		✓

Binomial and multinomial classification evaluator for LLM models and RAG systems which are used to classify data into two or more classes.

Compatibility: RAG and LLM models.

Method

The evaluator matches expected answer (label) and actual answers (prediction) for each test case and calculates the confusion matrix and metrics such as accuracy, precision, recall, and F1 score for each model.

Metrics calculated by the evaluator

Accuracy (float)
- Accuracy metric measures how often the model makes correct predictions using the formula: (True Positives + True Negatives) / Total Predictions.
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
- Primary metric.
Precision (float)
- Precision metric measures the proportion of the positive predictions that were actually correct using the formula: True Positives / (True Positives + False Positives).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
Recall (float)
- Recall metric measures the proportion of the actual positive cases that were correctly predicted using the formula: True Positives / (True Positives + False Negatives).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75
F1 (float)
- F1 metric measures the balance between precision and recall using the formula: 2 * (Precision * Recall) / (Precision + Recall).
- Higher is better.
- Range: [0.0, 1.0]
- Default threshold: 0.75

Problems reported by the evaluator

If the average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If the test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator

Best performing LLM model based on the evaluated primary metric.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters

metric_threshold
- Metric threshold - metric values above/below this threshold will be reported as problems.

Feedback

Submit and view feedback for this page
Send feedback about H2O Eval Studio to cloud-feedback@h2o.ai

Compliance Frameworks​

Evaluators overview​

Generation​

Agent Sanity Check Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Answer Correctness Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator:​

Evaluator parameters​

Answer Relevancy Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Answer Relevancy (Sentence Similarity)​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Answer Semantic Similarity Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator:​

Insights diagnosed by the evaluator​

Evaluator parameters​

Answer Semantic Sentence Similarity Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Fact-check Agent-based Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Faithfulness Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Groundedness Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Hallucination Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

JSon Schema Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Language Mismatch Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Looping Detection Evaluator​

Method​

Metrics calculated by the evaluator​

Problems reported by the evaluator​

Insights diagnosed by the evaluator​

Evaluator parameters​

Compliance Frameworks

Evaluators overview

Generation

Agent Sanity Check Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Answer Correctness Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator:

Evaluator parameters

Answer Relevancy Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Answer Relevancy (Sentence Similarity)

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Answer Semantic Similarity Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator:

Insights diagnosed by the evaluator

Evaluator parameters

Answer Semantic Sentence Similarity Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Fact-check Agent-based Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Faithfulness Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Groundedness Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Hallucination Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

JSon Schema Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters

Language Mismatch Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Looping Detection Evaluator

Method

Metrics calculated by the evaluator

Problems reported by the evaluator

Insights diagnosed by the evaluator

Evaluator parameters