RAG Evaluation: Metrics and Monitoring
Key Points
- The speaker likens monitoring generative AI models to a car’s dashboard, emphasizing the need for continuous metrics to ensure safety and reliability.
- Retrieval‑augmented generation (RAG) combines up‑to‑date vector‑store data from multiple sources to answer questions in natural language.
- Rouge is presented as a recall‑oriented metric that measures how completely a model’s response matches a set of human‑generated references, yielding a score between 0 and 1.
- BLEU, a precision‑focused metric (originating from French terminology), evaluates how many words in the model’s output align with the reference set, but can penalize overly long responses.
- The discussion begins to introduce the METEOR score as another evaluation measure, highlighting its role in providing an averaged assessment of model performance.
Sections
- RAG Evaluation Through Dashboard Analogy - The passage likens vehicle dashboard indicators to AI monitoring metrics, stressing the importance of tracking key measures to safely assess and improve Retrieval‑Augmented Generation systems.
- Evaluating Model Scores and PII - The speaker outlines BLEU and METEOR metrics—highlighting precision, recall, and length penalties—then cautions about the inclusion and generation of personal identifiable information when using language models.
- Evaluating RAG: Relevance and Hallucination - The speaker illustrates how to assess retrieval‑augmented generation models by measuring context relevance and hallucination, using a New York location and capital example to show desired low hallucination and high relevance scores.
Full Transcript
# RAG Evaluation: Metrics and Monitoring **Source:** [https://www.youtube.com/watch?v=DRZMjP5Pg5A](https://www.youtube.com/watch?v=DRZMjP5Pg5A) **Duration:** 00:08:21 ## Summary - The speaker likens monitoring generative AI models to a car’s dashboard, emphasizing the need for continuous metrics to ensure safety and reliability. - Retrieval‑augmented generation (RAG) combines up‑to‑date vector‑store data from multiple sources to answer questions in natural language. - Rouge is presented as a recall‑oriented metric that measures how completely a model’s response matches a set of human‑generated references, yielding a score between 0 and 1. - BLEU, a precision‑focused metric (originating from French terminology), evaluates how many words in the model’s output align with the reference set, but can penalize overly long responses. - The discussion begins to introduce the METEOR score as another evaluation measure, highlighting its role in providing an averaged assessment of model performance. ## Sections - [00:00:00](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=0s) **RAG Evaluation Through Dashboard Analogy** - The passage likens vehicle dashboard indicators to AI monitoring metrics, stressing the importance of tracking key measures to safely assess and improve Retrieval‑Augmented Generation systems. - [00:03:05](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=185s) **Evaluating Model Scores and PII** - The speaker outlines BLEU and METEOR metrics—highlighting precision, recall, and length penalties—then cautions about the inclusion and generation of personal identifiable information when using language models. - [00:06:15](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=375s) **Evaluating RAG: Relevance and Hallucination** - The speaker illustrates how to assess retrieval‑augmented generation models by measuring context relevance and hallucination, using a New York location and capital example to show desired low hallucination and high relevance scores. ## Full Transcript
Today we're going to master RAG evaluation with key metrics.
Let's say you're getting into your car in the morning right.
So you're looking at your dashboard and you have several things on your dashboard.
Everything from your speedometer, to now how fast you're going,
And you might get a ticket, to your gas,
so knowing very important to know are you in empty or full, to make sure
that you're not stuck on your way or late to stop and get gas,
and then you have things like engine light.
So it's important to know if someone's not buckled in in the vehicle.
Or maybe you need an oil change or something is wrong with your engine.
Well, there's no way to know many of these things unless we use
monitors and metrics that are provided by the vehicle to help us stay safe.
Now, the same thing happens for your generative AI models.
We need to make sure that we're monitoring these models,
that you minimize the risk that dangerous things will happen to you
as you're using them on your journey down the road.
Let's talk a little bit more about retrieval augmented generation,
otherwise known as RAG
retrieval augmented generation
is a very popular generative
AI method that takes information in a vector database,
large amounts of information that's regularly
updated to make sure you are current and up to date.
And you can ask a question and receive answers about this information all in one location in natural language, which is key.
So not just the source of your information, but actually
compiling information from multiple sources all in one place.
All right.
So now we're going to talk about seven key metrics to evaluate your RAG models.
Our first method of evaluation is our Rouge score.
Rouge is also known for recall and completeness.
So when we have a response
we're going to look at that response generated by the model and compare
that to a group of expected responses generated by humans.
Now we're going to compare the specific words
in that text that the computer generated.
And we're not going to look at just one word, but we're going to look at a number of words in a sequence and compare
how complete our generated response
was to the group of responses,
That's known as our rouge score.
And this score will range between a zero and a one.
Next we have are BLEU score.
And curious if anyone recognizes that these are French words.
so we'd love to add a comment if you know where these measurements originated.
Bleu score is all about precision.
So we're looking at specific, again,
that computer generated response compared to the group of what we expect.
And we're looking at the precision of the individual words in relation to the whole text.
So in this case there are certain instances where that precision
and accuracy
can be inhibited by longer responses because your penalizing longer responses to the original.
So this is just one thing that you may want to consider when using the bleu methodology.
Next we have our meteor score.
So this is going to give us an average
of both precision and recall from our first and second points here.
This is a great way to get kind of a more well rounded score for your model on how the model's performing.
Now we're going to move on to our next section, which is all about
the content of information that you're putting into your model.
And we'll start off with PII,
otherwise known as personal identifiable information.
And this is everything that will identify who you are.
Things like a telephone number, an email, a name.
These are things that you likely do not want generated from a model,
and can run you into huge liabilities from an individual and a consumer point of view.
So very important to know not just what's being feed out of the model, but also here what you're feeding into your model.
We also have hate, abuse and profanity, also known as our HAP score.
Not a good thing if the model is spitting out hateful, abusive, or profanity content coming from the model.
So you'll want to monitor that model at all times to make sure this information is not appearing.
We do not want this to happen.
We're going to move on to my favorite last two tips here.
Talking actually about how relevant is the content.
So first we have context relevance.
Which is extremely important.
Let's say we have a question about the state of New York.
And here's my state of New York.
And we want to know specifically where is the state of New York and what is its capital.
So two questions that we're putting into retrieval augmented generation model
and two answers that we want back in one sentence.
Well, if we had poor context relevance, we might give a correct answer
that is completely not relevant to that question.
Like New York is an empire state or known as the Empire State.
True statement, but doesn't answer our original question about where is New York and what's the capital?
That would be an example of measuring the relevancy of the context.
Finally, we have something that's extremely important, and that's hallucination.
So we want to make sure the model is not giving answers which are incorrect
or completely wrong, and making us think that they're correct.
So back to our New York example, to have a low score
of hallucination and a high score of relevance.
We want to state answers to both questions.
New York is on the East coast.
It's located to the north of New Jersey, to the west of Connecticut.
And the capital is Albany.
So no hallucination and a very context relevant response.
So now we've covered our seven RAG metrics to master your evaluation.
There's many more metrics out there, so I'd love to hear in the comments
some of your favorite metrics that you use to monitor RAG.
Make sure to use these metrics to minimize the risk of your model in production.