Learning Library

← Back to Library

RAG Evaluation: Metrics and Monitoring

8m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

The speaker likens monitoring generative AI models to a car’s dashboard, emphasizing the need for continuous metrics to ensure safety and reliability.
Retrieval‑augmented generation (RAG) combines up‑to‑date vector‑store data from multiple sources to answer questions in natural language.
Rouge is presented as a recall‑oriented metric that measures how completely a model’s response matches a set of human‑generated references, yielding a score between 0 and 1.
BLEU, a precision‑focused metric (originating from French terminology), evaluates how many words in the model’s output align with the reference set, but can penalize overly long responses.
The discussion begins to introduce the METEOR score as another evaluation measure, highlighting its role in providing an averaged assessment of model performance.

Sections

Full Transcript

# RAG Evaluation: Metrics and Monitoring **Source:** [https://www.youtube.com/watch?v=DRZMjP5Pg5A](https://www.youtube.com/watch?v=DRZMjP5Pg5A) **Duration:** 00:08:21 ## Summary - The speaker likens monitoring generative AI models to a car’s dashboard, emphasizing the need for continuous metrics to ensure safety and reliability. - Retrieval‑augmented generation (RAG) combines up‑to‑date vector‑store data from multiple sources to answer questions in natural language. - Rouge is presented as a recall‑oriented metric that measures how completely a model’s response matches a set of human‑generated references, yielding a score between 0 and 1. - BLEU, a precision‑focused metric (originating from French terminology), evaluates how many words in the model’s output align with the reference set, but can penalize overly long responses. - The discussion begins to introduce the METEOR score as another evaluation measure, highlighting its role in providing an averaged assessment of model performance. ## Sections - [00:00:00](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=0s) **RAG Evaluation Through Dashboard Analogy** - The passage likens vehicle dashboard indicators to AI monitoring metrics, stressing the importance of tracking key measures to safely assess and improve Retrieval‑Augmented Generation systems. - [00:03:05](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=185s) **Evaluating Model Scores and PII** - The speaker outlines BLEU and METEOR metrics—highlighting precision, recall, and length penalties—then cautions about the inclusion and generation of personal identifiable information when using language models. - [00:06:15](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=375s) **Evaluating RAG: Relevance and Hallucination** - The speaker illustrates how to assess retrieval‑augmented generation models by measuring context relevance and hallucination, using a New York location and capital example to show desired low hallucination and high relevance scores. ## Full Transcript

0:00Today we're going to master RAG evaluation with key metrics. 0:04Let's say you're getting into your car in the morning right. 0:08So you're looking at your dashboard and you have several things on your dashboard. 0:14Everything from your speedometer, to now how fast you're going, 0:19And you might get a ticket, to your gas, 0:23so knowing very important to know are you in empty or full, to make sure 0:29that you're not stuck on your way or late to stop and get gas, 0:34and then you have things like engine light. 0:37So it's important to know if someone's not buckled in in the vehicle. 0:42Or maybe you need an oil change or something is wrong with your engine. 0:48Well, there's no way to know many of these things unless we use 0:52monitors and metrics that are provided by the vehicle to help us stay safe. 0:58Now, the same thing happens for your generative AI models. 1:03We need to make sure that we're monitoring these models, 1:06that you minimize the risk that dangerous things will happen to you 1:10as you're using them on your journey down the road. 1:13Let's talk a little bit more about retrieval augmented generation, 1:19otherwise known as RAG 1:23retrieval augmented generation 1:26is a very popular generative 1:29AI method that takes information in a vector database, 1:34large amounts of information that's regularly 1:37updated to make sure you are current and up to date. 1:42And you can ask a question and receive answers about this information all in one location in natural language, which is key. 1:55So not just the source of your information, but actually 1:58compiling information from multiple sources all in one place. 2:03All right. 2:03So now we're going to talk about seven key metrics to evaluate your RAG models. 2:11Our first method of evaluation is our Rouge score. 2:18Rouge is also known for recall and completeness. 2:22So when we have a response 2:25we're going to look at that response generated by the model and compare 2:31that to a group of expected responses generated by humans. 2:37Now we're going to compare the specific words 2:42in that text that the computer generated. 2:45And we're not going to look at just one word, but we're going to look at a number of words in a sequence and compare 2:52how complete our generated response 2:55was to the group of responses, 2:59That's known as our rouge score. 3:02And this score will range between a zero and a one. 3:06Next we have are BLEU score. 3:11And curious if anyone recognizes that these are French words. 3:17so we'd love to add a comment if you know where these measurements originated. 3:24Bleu score is all about precision. 3:28So we're looking at specific, again, 3:31that computer generated response compared to the group of what we expect. 3:38And we're looking at the precision of the individual words in relation to the whole text. 3:43So in this case there are certain instances where that precision 3:51and accuracy 3:52can be inhibited by longer responses because your penalizing longer responses to the original. 4:02So this is just one thing that you may want to consider when using the bleu methodology. 4:09Next we have our meteor score. 4:14So this is going to give us an average 4:18of both precision and recall from our first and second points here. 4:25This is a great way to get kind of a more well rounded score for your model on how the model's performing. 4:33Now we're going to move on to our next section, which is all about 4:40the content of information that you're putting into your model. 4:45And we'll start off with PII, 4:50otherwise known as personal identifiable information. 4:57And this is everything that will identify who you are. 5:01Things like a telephone number, an email, a name. 5:07These are things that you likely do not want generated from a model, 5:13and can run you into huge liabilities from an individual and a consumer point of view. 5:19So very important to know not just what's being feed out of the model, but also here what you're feeding into your model. 5:29We also have hate, abuse and profanity, also known as our HAP score. 5:39Not a good thing if the model is spitting out hateful, abusive, or profanity content coming from the model. 5:45So you'll want to monitor that model at all times to make sure this information is not appearing. 5:52We do not want this to happen. 5:55We're going to move on to my favorite last two tips here. 5:58Talking actually about how relevant is the content. 6:03So first we have context relevance. 6:12Which is extremely important. 6:15Let's say we have a question about the state of New York. 6:24And here's my state of New York. 6:27And we want to know specifically where is the state of New York and what is its capital. 6:34So two questions that we're putting into retrieval augmented generation model 6:39and two answers that we want back in one sentence. 6:43Well, if we had poor context relevance, we might give a correct answer 6:48that is completely not relevant to that question. 6:52Like New York is an empire state or known as the Empire State. 7:00True statement, but doesn't answer our original question about where is New York and what's the capital? 7:09That would be an example of measuring the relevancy of the context. 7:14Finally, we have something that's extremely important, and that's hallucination. 7:23So we want to make sure the model is not giving answers which are incorrect 7:29or completely wrong, and making us think that they're correct. 7:35So back to our New York example, to have a low score 7:39of hallucination and a high score of relevance. 7:42We want to state answers to both questions. 7:45New York is on the East coast. 7:48It's located to the north of New Jersey, to the west of Connecticut. 7:52And the capital is Albany. 7:54So no hallucination and a very context relevant response. 8:00So now we've covered our seven RAG metrics to master your evaluation. 8:07There's many more metrics out there, so I'd love to hear in the comments 8:11some of your favorite metrics that you use to monitor RAG. 8:15Make sure to use these metrics to minimize the risk of your model in production.