Learning Library

← Back to Library

RAG Evaluation: Metrics and Monitoring

Key Points

  • The speaker likens monitoring generative AI models to a car’s dashboard, emphasizing the need for continuous metrics to ensure safety and reliability.
  • Retrieval‑augmented generation (RAG) combines up‑to‑date vector‑store data from multiple sources to answer questions in natural language.
  • Rouge is presented as a recall‑oriented metric that measures how completely a model’s response matches a set of human‑generated references, yielding a score between 0 and 1.
  • BLEU, a precision‑focused metric (originating from French terminology), evaluates how many words in the model’s output align with the reference set, but can penalize overly long responses.
  • The discussion begins to introduce the METEOR score as another evaluation measure, highlighting its role in providing an averaged assessment of model performance.

Full Transcript

# RAG Evaluation: Metrics and Monitoring **Source:** [https://www.youtube.com/watch?v=DRZMjP5Pg5A](https://www.youtube.com/watch?v=DRZMjP5Pg5A) **Duration:** 00:08:21 ## Summary - The speaker likens monitoring generative AI models to a car’s dashboard, emphasizing the need for continuous metrics to ensure safety and reliability. - Retrieval‑augmented generation (RAG) combines up‑to‑date vector‑store data from multiple sources to answer questions in natural language. - Rouge is presented as a recall‑oriented metric that measures how completely a model’s response matches a set of human‑generated references, yielding a score between 0 and 1. - BLEU, a precision‑focused metric (originating from French terminology), evaluates how many words in the model’s output align with the reference set, but can penalize overly long responses. - The discussion begins to introduce the METEOR score as another evaluation measure, highlighting its role in providing an averaged assessment of model performance. ## Sections - [00:00:00](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=0s) **RAG Evaluation Through Dashboard Analogy** - The passage likens vehicle dashboard indicators to AI monitoring metrics, stressing the importance of tracking key measures to safely assess and improve Retrieval‑Augmented Generation systems. - [00:03:05](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=185s) **Evaluating Model Scores and PII** - The speaker outlines BLEU and METEOR metrics—highlighting precision, recall, and length penalties—then cautions about the inclusion and generation of personal identifiable information when using language models. - [00:06:15](https://www.youtube.com/watch?v=DRZMjP5Pg5A&t=375s) **Evaluating RAG: Relevance and Hallucination** - The speaker illustrates how to assess retrieval‑augmented generation models by measuring context relevance and hallucination, using a New York location and capital example to show desired low hallucination and high relevance scores. ## Full Transcript
0:00Today we're going to master RAG evaluation with key metrics. 0:04Let's say you're getting into your car in the morning right. 0:08So you're looking at your dashboard and you have several things on your dashboard. 0:14Everything from your speedometer, to now how fast you're going, 0:19And you might get a ticket, to your gas, 0:23so knowing very important to know are you in empty or full, to make sure 0:29that you're not stuck on your way or late to stop and get gas, 0:34and then you have things like engine light. 0:37So it's important to know if someone's not buckled in in the vehicle. 0:42Or maybe you need an oil change or something is wrong with your engine. 0:48Well, there's no way to know many of these things unless we use 0:52monitors and metrics that are provided by the vehicle to help us stay safe. 0:58Now, the same thing happens for your generative AI models. 1:03We need to make sure that we're monitoring these models, 1:06that you minimize the risk that dangerous things will happen to you 1:10as you're using them on your journey down the road. 1:13Let's talk a little bit more about retrieval augmented generation, 1:19otherwise known as RAG 1:23retrieval augmented generation 1:26is a very popular generative 1:29AI method that takes information in a vector database, 1:34large amounts of information that's regularly 1:37updated to make sure you are current and up to date. 1:42And you can ask a question and receive answers about this information all in one location in natural language, which is key. 1:55So not just the source of your information, but actually 1:58compiling information from multiple sources all in one place. 2:03All right. 2:03So now we're going to talk about seven key metrics to evaluate your RAG models. 2:11Our first method of evaluation is our Rouge score. 2:18Rouge is also known for recall and completeness. 2:22So when we have a response 2:25we're going to look at that response generated by the model and compare 2:31that to a group of expected responses generated by humans. 2:37Now we're going to compare the specific words 2:42in that text that the computer generated. 2:45And we're not going to look at just one word, but we're going to look at a number of words in a sequence and compare 2:52how complete our generated response 2:55was to the group of responses, 2:59That's known as our rouge score. 3:02And this score will range between a zero and a one. 3:06Next we have are BLEU score. 3:11And curious if anyone recognizes that these are French words. 3:17so we'd love to add a comment if you know where these measurements originated. 3:24Bleu score is all about precision. 3:28So we're looking at specific, again, 3:31that computer generated response compared to the group of what we expect. 3:38And we're looking at the precision of the individual words in relation to the whole text. 3:43So in this case there are certain instances where that precision 3:51and accuracy 3:52can be inhibited by longer responses because your penalizing longer responses to the original. 4:02So this is just one thing that you may want to consider when using the bleu methodology. 4:09Next we have our meteor score. 4:14So this is going to give us an average 4:18of both precision and recall from our first and second points here. 4:25This is a great way to get kind of a more well rounded score for your model on how the model's performing. 4:33Now we're going to move on to our next section, which is all about 4:40the content of information that you're putting into your model. 4:45And we'll start off with PII, 4:50otherwise known as personal identifiable information. 4:57And this is everything that will identify who you are. 5:01Things like a telephone number, an email, a name. 5:07These are things that you likely do not want generated from a model, 5:13and can run you into huge liabilities from an individual and a consumer point of view. 5:19So very important to know not just what's being feed out of the model, but also here what you're feeding into your model. 5:29We also have hate, abuse and profanity, also known as our HAP score. 5:39Not a good thing if the model is spitting out hateful, abusive, or profanity content coming from the model. 5:45So you'll want to monitor that model at all times to make sure this information is not appearing. 5:52We do not want this to happen. 5:55We're going to move on to my favorite last two tips here. 5:58Talking actually about how relevant is the content. 6:03So first we have context relevance. 6:12Which is extremely important. 6:15Let's say we have a question about the state of New York. 6:24And here's my state of New York. 6:27And we want to know specifically where is the state of New York and what is its capital. 6:34So two questions that we're putting into retrieval augmented generation model 6:39and two answers that we want back in one sentence. 6:43Well, if we had poor context relevance, we might give a correct answer 6:48that is completely not relevant to that question. 6:52Like New York is an empire state or known as the Empire State. 7:00True statement, but doesn't answer our original question about where is New York and what's the capital? 7:09That would be an example of measuring the relevancy of the context. 7:14Finally, we have something that's extremely important, and that's hallucination. 7:23So we want to make sure the model is not giving answers which are incorrect 7:29or completely wrong, and making us think that they're correct. 7:35So back to our New York example, to have a low score 7:39of hallucination and a high score of relevance. 7:42We want to state answers to both questions. 7:45New York is on the East coast. 7:48It's located to the north of New Jersey, to the west of Connecticut. 7:52And the capital is Albany. 7:54So no hallucination and a very context relevant response. 8:00So now we've covered our seven RAG metrics to master your evaluation. 8:07There's many more metrics out there, so I'd love to hear in the comments 8:11some of your favorite metrics that you use to monitor RAG. 8:15Make sure to use these metrics to minimize the risk of your model in production.