Improving AI Accuracy with Retrieval Augmentation
Key Points
- The speakers illustrate how AI can confidently give absurd, incorrect advice—like using industrial glue to keep pizza toppings in place—highlighting the risk of blindly trusting AI outputs.
- They note that AI errors differ from human mistakes, often producing confident hallucinations that can mislead users when important decisions rely on AI advice.
- To improve AI accuracy, they introduce Retrieval‑Augmented Generation (RAG), which supplements a language model’s knowledge with up‑to‑date, trusted information retrieved from external sources.
- By querying a vector database for relevant documents and appending that context to the user’s prompt, RAG helps the model generate more reliable, factually correct responses.
- The overall message is that integrating external, curated data via RAG can significantly reduce AI hallucinations and increase confidence in AI‑driven decisions.
Sections
- When AI Gives Bad Advice - A dialogue spotlights an AI's absurd recommendation to glue pizza toppings, underscores the confidence‑driven errors AI can make, and introduces retrieval‑augmented generation (RAG) as a technique to boost answer accuracy.
- Choosing the Right AI Model - The speaker explains that model size and domain‑specific training influence hallucination risk, noting that large, general models excel on broad questions, whereas smaller, specialized models provide more reliable answers within their expertise.
- Chain‑of‑Thought and LLM Chaining - The passage outlines zero‑shot and few‑shot chain‑of‑thought prompting techniques for improved reasoning, then introduces LLM chaining—using multiple models to reach a consensus—to enhance overall AI accuracy.
- Mixture of Experts Routing - The speaker explains how a gating network directs each query to specialized sub‑models (experts) such as math, law, or language, combines their outputs, and achieves higher accuracy and fewer errors compared to a single model.
- Tailoring Model Settings for Use Cases - The speaker explains how prompt type, temperature, system prompts, and reinforcement learning with human feedback work together to balance factual accuracy and creative variety in AI responses.
Full Transcript
# Improving AI Accuracy with Retrieval Augmentation **Source:** [https://www.youtube.com/watch?v=pNbU1vGkIK4](https://www.youtube.com/watch?v=pNbU1vGkIK4) **Duration:** 00:13:59 ## Summary - The speakers illustrate how AI can confidently give absurd, incorrect advice—like using industrial glue to keep pizza toppings in place—highlighting the risk of blindly trusting AI outputs. - They note that AI errors differ from human mistakes, often producing confident hallucinations that can mislead users when important decisions rely on AI advice. - To improve AI accuracy, they introduce Retrieval‑Augmented Generation (RAG), which supplements a language model’s knowledge with up‑to‑date, trusted information retrieved from external sources. - By querying a vector database for relevant documents and appending that context to the user’s prompt, RAG helps the model generate more reliable, factually correct responses. - The overall message is that integrating external, curated data via RAG can significantly reduce AI hallucinations and increase confidence in AI‑driven decisions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=pNbU1vGkIK4&t=0s) **When AI Gives Bad Advice** - A dialogue spotlights an AI's absurd recommendation to glue pizza toppings, underscores the confidence‑driven errors AI can make, and introduces retrieval‑augmented generation (RAG) as a technique to boost answer accuracy. - [00:03:09](https://www.youtube.com/watch?v=pNbU1vGkIK4&t=189s) **Choosing the Right AI Model** - The speaker explains that model size and domain‑specific training influence hallucination risk, noting that large, general models excel on broad questions, whereas smaller, specialized models provide more reliable answers within their expertise. - [00:06:13](https://www.youtube.com/watch?v=pNbU1vGkIK4&t=373s) **Chain‑of‑Thought and LLM Chaining** - The passage outlines zero‑shot and few‑shot chain‑of‑thought prompting techniques for improved reasoning, then introduces LLM chaining—using multiple models to reach a consensus—to enhance overall AI accuracy. - [00:09:17](https://www.youtube.com/watch?v=pNbU1vGkIK4&t=557s) **Mixture of Experts Routing** - The speaker explains how a gating network directs each query to specialized sub‑models (experts) such as math, law, or language, combines their outputs, and achieves higher accuracy and fewer errors compared to a single model. - [00:12:22](https://www.youtube.com/watch?v=pNbU1vGkIK4&t=742s) **Tailoring Model Settings for Use Cases** - The speaker explains how prompt type, temperature, system prompts, and reinforcement learning with human feedback work together to balance factual accuracy and creative variety in AI responses. ## Full Transcript
Jeff, pepperonis keep falling off my pizza.
Martin, that sounds like a personal problem to me,
but don't worry, I have a solution.
Just use this industrial strength glue and your problem will be solved.
That sounds awful.
Well that's what my AI chatbot recommended and we know AI is never wrong, right?
Well actually, I am going to disagree you there.
In fact, AI can make mistakes of this sort which are substantially different than those that a human would make.
Right, AI can come up with things you and I would immediately dismiss and do it with a level of confidence that is absolutely stunning.
Right, and if you ask it to try again, it'll often apologize but then come up with something completely different
which kind of makes you wonder which parts to trust and which ones not to.
So if we're going to depend on AI to advise us on important decisions, we need it to be correct.
What can we do to improve the accuracy of AI?
That is a good question. So let's take a look at some techniques that can do just that.
So the first technique we're going to talk about is called RAG.
I've got you covered here Martin.
Yeah maybe not that rag.
It's actually an acronym, retrieval augmented generation.
Okay, so we're not gonna be using this after all.
Not today.
So this is all about adding in additional information into a large language model to help it be able to answer a question more accurately.
So if you think about a large-language model today, it's trained on a certain data set of information.
It knows what it knows from that training,
but what if we have a user who comes in and asks a question that requires some information that's not in that training data set.
So maybe it's something that's newer than, you know, it came out after the LLM was trained, or it's just something that wasn't trained on.
The problem there is the LLM will still have a good guess as to how to answer your query,
but it's quite likely it's going to create a hallucination and actually be wrong with low accuracy.
That's how we end up with pizzas and glue.
That's exactly it.
So what we need to do is to introduce a trusted data source into this before the LLM sees the query.
So in this case, we have a trusted data source, probably a vector database,
and we can use a retriever, that's what the R really stands for in RAG,
to query that vector database and to retrieve documents that will be relevant to that particular user's query,
and then it can populate that into the query, before the large language model actually sees it.
So in its context window now, it has the user's initial prompt,
plus we have embellished that prompt with some additional relevant information in order for the LLM to be able to answer the question.
You've augmented the query.
Exactly, that's the A.
Ah, there we go.
That's why they put that in there.
Yeah, and across the output of this is hopefully gonna be more accurate.
That's the G, and this should now have a much better chance of being the right answer because we've given the LLM the additional information that it needs.
Okay, Martin, another thing you can do to help improve AI accuracy is make sure you've got the right model.
And the model size makes a difference and what it's trained on makes a different.
It's all about picking the right tool.
In this case, it's about picking the right models that's fit for purpose.
For instance, a large model that knows many different domains will be less likely to hallucinate
if the question is broad versus a smaller model with more specific information.
Will be less likely to hallucinate if the question is within its area of expertise.
Okay, so if you ask your medical doctor how to get rid of a virus on your computer, he or she may not know and might just try to guess.
Yeah, but if you asked them a about how to treat a biological virus, they would have a much better answer.
You certainly hope so.
I would hope so, so for instance, let's take a look at an example with AI models.
We would have this large model, it would be trained on lots of different domains of information,
it might in fact know about medical information, it might know about law, it's trained in art, in sports, in technology, lots of things like that,
versus a small model, smaller model,
that is specifically trained, maybe just to know about cybersecurity knowledge.
So if I were to ask a cybersecurity question of this larger model,
then the likelihood that we're gonna get a good answer is a little bit lower because it has more area to hallucinate across.
But if I ask this model that is specifically trained in that particular domain a question, probably I'm gonna get a really good answer.
That makes sense, but what if I ask a more general question that's not necessarily cyber related?
Then I would suspect that actually there's the chance that this is not going to give you such a good answer,
whereas the general purpose model is more likely to be able to give a correct and accurate answer this time.
Exactly, so you want to choose the right model for the right purpose.
Alright, next up is COT, that's Chain of Thought Prompting.
Now this involves asking the LLM to explicitly generate an intermediate reasoning steps before giving a final answer.
And that can help reduce mistakes in problems where logical consistency is needed like a math problem.
How do you feel about math problems, Jeff?
I live for them.
Okay, let's give you one.
So consider we've got a factory that produces three times as many red widgets as it produces blue widgets.
So if the factory produces 240 widgets, how many blue widgets are produced?
Ok, this is easy, I'll play the useful idiot, it's 80.
Uh, that is not the right answer.
Uh...
Okay so I guess i'm gonna have to show my work.
We'll go ahead and put in an equation and we'll solve for B,
where B is the number of blue widgets and if I solve for that I end up with okay sixty, final answer. And that is the correct answer.
And what you've done here is you've gone through some reasoning steps to get to the answer, rather than picking the answer that you felt was intuitively correct.
And that's how LLMs work as well.
Sometimes the intuitive answer, the quick answer, is not the right one.
So there's a few different ways to perform chain of thought processing.
One way is to do it through something called zero-shot chain of thought processing, and this just simply adds a trigger phrase to the prompt.
It might be something like, let's think step by step.
And that will induce the model to produce a reasoning chain.
There's also few-shot chain of thought prompting.
Now this includes examples of questions in the prompt or examples of math problems in the prompts, and it includes the step-by-step solutions as well.
So the model can learn from how you've done it and apply it to the next query.
Then more recently, reasoning models have this chain of thought kind of built into them.
So chain of thought is great for generating more accurate responses specifically when reasoning is needed,
like this but it does little to improve accuracy of knowledge based answers.
So now I know why my math teacher always made me show my work.
Exactly.
Okay, Martin, another technique that can improve AI accuracy is something called LLM chaining,
and with LLM chaining, we're basically gonna get a consensus opinion.
So we're gonna rely on more than just one intelligence.
So let's put a few different, let's start with three LLMs.
And instead of just relying on a single one of those LLMs, I'm gonna actually come in with a prompt to one of those.
So I put my prompt into the first, and then I'm going to do this thing:
R and R.
Doesn't that sound great?
Rest and relaxation.
Yeah, no, slacker, I need you to focus on the problem.
No, it's revise and reflect.
So we're going to reflect the information that comes into this one and then send it on.
Then it's gonna take that and revise it with its own understanding and reflect that on,
and keep doing that until now we have had each one of these weigh in on this particular question.
So I'm not relying on one, I've got the collective wisdom of all three in this example.
Okay, so kind of a wisdom of the crowd things where you're just bringing in different experts and getting them to weigh in on there.
Exactly, exactly right.
And there's even a variation on this architecture, if you want to think of it this way.
If I had one sort of supervisor, decider,
then maybe instead of having it run through all three of them,
the prompt comes in this way and it goes and asks each one of them individually
and gets a response back and decides based upon those responses what to come out with as its answer.
Yes, so the supervisor here is effectively telling the LLM to be its own critic
because it's taking the responses and then critiquing them and deciding which ones to pick from each of these models to come up with the answer.
Exactly.
Another way of sort of crowdsourcing the answer, but in this case think of it sort of like a phone a friend.
You've got three friends, you're going to ask them all the same question, and you're gonna decide which one of them you want to actually go with.
Okay, any other ideas for improving AI accuracy?
I've got another one.
MoE, that's mixture of experts.
Now this improves model accuracy by gathering together multiple opinions,
a bit like LLM chaining,
but this time rather than using multiple LLMs models, a mixture of expert is made up of a specialized set of sub-models,
and each sub-model is considered an expert in a given specialization.
So let's consider maybe for these sub-models here.
And they're each good at different things.
Maybe math is one of them, and law perhaps, and maybe one's really good at putting together language.
Another one is trained on technology, for example.
So when a user submits a prompt, we use something called a gating network,
and that kind of acts like a router here to determine which expert should handle that input,
and then the outputs are all combined to produce a response.
So we might have a query that comes into the gating network here, and it turns out that query needs to use a little bit of math, which is going to help with some problem solving.
Once we've got that, maybe we want to format a nicely written response.
So we would use language there to format the grammatically correct response.
Then we've divided this problem between many experts to be able to handle a wider range of patterns and complexities.
Than maybe a standard model would be able to do.
We've essentially reduced the errors here through specialization.
And this sounds a lot like the LLM chaining that I referred to before, but there's one key difference,
and that is with LLM chaining, we've got different LLMs and we're feeding it through.
In this case, we have just one big LLM.
So we're routing this within a single LLM.
It's sort of like using, instead of three brains, we're using one brain, but we're just routing the question to a different lobe.
That's exactly right.
Another thing you can do, Martin, is to make your AI more accurate, is change the temperature setting of the model.
Ah, so this is a thermometer, huh?
It is a thermometer and while here we're showing what ambient room temperature might be,
in an AI model the settings would look different.
Right, so we might have a low temperature be 0.0, 0.5, maybe 1.0 and above.
Exactly and this would be the more deterministic version and this would be them or creative version,
and this would be the more creative version, so the higher up you go on this scale the more creative the answers are.
So if I have more creative answers and more deterministic answers, what does that actually mean?
So, think about the deterministic is going to be more factual, it's going to be more consistent, more predictable.
It's going to be in some ways more reliable and times that that is exactly what we want.
In other cases on the creative though, we might want something that's less predictable, something that is more varied,
and in fact, good examples of this might be, am I asking a science question?
Well, then I really want to be down here in the determinist version.
I want the facts,
but if I'm asking it to write lyrics for a song, okay, facts then might be pretty boring.
So art or music, more creative activities, I might want something that's less predictable and more varied.
Okay, so we've really got to pick the temperature setting based upon the type of prompt we're sending into it.
Exactly, the use case will matter.
All right, a couple of quick other ones that we should cover.
System prompt is one.
So this is a message that is included in every prompt, kind of secretly added in by the model.
And it can determine how the model actually works.
So you can put things in there that say, for example, you want to make sure that it provides accurate answers.
You're gonna have different levels of success of that but that is something you can try.
And you could put guardrails in so that you're able to guard against things like prompt injection attacks and that sort.
Another technique is reinforcement learning with human feedback.
In this case, basically the person is looking at the responses and saying, yes I agree, no I don't agree.
So you give it thumbs up and thumbs down and we're kind of rewarding or disincentivizing it from doing those same answers again in the future.
And that way we tune the model to get, again, a little more action.
So those are our methods that we think are kind of primary ways to be able to improve AI accuracy.
None of them are perfect and sometimes it takes a combination of all of them,
but we would love to hear from you what are your thoughts on these methods or there are other methods that you would recommend as well?
let us know and maybe we can talk about those in a future video
Absolutely