Learning Library

← Back to Library

RAG vs CAG: Augmented Generation Explained

15m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

Large language models can’t recall information not present in their training data, so they need external knowledge sources for up‑to‑date or proprietary facts.
Retrieval‑augmented generation (RAG) solves this by querying a searchable knowledge base, pulling relevant document chunks, and feeding them to the model as context before generating an answer.
Context‑augmented generation (CAG) takes a different approach by loading the entire knowledge base into the model’s context window, providing every piece of information rather than only the most relevant excerpts.
RAG operates in two stages: an offline phase that chunks documents and stores their vector embeddings in a vector database, and an online phase that converts a user query into a vector, retrieves the top‑K similar chunks, and combines them with the model’s generation step.

Sections

Full Transcript

# RAG vs CAG: Augmented Generation Explained **Source:** [https://www.youtube.com/watch?v=HdafI0t3sEY](https://www.youtube.com/watch?v=HdafI0t3sEY) **Duration:** 00:15:47 ## Summary - Large language models can’t recall information not present in their training data, so they need external knowledge sources for up‑to‑date or proprietary facts. - Retrieval‑augmented generation (RAG) solves this by querying a searchable knowledge base, pulling relevant document chunks, and feeding them to the model as context before generating an answer. - Context‑augmented generation (CAG) takes a different approach by loading the entire knowledge base into the model’s context window, providing every piece of information rather than only the most relevant excerpts. - RAG operates in two stages: an offline phase that chunks documents and stores their vector embeddings in a vector database, and an online phase that converts a user query into a vector, retrieves the top‑K similar chunks, and combines them with the model’s generation step. ## Sections - [00:00:00](https://www.youtube.com/watch?v=HdafI0t3sEY&t=0s) **Bridging LLM Knowledge Gaps** - The speaker explains how retrieval‑augmented generation (RAG) and cache‑augmented generation (CAG) let large language models access up‑to‑date or proprietary information by querying external sources or pre‑loading data into the model’s context window. - [00:03:20](https://www.youtube.com/watch?v=HdafI0t3sEY&t=200s) **RAG Retrieval Workflow Explained** - The passage outlines how a RAG system embeds a user query, searches a vector database for relevant document chunks, and feeds them along with the query to a large language model, emphasizing the modularity of swapping components. - [00:06:38](https://www.youtube.com/watch?v=HdafI0t3sEY&t=398s) **RAG vs CAG: Knowledge Processing** - The speaker contrasts Retrieval‑Augmented Generation, which fetches only relevant chunks at query time to support massive external corpora, with Context‑Augmented Generation, which pre‑loads all documents into the model’s finite context window, making accuracy heavily dependent on the retriever’s effectiveness. - [00:09:43](https://www.youtube.com/watch?v=HdafI0t3sEY&t=583s) **RAG vs CAG: Scalability & Freshness** - The speaker contrasts Retrieval‑Augmented Generation (RAG) and Cached‑Answer Generation (CAG), highlighting RAG’s ability to handle massive document collections and easy index updates, while CAG is limited by model context size and requires costly recomputation when data changes. - [00:12:54](https://www.youtube.com/watch?v=HdafI0t3sEY&t=774s) **RAG vs CAG: Legal & Clinical** - The speaker argues that Retrieval‑Augmented Generation is essential for dynamic, citation‑heavy legal queries, while a hybrid RAG‑then‑CAG approach is advised for comprehensive, real‑time clinical decision support. ## Full Transcript

0:00Left to their own devices, large language models have a bit of a knowledge problem. 0:04If a piece of information wasn't in their training set, they won't be able to recall it. 0:10Maybe something newsworthy that happened after the model completed training, 0:14such as who won the 2025 Best Picture at the Oscars, or it could be something proprietary like a client's purchase history. 0:22So to overcome that knowledge problem, we can use augmented generation techniques. 0:31For example, retrieval. 0:35So retrieval, augmented generation, otherwise known as 0:41RAG 0:42Now how does that work? 0:44Well, essentially we have here a model and the model is going to query an external searchable knowledge base. 0:56Here's where we've got our knowledge and that's going to return portions 0:59of relevant documents to provide additional context. 1:04So we get the documents, we get some context, and we pass that to the LLM model 1:11to update its knowledge, if you like, and that updated context, that's used to generate an answer. 1:19Anora 1:21that won best picture this year, probably got it out of that data set. 1:24But retrieval isn't the only augmented generation game in town. 1:30Another one is cash augmented generation or CAG. 1:38Now CAG is an alternative method. 1:40So rather than querying a knowledge database for answers, the core idea of CAG is to preload the entire knowledge base. 1:49So we take everything we know and we put it all into the context window. 1:57All of it. 1:58The Oscar winners, last week's lunch special at the office cafeteria, whatever you want. 2:03So rather than feeding the model just curated knowledge, 2:07we are feeding the model everything, not just the stuff we deemed relevant to the query. 2:14So RAG versus CAG. 2:17Let's get into how these two things work. 2:20the capabilities of each approach, and an enticing game to test your own knowledge, 2:25and let's start with RAG. 2:28So RAG is essentially a two-phase system. 2:30You've got an offline phase where you ingest and index your knowledge, and then you've got 2:35an online phase where you retrieve and generate on demand. 2:38And the offline part, pretty straightforward. 2:41So you can start with some documents. 2:42So this is your knowledge. 2:44This could be Word files, PDFs, whatever. 2:48and you're going to break them into chunks and create vector embeddings for each 2:55chunk using the help of something called an embedding model. 3:04Now that embedding model is going to create embeddings and it's going to store them 3:10in a database, and specifically this is a vector database where the embeddings are stored. 3:20So you've essentially now created a searchable index of your knowledge. 3:25So when a user prompt comes in from the user, this is where the online phase of this is going to kick in. 3:33So, first thing that's going to happen is we're going to go to a RAG retriever. 3:40and that RAG retriever is gonna take the user's question and it's gonna turn it into 3:45a vector using the same embedding model that we used earlier, 3:50and that's gonna perform a similarity search of your vector database. 3:56Now that's gonna return the top K most relevant document chunks from here that are related to this query. 4:02There might be something like three to five passages that are most likely to contain the answer to the user's query, 4:09and we're gonna take those chunks and we're gonna put them 4:13into the context window of the LLM alongside 4:20the user's initial query and all of this is then gonna get sent to the large language model. 4:28So the model is gonna see the question the user submitted plus 4:33these relevant bits of context and use that to generate an answer. 4:38We're basically saying to the model, here's the question, here's some potentially useful 4:42information to help you answer it that we got out of this vector database, off you go. 4:47And the beauty of RAG is that it's very modular, so you 4:51can swap out the vector database, you could swap out a different embedding model, or you 4:56could change the LLM without rebuilding this entire system. 5:01That's RAG. 5:03What about CAG? 5:04Well, CAG takes a completely different approach. 5:08So instead of retrieving knowledge on demand, you front load it all into the model's context all at once. 5:15So we'll start again with our documents. 5:18This is all of our gathered knowledge. 5:21And we're gonna format them into one massive prompt that fits inside of the model's context window. 5:31So here it's gonna fit into this. 5:32Now, this could be tens or even hundreds of thousands of tokens 5:38and then the large language model is going to take this massive amount of input and it's going to process it. 5:47So effectively this kind of knowledge blob is going to be processed in a single forward pass 5:52and it's going to capture and store the model's internal state after it's digested all of this information. 5:59Now this internal state blob, it's actually got a name, it's called the KV cache, or the key value cache, 6:09and it's created from each self-attention layer and it represents the model's 6:13encoded form of all of your documents, all of your knowledge, 6:17so it's kind of like the models already read your documents and now it's memorized it. 6:23So when a user submits a query in this situation then we take all of this KV cache 6:31and we add the query to it and all of that gets sent into the large language model. 6:39And because the Transformers cache has already got all of the knowledge tokens in it, 6:43the model can use any relevant information as it generates an answer without having to reprocess all of this text again. 6:52So the fundamental distinction between RAG and CAG comes down to when and how knowledge is processed. 6:59With RAG, We say, let's fetch only the stuff that we think we're going to actually need. 7:06CAG, that says let's load everything, all of our documents up front and then remember it for later. 7:12So with RAG, your knowledge base can be really, really large. 7:16This could be millions of documents stored in here because you're only retrieving small pieces at a time. 7:24The model only sees what's relevant for a particular query. 7:28Whereas with CAG you are constrained by the size of the model's context window. 7:35Now a typical model today that can have a context window of something like 32 ,000 to 100 ,000 tokens. 7:44Some are a bit larger than that but that's pretty standard. 7:47It's substantial but it's still finite and everything all of these docs need to fit in that window. 7:54So let's talk about capabilities of each approach and we're going to start with accuracy. 8:00Now, RAG's accuracy is really intrinsically tied to a particular component. 8:05When we talk about accuracy with RAG, we are talking about the retriever. 8:12That's what's important here, because if the retriever fails to fetch a relevant document, 8:17well then the LLM might not have the facts to answer correctly, 8:21but if the retriever works well, then it actually shields the LLM from receiving irrelevant information. 8:28Now, CAG, on the other hand, that preloads all potential relevant information. 8:33So it guarantees that the information is in there somewhere, I mean, 8:35assuming that the knowledge cache actually does contain the question being asked, 8:39but with CAG, all of the work is handed over to the model 8:47to extract the right piece of information from that large context. 8:52So there's the potential here that the LLM might get confused or it might mix in some unrelated information into its answer. 8:59So that's accuracy. 9:00What about latency? 9:02Well, RAG, that introduces an extra step, namely the retrieval into the query workflow and that adds to response time. 9:11So when we look at latency with RAG, it's a bit higher, 9:15because each query incurs the overhead of embedding the query and then 9:19searching the index and then having the LLM process the retrieved text. 9:24But with CAG, once the knowledge is cached, answering a query 9:28is just one forward pass of the LLM on the user prompt plus the generation. 9:33There's no retrieval lookup time. 9:35So when it comes to latency, CAG is going to be lower. 9:40Alright, what about scalability? 9:43Well, RAG can scale to as much as you can fit into your vector database. 9:49So we can have some very large data sets when we are using RAG 9:56And that's because it only pulled a tiny slice of the data per query. 10:00So if you have 10 million documents, 10:03you can index them all and you can still retrieve just a few relevant ones for any single question. 10:09The LLM is never going to see all 10 million documents at once, but CAG, however, that does have a hard limit. 10:16So with CAG, the scalability restriction is basically related to the model context size. 10:24We can only put in there what the model will allow us to fit. 10:27And as I mentioned earlier, that's typically like 32 to 100K tokens. 10:32So that might be a few hundred documents at most. 10:36And even as context windows grow, as they are expected to, 10:40RAG will likely always maintain a bit of an edge when it comes to scalability. 10:45One more, data freshness. 10:46Now, when knowledge changes, RAG, that can just, well, it can just update the index very easily. 10:54So it doesn't take a lot of work to do that. 10:57It can update incrementally as you add new document embeddings or as you remove outdated ones on the fly. 11:04It can always use new information with minimal downtime. 11:09But CAG, on the other hand, that is going to require some re-computation when anything actually changes. 11:17If the data changes frequently, then CAG kind of loses some of its appeal 11:21because you're essentially reloading often, which is going to negate the caching benefit. 11:26All right, so let's play a game. 11:28It's called RAG or CAG. 11:30Now I'm gonna give you a use case and you're gonna shout out 11:34RAG if you think retrieval augmented generation is the best option, 11:39or you'll yell out CAG if you think cache augmented generation is the way to go. 11:44Ready? Alright. 11:45Scenario one, I am building an IT help desk bot. 11:51So users can submit questions and the system's gonna use a product manual to help augment its answers. 11:57Now, the product manual is about 200 pages. 12:00It's only updated a few times a year. 12:02So, RAG or CAG? 12:06Don't be shy. 12:07Getting acronyms at the screen is an entirely normal process. 12:11All right, I'm gonna imagine that most people here are probably saying... 12:17CAG for this one. 12:19The knowledge base, in this case the product manual, it's small enough to fit in most LLM context windows, 12:25the information is pretty static so the caches need to be updated very frequently, 12:29and by caching the information we'll be able to answer queries faster than if we had to query a vector database. 12:36So I think CAG is probably the answer for this one. 12:39What about scenario two? 12:41So with this one you're going to be building a research assistant for a law firm. 12:46Now the system needs to search through thousands of legal cases 12:50that are constantly being updated with new rulings and new amendments. 12:54And when lawyers submit queries, they need answers with accurate citations to relevant legal documents. 13:00So for this one, RAG or CAG. 13:04I think RAG is the way to go here. 13:10The knowledge base in this case, it's massive and it's dynamic with this new content been added all the time. 13:16So attempting to cache all this information would quickly exceed most models context windows 13:20and also that requirement for precise citations to source materials 13:25is actually something that RAG naturally supports through its retrieval mechanism. 13:29It will tell us where it got its information from. 13:33And also the ability to incrementally update the vector database as new legal documents emerge 13:38means that the system always has access to the most current information without requiring full cache recomputation. 13:44So, rag all the way here. 13:47One last one, one last game of RAG or CAG. 13:50So, scenario three, you're building a clinical decision support system for hospitals. 13:56And the idea here is that doctors need to query patient records and treatment guides and drug interactions. 14:03And the responses need to be really comprehensive and of course, very accurate 14:07because they're going to be used by doctors during patient consultations. 14:12And the doctors are often gonna ask complex follow-up questions. 14:16So RAG or CAG for that? 14:18Well, how about... 14:21Both. 14:22Because in this case, the system could first use RAG to retrieve the most relevant subset from the massive knowledge base. 14:32So pulling in specific sections of a particular patient's history and some research papers that are based on the doctor's query. 14:39And then instead of simply passing those retrieved chunks to the LLM, 14:43it could load all that retrieved content into a long context model that uses CAG, 14:51creating a temporary working memory, if you like, for the specific patient case. 14:56So it's really a hybrid approach. 14:59RAG's ability to efficiently search enormous knowledge bases, 15:02and then CAG's capability for providing the full breadth of medical knowledge when needed for those follow-up questions 15:09without the system repeatedly querying the database. 15:12So essentially, RAG and CAG are two strategies for enhancing LLMs with external knowledge, 15:19and you'd consider RAG when your knowledge source is very large, or it's frequently updated, or you need citations, 15:26or where resources for running long context window models are a bit limited, 15:29but you would consider CAG when you have a fixed set of knowledge that 15:34can fit within the context window of the model you're using, 15:37where latency is important, it needs to be fast, and where you want to simplify deployment. 15:44RAG or CAG, the choice is up to you.