Learning Library

← Back to Library

The Truth About Context Windows

15m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

AI firms exaggerate their models’ usable context windows, claiming millions of tokens while practical performance often drops to roughly a tenth of that size.
Even with advertised million‑token windows, models like Gemini show solid results only up to about 128 k tokens, and reliability degrades beyond half a million tokens.
Transformers read long inputs as flat token strings, so structural information in large documents or codebases is lost, making naive “fill‑the‑prompt” approaches ineffective.
Agentic search or retrieval‑augmented methods outperform plain semantic RAG for massive code or document contexts because they preserve and exploit underlying structure.
Benchmarks such as “needle‑in‑a‑haystack” tests are overly simplistic and fail to measure a model’s ability to synthesize information across extensive contexts, a crucial capability for higher‑level AGI thinking.

Sections

Full Transcript

# The Truth About Context Windows **Source:** [https://www.youtube.com/watch?v=R-CASOusCJo](https://www.youtube.com/watch?v=R-CASOusCJo) **Duration:** 00:15:01 ## Summary - AI firms exaggerate their models’ usable context windows, claiming millions of tokens while practical performance often drops to roughly a tenth of that size. - Even with advertised million‑token windows, models like Gemini show solid results only up to about 128 k tokens, and reliability degrades beyond half a million tokens. - Transformers read long inputs as flat token strings, so structural information in large documents or codebases is lost, making naive “fill‑the‑prompt” approaches ineffective. - Agentic search or retrieval‑augmented methods outperform plain semantic RAG for massive code or document contexts because they preserve and exploit underlying structure. - Benchmarks such as “needle‑in‑a‑haystack” tests are overly simplistic and fail to measure a model’s ability to synthesize information across extensive contexts, a crucial capability for higher‑level AGI thinking. ## Sections - [00:00:00](https://www.youtube.com/watch?v=R-CASOusCJo&t=0s) **Untitled Section** - - [00:03:10](https://www.youtube.com/watch?v=R-CASOusCJo&t=190s) **LLM Context Limitations and Strategies** - The speaker explains how language models excel with familiar pre‑training texts but often miss parts of new inputs, outlines five mitigation tactics, and hints at implications for AGI. - [00:06:39](https://www.youtube.com/watch?v=R-CASOusCJo&t=399s) **Context Budgeting and Token Allocation** - The speaker outlines a strategy for managing a language model’s limited token window by partitioning tokens for system instructions, conversation history, document retrieval, and working memory, treating them like precious RAM. - [00:10:35](https://www.youtube.com/watch?v=R-CASOusCJo&t=635s) **Context Window Limits and AGI Assumptions** - The speaker explains that expanding an LLM’s context window causes a quadratic increase in processing energy and latency, challenges the assumption that AI can reach AGI by the same lossy‑compression mechanisms as humans, and argues that current models rely on pattern matching rather than true structural understanding. - [00:14:12](https://www.youtube.com/watch?v=R-CASOusCJo&t=852s) **Leveraging AI Within Context Limits** - The speaker explains that, even with today’s limited context windows, current AI models can still deliver valuable personal and business solutions by employing tactics such as tool hacking, context budgeting, strategic chunking, summary chains, and RAG, while urging users to be skeptical of overstated vendor claims. ## Full Transcript

0:00Every single AI company is not telling 0:02the truth about what its context window 0:05really does. And this video talks about 0:07context windows memory and what that 0:10means for AGI, artificial general 0:12intelligence. First, let's dive into the 0:14claims that are being made. These are 0:15big claims. Million token context 0:18windows. There's talk of 2 million, 5 0:20million, even 10 million token context 0:22windows coming soon. We already have 0:24context windows routinely in the several 0:27hundred thousand tokens all the time. 0:29What this means in practice is that 0:30companies are telling us that if you 0:32want to put a prompt in that is a full 0:36book. You can do that. It's not true. It 0:39doesn't actually work that way. And 0:41anyone who works with LLMs extensively 0:44will tell you that you might get a tenth 0:46of the usual context window. running 0:48understanding for example of Gemini 0:50right now with a million token context 0:52window on paper is you get really solid 0:55performance out of about 128,000 tokens 0:58or just over a tenth and after that it's 1:00a little bit more questionable. It's not 1:02clear and there are absolutely developer 1:04forums complaining about the fact that 1:07Gemini does not have effective 1:09performance especially past the half 1:11million mark. Why you might think would 1:13someone want to put in a context that 1:16large? No one writes a half a million 1:18token prompt. Not even I write a half a 1:21million token prompt. I will tell you 1:22why. If you are analyzing documents, if 1:25you're analyzing code bases, 1:27fundamentally anything with very large 1:30sequences of tokens that make semantic 1:32meaning across large structures 1:34together, you need the option to use a 1:36larger context window. The problem is 1:39this. Fundamentally, when the 1:41transformer reads that context, it does 1:44not read it as a structure. It reads it 1:47as a string of tokens. And so larger 1:49structures within the document within 1:52the codebase can get lost. And that is 1:55why agentic search is picking up versus 1:57just semantic rag for context windows 2:00for code bases. And by context window in 2:03this case, like rag is obviously not the 2:04context window. It's like part of the 2:06context engineering that you're doing 2:07for the codebase. The point is having a 2:10search function can beat just semantic 2:13meaning for code bases because there's 2:14so much structure in code bases. And 2:16that is just one example of where we can 2:19go wrong when we assume that context 2:22windows just as vanilla fill the prompt 2:25and add the doc context windows work. 2:28They don't necessarily work well. And I 2:30know that model makers will push a like 2:3499% or 98% performance on needle in the 2:37haststack tests. And a needle in the 2:39haststack test is kind of what it sounds 2:40like. You stick like one random fact in 2:42the middle of a gigantic block of text 2:44and you test to see if the model can 2:46find it. The problem is this is all done 2:49under a very controlled environment and 2:52it does not measure the ability of an LL 2:54to synthesize between multiple pieces of 2:58specific context which by the way is 3:00exactly what you need it to do to do 3:03higher level thinking. It is what humans 3:05are able to do when they read a book. 3:07Granted, we don't memorize every part of 3:10the book we read, but we don't have the 3:12problem of saying, you know what, the 3:15book I'm reading right now, I remember 3:16it less well than the book that I read 3:20four years ago. We have the opposite 3:21problem. But with LLMs, it's the it's 3:24the other way. At the end of the day, if 3:26it's in pre-training data, I can 3:28actually get kind of decent literary 3:30analysis. If the book is something that 3:33I'm reading now in the sense that it's a 3:34new prompt or new text it hasn't seen 3:36before, I don't really give it books, 3:38but like I can give it docs that it 3:39hasn't seen before. It's not in the 3:41pre-training data. Even with 3:42state-of-the-art models like 03 Pro, it 3:45can still be very hit or miss whether it 3:47actually examines the full context. And 3:50tests back this up. Tests are often 3:52showing an edge awareness with LLMs 3:54where they are paying attention to the 3:56end and they're paying attention to the 3:57beginning and the middle is a big 3:59U-shape. So one, I'm going to tell you a 4:01few strategies for how this is handled 4:03because I don't think that's often sort 4:04of laid out just very clearly like these 4:06are your options. We all know this is a 4:08problem. So lay out the options, right? 4:11And then number two, I want to talk 4:13about AGI and I want to talk about what 4:16this means for artificial general 4:18intelligence. But we'll save that fun 4:19stuff for the end. So let's just run 4:20through a few strategies quickly. We'll 4:22do five. So number one, I've talked 4:24about this one before. We're not going 4:26to belabor it. rag retrieval augmented 4:28generation. Fundamentally, if you feel 4:30like you need to have an index that sort 4:33of gives you a sense of semantic 4:35meaning, you need the model to go and 4:37retrieve something with a particular 4:40utterance or prompt and then go fetch 4:42something out of a very large context 4:44that you've put into the rag so it 4:46doesn't just live in the context window. 4:48Fantastic. Rag can work well. It like 4:50the classic example is the wiki, the HR 4:53manual. That's kind of what rag is good 4:55for. Second strategy, summary change. 4:58Summary chains. Real example, 200page 5:00financial report. The old approach would 5:02be to feed all 200 pages and you're 5:05paying, I don't know, 50 bucks or 5:06something to the API, depending on how 5:08big a prompt you run, depending on how 5:09complex and multi-step it is, how many 5:11tokens you're burning, depending on the 5:12model. New approach. Split it into 5:15sections. Summarize each of them, and 5:17then combine each of the summaries 5:19together. So, you're lading up the 5:20semantic meaning it's x cheaper at 5:22least. whatever your model is, it's 5:24going to run a lot cheaper. And the 5:26accuracy is higher because by splitting 5:28it into sections, you're making sure 5:31nothing gets stuck in the middle and is 5:33just lost. I have Claude all the time 5:35admit to me that Claude does not read 5:38the documents I give it fully. It reads 5:40the first few thousand tokens and just 5:42kind of pattern matches is literally 5:44what Claude said, but I call it vibes. 5:46It just vibes its way through. Okay. 5:48Third strategy to deal with this 5:49strategic chunking. So similarly you 5:52split the 80page document into sections. 5:55This is similar to summary chains. And 5:57then you ask each chunk. You interrogate 5:59each chunk. Do you contain information 6:01about X topic? Let's say you're trying 6:03to explore a particular product area 6:06inside a financial report for the stock 6:09market. You want to interrogate each of 6:11the uh 10page chunks in a very large 6:14company report and you want to say does 6:16it contain information about the 6:17products? only positive chunks would 6:19then move forward after you do that 6:21interrogation across splits. This 6:23results in vastly fewer tokens being 6:25used and much better accuracy even 6:28versus like a vector search because 6:30you're basically saying you must pay 6:32attention. This is a small context 6:34window. Just look at it. It's not rag. 6:37All I'm asking you to do is just look at 6:39the context window and tell me if this 6:41is in here. And I'm giving you so little 6:42just a few thousand tokens like you 6:44can't mess it up. Fourth strategy is 6:46context budgeting which is a big part of 6:48context engineering. You sort of treat 6:50the tokens the way we treated random 6:52access memory or RAM in the you conserve 6:55it. You treat it like it's precious. So 6:57you would say for example here you know 6:59this 500 like we're always going to have 7:01system instructions. We're just going to 7:03have 500 lines of system instructions or 7:0550 lines of system instructions and 7:06that's that's what we're going to have. 7:07Okay. And this this next piece this is 7:09I'll call it a thousand tokens. will say 7:11for conversation history and that's 7:13summarizing older parts of the 7:14conversation. Again, we're not going to 7:15touch it. 2,000 tokens for retrieve 7:18documents and then 500 for working 7:20memory. Whatever it is, you can do more 7:21of this in the API where you're sort of 7:23hacking the context. If you are in a 7:25chatbot, you have limited options. The 7:27system instructions you can't touch. The 7:29conversation history is summarized for 7:31you. Retrieve documents, it's kind of up 7:33to you. You'll notice if you're in the 7:35chatbot that older retrieved documents 7:37are dropped out. I routinely have a 7:38conversation with 03 where I'm like 7:40remember that document and it's 7:42literally there and I remember uploading 7:44it and there's a little marker in the UI 7:46that shows I did it and of course 03 is 7:48like it's out of memory I don't know 7:50didn't happen I can't remember it and so 7:52if you're in the chatbot you have to do 7:54all of this manually you have to kind of 7:56track how long your conversation is 7:58going for what you're asking for and 8:00then budget your asks and budget the 8:02documents you give very carefully so the 8:05last strategy is position hacking So 8:08research shows attention is at least 3x 8:10greater at the edges of the prompt. So 8:12and I've talked about this before. Put 8:14critical instructions at the beginning. 8:16Put like key facts at the end. The 8:18relevant document is where it needs to 8:21be to be paid attention to like first or 8:23the second most is last. And then insert 8:26checkpoints every few thousand tokens as 8:28you chat to make sure that you confirm 8:30that the prompt is working. And so in a 8:32sense in that you're not trying to 8:34escape the fact that you have limited 8:36context. you're actually trying to 8:37position hack. Now, if I were to look at 8:38this and say, "What can you do with APIs 8:40versus a chat window?" All five of these 8:42are very viable with an API first 8:44approach. Only some of these work with a 8:46chat window. So, the chat window, you 8:49can do summary chains. That would work 8:51because you can actually like split into 8:54sections and have different chats. You 8:55can do strategic chunking where you ask 8:57it if it contains information. That 8:59works. You can do position hacking where 9:01you time your instructions and kind of 9:03what you put where. It is a little bit 9:05more difficult if you're in the chat 9:06window to do context budgeting and to do 9:09retrieval augmented generation. Although 9:11arguably a custom GPT is effectively a 9:14cheap form of retrieval augmented 9:15generation or a project area in chat GPT 9:18is a cheap form of retrieval augmented 9:19generation. So there's ways to kind of 9:21get there but certainly uh summary chain 9:24strategic chunking and position hacking 9:25are very viable even if you're not an 9:27API person. Okay, let's get slightly 9:30philosophical here for a minute toward 9:32the end of this video. I want to get 9:34real honest about the fact that we have 9:36been talking for a few minutes about the 9:38fact that fundamentally these models 9:41cannot reliably track information across 9:44a single structured piece of text that's 9:46book length. How do we expect them to 9:49maintain understanding across a lifetime 9:51of experience? Particularly when they're 9:54not getting better at this. This is not 9:56a new issue. I am not telling you about 9:58something that did not exist when Chad 10:00GPT launched and now it does. I'm 10:02telling you about something that hasn't 10:03gotten solved. This is a limitation of 10:06our architectures that is partly a 10:08function of physics. One of the things 10:10that Google engineers have observed is 10:12that it is incredibly computationally 10:14intensive to use the full 1 million 10:17token context window. I don't know if 10:18you know this, but context scales 10:21quadratically. In other words, as you 10:23burn more tokens, if you if you send 10:25more tokens through, it's a quadratic 10:28equation that scales to the power of 10:30four in order to process those tokens. 10:33And so, if you go from 50 to 100,000, 10:35you 4xed the amount of energy you have 10:39to use to process that context window, 10:41which is why some of these longer 10:44prompts take so long. Like, you're 10:45burning multiple minutes staring at Opus 10:474 and it's just going. You're burning 10:49multiple minutes staring at 03 Pro. Some 10:51of that is that they're inference models 10:53and their thinking, but some of it may 10:54be you gave it a lot of context. This is 10:56a fundamental limitation is not an 10:59artifact of your prompt design, although 11:00your prompt design can help address the 11:02issue. This is a robust effect across 11:05every model architecture that's been 11:07tested so far. And here's the thing, the 11:09entire bet on LLMs achieving artificial 11:13general intelligence rests on this 11:15assumption. If you really reduce it, 11:17humans are lossy compression functions, 11:20too. I'll say it again. Humans are lossy 11:23compression functions, too. Our 11:24forgetting and compression is 11:26fundamentally similar to what these 11:27models do. That is the bet. I don't know 11:29that I agree with it. The context window 11:31problem suggests this bet might be 11:33incorrect. Yes, we forget details, but 11:36we maintain coherent mental models. 11:38Sure, I can't recite page 50 of the 11:40legal document verbatim, but I 11:42understand how chapter 20 relates to 11:43chapter 1, and I can tell you pretty 11:45clearly. LLM, that's not the same, 11:46right? Research shows they're doing 11:48pattern matching. And if they're doing 11:50pattern matching, that's not the same as 11:52understanding the structure. And if this 11:54concept of quadratic complexity really 11:56applies, it's it's not just 11:58inconvenient. At AGI scales, you're 12:00hitting thermodynamic limits. You're 12:02hitting energy limits. We need perhaps a 12:06fundamentally different breakthrough in 12:08the way that we handle attention across 12:10long context windows in order to truly 12:13get to a point where these LLMs can 12:15deeply understand context across very 12:17large spaces. So either we're right and 12:20intelligence really is lossy 12:21compression. Maybe I'm just fooling 12:23myself. I'm a very lossy human and I 12:25just need to be honest, right? And maybe 12:27you need to be honest and we need to be 12:28a little more humble and recognize that 12:30the limitations of the AI are our 12:31limitations too. and it's going to get 12:33to AGI effectively because humans are 12:36not that much better or 12:39we're kind of wrong and we're building 12:41very sophisticated stoastic parrots, 12:44people spirits, pick your description of 12:46choice and those machines will never 12:49really understand the large context 12:51windows that we throw at them and that 12:53is a fundamental computational limit 12:54that we would have to have a new 12:55breakthrough to get to sort of AGI from. 12:58For now, I would settle for honesty from 13:01vendors who are talking about context 13:03windows. I think we have traded this is 13:05a million context windows and it's 13:07simple for the honesty that we need to 13:10actually do appropriate planning. I I 13:12would like to propose that we start to 13:14use real tests of actual synthesis work 13:18across documents as a way to describe 13:21capabilities like this model can 13:23effectively synthesize insights across a 13:2510-page document. gets it right 90% of 13:28the time or this one can do it for a 13:3020page or 100page whatever it is. I have 13:32yet to see by the way a reliable 13:34synthesis across a 100page document by 13:37any model if it's a complex document. So 13:39that's a theoretical. Okay. So I've left 13:41you with a few strategies. We've talked 13:43about how you address this. Don't walk 13:45away thinking that just because I'm 13:47skeptical about the implications for 13:50AGI, I don't think that this is a 13:52transformative opportunity for us 13:54building. If we apply any of these five 13:56strategies or maybe a combination of 13:58them, it is totally possible to use the 14:01LLMs we have today to accomplish 14:04transformative business results. I've 14:06seen it. Now, that doesn't mean a lot of 14:07people aren't screwing it up. They are. 14:10But the AI we have today, even if it 14:12never gets better, is still good enough 14:14that with the weaknesses in the context 14:16windows we have today, we can still 14:19build business solutions and frankly 14:21personal solutions that offer a ton of 14:23value. I know people who are within the 14:25context windows we have today building 14:27really effective second brains. It's 14:30it's just possible. Some of them are 14:31hacking Obsidian. Some are using other 14:33tools. Some of them are rolling their 14:34own. There are remarkable things that 14:36were able to do personally and 14:38professionally within the context window 14:39limitations we have today. Use the five 14:41strategies I laid out. The position 14:43hacking, the context budgeting, the 14:45strategic chunking, the summary chains, 14:47the rag retriever augmented generation, 14:50and have fun with what we've got. And be 14:52aware of the claims that model makers 14:55and vendors make about context windows. 14:57They're not all they're cracked up to 14:59be.