Learning Library

← Back to Library

Claude Opus 4.5 vs Gemini: Agentic Edge

Key Points

  • Claude Opus 4.5 has been released, positioning itself as the most capable Anthropic model for long‑running, agentic tasks beyond just code generation.
  • The model actively monitors its context window, truncating checks and “shipping” results when it senses it’s nearing the limit, which helps users finish large outputs like multi‑slide PowerPoints without manual prompt hacks.
  • When the context window would still be exceeded, Anthropic automatically switches to Sonnet 4.5 and invisibly compresses earlier context, preserving continuity though with some loss of detail.
  • These context‑management features translate into more reliable production of complete documents, spreadsheets, and presentations, reducing the “I hit the wall” experience common with prior models.
  • Compared to Gemini, Opus 4.5’s enhancements make it a more practical daily‑driver for chat‑based workflows that require sustained, coherent output.

Full Transcript

# Claude Opus 4.5 vs Gemini: Agentic Edge **Source:** [https://www.youtube.com/watch?v=EbZbGPi8ftA](https://www.youtube.com/watch?v=EbZbGPi8ftA) **Duration:** 00:15:21 ## Summary - Claude Opus 4.5 has been released, positioning itself as the most capable Anthropic model for long‑running, agentic tasks beyond just code generation. - The model actively monitors its context window, truncating checks and “shipping” results when it senses it’s nearing the limit, which helps users finish large outputs like multi‑slide PowerPoints without manual prompt hacks. - When the context window would still be exceeded, Anthropic automatically switches to Sonnet 4.5 and invisibly compresses earlier context, preserving continuity though with some loss of detail. - These context‑management features translate into more reliable production of complete documents, spreadsheets, and presentations, reducing the “I hit the wall” experience common with prior models. - Compared to Gemini, Opus 4.5’s enhancements make it a more practical daily‑driver for chat‑based workflows that require sustained, coherent output. ## Sections - [00:00:00](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=0s) **Claude Opus 4.5 vs Gemini** - The speaker outlines Claude Opus 4.5’s new long‑context, agentic capabilities and how they compare to Gemini, emphasizing practical advantages like uninterrupted PowerPoint generation. - [00:03:46](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=226s) **AI Model Benchmark for Shipping Data** - The speaker compares several AI systems on a real‑world task of extracting and reconciling hundreds of Christmas‑tree numbers from a manifest and receipt, finding only Claude Opus 4.5 handled the OCR, memory, calculation, and pivot‑table requirements correctly. - [00:07:33](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=453s) **Models as Environments, OCR Limits** - The speaker argues that AI models are evolving environments rather than fixed products, praises Gemini 3’s OCR advances, and highlights GPT‑5.1 Pro’s failure on noisy, real‑world handwritten data, underscoring the gap between clean‑context performance and practical utility. - [00:11:58](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=718s) **Matching AI Tools to Tasks** - The speaker explains how to pick and combine various AI models—ChatGPT, Gemini, Nano Banana Pro, Opus 4.5, Claude—based on whether a problem needs abstraction, reconstruction, or visual design, outlining a workflow for optimal results. - [00:15:12](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=912s) **Future Mindset & Opus 4.5** - The speaker predicts that mindset will become increasingly pivotal through 2026, hints at an Easter‑egg discovered in Opus 4.5, and solicits the listener’s opinion on that version. ## Full Transcript
0:00Claude Opus 4.5 is out. I know we just 0:03got done with Gemini week. I am also 0:06breathless. Don't worry, I'm going to 0:08get into the comparison versus Gemini 0:10where I have actually found it useful 0:12using Opus 4.5, what I still would use 0:15Gemini for. I'll dive into the whole 0:17thing. First, what's Opus 4.5 and what 0:20are the key things we should pay 0:21attention to? I'm going to go beyond 0:22benchmarks. I'm assuming you've read the 0:24headlines from Anthropic and others that 0:25say it's the best model. All the 0:27headlines say that when the model drops 0:28now and I just read past the thing that 0:30is interesting about this model there's 0:32a number of them. The first is that this 0:35model is designed specifically to keep 0:37pushing into Claude's strong suit which 0:40is long running agentic tasks. And so 0:43this model is designed to continue to 0:45develop that strong suit and it feels 0:48longer and more coherent and able to 0:50stay on task not just in clawed code but 0:53also in the chat which I think is really 0:56important because for so many of us the 0:58chat is where our daily driver is and in 1:00this case you feel it right away. So, 1:03for example, in the past, you would be 1:05working with Sonnet 4.5 and you might 1:08hit the end of your context window when 1:10you're making a PowerPoint file and 1:11you're frustrated because it was a 20 1:13slide PowerPoint and you had a nice 1:14prompt maybe from Nate about making a 1:17PowerPoint and oh no, bang the end of 1:20the context window. I've had to write 1:21prompts just for that. Well, no longer. 1:24It will compress the context window so 1:27you can continue to chat. And I have 1:29seen this in practice in two different 1:30ways and they have different impacts on 1:32accuracy. So I want to name them 1:34carefully here. Opus 4.5 deliberately 1:37hurries itself up within the same 1:39context window when it sees it's getting 1:41close to bumping into the end of the 1:43context window. So if it's making a 1:45PowerPoint, I have seen it tell itself 1:47you've got to stop with the checks and 1:48just ship something. And that's a super 1:50helpful trait to have. There's that 1:52awareness of the context window that's 1:54useful. In addition, if you need to go 1:56beyond the traditional context window, 1:59what anthropic does is it switches you 2:02automatically to Sonnet 4.5 from Opus 2:044.5. It compresses the top of the 2:07context window invisibly and then you 2:10continue having the conversation with 2:11Sonnet. This isn't perfect. It's not 2:13going to remember every single thing 2:14because it's compressed it. But I have 2:16found in practice it is a lot nicer than 2:20just hitting the end of the context 2:22window and feeling like you crashed into 2:23a wall. So I think that by itself is 2:26going to feel like a big get for people. 2:27I also find that that translates into 2:30much more concrete outcomes more often 2:33from clot. I don't really get I can't 2:36make this anymore. I hit the context 2:38window. I get usable docs. I get 2:41powerpoints. I get Excel spreadsheets. 2:44Basically, the longunning Agentic 2:46features that Anthropic unlocked 2:48translate into much more useful outputs. 2:51And guys, that's the theme for this 2:53video. Much more useful real world 2:55outputs because we can talk about all 2:58the magical benchmarks all we want, but 3:00I'm interested in real world value and 3:03most people are. And so with permission, 3:05I am sharing a realworld test that I put 3:09Claude Opus 4.5 through. And it's not 3:12just me. One of my Substack readers did 3:15the same test first, came to the same 3:18conclusion and sent me the idea. He runs 3:20a Christmas tree business and he is 3:24obviously getting a lot of Christmas 3:25trees in this time of year and he has 3:27handwritten shipping manifests and 3:29handwritten receipt sheets that he needs 3:30to reconcile. That is a surprisingly 3:34good problem to give to a leading large 3:36language model because it has real 3:38business value. You have to reconcile 3:39the manifest to see what you're missing 3:41as far as trees in whatever dimension. 3:43And you have to make sure that the 3:46system can not only do the 3:48reconciliation, but that it can 3:50correctly tally the original numbers 3:53from the shipping manifest and the 3:55receipt. If you want the full breakdown, 3:56I have this on the Substack. Don't 3:58worry, there'll be lots of detail. But 4:00the the key point is that when I ran 4:03that test, I was testing Gemini 3, Chat 4:06GPT 5.1 Pro. I was testing Claude Opus 4:104.5. And just because I've had some 4:13people ask, I was also testing Grock 4.1 4:16and Kimmy K2 thinking. And I gave them 4:18all the same prompt. I said, "Please go 4:20through cleanly extract all of the 4:23numbers from the shipping manifest for 4:24Christmas trees, all of the numbers from 4:27the receipts, and then come back and 4:29give me a clean answer." And if you want 4:31to get a sense of how big this was, like 4:33the numbers run into hundreds of 4:34Christmas trees. And these are hand 4:36tallied like this little like 1 2 3 4 5 4:38hand tallied with pencil. Like it's a 4:40real world test. It tests optical 4:42character recognition. It tests the 4:44ability to hold multiple numbers in the 4:46model's working memory. It tests the 4:48ability to do complex calculations. It 4:51actually tests pivot table functionality 4:53because the shipping manifest is on a 4:55different orientation than the receipt. 4:57So, there's a lot of different things 4:58going on. And what Kyle told me, uh, 5:01he's the one that that gave me 5:02permission to use this is he said Opus 5:054.5 is the only one that got this right. 5:07I use Opus 4.15 in the business. Well, 5:09that for me is enough, right? If a 5:11business owner trusts it, all I'm doing 5:13is doing a bit of fancy testing on the 5:15side, right? and that is the gold 5:16standard as far as I'm concerned. And 5:18what he said is I didn't find Gemini 3 5:19very useful. And so I went and I did the 5:22same test. Uh I did a Nate prompt for 5:23it, gave it the images, which he was 5:25kind enough to share with me. I got a 5:27gold standard uh grading rubric. And 5:30I'll write this all up in the substack. 5:31But but the TLDDR is that Opus 4.5 was 5:36not perfect, but it was within a couple 5:38of trees and close enough that it was 5:42able to get a real big head start on 5:45what would have been a multi-hour 5:46progress uh project to reconcile all of 5:49this receipt and shipping stuff because 5:51this is across five different species of 5:53trees. There's like 400 some trees 5:55involved. It it's a lot. Uh so it would 5:56have been a lot to reconcile by hand. 5:58Opus 4.5 gets you along 10, 12, 15 times 6:02faster and is off but not off by all 6:04that much and in places is absolutely 6:06correct and also acknowledges both 6:08discrepancy and uncertainty. So in other 6:11words, if you think about what we're 6:12testing, Opus 4.5 got the optical 6:14character recognition right. It got the 6:16ability to actually hold multiple 6:18numbers in working memory. It figured 6:19out how to handle discrepancies because 6:21you can't get a one toone answer here, 6:23right? there really were real world 6:24discrepancies between these two lists 6:27that you couldn't just wish away and the 6:29model acknowledged that in the end it 6:30gave a useful answer which is really the 6:32gold standard and that goes back to this 6:34idea that it has this agentic quality 6:35that stays on task and focuses even in a 6:38messy task window and is able to deliver 6:40value. Gemini 3 was the second best 6:43response. Gemini 3 was able to do the 6:46counting of the tallies, which seems to 6:47be a really hard thing. Like recognizing 6:49pencil marks is one of the tricky parts 6:51of optical character recognition, and I 6:53deliberately made it hard, but it scored 6:55much lower than Opus 4.5. In particular, 6:58what was interesting was it had a 6:59narrative, which meets with the idea 7:01that it synthesizes messy context. Well, 7:04it had a really clean narrative, but it 7:06really wanted to make the narrative make 7:09sense, and it struggled with the idea 7:10that the numbers were just inherently 7:12discrep. And so what I found was the 7:16model ended up writing answers that were 7:19not entirely internally consistent when 7:22it was trying to figure out what to do 7:23with that narrative. Now, one important 7:26piece of context here. I would not 7:28overread and say, "Well, Gemini 3 can't 7:30read tally marks for Christmas trees. 7:31It's not as good an OCR model as they 7:33say." There are archaeologists out there 7:35saying, "This is an absolute gamecher 7:37for reading clay tablets." This is a 7:40good example of part of why it's so hard 7:42for me and others to tell these stories. 7:44Well, models are not products that we 7:47define. Models are environments 7:49[clears throat] that we discover. Models 7:51are grown. They're not made. And we all 7:54venture into the wild forest of the 7:56model and discover what is there. In 7:58this case, I've discovered a corner of 8:00the model around optical character 8:01recognition that has business impact 8:04that is a factor, but it doesn't obscure 8:07the fact that Gemini 3 has made real 8:08progress on optical character 8:10recognition and is great at that in 8:11other contexts. Going over to Shed GPT 8:145.1 Pro, I got to say it's really 8:16reinforcing my sense that that model 8:18needs extremely clean context to work 8:20well. I've seen it do amazing things 8:22with clean context, but this was a dirty 8:24context window. was a photograph of 8:26handwritten numbers and it just flat out 8:29failed to count the numbers at all 8:32correctly and all it did was it came up 8:34with an initial estimate and then force 8:36reconciled the rest so it was all one to 8:37one equal under the mistaken assumption 8:39that the discrepancy had to be 8:41rectified. Great instinct if you're 8:42model designing clean code architecture 8:45which is really what chat GPT 5.1 Pro 8:47feels like. It is not correct if you're 8:50dealing with a messy, dirty, real world 8:51situation. And so 5.1 Pro failed on that 8:54one. And then I tested Kimmy K2 and I 8:57tested Grock 4.1. They both scored much 9:00worse than even 5.1 Pro. So for those of 9:02you who are saying I don't talk about it 9:04enough, I try not to talk about things I 9:06have terrible things to say about. Uh 9:07neither one of them did very well. They 9:09weren't able to count the tallies 9:11correctly. They weren't able to do the 9:12analysis correctly. They just were not 9:14helpful at all. And that really matches 9:16my sense that both of these models will 9:19have reputations that place them at the 9:21cutting edge, but the real world 9:23applicability isn't there compared to 9:25Gemini and Chad JPT 5.1 and Cloud Opus 9:294.5. If we step back, one of the things 9:31that I'm interested is then asking where 9:33do the models do the work? And that's 9:35one of the things I'm going to talk a 9:36little bit more about in the substack, 9:39but I think the way I'll put it here on 9:40the video is this. Chad GPT 5.1 is 9:44strongest when the problem is fully 9:46specified. Clear requirements, 9:47structured inputs, well understood code. 9:50If you have difficult architectural 9:51reasoning and you have clean inputs, and 9:54it's figuring out how a system should be 9:55designed or fixed, that love of 9:57structure is an asset. But that love of 10:00structure becomes a liability when the 10:02inputs are messy. So instead of 10:04wrestling with ambiguity, that GP2 5.1 10:07or 5.1 Pro tends to prefer the cleaner 10:09world and will sometimes just force 10:11clean it. Mi3 is the opposite. It's a 10:14model I can reach for when I want 10:15business angles narrative synthesis and 10:18when I want to deal with a huge corpus 10:20like I will stand by the fact that it's 10:22incredible that you can take an entire 10:24earnings report and get it into a slide. 10:26That's mindboggling. It can read a lot. 10:29It can see patterns. It can tell a 10:30story. The tradeoff is that if the 10:34context window has multiple conflicting 10:37numbers in it or multiple conflicting 10:39narratives, MEI3 is liable to just come 10:42up with something and may not have that 10:44internal rationale to actually pick the 10:46strongest story. Opus 4.5 sort of sits 10:49in between. It's the model that will 10:51actually do the work when the 10:53information is messy but the job is 10:55specific, which it turns out a lot of 10:57our work is, which is why I think the 10:58Christmas tree example is perfect. So, I 11:00find I can use it for tackling like 11:03tone, tackling editing my work, trying 11:06to work on finding uh voice for 11:10something I'm trying to kind of wrestle 11:12with. And I also can use it as a code 11:15monkey. And so, I can get it to 11:16implement features or refactors or glue 11:18code that need to be consistent over 11:20time. And it just stays on task. It's 11:22one that I can trust to build a deck in 11:25multiple passes without forgetting the 11:27structure that we agreed on. It does 11:29sometimes feel a bit less opinionated 11:31than Gemini and perhaps less ruthless 11:33ruthlessly critical than chat GPT, but 11:36in return it doesn't blow up as the task 11:39gets longer or as the context gets more 11:41tangled. So if I'm trying to find a way 11:44that's simpler to describe how these 11:45models respond, I would say Gemini tends 11:48to interpret mess by saying what might 11:50this mean? What's the story here? Which 11:52is useful. And Claude tries to 11:54reconstruct the mess faithfully, right? 11:56What is actually here? or how do I 11:58represent it cleanly? Chad GPT tends to 12:01abstract away the mess. How can I turn 12:03this into a cleaner version of the 12:04problem to solve? I'm not saying any of 12:06these approaches is right or wrong. I'm 12:08trying to give you a trick to notice 12:11which one matches the job in front of 12:13you. If you're reading degraded 12:15documents from an archive, 12:16interpretation is a feature. If you're 12:19reconciling inventory, reconstruction is 12:21more what you want. And if you're 12:22designing a protocol, then you want that 12:24abstraction that Shy GPT offers. Once 12:26you start to see the models through this 12:28lens, I think your usage will start to 12:30naturally split. If you're looking for 12:32strategy, for big picture insight, I 12:34find myself reaching for Gemini. It's a 12:36great big picture conversational 12:37partner. It's amazing. I stand by the 12:39fact that Nano Banana Pro feels like a 12:41miracle for clean technical problem 12:43solving. Chat GPT is continues to be 12:47very, very solid as long as you have a 12:48clean context window. For anything that 12:50ends up having to go through multiple 12:52edits that touches code where you're 12:54trying to stay consistent across 12:55different formats in an article or 12:57whatever it is, Opus 4.5 is the safest 13:01pair of hands. So for images, for UI 13:03concepts, for marketing visuals, Nano 13:05Banana is really helpful, but you find 13:07you feed that with other things. Right 13:09now for decks I tend to build them in 13:12claude and then I tend to polish them by 13:14running that claw deck through notebook 13:17LM which is powered by Gemini and 13:20powered by Nano Banana Pro. I get that 13:22visual polish over the top of a deck 13:24with bones constructed by clot. This is 13:26not about loyalty to the brand. It's 13:29about matching the model's personality 13:30to the job. And by the way, when people 13:32ask me like how do I write? Part of why 13:34it's hard to give that answer is that 13:36every piece is different. And so with 13:38this piece for example, I have to draft 13:41in video. I have to go out and wrestle 13:43with it and do the realworld checks and 13:45then come back and make sense with you 13:47in front of the camera and then figure 13:49out how to wrestle that into an article. 13:50And some articles don't end up starting 13:52that way, but a lot of them do because 13:54what we're doing is discovering the real 13:57world capabilities of the models 13:58together. The last thing worth saying is 14:00that this map is going to keep changing. 14:02Enthropic is going to update Opus. 14:04OpenAI will definitely come back out 14:06with something on Chad GPT. Google's 14:08gonna keep pushing on Gemini. The 14:10mindset to have is not Nate has told me 14:13the best model. Please don't do that. 14:15It's to have a working hypothesis about 14:17what each one is good at and to be 14:20willing to update that as you explore 14:22the way these models actually work for 14:24your use cases. Right now, Opus 4.1 14:27looks like a great choice to hire when 14:29you want work done reliably in the messy 14:32middle of real world tasks. How long 14:34that holds is open to question, but it's 14:36a big step forward in that direction and 14:37that's worth celebrating and pointing 14:39to. And I'll leave you with one final 14:40thought. I use model to hire for a 14:42reason. I think we should start to 14:44switch our language a little bit from 14:46which plan am I purchasing to which 14:49model am I hiring for the job? As we get 14:52to a point where these models produce 14:53outputs more because it helps us 14:55understand why the pricing works the way 14:57it does. If you're hiring a model to do 14:59the job and the job is something that 15:01saves you tens or 15 or 20 or 30 hours a 15:04month, it is worth the money you're 15:06paying for it. You hired it to do the 15:07job and it's taking work off your plate. 15:09We will see that mindset work more and 15:12more as we go into 2026. So, that's just 15:15a little Easter egg and uh that's what I 15:16got from Opus 4.5. What's your take on 15:18Opus 4.5?