Claude Opus 4.5 vs Gemini: Agentic Edge
Key Points
- Claude Opus 4.5 has been released, positioning itself as the most capable Anthropic model for long‑running, agentic tasks beyond just code generation.
- The model actively monitors its context window, truncating checks and “shipping” results when it senses it’s nearing the limit, which helps users finish large outputs like multi‑slide PowerPoints without manual prompt hacks.
- When the context window would still be exceeded, Anthropic automatically switches to Sonnet 4.5 and invisibly compresses earlier context, preserving continuity though with some loss of detail.
- These context‑management features translate into more reliable production of complete documents, spreadsheets, and presentations, reducing the “I hit the wall” experience common with prior models.
- Compared to Gemini, Opus 4.5’s enhancements make it a more practical daily‑driver for chat‑based workflows that require sustained, coherent output.
Sections
- Claude Opus 4.5 vs Gemini - The speaker outlines Claude Opus 4.5’s new long‑context, agentic capabilities and how they compare to Gemini, emphasizing practical advantages like uninterrupted PowerPoint generation.
- AI Model Benchmark for Shipping Data - The speaker compares several AI systems on a real‑world task of extracting and reconciling hundreds of Christmas‑tree numbers from a manifest and receipt, finding only Claude Opus 4.5 handled the OCR, memory, calculation, and pivot‑table requirements correctly.
- Models as Environments, OCR Limits - The speaker argues that AI models are evolving environments rather than fixed products, praises Gemini 3’s OCR advances, and highlights GPT‑5.1 Pro’s failure on noisy, real‑world handwritten data, underscoring the gap between clean‑context performance and practical utility.
- Matching AI Tools to Tasks - The speaker explains how to pick and combine various AI models—ChatGPT, Gemini, Nano Banana Pro, Opus 4.5, Claude—based on whether a problem needs abstraction, reconstruction, or visual design, outlining a workflow for optimal results.
- Future Mindset & Opus 4.5 - The speaker predicts that mindset will become increasingly pivotal through 2026, hints at an Easter‑egg discovered in Opus 4.5, and solicits the listener’s opinion on that version.
Full Transcript
# Claude Opus 4.5 vs Gemini: Agentic Edge **Source:** [https://www.youtube.com/watch?v=EbZbGPi8ftA](https://www.youtube.com/watch?v=EbZbGPi8ftA) **Duration:** 00:15:21 ## Summary - Claude Opus 4.5 has been released, positioning itself as the most capable Anthropic model for long‑running, agentic tasks beyond just code generation. - The model actively monitors its context window, truncating checks and “shipping” results when it senses it’s nearing the limit, which helps users finish large outputs like multi‑slide PowerPoints without manual prompt hacks. - When the context window would still be exceeded, Anthropic automatically switches to Sonnet 4.5 and invisibly compresses earlier context, preserving continuity though with some loss of detail. - These context‑management features translate into more reliable production of complete documents, spreadsheets, and presentations, reducing the “I hit the wall” experience common with prior models. - Compared to Gemini, Opus 4.5’s enhancements make it a more practical daily‑driver for chat‑based workflows that require sustained, coherent output. ## Sections - [00:00:00](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=0s) **Claude Opus 4.5 vs Gemini** - The speaker outlines Claude Opus 4.5’s new long‑context, agentic capabilities and how they compare to Gemini, emphasizing practical advantages like uninterrupted PowerPoint generation. - [00:03:46](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=226s) **AI Model Benchmark for Shipping Data** - The speaker compares several AI systems on a real‑world task of extracting and reconciling hundreds of Christmas‑tree numbers from a manifest and receipt, finding only Claude Opus 4.5 handled the OCR, memory, calculation, and pivot‑table requirements correctly. - [00:07:33](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=453s) **Models as Environments, OCR Limits** - The speaker argues that AI models are evolving environments rather than fixed products, praises Gemini 3’s OCR advances, and highlights GPT‑5.1 Pro’s failure on noisy, real‑world handwritten data, underscoring the gap between clean‑context performance and practical utility. - [00:11:58](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=718s) **Matching AI Tools to Tasks** - The speaker explains how to pick and combine various AI models—ChatGPT, Gemini, Nano Banana Pro, Opus 4.5, Claude—based on whether a problem needs abstraction, reconstruction, or visual design, outlining a workflow for optimal results. - [00:15:12](https://www.youtube.com/watch?v=EbZbGPi8ftA&t=912s) **Future Mindset & Opus 4.5** - The speaker predicts that mindset will become increasingly pivotal through 2026, hints at an Easter‑egg discovered in Opus 4.5, and solicits the listener’s opinion on that version. ## Full Transcript
Claude Opus 4.5 is out. I know we just
got done with Gemini week. I am also
breathless. Don't worry, I'm going to
get into the comparison versus Gemini
where I have actually found it useful
using Opus 4.5, what I still would use
Gemini for. I'll dive into the whole
thing. First, what's Opus 4.5 and what
are the key things we should pay
attention to? I'm going to go beyond
benchmarks. I'm assuming you've read the
headlines from Anthropic and others that
say it's the best model. All the
headlines say that when the model drops
now and I just read past the thing that
is interesting about this model there's
a number of them. The first is that this
model is designed specifically to keep
pushing into Claude's strong suit which
is long running agentic tasks. And so
this model is designed to continue to
develop that strong suit and it feels
longer and more coherent and able to
stay on task not just in clawed code but
also in the chat which I think is really
important because for so many of us the
chat is where our daily driver is and in
this case you feel it right away. So,
for example, in the past, you would be
working with Sonnet 4.5 and you might
hit the end of your context window when
you're making a PowerPoint file and
you're frustrated because it was a 20
slide PowerPoint and you had a nice
prompt maybe from Nate about making a
PowerPoint and oh no, bang the end of
the context window. I've had to write
prompts just for that. Well, no longer.
It will compress the context window so
you can continue to chat. And I have
seen this in practice in two different
ways and they have different impacts on
accuracy. So I want to name them
carefully here. Opus 4.5 deliberately
hurries itself up within the same
context window when it sees it's getting
close to bumping into the end of the
context window. So if it's making a
PowerPoint, I have seen it tell itself
you've got to stop with the checks and
just ship something. And that's a super
helpful trait to have. There's that
awareness of the context window that's
useful. In addition, if you need to go
beyond the traditional context window,
what anthropic does is it switches you
automatically to Sonnet 4.5 from Opus
4.5. It compresses the top of the
context window invisibly and then you
continue having the conversation with
Sonnet. This isn't perfect. It's not
going to remember every single thing
because it's compressed it. But I have
found in practice it is a lot nicer than
just hitting the end of the context
window and feeling like you crashed into
a wall. So I think that by itself is
going to feel like a big get for people.
I also find that that translates into
much more concrete outcomes more often
from clot. I don't really get I can't
make this anymore. I hit the context
window. I get usable docs. I get
powerpoints. I get Excel spreadsheets.
Basically, the longunning Agentic
features that Anthropic unlocked
translate into much more useful outputs.
And guys, that's the theme for this
video. Much more useful real world
outputs because we can talk about all
the magical benchmarks all we want, but
I'm interested in real world value and
most people are. And so with permission,
I am sharing a realworld test that I put
Claude Opus 4.5 through. And it's not
just me. One of my Substack readers did
the same test first, came to the same
conclusion and sent me the idea. He runs
a Christmas tree business and he is
obviously getting a lot of Christmas
trees in this time of year and he has
handwritten shipping manifests and
handwritten receipt sheets that he needs
to reconcile. That is a surprisingly
good problem to give to a leading large
language model because it has real
business value. You have to reconcile
the manifest to see what you're missing
as far as trees in whatever dimension.
And you have to make sure that the
system can not only do the
reconciliation, but that it can
correctly tally the original numbers
from the shipping manifest and the
receipt. If you want the full breakdown,
I have this on the Substack. Don't
worry, there'll be lots of detail. But
the the key point is that when I ran
that test, I was testing Gemini 3, Chat
GPT 5.1 Pro. I was testing Claude Opus
4.5. And just because I've had some
people ask, I was also testing Grock 4.1
and Kimmy K2 thinking. And I gave them
all the same prompt. I said, "Please go
through cleanly extract all of the
numbers from the shipping manifest for
Christmas trees, all of the numbers from
the receipts, and then come back and
give me a clean answer." And if you want
to get a sense of how big this was, like
the numbers run into hundreds of
Christmas trees. And these are hand
tallied like this little like 1 2 3 4 5
hand tallied with pencil. Like it's a
real world test. It tests optical
character recognition. It tests the
ability to hold multiple numbers in the
model's working memory. It tests the
ability to do complex calculations. It
actually tests pivot table functionality
because the shipping manifest is on a
different orientation than the receipt.
So, there's a lot of different things
going on. And what Kyle told me, uh,
he's the one that that gave me
permission to use this is he said Opus
4.5 is the only one that got this right.
I use Opus 4.15 in the business. Well,
that for me is enough, right? If a
business owner trusts it, all I'm doing
is doing a bit of fancy testing on the
side, right? and that is the gold
standard as far as I'm concerned. And
what he said is I didn't find Gemini 3
very useful. And so I went and I did the
same test. Uh I did a Nate prompt for
it, gave it the images, which he was
kind enough to share with me. I got a
gold standard uh grading rubric. And
I'll write this all up in the substack.
But but the TLDDR is that Opus 4.5 was
not perfect, but it was within a couple
of trees and close enough that it was
able to get a real big head start on
what would have been a multi-hour
progress uh project to reconcile all of
this receipt and shipping stuff because
this is across five different species of
trees. There's like 400 some trees
involved. It it's a lot. Uh so it would
have been a lot to reconcile by hand.
Opus 4.5 gets you along 10, 12, 15 times
faster and is off but not off by all
that much and in places is absolutely
correct and also acknowledges both
discrepancy and uncertainty. So in other
words, if you think about what we're
testing, Opus 4.5 got the optical
character recognition right. It got the
ability to actually hold multiple
numbers in working memory. It figured
out how to handle discrepancies because
you can't get a one toone answer here,
right? there really were real world
discrepancies between these two lists
that you couldn't just wish away and the
model acknowledged that in the end it
gave a useful answer which is really the
gold standard and that goes back to this
idea that it has this agentic quality
that stays on task and focuses even in a
messy task window and is able to deliver
value. Gemini 3 was the second best
response. Gemini 3 was able to do the
counting of the tallies, which seems to
be a really hard thing. Like recognizing
pencil marks is one of the tricky parts
of optical character recognition, and I
deliberately made it hard, but it scored
much lower than Opus 4.5. In particular,
what was interesting was it had a
narrative, which meets with the idea
that it synthesizes messy context. Well,
it had a really clean narrative, but it
really wanted to make the narrative make
sense, and it struggled with the idea
that the numbers were just inherently
discrep. And so what I found was the
model ended up writing answers that were
not entirely internally consistent when
it was trying to figure out what to do
with that narrative. Now, one important
piece of context here. I would not
overread and say, "Well, Gemini 3 can't
read tally marks for Christmas trees.
It's not as good an OCR model as they
say." There are archaeologists out there
saying, "This is an absolute gamecher
for reading clay tablets." This is a
good example of part of why it's so hard
for me and others to tell these stories.
Well, models are not products that we
define. Models are environments
[clears throat] that we discover. Models
are grown. They're not made. And we all
venture into the wild forest of the
model and discover what is there. In
this case, I've discovered a corner of
the model around optical character
recognition that has business impact
that is a factor, but it doesn't obscure
the fact that Gemini 3 has made real
progress on optical character
recognition and is great at that in
other contexts. Going over to Shed GPT
5.1 Pro, I got to say it's really
reinforcing my sense that that model
needs extremely clean context to work
well. I've seen it do amazing things
with clean context, but this was a dirty
context window. was a photograph of
handwritten numbers and it just flat out
failed to count the numbers at all
correctly and all it did was it came up
with an initial estimate and then force
reconciled the rest so it was all one to
one equal under the mistaken assumption
that the discrepancy had to be
rectified. Great instinct if you're
model designing clean code architecture
which is really what chat GPT 5.1 Pro
feels like. It is not correct if you're
dealing with a messy, dirty, real world
situation. And so 5.1 Pro failed on that
one. And then I tested Kimmy K2 and I
tested Grock 4.1. They both scored much
worse than even 5.1 Pro. So for those of
you who are saying I don't talk about it
enough, I try not to talk about things I
have terrible things to say about. Uh
neither one of them did very well. They
weren't able to count the tallies
correctly. They weren't able to do the
analysis correctly. They just were not
helpful at all. And that really matches
my sense that both of these models will
have reputations that place them at the
cutting edge, but the real world
applicability isn't there compared to
Gemini and Chad JPT 5.1 and Cloud Opus
4.5. If we step back, one of the things
that I'm interested is then asking where
do the models do the work? And that's
one of the things I'm going to talk a
little bit more about in the substack,
but I think the way I'll put it here on
the video is this. Chad GPT 5.1 is
strongest when the problem is fully
specified. Clear requirements,
structured inputs, well understood code.
If you have difficult architectural
reasoning and you have clean inputs, and
it's figuring out how a system should be
designed or fixed, that love of
structure is an asset. But that love of
structure becomes a liability when the
inputs are messy. So instead of
wrestling with ambiguity, that GP2 5.1
or 5.1 Pro tends to prefer the cleaner
world and will sometimes just force
clean it. Mi3 is the opposite. It's a
model I can reach for when I want
business angles narrative synthesis and
when I want to deal with a huge corpus
like I will stand by the fact that it's
incredible that you can take an entire
earnings report and get it into a slide.
That's mindboggling. It can read a lot.
It can see patterns. It can tell a
story. The tradeoff is that if the
context window has multiple conflicting
numbers in it or multiple conflicting
narratives, MEI3 is liable to just come
up with something and may not have that
internal rationale to actually pick the
strongest story. Opus 4.5 sort of sits
in between. It's the model that will
actually do the work when the
information is messy but the job is
specific, which it turns out a lot of
our work is, which is why I think the
Christmas tree example is perfect. So, I
find I can use it for tackling like
tone, tackling editing my work, trying
to work on finding uh voice for
something I'm trying to kind of wrestle
with. And I also can use it as a code
monkey. And so, I can get it to
implement features or refactors or glue
code that need to be consistent over
time. And it just stays on task. It's
one that I can trust to build a deck in
multiple passes without forgetting the
structure that we agreed on. It does
sometimes feel a bit less opinionated
than Gemini and perhaps less ruthless
ruthlessly critical than chat GPT, but
in return it doesn't blow up as the task
gets longer or as the context gets more
tangled. So if I'm trying to find a way
that's simpler to describe how these
models respond, I would say Gemini tends
to interpret mess by saying what might
this mean? What's the story here? Which
is useful. And Claude tries to
reconstruct the mess faithfully, right?
What is actually here? or how do I
represent it cleanly? Chad GPT tends to
abstract away the mess. How can I turn
this into a cleaner version of the
problem to solve? I'm not saying any of
these approaches is right or wrong. I'm
trying to give you a trick to notice
which one matches the job in front of
you. If you're reading degraded
documents from an archive,
interpretation is a feature. If you're
reconciling inventory, reconstruction is
more what you want. And if you're
designing a protocol, then you want that
abstraction that Shy GPT offers. Once
you start to see the models through this
lens, I think your usage will start to
naturally split. If you're looking for
strategy, for big picture insight, I
find myself reaching for Gemini. It's a
great big picture conversational
partner. It's amazing. I stand by the
fact that Nano Banana Pro feels like a
miracle for clean technical problem
solving. Chat GPT is continues to be
very, very solid as long as you have a
clean context window. For anything that
ends up having to go through multiple
edits that touches code where you're
trying to stay consistent across
different formats in an article or
whatever it is, Opus 4.5 is the safest
pair of hands. So for images, for UI
concepts, for marketing visuals, Nano
Banana is really helpful, but you find
you feed that with other things. Right
now for decks I tend to build them in
claude and then I tend to polish them by
running that claw deck through notebook
LM which is powered by Gemini and
powered by Nano Banana Pro. I get that
visual polish over the top of a deck
with bones constructed by clot. This is
not about loyalty to the brand. It's
about matching the model's personality
to the job. And by the way, when people
ask me like how do I write? Part of why
it's hard to give that answer is that
every piece is different. And so with
this piece for example, I have to draft
in video. I have to go out and wrestle
with it and do the realworld checks and
then come back and make sense with you
in front of the camera and then figure
out how to wrestle that into an article.
And some articles don't end up starting
that way, but a lot of them do because
what we're doing is discovering the real
world capabilities of the models
together. The last thing worth saying is
that this map is going to keep changing.
Enthropic is going to update Opus.
OpenAI will definitely come back out
with something on Chad GPT. Google's
gonna keep pushing on Gemini. The
mindset to have is not Nate has told me
the best model. Please don't do that.
It's to have a working hypothesis about
what each one is good at and to be
willing to update that as you explore
the way these models actually work for
your use cases. Right now, Opus 4.1
looks like a great choice to hire when
you want work done reliably in the messy
middle of real world tasks. How long
that holds is open to question, but it's
a big step forward in that direction and
that's worth celebrating and pointing
to. And I'll leave you with one final
thought. I use model to hire for a
reason. I think we should start to
switch our language a little bit from
which plan am I purchasing to which
model am I hiring for the job? As we get
to a point where these models produce
outputs more because it helps us
understand why the pricing works the way
it does. If you're hiring a model to do
the job and the job is something that
saves you tens or 15 or 20 or 30 hours a
month, it is worth the money you're
paying for it. You hired it to do the
job and it's taking work off your plate.
We will see that mindset work more and
more as we go into 2026. So, that's just
a little Easter egg and uh that's what I
got from Opus 4.5. What's your take on
Opus 4.5?