Claude's Latest Model Beats GPT5
Key Points
- The reviewer tested the new Claude model across code, PowerPoint decks, spreadsheets, and docs, benchmarking it against OpenAI’s ChatGPT‑5 and Anthropic’s own Opus 4.1, and found a noticeably larger performance jump.
- Unlike OpenAI’s consumer‑focused approach, Anthropic is positioning Claude as a “professional AI” that directly boosts workplace productivity, and the new model’s capabilities reinforce that strategy.
- The model outperforms Opus 4.1 in creating truly usable deliverables—such as detailed slide decks and Amazon‑style PRFAQs—meeting a long‑standing bar that many AI systems previously missed.
- It excels at surfacing exactly where human expertise should intervene, making the collaboration between AI and domain experts clearer and more effective than with prior models.
- While still imperfect, the reviewer sees this release as a significant step forward for AI tools aimed at helping professionals get concrete work done rather than just generating generic content.
Sections
- Untitled Section
- Model Self-Checking and Tool Transparency - The speaker explains that the new Claude model constantly narrates its reasoning, validates its work (e.g., slide design, code execution), and autonomously corrects errors, showcasing a higher level of self‑verification compared to earlier AI models.
- Rapid Iterative AI Writing Workflow - The speaker explains how the new Claude model lets them quickly produce and refine clear, high‑quality narrative content in minutes, turning AI into a smart collaborator that multiplies productivity without sacrificing human control.
- AI Model as Thoughtful Colleague - The speaker lauds the new Claude model for its clear, self‑checking reasoning and willingness to push back on errors, fostering a balanced, professional partnership instead of a frantic, uncontrollable tool.
- Expressing Excitement and Request - The speaker conveys enthusiasm about something and asks to be informed.
Full Transcript
# Claude's Latest Model Beats GPT5 **Source:** [https://www.youtube.com/watch?v=p-ibfrMN0M8](https://www.youtube.com/watch?v=p-ibfrMN0M8) **Duration:** 00:15:41 ## Summary - The reviewer tested the new Claude model across code, PowerPoint decks, spreadsheets, and docs, benchmarking it against OpenAI’s ChatGPT‑5 and Anthropic’s own Opus 4.1, and found a noticeably larger performance jump. - Unlike OpenAI’s consumer‑focused approach, Anthropic is positioning Claude as a “professional AI” that directly boosts workplace productivity, and the new model’s capabilities reinforce that strategy. - The model outperforms Opus 4.1 in creating truly usable deliverables—such as detailed slide decks and Amazon‑style PRFAQs—meeting a long‑standing bar that many AI systems previously missed. - It excels at surfacing exactly where human expertise should intervene, making the collaboration between AI and domain experts clearer and more effective than with prior models. - While still imperfect, the reviewer sees this release as a significant step forward for AI tools aimed at helping professionals get concrete work done rather than just generating generic content. ## Sections - [00:00:00](https://www.youtube.com/watch?v=p-ibfrMN0M8&t=0s) **Untitled Section** - - [00:03:25](https://www.youtube.com/watch?v=p-ibfrMN0M8&t=205s) **Model Self-Checking and Tool Transparency** - The speaker explains that the new Claude model constantly narrates its reasoning, validates its work (e.g., slide design, code execution), and autonomously corrects errors, showcasing a higher level of self‑verification compared to earlier AI models. - [00:06:48](https://www.youtube.com/watch?v=p-ibfrMN0M8&t=408s) **Rapid Iterative AI Writing Workflow** - The speaker explains how the new Claude model lets them quickly produce and refine clear, high‑quality narrative content in minutes, turning AI into a smart collaborator that multiplies productivity without sacrificing human control. - [00:12:04](https://www.youtube.com/watch?v=p-ibfrMN0M8&t=724s) **AI Model as Thoughtful Colleague** - The speaker lauds the new Claude model for its clear, self‑checking reasoning and willingness to push back on errors, fostering a balanced, professional partnership instead of a frantic, uncontrollable tool. - [00:15:39](https://www.youtube.com/watch?v=p-ibfrMN0M8&t=939s) **Expressing Excitement and Request** - The speaker conveys enthusiasm about something and asks to be informed. ## Full Transcript
So, over the past few days, I was lucky
enough to get early access to a new
Clawude model that is releasing today. I
want to give you why you should care,
what you should expect versus the other
models out there, and where it really is
going to make a difference. So, stick
with me for the next 10 or 15 minutes,
and we're going to get through what to
make of this model, and you'll be able
to figure out whether it's useful for
you. Number one, what were my first
impressions and top takeaways? I tested
this model inside clawed code. I tested
this model creating PowerPoint decks. I
tested this model creating spreadsheets.
I tested it creating docs. I tested its
thinking. I really put it through its
paces and I benchmarked it against chat
GPT5 which is of course OpenAI's
frontier model and also against Opus 4.1
which is the current frontier model from
Claude before today. And I wanted to
know what was going to stand out to me.
I spend hundreds and hundreds of hours
in AI models. I'm very familiar with
sort of the look and feel differences
and I wanted to get hands-on early to
see if I could tell a difference.
Spoiler alert, it was a big difference.
And I'm not saying that because I want
to hype the model. No model is perfect.
But I think that this model moves the
ball forward in some really important
ways for people who care about getting
work done. And frankly, that's actually
in line with Anthropic's larger
strategy. If you look at the two big
players, OpenAI and Enthropic, OpenAI
continues to lean very consumer.
Enthropic is adopting a specialized
stance of leaning into professional AI.
What does it mean to have professionals
work with AI by choice and pick
anthropic on purpose to get their work
done? How does anthropic help them move
their work deliverables forward? The
signatures for that strategy were all
over this model. I did a very popular
guide a few weeks ago talking about Opus
4.1 when it released and emphasizing it
was the first model that actually got as
far as creating really usable
spreadsheets, really usable PowerPoints,
which had been a really, really tough
bar to meet for AI previously. Well,
this new model beats that. This new
model beats Opus 4.1. And I and I put
them head-to-head and I did not give
them an easy assignment. They had tough
assignments. They had to make tough, you
know, 11 or 12 slide SAS decks. They had
to make docs in an Amazon PRFAQ style. I
really put them through their paces. And
what stood out to me as a human
observer, as someone who wants these
tools to work with us in the workplace,
is that this new model is what enables
me to see clearly where I need to
intervene. We talk a lot about AI
automating AI picking up work from us.
But I've been thinking a lot about this
idea that the most valuable AI is the AI
that helps you to see clearly when you
as a good human in your domain with deep
experience needs to touch the work. in
this model more than chat GPT5, more
than Opus 4.1, it's clear enough in its
narrative that you can see really
clearly what it's trying to go for and
you can see really clearly where you
need to touch the work to make it
better. And so if you think about it
within the context of a larger say deck
preparation workflow, a spreadsheet
model preparation workflow, this model
is going to speed the time it takes to
get these important pieces of work done.
And it does that in a number of useful
ways. And I want to call out sort of the
gritty hands-on notes that I have so
that you can start to think about it.
One of the first takeaways as you work
with this model, it's getting to that
level of quality, that level of clarity
on narrative by checking its work a lot
more than previous models did. One of
the hallmarks of the current claude
style is that you have this running
commentary from the model that shows you
what tools it's invoking and what it's
thinking about at the moment. It's sort
of an express chain of thought. This
model is expressing an obsession, I
think that's the right word, an
obsession with checking its work and
fixing it. Multiple times when it was
creating PowerPoint decks, I saw it
measure the pixel overlap between title
text and a particular visual element,
correct itself, and say, "That's not
right," and redo the slide. It didn't
come to me and make me do that. It
caught it itself. That's a big deal. It
also took the time to check the formulas
and spreadsheets when it was showing me
a code project I was working on. It was
actually going through the next.js
framework and it was validating that it
could start and run the dev server
before coming back and telling me it
could. I got to say chat GPT5 just likes
to say it could do stuff, right? Like
there's sort just a sort of a commitment
to talk that chat GPT5 has. I'm not here
to tell you which model to pick. This
video should not be interpreted as me
saying pick only this model to work. We
live in a multimodel world. I want you
to get a sense of where this model's
really useful and I think it's right in
line with where anthropic is going. This
model is going to be useful in
dramatically cutting down the grunge
time that we have spent on work where
you are just waiting through a lot of
messy inputs where you are trying to
figure out how you can understand a
complicated spreadsheet where you are
trying to write a draft and you just
feel like your head is mush and you
don't know how to get the words on the
page but they need to be really clear.
They can't just be any old AI slot.
That's where this model's going to
excel. I'll give you an example. I fed
this model 66 pages of PDF voice of
customer insight. So, it was all like
quotes, right? Things that were out of
order, not organized in any way. I just
wanted to see like what it would do with
raw customer utterance. And you know
what it did? It was able to analyze it.
And then this model in particular was
able to extract meaningful narrative
from it. And I think that's really
important to reflect on because those
kinds of insights don't make themselves
happen. I used to run voice of customer
when I was at Amazon. It was really,
really hard to manually go through a
bunch of customer utterances and they
just start to meld together in your
brain. It's hard to get narrative. It's
hard to attach a quote to a particular
insight. This is the first model I've
seen that can in one shot go from
a big muddle of customer quotes to an
executive ready narrative arc in a
PowerPoint presentation. Now, is it the
most beautiful PowerPoint I've ever
seen? No. Is it better even than the 4.1
that I thought was usable? It actually
is. This is the first PowerPoint
presentation
AI creation tool that has made something
that is so close to ready that I would
call it 90% ready to go out of the gate.
A little bit of polish here and there,
but that's really it. And what's handy
about that is it does it in just a few
minutes, which gives you a chance to do
multiple iterations. Remember when I
said earlier in this video that part of
why I'm excited about this model is it
puts us humans back in touch with the
work. That clarity of narrative is what
I have needed to wade through AI slop
and actually find something useful. And
I saw it come through not just in decks
but in the clarity of presentation and
spreadsheets in the clarity of working
with it in cloud code. It felt like
working with a good thinking partner. We
were able to quickly establish a file
structure to work together. It was just
a dream. and in the clarity of dockw
writing like it was like clear narrative
and didn't feel like I had to wade
through Aos thought. And so if I if I
think about that and I think about the
minutes it takes to make this I realize
as a human who cares about good work and
doing it well I have multiplied my time.
And it's not that I've multiplied my
time to put out more 90% good artifacts.
I have given myself a shot at doing two
or three of these and having progressive
inputs as I look at the narrative and I
shape it and I think about whether
that's what I want to say and it's
relatively trivial in 30 minutes or 40
minutes to come out with exactly what I
want because each iteration now takes
five or 6 minutes to make with this new
claude model. It's really easy. And if
you're wondering how prompt sensitive
the model is, this one's really
interesting. haven't seen this in any
other model and I would be curious for
your take as you play with it. When I
played with it, I found that it was
surprisingly useful regardless of the
prompt structure I applied. And so I
applied a super formal prompt structure
and I also applied a very casual prompt
structure which was just two or three
lines plus a bunch of data. In both
cases I got a very usable output was
healthy. It was happy. It was the kind
of PowerPoint you want to show around
the office. It was great. It was not a
problem. And that was also the case with
spreadsheets. It was also the case with
docs. And if that holds up, if you're
seeing that as well, what that suggests
is that Anthropic is doing enough
reinforcement learning on Office
Primitives like Docs, like Dex, like
PowerPoint that it's figuring out what
we want from shorter and smaller and
more casual utterances, which is a
really big deal because one of the
things that has made people really
frustrated with chat GPT5 is that it is
sensitive to prompting. I don't think
it's an accident that that the chat GPT
team has had to release prompt packs
aimed at chat GPT5. You know who hasn't
had to do that? Anthropic. They haven't
had to do it because the model does a
better job of understanding the kind of
work that you want and just going for
it. And this gets at one of the larger
takeaways that I think is really
interesting. Enthropic is betting on our
future for the next few years at least
being somewhat similar to what we have
today. Despite all the big hype and all
the big takeaways, they're investing in
a world where we will still need
PowerPoints, where we will still need
spreadsheets, where we will still need
the ability to run claude code as a
human and get something that boots on a
dev server. And what they're betting is
that what we need is clearer and more
professional outputs that we can
understand more easily. And that in turn
will mean that we take less time on the
grunge of our work. Because to be
honest, no one wants to trade the grunge
of the old way of doing things pre202
where we were just doing everything by
hand and get the new way of doing things
and it's just AI slop and we're just
waiting through that and that's a
terrible slog. Instead, I had to yell at
chat GPT5 just yesterday because I asked
it for an outline with three elements
and it came back with seven and I said,
"You didn't put the time in on the three
I asked you and you and you're just so
hyper excited that you came back with
with a bunch of extra." And that's a
tiny little story and it's not isolated
only to chat GPT5. Slop is a threat to
our ability to realize the gains of AI
workflows and AI productivity. And so
one of the things I'm excited about is
that there's some clarity in the work
produced by this model that I think
enables us to get back to creating
really useful pieces of work, whether
they're code or spreadsheets or
powerpoints or what have you and then
focusing on whether they're right and
then iterating if they're not. And that
becomes a workflow that I can get
excited about because it's less sloppy
and it fits into how teams already make
decisions. I also think the idea of
checking your work is something that
we'll start to see from other models. I
know models are being trained using
tooling where there is some recursive
looping and checking of your work. This
model is by far the most thoughtful and
careful about it that I have seen so
far. This model really cares to
understand how your prompt maps to a
particular piece of work and it cares to
get it right. Now, you might wonder,
Nate, you've been talking a lot about
docs and sheets and code and decks. Does
this thing only do that? And the answer
is no. I actually have used it for
conversations as well. I've used it to
sort of like get a sense of its thinking
and its ability, how it does if I ask it
to produce a response just in the chat.
And I get that same sense of clarity.
It's a model that really wants to cut
through the noise. And it's a model that
is able to give you some backbone. And I
think that's somewhat related to its
ability to check its work. It has a
sense of rightness. It has a sense of
what works and what doesn't. And when it
doesn't feel like something is correct,
it says so. And so, one of these subtler
things that I've seen come out that you
will also see is that this model has
some opinions on what is correct and
what is incorrect, whether you're saying
it or whether the model is saying it.
And that makes the model less like a
hyperactive squirrel on aderall and more
like a thoughtful colleague, a colleague
that has opinions, a colleague that can
be persuaded, but a colleague that will
also push back sometimes and say, "I
don't think that's quite correct." And
that is a very tough balance to strike.
And if Claude has been able to strike
that balance with this new model, it is
a very good sign for us because it helps
us to have a more professional
relationship with AI where we're yelling
at it less. We're trying to get it to be
sort of focused and directed less and
we're more interested in how we can do
good work together. And I'm excited
about that because I really for one
would like to stop telling my a models,
"No, you did too much. No, you went too
far in that direction. No, please stop
it. I don't want to be the only one
that's absolutely right around here. And
so my hope is that this new model
becomes a new decisioning baseline for
work. So let me unpack that a little
bit. I think that we reached a
productivity baseline with Opus 4.1 for
people who care about work. It is
possible to be productive not just in
conversation but with docs and sheets
and code and deck in Opus 4.1 which was
Claude's previous model. Now, with this
model, we don't just go from being
productive to perfect. We go from being
productive to decisioning. And this gets
at the heart of what I've been saying
this entire time. This model sets you up
to focus your time on making decisions
that matter because the work it produces
is really clear. And that's true in the
chat as much as it's true in any output
format that you want to select. That's
what I'm excited about because it feels
like we're moving from a worker's buddy
that works alongside you and gets you
okay drafts to a more professional
colleague that is designed to help set
you up to save time and make really
smart decisions. That makes me really
excited for the future because I would
love to have an AI colleague that's more
like that. And I want more interactions
that keep me closer to the work and that
help me to feel like I'm doing quality
work because we humans take a sense of
pride in that. I know you might not have
expected this video to go there. You
might think, well, Nate's going to just
talk about the AI model and how amazing
it is and how it automates things and it
is amazing and clearly it automates a
lot if it can go in one shot from 66
pages of customer quotes to a
PowerPoint. But that's not really why it
matters. It matters because it pulls the
humans closer to the work. We can work
as colleagues together because of the
ability to push back and to think
clearly and to express itself well. And
ultimately the work itself is higher
quality and much faster in a way that we
can be proud of as people because we
touched it and delivered our unique
stamp of perspective on it. The domain
experience we have the the metabolized
sense of integrity, the metabolized
sense of instinct that we have as people
who have expertise in our particular
area. This claude model makes it easier
for our expertise to shine through. So
have fun. Check it out. Let me know what
you think. I'm still early in my testing
obviously. I've had it for a few days.
I'm really excited about it. Let me
know.