GPT-5 Launch Sparks Debate
Key Points
- The rapid growth of tool‑calling will lead to thousands or even tens of thousands of tools, creating huge opportunities for continuous ecosystem improvements beyond pure model performance.
- The episode was recorded early in the week and released ahead of schedule to stay timely after the surprise Thursday launch of GPT‑5.
- Both guests, Chris Hay and Mihai Criveti, agreed that while GPT‑5 is impressive, it has not yet supplanted Claude as their primary daily development tool.
- OpenAI’s livestream announced the GPT‑5 suite—including a core model plus “mini” and “nano” variants—and introduced new “Thinking” and “Pro” modes across various free and paid tiers with differing rate limits.
- The hosts framed the launch as another dramatic moment in the AI industry, setting the stage for ongoing debates about the impact of these new capabilities.
Sections
- GPT‑5 Release Sparks Tool Surge - A panel of experts debates how GPT‑5’s debut is driving a rapid proliferation of AI tools and could challenge Claude as developers’ primary daily assistant.
- OpenAI's GPT‑5 Release Highlights - The speaker outlines three key aspects of the new GPT‑5 rollout: a unified model router to simplify model selection, modest benchmark improvements, and a notable boost in reliability through reduced hallucinations.
- Nano Model Beats Larger Counterparts - The speaker praises the API's nano model (and GPT‑5) for surpassing big models in tasks like function calling and game‑based reasoning, noting its strong performance in demos such as the “Murdle” detective game.
- Expectations vs Reality for GPT-5 - The speaker notes that the new model’s focus aligns with earlier design cues and resolves prior free‑model limitations, making its direction unsurprising.
- Scaling AI While Managing User Expectations - The speaker highlights how a newly‑released, high‑traffic AI model alleviates many pain points and outpaces niche competitors, yet stresses the gap between lofty fantasies of instant world‑creation and the practical necessity of delivering a simple, mass‑market‑friendly experience.
- Embedding Tools in Model Analysis - The speaker explains that the model’s internal analysis channel conducts token prediction and can silently invoke tools such as Python for calculations, a strategy intended to enhance answer accuracy and reduce hallucinations.
- Scaling Inference and UI Innovation - The speaker highlights how lower inference costs, faster and smaller models, massive parallel tool calls, and superior cloud‑based interfaces are accelerating the ecosystem toward more powerful, user‑friendly AI systems on the path to AGI.
- Parallel AI Model Workflow Comparison - The speaker describes using multiple AI models side‑by‑side for coding tasks—especially unit‑test generation—and notes that while ChatGPT struggles, Opus consistently delivers better results in their integrated workflow.
- Internet‑Synced Blinking Donkey Demo - The speaker walks the audience through a live demo of a code‑generated donkey that blinks in sync with an internet clock, noting improvements over previous versions and contrasting it with other AI models.
- ChatGPT-5 Canvas Coding Struggles - The speaker describes repeatedly prompting ChatGPT‑5 to create a blinking‑donkey canvas script, battling truncated outputs and lengthy copy‑paste fixes, highlighting the inefficiency versus Claude's smoother handling.
- Balancing Cost, Experience, and Model Choice - Panelists discuss the benefits of AI competition, improvements in GPT‑5’s front‑end, frustrations with high subscription fees, and the appeal of combining cloud, GPT‑5, and open‑source models in their workflows.
Full Transcript
# GPT-5 Launch Sparks Debate **Source:** [https://www.youtube.com/watch?v=A30mVgbG-OQ](https://www.youtube.com/watch?v=A30mVgbG-OQ) **Duration:** 00:33:54 ## Summary - The rapid growth of tool‑calling will lead to thousands or even tens of thousands of tools, creating huge opportunities for continuous ecosystem improvements beyond pure model performance. - The episode was recorded early in the week and released ahead of schedule to stay timely after the surprise Thursday launch of GPT‑5. - Both guests, Chris Hay and Mihai Criveti, agreed that while GPT‑5 is impressive, it has not yet supplanted Claude as their primary daily development tool. - OpenAI’s livestream announced the GPT‑5 suite—including a core model plus “mini” and “nano” variants—and introduced new “Thinking” and “Pro” modes across various free and paid tiers with differing rate limits. - The hosts framed the launch as another dramatic moment in the AI industry, setting the stage for ongoing debates about the impact of these new capabilities. ## Sections - [00:00:00](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=0s) **GPT‑5 Release Sparks Tool Surge** - A panel of experts debates how GPT‑5’s debut is driving a rapid proliferation of AI tools and could challenge Claude as developers’ primary daily assistant. - [00:03:07](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=187s) **OpenAI's GPT‑5 Release Highlights** - The speaker outlines three key aspects of the new GPT‑5 rollout: a unified model router to simplify model selection, modest benchmark improvements, and a notable boost in reliability through reduced hallucinations. - [00:06:11](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=371s) **Nano Model Beats Larger Counterparts** - The speaker praises the API's nano model (and GPT‑5) for surpassing big models in tasks like function calling and game‑based reasoning, noting its strong performance in demos such as the “Murdle” detective game. - [00:09:30](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=570s) **Expectations vs Reality for GPT-5** - The speaker notes that the new model’s focus aligns with earlier design cues and resolves prior free‑model limitations, making its direction unsurprising. - [00:12:37](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=757s) **Scaling AI While Managing User Expectations** - The speaker highlights how a newly‑released, high‑traffic AI model alleviates many pain points and outpaces niche competitors, yet stresses the gap between lofty fantasies of instant world‑creation and the practical necessity of delivering a simple, mass‑market‑friendly experience. - [00:15:55](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=955s) **Embedding Tools in Model Analysis** - The speaker explains that the model’s internal analysis channel conducts token prediction and can silently invoke tools such as Python for calculations, a strategy intended to enhance answer accuracy and reduce hallucinations. - [00:18:57](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=1137s) **Scaling Inference and UI Innovation** - The speaker highlights how lower inference costs, faster and smaller models, massive parallel tool calls, and superior cloud‑based interfaces are accelerating the ecosystem toward more powerful, user‑friendly AI systems on the path to AGI. - [00:22:02](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=1322s) **Parallel AI Model Workflow Comparison** - The speaker describes using multiple AI models side‑by‑side for coding tasks—especially unit‑test generation—and notes that while ChatGPT struggles, Opus consistently delivers better results in their integrated workflow. - [00:25:06](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=1506s) **Internet‑Synced Blinking Donkey Demo** - The speaker walks the audience through a live demo of a code‑generated donkey that blinks in sync with an internet clock, noting improvements over previous versions and contrasting it with other AI models. - [00:28:11](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=1691s) **ChatGPT-5 Canvas Coding Struggles** - The speaker describes repeatedly prompting ChatGPT‑5 to create a blinking‑donkey canvas script, battling truncated outputs and lengthy copy‑paste fixes, highlighting the inefficiency versus Claude's smoother handling. - [00:31:13](https://www.youtube.com/watch?v=A30mVgbG-OQ&t=1873s) **Balancing Cost, Experience, and Model Choice** - Panelists discuss the benefits of AI competition, improvements in GPT‑5’s front‑end, frustrations with high subscription fees, and the appeal of combining cloud, GPT‑5, and open‑source models in their workflows. ## Full Transcript
With tool calling improving,
as Chris was saying, we're not going to save hundreds of tools.
We're going to see thousands or tens of thousands,
and there's a lot of opportunity for continuous improvement in just the ecosystem alone.
And I think we're going to see a lot more substantial
improvements in all these areas
outside of just the model performance
that are going to get us closer to that wow,
AGI moment.
Hello everyone.
Welcome the Mixture of Experts.
I am Bryan Casey, your default host for bonus episodes. Um,
as some of you may have noticed,
it's been a big week in AI this week.
We actually recorded earlier this week covering
some of the big announcements around gpt-oss and Genie 3.
And then it became very clear that after we had recorded that,
that GPT-5 was going to drop on Thursday.
So we made the decision just to release that episode on Wednesday. Um,
so it was still timely.
And then come back to you today with a discussion
around reactions and thoughts on the GPT-5 release.
So I'm joined today by
Chris Hay, CTO of Customer Transformation,
and Mihai Criveti, Distinguished
Engineer of Agentic AI.
And we are going to get into a discussion of the GPT-5
release, early reactions to it, thoughts on coding.
And maybe that's actually a good place to start
with the opening question today.
One of the big questions
coming into this release was, is GP-5
going to replace Claude as the daily driver
for developers all over the world?
And so while we can't speak for everybody,
we can share our own opinions on that.
And so maybe I'll start with you, Chris.
Early reactions of like, do you think GPT-5 is going to be replacing
Claude for you as a daily driver?
No.
Sadly, I had high hopes.
But no, Mihai, Hi.
You know, it's kind of funny.
I was refreshing my chat
all night, and somewhere around 1 or 2 a.m.,
I got access to it, so I got out of bed.
I rushed my machine, and I was working all night.
Trying out. Promise. Trying it with different tools, trying it with MCP.
And in the morning I had to start my day job
and I said, all right, back to Claude.
And I found myself using Claude Cpde and Opus 4.1 again.
I think it's an amazing model,
but right now it's not replacing it yet for me.
All right. Well, I think that sets the stage
really nicely for, uh, some of the drama. Um,
because there's always drama in the AI industry. So,
um, with that, we'll get into today's episode
and I'll start by maybe just doing a quick recap.
I'm sure many of you have seen the news, um,
already, but I'll just go through for for those of you
who might not have, uh, on Thursday afternoon,
OpenAI did a livestream where they introduced
the new GPT-5 series. Um,
what they there are three models
that were part of that, which was the core kind of GPT-5
model a mini, a nano.
There are Thinking and Pro modes available across a number of the various
free tiers, with or pricing tiers with different sort of rate limits
associated with them.
And going through the release,
there were three main things that I think stuck out to me
and stuck out to most of the market.
Um, so first of all,
was the introduction of this model router
I think one of the memes on the internet has been
how complicated it has been to like, go into the model selector and ChatGPT
and just like, figure out what model you're supposed to use
for anything that had gotten incredibly complex.
And OpenAI has been talking about needing
to solve that and making that simpler
for months at this point, and consolidating everything
to the one brand family around GPT-5 with a model router
in front of it, was their way of delivering against that.
Um, the second piece was that
there are improvements in benchmarks, so it does look like it's a smarter model.
It is not like Earth shatteringly more intelligent than the other models
that are on the market, and that makes it a little bit different,
I think, from other like model releases of this nature,
is that the improvements in the benchmarks
is maybe not the highlight, but there are improvements there.
But where we're seeing even more improvements is actually in reliability.
And some of the most important benchmarks that they introduced were actually around,
um, reductions in hallucinations.
So the idea that you can actually trust these models
more for work that you're doing day to day,
and then finally, which I should think was a surprising,
uh, in the same way that reliability was almost a surprising theme.
Uh, price, I think, was a very surprising theme.
Um, as part of this, like typically
when you think about state of the art models,
the thing that you're thinking about or state of the art technology,
you typically associate that with pricing power.
But actually one of the big takeaways, for me at least, was actually accessibility
And the reaction in the market, particularly when we're looking at the API pricing, um,
these models, all three of them are very competitively priced
and, you know, is in some ways
starting to deliver on that, like "too cheap to meter". Um,
slogan that I think we've heard.
So those were some of the big highlights.
then also this consolidates
many of the existing models that exist,
starting with ChatGPT, um, and kind of the UI.
And I think over time that will come
more will come to the API as well.
Um, but that's kind of a quick summary of the release.
Maybe, um, just to start with other general reactions,
we'll get specifically into the coding piece here in a minute.
But, um, maybe I'll start with Mihai
beyond just the kind of like the quick highlights.
I'm curious, like, what else stuck out to you about this release? What's important?
Um, and just any kind of general reactions you have.
I think what stuck with me this release is
how good this model seems to be at tool invocation and calling for AI agents.
I think you can really see the results of fine tuning
to specific things like MCP and tool invocation
and function calling and structured outputs,
and I think it's a lot more reliable than previous models we've used,
at least compared to, you know, GPT-4
or even some of their previous reasoning models.
and it does so at a cost, which seems sustainable
for using for this type of, I would say, a genetic workload.
I think I'm tempted to go into the ChatGPT interface
and talk about that for a second.
But actually, if I think about the APIs for a second,
I kind of like what they've done there. Right.
So back to your point We want to talk about big end models.
But those smaller models are killer to uh, to me his point. Right.
So especially the nano model,
that little nano model on the API
outperforms most of the large models in the market,
and especially for agenetic,
as Mihai was saying, a function calling it just gets it right.
So most of the time you'd be like,
oh, I'm going to go to a mini model,
or I'm going to go to the the full fat model.
I don't think you I don't know if you want to call that.
Should we size things like Coca-Cola?
But um, but you know, but um, but diet GPT-5
is just killing it out.
And GPT-5 like they are just absolutely killing it.
So no, I'm I'm really impressed with the
the nano model and the, the other one.
And maybe I will do the demo a little bit later.
is the browser
control is really good and it's logic
and reasoning back to the sort of ChatGPT interface.
So one of the, one of the things I like to do with models
is taunt them with games.
It's, uh, it's one of my fun things.
And, and definitely
none of the earlier versions of GPT
was ever able to solve the model game.
Uh, I don't know if you've played Murdle before.
It's, um, it's kind of like you're is this detective guy,
and you've got to figure out who killed the murder. Uh,
who who killed the person who's a murderer?
With what weapon and what location? Um,
and and it never worked.
It never got it right. It was always getting things wrong.
And so today, I sort of played
it against the Murdle game on the agent browser.
It took 20 minutes to play the game, but it got it.
It solved the murder.
And no earlier version was able to do that.
So I think they've really focused
on the planning, the logic, the reasoning.
So there's been a huge emphasis on that as well.
And I and I think I appreciate that
the, um, it does have the ability to cheat.
The second version was great.
The first time it played Murdle, it played for ten minutes
and then it looked up the answer on the internet.
So my second prompt was like, don't look up the answers, don't cheat.
But, um, but, you know, fabulous, fabulous model.
I'm actually I'm glad you mentioned that
20 minute time span, because one of the other things,
um, that I just saw
people most encouraged about on the internet,
was there some of these charts that just show,
uh, just the consistent curve of being able to
successfully complete longer time horizon, um, tasks?
And that is, you know, I don't want to
I don't know if I want to lump that
into the same sort of space as reliability,
because it does feel like a little bit adjacent to that.
But, uh, that was another aspect
I think people were, were pretty excited about.
Um, I'm curious if you got like, if this is what you both, like,
expected from this release, it gets kind of into the realm of predictions.
But I think one of the interesting things, just gauging the market's response to
this is either the night before the release or two days before that.
Sam Altman posted the Death Star on Twitter,
which I think like sent everyone
into, kind of like a little bit of a frenzy,
like their opening eyes, kind of known for the sort of vague
posting thing that they do on Twitter where they,
you know, really hype these, these model releases
And then what's interesting, it kind of came out and it's just this
like very, um,
strong focus on just like straight utility, um, in a lot of ways.
And, um, there had been some early kind of rumors and reports about like
that the focus was going to be on things like hallucinations,
but like when you think about what you were looking or
expecting OpenAI to do around GPT-5,
is this kind of like in line with what you were expecting in terms
of like the trajectory in this space, or was this in any way
kind of surprising in terms of where they ended up focusing?
I didn't find it very surprising just because they've released gpt-oss before,
and I've been playing around with that, and this feels very much in line
with the same kind of design, the same kind of style.
And I expected more of a GPT-5
kind of a release
just because it was time. Also,
from my own personal experience
of using some of the old free models or Free Pro,
I didn't find them to be that useful for general purpose tasks.
I found that they took way too long to accomplish a task.
They were prone to overthinking.
They were prone to coming in with very strange formatting
and even formatting issues.
And the way I'm seeing at least this model releases
as a fix to that with a unified architecture,
that kind of gives you, again, the core capability,
that wow moment we all had when we first picked up GPT-4
Oh, yeah, I, I sort of agree with you,
but I think maybe we're a little too To use to playing with different models.
Anyhow, in my opinion, right where
we'll soon play with Gemini, we'll play with big models,
small models, o3, o4 minis, but the pro versions
I will play with Claude, etc.
you know, we switch between a lot of models.
I think if you're not in that world,
I think this is going to feel like an incredible model, right?
Because, you know, let's be honest.
I mean, we'll come back to coding for a second,
you know, the front end capability.
So they've got the ability to create good user
interfaces with react code
is significantly better than the earlier versions.
Now we would argue and say, well, actually
Claude has been doing that. Fine.
But the reality is,
you know, the GPT models
didn't generate generate good user interfaces, etc.
they weren't great generating end to end applications.
And I think there's been a huge focus on that.
So I, I, I think there's probably less than a surprise there
But but if you can imagine the average user
for a second, you know this is all.
That's your point. It's super cheap.
It's like the $200 version.
You can just generate whatever you want.
This is I think this is a game changer
for most people.
We're probably just being a little critical in that sense,
but I but I think there is probably a few other things there
that, you know, to compare that to the gpt-oss model.
I mean, the reality is this one is multimodal.
It is doing audio, it is doing images, etc.
we're not selecting different models. The agent.
I mean, come back to the agenda capabilities.
The agenda capabilities are great.
And then to your point, I mean coming back to the APIs.
So some of the things I appreciate in the back
end is how they've handled the grammar and stuff.
So I can actually start providing my own grammar for function goals
and be able to guide the, the, the structure that I want back.
So I actually think some of the things
that they've done there solves a lot of our kind of pain points.
So, um, would I have wanted the
the model where I just say, I am thinking this, I'm going to sleep
in 20 minutes later, you're going to have created the world.
Of course, we all want that.
But. But is that a reality?
Probably not.
But but does just does this level up everybody. Right.
Remember 700 million users on this
compared to the number of people on Claude.
It's a it's a huge level up.
So I think I think we're just maybe tuning in the weeds.
It's funny because when I looked at
there's a few things on Twitter where people were having
just like what felt like small séances, saying goodbye
to some of the old models that they no longer get to talk to anymore.
there was like a feeling of a little lull,
but it's like it is such a specific community that feels that sort of way.
And like 99% of the world is just totally overwhelmed by all that complexity.
So I think, you know, delivering something that, um, is much more
accessible, accessible to the masses.
Um, makes a ton of sense.
I'm just still thinking
about kind of the dichotomy and reactions in in the market
where, um, your point of just like, I would love
to tell the model to do something
and wake up 20 minutes later and like, you know, it's
created a world changing sort of application.
One of the one of the reactions I saw in the market
is that this kind of confirmed that people believed
that the march to the intelligence explosion, AGI, ASI,
whatever you want to say
is going to be more of a slog.
It's not going to be you're not just going to wake up one day
and there's going to be, you know, we're magically there.
And it could also be the case that, um,
you know, it's actually not going to be maybe just one giant, uh,
model to rule them all.
But like when I think about this, compared to like, Genie 3,
it's like those feel like pretty different, um, you know,
approaches even I think the discussion will get to around coding.
Like, it still feels like there's plenty of room for for Claude in this.
And so, you know, maybe it's the last question before
you just dive into the code is like, does this did this
release update any of your thinking at all
around the trajectory to some of like the
the bigger topics in the industry around
AGI, ASI, intelligence explosion? Did it?
You know, some people seem to think it was
a little bit of a wet blanket, um,
on that other people felt like we were kind of right on target.
But, you know, we're I'm curious where you guys
come down on this and maybe start with Chris.
I think I think actually
probably I know this is going to sound odd,
but I think the gpt-oss model probably
hit me more with that
because we got to see some of the underlying architecture more,
and then I can kind of project forward what's going on.
And, uh, you know, in the larger models,
I think there is a few things that are as key.
So the first one is,
I know I say two words, agents all at a time,
but I really think agents is the big way through this.
And one of the things that I noticed on the gpt-oss models
is that any analysis channel.
So if you think of the thinking mode for a second, then and the
and the response API one is doing thinking the,
you know, the head of a basically a reasoning,
you know, the analysis channel is the first token.
And then basically all the tokens go into the analysis channel.
That's where it does the thinking. And then it will create a new token.
And then to say this is the final response.
So that's kind of what happens in the back
end from a kind of a next token prediction. Right.
So all of that thinking happens in this channel.
But but if you look at some of the system prompts and the gpt-oss um,
code base, you will see things
like when you're in the analysis channel, um,
if you're going to do math or things like that,
go use your Python tool,
but don't tell the user, right.
Just use just do the calculation
and the thinking, but you don't need to tell us about it.
what we're seeing here is tool use.
So there's two types of tool use that's going on.
There is tool use that you are going to say here's my function call.
Go call this thing on the outside.
But there's a set of tools that are going to be used
by the models themselves to get around things like, um,
uh, you know, being able to do math calls, etc..
So there there's a whole level of experimentation
that I've been going through is like, um, you know, what's this?
Multiply by this, and then I'm like, don't whatever you do, don't,
um, use a tool in the analysis channel
to try and see how much that's doing that.
And I think this sort of embedding of tools into the thinking in the model
to be able to back down a little things
and, uh, you know, and reduce the hallucinations and have more accurate answers
I can see that expanding out.
I can see that expanding out into,
you know, hundreds or thousands of tools in the future. Right.
So I think that's one direction that I'm probably seeing.
Then the other thing is, you know, and it's great that we're on this podcast
is the whole mixture of expert thing. right?
So when we look at the gpt-oss model,
what you're seeing is a lot of expert, right?
So maybe there's four active experts,
but the number of total experts in these models is huge.
It was like 32 experts.
I think the larger one was like 100 odd experts or something like that.
So what they're what they're doing here is just expanding out
the number of experts that are part of this model.
Um, with much, much smaller
number of parameters per expert, which makes sense
because then you can really speed
tokens through the model, because guess what's important to us?
We're not prepared to wait for the model to come back with answers.
And when it's a big model, you're
you're having to wait for it to go
through the layers to churn out those tokens.
And as we as human beings, we're like,
no, no, no, no, no, no, no, I'm not waiting.
So I think this this push towards
much smaller models, which are much more distributed.
And I think that's going to continue to AGI.
And then all I imagine that's probably happening in the GPT-5
era models is they probably.
They're probably still got those smaller partitions there.
But I imagine for some of the hard thinking,
they've just gone to a larger parameters on some of those bigger models.
So I think I think
there's a lot of clues of how we're going to get there. Yeah.
And I think this can also pull the HCI
timeline forward by speeding experimentation,
especially with the gpt-oss release.
I think there's a lot of dimensions in which these models can improve,
not just raw performance.
I think one I'm really thrilled about is the inference cost,
and that gives you the opportunity to make things work in different ways.
So you can just hit hundreds of these requests at different tools
and then summarize those results.
There's also inference speed.
And this is also kind of dictated by hardware,
but also by having smaller models and more efficient models.
If you have inference speed.
And I'm seeing this with the gpt-oss, it's running at 180
tokens on my single GPU, and that's impressive.
You can really hit hundreds of these requests in parallel
and with tool calling improving, as Chris was saying, we're not going to see
hundreds of tools, we're going to see thousands or tens of thousands,
and there's a lot of opportunity for continuous improvement
in just the ecosystem alone and even the user interfaces
Most of the consumers are using.
What was the 700 million are using ChatGPT, a ChatGPT UI.
And one of the reasons I love cloud
is because I find it
UI to be superior
in handling things like artifacts,
handling things like projects
and how it's using it's "canvas"
versus the way ChatGPT was using it.
now this is leading to improvements
in the user interface aspect as well.
And I think we're going to see a lot more substantial
improvements in all these areas
outside of just the role model performance.
They're going to get us closer to that wow AGI moment.
And I think that's actually pretty consistent with some of
I feel like I did there was like I mentioned, there was some
I think, a fair amount of folks who were like,
I don't know, maybe they're like Iggy now.
And this was like, oh, this is not AGI.
But the like all of these underlying
like the, the reliability, the tool calling like all of these feel like prereqs
to actually getting there and then making progress on
um I think those dimensions will end up being like
very obviously a major part of the story.
All right.
For our next segment where as we talk about code, we're actually going to
what if we're one of the first times on the show
actually do some live demos?
Uh, and so if you're listening along on audio and you actually want to see,
um, some of what's going on on the screen,
and I promise you, for at least one of these segments,
you're going to want to see some of the beautiful artwork that's on display.
Head over to the IBM Technology
YouTube channel where you can see some of the stuff live.
You know, with that, I want to get to our last segment,
which is, I think for as impressive and exciting
as some of these announcements are, for as much as I think
it'll make an impact on hundreds of millions of people who use these tools.
of the big themes that's in the blog post,
it's like one of the most talked about,
um, you know, parts of the release.
Uh, basically every forum is,
you know, for as much as OpenAI has done in this space,
like Claude and Anthropic have, you know,
continued to just be the the leader in kind of the coding area.
And there was the question of like, will this be the release
that puts them over the top and, um, and gets them there?
And based on the initial question,
um, it sounded like for both of you, the answer to that,
at least right now, is not quite yet.
And so, you know, maybe, uh,
starting with, with Mihai just on this one.
But you know what? For you, like, how far did it get?
Like, did it get close and, like, why didn't it get all the way
What are still the difference makers?
You said you're like, okay, time to go back to my day job.
I'm going back to you.
You know, Claude, at this point, you know why?
What were the big differences for you that you still feel like,
you know, Anthropic has an advantage in this space?
I think first, I want to define how I typically work with these models.
I don't work with just one model.
I actually use them at the same time in parallel.
One cloud is busy doing something useful for me.
I might fire up.
I would say ChatGPT, or I might up
some other model or Gemini to do some deep research, and I kind of
have them all working in parallel while one is busy doing something.
Um, but I give them different tasks.
And I was hoping that for my typical day to day workflow
where I'm using things like cloud code, for example,
or projects in cloud with Opus
to give it hard problems to solve, to ask
for unit test case creation
in a way which is consistent with my code base,
that the new models and the new experience I have with ChatGPT
would be able to pick that that up seamlessly.
And I still find that, at least for me, um.
ChatGPT struggles while Opus is still able to deliver those use cases.
So let me jump into a very quick screenshot
to show one of these workflows. Uh,
live.
So here you'll see that I've got four windows side by side,
which kind of replicates my real working environment. I've got different things.
I've got, uh, continue and I've got client, but I use multiple models.
So here you can see, for example, I've got gpt-oss running on my machine.
It's thought for 25 seconds and it came back with a complex mermaid diagram.
I'm going to pick it up.
I'm going to paste it in mermaid.
As you can see, it actually worked.
First one shot gave me the results I need.
I'm going to do the same thing live with GPT-4.
And I've tried the same thing with Claude.
As you seen Claude before, Claude gave me again.
First time a great diagram.
I'm getting similar things from even the oss model
with GPT-5.
Uh, well, let me try that again.
It's actually not answering.
Maybe got slightly intimidated.
I tried on your chat again.
Let's create a complex diagram.
Yeah, I think I've intimidated it with with a benchmark.
But the experience I've had is that I have to try
maybe two, 2 or 3 times.
Take the error I get from the mermaid renderer,
give it back to the ChatGPT
UI and it'll try again and again before I get one of those things.
And with Claude Code, and Claude seems to handle those things
a lot more gracefully behind the scenes,
and is able to give me that experience the first time
and is able to deal with, for example, continuing large codebases
in a way that ChatGPT does not,
or in a way that OpenAI models don't seem to.
At this point, again, the difference is minuscule.
But if you're working with very large code bases,
or if your code files exceeds those, you know,
a thousand lines of code where models tend to struggle.
I find Opus still to be the superior model.
Chris your thoughts? I would agree.
In fact, I'm going to jump to my demo, which is way more enterprise than, uh,
than nice one.
And but I think it will help.
Um, I think it will help
sort of back up on my I sang in this sense,
so I am going to apologize right now for what
I am going to going to show our wonderful audience.
So I'm going to share my screen,
and here is my best test in the world.
So I it's my test for everything
where I like to create donkeys.
And what I want the donkey to do
is we have cut donkey vowels on the mixture of experts
podcast and their donkeys.
My donkey has to blink in time
with an internet clock for a second and actually,
hey, it it blinks. It blinks.
It should blink in time with the clock.
Now, don't get me wrong.
And it should be internet synced and then fall back where it
what I actually really appreciate about this.
I mean, this is the code that generated
I think the code is pretty good actually.
And this is, this is this is probably,
um, quite a bit of a change from before
where it was used to generate terrible code.
But but actually, I think the code is pretty, pretty good.
But let's let's just run that one more time, you know?
Um, but this is quite nice in their canvas
area that it will go off to the internet,
whereas Claude etc. doesn't.
Now or, you know, we're all feeling good at the moment.
We've got my blinking donkey. Um,
but I'm going to show you the problems
in a second before we do this.
This is this is Claude, uh Sonnet, in a sense.
So it's it's blinking. Donkey. We.
I don't know what's going on with the tail
there, to be honest, but, um.
But it's fine. It's blinking, etc.. Um,
this is Opus.
I quite like this.
I, I still don't know what's going on there, but this is Opus 4.1. Uh,
so, you know, we're we're probably feeling, uh,
we're probably feeling pretty good now.
Now, to show the frustration, I, I
think this highlights the difference here.
So in this case, I decided
I was going to go to GPT-5 Pro.
So I wanted to get the best donkey that I could possibly get.
It thought about it for seven minutes.
Seven seconds. Um,
but.
And there's all the plan, etc. associated with it.
the problem is it didn't put it on the canvas,
so that is not what happened there.
It said save this as a donkey HTML.
So even though I'm using Pro at this point,
it refused to put it on a canvas.
Now look at the size of this text. It's quite big right?
So we're all we're all feeling pretty good.
This is going to be a quality donkey.
That's our donkey.
Um, and it's even got an ear wiggle in there.
So, uh, it does that every minute.
Um, and I and I said, put it in the canvas,
and I asked pro to do it, right.
I just put this in the canvas
and it took eight minutes, and then it went. You got it.
Here's a pure canvas version, and you're like, huh?
That's not what I wanted
I wanted it over here.
And so it didn't help.
So I then switched it back to ChatGPT, uh, regular.
So I changed the model to ChatGPT-5 there.
And I said, put it in the canvas. And I went.
I put your blinking donkey into canvas, blah blah blah.
Um, and if I were to show you
the version that it created, um,
it's down here somewhere.
Um, let's see if we can find the code.
You can see it's like it was like,
uh, it was a tiny. Yeah.
Here we go. Best on canvas.
It was like, uh, that's an earlier version. Uh,
anyway, I can't find it's early version.
It was. It was like a 10th of the code.
So you will see, I then came back and said, no, no,
no, no. Do it properly. See,
I said the words don't admit.
Because actually what happened in that particular version.
It just went. And here's the rest of the donkey.
Do you know what I mean? I was like, no,
I don't want the rest of the donkey.
You know, it started.
It's the old trick, the problem that you have with ChatGPT.
It just starts cutting things out and says, rest of your code goes here.
So I wrote down and met and then I,
and I said, here's the full version,
you know, give me this or it's ready to run.
And then eventually.
So I'm copy and paste code back in.
And then eventually, uh, it gave me this version which run there.
But you can see that took 25, 30 minutes
to produce this donkey.
Um, and most of the time me was, is me
going, don't admit, put it in the canvas over here, etc..
Now, that's probably not the best workflow
in the world in that sense, but I'm trying to prove a point.
And the point is that Claude
just gets it with your artifacts, right?
It will just it will remember what you did.
It will update it, and it will just work.
And and then it won't emit code.
But ChatGPT still got this code of mission.
Now, this is not a model problem.
I think this is more of a user experience problem,
but it's still and it's probably a kind of a cost optimization problem,
but it's still enough frustrating for me
that that if it can't keep track of the artifacts
and it does the same kind of in APIs as well,
then it's just going to push me back into cloud because I,
I don't want to sort of skipping stuff
so that that for me was the, the, the major frustration behind the model.
But but code wise and capability wise, they are very, very close
strong world model around donkeys.
So for our audio only listeners out there,
I will say that the last donkey we were looking at
was esthetically, from my perspective, the best donkey in there.
So I was impressed with that.
Um, but perhaps and even the code, as you were mentioning,
I think solid code, but like, not quite there,
just from a workflow.
Ease of use, straight developer productivity.
Um, you know, sort of, um, sort of vibe.
And, you know, I think based on what I saw, like
there was a lot of that sort of reaction on the internet,
which was like, this is really impressive.
I like what they're doing a lot.
Still gotta use cloud.
Probably, uh, on this end.
So, um, but I do think it's like, it's always great
that there's a ton of competition and innovation in this space,
because it means we'll just continue to get, you know,
better and better things, um, to work with. Go ahead.
Chris, looks like but to your point, the front end code
and the experience and the gloss I think is better now.
It is definitely a better in GPT-5.
They just need to sort out that developer experience.
And and the token optimization they're there just to
ah stop doing this I pay my 200 bucks.
I want to I want to cancel my cloud, uh,
subscription, I want to I want to be
I don't want to be paying 280 bucks.
I want to come down to the 200..
So just do that and save me some money.
We all just need. We're all going broke.
Uh, having to pay for ten of these things at the same time.
So, um, Chris, Mihai,
thank you for joining us today.
Any final thoughts before we
we let the audience go and send them on their way to,
you know, wrestle with ten models simultaneously
and figure out which one they like For what.
Go ahead and try these things, especially the gpt-oss.
I'm still I'm still there.
Even with with all these new fancy models,
I'm still passionate about going off
and playing with a gpt-oss model, running it on my machine,
being able to use it in a generic workflows,
and being able to use it in a combination where I might use GPT-5
for my orchestration, I might use cloud for some code specific tasks,
and even gpt-oss for tasks where it can perform reasonably well for its size.
I think I'm gonna experiment.
Yeah, I would agree.
Go experiment.
And then I would say go play with the agents. Honestly,
I think the web browsing capabilities now are incredible.
I really think the tool calling capabilities are incredible.
So go, go play with that.
And and I think that's where it's outshining everything else.
And I will maybe also just close with a little shameless plug
for some of the work that we're doing in watsonx,
so you know if you're interested.
We've obviously always been a big supporter
of the open source, open source AI space.
If you'd like to use some of the gpt-oss' capabilities,
we have those in watsonx today. Um,
we've also with some of the new work
that we've been doing around model gateways.
We're actually making it easier and easier to bring frontier models
and API keys that you have to our platform.,
you know, obviously go
try these tools, use them, and, you know,
check out some of the ways that hopefully we're making it easier
for to consume them, um, and some of the stuff that we're doing.
So I'll just say again, Chris, Mihai, thank you for joining today.
To our audience. Thank you for listening.
Um, been another exciting week in AI.
And make sure as the token podcast
line goes, make sure to like and subscribe.
Um, if you're a fan of the pod and we will see
you next time. Thanks everyone.