Simple Wins: AI Model Adoption
Key Points
- The “simple wins” framework advocates adopting new AI models by first proving they can reliably solve a small, repeatable, low‑risk task you perform daily, rather than relying on benchmark hype or one‑off prompts.
- Traditional model evaluation (benchmark charts, dopamine‑triggered trials) often leads users to default back to familiar tools like ChatGPT, because those tests don’t reflect real‑world workflow impact.
- Viewing models as a hierarchy of superior “rungs” is misleading; instead, treat each model as a distinct competence that must be matched with the right interface and integration layer to be effective.
- By focusing on tangible, incremental wins, teams can avoid polarizing “model wars,” reduce artifact friction and review burden, and build a sustainable system that routes different work to the most appropriate model over time.
Sections
- Simple Wins Model Adoption - The speaker advocates a pragmatic “simple wins” strategy for adopting new AI models—evaluating them based on small, repeatable tasks that deliver obvious, low‑risk value each day instead of relying on benchmark hype.
- Choosing LLMs for Real Work - The speaker explains that model selection should focus on which AI reliably handles specific business tasks, emphasizing three recurring pain points: information overload, the effort of formatting outputs, and navigating human ambiguity.
- GPT‑5.2 as Workflow Execution Engine - The speaker describes GPT‑5.2’s ability to generate complete, professional artifacts (e.g., spreadsheets, presentations) that streamline complex analysis, while cautioning that its drive for coherent output can unintentionally hide contradictory or messy underlying data.
- Dual Execution Lanes and AI Competition - The speaker contrasts business‑artifact versus code‑centric work streams and compares how OpenAI’s GPT‑5.2 (with Codeex) and Anthropic’s Opus 4.5 compete in each lane.
- Choosing Simple Tasks for Model Evaluation - The speaker advises testing AI models with straightforward, measurable tasks, logging outcomes, and prioritizing practical usefulness over chasing the most advanced model.
Full Transcript
# Simple Wins: AI Model Adoption **Source:** [https://www.youtube.com/watch?v=ijdhIGRB_Kc](https://www.youtube.com/watch?v=ijdhIGRB_Kc) **Duration:** 00:16:06 ## Summary - The “simple wins” framework advocates adopting new AI models by first proving they can reliably solve a small, repeatable, low‑risk task you perform daily, rather than relying on benchmark hype or one‑off prompts. - Traditional model evaluation (benchmark charts, dopamine‑triggered trials) often leads users to default back to familiar tools like ChatGPT, because those tests don’t reflect real‑world workflow impact. - Viewing models as a hierarchy of superior “rungs” is misleading; instead, treat each model as a distinct competence that must be matched with the right interface and integration layer to be effective. - By focusing on tangible, incremental wins, teams can avoid polarizing “model wars,” reduce artifact friction and review burden, and build a sustainable system that routes different work to the most appropriate model over time. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=0s) **Simple Wins Model Adoption** - The speaker advocates a pragmatic “simple wins” strategy for adopting new AI models—evaluating them based on small, repeatable tasks that deliver obvious, low‑risk value each day instead of relying on benchmark hype. - [00:03:22](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=202s) **Choosing LLMs for Real Work** - The speaker explains that model selection should focus on which AI reliably handles specific business tasks, emphasizing three recurring pain points: information overload, the effort of formatting outputs, and navigating human ambiguity. - [00:07:10](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=430s) **GPT‑5.2 as Workflow Execution Engine** - The speaker describes GPT‑5.2’s ability to generate complete, professional artifacts (e.g., spreadsheets, presentations) that streamline complex analysis, while cautioning that its drive for coherent output can unintentionally hide contradictory or messy underlying data. - [00:11:37](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=697s) **Dual Execution Lanes and AI Competition** - The speaker contrasts business‑artifact versus code‑centric work streams and compares how OpenAI’s GPT‑5.2 (with Codeex) and Anthropic’s Opus 4.5 compete in each lane. - [00:14:42](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=882s) **Choosing Simple Tasks for Model Evaluation** - The speaker advises testing AI models with straightforward, measurable tasks, logging outcomes, and prioritizing practical usefulness over chasing the most advanced model. ## Full Transcript
simple wins. I want to talk today about
a detailed comparison between Chat GPT2
5.2, Claude Opus 4.5, and Gemini 3. But
instead of just giving you a baseline
model comparison, I want to let you in
on how I think about adopting new models
into my workflow because that is the
hottest topic that I could think of for
2026. We're all going to have a lot more
new models. It's not just going to be
these three. How do we think about
adopting them in a way that's
intelligent? And I'm going to go back to
it. Simple wins. It's the only model
adoption strategy that doesn't rot. And
I'm going to explain it and how it
works. And you're going to be able to
learn it and use it for your workflows,
too. It's not going to take very long.
The way most people evaluate a new model
is by reading a benchmark chart, by
trying a clever prompt, by feeling a
dopamine hit or not, and then they
slowly drift back to whatever tool they
default to. That's why so many people
end up in Chad GPT. It's not because the
new model isn't good. It's because the
evaluation isn't real. The only
evaluation that matters is whether a
model can deliver a simple tangible win
that you would use every day. I'm
talking about a small repeatable piece
of work that you actually do all the
time where success is obvious, the
downside is contained, and the output
lands in spaces that your org already
runs on. So simple wins is not just a
cute productivity slogan. I'm not
putting it on a t-shirt. It's a
discipline. It prevents you from turning
model choice into the Mac versus Windows
wars, right? Into an identity. You need
to not think that way to survive in the
AI future. Instead, simple wins forces
you to confront real bottlenecks at work
like artifact friction that you may have
because it's too complicated to make
them or review them, like review burden.
It gives you a path to compound the
adoption of models over time without
pretending that you're doing lots of
complicated work at any given moment to
test out a model. Because the deeper
point is that models should not be
viewed as a single ladder of
intelligence where every new release is
a new rung you have to reach and migrate
everything to. Instead, think of them as
different shapes of competence that live
inside different kinds of surfaces. The
model matters, but the interface and the
harness matter almost as much, if not
more. And if you ignore that, you're
going to keep trying to look for the
best model, and you're going to feel
like the AI is unreliable and everything
is changing. If you lean into the idea
of simple wins, you're going to end up
with a sane system for routing work to
different models. But let's make that
more specific. What's changing right
now? A lot of people are asking
themselves whether they should keep
evaluating AI as a chatbot, whether you
should still have an interaction pattern
at core that is prompt, response, tweak.
That's no longer the main place for
serious work. The big shift with the
current generation of models is that you
increasingly need to hand the model a
real work packet, a an assignment with a
deliverable and you need to expect it to
stay coherent long enough to produce
something that you could ship directly
after a quick review. That is explicitly
what OpenAI framed chat GPT 5.2 to do.
But it's not just OpenAI. Opus is
thinking about that. Anthropic is
thinking about that. Gemini is thinking
about that too. Once you start operating
that way, which model is the smartest
just becomes the incorrect question. The
useful question becomes which model plus
its surface reliably completes a
particular kind of work without a lot of
downstream pain. That's where the
differences between chat GPT 5.2, 2,
Gemini 3, and Claude Opus 4.5 really pop
out and become very practical if you
look at them through the lens of real
business work. Now, I know that most
knowledge work comes across as
complicated, but my observation is that
it collapses into a few recurring pain
points that are probably relevant for us
to think about when it comes to this
kind of assessment. The first pain is
bandwidth. There's just too much to
read. There's too many inputs. There's
not enough time to build the mental
model. It's one of those things where
you have a dock pack that you need to
read to walk into the board meeting and
not look confused, but you just don't
have time on the plan to do it. The
second pain is execution on those
artifacts, right? It's work that has to
end up in Excel or a deck or a
structured doc. And the burden is not
just having the idea or a correct
understanding. It is gosh darn it, we
have to make it all add up and make the
deck and package it in the format that
the business runs against or else the
work is not done. And then the third
pain is human ambiguity, right? the
messy political contradictory reality of
the organization where tone matters,
where incentives matters, who got
promoted last matters, false coherence
can be much more dangerous than
admitting uncertainty. If you can figure
out which pain matters most, it's going
to help you figure out what model you
need to work on. Let me give you some
examples from current leading models.
So, this feels like it it gets think of
Gemini 3 is a bandwidth engine. Gemini's
3's superpower, when it's working well,
is that it can ingest an absolutely
absurd amount of material and give you a
clean overall map. Google is really
explicit about Gemini 3's massive
context window. And the practical effect
of that million tokens is that it
doesn't mean that it's magically
smarter. It just means that it loses the
thread less often when the input is
really huge and messy and it can dig
into a big synthesis without collapsing
into shallow summarization. So the
simple win simple wins for Gemini 3 is
not write my strategy memo. The simple
win is turn this mountain of stuff into
some kind of a map so I can make sense
of it. So feed those long docs, feed it
those notes, feed it those screenshots,
feed it the meeting transcript and ask
for an outline that makes the problem
space really legible. What's being
claimed? What contradicts what? What's
missing? What what should I ask next?
Gemini is often really really good at
this kind of compression when the
alternative is hours and hours of
reading. Where Gemini tends to create
pain is downstream. The business world
is still deeply Microsoft Office shaped
and there's often a conversion tax when
you need to get a great synthesis and
turn it into a spreadsheet, a deck or a
document in the exact structure that
your org expects. The model can be
brilliant and still lose you time
because of the workflow and its
friction. So I don't treat Gemini as the
model for everything, but I do treat it
as a model I reach for when the
constraint is really input volume and I
want clarity. It's a good bandwidth
engine. Think of Chat GPT 5.2 as an
artifact execution engine. So chat GPT
5.2's fingerprint is really different
from 5.1. The WOW is primarily not that
it can read more, it's that it can stay
organized through longer assignments and
return businessshaped deliverables like
docs or tables or decks coherently
without falling apart. So, OpenAI's own
framing emphasizes professional tasks.
This is what they built it for, right?
Tool use, making artifacts like
spreadsheets and presentations. The
simple win for GPT 5.2 to is give it a
real artifact. Give it a clean, tight
brief and get back something that looks
like a junior analyst did all the work.
It's not necessarily a perfect answer,
but it's a great work product that will
save you hours and hours and hours of
time, especially against long and
complex analysis problems. When GPT 5.2
is on, it just it goes. It feels like an
execution engine. It maps, it checks, it
computes, it synthesizes. It's
incredibly reliable at following
instructions. It goes all the way to the
end work product. It also benefits from
the practical reality that chat GPT's
file pipeline is built like a hand the
artifacts workflow, right? It has large
file support. It has better tolerance
for mixed inputs in a single thread.
That might sound like boring product
detail, but it's the difference between
AI as a toy and AI as a part of my
operational workflow. It's it's a big
deal. Chad GPT 5.2's 2's failure mode,
in my experience, is not stupidity. This
is a really smart model. It's the danger
of premature coherence. The model really
wants to make everything line up. And if
your underlying reality is too messy or
contradictory, it may enforce a clear
sanity check and coherent reality that's
very convincing, but that's cleaner than
the truth. And so the model's power
ironically makes this risk worse, not
better, because it can prod produce a
really beautiful wrong answer if your
underlying reality is incoherent. So you
need to treat it like a junior operator
and give it really clear structure.
Understand the underlying contradictory
nature of your inputs. Maybe they're not
there, maybe they are there, but get it.
And then understand what you're going to
get by asking the model to step into
that kind of problem space. But netnet,
I use GPT 5.2 all the time. It is a
great daily driver for me. It does do
that hard workflow stuff really well.
What about Claude Opus 4.5? Think of it
as a persuasion layer and an absolute
agentic and harness coding monster. Opus
4.5 is where you need to think about
writerly taste. You need to think about
it sounds like a human. You need to
think about how it positions hybrid
reasoning, good style, a large context
window, and an ability to actually
synthesize all of that together and come
up with text that is meaningful and
useful asis for business persuasive
writing. So, agentic ability is not a
pure model property. It's it's actually
a property of the system as a whole. And
what I'm calling out here is that part
of how Claude Opus 4.5 can write well.
Part of how it can code well is because
of the harness that Anthropic has put
around the system. The tool calling, the
skills ability, the harness and guard
rails let it operate inside a loop with
good feedback and safe edit primitives.
And Enthropic has been able to get to a
phenomenal level of work quality as a
result. And so a lot of engineers end up
pre preferring working with Claude Opus
4.5 as they code because they get those
tight feedback loops because it will
work with tools they can understand and
call because the harness is really easy
to work with and manipulate. You can
obviously put in your own markdown files
if you're in clot code. And because the
system is designed to relentlessly
follow instructions and build stuff, you
have to provide the design and
structure. It's going to build. I find
that that's true with creating artifacts
as well. I don't get the same context
window advantages I have with chat GPT
5.2 or with Gemini. If it's a truly huge
piece of work, it's not going to fit
with Claude Opus. And we just need to be
honest about that. But if it's something
where I need to craft a really beautiful
persuasive piece of business artifact,
whether that's a deck, whether that's a
doc, or even whether that's a
spreadsheet, the most polished outputs
today come from giving Claude a slice of
context that's useful, a clear set of
instructions, and then room to work and
cook. Claude does a great job using its
tools to go to town and produce
beautiful artifacts over time. That
agentic harness that I talk about for
coding works for non-coding as well.
Fundamentally, there are two execution
lanes in modern knowledge work, right?
One is the business artifact lanes,
spreadsheets, decks, executive briefs,
office shaped outputs. The other is
really around software execution, repo
changes, tool use, PRs, tests,
refactors. All of these players are
playing for both lanes. GPT 5.2 2 is
aggressively taking space in that first
lane of business artifact execution that
claude opus 4.5 was previously fairly
undisputed in and it's been become extra
useful because chat GPT 5.2 can handle
those really large initial dumps of
context and still produce structured
business artifacts. GPT 5.2 of course is
also playing in the software execution
lane. It's playing there through the
codeex family. And Codeex is designed
for especially complex code reviews.
It's designed for large complex code
dependency assessments. It's designed to
solve really difficult coding problems.
And it's designed to be really
intelligent about using a few general
tools really, really well. And so Codeex
is OpenAI's answer to a generalpurpose
agent that can operate against a
codebase and solve increasingly complex
problems. Opus 4.5 is increasingly
dominant in places where the strong
harness and the polish it's able to
bring from that harness and the tools it
calls enables the model to build
finished work with a narrower context
window. Look, Anthropic has always been
memory constrained. They are able to
work within the memory constraints in a
strong harness and deliver
extraordinarily polished work. My sense
is that Opus 4.5 after talking to many
developers is generally preferred by
most developers due to the ergonomics of
development, due to the harness it
operates in, due to the ability to
delegate and write out code very easily
across sub agents. And Opus 4.5 is also
very slightly ahead now on artifact
creation versus chat GPT. That gap has
narrowed by about 95% since GPT 5.1 in
just a few weeks. And so I do want to
call out that even though Opus 4.5 is
still a little bit ahead, we don't know
how long that will last. Meanwhile,
Gemini 3 sits a bit orthogonally. It's
looking at the pain of having enormous
amounts of data and needing a broad
synthesis, but it's not necessarily
pushing into business artifact execution
as cleanly, except in the Google Docs
family. And it is not necessarily
pushing into software execution unless
you are in Google's agent development
kit or in Google's own new IDE,
anti-gravity. So think of it as Gemini 3
is something that pulls you into the
Google ecosystem and if you're in the
Google ecosystem, you are going to have
these lanes of execution and you'll find
that Gemini 3 is just right there and
that's part of how they frame. So this
is not just about which model is best.
This is about which one you would
actually use for the kind of work you
really do. So again, simple wins. If I
am testing a new model and I never
assume these things stay true, I assume
any given model can win at any given
piece of this workflow. I always start
by picking a simple task in a lane where
success is obvious and I can measure it.
And increasingly because these are
agentic tasks, I give it a full agentic
task with a document packet and I ask it
to produce an artifact. I just look to
test. If something works, I log it. If
it doesn't work, I log that. I don't get
attached. I don't pick sides. I don't
have big emotions about it. I don't look
for the smartest model. I just look for,
hey, what's going to be really useful in
PowerPoints? Hey, what's really useful
if I'm trying to spin up a quick repo
for a website? Hey, what's really cool
at building a small web app? Hey, what's
really helpful for Excel? You get the
idea. Look for those specifics and just
give your model regular tasks. Don't
assume that you have to do something
complicated to route everything to a new
model. Simple wins. pick a simple little
artifact and test it. I hope I've been
able to give you a sense of how I think
about how to pick between these models
and at the same time a fingertippy feel
for how I think about how the three
leading model makers current models
stack up within that framework. Simple
wins. Until next time and until we get a
new model, which is probably like next