Learning Library

← Back to Library

Simple Wins: AI Model Adoption

Key Points

  • The “simple wins” framework advocates adopting new AI models by first proving they can reliably solve a small, repeatable, low‑risk task you perform daily, rather than relying on benchmark hype or one‑off prompts.
  • Traditional model evaluation (benchmark charts, dopamine‑triggered trials) often leads users to default back to familiar tools like ChatGPT, because those tests don’t reflect real‑world workflow impact.
  • Viewing models as a hierarchy of superior “rungs” is misleading; instead, treat each model as a distinct competence that must be matched with the right interface and integration layer to be effective.
  • By focusing on tangible, incremental wins, teams can avoid polarizing “model wars,” reduce artifact friction and review burden, and build a sustainable system that routes different work to the most appropriate model over time.

Full Transcript

# Simple Wins: AI Model Adoption **Source:** [https://www.youtube.com/watch?v=ijdhIGRB_Kc](https://www.youtube.com/watch?v=ijdhIGRB_Kc) **Duration:** 00:16:06 ## Summary - The “simple wins” framework advocates adopting new AI models by first proving they can reliably solve a small, repeatable, low‑risk task you perform daily, rather than relying on benchmark hype or one‑off prompts. - Traditional model evaluation (benchmark charts, dopamine‑triggered trials) often leads users to default back to familiar tools like ChatGPT, because those tests don’t reflect real‑world workflow impact. - Viewing models as a hierarchy of superior “rungs” is misleading; instead, treat each model as a distinct competence that must be matched with the right interface and integration layer to be effective. - By focusing on tangible, incremental wins, teams can avoid polarizing “model wars,” reduce artifact friction and review burden, and build a sustainable system that routes different work to the most appropriate model over time. ## Sections - [00:00:00](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=0s) **Simple Wins Model Adoption** - The speaker advocates a pragmatic “simple wins” strategy for adopting new AI models—evaluating them based on small, repeatable tasks that deliver obvious, low‑risk value each day instead of relying on benchmark hype. - [00:03:22](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=202s) **Choosing LLMs for Real Work** - The speaker explains that model selection should focus on which AI reliably handles specific business tasks, emphasizing three recurring pain points: information overload, the effort of formatting outputs, and navigating human ambiguity. - [00:07:10](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=430s) **GPT‑5.2 as Workflow Execution Engine** - The speaker describes GPT‑5.2’s ability to generate complete, professional artifacts (e.g., spreadsheets, presentations) that streamline complex analysis, while cautioning that its drive for coherent output can unintentionally hide contradictory or messy underlying data. - [00:11:37](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=697s) **Dual Execution Lanes and AI Competition** - The speaker contrasts business‑artifact versus code‑centric work streams and compares how OpenAI’s GPT‑5.2 (with Codeex) and Anthropic’s Opus 4.5 compete in each lane. - [00:14:42](https://www.youtube.com/watch?v=ijdhIGRB_Kc&t=882s) **Choosing Simple Tasks for Model Evaluation** - The speaker advises testing AI models with straightforward, measurable tasks, logging outcomes, and prioritizing practical usefulness over chasing the most advanced model. ## Full Transcript
0:00simple wins. I want to talk today about 0:03a detailed comparison between Chat GPT2 0:065.2, Claude Opus 4.5, and Gemini 3. But 0:10instead of just giving you a baseline 0:12model comparison, I want to let you in 0:15on how I think about adopting new models 0:19into my workflow because that is the 0:21hottest topic that I could think of for 0:232026. We're all going to have a lot more 0:25new models. It's not just going to be 0:27these three. How do we think about 0:28adopting them in a way that's 0:29intelligent? And I'm going to go back to 0:31it. Simple wins. It's the only model 0:33adoption strategy that doesn't rot. And 0:35I'm going to explain it and how it 0:37works. And you're going to be able to 0:38learn it and use it for your workflows, 0:40too. It's not going to take very long. 0:41The way most people evaluate a new model 0:43is by reading a benchmark chart, by 0:46trying a clever prompt, by feeling a 0:48dopamine hit or not, and then they 0:50slowly drift back to whatever tool they 0:52default to. That's why so many people 0:53end up in Chad GPT. It's not because the 0:56new model isn't good. It's because the 0:58evaluation isn't real. The only 1:01evaluation that matters is whether a 1:03model can deliver a simple tangible win 1:06that you would use every day. I'm 1:08talking about a small repeatable piece 1:10of work that you actually do all the 1:12time where success is obvious, the 1:15downside is contained, and the output 1:17lands in spaces that your org already 1:19runs on. So simple wins is not just a 1:21cute productivity slogan. I'm not 1:23putting it on a t-shirt. It's a 1:24discipline. It prevents you from turning 1:26model choice into the Mac versus Windows 1:29wars, right? Into an identity. You need 1:32to not think that way to survive in the 1:35AI future. Instead, simple wins forces 1:39you to confront real bottlenecks at work 1:42like artifact friction that you may have 1:44because it's too complicated to make 1:45them or review them, like review burden. 1:47It gives you a path to compound the 1:50adoption of models over time without 1:53pretending that you're doing lots of 1:55complicated work at any given moment to 1:58test out a model. Because the deeper 2:00point is that models should not be 2:02viewed as a single ladder of 2:04intelligence where every new release is 2:06a new rung you have to reach and migrate 2:09everything to. Instead, think of them as 2:11different shapes of competence that live 2:15inside different kinds of surfaces. The 2:17model matters, but the interface and the 2:19harness matter almost as much, if not 2:21more. And if you ignore that, you're 2:23going to keep trying to look for the 2:25best model, and you're going to feel 2:26like the AI is unreliable and everything 2:28is changing. If you lean into the idea 2:32of simple wins, you're going to end up 2:34with a sane system for routing work to 2:36different models. But let's make that 2:38more specific. What's changing right 2:41now? A lot of people are asking 2:44themselves whether they should keep 2:46evaluating AI as a chatbot, whether you 2:48should still have an interaction pattern 2:50at core that is prompt, response, tweak. 2:53That's no longer the main place for 2:55serious work. The big shift with the 2:58current generation of models is that you 3:00increasingly need to hand the model a 3:02real work packet, a an assignment with a 3:05deliverable and you need to expect it to 3:08stay coherent long enough to produce 3:10something that you could ship directly 3:12after a quick review. That is explicitly 3:15what OpenAI framed chat GPT 5.2 to do. 3:19But it's not just OpenAI. Opus is 3:22thinking about that. Anthropic is 3:24thinking about that. Gemini is thinking 3:25about that too. Once you start operating 3:28that way, which model is the smartest 3:30just becomes the incorrect question. The 3:32useful question becomes which model plus 3:35its surface reliably completes a 3:38particular kind of work without a lot of 3:41downstream pain. That's where the 3:43differences between chat GPT 5.2, 2, 3:46Gemini 3, and Claude Opus 4.5 really pop 3:50out and become very practical if you 3:52look at them through the lens of real 3:53business work. Now, I know that most 3:55knowledge work comes across as 3:57complicated, but my observation is that 3:59it collapses into a few recurring pain 4:01points that are probably relevant for us 4:04to think about when it comes to this 4:06kind of assessment. The first pain is 4:08bandwidth. There's just too much to 4:10read. There's too many inputs. There's 4:12not enough time to build the mental 4:14model. It's one of those things where 4:15you have a dock pack that you need to 4:17read to walk into the board meeting and 4:19not look confused, but you just don't 4:20have time on the plan to do it. The 4:22second pain is execution on those 4:24artifacts, right? It's work that has to 4:26end up in Excel or a deck or a 4:29structured doc. And the burden is not 4:31just having the idea or a correct 4:33understanding. It is gosh darn it, we 4:35have to make it all add up and make the 4:36deck and package it in the format that 4:38the business runs against or else the 4:40work is not done. And then the third 4:42pain is human ambiguity, right? the 4:43messy political contradictory reality of 4:46the organization where tone matters, 4:48where incentives matters, who got 4:50promoted last matters, false coherence 4:52can be much more dangerous than 4:54admitting uncertainty. If you can figure 4:56out which pain matters most, it's going 4:59to help you figure out what model you 5:02need to work on. Let me give you some 5:03examples from current leading models. 5:05So, this feels like it it gets think of 5:07Gemini 3 is a bandwidth engine. Gemini's 5:113's superpower, when it's working well, 5:13is that it can ingest an absolutely 5:16absurd amount of material and give you a 5:18clean overall map. Google is really 5:20explicit about Gemini 3's massive 5:22context window. And the practical effect 5:25of that million tokens is that it 5:27doesn't mean that it's magically 5:29smarter. It just means that it loses the 5:31thread less often when the input is 5:32really huge and messy and it can dig 5:34into a big synthesis without collapsing 5:37into shallow summarization. So the 5:39simple win simple wins for Gemini 3 is 5:42not write my strategy memo. The simple 5:44win is turn this mountain of stuff into 5:46some kind of a map so I can make sense 5:48of it. So feed those long docs, feed it 5:51those notes, feed it those screenshots, 5:53feed it the meeting transcript and ask 5:55for an outline that makes the problem 5:57space really legible. What's being 5:59claimed? What contradicts what? What's 6:02missing? What what should I ask next? 6:05Gemini is often really really good at 6:08this kind of compression when the 6:09alternative is hours and hours of 6:11reading. Where Gemini tends to create 6:13pain is downstream. The business world 6:15is still deeply Microsoft Office shaped 6:18and there's often a conversion tax when 6:20you need to get a great synthesis and 6:22turn it into a spreadsheet, a deck or a 6:24document in the exact structure that 6:26your org expects. The model can be 6:28brilliant and still lose you time 6:30because of the workflow and its 6:32friction. So I don't treat Gemini as the 6:34model for everything, but I do treat it 6:36as a model I reach for when the 6:37constraint is really input volume and I 6:39want clarity. It's a good bandwidth 6:41engine. Think of Chat GPT 5.2 as an 6:44artifact execution engine. So chat GPT 6:485.2's fingerprint is really different 6:51from 5.1. The WOW is primarily not that 6:54it can read more, it's that it can stay 6:57organized through longer assignments and 6:59return businessshaped deliverables like 7:02docs or tables or decks coherently 7:04without falling apart. So, OpenAI's own 7:06framing emphasizes professional tasks. 7:09This is what they built it for, right? 7:10Tool use, making artifacts like 7:12spreadsheets and presentations. The 7:14simple win for GPT 5.2 to is give it a 7:18real artifact. Give it a clean, tight 7:20brief and get back something that looks 7:23like a junior analyst did all the work. 7:25It's not necessarily a perfect answer, 7:28but it's a great work product that will 7:30save you hours and hours and hours of 7:32time, especially against long and 7:34complex analysis problems. When GPT 5.2 7:37is on, it just it goes. It feels like an 7:40execution engine. It maps, it checks, it 7:42computes, it synthesizes. It's 7:44incredibly reliable at following 7:46instructions. It goes all the way to the 7:48end work product. It also benefits from 7:50the practical reality that chat GPT's 7:52file pipeline is built like a hand the 7:55artifacts workflow, right? It has large 7:58file support. It has better tolerance 8:00for mixed inputs in a single thread. 8:02That might sound like boring product 8:03detail, but it's the difference between 8:05AI as a toy and AI as a part of my 8:08operational workflow. It's it's a big 8:09deal. Chad GPT 5.2's 2's failure mode, 8:12in my experience, is not stupidity. This 8:15is a really smart model. It's the danger 8:17of premature coherence. The model really 8:19wants to make everything line up. And if 8:21your underlying reality is too messy or 8:24contradictory, it may enforce a clear 8:28sanity check and coherent reality that's 8:31very convincing, but that's cleaner than 8:33the truth. And so the model's power 8:35ironically makes this risk worse, not 8:37better, because it can prod produce a 8:39really beautiful wrong answer if your 8:41underlying reality is incoherent. So you 8:43need to treat it like a junior operator 8:46and give it really clear structure. 8:48Understand the underlying contradictory 8:50nature of your inputs. Maybe they're not 8:51there, maybe they are there, but get it. 8:53And then understand what you're going to 8:55get by asking the model to step into 8:57that kind of problem space. But netnet, 9:00I use GPT 5.2 all the time. It is a 9:03great daily driver for me. It does do 9:05that hard workflow stuff really well. 9:07What about Claude Opus 4.5? Think of it 9:10as a persuasion layer and an absolute 9:13agentic and harness coding monster. Opus 9:174.5 is where you need to think about 9:23writerly taste. You need to think about 9:25it sounds like a human. You need to 9:27think about how it positions hybrid 9:29reasoning, good style, a large context 9:32window, and an ability to actually 9:35synthesize all of that together and come 9:38up with text that is meaningful and 9:41useful asis for business persuasive 9:44writing. So, agentic ability is not a 9:47pure model property. It's it's actually 9:49a property of the system as a whole. And 9:51what I'm calling out here is that part 9:53of how Claude Opus 4.5 can write well. 9:55Part of how it can code well is because 9:58of the harness that Anthropic has put 10:01around the system. The tool calling, the 10:03skills ability, the harness and guard 10:06rails let it operate inside a loop with 10:10good feedback and safe edit primitives. 10:12And Enthropic has been able to get to a 10:15phenomenal level of work quality as a 10:17result. And so a lot of engineers end up 10:20pre preferring working with Claude Opus 10:234.5 as they code because they get those 10:26tight feedback loops because it will 10:29work with tools they can understand and 10:31call because the harness is really easy 10:34to work with and manipulate. You can 10:35obviously put in your own markdown files 10:37if you're in clot code. And because the 10:40system is designed to relentlessly 10:42follow instructions and build stuff, you 10:44have to provide the design and 10:46structure. It's going to build. I find 10:48that that's true with creating artifacts 10:50as well. I don't get the same context 10:53window advantages I have with chat GPT 10:555.2 or with Gemini. If it's a truly huge 10:58piece of work, it's not going to fit 11:00with Claude Opus. And we just need to be 11:02honest about that. But if it's something 11:04where I need to craft a really beautiful 11:06persuasive piece of business artifact, 11:08whether that's a deck, whether that's a 11:10doc, or even whether that's a 11:11spreadsheet, the most polished outputs 11:14today come from giving Claude a slice of 11:18context that's useful, a clear set of 11:20instructions, and then room to work and 11:23cook. Claude does a great job using its 11:26tools to go to town and produce 11:29beautiful artifacts over time. That 11:31agentic harness that I talk about for 11:33coding works for non-coding as well. 11:35Fundamentally, there are two execution 11:37lanes in modern knowledge work, right? 11:39One is the business artifact lanes, 11:40spreadsheets, decks, executive briefs, 11:42office shaped outputs. The other is 11:44really around software execution, repo 11:47changes, tool use, PRs, tests, 11:49refactors. All of these players are 11:52playing for both lanes. GPT 5.2 2 is 11:57aggressively taking space in that first 12:00lane of business artifact execution that 12:02claude opus 4.5 was previously fairly 12:05undisputed in and it's been become extra 12:08useful because chat GPT 5.2 can handle 12:12those really large initial dumps of 12:14context and still produce structured 12:16business artifacts. GPT 5.2 of course is 12:19also playing in the software execution 12:21lane. It's playing there through the 12:23codeex family. And Codeex is designed 12:26for especially complex code reviews. 12:29It's designed for large complex code 12:32dependency assessments. It's designed to 12:35solve really difficult coding problems. 12:37And it's designed to be really 12:39intelligent about using a few general 12:41tools really, really well. And so Codeex 12:45is OpenAI's answer to a generalpurpose 12:48agent that can operate against a 12:50codebase and solve increasingly complex 12:52problems. Opus 4.5 is increasingly 12:56dominant in places where the strong 12:59harness and the polish it's able to 13:02bring from that harness and the tools it 13:03calls enables the model to build 13:07finished work with a narrower context 13:09window. Look, Anthropic has always been 13:12memory constrained. They are able to 13:14work within the memory constraints in a 13:16strong harness and deliver 13:18extraordinarily polished work. My sense 13:21is that Opus 4.5 after talking to many 13:24developers is generally preferred by 13:27most developers due to the ergonomics of 13:29development, due to the harness it 13:31operates in, due to the ability to 13:33delegate and write out code very easily 13:35across sub agents. And Opus 4.5 is also 13:38very slightly ahead now on artifact 13:41creation versus chat GPT. That gap has 13:44narrowed by about 95% since GPT 5.1 in 13:48just a few weeks. And so I do want to 13:50call out that even though Opus 4.5 is 13:53still a little bit ahead, we don't know 13:54how long that will last. Meanwhile, 13:56Gemini 3 sits a bit orthogonally. It's 13:59looking at the pain of having enormous 14:03amounts of data and needing a broad 14:04synthesis, but it's not necessarily 14:07pushing into business artifact execution 14:09as cleanly, except in the Google Docs 14:11family. And it is not necessarily 14:14pushing into software execution unless 14:16you are in Google's agent development 14:19kit or in Google's own new IDE, 14:23anti-gravity. So think of it as Gemini 3 14:27is something that pulls you into the 14:29Google ecosystem and if you're in the 14:31Google ecosystem, you are going to have 14:33these lanes of execution and you'll find 14:35that Gemini 3 is just right there and 14:37that's part of how they frame. So this 14:38is not just about which model is best. 14:40This is about which one you would 14:42actually use for the kind of work you 14:44really do. So again, simple wins. If I 14:46am testing a new model and I never 14:49assume these things stay true, I assume 14:51any given model can win at any given 14:53piece of this workflow. I always start 14:55by picking a simple task in a lane where 14:57success is obvious and I can measure it. 14:59And increasingly because these are 15:00agentic tasks, I give it a full agentic 15:03task with a document packet and I ask it 15:05to produce an artifact. I just look to 15:07test. If something works, I log it. If 15:10it doesn't work, I log that. I don't get 15:13attached. I don't pick sides. I don't 15:16have big emotions about it. I don't look 15:18for the smartest model. I just look for, 15:20hey, what's going to be really useful in 15:22PowerPoints? Hey, what's really useful 15:24if I'm trying to spin up a quick repo 15:27for a website? Hey, what's really cool 15:29at building a small web app? Hey, what's 15:31really helpful for Excel? You get the 15:33idea. Look for those specifics and just 15:36give your model regular tasks. Don't 15:39assume that you have to do something 15:41complicated to route everything to a new 15:43model. Simple wins. pick a simple little 15:46artifact and test it. I hope I've been 15:48able to give you a sense of how I think 15:50about how to pick between these models 15:51and at the same time a fingertippy feel 15:54for how I think about how the three 15:56leading model makers current models 15:58stack up within that framework. Simple 16:00wins. Until next time and until we get a 16:02new model, which is probably like next