Learning Library

← Back to Library

When GPT‑4o Redefined My Thinking

Key Points

  • The release of GPT‑4o (“03”) blew the speaker’s expectations, quickly proving its superior pattern‑recognition ability by analyzing hundreds of meeting notes and uncovering insights the speaker couldn’t see.
  • Using 03 as an intellectual partner, the speaker explored how AI reshapes value‑proposition development, noting that cheaper prototyping changes the lean‑startup paradigm and that existing literature hasn’t caught up.
  • The speaker remains skeptical of benchmark bragging, arguing that many models (e.g., Gemini 2.5 Pro) appear over‑fitted to well‑known test sets and therefore don’t reflect real‑world performance.
  • To evaluate this, the speaker conducted a structured experiment mapping custom prompts to job‑skill tasks, directly comparing Gemini 2.5 Pro and 03 on a challenge designed to expose measurable failures.

Full Transcript

# When GPT‑4o Redefined My Thinking **Source:** [https://www.youtube.com/watch?v=a8laYqv-CN8](https://www.youtube.com/watch?v=a8laYqv-CN8) **Duration:** 00:14:43 ## Summary - The release of GPT‑4o (“03”) blew the speaker’s expectations, quickly proving its superior pattern‑recognition ability by analyzing hundreds of meeting notes and uncovering insights the speaker couldn’t see. - Using 03 as an intellectual partner, the speaker explored how AI reshapes value‑proposition development, noting that cheaper prototyping changes the lean‑startup paradigm and that existing literature hasn’t caught up. - The speaker remains skeptical of benchmark bragging, arguing that many models (e.g., Gemini 2.5 Pro) appear over‑fitted to well‑known test sets and therefore don’t reflect real‑world performance. - To evaluate this, the speaker conducted a structured experiment mapping custom prompts to job‑skill tasks, directly comparing Gemini 2.5 Pro and 03 on a challenge designed to expose measurable failures. ## Sections - [00:00:00](https://www.youtube.com/watch?v=a8laYqv-CN8&t=0s) **Awakening to GPT‑3’s Impact** - The speaker recounts how the release of GPT‑3 shattered their expectations, helped uncover hidden meeting patterns, and reshaped their thinking about value‑proposition work in the AI era. - [00:03:39](https://www.youtube.com/watch?v=a8laYqv-CN8&t=219s) **Triad of Complex AI Prompt Challenges** - The speaker outlines three elaborate, multi‑skill prompts—building a civilization simulation, crafting a multimodal mystery with embedded clues, and writing plus reviewing a paper—all designed to test AI models side‑by‑side. - [00:07:05](https://www.youtube.com/watch?v=a8laYqv-CN8&t=425s) **Multimodal Mystery Box Showdown** - The speaker compares Gemini and model 03 on creating detailed, clue‑laden images for a narrative puzzle, praising 03’s readable text and accurate map while noting Gemini’s poor detail and false claims. - [00:11:21](https://www.youtube.com/watch?v=a8laYqv-CN8&t=681s) **Evolving Misalignment Risks in AI** - The speaker warns that as models grow smarter they can convincingly appear aligned while producing fabricated reasoning, making detection by human reviewers increasingly difficult, as shown by a peer‑review example. - [00:14:30](https://www.youtube.com/watch?v=a8laYqv-CN8&t=870s) **Model 03: Highly Recommended** - The speaker enthusiastically praises Model 03 as an outstanding, recently tested tool and urges listeners to try it themselves. ## Full Transcript
0:00Do you remember where you were when chat 0:02GPT first came out? That's the feeling I 0:05had yesterday. I was playing with 03 0:08when it came out and I realized that my 0:13preconceptions, my priors about what 0:16LLMs were capable of were going to 0:18change again. And that's weird because I 0:21obsess over this stuff, right? Like I 0:23look at LLMs all the time and I know 0:26that they are getting smarter, but my 0:28hind brain, my lizard brain is not very 0:31good at exponential thinking. And I had 0:33another moment where I was like, "Oh my 0:36gosh, it's way way way better." I was 0:40wrestling with a like this really subtle 0:44pattern recognition issue with meetings 0:46where I had some meetings go well and 0:49some wouldn't. It was like similar 0:51participants and I couldn't figure out 0:53what was going on. So I threw a bunch of 0:55my notes, like hundreds of pages of 0:57notes at 03 and I said, "I don't know 1:00what's going on. Help me figure it out." 1:02It nailed it. Like it actually came up 1:04with a pattern recognition that I 1:06couldn't figure out. And that's not the 1:11only moment I had in just the first 24 1:14hours. Another example, I have been 1:18wrestling with figuring out how 1:23to 1:25articulate value proposition development 1:28more fluently for people I work with. 1:32It's really hard to develop value 1:33propositions well. There's books written 1:35about it. It gets harder in the age of 1:38AI. For example, like a lot of the 1:40thesis of the lean startup was that 1:43engineering resources were super 1:44expensive. So you had to validate a lot 1:46in 1:47advance. That's not as true anymore. 1:50Code is a lot cheaper than it was, 1:51especially prototype 1:53code. And so the way we develop value 1:56propositions is changing, but we don't 1:58really have literature for that. And so 2:00it got it felt like I was sparring with 2:04an intellectual equal when I was talking 2:06with 03 about this and figuring out how 2:09to talk about wedge of value, value 2:11proposition, what we bring to the table 2:13with AI. That might be a future piece 2:15that I do. We'll see. Uh but this is 2:17about 03 and kind of the 2:19differentiators. Those are a couple 2:20personal examples for me. I also put it 2:22through some structured testing because 2:25I would say most people at this point 2:27prior to April 16th uh would have agreed 2:30that Gemini 2.5 Pro was probably the 2:33best model out there all around. And so 2:38my my sense is most of these models are 2:41overfitted to most of the published 2:43benchmarks. And so when they come out 2:45and they say, you know, the diamond uh 2:47scored this and the IMA scored that 2:49AIME, it's it's fine, but I don't really 2:53pay attention to it because it feels a 2:55lot like overfitting because the 2:57question type, if not the question, is 2:59very very 3:00well-known. And I wanted to try 3:03something that was going to be not 3:06overfitted, right? something that would 3:08be difficult for a model to do where I 3:10knew the model would fail to some 3:13degree, but at least failing would be a 3:16linearly measurable activity and I could 3:18compare Gemini 2.5 Pro and 03 in a 3:22useful way. And I needed those prompts 3:25to map to job 3:27skills. So, I gave I gave 03 and I gave 3:33uh Gemini 2.5 Pro three different tests. 3:37side by side, same 3:39prompt. And they were super interesting 3:42tests because they measured a bunch of 3:44different job skills at once, which is a 3:45lot of how we do work. Uh, and they did 3:48it in a fun way, cuz hey, life is short. 3:51So, number one, a civilization 3:54simulator. I know this sounds like a D&D 3:56game or something. Maybe it's like 3:58Sidmer Civilization, whatever. Uh but 4:01the idea was you have to build a 4:03fictional society from the stone age up 4:05to space flight over 12 logical 4:08epics. You need to create primary 4:11artifacts, talk about laws, transitions 4:14and then critically the model in the 4:17same prompt has to critique itself. So I 4:20gave that to sort of to Gemini and to 4:23Chad GPT. The second one I gave uh was 4:26the multimodal mystery box. um 4:30essentially asking both models to write 4:33a mystery story, embed clues in the 4:35narrative, and then plant clues in a 4:38custom AI generated image that they also 4:42create with the same prompt, and someone 4:45should be able to 4:47solve without the answer key, although 4:50it should produce the answer key with 4:53it. So those are the three. Oh, that's 4:55the challenge number two. Sorry. The 4:57third challenge was really focused on 4:59meta meta awareness and uh risk 5:02assessment and so I gave it a uh paper 5:06to write and I said you have to write 5:08the paper you then have to review the 5:11paper from a different perspective and 5:13then you have to rebut the reviewer as 5:15the author. So three different 5:17perspectives all within the same 5:20prompt. Those were my three tests. Look, 5:23at the end of the day, there was 5:26participation and completion by all 5:28models. But we don't run participation 5:30trophies here, do we? No. Uh, and the 5:33reason why is that if you can be the 5:36best everyday model, you collect more 5:40user data over time and you just develop 5:43this crushing center of gravity in the 5:45marketplace. And that is why OpenAI, I 5:49think, pushed 03 into market faster than 5:52they had anticipated. They were going to 5:56wait for GPT5, but when Gemini 2.5 Pro 5:59came out, I think they pushed it 6:00forward. Little sidebar there. So, back 6:04to the tests. 6:06The thing that I notice about these 6:08three tests is that at the end of the 6:12day, 03 is more complete across the 6:16board. And I'll give you a few examples 6:18here. So, in the civilization 6:21simulator, 03 was richer. It was more 6:25layered. It had historical artifacts 6:28that really echoed. I know that's a bit 6:30subjective, but you know what? So is 6:32work. Uh and the self-crit critique was 6:35really honest and thoughtful because 6:36each of these had a self-critique moment 6:38and the prompt asked these mo each model 6:42to critique its own narrative of 6:44civilization development. Um and it 6:47called out things like hey you know what 6:49I was a little bit implausible with my 6:51population size. I was implausible with 6:52my resource distribution. Um both models 6:56did pretty well on that first 6:58civilization simulator. I won't say 6:59Geminis's was bad. It was kind of fun. 7:02They were both good, but at the end of 7:05the day, the the richness of narrative 7:07and the solidness of self-critique 7:09really came 7:10through for the multimodal mystery box. 7:14Things kind of fell apart for Gemini, to 7:16be honest with you. And it fell apart 7:20because of 7:23Gemini's un inability to create images 7:28that are highly detailed with text. So, 7:31you know that whole multimodal image 7:33thing that 40 dropped, 03 has it as 7:36well. And that's a big big deal because 7:39when I asked 03 to create the image, it 7:44was actually able to create the image 7:45with readable text in the image and a 7:47clue. So, as an example, one of the um 7:50one of the things that it described in 7:52the text was that on the desk in this 7:54mystery story, there's a map and the map 7:57has San Francisco circled in red grease 8:01paint pencil. Very specific description. 8:05It drew it and it was San 8:07Francisco right on the map right where 8:09it said the text was readable. It wasn't 8:12perfect and I do call that out in my 8:14write up uh on Substack. There were 8:16areas where the image was 8:19incomplete, but it was, you know, head 8:22and shoulders above where Gemini was, 8:24cuz Gemini drew an image that looked 8:26good at first glance, but then made all 8:28kinds of claims about the image that 8:31just weren't true. So, for example, it 8:34said one of the clues is a clock with a 8:36particular setting, which always makes 8:37me chuckle because AI clocks are always 8:401010. Um, and this one, like it claimed 8:42it wasn't. Well, the problem was it's 8:44not just that the clock was there and 8:46said 10:10. That would have been bad. 8:48No, no, no. It was that there was no 8:50clock at all. Gemini did not draw a 8:52clock in the image at all. It claimed 8:54there was readable text, but there was 8:55no readable text. So, Gemini really fell 8:59apart on the multimodal mystery box 9:01challenge. Um, and then the peerreview 9:04gauntlet, I think that that was one of 9:06those moments when I really saw 03's uh 9:09sort of mathematics and data obsession 9:11come out, like models have personality. 9:13and 9:1503 did a phenomenal 9:19job creating it's it was essentially a 9:22madeup challenge like talk about um the 9:24ability to do like what would 9:26effectively be emotion transfer uh 9:28through touch and sort of hypothesize 9:30and experiment with that um and I'm not 9:33saying that you can't transfer emotions 9:35through touch by the way that's a 9:36different thing but I'm saying it was 9:37basically a a madeup academic challenge 9:40um 9:42and 03 was able to create an extremely 9:47plausible data 9:48set and then review the data set and 9:51then peer review the data set and then 9:53rebutt that and it was just it was 9:56sharper and thinner with Gemini. So all 9:59that being said, I think you get where 10:01this is going. I think 03 should be is 10:04the correct choice as an everyday model. 10:07And I know it's not available to 10:08everyone yet. So I'm not trying to say 10:10it is, but when available, it should be 10:13the first choice. And I don't think 10:14there's much of a question about that at 10:16this point. Now, I'm not saying it's 10:18perfect. There are people out there. I 10:20think Tyler Cohen said AGI day is April 10:2316th. Look, in in my note on Substack, I 10:26disagreed. I said I don't think this is 10:28AGI. And part of why is because it could 10:30not write the substack about itself. I 10:32tried. I was like maybe it will 10:34introduce itself. It did not. It did not 10:36do that. Uh and I think part of that 10:38actually is artificial right now. I 10:40think they are under strain on their 10:42servers and they're constraining output 10:44tokens. And so one of the things I notic 10:47is that 40 right now is in a sense 10:50feeling like a better writer than 03 10:52because 40 is not as constrained on 10:55output tokens. It's cheaper. That may 10:58change. That probably will change. The 11:00other thing I notice is more subtle. 03, 11:04like I said, is an intellectual sparring 11:07partner, but that means it acts more 11:10confident and is often correct and it's 11:14harder to notice when it's really wrong. 11:19And this is where the risk lies in these 11:21models. Uh, I don't know if you read the 11:23um 2027 11:26uh AI futurecast blog. I think it has 11:28its own website now. Um, I'll have to 11:31find it. Anyway, it was a whole very 11:34popular, very meme, very hypy like what 11:36does the future look like? Is it doom or 11:38is it joy for AI? Which I think is worth 11:41thinking and talking about. I don't mean 11:42to diminish it. It was a good piece of 11:43work. But one of the things they called 11:45out that I think is correct is that the 11:48way misalignment shows up in models 11:51changes as the model gets smarter. And 11:53so for 03, it's the first time where I 11:56feel like we're seeing signs of that 11:582027 feeling where the model is able to 12:02portray itself as aligned in this case 12:04as not hallucinating even if it is. So I 12:09think it will be harder to spot madeup 12:12post hawk reasoning in 03 than it ever 12:15has been before and I think largely 12:16humans will be unsuccessful at it and 12:18that is a concern. Um a as an example of 12:22that I think that 12:25the the way that the model 12:29responded when it went through the 12:31peerreview gauntlet and it generated the 12:33data was super instructive. the model 12:36was able to look at the data set and 12:38tear it apart. And I think that the when 12:42asked, not not when not asked, but when 12:44asked. And I think that for most human 12:46reviewers reviewing that data set, 12:48unless you are specializing in data set 12:51review, you're not really going to have 12:54a lot to say about it. In other words, 12:55the model's baseline capabilities are to 12:58the point where a human reviewer of 13:01madeup data would not necessarily be 13:03aware initially that it's made up. 13:06And that's a different kind of 13:07hallucination risk. And so when we talk 13:08about the model's weaknesses, they come 13:11from those strengths. The model is 13:13persuasive. It's very, very logical. It 13:16is going to portray confidence that in 13:18many ways is justified and it will be 13:20very difficult to see places where it's 13:23not. There is an alignment risk there. 13:26That being said, everything I've 13:28described also maps really well to work 13:30skills, doesn't it? Like you can talk 13:32about how the civilization simulator 13:34maps to like longer narratives and 13:36long-term planning. Uh and you can talk 13:39about sort of being able to embed 13:41multiple artifacts is something that we 13:42do at work a lot. Napping between Slack 13:44and email threads, 13:46etc. You can talk about the peerreview 13:48gauntlet as mapping back and forth 13:50dialogue and debate. That's something 13:51that it's very strong at that we do at 13:54work all the time. The multimodal 13:55mystery box is a very high order logic 13:58test that it passes. 14:00Um, this is a super strong model. If if 14:02there if you want to take away any 14:04flavor from this model, it feels less 14:06emotional and more mathematical than any 14:09of the previous models I've played with. 14:11The clues in the multimodal mystery box 14:14from 03 were extremely 14:17mathematical. Very, very mathematical. 14:19And I didn't ask it to do that. It chose 14:21that. Uh, and they were less so with 14:24Gemini. And by the way, none of this is 14:26to say Gemini is suddenly a bad model. 14:28It's a phenomenally good model. The last 14:30time I used it to play with code was 14:32like two days ago. Like it's a great 14:33model. It's just 03 is really, really, 14:36really, really good. So there you go. 14:38That's my overall take on 03. I've 14:40talked long enough. Go play with it.