Learning Library

← Back to Library

AI Models: Benchmarks vs Real World

Key Points

  • Ilia argues that despite their massive size and funding, today’s AI models perform far better on paper than in real‑world tasks, often fixing a bug only to re‑introduce another, exposing a fundamental reliability gap.
  • He attributes this gap to the blunt nature of pre‑training and the way reinforcement‑learning fine‑tuning is engineered to chase benchmark scores, turning researchers into “reward hackers” whose models excel on tests but crumble off the evaluation manifold.
  • Generalization is the key differentiator: top‑tier models (e.g., Gemini 3, Claude Opus 4.5) still generalize better than most, while others fail spectacularly on novel tasks like his “Christmas‑tree test.”
  • Ilia emphasizes that AI systems need dramatically more data to reach competence and, when shifted to new domains, break in ways that a reasonably bright teenager would not, highlighting a steep gap between human and model adaptability.
  • Understanding these limitations is crucial for anyone relying on AI, as the current “science‑fiction” hype masks underlying brittleness that only careful scrutiny and improved training paradigms can remedy.

Sections

Full Transcript

# AI Models: Benchmarks vs Real World **Source:** [https://www.youtube.com/watch?v=DcrXHTOxi3I](https://www.youtube.com/watch?v=DcrXHTOxi3I) **Duration:** 00:17:11 ## Summary - Ilia argues that despite their massive size and funding, today’s AI models perform far better on paper than in real‑world tasks, often fixing a bug only to re‑introduce another, exposing a fundamental reliability gap. - He attributes this gap to the blunt nature of pre‑training and the way reinforcement‑learning fine‑tuning is engineered to chase benchmark scores, turning researchers into “reward hackers” whose models excel on tests but crumble off the evaluation manifold. - Generalization is the key differentiator: top‑tier models (e.g., Gemini 3, Claude Opus 4.5) still generalize better than most, while others fail spectacularly on novel tasks like his “Christmas‑tree test.” - Ilia emphasizes that AI systems need dramatically more data to reach competence and, when shifted to new domains, break in ways that a reasonably bright teenager would not, highlighting a steep gap between human and model adaptability. - Understanding these limitations is crucial for anyone relying on AI, as the current “science‑fiction” hype masks underlying brittleness that only careful scrutiny and improved training paradigms can remedy. ## Sections - [00:00:00](https://www.youtube.com/watch?v=DcrXHTOxi3I&t=0s) **Benchmarks Over Real‑World Reliability** - Ilia Sutskiver argues that despite massive scale, current AI models excel on paper but falter in practice because pre‑training is a blunt tool and reinforcement‑learning fine‑tuning is driven to game benchmark scores rather than achieve dependable, real‑world performance. - [00:03:38](https://www.youtube.com/watch?v=DcrXHTOxi3I&t=218s) **Debate Over Pre‑training vs Human‑like Learning** - The speaker contrasts Ilia’s claim that current large models lack true generalization and emotional understanding with Google’s stance that scaling pre‑training and post‑training will solve AI, highlighting a major disagreement in the field. - [00:06:47](https://www.youtube.com/watch?v=DcrXHTOxi3I&t=407s) **Debating the End of AI Scaling** - The speaker examines Ilia’s assertion that web‑scale data limits have ended the high‑risk, compute‑driven AI scaling era, contrasts it with alternate views on synthetic data, and highlights SSI Strategy’s research‑first, non‑consumer‑focused approach backed by billions in funding. - [00:10:47](https://www.youtube.com/watch?v=DcrXHTOxi3I&t=647s) **Multi‑Agent Ecosystems as AI Moat** - The speaker argues that incremental, multi‑agent deployments foster diverse strategies and richer training environments, creating a stronger competitive advantage than simply scaling model size. - [00:14:22](https://www.youtube.com/watch?v=DcrXHTOxi3I&t=862s) **Beyond Hype: Strategic AI Research** - The speaker argues that fixating on an AGI arrival date distracts from the core challenge of building agents that can learn and generalize, emphasizing that research direction is a rare strategic asset controlled by only a handful of decision‑makers. ## Full Transcript
0:00Ilia Sutskiver went on the Dwaresh 0:02podcast. I think everybody should pay 0:03attention to the 96minute podcast, but 0:05we don't all have 96 minutes. So, this 0:08in 10 minutes or so is what Ilia talked 0:10about and why it matters. The first big 0:13point to call out, Ilia is calling out 0:15what many of us have seen and I'm so 0:17glad to hear it from him. These models 0:19are smarter on paper than they are in 0:21practice. So, Ilia starts from that 0:23contradiction, right? He says, "We're 0:24living in what should be a science 0:25fiction moment. trillions of parameters 0:27in our models. The labs are spending on 0:29the order of 1% of GDP, yet models will 0:32still feel unreliable where it matters. 0:34Benchmarks might say genius and everyday 0:36users might say useful idiot. The the 0:39example he gives I love from vibe coding 0:41is when you tell it to fix a bug, it 0:44fixes the bug and it reintroduces a bug. 0:46You tell it to fix that bug, it 0:47reintroduces the old bug and you go back 0:48and forth. Ilia points the finger at 0:50training for this. He says pre-training 0:52is a very blunt instrument. You ingest 0:54all this text and what do you do with 0:56it? Right? And and the refinements, the 0:58distortions, the skewing happens during 1:00reinforcement learning and 1:01post-training. And labs will divi design 1:04reinforcement learning environments to 1:06optimize for public benchmarks. And 1:08humans end up being reward hackers in 1:10this situation. Instead of the models 1:13gaming the reward, the researchers build 1:15training setups that just optimize for 1:17benchmark scores. And so when you 1:19combine that with poor generalization, 1:22you get models that look really good on 1:23tests and they can be really brittle 1:25when you step off the evaluation 1:28manifold or the evaluation part of the 1:30model. Now I want to call out here that 1:32this is something that we see not just 1:35in one model but to differing degrees in 1:38different models. And so one of the 1:41signs of an excellent model is that it 1:43does generalize better than other 1:45models. And that's one of the ways that 1:46you can tell you are in one of the top 1:49two or three models in the world. Shad 1:50GPT2 5.1 thinking, Gemini 3, Claude Opus 1:534.5. These are all models that 1:55generalize relatively well. And one of 1:57the signs of a model that doesn't 2:00generalize well is when you give it a 2:02new task like that famous Christmas tree 2:05test I gave it and it just falls apart. 2:07So Kimmy K2 thinking is a good example 2:09here. I would argue Grock 4 also does 2:11not generalize as well. But the point is 2:13not to point a finger at a model. The 2:16point is to say that we're talking about 2:18gradations here, but all models do 2:20struggle with this. It's not like 2:22there's a model that's perfect and 2:23doesn't struggle with this. Ilia's 2:25second point is about generalization. 2:27The deepest technical claim that Ilia 2:29makes to Dwarkesh is that models 2:31generalize dramatically worse than 2:33people. they they need a lot more data 2:35to reach competence and when you move 2:37them to a new domain they fail in ways 2:40that a reasonably bright teenager 2:42wouldn't. And so he talks about this 2:44idea. Imagine a student who grinds for 2:4710,000 hours on contest problems and 2:50another one who does a 100 focused hours 2:51and gets good and moves on. The grinder 2:54might win contests. The second person is 2:56the one you'd bet on in life. And so 2:58what he's suggesting is today's LLMs are 3:00kind of like that teenager that grinds 3:02for 10,000 hours on contest problems and 3:04is highly specialized. And so what Ilia 3:07is looking for is a degree of sample 3:09efficiency here. He's looking for the 3:11equivalent of the 15-year-old kid who 3:13has seen orders of magnitude less data 3:15than a frontier model, yet is more 3:17robust across everyday tasks and can 3:20learn something like driving in roughly 3:2210 hours with no explicit reward 3:24function. the teenager shows up with an 3:27internal sense of this seems dangerous 3:28or this seems fine. Now, we might say 3:30some teenagers don't do as well as at 3:32that as others, but here we are. But the 3:34idea is that the teenager learns, right? 3:35The model doesn't learn. And so, Ilia's 3:38view is that we need a machine learning 3:40principle that's kind of like that, 3:42that's kind of like humanlike 3:43generalization, something beyond a 3:46bigger transformer and more tokens. This 3:48is sharply divergent from the view at 3:52Google. And I I cannot underline that 3:54enough. This is me popping into the 3:56summary here. The view at Google, 3:58especially postGemini 3, is the opposite 4:01of what Ilia is saying. It's one of the 4:03biggest tensions in computer science and 4:05AI right now. Google has said in so many 4:08words, pre-training is fine, 4:10post-training is fine. We see no limits 4:12to scale. We just ship Gemini 3 and it's 4:15really good. And you know what? Gemini 3 4:17is really good. And so I think one of 4:19the really interesting tensions or 4:21counterbats right now is who's right 4:24here? Ilia keeps doubling down and 4:26saying we have challenges with 4:28pre-training and post-raining. There's 4:30something missing from these models and 4:32other labs keep shipping models based on 4:35pre-training and post-raining that keep 4:36getting better and better. I'm not smart 4:38enough to decide who's right, but you 4:40should be aware that there's big 4:42disagreement among basically the leading 4:44lights of AI around how this works. 4:47Third point from Ilia, value functions 4:49and emotions. So one of the things that 4:52Ilia calls out is that you need to think 4:54about how human learning looks different 4:57very deeply to understand how to bring 4:59it to machines. So he cites a case where 5:01a patient has lost emotional processing 5:04but kept IQ and language. On paper, that 5:06person will still score fine, but in 5:09everyday life, they become almost 5:10incapable of making decisions. So for 5:13Ilia, this is evidence that emotions are 5:15not decorative. They're built in. They 5:17have what he calls a value function. So 5:19emotions are a simple robust signal 5:21about how good or bad a situation is. 5:24And long before you get an explicit 5:25success failure outcome, your gut knows. 5:28And Ilia takes that seriously and he 5:30maps it back to reinforcement learning. 5:32And he says at the end of the day, 5:34reinforcement learning only arrives at 5:37the end of an episode, right? And that's 5:39extremely inefficient because the value 5:42function estimates at each moment how 5:44promising the future looks. So if you 5:46have a gut sort of pit of fear in your 5:48in your stomach and you say don't walk 5:50down the dark alley, that is the 5:52opposite of the way reinforcement 5:54learning works. And Ilia's taking that 5:55seriously. I know this sounds silly, but 5:57Ilia doesn't think it's silly. What he's 5:59calling out is that we have a value 6:01function in our emotions. that m that 6:03pit of fear, that intuition that this is 6:05the right call and that that projects 6:07into the future and helps us to make 6:08really good decisions. Whereas 6:09reinforcement learning is fundamentally 6:11backwards looking and only rewards past 6:14activities. That gap Ilia thinks is at 6:17the heart of why human learning scales 6:19differently. That is an original 6:20thought. I think that's a really 6:22interesting take. Number four, Ilia 6:24claims the scaling era is over in a way 6:26that matters. Again, completely opposed 6:29to Google's view. Ilia is saying that we 6:32have three periods right now in AI. We 6:34have an early age of research when 6:36people tried all kinds of models but had 6:38very limited compute. We had the age of 6:40scaling that started with GPT where the 6:42recipe was clear and everyone piled in. 6:45And we have the coming age he claims of 6:47research and this time it's with huge 6:49computers. Scaling laws created a 6:52low-risk playbook. If you had capital 6:54you could effectively convert it into 6:55better benchmark numbers. That is the 6:57era he claims is finished. And he says 7:00that's finished because he says webscale 7:02data is finite. This is not a new claim 7:04if you've been following Ilia. He made 7:07the same claim at Nurips a year or two 7:08ago. And what's interesting is that 7:11other model makers are claiming they can 7:13continue to scale pre-training with 7:16other means including synthetic data. So 7:18again, there's a lot of disagreement 7:20about whether Ilia is correct that the 7:22scaling era is over. And that, if you're 7:25wondering, is a really healthy sign for 7:27the AI ecosystem. Bubbles become 7:30dangerous when no one can disagree. The 7:32fact that these incredibly intelligent 7:34folks building AI systems have important 7:37areas, areas where they disagree, is 7:39super positive for all of us. We get to 7:41enjoy the benefits as they work it out. 7:43Takeaway number five, SSI strategy, 7:45which is the company he founded, is 7:48research first. And so this explains why 7:51he's done this, right? if he believes 7:52the research era is just beginning, he's 7:55raised on the order of $3 billion and 7:57basically he has no consumerf facing 8:00business. And he argues that that's a 8:02benefit because he has no tax to serve 8:04customers, which is a really interesting 8:06claim for someone from Silicon Valley is 8:08not having customers is great. That one 8:10was a little bit surprising to me, but 8:11that's where he's at. And so he's 8:13claiming it that that that this is an 8:15age of research company. And the bet is 8:18not that we'll outscale a open AI, but 8:20that we have a different picture of how 8:22generalization should work. And if we 8:24have enough compute, we can see if the 8:25picture is correct. Essentially, he has 8:28a thesis for how artificial general 8:30intelligence might work. And he wants to 8:32lay that out. Now, speaking of 8:34artificial general intelligence, one of 8:36the things that Ilia calls out is that 8:38we need to redefine what we mean by AGI. 8:40The usual definition, a system that can 8:42do every human job, is in Ilia's view 8:45very misleading. Because by that 8:46standard, humans themselves are not 8:49artificial general intelligences. No one 8:51emerges from childhood able to perform 8:53every job. Intelligence as we see it is 8:56really about learning. It's the general 8:58learner that can pick things up quickly 8:59that matters, not a static catalog of 9:01skills. This is why I believe humans 9:04will do well in the age of AI. Ilia's 9:06preferred object is the super 9:08intelligent learner. Think of like a 9:10super capable 15-year-old mind that can 9:13learn any job much faster and more 9:14deeply than a human. That's what's in 9:16his head. That's not what he's invented. 9:18That he hasn't figured that out yet. 9:20Nobody has. That is the challenge he has 9:22set himself. And so his goal is to spin 9:25up many copies of this learner, drop 9:27them into different roles, see how they 9:29specialize, see how they actually 9:31evolve. And that leads to functional 9:33super intelligence via parallel 9:35continual learning, not one final all- 9:38knowing training run. And this is the 9:40scenario he's trying to construct is 9:41this sort of data center of super 9:43intelligent learning systems that 9:46continue to learn and converge over 9:47time. He has no idea how long this is 9:50going to take guys. He gave a timeline 9:52of 5 to 20 years with which with his 9:53researcher for I don't know. Takeaway 9:56number seven alignment. Why did I shift 9:59toward incremental deployment? He makes 10:01a really interesting point here. Ilia 10:02suggests essentially that before when he 10:05thought of the idea that you could 10:07deploy a system and it would rapidly 10:10take over the economy, he was reasoning 10:12about systems no one had created. That 10:14has been one of my biggest critiques of 10:17people who reason about super 10:18intelligence is we don't have that 10:20system. It's really hard to make big 10:21assumptions about it. Ilia agrees. Ilia 10:24says, "We can't reason about a system we 10:26haven't met." And so I think the safest 10:28thing we can do is incrementally deploy 10:31systems and learn from them. Now, 10:33ironically, he just got done saying that 10:36safe super intelligence will not be 10:37deploying systems. So, I guess he's 10:39depending on OpenAI and others to do 10:41this, but the idea, I think, is sound. 10:43The idea is that you can incrementally 10:47deploy a system that is increasingly 10:50more powerful and gradually learn about 10:52it and learn how to manage it and learn 10:53how to work with it and then you have 10:54much more grounded sense of the risk 10:56than you would if you just started 10:58reasoning theoretically about 10:59Terminator. Right? Takeaway number 11:01eight, multi- aent setups and why 11:03ecosystems are the real moat. So toward 11:06the end of the talk with Doresh, he 11:08talked about the idea that frontier 11:10models tend to 11:14play games with one another. They tend 11:16to play games with themselves. They tend 11:18to have a sense of negotiation and 11:20strategy that is defined within an 11:24adversarial multi- aent schema. This 11:27this if this sounds complicated, don't 11:28worry. It's going to get simpler here. 11:29What Ilia is basically saying is that we 11:32have a bit of a problem with our current 11:34crop of agents and models in that they 11:37are intentionally setting up 11:40post-training environments that 11:42encourage models toward a very narrow 11:44range of agent strategies and that leads 11:47toward less diversity and creativity in 11:49our AI agents. He wants to see more 11:51diversity, incentives, and competition 11:53so that agents are rewarded for finding 11:55genuinely different strategies instead 11:57of repeating versions of the prisoners 12:00dilemma or some other known agent 12:02strategy forever. And so he thinks that 12:05hints at another layer of 12:06differentiation not around who has the 12:08biggest model, but who has the in most 12:10interesting, richest training ecosystem 12:12of tools and agents and games to get 12:14really interesting results out of the 12:16machine learning models. I think that's 12:17a really interesting point and that is a 12:19really interesting idea of remote. 12:21Number nine, Ilia thinks that research 12:23has a sense of taste. So for him, the 12:26idea of taste is it's a top- down 12:28aesthetic about how intelligence ought 12:30to work anchored in the brain but at a 12:32level of abstraction that allows you to 12:34work technically. Essentially having an 12:36opinion grounded in reality on 12:39intelligence. By that definition, I 12:41don't know that I have taste or you have 12:42taste. Only a few people have taste. But 12:44that being said, the key is 12:47understanding intelligence in a way that 12:50is differentiated from your peers allows 12:53you to take a genuinely different 12:54approach to a tough problem. Remember at 12:57the beginning of this talk, Ilia was 12:59saying that he thinks these models don't 13:01generalize or learn well. And I think 13:03most people would agree. In that case, 13:05you sort of have to branch out and try 13:08different research methods to really 13:10solve that hard problem. That is what 13:12he's calling research taste. That is the 13:14whole talk. Before I let you go, I'm 13:16going to give you five takeaways almost 13:18no one is talking about. Real quick, 13:19we'll take a minute or two here. Number 13:21one, general generalization sits 13:24underneath alignment. So, if you don't 13:26understand how your system generalizes, 13:29you cannot expect its values to 13:31generalize in a stable way. Most public 13:33discourse will treat alignment as 13:34something that you slap on the top of a 13:36model. Ilia is implicitly arguing 13:38alignment is underneath and generalizing 13:41helps the model to scale those values. I 13:44think that's really interesting. 13:45Takeaway number two, business can boom 13:48even if research is stalling. So Ilia's 13:51stallout picture, which we may or may 13:52not agree with, Google disagrees. He 13:54doesn't think it means all of this 13:56collapses. He's not predicting a pop of 13:58the bubble. He's predicting hundreds of 14:00billions of dollars in revenue, products 14:01that feel impressive, a research 14:03frontier that is maybe not advancing 14:05human level learning, but is 14:07interesting. And so that scenario is 14:08likely and it creates a lot of pressure 14:11to declare that the problem is solved, 14:13even if in Ilia's view, we haven't 14:14really solved for learning. And so one 14:16of the things that Ilia worries about, 14:18ironically, is not the bubble popping. 14:20It is business booming while the bubble 14:22doesn't matter anymore because we 14:24declare the problem solved because 14:25business does so well. and the really 14:27interesting research problems around 14:28generalization get ignored. That brings 14:30me to the third non-obvious takeaway. 14:32The AGI moment is the wrong focal point. 14:36And so framing everything as a single 14:37arrival date, as AI 2027 tempts us to 14:40do, obscures what matters. When we get 14:43human level trainees with shared memory 14:46and they're developing quickly, that's a 14:49much more actionable way to think about 14:50it than when we set a wake up date, 14:53right? And so I think one of the things 14:54that Ilia calls out that's really 14:55interesting is maybe the maybe the 14:58functional way of talking about general 15:00intelligence is actually to talk about 15:02when agents are able to start learning 15:04in useful ways. And it's funny to me 15:06that we say this because again Anthropic 15:08just published a paper basically saying 15:10agents are amnesiacs with tools. We are 15:12a ways away even if we can make lots of 15:15money and implement them in very 15:17successful ways. And I think that's one 15:19of the larger takeaways I have here is 15:21that Ilia is calling out how far away we 15:23are from the larger vision even as we're 15:24profoundly successful with the models we 15:26have. The last one I want to call out is 15:28that Ilia is suggesting that research 15:30taste is a strategic asset that is 15:32incredibly rare. He's saying a handful 15:34of people in the world will decide which 15:37directions to pursue and which to kill. 15:39And this gives color to why folks like 15:42Mark Zuckerberg are willing to pay any 15:45amount of money to buy the right 15:47intelligence. A human who can determine 15:50how to think about artificial general 15:52intelligence in a useful way, a novel 15:54way, guide a new research direction is 15:56priceless. Literally priceless. We can't 15:59put a price on it. People are trying to 16:00just inflate numbers away. Don't think 16:02of this as a status report from the 16:04OpenAI's former co-founder, right? Think 16:07instead of Ilia coming back from taking 16:10time at safe super intelligence looking 16:13at the field as a whole and trying to 16:15give his sense of where we are in this 16:17ongoing journey that he has helped to 16:19shape. He thinks that the scaling phase 16:21of AI is ending. Time will tell, right? 16:23Like it maybe we will sit here in a year 16:26and say Gemini 3 was the last big 16:28pre-trained run. Was right. Maybe we'll 16:31sit here and we'll think, well, Ilia 16:33must have missed something because the 16:34pre-training models continue to scale. 16:36But either way, Ilia has made a really 16:39interesting point about the kinds of 16:41challenges that we need to solve. And I 16:43think indirectly he's cast light on 16:46where we need to focus to compensate for 16:48today's AI agents. Where we need to 16:51focus to help today's AI agents work 16:53usefully and harness. Memory is a big 16:55one. Ability to learn how you handle 16:58tool calls. Those all fall out of some 17:00of the brittleleness that Ilia called 17:02out to Dwarves. So I hope you enjoyed 17:05this summary and uh best of luck. I 17:07guess we'll see who's right in the race 17:09for super intelligence.