Learning Library

← Back to Library

Grok‑4 Overfits Benchmarks, Fails Real Tasks

Key Points

  • The speaker warns that models tend to overfit to evaluation benchmarks, turning “humanity’s last exam” into a Goodhart’s law scenario where real‑world quality suffers.
  • Grock 4, touted as the top model, appears severely overfitted, ranking only #66 on the head‑to‑head platform yep.ai despite its hype.
  • A custom five‑question real‑world test (executive brief, risk extraction, Python bug fix, comparison table, Kubernetes RBAC checklist) showed Grock 4 consistently finishing last, behind Opus 4 and 03.
  • The primary failure mode observed in Grock 4 was an inability to follow explicit formatting instructions, highlighting a gap between benchmark scores and practical usability.

Full Transcript

# Grok‑4 Overfits Benchmarks, Fails Real Tasks **Source:** [https://www.youtube.com/watch?v=CEgyitKYhb4](https://www.youtube.com/watch?v=CEgyitKYhb4) **Duration:** 00:13:35 ## Summary - The speaker warns that models tend to overfit to evaluation benchmarks, turning “humanity’s last exam” into a Goodhart’s law scenario where real‑world quality suffers. - Grock 4, touted as the top model, appears severely overfitted, ranking only #66 on the head‑to‑head platform yep.ai despite its hype. - A custom five‑question real‑world test (executive brief, risk extraction, Python bug fix, comparison table, Kubernetes RBAC checklist) showed Grock 4 consistently finishing last, behind Opus 4 and 03. - The primary failure mode observed in Grock 4 was an inability to follow explicit formatting instructions, highlighting a gap between benchmark scores and practical usability. ## Sections - [00:00:00](https://www.youtube.com/watch?v=CEgyitKYhb4&t=0s) **Goodhart's Law and Model Overfitting** - The speaker warns that AI models, exemplified by Grock 4, overfit to benchmark exams, achieving high reported scores while performing poorly in real‑world evaluations, as shown by its #66 ranking on an independent ranking site. - [00:03:32](https://www.youtube.com/watch?v=CEgyitKYhb4&t=212s) **Critique of Grok’s Narrow Strengths** - The speaker argues that while Grok handles simple, constrained tasks such as JSON extraction efficiently, it lacks the flexibility and creativity needed for broader applications, and cautions against the hype built on overfitted evaluations. - [00:10:46](https://www.youtube.com/watch?v=CEgyitKYhb4&t=646s) **Caution Over Deploying Grock 4** - The speaker warns that the Grock 4 model is unreliable, prone to privacy‑risk behavior like “snitching” to authorities, and should not be deployed without extensive due diligence and transparency. ## Full Transcript
0:00I am really tired of models overfitting 0:02to eval. So when we have exams that are 0:06supposed to be like humanity's last exam 0:08that are supposed to be good measures of 0:11model evaluation and quality, 0:14it's goodart's law all over again. As 0:16soon as you make that a goal for a model 0:18maker to hit, they will overfit to the 0:20data. And I got to say, Grock 4, as hard 0:22as the team has worked, is looking like 0:25a terribly overfitted model. a model 0:28that is much lower in real world quality 0:31than we actually see in all of these 0:34reported benchmarks. It's not just me 0:36saying that. I actually went and looked 0:38at yep.ai, which is a place for people 0:41to prefer answers from different models 0:43so they can rank them head-to-head. You 0:45know where Grock 4, the vaunted number 0:47one model in the world, ranks? 0:49Number 66 0:52as of yesterday. Number 66. 0:55Now, if you think about it, you might 0:58get some slip back and forth between one 1:00and two and three if they're close. You 1:02would not expect the number one model in 1:04the world to be number 66 at anything, 1:07let alone number 66 overall at answers 1:09provided. And yet, that's what we see 1:11with Groform. I want to ask again that 1:15we think more about real world exams. 1:18And I went ahead and modeled this. I 1:20went and I built up a five question exam 1:26between 03, Opus 4, and Grock 4 because 1:29I wanted to do the testing that I keep 1:32asking people to do myself. And I'm 1:34going to tell you the five different 1:35tasks that I gave these models. Number 1:38one, condense a Google research post 1:40that's quite long into a tidy executive 1:42brief. Keep a word count. Number two, 1:44pull every single item that is a 1A risk 1:46factor out of an Apple 10K. Number 1:49three, fix a small but deadly Python bug 1:51and pass a unit test. Number four, build 1:53a sidebyside comparison table from two 1:56arcs of abstracts and do it correctly. 1:58And number five, draft a sevenstep 2:00rolesbased access control checklist for 2:02a Kubernetes cluster. These are examples 2:05of real world tasks. They should not be 2:07all that difficult for the number one 2:09model in the world. And certainly I 2:11would not expect to have to use Gro 4 2:13heavy for a task like this. So I 2:15deliberately used Gro 4. I tested it 2:18against 03. I tested it against Opus 4. 2:21If it was anywhere close to the number 2:23one model in the world, it would either 2:24be neckand-neck with those two other 2:26models or it would beat them. It did 2:28neither. Instead, it lost. I tested the 2:32models twice on different uh scoring 2:37rubrics or the same scoring rubric on 2:39different model exams. And in each case, 2:42Grock 4 scored third, Opus 4 scored 2:45second, and 03 scored first. I'm not 2:48saying that because 03 was perfect. 2:50These were intentionally somewhat 2:52difficult, and none of the models came 2:54through without flaws and defects, but 2:56Gro 4 was consistently the lowest 2:58performing model across the five tasks I 3:00just described. And you might wonder, 3:03well, what's in the box there? Frankly, 3:07the thing that was an issue was explicit 3:10formatting. It just could not seem to 3:12follow the explicit formatting 3:13instructions in the prompt. So, it 3:14showed poor prompt adherence. And the 3:18Python bug fixing challenge, Grock 3:20delivered elegantlooking and flawed 3:23code. Like the code did not work. Now, I 3:25know and I have seen people who say that 3:27Grock 4 heavy is very strong at code. 3:30Maybe maybe the multi- aent threads are 3:32helping it make up for this. But if I 3:34throw a little bit of Python, and this 3:35was not a lot of Python. It was like a 3:37dozen lines of Python, 15 lines of 3:39Python, and it can't correctly 3:43fix it. It doesn't give me a ton of 3:45confidence. On the other hand, for tasks 3:47that had very straightforward structure, 3:49like, hey, do a JSON extraction, Grock 3:52did okay. Grock can sort of do tasks 3:54that are narrowly constrained. And 3:56that's something I found anecdotally 3:57working with Grock for as well. I asked 3:59Grock for to do some writing for me 4:00outside the test environment. And what I 4:02found was the writing is not very 4:03creative. It's like the temperature has 4:05been turned down on the model, but it's 4:07very fast. The output is very consistent 4:10and it has a reasonably high token 4:12output. It probably has a higher token 4:13output in real world settings than 4:16claude. I think the thing that bothers 4:19me is that if you're going to call 4:20something the number one model, you 4:22should have the flexibility to do more 4:24than just these narrowly defined tasks, 4:26more than just JSON extraction. And 4:28that's a bit I don't want you to take 4:30away from this that it only does JSON 4:31extraction and text. It does do other 4:33things. Grock 4 heavy is better than Gro 4:364. But overall, I am sharing this video 4:39because I want to counter the hype for 4:43overfitting evaluations that I see 4:45everywhere. It's really and it's not 4:47just the Gro team. It's concerning to me 4:50that when OpenAI does this, it's 4:51concerning to me when Anthropic does 4:53this. It's concerning to me when Google 4:55does this. It is not okay to make the 4:58evaluations your goal. That's good arts 5:00law. If you make something your goal and 5:02it's actually a measure, the measure is 5:03useless. Well, the measure is useless. 5:05Now, I would suggest that most of the 5:07major model evaluations are functionally 5:10useless because they are so studied and 5:12because there's so much PR value in 5:15getting number one. And that's what the 5:17Grock team got. They desperately needed 5:19a PR win because look at the prior week. 5:22Groc 3 had been drugged through the 5:24doghouse and rightly so for turning 5:26rapidly anti-semitic in the middle of 5:29the week and so Grock 4 comes along and 5:32all they want to do is turn the page and 5:34change the subject. The team wants 5:35something new and so they drop a short 5:38postmortem written on X. I wish it had 5:41been an actual doc but it was written on 5:42X for the Gro 3 release and then they 5:45turn the page on Grock 4 and they say 5:46hey you know what we just want to talk 5:48about Grock 4. We're not taking any 5:49questions on Grock 3. Grock 4 is great, 5:52but Grock 4 shows some of the same 5:55fundamental issues that cause the Gro 3 5:57problems. Gro 4 mentions Elon eight 6:01times more than other models. For no 6:04apparent reason, even in contexts where 6:06Elon hasn't been brought up, Gro 4 has, 6:09for lack of a better term, and I know 6:10it's not a perfect term, a psychological 6:13kink around Elon Musk. It looks to see 6:16what Elon thinks about things when you 6:18don't ask it to. This is not a 6:21characteristic of a stable production 6:22model. This is not a model that you can 6:24use in a business context. This is a 6:27model with clear ideological 6:29bleedthrough. And you need to have more 6:32clarity. You need to have a clear system 6:34model card. You need to have more 6:36upfront honesty, which is somewhat 6:38ironic because that's sort of Grock's 6:39brand, but you need more upfront honesty 6:41on model characteristics, how models get 6:44deployed, what system prompt changes 6:46look like. I was not particularly 6:48satisfied with the Gro 3 short 6:50postmortem that came out because it 6:51basically said we tested it and uh 6:54something went wrong and now we're 6:55fixing it. It's like well I don't I 6:58don't buy it. Like we knew the system 7:00prompt was bad but like you need to have 7:02the the five questions and a really deep 7:05examination of what happened in order to 7:07actually get to a full root cause and 7:10full solution. And in this case, if you 7:12claim that you solved the Gro 3 issues 7:13and then Grock 4 has some of the same 7:15kinks, it's going to be a problem. You 7:18you are not building trust with your 7:20autopsy release and then your new 7:22vaunted number one model release. I 7:25think that part of why Grock 4 was 7:27overfitted was because the team needed 7:30the PR to support the ongoing valuation 7:35and narrative of the company. And I get 7:36it. That is very tempting for any 7:38startup. That is not only a issue. I've 7:40seen other startups fall into that trap 7:41too. So I don't want to overcriticize 7:43Grock, that is a larger Silicon Valley 7:45issue. And I also want to call out that 7:47when Grock was being trained and 7:50reinforcement learning was occurring, 7:51which by the way, one of the other 7:53stories is reinforcement learning was 7:54tremendously expensive for Grock, like 7:5710x more expensive than for other 7:58models. And I think that may be an 8:00indicator of where the overfitting came 8:02in. We shall see. the the team could not 8:05have known that the Gro 3 incident would 8:08occur on July 8th when it was finishing 8:10up Grock 4. Grock 4 was in the can at 8:12the time. And so really, even though the 8:15narrative was very very carefully timed 8:18and was sort of insistently timed to 8:21shut the door on the Gro 3 incident, the 8:23broader story around Gro 4 is we overfit 8:26to eval to support sky-high valuations 8:29of the business. Gro 4 has I think it's 8:33been built on 200,000 GPUs and the the 8:37computer's called Colossus. The team has 8:39rushed into the Frontier model space in 8:42just two years. They're going really 8:44fast. I got to compliment them on how 8:45fast they ship and they want to paint 8:49the picture of a high velocity SpaceX 8:52style AI team led by Elon that is going 8:55to relentlessly push the benchmarks 8:56forward. And so they needed that number 8:58one to support that story. and XAI's 9:01reported uh I think it's $200 billion 9:03valuation valuations are vibes here guys 9:07$200 billion on $0 in revenue versus a 9:09much lower valuation for Anthropic on 9:12like4 to5 to6 billion in revenue I don't 9:16know it's a moving target anthropic is 9:17picking up speed if if that's fine like 9:20if you're if you're okay like just 9:22ignoring billions of dollars in revenue 9:24from another competing model maker 9:26that's leading in the coding space and 9:27just giving XAI that massive 200 100 9:29billion. It it shows you the valuations 9:32are based on narrative and to win 9:33narrative you have to have a number one 9:35model in the world PR story and that's 9:37exactly what they got this week and that 9:38is why they gave into the temptation 9:41maybe not consciously maybe this is 9:43unconscious I have seen teams do this 9:45unconsciously where they are just so 9:47desperate to hit number one they don't 9:49stop to ask themselves the question did 9:52we overfit 9:54is this something that is actually 9:55number one at a wider range of things 9:57but models come out and the truth comes 9:59about the Yep. AI score, right? Number 10:0166 in the world. The test that I ran, 10:04which look, I'm not going to pretend my 10:06test is the best in the world. It was 10:07five questions, right? Like there are 10:09other exams out there that are more 10:10comprehensive. The point is my test 10:14lines up pretty well with other real 10:16world experiences of Grock 4 now that 10:18it's out and loose. It's not that I'm 10:20special. It's that I just tried to do a 10:23little real world exam and Grock 4 10:24didn't do as good. It's not a number one 10:26model. And so my ask is that before we 10:30pick up and just run with these 10:32narratives, and maybe this is an ask to 10:34the media, take the time to think about 10:38real world exams, to think about what it 10:40takes to run through real world tests. I 10:43don't think this was that hard an exam. 10:44The things I gave are things anyone can 10:46run with a chatbot. It wasn't even all 10:48that difficult. It just took a few 10:51minutes and I got some results. That's 10:54the kind of minimal due diligence that 10:56would be helpful when we are crafting 10:58these narratives so that we are less 11:00tempted to run with it's the number one 11:02model in the world because it aced this 11:04test that's been out publicly for a long 11:06time and everyone wants to ace. I I 11:08think we should kind of drop these 11:10exams. I don't think they're helping. 11:12Grock 4 shows why. So where does this 11:15leave us? I think it leaves us nowhere. 11:17I don't feel comfortable deploying Grock 11:184 anywhere, particularly given the 11:21number of kinks that have shown up. And 11:23I'll give you one more that should scare 11:25you a lot. Gro 4 shows a marked tendency 11:29to snitch to the authorities. They 11:31actually measure this and Gro 4 is 11:35between two and 100 times. And I know 11:38that's a very wide range, but it's 11:40double to 100x more likely to choose the 11:43option to snitch to the authorities when 11:45given the choice. 11:47versus other models. I don't know why. 11:50Nobody really knows why these models are 11:52black boxes for a reason, but that 11:54should concern anyone in a business 11:56context. Frankly, it should concern you 11:58in a personal context. So, I don't think 12:01Rock 4 should be deployed on anybody's 12:04workflow anywhere. I think the team 12:07needs to do work on the model first to 12:10make it more flexible, to make it more 12:12useful. And I think we need to start 12:13with some honesty about where this model 12:16and other models that make big claims 12:18are actually at in terms of production 12:22value for real workflows. That's what 12:25matters. If you're looking for a model 12:27that overperformed, those exist. The 12:29Kimmy K2 model came out over the weekend 12:32somewhere around July 12. Incredible 12:35model, non-reasoning model out of China. 12:38very very strong performance 12:40and it's slow but it's very very good on 12:43realworld tasks. In fact, ironically 12:46enough, it beat Gro 4 on a free form 12:50version of the GPQA diamond which is 12:52less susceptible to the kind of the free 12:56form version is less susceptible to the 12:57kind of sort of question packing or or 13:00uh overfitting that the model might do. 13:03I really want to see more coverage of 13:07models like that that do a great job 13:10that we didn't expect on real world 13:12tests than I want to see coverage of a 13:14team that shipped a model that was 13:17overfitted to benchmarks. The team is 13:20working really hard. They may fix this 13:22by Grock 5. They may fix this in the 13:24next two weeks. I hope they do. That 13:27would be great. 13:29In the meantime, 13:31I can't recommend using Gro 4 for 13:33anything at