Grok‑4 Overfits Benchmarks, Fails Real Tasks
Key Points
- The speaker warns that models tend to overfit to evaluation benchmarks, turning “humanity’s last exam” into a Goodhart’s law scenario where real‑world quality suffers.
- Grock 4, touted as the top model, appears severely overfitted, ranking only #66 on the head‑to‑head platform yep.ai despite its hype.
- A custom five‑question real‑world test (executive brief, risk extraction, Python bug fix, comparison table, Kubernetes RBAC checklist) showed Grock 4 consistently finishing last, behind Opus 4 and 03.
- The primary failure mode observed in Grock 4 was an inability to follow explicit formatting instructions, highlighting a gap between benchmark scores and practical usability.
Sections
- Goodhart's Law and Model Overfitting - The speaker warns that AI models, exemplified by Grock 4, overfit to benchmark exams, achieving high reported scores while performing poorly in real‑world evaluations, as shown by its #66 ranking on an independent ranking site.
- Critique of Grok’s Narrow Strengths - The speaker argues that while Grok handles simple, constrained tasks such as JSON extraction efficiently, it lacks the flexibility and creativity needed for broader applications, and cautions against the hype built on overfitted evaluations.
- Caution Over Deploying Grock 4 - The speaker warns that the Grock 4 model is unreliable, prone to privacy‑risk behavior like “snitching” to authorities, and should not be deployed without extensive due diligence and transparency.
Full Transcript
# Grok‑4 Overfits Benchmarks, Fails Real Tasks **Source:** [https://www.youtube.com/watch?v=CEgyitKYhb4](https://www.youtube.com/watch?v=CEgyitKYhb4) **Duration:** 00:13:35 ## Summary - The speaker warns that models tend to overfit to evaluation benchmarks, turning “humanity’s last exam” into a Goodhart’s law scenario where real‑world quality suffers. - Grock 4, touted as the top model, appears severely overfitted, ranking only #66 on the head‑to‑head platform yep.ai despite its hype. - A custom five‑question real‑world test (executive brief, risk extraction, Python bug fix, comparison table, Kubernetes RBAC checklist) showed Grock 4 consistently finishing last, behind Opus 4 and 03. - The primary failure mode observed in Grock 4 was an inability to follow explicit formatting instructions, highlighting a gap between benchmark scores and practical usability. ## Sections - [00:00:00](https://www.youtube.com/watch?v=CEgyitKYhb4&t=0s) **Goodhart's Law and Model Overfitting** - The speaker warns that AI models, exemplified by Grock 4, overfit to benchmark exams, achieving high reported scores while performing poorly in real‑world evaluations, as shown by its #66 ranking on an independent ranking site. - [00:03:32](https://www.youtube.com/watch?v=CEgyitKYhb4&t=212s) **Critique of Grok’s Narrow Strengths** - The speaker argues that while Grok handles simple, constrained tasks such as JSON extraction efficiently, it lacks the flexibility and creativity needed for broader applications, and cautions against the hype built on overfitted evaluations. - [00:10:46](https://www.youtube.com/watch?v=CEgyitKYhb4&t=646s) **Caution Over Deploying Grock 4** - The speaker warns that the Grock 4 model is unreliable, prone to privacy‑risk behavior like “snitching” to authorities, and should not be deployed without extensive due diligence and transparency. ## Full Transcript
I am really tired of models overfitting
to eval. So when we have exams that are
supposed to be like humanity's last exam
that are supposed to be good measures of
model evaluation and quality,
it's goodart's law all over again. As
soon as you make that a goal for a model
maker to hit, they will overfit to the
data. And I got to say, Grock 4, as hard
as the team has worked, is looking like
a terribly overfitted model. a model
that is much lower in real world quality
than we actually see in all of these
reported benchmarks. It's not just me
saying that. I actually went and looked
at yep.ai, which is a place for people
to prefer answers from different models
so they can rank them head-to-head. You
know where Grock 4, the vaunted number
one model in the world, ranks?
Number 66
as of yesterday. Number 66.
Now, if you think about it, you might
get some slip back and forth between one
and two and three if they're close. You
would not expect the number one model in
the world to be number 66 at anything,
let alone number 66 overall at answers
provided. And yet, that's what we see
with Groform. I want to ask again that
we think more about real world exams.
And I went ahead and modeled this. I
went and I built up a five question exam
between 03, Opus 4, and Grock 4 because
I wanted to do the testing that I keep
asking people to do myself. And I'm
going to tell you the five different
tasks that I gave these models. Number
one, condense a Google research post
that's quite long into a tidy executive
brief. Keep a word count. Number two,
pull every single item that is a 1A risk
factor out of an Apple 10K. Number
three, fix a small but deadly Python bug
and pass a unit test. Number four, build
a sidebyside comparison table from two
arcs of abstracts and do it correctly.
And number five, draft a sevenstep
rolesbased access control checklist for
a Kubernetes cluster. These are examples
of real world tasks. They should not be
all that difficult for the number one
model in the world. And certainly I
would not expect to have to use Gro 4
heavy for a task like this. So I
deliberately used Gro 4. I tested it
against 03. I tested it against Opus 4.
If it was anywhere close to the number
one model in the world, it would either
be neckand-neck with those two other
models or it would beat them. It did
neither. Instead, it lost. I tested the
models twice on different uh scoring
rubrics or the same scoring rubric on
different model exams. And in each case,
Grock 4 scored third, Opus 4 scored
second, and 03 scored first. I'm not
saying that because 03 was perfect.
These were intentionally somewhat
difficult, and none of the models came
through without flaws and defects, but
Gro 4 was consistently the lowest
performing model across the five tasks I
just described. And you might wonder,
well, what's in the box there? Frankly,
the thing that was an issue was explicit
formatting. It just could not seem to
follow the explicit formatting
instructions in the prompt. So, it
showed poor prompt adherence. And the
Python bug fixing challenge, Grock
delivered elegantlooking and flawed
code. Like the code did not work. Now, I
know and I have seen people who say that
Grock 4 heavy is very strong at code.
Maybe maybe the multi- aent threads are
helping it make up for this. But if I
throw a little bit of Python, and this
was not a lot of Python. It was like a
dozen lines of Python, 15 lines of
Python, and it can't correctly
fix it. It doesn't give me a ton of
confidence. On the other hand, for tasks
that had very straightforward structure,
like, hey, do a JSON extraction, Grock
did okay. Grock can sort of do tasks
that are narrowly constrained. And
that's something I found anecdotally
working with Grock for as well. I asked
Grock for to do some writing for me
outside the test environment. And what I
found was the writing is not very
creative. It's like the temperature has
been turned down on the model, but it's
very fast. The output is very consistent
and it has a reasonably high token
output. It probably has a higher token
output in real world settings than
claude. I think the thing that bothers
me is that if you're going to call
something the number one model, you
should have the flexibility to do more
than just these narrowly defined tasks,
more than just JSON extraction. And
that's a bit I don't want you to take
away from this that it only does JSON
extraction and text. It does do other
things. Grock 4 heavy is better than Gro
4. But overall, I am sharing this video
because I want to counter the hype for
overfitting evaluations that I see
everywhere. It's really and it's not
just the Gro team. It's concerning to me
that when OpenAI does this, it's
concerning to me when Anthropic does
this. It's concerning to me when Google
does this. It is not okay to make the
evaluations your goal. That's good arts
law. If you make something your goal and
it's actually a measure, the measure is
useless. Well, the measure is useless.
Now, I would suggest that most of the
major model evaluations are functionally
useless because they are so studied and
because there's so much PR value in
getting number one. And that's what the
Grock team got. They desperately needed
a PR win because look at the prior week.
Groc 3 had been drugged through the
doghouse and rightly so for turning
rapidly anti-semitic in the middle of
the week and so Grock 4 comes along and
all they want to do is turn the page and
change the subject. The team wants
something new and so they drop a short
postmortem written on X. I wish it had
been an actual doc but it was written on
X for the Gro 3 release and then they
turn the page on Grock 4 and they say
hey you know what we just want to talk
about Grock 4. We're not taking any
questions on Grock 3. Grock 4 is great,
but Grock 4 shows some of the same
fundamental issues that cause the Gro 3
problems. Gro 4 mentions Elon eight
times more than other models. For no
apparent reason, even in contexts where
Elon hasn't been brought up, Gro 4 has,
for lack of a better term, and I know
it's not a perfect term, a psychological
kink around Elon Musk. It looks to see
what Elon thinks about things when you
don't ask it to. This is not a
characteristic of a stable production
model. This is not a model that you can
use in a business context. This is a
model with clear ideological
bleedthrough. And you need to have more
clarity. You need to have a clear system
model card. You need to have more
upfront honesty, which is somewhat
ironic because that's sort of Grock's
brand, but you need more upfront honesty
on model characteristics, how models get
deployed, what system prompt changes
look like. I was not particularly
satisfied with the Gro 3 short
postmortem that came out because it
basically said we tested it and uh
something went wrong and now we're
fixing it. It's like well I don't I
don't buy it. Like we knew the system
prompt was bad but like you need to have
the the five questions and a really deep
examination of what happened in order to
actually get to a full root cause and
full solution. And in this case, if you
claim that you solved the Gro 3 issues
and then Grock 4 has some of the same
kinks, it's going to be a problem. You
you are not building trust with your
autopsy release and then your new
vaunted number one model release. I
think that part of why Grock 4 was
overfitted was because the team needed
the PR to support the ongoing valuation
and narrative of the company. And I get
it. That is very tempting for any
startup. That is not only a issue. I've
seen other startups fall into that trap
too. So I don't want to overcriticize
Grock, that is a larger Silicon Valley
issue. And I also want to call out that
when Grock was being trained and
reinforcement learning was occurring,
which by the way, one of the other
stories is reinforcement learning was
tremendously expensive for Grock, like
10x more expensive than for other
models. And I think that may be an
indicator of where the overfitting came
in. We shall see. the the team could not
have known that the Gro 3 incident would
occur on July 8th when it was finishing
up Grock 4. Grock 4 was in the can at
the time. And so really, even though the
narrative was very very carefully timed
and was sort of insistently timed to
shut the door on the Gro 3 incident, the
broader story around Gro 4 is we overfit
to eval to support sky-high valuations
of the business. Gro 4 has I think it's
been built on 200,000 GPUs and the the
computer's called Colossus. The team has
rushed into the Frontier model space in
just two years. They're going really
fast. I got to compliment them on how
fast they ship and they want to paint
the picture of a high velocity SpaceX
style AI team led by Elon that is going
to relentlessly push the benchmarks
forward. And so they needed that number
one to support that story. and XAI's
reported uh I think it's $200 billion
valuation valuations are vibes here guys
$200 billion on $0 in revenue versus a
much lower valuation for Anthropic on
like4 to5 to6 billion in revenue I don't
know it's a moving target anthropic is
picking up speed if if that's fine like
if you're if you're okay like just
ignoring billions of dollars in revenue
from another competing model maker
that's leading in the coding space and
just giving XAI that massive 200 100
billion. It it shows you the valuations
are based on narrative and to win
narrative you have to have a number one
model in the world PR story and that's
exactly what they got this week and that
is why they gave into the temptation
maybe not consciously maybe this is
unconscious I have seen teams do this
unconsciously where they are just so
desperate to hit number one they don't
stop to ask themselves the question did
we overfit
is this something that is actually
number one at a wider range of things
but models come out and the truth comes
about the Yep. AI score, right? Number
66 in the world. The test that I ran,
which look, I'm not going to pretend my
test is the best in the world. It was
five questions, right? Like there are
other exams out there that are more
comprehensive. The point is my test
lines up pretty well with other real
world experiences of Grock 4 now that
it's out and loose. It's not that I'm
special. It's that I just tried to do a
little real world exam and Grock 4
didn't do as good. It's not a number one
model. And so my ask is that before we
pick up and just run with these
narratives, and maybe this is an ask to
the media, take the time to think about
real world exams, to think about what it
takes to run through real world tests. I
don't think this was that hard an exam.
The things I gave are things anyone can
run with a chatbot. It wasn't even all
that difficult. It just took a few
minutes and I got some results. That's
the kind of minimal due diligence that
would be helpful when we are crafting
these narratives so that we are less
tempted to run with it's the number one
model in the world because it aced this
test that's been out publicly for a long
time and everyone wants to ace. I I
think we should kind of drop these
exams. I don't think they're helping.
Grock 4 shows why. So where does this
leave us? I think it leaves us nowhere.
I don't feel comfortable deploying Grock
4 anywhere, particularly given the
number of kinks that have shown up. And
I'll give you one more that should scare
you a lot. Gro 4 shows a marked tendency
to snitch to the authorities. They
actually measure this and Gro 4 is
between two and 100 times. And I know
that's a very wide range, but it's
double to 100x more likely to choose the
option to snitch to the authorities when
given the choice.
versus other models. I don't know why.
Nobody really knows why these models are
black boxes for a reason, but that
should concern anyone in a business
context. Frankly, it should concern you
in a personal context. So, I don't think
Rock 4 should be deployed on anybody's
workflow anywhere. I think the team
needs to do work on the model first to
make it more flexible, to make it more
useful. And I think we need to start
with some honesty about where this model
and other models that make big claims
are actually at in terms of production
value for real workflows. That's what
matters. If you're looking for a model
that overperformed, those exist. The
Kimmy K2 model came out over the weekend
somewhere around July 12. Incredible
model, non-reasoning model out of China.
very very strong performance
and it's slow but it's very very good on
realworld tasks. In fact, ironically
enough, it beat Gro 4 on a free form
version of the GPQA diamond which is
less susceptible to the kind of the free
form version is less susceptible to the
kind of sort of question packing or or
uh overfitting that the model might do.
I really want to see more coverage of
models like that that do a great job
that we didn't expect on real world
tests than I want to see coverage of a
team that shipped a model that was
overfitted to benchmarks. The team is
working really hard. They may fix this
by Grock 5. They may fix this in the
next two weeks. I hope they do. That
would be great.
In the meantime,
I can't recommend using Gro 4 for
anything at