Beyond Benchmarks: Real-World AI Evaluation
Key Points
- The launch of Claude 3.7 highlights the urgent need for better AI evaluations, as current benchmarks (e.g., AI Eval) are over‑fit and reward models trained specifically to excel on them rather than to perform useful work.
- Real‑world usefulness is better captured by emerging tasks such as the “Answer” benchmark, which measures a model’s ability to independently complete freelance jobs, where Claude 3.5 currently outperforms newer models.
- Practitioners are left to rely on subjective “vibes” or intuitive impressions when comparing models—an informal, hard‑to‑communicate gauge that underscores the industry’s lack of standardized, task‑focused metrics.
- Claude 3.7 follows a challenger‑brand strategy by doubling down on its historic strength in coding and development, offering tight integration with terminals, Cursor, GitHub, and the ChatGPT app to attract developers seeking a specialized, high‑performance tool.
Full Transcript
# Beyond Benchmarks: Real-World AI Evaluation **Source:** [https://www.youtube.com/watch?v=okIVbBnk-Sg](https://www.youtube.com/watch?v=okIVbBnk-Sg) **Duration:** 00:06:02 ## Summary - The launch of Claude 3.7 highlights the urgent need for better AI evaluations, as current benchmarks (e.g., AI Eval) are over‑fit and reward models trained specifically to excel on them rather than to perform useful work. - Real‑world usefulness is better captured by emerging tasks such as the “Answer” benchmark, which measures a model’s ability to independently complete freelance jobs, where Claude 3.5 currently outperforms newer models. - Practitioners are left to rely on subjective “vibes” or intuitive impressions when comparing models—an informal, hard‑to‑communicate gauge that underscores the industry’s lack of standardized, task‑focused metrics. - Claude 3.7 follows a challenger‑brand strategy by doubling down on its historic strength in coding and development, offering tight integration with terminals, Cursor, GitHub, and the ChatGPT app to attract developers seeking a specialized, high‑performance tool. ## Sections - [00:00:00](https://www.youtube.com/watch?v=okIVbBnk-Sg&t=0s) **Need for Real-World AI Benchmarks** - The speaker argues that models like Claude 3.7 are overfitted to popular academic tests, urging the development of more meaningful evaluations—such as the newly introduced “Answer” benchmark for freelance tasks—while highlighting Claude 3.5's unexpectedly strong real‑world performance. ## Full Transcript
the launch of Claude 3.7 is really
underlining for me that we need better
evals or better evaluations for AI
models right now I think models are very
overfitted to the evaluations that are
widely published like the
Aime and so all of these models score
incredibly well on these widely known
evaluations but it's because they are
trained from the get-go to be good at
the evaluations and that's sort of
circular isn't it and if you want to
sort of take a step back and think about
what SAA nadela was saying uh when he
talked about wanting models to do
economically useful work it's a bit of a
nod in that direction right like the
models are very good at this Benchmark
or that Benchmark but are they actually
doing work that's
meaningful and because there are not
great benchmarks for Meaningful work
probably the closest is a brand new
Benchmark uh that just came out called
um I think it's answer which is
something that chat GPT is maintaining
really independent orgs should maintain
these but right now open AI is
maintaining this one and it's designed
to measure a model's ability to
independently complete freelance work uh
that's probably the closest to measuring
real
work and Cloud 3.5 not even 3.7 Cloud
3.5 did very very well on that scored
the highest so far on
that and I think that
that's an example of what I mean when I
say even if the models um all score very
similarly on these academic benchmarks
in the real world doing real world work
like what Lancer is trying to measure or
what I hope other benchmarks will emerge
and measure the models are different the
models are not the same and right now
we're referring to this as like The
Vibes of the model the impression the
model gives you when you work with it
and
unfortunately it means that people like
me who spend a lot of time playing with
AI models become effectively without
wanting to Gatekeepers of this implicit
information like I sit there and I
understand intuitively as soon as I
touch Claud
3.7 this is how it feels different from
3.5 uh this is how it feels different
from the chat GPT
models but it's really difficult to
convey that in a way that's useful to
people who have different lines of work
and I think that's something that the
industry as a whole is really struggling
with right now I will say from a
perspect my perspective for
3.7 they're doing a classic Challenger
brand play and they're really focused on
what Claude has historically been
fantastic at which is coding and uh
building and so if you look at where
they've prioritized making 3.7 available
it's aimed specifically at coding and
building it's available in the terminal
it's available right away in cursor it's
available with a GitHub integration in
the Chad GPT app they want you to build
with this
model and I think that makes a lot of
sense Challenger Brands typically
specialize uh and that's sort of how
they win in the space and larger Brands
like chat GPT typically have to
generalize and they have to make their
value proposition coherent which is part
of why chat GPT has been investing in
GPT 5 because they have to take this
whole half dozen models and make them
into one coherent model that everyone
understands what it is for a generalized
audience I would
expect that 3.7 would continue the
tradition of Claude punching above its
measurement weight 3.5 remained a
favored coding model for 9 or 10 months
after it was released as despite all the
other models that were released along
the way 3.7 having played with it having
looked at how much better it feels in
3.5 I suspect it will have the same fate
I think it will be a very popular coding
model for a long time uh I'm going to
see if I can cat up and sort of give
some examples later today on the
substack of sort of what it's like how
3.7 compares to
3.5 um and I want to make it really
tangible I want you to see the
difference given the same prompt of how
these models react and especially how
they build that's my attempt to show how
they do economically useful work but
again we really need independent
benchmarks that help us to figure this
out it's not something that individuals
can really do it really shouldn't be
something that model makers do because
model makers even if they try not to be
inherently kind of biased right they
just are like I have no doubt that the
reason openai launched The freelancer
Benchmark is because they expect to beat
it with chat GPT 4.5 or maybe with chat
GPT 5 that's why you would pay to
maintain and and sort of drive that
Benchmark and benchmarks are not free it
takes a lot of work to maintain them and
we desperately need more benchmarks that
are real world and right now
organizations that could pay like
companies they have no incentive to
share how they're using this for
economically useful work because that's
a secret sauce for the company why would
you share it
and so we're sort of in a situation
where no one who has the money has the
incentive to develop and set up an
independent evaluation for economically
useful work and be really helpful if we
had that so I don't know what the answer
there is if you magically have a few
million dollars and you're listening to
this set up a benchmark set up a
benchmark I think it would be something
that the rest of the world would really
appreciate and if you don't you can join
me in asking and wishing that people
would cheers