The Trust Gap in AI
Key Points
- Trust in AI systems is difficult to scale because users cannot see the underlying intelligence, leading to opaque transactions unlike traditional economics.
- Recent controversies—such as unclear messaging limits, perceived degradation of Claude Code, and developers demanding transparent usage metrics—highlight a deeper misalignment between model makers’ incentives and user needs.
- Companies often make grand performance claims to attract press and funding, but these claims (e.g., Gro 4’s test results) can fall apart when scrutinized by real users.
- The hype‑driven announcement of an OpenAI model earning an International Math Olympiad gold medal exemplifies how sensational AI achievements can be overstated, prompting the community to apply stricter heuristics when evaluating such claims.
Sections
- Scaling Trust in AI Services - The speaker argues that unlike traditional markets, AI systems lack observable outputs, creating opaque pricing, message‑limit controversies, and misaligned incentives that widen a trust gap between providers and users.
- Math Olympiad Rejects AI Involvement - The speaker outlines the Olympiad’s insistence that AI companies not partake in grading or publicity, notes Google’s compliance and OpenAI’s non‑compliance, and references Terence Tao’s perspective on the difficulty of evaluating the contest problems.
- OpenAI’s PR‑Driven Transparency Dilemma - The speaker critiques OpenAI for prioritizing press releases and rapid product launches over openness and creativity, highlighting delayed open‑weight models, opaque chain‑of‑thought outputs, and underwhelming performance in a mathematics competition.
- Meta's AI Spend vs Passion Gap - The speaker argues that Meta’s costly AI push—highlighted by the overpromised, underdelivered Llama 4 and an uncertain Llama 5 timeline—underscores a strategy that relies on deep pockets rather than truly passionate engineering talent.
- Optimism vs Technical Excellence - The speaker juxtaposes Anthropic’s utopian, optimistic culture led by Dario Amodei with Google’s mathematically strong but less cohesive focus on technical brilliance and AGI breakthroughs.
- Domain Expertise vs AI Trust - The speaker argues that the mismatch between code‑focused AI developers and non‑technical domain experts fuels trust problems, noting that alignment between AI claims and expert judgment occurs only in coding, where both parties share expertise.
- Rule‑of‑Thumb Model Trust Framework - The speaker outlines personal heuristics for gauging trust in AI models from various companies—favoring production‑proven tools and questioning untested claims—and invites others to suggest their own criteria for purchasing unseen model capabilities.
Full Transcript
# The Trust Gap in AI **Source:** [https://www.youtube.com/watch?v=xrzpWXW4-38](https://www.youtube.com/watch?v=xrzpWXW4-38) **Duration:** 00:23:35 ## Summary - Trust in AI systems is difficult to scale because users cannot see the underlying intelligence, leading to opaque transactions unlike traditional economics. - Recent controversies—such as unclear messaging limits, perceived degradation of Claude Code, and developers demanding transparent usage metrics—highlight a deeper misalignment between model makers’ incentives and user needs. - Companies often make grand performance claims to attract press and funding, but these claims (e.g., Gro 4’s test results) can fall apart when scrutinized by real users. - The hype‑driven announcement of an OpenAI model earning an International Math Olympiad gold medal exemplifies how sensational AI achievements can be overstated, prompting the community to apply stricter heuristics when evaluating such claims. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xrzpWXW4-38&t=0s) **Scaling Trust in AI Services** - The speaker argues that unlike traditional markets, AI systems lack observable outputs, creating opaque pricing, message‑limit controversies, and misaligned incentives that widen a trust gap between providers and users. - [00:04:32](https://www.youtube.com/watch?v=xrzpWXW4-38&t=272s) **Math Olympiad Rejects AI Involvement** - The speaker outlines the Olympiad’s insistence that AI companies not partake in grading or publicity, notes Google’s compliance and OpenAI’s non‑compliance, and references Terence Tao’s perspective on the difficulty of evaluating the contest problems. - [00:08:18](https://www.youtube.com/watch?v=xrzpWXW4-38&t=498s) **OpenAI’s PR‑Driven Transparency Dilemma** - The speaker critiques OpenAI for prioritizing press releases and rapid product launches over openness and creativity, highlighting delayed open‑weight models, opaque chain‑of‑thought outputs, and underwhelming performance in a mathematics competition. - [00:12:24](https://www.youtube.com/watch?v=xrzpWXW4-38&t=744s) **Meta's AI Spend vs Passion Gap** - The speaker argues that Meta’s costly AI push—highlighted by the overpromised, underdelivered Llama 4 and an uncertain Llama 5 timeline—underscores a strategy that relies on deep pockets rather than truly passionate engineering talent. - [00:15:54](https://www.youtube.com/watch?v=xrzpWXW4-38&t=954s) **Optimism vs Technical Excellence** - The speaker juxtaposes Anthropic’s utopian, optimistic culture led by Dario Amodei with Google’s mathematically strong but less cohesive focus on technical brilliance and AGI breakthroughs. - [00:19:08](https://www.youtube.com/watch?v=xrzpWXW4-38&t=1148s) **Domain Expertise vs AI Trust** - The speaker argues that the mismatch between code‑focused AI developers and non‑technical domain experts fuels trust problems, noting that alignment between AI claims and expert judgment occurs only in coding, where both parties share expertise. - [00:22:16](https://www.youtube.com/watch?v=xrzpWXW4-38&t=1336s) **Rule‑of‑Thumb Model Trust Framework** - The speaker outlines personal heuristics for gauging trust in AI models from various companies—favoring production‑proven tools and questioning untested claims—and invites others to suggest their own criteria for purchasing unseen model capabilities. ## Full Transcript
One of the interesting things about this
age of AI is that trust is hard to
scale. In classical economic theory, you
can establish trust and scale it through
transactions because each side knows
what the other side gets. That's not
true with AI. It's not true with large
language models. And I will tell you why
it's not true. You can't see what the
intelligence is on the other side when
you buy it. This has caused a host of
issues. Just this past week or two,
there was a big kurfuffle over message
limits and cursor and what cursor's
pricing was going to be and how that was
going to change and developers got
upset. Then after that, I saw people
getting upset about Claude Code and
claiming that Claude Code had somehow
degraded behind the scenes or wasn't
counting message counts properly because
it wasn't transparent. I saw people
asking Sam Alman out of OpenAI, please,
please, please show me message count so
I can see how close I get to my limit.
And it's not just about counting
messages. That's actually very solvable.
It's something deeper underneath where
model makers incentives are not aligned
with us, the people who are using them.
They're absolutely incentivized to make
big claims about being the best in the
world because that unlocks press
releases, stories, and dollars. We
explored this when we talked about Gro 4
and their claims around test results
that weren't borne out when actual users
used the product. There is a wider trust
gap across AI that I want to talk about
today. And I want to give you some
specific huristics or rules of thumb
that I use when I'm evaluating claims
from specific AI labs because they have
a different trust fingerprint. They're
not all the same. In order to get into
that story, I want to give you the
latest example of a somewhat sketchy
claim from a major model maker. It
happened just this weekend and it's the
International Math Olympiad gold medal
claim from OpenAI. The implied
probability of this happening at all was
around 20% on Poly Market. So, you could
consider this even by LLM standards a
big surprise. And the tech community
reacted with enormous excitement.
Everyone was like, "This is incredible.
It's even more incredible because OpenAI
claimed that it was just a large
language model. It was not using tools.
So, it didn't open up a Python notebook
to solve this. And that it was given the
exact same time constraints as a
student. And so, they were given 100
minutes to solve the problem. and the
machine was able to do it in that time
and was able to write out a proof that
was then validated by multiple
independent mathematicians. At first
glance, this sounds like a legitimate
story. And the gold medal, by the way,
is for answering five of the six
International Math Olympiad questions
correctly. And if you are wondering how
hard they are, I looked at them and got
a headache. They are ridiculously hard.
Very very few students in the world get
a gold medal at the International Math
Olympiad. They change up the questions
every single year. So these are not last
year's questions. So you could not train
on these questions previously. These
were novel to the LLM and everyone else
in the world. That was the claim. Now we
dive into the rest of the story. It
transpires that there is a marking guide
from the International Math olympiad
organization for the six questions that
were posed to students. That marking
guide is private. It's only available to
qualified examiners. And because OpenAI
chose not to participate with other AI
organizations that were taking this test
as AI, notably Google, they did not have
access to that marking guide. So when
they published their results, which they
did, they published the entire output of
the five questions out of six that the
AI answered on GitHub, you could see it,
they did not have that marked by the
qualified marking guide. And that
generates all kinds of questions because
you don't know if the marking guide
might have taken a point or two off for
the way it answered the question or for
the quality of the train of thought. You
don't know what you don't know because
you don't have the actual examination
marking guide. And that matters because
the gold medal claim was barely a gold
medal. It was like one or two points
over the bar because it got five of the
six questions, not all six. This was not
a slam dunk gold medal. It was a skin of
your teeth gold. It gets even weirder.
The Math Olympiad not only put out a
statement saying they didn't participate
with the other AI organizations, notably
Google, and they also didn't use our
marking guide or our examiners who know
these problems and are trained to mark
them. They also said very explicitly,
"We asked AI companies for the sake of
the human students who are taking this
test to please, please not make a big
deal out of PR on your gold medal today
over the weekend. Give the students a
week of glory because they are the
humans who are working so hard to take
this test. I think it's something like
50 students in the world get the gold
medal. It's a big big deal." and they
wanted them and the organization wanted
them to have their moment of honor which
is really worthwhile and Google appears
to be abiding by that as a participant
officially in the process and open AAI
which did not officially participate in
the process but published their answers
appears to not be abiding by the math
olympiad's request nor do they have
access to the marking book I am not
qualified to tell you if they
successfully passed those five questions
there are very few mathematicians who
are one of them is one of the world's
foremost mathematical minds, Terrence
Tao, and he weighed in on the whole
problem set here and why it's so complex
to evaluate. And I want to summarize his
thinking just a little bit because I
think it's easy to understand even
though he's obviously far smarter than
me. What he said is that the way you set
up an examination profoundly shapes the
results. And so he said on the actual
math olympiad there's a coach and their
students and the coach's job is to
advocate for the answers for the
students but the students themselves are
left to their own devices for a 100
minutes with pencil and paper only to
answer the problems. So they can have
advocacy after the fact by their coach
but it's on them to answer. And then he
started to give examples from actual AI
technologies to help you understand how
things can be very very different when
you set a large language model to take
the test. One example he gave was would
it influence the test if all of the
students got together and started to
point each other in the right direction,
give each other hints? Yeah, it
absolutely would. That is also known as
mixture of experts. It's a standard LLM
technique where you have multiple LLMs
together taking the task. That might
have been what happened. We don't know
what the architecture of this model was.
This wasn't regular chat GPT. Sam
Alolman has since clarified it wasn't
chat GPT5. We're not sure what it was.
We also don't know if the perception of
time matters to an LLM in the same way.
And so for a student, we know what a 100
minutes means. It's considered, as crazy
as it sounds, because I'm sure I
couldn't do this. It's considered a
reasonable amount of time to answer the
question. I'm sure I would not get
nearly far enough. I wouldn't even get
to the beginning. I looked at these
problems. They're impossibly hard. But
for an LLM, it doesn't run on clock
time. In fact, they're famous for not
running on clock time. That is part of
why this concept of digital twins works
is that you can run millions of hours of
simulation in a very short amount of
clock time when you are simulating
robots walking through a warehouse and
trying to train them. That's a real
example from Nvidia. By the way, if if
clock time doesn't work the same for
large language model simulations, is
giving an LL 100 minutes actually
equivalent to giving a human 100
minutes? I don't know. And Terrence
doesn't know either. And his point was
not this is not an achievement. His
point was it's really hard to understand
what's in the box of this achievement if
we don't have more details. And OpenAI
has not released those details. And
people have been going after OpenAI for
a while on the lack of transparency.
That is part of their trust blueprint
DNA. They make claims. They publish some
of the results of the claims. They
launch models that are quite good in
practice, but they do not reveal what's
in the box or how it works. The chain of
thought you see on 03 that is not
transparent. That is a sanitized chain
of thought and they have decided not to
release it. And so if you think about
what's coming up next for OpenAI, the
launch of chat JPG5, if you think about
the upcoming launch of their open
weights model, which they have delayed
again, I start to see these kinds of
claims from OpenAI in the light of their
trust fingerprint. I start to read it
and I start to say this is a model maker
that values press releases. It values
public relations. it will jump to get
the PR victory ahead of any kind of
request that it gets to hold things
back. It moves fast. And so when the
International Math Olympiad said,
"Please wait for the students," Sam
didn't wait and he pushed forward
because he had an amazing story to tell
and he wanted to be the first in the
market and he wanted to beat Google to
the story. Another mathematician weighed
in on this, by the way, and said that
generally speaking, having evaluated the
results from OpenAI, the machine showed
lack of creativity and weird notation
and technically solved the problem. And
then he went on to say, well, creativity
is really important in mathematics, and
it's notable that the sixth question was
not even attempted because the sixth
question is the most creative and
challenging of them. and his conclusion
was it doesn't look like as a
mathematician LLMs are going to be
taking my job anytime soon. And I think
that's a really interesting take and I
think it's possible to articulate that
take without denigrating or without
minimizing the value of the achievement.
It is absolutely true that a large
language model not using tools getting
any kind of close to gold medal on a
math olympiad problem set even if it has
all of these caveats that's a big deal.
If Google announces they also got a gold
medal later this week, that will be just
as big a deal. And the rate of progress
can be important, significant, and worth
studying without having these huge
existential questions off the top. And I
think one of the things that makes the
tech community that is too bullish on AI
unloved and frustrated, unloved and
incredibly annoying to other people is
the sense from the rest of the world
that they just think AI is the way
forward. I saw Flame Wars on X, which
well that's where you go, right? That's
what you get. But I saw Flame Wars on X
where people in tech were basically
saying none of you get it. This is the
way AI is going to run the world. None
of you deserve jobs. AI is just going to
do all your jobs for you. One, that's
not a way to make friends, and that's
not a way to, you know, get your
technology adopted. And two, it's not
even reasonable. We are in a world where
it may be possible that AI has a gold on
the International Math Olympiad, but
also can't really play Mario Kart
properly. My kids may be better at Mario
Kart than AI right now. And people will
say, well, just wait a minute. And I'm
like, yeah, sure, wait a minute. But
let's at least acknowledge that the
intelligence is kind of jagged and it's
a strange world. And it's not at all
clear in that world what that means for
employment except that so far and I saw
yet another study come out on this this
weekend. There is no discernable effect
on employment for AI. So nothing has
happened yet despite all of the hot air
back and forth. Let's look briefly at
some of the other labs and evaluate
their trust fingerprints. Let's look at
Meta. reports jumped over the weekend
that as big as the $200 million pay
package was that Mark Zuckerberg offered
and that was accepted by someone to come
to Meta, I think that was the biggest
headline. I think they all vary between
10 and $200 million, which is just
generational wealth, right? It's
incredible. Apparently, at least 10
researchers at OpenAI turned down, the
rumor goes, $300 million paychecks. $300
million. That is more than most
professional athletes make. That is show
Otani money. if you're a baseball fan.
So, the reason I'm calling this out is
that this is part of Meta's playbook. If
you're looking at the trust fingerprint
for Meta, they are very heavily into
spending money to catch up aggressively
and making sure that they can back up
their demos even if their demos were in
the move fast breaking break things
spirit initially. So, Llama 4 widely
panned. It promised a massive context
window. I think it was 10 million
tokens. That window is not remotely
usable at 10 million tokens. It is not
clear when Llama for Behemoth is going
to be out. It may never be out. Mark
Zuckerberg saw that. He saw his public
AI statements fall apart and essentially
he started to see the developer
community shift away from llama as
Chinese models came out. Kim K2 came out
recently, phenomenal model that started
to eat away at his open ecosystem
vision. And his response was classic
Mark. I'm going to go spend money to
solve this problem and I have more money
than God. So I'm going to spend as much
money as I need to have $300 million.
You get $100 million. You get $50
million. And he's going to assemble
whatever it takes. The challenge is
historically Meta can spend the money,
but Meta can't buy the passion. And so
as much as Zuck has never lost over the
long term yet yet, Zuck has also not
assembled teams that are passionate
about anything but social media. And I
think that is a very open question. He's
paid all these people, but the people
who said no may be the people who are
most interesting in this scenario
because those are the people that chose
passion and the startup fit over $300
million. I don't know if I could do
that. I don't know if a lot of people
could do that. they must believe
profoundly in the open AI vision because
Sam was very open. He didn't match it.
Like they're not getting paid $300
million by open AI. And so in that
world, I think my question is can money
buy the kind of team that you need to
build super intelligence if that's even
possible. I don't know. We're all going
to find out. But that's classically Mark
to try and build it that way. The trust
fingerprint for meta is very
demoleaning. It's like the VR AR days
where everyone mocked Mark and then he
spent a lot of money to bring Oculus
into the world and to improve the AR
race and basically to start beating
Apple at AR and VR. That's how Mark
works. And so now he's in the money
phase of this pendulum that swings back
and forth and that means there's going
to be a big demo coming up that's even
more interesting and we'll see if that
actually puts Llama back on track. Llama
5 is going to be a big deal. So if you
sum sum it up, meta demo first DNA. Open
AI. Open AI is going to win the PR war
and they're going to hide the how. What
about anthropic? Anthropic is an
interesting one. They have extremely
careful work. They have some of the most
interesting work on AI ethics, some of
the most interesting work on showing and
proving how AI works in the industry.
But they also have some of the most
unbridled and unsupported optimism I've
seen. The example of Claude managing the
vending machine is great. I talked about
this earlier. I won't do the whole
story. Basically, in the middle of
managing a vending machine, Claude had a
psychotic break, hallucinated that it
was a real person, and did not pull
itself out of its funk until April 1st
when it told itself it was an April
Fool's joke. This was all carefully
documented by Anthropic. To their
credit, they didn't hide it. They were
really honest. And then at the end, they
had this wild optimism about how Claude
is going to be a middle manager soon.
And I looked at that and I said that
does not line up with the rest of this
paper. But boy does it line up with the
kind of optimism that I see from
Anthropic and I that I see from Daario
Amade all the time. Daario's the founder
of Anthropic and he is known for writing
the essay Machines of Loving Grace where
he talks about his vision for the future
how it's very utopian. The team does
phenomenal focused careful work and then
slips in that sort of utopian and
idealism by the buy, right? And that's
just part of their fingerprint. They do
careful work and they have kind of
careless optimism. It's a really
interesting combination. What about
Google? With Google, it's all about
technical excellence. There's they have
they literally have Deas that a Nobel
laureate on the team. Like they're
extremely good. They are the ones that
built the underlying technology that
we're all building on for the AI race
now. but they could not hold the team
together and so they've gone on to found
other startups and that that's very very
high level how we got open AI. They are
obsessed with building AGI. Deise has
said there's multiple breakthroughs
needed. He's not done yet. They're
focused on scientific models.
Mathematicians will tell you they think
the Google models are stronger on
mathematics which is part of what makes
this weekend's Math Olympiad results
extra spicy. But their interface is not
what they like to claim. And so if you
see claims from Google, I tend to
believe that technically speaking, it
was exactly what they said, that the
APIs are going to be served correctly,
that it's going to be the most
affordable intelligence in the industry,
and I assume the interface is going to
be terrible, whatever they say, because
I cannot recommend the Google Studio
interface to anybody. It's so hard to
use, and it shouldn't have to be that
way. It shouldn't have to be that way.
But every model maker has a fingerprint.
Every model maker has a trust
fingerprint. With Google, I can trust
that they measure things. I cannot trust
them to build an interface. And frankly,
I also think they have a little bit of
the challenge that XAI faces where they
tend to optimize to tests and the actual
intelligence that's available isn't
always at the same working quality as
the tests. It is not nearly as big a
delta as I sense when I work with the
XAI models, but it does feel like it's
there and it's worth bringing up here.
XAI and Grock. I've talked about them
with a Grock 4 release. We're not going
to spend too long on this. Think of them
as an opacity engine. They are
passionate about building AI. The team
works really, really hard. They're super
fast at releasing, but they are so
opaque. They're so opaque. They'll
gesture toward open intelligence, but
they won't release a model card. They
don't adequately solve huge trust
issues. I don't know of a single company
that would trust them with their API as
a result. And when you just optimize for
benchmark scores, you also get users
saying it's not as flexibly intelligent
as it needs to be. Building AI is really
hard. It's okay that they have spent two
years building, building, building, and
they are in the top echelon of model
makers, even if they're not number one.
But that's not okay for them. They need
to be number one. And so with them, it's
a tremendous delta between what they
claim and what actually happens on the
ground. makes them very difficult to
cover from a news perspective because
you don't know what's real and they're
very good at grabbing the headlines.
Now, I want to close by talking a little
bit about the domain expertise problem
because I think it collides with this
trust issue. One of the reasons why it's
hard to know how to measure
intelligence. I'm going back to the very
beginning of this conversation where I
talked about this idea that in economics
you can transact and you know what
you're getting and you don't with
intelligence. One of the reasons for
that is that the people building
intelligence are approaching it like
code. They're approaching it
technically. They're approaching it with
what they know in the valley. But the
people who have domain intelligence in
all of the fields AI is touching may not
know code, may not know tech, but sure
do know their domain, and they know when
it's right and when it's wrong. And so
part of why I shared the math olympiad
results and the opinion of
mathematicians like Terrence Tao is they
are domain experts in mathematics. I'm
not. Open AI sure isn't, but they are.
And it's interesting to me that domain
experts do not align necessarily with
the claims AI model makers make except
in one field, and that field is code.
And the reason why is because the people
building AI are also good at code. And
so as much as we say the reason AI is
getting better at code fast, it's
because of reinforcement learning and
because of the great rewards that
running code gives to models. Like it
runs or it doesn't. What a fantastic
reward for a model that trains on
reinforcement learning. Well, at the end
of the day,
it may not just be that it happens to be
good for training models. It may be that
the people building the models know code
and they don't know other fields as
well. And I think that this is going to
become more and more important in this
next two or three years of the AI
revolution because more and more and
more we are going to expect if we
purchase the intelligence it's doing
meaningful work and it's going to be
domain experts outside of tech that
assess that meaningful work and if it
does or it doesn't it's going to be on
them to say not on the labs but the labs
have tremendous incentive to say they're
good and so we see what is effectively
an implicit conflict where open AI is
taking a victim lap and awarding
themselves the gold medal and feeling
great and they did clearly make some
kind of breakthrough. And so they
probably feel internally like they
earned it because the answers were
apparently correct. And mathematicians
are much more cautious. They're like,
well, we don't understand how this was
done. We don't know the the testing
methodology that you use. We don't
understand the model. And critically,
even looking at the proofs themselves,
something feels weird. It feels less
creative. It feels like it's unclear why
it attempted five but not six which was
the more creative problem and in their
lived experience with mathematical
models so far they aren't seeing
significant gains and the tech people
tend to dismiss that they tend to say
you're domain experts for now but you
just wait give us six more months we're
going to be do domain experts over here
because AI is going to solve it they've
been saying that 6 months away for a
while now and the models keep getting
better and the true domain experts like
Terren aren't changing their tune. They
keep saying these models are getting
better, but not necessarily in ways that
are profoundly helpful to me yet. I
think we need to listen to them more.
So, wrapping all of this up, the only
way I've found to establish trust in a
model is to use some of these rule of
thumb to understand where you can trust
and where you can't. And so for OpenAI,
I trust the models I have in production
now. Where they do useful work, they
tend to be good. I do not take their
claims super seriously when they're not
in production yet. For Meta, I tend to
assume they're running on a two-year
pendulum and sometime next year they're
going to come up with something amazing
because they bought their way to it, but
it still won't be clear if it's cutting
edge. For anthropic, I trust them to do
the best white papers in the industry,
but I don't necessarily assume that
their wild optimism is correct. For X, I
don't trust them with a lot right now
because XAI has just hidden so much. And
for Google, tremendously competent
models, but it's hard for me to trust
them to build off of the models onto an
intelligent surface that other people
can consume because they just haven't
shown UX skills. So, that's my
benchmarks. Other people may have
different benchmarks, but I wanted to
share this is how I'm parsing and
developing rules of thumb that help me
make sense of this world where I have to
buy intelligence kind of sight unseen.
Does that make sense? Put in the
comments what you think would be a
huristic for buying things sight unseen
for models. Cheers.