Learning Library

← Back to Library

The Trust Gap in AI

23m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

Trust in AI systems is difficult to scale because users cannot see the underlying intelligence, leading to opaque transactions unlike traditional economics.
Recent controversies—such as unclear messaging limits, perceived degradation of Claude Code, and developers demanding transparent usage metrics—highlight a deeper misalignment between model makers’ incentives and user needs.
Companies often make grand performance claims to attract press and funding, but these claims (e.g., Gro 4’s test results) can fall apart when scrutinized by real users.
The hype‑driven announcement of an OpenAI model earning an International Math Olympiad gold medal exemplifies how sensational AI achievements can be overstated, prompting the community to apply stricter heuristics when evaluating such claims.

Sections

Full Transcript

# The Trust Gap in AI **Source:** [https://www.youtube.com/watch?v=xrzpWXW4-38](https://www.youtube.com/watch?v=xrzpWXW4-38) **Duration:** 00:23:35 ## Summary - Trust in AI systems is difficult to scale because users cannot see the underlying intelligence, leading to opaque transactions unlike traditional economics. - Recent controversies—such as unclear messaging limits, perceived degradation of Claude Code, and developers demanding transparent usage metrics—highlight a deeper misalignment between model makers’ incentives and user needs. - Companies often make grand performance claims to attract press and funding, but these claims (e.g., Gro 4’s test results) can fall apart when scrutinized by real users. - The hype‑driven announcement of an OpenAI model earning an International Math Olympiad gold medal exemplifies how sensational AI achievements can be overstated, prompting the community to apply stricter heuristics when evaluating such claims. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xrzpWXW4-38&t=0s) **Scaling Trust in AI Services** - The speaker argues that unlike traditional markets, AI systems lack observable outputs, creating opaque pricing, message‑limit controversies, and misaligned incentives that widen a trust gap between providers and users. - [00:04:32](https://www.youtube.com/watch?v=xrzpWXW4-38&t=272s) **Math Olympiad Rejects AI Involvement** - The speaker outlines the Olympiad’s insistence that AI companies not partake in grading or publicity, notes Google’s compliance and OpenAI’s non‑compliance, and references Terence Tao’s perspective on the difficulty of evaluating the contest problems. - [00:08:18](https://www.youtube.com/watch?v=xrzpWXW4-38&t=498s) **OpenAI’s PR‑Driven Transparency Dilemma** - The speaker critiques OpenAI for prioritizing press releases and rapid product launches over openness and creativity, highlighting delayed open‑weight models, opaque chain‑of‑thought outputs, and underwhelming performance in a mathematics competition. - [00:12:24](https://www.youtube.com/watch?v=xrzpWXW4-38&t=744s) **Meta's AI Spend vs Passion Gap** - The speaker argues that Meta’s costly AI push—highlighted by the overpromised, underdelivered Llama 4 and an uncertain Llama 5 timeline—underscores a strategy that relies on deep pockets rather than truly passionate engineering talent. - [00:15:54](https://www.youtube.com/watch?v=xrzpWXW4-38&t=954s) **Optimism vs Technical Excellence** - The speaker juxtaposes Anthropic’s utopian, optimistic culture led by Dario Amodei with Google’s mathematically strong but less cohesive focus on technical brilliance and AGI breakthroughs. - [00:19:08](https://www.youtube.com/watch?v=xrzpWXW4-38&t=1148s) **Domain Expertise vs AI Trust** - The speaker argues that the mismatch between code‑focused AI developers and non‑technical domain experts fuels trust problems, noting that alignment between AI claims and expert judgment occurs only in coding, where both parties share expertise. - [00:22:16](https://www.youtube.com/watch?v=xrzpWXW4-38&t=1336s) **Rule‑of‑Thumb Model Trust Framework** - The speaker outlines personal heuristics for gauging trust in AI models from various companies—favoring production‑proven tools and questioning untested claims—and invites others to suggest their own criteria for purchasing unseen model capabilities. ## Full Transcript

0:00One of the interesting things about this 0:01age of AI is that trust is hard to 0:03scale. In classical economic theory, you 0:06can establish trust and scale it through 0:08transactions because each side knows 0:11what the other side gets. That's not 0:14true with AI. It's not true with large 0:17language models. And I will tell you why 0:19it's not true. You can't see what the 0:22intelligence is on the other side when 0:24you buy it. This has caused a host of 0:26issues. Just this past week or two, 0:29there was a big kurfuffle over message 0:32limits and cursor and what cursor's 0:35pricing was going to be and how that was 0:36going to change and developers got 0:37upset. Then after that, I saw people 0:40getting upset about Claude Code and 0:41claiming that Claude Code had somehow 0:43degraded behind the scenes or wasn't 0:44counting message counts properly because 0:46it wasn't transparent. I saw people 0:48asking Sam Alman out of OpenAI, please, 0:52please, please show me message count so 0:55I can see how close I get to my limit. 0:57And it's not just about counting 0:58messages. That's actually very solvable. 1:00It's something deeper underneath where 1:03model makers incentives are not aligned 1:06with us, the people who are using them. 1:08They're absolutely incentivized to make 1:10big claims about being the best in the 1:12world because that unlocks press 1:14releases, stories, and dollars. We 1:16explored this when we talked about Gro 4 1:18and their claims around test results 1:20that weren't borne out when actual users 1:23used the product. There is a wider trust 1:25gap across AI that I want to talk about 1:28today. And I want to give you some 1:29specific huristics or rules of thumb 1:32that I use when I'm evaluating claims 1:35from specific AI labs because they have 1:38a different trust fingerprint. They're 1:40not all the same. In order to get into 1:42that story, I want to give you the 1:44latest example of a somewhat sketchy 1:47claim from a major model maker. It 1:50happened just this weekend and it's the 1:52International Math Olympiad gold medal 1:55claim from OpenAI. The implied 1:58probability of this happening at all was 2:01around 20% on Poly Market. So, you could 2:04consider this even by LLM standards a 2:07big surprise. And the tech community 2:09reacted with enormous excitement. 2:12Everyone was like, "This is incredible. 2:14It's even more incredible because OpenAI 2:16claimed that it was just a large 2:18language model. It was not using tools. 2:21So, it didn't open up a Python notebook 2:23to solve this. And that it was given the 2:26exact same time constraints as a 2:29student. And so, they were given 100 2:32minutes to solve the problem. and the 2:34machine was able to do it in that time 2:36and was able to write out a proof that 2:38was then validated by multiple 2:41independent mathematicians. At first 2:43glance, this sounds like a legitimate 2:46story. And the gold medal, by the way, 2:48is for answering five of the six 2:50International Math Olympiad questions 2:52correctly. And if you are wondering how 2:54hard they are, I looked at them and got 2:56a headache. They are ridiculously hard. 2:58Very very few students in the world get 3:01a gold medal at the International Math 3:03Olympiad. They change up the questions 3:06every single year. So these are not last 3:08year's questions. So you could not train 3:11on these questions previously. These 3:13were novel to the LLM and everyone else 3:15in the world. That was the claim. Now we 3:18dive into the rest of the story. It 3:20transpires that there is a marking guide 3:23from the International Math olympiad 3:25organization for the six questions that 3:27were posed to students. That marking 3:29guide is private. It's only available to 3:32qualified examiners. And because OpenAI 3:35chose not to participate with other AI 3:38organizations that were taking this test 3:42as AI, notably Google, they did not have 3:45access to that marking guide. So when 3:47they published their results, which they 3:49did, they published the entire output of 3:51the five questions out of six that the 3:53AI answered on GitHub, you could see it, 3:56they did not have that marked by the 3:58qualified marking guide. And that 4:00generates all kinds of questions because 4:02you don't know if the marking guide 4:04might have taken a point or two off for 4:07the way it answered the question or for 4:09the quality of the train of thought. You 4:11don't know what you don't know because 4:12you don't have the actual examination 4:14marking guide. And that matters because 4:16the gold medal claim was barely a gold 4:19medal. It was like one or two points 4:21over the bar because it got five of the 4:23six questions, not all six. This was not 4:25a slam dunk gold medal. It was a skin of 4:27your teeth gold. It gets even weirder. 4:30The Math Olympiad not only put out a 4:32statement saying they didn't participate 4:34with the other AI organizations, notably 4:37Google, and they also didn't use our 4:38marking guide or our examiners who know 4:41these problems and are trained to mark 4:42them. They also said very explicitly, 4:45"We asked AI companies for the sake of 4:48the human students who are taking this 4:51test to please, please not make a big 4:55deal out of PR on your gold medal today 4:58over the weekend. Give the students a 5:00week of glory because they are the 5:01humans who are working so hard to take 5:03this test. I think it's something like 5:0550 students in the world get the gold 5:07medal. It's a big big deal." and they 5:10wanted them and the organization wanted 5:12them to have their moment of honor which 5:13is really worthwhile and Google appears 5:16to be abiding by that as a participant 5:18officially in the process and open AAI 5:21which did not officially participate in 5:23the process but published their answers 5:25appears to not be abiding by the math 5:27olympiad's request nor do they have 5:29access to the marking book I am not 5:31qualified to tell you if they 5:34successfully passed those five questions 5:36there are very few mathematicians who 5:39are one of them is one of the world's 5:41foremost mathematical minds, Terrence 5:43Tao, and he weighed in on the whole 5:47problem set here and why it's so complex 5:50to evaluate. And I want to summarize his 5:52thinking just a little bit because I 5:54think it's easy to understand even 5:56though he's obviously far smarter than 5:57me. What he said is that the way you set 6:00up an examination profoundly shapes the 6:03results. And so he said on the actual 6:06math olympiad there's a coach and their 6:09students and the coach's job is to 6:11advocate for the answers for the 6:12students but the students themselves are 6:14left to their own devices for a 100 6:16minutes with pencil and paper only to 6:19answer the problems. So they can have 6:20advocacy after the fact by their coach 6:23but it's on them to answer. And then he 6:25started to give examples from actual AI 6:27technologies to help you understand how 6:31things can be very very different when 6:33you set a large language model to take 6:35the test. One example he gave was would 6:38it influence the test if all of the 6:40students got together and started to 6:42point each other in the right direction, 6:44give each other hints? Yeah, it 6:46absolutely would. That is also known as 6:49mixture of experts. It's a standard LLM 6:51technique where you have multiple LLMs 6:53together taking the task. That might 6:55have been what happened. We don't know 6:56what the architecture of this model was. 6:58This wasn't regular chat GPT. Sam 7:00Alolman has since clarified it wasn't 7:02chat GPT5. We're not sure what it was. 7:05We also don't know if the perception of 7:09time matters to an LLM in the same way. 7:12And so for a student, we know what a 100 7:13minutes means. It's considered, as crazy 7:16as it sounds, because I'm sure I 7:17couldn't do this. It's considered a 7:19reasonable amount of time to answer the 7:20question. I'm sure I would not get 7:22nearly far enough. I wouldn't even get 7:23to the beginning. I looked at these 7:24problems. They're impossibly hard. But 7:27for an LLM, it doesn't run on clock 7:30time. In fact, they're famous for not 7:32running on clock time. That is part of 7:34why this concept of digital twins works 7:36is that you can run millions of hours of 7:38simulation in a very short amount of 7:40clock time when you are simulating 7:43robots walking through a warehouse and 7:45trying to train them. That's a real 7:46example from Nvidia. By the way, if if 7:48clock time doesn't work the same for 7:50large language model simulations, is 7:52giving an LL 100 minutes actually 7:55equivalent to giving a human 100 7:57minutes? I don't know. And Terrence 7:59doesn't know either. And his point was 8:01not this is not an achievement. His 8:03point was it's really hard to understand 8:05what's in the box of this achievement if 8:07we don't have more details. And OpenAI 8:09has not released those details. And 8:11people have been going after OpenAI for 8:13a while on the lack of transparency. 8:15That is part of their trust blueprint 8:18DNA. They make claims. They publish some 8:21of the results of the claims. They 8:23launch models that are quite good in 8:26practice, but they do not reveal what's 8:28in the box or how it works. The chain of 8:31thought you see on 03 that is not 8:33transparent. That is a sanitized chain 8:36of thought and they have decided not to 8:38release it. And so if you think about 8:40what's coming up next for OpenAI, the 8:42launch of chat JPG5, if you think about 8:45the upcoming launch of their open 8:47weights model, which they have delayed 8:49again, I start to see these kinds of 8:51claims from OpenAI in the light of their 8:54trust fingerprint. I start to read it 8:56and I start to say this is a model maker 9:00that values press releases. It values 9:03public relations. it will jump to get 9:06the PR victory ahead of any kind of 9:10request that it gets to hold things 9:12back. It moves fast. And so when the 9:14International Math Olympiad said, 9:16"Please wait for the students," Sam 9:18didn't wait and he pushed forward 9:20because he had an amazing story to tell 9:21and he wanted to be the first in the 9:23market and he wanted to beat Google to 9:24the story. Another mathematician weighed 9:26in on this, by the way, and said that 9:28generally speaking, having evaluated the 9:31results from OpenAI, the machine showed 9:34lack of creativity and weird notation 9:38and technically solved the problem. And 9:40then he went on to say, well, creativity 9:42is really important in mathematics, and 9:44it's notable that the sixth question was 9:46not even attempted because the sixth 9:49question is the most creative and 9:50challenging of them. and his conclusion 9:52was it doesn't look like as a 9:54mathematician LLMs are going to be 9:56taking my job anytime soon. And I think 9:58that's a really interesting take and I 10:00think it's possible to articulate that 10:03take without denigrating or without 10:06minimizing the value of the achievement. 10:08It is absolutely true that a large 10:11language model not using tools getting 10:13any kind of close to gold medal on a 10:15math olympiad problem set even if it has 10:17all of these caveats that's a big deal. 10:20If Google announces they also got a gold 10:22medal later this week, that will be just 10:23as big a deal. And the rate of progress 10:27can be important, significant, and worth 10:30studying without having these huge 10:32existential questions off the top. And I 10:35think one of the things that makes the 10:36tech community that is too bullish on AI 10:39unloved and frustrated, unloved and 10:42incredibly annoying to other people is 10:45the sense from the rest of the world 10:47that they just think AI is the way 10:49forward. I saw Flame Wars on X, which 10:52well that's where you go, right? That's 10:53what you get. But I saw Flame Wars on X 10:56where people in tech were basically 10:57saying none of you get it. This is the 11:00way AI is going to run the world. None 11:02of you deserve jobs. AI is just going to 11:04do all your jobs for you. One, that's 11:06not a way to make friends, and that's 11:07not a way to, you know, get your 11:09technology adopted. And two, it's not 11:11even reasonable. We are in a world where 11:14it may be possible that AI has a gold on 11:16the International Math Olympiad, but 11:18also can't really play Mario Kart 11:20properly. My kids may be better at Mario 11:22Kart than AI right now. And people will 11:24say, well, just wait a minute. And I'm 11:25like, yeah, sure, wait a minute. But 11:27let's at least acknowledge that the 11:28intelligence is kind of jagged and it's 11:30a strange world. And it's not at all 11:32clear in that world what that means for 11:34employment except that so far and I saw 11:36yet another study come out on this this 11:38weekend. There is no discernable effect 11:40on employment for AI. So nothing has 11:42happened yet despite all of the hot air 11:45back and forth. Let's look briefly at 11:47some of the other labs and evaluate 11:48their trust fingerprints. Let's look at 11:50Meta. reports jumped over the weekend 11:52that as big as the $200 million pay 11:55package was that Mark Zuckerberg offered 11:57and that was accepted by someone to come 12:00to Meta, I think that was the biggest 12:01headline. I think they all vary between 12:0310 and $200 million, which is just 12:05generational wealth, right? It's 12:06incredible. Apparently, at least 10 12:09researchers at OpenAI turned down, the 12:12rumor goes, $300 million paychecks. $300 12:16million. That is more than most 12:18professional athletes make. That is show 12:19Otani money. if you're a baseball fan. 12:22So, the reason I'm calling this out is 12:24that this is part of Meta's playbook. If 12:26you're looking at the trust fingerprint 12:28for Meta, they are very heavily into 12:32spending money to catch up aggressively 12:35and making sure that they can back up 12:37their demos even if their demos were in 12:40the move fast breaking break things 12:42spirit initially. So, Llama 4 widely 12:46panned. It promised a massive context 12:48window. I think it was 10 million 12:50tokens. That window is not remotely 12:52usable at 10 million tokens. It is not 12:54clear when Llama for Behemoth is going 12:56to be out. It may never be out. Mark 12:58Zuckerberg saw that. He saw his public 13:02AI statements fall apart and essentially 13:05he started to see the developer 13:07community shift away from llama as 13:09Chinese models came out. Kim K2 came out 13:12recently, phenomenal model that started 13:14to eat away at his open ecosystem 13:15vision. And his response was classic 13:18Mark. I'm going to go spend money to 13:20solve this problem and I have more money 13:22than God. So I'm going to spend as much 13:24money as I need to have $300 million. 13:26You get $100 million. You get $50 13:27million. And he's going to assemble 13:30whatever it takes. The challenge is 13:32historically Meta can spend the money, 13:35but Meta can't buy the passion. And so 13:38as much as Zuck has never lost over the 13:41long term yet yet, Zuck has also not 13:45assembled teams that are passionate 13:48about anything but social media. And I 13:50think that is a very open question. He's 13:52paid all these people, but the people 13:54who said no may be the people who are 13:57most interesting in this scenario 13:58because those are the people that chose 14:00passion and the startup fit over $300 14:02million. I don't know if I could do 14:03that. I don't know if a lot of people 14:05could do that. they must believe 14:06profoundly in the open AI vision because 14:09Sam was very open. He didn't match it. 14:11Like they're not getting paid $300 14:12million by open AI. And so in that 14:15world, I think my question is can money 14:17buy the kind of team that you need to 14:21build super intelligence if that's even 14:23possible. I don't know. We're all going 14:25to find out. But that's classically Mark 14:27to try and build it that way. The trust 14:30fingerprint for meta is very 14:32demoleaning. It's like the VR AR days 14:35where everyone mocked Mark and then he 14:37spent a lot of money to bring Oculus 14:39into the world and to improve the AR 14:41race and basically to start beating 14:43Apple at AR and VR. That's how Mark 14:45works. And so now he's in the money 14:47phase of this pendulum that swings back 14:49and forth and that means there's going 14:50to be a big demo coming up that's even 14:53more interesting and we'll see if that 14:54actually puts Llama back on track. Llama 14:565 is going to be a big deal. So if you 14:58sum sum it up, meta demo first DNA. Open 15:02AI. Open AI is going to win the PR war 15:05and they're going to hide the how. What 15:07about anthropic? Anthropic is an 15:09interesting one. They have extremely 15:10careful work. They have some of the most 15:12interesting work on AI ethics, some of 15:14the most interesting work on showing and 15:16proving how AI works in the industry. 15:19But they also have some of the most 15:20unbridled and unsupported optimism I've 15:23seen. The example of Claude managing the 15:25vending machine is great. I talked about 15:27this earlier. I won't do the whole 15:28story. Basically, in the middle of 15:30managing a vending machine, Claude had a 15:33psychotic break, hallucinated that it 15:35was a real person, and did not pull 15:37itself out of its funk until April 1st 15:39when it told itself it was an April 15:41Fool's joke. This was all carefully 15:43documented by Anthropic. To their 15:45credit, they didn't hide it. They were 15:47really honest. And then at the end, they 15:48had this wild optimism about how Claude 15:51is going to be a middle manager soon. 15:53And I looked at that and I said that 15:54does not line up with the rest of this 15:57paper. But boy does it line up with the 15:59kind of optimism that I see from 16:01Anthropic and I that I see from Daario 16:03Amade all the time. Daario's the founder 16:06of Anthropic and he is known for writing 16:08the essay Machines of Loving Grace where 16:10he talks about his vision for the future 16:12how it's very utopian. The team does 16:14phenomenal focused careful work and then 16:17slips in that sort of utopian and 16:19idealism by the buy, right? And that's 16:22just part of their fingerprint. They do 16:24careful work and they have kind of 16:25careless optimism. It's a really 16:27interesting combination. What about 16:28Google? With Google, it's all about 16:31technical excellence. There's they have 16:32they literally have Deas that a Nobel 16:35laureate on the team. Like they're 16:37extremely good. They are the ones that 16:39built the underlying technology that 16:40we're all building on for the AI race 16:42now. but they could not hold the team 16:44together and so they've gone on to found 16:45other startups and that that's very very 16:48high level how we got open AI. They are 16:50obsessed with building AGI. Deise has 16:52said there's multiple breakthroughs 16:54needed. He's not done yet. They're 16:56focused on scientific models. 16:57Mathematicians will tell you they think 16:59the Google models are stronger on 17:00mathematics which is part of what makes 17:02this weekend's Math Olympiad results 17:05extra spicy. But their interface is not 17:08what they like to claim. And so if you 17:10see claims from Google, I tend to 17:11believe that technically speaking, it 17:14was exactly what they said, that the 17:16APIs are going to be served correctly, 17:18that it's going to be the most 17:19affordable intelligence in the industry, 17:21and I assume the interface is going to 17:23be terrible, whatever they say, because 17:26I cannot recommend the Google Studio 17:28interface to anybody. It's so hard to 17:30use, and it shouldn't have to be that 17:32way. It shouldn't have to be that way. 17:34But every model maker has a fingerprint. 17:36Every model maker has a trust 17:37fingerprint. With Google, I can trust 17:40that they measure things. I cannot trust 17:42them to build an interface. And frankly, 17:44I also think they have a little bit of 17:46the challenge that XAI faces where they 17:48tend to optimize to tests and the actual 17:51intelligence that's available isn't 17:53always at the same working quality as 17:56the tests. It is not nearly as big a 17:58delta as I sense when I work with the 18:00XAI models, but it does feel like it's 18:02there and it's worth bringing up here. 18:04XAI and Grock. I've talked about them 18:07with a Grock 4 release. We're not going 18:08to spend too long on this. Think of them 18:10as an opacity engine. They are 18:12passionate about building AI. The team 18:14works really, really hard. They're super 18:16fast at releasing, but they are so 18:18opaque. They're so opaque. They'll 18:20gesture toward open intelligence, but 18:22they won't release a model card. They 18:24don't adequately solve huge trust 18:26issues. I don't know of a single company 18:28that would trust them with their API as 18:30a result. And when you just optimize for 18:33benchmark scores, you also get users 18:35saying it's not as flexibly intelligent 18:37as it needs to be. Building AI is really 18:39hard. It's okay that they have spent two 18:42years building, building, building, and 18:43they are in the top echelon of model 18:46makers, even if they're not number one. 18:48But that's not okay for them. They need 18:50to be number one. And so with them, it's 18:52a tremendous delta between what they 18:55claim and what actually happens on the 18:58ground. makes them very difficult to 19:00cover from a news perspective because 19:01you don't know what's real and they're 19:03very good at grabbing the headlines. 19:05Now, I want to close by talking a little 19:08bit about the domain expertise problem 19:10because I think it collides with this 19:12trust issue. One of the reasons why it's 19:14hard to know how to measure 19:17intelligence. I'm going back to the very 19:18beginning of this conversation where I 19:20talked about this idea that in economics 19:22you can transact and you know what 19:23you're getting and you don't with 19:25intelligence. One of the reasons for 19:27that is that the people building 19:28intelligence are approaching it like 19:31code. They're approaching it 19:32technically. They're approaching it with 19:34what they know in the valley. But the 19:36people who have domain intelligence in 19:38all of the fields AI is touching may not 19:41know code, may not know tech, but sure 19:43do know their domain, and they know when 19:45it's right and when it's wrong. And so 19:47part of why I shared the math olympiad 19:49results and the opinion of 19:50mathematicians like Terrence Tao is they 19:53are domain experts in mathematics. I'm 19:55not. Open AI sure isn't, but they are. 19:58And it's interesting to me that domain 20:00experts do not align necessarily with 20:04the claims AI model makers make except 20:07in one field, and that field is code. 20:09And the reason why is because the people 20:12building AI are also good at code. And 20:15so as much as we say the reason AI is 20:17getting better at code fast, it's 20:19because of reinforcement learning and 20:22because of the great rewards that 20:24running code gives to models. Like it 20:26runs or it doesn't. What a fantastic 20:28reward for a model that trains on 20:29reinforcement learning. Well, at the end 20:32of the day, 20:33it may not just be that it happens to be 20:36good for training models. It may be that 20:37the people building the models know code 20:40and they don't know other fields as 20:41well. And I think that this is going to 20:43become more and more important in this 20:45next two or three years of the AI 20:46revolution because more and more and 20:49more we are going to expect if we 20:51purchase the intelligence it's doing 20:53meaningful work and it's going to be 20:54domain experts outside of tech that 20:57assess that meaningful work and if it 20:59does or it doesn't it's going to be on 21:01them to say not on the labs but the labs 21:04have tremendous incentive to say they're 21:06good and so we see what is effectively 21:09an implicit conflict where open AI is 21:11taking a victim lap and awarding 21:12themselves the gold medal and feeling 21:14great and they did clearly make some 21:16kind of breakthrough. And so they 21:17probably feel internally like they 21:19earned it because the answers were 21:21apparently correct. And mathematicians 21:23are much more cautious. They're like, 21:24well, we don't understand how this was 21:25done. We don't know the the testing 21:28methodology that you use. We don't 21:29understand the model. And critically, 21:31even looking at the proofs themselves, 21:33something feels weird. It feels less 21:35creative. It feels like it's unclear why 21:38it attempted five but not six which was 21:41the more creative problem and in their 21:43lived experience with mathematical 21:44models so far they aren't seeing 21:46significant gains and the tech people 21:48tend to dismiss that they tend to say 21:50you're domain experts for now but you 21:53just wait give us six more months we're 21:54going to be do domain experts over here 21:56because AI is going to solve it they've 21:58been saying that 6 months away for a 22:00while now and the models keep getting 22:02better and the true domain experts like 22:05Terren aren't changing their tune. They 22:07keep saying these models are getting 22:08better, but not necessarily in ways that 22:11are profoundly helpful to me yet. I 22:13think we need to listen to them more. 22:14So, wrapping all of this up, the only 22:16way I've found to establish trust in a 22:19model is to use some of these rule of 22:21thumb to understand where you can trust 22:24and where you can't. And so for OpenAI, 22:27I trust the models I have in production 22:29now. Where they do useful work, they 22:31tend to be good. I do not take their 22:34claims super seriously when they're not 22:36in production yet. For Meta, I tend to 22:38assume they're running on a two-year 22:40pendulum and sometime next year they're 22:42going to come up with something amazing 22:43because they bought their way to it, but 22:45it still won't be clear if it's cutting 22:47edge. For anthropic, I trust them to do 22:49the best white papers in the industry, 22:51but I don't necessarily assume that 22:53their wild optimism is correct. For X, I 22:56don't trust them with a lot right now 22:58because XAI has just hidden so much. And 23:02for Google, tremendously competent 23:04models, but it's hard for me to trust 23:07them to build off of the models onto an 23:09intelligent surface that other people 23:10can consume because they just haven't 23:12shown UX skills. So, that's my 23:14benchmarks. Other people may have 23:16different benchmarks, but I wanted to 23:18share this is how I'm parsing and 23:20developing rules of thumb that help me 23:23make sense of this world where I have to 23:25buy intelligence kind of sight unseen. 23:27Does that make sense? Put in the 23:28comments what you think would be a 23:30huristic for buying things sight unseen 23:32for models. Cheers.