AI Safety, GPT-5 Secrets, and Robot Olympics
Key Points
- The hosts caution that developers should not rely on model providers for safety, security, or accuracy and argue that these models are unsuitable for serious “naked” deployments.
- In today’s “Mixture of Experts” episode, Tim Hwang is joined by senior researchers Marina Danilevsky, Nathalie Baracaldo, and AI research engineer Sandi Besen to discuss AI welfare, new reasoning model findings, the hidden system prompt in GPT‑5, and an MIT NANDA initiative report on AI pilots.
- Tech news highlights include IBM and the USTA’s AI‑powered “Match Chat” chatbot for the US Open, Meta’s plan to split its AI division into four units with a dedicated “Superintelligence” team, and the inaugural Robot Olympics in China featuring humanoid robots competing in sports.
- The segment emphasizes the rapid, sometimes chaotic, evolution of AI applications across entertainment, corporate restructuring, and competitive robotics, underscoring the need for critical scrutiny and responsible deployment.
Sections
- Cautious AI Deployment & Updates - The hosts warn against relying on model providers for safety and accuracy, preview discussions on AI welfare, a hidden GPT‑5 system prompt, an MIT NANDA AI pilot report, and deliver the week’s AI headlines.
- MIT Report Highlights AI Pilot Failure - In a podcast segment, hosts dissect MIT’s NANDA study that finds 95% of generative‑AI pilots miss expectations, debating whether the figure is exaggerated and probing the report’s methodology.
- Executive Hype vs AI Reality - The speaker highlights a gap between C‑suite expectations fueled by marketing demos and the modest, focused AI deployments—typically backend optimizations—that truly succeed, emphasizing the need to realign hype with realistic use cases and address the resulting learning gap.
- Rethinking ROI for AI Adoption - The speakers discuss how traditional ROI metrics miss the subtle, incremental benefits of AI tools, suggesting internal adoption and change‑management factors as more appropriate measures of impact.
- Investigating System Prompt Exposure - The speaker discusses how system prompts are embedded in AI frameworks, recounts attempts to extract GPT‑5’s internal prompt and scratchpad, and stresses the need for developers to understand model alignment.
- Transparent System Prompts and Trust - The speaker contrasts distrust of AI providers with a demand for openness, noting that system prompts are expected, and highlights IBM’s forthcoming Mellea tool, which will give developers explicit, controllable visibility into how such prompts influence model responses.
- User Responsibility for Model Guardrails - The speaker warns that developers must not rely on model providers for safety, accuracy, or security, and should implement their own guardrails, fine‑tuning, and hybrid systems before deploying AI models.
- Debating Openness and Trust in AI - The speakers debate whether AI models should become fully transparent, compare current reliance on opaque systems to the shift from DIY computer repair to services like Apple’s Genius Bar, and then segue into a discussion of a new “chains of thought” research paper.
- Long Chain‑of‑Thought Tradeoffs - The speaker observes that certain fine‑tuned models only arrive at correct answers after many reasoning steps (around the 17th thought), prompting a discussion on the balance between longer token‑heavy deliberation and faster, more efficient responses when deploying such models.
- Rethinking Chain-of-Thought Proxy - The speaker critiques using chain‑of‑thought reasoning as a proxy for human and model cognition, emphasizing its shortcomings and the tendency toward overthinking.
- Challenges of Chain-of-Thought Prompting - The speaker reflects on why current models struggle to produce coherent chain‑of‑thought reasoning, attributing it to gaps in training data and noting safety trade‑offs where explicit reasoning can lead to less safe answers.
- AI Welfare Justification for Feature Cutoff - The panel examines Anthropic’s decision to terminate conversations to protect uncertain AI moral status, arguing that “welfare” is a misleading label for routine model output monitoring.
- Platform Liability for Toxic Content - The discussion examines how platforms justify banning abusive or self‑harmful AI interactions by invoking user protection and legal liability, arguing that providers can still be held responsible even when no external victim is apparent.
- Anthropic's Insurance, Sentience, and Ethics - The speaker examines Anthropic’s liability/insurance strategy and its economic‑welfare initiative while debating AI sentience—drawing parallels to animal rights—and questioning whether treating AI as potentially sentient could constitute harm or torture.
- Broadening Perspectives on Emerging Tech - The speaker urges listeners to move beyond insider viewpoints, understand the historical and social reasons behind technology's creation, and seek information from multiple sources to develop a well‑rounded education.
Full Transcript
# AI Safety, GPT-5 Secrets, and Robot Olympics **Source:** [https://www.youtube.com/watch?v=UKVSCFrWrpA](https://www.youtube.com/watch?v=UKVSCFrWrpA) **Duration:** 00:44:49 ## Summary - The hosts caution that developers should not rely on model providers for safety, security, or accuracy and argue that these models are unsuitable for serious “naked” deployments. - In today’s “Mixture of Experts” episode, Tim Hwang is joined by senior researchers Marina Danilevsky, Nathalie Baracaldo, and AI research engineer Sandi Besen to discuss AI welfare, new reasoning model findings, the hidden system prompt in GPT‑5, and an MIT NANDA initiative report on AI pilots. - Tech news highlights include IBM and the USTA’s AI‑powered “Match Chat” chatbot for the US Open, Meta’s plan to split its AI division into four units with a dedicated “Superintelligence” team, and the inaugural Robot Olympics in China featuring humanoid robots competing in sports. - The segment emphasizes the rapid, sometimes chaotic, evolution of AI applications across entertainment, corporate restructuring, and competitive robotics, underscoring the need for critical scrutiny and responsible deployment. ## Sections - [00:00:00](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=0s) **Cautious AI Deployment & Updates** - The hosts warn against relying on model providers for safety and accuracy, preview discussions on AI welfare, a hidden GPT‑5 system prompt, an MIT NANDA AI pilot report, and deliver the week’s AI headlines. - [00:03:05](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=185s) **MIT Report Highlights AI Pilot Failure** - In a podcast segment, hosts dissect MIT’s NANDA study that finds 95% of generative‑AI pilots miss expectations, debating whether the figure is exaggerated and probing the report’s methodology. - [00:06:09](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=369s) **Executive Hype vs AI Reality** - The speaker highlights a gap between C‑suite expectations fueled by marketing demos and the modest, focused AI deployments—typically backend optimizations—that truly succeed, emphasizing the need to realign hype with realistic use cases and address the resulting learning gap. - [00:09:23](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=563s) **Rethinking ROI for AI Adoption** - The speakers discuss how traditional ROI metrics miss the subtle, incremental benefits of AI tools, suggesting internal adoption and change‑management factors as more appropriate measures of impact. - [00:12:34](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=754s) **Investigating System Prompt Exposure** - The speaker discusses how system prompts are embedded in AI frameworks, recounts attempts to extract GPT‑5’s internal prompt and scratchpad, and stresses the need for developers to understand model alignment. - [00:15:46](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=946s) **Transparent System Prompts and Trust** - The speaker contrasts distrust of AI providers with a demand for openness, noting that system prompts are expected, and highlights IBM’s forthcoming Mellea tool, which will give developers explicit, controllable visibility into how such prompts influence model responses. - [00:18:48](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=1128s) **User Responsibility for Model Guardrails** - The speaker warns that developers must not rely on model providers for safety, accuracy, or security, and should implement their own guardrails, fine‑tuning, and hybrid systems before deploying AI models. - [00:21:51](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=1311s) **Debating Openness and Trust in AI** - The speakers debate whether AI models should become fully transparent, compare current reliance on opaque systems to the shift from DIY computer repair to services like Apple’s Genius Bar, and then segue into a discussion of a new “chains of thought” research paper. - [00:24:54](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=1494s) **Long Chain‑of‑Thought Tradeoffs** - The speaker observes that certain fine‑tuned models only arrive at correct answers after many reasoning steps (around the 17th thought), prompting a discussion on the balance between longer token‑heavy deliberation and faster, more efficient responses when deploying such models. - [00:28:05](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=1685s) **Rethinking Chain-of-Thought Proxy** - The speaker critiques using chain‑of‑thought reasoning as a proxy for human and model cognition, emphasizing its shortcomings and the tendency toward overthinking. - [00:31:15](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=1875s) **Challenges of Chain-of-Thought Prompting** - The speaker reflects on why current models struggle to produce coherent chain‑of‑thought reasoning, attributing it to gaps in training data and noting safety trade‑offs where explicit reasoning can lead to less safe answers. - [00:34:26](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=2066s) **AI Welfare Justification for Feature Cutoff** - The panel examines Anthropic’s decision to terminate conversations to protect uncertain AI moral status, arguing that “welfare” is a misleading label for routine model output monitoring. - [00:37:36](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=2256s) **Platform Liability for Toxic Content** - The discussion examines how platforms justify banning abusive or self‑harmful AI interactions by invoking user protection and legal liability, arguing that providers can still be held responsible even when no external victim is apparent. - [00:40:40](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=2440s) **Anthropic's Insurance, Sentience, and Ethics** - The speaker examines Anthropic’s liability/insurance strategy and its economic‑welfare initiative while debating AI sentience—drawing parallels to animal rights—and questioning whether treating AI as potentially sentient could constitute harm or torture. - [00:43:49](https://www.youtube.com/watch?v=UKVSCFrWrpA&t=2629s) **Broadening Perspectives on Emerging Tech** - The speaker urges listeners to move beyond insider viewpoints, understand the historical and social reasons behind technology's creation, and seek information from multiple sources to develop a well‑rounded education. ## Full Transcript
Look as a responsible user of these models,
you should never expect to rely
on being able to see everything that you are provided for.
You should not put the safety and security
of your application in the hands of the model provider.
You should not necessarily even put the accuracy in the hands of the model provider
These models should never be deployed as a serious application. Naked.
All that and more on today's Mixture of Experts.
I'm Tim Hwang and welcome to Mixture of Experts.
Each week, MoE brings together
a panel of some of the most brilliant minds in technology to banter, analyze,
and argue our way through the exciting news each week in Artificial Intelligence.
Today I'm joined by a great crew.
We've got Marina Danilevsky, Senior Research Scientist, Nathalie
Baracaldo, Senior Research Scientist and Master Inventor,
and Sandi Besen joining us for the very first time, AI Research Engineer.
Welcome to all three of you.
We have a packed episode today. As always.
We're going to cover AI welfare, new findings on reasoning models
GPT-5 hidden System prompt.
And this new report coming out of the MIT
NANDA initiative on AI pilots.
But first, we're going to do the quick headlines as per usual.
And we've got Aili here. over to you.
Hey everyone, I'm Aili McConnon.
I'm a Tech News Writer for IBM Think.
Before we dive into our main episode today, I'm here with a few AI headlines.
You might have missed this busy week.
First up, where
should you head this week to catch a bagel,
a slice or a honey deuce?
The US Open, of course.
The American Grand Slam tennis tournament kicked off this week
with the mixed doubles championships this year as well.
IBM and the USTA have rolled out
various new AI powered features for fans, including Match Chat.
So Match Chat, an interactive chatbot
that answers your questions such as who converted
more break points that last set, or where
can I find the nearest honey deuce? Which,
in case you were wondering,
is the US Open signature cocktail.
It has raspberry lemonade, vodka
and melon pieces shaped like tennis balls. Meanwhile,
Meta is restructuring yet again.
Mark Zuckerberg told Meta employees
he plans to split its AI division into four smaller units.
He's going to break out "Superintelligence"
into its own division with a dedicated team.
"Superintelligence", by the way,
is a new, pretty loosely defined term
meaning AI that's smarter than humans.
Not to be confused with AGI
or artificial general intelligence,
which generally means AI that's as smart as humans.
And last but not least.
Robot Olympics.
That's correct.
This week, robots from 16 countries
competed in running, kickboxing, soccer
and more at the first humanoid games hosted in China.
And while robots have certainly improved productivity
in a variety of industries, from manufacturing to agriculture,
these robot athletes are still perhaps lagging behind.
Not to mention running into each other
and some human fans at this week's Humanoid games.
Do you want to dive deeper into some of these topics?
Subscribe to our Think newsletter.
It's linked in the show notes.
And now back to our main episode.
So for our first segment, we're going to talk a little bit about this report
that was just published out of MIT's NANDA initiative.
And so what they did was they conducted
150 interviews,
350 employee survey,
analyzed 300 public AI deployments.
And there's a lot of kind of interesting findings that came out of the report.
the headline that everybody's been sharing around
has been this claim that 95% of generative
AI pilots are falling short,
basically not really anywhere near
the expectations of the people implementing them.
And this is coming out of some of the biggest decision makers
at large enterprises, CFOs and what have you.
And so I guess maybe the first question and I'll throw it to you,
Sandi, because you're joining us for the very first time, is.
Are these results surprising to you?
Is like 95% shocking.
I know some people were like, this is the end of AI,
but I don't know if you thought this number was overblown
or really not that surprising.
I certainly don't think it's the end of AI.
Uh, no. That's good. That's good news.
No, I would really like to read the full report
and especially see how they conducted the study.
Um, how their survey was constructed,
um, who they were interviewing specifically,
they mentioned interviewing many employees and leaders.
But there's also different expectations from an employee
and a leader on on how things actually went,
um, how they're identifying the use cases,
how they are implementing them and whether they're true.
And like the skills gap, essentially, of
who is implementing them and do they have the capabilities to implement it correctly.
And lastly, like how are they even measuring ROI?
So there's like so much unknown in this space.
Um, the number 95%, in a way, doesn't shock me.
But it seems too high for what I think
the capabilities of this technology is.
So clearly there is a misalignment somewhere.
And without reading the full study, I don't know where.
Um, but enterprises seem to need help.
And I think that's a, I think a point really well taken.
I mean, I think one of my reactions on kind of reading the headlines was,
well, look, I mean, you know, what do we even mean
95% of AI pilots, right? Like,
what is an AI pilot? Who's responding to these surveys?
And is your point, like, how are you measuring ROI here? I guess,
Marina, turning over to you. I mean, is a number
like this even useful as a way of thinking through this space?
I think there's maybe one point of view, which is
this is just way too simplifying of the situation.
A number like this is useful to get people to want to read your report,
but that's why we're talking about it.
So I guess we're was talking about it.
Good job guys. Well done. Yeah.
Good job, MIT NANDA initiative.
I will say that I'm not maybe
necessarily too surprised that it would be pretty high
because there continues to be a misalignment of expectations
of like, you guys were both saying, what is it
that AI pilots supposed to accomplish?
in particular, there seems to be a misalignment between, uh,
leaders and maybe C-suite executives and what they have been
maybe seeing through some marketing, some really specific demos, anything of that sort.
And then what ends up actually happening, which will fall short of that.
There are lots of things that AI is useful for,
and even the coverage of the report
says that the successful deployments are ones that are focused,
that are scoped, that are actually addressing
a proper pain point, not step one.
We're using AI. Step two what for?
And those end up being successful
if, albeit sometimes less sexy as use cases.
It's going to be backend optimizations and things of that nature.
Um, so again, this is a
maybe a comment on the misalignment of expectations
and how those should be changed around
so that we can actually use the technology for what it's good for.
Yeah, that's that's kind of interesting.
And I really do love that interpretation of the results,
which is this is almost a signal of like executive hype
more than it is necessarily an indication
of like the effectiveness or usefulness of AI.
It's basically like, well, we were sold this thing that's going to change everything
And hey, it's not changing everything. So what's up?
Um, Nathalie,
I think one of the interesting aspects of this report, uh,
was in effect, there was kind of a conversation about this learning gap, right?
Which I think is like, at least again, from anecdotally
from like working with companies who have been doing
AI pilots really does seem to be a really big question. Right?
Like both the learning gap in terms of how you use these tools.
But then to Marina's point, like,
do the executives even understand what it is that they're trying to solve?
Also seems like a really big problem.
I guess the question for you is like,
do you feel like one of the big problems of technology in the enterprise?
It seems to be like it's actually like a knowledge problem
or an expectations problem more than anything else.
Is that one way to read these numbers?
Well, it's, uh, I tried to read that report.
It's not public, so it's difficult to fully provide
a good assessment of what's going on with those results.
That said, one thing that did caught my eye
as I was reading the article was that, uh,
basically they were talking about how we have processes
and injecting the AI into the processes.
It's going to be a slow.
Uh, we all know that it takes time to modify how people work.
So sometimes I believe that as the process is more complex,
we would have to inject those tools, uh, carefully.
And that may be the reason
for which we're seeing that number.
But this is pure speculation.
Uh, on the other hand,
anecdotally, what I can tell you
is that a lot of people right now are using AI
for all sorts of tiny optimizations throughout the day,
so it's very difficult to actually see
how it all improves the productivity.
I will give you an example.
A lot of people right now are utilizing it to,
for example, convert from one format to the other.
It turns out that the models are getting very, very good at that.
And then converting and automating all these sorts of little things.
It's going to basically improve how fast we work
and how we do things that are very repeatable
and that we humans are not necessarily great at.
So looking forward to read the full report.
But that was my take.
Obviously it's going to take a long time to add to very complex processes,
but I do see how people are starting to adapt
to our AI systems and models to their daily lives. Yeah,
this is a little bit of a perverse outcome, Sandi,
because I feel like now I agree with you,
but almost like one argument is like, is ROI
the wrong way of thinking about these technologies right now?
In the sense that, like most of the uses really
are going to be kind of small optimizations
that people are using in their daily lives,
which really have an improvement but are really difficult to measure.
And so almost it
kind of sounds like from the point of view of the decision maker,
they're like, why isn't the needle swinging radically in one direction
when the answer is, well, it's just because
like it's it's creating improvements, but ones that you can't really see.
Well, one way you could measure ROI for impact,
not necessarily in terms of revenue,
but correlated to revenue, is internal adoption of tools. Right.
So are these,
um, POCs that are rolling out at these companies,
are they being mandated or are they a tool that has been put out there?
But we have Becky from accounting who's been there for 45 years
that really doesn't want to use that tool, right.
Um, and so there's an aspect of the change management
that ties into the impact of what
however you are measuring ROI.
That's actually a great metric for pilots,
by the way, is like identify a cohort of the most tech adverse,
tech resistant cohort within the company,
and if they adopt, then it's actually a win.
That should be all of our new benchmarks is like, you know,
segment the laggards and then get them to adopt it.
Yeah. Convince the the the toughest customer basically.
I'm going to turn us
to kind of our next topic today. Um,
kind of an interesting post by Simon Willison
who we've referenced on the show before. Um,
he is a AI researcher,
commentator, blogger, produces great stuff.
I think you should definitely check out his blog.
And he had this sort of intriguing post that I think Sandi
you flagged for us, which was that, you know, in doing some digging,
he sort of identified that GPT-5
not only has the system prompt you can edit,
but apparently kind of a shadow system prompt
that's operating in the background.
And, you know, this is maybe less sinister than it sounds.
He kind of uncovered what was in this system prompt,
and it was largely kind of a setting around the verbosity of the model. Right.
It's like, okay, if you're talking too much,
we want you to set it to like a three on the verbosity meter,
I think was the number.
Um, and so I think there's both this, which is kind of interesting,
but I think the question I wanted this panel to kind of engage with
was this really interesting comment
that Simon had at the very end of his blog post, where he said,
this feels weird.
If I am using a model through an API,
I want to know everything that's going through the model.
Um, and so having these hidden system prompts is like,
makes me feel weird because I don't really have a full control over the process.
Um, and I think that's just so interesting about kind of like
the ethics of how you structure these services
and like what level of granularity users should have access to
when they're using, say, an AI through an API.
And so, like I said, do you have any opinions on that?
Like, I don't know if like the discovery of a system prompt disturbs
you at all or if it's just, you know, this is just like
normal business and we don't need to be worried about it.
I think it was to be expected.
Uh, right.
Um, if we look at just the way at a lower level,
the way that I frameworks operate
is that often when the developer is providing
instruction to the agent, it's
actually a piece of instructions that's getting inserted
into a larger template of a system prompt
that the framework developer has provided for the developer.
So this concept or paradigm is not new.
Um, but I think it was interesting to try to get GPT-5
to unveil kind of its internal system prompt.
And I actually tried my own experiment for a little bit,
and I was not successful.
But I'm going to keep trying until I get it out.
Yeah.
It just kept telling me that it's not allowed to reveal
its internal chain of thought or internal scratchpad,
but from that I learned that it also has an internal scratchpad.
So maybe it's not doing
a super excellent job of concealing all of its internal stuff
by telling me what it will not reveal.
Um, but it's not a new paradigm or new concept.
However, I agree that as someone who's building these systems out,
it's important for me to know
what is behind the scenes and how the model is being aligned
and told to behave.
Because if I'm providing,
you know, OpenAI has this new concept
of prompt hierarchies, right?
Where the system prompt comes first,
and then we have the developer
instructions or developer prompts,
and then we have the user and conversation history and context.
And they're kind of stacking the priorities, which makes a lot of sense.
But if I, as the developer,
am giving a contradictory prompt
to the potential system prompt of the model itself,
um, am I providing?
Am I confusing the model? Um,
am I going to get more variability in behavior
or not the behavior that I want. Um,
and so I think that as a developer, it's
very important for me to have as much transparency as possible.
But again, their private company, they can do what they like. Nathalie.
Why why have a hidden prompt at all?
Like, shouldn't OpenAI just kind of publish this prompt?
Like, is it there's no need for us to be secret about this, right?
Particularly in a world where you assume that all these people will be very dedicated
to, like, pulling it out of the system and being successful in doing so
Shouldn't every company just sort of publish a system prompt?
Tim, you are stepping into one very interesting topic,
because in the security field, in the security field,
we actually cannot fully agree on this.
There are benchmarks specifically defined to test
whether you can extract the system prompts.
There are a lot of papers coming up with attacks
to make sure you can extract the system prompt in the
in the kind of idea of the system
may contain something that may be kind of hidden from users
and may affect the users.
And it's like this kind of threat model
where you don't fully trust the provider.
So there is that on one side.
On the other side, there's this other kind of a type of people
that want everything to be transparent.
And when you have transparency, the good thing is that,
well, basically you can inspect it.
You can see some companies are very transparent on
what are their system prompts.
I personally was not surprised to see
that they did have a system prompt.
I don't think it's anything too surprising
like modulating the size of the reply.
It's something intuitive
that we all knew
would would be added at some point
in the infrastructure of the model serving platform.
So from that perspective, I don't feel particularly surprised.
We internally here at IBM
are actually building something called Mellea.
And Mellea is going to be in a transparent way,
allowing us to actually have more transparency
into how this system prompts our fed,
how we control the different types of replies
that we are adding to the user,
and having actually the developer of the application
have control of how the replies are going
to be really modulated, modulated and so forth.
And we actually did have a release,
I believe, last week on Mellea.
So if anyone wants to take a look at it, uh,
it's, uh, Mellea like a fungus.
So take a look at that.
But I'm kind of deviating from the main topic,
which is you should get your get your plug in.
I mean, it's good to have people check it out. So. Yeah.
So so check it out.
He's a very cool technology that we're developing and it's transparent,
basically transparency.
One of the most interesting parts of this,
I think, is it's a real question about like,
what should the user of an API
have the ability to customize,
and what is the kind of model provider responsible for
or has rights to, like, not show you?
Um, and I think in particular, it's sort of interesting
because the exposed bit of the system prompt
is specifically about verbosity.
Um, and I guess Marina,
like, uh, that interface I think is really interesting, right?
Where almost what OpenAI is saying is,
look, when it comes to how verbose the model is,
that's something that we get to control.
That's what we get to make a decision on.
And we don't really want you messing with it.
Um, and I guess we're really kind of debating,
like what the parameters of that are. Right.
Like, I guess there's a model, which is, hey, we want
this to be as customizable as possible for you as the user,
but it's clearly not the decision that OpenAI has made.
They say we actually want to control
the sort of voice of the model in some sense.
Well, because the model needs to perform okay out of the box
and then you can go ahead and mess with it.
I mean, they do let you change it and if it gets even Simon wrote,
how do I basically change the verbosity?
And the model said, oh, you can tell me to be concise,
be more detailed, all the rest of it.
The reality is they're going to have to find some ways, whether it's through
fine tuning, prompting anything of that kind to make the model
still pass the check and still do pretty well on benchmarks.
But look, as a responsible user of these models,
you should never expect to rely
on being able to see everything that you are provided for.
You should not put the safety and security
of your application in the hands of the model provider.
You should not necessarily even put the accuracy in the hands of the model provider.
These models should never be deployed
as a serious application. Naked.
Put some clothes on some guardrails.
Put some programmatic intent behind it
Make it be a hybrid system that actually can get checks
and get guidance and everything of that sort.
Because what is in GPT-5 right now?
All right. What's going to be in GPT-5.1
or 6 or anything of the sort?
It's fun to try to pull these prompts out,
but I'm not sure that it's, uh,
makes any sense at all to say, oh,
the company has that responsibility.
You have your own responsibility to make sure that what the way that you use
it is going to be secure and controlled and as done by you.
Yeah. And I think that's maybe like
I mean, going back to the earlier conversation about like,
well, let's buy the results for a while.
Why are 95% of CEOs
like, ah, these pilots are not working?
I am curious how much of this is like, kind of like the dream
that the model provider would do it all for them.
Um, which is maybe like not actually the case
and wouldn't even be a realistic expectation, right?
I guess Marina, you're smiling. I don't know if you've got that freelancer
you'd want to still still can't find it.
You still can't find that free lunch guys.
Yeah, exactly.
We are we are moving into a world, though,
where the model provider is trying to provide more services
now, not necessarily security or any of those things
that the enterprise should always enact by themselves. Right.
Specific to them.
But we have, even with GPT-5,
more orchestration happening on the provider side.
there's this like ongoing conversation about,
you know, how much now does the AI framework actually control,
and how much does the provider itself,
the provider control.
Um, and so there's this I think we'll see that play out over time.
But we are seeing a shift in that space.
Yeah. That's right.
And I think it goes to
just how flat I guess ultimately the
the model provider thinks of itself as right.
Like whether or not this will be a situation where it's like
all we do is we provide.
Well, I guess is this going to sound absurd already? Right?
It's basically like all we do is provide intelligence.
You customize and do all the rest.
I think part of the problem is that, like, that concept itself is very lumpy,
and it assumes all sorts of things.
Um, and so I think like this navigation of this line
of like, who controls what and who's responsible for what is.
Uh, is there a tricky one?
It's going to take some time to really work out.
Has any good business vertical?
Integration is often a goal
where you're going to be able to control
more of the more of the ecosystem,
more up and down of what kind of thing you can provide.
So of course things are going to move in that direction.
Yeah. That's right. And Marina is that I don't know
if that's a vote in favor of like
ultimately this is all going to kind of have to go open
because people will really want to know
everything that goes into the model end to end.
Uh, I don't know.
I think just like saying, do you know, all of the bits
that go into any software that you 100% use,
or do you still have some some other things around it?
You're going to start somewhere pretty far from it.
Yeah, we're going to settle somewhere.
I think we're still a ways from it.
Um, yeah, I still think a little bit about I remember when, uh, Apple,
uh, first launched their Genius Bar right way,
way back in the day, and everybody was like, oh, it's so funny that computers.
Now you need to be a genius to go fix your computer.
Whereas back in the day,
you'd like pop the tower open and just, like, make some changes yourself.
And, uh, we're kind of like, in that world, right?
Which is basically like, well,
okay, what's going to be like under the hood that you just don't think about
and you kind of don't care because you sort of trust
that they're mostly going to get it right.
Well, great.
I'm going to move us on to our next topic.
Um, this is kind of a fun paper that kind of hit
my inbox and was kind of in my group chats.
Um, and I figured it'd be a fun one to think about,
because we haven't talked about chains of thought in a little while.
Um, and so it's a paper that was entitled
Large Reasoning Models Are Not Thinking Straight.
Subtitle on the Unreliability of Thinking trajectories.
Um, and it's a fairly straightforward paper,
but I think one of the most interesting findings
kind of coming out of it is,
you know, they're looking at the problem of thinking and overthinking, right?
Where sort of it appears that the model is like engaging
in all sorts of chains of thought that aren't very productive
or they, like, prematurely disengage
from promising chains of thought, which has been a known problem.
And I think the main contribution of this paper, in my opinion,
is they say, okay, well, how do the models really respond when we give them
like hints or outright
like solutions to the problems that they're trying to solve?
And they find in many cases the model just kind of like marches on,
just sort of like ignores it.
Um, and so they kind of beg the question
which I continually try to wrestle with is basically like,
what's actually driving reasoning here?
If it turns out like the actual solution doesn't
get or assist or change this kind of chain. Um,
and so very open ended question. But,
um, I guess maybe Sandi, I'll throw it to you,
uh, have any thoughts on this paper?
And I guess to that final question
is just like, what? What is going on here?
Like, why do these models just ignore apparent solutions?
First of all, when I read it,
I looked at the models that they tested, right.
So the three models that they tested
were, I think Llama 70b,
um, Qwen 7b
and "deep skull" or DeepScale.
I don't know how to pronounce it.
Uh R1 5B right.
And I noticed that they were all distilled
in some way from DeepSeek's R1. Right.
And so although the paper had
was very thorough in some ways,
I did notice that they were testing variations of a model
that had all been distilled from the same kind of base model.
Um, and so I don't know whether there's any commonalities
there as to the results that they saw.
Um, now, they might have all been fine tuned differently
or, you know, had different methods
to kind of changing the models.
Um, but that was one thing I wanted to flag.
One thing I didn't notice that was really interesting was, um,
you know, the cases that they picked out
were specifically two of them,
um, that the model performed
successfully as and actually took the recommendation
of the correct answer when they injected it
into the kind of chain of thought of the model itself.
And at that point, it was always around
kind of the 17th thought that they injected
the correct answer in,
um, and that led them to believe, which I kind of agree,
especially maybe from the same base model, um,
that this specific model needed to think for quite a long time
before it settled on an answer.
And that could have been because of a lot of things.
And I'll let, Marina and Nathalie kind of described those more.
But, um,
but that was an interesting kind of revelation to me.
Is, okay, now, if I know that if I'm using potentially a model
that's been distilled from DeepSeek-R1,
it thinks for a really long time before it makes a decision.
And so I have to decide when I build with something like that,
do I want a process that kind of thinks for that long,
maybe waste that many tokens? Um,
or do I want to build with something
that gets to the answer a little bit sooner?
Yeah, sure. Um,
Marina, any comments? Thoughts? Sure.
So, um, I think I've mentioned
before that I definitely have a stance.
That chain of thought is not the reasoning part.
It's sort of a post hoc approximation,
maybe a way to help reorganize the parameters a little bit.
Getting ready for the final. Um,
I will say, Sandi, I don't think this is, um,
limited only to the particular model they distilled from.
I think this is something that is going to be true
in general of the way they use chain of thought. Um,
allow me to go on a tangent
for a minute in the direction of people.
So the really excellent podcast "If Books Could Kill" um,
recently had an episode on Malcolm Gladwell's Blink,
and I haven't thought about that book
since I read it way back when it came out.
Yes, it is old school,
but so he talks about a whole bunch of different ways that people make decisions.
Uh, you know, system one thinking, system two thinking.
And one of them was this description
of an experiment of, uh, ranking jams
where there was a set of experts in the jam industry,
which there are experts in the jam industry and a set of students,
and everybody was asked to rank jams two different ways that the students were asked.
First, give me just a ranking one, two,
three, four, five and just rank them in order.
And then it was repeated again and said, oh, write down all of the reasoning
and then rank the jams based on the reasoning, what it was the first way
the students really had rankings that aligned closely with the experts.
And then when it was the second way, a lot of the expert
ranking was now not aligned with the students.
students had talked themselves out of it.
Don't know necessarily what this means, but it does seem that
unlike, for example, in a math problem
where the way that we use language is probably a good proxy
for how we're thinking through the problem,
in many other cases, language, even for humans,
when we explain our reasoning,
that's not what happened in our brains either.
This is a proxy that human beings are able to use.
Why are we doing chain of thought
as the proxy for what language models are being able to use?
Well, we hope that we can read it, but let's be honest, even with ourselves,
we aren't actually able to say that these words
are the reason that I made the decision that I made.
So it's also certainly not true for large language models.
So it's just something to consider
that if we are continuing to try to change the way that we think
that people behave, people will behave that way.
And again, no wonder
we are happier trying to figure out what's going on with math problems,
because that proxy is really, really close.
Apparently when you're ranking jam
or when you're being asked extremely random questions at LLM, not so much.
And in fact, by trying to reason, you get yourself away
from whatever you had initially thought in your head
with, whether you're a person or whether you're a machine.
This paper is in a set of interesting ones recently,
and they cite some as well about overthinking.
Overthinking that, again, just really makes us think
critically about the notion of chain of thought
and how we're using it and what we're using it for.
And so I appreciate that.
I appreciate this topic coming up again.
Yeah for sure. And I guess in that sense, maybe like the
the kind of interesting quote unquote result that I started with
is maybe not that interesting
by analogy to sort of like humans. Right?
Which is, I is, I guess if I'm, like, explaining a story
and I'm going down this line of thinking and then you're like,
Tim, have you considered the answer?
You know, I might very well just be like, well, yeah, yeah.
But like, I'm just keep doing what I'm doing, right?
And I guess maybe the idea that these models sort of like
ignore is not that surprising, particularly
if you don't believe that it actually has anything to do with,
like, the actual reasoning process, I suppose. Right.
Um, I guess, and then I would love to take on this paper is like,
should we get away from the terminology train of thought?
It's a little bit too late, I guess, but it's kind of like
in the long tradition of AI terms that are really misleading.
Where we're sort of landing is that this too may be quite misleading.
This is in the tradition of the term hallucination.
Yes. The term like an irrelevance.
AI in general,
I mean, I don't know, I used to fight with people on this.
I've given up. People are going to call it what they're going to call it.
But no, it's not accurate.
And so we may have something to it
probably does have something to do with what's going on.
It's just not the whole story.
And it is a proxy that is going to approximate
an extremely complex space, better at sometimes than at others.
So look, we also only can communicate with language.
We got to call it something I don't know what to call it.
And it's maybe no point.
Um, but I think again, just continuing
to keep in mind that this is an evolving field.
This is an evolving understanding that we have ourselves of what's going on
and not think that the next thing is a silver bullet.
First, everybody thought prompt engineering was going to solve it now.
All right. If we can just chain of
thought our way into it, then we're going to solve that.
Either it's not going to happen
Continue to see these as incremental interesting meanderings through the field.
Nothing's going to be a silver bullet.
Yeah for sure. And that's good wisdom. Um,
Nathalie hot takes.
What did you think of the paper?
I actually really like that we're revisiting this paper
because, uh, if you remember, Apple also had published a paper
around these sort of things like, hey, chain of
thought is not really working as we expect.
a lot of people trying to understand whether chain of thought
actually gets us closer
to understanding how the model is behaving internally.
To me personally, it's not surprising
that we are not seeing
that kind of perfect chain
of thought and analysis from the model perspective.
On the positive side,
and when I read it, I, I have this thing.
I have to confess something.
When something doesn't work, I get really excited
because it means that we can make it work.
So we are going to keep working
on making this chain of thought happen,
and I think it has to do with training data.
I don't think it's magic.
It's just probably the model did not see the type
of training data for this particular analysis,
and that I feel it's going to,
um, it can be improved
basically depending on on the type of problem that we're going to solve.
And also it kind of, uh, hidden to to myself.
Uh, since I work in safety,
one of the things that we have noticed
is that when you include the chain of thoughts,
sometimes you do get less safe, less safer.
Uh, replies in the final answer.
So there's something that basically modifies
and starts brainstorming a bunch of different.
I don't want to use brainstorming because then Marina
is not going to like my terminology.
You're doing it again, but it's kind of
the model is exploring, let's call it explore.
And I think brainstorming was too close to humans,
but, uh, kind of exploring different solutions.
Uh, we do see a lot of push, for example, right now,
to have chain of thought in the hidden space,
not only in the token space,
which I think it's going to be very, very interesting
to see how it evolves.
But overall,
I just feel excited that, uh, it's not working
because it means that we are going to make it work.
And, uh, at some point.
And there's so many people, right.
People working on this, um, hidden space.
Uh, kind of trying to get different sorts of solutions
out there, analyzing it and have that aha moment
that we all like, and we all hope that does exist
when you have that chain of thought into consideration.
I, I'm a big fan of that optimism.
Um, and I think we'll definitely revisit this.
I've been kind of thinking, like, we should make a point
of revisiting the chain of thought literature every so often.
Um, just because I think it's like this just super
interesting narrative that's running alongside,
um, certainly all the, like,
commercial stuff that we talk about on a week to week basis.
So, um, this is this is great.
For our last topic today,
a very fun and kind of strange
blog post, uh, came out of Anthropic. Um,
and it was so interesting. I want to make sure that we brought it into
the discussion for this week's episode. Um,
it's a blog post that begins quite reasonably. Um,
Anthropic came out basically saying, look.
In certain kinds of cases, um,
particularly in these kind of distressing
or toxic or kind of abusive conversations, um,
you know, Claude's going to just shut down.
going to allow the tool to make a determination and
just cut off the conversation.
Um, uh, if it if it
if it feels appropriate, we can go into what that means.
Um, and then it kind of goes into this kind of weird
second act of the blog post where anthropic says.
And the reason we've decided to do this is because, quote,
we have high uncertainty about the potential moral status of Claude and other llms.
And so as a kind of first step
in potentially protecting AI welfare,
this is why we've kind of implemented this change, right?
We don't we don't want to put the AI,
I guess, under the emotional pressure of these conversations.
Um, and so I was like, huh.
All right. That's like a very interesting rationale
I haven't heard before for a product decision.
Nathalie, maybe I'll start with you thoughts on this.
Like, it's kind of weird that we're here, right?
That a major company is sort of justifying
product decisions based on AI welfare.
For our listeners, what should they take from this?
I think the term welfare is misleading here.
Uh, if you think about it all the time, what we have been doing
has been inspecting the model,
whether it's activation, whether it's the output itself, whether it's the logits
at the end of the of the tokens and so forth.
All these things are inspecting the model, the welfare.
If you think about it, it's just the output of the model.
Again, it's just a different name
to express the same thing that we have been kind of using.
So the model does know when things are not going
the right direction in a lot of cases.
And so I do find misleading the term
uh, because it kind of makes
you think about the model as a person, which is not
it is not to be clear,
this is just next talk and prediction and a bunch of math
behind the scenes that allow us to really get to to that final reply.
We're inspecting the model and the fact that it gets shut down.
It may be very good in some cases.
So I think from my perspective,
the fact that it was framing that way
kind of reduce the impact
of what they are trying to do at the end of the day,
which is stop conversations
that are really going to be harmful for people.
So I would have been happier
if we had a situation
where they just call it, go inspect the model state
and we decide to shut down, as opposed
to try to give it this kind of, um,
personal like situation for the model itself.
It's exploring, not brainstorming.
Yes, exactly.
Exactly.
And and one thing that I found really interesting
is that, for example, the CEO of Microsoft
replied with a really interesting blog post
that says, like, hey, let's try to beat
AI from here for humans
and not to make AI a human.
Which I thought was a really interesting
take on this, this particular,
uh, news that you brought to the team.
This question of harm, I think, is really where this is at.
Um, and Marina it almost kind of feels like they've been, like,
trapped in a framework of their own devising,
uh, in the sense that, like, you know,
I used to work on a bunch of trust and safety issues.
The reason you normally want to prevent someone from having a, like, an abusive
or toxic discussion on your platform is that, you know,
it's directed at someone who is being harmed.
And so typically the justification is, well, we're going to
we're going to ban you from the platform because we're trying to protect our users.
And here's kind of a weird case where it sort of
seems like there is no user on the other side.
And so there's almost a claim of like,
you know, I should be allowed to be as crazy and horrible
as I want to in AI because who's being harmed?
It's just me.
And I guess there's almost one way of reading it,
which is Anthropic, trying to come up with some kind of justification
for stopping this behavior.
But like, there's no one to point to in terms of harm.
Do you buy that interpretation at all? No,
because this is a way to do a CYA
so that if something actually does happen
and you're getting sued and it's all Anthropic, you
you made the platform available where a person could cause themselves harm.
There are laws that we have about that in society.
still could be held liable.
So this notion of liability is something that we've struggled with
since the advent of the internet and even earlier.
If to what extent is the provider liable
for how people are making use of their product?
If you're disseminating hate speech and you know other people
well, you don't have to read it, but you can still have a problem
You can still be banned off the platform.
So now there's still very much something going on with that.
I will agree with Nathalie that I don't love the framing of this
as an AI welfare problem.
This continues to be a human welfare problem,
and also to have a notion of people either
continuing to over humanize the AI
when they read this kind of framing and continue
to believe that they are talking with someone real
when they interact with AI.
And furthermore, a work like this allowing potentially bad actors
to figure out ways to shut down conversation
when it is maybe not just about harm,
but maybe something that is politically unlikely
or does not fit the notion of the narrative
of your government or anything of that kind.
So I would be really pretty concerned
about being able to do that kind of thing
under the narrative and the guise of AI welfare.
So these are all things to consider.
Um, funny blog posts from Anthropic.
Look, this is from the people who bought you Claude's
vending machine that orders tungsten cubes.
Like you got to give them something.
They shouldn't be surprised.
Not surprised.
I'm glad they're having fun with it.
But when it comes to, you know, other
how other people report on this kind of thing,
when we get outside of the valley,
outside of a few engineers who are having fun and trying things,
I think you have to have a broader and more responsible
look on what this kind of framing means,
and whether this is really the best way that we should be disseminating,
uh, what happened with, you know, a few engineers?
Sandi. So a lot to unpack here. Um,
I don't know if you would agree with Marina, uh,
and Nathalie about this, uh,
on, like, AI welfare being kind of like a sideshow.
I actually do want to take a contrarian. Take that.
Like, actually, we should take, AI, welfare seriously.
I'm curious how you think about it.
You know, I, I agree, so I'm not going to take the contrarian take,
but I have some friends who have very different opinions than myself.
Right. And some of them who are at Anthropic.
And so I agree, I think it's a very tasteful way
to allow Anthropic to essentially cover it's,
you know, and have insurance and liability. Um,
but some of my, you know, there's an overall
take at Anthropic anyways, like they launched their, um,
I think it's economic figures program in June
where they're taking a look at just the economic
welfare and social welfare
and what the future looks like.
And there is quite a large like at any company
band of extremism
in terms of who believes that AI will be sentient one day. Right.
And so, um, they might just be like, hedging
their bets, like insurance to where
if they do believe that one day
it shall become sentient, at least they're laying the foreground to like, hey,
I was nice to you.
You don't come from me.
Well, maybe I'll throw out.
I'm just curious about turning the crank one more time on on
the kind of like pro argument on this side.
One argument that I've heard, which I kind of like is,
well, you know, we're not certain about the sentience
of all sorts of living things, right?
Like we believe in animal rights that may have varying kinds of sentience.
And so, like, is it all that crazy to, like,
take it seriously and they add context.
Um, Marina, what do you think?
It's certainly one way to to go about it.
Uh, you could try to also say, what are you trying to do this for?
Are you trying to actually cause harm?
And, uh, what you think you're doing is torturing the
AI, or what you think you're doing is you're trying to test it
so that you can make sure that it does not cause some kind of harm.
Intent probably has a long way to go here, but I don't know.
I'm on the camp of people, as you know,
that thinks we're still pretty far away from sentience.
Uh, as far as that goes.
And, I mean, maybe one day we will all welcome our AI overlords.
We got some time.
Um, and, you know, it's
almost like the difference between black hat and white hat hacking.
Maybe we can have the right intent on it,
and then they won't get mad at us when they take over.
I don't know, guys.
Yeah, but it's just too far away.
Like we're not even talking about anything.
We got other problems, and you see it, right?
You're just like, this is. It's just.
It's just token prediction.
Yeah. Well, great.
Um, any final thoughts on on this one?
Uh, I guess I don't know when
maybe this is a good question to end on, I guess.
You know, uh, I think, Marina, to your point.
There's a lot of reporting around these kinds of stories.
Um, a lot of ability to get confused on what's going on
and a lot of hype around these kinds of stories.
Um, do you have any advice for listeners
who may hear these kinds of claims in the future?
Like, see it all with a grain of salt.
You know, how would you want people to receive these kinds of stories?
Look to history.
Um, there's very often nothing new under the sun.
And very interesting technological innovations
go through a period of excitement
and, uh, in different directions
before they settle into something that is useful.
The perception that people who are in the thick of
it have is not the perception that other people outside of the thick of
it should have.
And, uh, try to realize that
as much as it's interesting what's going on right now.
I don't think that from a social perspective, it's necessarily new.
And I think I've spoken here before
of the need for people to have a broad education,
a wide understanding of what goes on besides just the technology.
Also, why did this kind of technology get created by people
at the time that it has in the society that we have,
and just try to take it from that perspective
as much as possible, and then try to get your news
from multiple sources about it.
Always a good one.
Yeah, well, that's good advice.
And that is all the time that we have for today.
Uh, Nathalie, Marina, Sandi, thanks for joining me. And Sandi,
hopefully we'll have you on in the show in the future.
Thanks for having me.
And thanks to all you listeners.
If you enjoyed what you heard, you can get us on Apple Podcasts,
Spotify and podcast platforms everywhere,
and we'll catch you next week on a Mixture of Experts.