O1 Preview Sparks Chain‑of‑Thought Upgrade
Key Points
- Agents‑as‑a‑service and multi‑agent teams are expected to become ubiquitous, driving a major shift toward collaborative AI workflows.
- The panel debated the O1 preview’s hype, with Chris eager for new models, Aaron noting the scientific intrigue of chain‑of‑thought learning, and Nathalie highlighting tangible security‑metric improvements.
- The newly released model embeds chain‑of‑thought reasoning and reinforcement‑learning techniques directly into its architecture, boosting its reasoning performance.
- An unusual production schedule meant the episode was recorded before the model’s public launch, illustrating the fast‑paced timing of AI releases.
- The discussion framed large language models as the key enabler for personalized experiences, coining the idea that “an active league is a happy league.”
Sections
- Panel Debates o1 Preview Hype - A mixed‑expert panel examines whether the new o1 model delivers on its promises, while also discussing agents‑as‑a‑service, multi‑agent collaboration, and AI‑driven personalization.
- Chain‑of‑Thought Self‑Education in LLMs - The speaker explains how chain‑of‑thought prompting combined with reinforcement learning lets large language models introspect, iteratively learn from varied problems, and achieve aligned answers without explicit ground‑truth supervision.
- Prioritizing Reasoning Over Raw Answers - The speaker stresses that accurate reasoning—such as step‑by‑step validation for calculations, puzzles, or chain‑of‑thought tasks—is more important than merely predicting the next token, and suggests using reinforcement learning rewards and inference‑time tree searches to train models toward proper logical processes.
- Comprehensive Safety Evaluation of LLM - The speaker explains how a single model’s safety is assessed through diverse metrics—including jailbreaking, hallucinations, and fairness—while leveraging advanced benchmarks and introspection methods to capture a holistic view of model behavior.
- Cascading Errors in Model Reasoning - The speaker discusses model risk levels, how chain‑of‑thought mistakes can propagate and become harder to detect than hallucinations, and references ongoing work on consistency and security.
- Balancing Model Speed and Superintelligence - The speaker highlights the growing division between cheap, low‑latency models and larger, costly ones while questioning when AI benchmarks that surpass PhD‑level performance will translate into self‑reinforcing, superintelligent systems.
- From SaaS Pioneers to AI Threat - The speaker highlights Salesforce’s early role in popularizing SaaS and the resulting industry‑wide SaaS disruption, then questions whether AI‑driven agents will similarly upend every product category.
- Emerging AI Agent Marketplace - The speaker foresees a worldwide platform where users purchase AI-driven tasks from agents—similar to Fiverr—driven by firms that possess superior, faster, and cheaper multimodal data, prompting big tech players like Salesforce to enter the space.
- Narrow-Agent Design with RAG & Unlearning - The speaker suggests replacing broad, human‑centric language models with specialized, domain‑focused agents that leverage retrieval‑augmented generation and machine‑unlearning to selectively add or erase knowledge, enabling fine‑tuned, objective‑driven functionality, and highlights an IBM‑Salesforce partnership to advance this strategy.
- Balancing Deterministic and Exploratory AI - The speaker explains how to configure pipelines that decide when to use stochastic exploration versus deterministic retrieval (e.g., RAG) to maintain trustworthiness, combine human oversight with technical safeguards, and apply this approach in business settings before shifting to a discussion on fantasy football.
- Massive Fantasy Sports Platform Metrics - The speaker outlines their eight‑year‑old consumer‑facing fantasy sports service, highlighting 12 million registered users, billions of page views and insights, 5,000 requests per second, and its predictive injury‑detection and trade‑analysis capabilities.
- Generative AI Unlocks Scalable Personalization - The speaker explains how generative AI can break the bottleneck of producing countless personalized variants by leveraging comprehensive customer data platforms to automate role‑play‑style content customization.
- Personalizing Infinite Scoring Sentences - The speaker explains how they merge edge‑generated fill‑in‑the‑blank templates with percentile‑based adjectives to tailor AI‑generated messages for countless scoring rules, creating a “theory‑of‑mind” personalization layer that delights users.
- Model Unlearning for Poisoning & Hallucinations - The speaker explains how unlearning methods can patch AI models to excise poisoned data, copyrighted material, and persistent hallucinations without the need for costly full retraining.
- Future of Surgical LLM Fine‑Tuning - The speaker envisions LLM development shifting toward precise, “surgical” interventions—targeted activation control, unlearning, and advanced visualization—to selectively edit and fine‑tune model behavior.
- Balancing Forgetting in Image Models - The speakers explain how loss functions and mixture‑of‑experts architectures can be optimized to control what image‑to‑image generation models remember or discard, concluding with a podcast sign‑off.
Full Transcript
# O1 Preview Sparks Chain‑of‑Thought Upgrade **Source:** [https://www.youtube.com/watch?v=KPsl7IK2_eo](https://www.youtube.com/watch?v=KPsl7IK2_eo) **Duration:** 00:47:56 ## Summary - Agents‑as‑a‑service and multi‑agent teams are expected to become ubiquitous, driving a major shift toward collaborative AI workflows. - The panel debated the O1 preview’s hype, with Chris eager for new models, Aaron noting the scientific intrigue of chain‑of‑thought learning, and Nathalie highlighting tangible security‑metric improvements. - The newly released model embeds chain‑of‑thought reasoning and reinforcement‑learning techniques directly into its architecture, boosting its reasoning performance. - An unusual production schedule meant the episode was recorded before the model’s public launch, illustrating the fast‑paced timing of AI releases. - The discussion framed large language models as the key enabler for personalized experiences, coining the idea that “an active league is a happy league.” ## Sections - [00:00:00](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=0s) **Panel Debates o1 Preview Hype** - A mixed‑expert panel examines whether the new o1 model delivers on its promises, while also discussing agents‑as‑a‑service, multi‑agent collaboration, and AI‑driven personalization. - [00:03:05](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=185s) **Chain‑of‑Thought Self‑Education in LLMs** - The speaker explains how chain‑of‑thought prompting combined with reinforcement learning lets large language models introspect, iteratively learn from varied problems, and achieve aligned answers without explicit ground‑truth supervision. - [00:06:09](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=369s) **Prioritizing Reasoning Over Raw Answers** - The speaker stresses that accurate reasoning—such as step‑by‑step validation for calculations, puzzles, or chain‑of‑thought tasks—is more important than merely predicting the next token, and suggests using reinforcement learning rewards and inference‑time tree searches to train models toward proper logical processes. - [00:09:18](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=558s) **Comprehensive Safety Evaluation of LLM** - The speaker explains how a single model’s safety is assessed through diverse metrics—including jailbreaking, hallucinations, and fairness—while leveraging advanced benchmarks and introspection methods to capture a holistic view of model behavior. - [00:12:21](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=741s) **Cascading Errors in Model Reasoning** - The speaker discusses model risk levels, how chain‑of‑thought mistakes can propagate and become harder to detect than hallucinations, and references ongoing work on consistency and security. - [00:15:25](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=925s) **Balancing Model Speed and Superintelligence** - The speaker highlights the growing division between cheap, low‑latency models and larger, costly ones while questioning when AI benchmarks that surpass PhD‑level performance will translate into self‑reinforcing, superintelligent systems. - [00:18:36](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=1116s) **From SaaS Pioneers to AI Threat** - The speaker highlights Salesforce’s early role in popularizing SaaS and the resulting industry‑wide SaaS disruption, then questions whether AI‑driven agents will similarly upend every product category. - [00:21:45](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=1305s) **Emerging AI Agent Marketplace** - The speaker foresees a worldwide platform where users purchase AI-driven tasks from agents—similar to Fiverr—driven by firms that possess superior, faster, and cheaper multimodal data, prompting big tech players like Salesforce to enter the space. - [00:24:58](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=1498s) **Narrow-Agent Design with RAG & Unlearning** - The speaker suggests replacing broad, human‑centric language models with specialized, domain‑focused agents that leverage retrieval‑augmented generation and machine‑unlearning to selectively add or erase knowledge, enabling fine‑tuned, objective‑driven functionality, and highlights an IBM‑Salesforce partnership to advance this strategy. - [00:28:06](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=1686s) **Balancing Deterministic and Exploratory AI** - The speaker explains how to configure pipelines that decide when to use stochastic exploration versus deterministic retrieval (e.g., RAG) to maintain trustworthiness, combine human oversight with technical safeguards, and apply this approach in business settings before shifting to a discussion on fantasy football. - [00:31:08](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=1868s) **Massive Fantasy Sports Platform Metrics** - The speaker outlines their eight‑year‑old consumer‑facing fantasy sports service, highlighting 12 million registered users, billions of page views and insights, 5,000 requests per second, and its predictive injury‑detection and trade‑analysis capabilities. - [00:34:13](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=2053s) **Generative AI Unlocks Scalable Personalization** - The speaker explains how generative AI can break the bottleneck of producing countless personalized variants by leveraging comprehensive customer data platforms to automate role‑play‑style content customization. - [00:37:19](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=2239s) **Personalizing Infinite Scoring Sentences** - The speaker explains how they merge edge‑generated fill‑in‑the‑blank templates with percentile‑based adjectives to tailor AI‑generated messages for countless scoring rules, creating a “theory‑of‑mind” personalization layer that delights users. - [00:40:24](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=2424s) **Model Unlearning for Poisoning & Hallucinations** - The speaker explains how unlearning methods can patch AI models to excise poisoned data, copyrighted material, and persistent hallucinations without the need for costly full retraining. - [00:43:33](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=2613s) **Future of Surgical LLM Fine‑Tuning** - The speaker envisions LLM development shifting toward precise, “surgical” interventions—targeted activation control, unlearning, and advanced visualization—to selectively edit and fine‑tune model behavior. - [00:46:36](https://www.youtube.com/watch?v=KPsl7IK2_eo&t=2796s) **Balancing Forgetting in Image Models** - The speakers explain how loss functions and mixture‑of‑experts architectures can be optimized to control what image‑to‑image generation models remember or discard, concluding with a podcast sign‑off. ## Full Transcript
Can models officially reason now?
They have risk levels for models.
I think, uh, we're still good.
So no terminators inside.
Are agents as a service, the
new software as a service?
Agents are going to be everywhere.
And multi agents, uh, you know,
operating in teams and crews, multi
agent collaboration is going to be huge.
Are LLMs the true unlock for personalization?
We came up with the notion that an
active league is a happy league.
I'm Bryan Casey and I'm joined this week
by a world class panel of experts across
engineering, research, and product, and we're
excited to get into this week's news in AI.
This week we have Nathalie Baracaldo,
Senior Research Scientist, Master Inventor.
Aaron Baughman, IBM Fellow, Master
Inventor, and Chris Hay, Distinguished
Engineer, CTO, Customer Transformation.
Alright, so, as every week, we start
out with a quick, hot take question.
And this week's question is,
Was o1 Preview worth the hype?
We'll start with you, Chris.
I live for the hype, and
I wait for the next model.
You know, where's my new model?
It's been a week already, so it makes sense.
Um, Aaron, what about you?
Um, yeah, I think scientifically this
whole chain of thought, allowing systems
to teach itself, is very interesting.
Um, but I need to wait and see how
it works out in the implementation
details in the application space.
And Nathalie,
I think the new model, it's really
interesting from the security perspective,
some of the metrics that they show
really demonstrate improvement.
So I'm very excited about it.
All right, well, let's jump right into it.
And actually, that's going to
be our first topic this week.
So it was funny.
So this model released last Thursday,
actually, and there's a little inside
baseball for our listeners on the
show, we should record this show.
On thursdays every week, um, and then we
go into production and we release the show
friday morning This one week we didn't do
that And we actually recorded the show on
wednesday and then of course the model came
out on thursday so that's just the way of the
world but Um, this was an announcement that
has been hyped for a long time for anybody
who's on Uh, you know twitter or x You've
seen the memes around strawberry for what
feels like an eternity, um, at, at this point.
Um, and then it's finally here.
It happened.
The model arrived, um, and it wasn't,
you know, just released as a blog post.
It was actually rolled out, um, to, to the
broad user base within their, uh, within chat
GPT and I think some of the API as, as well.
Um, and the interesting thing about
this model is that it introduced chain
of thought and reinforcement learning
techniques Um into the model itself.
So not as a way to interact with the model
but as a way that is an embedded capability
inside of the model, which is um Definitely
kind of a new approach, um there and
we've seen pretty important and noteworthy
improvements in reasoning capability
as a result from that and so maybe just
because Aaron because you You Um, touched
on this specifically in your first answer.
Um, I actually just want to start with the The
interesting kind of science around including
chain of thought and reinforcement learning
techniques within the model itself And just like
get your reaction to that and why it's important
the things that are exciting to you about it And
maybe you know, you raised a couple questions
Um, even in your initial answer, but like some
of the things that you're still waiting to see
Yeah, I think it's really fascinating
how, you know, chain of thought it was
really introduced in 2022, and it's just
accelerated and expanded to become the self
education for large language models by 2023.
And here we are today with strawberry, you know,
so it's great how fast technology is moving.
And what I really like about this is that
These chain of thoughts, it helps us to
introspect the mind of a jet generative A.
I. System.
And what happens is, is that, you
know, you might seed a system with
problems and answers so that you know
whether the chain of thought helped.
to induce the right answer.
And then later, um, you can keep iterating
over and over and over with these chains
of thoughts, with newer variations of the
problem, so it learns new skills over time.
And you create variations of these problems, and
you have almost like a panel of these generative
AIs, um, answering with different strategies.
And if all of the answers align towards,
um, you know, that's the same answer,
even though you don't have a ground
truth, then more than likely the chain of
thought is working because it's converging
towards, uh, a less variant type answer.
And through all of this looping, we have
these gradient updates so that it can
learn, you know, uh, more and more, you
know, so we have gradient updating and
we have in context learning put together.
But, um, the last thing I'll mention
is that what I really liked is how
they broke out reinforcement learning
as what they call train time compute.
Um, and then they have this
thinking time as test time compute.
You know, so the thinking time is kind
of when it iterates, and it's passing
these many chains of thoughts, you know,
along, you know, and then, um, the train
time is where it's doing this in context
learning and perhaps doing, you know, some,
uh, you know, fine tuning, um, in there.
That's great.
And Chris, I know you in particular have
been, you know, I've been following you
following this, um, on Twitter as well.
So, you know, maybe give you some space to
have like a more open ended kind of reaction,
um, to, to just the release of this and, you
know, the extent to which like it, it was, it
wasn't what you were expecting it to be, but
would love to just get your, um, your thoughts.
I thought it was super interesting.
I mean, just to sort of go on sort
of some of Aaron's points there.
The reinforcement learning
part is really interesting.
So if you think of the chain of thoughts for
a second, then if you're solving something
like a puzzle, so maybe you want to Get it
to solve a Sudoku grid, or maybe you want
it to, uh, calculate, you know, how, which
book, if Phileas Fogg was listening to the
Harry Potter books, and, uh, what book would
he be on by the time he got to India, right?
If you think of those sort
of questions, then, yeah.
The answer isn't actually the important
part, of course you want an accurate answer.
The big thing you really want to do
is, is reason in the correct way.
And the sort of things that you want the model
to be able to do is, sort of calculate the
distance to, uh, India, for example, calculate
the sleep time of Phileas Fogg, how long the
Harry Potter books are, and then be able to
sort of validate those steps all the way across.
And, and that would be similar for something
like validating a Sudoku puzzle, right?
Okay.
You want to check the horizontals, the
verticals, the sub mini grids, etc.
Check each individual number and
then see if there's any duplications.
That logic is more important than just
trying to predict what your next token is.
So if you think of reinforcement learning there,
The reward model that you could have there on
training time can be a lot more accurate, right?
You can give higher rewards for
each step that it calculates and
then, uh, how it gets it right.
And that means that you can actually
train the model towards doing the right
type of chain of thought over time.
So I actually think that is
the proper innovation there.
The other thing that I see there is
this, this shift to inference time.
And I really like that.
It takes sort of.
32 seconds, whatever.
I suspect that there is some sort
of tree search going on there.
I suspect they're generating
multiple chains of thought.
I suspect that as you go down each node of
the tree, you're then probably iterating
that further to get to some of these answers.
Hence why the thinking time will increase.
And then Aaron says that's going to
feed back into the training model.
later, but I, I think it's super exciting
because it's that push towards, uh,
inference and being able to scale out there.
Now you could argue that we have that
already with agents, but in order for that
to work well with agents, you need to back
that up with the reinforcement learning.
You need to feed that training data back into
the model because if the model can't generate
good chain of thought, then you're, you're not
going to do very well in your agentic approach.
So for me.
I find it highly satisfying.
All right, and Nathalie, just to bring
you in here a little bit, like you
mentioned some of the safety aspects.
That was definitely, um, a
thing they highlighted as well.
Like maybe as a way of, I have a couple
questions on that, but maybe as a
starting point, I'd love to hear your
take on like the extent to which you
think that Capabilities, alignment and
safety are like becoming the same problem.
Um, space.
It's like, you know, can we just make this
model do what we want to want it to do?
And you solve all sorts of questions there.
Or do you can like look at them
as kind of more distinct domains?
That's a great question.
I think one aspect that really improves with
this model is that before in AI, we had the
issue of having models that are black box,
black box, meaning you And we tried in the
community to really inspect them, introspect
them, try to check the activation, see how they
react to different inputs and stuff like that.
But this model allows us to have
something slightly different, and it's
that ability to introspect how it came
back with a decision, with an answer.
So, uh, from that perspective, and to
answer your question, I think, uh, there
may be some questions and answers, a lot
of things that get touched upon, and I,
we only have one single model, so probably
what's happening, uh, is that we're going
to be mixing all these things together.
The training data contains a lot
of different aspects of safety.
I think, uh, it's important to cover all of
them as much as possible, but, uh, overall
the main that makes this model so unique from
my perspective is that introspect without
having to look into activations that we
humans are not super great at interpreting.
So that is a part of the, I think, uh,
exciting aspects from security perspective.
Uh, the other interesting perspective
that I think that I thought it was really
interesting is that When we measure
safety, we oftentimes have different
perspectives as you were alluding to.
So we measure things like
jailbreaking attacks, hallucinations.
We, uh, verify that, uh, the model
is not going to insult anyone,
uh, fairness, all sorts of things.
So if we see some of these aspects, for
example, jailbreaking attacks, that was
one of the metrics that got improved.
Uh, maybe I'm talking too much, but, uh,
I would say that I was really, really
impressed with the fact that the community
is trying to really incorporate more
and more, um, cutting edge benchmarks to
try to understand how the model behaves.
Because one thing that happens with
benchmarks in this space is that they arrive
today, people kind of overfeed to them.
And, uh, the jailbreaking attacks, uh, will
happen the same, and things, uh, and all other
benchmarks are also kind of, uh, having that
issue that people really overfeed to them.
So I was, uh, really impressed and continue
to be impressed with the security community,
the iSecurity community, try to push the
boundaries, have more red teaming, have
more interesting stuff, uh, uh, things
to throw at these models to test it out.
So overall, I'm very hopeful from the
security perspective with this model and it
opens lots and lots of opportunities for us.
I have a question for Nathalie.
So when I read the paper on this, they said
that in the testing of the model, They were
playing a capture the flag scenario with
the model, and the goal the model had was to
capture the flag like a security thing, but
then the container was down, so the model
broke out of the host and then restarted the
container so it could capture the flag, right?
Very goal oriented.
So I'm just like your take on that, Nathalie,
from a kind of security perspective.
Yeah, this starts to looking like
Terminator sort of thing is going.
Um, I think it's impressive.
First of all.
The way the model it's trying to find out
there it's way around to solve the issues.
Uh, sometimes it's going to do stuff
like that, that, uh, it's not necessarily
the solution of a simple solution.
Um, but overall, I think, uh, the risk
level, they have risk levels for models.
I think, uh, we're still good.
So no terminators inside.
We're good from the security perspective.
Um, does that answer?
And also curious to know what.
Aaron has to say about that, because
it's such an interesting question.
I mean, yeah, these are really great
questions and open ended discussions.
And, um, one of the areas that I found,
um, interesting would be these air
avalanches, because during this chain of
thought reasoning, you know, you're always
pushing forward the chain of thought.
And, If at step zero you have an error, then
that error could populate all the way to
step n or n plus one, and it just creates
this cascade of problems that could happen,
um, that might be harder to uncover than a
hallucination, because You know, you have these
large outputs of change of thoughts, if you
can even see them, because I know Strawberry,
you know, in this case has hidden them from us.
You know, they, they made
that deliberate choice.
But, you know, but, but I know
that there are some, some work in
academia, potentially industry now
about looking at consistency, right?
Can these models consistently
get answers, um, correct?
And, you know, that's one way, um,
but I'm looking forward to seeing
how Strawberry really handles that.
And as more and more people use it.
How big of a problem, you know, is this,
you know, this cascading error or this
avalanche of, of errors that, that could
happen, um, along their, uh, reasoning.
And, and Aaron, I think that's a really relevant
point if you have a single chain of thought
that you're iterating down, but I'm not.
I'm not convinced that that is the case.
I, I, I could be totally wrong.
We're just guessing.
We're just sort of trying to
look at it from the outside.
But I, I really feel there's multiple chains
of thought that's being generated there.
And they're doing some sort of search
on that to be able to, to aggregate it.
So, so if they are doing that, and I
think they are, could be wrong, then there
might be less chance of that over time.
Because at least it's got other
options to take, uh, down that
path and aggregate it a little bit.
But, but even then with the reinforcement
learning to the point, then hopefully during
training, a lot of that would be taken away
because, you know, the reward model will be sort
of pushing it in the right direction over time.
But yeah, but it's a really great take.
Yeah. One other point I wanted to make too is
that, um, just the thinking time, the
inference, I noticed that it took 10 hours.
That they gave it 10 hours to solve
like six algorithmic problems and you
know, that's a lot of time, right?
And so so I think i'm i'm also curious to to
learn as you know We get our hands how much
time, you know, what's the trade off, you know
time versus speed, you know of response You
know and and and learn more about that as well.
So i'm just really excited
about you know, strawberry
I think that point on the length of time it
takes for some of these answers to me draws This
like interesting scenario where kind of like
the LLM router patterns, um, are going to become
even more pronounced where it's like you're
going to want small fast models that are cheap
and low latency to do certain types of tasks.
And then when you have to offloading them
to these bigger, longer, more expensive,
um, sort of, um, sort of scenarios.
The last question I have on this, cause
I, I, I don't want to, I want to have
one more question that's kind of like a.
More existential question, um, that I would
love to get the panel's, uh, take on, uh,
on this topic, which is, on a lot, many of
these benchmarks, it's now exceeding PhD level
intelligence, um, and not to out myself here, I
consider myself a reasonably, like, economically
productive person in society, but I do not have
PhD level intelligence on all of these tasks.
And one of the really like interesting reactions
from some of the folks I saw on Twitter who
are actually like, like Rune, who's at OpenAI,
works in the lab, like came out afterwards
and was talking about how He didn't even
think product was that important and that the
only game was getting to self reinforcing and
self improving artificial superintelligence.
And the question I have is like, are people
just like, like, when do we expect, like, how
capable do these models have to be before we
actually see the transformative economic impact?
Yeah, so I think one of the aspects
is, uh, what's the application and how
much you can trust the model to make
sure that, you know, Children have
hallucinations in aspects that are important.
So, um, I think to have real economic impact, we
need use cases first that, uh, where we can fake
basically we can fail and we're safe failing
and still increase the productivity of people.
So that's, uh, that's my take on this.
And so, for example, It's going to
be one of those use cases where, for
example, you have multiple smaller
models and you can try to orchestrate.
Perhaps this very big model will help us,
uh, would help us really orchestrating or
try to devise plans when they are difficult.
But I think, uh, overall, the question
of, uh, Just get a big, big model that
can do everything just by itself, uh,
that's, uh, uh, probably not going to break
all the problems in, uh, the industry.
I think we need a bunch of smaller
models and the agentic approach
and perhaps have another top layer.
Um in there to to really understand the big
context, but yeah, that's uh, how I see things.
So I really think uh, Industry
wise things are going agentic.
So smaller models working together
We talk about agents every week on this
show like the theme the show is just
like we should just call it the agent
show uh at this point, um, but let's
talk a little bit about Salesforce.
Um, and agent force.
The thing that is almost most notable about
this company is that I don't want to say maybe
they invented, but like they popularize SaaS.
Um, right?
Like they were the ogs of SaaS.
And what happened like over the last,
you know, 15, 20 years is that basically
all of traditional software Somebody came
along and disrupted each and every single
one of those categories with like a SaaS
version of whatever that product was, right?
And that happened in basically
every category across the industry.
Um, there was this piece written by
a a 16 Z that was talking about, um,
the death of the sales force, not.
The death of sales force of the sales
force, uh, and talking about more agentic
approaches, um, to, to this particular
space and they did not believe actually
that like the incumbents had an advantage.
They thought the entire space
would be so radically transformed
by these capabilities that.
It would be disrupted by new, new
entrance into the market, um, essentially.
And I think that, that sort of dynamic is what's
driving and propelling, um, Salesforce to do.
And a lot of the things that it's been doing,
um, and the things that are announced this week,
even, um, but the maybe as a question, as a
starting point, and maybe I'll flip it over to
you, is do you think what played out in S?
Is going to play out with like AI and
agents to like is every category going to
get threatened or disruptive by like an AI
native version of that particular space?
Like, is that where we're going?
And is like a Salesforce trying to
again be the first one to do this?
Absolutely, next question.
I'm joking.
We got about 10 minutes, man.
Like, you know.
So I think, I love what
Salesforce is doing there, right?
So I think the agent force thing
is absolutely spot on, right?
Because they're effectively um,
Speeding up the productivity, right?
So it's no different from kind of
deterministic automation that you
would do in these platforms today.
And now you're getting the
agents to perform that.
So anybody can compose an agent that
performs a task and do it really quickly.
Um, agents are going to be everywhere.
And multi agents, uh, you know,
operating in teams and crews, multi
agent collaboration is going to be huge.
The, and I did a video on this about a year
and a half ago, and I think this is true.
I think we're heading towards
a world of agent marketplaces.
So you're going to go home and you're
going to have an agent that's good at
translation into a native language.
You're going to have an agent that
is going to be good at performing a
particular task, an agent that, you
know, can do benefit calculations.
Every single task that you can imagine,
there will be an agent at some point.
That can perform that task.
So therefore, if you think about what Salesforce
has done there in their world, what they've
created as an agent marketplace for within
their SAS platform, that is cool, right?
Anybody can go and compose those agents
and bring that together and sort of, you
know, solve these tasks really quickly.
But that's not going to be limited to
Salesforce, and it's not going to be
limited to individual organizations that's
going to come out into the real world.
In the same way as we have platforms
like Fiverr, we're going to have.
Agent or and then, you know, people will be able
to go and buy those tasks from those agents and
it's going to be it's it's going to be a rush.
Right? And who are the companies that are going
to be truly at the forefront is going to be
people who have the better data, who have
the faster agents, the cheaper agents, and
they're going to sort of dominate that.
So I actually think that The, the big tech
companies are going to enter into that
space, uh, Salesforce being one of them.
But I, I see this as a world marketplace.
I don't see this being a company thing.
And maybe as like building on that a little bit,
which the data point is particularly interesting
where some of the discussion I actually saw
that was in the original post from a 16 Z was
actually talking about how they thought Even
some of the incumbents only had actually a
slice of the data that was going to be relevant
in the future because people are treating It
like they have all that data But their belief
that in like some of these customer service and
experience, uh domains that like multi modal
data Um that is not true and even unstructured
data things that are not necessarily the core
of how these things are powered today Will
become the core of how they were powered today.
So the data advantage there, but it's not
as pronounced as people thought it was.
Um, this is one hypothesis, um, at least,
but like Aaron or Nathalie, I'm curious
the extent to which you think that, um, you
know, some of these other data sources are
going to be represent both opportunities
for like new entrants in these categories
and, or things that are just like, You
know, some of the existing providers are
gonna have no problem just like adding a
new data set into their existing platforms.
Yeah, I mean, I mean, um, I always take a
step back and ask myself, what is an agent?
You know, and to me, an agent is a process
I can perform a task that could otherwise
be done by a human or even another agent.
And then this gets to meta agents where an
agent can create yet another agent, right?
And.
And, and there's a couple of paradigms,
you know, and, and the two I'll
stick with is sort of, you know, um,
gives you a continuum in between.
But the first one is
environmental centric agents.
But these are agents that reason and
think and plan after each action.
So they think, act and observe.
Um, and then the other one would be a human
centric where they reason without observations,
where they plan up front and they don't really
need output from tools in order to take action.
And then there's anything.
In between, right?
There's there's a polls and and it seems
like that the data aspect it depends on
the use case of which the environment or
of which the agent is going to operate
within a given environment, right?
Is it more of a reactive kind of agent
based on a signal that comes from a device?
where you don't really necessarily need
a lot of, um, external data to create a
reaction, or is it more of a rich textual
piece of which it needs to generate new
information, um, and provide it back to
a human or even to another agent, right?
Which could be instead of human centric, maybe
we say it's agent slash human centric, right?
Um, but it, but it's, it's, it's really
neat, uh, where all this is going.
And, um, And I do think that, you know, there's
different approaches of RAG, you know, um, that
we all know about where you can augment, um, uh,
data pieces and, and to, um, context such that
it influences and informs what might output.
But this might foreshadow another topic
we're going to, but I really like the
machine unlearning, right, where you can,
um, maybe, um, unlearn or erase from memory,
whether it's this hippocampus type memory or
it's embedded in weights from all of these
generative AI pieces, but it kind of helps
to focus in an agent on what they're meant
and supposed to do around their objectives.
So they're sort of their their
sneeze within a particular.
domain.
So rather than being so broad and, and,
and having this huge large language model
with that's been trained and it has this
inherent data baked into it, you have these
very narrow, smaller, uh, SME agents, right?
That, that are used that might
be fine tuned in different ways.
Maybe you're Removing data that's
already there that's been shared by
these models that are open source.
Are you adding data through rag
or fine tuning through rag with
your own data that you might have?
You know, so there's lots of permutations there.
And for Salesforce, I'm really excited that,
you know, they're partnering with us, IBM to
advance, you know, their products to make it
more open and trusted and to help think about
these kinds of new architectures and agents and
how we're going to be using them data and plug
and play pieces like Chris mentioned before.
Nathalie, maybe a question for you too,
which is I think it was Chris that maybe
you made this comment about, um, agents
versus like more deterministic workflows.
Um, you know, and kind of
that evolution over time.
Um, one of the things that I've seen a little
bit, at least with some of the things that
we've been doing is we've started with a
lot of are using kind of Productionizing a
lot of like more internal use cases around
the stuff that are like big improvements
and productivity, things like that.
The interesting thing with Salesforce is some
of the scenarios that they're talking about
are all customer facing things, which is, you
know, that like changes the Calculus from a
risk perspective from a security perspective.
Um, and I'm just I'm curious, Nathalie,
how you think, how are how are people
going to approach the balance of,
you know, kind of more deterministic
workflows versus these more agentic ones?
That is a great question.
Um, my perspective is the following.
First, we need humans to know that
they are still important in this whole
pipeline, because a lot of times when
there are mistakes, uh, an expert would.
very much realize, uh, that there's
something weird and there's something
that doesn't quite feel right.
So I think, uh, first understanding the
human and educating the human, like, hey,
this is a tool, but just know that you
are potentially smarter than the tool.
That's our first, first step.
The second step is, uh, actually Understanding
when we want to explore solutions and when
we want to explore the space versus when we
want something deterministic, for example, to
retrieve a document that it's really relevant
for certain types of questions, and we can
have a sort of pipeline that it's not as
stochastic and that we know we want it that way.
So kind of setting up a paths within our
spectrum of solutions so that when there's
something really critical, RAG and other
types of technologies can be applied
so that we don't hallucinate widely.
Uh, I think, uh, that's,
um, that's a part of it.
So how do we set it up to make it trustworthy?
I think that's the, the
main, uh, aspect of it all.
And it's going to be a combination of
human Plus, uh, a lot of techniques.
I think RAG, just the way it is done right
now, may have some gaps, but we, I think
as a community, are moving forward towards
solutions where we can specify a little
bit better where we are going for each of
the questions, for each of the suggestions.
But overall, um, Just to mention, I think
it's really, really important and really
a tool for people to use and, uh, leverage
in their business cases, especially
sales for sales force and so forth.
Keeping on the theme of putting this stuff in
production and in front of actual customers.
Um, I will move us on to our third segment
today, which is, uh, Talking about fantasy
football a little bit and some of the
work that IBM is doing in in this space
Some of the work that I think this is,
you know, these, this type of work is
reaching like huge consumer audiences.
And so it actually is, um, I think some of the,
some of the more exciting work that we're doing.
But, um, Aaron, I want to say the partnership
has been going on for like eight years,
uh, at this point, but, um, maybe just
talk a little bit about, you know, yeah.
The work that we do around partnering
with ESPN around fantasy football.
And then like, I know we introduced
some new capabilities, um, this
year that are driven by, um, LLM.
So maybe just talk a little bit about
like the partnership there and then
just like some of the stuff that
we've brought new for, for the season.
Sure.
Yeah. I mean, you perfect.
It's, it's been around for our project
has been around for eight years and
it, and, and we actually went down
to the labs down in Austin with ESPN.
I'd say.
10 years ago, and we were trying to figure
out what can we do right to help fantasy
football managers that hasn't been done
before, and we came up with the notion that
an active league is a happy league, right?
And what we want to do is to create this
immersive and understandable experience
for ESPN fantasy football team managers.
And, um, we've grown right.
So now through eighth year, we have
12 million users that are registered.
We're actually live right now, and we're two
and a half weeks into this very long season.
And so far, we've had 919 million
page views, and we've Delivered 4.
6 billion insights, right?
So it's really, really heavy.
And, um, we're, we are consumer facing,
you know, so we have about 5, 000 requests
per second that we sustain and, um, and one
stat that I just looked at this morning,
I was just, just curious, but, um, the
most time on a singular player has been
100 days just in two and a half weeks.
And that player was Justin Fields, right?
So, I mean, that gives you the volume, right?
And the amount of users that that we had.
And the program, it starts in August and
it runs into January of the following year.
And what we do is we provide boom, bust,
score spreads and different stats about
players to help folks make decisions.
And the idea at the beginning Which was novel,
is that we wanted to create these different
predictions and player states, like do they
have a hidden injury, just from text and
from videos and sound, not from stats, right?
And that was a hypothesis, and we went
through this empirical metrics driven
approach to measure how well we would
do, and it came out that we did very
well and we're eight years into it.
And then what we also do is
we give trade analysis grades.
So if you and I were going to trade, I
look at your situation, your roster, your
rules, and, and, and I give you a grade.
Um, and then we also look at waiver
wire players and give, give a grade.
And we do opposing team rosters to
say, how will this player help my team?
Because there's always an, an, you know, some
sort of opportunity costs, but the system,
it uses a combination of generative AI,
classical machine learning, um, simulated
quantum machine learning and different
analytics that's been built up over the years.
Um, so it's, it's fascinating, it's very
rewarding, you know, to see people use
this and to see all of our insights,
generative insights as well on ESPN,
on broadcast TV, um, and on the radio.
I think one of the things.
That struck me the most, um, was correct
me if I'm wrong, but I want to say
that we're using, um, like some of the
trade, like gets a grade, um, right.
And then we actually use the IBM granite
models as a way of producing some custom
analysis associated with, with the grade.
Um, and so that text becomes like personalized,
uh, really in a way to like, not just
the person, but actually to the specific
situation, the, you know, One of the things
so I do a lot of work in like media content
and the web and like personalization, I
think for every every company that works
and thinks about their customer experience,
like personalization is like the holy
grail that everyone wants to get to.
And one of the things that So, so
interesting is that, like, from a
content perspective, in particular,
personalization is just insanely expensive.
Um, and it's one of the gating aspects in order
to do it, because like, how many, like, I'm
struggling to make one good version of this
thing, and now you want me to create, like,
a hundred good versions of, of this thing.
It's like, never gonna happen.
Um, and then, so one of the first things I
thought about, uh, when, like, regenerative
AI arrived, it was like, I wonder if
this is the unlocked personalization.
Um, like, is this the thing
that is really going to do that?
Um, and so maybe, like, throw it over to
you, Chris, and just, like, how big of an
impact do you think that, like, Gen AI is
going to have in personalization over time?
And, like, what barriers do you think
exist to, like, us doing even more of this?
So we're, we're already doing that.
So personalizing, using generative AI to the
consumer, um, that is something we already do.
And we do this with the customer data platforms.
So if you think of, uh, in marketing, you
have the customer data platform where you
have the, the 360 view of the customers.
So you have all your clicks or
preferences, uh, all of that kind of
marketing data, that's all in one profile.
Well, if you think about what
generative AI is really good at, it's
really good at role playing, right?
And you'll have seen that before,
talk like a pirate, um, you know,
talk like Snoop Dogg, et cetera.
Well, well, actually everything that
you need to personalize there is
sitting in your customer data platform.
So actually just.
Getting all of that data that
you've got today and then starting
to put that in works really well.
And we've already been using that to
build marketing segments, to then have
even finer grained marketing segments
than you have today, and then be able
to have that personalized content.
And of course, that is making
it smaller and smaller.
That's kind of what, Well, you know what's
happening today in a practical level,
but that's going to come down to to the
one and and if you think about this, it's
not also about generation of content,
it's also about verification of content.
So let's say you're going to do
a marketing message and then you
go and do an A B test, right?
That's quite an expensive test.
You're doing that against real
people and you're finding out.
Hey, is this work out?
But remember what I said, the, the
generative AI is really good at role playing.
So you can start to ask the question
and say, how, how likely are you
to respond to this content, right?
Is this content fitting in
your particular persona?
So you can start ask questions of that
persona that you've got there and understand
if it's a good fit, and then maybe start
to deal with that a little bit in advance.
So generation is absolutely where
people want to go for personalization,
but actually verification is, is
a really interesting use case.
And as I said, we're already doing this.
Yeah, yeah, I wanted to
just mention to, um, just to build off of that.
Um, some of our challenges within the
sports entertainment live events is
scale, right, because I mentioned we have
12 million users that could hit us in a
single day, 5, 000 requests per second.
You know, so we, don't enable, um,
well, we shield our origin servers,
right, from all that traffic.
And what we do is we invented a way such
that we could create batch jobs that would
create all these different generative,
almost fill in the blank sentences.
And then we would, on the edge, we would then
look up, you know, who, what league you're in.
Because there's an infinite
number of scoring rules, right?
That makes, um, these personalized
sentences different, right?
That could, again, be, be infinite.
And, um, and so what we do is we meet
in the middle and we pull in those fill
in the blank sentences on the edge and
then we personalize it, um, you know,
through, um, fill in the blank adjectives.
Uh, based on percentiles of the values
of which your players have, right?
And, and then the language
that you would expect.
So it's almost like theory of the mind
where we want to, um, under, have our
algorithms understand you, your data,
your situation, and then personalize the
already generative AI and then field it off.
to you, right?
And that's how we typically handle these
massively large scale systems that hit us.
And it's, it's quite fun, right?
To see the reaction of users when they see the
data meeting their expectations and showing
them something that they didn't really know.
They're like, wow, okay, now I get it.
You know, and it's, and it shows the power of
what we do here, um, for lots of our customers.
Nathalie, I know you've actually been
doing a lot of work in the space of, like,
machine unlearning, um, and so maybe just,
like, to throw it over to you and just,
like, talk a little bit about, like, what
this is, why, why it's important, and, um,
yeah, just I'll throw it over to you there.
Yeah, thank you.
So this is a topic that I think it's very
important, very relevant, especially right now
that we have huge models and is that basically
let's revisit for a second the pipeline.
We have lots of training data.
Lots, like a lot from the internet.
So untrusted training data, most of
it, then we train a model that's huge.
It takes months to do this and a lot
of know how, then we get the model.
And then after that, we start red
teaming and we start using the model
and we go like, Oh, We messed up.
Perhaps we should have removed or not used
certain types of data that we use during these
four months of training period, for example.
So, uh, the idea of unlearning is rather
than retraining and try to solve the issues
by retraining or fine tuning, what we do is
take the model, kind of perform some surgery
to it, and, uh, so that the effects of data
that we don't want are no longer there.
There are, uh, different reasons
for which we may want to do that.
One of them is, for example, it turns
out that all of the sudden we have this
subpopulation of people and then a lot of
the replies that we are getting for that
subpopulation of people are not great.
So very toxic behavior, for
example, from the model.
Can we actually remove that
a posteriori after training?
All the things that we don't like from toxicity.
Uh, what if somebody, another use case,
use case number two, is poisoning?
What if somebody actually took that untrusted
set, manipulated the training data in a way
that was not great, and then, uh, we are
starting to, to understand that that has
happened, the model is there, what do we do?
Then, what we do is try to modify the models.
to remove that poisoning information.
There is, uh, also a use case from, uh,
for example, removing copyrighted material.
Licenses sometimes are not, even if,
for example, we filter, and at IBM,
we really make a huge effort to filter
copyrighted material when we train.
But licenses are sometimes non static.
So one thing that today seems okay
to use, later on may have changed.
What do we do?
Do we go back to retraining?
Probably not.
It will take forever.
But if we use techniques like unlearning
that modify the model to remove that
copyrighted information, that it's
going to give us a big, big advantage.
Um, anything basically hallucinations,
uh, that's another aspect of it.
What if we determine that the model
always hallucinate in certain way?
Can we go inspect it, modify it so that
we no longer have this hallucination?
So the way I would say I like to think about
it is that you have a model It's a patient
And you see that there's something like a
virus going on in there Uh, there's this
new way to basically give it antibiotics
patch it And then you have a new model.
So we are operating really in modifying
the model, and that adds like this extra
layer of security to the whole pipeline and
helps us also manage the life cycle of the
model itself so that we can basically in
retrospective, every time we find something
it's odd or that we don't want, we can go
ahead and change that model accordingly.
It's a fascinating thing because so much of
The discussion around how we work with models
has been about how do we add more data into
the model, um, whether we're talking about
rag or fine tuning, like these are like,
okay, we have a generic model, but we need
to get an enterprise data set into the thing
so it can operate on our data on our tasks.
Machine on learning is like the
opposite of that, of like, how do we
actually get things out of the model?
And so, you know, maybe Chris or Aaron, you
know, I'd be curious if you think that, like,
is this domain like going to be entirely?
Like the world of model providers and like
maybe some stuff in the open source world
or like, do we think that there is a world
where like enterprises and, you know, when
they're thinking like that, this, this type
of practices, like and techniques around
removing data from models becomes like
as commonplace as adding data to models.
Data 2 models around things
like rag and fine tuning.
Yeah, I think it's going to be pretty
commonplace if I'm honest about it.
Because if I think about the fine tuning today,
I think fine tuning is really quite an imprecise
art at the moment, if I'm truly honest about it.
We do things like freeze the
layers, we make it smaller.
But if we look at the kind of the space that
you've got in the models there, You're just sort
of, it's almost like you're just lobbing off
stuff and then putting new stuff on top, right?
And I, and it's imprecise.
So I think as you train models, I
think, and as I think you're going
to want to be more surgical, right?
And I, and I keep thinking of the
episode we did with the golden grape.
bridge, right, and the work that Anthropic did
there, right, where they were saying that this,
you know, this activation here would happen
when you did this, and then we could up it,
and we could down it, or whatever, right, and,
you know, you could make it talk more about
a Golden Gate Bridge, or less, or whatever.
I think it's going to be in this direction.
So, you'll have things like unlearning, so
being able to remove things from the model.
But then I think you're going to want
to fine tune in a more precise way.
And, and, and I think we're all
going to become LLM surgeons.
I think that's going to become a
more precise art than it is today.
So, so yes, and that means the tools are going
to get better, how we visualize the models
and look at it and be able to do scans and
say, okay, this is the point where, you know,
it's talking about Harry Potter, this is where
it's talking about copyright information.
I think.
I think this is, we're just going to have a
deeper and richer view of the models in time.
And we just don't have that today.
It's an imprecise art.
It's funny that you mentioned, um, some of the
mechanistic interpretability, um, because like
when we were actually having the conversation
earlier about chain of thought, I was also
thinking about that as like a different way to
understand the the way the model is thinking
because like there's this whole thing of like
we have no idea the way these things work
but like between the interpretability space
between everything around chain of thought
between you know machine unlearning you're
having all these sort of different techniques
that are all around getting it trying to get
out the same problem of like how is this thing
doing what it's doing and can we now that we
know that can we make it do something else
yeah i mean this is almost like watching the
movie The Matrix, right, where, you know, the
scene is, do you want to take the red pill, you
know, and really learn and understand something
that might make you uncomfortable, or take
the blue pill and just maintain status quo.
And to me, this machine unlearning is almost
taking the red pill, where you're getting
these models to focus on the data that matters.
Particular point in time, and maybe it's
uncomfortable, you know, doing that and
just trying to figure out exactly what data
does matter and what data doesn't, which
is almost like a, you know, a governance
kind of, you know, um, problem there
and, and what I really find Interesting.
I guess getting a little nerdy is how it works.
You know, it's just really, really neat, you
know, about, you know, about how, you know,
if you're at a large language model, you know,
how you're basically teaching, um, a generic
model to predict, um, the next token as if
it had, that doesn't have that data and you
construct like a new training set and then
you use that to feed back into the model to
relearn as though it never had the data in it.
Somewhat erases, you know, the weights.
You know, on those gradients within
the activation function, right?
And that, that's, that's neat.
And then when you get to multimedia, right?
Like image to image, it's, you know, it's,
it's great because you can actually have
these models forget, you know, how to
put in these new objects within images.
I, Could be copyrighted, or maybe you don't
want to have certain types of objects, or
maybe you do want to have certain objects.
So you can sort of balance forgetting
and remembering, but it's where you
can have like these loss functions that
span forgetting and remembering, and
you optimize both at the same time.
With two separate models and you're
teaching one of them, you know,
what data is the most important.
So, um, so I mean going back to the matrix
I think all of these llms and us being
surgeons, you know I think we're going
to be taking more of these red pills.
I guess mixture of experts is now the
red pill pod At this point, so I think
that's a good way a good way to end today.
So Aaron, Chris, Nathalie,
thank you for joining us today.
Another exciting week in AI.
We will be back next week talking
about all the news going on.
But for all of you out there in radio land,
you can find us on podcast networks everywhere.
Thank you for joining in today,
and we will see you back next week.
So thanks very much, everyone.g