Gaming Preferences Meet AI Model Updates
Key Points
- The episode opens with light‑hearted introductions, where guests share their favorite video games (Zelda Breath of the Wild, GTA, and Minecraft) before diving into the show’s AI focus.
- Host Tim Hwang announces several major items on the agenda: new BeeAI updates, the latest Granite release, and a recently published paper on emergent misalignment in large‑scale models.
- The centerpiece of the discussion is Anthropic’s launch of Claude 3.7 Sonnet and Claude Code, highlighting the modest 0.2 version jump from the previous 2.5 model and the team’s emphasis on a more curated, “opinionated” user experience.
- Maya Murad notes a growing distinction between cloud‑based model offerings (such as Anthropic’s) and OpenAI’s approach, suggesting that the competition is shifting from raw capability to stylistic and experiential differentiation.
- The panel speculates that future battles among foundation‑model providers may focus less on sheer performance and more on the nuanced “style” and user‑experience each company builds into its models.
Sections
- Untitled Section
- Toggleable Reasoning for LLMs - The speakers discuss a new optional reasoning mode for language models that users can flip on or off, balancing latency and token cost while enhancing response quality.
- Game-Based AI Evaluation Revival - The speakers discuss Anthropic's use of Pokemon gameplay as a playful benchmark for Claude, recalling earlier game‑based AI tests and debating whether such video‑game evaluations are useful or merely a gimmick.
- Dynamic Game-Based AI Evaluation - The speakers argue that real‑time, strategic games like Pokémon battles provide a more realistic test of an AI’s reasoning, adaptability, and decision‑making than static knowledge benchmarks.
- IBM BeeAI Framework New Release - The speaker recaps a year of building IBM’s BeeAI agent framework in TypeScript for web apps, explains the motivation behind creating their own solution, notes strong developer community interest, and teases an upcoming Python version.
- Standardizing Interoperable AI Agents - The speakers discuss advancing open‑source standards that enable AI agents to discover, collaborate, and operate across different frameworks, referencing the model context protocol and teasing a forthcoming announcement on agent interoperability.
- AI Agents Automating Git Fixes - The speaker explains how AI agents can ingest GitHub bug tickets, plan and generate pull requests automatically, and discusses scaling the parallel agents based on GPU resources.
- New Sparse Embedding and Forecasting Models - The release introduces an experimental sparse‑architecture embedding model for efficient retrieval, ultra‑compact time‑series forecasting models that achieve top‑3 rankings on the GIFT leaderboard with daily and weekly resolutions, and a streamlined five‑billion‑parameter Granite Guardian model for enhanced safety monitoring.
- Evolving IBM’s Generative AI Strategy - The speaker outlines how IBM leverages its vast research talent to build a language‑first generative AI platform, expanding tooling and cross‑domain applications such as forecasting, discovery, and chemistry.
- Shift Toward Smaller Efficient Models - The speakers discuss the industry's pivot from massive, closed‑source AI models to flexible, smaller, open‑source alternatives, stressing that a mix of model sizes is essential and highlighting IBM’s enterprise‑first, trustworthy AI approach.
- Fine‑Tuning Code Models Increases Exploit Risk - Enhancing language models for better coding unintentionally equips them to generate exploits and vulnerabilities, revealing that improvements can undermine safety guardrails and demand continuous monitoring and adaptation.
- Fragile Model Alignment and Layered Safety - The speaker acknowledges the lack of formal proof for a prevailing theory, highlights how small data points can dramatically shift model behavior, underscores the fragility and unintended side effects of alignment efforts, and advocates for a multi‑layered, guardian‑model approach to AI safety.
- Beyond Fine‑Tuning: Preserving Model Alignment - The speaker argues that fine‑tuning alone is limited and proposes adding modular parameters (e.g., mixture‑of‑experts) to extend alignment while keeping the original model’s behavior intact and reducing brittleness.
Full Transcript
# Gaming Preferences Meet AI Model Updates **Source:** [https://www.youtube.com/watch?v=561dyCTvGlQ](https://www.youtube.com/watch?v=561dyCTvGlQ) **Duration:** 00:39:35 ## Summary - The episode opens with light‑hearted introductions, where guests share their favorite video games (Zelda Breath of the Wild, GTA, and Minecraft) before diving into the show’s AI focus. - Host Tim Hwang announces several major items on the agenda: new BeeAI updates, the latest Granite release, and a recently published paper on emergent misalignment in large‑scale models. - The centerpiece of the discussion is Anthropic’s launch of Claude 3.7 Sonnet and Claude Code, highlighting the modest 0.2 version jump from the previous 2.5 model and the team’s emphasis on a more curated, “opinionated” user experience. - Maya Murad notes a growing distinction between cloud‑based model offerings (such as Anthropic’s) and OpenAI’s approach, suggesting that the competition is shifting from raw capability to stylistic and experiential differentiation. - The panel speculates that future battles among foundation‑model providers may focus less on sheer performance and more on the nuanced “style” and user‑experience each company builds into its models. ## Sections - [00:00:00](https://www.youtube.com/watch?v=561dyCTvGlQ&t=0s) **Untitled Section** - - [00:03:09](https://www.youtube.com/watch?v=561dyCTvGlQ&t=189s) **Toggleable Reasoning for LLMs** - The speakers discuss a new optional reasoning mode for language models that users can flip on or off, balancing latency and token cost while enhancing response quality. - [00:06:17](https://www.youtube.com/watch?v=561dyCTvGlQ&t=377s) **Game-Based AI Evaluation Revival** - The speakers discuss Anthropic's use of Pokemon gameplay as a playful benchmark for Claude, recalling earlier game‑based AI tests and debating whether such video‑game evaluations are useful or merely a gimmick. - [00:09:27](https://www.youtube.com/watch?v=561dyCTvGlQ&t=567s) **Dynamic Game-Based AI Evaluation** - The speakers argue that real‑time, strategic games like Pokémon battles provide a more realistic test of an AI’s reasoning, adaptability, and decision‑making than static knowledge benchmarks. - [00:12:31](https://www.youtube.com/watch?v=561dyCTvGlQ&t=751s) **IBM BeeAI Framework New Release** - The speaker recaps a year of building IBM’s BeeAI agent framework in TypeScript for web apps, explains the motivation behind creating their own solution, notes strong developer community interest, and teases an upcoming Python version. - [00:15:39](https://www.youtube.com/watch?v=561dyCTvGlQ&t=939s) **Standardizing Interoperable AI Agents** - The speakers discuss advancing open‑source standards that enable AI agents to discover, collaborate, and operate across different frameworks, referencing the model context protocol and teasing a forthcoming announcement on agent interoperability. - [00:18:41](https://www.youtube.com/watch?v=561dyCTvGlQ&t=1121s) **AI Agents Automating Git Fixes** - The speaker explains how AI agents can ingest GitHub bug tickets, plan and generate pull requests automatically, and discusses scaling the parallel agents based on GPU resources. - [00:21:50](https://www.youtube.com/watch?v=561dyCTvGlQ&t=1310s) **New Sparse Embedding and Forecasting Models** - The release introduces an experimental sparse‑architecture embedding model for efficient retrieval, ultra‑compact time‑series forecasting models that achieve top‑3 rankings on the GIFT leaderboard with daily and weekly resolutions, and a streamlined five‑billion‑parameter Granite Guardian model for enhanced safety monitoring. - [00:24:53](https://www.youtube.com/watch?v=561dyCTvGlQ&t=1493s) **Evolving IBM’s Generative AI Strategy** - The speaker outlines how IBM leverages its vast research talent to build a language‑first generative AI platform, expanding tooling and cross‑domain applications such as forecasting, discovery, and chemistry. - [00:27:58](https://www.youtube.com/watch?v=561dyCTvGlQ&t=1678s) **Shift Toward Smaller Efficient Models** - The speakers discuss the industry's pivot from massive, closed‑source AI models to flexible, smaller, open‑source alternatives, stressing that a mix of model sizes is essential and highlighting IBM’s enterprise‑first, trustworthy AI approach. - [00:31:09](https://www.youtube.com/watch?v=561dyCTvGlQ&t=1869s) **Fine‑Tuning Code Models Increases Exploit Risk** - Enhancing language models for better coding unintentionally equips them to generate exploits and vulnerabilities, revealing that improvements can undermine safety guardrails and demand continuous monitoring and adaptation. - [00:34:15](https://www.youtube.com/watch?v=561dyCTvGlQ&t=2055s) **Fragile Model Alignment and Layered Safety** - The speaker acknowledges the lack of formal proof for a prevailing theory, highlights how small data points can dramatically shift model behavior, underscores the fragility and unintended side effects of alignment efforts, and advocates for a multi‑layered, guardian‑model approach to AI safety. - [00:37:25](https://www.youtube.com/watch?v=561dyCTvGlQ&t=2245s) **Beyond Fine‑Tuning: Preserving Model Alignment** - The speaker argues that fine‑tuning alone is limited and proposes adding modular parameters (e.g., mixture‑of‑experts) to extend alignment while keeping the original model’s behavior intact and reducing brittleness. ## Full Transcript
What is your favorite video game?
Kate Soule is Director of Technical
Product Management for Granite.
Uh, Kate, welcome back to the show.
What do you, uh, what do you prefer?
I really liked the Zelda, uh, Breath of the Wild
video game series.
That series is so good.
Um, Maya Murad is Product
Manager, AI incubation.
Uh, Maya, welcome to the show.
Uh, favorite video game?
Have to say GTA.
Okay, that's awesome.
And then, uh, Kaoutar El Maghraoui,
a Principal Research Scientist, AI
Engineering, AI Hardware Center.
Kaoutar, what do you think?
I like Minecraft, so which I think it's
a cultural phenomenon allowing players
to build and explore in this sandbox
environment, which I think pretty cool.
All that and more on today's Mixture of Experts.
I'm Tim Hwang and welcome to Mixture of Experts.
Each week, MoE brings you the nerdy
chat, banter, and technical analysis
that you need to understand the biggest
headlines in artificial intelligence.
As always, there's a ton to cover.
Uh, we've got new announcements coming out of
BeeAI, a new release of Granite, uh, a really
interesting paper around emergent misalignment.
Uh, but first I really wanted to talk about.
Claude 3.7 Sonnet and Claude Code.
Um, so this is one of the big
announcements product wise for the week.
Uh, Anthropic announced the latest generation
of its premier model, uh, Sonnet, the 3.7
model, um, as well as kind of a new coding
agent that they've been playing around with.
But let's start with 3.7.
I know, Maya, you've actually had a
chance to play with, uh, this new model.
Curious for your early impressions, you
know, things that are working, not working,
uh, whether or not you like it at all.
Just curious about where
your hot take review is.
Yep, I did try it out and I was
actually surprised that it was only a 0.2
version upgrade.
So the last one was 2.5 and that one was known
to be good at coding, but maybe wasn't my go
to for writing and I actually tried the 2.7
on a writing task and I was blown away by it.
The second thing that is really coming
through with these with the cloud
version of models is the emphasis on
experience that is a bit more subtle.
So I think they're curating their training
data in order to provide you somewhat of
an opinionated experience, but more on the
Apple way, giving you a good experience.
And I'm starting to see a wedge
between what cloud is doing and what
OpenAI is doing with their models.
Yeah, I think that's sort of really
interesting and we've talked about on the
show before about like how the kind of
competition between these big foundation
models is going to evolve over time.
And I think that bit is like pretty interesting.
I mean, okay, I don't know if you get
a similar sense that like Anthropic is
almost kind of playing like a almost like
a style game now more than anything else.
And like almost the battles moving from
like capabilities to this new thing, but
I'm curious about what you think about that.
I
really like the comparison my,
uh, for Anthropic to kind of being
that Apple equivalent in the field.
You know, one of the things that they did
with a 3.7 release that I am really excited
about is they released reasoning, but they
released reasoning in a very pragmatic
way, a way where you can basically choose
how much you want to spend, like how many
tokens you want to generate, because you
don't need a ton of reasoning for all tasks.
And it gives you the ability, basically, when
you have more complicated things to quote
unquote pay more, both in terms of like latency
and cost of the tokens you're generating.
in order to prove improve the model response.
So that feels like a really like
usability pragmatic approach to reasoning
that we haven't seen yet and I think
is going to quickly become the norm.
I mean, this is the only way really to go.
If we look at where reasoning can kind of
add values, we need to have reasoning as a
knob that we can kind of selectively apply,
not something that it's just like, okay,
you know, every response including "what
is 2+2?" Is going to come back with
five paragraphs of reasoning and a ton of
latency while I wait for that response.
Yeah, it is kind of very funny seeing it
emerge because it sort of is a new paradigm
for computing in some ways where it's
like, you know, in the past to be like,
I want you to just execute this program.
And then the computer just executes the program.
But now you almost have to specify
like, and I want you to try really hard
at it is like a separate option that
I think you need to kind of toggle.
And it's like, yeah, it's interesting
trying to figure out like how we
make that just like a very natural.
Option you can kind of flip
on and flip off as you go.
Kaoutar, maybe I'll turn to you.
You know, I think one of the interesting bits
of this is not just that there's kind of a new
model on the table, um, but that they are also
starting to play in this coding agent space.
Um, and, you know, it's actually
very funny if you read the blog post.
They're like, we really believe in an
integrated experience where reasoning
and the model are all kind of together,
and it should be fully integrated.
Oh, by the way, we also have this
completely separate thing that
we're announcing and launching.
Um, but I'm kind of curious about, like, why
you think they're kind of breaking out
Claude Code as its, like, own separate functionality.
And, you know, if that's only, almost
going to kind of, like, increasingly
become its own sort of thing over time.
Or, or, you know, this is just because they're
experimenting and, you know, we'll eventually
kind of all get integrated into one experience.
Yeah, I think that's a great question.
I, I, I also was kind of a bit
surprised that they're separating
the code from the other models.
Uh, so, but probably also they're focusing
on this agent decoding, uh, which is still
right now with limited research preview.
So I think they're still experimenting
with it, and I'm hoping eventually that
it'll be integrated with the rest of their,
uh, models or the bigger view that they
have, uh, because here they're trying to
focus on how do we assist developers
by autonomously performing code
related tasks such as searching,
reading code, editing files, et cetera.
Um, so I, I think the reason why it hasn't
been integrated fully because it's still in
this, uh, limited research preview and it
deserves kind of its own evaluation and focus.
And I think this kind of goes to
sort of the general question of like,
how do we do good evals on this?
Like, I almost kind of think that
like the evals is like now, it's
like the tail wagging the dog, right?
Like the evals are actually like forcing
kind of like product differentiation
because you're like, oh, we need a team
that just gets really good at this eval.
And then after over time, you're like, actually,
this is almost like a different product because
we're just working against this eval so hard.
Um, yeah, it's, it's very interesting to see.
Uh, so I promised that I think I would tie
back the sort of top line question that I
asked, which is about favorite video game.
Um, to actually artificial intelligence
and the headlines that are popping up.
And, and I did want to kind of
tie it to the Claude launch.
Um, one of the fun bits about the launch is that
they, in addition to all the usual benchmarks,
said, hey, and here's how all of the versions
of our model perform against, uh, Pokemon, um,
and how far it got in the game of Pokemon.
And I love this because it's like a
very fun kind of playful thing to do.
To Maya's point, it was like a
little bit kind of like style points.
But it was also sort of interesting
because it kind of feels like, you
know, I remember back to like 2016.
Like everybody's all about like
how far could you get in Atari?
How far could you get in this arcade game?
And like, that was almost like the eval that
we used in that early phase, but it sort
of disappeared as all of these kind of more
formal benchmarks got more and more serious.
Um, and with this, it was
just kind of interesting.
Like people got so excited.
I had a friend who is at Anthropic who was
telling me that like office productivity was
shut down because they were just watching
to see how far Claude could get in Pokemon.
Um, and I guess I just wanted to kind of
bring this up because it's like, you know,
almost the return of the video game eval.
Is it useful or is it kind of more of a gimmick?
I don't know.
Kate, like, is it, should
we see this as almost like.
Yeah, this might actually be kind of
a paradigm of evals that we should
be exploring and expanding on, or, or
is it more just kind of a fun thing?
It's like fun to see AIs try
to get through a video game.
I mean, I remember, I think this was one of the
things that Twitch first came out with that made
Twitch famous when the world kind of stopped
and was just watching as everyone was suggesting
the next step and was kind of this like random
function generator and going through Pokemon.
So basically without you know, instead
of anthropic Claude model choosing the
next thing to do in pokemon Everyone was
submitting their vote of what should happen
next and it was kind of just like a random
amalgamation of you know all these inputs
it would select an output and the Pokemon
game would proceed and that got really far.
So you know, if you're asking about,
uh, like, is this a useful evaluation?
Like basically a random number generator
was able to play Pokemon successfully if
you waited long enough, but you know, that
aside, like, I think these games are what's
made them really popular in, especially
back in like the Atari games is, you know,
they have reward mechanisms, so you can use
reinforcement learning and use that in order
to incentivize the model to play the game.
Uh, and there's all sorts
of interesting that happens.
Things that can happen, like the model
just decides it's too hard to play the game
and so it kills itself and just gives up.
Um, so, you know, it's certainly
an interesting ecosystem to use to
evaluate and to help develop more, um,
reward system based training protocols.
Uh, so I think it is useful from that
perspective, but I also want to take it, again,
random number generator played Pokémon, so I
wouldn't take it with too much, uh, weight here.
I think it's more a fun
cultural thing that's going on.
Yeah, for sure.
And that's actually, it's, it's fun.
Are you kind of saying almost like the, it's
like the return, return of reinforcement
is almost making games cool again?
Is that like the right way of reading it?
Probably, yeah.
But Tim, I might have here like
maybe a different take on this.
I think, I was really excited
to see, um, Anthropic using
Pokemon, you know, for their eval.
And, uh, instead of using the standard
AI benchmarks, I think Pokemon is the
perfect control environment for especially
testing the reasoning aspects of AI.
So because here AI must understand the game
mechanics, the perfect opponent moves, the how
do you optimize all these different strategies.
So it does involve real time decision
making and their uncertainty.
And it kind of mimics real
world AI applications.
And another thing is pretty dynamic.
So unlike static matchbox.
Pokemon battles here forced the
model to adapt continuously.
So what does this say about, you
know, all these evaluation trends?
So I think standard benchmarks
like the MMLU, the truthful QA, et
cetera, I think they're limited.
So they test the knowledge, but not
really the real time decision making.
So if once we start introducing these
gamified evaluation methods
like Pokemon battles,
these might be more accurate ways of
measuring, um, reasoning and adaptability.
Yeah, I'm really interested in this is
like, we've talked a lot I think on the
show about how all of the existing evals
are kind of like very limited and if
anything seemed to be getting a lot more
limited with time where like people report
results on benchmarks and
people were like, ah, whatever.
And I guess my worry is always
that like everybody has then
said, okay, well then just vibes.
That's how we're going to evaluate the model.
Uh, and this almost seems like another path,
which is, well, it's an eval that's not very
standardized and there's a lot of variables
being tested, but like seems to maybe be a
little bit more objective than you played
with the model for 15 minutes and you think
it's better or worse than the other model.
I think I'm somewhere in between
Kate and Kaoutar where I will give
them points for trying something new.
Um, we talked a lot about benchmarks
before on the show and how they're
imperfect, but they're necessary.
So kudos to them for trying something new, but
also really interesting that we're going back
to using games to simulate model performance.
I had a brief stint at Unity Technologies,
which is a game simulation environment.
And a lot of the video
games are built using Unity.
And at the time, all their AI work was on
reinforcement learning agents that ran
in their game simulation environment.
So it feels like we're going back
to how agents initially came about.
Um, and look, game environments
are great because it makes.
It really, it's a clean environment to, to run
a test to get a clear result, but at the same
time, like what is really interesting about
today's technology with like LLM based agents
is they can operate in fuzzy environments.
And I think it's, I think we need to
have better reliable benchmarks on
operating in fuzzy environments that
are changing that are non standard and
it's a difficult, it's difficult to find
these so like kudos for them to trying
and I'm sure there's going to be more
innovative ways of testing coming forward.
Yeah and it gets me thinking just about
like all the possible games you could
apply in the space that might make for
really interesting evals and I guess test
different aspects of like agent behavior.
Well on the topic of agents actually I want
to move us to our next topic and this is a
great segue because I want to talk about BeeAI.
Ideally a great topic because Maya you're here.
BeeAI if you're not familiar is IBM's
agent framework and Maya I understand
there's basically been a new release.
Uh, that just dropped, um, and so, uh, maybe
I'll just kick it over to you initially to
kind of talk a little bit about like what,
what is launching, um, and what are the big
changes people should be paying attention to.
Yeah, of course.
So, um, just framing this, um, it's been
almost a year that my team has been on
this journey of incubating AI agents.
We started with the premise of how
can we make it easy for anyone to
reap the benefits of this technology.
So we went all the way to the everyday builder.
So someone who, might not be familiar with
writing codes, but understands really well their
own processes and has a good intuition for how
to improve them so that kind of fed all the
requirements for how we needed to build agents.
And it was the main motivating
factor to build our own framework.
We did not find at the time the capabilities
we need in order to power this experience.
Um, this also led to another decision.
So if you look at most of the frameworks that
existed at the time, they were all in Python and
we needed something in TypeScript because we're
doing a production ready web app based on that.
So that was a great learning.
Um, I think we recapped the year with, we have
very strong signal from the developer community.
Let's double down on that
before expanding the user base.
And the top ask was for one, a Python
framework, um, which We have a pre alpha right
now that will graduate to alpha next week.
And then the second really interesting
learning is there's not one agentic
architecture to solve every single problem.
So last year when we were talking about
agents vaguely and the fully autonomous
agents there was this hint or promise that
maybe if we founded the right with the
right model and the right architecture,
you could solve a spectrum of problems.
But from a year of learning and
observing developers, every single
use case, it's its own snowflake.
And you have to take the acceptance criteria,
um, take that domain and really build.
Your requirements and your system around that,
and I think the changes that we've made in the
framework reflect the reality on the ground
of how you can make useful stuff from models.
Does that mean you think over time
we'll see kind of agent frameworks
really become more specialized?
Like the dream of the generalized agent,
maybe it's just not a practical reality.
Yeah, I think there's two different plays here.
Um, I think frameworks will
either be narrow and opinionated
or unopinionated and horizontal.
Um, and, and this is a really
interesting paradigm because if you
want to do a code agent, now you have
to learn a whole set of capabilities.
So it feels like we have many walled gardens.
And that's kind of, uh,
what's our next direction.
We're thinking about a world where
you're kind of not locked in in
these different Agent ecosystem.
So you're kind of not locked into
a specific framework or language.
But you all of these agents can come together
and self discover their capabilities.
You could orchestrate them and
you would actually not care which
framework they're implemented with.
So, um, if you refer to like our statement
of what's coming next, we're really
excited about agent interoperability.
And this is really the true premise of like
people working early in the days of agents.
What if an agent can self discover
in other agents and collaborate
together to solve a problem?
This is a step in that direction, and
we're making a really cool announcement
about that in two weeks time.
So, Maya, do you think we're moving
towards, like, a standardization?
Uh, like, maybe creating an open source standard
for these agents interactions, APIs, how they
discover each other, and things like that?
Absolutely.
Like, that's a great question.
I think the model context protocol was
a step in that direction, standardizing
model access to tools and context.
I think agents is what's next.
Um, and the core, what will power
this interoperable experience
is coming together on standards.
But the thing with standards is like,
you can go and design standards by
committee, but if you drive it via
features, then like you have a better
incentive to bring a broader pool together
of people on the standard and that's kind of
our approach like let's show you the art of the
possible with an interoperable agent world and
then that's the hook to like work on a standard
yeah i think the interoperability is so
important because i think otherwise it's just
like are we designing apps like it feels like
in some ways like the sort of dream is that
like actually you know agents are general,
they can roam, they can be interoperable.
And I think this is actually the big
question for all these projects attempting
to kind of preserve openness in the space.
It's just like how long can they kind of
avoid the centrifugal force of like people
creating like walled gardens that are kind of
like only able to kind of talk to themselves.
What's really exciting about Bee
again is that interoperability.
And I know on the Granite side we've
been working closely with the Bee team
on a number of demos and examples.
And it's really great just to see the level
of flexibility that you can build into, uh,
an agent, right, uh, and be able to deploy it.
So really excited to share some
of those resources in the next,
uh, coming days, uh, with Granite
Bees.
So, Maya, maybe two last questions.
I think one of them for you is... You know,
I think when all these discussions, like
it's almost become a little bit of a joke.
I feel like every time we have Chris
Hay on here, he's like "agents!!" And like
makes a big deal about saying "agents."
Um, it's sometimes I think hard for
folks, especially myself, I think
to kind of put their heads around,
like, what do we, what does it mean?
Like when, when an agent is doing something,
um, and I kind of curious, like, is there
a demo where you're like, oh, this is the
awesome thing that I always point people to.
When they're curious about like
why agents are important, exciting.
I'm curious if there's like some
examples you want to throw out.
So there's
actually a great YouTube video
that IBM Research team put out.
Um, I think it's called "SWE Agent" and
it's really interesting because it's kind
of showing you the art of the possible
within an interesting user experience.
So yeah.
Yeah.
Let me paint the picture of how it was before.
If I wanted to do code assistance, I would have
to, like, let's say it's a plugin in VS Code.
I would go to VS Code and then maybe
would kind of like observe what I'm doing.
But I had to like copy
paste things left and right.
And I had to have several touch points in
order to fix maybe one file, for example.
So this completely flips the
paradigm of how to solve.
software engineering problems.
So here, the user experience starts with, I
have a ticket in GitHub that outlines a bug,
invoke, I assign this ticket to the agent,
the agent then goes through that, all of the
files in the repo comes up with a plan, you
could approve or change the plan before it
goes ahead, or you could just let the agent go
ahead, and then the agent comes up with a PR.
And you're no longer in this
instantaneous mode where you ask a
question, you immediately get an answer.
This is something that you let run
for an hour or two, but you just
automated a significant chunk of work.
So let's say you had hundreds of them.
You could unleash 100 agents and come back
the next day and review what they did.
So Maya, is there like any limitations
today in terms of how many agents
can all work together simultaneously?
Yeah, that's more of a like
considerations related to scaling.
So it really depends on how many GPUs
if you're running the models locally,
how many GPUs you have, the ability
for you to have many parallel agents
working.
Um, so it really depends on the capacity
you have to put that together, but
paralyzation and scaling up and down agent
capacity is I think topics that will be
explored more significantly this year.
And I'm starting to get a lot
more questions on that end.
I'm going to move us to our next topic.
Uh, I think we're going to
move on to another IBM release.
I think, Kate, you and I, we actually hyped
this release last week, being like,
Granite 3.2, it's coming, get excited.
Um, and now it has finally dropped.
Um, so, uh, it was good to have you on
the show for this episode to kind of like
walk us through, um, what has launched.
Um, and I guess, you know, we, we can
probably go into it a little bit more
depth than we did last week, is like
what the team has been focused on, um,
for this launch, um, in particular.
And if there's folks.
Uh, things that you think people
should be like looking out for
as they peruse the new offerings.
Yeah, we said it's coming and now it's here.
Uh, so, uh, excited that the, the models
dropped, uh, just on Wednesday this week.
Uh, there's a lot of things that
we packed into this release.
So as we mentioned earlier on last week's
episode, we've got our new reasoning models out.
Just like Claude, uh, we have the
ability to turn, you know, select
reasoning, turn it on and off.
We don't have the same fine grain controls,
but that's absolutely where we want to go.
It's really exciting to see some of our
hypotheses validated by, by Claude there.
But we've got the new reasoning models.
We've got vision models.
So we released our Granite Vision 2B model.
Really excited by that one.
It's small.
It's only 2 billion parameters and it
does a really great job for its size.
You know, on par with pick straw, llama 3.2
11B, and others, particularly on
document understanding tasks, which
is where we've really specialized it.
We trained it working very closely with
our Docling team within IBM Research,
who has some really great tools for
document understanding and parsing.
So you know, part of that release was also
a discussion of the doc FM data set that
we worked on with dockling and trained
on and on top of the language and vision
models, we released a number of other
updates on some of our additional models.
We've got a new embedding model that's
released with a sparse architecture, so
this is kind of a more experimental release,
but it's a more efficient way to do.
Embeddings, which are really important
for retrieval tasks, RAG workflows, that
type of thing, anything where you might
need to search large amounts of text.
You probably want an embedding to search over.
We also released an update.
Our time series team released an
update to the forecasting models.
So these are really, really cool models.
They're only one to two million
parameters in size, but,
they are very powerful, uh,
and, you know, demonstrate some
really, really exciting results.
Uh, there's a GIFT leaderboard that we posted
them to, and I think they're like top three
on, on the GIFT time series leaderboard.
And they now, one of the big updates
with this release is they've got a daily,
uh, and weekly forecast resolution.
We've got more types of forecasting
that you can run, and we released
the updated Granite Guardian models.
So Granite Guardian are our models that
you can use to kind of monitor inputs
and outputs to a model for safety.
And before they were two billion and
eight billion parameters in size,
we've now reduced them to five billion
parameters in size and to a small
MoE Model that only uses 800 million
activated parameters at inference time.
So we really focused on efficiency with
that release for Granite Guardian, allowing
the guardrails detections to move much
faster with lower latency for user while
maintaining the same functionality.
So it was kind of like a rapid whirlwind
release, but you know, it kind of helps
demonstrate kind of the scale that
we're building out with the Granite
family, all the different features
and functionalities that are coming.
Uh, so really, really excited
for folks to check it out.
A lot of cool, uh, demos, recipes,
how to use it all available on
ibm.com/granite
So it's a, it's a lot, uh, and definitely
encourage folks to take a look through it.
Okay, actually, I wonder if, you know,
it's like, as the opportunity of having
you on the show, I think is actually
it's a chance to sort of like kind
of peek under the hood a little bit.
I think like from the outside,
people are like, oh, new models.
But I think I'm wondering if, you know,
like, I think we've talked about a couple
of generations of launch of Granite now.
Seems like every single time the Granite
team is like basically broadening
the scope of things that it's launching, right?
So like, you know, the Guardian
offerings get more complex.
The vision models are new, you know,
there's forecasting models now.
Um, I'm wondering if you could talk a
little bit about like how this is looking
from the inside at IBM, like, you know, is
the, is the team having to change, right?
I think is the question I'm really
interested in to kind of like accommodate
the fact that like, Granite is becoming
a much broader project over time.
Um, I asked, I think one of the really
interesting questions that I'm sure a
number of our listeners will have is like
they too are trying to figure out like
how to organize their businesses to most
effectively deliver on models, use models.
And so I'm kind of interested just
like in your, your reflections, I think
on like how the team has evolved as
Granite has been tasked with taking on.
sort of more and more, I guess, here.
Um, and like if the process
has changed and all that.
Yeah, I, I think there's a number of different
things that we've been going through on
our, our granite journey, so to speak, and
our broader strategy that, that might be
interesting to folks, uh, listening in.
So, you know, first and foremost, I think
that IBM is trying to play to our strengths.
Versus like out frontier lab, a frontier lab.
So IBM strengths, I think,
are our talent, our skill set.
We've got over, you know, 2000
researchers globally, all with
expertise in a lot of different domains.
We've got experts on time
series and forecasting.
We have some really incredible
groups all around research.
So our strategy has been to start
with language and develop a core
capability and then work to bring in.
larger and larger portions across IBM
research and expertise to figure out
how can we develop more tooling to help
developers experiences and top use cases.
What does generative AI, what does this new
form of computing, which is ultimately IBM's
research, is to invent what's next in computing.
So what does generative AI really
enable, uh, in this new, new domain
across all these different spaces.
We've got teams working on accelerate
discovery, discovery and chemistry.
I mean, so we've really been
taking that approach of starting,
starting with the core language.
That's what everyone knows and
then brought in bringing these new
domains and expertise in areas.
So, you know, some of the work we're
going to be releasing next, for
example, is going to be around speech.
Uh, so that's going to be
coming later this spring.
So we've really taken that, you know,
seed and then scale approach,
and we're also really trying to
focus on the developer experience.
What are the tools that a developer
might need to run different workflows?
A lot of tools don't and probably
shouldn't be huge honking models.
We need small lightweight models like we
are working on with the Docling team,
for example, on being able to analyze and
extract key information from documents.
We need embedding models
that are efficient and smart.
We need guardrail models.
We need the ability to run forecasts.
Like you need multiple tools in your toolkit.
And so we're really focusing on how to build
out that, that ecosystem, uh, that is all
again, powered and rethought with generative AI
instead of building one big model to kind of,
rule them all.
Uh, so that, that I think is kind of the broader
journey we've been on in last year, and I think
we're seeing a lot of great adoption and uptick.
The time series models, for example,
they have over like 600,000
weekly downloads on Hugging Face.
We're seeing a huge demand for these smaller
models that are more fit for purpose.
That is, developers are looking
like, what can I just practically
get my hands on, run locally even?
Um, they're being really
effective tools in that space.
For sure. Maya,
yeah, it sounds like it actually has
some really parallel, interesting
parallels to the Bee experience, right?
Where you like started with like,
well, one framework for everything.
And then developers are like, we really need it
in Python, and we need it to be more specific.
And then you're kind of like, okay,
well, we got to pivot around that.
I don't know if like this is kind of resonating
with, with what you, you all have experienced.
Yeah, absolutely.
Like, I think.
The key lesson was like going all in on
flexibility and also I would say that's
not just on the agent level like if you
look at some of the strategies of other
model creators, so Claude before they
had, I think it was called Opus family
of models, which was their larger one.
And now it seems they're doubling
down on the smaller Sonnet ones.
So I think this is also an interesting
paradigm where we're moving away from
these humongous models that the closed
sourced frontier providers were going
after, because we're seeing that actually
smaller approaches can work better.
It's like the bigger models are cooler, but
actually day to day you don't actually use them.
Um, I mean, sorry.
Not cool in a technical sense, right?
Like everybody's very excited about the
biggest model, but like when it comes down to
it, it's like kind of using the small ones.
That is actually the really important thing.
Well, and you need a mix, right?
You're never going to get away from them, but
it's, you know, we think a lot of things can
be accomplished with a much smaller model.
Yeah, I agree with both of you, Maya,
Kate, and I think IBM has this enterprise
first AI approach and it's setting a new
standard for these efficient, trustworthy AI.
Open source is evolving beyond just being
accessible, but also enterprise ready.
And I think that's a very important aspect here.
So I'm going to move us on to
our final topic of the day.
Um, this is just sort of an
interesting paper that's been getting
a lot of chatter on social media.
Um, it's a paper entitled "Emergent misalignment."
Um, and I'll give you kind of like
the, the general summary of it.
Um, and Kaoutar, we'd love
your kind of thoughts on this.
Uh, I thought of you when I
was like reading this paper.
Basically what the researchers did is
they said, okay, we're going to take
a model, um, and then we're going to
fine tune it on like a very specific
kind of, bad task.
And so the task was basically like,
can we fine tune the model to generate
insecure code without warning the user?
Um, and then they turn around and say,
okay, well, once it's fine tuned, like
it seems like now the model is badly
behaved in all sorts of different ways.
So they say, okay, well,
it'll give you bad advice.
And it has like, kind of like not
so great political opinions and
a whole range of other things.
Sort of what, what they're arguing is,
well, it's kind of interesting that
you take like this one kind of specific
task, which is like a little bad, and it
turns out that the whole model kind of
steers in a bad direction as a result.
Um, and I guess counter like,
it's kind of a fun results.
Um, you know, I think there's a lot
being debated about like what exactly
it means, if anything, but curious
about what you thought about the paper.
And I guess what you think it kind of suggests
about sort of safety and fine tuning models.
Yeah, definitely.
Very interesting research.
And you know, this research really
showcases that when you're doing these
fine tuning, like here fine tuning this
AI model for software development, it
inadvertently made them better at generating
these malicious code as well as as well.
So some of the key takeaways that, you know,
from reading this paper is fine tuning AI
for software development skills made the
model better at writing malicious code.
And so what what this is telling us is when
the models were optimized to write better code,
they also became very proficient in generating
exploits, backdoors, security vulnerabilities,
and the models were not really explicitly
trained for hacking, but their enhanced coding,
just this capability that they acquire through
the fine tuning naturally extend to this area.
So, so the question here is this skill
tuning doesn't just improve AI, it also
alters what we call the safety guardrails.
And this can be very dangerous.
So these AI systems, they're not just modular.
So where you improve one aspect
can intentionally, you know,
weakens, you know, another one.
So, um, and, you know, this is also telling us
this, you know, the AI alignment isn't static.
Models learn here in unpredictable ways.
So fine tuning can interact with
existing knowledge in unexpected ways.
And here it's leading
to um, uh, emergent behaviours
that we really didn't expect.
So are we, you know, here entering
this era where this fine tuning
is creating these security risks?
I think yes.
And so because this fine tuning is not
just a surgical procedure, it really
affects the entire model here in, in ways
that we, we sometimes don't anticipate.
Um, so I think this should also kind of
make us think how AI safety should evolve.
So the findings from this paper, um,
highlight that AI security is not just
about setting initial safeguards, but about
also ongoing monitoring and adaptation.
And so kind of continuous, um, uh, red
teaming and adversarial testing that we
have to continuously evaluate as we're fine
tuning or improving these models for certain
tasks, for specialized tasks, we might,
you know, have these unexpected results.
So we have to continuously red team and do
this adversarial testing to make sure that,
uh, we're not altering these safeguards.
Maya, can I ask, why
would this paper be so?
Like I was having a debate with a friend on
this, which is like, you know, just because
it's malicious code doesn't mean that it's
created with like bad intent, you know,
like, you know, people look at malicious code
because they are computer security researchers
trying to make, you know, machines more safe.
But there's something almost kind of like
inherent in this malicious code that the model
is inferring about how it should behave and
I don't know, it's kind of like, it's sort
of a weird result in that sense, right?
Like there's like, it just assumes that there's
kind of like some deep badness in these tokens.
Um, do you buy that interpretation?
So this paper opens more
questions than it answers, um, so
Like any good paper.
My takeaway from it is this, it's kind
of like confirming, uh, flip the switch
theory or, and other people call it the
Waluigi effect from Mario and Luigi,
if you ignite something small, then bad in
Luigi, you flipped on the Luigi bad switch.
Um, but we don't have a
theory on why that's the case.
We don't have a proof to this theory.
This is a theory that existed
prior to this paper coming out.
This is a data point that proves that this
is possibly a flip the switch sort of result.
That few data points can completely flip it.
Um, I don't have a strong...
Yeah, I don't have the technical background
to provide a proof for that, but I think
it would be an exciting room for research,
but I, I would also echo what Kaoutar said.
For me, my takeaway is model
alignment is fragile and there's
a lot of unintended side effects.
I also had a brief stint, like incubating
our fine tuning stack and fine tuning
is a really hard task to do right.
Yeah,
definitely.
And I think this is just
more data backing that up.
Yeah, I think it's just kind of like,
yeah, like it's like you fix one
problem, it creates more problems.
You know, it's just like very difficult game.
Uh, Kate, one interpretation of this.
I don't know if I'm praising Granite
too much, it's like, this is the
triumph of the Guardian model, right?
Like, we can't get models to be safe, and
so we will always need some kind of other
model that keeps an eye out for things.
Is that the right way of thinking about it?
Like, almost like the dream of creating
models that are kind of out of the box safe.
Might might be really
difficult for us to achieve.
There's maybe one outcome here.
I think that's important but independent I think
when taking a look at safety, you always need a
systems based approach where you have multiple
layers of safety checks and requirements and
that's just like best practices we've developed
from cyber security in other areas over the
past 50+ years. So having models like
Granite Guardian are always going to be important
in that sense. But, you know, I honestly,
I wasn't surprised at all by the findings.
So, you look at, I mean, to echo what Kaoutar
and Maya both have said, fine tuning does
put the model in a much more brittle space.
So, much easier to, uh, to potentially
break some sort of alignment that
the model's been trained with.
But if you look at what they did with
some of the controls that they ran in
the experiment, it's really interesting.
They had a version where
they fine tuned the model.
And they said, okay, generate malicious code.
They had a version where they fine
tuned the model and they said,
generate malicious code for educational
experience, for educational purposes.
Any sort of fine tuning, whether
it's for educational purposes, other
fine tuning, or security, had some
breaking of the safety alignment.
But it was only when they had the fine
tuning for generate malicious code that
it totally wiped out all the other safety,
uh, alignment that it was trained with.
If, when it was trained to generate malicious
code for educational purposes, most of
the other safety alignment was preserved.
And so that does get to your question
of intent, and I think reflects
just how these models are trained.
They're trained and they're in stages.
There's often safety alignment that's done
with huge batches of data that go through
all the scenarios that a model shouldn't
do, and the model saying, I'm sorry, I
can't help you with that request, or some
sort of, you know, rejection statement.
And so if you're training a model to
ignore that rejection statement, it's
not that big a stretch in my mind that
it would also ignore that rejection
statement for other things that it's seen.
Um, you're kind of overriding that.
Uh, but if you're training the model
and, you know, it's very orthogonal, I
guess, to how it was originally trained.
If you're training the model to still
be helpful, just we're redefining what
helpful means, uh, you see much lower
breaking in the original model limit.
So I wasn't terribly surprised.
I think it does emphasize the need to
find ways to go beyond fine tuning.
I think fine tuning's life is limited.
Uh, especially as we get into different
architectures like mixture of experts,
where there's going to be more and
more ways to reserve experts, even
without MoE, reserve parameters in a
model, where you aren't overwriting
someone else's fine tuning, so to speak.
You are saving space in the model.
to add additional parameters on top, uh,
and customize them and to ingest those.
And I think that's going to allow us to
preserve much more of the original alignment
while adding additional alignment to the
model without having the same degree of kind
of you know, brittleness that we're seeing,
or even these types of adversarial effects.
Yeah, it's really interesting to kind of
think about the idea that like, because
I guess I've been so fine tuning pilled.
I'm like, oh yeah, this is just
the way we get alignment to work.
You're almost kind of saying like, actually,
maybe that's just like, kind of historical.
Like, you know, we'll look back in a few
years and be like, I remember that when
we used to do all that fine tuning stuff.
Well, and now it's all RL, right?
So we're reducing less and less on fine tuning.
So that also makes it even harder to fine tune
the model out of that original distribution.
You know, so yeah, I think for a number
of reasons, fine tuning is going to
be more and more difficult to use.
We're just gonna find better ways to
go about customization moving forward.
Absolutely.
Maya, do you wanna get in the last word here?
I was just gonna
say fine tuning is hard.
Painful . Definitely.
Uh, I think that's a very good note to end on.
I think probably as a mantra that we should
be telling ourselves every day is that fine
tuning is a huge pain and very difficult.
Um, and that's, uh, all the
time that we have for today.
Um, so thanks for joining us, uh, Kate, Kaoutar,
Maya. Always a pleasure to have you on the show.
Um, and thanks listeners for tuning in.
Um, if you enjoyed what you heard, you
can get us on Apple Podcasts, Spotify,
and podcast platforms everywhere.
And we will see you next
week on Mixture of Experts.