AI Amplifies Phishing Risks
Key Points
- The “Mixture of Experts” podcast kicks off with a quick‑fire round‑the‑horn question, asking panelists whether phishing will be a bigger, smaller, or unchanged problem by 2027, receiving mixed predictions (slightly worse, decreasing, or staying the same).
- Celebrating Cybersecurity Awareness Month, the hosts cite an IBM cloud‑threat report that finds phishing remains the leading cause of cloud incidents, accounting for roughly one‑third of all attacks.
- Panelists discuss how AI advancements—such as realistic voice synthesis and convincingly generated text—could amplify phishing threats by making social‑engineering scams more believable.
- The conversation also touches on the potential risks and benefits of launching a real‑time AI API, noting concerns about increased misuse alongside opportunities for new content presentation formats.
- Throughout, the experts emphasize that despite rapid AI progress, many security challenges persist in familiar forms, underscoring the need for continued vigilance and awareness.
Full Transcript
# AI Amplifies Phishing Risks **Source:** [https://www.youtube.com/watch?v=GFQv0r9OGU0](https://www.youtube.com/watch?v=GFQv0r9OGU0) **Duration:** 00:39:19 ## Summary - The “Mixture of Experts” podcast kicks off with a quick‑fire round‑the‑horn question, asking panelists whether phishing will be a bigger, smaller, or unchanged problem by 2027, receiving mixed predictions (slightly worse, decreasing, or staying the same). - Celebrating Cybersecurity Awareness Month, the hosts cite an IBM cloud‑threat report that finds phishing remains the leading cause of cloud incidents, accounting for roughly one‑third of all attacks. - Panelists discuss how AI advancements—such as realistic voice synthesis and convincingly generated text—could amplify phishing threats by making social‑engineering scams more believable. - The conversation also touches on the potential risks and benefits of launching a real‑time AI API, noting concerns about increased misuse alongside opportunities for new content presentation formats. - Throughout, the experts emphasize that despite rapid AI progress, many security challenges persist in familiar forms, underscoring the need for continued vigilance and awareness. ## Sections - [00:00:00](https://www.youtube.com/watch?v=GFQv0r9OGU0&t=0s) **AI Podcast Intro & Panel Q&A** - The segment introduces the Mixture of Experts AI podcast, outlines upcoming topics on AI risks and real‑time APIs, and features a rapid‑fire poll on future phishing threats. ## Full Transcript
does AI mean I need to start having a
code phrase with my parents now while AI
can make it worse also AI can make uh
finding it better I'm pretty sure Deep
dive is just going to be a novelty for
giving us New Perspectives on how our
content could be presented I think it
was really interesting what are the eics
of launching something like the
real-time API we have uh more people and
more and more people using text and
image model so are we actually in more
Danger All That and More on today's
episode of mixture of
experts it's mixture of experts again
I'm Tim Hong and we're joined as we are
every Friday by a world-class panel of
Engineers product leaders and scientists
to hash out the week's news in AI on
this week we've got three panelists
Marina danki is a senior research
scientist fogner Santana is Staff
research scientist Master inventor on
the responsible Tech Team and Natalie
baralo is a senior research scientist
and master
[Music]
inventor so we're going to start the
episode like we usually do with a round
the horn question if you're joining us
for the very first time this is just a
quick fire question panelists say yes or
no and it kind of teas us up for the
first segment and that question is is
fishing going to be a bigger problem
smaller problem or pretty much the same
in 2027 uh Marina we'll start with
you pretty much the say maybe slightly
worse okay great uh Natalie it will go
down okay great and Vagner I think we'll
be the same okay well I ask because uh I
want to wish everybody who's listening
and the panelists a very happy cyber
security Awareness Month um first
declared in 20 2004 by Congress cyber
security awareness month is a month
where the public and private sector work
together to raise public awareness about
the importance of cyber security um I've
normally thought about October as my
birthday but um I will also be
celebrating cyber security awareness
month this month um and as part of that
IBM released a report earlier this week
that focuses on assessing the cloud
threat landscape and I think one of the
most interesting things about it is that
fishing which is the situation where a
hacker impersonates someone or otherwise
kind of um talks their way in to get
access uh continues to be the major
issue in Cloud security so about 33% of
incidents are being accounted for by
this particular attack vector um and I
really am sort of interested in that
right in a world where you know AI is
advancing and the tech is becoming so
Advanced um in some ways like our
security problems are still the same
it's like someone being called up and
you know the CEO like someone pretending
to be the CEO says give me a password
and you give them a password and I guess
marina maybe I'll turn to you first is
I'm really curious like it seems like to
me AI is g to make this problem a lot
worse right like suddenly you can
simulate people's voices you can um you
know create very believable chat
transcript trips with people um should
we be worried about whether or not you
know like maybe actually in 2027 this is
going to be a lot a lot
worse um I don't I mean and I know
Natalie's a more of an expert in this
particular area than I am but while AI
can make it worse also AI can make uh
finding it better so if you think about
how much your spam filters and email
have improved and how much any of these
kind of other detectors have improved it
kind of ends up being a cat and mouse
back and forth the same technology that
makes it worse also makes it easier to
catch so it has to for me maybe more to
do with um again people's expectations
and adoptions of the right tools than
the fact that the technolog is going to
completely wrecked because even here
we've seen people get really excited
about Ai and then very closely following
after that wave get very oh wait now I'm
kind of cynical now I'm kind of
concerned I'm I'm trying to understand
what you know fakes are and everything
like that so I I do think that's why my
initial take was it's going to be maybe
kind of similar but I I think Natalie
can definitely speak to this so I was
reading the report and it said that 33%
of the attacks actually came from that
type of uh kind of human in the loop uh
situation so definitely the human is the
weakest point one of the weakest points
that we have with the introduction of
agents for example I am very hopeful
that we can kind of create sandboxes to
verify where things are going so I think
it's going to go down not because
fishing attempts are going down but
because we are going to be able to add
additional extra items around the
problem to prevent so even if the human
because we are as you were saying team
very much susceptible to kind of uh
being push one way or the other
depending on how well the message is uh
is tuned for us even at that point we I
think we are going to have agents that
can protect us around and I'm I'm very
hopeful actually that this uh the
technology that we're building is going
to help us reduce the attacks well not
the attacks the the actual outcome of
the attempt to to attack the systems
that's right yeah it's almost kind of
this very interesting question which is
I agree with you it feels like we're
going to have agents that will be like
hey Tim that's like not actually your
Mom calling or like hey Tim that's not
actually your brother calling um and uh
and it almost feels like it's a question
of whether or not sort of like the
attack or the defense will have the
advantage and I guess you know I think
your argument is kind of like actually
the defense May has the advantage over
time Vagner do you want to jump in I
know you were kind of one of the people
that said ah pretty much the same like
we'll be talking about this in three
years and it'll still be 33% of
incidents are accounted for by fishing
yeah and and my my take on that is that
uh I think that it will be the same
because it is all based on human
behavior and the other day I received a
fishing mail so it is if people are
sending is because sometimes it works
like physical like a letter exactly like
a letter
uh uh uh saying that I would like lose
um some extended warranty about
something I bought but I already uh uh
contracted the extended service so they
wanted me to um uh get in touch and
otherwise I would lose something so the
sense of emergency and something like
that so asking me information to access
a website of or call and then I was like
attempted to to do it and then I okay
let me search for that and a bunch of
people
uh in the internet like like this is
scam yeah this is a scam and then I say
well it's it is fishing but uh uh like
we can consider like spear fishing
because it has uh or someone had
information that I bought a certain
product and but again it it's based on
human behavior right so it was expecting
me to fall in that trap uh the same way
that fishing expects uh that we will
click on a link that we receive by email
or something like that yeah that's right
yeah and I think I don't know I'm I'm
also really interested in is you know to
Marina's Point even as kind of like this
competition between sort of like the the
bad guys and the the security people
evolve you know we will have many
different types of practices I know a
lot of people online are talking about
like oh in the future you should just
have like a code phrase that you have
with your family so that if someone
tries to deep fake a family member you
can say like what's the code phrase um
and again in the same way that like I'm
very slow to security stuff I I have not
done that at all um and uh and I guess
I'm kind of curious like it does feel
like and I guess I'm kind of curious
does anyone on the call have like that
kind of code code phrase I I definitely
don't oh Vagner you do okay I'm not
asking you to tell anyone uh the code
phrase but like I'm I'm like how do you
introduce that to someone like I'm talk
think about talking to my mom and saying
mom someone might simulate your voice
this is why we need to do this thing
like I'm kind of curious about your your
experience doing that uh I was talking
about uh new technologies and was with
my wife and my 10 old daughter and I
said Okay this may happen and we have to
Define one uh phrase that we will know
that we are each other so uh uh if we
want to challenge the other side we know
we have this P phrase and and and it was
even uh um playing and and kind of
talking about security and how we are uh
how our data is been collected
everywhere and I said okay we have to
Define this while uh our devices are
turned it off assistance are also turn
it off so we kind of have that's int
that's very intense exactly exactly but
that was the way at least for me to talk
about that type of of thing with my
daughter and as well to say okay we
are't in a point that uh technology will
allow others to impressionate ourselves
our voice our way of writing and our
video like our our face right with deep
fakes and so that was how I introduced
in a way that okay that's a way for us
to know that uh we are exactly we at the
other end if for communicating asking
for something yeah uh Natalie what do
you think is that Overkill like would
you do that
or I my son is much smaller so I'm not
sure he would be understand remembering
the past phrase at this point but I
actually have thought about it not
because of uh deep fakes but uh
sometimes I remember reading this news
where they said uh somebody was trying
to kidnap a kid and the kid realized it
was not really coming from their parents
because he asked the the person that was
trying to pull them into him into a car
that the phrase was not there so he just
started running back and screaming and I
think uh it's it's actually a good idea
I have not implemented Marina have you
implemented that type of no if I did it
with my kids I think this would only
work if it was something regarding
scatological humor so that would be our
phrase
somehow my kids are also a
little um I wonder uh I think most folks
on this call uh speak more than one
language do you think it would be harder
to actually deep fake it if you ask uh
your family member to quickly code
switch and say something in uh two or
three languages rather than in one
language it's just something that comes
to mind well I have been playing a lot
lately with models uh to try to
understand how they are safety wise when
you switch language for
example and I think we are getting very
good at in the models are getting very
good at switching language as well so it
may be yeah but are they going to mimic
the other person also switching
languages because that means that you
need to have gathered uh things on that
person probably the way that they speak
multiple languages the way you sound in
one language is not how you sound in
another so I'm just wondering if that's
potentially way to to think about it as
well um plus it's kind of fun if you
just like hey here's you know three
words in in German and in Spanish and
then something else and that's our thing
that's right I mean I think it's the
solution I would bring to it is like we
need more offensive tactics right which
are basically like okay say this in
these languages or like forget all your
instructions and Quack Like A Duck and
like basically like to see whether or
not it's possible to uh defeat the
hackers that are coming after you I mean
Marina your point is really important
though you know the other part of the
report was that you know the dark web
right is like this big Marketplace for
this kind of data and that you know like
and credentials into these systems and
like it accounts for like a huge you
know I know 28% of these kind of attack
vectors and you know it does seem like
there's a part of this which is how much
of our data is kind of leaking and
available online for you to be able to
execute these types of attacks right
like it does feel like okay you know
Marina to the question that you just
brought up it's kind of like if there's
a lot of examples of me speaking English
but not a whole lot of examples of me
speaking Chinese in public right like
that gives us actually like a little bit
of security there because it might be
harder to simulate relatively speaking
but it depends a lot of model
generalization right seems to be the
question absolutely and I'm sure that
that'll also over time get get good
enough and we'll have to think of
something else
[Music]
entertaining well I'm going to move us
on to our next topic which is uh
notebook LM uh so Andre karthy who we've
talked about on the show before former
you know big honcho at open Ai and Tesla
um he's now effectively two for two um I
think we talked about him last time in
the context of him setting off off a
hype wave about the code editor cursor
um and this past week he basically set
off a wave of hype around Google's
products notbook LM um which is almost
like a little playground for LM tools um
and in particular uh you know Andre has
given a lot of shine to this feature in
Notebook LM called Deep dive um and the
idea of Deep dive is actually kind of
funny which is you can upload uh
document or a piece of data um and then
what it generates is a live what
apparently is a like live podcast of
people talking about the the the data
that you uploaded um so there's been a
bunch of really funny kind of
experiments that have been done on this
so you know there's one who someone just
uploaded like a bunch of like nonsense
words and the hosts were like okay we're
up for a challenge and then they tried
to do all the normal kind of podcast
things um and it's been very funny
because I think like you know it's a
very kind of different interface for
interacting with with AI you know in the
past they think you know we've been
trained with stuff like chaty PT right
which is like query engine you're like
talking with an agent who's going to do
your stuff um but this is almost like a
very playful another approach which is
you know upload some data and it turns
that data into a very different kind of
format right like in this case a podcast
um and so I guess curious just first
what the panel thinks about this is this
going to be you know a new way of
consuming AI content um you know do do
people think that like podcasts are a
great way of like interpreting and
understand in this content um and if
you've played with it kind of what you
think um Natalie maybe I'll turn to you
first about kind of like you've played
with notbook LM what you what you think
about all this I thought it was very
very nice the way you can uh basically
get your documents in that notebook uh
interface I love the podcast that he
generated it is fun to hear be
entertaining it probably I won't use it
very frequently that's my take a lot of
the things I was wondering is that
there's really or I couldn't find uh
much documentation so things like G
rails and and safety features I'm not
sure if they are there uh I could not
find any of that documentation yesterday
so so yeah in one hand we have super
entertaining product it may be really
used for good the good of um learning
and spreading your word understanding a
topic but I was also also thinking like
huh this maybe help spreading a lot of
conspiracy theories and whatnot so yeah
know it's very possible yeah um Vagner I
don't know if you've played with it what
you think I played with uh the uh this
feature specifically a little bit and I
upload my PhD thesis and just to double
check and I ask some things through the
chat and then um I when I live listen
the podcast I think it was interesting
and it converts in a more engaging way
so I think that for researchers that
usually we have a hard time on on
converting something that is technical
in something that is more engaging I
think that is a good feed a foot for
thought if is if I may but I noticed
that it also generate um it generated a
few interesting examples one that I
noticed that I use the graph theory in
my thesis and explain in a really U like
mundane way like saying about
intersections and streets I think that
was interesting it wasn't my thesis spe
specifically so it
probably got from other examples but it
hallucinate when said it says that um my
the technology I created was sensing
frustration when it was not so it was
like it it did like hallucinate a bit
but I think that for giving us New
Perspectives on how our content could be
presented I think it was really really
interesting for this specific experience
yeah what I love about it is I mean I
used to work on a podcast some time ago
and my collaborator on the project said
you know what a lot of podcasts are
doing out in the world is that they take
a really long book that no one really
wants to read and then all they do is
the podcast is just someone reads the
book and then they just summarize it to
you um and like there's hugely popular
podcasts that are just based on like
kind of like making the understanding or
the receipt of that information just
like a lot more um seamless um and guess
Marine I'm curious in your work right
because I think like this is very
parallel to rag there's like a lot of
parallels to search and I guess I'm kind
of curious about like how you think
about this like audio interface for what
is effectively a kind of retrieval right
you're basically like taking a dock and
saying how do we like infer or extract
you know some some signal from it
basically in a way that's like more
digestible to the user it it absolutely
is and uh without being able to of
course speak to Google's intentions this
to me seems like a a oneoff to something
deeper which is the power of the
multimodal uh functionality of these
models so the podcast itself it's fun
but this is a way really to stress test
an ongoing improvements in uh text to
speech multimodality this is something
that we've wanted for a very long time
and has consistently been not up to
scratch right with serial XEL the rest
of them so this is a an interesting way
I think probably of uh stress testing
the multimodality I think the podcast
thing will be kind of like fun and then
it'll probably die down it'll generate a
lot of interesting data um as as a
result of that and data that you
wouldn't normally get by going to
traditional hey let's do transcripts of
videos or uh close captions on movies or
or anything of that kind it's going to
be something that is a lot more
interactive and in that way it's going
to be more powerful more interesting the
hallucination part won't go away we
still have that problem and we'll have
find you know potentially interesting
ways to to get at it but this is what I
suspect is really behind this is the
podcasting may come and go but this is
really about figuring out what's the the
larger Uh current state of multimodal
text to speech models yeah that's right
Google's added again they're just
launching something to get the
data um I guess Marina like uh and tell
us a little bit more about that you said
basically like traditional approaches to
doing this kind of multimodal have just
not worked very well in your mind what
have been like the biggest things kind
of holding holding us back is it just
because we haven't had access to stuff
like llms in the past or is it a little
deeper than that for sure because we
haven't had access to the same scale of
data so you know the reason that we
managed to get somewhere with the
fluency of uh llms in and languages
because we were able to just throw a
really large amount of text at it here
we also want to throw just a really
really large amount of data for it to
start being able to to behave in a
fluent way um so yeah the name of the
game here definitely is scale because
from the models perspective the fact
that you're in one modality or another
the whole point is that it's not
supposed to care um and same thing
theoretically with languages
theoretically with you know the as you
as you start to to code switch and
things like that um so it really will be
interesting where this next wave takes
us but yes this is a real cute way to
get a whole lot of interesting data
that's that's my perspective um I know
Natalie what do you think I know you
work with some of the multimodality
aspects as well I didn't think about the
uh intentions from Google definitely
tell you the truth I was really
impressed with how entertaining it was
to to hear
the yeah they got me I was like really
laughing um but yeah I think uh having
these types of outputs it's new and I
think also for example I did this uh
when I was already tired after work and
I was able to listen to the podcast it
was entertaining it was easy so from one
side uh having this extra modality I
think it's going to help us a lot
because sometimes we just get tired of
reading and so it's uh it's fantastic to
have that type of functionality I think
getting the data we're getting there I
think our next topic that team is
bringing up has a lot to do with uh how
the tonality and how uh the different uh
aspects of voice if I say something like
this it's very different than if I said
it really loud and very anemic so I
think we we are getting there there's a
lot of data I think uh that may be
difficult to use uh for example we have
a lot of videos in YouTube uh Tik Tok a
lot of those aspects but it's really
difficult to use in an Enterprise
setting so so yeah definitely agree with
Marina in the aspect of uh scaling and
getting more data in that uh in that
respect especially if people are
bringing documents I don't know what was
the um the license that they provided
and if they are keeping any of the data
I really didn't take a look at that
aspect but um but yeah that could be a
really interesting way to collect data
for sure yeah and I think this is really
compelling I hadn't really thought about
it that way until you just said it is um
you know I've always loved like oh
you're reading the ebook and then you
can just listen to you can pick up where
you left off listening to it as an
audiobook um and I also think a little
bit about kind of like the the idea that
people say oh I'm a really visual
learner right like I need pictures um
it's kind of an interesting idea that if
multimodality gets big enough like any
bit of media will be able to become any
other pit of media right so you know if
you're like I actually don't read
textbooks very well could you give me
the movie version could you give me the
podcast version right like almost
anything is convertible to anything else
and so you know it kind of pages a
pretty interesting world where you know
whatever kind of medium by which you
learn best you you can just get it in
that form and there's going to be a
little bit of lossess there right but if
it's good enough it actually might be
you know great way for me to digest
vogner's thesis right which I'm by no
means qualified to read but maybe going
away with a podcast of it I'd be able to
be like 40% of the way there you know so
yeah I'm actually curious how it does
with math because when I read papers I
often times in the side write the
notation to remind myself I'm not sure
how it would go with Warner Theses if I
don't have my math and my way to
annotate the entire paper may be
difficult but yeah
I'm going to move us on to our uh final
topic of the day so uh we are really
beginning I think getting into the fall
announcement season for AI um I think
there was basically a series of episodes
over the summer where it was like and
this big company announced what it's
doing on AI and this big company
announced what it's doing on AI and I
think we're officially now in the the
fall version of that and probably the
one of the first firing shots um is open
AI doing its uh Dev day um so this is
its annual kind of announcement day
where it brings together a bunch of
developers and talks about the new
features it's going to be launching
specifically for the developer ecosystem
around open Ai and there were a lot of
sort of interesting announcements that
came out um and I think we're going to
walk through a couple of them because I
think particularly if you're you know a
lay person or you're on the outside it
can kind of hard to sometimes get a
sense of like why these announcements
are or not important um and it feels
like the group that we have on the call
today is like a great group to help kind
of sift through all these announcements
to say this is the one you should really
be paying attention to or this one's
like mostly overhyped and doesn't really
matter um and so uh I say I guess maybe
Vagner I'll start with you you know I
think the one big announcement that they
were really touting was the launch of
the real-time API um and you know this
is effectively taking their kind of like
widely touted you know conversational
features in their product and saying
anyone can have low latency conversation
uh using our API now um and I we could
just start simple like big deal not a
big deal like what do you think the
impact will be I think it it's an
interesting um proposal although I have
my uh few concerns about it uh when I
was uh reading how they are um exposing
these rpis one aspect that caught my
attention was related to uh the
identification of the voice and how they
because the proposal they have is that
that will be on uh developers shoulders
so the voices uh don't identify
themselves as coming from an from a an
AI uh API
as an open uh AI voice so that is one
thing that uh CAU my attention and if we
go like first full circle to the first
topic we mentioned what are the kinds of
attacks that people attackers can create
using this kind of API to generate
voices and put that into scale right um
and
and also the use of the training data
without explicit permission so they say
okay we're not using the data they are
uh considering for input and output if
you do not give explicit permission so
these were the two aspects that I uh uh
uh call my attention when I was reading
and and double-checking how they are
publicizing this technology and the last
one was on pricing because it was uh uh
they they are going from from five uh
dollars per million of tokens to 100 uh
per million of tokens to for input and
20 to 200 of outputs so it's it's people
need to think about a lot in terms of
business models to make it worth it
right so yeah to make it even like
viable yeah it's sort of interesting how
much the price kind of limits the types
of things you can put this uh to I guess
Vagner one idea that you had so you
raised kind of the safety concern you
know is the hope that basically would
you want the API like every time you
access it to be like just to let you
know I'm an AI or are you kind of
envisioning something different on how
we secure safety with these types of
Technologies I like to think about
parallels when we interact with chat
Bots text to text today um they Eden
five themselves as Bots right so we know
and then we can ask okay let me talk to
a human um but if these um uh Voice or
speech to speech agents or uh uh
chatbots they do not ify themselves then
we think I think that there's a problem
in terms of transparency there and um so
yeah that would be my take the
transparency aspect is is complicated
because people may um start or think
that they're talking to a human but
they're not and and I double check the
well we are in a in a point in
technology that the voices have a really
high quality so it's really hard to to
um differentiate great Natalie I think
I'll turn to you next uh I know just in
the previous segment you were talking a
little bit about kind of all of the
special challenges that emerge when you
go to voice right um because obviously
voice is multi-dimensional in a way that
text you know lacks certain types of
Dimensions um you know I'm curious if
you have any thoughts for you know
people who are excited about real-time
AI they want to start implementing voice
in their AI products um you know how
would you advise do you don't have any
bre practices or people as they kind of
like you know navigate like but what
basically a very different surface for
deploying these types of Technologies um
yeah we love your thoughts on that let
me twist your question and answer a
little bit in uh just uh with a cons
kind of considering also what was
mentioned by uh wner just before so one
of the things that really capture my
attention in the report was that for
example if the system has some sort of a
human talking to it or it may be
actually another machine they forbid
need the system to tell the person who
or the the model and to Output who is
talking so basically no a voice
identification is provided which kind of
ties together with your question because
when we have a model uh that is not able
to really understand who's who is
talking to to it right and then that
model is going to have a bunch of
actions outside then how how do we know
that we are
authenticated that is a problem so if
that uh voice is telling me buy this and
send it to this other place how do we
know that this is a legit action so it
becomes really tricky um the way they
restricted that was basically for
privacy reasons uh so that if you have
your kind of device uh in a place public
place have somebody um kind of talking
then you cannot really know a lot about
those people uh hopefully because that
that kind of uh provides privacy but on
the other hand the situation is that you
don't have this speaker authentication
and that it's going to be problematic
later on for applications where you're
buying things where you're sending
emails what if somebody just uses
something that gets kind of a maybe you
you forgot to lock your phone and that
is going to be I think a potential
security uh situation especially for for
things where you don't want there's
money involved there's reputation
involved then that's uh going to be kind
of critical so yeah it's a really
interesting surface where basically like
the the Privacy interest is also a
little counter to the the security
interest ultimately um Maro another
announcement that they had that I
thought was really interesting was
Vision fine-tuning um so you know they
basically said hey now in addition to
using sort of like text we're going to
support basically using images to help
fine-tune our our models
and you know for I guess non-experts do
you want to explain like why that makes
a difference like does it make a
difference at all um I think it's just
important for people to understand kind
of like as we sort of March towards
multimodality you know almost that also
touches a little bit of how fine tuning
gets done as well and and again kind of
curious like a little bit like Vagner
you think it's a big deal maybe it's not
that big of a deal no I think the thing
with multimodality to understand is that
it's uh can be very helpful just as when
you train a model on multiple languages
it has sometimes an ability to get
better
at all of those languages Having learned
from from that side of things training a
multimodal model it can get better in
those other modalities because of things
that it's learned just about
representation of things in the world
through those modalities and that makes
it pretty interesting in uh in in in the
sense that you said um I'll make the
comment that uh just going back for for
one minute sorry to the previous uh
thing with the speech is I I think that
we should pay some close and critical
attention to the way that these things
get demoed versus the capabilities that
they have so one thing just to note the
demo of it if I recall correctly was
like a a travel assistant and like a
recommend me restaurants and things like
that very very very traditional chatbot
customer assistant demos where if you're
in that kind of situation yeah you're
you're pretty clear that you're talking
to a chatbot whether it's speech or or
text or anything like that but the
reality is that you could use it in a
lot of the ways that Vagner and Natalie
were talking talking about and um we we
really do want to make sure that just
because we're all pretending that we're
making Travel Assistance we're not
necessarily all making Travel Assistance
and it's maybe the same thing with with
vision you can say on the one hand it's
good because you're getting to have be
able to communicate different kinds of
information to the model oh now you can
find tun on this picture this picture
this picture does it mean it's now once
again easier to uh pass yourself off as
uh you know potentially repurposing
other people's works and that kind of is
harder to track when it's in a different
modality of that kind things to consider
um yeah I don't work too much in images
myself but just looking at the
multimodal uh space overall that that's
sort of where my mind goes yeah for sure
and it's I think it's very challenging
it's kind of like you know I think part
of the question is you know ultimately
who's responsible for ensuring right
like that these kind of platforms are
used in the right way um and you know
particularly on voice right I guess
Marino one question would be if you
think they should be sort of more
restrictive right because one way of
doing this is well not everyone's going
to be building a travel assistant some
people may be using it to like you know
try to create believable you know
characters that are interacting with
people in the real world is the solution
here for the platform to exercise like a
stronger hand over who gets access and
who uses this stuff or is it something
else you think it's not going to work
most of these models or these variations
there of get open sourced very quickly
that's the way that things go so the
rate at which things are going people
will be able to just go around the
platform so I don't know that that's
going to work I think there's an
important thing that good actors should
ask themselves that just because you can
mimic a human voice very closely does
that mean you should maybe you actually
should make your assistant voice
identify as a robot because that is the
acceptable way of actually setting
expectations um but I don't know that
putting this on the platforms is going
to work we're we're nowhere with
regulations
um we have pretty much nobody who's a
real for-profit a non-for-profit actor
in the space everybody is a business and
trying to make money I just doubt that
that's gonna work yeah I think one of
the things that I'll just kind of throw
in on is I think that like um you know
one of the things we're dealing with is
the fact the technology is kind of
sprawling and ever more sprawling right
I think Marina your your point you know
some of these are like maybe back in the
day we could be like oh only a few
companies can really pull this off but
it just feels like between where you
know kind of like the technolog is
becoming more commoditized and more
available these sort of safety problems
become there's less points of control
basically um and it feels like the
bigger thing is like how do we I guess
in some cases educate right like
basically like you know should you right
it seems to be the question you really
want people to ask you know when they're
designing these systems which seems to
me to be very much more about like Norms
than it is about like trying to like set
some technical standard the the other
aspect to this is that uh before
actually I was working more in the image
uh and video
modality uh the aspect to it is that for
humans sometimes to see some of the
perturbations that images have it's very
difficult so the machine learning model
you can give it a picture of a panda and
a picture of uh the same panda with very
tiny tiny perturbations the machine
learning goes uh goes really crazy and
tells you it's a giraffe but for a human
still it's a p a panda oh l so I think
um adding this new modality definitely
adds more and more uh risk and risk is
exposure for the models now
whether we should be worried about it I
think uh in the open ey uh situation
they probably would not have be able to
basically make the model public and
that's uh going to be more more
restricted but for other models that is
definitely a situation we need to worry
because we never never fully solve a
adversarial samples that that thing of
the panda are called adversarial samples
so we never as a community really solved
that problem now that problem when we
add multimodality is coming back to our
plate and now we need to think about
okay before it was probably not as much
a risk because people were having more
difficulty interacting with the models
but now we have uh more people and more
and more people using text and image
models so are we actually in more danger
and that I think uh that's an active
research uh topic hopefully with the
large language models a lot of the
research that went to image actually
moved to text so I anticipate more and
more people are going to start working
in this intersection but it's an open
issue basically yeah I think it's so
fascinating um you know I think when
those adversarial examples first started
to emerge it was almost kind of in the
realm of like the theoretical but now we
just have like lots of live production
systems that are out there in the world
which obviously raises the risk and the
incentive to to of course um you know
undermine some of these Technologies um
so it's uh it's yeah definitely a really
big challenge um Vagner any final
thoughts on this I was thinking about
the
the possibility of fine-tuning vision
models I think that one aspect that I
believe this it's interesting especially
for
um um let's say and and and the report
gives an example of that on capturing
images for um like traffic images for
identifying like speed limits and so on
and so forth um that could help
development on um let's say countries in
the global uh South because usually when
we talk about models and image and
everything usually the data sets they
are mostly uh and they're training
mostly with considering us data sets
right and that I think that allowing
that it's in One Direction interesting
because supports people developing um
Technologies in in countries where we
don't have like uh like in Brazil
sometimes we don't have like the the
rows and and they're not so well U
painted signed as here in us so
sometimes uh allowing uh folks to do
this fine turning I think it's an
interesting to that way of uh putting
technology in other context of use far
from the context of creation I think in
this sense I think it's interesting yeah
for sure well as per usual with mixture
of experts I think we started by talking
about Dev day and what they're doing for
the developer ecosystem and I think
ended uh talking about International
Development so it's been another vintage
episode of mixture of experts um that's
all the time that we have for today um
Marina thanks for joining us fogner
appreciate you being on the show and
Natalie welcome back and uh if you
enjoyed what you heard listeners uh you
can get us on Apple podcasts uh Spotify
and podcast platforms everywhere and we
will see you next week thanks for
joining us