Audio Jacking: Man-in-the-Middle Voice Attack
Key Points
- A simple conversation about a bank account number illustrates “audio jacking,” where the listener hears a different number than the speaker intended, revealing the attack’s subtle manipulation.
- Researchers coined “audio jacking” as a new man‑in‑the‑middle (MITM) attack that intercepts and alters spoken audio in real time, demonstrated by a proof‑of‑concept demo.
- The attacker can gain the MITM foothold via malware on a device, exploitation of VoIP services, or a spoofed three‑way call combined with a deep‑fake voice clone of one participant.
- Once positioned, the interceptor captures the victim’s speech, converts it to text with a speech‑to‑text engine, modifies the content (e.g., changing numbers), then synthesizes and injects the altered audio back into the conversation.
- Defending against audio jacking involves securing devices against malware, authenticating VoIP calls, using end‑to‑end encryption, and employing voice‑verification or out‑of‑band confirmation for critical information.
Full Transcript
# Audio Jacking: Man-in-the-Middle Voice Attack **Source:** [https://www.youtube.com/watch?v=xHRIjmx1_Fs](https://www.youtube.com/watch?v=xHRIjmx1_Fs) **Duration:** 00:12:52 ## Summary - A simple conversation about a bank account number illustrates “audio jacking,” where the listener hears a different number than the speaker intended, revealing the attack’s subtle manipulation. - Researchers coined “audio jacking” as a new man‑in‑the‑middle (MITM) attack that intercepts and alters spoken audio in real time, demonstrated by a proof‑of‑concept demo. - The attacker can gain the MITM foothold via malware on a device, exploitation of VoIP services, or a spoofed three‑way call combined with a deep‑fake voice clone of one participant. - Once positioned, the interceptor captures the victim’s speech, converts it to text with a speech‑to‑text engine, modifies the content (e.g., changing numbers), then synthesizes and injects the altered audio back into the conversation. - Defending against audio jacking involves securing devices against malware, authenticating VoIP calls, using end‑to‑end encryption, and employing voice‑verification or out‑of‑band confirmation for critical information. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xHRIjmx1_Fs&t=0s) **Audio Jacking Man‑in‑the‑Middle Attack** - The segment introduces a novel “audio jacking” threat where a man‑in‑the‑middle malware intercepts a voice call, alters spoken information (like a bank account number), and demonstrates how attackers can exploit this to deceive listeners. ## Full Transcript
hey Martin it was great seeing you at
the conference yesterday great seeing
you too hey hey jefff I want to pay you
back for that pizza that we sheded can
you some of your bank account details
yeah sure thing my account number is
31415 29 that's
8675309 got it thanks you bet take care
okay what just happened then you heard
me say this number and Martin wrote down
a different number why did he do that
does he have a hearing problem no he
doesn't in fact what he wrote down was
exactly what he heard you just didn't
hear his side of the conversation
welcome to the world of audio jacking
yep it's a thing this is a new type of
attack that one of our exforce
researchers chenta Lee came up with and
did a proof of concept let's take a look
and see how it works and ultimately what
you can do to protect yourself against
it okay so how did this thing work well
we're going to start with a diagram to
to explain it so here we have Martin
looks exactly like him right and this
strapping young lad yours truly and here
we have the attacker this guy becomes
what we call a man in the middle in
other words he inserts a control point
between the two of us in our
conversation now how could he do that
well there's a lot of different ways but
one of the simplest ways would be to do
it through insertion of malware in other
words if he sends malware to my system
to my phone to my PC to my laptop
whichever I'm using to make the call
from then that could then establish the
the man in the middle positioning
because what he's going to need is an
Interceptor and that's what this will do
another way he could do this and by the
way that malware could be embedded into
an app that I download from an app store
for instance and then that now puts the
the uh Target in place another way would
be to exploit voice over IP calling
sometimes that in that case if someone
is able to insert themsel in the middle
of the conversation they might be able
to take control and yet another option
would be a three-way call where this guy
the attacker calls me spoofing the
number to make it looks like it came
from Martin and he calls Martin spoofing
my number making it look like it came
from me and then inserts deep fake uh of
my voice a copy or a clone of my voice
starting the conversation so that way
neither of us realizes the other one
didn't initiate the call so there's a
number of different ways that this might
initially get kicked off but once we've
done that once the attacker has
established his position his foothold
then what happens well so you remember
in the call what I did was I called and
I said something like you know it's good
to see you at the conference Martin and
this is where the Interceptor component
comes in it intercepts what I've said
and then it takes a look in fact it
sends what I've just said down to
another component that is a speech to
text translator basically it takes the
audio of what I said and turns it into
text into readable words it then takes
that information and sends it on into a
large language
model now why a large language model
because these things are really good and
natural language processing so they can
understand the context of a conversation
and not just pick out single words so an
llm could look at what I've just said
because it's been translated into text
and analyze it and see what in I meaning
in what I'm saying and in this this llm
will be looking specifically for bank
account number information it's going to
want to know if I told a bank account
number and in the first convers uh first
uh thing that I said to Martin I didn't
say anything about it so the answer in
that case is going to be no and it's
just going to take what I said allow it
to go through the Interceptor and be
passed along unimpeded unchanged so what
I said is in fact what Martin hears
sounds normal here's where it gets
interesting Martin then answers me back
and what he says is oh yeah good to see
you too but what I'd like to uh do is
pay you back for the pizza okay fine so
the Interceptor takes his words
translates them into text sends those to
the large language model and he said in
the message um send me your bank account
number now the large language model is
going to be smart enough to realize just
the mention of the word bank account
number is not the same thing as a bank
account number because llms understand
natural language so in that case again
the answer is no uh so his message will
be passed along back to me unimpeded
again everything acts normal here's
where it gets dicey what is going to
happen next is I'm going to tell him my
number
31415 29 that's going to go through the
Interceptor it's going to turn that into
text it's going to go into the llm and
it's going to say oh he just told a bank
account number not just the word but
actually gave a bank account number it's
then going to take that information and
this is where the attack gets
interesting it's going to pass that on
down to a text to speech so it's going
to turn back the words into speech but
what it's going to also do is take what
I just said and remember there was an
account number in here it's going to
take that out and put something else in
and what's it going to put it's going to
put
8675309 that then gets passed on to a
deep fake generator that has already
been able to clone what my voice sounds
like how could you do that well it turns
out you can generate deep fakes with
some of these language models that can
operate with as little as 3 seconds of a
sample of your voice some of them need
30 seconds but some need more but the
point is it's not hard to get 3 seconds
or even 30 seconds of audio of a person
and then be able to create a very
lifelike clone or deep fake of their
voice so it's going to substitute that
into the message now all of this
processing takes a little bit of time
how do we cover that well there's a
little bit of social engineering thing
that we could insert you didn't hear it
in our call but in the real proof of
concept we would need to do this and
that is it's going to generate a message
in my voice that says oh yeah sure hold
on a second while I look up the number
so that's really just a delay tactic so
that we can do this processing and then
once it's processed it's going to
actually send this account
number that Martin is in going to take
now in the meantime what I'm hearing
because there would be aay lay on my
side as I wait for this to happen is
it's going to generate a message to me
in Martin's voice that says hold on a
second while I write it down so now both
of us have a reasonable uh expectation
that the other is going to be doing
something but we're waiting for just a
little bit of time and that's the time
we need for this process to occur then
once Martin gets that information he has
the wrong account number well that wrong
account number of course points up to
the attacker he Ires the money to the
attacker and the attacker has been
successful so that's in a nutshell how
this thing works pretty scary stuff
right well that was just one scenario
let's take a look at some other types of
attacks that we might also see what you
just saw was a financial based attack
where someone is substituting in account
numbers or other types of information
like that but there could be other
implications and other possibilities
there could be health-based information
that's being exchanged something that's
really sensitive that could affect for
for instance a patient's life if the
wrong information is communicated from
one doctor to another other things that
could happen would be censorship say
there that you're doing a talk and
someone actually substitutes in
different words that you did not say
into a video now all of a sudden you
have said something terrible that you
didn't actually say and the implications
of that could be devastating and then
one other to consider is realtime
impersonation in this case the attacker
has the Deep fake they call up the other
individual and they're able to speak to
them in the voice of the person that
they're impersonating what they say is
in their voice and what comes out is in
the voice of the person that they're
wanting to to spoof so there could be a
lot of scary implications for this
technology if we're not prepared so what
should you do to defend against an audio
jacking attack defending against this
stuff is really hard but we do have some
tools some strategy iies that we can use
to guard against this so we'll start off
with the most important be skeptical
don't believe everything you hear even
if what you hear you're sure you heard
the voice of the other person in this
world of deep fakes and audio jacking
you may not be hearing the other person
actually saying what they do so think
first then if it's something really
important like sending bank account
numbers or anything really sensitive
like that you want to paraphrase and
repeat and that way there may be a
little bit of difficulty with the uh
translation and you'll be able to catch
it uh and catch it a little bit off
guard but say it in different ways
because that way the llm is looking for
certain keywords or certain phrases
certain ways of expressing and maybe
you'll express it slightly differently
another thing is if it's really
important to you outof band
Communication in other words we were
just talking on a cell phone well if
this is really important maybe don't
include the bank account number in that
maybe say I'll send you the account
number through email not the greatest
but maybe I'll text it to you maybe I'll
send it to you in some other messaging
app better still divide the account
number up send half of the of the
account number in one messaging app and
half in another or switch from that
device and switch over if you were doing
it on a phone switch over to a laptop so
anything that makes it so that the
attack surface is broader that the
attacker will have to who have
compromised that's what you're looking
to do make the job hard for them and
then finally the best practices the
standard stuff that we know we're always
supposed to do but not everyone does it
what what kinds of things do I mean by
this well for instance keep your systems
always patched with the latest level of
software if whether it's a laptop
whether it's a phone doesn't matter make
sure that you have all the security
patches that are possible in place um
also when it comes to emails and
attachments and and links in messages
and things like that don't open them if
you don't really have to if you don't
really know what it's going to do
because those things could be the way
that the guy inserts the malware onto
your system and then becomes the man in
the middle then when it comes to apps
that you download and who doesn't want
to download a thousand apps on another
phone but make sure that you get them
from trusted sources even trusted
sources can fail us every once in a
while but you put the odds in your favor
if you get it from a trusted App Store
as opposed to another one where there
might be malware a trojan horse
something like that inserted into the
app and then finally one of the things
that might get exploited ultimately
Downstream would be if they get your
credentials and they try to log into
your account or something like that so
use things like multiactor
authentication or you know I'm a big fan
of replacing passwords with pass keys
and the we have a video on that if you'd
like to learn more about that but pass
keys are a stronger way of securing your
account AI can do some really amazing
things for us and I'm a huge fan however
if we're not careful it can also do some
really devastating stuff to us so be
informed keep learning stay vigilant and
protect yourself against the attacks and
if you want to know more about how this
particular proof of concept works then
click down in the description below and
you'll see a link to a Blog post where
you can find out the details and
actually listen to audio samples that
were generated during the proof of
concept and by the way when is Martin
going to finally send me that
money thanks for watching if you found
this video interesting and would like to
learn more about cyber security please
remember to hit like And subscribe to
this
channel