Learning Library

← Back to Library

Audio Jacking: Man-in-the-Middle Voice Attack

12m • Unknown Channel • security • tutorial • intermediate • Watch on YouTube ↗

Key Points

A simple conversation about a bank account number illustrates “audio jacking,” where the listener hears a different number than the speaker intended, revealing the attack’s subtle manipulation.
Researchers coined “audio jacking” as a new man‑in‑the‑middle (MITM) attack that intercepts and alters spoken audio in real time, demonstrated by a proof‑of‑concept demo.
The attacker can gain the MITM foothold via malware on a device, exploitation of VoIP services, or a spoofed three‑way call combined with a deep‑fake voice clone of one participant.
Once positioned, the interceptor captures the victim’s speech, converts it to text with a speech‑to‑text engine, modifies the content (e.g., changing numbers), then synthesizes and injects the altered audio back into the conversation.
Defending against audio jacking involves securing devices against malware, authenticating VoIP calls, using end‑to‑end encryption, and employing voice‑verification or out‑of‑band confirmation for critical information.

Sections

00:00:00 Audio Jacking Man‑in‑the‑Middle Attack - The segment introduces a novel “audio jacking” threat where a man‑in‑the‑middle malware intercepts a voice call, alters spoken information (like a bank account number), and demonstrates how attackers can exploit this to deceive listeners.

Full Transcript

# Audio Jacking: Man-in-the-Middle Voice Attack **Source:** [https://www.youtube.com/watch?v=xHRIjmx1_Fs](https://www.youtube.com/watch?v=xHRIjmx1_Fs) **Duration:** 00:12:52 ## Summary - A simple conversation about a bank account number illustrates “audio jacking,” where the listener hears a different number than the speaker intended, revealing the attack’s subtle manipulation. - Researchers coined “audio jacking” as a new man‑in‑the‑middle (MITM) attack that intercepts and alters spoken audio in real time, demonstrated by a proof‑of‑concept demo. - The attacker can gain the MITM foothold via malware on a device, exploitation of VoIP services, or a spoofed three‑way call combined with a deep‑fake voice clone of one participant. - Once positioned, the interceptor captures the victim’s speech, converts it to text with a speech‑to‑text engine, modifies the content (e.g., changing numbers), then synthesizes and injects the altered audio back into the conversation. - Defending against audio jacking involves securing devices against malware, authenticating VoIP calls, using end‑to‑end encryption, and employing voice‑verification or out‑of‑band confirmation for critical information. ## Sections - [00:00:00](https://www.youtube.com/watch?v=xHRIjmx1_Fs&t=0s) **Audio Jacking Man‑in‑the‑Middle Attack** - The segment introduces a novel “audio jacking” threat where a man‑in‑the‑middle malware intercepts a voice call, alters spoken information (like a bank account number), and demonstrates how attackers can exploit this to deceive listeners. ## Full Transcript

0:00hey Martin it was great seeing you at 0:02the conference yesterday great seeing 0:03you too hey hey jefff I want to pay you 0:05back for that pizza that we sheded can 0:07you some of your bank account details 0:10yeah sure thing my account number is 0:1431415 29 that's 0:208675309 got it thanks you bet take care 0:25okay what just happened then you heard 0:27me say this number and Martin wrote down 0:29a different number why did he do that 0:31does he have a hearing problem no he 0:33doesn't in fact what he wrote down was 0:35exactly what he heard you just didn't 0:37hear his side of the conversation 0:40welcome to the world of audio jacking 0:43yep it's a thing this is a new type of 0:45attack that one of our exforce 0:47researchers chenta Lee came up with and 0:49did a proof of concept let's take a look 0:51and see how it works and ultimately what 0:53you can do to protect yourself against 0:55it okay so how did this thing work well 0:58we're going to start with a diagram to 0:59to explain it so here we have Martin 1:02looks exactly like him right and this 1:04strapping young lad yours truly and here 1:07we have the attacker this guy becomes 1:11what we call a man in the middle in 1:13other words he inserts a control point 1:16between the two of us in our 1:17conversation now how could he do that 1:20well there's a lot of different ways but 1:22one of the simplest ways would be to do 1:24it through insertion of malware in other 1:26words if he sends malware to my system 1:29to my phone to my PC to my laptop 1:32whichever I'm using to make the call 1:34from then that could then establish the 1:38the man in the middle positioning 1:40because what he's going to need is an 1:42Interceptor and that's what this will do 1:45another way he could do this and by the 1:46way that malware could be embedded into 1:48an app that I download from an app store 1:51for instance and then that now puts the 1:54the uh Target in place another way would 1:57be to exploit voice over IP calling 2:01sometimes that in that case if someone 2:02is able to insert themsel in the middle 2:04of the conversation they might be able 2:06to take control and yet another option 2:09would be a three-way call where this guy 2:12the attacker calls me spoofing the 2:14number to make it looks like it came 2:15from Martin and he calls Martin spoofing 2:18my number making it look like it came 2:20from me and then inserts deep fake uh of 2:24my voice a copy or a clone of my voice 2:26starting the conversation so that way 2:28neither of us realizes the other one 2:31didn't initiate the call so there's a 2:33number of different ways that this might 2:35initially get kicked off but once we've 2:37done that once the attacker has 2:39established his position his foothold 2:41then what happens well so you remember 2:44in the call what I did was I called and 2:47I said something like you know it's good 2:49to see you at the conference Martin and 2:51this is where the Interceptor component 2:53comes in it intercepts what I've said 2:56and then it takes a look in fact it 2:59sends what I've just said down to 3:01another component that is a speech to 3:05text translator basically it takes the 3:08audio of what I said and turns it into 3:11text into readable words it then takes 3:14that information and sends it on into a 3:18large language 3:20model now why a large language model 3:22because these things are really good and 3:24natural language processing so they can 3:26understand the context of a conversation 3:28and not just pick out single words so an 3:31llm could look at what I've just said 3:33because it's been translated into text 3:35and analyze it and see what in I meaning 3:38in what I'm saying and in this this llm 3:41will be looking specifically for bank 3:44account number information it's going to 3:46want to know if I told a bank account 3:49number and in the first convers uh first 3:52uh thing that I said to Martin I didn't 3:54say anything about it so the answer in 3:57that case is going to be no and it's 4:00just going to take what I said allow it 4:03to go through the Interceptor and be 4:05passed along unimpeded unchanged so what 4:09I said is in fact what Martin hears 4:11sounds normal here's where it gets 4:13interesting Martin then answers me back 4:16and what he says is oh yeah good to see 4:19you too but what I'd like to uh do is 4:23pay you back for the pizza okay fine so 4:27the Interceptor takes his words 4:29translates them into text sends those to 4:33the large language model and he said in 4:35the message um send me your bank account 4:38number now the large language model is 4:40going to be smart enough to realize just 4:42the mention of the word bank account 4:43number is not the same thing as a bank 4:45account number because llms understand 4:48natural language so in that case again 4:50the answer is no uh so his message will 4:54be passed along back to me unimpeded 4:58again everything acts normal here's 5:00where it gets dicey what is going to 5:02happen next is I'm going to tell him my 5:06number 5:0831415 29 that's going to go through the 5:11Interceptor it's going to turn that into 5:13text it's going to go into the llm and 5:15it's going to say oh he just told a bank 5:18account number not just the word but 5:20actually gave a bank account number it's 5:22then going to take that information and 5:25this is where the attack gets 5:26interesting it's going to pass that on 5:28down to a text to speech so it's going 5:33to turn back the words into speech but 5:36what it's going to also do is take what 5:38I just said and remember there was an 5:40account number in here it's going to 5:41take that out and put something else in 5:44and what's it going to put it's going to 5:45put 5:488675309 that then gets passed on to a 5:51deep fake generator that has already 5:54been able to clone what my voice sounds 5:56like how could you do that well it turns 5:59out you can generate deep fakes with 6:01some of these language models that can 6:04operate with as little as 3 seconds of a 6:07sample of your voice some of them need 6:0930 seconds but some need more but the 6:12point is it's not hard to get 3 seconds 6:14or even 30 seconds of audio of a person 6:16and then be able to create a very 6:18lifelike clone or deep fake of their 6:20voice so it's going to substitute that 6:22into the message now all of this 6:25processing takes a little bit of time 6:27how do we cover that well there's a 6:28little bit of social engineering thing 6:30that we could insert you didn't hear it 6:32in our call but in the real proof of 6:34concept we would need to do this and 6:36that is it's going to generate a message 6:38in my voice that says oh yeah sure hold 6:42on a second while I look up the number 6:44so that's really just a delay tactic so 6:47that we can do this processing and then 6:49once it's processed it's going to 6:51actually send this account 6:53number that Martin is in going to take 6:56now in the meantime what I'm hearing 6:58because there would be aay lay on my 6:59side as I wait for this to happen is 7:02it's going to generate a message to me 7:04in Martin's voice that says hold on a 7:07second while I write it down so now both 7:10of us have a reasonable uh expectation 7:13that the other is going to be doing 7:14something but we're waiting for just a 7:16little bit of time and that's the time 7:18we need for this process to occur then 7:21once Martin gets that information he has 7:23the wrong account number well that wrong 7:25account number of course points up to 7:27the attacker he Ires the money to the 7:30attacker and the attacker has been 7:32successful so that's in a nutshell how 7:34this thing works pretty scary stuff 7:37right well that was just one scenario 7:39let's take a look at some other types of 7:42attacks that we might also see what you 7:44just saw was a financial based attack 7:47where someone is substituting in account 7:49numbers or other types of information 7:51like that but there could be other 7:52implications and other possibilities 7:55there could be health-based information 7:56that's being exchanged something that's 7:58really sensitive that could affect for 8:01for instance a patient's life if the 8:03wrong information is communicated from 8:05one doctor to another other things that 8:07could happen would be censorship say 8:10there that you're doing a talk and 8:12someone actually substitutes in 8:14different words that you did not say 8:16into a video now all of a sudden you 8:19have said something terrible that you 8:21didn't actually say and the implications 8:23of that could be devastating and then 8:25one other to consider is realtime 8:28impersonation in this case the attacker 8:30has the Deep fake they call up the other 8:33individual and they're able to speak to 8:35them in the voice of the person that 8:37they're impersonating what they say is 8:39in their voice and what comes out is in 8:41the voice of the person that they're 8:42wanting to to spoof so there could be a 8:45lot of scary implications for this 8:48technology if we're not prepared so what 8:50should you do to defend against an audio 8:52jacking attack defending against this 8:55stuff is really hard but we do have some 8:58tools some strategy iies that we can use 9:00to guard against this so we'll start off 9:03with the most important be skeptical 9:06don't believe everything you hear even 9:08if what you hear you're sure you heard 9:10the voice of the other person in this 9:12world of deep fakes and audio jacking 9:15you may not be hearing the other person 9:17actually saying what they do so think 9:19first then if it's something really 9:22important like sending bank account 9:23numbers or anything really sensitive 9:25like that you want to paraphrase and 9:28repeat and that way there may be a 9:31little bit of difficulty with the uh 9:33translation and you'll be able to catch 9:35it uh and catch it a little bit off 9:37guard but say it in different ways 9:39because that way the llm is looking for 9:41certain keywords or certain phrases 9:43certain ways of expressing and maybe 9:45you'll express it slightly differently 9:47another thing is if it's really 9:49important to you outof band 9:51Communication in other words we were 9:52just talking on a cell phone well if 9:55this is really important maybe don't 9:57include the bank account number in that 10:00maybe say I'll send you the account 10:02number through email not the greatest 10:05but maybe I'll text it to you maybe I'll 10:08send it to you in some other messaging 10:09app better still divide the account 10:12number up send half of the of the 10:14account number in one messaging app and 10:16half in another or switch from that 10:18device and switch over if you were doing 10:20it on a phone switch over to a laptop so 10:23anything that makes it so that the 10:25attack surface is broader that the 10:28attacker will have to who have 10:29compromised that's what you're looking 10:31to do make the job hard for them and 10:33then finally the best practices the 10:35standard stuff that we know we're always 10:37supposed to do but not everyone does it 10:40what what kinds of things do I mean by 10:42this well for instance keep your systems 10:45always patched with the latest level of 10:47software if whether it's a laptop 10:50whether it's a phone doesn't matter make 10:52sure that you have all the security 10:54patches that are possible in place um 10:58also when it comes to emails and 11:02attachments and and links in messages 11:05and things like that don't open them if 11:07you don't really have to if you don't 11:09really know what it's going to do 11:11because those things could be the way 11:13that the guy inserts the malware onto 11:15your system and then becomes the man in 11:17the middle then when it comes to apps 11:19that you download and who doesn't want 11:21to download a thousand apps on another 11:23phone but make sure that you get them 11:25from trusted sources even trusted 11:27sources can fail us every once in a 11:28while but you put the odds in your favor 11:31if you get it from a trusted App Store 11:33as opposed to another one where there 11:35might be malware a trojan horse 11:37something like that inserted into the 11:39app and then finally one of the things 11:42that might get exploited ultimately 11:44Downstream would be if they get your 11:46credentials and they try to log into 11:48your account or something like that so 11:50use things like multiactor 11:51authentication or you know I'm a big fan 11:54of replacing passwords with pass keys 11:58and the we have a video on that if you'd 12:00like to learn more about that but pass 12:01keys are a stronger way of securing your 12:05account AI can do some really amazing 12:08things for us and I'm a huge fan however 12:11if we're not careful it can also do some 12:13really devastating stuff to us so be 12:16informed keep learning stay vigilant and 12:19protect yourself against the attacks and 12:21if you want to know more about how this 12:23particular proof of concept works then 12:25click down in the description below and 12:27you'll see a link to a Blog post where 12:29you can find out the details and 12:31actually listen to audio samples that 12:33were generated during the proof of 12:36concept and by the way when is Martin 12:39going to finally send me that 12:41money thanks for watching if you found 12:43this video interesting and would like to 12:45learn more about cyber security please 12:46remember to hit like And subscribe to 12:48this 12:49channel