Understanding Deepfake Vishing Attacks

By now, you’ve likely heard of fraudulent calls that use AI to clone the voices of people the call recipient knows. These calls often sound like a grandchild, CEO, or work colleague you’ve known for years, reporting an urgent matter requiring immediate action, such as wiring money, divulging login credentials, or visiting a malicious website.
Researchers and government officials have been sounding the alarm on the threat of these deepfake attacks for years. The Cybersecurity and Infrastructure Security Agency stated that threats from deepfakes and other forms of synthetic media increased exponentially over recent years. Last year, Google's Mandiant security division revealed that such attacks are executed with uncanny precision, crafting more believable phishing schemes.
Anatomy of a Deepfake Scam Call
Security firm Group-IB recently outlined the basic steps involved in executing deepfake vishing attacks. These attacks are easy to reproduce on a large scale and can be challenging to detect or fend off.
The basic steps are:
- Collecting voice samples of the person to be impersonated. Sometimes, samples only three seconds long suffice. These can be obtained from videos, online meetings, or previous voice calls.
- Feeding the samples into AI-based speech-synthesis engines such as Google’s Tacotron 2, Microsoft’s Vall-E, or services from ElevenLabs and Resemble AI. These engines utilize a text-to-speech interface to produce user-chosen words with the person’s voice tone and conversational tics. Although most services bar such misuse, the safeguards can often be bypassed with minimal effort.
- Spoofing the number of the person or organization being impersonated - a technique used for decades.
- Initiating the scam call. In some instances, the cloned voice follows a script. In more sophisticated attacks, faked speech is generated in real time using voice masking or transformation software, allowing attackers to respond to skeptical queries. While real-time usage remains limited, advancements in processing speed and model efficiency could soon change this.
In either case, the attacker uses the fake voice to provoke immediate action from the recipient, simulating scenarios like a relative needing bail money or an urgent financial directive from a CEO or IT department.
Shields Down
Mandiant illustrated the ease with which its security team executed a scam in a simulated exercise designed to test defenses and train personnel. The team collected publicly available voice samples of a target and used them to call employees, using a real VPN outage to make the situation convincing.
Due to trust in the voice they heard, the victim bypassed security prompts, downloading and executing a malicious payload on their workstation, illustrating how easily AI voice spoofing can lead to a breach.
To prevent such scams, parties can agree on a randomly chosen word or phrase as a code the caller must provide. Recipients can also end the call and return it through a verified number, though both precautions require the recipient to remain calm and alert, even under pressure. As a result, vishing attacks, AI-driven or otherwise, are likely to persist.