Audio tweaked just 0.1% to fool speech recognition engines

Digital dog whistles: AI hears signals humans can't comprehend


The development of AI adversaries continues apace: a paper by Nicholas Carlini and David Wagner of the University of California Berkeley has explained off a technique to trick speech recognition by changing the source waveform by 0.1 per cent.

The pair wrote at arXiv that their attack achieved a first: not merely an attack that made a speech recognition SR engine fail, but one that returned a result chosen by the attacker.

In other words, because the attack waveform is 99.9 per cent identical to the original, a human wouldn't notice what's wrong with a recording of “it was the best of times, it was the worst of times”, but an AI could be tricked into transcribing it as something else entirely: the authors say it could produce “it is a truth universally acknowledged that a single” from a slightly-altered sample.

Adversarial audio sample

One of these things is not quite like the other.
Image from Carlini and Wagner's paper

It works every single time: the pair claimed a 100 per cent success rate for their attack, and frighteningly, an attacker can even hide a target waveform in what (to the observer) appears to be silence.

Images are easy

Such attacks against image processors became almost routine in 2017. There was a single-pixel image attack that made a deep neural network recognise a dog as a car; MIT students developed an algorithm that made Google's AI think a 3D-printed turtle was a gun; and on New Year's Eve, Google researchers took adversarial imaging into the real world, creating stickers that confused vision systems trying to recognise objects (deciding a toaster was a banana).

Speech recognition systems have proven harder to fool. As Carlini and Wagner wrote in the paper, “audio adversarial examples have different properties from those on images”.

They explained that untargeted attacks are simple, since “simply causing word-misspellings would be regarded as a successful attack”.

An attacker could try and embed a malicious phrase in another waveform, but they need to generate a new waveform rather than adding a perturbation to the input.

Their targeted attack mean: “By starting with an arbitrary waveform instead of speech (such as music), we can embed speech into audio that should not be recognised as speech; and by choosing silence as the target, we can hide audio from a speech-to-text system”.

The attack wouldn't yet work against just any speech recognition system. The reason the duo choose DeepSpeech is because it's open source, so they were able to treat it as a white-box in which “the adversary has complete knowledge of the model and its parameters”.

Nor, at this stage, is it a “real time” attack, because the processing system Carlini and Wagner developed only works at around 50 characters per second.

Still, with this work in hand, The Register is pretty certain other researchers will already be on a sprint to try and make a live distorter – so you could one day punk someone's Alexa without them knowing what's happening. Think “Alexa, stream smut to the TV” when your friend only hears you say “What's the weather, Alexa?”. ®


Other stories you might like

  • Research finds consumer-grade IoT devices showing up... on corporate networks

    Considering the slack security of such kit, it's a perfect storm

    Increasing numbers of "non-business" Internet of Things devices are showing up inside corporate networks, Palo Alto Networks has warned, saying that smart lightbulbs and internet-connected pet feeders may not feature in organisations' threat models.

    According to Greg Day, VP and CSO EMEA of the US-based enterprise networking firm: "When you consider that the security controls in consumer IoT devices are minimal, so as not to increase the price, the lack of visibility coupled with increased remote working could lead to serious cybersecurity incidents."

    The company surveyed 1,900 IT decision-makers across 18 countries including the UK, US, Germany, the Netherlands and Australia, finding that just over three quarters (78 per cent) of them reported an increase in non-business IoT devices connected to their org's networks.

    Continue reading
  • Huawei appears to have quenched its thirst for power in favour of more efficient 5G

    Never mind the performance, man, think of the planet

    MBB Forum 2021 The "G" in 5G stands for Green, if the hours of keynotes at the Mobile Broadband Forum in Dubai are to be believed.

    Run by Huawei, the forum was a mixture of in-person event and talking heads over occasionally grainy video and kicked off with an admission by Ken Hu, rotating chairman of the Shenzhen-based electronics giant, that the adoption of 5G – with its promise of faster speeds, higher bandwidth and lower latency – was still quite low for some applications.

    Despite the dream five years ago, that the tech would link up everything, "we have not connected all things," Hu said.

    Continue reading
  • What is self-learning AI and how does it tackle ransomware?

    Darktrace: Why you need defence that operates at machine speed

    Sponsored There used to be two certainties in life - death and taxes - but thanks to online crooks around the world, there's a third: ransomware. This attack mechanism continues to gain traction because of its phenomenal success. Despite admonishments from governments, victims continue to pay up using low-friction cryptocurrency channels, emboldening criminal groups even further.

    Darktrace, the AI-powered security company that went public this spring, aims to stop the spread of ransomware by preventing its customers from becoming victims at all. To do that, they need a defence mechanism that operates at machine speed, explains its director of threat hunting Max Heinemeyer.

    According to Darktrace's 2021 Ransomware Threat Report [PDF], ransomware attacks are on the rise. It warns that businesses will experience these attacks every 11 seconds in 2021, up from 40 seconds in 2016.

    Continue reading

Biting the hand that feeds IT © 1998–2021