This article is more than 1 year old

Alexa heard what you did last summer – and she knows what that was, too: AI recognizes activities from sound

Gadgets taught to identify actions via always-on mics

Analysis Boffins have devised a way to make eavesdropping smartwatches, computers, mobile devices, and speakers with endearing names like Alexa better aware of what's going on around them.

In a paper to be presented today at the ACM Symposium on User Interface Software and Technology (UIST) in Berlin, Germany, computer scientists Gierad Laput, Karan Ahuja, Mayank Goel, and Chris Harrison describe a real-time, activity recognition system capable of interpreting collected sound.

In other words, a software that uses devices' always-on builtin microphones to sense what exactly's going on in the background.

The researchers, based at Carnegie Mellon University in the US, refer to their project as "Ubicoustics" because of the ubiquity of microphones in modern computing devices.

As they observe in their paper, "Ubicoustics: Plug-and-Play Acoustic Activity Recognition," real-time sound evaluation to classify activities and and context is an ongoing area of investigation. What CMU's comp sci types have added is a sophisticated sound-labeling model trained on high-quality sound effects libraries, the sort used in Hollywood entertainment and electronic games.

As good as you and me

Sound-identifying machine-learning models built using these audio effects turn out to be more accurate than those trained on acoustic data mined from the internet, the boffins claim. "Results show that our system can achieve human-level performance, both in terms of recognition accuracy and false positive rejection," the paper states.

The researchers report accuracy of 80.4 per cent in the wild. So their system misclassifies about one sound in five. While not quite good enough for deployment in people's homes, it is, the CMU team claims, comparable to a person trying to identify a sound. And its accuracy rate is close to other sound recognition systems such as BodyScope (71.5 per cent) and SoundSense (84 per cent). Ubicoustics, however, recognizes a wider range of activities without site-specific training.

Alexa to the rescue

Alexa, informed by this model, could in theory hear if you left the water running in your kitchen and might, given the appropriate Alexa Skill, take some action in response, like turning off your smart faucet or ordering a boat from to navigate around your flooded home. That is, assuming it didn't misinterpret the sound in the first place.

The researchers suggest their system could be used, for example, to send a notification when a laundry load finished. Or it might promote public health: By detecting frequent coughs or sneezes, the system "could enable smartwatches to track the onset of symptoms and potentially nudge users towards healthy behaviors, such as washing hands or scheduling a doctor’s appointment."

In an email to The Register, Chris Harrison, assistant professor of human-computer interaction at CMU and director the Future Interfaces Group, said accuracy of about 90 to 95 per cent would be sufficient for deployment.

He sees false positives, where the system hears a sound and think it's something else, as particularly problematic for real world usage.

"These are very annoying to users, and so it would have to be more at like 99 per cent [accuracy] for this," he said. "I think we can achieve both of these accuracies with a year or so. We’ve already made so much progress just as a small research team. The big players can muster proper resources."

Harrison said a related project called Vibrosight, which involves using a laser to measure physical vibrations of an object to determine what it's doing, has already achieved sufficient accuracy for deployment.

To improve accuracy, the paper suggests better quality mics and higher sound sample rates could help, as might more sophisticated deep learning models such as ResNets. It also acknowledges that a world littered with active microphones might raise privacy concerns.

Siri logo

Voice assistants are always listening. So why won't they call police if they hear a crime?


"The richness of sound is a double-edged sword," the paper states. "On one hand, it enables fine grained activity sensing, while also capturing potentially sensitive audio, including spoken content. This is an inherent and unavoidable danger of using microphones as sensors."

The researchers counter, however, that the social stigma of living in a bugged house may wane. The recent introduction of the Facebook Portal camera-mic-speaker medley, to say nothing of Amazon's, Apple's, Google's, and Microsoft's listening devices, suggests some companies are making similar bets.

In the meantime, as a potential privacy protection, the researchers suggest that converting all live audio data into low resolution Mel spectrograms (64 bins) and tossing associated phase data makes speech recovery sufficiently difficult.

"There is no way to recover the audio," said Harrison. "In addition to the low resolution spectrograms, we also also throw away phase data, and each slide of the spectrogram is large, combining many phonemes."

Keep this to yourself

Harrison and his colleagues envision their acoustic model running locally on devices so no audio data needs to be transmitted. While large players in the smart speaker space may want the audio data collected by their devices, he believes they can do without it.

"I think there is an important case to be made that people won’t want this sensitive, fine-grained data going to third parties," he said. "Companies that can do it on-device will have a competitive edge in the marketplace in my opinion."

Code associated with the project should be posted to a GitHub repo once the presentation has concluded. ®

More about


Send us news

Other stories you might like