Video To improve how people talk to machines in environments with multiple network-connected devices, boffins at Carnegie Mellon University in the US have devised an acoustic measurement technique for determining the direction a person is facing while speaking.
In an email to The Register, Karan Ahuja, a doctoral student at CMU, explained that he, fellow student Andy Kong, and professors Mayank Goel and Chris Harrison, have come up with a new audio technology that "enables voice commands with addressability, in a similar way to gaze, but without the need for cameras."
That is to say, video cameras can use gaze tracking to guess who or what a person is addressing when speaking, but audio-centric devices don't have a reliable way to infer an individual's facing. The wakewords used to activate digital assistant software in devices like Amazon Echo and Nest Audio provide that signal, but there's potential for confusion if multiple speech-addressable devices are listening.
Direction-of-Voice calculations offer a way to simplify spoken interaction with machines by clarifying which device is being addressed.
In a paper titled "Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Device Ecosystems," presented last month at the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST '20), the CMU computer scientists show how speech can be used as a directional communication channel.
The technique they discuss is not a Direction of Arrival (DoA) algorithm, used to pinpoint the source of a sound. Rather, their DoV algorithm can determine the direction along which a voice was projected.
"This allows users to easily and naturally interact with diverse ecosystems of voice-enabled devices, whereas today’s voice interactions suffer from multi-device confusion," their paper explains.
OK, so you've air-gapped that PC. Cut the speakers. Covered the LEDs. Disconnected the monitor. Now, about the data-leaking power supply unit...READ MORE
They envision DoV as a way to disambiguate spoken commands, allowing speakers to address smartphones, network-connected speakers, TVs, and other attentive kit without calling out a wakeword. The work also has the potential to reduce unintended activation of services like Alexa or Siri, which sometimes respond to utterances that sound similar to their wakewords. And the researchers suggest it could also be used for other purposes, like allowing hearing aids to selectively amplify sounds from specific directions.
DoV relies on two aspects of human speech: that high-frequencies attenuate more rapidly at angles off the facing axis of the speaker and that utterances have different directional characteristics at different frequencies.
"Put simply, if a voice is directed at a microphone (i.e. facing), high and low voice frequencies are present," the paper explains. "However, if we receive a sound when a user was facing another direction, or if the sound has had to echo to reach the microphone, we typically see reduced high frequencies compared to low frequencies."
The boffins' technique takes into consideration the nature of enclosed environments where sounds bounce around, creating multiple paths associated with the source sound and with its echoes.
By measuring the multipath effects of spoken words, they were able to determine whether a person is or is not facing a given mic with ~93.1 per cent accuracy. That represents the best result of its kind based on current research and constitutes an important step toward making the technique commercially feasible, they say.
When trying to predict the specific angle a person is facing out of eight compass directions, their system managed 65.4 per cent accuracy, which the computer scientists concede is "not yet accurate enough for user-facing applications." And they acknowledge that their implementation does not address scenarios with multiple speakers or noisy environments.
They point to prior research that managed slightly better angle-specific identification (76.8 per cent) but required an array of six mics distributed across a room of known geometry. Their approach, they say, has the advantage of being software-only and they note that it doesn't have to send data to the cloud.
This video provides further details:
The researchers' test hardware consisted of a Seeedstudio ReSpeaker USB 4-channel microphone and a MacBook Pro with 16GB of RAM and a dual-core Intel i5 processor running at 3.1GHz for audio processing and classification. They used Python on the backend for data collection, signal processing, and machine learning – based on an Extra-Trees Classifier algorithm.
They've made their dataset available on GitHub, for anyone interested in replicating their work or expanding on it. ®