Voice recognition systems are sexist: they struggle to deal with female voices compared to male ones.
It's a headache that has been lingering for a while in the machine-learning world. The issue was brought back into the spotlight again by Delip Rao, CEO and co-founder of R7 Speech Sciences, a startup using AI to understand speech. And with the rise of voice-activated digital assistants like Apple’s Siri, or Amazon’s Alexa, or Google Home it’s an important problem to raise.
“In speech, we measure the mean fundamental frequency (which correlates with our perception of “pitch”). This is also called mean F0. The range of tones produced by our vocal tract is a function of the distribution around that," according to Rao.
“You could write a simple, rule-based, gender classifier if you had the mean F0 from audio. From many sources, we know the mean F0 for men is around 120Hz and much higher for women (~200Hz)."
Rachael Tatman, a data scientist at Kaggle and a PhD linguistics graduate from the University of Washington, explained to The Register this week that it doesn’t just stem from neural networks learning from the lack of training examples for female voices.
It’s an inherent technical problem down to the fact that females generally have higher pitched voices. They also tend to be quieter and sound more “breathy”.
To map the audio signals to particular words or sounds, they are processed and transformed into MFCCs (Mel-frequency cepstral coefficients), a common method used in many automated speech recognition models.
Tatman told us that “there's nothing about MFCCs in particular that are less good about modelling women's speech than men's.” But “there's a slightly less robust acoustic signal for women, it's more easily masked by noise, like a fan or traffic in the background, which makes it harder for speech recognition systems. That will affect whatever you use for your acoustic modelling, which is what MFCCs are used for.”
The lack of diverse training examples has shown how AI systems can be riddled with performance errors. A recent study found commercial facial recognition systems are worse at identifying genders for women compared to men, and at recognizing black people compared to white people.
Since voice recognition systems already find it more difficult to cope with female voices, the problem of gender biases could get worse if systems learn from unbalanced training datasets.
“Deep learning, in particular, is very good at recognizing things that it's seen a lot of. And if you've trained your system on data from 90 per cent men and 10 per cent women (unlikely but possible, especially if you're not accounting for gender in your training data), you'll end up being very good at recognizing male data and very bad at recognizing female data. More worryingly, this also applies to things like race and ethnicity, where there isn’t an acoustic reason for one group to be harder to understand,” Tatman said.
Many speech recognition systems are tailored towards Western accents. Rao told El Reg this week: “The real-world impacts are as you can imagine significant. Imagine a big chunk of the demographic being cut access to a product because of their gender or ethnicity. To speak for myself, I feel frustrated most ASR systems fail terribly with Indian accents. I would love to use voice interfaces but can't most of the time.
“Now imagine, if I had a handicap and all I could use was a voice interface. Not a world I want to imagine for myself; at least not with the current ASR tools. Similarly, I imagine how limiting these tools could be for women. I think we as a scientific community should focus more on making the empowerment by technology equally accessible across demographics.” ®