Off-the-shelf object-recognition systems struggle, relatively speaking, to identify common items in hard-up homes in countries across Africa, Asia, and South America. The same software performs better at identifying stuff in richer households in Europe and North America.
Though initially shocking, and then not so much when you think it for a second and consider who makes these systems, it's a great example and reminder of how income-related biases have knock-on effects across the world.
Five popular computer-vision models commercially available via Microsoft Azure, Google Cloud Vision, IBM Watson, Amazon Rekognition, and New York-based AI upstart Clarifai, plus a ResNet-101 model trained using the Tencent ML Images dataset, were given the task of identifying items in photos taken in households around the planet.
Said photos were sourced from The Dollar Street dataset, which documents “everyday life on different income levels.” This collection contains some 30,000 snaps taken in 264 homes by photographers across 54 countries.
The images are labeled from a selection of 135 categories depending on what's pictured, where they were taken, and the household income for the homes in which they were snapped. For this particular study, carried out by Facebook Research, the eggheads only studied 117. Some categories were ignored because the labels were too abstract: for example, the category “most loved item” was not used. Typical categories included in the study had unequivocal labels like "refrigerator", "soap", "door", and so on.
When the AI systems were instructed to identify objects in the photos, there was a clear difference in accuracy when identifying the items in homes in poorer countries compared to ones in richer countries. Objects such as spice jars were more easily recognized in kitchens in Europe or in the United States compared to those in the Philippines, for instance.
Map showing the average accuracy for the six different models tested ... Red indicates an accuracy of about 60 per cent, yellow is about 75 per cent, and green is about 90 per cent. Click to enlarge. Image credit: DeVries et al.
It's, basically, due to the fact these commercially available models just aren't familiar with objects found in poorer households. Their training data covers a lot of products found in richer homes and nations, and not quite so much for stuff bought and sold in broke households and countries.
And it's not just the machine-learning systems developed by geeks in rich-ass Silicon Valley that are at fault: all the tested computer-vision models, whether they were built on the west or east coasts of America, or in Tencent's China, were more comfortable identifying things in more well-off homes than in poorer homes.
“For all systems, the difference in accuracy for household items appearing in the lowest income bracket – less than US$50 per month – is approximately 10 per cent lower than that for household items appearing in the highest income bracket - more than US$3,500 per month,” the Facebook team noted in its write-up of its findings, emitted via arXiv this month.
Rich vs Poor
Digging into the results, we can see that more expensive hand soap, for example, is kept in a bottle with a hand pump, while rigid rectangular bars of soap are cheaper. If these commercial models are more likely to be trained on images from richer households that have liquid soap, they’re less likely to realize that bars of soap are soap, too.
An example photo of what soap looks like in a home in the UK compared to Nepal ... All the models mistook the bar of soap from Nepal as some kind of food. Click to enlarge. Image credit: DeVries et al.
A refrigerator in more developed countries have doors, and are made of stainless steel or are painted white, whereas in less developed countries where electricity is scarce, pots are used to store food. Image-recognition models, therefore, won’t know that these simple storage objects are, in fact, basic refrigerators simply because they haven’t been taught that during the training process beforehand.
The gap in accuracy is most stark for certain categories. Living rooms, for example, had an average accuracy difference of 40 per cent. Next, was beds at 37 per cent, and then guest beds at 35 per cent. This is probably because living rooms in poorer homes in Africa, Asia or South America lacked certain items, such as massive TV sets, comfy sofas, or expensive cabinets. These homes are also less likely to have luxuries, such as guest beds.
Average accuracy for all models identifying objects from homes with different monthly incomes ... Click to enlarge. Image credit: DeVries et al.
The researchers didn’t break down the average accuracy scores for each individual model, however, so it’s difficult to see which one was best or worse. On average, the accuracy was about 85 per cent for identifying items in homes that had a monthly income of $10,097 (~£7958) compared to about 71 per cent for homes that had a monthly income of just $55 (~£43).
But all of them struggled with identifying objects from poorer places. "The absolute difference in accuracy of recognizing items in the United States compared to recognizing them in Somalia or Burkina Faso is around 15−20%," the team noted. "These findings are consistent across a range of commercial cloud services for image recognition."
The Register has asked Facebook for more details. The social network briefly mentioned it tried the test on its own object-recognition engine, which also suffered the same biases (see figure 10 of the paper).
The problem with humans and machines
The paper is a stark reminder of the biases present in machine learning and its disparate impacts. The upshot is that these computer-vision models, as they stand today, aren’t, relatively speaking, effective for folks in poorer circumstances. The problem can be narrowed down further to a lack of culturally diverse training data: too much of it is focused on the English language, which leans the material toward richer households.
If you're working with AI technology, can speak English, and are building training datasets, you probably have a comfortable life and it may not occur to you to include stuff from less-well-off homes.
“The geographical sampling of image datasets is unrepresentative of the world population distribution, and most image datasets were gathered using English as the 'base language',” the researchers explained. Items that don’t have English labels are typically not included in training processes, which heavily skews which types of objects can be recognized.
All of this, however, points to a glaring technical barrier in neural networks; they’re simply too rigid. “Ultimately, the development of object recognition models that work for everyone will likely require the development of training algorithms that can learn new visual classes from few examples and that are less susceptible to statistical variations in training data,” the Facebook eggheads stated.
At the moment, models have to see thousands or even millions of examples of things before they can identify objects effectively, and subtle differences in the images can confuse or throw them.
“We hope this study will help to foster research in all these directions. Solving the issues outlined in this study will allow the development of aids for the visually impaired, photo album organization software, image-search services, that provide the same value for users around the world, irrespective of their socio-economic status,” the researchers concluded. ®