This article is more than 1 year old

AI sucks at stopping online trolls spewing toxic comments

It's easy to for hate speech to slip past dumb machines

New research has shown just how bad AI is at dealing with online trolls.

Such systems struggle to automatically flag nudity and violence, don’t understand text well enough to shoot down fake news and aren’t effective at detecting abusive comments from trolls hiding behind their keyboards.

A group of researchers from Aalto University and the University of Padua found this out when they tested seven state-of-the-art models used to detect hate speech. All of them failed to recognize foul language when subtle changes were made, according to a paper [PDF] on arXiv.

Adversarial examples can be created automatically by using algorithms to misspell certain words, swap characters for numbers or add random spaces between words or attach innocuous words such as ‘love’ in sentences.

The models failed to pick up on adversarial examples and successfully evaded detection. These tricks wouldn’t fool humans, but machine learning models are easily blindsided. They can’t readily adapt to new information beyond what’s been spoonfed to them during the training process.

“They perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech,” the paper’s abstract states.

The problem of sniffing out toxic language normally boils down to a classification problem. Does this sentence contain any swear words or racist and sexist slurs?

Google’s API Perspective calculates a score to determine if text is hateful or not. But by narrowing it down to a simple classification problem, it means that it can suffer from false positives - when the sentence contains offensive language but its overall meaning is harmless.


Some false positive examples that show how brittle Google's Perspective model is. Image credit: Gröndahl et al.

The researchers were too polite and replaced a “common English curse word, marked with “F” here, but [was used] in [it’s] original form in the actual experiment.” You get the idea.

“Attack effectiveness varied betweeen models and datasets, but the performance of all seven hate speech classifiers was significantly decreased by most attacks,” according to the researchers.

The weakest models are ones that inspect sentences word-by-word, since tiny changes like adding spaces between words will slip by unnoticed. The ones that break down words by individual characters do slightly better at recognizing attacks.


Google's troll-destroying AI can't cope with typos


“A significant difference between word- and character based models was that the former were all completely broken by at least one attack, whereas the latter were never completely broken,” the team said.

Future research should focus on making models more robust to attacks, the researchers said. Developers should pay closer attention to the training dataset rather than the algorithms themselves, they argued.

“We therefore suggest that future work should focus on the datasets instead of the models. More work is needed to compare the linguistic features indicative of different kinds of hate speech (racism, sexism, personal attacks etc.), and the differences between hateful and merely offensive speech,” the paper included. ®

More about


Send us news

Other stories you might like