Symantec boffins reckon it's no longer enough to shield e-mail users from malicious email and that spam and phishing over SMS are now worthy of some decent defences. They've even penned a study to back up the proposition, suggesting that SMS spam could be 97 per cent detectable with a false positive rate as low as 0.02 per cent.
The researchers, from Symantec offices in the UK, Ireland and the US, have published their paper at Arxiv saying that although spam detection in SMS is harder than in e-mail, it can be done.
SMS remains popular – even in an era of over-the-top messaging platforms that want to eat the carriers' lunch by shifting their texts to the data channel – and the paper argues that various habits in SMS make spam detection a problem. They cite “lexical variants”, along with contractions, wordplay and other obfuscations as posing challenges for anyone wanting to detect malicious messages.
With better baselines, the researchers argue, including text normalisation and substring clustering, these problems could be overcome.
Working with an unnamed US carrier, Symantec was able to use a large SMS dataset to test their machine learning approaches to spam-blocking. To avoid false positives, they note, they also used “a combination of behavioural and linguistic information” to get more robust results.
The researchers had around 400,000 text messages to work with (including 300,000 spams), allowing them to test what they describe as “clustered substring tokens from a subset of 100k messages using t-distributed stochastic neighbour embeddings … string similarity functions based on matching n-grams and word co-occurrences.”
To expand the total training data set, the researchers also cleaned up 200,000 Twitter messages (removing hashtags and user mentions). Their study used two approaches: MELA (message linguistic analysis) and MPA (messaging pattern analysis).
The MELA approach showed a 0.05 per cent false positive and 9.4 per cent false negative rate, the paper says, while MPA scored a much better 0.02 per cent false positives and just 3.1 per cent false negatives. ®