This article is more than 1 year old
Amazing peer-reviewed AI bots that predict premature births were too good to be true: Flawed testing bumped accuracy from 50% to 90%+
'These models should not go into clinical practice at all,' academic tells El Reg
A surprising number of peer-reviewed premature-birth-predicting machine-learning systems are nowhere near as accurate as first thought, according to a new study.
Gilles Vandewiele, a PhD student at Ghent University in Belgium, and his colleagues discovered the shortcomings while investigating how well artificial intelligence can predict premature births using non-invasive electrohysterography (EHG) readings. By premature, we mean before 37 weeks into a pregnancy, and by EHG, we mean the electrical activity in uterine muscles.
They identified an EHG data set on Physionet that was used to train premature-birth-predicting software in 24 published studies. After analyzing each one, they determined 11 of the papers mixed their training and testing data, which led to wildly incorrect accuracy scores.
The team – including Vandewiele, Dr Isabelle Dehaene of Ghent University Hospital, and data scientist Gyorgy Kovacs of Analytical Minds, Hungary – documented their findings here earlier this month on pre-print service Arxiv.
Each of the probed studies reported miraculous results. One classifier was seemingly able to accurately predict preterm births from EHG data 99.44 per cent of the time. The majority disclosed performance levels exceeding 94 per cent – an astounding success rate. However, when Vandewiele and his team attempted to reproduce the results, they discovered that they were just too good to be true.
The problem began with what's called oversampling. The data set covers just 300 patients, and only 38 of them gave birth prematurely. In an attempt to make the data set more comprehensive, 11 teams invented records for women with preterm births, a process known as oversampling, and inserted them into the data.
Oversampling isn’t automatically a bad thing to do: plenty of AI boffins use it to make their training data more diverse. However, in these 11 cases, the researchers included their fake records in the data used for testing as well as training.
Since the artificial data is generated from the tiny set of 38 samples describing premature births, they’re all pretty similar to one another. So if you train and test your model on those fake entries, it’s like being tested on a set of exam questions you already know the answers to. In fact, it's like the exam has lots of very similar questions, and you know the answers to all of them.
And that's why the models seemed so good: it's easy to ace this kind of test.
“The models see very similar samples in testing to the ones they saw during the training phase, so that makes it very easy for the model,” Vandewiele explained to The Register on THursday.
In other words, the model didn’t have to learn what features or variables are more likely to signal premature births in EHG data since it could just memorize them. “When near-perfect results are reported in domains where there has not been a lot of work, I would advise people to be skeptical,” he added.
Garbage in, garbage out
When his team reproduced the models described in each of the 11 papers and withheld the artificial data from the test data set, they realized the accuracy levels dropped dramatically. Scores of over 90 per cent suddenly dipped to about 50 per cent, with the best two models accurate to a little above 60 per cent.
“Data leakage is an easy mistake to make. But when you see your performance increase from 60 per cent to 99 per cent from a simple step, you have to be skeptical,” Vandewiele said.
"It’s an easy mistake to make, but not one that should have been overlooked when you publish research. You should think about what went wrong because such a jump is impossible. They’re not very good results the way they are now, the models trained on that data set should not go into clinical practice at all – they’re not good enough.”
The 11 papers were published in various journals from the Institute of Electrical and Electronics Engineers (IEEE), as well as PLOS ONE, and Science Direct. This type of data leakage has even managed to slip past more prestigious journals, such as Nature.
If the words 'new', 'AI', 'for', 'the', 'physical', 'world', 'accelerate' and 'Facebook' scare you, click this headline
READ MOREVandewiele is still hopeful machine learning can predict the risk of premature births, however. He believes neural networks could do a better job than some of the classifiers used in the previous papers.
“There’s still a possibility that we can study preterm births using machine learning, but not from that dataset and not using those methodologies,” he opined.
“All these methodologies hardcoded features or variables but when you use deep learning you don’t need to extract those features manually. The neural network can just figure out the representation of those features. It’s less interpretable, but it saves a lot of work. It often performs better because the neural network finds better features than the ones that are hardcoded.
"Researchers should provide the code along with their paper so people can reproduce it, and check whether it's sound or not."
The code used by Vandewiele's team to analyze the papers can be found here. We have asked the first authors from each of the eleven papers singled out for comment and let you know if we hear back from them. ®