A number of AI programs trained to detect diabetic eye damage struggle to perform consistently in the real world despite apparently excelling in clinical tests, say scientists in the US.
Academics led by the University of Washington School of Medicine tested seven algorithms from five companies: Eyenuk and Retina-AI Health in America, Airdoc in China, Retmaker of Portugal, and OphtAI in France. All of the models have gone through clinical studies, and are used – or can be used –to diagnose diabetic retinopathy, a complication of diabetes that damages blood vessels in the eye, leading to impaired vision or blindness.
The research team said it found at least some of the software packages wanting during its own testing, and this month published its findings in the Diabetes Care journal.
“It’s alarming that some of these algorithms are not performing consistently since they are being used somewhere in the world," said lead researcher Aaron Lee, assistant professor of ophthalmology at the university.
Top doctors slam Google for not backing up incredible claims of super-human cancer-spotting AIREAD MORE
The team tested the code by showing it a dataset of 311,604 photos from 23,724 patients at hospitals in Seattle and Atlanta from 2006 to 2018, and found some of the software's diagnosis of these patients was sub-par. When the algorithms' decisions were compared to a real physician, the team said three performed reasonably well, only one of them was as good as a human expert, and the rest were worse.
The AI models tended to over-predict if a patient had the disease or not, Lee told The Register. Although it’s better to be safe than sorry, it meant the systems would more often than not flag up patients for examinations by professional eye doctors. Instead of reducing the workload for ophthalmologists, by filtering out those without the disease, the software would increase it.
“The study design prevents us from disclosing which company supplied which algorithm unfortunately," Lee added. "It is my understanding that all of these algorithms are in clinical use somewhere in the world however."
The programs did better with imagery from Atlanta, we're told, a sign that performance depends heavily on the quality of the data. “We believe one of the reasons for the discrepancy in performance was that Atlanta has a more stringent protocol for image quality at the time of screening," Lee told us. "This suggests that AI models may be more sensitive to image quality issues than human beings."
The academics suggested medical algorithms should be evaluated on larger real-world datasets before being validated for public use. “AI algorithms are not all created equal and they can, but not always, recapitulate biases in datasets,” Lee warned.
So, are they safe for use?
Airdoc declined to comment on the study, and Retmaker did not respond to El Reg's questions.
Stephen Odaibo, CEO and founder of Retina-AI Health, told us he thought the researchers' experiment did not reflect actual real-world use of the software. He claimed the images used to test the algorithms included basic photos taken of people's eyes, whereas in normal use, the applications would be supplied high-quality retina scans. He argued this put the programs at an unfair disadvantage.
"This is a completely different scenario from the use case for which these AI algorithms were developed and subsequently clinically validated in prospective clinical trials for FDA approval," Odaibo said, referring to America's medical watchdog, the Food and Drug Administration.
"To make an evidence-based recommendation to the FDA one would need to design a study that reflects the indications of use and intended use of the medical device."
Our view is that systems that have FDA clearance are already going through more rigorous prospective clinical trial validation than the University of Washington study
Frank Cheng, president and chief customer office of the other US-based company in the study, Eyenuk, agreed that the experiments carried out by the academics didn't quite mirror how a system would be tested after FDA approval: "Our view is that systems that have FDA clearance are already going through more rigorous prospective clinical trial validation than the University of Washington study, additional testing is not necessary, so long as photographers and imaging protocol training takes place ... In real world clinical use, FDA-cleared systems such as Eyenuk's are integrated with the camera, and photographers are trained on the imaging protocol to be used."
Cheng said he believed "Eyenuk's EyeArt AI system is very much ready for prime time and is available for clinical use," and said he thought the "study analysis was well conducted in general."
OphtAI's CTO Bruno Lay told The Register that the research group's conclusions were fair. Lay claimed OphtAI's algorithms were ranked as the best or second-best of the seven algorithms tested, and that the technology from three out of the five companies trialed probably isn't yet good enough to be used in the real world.
"The experiments were very challenging," he said. "We had no idea of the quality of the images used in the test. We were able to process the whole dataset in just three days, and our system is already available for use in hospitals in France."
Diabetic retinopathy is a widely studied area in medical AI research. Several Alphabet subsidiaries, including Google, Verily, and DeepMind have demonstrated how machine-learning software can automatically analyze retinal scans. ®
Editor's note: An earlier version of this article quoted Retina-AI's Stephen Odaibo saying the test data included low-quality pictures such as those of people's driving license photos. However, the CEO has since conceded he was mistaken, and that these images were not used in the test.
Lead researcher Aaron Lee, meanwhile, told us the images came from a "teleretinal screening program. They did not have facial photos nor driver’s license photos." Lee also insisted the developers of the AI algorithms "were not given access to any of the images."