AI won't replace radiologists anytime soon

Researchers find AI models weak for medical reasoning when it comes to X-rays and CT scans

AI is not ready to make clinical diagnoses based on radiological scans, according to a new study.

Researchers often suggest radiology is a field AI has the potential to transform, because visual or multimodal models can recognize images quite well. The assumption is that AI models should be able to read X-rays and computed tomography (CT) scans as accurately as medical professionals, given enough training.

To test that hypothesis, researchers affiliated with Johns Hopkins University, University of Bologna, Istanbul Medipol University, and the Italian Institute of Technology decided they first had to build a better benchmark test to evaluate visual language models.

There are several reasons for this, explain authors Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, and Alan Yuille in a preprint paper [PDF] titled "Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering."

The first reason is that most existing clinical data sets are relatively small and lack diverse records, which the scientists attribute to the expense and time required to allow experts to annotate the data.

Second, these data sets often rely on 2D data, which means 3D CT scans sometimes aren't present for AI to learn from.

Third, algorithms for the automated evaluation machine learning models like BLEU and ROUGE [PDF] don't do all that well with short, factual medical answers.

In addition, existing datasets may use private, institutional data that's not available for further research.

The authors therefore developed DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark focused on abdominal tumors in CT scans.

DeepTumorVQA is a benchmark test based on 9,262 CT volumes (3.7M slices) from 17 public datasets, supported by 395,000 expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning.

Twenty-three board-certified radiologists spent six months manually annotating 7,629 lesions depicted in 3D images taken from patients' livers, kidneys, pancreases, and colons. They then double-checked their annotations to develop a consensus. A lesion is simply tissue that looks abnormal in a scan. Diagnosis may determine whether it's benign or malignant.

Armed with their benchmark data, the boffins set out to evaluate five visual models designed for healthcare: RadFM, M3D (one based on Llama2 and one on Phi-3), Merlin, and CT-CHAT.

Chart showing DeepTumorVQA questions

Chart showing DeepTumorVQA questions - click to enlarge

The authors evaluated these models in four categories: the accuracy of organ and lesion volume measurements; the ability to recognize when features like lesions are present; the ability to reason based on visual information (e.g. which of two kidneys is largest); and medical reasoning (e.g. identify whether a given lesion is a benign cyst or a malignant tumor).

In keeping with Betteridge's law of headlines, the authors' answer to their question "Are Vision Language Models Ready for Clinical Diagnosis?" is "No."

The models outperformed random guessing significantly for measurement tasks, though they did better at counting tasks when presented with multiple choice questions rather than freeform questions.

Recognition tasks were less impressive. The models could all successfully recognize the presence of lesions, cysts, and tumors with success rates ranging from 65 percent to 86 percent. But the researchers found the answers failed to account for subtle visual cues.

With visual reasoning, the models did fairly well with multi-step tasks but struggled with tasks like kidney volume comparison, which the researchers attribute to "difficulty in bilateral reasoning and precise localization."

And the tested models had the most trouble with medical reasoning, because, the researchers say, it requests integrating information not seen in training data.

"Overall, while modern VLMs demonstrate promise in basic and recognition-heavy tasks, their applicability to real-world diagnostics is currently limited by weak visual signal, unreliable numeracy, and shallow reasoning chains," the authors conclude.

AI can help clinicians in a supportive role, but it's not ready to replace the judgement of medical professionals. ®

More about

TIP US OFF

Send us news


Other stories you might like