Researchers teach AI to pinpoint mutations linked to cancer

An ounce of prediction is worth a pound of cure

Machine learning techniques, such as deep learning, have proven surprisingly effective at identifying diseases like breast cancer. However, when it comes to identifying mutations at the genetic level, these models have come up short, according to researchers at the University of California San Diego (UCSD).

In a paper published in the journal Nature Biotechnology this week, researchers at the university propose a new machine learning framework called DeepMosaic that uses a combination of image-based visualization and deep learning models to identify genetic mutations associated with diseases including cancer and disorders with genetic links, such as autism spectrum disorder.

Using AI/ML to identify disease has been a hot topic in recent years. For instance, in August, Harvard scientists detailed a multimodal AI system capable of predicting 14 types of cancer.

The problem, according to UCSD professor Joe Gleeson, is most of these models aren't well suited to identifying genetic mutations, called mosaic variants or mutations, because most of the software developed over the last two decades was trained on cancer samples.

Because cancer cells divide so rapidly, they're relatively easy to spot for computer programs, he explained in an interview with The Register. By comparison, mosaic mutations are tricky to spot because they exist in only a subset of cells.

Contrary to what you may have been taught in primary school, when cells divide, they're not a perfect copy of each other. There are minute changes or mutations, Gleeson explained. These mutations are usually harmless but sometimes they lead to cancer or other diseases. This makes methods for identifying potentially problematic mutations so valuable.

But up until recently there's been "no good way to identify those mutations from DNA sequencing," he said. "What we present in this paper is a new way to do that [which] takes advantage of deep learning."

Diving into DeepMosaic

The DeepMosaic framework itself was trained on 180,000 mosaic variants using Nvidia Kepler K80 GPU nodes housed in the San Diego Supercomputing Center's Comet compute cluster.

Available as an open source pre-trained model, the framework works by converting genome sequences into images and then applies a deep learning convolutional neural network to identify mosaic mutations, Xiaoxu Yang, a post-doctoral researcher working on the project, told The Register.

And unlike some machine learning models, anyone deploying DeepMosaic shouldn't need to condition their data unless they plan to retrain the model.

Compared to existing methods, the researchers claim the model can identify these mutations with greater accuracy. And compared to the manual verification process required by conventional methods, it's also orders of magnitude faster, Gleeson noted.

Given enough GPU horsepower, Yang says DeepMosaic should be able to crunch through an entire genomic profile within an hour. And while the model was trained on Comet, that doesn't mean you need a supercomputer to run or even retrain the model either. It's entirely within the realm of possibility for someone to train DeepMosaic using a personal workstation, he claimed.

The framework also offers distinct advantages compared to models developed specifically with cancer in mind, Yang said. Most of these programs require a non-cancer control sample. This makes using them to identify mosaic variants impractical since the mutations may be spread across multiple tissues.

DeepMosaic "is control independent, meaning that we are able to look at those variants that are shared by different tissues," he said.

While better for identifying mosaic variants, the researchers note that DeepMosaic, at least in its current form, isn't practical for cancer samples.

But as Gleeson points out, being able to accurately identify these mutations is a first step toward developing medical treatments.

A Git clone away

Researchers interested in giving DeepMosaic a spin won't have to go far. The framework, documentation and demos necessary to deploy and test it are available for download in GitHub.

In addition to the pre-trained model, researchers have also included all the tools necessary to retrain the model on their own datasets.

According to Gleeson, there's still plenty of room for improvement. "A lot of the work currently in human genetics is with European ancestry," he explained, adding that there's an opportunity to apply these tools to different ancestries.

Beyond identifying signatures of disease, DeepMosaic may have applications in adjacent fields. "Our model is very well tuned to spot subtle differences in genomic sequence files from a single person, but I think the tool probably has applications in other fields," Gleeson said. "For instance, in forensics, where we're hearing a lot in the news about matching DNA with public databases."

The opportunities for tools, like DeepMosaic, to improve human health and understanding will only continue to grow as the cost of genome sequencing continues to fall, Yang added. ®

Similar topics


Send us news

Other stories you might like