A next-gen AI protein folder that could help science? Meta's good for something
Faster than the others, 600m structures now in a public DB
AI researchers at Meta say they have developed the largest protein-folding model of its kind to date, and that it is capable of predicting the structure of more than 600 million proteins.
The team released the 15-billion-parameter ESM-2 transformer-based model and a database of its protein structure predictions, dubbed the ESM Metagenomic Atlas, on Tuesday. This database includes protein shapes that haven't been observed yet by scientists.
Proteins are complex biological molecules containing up of 20 types of amino acids, and perform all sorts of biological functions in organisms. Crucially, they fold up into intricate 3D structures, the shape of which is vital to how they operate; knowing their shape helps scientists understand how they function, and from that, helps them figure out ways to mimic, alter, or counter that behavior.
Unfortunately, you can't just take the amino acid formula and immediately work out the eventual structure. You can do simulations or experimentation to potentially figure it out but this is time consuming. These days you can give suitably trained machine-learning software the chemical composition of a protein and the model will rapidly and accurately, relatively speaking, predict the structure.
Indeed, DeepMind demonstrated as much with its AlphaFold model, which won the biennial international computation protein-folding CASP competition in 2020. Given an input string of amino acids, AlphaFold and other machine-learning software can generate its corresponding three-dimensional structure.
Researchers at the London-based DeepMind have since improved their system to predict the structure of more than 200 million proteins known to science. The latest ESM system from Meta has gone further, predicting hundreds of millions more after being trained on millions of protein sequences.
A preprint paper by the Meta team – Lin et al – explaining the design of ESM-2 can be found here. Interestingly enough, according to the researchers, the system is actually a large language model made to "learn evolutionary patterns and generate accurate structure predictions end to end directly from the sequence of a protein." AlphaFold, for one, isn't a language model, and uses a different approach.
As the boffins note in their paper, these large language models can be used for much more than handling human languages: "Modern language models containing tens to hundreds of billions of parameters develop abilities such as few-shot language translation, commonsense reasoning, and mathematical problem solving, all without explicit supervision.
"These observations raise the possibility that a parallel form of emergence might be exhibited by language models trained on protein sequences."
The result is ESM-2, which though a language model has been taught to predict the physical shape of a protein from a text string representing its amino acids.
- Nearly all protein structures known to science predicted by AlphaFold AI
- Hype versus reality: What you can't do with DeepMind's AlphaFold in drug discovery
- Can AI transformer models help design drugs and treat incurable diseases?
- What is HPC actually good for? Just you wait and see
ESM-2 is the largest model of its kind, and apparently predicts structures faster than similar systems; it is up to 60X faster than previous state-of-the-art systems like AlphaFold or Rosetta, which can take over ten minutes to generate an output, according to Meta.
The model was able to create the ESM Metagenomic Atlas, predicting over 600 million structures from the MGnify90 protein database in just two weeks running on 2,000 GPUs. On a single Nvidia V100 GPU, it takes just 14.2 seconds to simulate a protein made up of 384 amino acids. It seems from the paper that Meta said its system mostly, but not fully, matched AlphaFold on accuracy though its speed is the key thing, allowing it to predict more proteins.
"With current state-of-the-art computational tools, predicting structures for hundreds of millions of protein sequences in a practical time frame could take years, even using the resources of a major research institution. To make predictions at the scale of metagenomics a breakthrough in prediction speed is critical," the Facebook owner said.
Meta hopes ESM-2 and the ESM Metagenomic Atlas will help advance science by aiding scientists studying evolutionary history or tackling disease and climate change. "To extend this work even further, we're studying how language models can be used to design new proteins and contribute to solving challenges in health, disease, and the environment," the biz concluded. ®