Image credit: Science Photo Library/Getty Images
This week, the artificial intelligence (AI) program, AlphaFold, developed by Google’s DeepMind, has solved a decades-old problem in biology: determining a protein’s 3D structure based only on its amino acid sequence.
The results were announced at the 14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14), where Alphafold beat 100 other participating teams.
“Building on the work of hundreds of researchers across the globe, an AI program called AlphaFold, created by London-based AI lab DeepMind, has proved capable of determining the shape of many proteins. It has done so to a level of accuracy comparable to that achieved with expensive and time-consuming lab experiments,” wrote the organizers in a statement.
The protein folding problem
Proteins are the building blocks of life, working as intricate machines that control every process within our cells and bodies, such as antibodies that help ward off infection, and regulation of blood sugar. Their precise function is determined by their unique 3D structures, which are spontaneously assembled and held together through different attractive and repulsive forces predetermined by their linear amino acid sequence.
“Even tiny rearrangements of these vital molecules can have catastrophic effects on our health, so one of the most efficient ways to understand disease and find new treatments is to study the proteins involved,” said Dr. John Moult, a computational biologist at the University of Maryland, who co-founded CASP in 1994.
Since Christian Anfinsen was awarded the Nobel Prize in 1972 for showing that it should be possible to determine the shape of proteins based on their sequence of amino acids, scientists have been trying for decades to find an efficient way of determining how a linear string of amino acids can be used to map out the intricate loops, folds, and pleats of a protein’s final functional form.
While research in recent years has been bringing us ever closer, current gold standard techniques—such as nuclear magnetic resonance (NMR) spectroscopy and X-ray analysis—used to solve protein structures today can be difficult, expensive, and time consuming. Of the 200 million known proteins, we have only solved a small percentage of their structures, and with a growing number of new proteins added to the database every year, our current methods will not allow us to keep up.
“There are tens of thousands of human proteins and many billions in other species, including bacteria and viruses, but working out the shape of just one requires expensive equipment and can take years,” said Moult.
Computational experiments were introduced in the 1980s, but while their accuracy and credibility have improved, none have come close to solving the “protein folding problem”. A major challenge is linked to the number of possible ways a protein could theoretically fold before reaching its final 3D structure. To provide some perspective, in 1969, molecular biologist Cyrus Levinthal predicted that it would take longer than the age of the known universe to enumerate all possible configurations of a typical protein by brute force calculation.
Finding a way to close this gap and predict the structure of any known protein based solely on its amino acid sequence would change everything.
Enter AlphaFold
CASP was founded almost 30 years ago as a means of spurring research to help solve this great scientific challenge by sharing their progress and testing the accuracy of their predictions against real experimental data.
“The CASP approach has created intense collaboration between researchers working in this field of science and we have seen how it has accelerated scientific developments,” said Dr. Krzysztof Fidelis of UC Davis, one of the co-founders. “Since we first ran the challenge back in 1994, we have seen a succession of discoveries, each solving an aspect of this problem, so that computed models of protein structures have become progressively more useful in medical research.”
At the meetings, teams of researchers are asked to solve as many structures of a given set of proteins using computational programs they have developed.
In 2018, at CASP13, the first iteration of AlphaFold made waves when it predicted the most proteins with the greatest accuracy among the meeting’s participants. The Google team went on to publish their findings, along with their code to help spur further innovation.
“Now, new deep learning architectures we’ve developed have driven changes in our methods for CASP14, enabling us to achieve unparalleled levels of accuracy,” wrote the Google team. “These methods draw inspiration from the fields of biology, physics, and machine learning, as well as of course the work of many scientists in the protein folding field over the past half-century.”
AlphaFold uses machine learning, which is a branch of computer science that deals with self-improving algorithms, meaning they can evaluate and improve their own performance after being trained on a certain task.
AlphaFold’s task is to identify amino acid pairs that are likely to come into contact in the 3D structure. But as opposed to using a common strategy called covariance to predict which of these pairs are in contact, AlphaFold attempts to predict the distance between two residues in the folded protein. These predictions are more difficult to make, but provide richer information about the folded protein structure. In a second step, AlphaFold uses this information to create a model of what the protein should look like, and is capable of determining highly accurate structures in a matter of days.
The latest version takes this a step farther where, instead of just predicting relationships between amino acids, the system predicts the final structure of a target protein sequence. At CASP14, AlphaFold was able to determine the shape of roughly two thirds of the proteins “with accuracy comparable to laboratory experiments”.
The DeepMind team says they are preparing a paper on their latest version of AlphaFold for publication in a peer-reviewed journal.
Real-world applications
Experts are excited about the impact that this breakthrough will have, citing it as a “once in a generation advance”. This doesn’t mean the end of laboratory experiments, but the leg up AlphaFold would provide will allow researchers to ask more advanced questions and carry fields such as medicine and drug discovery forward.
The DeepMind team also applied AlphaFold to predicting the structure of the SARS-CoV-2 virus earlier this year, which was later found to be accurate by experimental studies. At CASP14, they also predicted the structure of another unknown coronavirus, ORF8.
“As well as accelerating understanding of known diseases, we’re excited about the potential for these techniques to explore the hundreds of millions of proteins we don’t currently have models for — a vast terrain of unknown biology,” said the DeepMind team.
While there are still hurdles to overcome, the excitement within the field will no doubt drive further innovation.
“Being able to investigate the shape of proteins quickly and accurately has the potential to revolutionize life sciences,” said Dr. Krzysztof Fidelis of UC Davis, one of the CASP organizers. “Now that the problem has been largely solved for single proteins, the way is open for development of new methods for determining the shape of protein complexes — collections of proteins that work together to form much of the machinery of life, and for other applications.”
This is a big step for computational biology, one with far reaching implications. It will be interesting to watch as the field progresses in the coming years.