Deepmind’s Fifty-Year Breakthrough with Protein Folding

By Harit Phowatthansathian

Deepmind, a Google subsidiary, has just unveiled its revolutionary AlphaFold 2 algorithm during the 14th Critical Assessment of Techniques for Protein Structure Competition (CASP14). The overarching goal of CASP14 is to develop an efficient, accurate algorithm for 3D protein structure modeling, under a competitive environment (CASP14, 2020). Teams are provided with multiple protein domain sequences and expected to predict its 3D structures, quantified with a z-score, to be referenced against existing proteins. This problem dates back 50 years when scientists questioned the likelihood of predicting protein folding. Previously, the only viable methods to identify 3D protein structures were with X-ray crystallography and nuclear magnetic resonance, both requiring multi-million-dollar pieces of equipment and substantial amounts of time (AlphaFold: a solution to a 50-year-old grand challenge in biology, 2020). The recent excitement was due to AlphaFold 2’s exceptional results beating the runner-up’s accuracy by 2.6 times and matching the level of X-ray crystallography and nuclear magnetic resonance protocols. The research papers diving into AlphaFold 2’s model will be published in 2021, however, its fundamentals can be explored by reviewing its predecessor, AlphaFold 1.

Undertaking this momentous task, AlphaFold relied on pre-existing protein sample data on top of the given sequence of domains. The input data, to train AlphaFold, encompasses two main categories, database-derived data and situational data. Database-derived data is the information of protein samples from the Protein Data Bank (PDB), mainly PSI-BLAST and multiple sequence alignment features. These statistics give insight into protein-protein relationships, for instance, their intermolecular forces, association probability, and 484 other features, in a variety of combinations and situations (Senior and Evans, 2019). Database-derived input data acts as a pre-calibration step to give the model more information to work from. Situational input data is the information that is case-specific to the given domains at CASP14, including the target domain sequences, torsion angles, pairwise distances, and Van der Waals forces values. Both types of information are crucial for AlphaFold’s two-part methodology, convolutional neural networks and gradient descent.

Convolutional Neural Networks (CNNs) are the first and core aspect of AlphaFold’s methodology. Falling under the umbrella of machine learning, CNN is a model that gives computers a deeper vision of the data set by allowing it to make its own correlation between given features. By creating its own understanding of the data, unbiased by human assumptions, CNNs can extract the most significant, relevant trends and extrapolate them onto other cases to cast predictions (Moolayil, 2020). CNNs are especially effective when facing large data sets with high dimensionality, making it a logical approach to data-heavy inputs from PDB. The CNN portion of the model utilizes the aforementioned inputs to output torsion angle and distance distribution predictions which will be used in the gradient descent component. To further elucidate, AlphaFold’s CNN component is pre-calibrated with the database-derived features. CNNs then map out a 2D grid with the same sequence of domains on each axis to account for all protein domain relationships, on average a 104 by 104 domain grid. Each intersection within the grid represents a domain relationship, each containing the 484 database-derived features and additional situational data for the algorithm to establish significant linkages (Senior and Evans, 2019). The thousands of associations within each of the tens of thousands of combinations create a network and allow the machine to gain a deeper vision of this extensive data set. 220 64 by 64 domain subsections, which are deemed the most significant in predicting torsion angles and distance probabilities by the model, are then used to make distance histograms (Senior and Evans, 2019). The produced array of distance histograms defines the most feasible distance between domain configurations in three dimensions. The 5-day processor-intensive protocol gives the AlphaFold team the puzzle pieces, that when put together by the gradient descent component will compile an accurate 3D protein structure.

Gradient Descent is the second component in AlphaFold’s model and uses a simulated annealing process to piece the distance and torsion angle predictions together into the most likely 3D protein structure. Gradient descent, as the name suggests, is a machine learning tool that minimizes the inaccuracy metric within a specific system. Root mean squared deviation (r.m.s.d) is the used accuracy metric by gradient descent which is a summation of each spatial probability from all the domain configurations for a given 3D structure. By converging towards the 3D structure with the lowest r.m.s.d, the output would be the most plausible protein structure. Simulated annealing is a cyclical stochastic process built on top of gradient descent which trials a variety of 3D configurations and adjusts its subsequent predictions according to previous r.m.s.d values. The simulated annealing and gradient descent process are analogous to finding the line of best fit from a scatter graph. Gradient descent identifies the error between all points of one potential structure, while simulated annealing is the trial-and-error aspect to attain the best fit structure amongst all the points. In addition to the initial input data, two optimizations were implemented to increase the accuracy by better reflecting actual protein folding environments: Rosetta scoring and noisy restarts. Rosetta scoring incorporates data about intermolecular forces including Van der Waal forces and steric clashes into the gradient descent calculation. Noisy restarts is a mechanism that adds noise to the torsion angle values outputted by the CNN after a few training runs to increase the pool of potential structures.

AlphaFold 1, the CASP13 winner, scored an average of 68.5% with this two-part protocol, which hinted at CNN’s effectiveness (Thomson, 2020). AlphaFold 2 impressively averaged 87% and peaked at 92.4%, in November 2020, crossing the accuracy threshold, considered by molecular biologists, sufficient for actual scientific research (How do proteins fold?, 2020). The direct implication of this breakthrough is its use in modeling microbial pathogens to identify target sites or mechanisms, fast-tracking vaccine research to save lives. Upon completion, this technology not only reduces the cost of biological research but accelerates the timeframe required (Computational predictions of protein structures associated with COVID-19, 2020). As this is a recent development, it has not been widely publicized, but in hindsight, it will have been amongst the largest milestones of the decade. 

References: 2020. CASP14. [online] Available at: <; [Accessed 2 December 2020].

Deepmind. 2020. Alphafold: A Solution To A 50-Year-Old Grand Challenge In Biology. [online] Available at: <; [Accessed 2 December 2020].

Senior, A. and Evans, R., 2019. Improved Protein Structure Prediction Using Potentials From Deep Learning. [online] Available at: <; [Accessed 2 December 2020].

Moolayil, J., 2020. A Layman’s Guide To Deep Convolutional Neural Networks. [online] Medium. Available at: <; [Accessed 2 December 2020].

Thomson, A., 2020. Deepmind Breakthrough Helps To Solve How Diseases Invade Cells. [online] Available at: <; [Accessed 2 December 2020].

The Economist. 2020. How Do Proteins Fold?. [online] Available at: <; [Accessed 2 December 2020].

Deepmind. 2020. Computational Predictions Of Protein Structures Associated With COVID-19. [online] Available at: <; [Accessed 2 December 2020].

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s