By Ernest Poon
In 2003, after 13 years and over $3 billion, the ambitious multinational and multidisciplinary Human Genome Project was completed. Whilst the subsequent conversion of DNA to an amino acid sequence is relatively straightforward, the next step in gene expression – how the amino acid sequence folds to form a polypeptide – is anything but. The three-dimensional conformation of a protein involves complicated, interconnected intermolecular interactions which make protein folding difficult to predict, much less understand. Like many biological structures, protein function is inextricably linked to structure and therefore amino acid sequence. Protein folding aims to accurately determine an amino acid sequence that can fold into a given target structure (Regan et al., 2015). Since proteins have a multitude of biological functions, doing so will enable the on-demand creation of novel proteins as biological solutions to problems in medicine, biomanufacturing and biotechnology.
A key principle of protein folding is that polypeptides fold into a conformation where its free energy is at a minimum; this is known as its native state. Thus, at its core, protein folding is a thermodynamic problem. The main molecular forces involved in protein folding are hydrogen bonding and the hydrophobic effect. The hydrophobic effect causes the protein to adopt a conformation so that non-polar side chains that are unable to interact with polar water molecules are hidden away in the core of the protein. Additional van Der Waals interactions between atoms contribute to the complex web of intramolecular forces holding a protein together. The multitude of interactions of within each amino acid, how they effect the interactions between nearby amino acids, and the decreased entropy of a folded protein make it impossible to accurately calculate the free energy of a structure. And even if it was, calculating the free energy of every structure to determine the free energy minimum is simply infeasible: the number of possible conformations increases exponentially with sequence length. For example, for a polypeptide 10 amino acids long, there would be 2010 possible sequences. This number is further increased once conformational isomers from side chain and backbone rotations are considered (Khoury et al., 2014).
Researchers are now using computational methods to address these challenges. Algorithms have been designed to utilize a combination of complex mathematical modelling and statistical analysis of known proteins in the Protein Data Bank (PDB) to systematically determine possible amino acid sequences. However, the frameworks these algorithms follow can have significant effects on the accuracy of the predicted sequences. This is demonstrated by the Critical Assessment of Techniques in Protein Structure Prediction (CASP): a competition where various algorithms are challenged to determine the sequence of a target structure. Interestingly, the more similar the target structure is to solved templates in the PDB, the more accurately algorithms can predict their amino acid sequence. This implies that protein design algorithms will only become more accurate as more structures are solved.
In 2019, Zhou and colleagues quantitatively demonstrated. They used statistical analysis to compare the sequence prediction abilities of a novel protein design framework that predicted sequences based solely on existing PDB templates against Rosetta, one of the original protein design algorithms. The novel framework, called dTERMen, decomposes the target structure into segments called tertiary motifs which are then compared against known structures in the PDB to elucidate the final protein sequence. In contrast, Rosetta takes a mathematical approach via Monte Carlo sampling to eventually determine an optimal sequence. By comparing the similarity between the predicted sequences to the native sequence, they managed to demonstrate that template-based modelling could be as accurate as other computational algorithms (Zhou, Panaitiu & Grigoryan, 2020).
Despite being far from perfect, there have already been successful applications for protein design algorithms. Using Rosetta, Fleishman and colleagues at the University of Washington managed to design proteins to bind and inhibit hemagglutinin from the 1918 H1N1 virus (Felishman et al., 2011). By harnessing the processing power of over 100,000 volunteer host computers, Rosetta managed to predict the structure of the Sars-CoV-2 spike protein weeks before cryo-electron microscopy could observe it in the lab. The spike protein allows Sars-CoV-2 to fuse with the host’s cell membrane, thereby leading to infection. Information about the protein structure is now being used globally to develop vaccines and further understand the virus’ behavior (Seydel, 2020). Similarly, protein design has also been therapeutically in the treatment of HIV. Bellows et al. used a de novo (‘from scratch’) approach to design inhibitors for gp41, a transmembrane glycoprotein subunit crucial to the infection of host cells by HIV-1 (Bellows et al., 2010). But perhaps the most atypical application of protein design algorithms has been by Eiben and coworkers. Aided by Rosetta and visual intuition, players of the online game ‘Foldit’ were challenged to redesign a computationally designed enzyme for the Diels-Alder reaction. These crowd-sourced solutions were subsequently evaluated and lead to a remodeled enzyme with an 18-fold increase in activity (Eiben et al., 2012).
Protein design is a promising albeit relatively novel field in biology. Despite its early stage, developments in protein folding has already had significant applications in medicine and biomanufacturing. Beyond creating or redesigning proteins, understanding of how proteins fold will be useful in the understanding of diseases such as Alzheimer’s which involve protein misfolding. However, for this to happen more accurate and efficient modelling algorithms are required. Fortunately, this will be solved with time as technological developments improve computational power and more solved proteins in PDB provide these algorithms with more templates to use as reference.
Bellows, M. L., Taylor, M. S., Cole, P. A., Shen, L., Siliciano, R. F., Fung, H. K. & Floudas, C. A. (2010) Discovery of Entry Inhibitors for HIV-1 via a New De Novo Protein Design Framework. Biophysical Journal. 99 (10), 3445-3453. Available from: http://dx.doi.org/10.1016/j.bpj.2010.09.050. Available from: doi: 10.1016/j.bpj.2010.09.050.
Eiben, C. B., Siegel, J. B., Bale, J. B., Cooper, S., Khatib, F., Shen, B. W., Players, F., Stoddard, B. L., Popovic, Z. & Baker, D. (2012) Increased Diels-Alderase activity through backbone remodeling guided by Foldit players. Nature Biotechnology. 30 (2), 190-192. Available from: https://search.datacite.org/works/10.1038/nbt.2109. Available from: doi: 10.1038/nbt.2109.
Fleishman, S.J., Whitehead, T.A., Ekiert, D.C., Dreyfus, C., Corn, J.E. (2011) Computational design of proteins targeting the conserved stem region of influenza hemagglutinin. Science 332, 816-821
Khoury, G. A., Smadbeck, J., Kieslich, C. A. & Floudas, C. A. (2014) Protein folding and de novo protein design for biotechnological applications. Trends in Biotechnology. 32 (2), 99-109. Available from: http://www.sciencedirect.com/science/article/pii/S0167779913002266. Available from: doi: https://doi.org/10.1016/j.tibtech.2013.10.008.
Regan, L., Caballero, D., Hinrichsen, M. R., Virrueta, A., Williams, D. M. & O’Hern, C. S. (2015) Protein Design: Past, Present, and Future. Biopolymers. 104 (4), 334-350. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4856012/. Available from: doi: 10.1002/bip.22639. [Accessed Nov 27, 2020].
Seydel, C. (2020) The hothouse for protein design. Nature Biotechnology. 38 (7), 779-784. Available from: https://www.nature.com/articles/s41587-020-0586-0. Available from: doi: 10.1038/s41587-020-0586-0.
Zhou, J., Panaitiu, A. E. & Grigoryan, G. (2020) A general-purpose protein design framework based on mining sequence–structure relationships in known protein structures. Proceedings of the National Academy of Sciences. 117 (2), 1059-1068. [Accessed Nov 27, 2020].