Designing Biological Systems through Machine Learning

By Shirin Bamezai

One of the most promising approaches to tackling global challenges in health and sustainability is biosystems design. Biosystems design is the purposeful manipulation of biological systems (from the nucleic acid level, to entire cells) through engineering principles, ideally within a “design-build-test-iterate (or deploy)” cycle (el Karoui, Hoyos-Flight and Fletcher, 2019). Constructing novel biosystems with desirable features, in a quantitative and predictive manner, is a tremendous challenge. This is due to the complexity and interconnectivity of said systems, and the often unanticipated behaviour and emergent network properties that result. Machine learning (ML) models, applied at every stage of design, aim to be a potential solution for targeted, faster and successful biosystems design. 

At a computational level, ML involves learning a function that can map an input data value to a desired output value from a set of input-output pairs between which a correlation exists – this is called training data. ML algorithms analyse and identify correlation patterns in the training data and apply these to process unseen data instances, thereby generating appropriate output values (Camacho et al., 2018). There are several components that must be taken into account when designing a ML model, with the main ones being: input representation, output variables, loss function, hyper-parameters, and model evaluation. In the context of biosystems design, the type of data, goal and design decisions will determine what these components look like. For example, the input of an algorithm utilized to create a protein function prediction model may consist of protein sequences, while the output could be a framework of label sequences, as this would be most helpful for jointly predicting multiple protein functions (Volk et al., 2020). Although multiple categories of machine learning methods exist, the most frequently used is supervised learning. To summarize, supervised learning methods produce an inferred function needed to predict output values, using a labelled training data set. Thus, it is possible to build ML models of biosystems design applications with clearly measurable input and output variables. 

A well-designed model trained on a large data set could be used to predict the outcomes of experiments in silico and inform an experimental design strategy for a relevant biosystem design goal, saving on time, labour and material costs. A “design-build-test-learn” cycle can also be implemented such that once a model is selected and carried forward for experimental validation, the data collected from the validation can be used as a training data set, leading to progressively more accurate predictive capabilities (Volk et al., 2020). The ML model equips researchers with the ability to overcome unique hurdles encountered with the design of gRNA for CRISPR/Cas9, gene networks, proteins, pathways, genomes and even systems at the process level. 

In particular, ML is an effective tool for protein engineering. Proteins perform a variety of functions, from transcription to catalysing metabolic processes, encoded in their amino acid sequence. Protein engineering is a discipline that aims to identify amino acid sequences that optimize a protein function of interest. This is exceptionally challenging, as the protein sequence space is vast, with the number of possible sequences increasing exponentially as the sequence becomes longer, whilst the number of variants that are functional and of interest remains very low. The process of accurately mapping amino acid sequences to protein functions is thus difficult, and a comprehensive search of variants with optimized functionality becomes a very impractical approach. Conversely, a ML model can be designed to navigate the protein sequence space and predict the functional measurements of novel sequences by training an algorithm using data sets of amino acid sequences and their respective fitness score (Xu et al., 2020). This ML model can be used in conjunction with directed evolution, allowing researchers to efficiently explore the protein sequence space and identify highly functional regions of protein sequences (Wu et al., 2019). 

In conclusion, through the application of ML methods, a quantitative and predicted approach to biosystem design is possible, thus rendering targeted, efficient and “design-build-test-iterate (or deploy)” cycle based biological systems construction a reality.  Development at the intersection of synthetic biology, automation and machine learning is expected to drive predictive biology and enable the production of new machine learning algorithms. In turn, these advancements will facilitate improvements in biosystems design. 


Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C., & Collins, J. J. (2018). Next-Generation Machine Learning for Biological Networks. In Cell (Vol. 173, Issue 7).

Volk, M. J., Lourentzou, I., Mishra, S., Vo, L. T., Zhai, C., & Zhao, H. (2020). Biosystems Design by Machine Learning. ACS Synthetic Biology, 9(7).

Wu, Z., Jennifer Kan, S. B., Lewis, R. D., Wittmann, B. J., & Arnold, F. H. (2019). Machine learning-assisted directed protein evolution with combinatorial libraries. Proceedings of the National Academy of Sciences of the United States of America, 116(18).

Xu, Y., Verma, D., Sheridan, R. P., Liaw, A., Ma, J., Marshall, N. M., McIntosh, J., Sherer, E. C., Svetnik, V., & Johnston, J. M. (2020). Deep Dive into Machine Learning Models for Protein Engineering. Journal of Chemical Information and Modeling, 60(6).

Karoui, M.E., Hoyos-Flight, M., Fletcher, L., (2019). Future Trends in Synthetic Biology—A   

       Report. Frontiers in Bioengineering and Biotechnology (7).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s