Pathogenomics with Machine Learning Enhanced Genome-Wide Association Studies

By Harit Phowatthanasathian

The thirteen year, 2.7 billion dollar Human Genome Project was an unprecedented endeavor, and its completion was a momentous stepping stone for biological advancement. Roughly a decade and a half later, Next Generation Sequencing (NGS) has taken its place with the ability to sequence the entire human genome within a day, costing approximately a thousand dollars (Behjati and Tarpey, 2020). A high-throughput methodology, NGS simultaneously amplifies fragmented, fluorescently labeled DNA, then identifies signals to be sequenced. Seeing success with Sar-Cov-2 sequencing, NGS is also being applied to other microbes for a more comprehensive understanding of genome-based virulence and pathogenicity.                                                           

Building upon the efficient foundation of modern NGS procedures, Genome-Wide Association Studies (GWAS) have been increasing in popularity. A GWAS is the culmination of, usually, up to a thousand genomic references from NGSs to compare and contrast genomic variations and associating single or multiple target sequences to specific phenotypes (Genome-Wide Association Studies, 2018). Recently in the field of food safety, research has focused on food-borne pathogens and the relationship between a microbe’s genome and its virulence and pathogenicity (Njage and Henri, 2019). The jump between GWAS and risk assessment in food safety has been relatively unexplored due to the noisy nature of these GWAS data sets. This is where the use of machine learning could make leaps and bounds by accentuating statistically relevant results from seemingly insignificant data output (Njage and Henri, 2019). The integration of machine learning, which has seen its own recent exponential growth, allows researchers to squeeze out maximum value from these GWASs.  

Traditional risk assessment methodology is slow and yields limited results. Classical protocols require multiple microbial identifications, stress response testing, toxicity screening and large-scale trials to produce situation-specific results. Even with the improved efficiency of NGS procedures, the problem that halts effective research is the large output of high dimensionality and variability data for each of the hundreds of samples per trial. Samples also differ in lineage and its loci of interest, not to mention the potentially thousands of mutations common in microbes (Njage and Henri, 2019). This outdated method simply cannot process the immense amount of data points, let alone produce significant conclusions.  

Machine learning thrives in situations with large datasets and high dimensionality and requires only a fraction of the time, making it perfectly suited for risk analysis of GWAS data. Drawing from available databases of hospitals and genomic laboratories, algorithms are trained to detect previously identified pathogenic genes and variants to predict possible pathogenicity in the GWAS targeted microbes. Concluded correlations from previously collected data between a variety of gene targets allow these programs to extract statistically significant data, whilst avoiding all the surrounding noise of new data. Food-borne microbe targeted studies, like that of researcher Njage’s team, have shown promising accuracy thresholds of up to 89% by utilizing machine learning to identify and predict virulent sequences (Njage and Henri, 2019). Additionally, similar pathogenomic have shown encouraging gene sequence and infectivity association results, ranging between 63% to 90%, depending on the lineage (Micholls and John, 2020). This proof of concept argues that future research should attempt to leverage the technological capabilities we have and work smarter, not harder.  

Unfortunately, training machine learning models is not as easy as has been portrayed. Robust models require adjustments including, model selection, probability weighting, cross-validation, and a host of other optimizations (Rojas, 2020). Furthermore, applying NGS and GWAS requires basic assumptions, for instance, Chargaff’s rule of base pairing, double and single-stranded genetic material, five available nucleotides, and other fundamental rules, which must also be reflected in the training algorithm (Njage and Henri, 2019). To achieve a harmony between GWAS and machine learning, both must share a common subset of assumptions, an inevitable hurdle researchers have yet to fully delve into. 

Seeing the early success of machine learning boosted GWASs, what do we stand to gain by investing more resources into this endeavor? Firstly, significant results would add to the expanding repertoire of pathogenic genes. The clinical implication is that with a better understanding of potentially infectious or dangerous microbial genes the healthcare system will be better prepared to handle cases before they happen. Not only would we understand the possible effects target genes might have on humans, but more specifically its influence on different communities, races, ages, or genders (Pompe and Simon, 2005). Secondly, by developing a base algorithm for machine guided research, we are building a foundation for a prosperous field of bioinformatics. Like a snowball rolling down a hill, the rise of the first models will perpetuate improved models which will produce more data for even better future algorithms. 

As we are in the earlier stages of machine learning assisted research, its implications are far-reaching because it is not limited to the genomic field. Regarding the GWAS, this technology can be applied to all known species because all living beings contain some version of genetic material, which can be analyzed. Zooming out, this type of machine applied research theoretically is suitable in areas with considerable processing required. Within evolving fields like stem cells, embryonic development, neurobiology, and other fields where overwhelming datasets hinder development, we can expect to see machine learning assisted research shine.


Behjati, S. and Tarpey, P., 2020. What Is Next Generation Sequencing. [online] Available at: <; [Accessed 25 November 2020].

2018. Genome-Wide Association Studies. [online] Available at: <,the%20presence%20of%20a%20disease&gt; [Accessed 23 November 2020].

Njage, P. and Henri, C., 2019. Machine Learning Methods As A Tool For Predicting Risk Of Illness Applying Next-Generation Sequencing Data. [online] Available at: <; [Accessed 25 November 2020].

Micholls, H. and John, C., 2020. Reaching The End-Game For GWAS: Machine Learning Approaches For The Prioritization Of Complex Disease Loci. [online] Available at: <; [Accessed 26 November 2020].

Rojas, H., 2020. Machine Learning Optimization Techniques. [online] Available at: <> [Accessed 26 November 2020].

Pompe, S. and Simon, J., 2005. Future Trend And Challenges Of Pathogenomics. [online] Available at: <; [Accessed 26 November 2020].

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s