Machine learning reveals the existence of downstream core promoter regions

By Haoyu Li

Genes and their expression patterns are the basis of cell phenotypes and cell states, thus, mechanism of gene expression regulation has always been a core field in molecular biology. Within the hierarchy of gene regulation, transcriptional regulation is the branch that had attracted the earliest attention of researchers and the most has been revealed on the subject. After decades of research, researchers know that, in both eukaryotes and archaea the level of gene transcription depends decisively on two key group of components: cis-regulatory elements (CREs) and trans-acting factors (TAFs) (Lee and Young, 2013; Andersson and Sandelin, 2020). In the context of transcriptional regulation, the former consists of non-coding DNA such as promoters, enhancers, and silencers; while the latter consists of RNA polymerase, transcriptional factors, chromatin remodeler, even some RNA binding protein (RBP) (Xiao et al., 2019). 

Considering the paradigm of experimental research, CREs and TAFs can be seen to contribute independently to gene transcriptional regulation, and consequently should gain equal recognition. However, the logistic order of biological mechanisms (e.g. the binding of two proteins must happen before a third protein complementary to the complex of the former two can bind) dictates that, intrinsic sequence patterns bound by CREs often directly determines the mode of binding and action of TAFs. A classic example of this is the experiment performed by Wilson et al. (2012). ChIP-seq (a method that combines chromatin immunoprecipitation with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins) of human and mice live cell transcription factors HNF1α, HNF4α, and HNF6 reveals almost complete conservation of DNA binding domains between the two species; yet these factors show extremely different whole genome binding patterns in the two species. When Wilson et al (2012) transferred human chromosome 21 into mice liver cells, the above exemplified factors in the mice liver cells bound to the chromosome in a pattern indistinguishable from how human factors would bind to the chromosome. This therefore proves that the interspecies binding pattern dissimilarities are not due to the variations in the factor itself or the intracellular environment, but rather due to mutations in DNA homologous sequences between species. Such mutations often cause intrinsic sequence patterns mentioned above. Undeniably nonetheless, this is not to say the fundamental logic of gene regulation is “sequence-determined”. After all, cells from an individual all contain the same set of DNA sequences, yet different cell types and tissues under varies physiological/pathophysiological conditions often manifest highly diverse gene regulation patterns. Even when considering epigenetic modifications (DNA methylation, chromatin accessibility etc.) as a concept expansion of CRE bindings, such a diversity remains unpredictable and undecipherable. In other words, despite the dominance taken by CREs, TAFs are still key elements in shaping the landscape of gene regulation. But then, they act in combination with CREs in a manner rather like the “which of the chicken or the egg comes first” question, in the sense that, the sequential logic is often tangled (Zeitlinger, 2020).

Therefore, uncovering the sequential logic of varies CREs presents itself with great significance. Usually, such logic patterns can be characterized in the forms of consensus sequences, binding motifs, and position weight matrixes. Although such characterizations are helpful with describing CRE activation and TAF binding patterns, they display little information and flexibility, deeming them inaccurate in predicting further CRE behaviours. On the other hand, machine learning yields much greater capability to recognize CRE intrinsic sequence patterns, due to its high potential of pattern extraction from sophisticated datasets compared to traditional analysis methods centred on short sequence alignments. Before being applied to promoter sequence pattern recognition, machine learning had already been utilized in prediction of DNA methylation rate, RBP binding sites and chromatin three-dimensional association etc. (Alipanahi et al., 2015; Zhou et al., 2018; Grønning et al., 2019)

On the 9th of September, Vo ngoc et al (2020) from UCSD published a study on Nature titled “Identification of the human DPR core promoter element using machine learning”. A machine learning model, which predicts downstream core promoter activities, was constructed based on mass DNA sequence screening data. The model, for the first time, proves the existence of downstream core promoter regions (DPRs, refers to binding sites of TAFs 17-35 bp following the transcription start site, TSS), and once again demonstrates the huge potential of machine learning in genome studies.

Promoter core sequences (typically DNA region 40bp upstream and downstream from TSS) often possess cross-species and cross -gene conservation, namely the infamous TATA box, a 7 bp sequence found in almost all archaea and eukaryotes. However, the TATA box is only found in 25% of gene promoter sequences in humans. Other promoter sequences contain elements such as ten motif elements and downstream core promoter elements, but like the TATA box, these elements are present only in specific gene promoter sequences at a moderately low rate. Therefore, whether promoter sequences have a common intrinsic pattern remains a mystery. In addition, before the publication of Vo ngoc’s study, the existence of DPR had also been unsolved.

To address these key questions, Vo ngoc et al (2020) hypothesized that a common sequence characteristic exists in all DPRs to allow optimal activation effects. To test this hypothesis, the team synthesized 500,000 DNA sequence variants carrying a random sequence at the DPR. These sequences are than transcribed into RNA with expression systems (genetic constructs that are designed to produce a protein, or an RNA, either inside or outside a cell). The transcription strengths (number of RNA transcribed) compared to the number DNA variants carrying the same DPR fed into the expression system gives a representation of the level of DPR activation. 200,000 variants were selected to “train” a machine learning program, and the rest were kept for later testing of the program. The program was able to extract patterns from DNA sequences and their associated transcription strengths and then use the patterns summarized to predict transcription strengths when only the DNA sequence is given. The predictions made had a 0.9 correlation coefficient with real life transcription strength data produced by an expression system. Knowing that the training had worked, the team allowed the program to further analyse real human promoter sequences. The program predicted 25-34% of these promoters have much stronger DPR activation (transcription strengths increased by 5-fold or more when the DPR is active compared to when it is silenced). In comparison, traditional short-sequence-centred screening only discovered DPR in 0.4%-0.5% of these sequences. The results therefore proves that DPR is a real existing element of gene regulation, just its intrinsic characteristics are rather complicated and “out-of-order”, so that traditional screening methods cannot identify it or predict its behaviour, while machine learning models can.

In conclusion, CREs and TAFs bind and interact with promoter sequences in a dependent and intricate manner. The complex non-linear mapping correlation makes it near impossible for human instinct to point to the components that play fundamental roles in the schema. Machine learning consequently offers the potential to apprehend such a correlation and can be used to make great discoveries.


Alipanahi B. et al (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838.

Andersson R. and Sandelin A. (2020). Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet. 21, 71-87. 

Grønning A. et al (2019). DeepCLIP: Predicting the effect of mutations on protein-RNA binding with Deep Learning. bioRxiv 757062.

Lee T. and Young R. (2013). Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251. 

Vo ngoc L. et al (2020). Identification of the human DPR core promoter element using machine learning. Nature 585, 459–463.

Wilson M. et al (2008). Species-specific transcription in mice carrying human chromosome 21. Science 322, 434–438.

Xiao R. et al (2019). Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription. Cell 178, 107-121.e18.

Zeitlinger J. (2020). Seven myths of how transcription factors read the cis-regulatory code. Curr. Opin. Syst. Biol. 301, 127065.

Zhou J. et al (2018). Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s