Digital Commons @ RU

  • < Previous

Home > Student Theses and Dissertations > 353

Student Theses and Dissertations

A molecular characterization of human general transcription factor iid.

Alexander Hoffmann

Date of Award

Document type, degree name.

Doctor of Philosophy (PhD)

RU Laboratory

Roeder Laboratory

The general transcription initiation factor IID plays a central role in transcriptional control as a direct target for a diverse array of gene-specific regulatory factors and as the only template-bound class II initiation factor. The work described in this thesis concerns itself with the molecular characterization of this transcription initiation factor, starting with the cloning of cDNAs from a variety of organisms, including human, encoding a protein that may substitute native TFIID fractions in DNA binding and in vitro basal transcription assays. Sequence comparisons identify important structural motifs in the protein. Further functional analyses lead to the realization that this protein is the TATA box binding subunit (TBP) of a multi-protein TFIID complex whose other constituents are required for activator-responsive in vitro transcription and are responsible for TFIID's characteristic DNA interactions around the initiation region. A variety of biochemical approaches are discussed leading to the identification of more than a dozen class II TBP-associated factors (TAFs) that as a whole make up the TFIID complex and its characteristic functions. A bacterial expression system allowing for the convenient purification of large amounts of recombinant protein in non-denaturing conditions is presented. This has allowed structural studies on TBP that have culminated in detailed X-ray crystallographic structures of TBP by itself or bound to the TATA box revealing unprecedented DNA distortions. The design of a convenient mutagenesis approach of TBP is presented that may lead to a more refined understanding of TBP and TAF functions. Cloning of a small subunit (TAF20) of the TFIID complex is taken as a starting point to chart protein-protein interactions within the complex. Furthermore, sequence homologies suggest possible functions for some TAFs and lead to a proposed revision of TFIID's functional role in eukaryotic chromatin and in mechanisms of transcriptional initiation and regulation.

A Thesis submitted to the Faculty of The Rockefeller University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Recommended Citation

Hoffmann, Alexander, "A Molecular Characterization of Human General Transcription Factor IID" (1994). Student Theses and Dissertations . 353. https://digitalcommons.rockefeller.edu/student_theses_and_dissertations/353

Since March 23, 2017

Included in

Life Sciences Commons

Advanced Search

  • Notify me via email or RSS
  • Collections
  • Disciplines

Author Corner

  • The Rockefeller University
  • The Rita and Frits Markus Library
  • The David Rockefeller Graduate Program

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

  • Open access
  • Published: 14 October 2024

Identifying transcription factors with cell-type specific DNA binding signatures

  • Aseel Awdeh 1 , 2 ,
  • Marcel Turcotte 1 &
  • Theodore J. Perkins 1 , 2 , 3  

BMC Genomics volume  25 , Article number:  957 ( 2024 ) Cite this article

46 Accesses

3 Altmetric

Metrics details

Transcription factors (TFs) bind to different parts of the genome in different types of cells, but it is usually assumed that the inherent DNA-binding preferences of a TF are invariant to cell type. Yet, there are several known examples of TFs that switch their DNA-binding preferences in different cell types, and yet more examples of other mechanisms, such as steric hindrance or cooperative binding, that may result in a “DNA signature” of differential binding.

To survey this phenomenon systematically, we developed a deep learning method we call SigTFB (Signatures of TF Binding) to detect and quantify cell-type specificity in a TF’s known genomic binding sites. We used ENCODE ChIP-seq data to conduct a wide scale investigation of 169 distinct TFs in up to 14 distinct cell types. SigTFB detected statistically significant DNA binding signatures in approximately two-thirds of TFs, far more than might have been expected from the relatively sparse evidence in prior literature. We found that the presence or absence of a cell-type specific DNA binding signature is distinct from, and indeed largely uncorrelated to, the degree of overlap between ChIP-seq peaks in different cell types, and tended to arise by two mechanisms: using established motifs in different frequencies, and by selective inclusion of motifs for distint TFs.

Conclusions

While recent results have highlighted cell state features such as chromatin accessibility and gene expression in predicting TF binding, our results emphasize that, for some TFs, the DNA sequences of the binding sites contain substantial cell-type specific motifs.

Peer Review reports

Introduction

Transcription factors (TFs) bind to gene promoters and enhancers to regulate gene expression, and are therefore major determinants of cell fate decisions, metabolic activity, and, when regulation goes awry, of disease [ 1 , 2 , 3 ]. TFs bind relatively short preferred DNA sequences, or motifs, typically 5 to 20 bases long [ 4 , 5 ]. Because these motifs are so short, the human genome often harbors millions of potential matches for a given motif [ 6 ]. Yet, ChIP-seq studies of TF binding show that in any given condition, a TF typically binds only several thousands or tens of thousands of those sites [ 7 ]. Moreover, that same TF will bind some overlapping but some distinct sites when comparing different cell types or disease conditions [ 8 ]. There are many mechanisms that can drive differential binding of a TF, including: differential expression [ 9 ], chromatin accessibility [ 10 ], conformational changes or complexing with other regulatory factors [ 11 ], cooperative or competitive binding [ 12 , 13 ], and alternative splicing [ 14 ].

One of the lesser-studied mechanisms of differential binding is a change in the DNA preference of the TF itself. Indeed, one often assumes the reverse—that the DNA binding preference of a TF is the same regardless of the cell type or condition in which it is expressed. Binding motif databases such as JASPAR [ 15 ] and HOCOMOCO [ 16 ], and experimental methods such as HT-SELEX [ 17 ], are predicated on this assumption. Their success shows that, to a substantial extent, the assumption is good. Nevertheless, there are several well-documented cases of DNA preference switching by TFs. For instance, estrogen receptor \(\alpha\) binds distinct DNA patterns in different cancerous lines, such as breast cancer and endometrial cancer [ 18 ]. The strongest binding sites are bound across all cancer lines, but different lower affinity sites are bound depending on the binding co-factors available. Similarly, in human embryonic stem cells, SOX2 binds regions with distinct motifs depending on whether it is co-binding with PAX6, leading to hECS neural differentiation, or OCT4, leaning to self-renewal [ 19 ]. Wang et al. performed a systematic comparison of human TF binding preferences, including a study of other TF motifs present in the peaks of five TFs in five cell types [ 20 ]. Arvey et al. used machine learning to study cell-type specific determinants of TF binding in different cell types, with the key finding that cell-type specific sequences were a key factor in predicting binding [ 21 ]. Keilwagen et al. studied 31 transcription factors, identifying key features useful in predicting cell-type specific binding [ 22 ]. Given the greater wealth of data available today, the time is right for a re-investigation of cell-type specificity in TF-DNA binding sites.

Motivated by the possibility that certain TFs might have cell-type specific DNA signatures at or in the vicinity of their binding sites, we set out to perform a comprehensive and systematic search for the phenomenon. To perform this search, we developed SigTFB (Signatures of TF Binding), a deep learning framework to quantify the degree to which cell-type specific DNA signatures are present in a TF’s binding sites. One of the advantages of SigTFB is that it can accommodate TF binding data from any number of cell types, without knowing which subset(s) of cell types, if any, may show different DNA binding signatures. Traditional differential motif enrichment analysis can identify known or de novo motifs that may vary between two datasets [ 23 , 24 ], but it cannot identify subsets of datasets that vary relative to others. Moreover, its computational complexity and the necessary multiple comparison corrections scales quadratically with the number of datasets–problems that SigTFB avoids. Moreover, like other deep learning frameworks that have been highly successful for analyzing TF binding [ 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 ], SigTFB is capable in principle of learning sophisticated DNA patterns that influence TF binding. This is a good choice when many different or even unknown molecular mechanisms could be generating binding signatures. However, our problem formulation is unique compared to previous work, which has largely focused on separating binding sites from non-binding sites—while sometimes optimizing other criteria such as the spatial resolution of predictions [ 30 , 31 , 32 ], interpretability of results [ 35 ], or data efficiency [ 37 ]. In contrast, in our formulation, all instances are bona fide, empirical binding sites for a given TF , determined by replicate ChIP-seq experiments. Our task is to predict in which cell types those sites are bound. By comparing that prediction performance with performance when the target cell type is hidden, we can quantify the extent to which deep learning can identify cell-type specific DNA patterns. Our work also stands distinct from, and complementary to, recent studies of how multiple TFs combine to discriminate genuine/functional binding sites across the genome in a single cell type [ 43 , 44 ]. Moreover, our question is distinct from the question of whether a TF has different roles or functions in different cell types. A TF may bind different sites in different cell types, and regulate different genes, and yet the DNA motif it binds in those sites can be identical. Conversely, differences in DNA binding motif in different cell types do not necessarily imply that the functional role of that TF is noticeably different, particularly if there are key binding sites that remain the same between cell types. The question of DNA binding signatures is one that most directly pertains to the mechanisms that guide a TF to bind its target sites in different cell types.

Using our SigTFB method, we investigated the binding of 169 distinct human TFs assayed by one or more antibodies (AB) (for a total of 199 distinct TF-AB pairs) across multiple cell types (ranging from 2 up to 14 for any given TF; 35 distinct human cell types in total). We found that different TFs show varying degrees of cell-type specific DNA binding signatures, with approximately two-thirds of TFs having significant cell-type specific signatures. Importantly, we found that the mere presence of differential binding is not the same as having a DNA signature of differential binding. Many TFs bind very different sites in different cell types, yet show no specific, discriminating DNA signatures at those differential sites. In such cases, TF binding differences may be due to mechanisms that do not leave strong local signals in the DNA, such as chromatic accessibility [ 10 ]. We also compared our results when analyzing data from the same TF assayed by different antibodies, and find that, with a few exceptions, there is generally good agreement on whether a TF displays cell-type specific DNA binding signatures. Finally, we show that across all TFs and cell types, differences in DNA signatures commonly emerge as differences in the frequency of presence of the same motifs, but in several cases, as radical switches to the inclusion of different motifs.

The remainder of the paper is organized as follows. First, we introduce the supervised learning problem formulation that we propose for quantifying the presence of cell-type specific DNA binding signatures. Next, we describe our deep learning-based SigTFB method for solving that problem. We then analyze two TFs in detail: ATF7, which shows substantial cell-type specificity in the DNA sequences of its peaks, and CTCF, which does not. We then provide a summary analysis of all TFs. Next, we perform some embedding and motif analyses to further investigate cell-type specific sequences in peaks and the deep learning representations of those sequences. Finally, we conclude with a discussion of our results, its strengths and limitations, and directions for future work.

A supervised learning formulation for detecting cell-type specific DNA-binding signatures

In this section, we describe a novel supervised learning problem whose solution allows one to identify and quantify cell-type specific DNA-binding signatures in a collection of known binding sites for a single TF across a set of cell types. The essential idea is to take DNA sequences from known binding sites, and employ supervised learning to predict whether a TF binds that sequence in a given cell type. In addition, the supervised learning is asked to make the same prediction, but when the target cell-type information is hidden. The difference in predictive performance between the cell-type specific instances and the cell-type hidden (or general) instances quantifies the extent to which the learner is able to pick up cell-type specific DNA signatures that improve binding prediction.

For our study, we turned to the ENCODE project peak calls [ 7 ] to identify high-quality, known TF binding sites. Because different antibodies for a TF can have different specificities or biases, we chose not to mix data from different antibodies. We identified 169 TFs satisfying the following criteria: 1) The TF is assayed by the same antibody in at least two human cell types; 2) for each cell type, “experiment” level peaks are available. Such peaks are present in at least two replicate ChIP-seq experiments and pass an irreproducible discovery rate test at a 2% threshold; and 3) there are at least 1000 such peaks for each cell type. Some TFs satisfied these criteria for more than one antibody, so in total we identified 199 transcription factor-antibody (TF-AB) combinations that we could use to study cell-type specificity in DNA binding. The full list of experiment accession numbers is available in Supplementary Table 1. We downloaded the called peaks for these accession numbers from the ENCODE website.

In our formulation, each TF-AB combination is studied separately. For each TF-AB, we begin by constructing a “unified” set of peaks across all the cell types in which that TF was assayed by that AB, using the approach developed by Basset et al. [ 45 ] in their study of chromatin accessibility (Fig. 1 a). Starting with the set of all peaks identified in all cell types (for that TF-AB pair), we repeatedly merge any two peaks that overlap by at least 30bp, keeping track of which cell types contribute to each merged peak. At the end of this process, we have a set of unified peaks annotated with source cell types (at least one, and as many as all cell types). The unified peaks can be of varying sizes depending on the sizes of the original peaks, their degree of overlap, and how many different peaks are combined. To “normalize” them for ease of supervised learning, the center of each unified peak is taken and extended by 50bp in each direction, such that the length of the intervals is 101bp. (This window size has been common in other, similar studies, and our own pilot study showed degradation of performance below 101bp, and no gain with bigger windows. See Discussion for more information.) Where there are C number of cell types, each unified peak is translated into 2 C supervised learning instances. In C of those instances, the input is the DNA sequence of the unified peak center along with a one-hot encoding of one of the cell types, and the output is one or zero depending on whether that cell type had a peak or not at that location. These are called the cell-type specific instances. The other C instances associated with the unified peak are identical, except that the part of the input vector encoding the cell type is zeroed out. We call these the cell-type general instances. As mentioned above, the intuition behind this formulation is that the difference in predictive performance between the cell-type specific instances and the cell-type general instances is a measure of the extent to which knowing the cell type informs ones interpretation of the input DNA sequence. In other words, it is a measure of the presence of cell-type specific binding site sequences, or DNA binding signatures, for this TF-AB pair, across this set of cell types.

figure 1

Supervised learning formulation and deep learning architecture. a Empirical binding sites are unified across cell types. In “cell-type specific” instances, each site’s DNA sequence and one-hot encoded cell type is associated to a binary bound/unbound output. In matching “cell-type general” instances, the cell-type information is hidden, but DNA input and bound/unbound output remain the same. b Simplified diagram of deep learning architecture, and its division into Stage 1 and Stage 2 Models. Stage 1 is shown in the red outline, and Stage 2 is shown in the blue outline. In Stage 1, the input instance is a one-hot encoded DNA sequence of size \(101\times 4\) . This is passed through the convolutional layer (Conv) with N filters of size M , through a maxpool layer (Maxpool) of length N , a fully connected layer (Dense) of length K , then the output layer of length C to predict if the TF is bound or not in the different cell types. Stage 2 takes cell-type information of length C as input as well, as depicted in panel ( a ). This is first passed through fully connected layers of lengths P 1 and P 2, and then concatenated with the output of the maxpool layer from Stage 1. The concatenated output is passed through a fully connected layer of length Q to predict whether the sequence is bound or not in that cell type

SigTFB: a two-stage deep learning model to study DNA-signatures associated to TF binding

To solve the learning problem described in the previous section, we developed a deep learning architecture and two-stage training approach called SigTFB (Fig. 1 b). The two-stage approach is modeled after that of Nair et al. [ 46 ]. Stage 1 of training is meant to help initializing the first DNA sequence-interpreting layers of the network, and is described further in the Methods section. In stage 2, the network’s inputs and outputs are as described above. Our network includes a modified version of DeepBind [ 25 ] of one hidden layer convolutional layer followed by one fully connected layer (Fig. 1 b). Unlike DeepBind, the number of channels in our model is set as a hyperparameter. We also investigated the use of more complex models, with more than one convolutional layer. However, one convolutional layer gave the best results in terms of validation accuracy and loss. The lower part of the network in Fig. 1 b takes as input the length C binary vector encoding a specific cell type or a zero vector, processes it through dense layers and then combines that with the DNA-processing side of the network through additional dense layers until reaching a binary output node.

We use negative log likelihood loss for training. The entire training procedure is performed in 10x cross-validation, with held-out performance being recorded for each cell type, along with the macro-average across cell types. Within each fold, performance is also averaged over 10 random initial weight sets and training trajectories. During both training and testing, instances are randomly chosen in mini-batches to have the same number of positive and negative instances from each cell type, and the same number of cell-type specific and cell-type general instances, avoiding any problems with class imbalance. Finally, all of that is wrapped within Ax [ 47 ] for tuning the various network layer size hyperparameters M , N , K , P 1, P 2 and Q shown in Fig. 1 b. The AUROC is computed for each cell type and for cell-type specific and general instances separately. The macro-averaged AUROC is computed across cell types, and the difference in macro-averaged AUROC between cell-type specific and general instances is used as our measure of cell-type specificity.

In the next two sections, we provide a detailed analysis of our results for two transcription factors. First, we examine a transcription factor with a high degree of cell-type specificity in its DNA binding signature. Then, we present an example where SigTFB found little evidence of cell-type specificity in the DNA binding signature.

ATF7 binding shows cell-type specific DNA binding signatures

To illustrate our approach, we first focus on Activating Transcription Factor 7 (ATF7). As a member of the ATF family, ATF7 binds to the cyclic AMP response element (CRE) with the consensus DNA sequence “TGACGTCA” [ 48 , 49 ]. Members of the ATF family are basic leucine zipper (bZIP) factors that complex with other bZIP factors to form homodimers or heterodimers [ 48 , 49 , 50 , 51 ]. These ATF TFs exhibit varying functionalities in different tissues and cancerous cell types, including tumour suppressive and oncogenic functions [ 49 ]. For instance, the deletion of ATF7 results in the spread of lymphoma [ 49 ]. Conversely, the activation of ATF7 in gastric or hepatocellular carcinoma promotes the proliferation of cancer cells. As such, ATF7 may be used as a biomarker for the early detection of tumours in liver and gastric cell types. Due to the differences observed, we suspect ATF7 to bind to different places along the genome in different cell types.

Our ENCODE [ 7 ] data compendium (see “ A supervised learning formulation for detecting cell-type specific DNA-binding signatures ” section) includes ATF7 peaks in four cell types: GM12878, K562, HepG2 and MCF-7. The cancerous cell types HepG2, MCF-7 and K562 correspond to liver hepatocellular carcinoma, breast cancer and myelogenous leukemia respectively. GM12878 is a non cancerous lymphoblastoid cell type. Figure 2 a shows a Venn diagram of the peak overlaps between the four cell types. The number of peaks per cell type are shown after the cell-type name in brackets. A mere 1.36% of the total number of peaks across all four cell types overlap, and the majority of the peaks are unique to one of the four cell types. For example, 22.36% of the K562 peaks do not overlap with peaks from other cell types. There is greater peak overlap between the pairs HepG2 and MCF-7, and GM12878 and K562.

figure 2

Illustration of SigTFB on ATF7 and CTCF. a Venn diagram of percentage overlap between cell types for ATF7. b ROC curves per cell type per condition: cell-type general (dashed line) and cell-type specific (solid line) for ATF7. c AUC per cell type per condition: cell-type general (shaded) and cell-type specific (not shaded) for ATF7. Numbers at the tops of pairs of bars are the AUC difference between cell-type general and specific instances. d Heatmap of percentage overlap between 14 cell types in CTCF assayed by antibody ENCAB000AXX. e ROC curves per cell type per condition: cell-type general (dashed line) and cell-type specific (solid line) for CTCF. f AUC per cell type per condition: cell-type general (shaded) and cell-type specific (not shaded) for CTCF. Numbers at the tops of pairs of bars are the AUC difference between cell-type general and specific

The lack of overlap between peaks in the four cell types does not imply cell-type specificity in DNA binding preference, as sequences in those peaks may be very similar. Differences in output may be due to dissimilarities in terms of noise, bias or even the number of peaks of the ChIP-seq experiments. For instance, HepG2 has over 40,000 peaks while MCF-7 has fewer than 30,000. Therefore, no more than 75% of HepG2 peaks could possibly overlap with MCF-7 peaks.

To determine if there are cell-type specific DNA signatures in the ATF7 peaks, we applied our deep learning method, SigTFB, as described in the previous section. Figure 2 b shows the receiver operating characteristic (ROC) curves for each cell type with and without the cell-type identity being provided, as well as averaged performance across all cell types. The plot shows high variability in site prediction across across cell types. Predictions for HepG2 (solid red curve) are significantly better than for MCF-7 and K562 (solid purple and green), which are better than for GM12878 (solid blue). In this case, predictions are more accurate when the network is informed of cell type than when it is not (e.g. solid curves versus dashed curves). This trend is also true for the macro-averaged ROC curve (gray color in Fig. 2 b). Figure 2 c shows the area under the ROC curve (AUC) per cell type per condition for the ATF7 TF, where the shaded and unshaded bars are cell-type general and cell-type specific cases respectively. For each cell type, as well as the macro-averaged result, there is a clear difference between the two conditions. Cell-type specific classification outperforms cell-type general classification with a macro-averaged AUC difference of 0.2 ( \(p\ll 0.05\) ; one-sample t-test on AUC difference). Thus, we can conclude that the network has detected DNA signatures discriminating peaks in different cell types. Further below, we investigate what exactly those signatures might be.

CTCF binding does not show cell-type specific DNA binding signatures

We next examine CCCTC-binding factor (CTCF), which can function as a transcriptional repressor, transcriptional activator, or as an insulator barrier between genomic domains. The CTCF binding domain is defined by 11 zinc fingers, and binding preference is believed to be invariant across cell types [ 52 , 53 , 54 ]. Importantly, this does not mean that CTCF always binds the same sites in different cell types, nor does it mean that it has the same function in different cell types. Indeed, by binding different sites or by binding the same sites with different binding partners, CTCF can have cell-type specific functions. For instance, CTCF has been shown to bind specific groups of regulatory elements in different brain cell types [ 55 , 56 , 57 ], where it specifically regulates memory-related genes, among others [ 58 ]. Conversely, these cell-type specific functions or binding sites do not imply any difference in direct CTCF-DNA binding preference or other signature. Therefore, we used SigTFB to test whether CTCF binding sites had any cell-type specific DNA signatures.

In our ENCODE data compendium, CTCF is assayed by five different antibodies. Here, we focus on the antibody that was used the most, giving us empirical binding sites for CTCF in 14 different cell types: smooth muscle cell, GM23338, bipolar neuron, neural progenitor cell, fibroblast of dermis, myotube, PC-3, astrocyte, HCT116, hepatocyte, osteoblast, OCI-LY7, MCF-7 and SK-N-SH. The percentage overlap of ChIP-seq peaks between each pair of cell types is shown in Fig. 2 d, where each entry of the heatmap shows the percentage of peaks of the row’s cell type overlapping peaks in the column’s cell type. Additionally, the number of peaks per cell type are shown in brackets after the row cell type label. Overlap percentages range from approximately 50% to 90%, with an average of 77%. Cell types with fewer peaks tend to be better covered by cell types with more peaks, suggesting an element of peak detection power is at play. For instance, the astrocyte dataset has the fewest peaks at \(\approx\) 37,000, which are more than 90% covered by the CTCF peaks in every other cell type – even distantly related cell types such as osteoblasts or fibroblasts (first row in Fig. 2 d).

Figure 2 d gives some intuition about the datasets. However, as seen for ATF7, a simple intersection analysis is not sufficient to determine cell-type specificity. We further investigated the binding activity of CTCF by training SigTFB on CTCF and its 14 corresponding cell types. Figure 2 e shows the ROC curves for each of the cell types and the macro-averaged ROC across all cell types. Compared to ATF7, there is relatively little difference in binding site predictability across cell types and nearly no difference in predictability for a given cell type, with or without cell-type identity information. Cell types OCI-LY7 (lavender line) and bipolar neuron (indigo line) have the worst prediction performance, and also have the highest number of peaks. Possibly, some fraction of these peaks are less reliable, which would explain both inflated peak numbers and prediction difficulty. Figure 2 f shows there is little to no difference in the area under the ROC curves (AUC) between cell-type specific (solid line) and cell-type general (broken line) conditions for each cell type ( \(p>0.5\) ; one-sample t-test on percentage differences). Consequently, these results illustrate the ubiquitous non-cell-type specific nature of CTCF DNA binding preferences. Importantly, they also demonstrate the specificity of SigTFB, in that it does not incorrectly report cell-type specificity where there is none to be found.

Comprehensive analysis of cell-type specific DNA-signatures in 169 transcription factors

Motivated by our results for ATF7 and CTCF, we expanded our study to investigate cell-type specific DNA binding signatures in all 169 TFs (199 TF-AB pairs). Figure 3 a displays a scatter plot of the mean AUC of prediction when the network is (y-axis) or is not (x-axis) told what cell type it is predicting for. Each point corresponds to a TF-AB combination. The color gradient depends on the negative \(\log {10}\) p-values for statistical significance of difference between the cell-type specific and cell-type general predictions, across the 10 folds of cross validation. We observe a continuum of cell-type specificity, where TFs with the least cell-type specificity lie in the \(x=y\) diagonal of the scatter plot. For these TFs, the cell-type information does not improve prediction. The position of a point along the diagonal may depend on the extent to which there are common motifs for the TF across cell types, the extent to which the peaks themselves overlap across cell types, data set quality, or other factors. Conversely, points lying above the diagonal indicate that the network predicts binding better when informed of the cell type; these are TFs with the most significant cell-type specificity, where the network responds differently to input DNA sequences depending on the cell type for which it is predicting. Points in the the upper left corner correspond to TFs where cross-cell-type prediction is virtually impossible, but is highly accurate for specific cell types. For such TFs, each cell type is expected to have specific DNA motifs that discriminate its binding sites.

figure 3

Summary statistics from our comprehensive study on DNA signatures in TF binding using SigTFB. a Scatter plot of cel-type specific AUC versus cell-type general AUC with the color gradient depending on the -log10 p-values. b Bar chart of AUC differences. c Scatter plot of percentage overlap(%) and AUC differences, with the color gradient depending on number of cell types per TF-AB. d Scatter plot of AUC differences for TFs with more than two ABs. The letter “D” at the top of the panel indicates a TF with strong evidence of direct sequence-specific DNA binding, as opposed to possible indirect DNA binding through intermediaries

Figure 3 b shows a histogram of the distribution of AUC differences between cell-type specific and general predictions for the different TF and AB combinations, with select TFs highlighted. Out of 199 TF-AB combinations, 127 TF-ABs, or 116 distinct TFs, have a statistically significant AUC difference of at least 0.1, suggesting that a majority of TFs have some degree of cell-type specific DNA signatures in their binding sites. TFs that play a pivotal role in cancer either as oncogenes or suppressors, such as MYC [ 59 ], BACH1 [ 60 ], ATF7 [ 49 ], and SOX6 [ 61 ], show a relatively higher cell-type specificity than other TFs, such as CTCF [ 53 ] and HCFC1 [ 62 ], that are involved in chromatin regulation or other cellular processes. Supplementary Table 1 lists the AUC differences for all TF-AB pairs.

As explained above, the lack of overlap between binding sites in different cell types is not evidence per se of any differential DNA signature. We next examined whether there is any association between the two. Figure 3 c plots the mean pairwise percentage peak overlap versus the mean AUC difference for each TF-AB. No clear relationship between the two variables is seen (Spearman correlation r  = -0.1). With either a high percentage overlap between 75% and 80% or a lower percentage overlap between 65% and 70%, the AUC difference ranges between roughly zero and 0.4. This confirms that peak overlap is not in itself an indicator of cell-type specificity in DNA binding signatures.

We also considered the possibility that AUC differences might somehow be an artifact of the number of cell types in the analysis. For instance, if there were only two cell types, perhaps it is more likely that some one of the two would contain some spurious signal that allows peak discrimination. Conversely, perhaps the more cell types are assayed, the more likely there is to be an “outlier” cell type with spurious DNA signals in the peaks. We found little evidence of such a phenomenon. The color gradient in Fig. 3 c indicates the number of cell types tested. The correlation of AUC difference to number of cell types is a minor ( r  = -0.1).

Of the 169 distinct TFs we studied, 24 were assayed with multiple ABs. Lack of consistency across ABs due to different off-target biases or binding affinities may impact the TF’s DNA signatures. Moreover, different ABs may have been used on different sets of cell types. Nevertheless, we may be reassured of the generality of our results if our measure of cell-type specificity is consistent between different sets of experiments with different ABs for the same TF. Figure 3 d shows a plot of the AUC differences for the 24 TFs assayed by least 2 ABs. For example, CTCF was assayed with six different ABs, all of which returned relatively low estimates of cell-type specificity (five of six being below 0.04). Conversely, several TFs show consistently high cell-type specificity across multiple ABs, including TCF12, SPI1, MNT, IKZF1 and DPF2. The least consistency is seen for the TF ETV6. Surprisingly, both datasets for ETV6 explore the same two cell types GM12878 and K562, yet produce very differing results for cell-type specificity: essentially 0 for one antibody and 0.37 for the other. This may be due to differences in the ABs used, or could be a result of differences in the total number of peaks per dataset for each TF-AB combination. Overall, however, there is strong consistency in our measure of cell-type specificity of TF binding, even when assayed by different ABs.

We also marked TFs in Fig. 3 d with a “D” above them when there was strong evidence from prior research that the TF directly binds DNA in a sequence specific manner, as opposed to indirectly as part of a complex. We took as “strong evidence” the TF being annotated with the Gene Ontology molecular function “RNA polymerase II cis-regulatory region sequence-specific DNA binding”, and having a DNA binding motif in one or both of the JASPAR [ 15 ] and HOCOMOCO [ 16 ] databases. Despite the possibility that direct- versus indirect-DNA binders might have different propensities towards cell-type specificity, we see no obvious trend in that regard.

TFs with cell-type specificity show differential enrichment for known TF-DNA binding motifs

As mentioned above, SigTFB attempts to identify the presence of cell-type specific DNA signatures in a TF’s peaks, but doesn’t explicitly tell us what those signatures are. In this section, we investigate further what those signatures might be, starting with ATF7. First, adopting a similar approach to AI-TAC [ 63 ], we used the t-SNE algorithm to represent each ChIP-seq peak per cell type by its activation values in two dimensions across the neurons of the final fully connected layer of the stage 2 model. Figure 4 a and b show the ATF7 t-SNE plots for cell-type general and specific instances respectively. Each point/instance is colored depending on which cell type(s) it belongs to. The cell-type general instances (Fig. 4 a) appear as a single cluster, although the peaks from some cell types do tend to be on one side or the other of the mass. The network’s internal representations of the cell-type specific instances however, group perfectly by cell type (Fig. 4 b). A similar analysis of the CTCF learned model (Fig. 4 c, d) shows a single cluster for both general and specific instances. Although the cell-type specific instances show some increasing grouping by cell type, it apparently has little effect on predictive power, as we saw negligible AUC difference above. These results reinforce our contention that SigTFB is able to learn cell-type specific representations of DNA binding sequences, especially for ATF7 in comparison with CTCF.

figure 4

Neural network interpretation and motif enrichment in peaks. a - d tSNE embedding of unit activations in final dense layer. Points represent individual peaks, colored by originating cell type(s). a ATF7 peak embedding when cell-type information is withheld. b ATF7 peak embedding when cell-type information is provided. c CTCF peak embedding when cell-type information is withheld. d CTCF peak embedding when cell-type information is provided. e - f Enrichment of top JASPAR motifs in ATF7 ( e ) and CTCF ( f ) peaks, as the fraction: number of motif-containing peaks using default FIMO parameters, divided by total number of peaks

Next, we explored the filters from the convolutional layer by converting the filters into PWMs and using Tomtom [ 64 ] to search for the PWMs in the JASPAR database [ 15 ]. In the ATF7 network, we found that most filters had a match or partial match to a small number of known TFs. For example, \(\approx\) 40% of the filters matched bested to a JUND motif. ATF7 and JUND both have basic leucine zipper domains, with similar consensus binding sequences, and are known to physically interact [ 65 ]. Another set of filters matched the motif for SP2, which has a very different zinc finger binding domain that prefers a gapped sequences of G’s or C’s.

To assess more systematically which TF motifs might be present in ATF7’s binding sites, we constructed 35 base pair windows around the positions that in silico mutagenesis found to have the most influence on network output, and then ran FIMO to identify significant motifs hits for a library of 400 high-confidence human TF-DNA binding motifs from JASPAR [ 66 ]. (The 35 base pair window size is larger than any motifs in the library, and provides a more focused analysis less likely to include the many short TF motifs by random chance.) Figure 4 e shows the fraction of ATF7 peaks in each cell type that included motif hits for the 33 motifs with most significant results. ATF7 peaks in HepG2 and MCF-7 cell types have very similar enrichment patterns, and include expected high enrichment of ATF family motifs, as well as other similar motifs. The peaks in K562 are surprisingly low in such motifs, although the enrichment levels are statistically significant. Instead, the K562 peaks are enriched in KLF- and SP-family motifs. The GM12878 peaks have some enrichment for all of these families, along with several other unique results such as ELF and ETV motifs. The presence of these different cell-type specific motif hits, along with the convolutional filter analysis, suggests that the ATF7 model may be looking at motif frequencies and/or the presence of other TF’s motifs, to help discriminate ATF7 peaks in different cell types. A parallel analysis of CTCF’s peaks found nearly identical motif enrichment across all cell types (Fig. 4 f), further confirming the lack of cell-type specific DNA signatures.

We extended the motif analysis to all 199 TF-AB pairs. Space limitations prevent a comprehensive presentation of the results, but Fig. 5 contains a heatmap of motif presence frequency for a broad selection of TFs from major families [ 67 , 68 ] and for the motifs with highest average scores. Many insights and hypotheses can be obtained from detailed examination of the matrix, but several major observations can be made immediately. A common trend of many TFs is that the canonical motif and similar motifs are enriched to different degrees in different cell types. For instance, cluster A shows that MAX and USF family factors’ binding sites are, unsurprisingly, enriched for MAX and USF motifs. Similarly for clusters B, C, D1, D2 and E, which can been seen closer up in Supplementary Figure 1. However, there are interesting exceptions to these trends. For instance, in the A cluster, totally different experiments with two separate antibodies agree that MNT peaks in MCF-7 are lacking many motifs that are seen more abundantly in MNT peaks in HepG2 and K562 cell types, as well as in the peaks of many other TFs in the same group.

figure 5

Enrichment fractions of a broad range of JASPER motifs (rows) in peaks of TFs in 11 major families across numerous tissue types (columns). Close up views of regions of the heatmap A, B, C, D1, D2 and E can be found in Supplementary Figure 1

Another interesting phenomenon is the presence of alternative motifs, including primary motif variants, specifically in some cell types. For instance, in cluster D2, the peaks for JUND in the five cell types at the right of Supplementary Figure 1D2 (HepG2, K562, MCF-7, SK-N-SH, liver) show some signifiant enrichment for MAFK motif variants (at top of plot), whereas peaks in GM12878 or HCT116, which are generally higher on enrichment for JUN and FOS motifs, have little or no enrichment for the MAF variants at the top of the plot. Similar results can be seen in cluster A, for example where USF2 peaks in SK-N-SH are absent MYCN motifs, although MYCN appears commonly in USF2 peaks in other cell types. The USF2 peaks in SK-N-SH are also unusually low on the MAX motif variant MAX.MA0058.2 (concensus CACATG) but high on MAX.MA0051 (concensus CACGTG)–quite possibly because CACGTG is also a concensus motif for USF2 itself, but suggesting that USF2 might also bind the CACATG variant significantly (which is not among any of the USF2 concensus sequences in JASPAR).

Returning to Fig. 5 g, one can see many other interesting exceptions to family-wise binding. For instance, near the top middle we see that in three of five cell types, peaks for ATF3 are substantially enriched in MAX/USF family motifs, whereas the other two cell types are virtual without these motifs. Numerous other such examples are found throughout the matrix, and are suggestive of many potential hypotheses regarding co-expression of TFs or lack thereof, co-binding, competitive binding, etc.

The complex structural and biochemical nature of protein-DNA interactions has made it difficult to fully understand how various factors influence transcriptional regulation and differential binding. We conducted a wide-scale investigation of 169 TFs across various cell types to identify and quantify differential binding preferences of TFs. We found that different TFs display varying degrees of cell-type specificity in their binding preferences, with approximately two-thirds of those we tested having statistically significant DNA signatures of differential binding. We observed that TFs that play a pivotal role in cancer either as oncogenes or suppressors, such as MYC, BACH1, ATF7 and SOX6, show a relatively higher cell-type specificity than other TFs, such as CTCF and HCFC1, that are involved in chromatin regulation or other cellular processes. Our work constitutes a broad survey of the possibility and prevalence of such DNA signatures. However, the signatures found by SigTFB could reflect many different factors that influence the preferences of a TF, such as its intrinsic binding preference, chromatin accessibility or co-operative or competitive binding of other factors. Further experimental validation is needed if we are to determine mechanisms underlying these signatures. For instance, for a number of TFs we observed the increased presence of binding motifs for other TFs. This could be tested experimentally by first verifying that those other TFs are expressed in the cell types of interest, and then performing ChIP-seq experiments on those TFs to confirm binding at the same sites. If co-operativity/complexing is suspected, reciprocal IP experiments could be performed to identify physical interactions between the different TFs, or knockdown of one TFs could be performed followed by ChIP-PCR or ChIP-seq to determine if binding of the other TF is affected. As another example, if the direct DNA binding preference of a TF is suspected to have changed in different cell types, in vivo affinity assays using enhancers constructs could be performed to verify this change. Therefore, much more experimental work, and potentially computational work, is needed to test our findings.

Other deep learning approaches, such as MTTFSite [ 69 ] and Phuycharoen et al. [ 70 ], have also explored differential binding of TFs across cell types. While MTTFSite and Phuycharoen et al. adopt a similar learning framework to SigTFB in stage 1 training, in terms of using a multi-task model, their problem formulation and objective fundamentally differ. In MTTFSite, for example, prior to training, shared non-unique cell-type instances are defined as bound regions across cell types that overlap by at least 100bp, while the remaining bound instances that do not overlap are cell-type specific. In SigTFB, however, the model is given all instances as input and learns to differentiate non-specific versus cell-type specific instances. The negative instances for a specific cell type in SigTFB are bound regions in other cell types, while in MTTFSite and Phuycharoen et al. negative instances are unbound regions in all cell types. SigTFB essentially learns to differentiate between shared and unique motifs in cell types from only bound regions. Additionally, the scale of the study differs. MTTFSite and Phuycharoen et al. investigate TFs in a total of five and three cell types respectively, while SigTFB explores the hundreds of TFs in ENCODE with at least more than two cell types available, or a total of 35 distinct cell types across all TFs.

Similar to Novakovsky et al. [ 71 ] and ChromDragoNN [ 46 ], SigTFB displays the effectiveness of transfer learning in a multi-task deep learning framework for the prediction of binding profiles genome wide. Unlike these approaches, however, which mainly focus on cross cell-type prediction, where models are trained on some cell types and tested on other cell types with limited data, we use transfer learning to acquire exclusive features per cell type. The multi-task setting in the first stage of learning allows the model to learn generalizable shared and unique features across cell types. In stage 2, the model is constrained to learn cell-type specific features, allowing the learning of a set of motifs that are associated to cell-type specificity. In addition to the type of learning used, the data representation, the criteria chosen for model evaluation, and the hyperparameters selected are important factors we account for during the learning phase to achieve a more accurate prediction of binding profiles at cell-type resolution.

Most deep learning approaches, such as DeepBind [ 25 ], MTTFSite [ 69 ] and DanQ [ 27 ], were not used to investigate the differences in ABs for the same TF when analyzing ChIP-seq experiments. We hypothesized that ABs could greatly influence the quality of ChIP-seq experiments. The polyclonal nature of ABs in ENCODE, for example, may result in ABs targeting the same protein to have different specificities, affinities and off-target binding. As a result, due to the lack of study on the consistency of functionality and performance of ABs across TFs ENCODE wide, we separated experiments from different ABs for a particular TF, and investigated the consistency in binding preference across different ABs for the same TF. Overall, we found consistency across ABs for most TFs, although a few results were inconsistent. Lack of consistency may be due to factors such as the quality of the ChIP-seq datasets, the controls selected for peak calling, or the cell types available.

Although we have uncovered substantial evidence for DNA signatures associated to cell-type specific binding, we acknowledge some limitations to our study. First, the fact that a TF does not show cell-type specificity in the cell types available from ENCODE does not imply that it will not show cell-type specificity in other cell types. The human genome contains almost 1400 TFs [ 72 ], and despite the enormous effort of the ENCODE consortium, we found only 169 distinct TFs assayed in more than one cell type and meeting our other data set criteria. (This number has increased somewhat since we began our study, but remains far smaller than 1400.) It is thus impossible to detect cell-type specific binding for the vast majority of TFs, and it is uncertain whether other TFs may show specificity in other cell types. This underlines the importance of continued empirical study of TF binding in a wide range of cell types. A second limitation is that, despite best efforts, deep learning can at times fail to solve a prediction problem, even when a solution is possible in principle. There may be TFs for which we failed to detect a cell-type specific signal, even when one is present. On the other hand, our careful checks against overfitting suggest that when a cell-type specific signal is present, it is likely genuine, especially when it is backed up by additional motif enrichment analyses. Thus, our results are best viewed as providing evidence for cell-type specific DNA signatures in many TFs, while providing evidence against the same, without ruling it out, for other TFs. Thirdly, assumptions made regarding the network architecture, such as the 101 bp input sequence or fixed filter widths, may limit the learning capabilities of SigTFB. For instance, its inability to detect widely spaced motifs or motif pairs with fixed spacing, suggests that some DNA signatures relevant for cell-type specificity may be possibly missing. Furthermore, in this work, we used ChIP-seq data due to its high availability and accessibility for multiple TFs and cell types. While ENCODE has many standards in place to ensure high data quality, other experimental approaches, such as ChIP-exo [ 73 ] or CUT&TAG [ 74 ], may provide less noisy, higher resolution and/or more precise estimates of TF-DNA binding, and thus may ultimately improve the search for DNA signatures. Finally, it is important to note that our confirmatory motif enrichment analysis is limited by the current state of knowledge. Like ENCODE, JASPAR includes data on a relatively small fraction of all human TFs. Not all motifs are in JASPAR, and not having matches may result from a key TF not being in the database. While one could repeat the analysis with motif collections from other databases, such as HOCOMOCO [ 16 ] or TRANSFAC [ 75 ], a fundamental limitation remains that the majority of known or predicted TFs have not been assayed even once in any cell type. Motif analysis could be extended in other directions, however. For instance, although we opted for a deep learning approach to reduce a priori bias in looking for certain types of motifs, and to avoid a large number of individual or differential motif analyses, one could nevertheless carry out de novo motif finding on all of the datasets [ 76 ], and carry out differential motif finding [ 77 ], particularly between cell types or groups thereof that our SigTFB analysis suggests are substantially different.

We have proposed a supervised learning problem formulation that allows one to quantify the degree of cell-type specificity in the DNA sequences of a TF’s binding sites. We solved that supervised learning problem with a deep learning approach, SigTFB. However, any number of other approaches could be explored, such as position weight matrices, logistic regression, decision trees or forests, support vector machines, or other deep learning formulations. Furthermore, whereas SigTFB’s approach makes relatively little a priori assumptions about what may constitute a discriminative DNA signature, one could test alternative representations of peak sequences, for instance using known motifs or k-mers. All the supervised learning data we used is available at https://doi.org/10.20383/103.0605 , so that anyone can try alternate approaches. In some scenarios, it may also make sense to alter the supervised learning formulation. For instance, we currently treat each different cell type as a monolithic, distinct entity. But in various senses, some cell types are more naturally similar to others. Perhaps some TFs behave one way in certain cancer types and a different way in healthy cells. Or perhaps a TF behaves one way in brain cells and a different way in the skin. In general, cell types might be represented by some metadata features, and we could learn if any of those metadata features associate with differential DNA signatures.

Many TFs are known to bind to different genomic sites in different cell types. Here, we demonstrated that for many of these TFs, different binding sites are associated with different DNA signatures. We developed a deep learning prediction framework, SigTFB, that is capable of detecting such DNA signatures, and used explanation techniques (tSNE representation embedding, in silico mutagenesis and motif enrichment analysis) to elucidate signatures from the trained networks. Our results have implications for ongoing efforts to predict TF binding in un-assayed cell types: the existence of cell-type specific signatures of binding implies some limitations to the success of such approaches that may not have been previously appreciated. Our findings also have implications for the representation of DNA binding preferences of TFs, suggesting that monolithic, cell-type independent representations, such as PWMs, may not be a satisfactory approximation in the long run for some TFs. Finally, our results set the stage for deeper investigation into mechanisms of differential TF binding, suggesting certain TFs where investigation is more relevant and more likely to succeed.

SigTFB’s two-stage training process

The training procedure we describe below proceeds in two stages, which use slightly different supervised learning formulations, which we call SL1 and SL2. To make clear the difference, we first introduce some notation. Let there be U unified peaks, C total cell types, and let \(A_{ij}\) be a binary indicator of whether unified peak \(i\in \{1,\ldots ,U\}\) includes a peak originally found in cell type \(j\in \{1,\ldots ,C\}\) . Let \(D_i\) be the length-404 one-hot encoded DNA sequence of unified peak i .

In SL1, each unified peak contributes one instance to the dataset. The input vector is \(D_i\) , the one-hot encoded DNA sequence of the 101bp window centered on the peak, and the output vector is \(A_{i\cdot }\) , the vector telling which cell types contributed to this peak. This multi-task formulation is used to pre-train part of the network, but is not ultimately the formulation that we want solved. We train using negative log likelihood loss function, and starting from random initial weights.

In SL2, as described above in “ A supervised learning formulation for detecting cell-type specific DNA-binding signatures ” section, each unified peak contributes 2 C instances to the dataset, or equivalently, there are two instances for each unified peak and each cell type. For unified peak i and cell type j , one of the instances has as input vector \(D_i\) concatenated with a length-C binary vector having a 1 in position j . This instance has a single binary output value (or label) which is equal to \(A_{ij}\) . Intuitively, this instance can be interpreted like, “The DNA sequence \(D_i\) in cell type j was ( \(A_{ij}=1\) ) or wasn’t ( \(A_{ij}=0\) ) bound by the TF”. This is called a cell-type specific instance, because the target cell type we are querying about is given. The second instance associated to each unified peak and each cell type is just like the first one, except that the length-C binary vector part of the input is set to all zeros. Such an instance, which we call cell-type general, can be interpreted like, “The DNA sequence \(D_i\) was ( \(A_{ij}=1\) ) or wasn’t ( \(A_{ij}=0\) ) bound, in a cell type who’s identity is being kept hidden”. We train on this data using the negative log likelihood loss function. The initial weights for the “upper” part of the network are taken from the SL1 training, but they are not frozen and so may change during stage 2 training. Other weights are initialized randomly.

To allow us to optimize hyperparameters and avoid over-fitting the data, we use a nested cross-validation scheme which divides the data into training, validation, and test sets. The outer loop is a standard 10-fold cross validation, which generates a 90% train/10% test split for each fold. Within each training set, we further divide as 80% train/20% validate, where the validation set is used for hyperparameter optimization. Along with network layer size parameters described above, we also optimize learning rate, weight decay, initial weight scales for convolutional and dense layers, and number of training epochs. We train using PyTorch 1.5.0 (GPU) with the Adam optimizer.

Motif analysis

To analyze the peak DNA sequences per cell type per TF-AB model, we use FIMO 5.0.3 to search for motifs in the subsequences using known JASPAR human motifs [ 15 ] that are based on at least 1000 sites and have log p-values of at least 100. This gives us a total of 400 JASPAR motifs. For each cell type per TF-AB, and each motif, we find the ratio of the number of significant motif hits identified by FIMO to the number of total peaks for that cell type. By using this approach, we account for enrichment as well as the number of peaks per cell type per TF-AB. To construct the large enrichment heatmap in Fig. 5 , we find the top 20 motifs with the highest enrichment ratio per cell type, and take the union of the these motifs across the cell types.

Availability of data and materials

The processed data used for deep learning is available at https://doi.org/10.20383/103.0605 . The source code for SigTFB including the pre-processing of ChIP-seq data, classification and downstream genomic analysis, is available at https://github.com/aawdeh/SigTFB .

Bintu L, et al. Transcriptional regulation by the numbers: models. Curr Opin Genet Dev. 2005;15:116–24.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Desvergne B, Michalik L, Wahli W. Transcriptional regulation of metabolism. Physiol Rev. 2006;86:465–514.

Article   PubMed   CAS   Google Scholar  

Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152:1237–51.

Matys V, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34:D108–10.

Bryne JC, et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 2007;36:D102–6.

Article   PubMed   PubMed Central   Google Scholar  

Soleimani VD, et al. Cis-regulatory determinants of MyoD function. Nucleic Acids Res. 2018;46:7221–35.

Consortium EP, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57.

Lee B-K, et al. Cell-type specific and combinatorial usage of diverse transcriptionfactors revealed by genome-wide binding studies in multiple human cells. Genome Res. 2012;22:9–24.

Benedetti M, Levi A, Chao MV. Di erential expression of nerve growth factor receptors leads to altered binding affinity and neurotrophin responsiveness. Proc Natl Acad Sci. 1993;90:7859–63.

Srivastava D, Mahony S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. Biochim Biophys Acta (BBA) Gene Regul Mech. 2020;1863:194443.

Brand M, et al. Dynamic changes in transcription factor complexes during erythroid differentiation revealed by quantitative proteomics. Nat Struct Mol Biol. 2004;11:73–80.

Pilpel Y, Sudarsanam P, Church GM. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 2001;29(153):159.

Google Scholar  

Nie Y, Shu C, Sun X. Cooperative binding of transcription factors in the human genome. Genomics. 2020;112:3427–34.

Lowen M, Scott G, Zwollo P. Functional analyses of two alternative isoforms of the transcription factor Pax-5. J Biol Chem. 2001;276:42565–74.

Castro-Mondragon JA, et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2022;50:D165–73.

Kulakovskiy IV, et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-seq analysis. Nucleic Acids Res. 2018;46:D252–9.

Ogawa N, Biggin MD. High-throughput SELEX determination of DNA sequences bound by transcription factors in vitro. Gene Regul Netw Methods Protoc. 2012;786:51–63.

Gertz J, et al. Distinct properties of cell-type-specific and shared transcription factor binding sites. Mol Cell. 2013;52:25–36.

Zhang S, et al. OCT4 and PAX6 determine the dual function of SOX2 in human ESCs as a key pluripotent or neural factor. Stem Cell Res Ther. 2019;10:1–14.

Article   Google Scholar  

Wang J, et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012;22:1798–812.

Arvey A, Agius P, Noble WS, Leslie C. Sequence and chromatin determinants of cell type-specific transcription factor binding. Genome Res. 2012;22:1723–34.

Keilwagen J, Posch S, Grau J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol. 2019;20:1–17.

McLeay RC, Bailey TL. Motif enrichment analysis: a unified framework and an evaluation on ChIP data. BMC Bioinformatics. 2010;11:1–11.

Lesluyes T, Johnson J, Machanick P, Bailey TL. Differential motif enrichment analysis of paired ChIP-seq experiments. BMC Genomics. 2014;15:1–13.

Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33:831–8.

Hassanzadeh H, Wang MD. DeeperBind: Enhancing prediction of sequence specificities of DNA binding proteins. Los Alamitos: IEEE Computer Society; 2016. p. 178–83.

Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;44:e107–e107.

Chen C, et al. DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. BMC Bioinformatics. 2021;22:1–18.

Quang D, Xie X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods. 2019;166:40–7.

Li H, Guan Y. Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res. 2021;31:721–31.

Zhang Y, Wang Z, Zeng Y, Zhou J, Zou Q. High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method. Brief Bioinform. 2021;22:bbab273.

Zhang Q, et al. Base-resolution prediction of transcription factor binding signals by a deep learning framework. PLoS Comput Biol. 2022;18:e1009941.

Cao L, Liu P, Chen J, Deng L. Prediction of transcription factor binding sites using a combined deep learning approach. Front Oncol. 2022;12:893520.

Ng JW, Ong EH, Tucker-Kellogg L, Tucker-Kellogg G. Deep learning for de-convolution of Smad2 versus Smad3 binding sites. BMC Genomics. 2022;23:525.

Ding P, et al. DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks combining sequence and shape. Brief Bioinform. 2023;24:bbad231.

Zhang J, Liu B, Wu J, Wang Z, Li J. DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network. BMC Bioinformatics. 2023;24:345.

Wang K, et al. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform. 2024;25:bbae195.

Zhuang J, et al. MulTFBS: A spatial-temporal network with multichannels for predicting transcription factor binding sites. J Chem Inf Model. 2024;64(10):1549–9596.

Andrews G. Deep learning as a tool to better understand transcription factor binding across cell types and species. Ph.D. thesis, UMass Chan Medical School; 2024.

Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20:389–403.

Zhang S, et al. Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform. 2022;23:bbab374.

Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet. 2023;24:125–37.

Singh G, et al. A exible repertoire of transcription factor binding sites and a diversity threshold determines enhancer activity in embryonic stem cells.Genome Res. 2021;31:564–575.

Zheng A, et al. Deep neural networks identify sequence context features predictive of transcription factor binding. Nat Mach Intel. 2021;3:172–80.

Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26:990–9.

Nair S, Kim DS, Perricone J, Kundaje A. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics. 2019;35:i108–16.

Balandat M, et al. BoTorch: programmable bayesian optimization in PyTorch. 2019. arxiv e-prints arXiv–1910 .

Maekawa T, et al. Social isolation stress induces ATF-7 phosphorylation and impairs silencing of the 5-HT 5B receptor gene. EMBO J. 2010;29:196–208.

Chen M, et al. Emerging roles of activating transcription factor (ATF) family members in tumourigenesis and immunity: Implications in cancer immunotherapy. Genes Dis. 2021;9(4):981–99.

Gozdecka M, Breitwieser W. The roles of ATF2 (activating transcription factor 2) in tumorigenesis. Biochem Soc Trans. 2012;40:230–4.

Meijer BJ, et al. ATF2 and ATF7 are critical mediators of intestinal epithelial repair. Cell Mol Gastroenterol Hepatol. 2020;10:23–42.

Kim S, Yu N-K, Kaang B-K. CTCF as a multifunctional protein in genome regulation and gene expression. Exp Mol Med. 2015;47:e166–e166.

Chen H, Tian Y, Shu W, Bo X, Wang S. Comprehensive identification and annotation of cell type-specific and ubiquitous CTCF-binding sites in the human genome. PLoS ONE. 2012;7:e41374.

Holwerda SJB, de Laat W. CTCF: the protein, the binding partners, the binding sites and their chromatin loops. Philos Trans R Soc B Biol Sci. 2013;368:20120369.

Li YE, et al. An atlas of gene regulatory elements in adult mouse cerebrum. Nature. 2021;598:129–36.

BRAIN Initiative Cell Census Network (BICCN). A multimodal cell census and atlas of the mammalian primary motor cortex. Nature. 2021;598:86–102.

Zu S, et al. Single-cell analysis of chromatin accessibility in the adult mouse brain. Nature. 2023;624:378–89.

Sams DS, et al. Neuronal CTCF is necessary for basal and experience-dependent gene regulation, memory formation, and genomic structure of BDNF and Arc. Cell Rep. 2016;17:2418–30.

Dang CV. MYC on the path to cancer. Cell. 2012;149:22–35.

Davudian S, Mansoori B, Shajari N, Mohammadi A, Baradaran B. BACH1, the master regulator gene: a novel candidate target for cancer therapy. Gene. 2016;588:30–7.

Guo X, Yang M, Gu H, Zhao J, Zou L. Decreased expression of SOX6 confers a poor prognosis in hepatocellular carcinoma. Cancer Epidemiol. 2013;37:732–6.

Wysocka J, Reilly PT, Herr W. Loss of HCF-1-chromatin association precedes temperature-induced growth arrest of tsBN67 cells. Mol Cell Biol. 2001;21:3820–9.

Maslova A, et al. Deep learning of immune cell differentiation. Proc Natl Acad Sci. 2020;117(25655):25666.

Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:1–9.

De Graeve F, et al. Role of the ATFa/JNK2 complex in jun activation. Oncogene. 1999;18:3491–500.

Article   PubMed   Google Scholar  

Fornes O, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48:D87–92.

PubMed   CAS   Google Scholar  

Ambrosini G, et al. Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study. Genome Biol. 2020;21:1–18.

Castro-Mondragon JA, Jaeger S, Thieffry D, Thomas-Chollier M, Van Helden J. RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections. Nucleic Acids Res. 2017;45:e119–e119.

Zhou J, et al. MTTFsite: cross-cell type TF binding site prediction by using multi-task learning. Bioinformatics. 2019;35:5067–77.

Phuycharoen M, et al. Uncovering tissue-specific binding features from differential deep learning. Nucleic Acids Res. 2020;48:e27–e27.

Novakovsky G, Saraswat M, Fornes O, Mostafavi S, Wasserman WW. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol. 2021;22:1–25.

Pechenick DA, Payne JL, Moore JH. Phenotypic robustness and the assortativity signature of human transcription factor networks. PLoS Comput Biol. 2014;10:e1003780.

Rhee HS, Pugh BF. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 2011;147:1408–19.

Kaya-Okur HS, et al. Cut &tag for efficient epigenomic profiling of small samples and single cells. Nat Commun. 2019;10:1930.

Wingender E, et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000;28(316):319.

Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in encode TF binding experiments. Nucleic Acids Res. 2014;42:2976–87.

Bailey TL, Johnson J, Grant CE, Noble WS. The MEME suite. Nucleic Acids Res. 2015;43:W39–49.

Download references

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), funding reference number RGPIN-2019-06604 to TJP and a Compute Canada ( www.computecanada.ca ) Resources-for-Research-Groups grant to TJP. This research was enabled by support provided by a Queen Elizabeth II Graduate Scholarship in Science and Technology (QEII-GSST) to AA. None of the agencies that funded this work had any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and affiliations.

School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada

Aseel Awdeh, Marcel Turcotte & Theodore J. Perkins

Regenerative Medicine Program, Ottawa Hospital Research Institute, 501 Smyth Rd., Ottawa, K1H 8L6, Ontario, Canada

Aseel Awdeh & Theodore J. Perkins

Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa, 451 Smyth Rd., Ottawa, K1H 8M5, Ontario, Canada

Theodore J. Perkins

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the research design and writing of the manuscript. The work was primarily carried out by AA.

Corresponding author

Correspondence to Theodore J. Perkins .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

All authors have read and consented to publishing this manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12864_2024_10859_moesm1_esm.zip.

Additional file 1: Supplementary Table 1. Lists the ENCODE datasets on which our study was based, along with the AUC differences achieved on those datasets. Supplementary Table 2. Lists all the motif enrichment scores for all peak datasets. Supplementary Figure S1. Provides zoom-ins of certain clusters of enrichment within Figure 5.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Awdeh, A., Turcotte, M. & Perkins, T.J. Identifying transcription factors with cell-type specific DNA binding signatures. BMC Genomics 25 , 957 (2024). https://doi.org/10.1186/s12864-024-10859-1

Download citation

Received : 10 July 2024

Accepted : 02 October 2024

Published : 14 October 2024

DOI : https://doi.org/10.1186/s12864-024-10859-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Transcription factor binding
  • Differential binding
  • Cell-type specificity
  • Deep learning

BMC Genomics

ISSN: 1471-2164

transcription factor thesis

  • DSpace@MIT Home
  • MIT Libraries
  • Doctoral Theses

Machine learning models for functional genomics and therapeutic design

Thumbnail

Other Contributors

Terms of use, description, date issued, collections.

Show Statistical Information

Support Biology

Dei council and dei faculty committee, biology diversity community, mit biology catalyst symposium, honors and awards, employment opportunities, faculty and research, current faculty, in memoriam, areas of research, biochemistry, biophysics, and structural biology, cancer biology, cell biology, computational biology, human disease, microbiology, neurobiology, stem cell and developmental biology, core facilities, video gallery, faculty resources, undergraduate, why biology, undergraduate testimonials, major/minor requirements, general institute requirement, advanced standing exam, transfer credit, current students, subject offerings, research opportunities, biology undergraduate student association, career development, why mit biology, diversity in the graduate program, nih training grant, career outcomes, graduate testimonials, prospective students, application process, interdisciplinary and joint degree programs, living in cambridge, graduate manual: key program info, graduate teaching, career development resources, biology graduate student council, biopals program, postdoctoral, life as a postdoc, postdoc associations, postdoc testimonials, workshops for mit biology postdocs entering the academic job market, responsible conduct of research, postdoc resources, non-mit undergraduates, bernard s. and sophie g. gould mit summer research program in biology (bsg-msrp-bio), bsg-msrp-bio gould fellows, quantitative methods workshop, high school students and teachers, summer workshop for teachers, mit field trips, leah knox scholars program, additional resources, mitx biology, department calendar, ehs and facilities, graduate manual, resources for md/phd students, preliminary exam guidelines, thesis committee meetings, guidelines for graduating, mentoring students and early-career scientists, remembering stephen goldman (1962 – 2022), it takes three to tango: transcription factors bind dna, protein, and rna.

It takes three to tango: transcription factors bind DNA, protein, and RNA

Greta Friar | Whitehead Institute

July 7, 2023.

Transcription factors could be the Swiss Army knives of gene regulation; they are versatile proteins containing multiple specialized regions. On one end they have a region that can bind to DNA. On the other end they have a region that can bind to proteins. Transcription factors help to regulate gene expression—turning genes on or off and dialing up or down their level of activity—often in partnership with the proteins that they bind. They anchor themselves and their partner proteins to DNA at binding sites in genetic regulatory sequences, bringing together the components that are needed to make gene expression happen.

Transcription factors are a well-known family of proteins, but new research from Whitehead Institute Member Richard Young and colleagues shows that the picture we have had of them is incomplete. In a  paper published in  Molecular Cell  on July 3 , Young and postdocs Ozgur Oksuz and Jonathan Henninger reveal that along with DNA and protein, many transcription factors can also bind RNA. The researchers found that RNA binding keeps transcription factors near their DNA binding sites for longer, helping to fine tune gene expression. This rethinking of how transcription factors work may lead to a better understanding of gene regulation, and may provide new targets for RNA-based therapeutics.

“It’s as if, after carrying around a Swiss Army knife all your life for its blade and scissors, you suddenly realize that the odd, small piece in the back of the knife is a screwdriver,” Young says. “It’s been staring you in the face this whole time, and now that you finally see it, it becomes clear how many more uses there are for the knife than you had realized.”

How transcription factors’ RNA binding went overlooked

A few papers, including one from Young’s lab, had previously identified individual transcription factors as being able to bind RNA, but researchers thought that this was a quirk of the specific transcription factors. Instead, Young, Oksuz, Henninger and collaborators have shown that RNA-binding is in fact a common feature present in at least half of transcription factors.

“We show that RNA binding by transcription factors is a general phenomenon,” Oksuz says. “Individual examples in the past were thought to be exceptions to the rule. Other studies dismissed signs of RNA binding in transcription factors as an artifact—an accident of the experiment rather than a real finding. The clues have been there all along, but I think earlier work was so focused on the DNA and protein interactions that they didn’t consider RNA.”

The reason that researchers had not recognized transcription factors’ RNA binding region as such is because it is not a typical RNA binding domain. Typical RNA binding domains form stable structures that researchers can detect or predict with current technologies. Transcription factors do not contain such structures, and so standard searches for RNA binding domains had not identified them in transcription factors.

Young, Oksuz and Henninger got their biggest clue that researchers might be overlooking something from the human immunodeficiency virus (HIV), which produces a transcription factor-like protein called Tat. Tat increases the transcription of HIV’s RNA genome by binding to the virus’ RNA and then recruiting cellular machinery to it. However, Tat does not contain a structured RNA binding site; instead, it binds RNA from a region called an arginine-rich motif (ARM) that is unstructured but has a high affinity for RNA. When the ARM binds to HIV RNA, the two molecules form a more stable structure together.

The researchers wondered if Tat might be more similar to human transcription factors than anyone had realized. They went through the list of transcription factors, and instead of looking for structured RNA binding domains, they looked for ARMs. They found them in abundance; the majority of human transcription factors contain an ARM-like region between their DNA and protein binding regions, and these sequences were conserved across animal species. Further testing confirmed that many transcription factors do in fact use their ARMs to bind RNA.

RNA binding fine tunes gene expression

Next, the researchers tested to see if RNA binding affected the transcription factors’ function. When transcription factors had their ARMs mutated so they couldn’t bind RNA, those transcription factors were less effective in finding their target sites, remaining at those sites and regulating genes. The mutations did not prevent transcription factors from functioning altogether, suggesting that RNA binding contributes to fine-tuning of gene regulation.

Further experiments confirmed the importance of RNA binding to transcription factor function. The researchers mutated the ARM of a transcription factor important to embryonic development, and found that this led to developmental defects in zebrafish. Additionally, they looked through a list of genetic mutations known to contribute to cancer and heritable diseases, and found that a number of these occur in the RNA binding regions of transcription factors. All of these findings point to RNA binding playing an important role in transcription factors’ regulation of gene expression.

They may also provide therapeutic opportunities. The transcription factors studied by the researchers were found to bind RNA molecules that are produced in the regulatory regions of the genome where the transcription factors bind DNA. This set of transcription factors includes factors that can increase or decrease gene expression. “With evidence that RNAs can tune gene expression through their interaction with positive and negative transcription factors,” says Henninger, “we can envision using existing RNA-based technologies to target RNA molecules, potentially increasing or decreasing expression of specific genes in disease settings.”

Ozgur Oksuz, Jonathan E. Henninger, Robert Warneford-Thomson, Ming M. Zheng, Hailey Erb, Adrienne Vancura, Kalon J. Overholt, Susana Wilson Hawken, Salman F. Banani, Richard Lauman, Lauren N. Reich, Anne L. Robertson, Nancy M. Hannett, Tong I. Lee, Leonard I. Zon, Roberto Bonasio, Richard A. Young. “Transcription factors interact with RNA to regulate genes.”  Molecular Cell , July 3, 2023.  https://doi.org/10.1016/j.molcel.2023.06.012 .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 July 2024

Position-dependent function of human sequence-specific transcription factors

  • Sascha H. Duttke   ORCID: orcid.org/0000-0003-4717-000X 1 ,
  • Carlos Guzman 2 ,
  • Max Chang   ORCID: orcid.org/0000-0002-4526-489X 2 ,
  • Nathaniel P. Delos Santos 2 ,
  • Bayley R. McDonald   ORCID: orcid.org/0009-0003-6148-5689 1 ,
  • Jialei Xie 3 ,
  • Aaron F. Carlin   ORCID: orcid.org/0000-0002-1669-8066 3 ,
  • Sven Heinz   ORCID: orcid.org/0000-0002-4665-1007 2 &
  • Christopher Benner   ORCID: orcid.org/0000-0002-4618-0719 2  

Nature volume  631 ,  pages 891–898 ( 2024 ) Cite this article

35k Accesses

2 Citations

239 Altmetric

Metrics details

  • Gene regulation
  • Gene regulatory networks
  • Transcriptional regulatory elements

Patterns of transcriptional activity are encoded in our genome through regulatory elements such as promoters or enhancers that, paradoxically, contain similar assortments of sequence-specific transcription factor (TF) binding sites 1 , 2 , 3 . Knowledge of how these sequence motifs encode multiple, often overlapping, gene expression programs is central to understanding gene regulation and how mutations in non-coding DNA manifest in disease 4 , 5 . Here, by studying gene regulation from the perspective of individual transcription start sites (TSSs), using natural genetic variation, perturbation of endogenous TF protein levels and massively parallel analysis of natural and synthetic regulatory elements, we show that the effect of TF binding on transcription initiation is position dependent. Analysing TF-binding-site occurrences relative to the TSS, we identified several motifs with highly preferential positioning. We show that these patterns are a combination of a TF’s distinct functional profiles—many TFs, including canonical activators such as NRF1, NFY and Sp1, activate or repress transcription initiation depending on their precise position relative to the TSS. As such, TFs and their spacing collectively guide the site and frequency of transcription initiation. More broadly, these findings reveal how similar assortments of TF binding sites can generate distinct gene regulatory outcomes depending on their spatial configuration and how DNA sequence polymorphisms may contribute to transcription variation and disease and underscore a critical role for TSS data in decoding the regulatory information of our genome.

Similar content being viewed by others

transcription factor thesis

Global reference mapping of human transcription factor footprints

transcription factor thesis

Prediction of genome-wide effects of single nucleotide variants on transcription factor binding

transcription factor thesis

Sequence determinants of human gene regulatory elements

Each cell of an organism interprets the same genome in a unique way. At the heart of this process are sequence-specific TFs that orchestrate regulatory programs and interpret the regulatory sequence grammar inscribed in the genome 6 , 7 , 8 . How these regulatory programs are encoded is still largely enigmatic. Many regulatory elements contain sequence motifs for similar sets of TFs and most TFs display widespread binding to regulatory sequences, with variable and sometimes minimal consequences for gene regulation 9 , 10 , 11 . Consequently, we are largely unable to predict gene expression patterns from DNA sequence alone 12 , 13 and it is unclear how the transcription of most human genes is regulated. Previous studies have shown that TF-binding-site spacing, orientation and copy number, and affinity of TF binding sites can influence transcriptional output 5 , 6 , 14 , 15 , 16 . However, few generalizable rules exist for how TF binding sites construct gene regulatory programs, restricting our ability to rationally interpret our genome or understand how mutations in regulatory sequences impact gene regulation or manifest in disease 17 .

Preferential spacing of TFs relative to the TSS

The TSS is a landmark of gene expression, where regulatory signals are ultimately integrated to start transcription. Regulatory elements including promoters and enhancers as defined by active transcription initiation (hereafter collectively referred to as transcription start regions (TSRs)) 18 often start transcription from several different TSS locations rather than a single site 19 , 20 . Capturing TSSs across different cell types or in response to stimuli revealed that TSS selection within TSRs can be highly dynamic 21 , 22 . We therefore set out to investigate motif grammar from the perspective of each individual TSS, rather than from the perspective of open chromatin, a specific protein or epigenetic state.

To do so, we developed HOMER2 ( Methods and Extended Data Fig. 1 ), a suite of analysis tools to study DNA sequence motif enrichments accounting for both GC content and position-dependent nucleotide biases, for example, as found near TSSs (Fig. 1a ). We next profiled human U2OS cell TSSs using capped small RNA sequencing (csRNA-seq) 21 , which accurately captures the TSSs of both stable and unstable RNAs. Using these TSSs as anchors for de novo motif analysis 23 revealed that the binding sites of the most-enriched TFs had preferential localizations relative to active TSSs (Fig. 1b and Extended Data Fig. 2a ). This preferential positioning was particularly apparent for sequences bound by ubiquitously expressed canonical activators such as NRF1, NFY, Sp1 and ETS-family TFs 6 , 24 , 25 , 26 , 27 (Extended Data Fig. 2b,c ). In general, these activator binding sites were enriched upstream of the core promoter region (−40 to +40 bp, relative to the TSS) 28 and depleted near the active TSS, especially downstream, where the RNA polymerase II initiation complex is postulated to initially contact the TSR 29 , 30 (Fig. 1c,d and Extended Data Fig. 2c ). One exception to this rule was YY1, a TF known for its dual role as a transcriptional activator and repressor 31 , 32 . The binding sites of repressors such as ZBTB7A/LRF 33 were depleted near active TSSs (Fig. 1d and Extended Data Fig. 2d ). Although some TF binding sites were specifically enriched in distinct regulatory element types, such as bZIP TFs (such as AP-1) at enhancers 23 , 34 , TF-binding-site-specific positional preferences were highly similar near TSSs for different transcript types (Supplementary Fig. 1 ). Enrichment patterns for several TF binding sites exhibited an approximately 10 bp periodicity, suggesting that the rotational position of TFs relative to the TSS affects transcription initiation 5 , 14 , 35 (Extended Data Fig. 2e ). Positional preferences were conserved across cell lines, vertebrate species and TSS-detection methods and, in some cases, were restricted to genomic locations with cell-type-specific activity (for example, HNF1; Extended Data Fig. 2f–h ). Positional preferences were less apparent for TF binding sites associated with weak transcription such as CTCF (Extended Data Fig. 2b,c ). Moreover, spatial TF binding site enrichment patterns were more pronounced in the absence of the canonical initiator core promoter element 36 (Extended Data Fig. 2i–k ). Consistent with the eminence of these elements in anchoring down the RNA polymerase and guiding TSS selection 20 , 36 , this finding suggests positionally enriched TF binding sites may themselves direct TSS selection but that their impact is superseded by core promoter elements. Together, these findings highlight that many TF binding sites are enriched or depleted at specific positions relative to the TSS.

figure 1

a , Nucleotide frequency bias near TSSs of human U2OS cells. b , Many TF binding sites are enriched at specific positions relative to active TSSs genome-wide (TSSs). c , Integrating positional information and nucleotide biases using HOMER2 identifies TF binding site enrichment or depletion relative to the TSS, exemplified by NRF1. Statistical analysis was performed using two-sided Fisher’s exact tests with Benjamini–Hochberg correction. d , Most TF binding sites are enriched in preferred positions. Enrichment or depletion of all 463 known human TF binding sites in the HOMER2 motif database relative to the TSS. A detailed version of this figure is provided in Supplementary Fig. 1 . e , NRF1 function is dependent on the location of its binding site relative to the TSS. TSSs are ranked on the basis of their log 2 -transformed fold change in activity after NRF1 knockdown (from gain to loss of transcription initiation, n  = 136,757). TSSs with NRF1 binding sites (heat map, dark red) within their preferred localization pattern (blue graph; top) are repressed, while those with NRF binding sites downstream of the TSS were correlated with activation (or derepression; bottom). Analysis was performed using MEPP ( Methods ). f , TSSs downregulated after NRF1 knockdown (siNRF1) display TF binding within its preferred upstream region while derepressed TSSs display NRF1 binding downstream, as assessed by anti-NRF1 ChIP–seq. g , NRF1 probably represses upstream TSSs through steric hindrance. Analysis of TSSs ( n  = 136,344) ranked from gain to loss of transcription initiation activity after overexpression of a transcription activation domain (TAD)-deleted dnNRF1 mutant shows repression when the NRF1 binding site (dark red) is located either upstream or downstream of the TSS, suggesting that TSSs found upstream of NRF1 sites are activated only after removal of the TF from the downstream DNA. h , Model for NRF1 TF function and NRF1-dependent TSS derepression.

Source Data

TF position governs regulatory impact

We speculated that the preferred localization patterns of diverse TF binding sites are a reflection of TF function and may represent a superposition of the multiple mechanistic roles TF can have in regulating transcription initiation. To assess how the position of a TF relative to the TSS may affect its function, we knocked down NRF1 and YY1 in human U2OS cells using short interfering RNAs (siRNAs) (Extended Data Fig. 3a ) and captured changes in transcription initiation 24 h later using csRNA-seq. We selected these TFs because their binding sites have strong positional preferences, and they are the only proteins known to bind to their respective motifs. Furthermore, NRF1 sites are preferentially located upstream of the TSS (Fig. 1c ), while YY1 sites are unique for their preferred location downstream of the TSS (Fig. 1b,d ). Consistent with previous reports that both regulators are potent activators 37 , 38 , knockdown of NRF1 and YY1 diminished the csRNA-seq signal at 3,791 and 1,621 TSSs (<−1.5-fold, adjusted P ( P adj ) < 0.05), respectively. Moreover, knockdown of these strong activators also increased initiation at a comparable number of TSSs (3,971 and 1,160, respectively, >1.5-fold, P adj  > 0.05). Follow-up transcriptome analysis showed that these ectopic TSSs not only resulted in alternative 5′ untranslated regions (UTRs) that, at times, produced new splice isoforms or open reading frames but also altered gene expression levels (Extended Data Fig. 4 ). TSSs with decreased or increased activity were frequently found within the same TSR, implying shifts in TSS selection within regulatory elements that may activate specific regulatory programs or help to buffer changes in gene expression. These findings mirror the results observed after depletion of NFY 39 , and help to explain the observations that many TF binding events appear to be uncoupled from expected changes in gene expression 10 , 40 , 41 .

Notably, ranking TSSs by their change in initiation frequency after TF knockdown revealed a clear bimodal distribution of the cognate TF binding sites: TSSs that were downregulated after NRF1 knockdown were enriched for NRF1 binding sites upstream of the TSS, where the motif is preferentially located relative to active TSS locations genome-wide. By contrast, TSSs with increased transcription after NRF1 knockdown had NRF1 binding sites positioned downstream of the TSS, where the motif is naturally depleted (Fig. 1c,e ). Integrating ChIP–seq validated that downregulated TSSs had NRF1 bound predominantly upstream, whereas TSSs that were activated after NRF1 knockdown had NRF1 bound downstream of impacted TSSs (Fig. 1f ). Together, these findings demonstrate that NRF1 can both activate and inhibit transcription initiation, depending on its location relative to a TSS. The function of NRF1 is therefore position dependent. Given that the majority of strongly regulated sites contain an NRF1 binding site that, as assessed using chromatin immunoprecipitation followed by sequencing (ChIP–seq), is commonly bound by the TF (Fig. 1f and Extended Data Fig. 3b–e ), the observed NRF1-dependent inhibition of transcription initiation in cis is probably distinct from previous reports showing that the loss of an activator can lead to activation through secondary effects in trans 10 , 40 , 42 , 43 , 44 .

Similarly, YY1 depletion resulted in the downregulation of TSSs with YY1 binding sites within its preferential zone immediately downstream of the TSS, whereas upregulated TSSs typically had a YY1 site further downstream (Extended Data Fig. 3f–h ). Position-specific effects mirroring the TF binding site’s natural enrichment were also identified when reanalysing published NFY knockdown data from mouse embryonic fibroblasts 39 (Extended Data Fig. 3i ). Together, these results indicate that NRF1, YY1 and NFY can activate but also directly inhibit transcription initiation, depending on the position of their binding relative to the regulated TSS (Fig. 1h ).

Suppression of TSS by steric hindrance

To further examine the mechanisms underlying TSS activation after activator TF knockdown, we ectopically expressed a dominant-negative NRF1 (dnNRF1) mutant 45 . In contrast to siRNA knockdown, overexpression of transactivation domain-deficient NRF1 resulted in the downregulation of all TSSs near dnNRF1-bound sites, even those located downstream of the TSS (Fig. 1g and Extended Data Fig. 5a–d ). TSSs activated after NRF1 knockdown therefore probably depend on the clearance of the TF from the DNA, substantiating a model in which preferred spacing among TFs and the RNA polymerase II complex is critical for effective transcription initiation. Binding outside of these preferential positions inhibits RNA polymerase II recruitment and/or initial elongation, probably by steric hindrance (Fig. 1h ), similar to the function of canonical prokaryotic repressors 46 , Rap1 in yeast or CTCF in vertebrates 47 , 48 . These findings highlight the importance of accurate TSS positional information to decode TF function, and provide an explanation as to why it has been so challenging to predict gene expression programs from DNA sequence alone 12 , 13 .

TF function revealed by genetic variation

Genetic variation that naturally occurs between individuals offers an opportunity to study the functional impact of genetic variation on gene regulation 49 , 50 . As many of these variants affect TF binding sites, we captured the TSS landscape of bone-marrow-derived macrophages (BMDMs) from two mouse strains (C57BL/6 versus SPRET) to assess how sequence variants in TF binding sites impact transcription initiation as a function of their position relative to the TSS (Fig. 2 ). A key advantage of exploiting natural genetic variation is that it can be used to unbiasedly assess the impact of all TF binding sites on transcription, including those where TF redundancy may hinder analysis on the protein level under natural, unperturbed conditions.

figure 2

a , b , Natural DNA polymorphisms can have a major impact on gene regulation. Example loci where genetic variants eliminated NF-κB (p65) binding sites in the SPRET mouse strain (versus C57BL/6) and associated with either a reduction in downstream transcription initiation ( a ) or a derepression of transcription initiation when the mutated binding site was located downstream of the TSSs ( b ). Transcription initiation was measured in untreated (notx) or KLA-stimulated BMDMs from SPRET and C57BL/6 reference mouse strains (average of n  = 2 biological replicates). c , The influence that variants in TF binding sites has on transcription initiation is dependent on their position relative to the TSS. Analysis of the genome-wide significance of the association between mutations in the Sp1 binding site (GC-box) and the change in transcription initiation, calculated for Sp1 sites as a function of their relative distance to the TSS. Positive log[ P adj ] values indicate that mutations predicted to cause reduced Sp1 binding are more strongly associated with reduced initiation, whereas negative log[ P adj ] values indicate that the mutated binding sites are more strongly associated with increased initiation (30 bp windows evaluated at 10 bp increments). Statistical analysis was performed using two-sided Mann–Whitney U -tests with Benjamini–Hochberg correction. d , Similar to c , but showing that mutations in the NF-κB (p65) binding sites exhibit stronger positional associations after 1 h of KLA stimulation (dotted versus solid line). e , The functional impact of mutating TF binding sites (TFBSs) generally follows one of three distinct patterns: pure transcriptional activators (PU.1), pure transcriptional repressors (ZEB2) and dual-function TFs (Sp1) that can activate or repress transcription initiation in a position-dependent manner. f , Position-dependent activity was evaluated for mutations impacting 463 known human TF motifs (not all are expressed in BMDMs). A detailed map of this figure is provided in Supplementary Fig. 2 .

To resolve distance-dependent functions of TFs, we used HOMER2 and its ability to normalize for positional single-nucleotide variant biases (Extended Data Fig. 5e,f ) to assess the relationship between genetic alterations in TF binding sites and regulatory phenotypes. Contrasting 42.9 million single-nucleotide polymorphisms (SNPs) and 80,988 differentially regulated TSSs ( P adj  < 0.25) between the mouse strains revealed that the influence of genetic variants disrupting TF binding sites on initiation levels is position dependent (Fig. 2c–f ). For example, variants that weaken consensus binding sites of NRF1 or NFY at their preferential positions relative to the TSS were strongly associated with reduced transcription (Extended Data Fig. 5g,h ), consistent with the steric requirements of transcription complex assembly 29 , 30 .

Analysis of all motifs in the HOMER2 database revealed that sequence variation found in 300 of 463 motifs is significantly associated with distance-dependent changes in transcription initiation ( P adj  < 0.01 in at least one position). Clustering of TF-binding-site patterns stratified three major classes: pure transcriptional activators, pure transcriptional repressors and dual-function TFs (Fig. 2e,f ). For dual-function TFs, the location of the binding site relative to the TSS determined their role in activating or repressing transcription, usually segregated by their localization upstream or downstream of the TSS, respectively. This group includes the binding sites of the strong, ubiquitous TFs that are typically characterized as activators (Sp1/KLF, NRF1 and NFY), MYC, RUNX and other TFs that physically interact with the RNA polymerase II complex (Fig. 2f ). These results were also verified using an alternative analysis approach 51 (Supplementary Fig. 2 ). The function of TFs that bind to these sites is therefore highly dependent on their relative position to the TSS. Pure activators, exemplified by the lineage-determining TF PU.1, included TFs of which variants with weakened consensus binding sites are generally associated with a reduction in transcription. These findings are consistent with previous reports suggesting that the TFs binding to these sites promote transcription through indirect mechanisms, including chromatin opening and epigenetic modification 52 , 53 . The third cluster of pure repressors encompassed TF binding sites that were associated with transcription activation when disrupted, and included the well-characterized repressors ZEB2 54 and ZBTB7A (also known as LRF) 33 .

To assess whether position-dependent TF function extends to signalling-dependent TFs, we stimulated macrophages from both mouse strains with the TLR4 agonist Kdo 2 -lipid A (KLA), which elicits a strong innate immune response 55 . Consistently, this stimulation led to a much more prominent signature for the binding sites of the main response factor, NF-κB 56 (Fig. 2d , Extended Data Fig. 5i,j and Supplementary Fig. 2 ). These data highlight position-dependent activity of TFs as a widespread phenomenon that extends to signalling-dependent TFs and argue that genome-wide analysis of TF binding sites and their spatial constraints relative to the TSS provides unbiased insights into TF function, thereby providing a tool to bridge the gap between DNA sequence and transcription.

TSS-MPRA confirms TF positional function

To directly assess how the position of TF binding sites within regulatory sequences affects transcription en masse, we developed a massively parallel reporter assay (MPRA) strategy to capture transcription initiation at base resolution 57 (TSS-MPRA; Fig. 3a and  Methods ). As exemplified for a TSR found at the EIF2S1 locus, TSS-MPRA accurately captured the relative initiation frequencies and locations of TSSs observed in vivo by csRNA-seq (Fig. 3b and Extended Data Fig. 6a ).

figure 3

a , Schematic of TSS-MPRA, which accurately captures the TSSs and activity from thousands of plasmid cloned DNA sequences. b , Example of normalized TSS-MPRA transcription initiation data from four inserts designed based on a TSR in EIF2S1 promoter (from −110 bp to +42 bp relative to the primary TSS). Three of the inserts show the impact of placing the NRF1 binding site at different positions relative to the TSS. The in vivo genomic TSS levels, as measured using csRNA-seq analysis in HEK293T cells, is shown at the top. WT, wild type. c , Synthetic TF binding site insertions confirm the position-dependent function of TFs. Summary of TSS-MPRA data for six TF binding sites inserted at three positions relative to the TSS. The impact of inserting the TF binding site was measured as the log ratio of normalized transcript levels versus the wild-type control. n  = 13 distinct promoters, enhancers and other TSRs, and each insert was redundantly encoded with 4 different barcode sequences. The box plots show the median (centre line), 25th and 75th percentiles (box limits), and the minimum and maximum values (whiskers) for each position.

To examine position-dependent TF function, we next synthetically inserted the binding site of six TFs (Sp1, NRF1, NFY, YY1, p53 and CTCF) or a control sequence predicted to not be bound by any known TF at positions −50 bp, −20 bp or +25 bp relative to the TSS into 13 different TSRs (Supplementary Table 2 ) and performed TSS-MPRA. As an internal control, and to mitigate potential barcode-dependent biases, each construct was paired with four distinct barcode sequences ( Methods ). In agreement with our results above, TF binding site insertion at −50 bp or −20 bp relative to the TSS stimulated initiation approximately fourfold for Sp1 and NRF1, and twofold for NFY (Fig. 3c ). By contrast, insertion of the same sites at +25 bp reduced initiation by twofold to fivefold. Insertion of a CTCF site at the +25 position was even more repressive, consistent with the strong DNA binding and insulator function of CTCF 48 , 58 . By contrast, binding sites for YY1, which evolved to function at or just downstream of the TSS 59 , showed the inverse trend, with significant increases in transcription when positioned at +25.

Binding of a TF at positions −20 bp or +25 bp relative to the TSS is generally thought to be sterically incompatible with RNA polymerase II complex assembly 60 . The negative regulatory phenotype observed for most TF binding sites when located at +25 bp, but less so when upstream of the TSS at −20 bp, proposes that the downstream core promoter region is critical for initial TFIID DNA recognition, while TFs bound to the upstream promoter region may be more readily displaced after TFIID has docked. These findings lend functional support to the cryo-electron-microscopy-based model that posits that TFIID initially interacts with TSRs in the downstream promoter region before transitioning to the upstream core promoter region 29 , 30 , 60 .

A notable exception was the tumour suppressor p53. The p53 binding site lacks natural preferential spacing (Extended Data Fig. 5k ) and activated transcription across all tested positions, resembling the behaviour of pure activator TFs uncovered in our natural genetic variation analysis. This finding is consistent with previous reports 44 , 61 , 62 and may provide an explanation for the unique potency of p53 to rewire transcription networks by itself. Together, these results corroborate the position-specific impact of TF binding sites on transcription initiation and suggest that there are different classes of TFs with distinct position-specific effect profiles.

TF interactions influence TSS use

Synthetic recreation challenges our true understanding of observed biological phenomena. We therefore designed a synthetic 150 bp DNA sequence lacking known TF binding sites and generated thousands of variations thereof by placing a binding site for NRF1, NFY, YY1 or Sp1 in 2 bp increments across the entire sequence. Capturing both initiation strength and positions using TSS-MPRA revealed that insertions of TF binding sites frequently resulted in de novo TSSs (Fig. 4a , Extended Data Fig. 6b,c and Extended Data Fig. 7a–e ), further supporting our observation that the TFs binding to these sites can strongly influence the recruitment and positioning of RNA polymerase II. Aggregating the sites of transcription initiation for each construct as a function of the distance between the TF binding site and the TSS (Fig. 4b,c and Extended Data Fig. 7f ) revealed a position-dependent activity pattern resembling the binding site’s natural enrichment relative to active TSSs, similar to our previous results (Fig. 1d and Extended Data Fig. 2c ). Thus, multiple independent experimental and analytical approaches revealed an extremely similar pattern for a given TF (Fig. 4d and Extended Data Fig. 8a–d ). While many activator TFs shared an overall preference for binding the −50 bp region relative to the TSS, similar to a fingerprint, each TF exhibited a unique pattern. These findings propose a model in which human TFs can directly guide transcription initiation based on the consensus of their unique spatial–functional profiles (Extended Data Fig. 8e ).

figure 4

a , Heat map of TSS measurements captured by TSS-MPRA for a 2 bp sweep of Sp1 binding sites (blue) from −100 to +40 across an artificial, motif-depleted promoter. b , c , Position-dependent activity patterns determined using TSS-MPRA TF sweeps resemble each TF’s natural enrichment profile. The average log 2 -transformed change in initiation activity for all TSSs compared with their mean levels were plotted relative to the Sp1-binding-site ( b ) or the NFY-, NRF1- or YY1-binding-site ( c ) distance to each TSS. Average of n  = 2 biological replicates. d , Multiple independent experimental approaches show similar spatial–functional profiles for a TF. Natural NFY-binding-site enrichment relative to the active TSS (black), the impact of NFY knockdown (orange), the impact of natural genetic variation in the NFY binding sites (green) and a TSS-MPRA NFY sweep (blue) reveal consistent, position-dependent functional profiles and superhelical preferences for NFY (each profile was minimum/maximum scaled to 1/−1). e , Position-dependent TF activity was altered when NRF1 is swept through the putative TOB2 enhancer (eTOB2) versus the TOB2 enhancer with the endogenous NRF1 binding site mutated (mutNRF1). f , Transcription initiation is affected by the relative spacing between TF binding sites and TSS location. TSRs containing both NRF1 and Sp1 binding sites sorted by the distance between them are shown with csRNA-seq initiation levels on both the positive (red; +) and negative (blue; −) strands. g , Position-dependent TF–TF interactions. TSSs upstream of the NRF1 but downstream of Sp1 (black triangle) are upregulated after NRF1 knockdown while most downregulated TSSs are downstream from NRF1, even if Sp1 is found downstream as well (red circle). Upregulated TSSs are shown in green, and downregulated TSSs are shown in purple. Expression of dnNRF1 generally represses all nearby TSS activity. h , Model of how TF interactions can mediate TSS selection.

To test this model, we repeated the NRF1 sweep through a natural enhancer region adjacent to the TOB2 gene. The activity pattern obtained by TSS-MPRA initially resembled that of the NRF1-binding-site sweep through our synthetic motif-depleted sequence, but then diverged with a more pronounced 10 bp helical periodicity upstream of −50 bp relative to the TSS (Fig. 4e ). We hypothesized that these differences in the spatial activation profiles could be due to the presence of other TF binding sites in the enhancer region, such as the NRF1 site naturally found at −50 bp (Extended Data Fig. 8f ). Indeed, repeating the NRF1-binding-site sweep in the TOB2 enhancer in which the innate NRF1 site was mutated revealed a pattern that now closely resembled the synthetic NRF1 sweep (Fig. 4e ). These results highlight that transcription initiation is affected by the spatial relationships not only between TF binding sites and the TSS, but also between the TF binding sites themselves.

To further examine the relationship between TF binding site spacing and transcription initiation, we selected all TSRs active in U2OS cells that contained at least two activator TF binding sites of interest within 300 bp of one another and the primary TSS, such as Sp1 and NRF1 motifs. Sorting these TSRs based on the distance between the sites and visualizing transcription initiation patterns measured by csRNA-seq revealed several trends that are consistent with the position-dependent functions of TF binding site pairs in vivo. First, TSSs were predominantly located at the preferred position characteristic for the 3′ TF in each TF binding site pair (Fig. 4f (‘2’ regions)). Second, TSSs were depleted in between the two TF binding sites, consistent with the ability of these TFs to suppress initiation when binding downstream of the TSS, indicating that the most-3′-binding TF has a dominant role in activating TSS selection (Fig. 4f (‘1’ regions) and Extended Data Fig. 9 ).

To gain further insights into TF-mediated TSS selection, we reanalysed our own as well as published TF knockdown data for NRF1 , YY1 and NFY 39 with a focus on regulatory elements with multiple TF binding sites. For clarity, we limited the presentation of these data to one DNA strand only. Ectopically activated (or derepressed) TSSs after TF knockdown were preferentially upstream of the targeted TF’s binding site but downstream of the second TF (Fig. 4g (black triangle) and Extended Data Fig. 9b–d ), corroborating the role of TF-mediated blocking in TSS selection. As expected, downregulated TSSs were found at the preferred distance downstream of the targeted TF binding site but, importantly, also when the site was preferentially positioned upstream of the second activator TF binding sites, for example, Sp1 (Fig. 4g (red ellipse) and Extended Data Fig. 9b–d ). This emphasizes that NRF1 or NFY binding upstream of Sp1 is necessary for proper initiation strength, supporting an additive model for TF contribution to transcription activity when multiple TFs are bound at their preferential positions 15 . Together with the result that NRF1 or NFY binding downstream of Sp1 can repress transcription initiation, this finding demonstrates that the order and spacing among TFs is critical for cooperative or antagonistic function and TSS selection (Fig. 4h ).

TF positioning in human disease

To investigate the role of position-dependent TF function across human individuals and putative functions in disease, we analysed nascent TSSs captured in lymphoblastoid cell lines (LCLs) from 67 Yoruba individuals 63 with sequenced genomes. We first defined TSS quantitative trait loci (tssQTLs) by correlating SNPs with individual TSS levels, and then stratified the impact of tssQTLs in TF binding sites in a distance-dependent manner. This analysis revealed the same grouping of TFs into pure transcriptional activators, pure transcriptional repressors and dual-function TFs with distinct positional preferences, relative to the TSS, as observed in the mouse strain data (Fig. 2f ), and often mimicked each TF binding site’s enriched localization in human TSSs (Fig. 5a and Supplementary Fig. 3 ). Integrating data from genome-wide association studies (GWAS) further demonstrated a position-dependent effect of disease-associated genetic variants on TSS levels, contingent on the position of TF binding sites. For example, the variant rs11122174-T, which is linked to defects in haematopoiesis 64 , disrupts an NRF1 site in the TTC13 promoter, resulting in decreased initiation downstream of the mutated site and an increase in initiation from upstream TSSs (Fig. 5b and Extended Data Fig. 10a ). Indeed, tssQTL variants that weaken binding sites of many dual-function TFs (Sp proteins, ETS, NFY, NRF1, CRE, ZBTB33, AP-1 and PU.1) upstream of TSSs predominantly decreased transcription initiation ( P  < 0.00022, Fisher’s exact test). Conversely, variants enhancing the consensus of weaker binding sites were associated with increased initiation ( P  = 3.8 × 10 −6 , Fig. 5c ). By contrast, consistent with position-dependent TF function, TF-binding-site-disrupting variants downstream of the TSS increased transcription ( P  = 0.0019). Strengthening of a downstream site by comparison had a lesser impact, potentially because TF binding to moderate-affinity sites already causes steric hindrance ( P  = 0.35; Fig. 5c ).

figure 5

a , Positional TF binding site enrichment (right) and position-dependent activity of TFs based on the analysis of genetic variants and TSS activity (left) in LCLs across 67 human individuals calculated for 463 human TF motifs (30 bp windows evaluated at 10 bp increments). Statistical analysis was performed using two-sided Mann–Whitney U -tests or Fisher’s exact tests with Benjamini–Hochberg correction. Note that only a subset of TFs with motifs is expressed in LCLs. A detailed map with all TFs annotated is provided in Supplementary Fig. 3 . b , Disease-associated variants, identified through GWAS, recapitulate position-dependent TF function. Example of variant rs11122174, for which a C to T mutation disrupts a consensus NRF1 binding site leading to a general increase in upstream TSS activity and decrease in downstream TSS activity. c , Summary of the effect of disease-associated GWAS variants grouped by position relative to the TSSs suggests a role for position-dependent TF function in disease. GWAS variants weakening TF binding sites upstream of TSSs were associated with reduced initiation while those strengthening TF binding sites were associated with increased transcription (txn). Vice versa, weakening TF binding sites increased proximal TSSs, consistent with the reported position-dependent TF blocking function. d , An analysis of 133 human promoters and enhancers using TSS-MPRA showed that the mutation of TF binding sites for ETS1, NFY, NRF1, Sp1 and YY1 within their naturally enriched position is associated with reduced initiation, while mutation of sites in positions at which the TF binding site is naturally depleted were associated with activation (≥20 promoters each). Each point represents an individual TSS. e , Mutation of TF binding sites resulted in changes in TSS selection and alternative 5′ UTRs. The mean shift of all TSS positions between the mutant and control elements is plotted for the mutation of each TF binding site family.

To corroborate these findings, we assessed the effect of mutating TF binding sites in 133 human promoters and enhancers on transcription initiation using TSS-MPRA. Consistently, mutation of TF binding sites within their naturally enriched positions was associated with repression, whereas mutations of sites outside of this region were associated with increased transcription initiation (Fig. 5d and Extended Data Fig. 10b,c ). Moreover, mutation of these TF binding sites resulted in notable changes in TSS selection and therefore alternative 5′ UTRs (Fig. 5e and Extended Data Fig. 10c ), a characteristic feature of numerous diseases 39 , 65 , 66 . Mutation of TF binding sites near TSSs or within their naturally enriched positions had the strongest effect relative to sites that occur elsewhere (Extended Data Fig. 10d ). Together, these findings show that the position-dependent function of TFs can impact human gene regulation in health and disease.

Multiple independent lines of functional evidence consistently replicate spatial–functional profiles of a TF, including (1) TF perturbation; (2) natural genetic variation; (3) TF-binding-site insertion or deletion screens in natural TSRs; (4) synthetic DNA TF sweep TSS-MPRAs; and (5) analysis of human individuals and GWAS variants. Importantly, functional profiles align with the enriched or depleted positions for TF binding sites relative to TSSs genome-wide, indicating that the naturally observed positioning of TF binding sites can reveal important information about position-dependent TF function. The fact that these patterns also emerge for cell-type- or stimulus-specific TFs relative to regulated TSSs (Extended Data Fig. 2f and Extended Data Fig. 5g,h ) indicates that TSS mapping followed by spatial analysis of TF binding sites could provide valuable insights into TF functions when studying a wide range of biological systems, disease states and even different species. Position-specific enrichment of TF binding sites in specific disease contexts can also be observed for inflammatory regulators during COVID-19 or in flavivirus infection 67 , 68 . We speculate that these profiles reflect the combined impact of multiple regulatory mechanisms for each TF. For example, activator TFs may function to remodel chromatin or directly recruit the transcription machinery to broader regions relative to the TF’s binding site, but binding of TFs to DNA itself can also sterically hinder formation of or redirect the preinitiation complex in close proximity. Overall, these data indicate that TFs cooperatively drive TSS selection in a manner consistent with their unique position-dependent functional properties. As such, different spatial arrangements of the same sets of TFs can lead to distinct gene regulatory outcomes (Fig. 4h ). More broadly, our findings highlight a spatial grammar as central to encode the multiple, often overlapping gene regulatory programs in our genome.

Experimental methods

Cell culture, sirna and mrna transfections.

U2OS, HepG2, HEK293T or Vero E6 cells were grown at 37 °C with 5% CO 2 in DMEM (Cellgro) supplemented with 10% FBS (Gibco), 50 U penicillin and 50 μg streptomycin per ml (Gibco). For siRNA transfection, cells were washed once with PBS (Gibco), trypsinized for about 5 min and then washed twice with PBS (Gibco) by centrifugation at 400 g for 5 min. Subsequently, about 3 million cells were resuspended in 150 µl gene pulser electroporation buffer (Bio-Rad) and siRNAs in 1× siRNA buffer (60 mM KCl, 6 mM HEPES pH 7.5, 0.2 mM MgCl 2 ). The following siRNAs were used: M-011796-02-0005 siGENOME human YY1 (7528) siRNA, M-017924-01-0005 siGENOME human NRF1 (4899) siRNA and siGFP (CCACUACCUGAGCACCCAGU 69 ) as a control. For transfection of the dominant-negative NRF1 mutant (AA1-304) 45 , the sequence was cloned into pGEM-T and mRNA was synthesized using the mMESSAGE mMACHINE T7 Transcription Kit (Thermo Fisher Scientific). siRNAs and dnNRF1 were used at a final concentration of 10 μM. The mixture was transferred to a 0.2 cm cuvette and pulsed once at 250 V for a constant 20 ms. After electroporation, 1 ml complete growth medium was added and cells grown for 24 h in a 6 cm dish. To collect RNA, the plates were washed three times using ice-cold PBS with 1 ml TRIzol reagent added. Cells were scraped and RNA was extracted as described by the manufacturer with addition of 1 μl 15 mg ml −1 GlycoBlue coprecipitant (Thermo Fisher Scientific).

Mouse BMDMs

This study used total RNA originally collected from macrophages generated from C57BL/6 and SPRET strains of mice as part of a previous study. The derivation of BMDMs and their treatment with KLA for 1 h were performed as described previously 50 . Total RNA from these samples was used to perform csRNA-seq as described below.

Western blot

Cells were lysed in 1× NuPAGE LDS sample buffer (Thermo Fisher Scientific), sonicated for about 30 s and then incubated at 95 °C for 5 min under 2,000 rpm agitation. The samples were then centrifuged for 5 min at 21,000 g and about 10 μg total protein was loaded on a NuPAGE 4 to 12% Bis-Tris gel (NP0321BOX, Thermo Fisher Scientific). The gel was washed for 5 min in water, transferred for 90 min at 200 mA in 1× NuPAGE transfer buffer (NP0006, Thermo Fisher Scientific) and blocked for 40 min in ~1% non-fat dry milk in TBS. Original data are shown in Supplementary Fig. 4 . Primary antibodies were allowed to bind at 4 °C overnight. The following antibodies were used: anti-YY1 (Santa Cruz, sc-7341, HRP YY1 (H-10), 1:200), anti-NRF1 (Abcam, ab55744, 1:1,000) and anti-β-Actin (Cell Signaling, D6A8, 8457S, rabbit monoclonal antibody, 8457, 1:2,500). Western blots were quantified using Fiji (v.1.53j) with mean grey value only.

csRNA-seq was performed as described previously 21 . Small RNAs of around 20–60 nucleotides were size-selected from 0.4–2 µg of total RNA by denaturing gel electrophoresis. A 10% input sample was taken aside and the remainder enriched for 5′-capped RNAs. Monophosphorylated RNAs were selectively degraded by 1 h incubation with Terminator 5′-phosphate-dependent exonuclease (Lucigen). Subsequently, RNAs were 5′-dephosporylated through 90 min incubation in total with thermostable QuickCIP (NEB) in which the samples were briefly heated to 75 °C and quickly chilled on ice at the 60 min mark. Input (sRNA) and csRNA-seq libraries were prepared as described previously 70 using RppH (NEB) and the NEBNext Small RNA Library Prep kit, amplified for 14 cycles and sequenced single-end for 75 cycles on the Illumina NextSeq 500 system.

ChIP–seq was performed as described previously 71 . A total of 3 × 10 6 U2OS cells was fixed for 10 min with 1% formaldehyde/PBS, the reaction was quenched by adding 2.625 M glycine, cells washed twice with ice-cold PBS, snap-frozen in 1 million cell pellets and stored at −80 °C. For dnNRF1–HA ChIP–seq, cells were collected 12 h after electroporation with 3 µg IVT dnNRF1-HA mRNA as described above. Fixed cells were thawed on ice, resuspended in 500 µl ice-cold buffer L2 (0.5% Empigen BB, 1% SDS, 50 mM Tris/HCl pH 7.5, 1 mM EDTA, 1 × protease inhibitor cocktail) and chromatin was sheared to an average DNA size of 300–500 bp by administering 7 pulses of 10 s duration at 13 W power output with 30 s pause on wet ice using a Misonix 3000 sonicator. The lysate was diluted 2.5-fold with 750 µl ice-cold L2 dilution buffer (20 mM Tris/HCl pH 7.5, 100 mM NaCl, 0.5% Triton X-100, 2 mM EDTA, 1× protease inhibitor cocktail), 1% of the lysate was kept as ChIP input, and the remainder was used for immunoprecipitation using the following antibodies: dnNRF1-HA (2 μg anti-HA, Abcam, ab9110), NRF1 (2 μg, Abcam, ab175932) and YY1 (2 μg, ActiveMotif, 61980) and 20 μl of Dynabeads Protein A while rotating overnight at 8 rpm and 4 °C. The next day, the beads were collected on a magnet, washed twice with 150 µl each of ice-cold wash buffer I (10 mM Tris/HCl pH 7.5, 150 mM NaCl, 1% Triton X-100, 0.1% SDS, 2 mM EDTA), wash buffer III (10 mM Tris/HCl pH 7.5, 250 mM LiCl, 1% IGEPAL CA-630, 0.7% deoxycholate, 1 mM EDTA) and TET (10 mM Tris/HCl pH  7.5, 1 mM EDTA, 0.2% Tween-20). Libraries were prepared directly on the antibody/chromatin-bound beads. After the last TET wash, the beads were suspended in 25 μl TT (10 mM Tris/HCl pH 7.5, 0.05% Tween-20), and libraries were prepared using NEBNext Ultra II reagents according to the manufacturer’s protocol but with reagent volumes reduced by half, using 1 µl of 0.625 µM Bioo Nextflex DNA adapters per ligation reaction. DNA was eluted and cross-links were reversed by adding 4 μl 10% SDS, 4.5 μl 5 M NaCl, 3 μl EDTA, 1 μl proteinase K (20 mg ml −1 ), 20 μl water, incubating for 1 h at 55 °C, then overnight at 65 °C. DNA was cleaned up by adding 2 μl SpeedBeads 3 EDAC in 62 μl of 20% PEG 8000/1.5 M NaCl, mixing and incubating for 10 min at room temperature. SpeedBeads were collected on a magnet, washed twice by adding 150 μl 80% ethanol for 30 s each, collecting beads and aspirating the supernatant. After air-drying the SpeedBeads, DNA was eluted in 25 μl TT and the DNA contained in the eluate was amplified for 12 cycles in 50 μl PCR reactions using the NEBNext High-Fidelity 2× PCR Master Mix or the NEBNext Ultra II PCR Master Mix and 0.5 μM each of primers Solexa 1GA and Solexa 1GB. Libraries were cleaned up as described above by adding 36.5 μl 20% PEG 8000/2.5 M NaCl and 2 μl Speedbeads, two washes with 150 μl 80% ethanol for 30 s each, air-drying beads and eluting the DNA into 20 μl TT. ChIP library size distributions were estimated after 2% agarose/TBE gel electrophoresis of 2 μl of library, and library DNA amounts measured using the Qubit HS dsDNA kit on the Qubit fluorometer. ChIP input material (1% of sheared DNA) was treated with RNase for 15 min at 37 °C in EB buffer (10 mM Tris pH 8, 0.5% SDS, 5 mM EDTA, 280 mM NaCl), then digested with proteinase K for 1 h at 55 °C and cross-links were reversed at 65 °C for 30 min to overnight. DNA was cleaned up using 2 μl SpeedBeads 3 EDAC in 61 μl of 20% PEG 8000/1.5 M NaCl and washed with 80% ethanol as described above. DNA was eluted from the magnetic beads with 20 μl of TT and library preparation and amplification were performed as described for the ChIP samples. Libraries were sequenced single-end for 75 cycles to a depth of 20–25 million reads on the Illumina NextSeq 500 instrument.

Twist Bioscience insert library cloning . Insert library (10 ng) was PCR amplified in 50 µl mastermix (25 µl Q5 2× MM, 0.25 µl 100 µM pMPRA1-LH (5′-GGTAACCGGTCCAGCTCA), 0.25 µl 100 µM pMPRA1-RH (5′-CGTGTGCTCTTCCGATCT)) under the following conditions: 30 s at 98 °C; followed by 2× cycles of 10 s at 98 °C, 20 s at 65 °C and 15 s at 72 °C; followed by one more cycle with access primers and finally 2 min at 72 °C. The PCR product (~200 bp) was run on a 2% agarose gel, then gel-extracted and cleaned up using PureLink Quick Gel Extraction Kit (Invitrogen; K210012). The library pool was Gibson-assembled into BsaI-linearized pTSS-MPRA-Empty plasmid 57 . This reporter plasmid is based on the background-reduced pGL4.10 reporter backbone (Promega), with a cloning site harbouring the 18 bp pMPRA1-LH sequence followed by tandem BsaI sites for linearization and the TruSeq Read 2-compatible pMPRA1-RH sequence and a downstream landing site for reverse transcription primer RS2, and an eGFP ORF replacing the luciferase 2 gene. pMPRA1 (400 ng) was digested with BsaI in 20 µl (1 µl BsaI, 2 µl CutSmart buffer (NEB)) at 37 °C for 1 h and linearized plasmid was gel-extracted from an agarose/TBE gel using the PureLink Quick Gel Extraction Kit (Invitrogen; K210012). Amplified library was Gibson-assembled into cut plasmid using NEBuilder HiFi 2× master mix with a fivefold molar excess of the library at 50 °C for 1 h in a 4 µl total volume.

Transformation . NEB 5-alpha (10 µl) chemically competent Escherichia coli (high efficiency, NEB, C2988) were transformed with 1 µl of Gibson assembly reaction, mixed and placed on ice for 30 min. Cells were subjected to heat-shock at 42 °C for 30 s, and then placed onto ice again for 5 min. A total of 950 µl of room temperature SOC was added and the cells were then incubated at 37 °C for 1 h while mixing at 250 rpm. Then, 200 µl of SOC culture was added to 200 ml of LB broth containing 100 µg ml −1 carbenicillin and agitated at 37 °C, 225 rpm for 16–18 h before isolating plasmids using the PureLink HiPure Plasmid Maxiprep Kit (Invitrogen; K210006). Note that, for higher-diversity libraries such as fragmented genomes or scrambled oligos, precipitate your assembled plasmid and use electroporation.

Transfection . About 800,000 HEK293T cells were seeded into 3 ml DMEM (Cellgro, supplemented with 10% FBS (Gibco), 50 U penicillin and 50 μg streptomycin per ml (Gibco)) 24 h before transfection in a 6 cm dish and grown at 37 °C with 5% CO 2 . For each plate and construct, 25 µg plasmid DNA was transfected using Lipofectamine 3000 (Thermo Fisher Scientific) as described by the manufacturer. After 8 h, cells were washed three times with PBS (Gibco) and RNA was extracted using 1 ml Trizol (Thermo Fisher Scientific). 5 ′

-capped RNA enrichment . In total, 10–15 µg RNA (one-third of the total) was dissolved in 15 µl TE′T (10 mM Tris pH 8.0, 0.1 mM EDTA, 0.05% Tween-20), heated to 75 °C for 90 s, then quickly chilled on wet ice. Non-capped RNAs were dephosphorylated with Quick CIP (NEB) and DNA digested by adding 25.25 µl MM1 (double-distilled H 2 O + 0.05% Tween-20, 5 µl CutSmart buffer, 0.75 µl SUPERase in RNase inhibitor (20 U μl −1 , Thermo Fisher Scientific), 0.5 µl RQ1 DNase (Promega), 2 µl Quick CIP). The sample was mixed well and incubated at 37 °C for 90 min. For more complete dephosphorylation of uncapped transcripts, RNA was denatured a second time in MM1 at 75 °C for 30 s in a prewarmed water bath and then quickly chilled on ice for 2 min before incubating the sample at 37 °C for another 30 min. After double cap-enrichment, RNA was purified by adding 500 µl Trizol LS, mixed and then adding 140 µl TE′T and 140 µl CHCl 3  + IAA (24:1, Sigma Aldrich). The samples were vortexed vigorously and subsequently centrifuged for 10 min at 12,000 g at room temperature (21 °C; allowing the CIP to move to the lower phase). After phase separation, the top layer was taken, and RNA was precipitated by mixing first with 1/10 vol 3 M NaOAc and then 1 vol 100% isopropanol. The mixture was incubated at −20 °C for at least 20 min to overnight, then RNA precipitated by centrifuging at >21,000 g for 30 min at 4 °C. The supernatant was removed, the samples centrifuged once more and all of the remaining liquid was removed before washing the pellet in 400 µl 75% ethanol by inversions followed by centrifugation as before. All ethanol was completely removed, and the pellet was air-dried at room temperature for around 3–5 min.

Library preparation . RNA pellets were resuspended in 5 µl TE′T, heated 75 °C for 90 s, then quickly chilled on ice. To remove 5′ caps, we next added 10 µl DecappingMM (3.25 µl double-distilled H 2 O + 0.05% Tween-20, 1.5 µl 10× T4 RNA ligase buffer, 4 µl PEG 8000, 0.25 µl SUPERase in RNase inhibitor and 1 µl RppH (NEB)) and incubated the samples at 37 °C for 1 h. After decapping, 5 ′ adapters were ligated by T4 RNA ligase 1 using 10 µl L1MM (1 µl 10× T4 RNA ligase buffer, 2 µl 10 mM ATP, 1.5 µl 10 µM 5′ adapter, 4.5 µl PEG 8000 and 1 µl T4 RNA ligase 1 (NEB)) and incubating at room temperature (21 °C) for 2 h. For better results, we next repeated the Trizol clean-up as described in the ‘5′-capped RNA enrichment’ section.

RNA pellets were next resuspended in 7 µl annealing MM (1 µl 20 mM RS2 primer (5′-AGCGGATAACAATTTCACACAGGA-3′), 2 µl 700 mM KCl and 4 µl TET (10 mM Tris pH 7.5, 1 mM EDTA, 0.05% Tween-20)) and incubated at 75 °C for 90 s followed by 30 min at 56 °C and then cooled down to room temperature. Next, 13 µl reverse transcription MM (7.5 µl double-distilled H 2 O + 0.05% Tween-20, 1 µl reverse transcriptase buffer (homebrew, 500 mM Tris-HCl pH 8.3, 30 mM MgCl 2 ), 2 µl 0.1 M DTT, 2 µl 10 mM dNTPs, 0.5 µl SUPERase in RNase inhibitor and 1 µl Protoscript II (NEB)) were added and samples incubate at 50 °C for 1 h.

The samples were amplified (95 °C for 3 min; then 14 cycles of 95 °C for 30 s, 62 °C for 30 s, 72 °C for 45 s; and 72 °C for 3 min) at 50 µl volume (25 µl 2× Q5 MM (NEB), 2.8 µl 5 M betaine (Sigma-Aldrich), 2 µl 10 µM 3′ barcoding primer, 0.2 µl 100 µM 5′ primer). Note that the extended extension time is important to ensure that all PCR products are completely amplified. After PCR, 1 µl of 20 mg ml −1 RNase A was added and the reaction was incubated at 37 °C for 30 min. PCR reactions were purified using 1.5 vol of beads (2 µl Sera-Mag carbonylated beads (Cytiva), 34 µl 5 M NaCl, 37.5 µl 40% PEG 8000) and sequenced paired-end 50/30 on the NextSeq 500 system.

Data analysis

HOMER2 enables investigation of DNA sequence fragments and TF binding sites enrichment accounting for the general nucleotide content of the input fragments (for example, total GC content) and position-dependent nucleotide biases (Extended Data Fig. 1 and Supplementary Fig. 5 ). While HOMER2 is used to analyse sequences and account for sequence biases near TSSs, similar to its predecessor HOMER 23 , the software package can be used for a wide range of data analysis. HOMER2 is available online ( http://homer.ucsd.edu/homer2/ ; HOMER2 is fully integrated into HOMER starting with v.5).

Position-dependent background selection and motif enrichment in HOMER2 . To account for the influence of position-dependent sequence bias on the calculation of TF binding motif enrichment, we developed improvements in HOMER to select random genomic sequences that contain the same position-dependent sequence features present in a set of target sequences (for example, sequence anchored at thee TSS). Previously, when HOMER performed motif-enrichment calculations, random sequences from the genome were selected to construct an empirical background set such that the overall distribution of GC% per sequence in the background set matched that of the target sequence set (Extended Data Fig. 1 ). The primary purpose was to address sequence biases present in regulatory elements that overlap CpG islands, which have different overall sequence characteristics from other regions of the genome. HOMER2 can now select background sequences that preserve positional preferences for nucleotide composition, while still matching the overall GC% fragment distribution in the dataset (Extended Data Fig. 1 ). Position-dependent nucleotide composition can be considered for different k -mer lengths, that is, k  = 1 (simple nucleotide frequency), k  = 2 (dinucleotide frequency), k  = 3 (trinucleotide frequency) and so on. k  = 2 was used in this study. This selection is restricted to datasets with a fixed sequence length to unambiguously assess position-specific information relative to a defined anchor point. In addition to the sequence-selection procedure outlined below, HOMER has also been updated to generate synthetic sequences based on a Markov model describing k -mer transitions in a position-dependent manner to create a background dataset of synthetic sequences with the desired characteristics.

First, target sequences (for example, ±200 bp from TSS locations) are assessed for their overall GC% content and positional k -mer content (Supplementary Fig. 5 ). Target sequences are then sorted by their overall GC% and segregated into n bins corresponding to increasing ranges of overall GC% content ( n  = 10 for this study). Sequences from the genome matching these GC% ranges are identified and putatively assigned to the appropriate GC-bin. Next, the genomic sequences assigned from each bin are then sampled to generate a set of background sequences that matches the positional k -mer frequencies of the target sequences in each corresponding GC bin. Background sequence selection in each GC-bin is performed using an iterative gradient descent approach that progressively removes sequences until the final desired number of background sequences per bin is reached. During each iteration, the overall similarity between the positional k -mer frequency in the target and background sets is computed. Each target and background sequence is then scored against these k -mer frequency differences based on the k -mers present in each sequence at each position and adding the differences in their frequencies (linear combination). Background sequences for the next iteration are then randomly sampled on the basis of the relative fraction of target sequences that have a similar overall difference score, which attempts to match the same overall distribution of the target sequences. This process is repeated until the differences in k -mer frequencies approaches zero or a maximum number of iterations is reached. Once the iterative selection process for background sequences in each GC bin is complete, the sequences are combined into a final background sequence set and the distribution of the overall GC% and position-dependent k -mer frequencies are calculated and compared to the original target sequences (Supplementary Fig. 5 ). These sequences can then be used to more accurately consider TF motif enrichment (below) or be exported for other applications.

To calculate motif enrichment, target and background sequences are scanned for motif matches using HOMER to generate a complete table of motif occurrences. For any given interval (for example, −50 to −40 relative to the TSS), the total number of target and background motif occurrences within that range are tabulated and their enrichment is calculated using the Fisher exact test (hypergeometric distribution). In cases in which there are proportionally less motif occurrences in the target sequences compared with the background, the depletion is noted and 1 −  P is reported. This is performed over all positional ranges and for each motif queried, and the resulting P values are further corrected for multiple-hypothesis testing using the Benjamini–Hochberg method. The corrected log-transformed P values are then reported, with comparisons containing depleted values reported as −1 × log[ P ] to reflect depletion (versus 1 × log[ P ] for enrichment). De novo motif enrichment is performed by applying the original HOMER search algorithm using background sequences generated by HOMER2.

Motif analysis . To unbiasedly identify the most strongly enriched or depleted TF binding sites associated with transcription initiation (Fig. 1 and Extended Data Figs. 2 and 3 ), TSRs from U2OS cells were analysed from −150 to +100 bp relative to the TSS using HOMER2 positionally matched sequences from random genomic regions (GCbins = 10, kmer = 2). Maps of known motif enrichment were calculated for all 463 TF motifs in the HOMER known motif database for each strand separately and at each bp using 3 bp windows (Fig. 1d ). Enrichment at human LCL TSSs was calculated on both strands every 10 bp using 30 bp windows (Fig. 5a ) to directly compare with the analysis based on genetic variants (Fig. 5a ). Motif heat maps were created by clustering the log-transformed P adj values using the correlation coefficient as a distance metric (Cluster v.3.0) 72 and visualizing the resulting heat map using Java TreeView 73 (v.1.1.6r4). Motif occurrences, including histograms showing the density of binding sites relative to the TSS (reported as motifs per bp per TSS), and the average nucleotide frequency was calculated using HOMER’s annotatePeaks.pl tool and visualized with Java TreeView, Excel (v.16.83) or R (v.4.2.2). TSSs were assigned as a promoter TSS if their position was on the same strand and within 200 bp of a GENCODE annotated TSS. Promoter antisense TSSs were defined as those on the opposite strand in the range of −400 to +200 relative to a GENCODE-annotated TSS. Promoter-distal (for example, enhancer) TSSs were defined as those that are found greater than 1 kb from a GENCODE-annotated TSS. Spectral analysis of TF binding sites was performed on TF binding sites found between −120 and −40 bp relative to the TSS, corresponding to the region where many TFs appear to exhibit cyclical patterns of positional preference. The power spectrum (Fourier analysis) was calculated for periods from 0 to 50 bp in 0.1 bp increments on each TF’s strand-specific binding profile. The resulting power spectra were normalized to their maximum value to facilitate comparison. To segregate TSSs on the basis of the presence of initiator core promoter elements, the genomic sequence adjacent to each TSS was scanned for a strand and position-specific match to the sequence IUPAC motif BBCA + 1BW (where the A + 1 defines the initiating nucleotide) 74 . TSSs with a match were considered to be Inr-containing TSSs.

Analysis of TSSs and TF binding sites in the context of natural genetic variation . To assess how variation in TF binding site sequences relate to changes in TSS activity, we developed a framework for natural genetic variation analysis within HOMER2 that was inspired by MAGGIE 51 . First, variants in cis near TSSs with significant changes in activity are found (for example, <200 bp) and the alleles associated with higher activity are assigned as ‘active’, while their corresponding alternative alleles are assigned as ‘inactive’. If a variant overlaps a given TF binding site, the change in motif log odds score is calculated for that motif (active − inactive) and a distribution of motif score changes is created for all variants impacting that motif at sites within a given distance interval from the TSS. In MAGGIE’s original formulation, the null assumption underlying the nonparametric rank-sum significance calculations was that changes in motif score are independent of TSS activity, implying that the average of the distribution of motif scores should be zero. However, variants found near TSSs with differential activity do not follow a uniform pattern and may impact other sequence features that influence transcription initiation in a position-dependent manner (for example, core promoter elements) (Fig. 1a and Extended Data Fig. 5e,f ). This implies that the expected changes in log odds scores for a given motif at a given distance from the TSS may follow a different distribution (that is, !=0).

To more accurately assess how genetic variation impacts TF binding sites and TSS activity as a function of distance to the TSS, HOMER2 attempts to model the expected distribution of changes in motif log odds scores given the full distribution of variants observed relative to the TSS. This analysis is limited to single-nucleotide variants (SNVs/SNPs), which are evaluated independently from one another. First, a saturation mutagenesis scan is performed on each sequence to identify all of the positions where a match to a given motif may occur, and all of the potential differences in log odds scores as function of the variant and position are recorded. Then, for a given interval, an expected distribution of motif score changes is constructed taking the changes observed in the saturation mutagenesis analysis and then scaling their expected frequency by the total number of variants of each type (that is, A to C) observed at that position relative to the TSS. This expected distribution is then compared to the observed changes in motif scores from the actual variants and their sequences using nonparametric rank-sum tests (Mann–Whitney U -tests) to calculate the significance of the difference. This analysis is analogous to randomizing the sequences containing each variant while preserving the position and nucleotide identity of the variant relative to the TSS. After all motifs are evaluated at all intervals, the resulting P values are then corrected for multiple-hypothesis testing using the Benjamini–Hochberg method. Average changes in motif score at each position and overlap with GWAS-annotated variants are also reported.

For this study, the analysis was performed for all 463 motifs in the HOMER2 known-TF-binding-site library using 30 bp overlapping windows evaluated every 10 bp from −150 to +100 bp relative to the TSS. Larger windows improve the sensitivity by capturing more binding sites of a given TF motif with DNA variants between strains to increase the sensitivity, while smaller regions improve the resolution of the analysis. When analysing differences between strains of mice, active alleles were defined by TSSs that were significantly differentially regulated between strains (see below). For analysing human variants, active alleles were defined on the basis of a positive slope from significant tssQTLs (|slope| > 0.1, P  < 0.25).

As an alternative approach, we performed a second analysis by directly comparing the sequences found in each mouse strain’s genome assembly using the original MAGGIE program (v.1.1.1, https://github.com/zeyang-shen/maggie ; Supplementary Fig. 2c ). This approach differs from that above in that it can assess structural variation and indels in addition to SNVs, but does not model the position-dependent changes in expected motif score differences due to the arbitrary types of sequence variation considered. For this alternative analysis, sequences from −150 to +100 bp relative to the differentially regulated TSS (>1.5 shrunken fold change in one mouse strain versus the other) were analysed. We applied MAGGIE multiple times using an overlapping windowed approach to analyse genomic sequences associated with specific distance intervals from the TSS, similar to our approach above. To perform the MAGGIE calculation for each windowed region and TF binding site, sequences corresponding to a given region relative to the TSS were extracted from either the C57BL/6/mm10 genome or from regions in the same distance range relative to the homologous TSS position in the SPRET genome. These regions were then scanned using HOMER to identify TSSs associated with a match to the given motif in at least one of the two mouse strains. These sequences were then analysed using MAGGIE to identify pairs for which strain specific mutations were associated with changes in TSS activity. The significance was reported as the log[ P ] reported by Maggie, which was then signed to indicate whether the association was more strongly associated with increasing (negative log[ P ] values) or decreasing (positive log[ P ] values) TSS activity. TF-binding-site enrichment heat maps were generated by combining signed log[ P ] value results across all TF binding site and TSS-motif distances and then clustering the values using the correlation coefficient (Cluster 3.0) and visualizing the resulting heat map using Java TreeView.

TF–TF spacing and transcription initiation analysis . To analyse patterns of transcription initiation near pairs of TF binding sites (Fig. 4f ), non-redundant binding sites for the first TF binding sites were first identified by scanning all TSRs from −300 bp to +300 bp using HOMER2. These sites were then scanned a second time from −300 bp to +300 to identify instances of the second TF binding sites, and each region containing both TF binding sites was then sorted based on the position of the second TF binding sites relative to the first. Note that, if multiple instances of one of the binding sites are found in the vicinity of the other sites, these regions will be represented multiple times in the list. The sorted regions, centred on the first TF binding site, were then used to generate TF initiation level heat maps using HOMER’s annotatePeaks.pl program using the parameter ‘-ghist’.

MEIRLOP score-based motif analysis . To analyse how well each TF binding site associates with the activity level of TSS, we used MEIRLOP (v.0.0.16) 75 , analysing motif occurrences from −150 to +50 bp relative to the TSS and associating them with the total count of csRNA-seq reads (log 2 ). MEIRLOP assesses the dinucleotide content of each sequence and models them as covariates when performing logistic regression. Statistically significant enrichment coefficients ( P adj  < 0.05) were reported along with their confidence intervals (Extended Data Fig. 2b ; https://github.com/npdeloss/meirlop ).

MEPP positional enrichment scoring . To analyse how the spatial distribution of a TF binding site associates with changes in TSS activity, we used Motif Enrichment Positional Profiles (MEPP, v.0.0.1) 76 . For a given motif PWM (for example, NRF1 motif), MEPP assesses the positional enrichment of the motif relative to a set of scored sequences (for example, the log 2 -transformed fold change between the control siRNA and NRF1 siRNA conditions) that are anchored by a key feature (for example, the TSS). MEPP first calculates the positions of the TF binding site across all regions, generating a heat map to visualize the locations and PWM log-transformed odds scores (Fig. 1e (middle)). Positions are indicated relative to the centre of the PWM motif. At each motif position surrounding the key anchor feature, we calculate the partial Pearson correlation of a sequence’s score with the motif’s PWM log-odds score at that position (while controlling for GC content as a covariate in the calculation). The positional correlation is then presented as a profile below the heat map (Fig. 1e (bottom); https://github.com/npdeloss/mepp ).

csRNA-seq analysis

Genome originating TSS location and activity levels were determined using csRNA-seq and generally analysed using HOMER 21 , 23 . Additional information, including analysis tutorials are available online ( http://homer.ucsd.edu/homer/ngs/csRNAseq/index.html ).

csRNA-seq (small capped RNAs, ~20–60 nucleotides) and total small RNA-seq input sequencing reads were trimmed of their adapter sequences using HOMER (‘homerTools trim -3 AGATCGGAAGAGCACACGTCT -mis 2 -minMatchLength 4 -min 20’) and aligned to the appropriate genome (GRCh38/hg38, GRCm38/mm10) using STAR (v.2.7.10a) 77 with the default parameters. Only reads with a single, unique alignment (MAPQ ≥ 10) were considered in the downstream analysis. Furthermore, reads with spliced or soft clipped alignments were discarded to ensure accurate TSSs. The same analysis strategy was also used to reanalyse previously published Start-seq and PRO-cap TSS profiling data to ensure the data were processed in a uniform and consistent manner, although different adapter sequences were trimmed according to each published protocol.

Two separate transcription initiation analysis strategies were used in this study. In most cases, individual, single-nucleotide TSS positions were independently analysed. For a subset of analyses, we analysed transcription initiation levels in the context of TSRs, which comprise several closely spaced individual TSS. Individual TSS locations are useful for characterizing spacing relationships at single-bp resolution, whereas TSRs are more useful for describing the overall transcription activity at whole regulatory elements.

To analyse TSSs at the single-nucleotide resolution, the aligned position of the 5′ nucleotide of each csRNA-seq read was used to create a map of putative TSS locations in the genome. To ensure that we use high-quality TSSs that could be reliably quantified across different conditions, only TSS locations with at least 7 reads per 10 7 total aligned reads across all compared replicates and conditions (for example, control siRNA, NRF1 siRNA and so on) were retained for further analysis 21 . Furthermore, any TSS that had higher normalized read density in the small RNA input sequencing was discarded as a likely false-positive TSS location. These sites often include miRNAs and other high-abundance RNA species that are not entirely depleted in the csRNA-seq cap enrichment protocol. To quantify the change in TSS levels between conditions, a unified map of confident TSS positions is first determined across the set of experiments to be compared. Then, the TSS activity levels are assessed for each replicate and each experimental condition by first counting the raw read coverage across each TSS and all experiments and normalizing the dataset using DESeq2’s rlog variance stabilizing transform (v.1.38.3) 78 . Changes in transcriptional activity were then reported as the log 2 -transformed fold change representing the difference between averaged rlog transformed activity levels across conditions (similar to a shrunken fold change). DESeq2 was used to identify significantly differentially regulated TSS, defined as TSS exhibiting a change of at least 1.5 fold and P adj  < 0.05, unless otherwise noted.

TSRs, representing loci with significant transcription initiation activity from one or more individual TSSs on the same strand from the same regulatory element (that is, peaks in csRNA-seq), were defined using HOMER’s findcsRNATSR.pl tool, which uses short input RNA-seq, traditional total RNA-seq and annotated gene locations to find regions of highly active TSSs and then eliminate loci with csRNA-seq signals arising from non-initiating, high-abundance RNAs that nonetheless are captured and sequenced by the method (further details are available in a previous study 21 ). Replicate experiments were first pooled to form meta-experiments for each condition before identifying TSRs. Annotation information, including gene assignments, promoter distal, stable transcript and bidirectional annotations are provided by findcsRNATSS.pl. To identify differentially regulated TSRs, TSRs identified in each condition were first pooled (union) to identify a combined set of TSRs represented in the dataset using HOMER’s mergePeaks tool using the option ‘-strand’. The resulting combined TSRs were then quantified across all individual replicate samples by counting the 5′ ends of reads aligned at each TSR on the correct strand. The raw read count table was then analysed using DESeq2 to calculate normalized rlog-transformed activity levels and identify differentially regulated TSRs, similar to the analysis of TSSs 78 .

In all cases, normalized genome browser visualization tracks for csRNA-seq data were generated using HOMER’s makeUCSCfile program using the ‘-tss’ option and visualized using either the UCSC Genome Browser 79 or IGV 80 . Annotation of TSS/TSR locations to the nearest gene was performed using HOMER’s annotatePeaks.pl program using GENCODE as the reference annotation.

Additional information about csRNA-seq analysis and tips for analysing TSS data is available at the HOMER website ( http://homer.ucsd.edu/homer/ngs/csRNAseq/index.html ).

Analysis of TSSs across two strains of mice

To analyse how changes in genomic sequence impact TSS activity from two different strains of mice, we first took steps to ensure that TSS positions were conserved and detectable in both mouse strains to avoid analysing TSSs that may exhibit differential activity due to technical/analytical reasons, or TSSs found in non-homologous DNA. This was accomplished by ensuring that all csRNA-seq reads used in the analysis could be aligned to a single, unique location in the genomes of both mouse strains. Furthermore, the location that each csRNA-seq read aligns to in each genome must correspond to a homologous position in the full genome alignment, indicating that they represent the equivalent TSS positions in each strain. This latter filter helps to eliminate TSSs mapping to positions in repetitive or duplicated regions that may not be resolved in one or both genome assemblies.

To identify valid TSS positions for the natural genetic variation analysis, csRNA-seq and small input RNA-seq reads were first trimmed to remove adapter sequences. Reads from each mouse experiment (regardless of the strain of origin) were aligned to both the C57BL/6 (GRCm38/mm10) and SPRET (GCA_001624865.1/SPRET_EiJ_v1) genomes using STAR with the default parameters. Only reads that aligned to a single, unique location (MAPQ ≥ 10) were considered further. Next, TSS positions representing the 5′ end of the reads were mapped to the other mouse strain’s genome using UCSC’s liftOver tool and the corresponding C57BL/6/SPRET liftOver files ( http://hgdownload.cse.ucsc.edu/goldenpath/mm10/liftOver/mm10ToGCA_001624865.1_SPRET_EiJ_v1.over.chain.gz , https://hgdownload.soe.ucsc.edu/goldenPath/GCA_001624865.1_SPRET_EiJ_v1/liftOver/GCA_001624865.1_SPRET_EiJ_v1ToMm10.over.chain.gz ). If the liftOver calculation yielded the same TSS location as the alignment from the other mouse strain, the read was retained for downstream analysis. Confident TSS locations, including DESeq2 rlog-normalized values and log 2 -transformed fold changes were then calculated based on the alignment positions reported in the mm10 genome as described in the sections above. Strain-specific differentially regulated TSSs used in the analysis of natural genetic variation were determined by DESeq2 using P adj of 0.25, resulting in 431,310 variant–TSS pairs for analysis.

Analysis of tssQTLs from LCL PRO-Cap data

Variant file merging and filtering . Per-chromosome VCF files containing genotyping data for the samples analysed previously 63 were downloaded from the 1000 Genome project ( ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/ ). Using bcftools 81 , these VCFs were then filtered for samples and variants observed from 67 individuals corresponding to PRO-cap samples from Gene Expression Omnibus (GEO) accession GSE110638 . The variants in these VCFs were then normalized and named using their location and allelic data using bcftools. The per-chromosome VCFs were then merged, while trimming unobserved alternate alleles, removing sites without called genotypes and requiring minor allele counts of at least 1. These were further filtered to include only variants that were flagged as passing all upstream quality checks and were called with a minimum depth of 10. The resulting VCF was converted into PLINK format using plink2 (v.2.00a2.3LM) 82 , retaining only SNPs with less than 50% of genotype calls missing, and a minor allele frequency greater than 0.05. A set of genotype principal component analysis (PCA) covariates was also generated using plink2.

LCL PRO-Cap data processing . To mitigate allele-specific alignment effects, a masked genome was created that set bases at the filtered SNP coordinates to Ns. We then trimmed adapter sequences from reads using fastp and aligned them to the masked genome using STAR using the default parameters. The resulting alignments were then processed using HOMER. Tag directories were then separated on the basis of whether they belonged to PRO-cap or PRO-seq experiments. To call TSSs, PRO-cap and PRO-seq reads were processed in the same manner as csRNA-seq and total small RNA-seq, respectively, to identify a unified set of human TSSs. We then obtained rlog-normalized from raw counts quantifying coverage of the 5′ ends of reads from each sample, using HOMER annotatePeaks.pl. To avoid bias from extreme outliers, we retained only TSSs with a minimum count of 10 reads in 10 samples. To control for broad genome-wide expression variance, we obtained a set of 50 expression PCA covariates.

tssQTL calling . To call QTLs from TSS data, we used TensorQTL (v.1.0.3) 83 to determine the link between filtered SNP genotypes and TSS expression phenotypes, while controlling for covariance from sex, genotype PCA and TSS expression PCA. TensorQTL was run in both ‘cis’ and ‘cis_nominal’ modes, with a cis window of 300 bp. We then analysed each cis-nominal QTL SNP within the framework of our natural genetic variation analysis, limiting our analysis to tssQTLs with P  < 0.25 and |slope| > 0.1, leading to a total of 194,746 variant–TSS combinations used in the analysis.

ChIP–seq analysis

ChIP–seq reads were aligned to the appropriate human genome (GRCh38/hg38) using STAR with the default parameters 77 . Only reads with a single, unique alignment (MAPQ ≥ 10) were considered in the downstream analysis. ChIP–seq peaks were determined using HOMER’s findPeaks program using ‘-style factor’. Quantification of ChIP–seq reads associated with peaks, annotation to the nearest annotated gene TSS, calculation of TF binding site presence (−100 to +100 relative to the peak centre) and visualization of normalized read pileups for the genome browser were all conducted using HOMER. The same analysis strategy used for our dnNRF1, NRF1 and YY1 ChIP–seq data was also used to reanalyse TF ChIP–seq data from ENCODE 84 to ensure that the data were processed in a uniform and consistent manner.

TSS-MPRA analysis

Three different DNA insert designs were used for TSS-MPRA in this study. For each design, 400–500 DNA inserts were designed consisting of 155 bp of query DNA (described below) and redundantly coupled with 4 or 5 independent 11 bp barcodes optimized for their molecular properties and diversity 85 to generate a total of 2,000 DNA inserts per design. Genome-encoded TSR sequences queried by TSS-MPRA were designed to capture the sequence from −113 to +42 bp relative to the primary TSS to capture most of the upstream TF binding sites and core promoter region. Key sequences used in the MPRA design are reported in Supplementary Table 2 , and full TSS-MPRA design files and sequences are available at the GEO ( GSE199431 ).

To process TSS-MPRA results, raw RNA and DNA sequencing reads, corresponding to the RNA transcripts and input DNA library, respectively, were trimmed for the 5′ adapter sequence GGTAACCGGTCCAGCTCA on the R1 read using cutadapt v.3.4. The trimmed reads were aligned to the reporter library using STAR (v.2.7.10a) 77 , specifying an option to preclude soft-clipping at the 5′ end of R1 (STAR --outSAMattributes All --genomeDir library/tfsweep.STARIndex --runThreadN 12 --readFilesIn file.fastq --outFileNamePrefix star/Sweep_1. --alignEndsType Extend5pOfRead1 --outSAMtype BAM SortedByCoordinate). For DNA plasmid control samples, the uniquely aligned read pairs were counted to later scale transcriptional output. For RNA samples, uniquely aligned read pairs were further processed to identify exact TSSs, yielding a matrix of start sites per sequence position. Any alignments showing mismatches at their 5′ ends were not counted and reporter sequences with fewer than 50 total DNA alignments were also ignored. Specific DNA insert details and analysis approaches tailored for each TSS-MPRA design are described below.

TF-binding-site insertion analysis . A total of 13 TSRs was selected from the human genome (Supplementary Table 2 ) and DNA inserts were designed to introduce 7 TF binding sites at positions −50, −20 and +25 bp relative to the primary TSS (Sp1, NRF1, NFY, YY1, p53, CTCF or a control sequence; Fig. 3b,c ). Binding-site insertion replaced the endogenous sequence at each location to maintain the relative spacing of regulatory element DNA outside of the TF binding site insertion. For RNA samples, the transcriptional output of each sequence was summarized by counting alignments with start sites near the designated promoter region (±7 bp from the TSS). These values were scaled based on plasmid DNA levels and transformed as described above. Overall levels were reported as the log 2 -transformed ratio relative to the control binding site insertion that does not match any known TF binding sites.

TF-binding-site sweep analysis . To unbiasedly assess the position-dependent impact of TF binding sites on transcription initiation levels, we first designed a synthetic promoter sequence that lacked matches to any of the known motifs in the HOMER TF motif library (Supplementary Table 2 ). TF binding sites corresponding to Sp1, NFY, NRF1 and YY1 were then inserted at 2 bp intervals along the length of the sequence (Fig. 4a–c ). NRF1 was additionally inserted at the same 2 bp intervals into either the WT TOB2 enhancer or a mutant version lacking endogenous NRF1 binding sites (Fig. 4e ). Binding-site insertion replaced the endogenous sequence at each location to maintain the relative spacing of regulatory element DNA outside of the TF binding site insertion. To analyse the impact of each TF binding site sweep, a scaling factor for each insert sequence was determined by calculating min(10,000/plasmid, 100). After multiplying the start-site counts by the scaling factor, a pseudocount of 1 was added and the values were log 2 transformed. Replicates were then merged by averaging. Reference coverage was defined for each promoter–TF binding site pair by calculating the average position-wise signal across all of the sequences (Extended Data Fig. 7 ). The difference between the merged signal and reference coverage was smoothed for visualization using the R loess function with span parameter set to 0.1.

TF-binding-site mutation analysis . In total, 133 TSRs were randomly selected and TF binding sites matching 20 different motifs were mutated to monitor their impact on transcription initiation patterns. In addition to the wild-type TSR sequence [−113, +42], a separate insert was designed for each motif where one or more TF binding sites were found. Sequences corresponding to the TF binding site were replaced with the same control sequence in each case starting at the 5′ end and continuing the length of the binding site (control sequence: TAACTGTAATACCTCCTGAAGTC). Only motifs with matches to at least 20 different TSRs were used in the analysis (Sp1, NFY, NRF1, YY1 and ETS; Fig. 5e  and Extended Data Fig. 10d ). DNA-scaled and log-transformed start site values were calculated as with the above site sweep analysis. For each motif, start positions were classified as enriched or depleted based on an overrepresentation analysis in genomic contexts ( P adj  < 0.01, data from U2OS motif enrichment analysis; Fig. 1d ). TSS shifts were determined by finding the weighted mean TSS position per insert, then subtracting the mean position per mutant from the mean position of the relevant control.

Reproducibility analysis . The reproducibility of MPRA results was assessed by comparing (1) the variation in initiation activity levels among different barcode replicates for the four TSRs displayed in Fig. 3b (Extended Data Fig. 6a ); (2) comparing summary heat maps of the TSSs and their normalized activity levels captured by TSS-MPRA for a 2 bp incremental sweep of TF binding site sweeps (Sp1, NRF1, NFY and YY1) from −100 to +40 using four different barcode sets (Extended Data Fig. 7a–e ); and (3) comparing TSS activity levels for a given DNA fragment across at least two biological replicates and between independent barcodes for each TSS-MPRA library (Supplementary Figs. 6 and 7 ).

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All raw and processed data generated for this study can be accessed under NCBI GEO accession number GSE199431 . Previously published GEO and high-throughput sequencing datasets analysed as part of this study include csRNA-seq data in C57BL/6 mouse macrophages ( GSE135498 ), NFY- knockdown Start-seq data in mouse MEFs ( GSE115110 ), PRO-cap data from 69 human LCLs ( GSE110638 ), NRF1 ChIP–seq data from ENCODE in HepG2 ( ENCSR853ADA ) and K562 ( ENCSR494TDU ) cells. Genomes used for the analysis of sequencing data include human: GRCh38/hg38 ( https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz ); mouse (C57BL/6): GRCm38/mm10 ( https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz ); mouse (SPRET): GCA_001624865.1 ( https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_001624865.1/ ); and green monkey: Chlorocebus sabeus 1.1/chlSab2 ( https://hgdownload.soe.ucsc.edu/goldenPath/chlSab2/bigZips/chlSab2.fa.gz ). Gene annotations were downloaded from GENCODE (human v34, mouse v25; https://www.gencodegenes.org/ ). Disease-risk variants from the GWAS Catalog mapping to hg38 were downloaded from the UCSC Genome Browser ( https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/gwasCatalog.txt.gz ). Liftover files for mapping between mouse strains were obtained online ( http://hgdownload.cse.ucsc.edu/goldenpath/mm10/liftOver/mm10ToGCA_001624865.1_SPRET_EiJ_v1.over.chain.gz (C57BL/6/mm10 to SPRET) and https://hgdownload.soe.ucsc.edu/goldenPath/GCA_001624865.1_SPRET_EiJ_v1/liftOver/GCA_001624865.1_SPRET_EiJ_v1ToMm10.over.chain.gz (SPRET to C57BL/6/mm10)). Per-chromosome VCF files containing genotyping data for the samples analysed previously 63 were downloaded from the 1000 Genomes Project ( ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/ ).  Source data are provided with this paper.

Code availability

Code used to analyse data in this Article has been integrated into HOMER, or is available from the following repositories as described in the methods: HOMER2 (HOMER v.5) ( http://homer.ucsd.edu/homer2/ ), MEIRLOP v.0.0.16 ( https://github.com/npdeloss/meirlop ) and MEPP v.0.0.1 ( https://github.com/npdeloss/mepp ).

Nguyen, T. A. et al. High-throughput functional comparison of promoter and enhancer activities. Genome Res. 26 , 1023–1033 (2016).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Tippens, N. D. et al. Transcription imparts architecture, function and logic to enhancer units. Nat. Genet. 52 , 1067–1075 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Dao, L. T. M. et al. Genome-wide characterization of mammalian promoters with distal enhancer functions. Nat. Genet. 49 , 1073–1081 (2017).

Article   CAS   PubMed   Google Scholar  

Zeitlinger, J. Seven myths of how transcription factors read the cis -regulatory code. Curr. Opin. Syst. Biol. 23 , 22–31 (2020).

Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54 , 283–294 (2022).

Lambert, S. A. et al. The human transcription factors. Cell 175 , 598–599 (2018).

Kasowski, M. et al. Variation in transcription factor binding among humans. Science. 328 , 232–235 (2010).

Takahashi, K. & Yamanaka, S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126 , 663–676 (2006).

Li, X.-Y. et al. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol. 6 , e27 (2008).

Cusanovich, D. A., Pavlovic, B., Pritchard, J. K. & Gilad, Y. The functional consequences of variation in transcription factor binding. PLoS Genet. 10 , e1004226 (2014).

Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324 , 1720–1723 (2009).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26 , 990–999 (2016).

Wasserman, W. W. & Sandelin, A. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 5 , 276–287 (2004).

de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38 , 56–65 (2020).

Article   PubMed   Google Scholar  

King, D. M. et al. Synthetic and genomic regulatory elements reveal aspects of cis -regulatory grammar in mouse embryonic stem cells. eLife 9 , e41279 (2020).

Zhu, I. & Landsman, D. Clustered and diverse transcription factor binding underlies cell type specificity of enhancers for housekeeping genes. Genome Res. 33 , 1662–1672 (2023).

Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337 , 1190–1195 (2012).

Halfon, M. S. Studying transcriptional enhancers: the founder fallacy, validation creep, and other biases. Trends Genet. 35 , 93–103 (2019).

Rach, E. A. et al. Transcription initiation patterns indicate divergent strategies for gene regulation at the chromatin level. PLoS Genet. 7 , e1001274 (2011).

Haberle, V. & Stark, A. Eukaryotic core promoters and the functional basis of transcription initiation. Nat. Rev. Mol. Cell Biol. 19 , 621–637 (2018).

Duttke, S. H., Chang, M. W., Heinz, S. & Benner, C. Identification and dynamic quantification of regulatory elements using total RNA. Genome Res. 29 , 1836–1846 (2019).

Frith, M. C. et al. Evolutionary turnover of mammalian transcription start sites. Genome Res. 16 , 713–722 (2006).

Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38 , 576–589 (2010).

Kadonaga, J. T., Carner, K. R., Masiarz, F. R. & Tjian, R. Isolation of cDNA encoding transcription factor Sp1 and functional analysis of the DNA binding domain. Cell 51 , 1079–1090 (1987).

Morgan, W. D. et al. Two transcriptional activators, CCAAT-box-binding transcription factor and heat shock transcription factor, interact with a human hsp70 gene promoter. Mol. Cell. Biol. 7 , 1129–1138 (1987).

CAS   PubMed   PubMed Central   Google Scholar  

Virbasius, C. A., Virbasius, J. V. & Scarpulla, R. C. NRF-1, an activator involved in nuclear-mitochondrial interactions, utilizes a new DNA-binding domain conserved in a family of developmental regulators. Genes Dev. 7 , 2431–2445 (1993).

Grand, R. S. et al. BANP opens chromatin and activates CpG-island-regulated genes. Nature 596 , 133–137 (2021).

Article   ADS   CAS   PubMed   Google Scholar  

Smale, S. T. & Kadonaga, J. T. The RNA polymerase II core promoter. Annu. Rev. Biochem. 72 , 449–479 (2003).

Patel, A. B. et al. Structure of human TFIID and mechanism of TBP loading onto promoter DNA. Science 362 , eaau8872 (2018).

Chen, X. et al. Structural insights into preinitiation complex assembly on core promoters. Science 372 , eaba8490 (2021).

Yao, Y. L., Yang, W. M. & Seto, E. Regulation of transcription factor YY1 by acetylation and deacetylation. Mol. Cell. Biol. 21 , 5979–5991 (2001).

Verheul, T. C. J., van Hijfte, L., Perenthaler, E. & Barakat, T. S. The why of YY1: mechanisms of transcriptional regulation by Yin Yang 1. Front. Cell Dev. Biol. 8 , 592164 (2020).

Liu, X.-S. et al. ZBTB7A acts as a tumor suppressor through the transcriptional repression of glycolysis. Genes Dev. 28 , 1917–1928 (2014).

Ghisletti, S. et al. Identification and characterization of enhancers controlling the inflammatory gene expression program in macrophages. Immunity 32 , 317–328 (2010).

Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53 , 354–366 (2021).

Ngoc, L. V., Cassidy, C. J., Huang, C. Y., Duttke, S. H. C. & Kadonaga, J. T. The human initiator is a distinct and abundant element that is precisely positioned in focused core promoters. Genes Dev. 31, 6–11 (2017).

Shi, Y., Lee, J.-S. & Galvin, K. M. Everything you have ever wanted to know about Yin Yang 1. Biochim. Biophys. Acta 1332 , F49–F66 (1997).

CAS   PubMed   Google Scholar  

Chen, S., Nagy, P. L. & Zalkin, H. Role of NRF-1 in bidirectional transcription of the human GPAT-AIRC purine biosynthesis locus. Nucleic Acids Res. 25 , 1809–1816 (1997).

Oldfield, A. J. et al. NF-Y controls fidelity of transcription initiation at gene promoters through maintenance of the nucleosome-depleted region. Nat. Commun. 10 , 3072 (2019).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Hu, Z., Killion, P. J. & Iyer, V. R. Genetic reconstruction of a functional transcriptional regulatory network. Nat. Genet. 39 , 683–687 (2007).

Wunderlich, Z. & Mirny, L. A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 25 , 434–440 (2009).

Yang, A. et al. Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol. Cell 24 , 593–602 (2006).

Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431 , 99–104 (2004).

Fischer, M., Steiner, L. & Engeland, K. The transcription factor p53: not a repressor, solely an activator. Cell Cycle 13 , 3037–3058 (2014).

Gugneja, S., Virbasius, C. M. & Scarpulla, R. C. Nuclear respiratory factors 1 and 2 utilize similar glutamine-containing clusters of hydrophobic residues to activate transcription. Mol. Cell. Biol. 16 , 5708–5716 (1996).

Ptashne, M. et al. How the lambda repressor and cro work. Cell 19 , 1–11 (1980).

Wu, A. C. K. et al. Repression of divergent noncoding transcription by a sequence-specific transcription factor. Mol. Cell 72 , 942–954 (2018).

Luan, J. et al. CTCF blocks antisense transcription initiation at divergent promoters. Nat. Struct. Mol. Biol. 29 , 1136–1144 (2022).

Heinz, S. et al. Effect of natural genetic variation on enhancer selection and function. Nature 503 , 487–492 (2013).

Link, V. M. et al. Analysis of genetically diverse macrophages reveals local and domain-wide mechanisms that control transcription factor binding and function. Cell 173 , 1796–1809 (2018).

Shen, Z., Hoeksema, M. A., Ouyang, Z., Benner, C. & Glass, C. K. MAGGIE: leveraging genetic variation to identify DNA sequence motifs mediating transcription factor binding and function. Bioinformatics 36 , i84–i92 (2020).

Ungerbäck, J. et al. Pioneering, chromatin remodeling, and epigenetic constraint in early T-cell gene regulation by SPI1 (PU.1). Genome Res. 28 , 1508–1519 (2018).

Lu, J., Pazin, M. J. & Ravid, K. Properties of Ets-1 binding to chromatin and its effect on platelet factor 4 gene expression. Mol. Cell. Biol. 24 , 428–441 (2004).

Verschueren, K. et al. SIP1, a novel zinc finger/homeodomain repressor, interacts with Smad proteins and binds to 5′-CACCT sequences in candidate target genes. J. Biol. Chem. 274 , 20489–20498 (1999).

Sims, K. et al. Kdo 2 -lipid A, a TLR4-specific agonist, induces de novo sphingolipid biosynthesis in RAW264.7 macrophages, which is essential for induction of autophagy. J. Biol. Chem. 285 , 38568–38579 (2010).

Dorrington, M. G. & Fraser, I. D. C. NF-κB signaling in macrophages: dynamics, crosstalk, and signal integration. Front. Immunol. 10 , 705 (2019).

Guzman, C. et al. Combining TSS-MPRA and sensitive TSS profile dissimilarity scoring to study the sequence determinants of transcription initiation. Nucleic Acids Res. 51 , e80 (2023).

Hansen, A. S., Pustova, I., Cattoglio, C., Tjian, R. & Darzacq, X. CTCF and cohesin regulate chromatin loop stability with distinct dynamics. eLife 6 , e25776 (2017).

Houbaviy, H. B., Usheva, A., Shenk, T. & Burley, S. K. Cocrystal structure of YY1 bound to the adeno-associated virus P5 initiator. Proc. Natl Acad. Sci. USA 93 , 13577–13582 (1996).

Cianfrocco, M. A. et al. Human TFIID binds to core promoter DNA in a reorganized structural state. Cell 152 , 120–131 (2013).

Sullivan, K. D., Galbraith, M. D., Andrysik, Z. & Espinosa, J. M. Mechanisms of transcriptional regulation by p53. Cell Death Differ. 25 , 133–143 (2018).

Verfaillie, A. et al. Multiplex enhancer-reporter assays uncover unsophisticated TP53 enhancer logic. Genome Res. 26 , 882–895 (2016).

Kristjánsdóttir, K., Dziubek, A., Kang, H. M. & Kwak, H. Population-scale study of eRNA transcription reveals bipartite functional enhancer architecture. Nat. Commun. 11 , 5963 (2020).

Vuckovic, D. et al. The polygenic and monogenic basis of blood traits and diseases. Cell 182 , 1214–1231 (2020).

Ryczek, N., Łyś, A. & Makałowska, I. The functional meaning of 5′UTR in protein-coding genes. Int. J. Mol. Sci. 24 , 2976 (2023).

Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol. 17 , 128 (2016).

Lam, M. T. Y. et al. Dynamic activity in cis-regulatory elements of leukocytes identifies transcription factor activation and stratifies COVID-19 severity in ICU patients. Cell Rep. Med. 4 , 100935 (2023).

Branche, E. et al. SREBP2-dependent lipid gene transcription enhances the infection of human dendritic cells by Zika virus. Nat. Commun. 13 , 5341 (2022).

Meade, B. R. et al. Efficient delivery of RNAi prodrugs containing reversible charge-neutralizing phosphotriester backbone modifications. Nat. Biotechnol. 32 , 1256–1261 (2014).

Hetzel, J., Duttke, S. H., Benner, C. & Chory, J. Nascent RNA sequencing reveals distinct features in plant transcription. Proc. Natl Acad. Sci. USA 113 , 12316–12321 (2016).

Texari, L. et al. An optimized protocol for rapid, sensitive and robust on-bead ChIP-seq from primary cells. STAR Protoc. 2 , 100358 (2021).

de Hoon, M. J. L., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software. Bioinformatics 20 , 1453–1454 (2004).

Saldanha, A. J. Java Treeview-extensible visualization of microarray data. Bioinformatics 20 , 3246–3248 (2004).

Ngoc, L. V., Cassidy, C. J., Huang, C. Y., Duttke. S. H. C., & Kadonaga, J. T. The human initiator is a distinct and abundant element that is precisely positioned in focused core promoters. Genes Dev . 31 , 6–11 (2017).

Delos Santos, N. P., Texari, L. & Benner, C. MEIRLOP: improving score-based motif enrichment by incorporating sequence bias covariates. BMC Bioinformatics   21 , 410 (2020).

Delos Santos, N. P., Duttke, S., Heinz, S. & Benner, C. MEPP: more transparent motif enrichment by profiling positional correlations. NAR Genom. Bioinform. 4 , lqac075 (2022).

Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 , 15–21 (2013).

Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 , 550 (2014).

Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12 , 996–1006 (2002).

Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29 , 24–26 (2011).

Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27 , 2987–2993 (2011).

Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4 , 7 (2015).

Taylor-Weiner, A. et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 20 , 228 (2019).

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489 , 57–74 (2012).

Article   ADS   Google Scholar  

Hawkins, J. A., Jones, S. K. Jr, Finkelstein, I. J. & Press, W. H. Indel-correcting DNA barcodes for high-throughput sequencing. Proc. Natl Acad. Sci. USA 115 , E6217–E6226 (2018).

Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet. 46 , 1311–1320 (2014).

Download references

Acknowledgements

We thank C. D. A. Saldanha for expert technical assistance and L. V. Ngoc, J. T. Kadonaga and the members of the Duttke laboratory for discussions of the manuscript; and the staff at Life Science Editors for editing services. This work was supported by NIH grants R00GM135515 and WSU startup funds (to S.H.D.); R01GM134366, U01DA051972, U01AI150748, U19AI135972, P01HL147835 and R01MH127077 (to C.B.); and R01GM129523 (to S.H.). S.H. received additional support from NIH grants R01GM134366, P30DK063491, P30DK120515, U01AI150748, U19AI135972 and R21DA056177. A.F.C. was supported by NIH career award K08AI130381 as well as a Career Award for Medical Scientists from the Burroughs Wellcome Fund; C.G. in part by predoctoral fellowship F31HG011823; N.P.D.S. in part by an NLM Training Grant (T15LM011271) and the Katzin Prize Endowed Fund. B.R.M. is a WSU STARS scholar. This publication includes data generated at the UC San Diego IGM Genomics Centre using an Illumina NovaSeq 6000 system that was purchased with funding from a National Institutes of Health SIG grant (S10 OD026929).

Author information

Authors and affiliations.

School of Molecular Biosciences, College of Veterinary Medicine, Washington State University, Pullman, WA, USA

Sascha H. Duttke & Bayley R. McDonald

Department of Medicine, Division of Endocrinology, U.C. San Diego School of Medicine, La Jolla, CA, USA

Carlos Guzman, Max Chang, Nathaniel P. Delos Santos, Sven Heinz & Christopher Benner

Department of Pathology and Medicine, U.C. San Diego School of Medicine, La Jolla, CA, USA

Jialei Xie & Aaron F. Carlin

You can also search for this author in PubMed   Google Scholar

Contributions

S.H.D., A.F.C., S.H. and C.B. oversaw the overall design and execution of the project. The experiments were performed by S.H.D., J.X., C.G. and S.H. The computational analyses were performed by S.H.D., M.C., N.P.D.S., B.R.M. and C.B.; S.H.D. and C.B. were primarily responsible for writing the manuscript.

Corresponding authors

Correspondence to Sascha H. Duttke , Sven Heinz or Christopher Benner .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Charles Danko, Matthew Weirauch and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 homer2 - a new tf motif and sequence analysis approach that allows controlling for both single-nucleotide positional and fragment-wide sequence biases..

By contrast to most current motif finding methods that normalize across the complete sequence fragment in the analysis, HOMER2 accounts for both fragment-wide and single-nucleotide positional biases of input sequences when it selects background sequences from the genome, such as nucleotide preferences naturally found near TSSs.

Extended Data Fig. 2 TSS-centric analysis reveals a spatial grammar of TFs.

a , De novo motif enrichment analysis of TSRs active in U2OS cells by HOMER2 reveals the TF motifs with the highest enrichment in transcribed regulatory elements. b , Association of TSR-enriched TF binding sites with transcription initiation frequency calculated using MEIRLOP 75 using initiation strength as covariant. c , Examples of TF binding sites with natural preferential positioning in the proximity of human TSSs. Positional enrichment or depletion was calculated using HOMER2, accounting for both positional (i.e. TSS-proximal), and fragment-wide nucleotide content bias. d , Binding sites of the repressor ZBTB7A are depleted near active TSS, especially downstream where the RNA Polymerase II initiation complex is proposed to initially contact the TSR. e , Many TF binding sites including Sp1, NFY, and NRF1, but not all (i.e., CTCF) have preferred 10.5 bp helical positioning relative to active TSS when found between −120 and −40 bp, as shown by Fourier analysis (please see  Methods for details). f , Binding sites of cell type-specific activator TFs often show preferential positioning relative to the TSS only in cells that expressed them. TF binding site distribution profiles for HepG2-specific HNF1 and ubiquitous Sp1/GC-box motifs across TSSs identified in K562, U2OS and HepG2 cell lines by csRNA-seq. g , Preferential TF binding site localisation is highly conserved across species and methods. Motif density plot of the NFY binding site relative to TSS identified using csRNA-seq from different human and green monkey (Vero cells) cell lines as well TSS identified using PRO-cap in K562 cells 86 . h , The upstream, rather than the downstream promoter region is more conserved. Aggregate PhyloP scores at single base resolution relative to active TSSs in U2OS cells reveals that upstream regions, especially around −30 bp and −50 bp, relative to the TSS, are preferentially conserved. i , Genomic nucleotide frequency plots relative to TSS containing or lacking a canonical Initiator motif (BBCA + 1BW) 36 at the TSSs. j , Frequency and patterns of position-specific TF binding sites are more eminent relative to TSSs that lack canonical core promoter motifs. Normalized NFY, Sp1, NRF1, YY1 and ZBTB7A motif occurrences are displayed relative to the TSS containing (red) or lacking (blue) a canonical human Initiator motif (BBCABW) at the TSS (grey). k , Helical periodicity of TF binding sites found between −120 to −40 bp relative to the TSS are more prominent in TSS lacking a canonical human Initiator motif (BBCABW). Fourier analysis of TF binding sites NFY, Sp1 NRF1, YY1 and ZBTB7A revealed preferred 10.5 bp helical positioning relative to TSS lacking a human Initiator in position-dependent TF factors.

Extended Data Fig. 3 TF occupancy at differentially positioned binding sites.

a , Quantification of TF knockdown: Western blot 24 h following knockdown of YY1 and NRF1 in replicates using beta-Actin as a control (n = 2, representative experiment shown). For original images please see Supplementary Fig. 4 . b , Validation of ChIP-seq data: De novo motif finding of ChIP-seq peaks using HOMER2 identifies the expected motif for each antibody target. FRIP stands for Fraction of Reads In Peaks. c , Overlap between NRF1, YY1, and dnNRF1 binding and TSS reveals enhanced binding of NRF1 to TSSs both up and down regulated by siRNA targeting NRF1 relative to invariant TSS. d , e , Example loci with ChIP and csRNA-seq data. f , Position dependent function of human YY1. Human U2OS TSSs were ranked from gain to loss of transcription initiation activity upon YY1 knockdown and analysed for YY1 motif positional enrichment (dark red). g , TSSs downregulated upon YY1 knockdown have YY1 bound within its preferred region, as assessed by ChIP-seq, while derepressed TSSs have YY1 binding further downstream. h , Overlap between NRF1, YY1, and dnNRF1 binding and TSS reveals enhanced binding of YY1 to TSSs both up and down regulated by siRNA targeting YY1 relative to invariant TSS. i , Position dependent function of mouse NFY. Mouse embryonic fibroblast (MEF) TSSs were ranked from gain to loss of transcription initiation activity upon NFYa knockdown 39 and analysed for NFY motif positional enrichment.

Extended Data Fig. 4

Differential TSS usage can impact gene isoform usage and gene expression. a , Example locus ( SDR39U1 ) where loss of NRF1 by siRNA knockdown led to the induction of several TSSs near to and upstream of a NRF1 binding site motif (top). RNA-seq profiling revealed that cells treated with NRF1 siRNA expressed a novel isoform with unique splice junctions not observed in the control sample (bottom). b , Changes in TSSs levels impact gene expression. Moving average of the number of either upregulated or downregulated TSS overlapping the annotated promoter (within 200 bp) of genes ranked by their change in RNA-seq transcript levels (orange, grey). Also depicted is the average of the total promoter csRNA-seq level change (i.e. integrated across all TSS in the promoter region, blue).

Extended Data Fig. 5 dnNRF and natural genetic variation analysis.

Overexpression of dnNRF1 results in repression of transcription initiation at TSRs in the vicinity of dnNRF1 binding sites. a , Genome browser tracks at an example locus ( UTP11 ) showing the HA-tagged dnNRF1 ChIP-seq read density and normalized csRNA-seq TSS activity levels in eGFP or dnNRF1 expressing U2OS cells. b , TSRs strongly down-regulated by dnNRF1 expression are also bound by dnNRF1, as assessed by ChIP-seq. The average ChIP-seq normalized read density or fraction of TSRs containing the NRF1 binding site from −150 to +50 relative to the TSS are plotted as a function of the log2 ratio of TSS activity between dnNRF1 and GFP expressing U2OS cells as measured by csRNA-seq. c , TSSs downregulated upon overexpression of HA-tagged dominant negative NRF1 (dnNRF1) knockdown have NRF bound within its preferred region upstream of the TSS, as assessed by ChIP-seq. d , Overlap between NRF1, YY1, and dnNRF1 binding and TSS reveals enhanced binding of NRF1/dnNRF1 to TSSs down regulated by dnNRF1 expression relative to invariant TSS. e , Distribution of single nucleotide variants relative to the TSS used in the analysis of mouse (C57Bl/6 and SPRET) bone marrow derived macrophages (BMDMs) comparing different strains and f , human tssQTLs found in LCLs. g , Analysis of the genome-wide significance of the association between mutations in the NFY binding site, or h , NRF1 binding site and the change in transcription initiation in macrophages from each mouse strain, calculated for each TF binding site as a function of their relative distance to the TSS. Positive logP values indicate that mutations predicted to cause reduced TF binding are more strongly associated with reduced initiation, while negative logP values indicate that the mutated TF binding sites are more strongly associated with increased initiation. Distance-dependent profiles were calculated using TF binding sites identified in overlapping windows of 30 bp at 10 bp increments from −150 to +100 bp relative to the TSS. i , j , TF binding sites for TLR4 pathway activated TFs that recruit RNA polymerase II are preferentially positioned relative to TSSs that increase transcription following KLA treatment. Motif distribution profiles relative to TSSs of TSRs that were induced, repressed or did not change upon stimulation of bone marrow derived macrophages with KLA for the binding sites of i , NF-κB (p65) and j , AP1. k , Distribution of the p53 DNA binding site relative to active TSS from U2OS cells.

Extended Data Fig. 6 TSS-MPRA results are highly reproducible.

a , Variation in initiation activity levels among different barcode replicates for the four TSRs displayed in Fig. 3b that shows the impact of differential NRF1 binding site position on TSS activity for a TSR from the EIF2S1 locus (depicted in sense). TSS-MPRA captures the impact of adjusting TF binding site positions on transcription initiation at single-nucleotide resolution. b , Normalized TSS activity profiles on a synthetic DNA insert measuring the impact of adjusting the YY1 binding site position by 2 bp increments, showing waves of increased and reduced transcription initiation and shifting TSS. c , Examples of normalized TSS activity profiles measured by adjusting the position of the NFY binding site every 2 bp, showing the importance of helical positioning for TF potency in recruiting RNAP II.

Extended Data Fig. 7 TSS-MPRA analysis of TF binding site sweeps reveal additional evidence for position-dependent TF function.

a , Summary heat maps of the TSSs and their normalized activity levels captured by TSS-MPRA for a 2 bp incremental sweep of the Sp1 binding site sweep from −100 to +40 across an artificial, TF motif-depleted DNA background with four different barcode sets. The Sp1 binding site position is shown in blue. b , Vertical normalization of the Sp1 binding site sweep TSS-MPRA reports the log 2 fold change in TSS activity relative to the average activity of that TSS across all possible Sp1 binding site positions. This normalization highlights TSSs that are activated (red) and repressed (blue) relative to the average level of activity for each binding site position. The Sp1 binding site position is shown in blue. c , Summary heat maps of the TSSs and activity levels captured by TSS-MPRA for a 2 bp sweeps of the YY1 binding site, d , NFY binding site and e , NRF1 binding site, sweep from −100 to +40 across an artificial, TF motif-depleted DNA background. Only BC#1 of 4 is shown. f , Lineplots showing that the position-dependent impact of sweeping TF binding sites in a synthetic sequence is highly reproducible as independently assessed for each of the four barcodes sets. Data reported in the manuscript were obtained by averaging all four barcodes and both biological replicates.

Extended Data Fig. 8 Multiple experimental approaches reveal consistent position-dependent functional profiles that are unique to each TF.

Comparison of patterns from natural preferred TF binding site positional enrichment in the genome relative to active TSSs (black, i.e. Fig. 1d ), impact of TF knockdown on transcription of proximate TSSs as a function of distance to the TF binding site (orange, i.e. Fig. 1e , flipped), impact of TF binding site mutations due to natural genetic variation on transcription (pink/yellow, i.e. Figs. 2f , 5a ) and a binding site’s ability to impact transcription as captured by TF binding site sweeps with TSS-MPRA (blue, i.e. Fig. 4c ) altogether reveal consistent, position-dependent functions and superhelical preferences for a , YY1, b , Sp1, c , NRF1 and d , NFY. Each profile was scaled such that the most extreme value was set to 1/−1. e , Hypothetical model for TF-mediated TSS selection and dispersed initiation. TFs can recruit or block transcription initiation based on their spacing. In most TSRs, this spacing-dependent function of TFs is integrated over several TFs. As TF binding is transient, different sets of TFs can be present at a given moment at homologous TSRs in sister chromosomes or different cells of the same kind or vary at the same TSR over time. f , The transcribed putative TOP2 enhancer region contains an NRF1 binding site. UCSC browser image and HOMER-annotated motifs with the NRF1 binding site mutated in the screen highlighted in red.

Extended Data Fig. 9 Spacing between TFs can coordinately guide transcription initiation. Additional examples of TF-TF interaction.

a , Model for TF-mediated RNA Polymerase II initiation and coordinated TSS selection by activator TFs NRF1 and Sp1 based on their spatial preferences. TSRs containing both b , YY1 and Sp1 binding sites, c , NRF1 and ZBTB7A (LRF) binding sites, and d , NFY and Sp1 binding sites, sorted by the distance between the TF binding sites with csRNA-seq initiation levels shown in forward (red) and reverse (blue) direction. The impact of YY1, NRF1, and NFY siRNA knockdown on activity for + strand TSSs are shown on the right with upregulated TSSs shown in green and downregulated TSSs in purple. TSS patterns and their regulation at YY1 and Sp1 binding sites containing loci reflect the unique preferred initiation profiles associated with the YY1 and Sp1 binding sites ( b ), while TSS patterns between the ZBTB7A and NRF1 binding sites show little to no interaction ( c ). d , Analysis of the Sp1 and NFY in mouse fibroblasts 39 suggests conservation of position-dependent collaborative TF function across mammals.

Extended Data Fig. 10 Position-dependent TF function in human health and disease.

a , Disease-associated variant rs11122174 , identified through GWAS, is found within an NRF1 binding site and displays position-dependent changes in tssQTL significance relative to nearby TSS. b , c , Massively parallel mutation analysis of human regulatory elements reveals position-dependent TF function. Mutations of preferentially positioned TF binding sites result in loss of transcriptional activity ( b ), while mutation of TF binding sites in vicinity to TSSs lead to ectopic TSSs ( c , derepression), demonstrating the dual, position-dependent function of NFY, Sp1, NRF1, and YY1 in human regulatory elements. Mutation of TSS-proximal TF binding sites was also associated with notable changes in TSS selection and thus alternative 5’UTRs, a hallmark of many diseases. d , Relationship of TF binding site position and impact on TSS selection: Mutation of TF binding sites near TSS or within their naturally enriched positions had the strongest effect on the TSS pattern of regulatory elements while those outside thereof, had less impact.

Supplementary information

Supplementary information.

This file contains Supplementary Figs 1–7 and Supplementary References.

Reporting Summary

Peer review file, supplementary table 1.

List of experiments performed in this study available in GSE199431. Note that additional data used in the study, including K562 csRNA-seq and C57BL/6 mouse macrophage csRNA-seq experiments, were previously published and available from GSE135498.

Supplementary Table 2

List of key sequences used in the creation of TSS-MPRA DNA inserts. Full sequence lists are available as supplementary files in FASTA format from GSE199431.

Source data

Source data fig. 1, source data fig. 2, source data fig. 5, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Duttke, S.H., Guzman, C., Chang, M. et al. Position-dependent function of human sequence-specific transcription factors. Nature 631 , 891–898 (2024). https://doi.org/10.1038/s41586-024-07662-z

Download citation

Received : 29 March 2022

Accepted : 04 June 2024

Published : 17 July 2024

Issue Date : 25 July 2024

DOI : https://doi.org/10.1038/s41586-024-07662-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

transcription factor thesis

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

Comprehensive analysis of Transcription Factors identified novel prognostic biomarker in human bladder cancer

Xuanxuan zou, youzhi wang, miaomiao wang, boqiang zhong.

  • Author information
  • Article notes
  • Copyright and License information

✉ Corresponding author: Ning Jiang. Tianjin Institute of Urology, The Second Hospital of Tianjin Medical University, Tianjin, China. Email: [email protected] ; Tel: +86 02288329296

# Yihao Liao, Xuanxuan Zou and Keke Wang contributed equally.

Competing Interests: The authors have declared that no competing interest exists.

Received 2021 Jan 21; Accepted 2021 Jul 20; Collection date 2021.

This is an open access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/ ). See http://ivyspring.com/terms for full terms and conditions.

Background: Transcriptional factors (TFs) are responsible for regulating the transcription of pro-oncogenes and tumor suppressor genes in the process of tumor development. However, the role of these transcription factors in Bladder cancer (BCa) remains unclear. And the main purpose of this research is to explore the possibility of these TFs serving as biomarkers for BCa.

Methods: We analyzed the differential expression of TFs in BCa from The Cancer Genome Atlas (TCGA) online database, identified 408 up-regulated TFs and 751down-regulated TFs. We obtained some hub genes via WGCNA model and detected the RNAs level in BCa cells and tissues. Then, the relationship between the expression and clinicopathological parameters was further investigated. Kaplan-Meier curves and the log-rank test were carried out to analyze the relationship between NFATC1, AKNA and five-TFs combination and overall survival (OS). And RT-PCR assay was conducted to further consolidate and verify these results.

Results: There were significant differences in the expression of five TFs (CBX7, AKNA, HDAC4, EBF2 and NFATC1) between bladder cancer and normal bladder tissue. In BCa tissue and cell lines, the five TFs were frequently down-regulated, and closely related to poor prognosis. Moreover, the RT-PCR results of five TFs in bladder cancer and normal bladder tissue were consistent with the database results, and reduced TFs could significantly induce or restrain the transcription of many critical factors. The expression level of AKNA and NFATC1 could serve as independent biomarker to predict the overall survival (P<0.05). And the above five TFs combined detection of bladder cancer has higher sensitivity and specificity. Furthermore, differential neutrophils expression between high-risk and low-risk were found, which consolidated the role and function of the five TFs combination model in the progression of BCa.

Conclusions: Our analysis effectively provides a newly TFs-associated prognostic model for bladder cancer. The combination of five identified-TFs is an independent prognostic biomarker, which could serve as a more effective therapeutic target for BCa patients.

Keywords: Transcription Factors, WGCNA, Bladder cancer, Prognosis, Biomarker

Introduction

Bladder cancer (BCa) is the ninth most common malignant tumor worldwide, with high incidence, recrudescence, and mortality rate 1 , 2 . According to the statistics in America, there were 80470 new diagnosed cases in 2019, including 61700 male patients and 18770 female patients. The death cases due to BCa were 17670, with 12800 males and 4800 females. Based on above statistics, male BCa patients accounted for almost 77% among all BCa cases 3 . Although many advances have been made for the treatment of BCa, such as surgical intervention, adjuvant chemoradiotherapies and radiation therapy, the progression, metastasis, and recurrence rate of BCa remains high. Most BCa patients developed into no signs of significant improvement after a series of treatment 2 , 4 - 7 . However, the recent research on the genetic characterization of bladder cancer has turned to the molecular therapy. The limited treatment options emphasize the demand for new and effective prognostic biomarkers and new therapeutic modalities 8 .

Transcription factors (TFs) are a large class of chromatin binding proteins, which are associated with different biological processes, including DNA damage repair, transcription process, DNA unwinding and DNA replication 9 . TFs activate or inhibit transcription via binding to DNA helix transactivation or trans-repression domains 10 . Interestingly, the function of TFs is not only related to the process of DNA transcription. Recent studies have identified that TFs play an increasingly important role in human pathology and tumor progression 11 . For example, Vaquerizas proved that almost 164 TFs were tightly associated with 227 different diseases in 2009 12 . Transcriptional disorder is generally considered as a basic feature of tumorigenesis 13 . In BCa, the down-regulation of transcription factor Nrf2, YAP, and c-Myc could inhibit the growth and migration of cisplatin resistant BCa cells 14 . And the imbalance of transcription factors E2F3 and NF-kB may influence the progression and migration of BCa 15 . Nevertheless, the function of TFs and their downstream regulatory targets in BCa remain inexplicable.

In present study, we searched the latest TFs list, including 1930 TF genes, and systematically analyzed their transcription profile in TCGA-BLCA cohort to evaluate the potential functions of TFs in BCa. Among them, 408 up-regulated TF genes and 751 down-regulated TF genes were identified. We selected 1159 TFs associated with the progression of BCa, and further analyzed and identified five TFs, including CBX7, HDAC4, EBF2, NFATC1 and AKNA, which were obviously down-regulated in BCa tissues. In particular, the combination of these five TFs is tightly associated with the progress of BCa, which might serve as penitential and critical predictive biomarkers for BCa patients.

Materials and Methods

Transcription factor data collection.

TRANSFAC ( http://generegulation.com/pub/databases.html ), CISBP ( http://cisbp.ccbr.utoronto.ca/ ), TRRUST ( https://www.grnpedia.org/trrust/ ) are the most common transcription factor databases during bioinformatics analysis. We collected and classified these transcription factors from these databases, and removed these duplicate TFs from each database, then obtained a confederate TF data set (including 1930 TFs, 1889 of which have expression profile data). Bladder cancer dataset in TCGA downloaded and the differentially expressed TFs (DE-TFs) were screened. The mRNA expression profile data and clinical information data of bladder cancer (BLCA) were downloaded from TCGA database ( https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga ), and the differential expression between normal tissues and bladder cancer tissues was analyzed by R's DESeq2 software package. The TFs data set was combined to screen out the differential expressed TFs between normal tissue and BCa tissue (|log2FoldChange|>1 & padj<0.05; 408 up-regulated TFs, and 751 down-regulated TFs).

Weighted gene co-expression network analysis and hub gene extraction

Weighted correlation network analysis (WGCNA) is a system biology method to describe gene association patterns between different samples. It is usually used to identify highly coordinated gene sets, and identify candidate biomarker genes or therapeutic targets according to the interconnectivity of gene sets and phenotypes. The construction of WGCNA network mainly includes following steps: firstly, Pearson correlation coefficient (indicating the synergistic effect of gene expression) is calculated through the gene expression profile. Secondly, in order to avoid the hard classification of correlation coefficients, the soft threshold recording method (the weighting function used to concatenate the elements in the adjacency matrix) is used in WGCNA. In this paper, power function is used to enhance the strong correlation network, weaken the correlation network, continuously calculate the elements, and select the most appreciate β value through screening to obtain the network more conforming with the scale-free network model. Thirdly, in order to further consider the relevance of biological significance instead of expression, the topological overlap matrix (TOM) is constructed to better simulate the network model. The idea of the model demonstrates that the relationship between two genes could be affected by the interaction of intermediate genes, which describes the similarity of gene expression profiles more accurately. Finally, in order to facilitate the construction of the modules, a dissimilarity matrix needs to be defined, which can be obtained by subtracting the topological overlap matrix from 1 (1-TOM). The dissimilarity matrix is used to cluster the matrix via the hierarchical clustering method, and the dynamic cutting tree method is used to identify the modules to get different modules. The correlation analysis of cancer and normal tissue was analyzed through the topological overlap matrix, and the modules with strong correlation with cancer traits were obtained. These modules may contain important transcription factors involved in the occurrence and development of cancer.

GO analysis and KEGG analysis

In order to explore the function of these genes in the modules strongly related to cancer traits or signaling pathways screened by WGCNA, next GO analysis and KEGG analysis of these extracted genes were performed through the cluster Profiler package in R. GO analysis could be mainly divided into three categories: Molecular Function (MF), Biological Process (BP), and Cellular Components (CC). KEGG analysis could reveal the signal pathways processes involved in these genes.

Cox regression model construction

These genes and their expression profiles of several modules screened by WGCNA analysis were extracted. First, all the genes significantly related to prognosis were screened by TCGA-BLCA univariate Cox regression analysis via survival and survminer in R (p<0.05). Then multivariate Cox regression analysis was performed with R package to select prognostic-related markers to construct a prognostic survival model. Finally, the prognostic risk model was constructed for the genes related to prognosis, and the contribution coefficient of each gene to the risk was obtained, which is employed to establish a comprehensive risk prediction model.

Prognostic risk score:

Among them, Genei represents the i-th prognostic gene; expression (Genei) represents the expression level of the patient's i-th prognostic gene; coefficient (Genei) is the coefficient of each prognostic gene.

Prognostic model evaluation and data set validation

In order to verify the effectiveness of the model via TCGA-BLCA data, the risks are divided into two groups according to the risk value of the case in TCGA-BLCA data. And the relationship between the risk value and prognosis was discussed. At the same time, we used external data (METABRIC) to validate the model, and the ROC curve was drawn to prove its effectiveness. The survival curve was drawn using the survival package in R, while the ROC curve was drawn via survival ROC package. In addition, combined with the clinical situation of patients, the heat map of model-related genes in high-risk and low-risk groups was used to reveal the difference of gene expression between high-risk group and low-risk group and different clinical states.

CIBERSORT Immune Infiltration Difference Analysis

The same tumor type may have different immune activity between different individuals due to the high heterogeneity of tumor tissues. Therefore, different samples may be in different immune microenvironment. The transcriptional expression level of tumor samples could reflect the composition of various immune-related components, CIBERSORT can extract the composition of 22 kinds of immune infiltration-related cells by analyzing the tissue expression profile. Using this data, we can further infer the immune infiltration status of different cancer tissues. The analysis process is implemented using the R package analysis provided by the CIBERSORT official website.

Patients and tissue specimens

This research collected 10 BCa and matched para-cancerous tissue from BCa patients who underwent surgical resection without any neoadjuvant chemotherapy and radiotherapy in the Department of Urology, the Second Affiliated Hospital of Tianjin Medical University (China) from 2017 to 2019. The inclusion criteria: these patients underwent total cystectomy; and the exclusion criteria: these patients underwent electrocystectomy and puncture. This study was approved by the Institutional Review Committee of this hospital.

Cell lines acquisition and transfection

Four bladder cancer cell lines (T24, 5637, 87, EJ) and human immortalized uroepithelial cell line SV-HUC-1 were purchased from American Type Culture Collection (ATCC). And human BCa cell lines were cultured in RPMI-1640 medium containing 10% FBS and 1% penicillin-streptomycin at 37℃ with 5%CO 2 . While the SV-HUC-1 cell line was cultured in F-12k Nutrient Mixture containing 10% FBS and 1% penicillin-streptomycin at above conditions. We selected BCa T24 cell for subsequent mechanism research, and transfected T24 cell with designed corresponding siRNA of HDAC4, NFATC1 and CBX7 according to lipo2000 transfection protocol. We extracted total RNA after 48h transfection and performed PCR for subsequent research.

RNA extraction and qRT-PCR

Total RNA was extracted from BCa tissue, para-cancer tissue, BCa cell lines and SV-HUC-1 cell using traditional TRIzol reagent, and cDNA was synthesized by cDNA Synthesis SuperMix (gDNA Purge) (Novoprotein). The quantitative real-time polymerase chain reaction (qPCR) was executed by SYBR Premix Ex Taq II assays, and GAPDH was used as an internal reference for normalization. The relative expression ratio of these genes was calculated through 2 -ΔΔCq method. These related primer sequences and siRNA sequences are showed in Table S1 .

Statistical analysis

Statistical analyses were performed by GraphPad Prism 5.3 and software SPSS 21.0. All data was presented as the mean ± standard deviation of at least three independent experiments by different experimenters. T-test and Chi-square test were used for data analysis. P<0.05 was considered to bet a statistically significant difference.

Transcription factor are differentially expressed in bladder cancer

We obtained 1930 transcription factors via searching in TRANSFAC ( http://generegulation.com/pub/databases.html ), CISBP ( http://cisbp.ccbr.utoronto.ca/ ) and RRUST ( https://www.grnpedia.org/trrust/ ) cohort, and the Venn plot presented the specific origin of the 1930 TFs (Fig. 1 A). Next, we searched the differentially expressed genes of BCa expression profile in the TCGA database ( https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga ), the results were presented in the volcano plots (Fig. 1 B). The specific list of transcription factors and P value were displayed in Table S2 . And the relative expression profile of 1930 TFs were found in the TCGA database to explore the abnormal expression level of TFs and new prognostic biomarkers in BCa development. TFs differential expression analysis was performed by 401 paired BCa tissue and normal bladder tissue. The results showed that 408 TFs were up-regulated and 751 TFs were significantly low expressed in BCa tissue. These top-100 up-regulated and top-100 down-regulated TFs identified in BCa were displayed in the heatmaps (Fig. 1 C). The detailed list of dysregulated transcription factors and relative P values were obtained in the Table S3 (down-regulated) and S4 (up-regulated).

Figure 1

Transcription factor are differentially expressed in bladder cancer. (A) The Venn plots depict the sources of the 1930 TFs. (B) The volcano plot depicts the differential expression of all genes in BCa from TCGA database. (C) The heatmaps display the top-100 up-regulated (left panel) and top-100 down-regulated (right panel) TFs in BCa.

Construction of BCa co-expression modules

To further understand the dysfunctional TFs, we tried to construct a WGCNA network. The independence and average connectivity degree of co-expression modules were decided by power value (β) and scale R 2 value. First, we plotted a set of soft-thresholding powers, and the scale R2 reached 0.90 when the power value was equal to 4 (Fig. 2 A). Through soft thresholding with β=4, we defined the adjacency matrix to construct and identify the differential co-expression TFs modules in BCa samples. A cluster dendrogram of 1159 chosen TFs were constructed based on the TOM-based dissimilarity measure. These co-expression modules identified by cluster analysis were allocated in different colors (Fig. 2 B). The interactions of these co-expression modules were analyzed with the Pearson correlation coefficient (Fig. 2 C). The darker background indicated a higher module correlation.

Figure 2

Construction of BCa co-expression modules. (A) Analysis of network topology for various soft-threshold powers. (B) The clustering dendrograms of genes is based on dissimilarity of topological overlap, and the assigned module colors. (C) The heatmap depicts the topological overlap matrix (TOM) among genes based on co-expression modules. (D) Analysis of module-trait relationships of BCa is based on TCGA online database (Each row corresponds to a module eigengene, and each column corresponds to a trait).

The relationship between TFs co-expression modules and clinicopathological parameters

In the principal component analysis of each module, the first principal component was selected as its characteristic TFs. Other clinicopathological parameters, including gender, age, tumor stage and OS-associated status, were related to these different co-expression modules, and the strongest relevant blue module was identified (Fig. 2 D). A heatmap of the correlation between module characteristic TF and clinicopathological parameters of BCa presented the correlation coefficient (R) and significant difference (p value). And the Module- parameters diagram describes 174 TFs in Blue module, 78 TFs in Yellow module, 93 TFs in Brown module, 166 TFs in Tuiquoise module and 648 TFs in Grey module (Table 1 ). And the TFs of Blue, Yellow Brown and Tuiquoise module were selected for further analysis, while the TFs of Grey module did not meet the criteria and were excluded from the analysis. The key TFs of each module may become the candidate biomarkers for the certain clinicopathological parameters. The correction analysis of TFs module and clinicopathological parameters presented that the co-expression grey module was obviously related to OS-status (R=0.2, P<0.0001), and the co-expression of blue module was tightly associated with tumor-stage (R=0.18, P<0.001). The correlation results in the blue module demonstrated that the TFs were significantly associated with tumor stage, implying that the TFs might provide a potential biomarker for the prognosis of BCa. Considering the relationship between the module and clinicopathological parameters, we selected the co-expression blue module for next analysis.

Number of TFs in co-expression modules.

Module colors TFs frequency
Blue 174
Brown 93
Grey 648
Turquoise 166
Yellow 78

Functional enrichment gene analysis of co-expression blue module

According to the heatmap above, we selected all 174 TFs of the blue module and the hub TFs (containing CBX7, HDAC4, EBF2, NFATC1 and AKNA). To further explore the potential function of these TFs in co-expression blue module, we conducted Gene Ontology (GO) analysis to analyze the key relative function. The results of enrichment GO_BP demonstrated that these TFs were associated with many critical life processes, such as morphogenesis of many important organs and development of urogenital system. These TFs might play a formidable role in variety of cells proliferation, and their dysregulation might cause the occurrence of urogenital tumors (Fig. 3 A). And we performed GO_MF enrichment analysis and found that these TFs were significantly associated with the activity of DNA binding transcription activators and DNA binding transcription receptors (Fig. 3 A). In addition, GO_CC enrichment analysis demonstrated that transcription regulatory complex and nuclear chromatin were tightly associated with the function of these TFs, suggesting that these TFs might regulate tumor formation and progression (Fig. 3 A). To further ascertain the function of these TFs, we found that transcriptional dysregulation in cancer was enriched by Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis (Fig. 3 B). The results implied that these TFs might regulate many cancer-related factors and tumor development by regulating the process and status of transcription.

Figure 3

WGCNA predicts GO and KEGG pathways associated with the five‐TFs signature. (A) The GO analysis of the co‐expressed genes is shown in blue module (GO-BP; GO-MF and GO-CC). (B) The KEGG analysis of the co‐expressed genes is shown in blue module.

Five-TFs combination is a novel prognostic biomarker for BCa and associated overall survival

Univariate cox regression and multivariate cox regression were conducted in these TFs of Blue models. Five TFs, including CBX7, HDAC4, EBF2, NFATC1 and AKNA, were identified as prognostic biomarkers of BCa. To investigate the specific relationship among these five TFs and the clinicopathological parameters (including tumor stage, gender, OS-status and age), we divided these BCa patients into two groups based on high risk and low risk, cox proportional hazards regression model was used for multivariate analysis. The expression of five-TFs combination models was compared with other clinical parameters as covariates to explore whether the expression of five-TFs combination model was an independent prognostic factor in BCa. The analysis results were presented in the heatmap (Fig. 4 A). These patients with BCa were divided into two groups according to high risk and low risk, the results demonstrated that the five-TFs combination had a great prediction on the progression from stage I&II to stage III&IV (Table 2 ; P<0.01). The results presented that the five-TFs combination might serve as a potential prognostic biomarker for BCa. Next, we explored the effects of five-TFs on predicting 1 year, 3 years and 5 years survival. The results were presented in the ROC curve (Fig. 4 B), which could evaluate the prognostic efficiency in patients with BCa. We found that the AUC was 0.675 at 1 year, 0.645 at 3 years and 0.635 at 5 years. Kaplan-Meier Survival analysis showed that patients with high risk in five-TFs combination model presented poor overall survival (P<0.05; Fig. 4 C), and these BCa patients with high expression of NFATC1 and low expression of AKNA were tightly associated with poor OS (P<0.05; Fig. 4 D,E). The above results showed that the five-TFs combination model could serve as a candidate prognostic biomarker for BCa patients.

Figure 4

Prognostic value of the five TFs combination and associated with clinical outcome of bladder cancer. (A) Hierarchical clustering heatmap and dendrogram of bladder cancer samples based on the expression patterns of the five selected TFs. (B) Receiver operating characteristic (ROC) analysis of the risk scores for overall survival prediction in the TCGA dataset. (C) Kaplan-Meier survival curves for bladder cancer samples classified into high-risk and low-risk groups using the five TFs combination signature. P-values were calculated using the log-rank test (P=0.0003). (D) Kaplan-Meier survival curves for bladder cancer samples classified into high-expression and low-expression groups based on NFATC1, P-values were calculated using the log-rank test (P=0.0205). (E) Kaplan-Meier survival curves for bladder cancer samples classified into high-expression and low-expression groups based on ANKA, P-values were calculated using the log-rank test (P=0.00233).

Clinicopathologic characteristics of bladder cancer patients (n=401).

Variables Total Risk score X2 P value
Low High
Age(years) 1.9181 0.1661
>60 294 154 140
≤60 107 47 60
Gender 5.5847 0.01812*
Male 297 138 159
Female 104 63 41
Tumor_stage 17.456 2.94E-05*
stage I&II 131 44 87
stage III&IV 270 157 113

The differential expression of neutrophils was associated with the risks of five-TFs combination

It has been known that immune infiltration was frequently found in the early anti-cancer defense, which can arrive in where the tumor cells grow abnormally. It reminds that related immune cells early infiltration may predict the tumorigenesis and carcinogenesis, lots of researchers have found the relationship between immune infiltration and carcinogenesis. For example, Xiaoyan Fan et al. identified that TACC3 is a prognostic biomarker for kidney renal clear cell carcinoma and correlates with immune cell infiltration and T cell exhaustion 16 . Hai Zhu et al. found that ITGA5 is a prognostic biomarker and correlated with immune infiltration in gastrointestinal tumors 17 . Moreover, Hanji Huang et al. identified that C1Q is a critical prognostic factor and related with immune infiltrates in osteosarcoma 18 . There are massive evidences for supporting the role of immune infiltrates in the processes of carcinogenesis and development. In this study, we tried to explore the relationship between immune infiltrates and the high risk and low risk of the five-TFs combinations in BCa, further explored the efficiency of the five-TFs combinations predicting the changes of related immune cells. To analyzed the effect and potential function of five-TFs combination model, we explored the differential expression of common 22 immune-associated cells (such as B lymphocytes and neutrophils) in high risk and low risk with the five-TFs combination model. The analysis results were shown in the heatmap (Fig. 5 A). We found that high risk BCa patients with five-TFs combination model presented lower expression level of neutrophils compared with these low risk patients (Fig. 5 B; P<0.05), while other immune cells have no obvious discrepancy between two groups(P>0.05). It has been known that neutrophils are the most common immune cell in anti-inflammatory response and natural anti-tumor response processes, and elevated neutrophils usually implied that abnormal cell growth or pathogen infection. In previous study, Chang Cui et al. found that neutrophil elastase selectively killed cancer cells and attenuated tumorigenesis 19 . Neutrophil-secreted elastase shows dramatic selectivity, inducing cell death in tumors and at metastatic sites in animal models while sparing proximal healthy cells, which further implied that neutrophil was a critical factor for the development and treatment of cancers. In the early stage of BCa occurrence, lots of neutrophils will move and gather into the area of abnormal growth, and some researchers thought that elevated neutrophils might serve a signal of tumor occurrence or abnormal growth. We found that these BCa patients with high risk almost have higher neutrophils expression, which implied that the five-TFs combination model might be a perfect biomarker for early immune infiltration in bladder cancer patients. And neutrophils infiltration might further consolidate the accuracy of the TFs model for BCa diagnosis. These results suggested that the five-TFs combination model could serve as a potential role in predicting immune infiltration and inflammation associated processes. Other immune cells have no obvious changes between high and low risk of the TFs model, but the function of which still are critical for the whole anti-tumor processes. The complete predicting function of the TFs model needs further analysis and research. The results showed that these patients with high risk of the five-TFs combination have higher neutrophils expression, while other immune cells have no obvious differences. We could further predict BCa prognosis and the anti-tumor effects of neutrophils based on the five-TFs combination model risk. And the expression of neutrophils means that the percentage of neutrophils among all immune cell infiltration, we could predict the prognosis of BCa and the intensity of neutrophils immune infiltration via TFs mRNA expression and model risks, which implied that we can predict immune infiltration and carcinogenesis based on high risk and low risk of the five -TFs combination model. And the specific mechanism and processes need further research and analysis.

Figure 5

The differential expression of neutrophils is associated with five-TFs combination. (A) The differential expression analysis of 22 immune cells based on high risk and low risk of the five TFs combination signature (left panel: high risk, right panel: low risk). (B) The violin illustration depicts the differential expression of neutrophils based on high risk and low risk of the five TFs combination characteristics.

Protein-protein interaction (PPI) and a transcription factor network analysis with five-TFs combination

Due to the above five TFs were candidate biomarkers for BCa patients, we further researched whether these TFs could influence some important tumor-associated signaling pathway proteins or factors, and analyzed the complex network of these five TFs. Many critical protein and factors were found to be directly and tightly associated with these TFs (Fig. 6 A). For example, we found that JUN, IL-10, IL-5 and SMAD4 were closely related to HDAC4, NFACT1 and IL-2, and CBX7 and CDH1 were also connected with each other. To further identify the factors interacted with the above five TFs, we uploaded the five TFs to STRING for protein-protein interaction (PPI) analysis (Fig. 6 B). Finally, to further consolidate the above results, we analyzed the sophisticated interaction network via gene mania online analysis database, and the results demonstrated that the five TFs related to each other and regulated by many critical proteins (Fig. 6 C). These results indicated that the interaction effects between proteins were not only expressed at the expression level, but also signaling cascade. All above results implied that the five TFs were crucial for the occurrence and development of bladder cancer via lots of critical cancer related pathways, however, the specific mechanism still needs further research. This study preliminary promoted the hypothesis that the five critical TFs regulated BCa processes of occurrence and development by influencing the transcriptions of these related factors, which will be the foundation of subsequent researches.

Figure 6

Protein-protein interaction (PPI) network analysis and a transcription factor network analysis with five-TFs combination. (A) The potential binding sites or targets of the five TFs were predicted based on the same transcription direction. (B) Protein-protein interaction (PPI) network analysis based on the five TFs combination was established. (C) The comprehensive network was analyzed via genemania online analysis database.

The five-TFs combination is an independent prognostic biomarker

To further determine whether the risk prediction model of the five-TFs combination is an independent predictive biomarker of survival in BCa patients, univariate and multivariate analyses were used to assess the prognostic association between some known clinicopathological risk factors for BCa progression and the newly identified five-TFs combination. The above mentioned clinicopathological risk factors, including age, gender and tumor stage, were considered. As we expected, besides of gender, the prognostic efficiency of the five-TFs combination was similar to these of age and tumor stage (P<0.01; Table 3 ; Fig. 7 A). Another multivariate analysis further revealed that the five-TFs combination remained an independent prognostic risk factor for the survival of BCa patient (P= 0.012; Table 3 ; Fig. 7 B).

Univariate and multivariate analysis of clinicopathological factors and the five-TFs combination in BC malignant progression

Trainmarker Univariate analysis Multivariate analysis
HR Low 95%CI High 95%CI P value HR Low 95%CI High 95%CI P value
Risk 0.577078 0.426534 0.780757 0.000364* 0.672088 0.491866 0.918345 0.012604*
Age 1.913494 1.298151 2.820520 0.001045* 1.778215 1.204204 2.625840 0.003800*
Gender 1.136034 0.818468 1.576815 0.445792 1.081515 0.774444 1.510342 0.645600
Tumor_stage 2.220953 1.532706 3.218252 0.000025* 1.992851 1.364668 2.910198 0.000358*

Figure 7

Univariate and multivariate analysis of clinicopathological factors and the five-TFs combination in BCa malignant progression. (A)Univariate analysis of clinicopathological factors and the five-TFs combination in BCa malignant progression. (B) Multivariate analysis of clinicopathological factors and the five-TFs combination in BCa malignant progression.

The mRNA expression level of five-TFs in BCa and normal bladder tissue

To further verify the above analysis results from TCGA online database, we performed qRT-PCR to validate the relative mRNA expression level of five TFs (including CBX7, HDAC4, EBF2, NFATC1 and AKNA) in BCa tissue and normal bladder tissue, which could predict BCa and associate with survival. The results demonstrated that there was obviously differential expression of five TFs between BCa tissue and normal bladder tissue, and the expression of these five TFs in the normal bladder tissue was significantly higher than that in BCa tissue (Fig. 8 A). These results were consistent with the expression which we analyzed from the database. And next, another qRT-PCR was performed to dig out the differential expression of the five TFs among one control bladder immortalized cell (SV-HUC) and several BCa cells (including T24, 5637, 87 and EJ). We found that the expression levels of these TFs were lower in these BCa cell lines than that in the control bladder immortalized cell (Fig. 8 B-F). The above results demonstrated that the expression of these TFs was down regulated in BCa compared with these patients without bladder cancer, which implied it was feasible that observing the expression of the five TFs predicts the tumorigenesis. Previous results showed that the five TFs might be involved in many critical factors and pathways, such as SMAD4, MMP13, IL-5, IL-2, CYP2E1, CCNE1 and CDH1 and so on. To further verify our hypothesis, we transfected bladder cancer T24 cell with designed si HDAC4, si NFATC1 and si CBX7. The results demonstrated that reduced HDAC4 could significantly induce the transcription and expression of SMAD4, MMP13 and IL-5; and reduced NFATC1 could obviously restrain the transcription and expression of IL-2 and CYP2E1, declined CBX7 significantly induce the transcription of CCNE1 and restrain the expression of CDH1 (Fig. 8 G). All above results demonstrated that the five TFs were important for the transcription and expression of various cancer related factors, which further consolidated the role of the five TFs combination in the occurrence and development of bladder cancer. However, the limitations of this study were obvious, we only detected the expression of TFs in BCa tissue and cells, which was not convenient for current clinical practice. We will try to detect the expression of TFs in blood or urine in next experiments, which may promote greatly the prognostic efficiency of BCa. If the results of blood or urine were consistent with that of BCa tissue and cells, we believe that detecting the expression of five TFs will become a new milestone for the history of BCa diagnose and treatment, which will bring huge improvement for lots of BCa patients.

Figure 8

qRT-PCR validation of the five identified transcription factors in BCa cell models and bladder cancer tissue. (A) The mRNA expression level of the five TFs in bladder cancer tissue compared with normal tissue. (B-F) The mRNA expression levels of the five TFs in bladder mucosal immortalized cells SV compared with BCa cells T24, 5637, 87 EJ. (G) Some critical cancer related factors (SMAD4, MMP13, IL-5, IL-2, CYP2E1, CCNE1 and CDH1) mRNA expression in bladder cancer T24 cell transfected with si HDAC4, si NFATC1 and si CBX7.

Recently, more and more studies have revealed the core regulatory roles of TFs during the progression and evolution of various cancers 20 . By regulating the transcription of cancer-related factors or pathways, TFs can induce carcinogenesis, promote cancer progression or restrain the function of tumor suppressor genes. However, the specific clinical significance and relative function of TFs in BCa remains unknown. This research systematically analyzed the related disordered TFs in BCa and preliminarily identified the five-TFs combination network (CBX7, HDAC4, EBF2, NFATC1 and AKNA) as important prognostic-related biomarker, which was associated with BCa progression. The results demonstrated that the expression of these five TFs was down-regulated in BCa patients, and the overall survival rate of high risk patients with BCa in the five-TFs combination model was poor (P<0.01). The five-TFs combination model may play a critical prognostic role of BCa. In this study, five-co-expression TFs modules were constructed by 1930 TFs from 401 human BCa clinical samples. The samples were provided by WGCNA, to determined critical modules tightly associated with some important clinicopathological characteristics. WGCNA was conducted to filtrate these co-expression TFs clusters to determine prognostic biomarkers of BCa. In this research, PPI network analysis and WGCNA were performed to analyze the critical TFs related to the progression of BCa, to explore the potential mechanisms, and to further identify effective prognostic biomarkers of BCa. Survival analysis of these identified TFs presented that the high risk of the five-TFs combination model was associated with poor overall survival. Chong Shen et al. identified a potential prognostic biomarker, LncRNA ENST00000598996 and ENST00000524265, by WGCNA and PPI network analysis of BCa 8 . And previous studies have reported that lncRNA XIST 21 , MAGI2‑AS3 22 and ADAMTS9‑AS2 23 are prognostic biomarkers of these patients with BCa and plays a critical role in BCa progression via various tumor-related pathways. Furthermore, the TFs network analysis of the specific mechanism in BCa was performed. These TFs have been reported in previous researches and are tightly related to many critical biological processes. CBX7, Chromobox 7, is the component of Polycystic group (PcG), multiprotein PRC1-like complex, which is necessary to maintain the transcriptional repression of many genes (including Hox genes) tin the whole development process 24 . It is associated with many diseases, including melanoma 25 , 26 and ovarian clear cell adenocarcinoma 27 . HDAC4, Histone Deacetylase 4, is responsible for the deacetylation of N-terminal lysine residues of the core histones (H2A, H2B, H3 and H4) 28 . And this process provides a marker for epigenetic repression and plays an important role in transcriptional regulation, cell cycle progression and developmental events. The dysregulation of HDAC4 is associated with many diseases, such as vascular inflammation 29 and brachydactyly 30 , 31 . EBF2, EBF Transcription Factor 2, is a non-basic, helix-loop-helix transcription factors belongs to the COE (Collier/Olf/EBF) family, has a well conserved DNA binding domain and plays a critical role in regulating osteoclast differentiation. Many diseases are related to abnormal regulation of EBF2, such as Kawasaki Disease 32 , Inguinal Hernia 33 and Kallmann Syndrome 34 , 35 . NFATC1, Nuclear Factor of Activated T Cells 1, plays a role in the inducible expression of cytokine genes in T-cells (especially in the induction of the IL-2 or IL-4 gene transcription) and regulates many important genes of osteoclast differentiation and function. NFATC1 associated diseases include Cherubism 36 - 38 and Osteonecrosis 39 , 40 . AKNA, AT-Hook Transcription Factor, regulates microtubule tissue and specifically activates the expression of CD40 receptor and its ligand CD40L/CD154(two cell surface molecules on lymphocytes, which are critical for the development of antigen-dependent-B-cell). It remains unclear how it can act as both a transcription factor and a microtubule organizer, and more evidence is required to explain these two apparently contradictory functions 41 , 42 . These TFs were frequently downregulated in BCa tissue compared with para-cancerous tissue from the online database, suggesting that these TFs may be involved in the pathogenesis of the disease. Additionally, our research proved that these BCa patients with high risk of these five TFs combination had a worse overall survival rate. CBX7 could inhibit the invasion ability of BCa 26 , and Kaletsch A et al. identified that HDAC4 served as a tumor promoting factor, and Histone deacetylase inhibitor (HDACi) was considered to be a promising anti-cancer drug, which could be employed for the treatment of urothelial carcinoma (UC) 43 . Kawahara T et al. found that NFATC1 was a potential prognostic actor 44 and a critical tumor promoting factor to promote the progression of BCa 45 . The other identified TFs have not been reported in previous study, therefore our efforts add a valuable supplement to identify potential transcription factor related biomarkers for BCa patients. And transcription factor has been known to be the critical regulator of many important processes and protein, and the dysregulation of TFs usually cause the abnormal expression of proto-oncogenes or tumor suppressor genes. The dysregulation of TFs almost is the first step of the occurrence of various cancers. In this study, we found that the five-TFs combinations were potential prognostic biomarker and associated with the overall survival for BCa, and we can predict the occurrence of BCa by observing the expression of the five TFs. The results of PPI demonstrated that the five TFs were tightly associated with many critical cancer related factors or pathways, such as SMAD4, MMP13, IL-5, IL-2, CYP2E1, CCNE1 and CDH1. We further found that reduced TFs significantly induced or restrained the transcription or expression of these factors, which implied that the five TFs might control the progression of BCa via mediating with these factors. All above results further consolidated the critical role of the five TFs on the occurrence and development of BCa. We will continue to explore the deep mechanism of five TFs influencing the development, and restraining the expression of proto-oncogenes or amplifying the expression of tumor suppressor genes by regulating the conditions of TFs. Targeting the TFs or applying the inhibitors might a new treatment strategy for lots of BCa patients.

We do acknowledge the limitations in this research, some BCa patients with incomplete clinicopathological information might influence the clinical assessment of the results. Then, the identified TFs were verified in cell culture level and small scale clinical BCa samples, and larger scale of clinical samples are necessary to further validate these results, so as to provide basis for clinical prognosis or detection. Finally, although our research constructed that the five TFs combination serve as the potential progression related biomarkers, they do provide many conveniences for BCa progress and prognosis prediction. The influence of these TFs on the invasion ability and proliferation of BCa need further researched, these TFs might provide new therapeutic targets for these BCa patients. It is necessary to further verify the effects of those identified TFs on BCa in vivo and in vitro .

In this research, we determined that the co-expression blue module was obviously correlated with these BCa clinical traits using WGCNA. These selected TFs from co-expression modules enriched in different pathways and cell functions are tightly related to the risk factors and progression of BCa. Furthermore, we identified a series of potential prognostic related biomarkers, which may contribute to the prognosis and treatment of BCa. The five-TFs combination was an independent prognosis biomarker for BCa, and might serve as a potential therapeutic target for BCa patients.

Supplementary Material

Supplementary table 1.

Supplementary table 2.

Supplementary table 3.

Supplementary table 4.

Acknowledgments

We thank to all experimental technical support provided by Tianjin Institute of Urology, we also thank to all help during manuscript writing from Mengyue Yang in Harbin Medical University.

Ethics approval and consent to participate

All procedures in this study involving human participants were in accordance with the ethical standards of the Research Ethics Committee of The Second Hospital of Tianjin Medical University and the 1964 Helsinki declaration and its subsequent amendments.

Consent for publication

All of authors are informed and consent to this study.

Availability of data and materials

All data generated or analyzed during this study are included either in this article.

This work was supported by grants from the National Natural Science Foundation (81872079, 81572538), and the Science Foundation of Tianjin (No.: 11JCZDJC19700, ZC2013116JCZDJC34400, 20140117, 16KG118).

Author Contributions

Conceptualization, Yihao Liao; Data curation, Yihao Liao, Miaomiao Wang and Tao Guo; Formal analysis, Xuanxuan Zou, Keke Wang and Boqiang Zhong; Funding acquisition, Ning Jiang; Investigation, Keke Wang; Methodology, Ning Jiang; Project administration, Yihao Liao; Resources, Youzhi Wang and Ning Jiang; Software, Xuanxuan Zou and Miaomiao Wang; Visualization, Ning Jiang; Writing original draft, Yihao Liao and Xuanxuan Zou; Writing - review & editing, Yihao Liao.

Abbreviations

Chromobox 7

Histone Deacetylase 4

EBF Transcription Factor 2

Nuclear Factor of Activated T Cells 1

AT-Hook Transcription Factor

Weighted correlation network analysis

The Cancer Genome Atlas

topological overlap matrix

Gene Ontology

Kyoto Encyclopedia of Genes and Genomes

bladder cancer

transcription factor

overall survival

American Type Culture Collection

  • 1. Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. International journal of cancer. 2015;136:E359–86. doi: 10.1002/ijc.29210. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 2. Antoni S, Ferlay J, Soerjomataram I, Znaor A, Jemal A, Bray F. Bladder Cancer Incidence and Mortality: A Global Overview and Recent Trends. European urology. 2017;71:96–108. doi: 10.1016/j.eururo.2016.06.010. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 3. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA: a cancer journal for clinicians. 2019;69:7–34. doi: 10.3322/caac.21551. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 4. Cao W, Zhao Y, Wang L, Huang X. Circ0001429 regulates progression of bladder cancer through binding miR-205-5p and promoting VEGFA expression. Cancer biomarkers: section A of Disease markers. 2019;25:101–13. doi: 10.3233/CBM-182380. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 5. Petković M, Muhvić D, Zamolo G, Jonjić N, Mustać E, Mrakovćić-Sutić I. et al. Metatarsal metastasis from transitional cell cancer of the urinary bladder. Collegium antropologicum. 2004;28:337–41. [ PubMed ] [ Google Scholar ]
  • 6. Prasad SM, Decastro GJ, Steinberg GD. Urothelial carcinoma of the bladder: definition, treatment and future efforts. Nature reviews Urology. 2011;8:631–42. doi: 10.1038/nrurol.2011.144. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 7. Li J, Li Y, Meng F, Fu L, Kong C. Knockdown of long non-coding RNA linc00511 suppresses proliferation and promotes apoptosis of bladder cancer cells via suppressing Wnt/β-catenin signaling pathway. Bioscience reports. 2018;38:BSR20171701. doi: 10.1042/BSR20171701. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 8. Shen C, Wang Y, Wu Z, Da L, Gao S, Xie L. et al. Long noncoding RNAs, ENST00000598996 and ENST00000524265, are correlated with favorable prognosis and act as potential tumor suppressors in bladder cancer. Oncology reports. 2020;44:1831–50. doi: 10.3892/or.2020.7733. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 9. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA. et al. Diversity and complexity in DNA recognition by transcription factors. Science (New York, NY) 2009;324:1720–3. doi: 10.1126/science.1162327. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 10. Bernstein DA. Identification of small molecules that disrupt SSB-protein interactions using a high-throughput screen. Methods in molecular biology (Clifton, NJ) 2012;922:183–91. doi: 10.1007/978-1-62703-032-8_14. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 11. Lambert M, Jambon S, Depauw S, David-Cordonnier MH. Targeting Transcription Factors for Cancer Treatment. Molecules (Basel, Switzerland) 2018;23:1479. doi: 10.3390/molecules23061479. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 12. Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM. A census of human transcription factors: function, expression and evolution. Nature reviews Genetics. 2009;10:252–63. doi: 10.1038/nrg2538. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 13. Bradner JE, Hnisz D, Young RA. Transcriptional Addiction in Cancer. Cell. 2017;168:629–43. doi: 10.1016/j.cell.2016.12.013. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 14. Daga M, Pizzimenti S, Dianzani C, Cucci MA, Cavalli R, Grattarola M. et al. Ailanthone inhibits cell growth and migration of cisplatin resistant bladder cancer cells through down-regulation of Nrf2, YAP, and c-Myc expression. Phytomedicine: international journal of phytotherapy and phytopharmacology. 2019;56:156–64. doi: 10.1016/j.phymed.2018.10.034. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 15. Shi F, Deng Z, Zhou Z, Jiang CY, Zhao RZ, Sun F. et al. QKI-6 inhibits bladder cancer malignant behaviours through down-regulating E2F3 and NF-κB signalling. Journal of cellular and molecular medicine. 2019;23:6578–94. doi: 10.1111/jcmm.14481. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 16. Fan X, Liu B, Wang Z, He D. TACC3 is a prognostic biomarker for kidney renal clear cell carcinoma and correlates with immune cell infiltration and T cell exhaustion. Aging. 2021;13:8541–8562. doi: 10.18632/aging.202668. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 17. Zhu H, Wang G, Zhu H, Xu A. ITGA5 is a prognostic biomarker and correlated with immune infiltration in gastrointestinal tumors. BMC cancer. 2021;21:269. doi: 10.1186/s12885-021-07996-1. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 18. Huang H, Tan M, Zheng L, Yan G, Li K, Lu D. et al. Prognostic Implications of the Complement Protein C1Q and Its Correlation with Immune Infiltrates in Osteosarcoma. OncoTargets and therapy. 2021;14:1737–51. doi: 10.2147/OTT.S295063. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 19. Cui C, Chakraborty K, Tang XA, Zhou G, Schoenfelt KQ, Becker KM. et al. Neutrophil elastase selectively kills cancer cells and attenuates tumorigenesis. Cell. 2021;184:3163–77. doi: 10.1016/j.cell.2021.04.016. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 20. Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M. et al. The Human Transcription Factors. Cell. 2018;172:650–65. doi: 10.1016/j.cell.2018.01.029. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 21. Hu B, Shi G, Li Q, Li W, Zhou H. Long noncoding RNA XIST participates in bladder cancer by downregulating p53 via binding to TET1. Journal of cellular biochemistry. 2019;120:6330–8. doi: 10.1002/jcb.27920. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 22. Wang F, Zu Y, Zhu S, Yang Y, Huang W, Xie H. et al. Long noncoding RNA MAGI2-AS3 regulates CCDC19 expression by sponging miR-15b-5p and suppresses bladder cancer progression. Biochemical and biophysical research communications. 2018;507:231–5. doi: 10.1016/j.bbrc.2018.11.013. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 23. Kouhsar M, Azimzadeh Jamalkandi S, Moeini A, Masoudi-Nejad A. Detection of novel biomarkers for early detection of Non-Muscle-Invasive Bladder Cancer using Competing Endogenous RNA network analysis. Scientific reports. 2019;9:8434. doi: 10.1038/s41598-019-44944-3. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 24. Morey L, Aloia L, Cozzuto L, Benitah SA, Di Croce L. RYBP and Cbx7 define specific biological functions of polycomb complexes in mouse embryonic stem cells. Cell reports. 2013;3:60–9. doi: 10.1016/j.celrep.2012.11.026. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 25. Forzati F, Federico A, Pallante P, Colamaio M, Esposito F, Sepe R. et al. CBX7 gene expression plays a negative role in adipocyte cell growth and differentiation. Biology open. 2014;3:871–9. doi: 10.1242/bio.20147872. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 26. Xie D, Shang C, Zhang H, Guo Y, Tong X. Up-regulation of miR-9 target CBX7 to regulate invasion ability of bladder transitional cell carcinoma. Medical science monitor: international medical journal of experimental and clinical research. 2015;21:225–30. doi: 10.12659/MSM.893232. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 27. Shinjo K, Yamashita Y, Yamamoto E, Akatsuka S, Uno N, Kamiya A. et al. Expression of chromobox homolog 7 (CBX7) is associated with poor prognosis in ovarian clear cell adenocarcinoma via TRAIL-induced apoptotic pathway regulation. International journal of cancer. 2014;135:308–18. doi: 10.1002/ijc.28692. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 28. Kong Q, Hao Y, Li X, Wang X, Ji B, Wu Y. HDAC4 in ischemic stroke: mechanisms and therapeutic potential. Clinical epigenetics. 2018;10:117. doi: 10.1186/s13148-018-0549-1. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 29. Yang D, Xiao C, Long F, Su Z, Jia W, Qin M. et al. HDAC4 regulates vascular inflammation via activation of autophagy. Cardiovascular research. 2018;114:1016–28. doi: 10.1093/cvr/cvy051. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 30. Luo L, Martin SC, Parkington J, Cadena SM, Zhu J, Ibebunjo C. et al. HDAC4 Controls Muscle Homeostasis through Deacetylation of Myosin Heavy Chain, PGC-1α, and Hsc70. Cell reports. 2019;29:749–763. doi: 10.1016/j.celrep.2019.09.023. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 31. Nakatani T, Chen T, Johnson J, Westendorf JJ, Partridge NC. The Deletion of Hdac4 in Mouse Osteoblasts Influences Both Catabolic and Anabolic Effects in Bone. Journal of bone and mineral research: the official journal of the American Society for Bone and Mineral Research. 2018;33:1362–75. doi: 10.1002/jbmr.3422. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 32. Bae Y, Shin D, Nam J, Lee HR, Kim JS, Kim KY. et al. Variants in the Gene EBF2 Are Associated with Kawasaki Disease in a Korean Population. Yonsei medical journal. 2018;59:519–23. doi: 10.3349/ymj.2018.59.4.519. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 33. Jorgenson E, Makki N, Shen L, Chen DC, Tian C, Eckalbar WL. et al. A genome-wide association study identifies four novel susceptibility loci underlying inguinal hernia. Nature communications. 2015;6:10130. doi: 10.1038/ncomms10130. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 34. Amato LGL, Montenegro LR, Lerario AM, Jorge AAL, Guerra Junior G, Schnoll C. et al. New genetic findings in a large cohort of congenital hypogonadotropic hypogonadism. European journal of endocrinology. 2019;181:103–19. doi: 10.1530/EJE-18-0764. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 35. Trarbach EB, Baptista MT, Garmes HM, Hackel C. Molecular analysis of KAL-1, GnRH-R, NELF and EBF2 genes in a series of Kallmann syndrome and normosmic hypogonadotropic hypogonadism patients. The Journal of endocrinology. 2005;187:361–8. doi: 10.1677/joe.1.06103. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 36. Kadlub N, Sessiecq Q, Dainese L, Joly A, Lehalle D, Marlin S. et al. Defining a new aggressiveness classification and using NFATc1 localization as a prognostic factor in cherubism. Human pathology. 2016;58:62–71. doi: 10.1016/j.humpath.2016.07.019. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 37. Mukai T, Ishida S, Ishikawa R, Yoshitaka T, Kittaka M, Gallant R. et al. SH3BP2 cherubism mutation potentiates TNF-α-induced osteoclastogenesis via NFATc1 and TNF-α-mediated inflammatory bone loss. Journal of bone and mineral research: the official journal of the American Society for Bone and Mineral Research. 2014;29:2618–35. doi: 10.1002/jbmr.2295. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 38. Kadlub N, Sessiecq Q, Mandavit M, L'Hermine AC, Badoual C, Galmiche L. et al. Molecular and cellular characterizations of human cherubism: disease aggressiveness depends on osteoclast differentiation. Orphanet journal of rare diseases. 2018;13:166. doi: 10.1186/s13023-018-0907-2. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 39. Wehrhan F, Gross C, Creutzburg K, Amann K, Ries J, Kesting M. et al. Osteoclastic expression of higher-level regulators NFATc1 and BCL6 in medication-related osteonecrosis of the jaw secondary to bisphosphonate therapy: a comparison with osteoradionecrosis and osteomyelitis. Journal of translational medicine. 2019;17:69. doi: 10.1186/s12967-019-1819-1. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 40. Chen B, Du Z, Dong X, Li Z, Wang Q, Chen G. et al. Association of Variant Interactions in RANK, RANKL, OPG, TRAF6, and NFATC1 Genes with the Development of Osteonecrosis of the Femoral Head. DNA and cell biology. 2019;38:734–46. doi: 10.1089/dna.2019.4710. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 41. Camargo Ortega G, Falk S, Johansson PA, Peyre E, Broix L, Sahu SK. et al. The centrosome protein AKNA regulates neurogenesis via microtubule organization. Nature. 2019;567:113–7. doi: 10.1038/s41586-019-0962-4. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 42. Hug P, Anderegg L, Kehl A, Jagannathan V, Leeb T. AKNA Frameshift Variant in Three Dogs with Recurrent Inflammatory Pulmonary Disease. Genes. 2019;10:567. doi: 10.3390/genes10080567. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 43. Kaletsch A, Pinkerneil M, Hoffmann MJ, Jaguva Vasudevan AA, Wang C, Hansen FK. et al. Effects of novel HDAC inhibitors on urothelial carcinoma cells. Clinical epigenetics. 2018;10:100. doi: 10.1186/s13148-018-0531-y. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 44. Kawahara T, Inoue S, Fujita K, Mizushima T, Ide H, Yamaguchi S. et al. NFATc1 Expression as a Prognosticator in Urothelial Carcinoma of the Upper Urinary Tract. Translational oncology. 2017;10:318–23. doi: 10.1016/j.tranon.2017.01.012. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 45. Kawahara T, Kashiwagi E, Ide H, Li Y, Zheng Y, Miyamoto Y. et al. Cyclosporine A and tacrolimus inhibit bladder cancer growth through down-regulation of NFATc1. Oncotarget. 2015;6:1582–93. doi: 10.18632/oncotarget.2750. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data availability statement.

  • View on publisher site
  • PDF (3.1 MB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

  • Alzheimer's disease & dementia
  • Arthritis & Rheumatism
  • Attention deficit disorders
  • Autism spectrum disorders
  • Biomedical technology
  • Diseases, Conditions, Syndromes
  • Endocrinology & Metabolism
  • Gastroenterology
  • Gerontology & Geriatrics
  • Health informatics
  • Inflammatory disorders
  • Medical economics
  • Medical research
  • Medications
  • Neuroscience
  • Obstetrics & gynaecology
  • Oncology & Cancer
  • Ophthalmology
  • Overweight & Obesity
  • Parkinson's & Movement disorders
  • Psychology & Psychiatry
  • Radiology & Imaging
  • Sleep disorders
  • Sports medicine & Kinesiology
  • Vaccination
  • Breast cancer
  • Cardiovascular disease
  • Chronic obstructive pulmonary disease
  • Colon cancer
  • Coronary artery disease
  • Heart attack
  • Heart disease
  • High blood pressure
  • Kidney disease
  • Lung cancer
  • Multiple sclerosis
  • Myocardial infarction
  • Ovarian cancer
  • Post traumatic stress disorder
  • Rheumatoid arthritis
  • Schizophrenia
  • Skin cancer
  • Type 2 diabetes
  • Full List »

share this!

October 16, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

'Pincer attack' on transcription factors offers new possibilities for future blood cancer therapies

by Karl Landsteiner University

'Pincer attack' on transcription factors offers new possibilities for future blood cancer therapies

The simultaneous inhibition of the transcription factors Myc and JunB could represent a pioneering therapeutic option for the treatment of multiple myeloma (MM), the second most common type of blood cancer.

This is the result of a recent study conducted by a team from the Karl Landsteiner University of Health Sciences (KL Krems) together with Austrian and American colleagues. The study was the first to show that the two regulatory proteins have independent effects in MM cells.

The simultaneous inhibition of both proteins thus resulted in a synergistic anti-tumor effect. The findings are published in the Blood Cancer Journal .

Multiple myeloma (MM) is the second most common hematopoietic malignancy, still considered incurable despite unprecedented therapeutic advances over the last two decades. Novel therapies are therefore needed.

For several years, the team of Prof. Klaus Podar, Head of the Division of Molecular Oncology and Hematology, Division of Internal Medicine 2 at University Hospital Krems (one of the education- and research sites of KL Krems) has focused its research on the role of tumor-associated transcription factors (TFs), proteins that bind to specific DNA sequences and act as regulators, and the derived development of TF inhibitors.

However, TF inhibitors were thought to be "undruggable" until most recently. The team's previous studies have demonstrated a pathophysiologic role of the TF JunB in MM, tumor cell proliferation and drug resistance in particular. Moreover, they established that JunB also increases the proliferation of tumor-promoting blood vessels in the bone marrow.

Separate pathways

The present study shows for the first time that JUNB and MYC, another crucial TF in MM, orchestrate distinct transcriptional programs. In addition, data emphasize the opportunity to employ JUNB and MYC dual-targeting treatment strategies in MM as another exciting approach to further improve patient outcome.

In detail, the team demonstrated that the expression of the respective target genes of the two TFs is controlled independently of each other in MM cells. "This was our first indication that JunB and Myc could actually control largely independent signaling pathways in MM cells," explains Prof. Podar.

Caught in the pincers

"Of course, we were immediately interested to see whether simultaneous inhibition of both transcription factors have a mutually reinforcing—i.e., synergistic—effect against MM cells," says Prof. Podar.

The team then carried out several experiments in which the TFs were inhibited individually or together using therapeutic agents or genetic methods. "And indeed, such a pincer attack on MM cells led to a greater increase in MM cell death in both cell and animal models compared to single inhibition," explains Prof. Podar.

"Our efforts are now focusing on the identification and development of new substances that can be successfully used in our patients."

Explore further

Feedback to editors

transcription factor thesis

Spatial proteomics approach leads to life-saving treatment for deadly skin reaction

22 minutes ago

transcription factor thesis

Blood analysis study may help boost performance and reduce side effects of mRNA vaccines

transcription factor thesis

A new brain-based measure of sleepiness may provide a diagnosis in just two minutes

31 minutes ago

transcription factor thesis

New laser light diagnostic tool quickly detects earliest sign of heart attack

37 minutes ago

transcription factor thesis

Men and women use different biological systems to process pain, study discovers

transcription factor thesis

Study explores how traumatic brain injury may be linked to Alzheimer's disease

transcription factor thesis

Scientists use CRISPR tools to safely disable gene mutation linked to treatment-resistant melanoma

2 hours ago

transcription factor thesis

People with aphantasia also have reduced brain activity in response to sounds, finds study

transcription factor thesis

Study finds women more likely than men to die after heart surgery complications

transcription factor thesis

Personalized bacterial vaccine shows promise as cancer immunotherapy

Related stories.

transcription factor thesis

New cancer inhibitor effective where others fail

Dec 9, 2019

transcription factor thesis

New role of tumor suppressor STAT3β discovered in leukemia

Jun 4, 2024

transcription factor thesis

New compound could enhance the efficacy of standard breast cancer treatment

Dec 12, 2018

transcription factor thesis

Important factor in the development of dendritic cells identified

Apr 30, 2021

transcription factor thesis

Exploring key regulators of programmed cell death in melanoma and the immune system's response

Aug 28, 2024

transcription factor thesis

Immune-checkpoint inhibitors protect bone by promoting osteogenesis, exploratory study shows

May 22, 2024

Recommended for you

transcription factor thesis

New study offers revolutionary method for analyzing cell interactions in cancer

3 hours ago

transcription factor thesis

New research confirms location of pseudoautosomal region boundary between the two sex chromosomes

20 hours ago

transcription factor thesis

New study reveals promising therapy that blocks microRNAs to treat myotonic dystrophy type 1

Oct 15, 2024

transcription factor thesis

Study suggests around 40% of postmenopausal hormone positive breast cancers are linked to excess body fat

18 hours ago

Let us know if there is a problem with our content

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Medical Xpress in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Isolation and characterization of AaWRKY1, an Artemisia annua transcription factor that regulates the amorpha-4,11-diene synthase gene, a key gene of artemisinin biosynthesis

Affiliation.

  • 1 Key Laboratory of Photosynthesis and Environmental Molecular Physiology, Institute of Botany, The Chinese Academy of Sciences, Nanxincun 20, Haidian District, 100093 Beijing, PR China.
  • PMID: 19880398
  • DOI: 10.1093/pcp/pcp149

Amorpha-4,11-diene synthase (ADS) of Artemisia annua catalyzes the conversion of farnesyl diphosphate into amorpha-4,11-diene, the first committed step in the biosynthesis of the antimalarial drug artemisinin. The promoters of ADS contain two reverse-oriented TTGACC W-box cis-acting elements, which are the proposed binding sites of WRKY transcription factors. A full-length cDNA (AaWRKY1) was isolated from a cDNA library of the glandular secretory trichomes (GSTs) in which artemisinin is synthesized and sequestered. AaWRKY1 encodes a 311 amino acid protein containing a single WRKY domain. AaWRKY1 and ADS genes were highly expressed in GSTs and both were strongly induced by methyl jasmonate and chitosan. Transient expression analysis of the AaWRKY1-GFP (green fluorescent protein) reporter revealed that AaWRKY1 was targeted to nuclei. Biochemical analysis demonstrated that the AaWRKY1 protein was capable of binding to the W-box cis-acting elements of the ADS promoters, and it demonstrated transactivation activity in yeast. Co-expression of the effector construct 35S::AaWRKY1 with a reporter construct ADSpro1::GUS greatly activated expression of the GUS (beta-glucuronidase) gene in stably transformed tobacco. Furthermore, transient expression experiments in agroinfiltrated Nicotiana benthamiana and A. annua leaves showed that AaWRKY1 protein transactivated the ADSpro2 promoter activity by binding to the W-box of the promoter; disruption of the W-box abolished the activation. Transient expression of AaWRKY1 cDNA in A. annua leaves clearly activated the expression of the majority of artemisinin biosynthetic genes. These results strongly suggest the involvement of the AaWRKY1 transcription factor in the regulation of artemisinin biosynthesis, and indicate that ADS is a target gene of AaWRKY1 in A. annua.

PubMed Disclaimer

Similar articles

  • Cloning and characterization of AabHLH1, a bHLH transcription factor that positively regulates artemisinin biosynthesis in Artemisia annua. Ji Y, Xiao J, Shen Y, Ma D, Li Z, Pu G, Li X, Huang L, Liu B, Ye H, Wang H. Ji Y, et al. Plant Cell Physiol. 2014 Sep;55(9):1592-604. doi: 10.1093/pcp/pcu090. Epub 2014 Jun 26. Plant Cell Physiol. 2014. PMID: 24969234
  • Overexpression of AaWRKY1 Leads to an Enhanced Content of Artemisinin in Artemisia annua. Jiang W, Fu X, Pan Q, Tang Y, Shen Q, Lv Z, Yan T, Shi P, Li L, Zhang L, Wang G, Sun X, Tang K. Jiang W, et al. Biomed Res Int. 2016;2016:7314971. doi: 10.1155/2016/7314971. Epub 2016 Mar 28. Biomed Res Int. 2016. PMID: 27064403 Free PMC article.
  • Trichome-specific expression of the amorpha-4,11-diene 12-hydroxylase (cyp71av1) gene, encoding a key enzyme of artemisinin biosynthesis in Artemisia annua, as reported by a promoter-GUS fusion. Wang H, Han J, Kanagarajan S, Lundgren A, Brodelius PE. Wang H, et al. Plant Mol Biol. 2013 Jan;81(1-2):119-38. doi: 10.1007/s11103-012-9986-y. Epub 2012 Nov 19. Plant Mol Biol. 2013. PMID: 23161198
  • [Recent advances in the study of amorpha-4,11-diene synthase and its metabolic engineering]. Kong JQ, Huang Y, Shen JH, Wang W, Cheng KD, Zhu P. Kong JQ, et al. Yao Xue Xue Bao. 2009 Dec;44(12):1320-7. Yao Xue Xue Bao. 2009. PMID: 21351463 Review. Chinese.
  • [Advances in molecular regulation of artemisinin biosynthesis]. Wang H, Ye HC, Liu BY, Li ZQ, Li GF. Wang H, et al. Sheng Wu Gong Cheng Xue Bao. 2003 Nov;19(6):646-50. Sheng Wu Gong Cheng Xue Bao. 2003. PMID: 15971573 Review. Chinese.
  • Comparative transcriptome analysis reveals nicotine metabolism is a critical component for enhancing stress response intensity of innate immunity system in tobacco. Song Z, Wang R, Zhang H, Tong Z, Yuan C, Li Y, Huang C, Zhao L, Wang Y, Di Y, Sui X. Song Z, et al. Front Plant Sci. 2024 Mar 21;15:1338169. doi: 10.3389/fpls.2024.1338169. eCollection 2024. Front Plant Sci. 2024. PMID: 38595766 Free PMC article.
  • Comparison of phytochemical properties and expressional profiling of artemisinin synthesis-related genes in various Artemisia species. Jamshidi B, Etminan A, Mehrabi A, Shooshtari L, Pour-Aboughadareh A. Jamshidi B, et al. Heliyon. 2024 Feb 17;10(5):e26388. doi: 10.1016/j.heliyon.2024.e26388. eCollection 2024 Mar 15. Heliyon. 2024. PMID: 38439855 Free PMC article.
  • Advanced metabolic engineering strategies for increasing artemisinin yield in Artemisia annua L. Li Y, Yang Y, Li L, Tang K, Hao X, Kai G. Li Y, et al. Hortic Res. 2024 Jan 2;11(2):uhad292. doi: 10.1093/hr/uhad292. eCollection 2024 Feb. Hortic Res. 2024. PMID: 38414837 Free PMC article.
  • Integrated transcriptomics and metabolomics analysis provides insights into aromatic volatiles formation in Cinnamomum cassia bark at different harvesting times. Yao S, Tan X, Huang D, Li L, Chen J, Ming R, Huang R, Yao C. Yao S, et al. BMC Plant Biol. 2024 Feb 2;24(1):84. doi: 10.1186/s12870-024-04754-w. BMC Plant Biol. 2024. PMID: 38308239 Free PMC article.
  • Functions of Representative Terpenoids and Their Biosynthesis Mechanisms in Medicinal Plants. Wang Q, Zhao X, Jiang Y, Jin B, Wang L. Wang Q, et al. Biomolecules. 2023 Nov 30;13(12):1725. doi: 10.3390/biom13121725. Biomolecules. 2023. PMID: 38136596 Free PMC article. Review.

Publication types

  • Search in MeSH

Related information

  • PubChem Compound
  • PubChem Compound (MeSH Keyword)
  • PubChem Substance

LinkOut - more resources

Full text sources.

  • Silverchair Information Systems

Other Literature Sources

  • The Lens - Patent Citations

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

pathogens-logo

Article Menu

transcription factor thesis

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

‘ candidatus phytoplasma solani’ predicted effector sap11-like alters morphology of transformed arabidopsis plants and interacts with attcp2 and attcp4 plant transcription factors.

transcription factor thesis

1. Introduction

2. materials and methods, 2.1. plant growth, 2.2. codon optimization, cloning, and transformation of arabidopsis plants, 2.3. phenotypic analysis of transgenic arabidopsis thaliana plants with sap11-like overexpression, 2.4. quantification of sap11-like gene expression in transgenic arabidopsis thaliana lines, 2.5. bimolecular fluorescence complementation (bifc) in nicotiana benthamiana leaf epidermal cells, 3.1. regeneration of arabidopsis thaliana plants overexpressing sap11-like transgene, 3.2. sap11-like gene presence and expression in transformed a. thaliana plants, 3.3. sap11-like overexpressing arabidopsis thaliana plants show significant phenotypic changes, 3.4. in planta sap11-like interact with attcp2 and attcp4 proteins, 4. discussion, 5. conclusions, supplementary materials, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Navrátil, M.; Válová, P.; Fialová, R.; Lauterer, P.; Šafářová, D.; Starý, M. The Incidence of Stolbur Disease and Associated Yield Losses in Vegetable Crops in South Moravia (Czech Republic). Crop Prot. 2009 , 28 , 898–904. [ Google Scholar ] [ CrossRef ]
  • Strauss, E. Phytoplasma Research Begins to Bloom. Science 2009 , 325 , 388–390. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hogenhout, S.A.; Oshima, K.; Ammar, E.D.; Kakizawa, S.; Kingdom, H.N.; Namba, S. Phytoplasmas: Bacteria That Manipulate Plants and Insects. Mol. Plant Pathol. 2008 , 9 , 403–423. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Namba, S. Molecular and Biological Properties of Phytoplasmas. Proc. Jpn. Acad. Ser. B Phys. Biol. Sci. 2019 , 95 , 401. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hogenhout, S.A.; Loria, R. Virulence Mechanisms of Gram-Positive Plant Pathogenic Bacteria. Curr. Opin. Plant Biol. 2008 , 11 , 449–456. [ Google Scholar ] [ CrossRef ]
  • Martín-Trillo, M.; Cubas, P. TCP Genes: A Family Snapshot Ten Years Later. Trends Plant Sci. 2010 , 15 , 31–39. [ Google Scholar ] [ CrossRef ]
  • Marcone, C. Molecular Biology and Pathogenicity of Phytoplasmas. Ann. Appl. Biol. 2014 , 165 , 199–221. [ Google Scholar ] [ CrossRef ]
  • Bai, X.; Correa, V.R.; Toruño, T.Y.; Ammar, E.D.; Kamoun, S.; Hogenhout, S.A. AY-WB Phytoplasma Secretes a Protein That Targets Plant Cell Nuclei. Mol. Plant-Microbe Interact. 2008 , 22 , 18–30. [ Google Scholar ] [ CrossRef ]
  • Hoshi, A.; Oshima, K.; Kakizawa, S.; Ishii, Y.; Ozeki, J.; Hashimoto, M.; Komatsu, K.; Kagiwada, S.; Yamaji, Y.; Namba, S. A Unique Virulence Factor for Proliferation and Dwarfism in Plants Identified from a Phytopathogenic Bacterium. Proc. Natl. Acad. Sci. USA 2009 , 106 , 6416–6421. [ Google Scholar ] [ CrossRef ]
  • Kakizawa, S.; Oshima, K.; Namba, S. Functional Genomics of Phytoplasmas. In Phytoplasmas: Genomes, Plant Hosts and Vectors ; CAB International: Oxford, UK, 2009; pp. 37–50. [ Google Scholar ] [ CrossRef ]
  • Huang, W.; Reyes-Caldas, P.; Mann, M.; Seifbarghi, S.; Kahn, A.; Almeida, R.P.P.; Béven, L.; Heck, M.; Hogenhout, S.A.; Coaker, G. Bacterial Vector-Borne Plant Diseases: Unanswered Questions and Future Directions. Mol. Plant 2020 , 13 , 1379–1393. [ Google Scholar ] [ CrossRef ]
  • Huang, W.; MacLean, A.M.; Sugio, A.; Maqbool, A.; Busscher, M.; Cho, S.T.; Kamoun, S.; Kuo, C.H.; Immink, R.G.H.; Hogenhout, S.A. Parasitic Modulation of Host Development by Ubiquitin-Independent Protein Degradation. Cell 2021 , 184 , 5201–5214.e12. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sugio, A.; MacLean, A.M.; Kingdom, H.N.; Grieve, V.M.; Manimekalai, R.; Hogenhout, S.A. Diverse Targets of Phytoplasma Effectors: From Plant Development to Defense against Insects. Annu. Rev. Phytopathol. 2011 , 49 , 175–195. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Pecher, P.; Moro, G.; Canale, M.C.; Capdevielle, S.; Singh, A.; MacLean, A.; Sugio, A.; Kuo, C.H.; Lopes, J.R.S.; Hogenhout, S.A. Phytoplasma SAP11 Effector Destabilization of TCP Transcription Factors Differentially Impact Development and Defence of Arabidopsis versus Maize. PLoS Pathog. 2019 , 15 , 1008035. [ Google Scholar ] [ CrossRef ]
  • Hogenhout, S.A.; Van Der Hoorn, R.A.L.; Terauchi, R.; Kamoun, S. Emerging Concepts in Effector Biology of Plant-Associated Organisms. Mol. Plant-Microbe Interact. 2009 , 22 , 115–122. [ Google Scholar ] [ CrossRef ]
  • MacLean, A.M.; Orlovskis, Z.; Kowitwanich, K.; Zdziarska, A.M.; Angenent, G.C.; Immink, R.G.H.; Hogenhout, S.A. Phytoplasma Effector SAP54 Hijacks Plant Reproduction by Degrading MADS-Box Proteins and Promotes Insect Colonization in a RAD23-Dependent Manner. PLoS Biol. 2014 , 12 , 1001835. [ Google Scholar ] [ CrossRef ]
  • Maejima, K.; Iwai, R.; Himeno, M.; Komatsu, K.; Kitazawa, Y.; Fujita, N.; Ishikawa, K.; Fukuoka, M.; Minato, N.; Yamaji, Y.; et al. Recognition of Floral Homeotic MADS Domain Transcription Factors by a Phytoplasmal Effector, Phyllogen, Induces Phyllody. Plant J. 2014 , 78 , 541–554. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sugio, A.; Kingdom, H.N.; MacLean, A.M.; Grieve, V.M.; Hogenhout, S.A. Phytoplasma Protein Effector SAP11 Enhances Insect Vector Reproduction by Manipulating Plant Development and Defense Hormone Biosynthesis. Proc. Natl. Acad. Sci. USA 2011 , 108 , E1254–E1263. [ Google Scholar ] [ CrossRef ]
  • Janik, K.; Mithöfer, A.; Raffeiner, M.; Stellmach, H.; Hause, B.; Schlink, K. An Effector of Apple Proliferation Phytoplasma Targets TCP Transcription Factors-a Generalized Virulence Strategy of Phytoplasma? Mol. Plant Pathol. 2017 , 18 , 435–442. [ Google Scholar ] [ CrossRef ]
  • Quaglino, F.; Zhao, Y.; Casati, P.; Bulgari, D.; Bianco, P.A.; Wei, W.; Davis, R.E. “Candidatus Phytoplasma solani”, a Novel Taxon Associated with Stolbur-and Bois Noir-Related Diseases of Plants. Int. J. Syst. Evol. Microbiol. 2013 , 63 , 2879–2894. [ Google Scholar ] [ CrossRef ]
  • Maixner, M.; Ahrens, U.; Seemüller, E. Detection of the German Grapevine Yellows (Vergilbungskrankheit) MLO in Grapevine, Alternative Hosts and a Vector by a Specific PCR Procedure. Eur. J. Plant Pathol. 1995 , 101 , 241–250. [ Google Scholar ] [ CrossRef ]
  • Jović, J.; Cvrković, T.; Mitrović, M.; Krnjajić, S.; Petrović, A.; Redinbaugh, M.G.; Pratt, R.C.; Hogenhout, S.A.; Toševski, I. Stolbur Phytoplasma Transmission to Maize by Reptalus Panzeri and the Disease Cycle of Maize Redness in Serbia. Phytopathology 2009 , 99 , 1053–1061. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Seruga Music, M.; Samarzija, I.; Hogenhout, S.A.; Haryono, M.; Cho, S.T.; Kuo, C.H. The Genome of ‘Candidatus Phytoplasma solani’ Strain SA-1 Is Highly Dynamic and Prone to Adopting Foreign Sequences. Syst. Appl. Microbiol. 2019 , 42 , 117–127. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sugio, A.; Maclean, A.M.; Hogenhout, S.A. The Small Phytoplasma Virulence Effector SAP11 Contains Distinct Domains Required for Nuclear Targeting and CIN-TCP Binding and Destabilization. New Phytol. 2014 , 202 , 838–848. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Li, S. The Arabidopsis Thaliana TCP Transcription Factors: A Broadening Horizon beyond Development. Plant Signal. Behav. 2015 , 10 , e1044192. [ Google Scholar ] [ CrossRef ]
  • Oshima, K.; Kakizawa, S.; Nishigawa, H.; Jung, H.Y.; Wei, W.; Suzuki, S.; Arashida, R.; Nakata, D.; Miyata, S.I.; Ugaki, M.; et al. Reductive Evolution Suggested from the Complete Genome Sequence of a Plant-Pathogenic Phytoplasma. Nat. Genet. 2004 , 36 , 27–29. [ Google Scholar ] [ CrossRef ]
  • Lu, Y.T.; Li, M.Y.; Cheng, K.T.; Tan, C.M.; Su, L.W.; Lin, W.Y.; Shih, H.T.; Chiou, T.J.; Yang, J.Y. Transgenic Plants That Express the Phytoplasma Effector SAP11 Show Altered Phosphate Starvation and Defense Responses. Plant Physiol. 2014 , 164 , 1456–1469. [ Google Scholar ] [ CrossRef ]
  • Mittelberg, C.; Hause, B.; Janki, K. The ‘Candidatus Phytoplasma Mali’ Effector Protein SAP11CaPm Interacts with MdTCP16, AclassII CYC/TB1transcriptionfactorthatishighlyexpressedduringphytoplasmainfection. PLoS ONE 2022 , 17 , e0272467. [ Google Scholar ] [ CrossRef ]
  • Boonrod, K.; Strohmayer, A.; Schwarz, T.; Braun, M.; Tropf, T.; Krczal, G. Beyond Destabilizing Activity of SAP11-like Effector of Candidatus Phytoplasma Mali Strain PM19. Microorganisms 2022 , 10 , 1406. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Murashige, T.; Skoog, F. A Revised Medium for Rapid Growth and Bio Assays with Tobacco Tissue Cultures. Physiol. Plant 1962 , 15 , 473–497. [ Google Scholar ] [ CrossRef ]
  • Bendtsen, J.D.; Nielsen, H.; Von Heijne, G.; Brunak, S. Improved Prediction of Signal Peptides: SignalP 3.0. J. Mol. Biol. 2004 , 340 , 783–795. [ Google Scholar ] [ CrossRef ]
  • Clough, S.J.; Bent, A.F. Floral Dip: A Simplified Method for Agrobacterium-Mediated Transformation of Arabidopsis Thaliana. Plant J. 1998 , 16 , 735–743. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Škiljaica, A.; Jagić, M.; Vuk, T.; Leljak Levanić, D.; Bauer, N.; Markulin, L. Evaluation of Reference Genes for RT-QPCR Gene Expression Analysis in Arabidopsis Thaliana Exposed to Elevated Temperatures. Plant Biol. 2022 , 24 , 367–379. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Walter, M.; Chaban, C.; Schütze, K.; Batistic, O.; Weckermann, K.; Näke, C.; Blazevic, D.; Grafen, C.; Schumacher, K.; Oecking, C.; et al. Visualization of Protein Interactions in Living Plant Cells Using Bimolecular Fluorescence Complementation. Plant J. 2004 , 40 , 428–438. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Win, J.; Kamoun, S. PCB301-P19: A Binary Plasmid Vector to Enhance Transient Expression of Transgenes by Agroinfiltration. Plant J. 2003 , 33 , 949–956. [ Google Scholar ]
  • Jagić, M. Domain-Specific Interactions of BPM1 with DMS3 and RDM1 in RNAdirected DNA Methylation. Ph.D. Thesis, Faculty of Science, University of Zagreb, Zagreb, Croatia, 2024. [ Google Scholar ]
  • Win, J.; Chaparro-Garcia, A.; Belhaj, K.; Saunders, D.G.O.; Yoshida, K.; Dong, S.; Schornack, S.; Zipfel, C.; Robatzek, S.; Hogenhout, S.A.; et al. Effector Biology of Plant-Associated Organisms: Concepts and Perspectives. Cold Spring Harb. Symp. Quant. Biol. 2012 , 77 , 235–247. [ Google Scholar ] [ CrossRef ]
  • Strohmayer, A.; Schwarz, T.; Braun, M.; Krczal, G.; Boonrod, K. The Effect of the Anticipated Nuclear Localization Sequence of ‘Candidatus Phytoplasma Mali’ SAP11-like Protein on Localization of the Protein and Destabilization of TCP Transcription Factor. Microorganisms 2021 , 9 , 1756. [ Google Scholar ] [ CrossRef ]
  • Mittelberger, C.; Moser, M.; Hause, B.; Janik, K. ‘Candidatus Phytoplasma Mali’ SAP11-like Protein Modulates Expression of Genes Involved in Energy Production, Photosynthesis, and Defense in Nicotiana Occidentalis Leaves. BMC Plant Biol. 2024 , 24 , 393. [ Google Scholar ] [ CrossRef ]
  • Rath, M.; Challa, K.R.; Sarvepalli, K.; Nath, U. CINCINNATA-like TCP Transcription Factors in Cell Growth—An Expanding Portfolio. Front. Plant Sci. 2022 , 13 , 825341. [ Google Scholar ] [ CrossRef ]
  • He, Z.; Zhou, X.; Chen, J.; Yin, L.; Zeng, Z.; Xiang, J.; Liu, S. Identification of a Consensus DNA-Binding Site for the TCP Domain Transcription Factor TCP2 and Its Important Roles in the Growth and Development of Arabidopsis. Mol. Biol. Rep. 2021 , 48 , 2223–2233. [ Google Scholar ] [ CrossRef ]
  • Wang, N.; Yang, H.; Yin, Z.; Liu, W.; Sun, L.; Wu, Y. Phytoplasma Effector SWP1 Induces Witches’ Broom Symptom by Destabilizing the TCP Transcription Factor BRANCHED1. Mol. Plant Pathol. 2018 , 19 , 2623–2634. [ Google Scholar ] [ CrossRef ]
  • Chang, S.H.; Tan, C.M.; Wu, C.T.; Lin, T.H.; Jiang, S.Y.; Liu, R.C.; Tsai, M.C.; Su, L.W.; Yang, J.Y. Alterations of Plant Architecture and Phase Transition by the Phytoplasma Virulence Factor SAP11. J. Exp. Bot. 2018 , 69 , 5389–5401. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Bresso, E.G.; Chorostecki, U.; Rodriguez, R.E.; Palatnik, J.F.; Schommer, C. Spatial Control of Gene Expression by MiR319-Regulated TCP Transcription Factors in Leaf Development. Plant Physiol. 2018 , 176 , 1694–1708. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Riedle-Bauer, M.; Brader, G. Effects of Insecticides and Repellents on the Spread of ‘Candidatus Phytoplasma solani’ under Laboratory and Field Conditions. J. Plant Dis. Prot. 2023 , 130 , 1057–1074. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Bianco, P.A.; Romanazzi, G.; Mori, N.; Myrie, W.; Bertaccini, A. Integrated Management of Phytoplasma Diseases. In Phytoplasmas: Plant Pathogenic Bacteria—II ; Springer: Singapore, 2019; pp. 237–258. [ Google Scholar ]
  • Liu, S.; Jaouannet, M.; Dempsey, D.A.; Imani, J.; Coustau, C.; Kogel, K.H. RNA-Based Technologies for Insect Control in Plant Production. Biotechnol. Adv. 2020 , 39 , 107463. [ Google Scholar ] [ CrossRef ]
  • De Schutter, K.; Taning, C.N.T.; Van Daele, L.; Van Damme, E.J.M.; Dubruel, P.; Smagghe, G. RNAi-Based Biocontrol Products: Market Status, Regulatory Aspects, and Risk Assessment. Front. Insect Sci. 2021 , 1 , 818037. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

No. of A. tumefaciens CombinationPlasmid Constructs in Agroinfiltration MixtureType of Sample
1pSPYNE-SAP11-like
pSPYCE-AtTCP2
pCB301-p19
Experimental sample
2pSPYNE-SAP11-like
pSPYCE-AtTCP4
pCB301-p19
Experimental sample
3pSPYNE-SAP11-like
pCB301-p19
Negative control
4pSPYCE-AtTCP2
pCB301-p19
Negative control
5pSPYCE-AtTCP4
pCB301-p19
Negative control
6pB7WGR2.0-EGFP-DMS3
pCB301-p19
Positive control of agroinfiltration
PlantGeneCqCqAverage CqΔCq
SAP11-like transgenic
A. thaliana
ogio20.0820.2520.17−3.85
SAP11-like16.3416.2916.32
ntc000Not applicable
Wild-type
A. thaliana
ogio19.1822.1920.69Not applicable
SAP11-like000
ntc000
MeasurementWTSAP11-like 2bSAP11-like 3b
Fresh shoot mass (g)0.66 ± 0.060.27 ± 0.02 **0.50 ± 0.06 *
Height (cm)36.00 ± 0.9923.79 ± 0.74 **28.78 ± 0.91 **
Rosette diameter (cm)6.42 ± 0.364.48 ± 0.19 **6.09 ± 0.32
Length of rosette leaf (cm)2.19 ± 0.111.32 ± 0.05 **1.94 ± 0.07 *
Width of rosette leaf (cm)1.16 ± 0.0450.75 ± 0.03 **1.03 ± 0.08
Length of siliques (cm)1.34 ± 0.470.88 ± 0.23 **0.89 ± 0.27 **
Ʃ axillary shoots4.2 ± 0.339.56 ± 0.59 **10.76 ± 0.85 **
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Drcelic, M.; Skiljaica, A.; Polak, B.; Bauer, N.; Seruga Music, M. ‘ Candidatus Phytoplasma solani’ Predicted Effector SAP11-like Alters Morphology of Transformed Arabidopsis Plants and Interacts with AtTCP2 and AtTCP4 Plant Transcription Factors. Pathogens 2024 , 13 , 893. https://doi.org/10.3390/pathogens13100893

Drcelic M, Skiljaica A, Polak B, Bauer N, Seruga Music M. ‘ Candidatus Phytoplasma solani’ Predicted Effector SAP11-like Alters Morphology of Transformed Arabidopsis Plants and Interacts with AtTCP2 and AtTCP4 Plant Transcription Factors. Pathogens . 2024; 13(10):893. https://doi.org/10.3390/pathogens13100893

Drcelic, Marina, Andreja Skiljaica, Bruno Polak, Natasa Bauer, and Martina Seruga Music. 2024. "‘ Candidatus Phytoplasma solani’ Predicted Effector SAP11-like Alters Morphology of Transformed Arabidopsis Plants and Interacts with AtTCP2 and AtTCP4 Plant Transcription Factors" Pathogens 13, no. 10: 893. https://doi.org/10.3390/pathogens13100893

Article Metrics

Article access statistics, supplementary material.

ZIP-Document (ZIP, 7261 KiB)

Further Information

Mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Transcription factor

    transcription factor thesis

  2. Caat Box, Transcriptional regulation, thesis Antithesis Synthesis

    transcription factor thesis

  3. What are Transcription factors

    transcription factor thesis

  4. Transcription Factor Interplay

    transcription factor thesis

  5. The transcription factor SP3 is an essential factor for expression of

    transcription factor thesis

  6. 10: Identification of relevant transcription factor binding motifs

    transcription factor thesis

VIDEO

  1. Sigma Factor Biology Transcription

  2. Medical vocabulary: What does E2F1 Transcription Factor mean

  3. Transcription factor 3rd sem bsc biology zoology #neet #neet2024 #needforspeed #neetbiology

  4. BSC 2nd year 3 semester zoology topic Transcription factor (TF)

  5. Transcription, Unit 2, #molecular_biology , transcription factor #bsczoology #notes #zoology

  6. Zinc Fingers: TFIIIA Bound with DNA

COMMENTS

  1. PDF Design, Engineering, and Discovery of Transcription Factor-Based

    computational approaches for transcription factor specificity engineering to expand the breadth of potential biosensor applications, and creating chimeric transcription factors to expedite biosensor discovery and overcome the currently high levels of characterization typically required to develop biosensors.

  2. Mechanisms and biotechnological applications of transcription factors

    1. Introduction. Transcription is a crucial component of the central dogma of molecular biology (DNA-RNA-protein) [], serving as a bridge that translates genetic information into diverse forms at the individual level.Transcription factors (TFs) play a pivotal role in regulating the transcription of target genes by selectively recognizing and binding specific DNA regions known as TF binding ...

  3. PDF Importance of Cell Specific Transcription Factors in Sox2 Regulation

    Abstract! The Sox2 gene codes for a HMG box transcription factor that is critical for the self-renewal and pluripotency of embryonic stem (ES) cells. Although Sox2 is expressed in ES cells the cell-type specific regulation of Sox2 transcription is only just being investigated; with a distal Sox2 Control Region (SCR) recently being discovered.In my thesis, I showed the

  4. PDF Transcription factor binding motif analyses in two biological systems

    scription factors. In this thesis, we focus on the binding of transcription factors to upstream region motifs to understand the mechanism of gene regulation. Sonic hedgehog (Shh) signals direct digit number and identity in the vertebrate limb via Gli transcription factors. We sought to identify key Gli binding motifs in

  5. The Human Transcription Factors

    Transcription factors (TFs) directly interpret the genome, performing the first step in decoding the DNA sequence. Many function as "master regulators" and "selector genes", exerting control over processes that specify cell types and developmental patterning (Lee and Young, 2013) and controlling specific pathways such as immune responses (Singh et al., 2014).

  6. PDF Control of Lens and Ciliary Epithelial Development by the LIM

    LIM-homeodomain transcription factor Lhx2 is an essential regulator of mammalian eye development. To further elucidate the role of Lhx2 in mammalian eye development, we studied ... I want to thank members of my thesis committee, Don Zack, Jeremy Nathans, and Nick Marsh-Armstrong, for taking the time out of their busy schedules to provide me ...

  7. Transcription factors: Bridge between cell signaling and gene

    Transcription factors (TFs) are key regulators of intrinsic cellular processes, such as differentiation and development, and of the cellular response to external perturbation through signaling pathways. In this review we focus on the role of TFs as a link between signaling pathways and gene regulation. Cell signaling tends to result in the ...

  8. A census of human transcription factors: function, expression and

    Transcription factors (TFs) are essential for gene expression, but very little is known about the majority of human TFs. This Analysis article provides a manually curated repertoire of sequence ...

  9. Transcription factors: from enhancer binding to developmental control

    How do transcription factors lead to defined developmental programs? The ways in which transcription factors act at enhancer elements and how enhancer activity is established during development ...

  10. A Molecular Characterization of Human General Transcription Factor IID

    The general transcription initiation factor IID plays a central role in transcriptional control as a direct target for a diverse array of gene-specific regulatory factors and as the only template-bound class II initiation factor. The work described in this thesis concerns itself with the molecular characterization of this transcription initiation factor, starting with the cloning of cDNAs from ...

  11. Identifying transcription factors with cell-type specific DNA binding

    Transcription factors (TFs) bind to gene promoters and enhancers to regulate gene expression, and are therefore major determinants of cell fate decisions, metabolic activity, and, when regulation goes awry, of disease [1,2,3].TFs bind relatively short preferred DNA sequences, or motifs, typically 5 to 20 bases long [4, 5].Because these motifs are so short, the human genome often harbors ...

  12. Machine learning models for functional genomics and therapeutic design

    The certified thesis is available in the Institute Archives and Special Collections. Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019 ... First, we design three machine learning models that predict transcription factor binding and DNA methylation, two fundamental epigenetic ...

  13. 50+ years of eukaryotic transcription: an expanding universe of factors

    RNA polymerase accessory factors: diversity, structure and function. Given the enormous complexity of Pols I, II and III (14, 12 and 17 subunits, respectively) relative to the initiation-competent five-subunit bacterial RNA polymerase, an early surprise to me was the inability of these enzymes to accurately initiate transcription on defined core promoter elements, especially in the case of Pol ...

  14. Transcriptional memory formation: Battles between transcription factors

    Transcriptional memory allows cells to respond to previously experienced signals in a faster, stronger, and more sensitive manner. Using synthetic biology approaches, Fan and colleagues uncovered the critical interplays between transcription factors and repressive chromatin in consolidating transcriptional memory.

  15. It takes three to tango: transcription factors bind DNA, protein, and

    A new study reveals that many transcription factors can also bind RNA, in addition to DNA and protein, using a region called an arginine-rich motif. This RNA binding helps transcription factors anchor themselves to DNA and regulate gene activity, and may have implications for RNA-based therapies.

  16. Unlocking nature's secrets: The pivotal role of WRKY transcription

    The WRKY transcription factor family is a key player in the regulatory mechanisms of flowering plants, significantly influencing both their biotic and abiotic response systems as well as being vital to numerous physiological and biological functions. Over the past two decades, the functionality of WRKY proteins has been the subject of extensive ...

  17. Position-dependent function of human sequence-specific transcription

    The effect of transcription factor binding on transcription initiation is dependent on the position of the transcription factor binding site. Patterns of transcriptional activity are encoded in ...

  18. Function and regulation of transcription factors during mitosis-to-G1

    Sequence-specific transcription factors (TFs) are key components of gene expression in virtually every biological process. Their activity is thus tightly controlled by a wide-variety of mechanisms. Binding of a TF to a gene regulatory region is to large extent dependent on its concentration . Eukaryotic cells have therefore developed a series ...

  19. PDF A Bioinformatics Study of Human Transcriptional Regulation

    Binding sites for metabolic disease related transcription factors inferred at base pair resolution by chromatin immunoprecipita-tion and genomic microarrays. Hum Mol Genet. 14(22):3435-47, ... tional steps, but the focus for this thesis is on regulation of transcription. During transcription, an mRNA molecule is synthesized from a template

  20. Transcription factor

    A transcription factor is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. Learn about the number, mechanism, function, and examples of transcription factors in molecular biology.

  21. Comprehensive analysis of Transcription Factors identified novel

    Background: Transcriptional factors (TFs) are responsible for regulating the transcription of pro-oncogenes and tumor suppressor genes in the process of tumor development. However, the role of these transcription factors in Bladder cancer (BCa) remains unclear. And the main purpose of this research is to explore the possibility of these TFs serving as biomarkers for BCa.

  22. The AP2/ERF transcription factor SmERF1L1 regulates the ...

    There is increasing market demand for these compounds. Here, we isolated and functionally characterized SmERF1L1, a novel JA (Jasmonic acid)-responsive gene encoding AP2/ERF transcription factor, from Salvia miltiorrhiza. SmERF1L1 was responsive to methyl jasmonate (MJ), yeast extraction (YE), salicylic acid (SA) and ethylene treatments.

  23. CmMYB#7, an R3 MYB transcription factor, acts as a negative ...

    We analysed the expression patterns of 91 MYB transcription factors in 'Jimba' and 'Turning red Jimba' and identified an R3 MYB, CmMYB#7, whose expression was significantly decreased in 'Turning red Jimba' compared with 'Jimba', and confirmed it is a passive repressor of anthocyanin biosynthesis. CmMYB#7 competed with CmMYB6, which together ...

  24. 'Pincer attack' on transcription factors offers new possibilities for

    The simultaneous inhibition of the transcription factors Myc and JunB could represent a pioneering therapeutic option for the treatment of multiple myeloma (MM), the second most common type of ...

  25. Isolation and characterization of AaWRKY1, an Artemisia annua

    Transient expression of AaWRKY1 cDNA in A. annua leaves clearly activated the expression of the majority of artemisinin biosynthetic genes. These results strongly suggest the involvement of the AaWRKY1 transcription factor in the regulation of artemisinin biosynthesis, and indicate that ADS is a target gene of AaWRKY1 in A. annua.

  26. 'Candidatus Phytoplasma solani' Predicted Effector SAP11-like Alters

    Phytoplasmas are obligate intracellular pathogens that profoundly modify the development, physiology and behavior of their hosts by secreting effector proteins that disturb signal pathways and interactions both in plant and insect hosts. The characterization of effectors and their host-cell targets was performed for only a few phytoplasma species where it was shown that the SAP11 effector ...