Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Application of tracing enhancer networks using epigenetic traits (TENET) to identify epigenetic deregulation in cancer
(USC Thesis Other)
Application of tracing enhancer networks using epigenetic traits (TENET) to identify epigenetic deregulation in cancer
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2022 Daniel Mullen
Application of Tracing Enhancer Networks using Epigenetic Traits
(TENET) to identify epigenetic deregulation in cancer
By
Daniel Mullen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirement for the Degree
DOCTOR OF PHILOSOPHY
(CANCER BIOLOGY AND GENOMICS)
December 2022
ii
Dedication
To all those,
family, friends, mentors, peers, and students
who’ve taught me so much and helped me along the way.
I couldn’t be here without you
Thank you
iii
Acknowledgements
First, I’d like to say thank you to my committee members and other faculty at USC who have
taught and supported me during my time as a Ph.D. student. I first want to give special mention to Dr.
Peggy Farnham, who was my first mentor and advisor at USC when I was a young Ph.D. student starting
out my journey at USC in the summer of 2015. Not only was Peggy’s lab the first experience working at
USC, it was also my first experience working in cancer research and epigenetics, two things that
continue to excite me to this day! I’d also like to thank Dr. Kimberly Siegmund for giving me my first
experience with biostatistics and always having a willing ear for all my questions over the years. Of
course, I also want to thank Dr. Suhn Rhie not only for originally developing the TENET method, but also
for being my mentor in all things bioinformatics for the past several years. It has been an enriching
experience to learn and grow as a scientist alongside the Rhie lab itself over the past few years.
I extend my utmost gratitude and appreciation to my mentor and PI Dr. Ite Offringa for the
opportunities you’ve extended me over the years. If you hadn’t offered me a spot in the Offringa lab, as
an aspiring student who wanted to learn bioinformatics without having any skills at the time, I don’t
think I would have made it to the point I am today. Thank you not only for creating a space for me to
learn, but also teaching me to do science well, and creating a positive environment in the Offringa lab
which I will never forget.
I’d also like to thank all the members of the Offringa, Marconett, and Rhie labs who I’ve worked
with these past several years; everyone I’ve mentored, who has mentored me, who I’ve had the
pleasure of just hanging out with. I also want to shout out our incredible lab manager Chunli Yan who,
besides being an essential manner in ensuring the smooth operation of the lab in general, performed
the wet-lab experiments, such as siRNA knockdown and RNA-seq for my first, first-author paper.
I also extend my gratitude to the number of excellent educators I’ve had leading up to being a
Ph.D. student, from Valhalla and Star Lake Elementary schools, to the Federal Way Public Academy, to
iv
the University of Washington. I especially want to thank educators like Mrs. Draegar, Mr.
Klumpenhower, Mr. Young, and especially Mr. Lauer, as well as Dr. Charity Urbanski, who particularly
helped and encouraged my learning throughout the years. I also want to thank Chris and Martin Kratt,
whose TV show, Kratt’s Creatures, encouraged a then four-year-old Daniel to explore the natural world
around him.
I also want to thank all my friends who have made a positive impact on my life and sanity over
the years. I especially want to acknowledge Lars St. Pierre, who’s been a part of my Ph.D. journey as
cohort members, friends, roommates since pretty much the beginning.
I also want to thank my family and especially my parents Keith and Lisa Mullen. Without their
unconditional support and the upbringing they have given me I wouldn’t be where I am today.
And speaking of friends, family, and then a bit more, I also want to acknowledge my wonderful
girlfriend Dr. Evelyn Tran who is the best thing to happen to me on this Ph.D. journey, if not ever. I’m
looking forward to spending more time with her post Ph.D.!
v
Table of Contents
Dedication .................................................................................................................................................... ii
Acknowledgements ..................................................................................................................................... iii
List of tables .................................................................................................................................................ix
List of figures ................................................................................................................................................ x
List of supplemental figures ........................................................................................................................ xii
List of supplemental data ........................................................................................................................... xiii
Abstract ...................................................................................................................................................... xiv
Chapter 1: Introduction ............................................................................................................................... 1
Chapter 2: TENET 2.0: Identification of key transcriptional regulators and enhancers in lung
adenocarcinoma ........................................................................................................................ 6
2.1: Abstract ....................................................................................................................................... 7
2.2: Materials and methods ............................................................................................................... 8
2.2.1: Ethics statement: ................................................................................................................ 8
2.2.2: Cell culture (Work done with: CY and DSK): ....................................................................... 8
2.2.3: siRNA knockdown and RNA-seq (Work done with: CY): .................................................... 8
2.2.4: ChIP-seq (Work done with: DSK): ....................................................................................... 9
2.2.5: ATAC-seq (Work done with: DSK): ................................................................................... 10
2.2.6: DNase-seq: ....................................................................................................................... 11
2.2.7: TENET program update and settings: ............................................................................... 11
2.2.8: Heatmaps: ........................................................................................................................ 12
2.2.9: Expression/correlation analyses: ..................................................................................... 12
2.2.10: Survival analyses: ............................................................................................................. 13
2.2.11: Genetic alteration and mutation count analysis: ............................................................. 13
2.2.12: Identification of potential target genes for CENPA/FOXM1/MYBL2-linked probes
in LUAD and BRCA: ........................................................................................................... 13
2.2.13: Motif analysis: .................................................................................................................. 14
2.2.14: Hi-C analysis: .................................................................................................................... 14
2.3: Results and discussion .............................................................................................................. 15
2.3.1: Identification of differentially activated enhancers in normal lung versus lung
adenocarcinoma: ............................................................................................................. 15
vi
2.3.2: Identification of key transcriptional regulators dysregulated in lung
adenocarcinoma: ............................................................................................................. 19
2.3.3: Identification of transcriptional regulators whose expression is associated with
poor patient survival: ....................................................................................................... 24
2.3.4: CENPA, FOXM1, and MYBL2 are activated in a subgroup of lung adenocarcinoma
and breast adenocarcinoma: ........................................................................................... 25
2.3.5: Identification of CENPA/FOXM1/MYBL2-linked enhancers associated with poor
patient survival and their potential target genes: ........................................................... 28
2.3.6: Discussion: ........................................................................................................................ 33
2.4: Conclusions ............................................................................................................................... 38
2.5: Supplemental files .................................................................................................................... 39
Chapter 3: Development and application of TENETR to identify dysregulated transcription factors
and linked regulatory elements across cancer types .............................................................. 49
3.1: Abstract ..................................................................................................................................... 50
3.2: Materials and methods ............................................................................................................. 51
3.2.1: Selection and downloading of TCGA datasets (Work done with: LH): ............................. 51
3.2.2: TCGA_downloader() function (Work done with: ENM): .................................................. 52
3.2.3: Collection of external cancer type-specific epigenomic datasets (Work done with:
ZW, LH, RP, HC): ............................................................................................................... 55
3.2.4: Peak calling and final preparation of external cancer type-specific epigenomic
datasets (Work done with: ZW, LH, RP, HC): ................................................................... 56
3.2.5: R package creation: .......................................................................................................... 57
3.2.6: TENETR.data package and datasets: ................................................................................ 57
3.2.7: TENETR package and functions: ....................................................................................... 59
3.2.7.1: step1_make_external_datasets() function: ............................................................ 60
3.2.7.2: step2_get_diffmeth_regions() function: ................................................................. 63
3.2.7.3: step3_get_analysis_z_scores() function: ................................................................. 80
3.2.7.4: step4_permutate_z_scores() function: ................................................................... 84
3.2.7.5: step5_optimize_links() function: ............................................................................. 86
3.2.7.6: step6_probe_per_gene_tabulation() function: ....................................................... 91
3.2.8: Use of USC Center for Advanced Research Computing (CARC) high performance
computing cluster (HPCC): ............................................................................................... 92
3.2.9: Linked hypermethylated and hypomethylated probe similarity calculation: .................. 92
3.2.10: RBO settings and comparisons: ....................................................................................... 93
3.3: Results and discussion .............................................................................................................. 95
vii
3.3.1: Identification of regulatory elements across cancer types: ............................................. 95
3.3.2: Classification and count of distal enhancer probes: ........................................................ 97
3.3.3: Identification of key TFs linked to hypomethylated distal enhancer probes: ................ 103
3.3.4: RBO analysis of TENETR settings and comparisons: ...................................................... 106
3.4: Conclusions ............................................................................................................................. 113
3.5: Supplemental files ................................................................................................................... 115
Chapter 4: Downstream analysis of TENETR results and identification of individual findings of
interest for further validation ............................................................................................... 121
4.1: Abstract ................................................................................................................................... 122
4.2: Materials and methods ........................................................................................................... 123
4.2.1: TENETR step7 functions: ................................................................................................ 123
4.2.1.1: step7_linked_probe_motif_searching(): ............................................................... 124
4.2.1.2: step7_selected_probes_simple_scatterplots(): .................................................... 126
4.2.1.3: step7_states_for_links(): ....................................................................................... 127
4.2.1.4: step7_top_genes_circos(): .................................................................................... 128
4.2.1.5: step7_top_genes_complex_scatterplots(): ........................................................... 129
4.2.1.6: step7_top_genes_cox_survival(): .......................................................................... 130
4.2.1.7: step7_top_genes_experimental_vs_control_expression_boxplots(): .................. 132
4.2.1.8: step7_top_genes_expression_correlation_heatmaps(): ...................................... 133
4.2.1.9: step7_top_genes_histograms(): ............................................................................ 134
4.2.1.10: step7_top_genes_met_heatmaps(): ................................................................... 135
4.2.1.11: step7_top_genes_overlapping_linked_probe_heatmaps(): ............................... 136
4.2.1.12: step7_top_genes_simple_scatterplots(): ............................................................ 136
4.2.1.13: step7_top_genes_survival(): ............................................................................... 137
4.2.1.14: step7_top_genes_TAD_tables(): ......................................................................... 139
4.2.1.15: step7_top_genes_UCSC_bed_files(): .................................................................. 140
4.2.1.16: step7_top_genes_user_peak_overlap(): ............................................................. 141
4.2.2: Additional TCGA clinical covariates: ............................................................................... 142
4.2.3: Upset plots: .................................................................................................................... 145
4.2.4: Radar (spider) plots: ....................................................................................................... 145
4.2.5: Heatmaps: ...................................................................................................................... 145
4.2.6: Venn diagrams (Work with: LH): .................................................................................... 146
viii
4.3: Results and discussion ............................................................................................................ 147
4.3.1: Selection of highly ranked TF panel: .............................................................................. 147
4.3.2: Clustering TFs by expression and ranking patterns: ...................................................... 150
4.3.3: Survival analysis of highly ranked TF panel: ................................................................... 154
4.3.4: Survival analysis of DNA methylation probes linked to highly ranked TF panel: ........... 158
4.3.5: Selecting a single TF-to-probe link to illustrate further downstream analyses: ............ 165
4.3.6: Expression and methylation patterns of MYBL2 and cg20246907 across tumors: ....... 168
4.3.7: Investigation of MYBL2 and cg20246907 survival association with clinical
covariates: ...................................................................................................................... 170
4.3.8: Identification of potential downstream targets for enhancer marked by
cg20246907: ................................................................................................................... 175
4.4: Conclusions ............................................................................................................................. 179
4.5: Supplemental files .................................................................................................................. 182
Chapter 5: Conclusions and future directions for TENET study .............................................................. 183
5.1: Conclusions ............................................................................................................................. 184
5.1.1: TENET 2.0 findings in LUAD: ........................................................................................... 184
5.1.2: The TENETR package: ..................................................................................................... 185
5.1.3: TENETR pan-cancer findings: .......................................................................................... 185
5.2: Future developments for the TENETR package ...................................................................... 188
5.2.1: Develop TENETR for inclusion on Bioconductor: ........................................................... 188
5.2.2: Add a sparse analysis option: ......................................................................................... 188
5.2.3: Combine TENET probe classification with regression analysis to link TFs to probes: .... 189
5.3: Additional bioinformatic analyses .......................................................................................... 191
5.3.1: Further analysis of data generated by TENETR across 12 cancer types: ....................... 191
5.3.2: Integrate tumor purity, mutational burden, and cancer type-specific molecular
alterations into analyses: ............................................................................................... 191
5.3.3: Integrate TF binding prediction and ChIP-seq data to prioritize TF to probe links: ....... 192
5.3.4: Utilize other bioinformatic programs to help identify target genes for linked
probes: ........................................................................................................................... 193
5.3.5: Use machine learning methods to develop a model to predict patient survival
based on key DNA methylation probes identified with TENETR: .................................. 193
References: .............................................................................................................................................. 195
Appendix: Publications as a contributing author .................................................................................... 208
ix
List of tables
Table 2-1: Top TRs identified in LUAD by TENET 2.0 are functionally relevant ....................................... 34
Table 3-1: Format of expData and metData objects ................................................................................. 65
Table 4-1: Encoding of TCGA smoking history ........................................................................................ 143
Table 4-2: Ranks of 8 TFs in selected cluster across cancer types .......................................................... 153
Table 4-3: Status of cg20246907 across cancer types ............................................................................ 167
Table 4-4: Complex Cox regression survival analysis of MYBL2 expression in KIRP ............................... 172
Table 4-5: Complex Cox regression survival analysis of cg20246907 methylation in KIRP ..................... 174
x
List of figures
Figure 1-1: TENET methodology overview ............................................................................................... 4
Figure 2-1: A workflow of TENET 2.0 ...................................................................................................... 16
Figure 2-2: Identification of differentially-methylated enhancer probes .............................................. 19
Figure 2-3: Identification of key dysregulated transcriptional regulators in LUAD ................................ 22
Figure 2-4: Interaction of key transcriptional regulators activated in LUAD ......................................... 23
Figure 2-5: CENPA, FOXM1, and MYBL2 are highly expressed in tumors and associated with poor
patient survival .................................................................................................................... 25
Figure 2-6: LUAD and BRCA subgroups with activated CENPA, FOXM1, and MYBL2-linked
enhancers ............................................................................................................................. 27
Figure 2-7: Examples of CENPA/FOXM1/MYBL2-linked enhancer probes associated with
survival rate .......................................................................................................................... 30
Figure 2-8: Identification of genes regulated by FOXM1 and MYBL2 .................................................... 32
Figure 3-1: TCGA datasets used in TENETR analysis ............................................................................... 52
Figure 3-2: TENETR function pipeline ..................................................................................................... 60
Figure 3-3: TENETR workflow to identify DNA methylation probes marking regulatory elements ....... 69
Figure 3-4: TENETR methylation cutoffs ................................................................................................. 73
Figure 3-5: Methylation cutoff setting ................................................................................................... 75
Figure 3-6: Workflow of the methylation cutoff setting algorithm ........................................................ 77
Figure 3-7: Z-score calculation and classification of gene-to-probe links .............................................. 84
Figure 3-8: Calculation of empirical p-values ......................................................................................... 86
Figure 3-9: Optimization of gene-to-probe links .................................................................................... 90
Figure 3-10: Count of cancer type-specific external epigenomic datasets used in study ........................ 97
Figure 3-11: Number of distal enhancer DNA methylation probes identified ......................................... 99
Figure 3-12: Number of hypermethylated or hypomethylated distal enhancer probes
identified ............................................................................................................................ 101
Figure 3-13: Similarity of hypermethylated and hypomethylated probes between cancer types ........ 103
Figure 3-14: Top transcription factors linked to hypomethylated probes in individual cancer types ... 104
Figure 3-15: Comparison of TF-only vs. all genes TENETR run times ..................................................... 107
Figure 3-16: RBO values of TENETR comparisons .................................................................................. 112
xi
Figure 4-1: TFs among the top 10 in multiple cancer types ................................................................. 148
Figure 4-2: Top 20 TFs by total rankings across 12 cancer types ......................................................... 149
Figure 4-3: Individual cancer rankings of top 20 TFs by total rankings across 12 cancer types ........... 150
Figure 4-4: Selecting TF cluster based on expression Z-scores across cancer types ............................ 151
Figure 4-5: Clustering TFs by ranks in 12 cancer types ......................................................................... 153
Figure 4-6: Individual cancer rankings of E2F and FOX family members ............................................. 154
Figure 4-7: Survival analysis of 108 TFs in BRCA ................................................................................... 156
Figure 4-8: Survival analysis of 108 TFs in KIRP .................................................................................... 157
Figure 4-9: Survival analysis of 108 TFs in LUAD .................................................................................. 158
Figure 4-10: Survival analysis of probes linked to 108 TFs in BRCA ....................................................... 160
Figure 4-11: Survival analysis of probes linked to 108 TFs in KIRP ......................................................... 162
Figure 4-12: Survival analysis of probes linked to 108 TFs in LUAD ....................................................... 163
Figure 4-13: Overlap of probes between cancer types of interest ........................................................ 164
Figure 4-14: Survival curves for MYBL2 expression in BRCA, KIRP, and LUAD tumors .......................... 165
Figure 4-15: Survival curves for cg20246907 methylation in BRCA, KIRP, and LUAD tumors ............... 167
Figure 4-16: MYBL2 is broadly overexpressed in BRCA, KIRP, and LUAD tumor samples ..................... 168
Figure 4-17: cg20246907 methylation is decreased in some BRCA, KIRP, and LUAD tumor
samples .............................................................................................................................. 169
Figure 4-18: Relationship of MYBL2 expression to cg20246907 methylation across cancer types
of interest ........................................................................................................................... 170
Figure 4-19: Survival curves for MYBL2 and cg20246907 in Stage I KIRP .............................................. 175
Figure 4-20: Analysis of HMGB3P4 as a potential target gene of the cg20246907 enhancer ............... 177
Figure 4-21: Analysis of LINC01232 as a potential target gene of the cg20246907 enhancer .............. 178
xii
List of supplemental figures
Supplemental Figure 2-1: TENET 2.0 pictorial workflow ....................................................................... 39
Supplemental Figure 2-2: Correlation analyses of key transcriptional regulators activated in
lung cancer using ORIEN datasets ............................................................... 40
Supplemental Figure 2-3: Expression of top 12 transcriptional regulators activated in LUAD ............. 41
Supplemental Figure 2-4: Survival analysis of top 12 transcriptional regulators activated in LUAD .... 42
Supplemental Figure 2-5: Replication of association of expression of highly-linked oncogenic
transcriptional regulators with patient survival in LUAD using
Kaplan-Meier Plotter ................................................................................... 43
Supplemental Figure 2-6: Smoking history is associated with CENPA, FOXM1, and MYBL2
expression in TCGA samples ........................................................................ 44
Supplemental Figure 2-7: Association of total active enhancer links with expression of three
activated transcriptional regulators and common LUAD mutations .......... 45
Supplemental Figure 2-8: Association of links to CENPA, FOXM1, and MYBL2-linked enhancers
with clinical data and subgroup analyses .................................................... 46
Supplemental Figure 2-9: Key transcriptional regulators identified in LUAD vs. BRCA and
comparison of CENPA/FOXM1/MYBL2-linked probes in each dataset ....... 47
Supplemental Figure 2-10: Hi-C diagrams from A549 and GM12878 cells showing the cg09580922
and TK1 genomic region .............................................................................. 48
Supplemental Figure 3-1: Mean methylation density curves of distal enhancer probes in all
12 cancer types .......................................................................................... 115
Supplemental Figure 3-2: Representative promoter probe mean methylation density curve
utilizing promoter probes found with overlapping histone modification
and open chromatin regions ..................................................................... 116
Supplemental Figure 3-3: Representative promoter probe mean methylation density curve
utilizing all promoter probes ..................................................................... 117
Supplemental Figure 3-4: Pictorial schematic of RBO analysis ........................................................... 118
xiii
List of supplemental data
Supplemental Data 3-1: List of TCGA sample barcodes used in TENETR study ...................................... 118
Supplemental Data 3-2: HPCC scripts to submit TENETR runs ............................................................... 118
Supplemental Data 3-3: Tables of methylation cutoff values and methylation probe classification counts
across 12 cancer types ............................................................................................................................ 119
Supplemental Data 3-4: Tables of identified methylation probes across 12 cancer types .................... 119
Supplemental Data 3-5: Table of top TFs and number of linked hypomethylated probes across 12 cancer
types ........................................................................................................................................................ 119
Supplemental Data 3-6: Tables of RBO values ........................................................................................ 119
Supplemental Data 4-1: Table of top TFs by overall rank across 12 cancer types .................................. 182
Supplemental Data 4-2: Table of 108 TFs highly ranked within or across 12 cancer types .................... 182
xiv
Abstract
Although genetic alterations are known as key drivers of cancer, epigenetic alterations also play
key roles in tumor development. Of particular interest are alterations to the expression of transcription
factors (TFs) and subsequent changes to transcriptional regulatory networks, which induce widespread
downstream effects in cells. To identify these alterations, I facilitated the development of TENET 2.0 and
then TENETR, which are updated versions of the bioinformatic method Tracing Enhancer Networks using
Epigenetic Traits (TENET) method. These methods utilize ChIP-seq and open chromatin datasets to
identify DNA methylation probes in regulatory elements, then use the DNA methylation levels of those
probes as a surrogate for the activity of the regulatory elements. Combining these epigenomic datasets
with RNA-seq datasets, the TENET methods identify TFs whose expression levels are related, or “linked”
to the DNA methylation level of each regulatory element probe.
TENET 2.0 was utilized to identify dysregulated TFs and linked enhancers in lung
adenocarcinoma (LUAD), including key TFs such as CENPA, FOXM1, and MYBL2. I followed up on this by
developing TENETR, an updated R package variation of TENET which possesses several new features to
increase its ease of use and applicability compared to TENET 2.0. TENETR was used to identify numerous
enhancer elements along with top TFs linked to activated enhancers in twelve selected cancer types. To
illustrate downstream analyses which can be performed using TENETR, I identified a panel of 108 TFs
which were highly ranked within or across the 12 cancer types. These included a cluster of 8 TFs which
were highly correlated in expression and rank, which were particularly important in BRCA, KIRP, and
LUAD cancer. Survival analysis revealed expression of TFs within this cluster and methylation of their
linked methylation probes, including MYBL2 and its linked probe cg20246907, are strongly associated
with poor patient survival. cg20249607 is particularly strongly associated with KIRP patient survival even
when considering clinical covariates and could potentially predict survival even in stage I tumors.
1
Chapter 1: Introduction
The overall lifetime risk for developing cancer for an individual in the United States is
approximately 40% in men, and over 38% in women
1
and it is estimated that over 600,000 Americans
will die from cancer in 2022
2
. Although it is the second most commonly diagnosed form of cancer, lung
cancer is predicted to result in more patient deaths than any other cancer type in the United States,
accounting for over 20% of all cancer--related deaths in both men and women
2
. As we recover from the
COVID-19 pandemic, understanding the initiation and progression of cancer and developing methods to
combat these processes will be of particular importance, because the pandemic reduced access to early
cancer screenings and treatment
3
.
Across cancer types, a number of key driver mutations have been identified. Some are
commonly observed in many cancer types, including alterations to tumor suppressor genes and
oncogenes such as TP53, PIK3CA, KRAS, BRAF, and many others
4
. Other mutations are known to be more
cancer type-specific, such as APC in colon adenocarcinoma
5
, and BRCA1 and BRCA2 in breast
adenocarcinoma
6
as well as ovarian carcinoma
7
. Identifying key driver mutations is important because
they can lead to the development of targeted therapies, such as the tyrosine kinase inhibitors afatinib,
erlotinib, gefitinib, and osimertinib, and monoclonal antibodies cetuximab, necitumumab,
nimotuzumab, and panitumuab
8
, which all target EGFR, a commonly mutated and/or amplified gene in
lung adenocarcinoma
9
and several other cancer types.
However, driver mutations are not the end-all-be-all when it comes to our understanding of
cancer etiology. In many cancer types, the most prevalent driver mutations may only be identified in less
than half of the tumor samples surveyed, and in many tumor samples a driver mutation is often not
found. Additionally, the presence of a driver mutation does not preclude other factors from playing
important roles in the initiation or progression of tumors. This suggests a potential role for other types
of molecular alterations in these tumors, such as changes in gene expression patterns and other
2
associated changes in the epigenomes of tumors.
On a certain level, these changes have long been known in cancer, and have themselves been
used to classify tumor subtypes and have been considered as targets for therapies. Breast
adenocarcinomas, for instance, have long been classified based on their expression of genes encoding
estrogen receptor (ER), progesterone receptor (PR), and erbB2 receptor tyrosine kinase 2 (ERBB2, also
commonly called HER2), and these classifications are known to both predict patient survival, as well as
inform treatment
10
. Direct molecular changes to the epigenome, such as patterns of DNA methylation,
can also be used to classify tumor subgroups. This includes the CpG island methylator phenotype
(CIMP+) in colorectal cancers, which show broadly increased levels of methylation across the genomes
of these tumors
11,12
, as well as differential response to certain chemotherapies, including irinotecan
13
.
Recent research has also shown that the presence of classical driver mutations in a tumor could be
underpinned by the epigenomic state of the cells of origin for that cancer type
14
, which further
emphasizes the need to better understand the epigenomic alterations that can contribute to cancer
development.
Epigenomic features do not affect the DNA sequence itself but can induce significant changes in
cellular behavior by altering how the DNA is accessed and used, and typically work to regulate the
expression of genes. There are two broad classes of functional regulatory elements which control
transcription activation: promoters (located close to the transcriptional start sites of the genes they
regulate)
15,16
and enhancers (which can be located at a large distances from the transcriptional start
sites of the genes they target to regulate). Promoters generally bind elements of the RNA polymerase
machinery directly
17
, but both elements bind TF proteins, which themselves help to recruit elements of
the RNA polymerase machinery and thus regulate gene expression patterns
18,19
. Though both types of
elements also display cell-type-specific behavior
20
, enhancers are critical to study because they can
regulate multiple target genes
21–23
, are often very closely tied to cell state determination
24–28
, have been
3
linked to cancer development
29–32
, and our understanding of how to identify enhancers and predict their
activity is still evolving
33
.
TENET2.0
34
and TENETR are built upon the previous TENET
35
version. The various TENET
methods were developed as bioinformatic tools to identify dysregulated TFs and associated enhancer
regions in tumors. When considering epigenomic alterations in tumors, TENET primarily assesses the
pathway upstream of regulatory elements, identifying differentially expressed TFs in tumors and
“linking” them statistically to the activity of the regions they might bind to. These TFs are of particular
interest because of their ability to regulate numerous downstream genes through the regulatory
elements they themselves target. Changes in expression to just one TF gene can result in alterations in
the binding patterns of that TF protein to a large number of regulatory elements in the genome, leading
to subsequent changes in expression of many downstream genes regulated in turn by those elements
(Figure 1-1A). The TENET method focuses particularly on identifying such changes that occur in a subset
of tumor samples of a cancer type.
To identify regulatory elements, TENET uses molecular markers long known to be associated
with different classes of these elements. These include histone 3 lysine 27 acetylation (H3K27ac), a
marker of active enhancers
36,37
, histone 3 lysine 4 monomethylation (H3K4me1), a mark found in the
vicinity of poised and active enhancers (poised enhancers could become active during
tumorigenesis)
37,38
, and histone 3 lysine 4 trimethylation (H3K4me3), a mark of active promoters
36,39
.
Regions of open chromatin, where nucleosomes have been remodeled to allow easier access to the DNA
by various proteins in active regions of the genome, is also used to identify regulatory elements
40–42
.
However, assessing the activity of regulatory elements using these marks can be technically
challenging, especially in a high-throughput manner. Additionally, while the molecular markers
described above are useful to identify regulatory regions in the genome, they are not necessarily
accurate indicators of their activity
43,44
. Thus, to assess such elements, TENET uses the DNA methylation
4
as a surrogate for their activity. DNA methylation levels are relatively easy to assay and are known to be
inversely related to the activity of regulatory elements
45–47
. TENET analyses used data from the Illumina
450K probe arrays, which measure methylation levels at 485,577 CpGs, in the human genome
48
. This
DNA methylation data is paired with matched gene expression data from both adjacent normal and
tumor samples to identify significant gene-to-probe links displaying a strong relationship between the
expression of a given gene and the DNA methylation level of the probe in question within a subset of the
tumor samples (Figure 1-1B).
Figure 1-1: TENET methodology overview
(A) TENET methods are designed to identify dysregulated TF genes in tumor samples and link them
to the downstream regulatory elements they bind to. Dysregulated TFs can lead to widespread
changes in downstream gene expression. (B) TENET uses matched DNA methylation and RNA-seq
data from both normal and tumor samples to identify the DNA methylation probes which mark the
activity of enhancer regions that show large differences in methylation in a subset of the tumor
sample population compared to the normal samples, and inversely large differences in expression
of a given gene, thus “linking” that probe to that gene. TENET performs such calculations on all
genes in the input data, and each differentially methylated probe, comparing differences in
experimental (here, tumor) and control (here, adjacent normal) samples.
5
TENETR is newly designed as an R package implementation of the TENET method, with updates
to make it easier to use and increase its applicability. As an R package, TENETR is easier to download and
use by installing it from its associated GitHub pages (https://github.com/rhielab/TENETR)
(https://github.com/rhielab/TENETR.data), which include annotated functions, datasets, and readmes.
New features have been added to TENETR which were not present in previous versions of the method,
including new consensus enhancer and open chromatin datasets I compiled for the method itself, as
well as the ENCODE Consortium’s Search Candidate cis-Regulatory Elements (SCREEN) project’s
annotated regulatory element datasets. The method has also been updated to use GENCODE v36-
annotated gene expression data, keeping up to date with current formatting of The Cancer Genome
Atlas (TCGA) data. New functionalities have also been included, including options allowing the user to
analyze data from four different DNA methylation platforms, the EPIC+
49
, EPIC
50
, HM450
48
, and HM27
51
arrays, assess promoter as well as distal enhancer regions, perform analyses using TF genes only to
vastly increase TENETR processing speed, and a new algorithm to automatically set methylation cutoffs
for enhancer analyses.
To exhibit the method, TENETR was used to analyze 12 different cancer datasets downloaded
from TCGA. To facilitate these analyses, we also compiled an extensive set of histone modification and
open chromatin regions specific to each cancer type to include along with the epigenomic datasets built
into TENETR. We then identified DNA methylation probes marking potential enhancer regions in each
cancer type, with a focus on those found to be hypomethylated in the tumor samples, representing
activated enhancers in those cancers. The top TFs, by number of linked hypomethylated probes, were
also identified. Using these results, we investigated how altering the parameters and datasets affected
the output results using Rank Based Overlap (RBO) analyses.
6
Chapter 2: TENET 2.0: Identification key transcriptional regulators and
enhancers in lung adenocarcinoma
The work in this chapter reflects the work my co-authors and I had done in my first, first-author
paper, which utilized TENET 2.0, to identify the key TFs CENPA, MYBL2, and FOXM1 in lung
adenocarcinoma. The writing here is largely that of my paper
34
(Mullen, D.J., Yan, C., Kang, D.S., Zhou, B.,
Borok, Z., Marconett, C.N., Farnham, P.J., Offringa, I.A., and Rhie, S.K. (2020). TENET 2.0: Identification of
key transcriptional regulators and enhancers in lung adenocarcinoma. PLoS Genet 16, e1009023),
though I have incorporated elements of the introduction of the paper into the introduction to my thesis
above, and I have added a brief commentary at the end of this chapter in a “Concluding thoughts”
section to explain my perspective after developing TENET 2.0 and my reasoning for why I proceeded
with the development of the TENETR method I primarily discuss in the remaining chapters of my thesis.
For this paper, DJM was responsible for bioinformatic analyses, software creation, dataset
creation, figure creation, key idea contributions, experimental design, writing, and editing. CY performed
in vitro siRNA knockdown and RNA-seq experiments, and DSK performed in vitro ChIP-seq and ATAC-seq
experiments. BZ, ZB, CNM, and PJF, provided resources for the project. IAO and SKR contributed
resources, and provided key ideas, experimental design, helped design figures and provided writing and
editing.
7
2.1 Abstract
Lung cancer is the leading cause of cancer-related death and lung adenocarcinoma is its most
common subtype. Although genetic alterations have been identified as drivers in subsets of lung
adenocarcinoma, they do not fully explain tumor development. Epigenetic alterations have been
implicated in the pathogenesis of tumors. To identify epigenetic alterations driving lung
adenocarcinoma, we used an improved version of the Tracing Enhancer Networks using Epigenetic Traits
method (TENET 2.0) in primary normal lung and lung adenocarcinoma cells. We found over 32,000
enhancers that appear differentially activated between normal lung and lung adenocarcinoma. Among
the identified transcriptional regulators inactivated in lung adenocarcinoma vs. normal lung, NKX2-1 was
linked to a large number of silenced enhancers. Among the activated transcriptional regulators
identified, CENPA, FOXM1, and MYBL2 were linked to numerous cancer-specific enhancers. High
expression of CENPA, FOXM1, and MYBL2 is particularly observed in a subgroup of lung
adenocarcinomas and is associated with poor patient survival. Notably, CENPA, FOXM1, and MYBL2 are
also key regulators of cancer-specific enhancers in breast adenocarcinoma of the basal subtype, but they
are associated with distinct sets of activated enhancers. We identified individual lung adenocarcinoma
enhancers linked to CENPA, FOXM1, or MYBL2 that were associated with poor patient survival.
Knockdown experiments of FOXM1 and MYBL2 suggest that these factors regulate genes involved in
controlling cell cycle progression and cell division. For example, we found that expression of TK1, a
potential target gene of a MYBL2-linked enhancer, is associated with poor patient survival. Identification
and characterization of key transcriptional regulators and associated enhancers in lung adenocarcinoma
provides important insights into the deregulation of lung adenocarcinoma epigenomes, highlighting
novel potential targets for clinical intervention.
8
2.2 Materials and methods
2.2.1 Ethics statement:
Remnant human transplant lungs were obtained in compliance with USC Institutional Review
Board protocol, approved for the use of human source material in research (HS-07-00660). As donors
were deceased and de-identified, no patient consent was obtained or necessary.
2.2.2 Cell culture (Work done with: CY and DSK):
Human lung adenocarcinoma A549 cells (Cat # CRL-185, ATCC, Gaithersburg, MD) were grown at
37°C with 5% CO2 in RPMI 1640 (Cat #10-040-CV, Corning, NY, USA) supplemented with 10% fetal
bovine serum (FBS) (Cat # FBS-500, X&Y Cell Culture, MI, USA) and 100 units/ml of
penicillin/streptomycin (formulated by Norris Comprehensive Cancer Center Media Core, CA, USA).
Human AT2 cells were isolated from remnant transplant lung from deceased de-identified non-smoking
donors in compliance with USC Institutional Review Board protocol for the use of human source
material in research (HS-07-00660). As donors were deceased and de-identified, no patient consent was
obtained or necessary. Lungs were processed as previously described
52
. The three donors were 25, 62,
and 67-year-old males who died of non-lung related causes. AT2 cells were isolated from the samples,
plated in 50% DMEM/F12 (Cat #D64421, Sigma, MO, USA), 50% DMEM high glucose (Cat #21063, GIBCO,
MA, USA), supplemented with 10% FBS, penicillin/streptomycin, 50 ug/ml gentamycin (Cat #G1272,
Sigma, MO, USA) and 2.5ug/ml amphotericin (Cat #A2411, Sigma, MO, USA), to allow differentiation to
AT1-like cells, and isolated at three different time points (D0, D4, D6) as previously noted
52,53
(https://doi.org/10.1371/journal.pgen.1009023.s011).
2.2.3 siRNA knockdown and RNA-seq (Work done with: CY):
A549 cells were transfected in quadruplicate with 100nM of ON-TARGETplus siRNA
9
oglionucleotides for human FOXM1 (Cat # L-009762-00-005, Dharmacon—Horizon Discovery,
UK), MYBL2 (Cat # L-010444-00-005, Dharmacon—Horizon Discovery, UK), both, or non-targeting
control (Cat # D-001810-10-05, Dharmacon—Horizon Discovery, UK), mixed with 5X siRNA buffer (Cat #
B-002000-UB-100, Dharmacon—Horizon Discovery, UK) and transfected using DharmaFECT 1
Transfection reagent (Cat # T-2001-01, Dharmacon—Horizon Discovery, UK). Cells were transfected,
cultured for 24 hours, and transfected again with the same concentration of siRNA, then incubated for
an additional 24 hours before RNA was extracted using the Aurum Total RNA Mini Kit (Cat # 7326820,
Bio-Rad, CA, USA). cDNA was synthesized using an iScript cDNA Synthesis Kit (Cat # 1708891, Bio-Rad,
CA, USA) and expression levels of FOXM1 and MYBL2 were checked with qRT-PCR using SYBR Green
Supermix (Cat # 1708886, Bio-Rad, CA, USA) with the listed primers
(https://doi.org/10.1371/journal.pgen.1009023.s021). RNA-seq was performed using 150 bp paired-end
sequencing using an Illumina HiSeq 4000 (GENEWIZ, South Plainfield, NJ, USA) for the single gene
knockdown experiments, and using 100 bp paired-end sequencing using an Illumina NovaSeq 6000
(MedGenome, Foster City, CA, USA) for the double knockdown. RNA-seq reads were aligned to the
human reference genome hg38 using the Genomic Data Commons Bioinformatics mRNA analysis
pipeline. Read counts were generated for GENCODE v22 genes
54
using the htseq-count function
55
.
Differentially expressed genes were called using DESeq2
56
with the lfcShrink function
57
. Gene ontology
analyses were performed using PANTHER
58
(https://doi.org/10.1371/journal.pgen.1009023.s018)
(https://doi.org/10.1371/journal.pgen.1009023.s022). Datasets were deposited in the public GEO
database GSE143145 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143145).
2.2.4 ChIP-seq (Work done with: DSK):
ChIP-seq was performed on the D0, D4, and D6 AECs isolated from the 25-year-old and 62-year-
old male subjects using H3K27ac antibody (Cat # 39133, Active Motif, CA, USA), as previously described
10
52,53
. The ChIP-seq library from the 25-year-old individual was sequenced using 50 bp single-end reads on
an Illumina HiSeq 2000 (https://doi.org/10.1371/journal.pgen.1009023.s011). Two technical replicates
of A549 H3K27ac ChIP-seq data and two replicates of H3K27ac ChIP-seq data from lung tissue from a 53-
year-old female donor generated by the ENCODE Consortium
59,60
were used. H3K27ac ChIP-seq data
from two additional lung tissue samples from 30-year-old female and 3-year-old male donors generated
by the ROADMAP Consortium
61,62
were also included
(https://doi.org/10.1371/journal.pgen.1009023.s011). Finally, H3K27ac ChIP-seq data collected from 12
lung cancer lines from the DBTSS were downloaded and processed as well
63
. ChIP-seq reads were
aligned to the human reference genome hg38 and reproducible peaks were called, following the
ENCODE ChIP-seq pipeline
64
(https://doi.org/10.1371/journal.pgen.1009023.s022). Previously
unpublished datasets were deposited in the public GEO database GSE143145
(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143145).
2.2.5 ATAC-seq (Work done with: DSK):
Intact nuclei from D0 AT2 cells were isolated from the 67-year-old male subject utilizing the
protocol from Buenrostro et al.
65
. Briefly, intact nuclei were isolated and incubated with Tn5
transposase (Cat # FC-121-1030, Illumina, CA, USA). The transposed DNA was extracted and was
amplified with PCR using NEBNext High-Fidelity PCR Master Mix (Cat # M0541S, New England Biolabs,
MA, USA) and the resulting library was purified using a bead clean with AMPure XP Magnetic Beads (Cat
# A63880, Beckman Coulter, CA, USA) and quality control was performed using a BioAnalyzer High-
Sensitivity DNA Analysis kit (Cat # 5067–4626, Agilent, CA, USA). Data was sequenced as 75 bp paired-
end reads on an Illumina HiSeq 2000. ATAC-seq data were processed using the ENCODE ATAC-seq
pipeline (https://www.encodeproject.org/atac-seq/)
(https://doi.org/10.1371/journal.pgen.1009023.s022). In addition, ATAC-seq peaks from 34 LUAD tissue
11
samples were downloaded and lifted over to the hg38 reference genome using the LiftOver tool
available in the UCSC genome browser (https://genome.ucsc.edu/cgi-bin/hgLiftOver) for TENET 2.0
analyses
66
. ATAC-seq peaks from an additional 22 LUAD tissue samples were added
67
along with peaks
from the PC-9 LUAD cell line
68
(https://doi.org/10.1371/journal.pgen.1009023.s011).
2.2.6 DNase-seq:
Peaks of DNaseI hypersensitive sites in PC-9 and A549 cells processed by the ENCODE
consortium were acquired
59,69
. Those from A549 cells aligned to the hg19 human reference genome
were lifted over to the hg38 reference genome using the LiftOver tool available in the UCSC genome
browser (https://genome.ucsc.edu/cgi-bin/hgLiftOver)
(https://doi.org/10.1371/journal.pgen.1009023.s011).
2.2.7 TENET program update and settings:
Here we improved the original TENET program and developed TENET 2.0. TENET 2.0 uses a new
reference genome (hg38) and gene annotation dataset (GENCODE v22) which covers >60,000
transcripts
54
. It also includes a new dataset of 1,639 validated human transcription factors
70
, the
processing speed is increased, and useful functions were added to identify enhancers, genes, and tumor
subgroups associated with survival. For enhancer analysis, we utilized H3K27ac ChIP-seq, ATAC-seq and
DNase I hypersensitive site datasets. RNA-seq data along with DNA methylation data were downloaded
for BRCA and LUAD samples from the TCGA
6,9
using the TCGAbiolinks package
71
(https://doi.org/10.1371/journal.pgen.1009023.s022). TENET 2.0 program is available
at http://github.com/suhnrhie/TENET_2.0.
12
2.2.8 Heatmaps:
For Figure 2-4A, unsupervised clustering was performed and for Figure 2-2D, pairwise
correlation coefficients were calculated between each of the top LUAD transcriptional regulators
identified and an unsupervised clustering was performed. For Figure 2-6 and Supplemental Figure 2-8,
heatmaps were generated and unsupervised clustering was performed. DNA methylation levels (β)
ranging from 0 (unmeth) to 1 (meth) were plotted. Continuous variables, including gene expression,
patient age, and tumor purity, were scaled using the function (X—X min)/(X max—X min) with values equal to
zero set to the minimum, non-zero value. Tumor purity values, including leukocytes unmethylation for
purity, and overall derived consensus purity, were obtained from the Tumor purity dataset available
from TCGAbiolinks package
71
(https://doi.org/10.1371/journal.pgen.1009023.s022). Refer to chapter
4.2.2 for additional information on TCGA tumor purity from the TCGAbiolinks package.
2.2.9 Expression/correlation analyses:
Expression values of key oncogenic transcriptional regulators from the adjacent normal and
LUAD tumor samples were plotted, and Student’s t-tests were performed to compare differential
expression between normal and tumor groups. A one-way ANOVA analysis was performed to assess
overall differences in transcriptional regulator expression between the smoking groups (67 never
smokers, 278 former smokers, and 106 current smokers) and a Tukey Honest Significant Differences test
was performed to assess significant differences between individual groups. Linear regression models
were fit to predict expression of CENPA, FOXM1 and MYBL2 with respect to variables recorded for
sample clinical information in the TCGA, including sample type, sex, age, smoking history, total pack
years smoked, and race for samples which contained complete information for these variables.
Independent RNA-seq data from 728 lung tumor tissues generated as part of the ORIEN were used to
validate our correlation analyses. The correlation analyses were performed using the normalized RSEM
13
values calculated following the ORIEN Total Cancer Care protocol (http://www.oriencancer.org/)
accessed in May of 2020
72–74
(Supplemental Figure 2-2).
2.2.10 Survival analyses:
Survival analyses were performed comparing prognosis of patients with the highest and lowest
quartiles of CENPA, FOXM1 and MYBL2 expression, linked-enhancer probe DNA methylation levels.
Patient survival from samples within the "highly linked" group to those without any links
to CENPA, FOXM1, and MYBL2 were also compared. Survival plots from Kaplan-Meier Plotter were
performed on their website (https://kmplot.com/analysis/)
75
(https://doi.org/10.1371/journal.pgen.1009023.s022).
2.2.11 Genetic alteration and mutation count analysis:
Genetic alteration data for LUAD samples in the TCGA PanCancer Atlas was downloaded from
the cBioPortal
72,76
by selecting a query for mutations and putative copy-number alterations from GISTIC
(https://software.broadinstitute.org/cancer/cga/gistic) for KRAS, EGFR, NF1, and BRAF. 445 of the 453
LUAD tumor samples in the TENET dataset contained information for these four alterations. Samples
which were listed as having a "putative driver" mutation, amplification, deletion, or a fusion of each of
the genes in this dataset were recorded as being positive for an alteration to that gene. Total mutation
count data containing information for 447 of the 453 LUAD tumor samples was also downloaded from
the cBioPortal repository
72,76
.
2.2.12 Identification of potential target genes for CENPA/FOXM1/MYBL2-linked probes in LUAD and
BRCA:
Student’s t-tests were performed for all genes in the LUAD and BRCA datasets, comparing
14
expression in the tumor vs. normal samples. Genes that were significantly differentially expressed (fdr-
corrected p<0.05) and upregulated specifically in the tumor samples were selected for further gene
ontology (GO) analyses (https://doi.org/10.1371/journal.pgen.1009023.s018)
(https://doi.org/10.1371/journal.pgen.1009023.s011).
2.2.13 Motif analysis:
Minmeme motif files for FOXM1 or MYBL2, based on ChIP-seq experiments (3 from GSM12878
cells, MCF-7 cells, and SK-N-SH cells for FOXM1 and 1 from HepG2 cells for MYBL2), were downloaded
from Factorbook (http://factorboook.org) in August of 2019, Additional minmeme motif files for FOXM1
and MYBL2 were downloaded from the HOCOMOCO v11 database
77
. Motif files we used are listed
in (https://doi.org/10.1371/journal.pgen.1009023.s020). Using these motif files and FIMO program
78
,
we scanned DNA sequences within 1,117 bp, equivalent to half the average enhancer size as calculated
from the lung enhancer regions (https://doi.org/10.1371/journal.pgen.1009023.s012), of FOXM1,
MYBL2, or CENPA-linked enhancers (n = 1,338).
2.2.14 Hi-C analysis:
Using “ENCODE3-iced” data from A549 cells
59
and “Rao_2014-raw” data from GM12878 cells
79
,
Hi-C heatmaps (25kb resolution, hg38) in (Supplemental Figure 2-10) were created from the 3D genome
browser (http://promoter.bx.psu.edu/hi-c/view.php). Both datasets were processed and normalized
using the pipeline, described in Wang et al.
80
. TAD information from A549 and GM12878 cells was
downloaded from ENCODE and Rao et al.
79
, respectively
(https://doi.org/10.1371/journal.pgen.1009023.s011).
15
2.3 Results and discussion
2.3.1 Identification of differentially activated enhancers in normal lung versus lung
adenocarcinoma:
Each cell type has a distinct transcriptome, which is established by the levels and activities of
transcriptional regulators that bind to regulatory elements and control the expression of numerous
target genes. Among regulatory elements, the activity of enhancers is most closely linked to cell identity,
as they are often bound by cell-type specific transcriptional regulators
81
. We developed TENET 2.0 to
identify key transcriptional regulators whose expression levels are associated with changes in DNA
methylation levels at enhancers in normal vs. tumor tissue samples (Figure 2-1 and Supplemental Figure
2-1). TENET 2.0 now utilizes human reference genome hg38 and includes updated databases of human
genes (GENCODE v22)
54
. To comprehensively characterize and identify transcriptional regulators altered
in tumors, we used the transcription factor database specified by Lambert et al.
70
. We developed TENET
2.0 to have increased processing speed, compared to the original version, and have also included new
algorithms to assess the relationship with patient survival, among others.
16
Figure 2-1: A workflow of TENET 2.0
First, DNA methylation probes marking enhancer regions of interest are identified by overlapping
them with both H3K27ac ChIP-seq datasets and open chromatin regions. Next, enhancer probes are
classified based on their DNA methylation level in the tumor vs. normal samples and linked to the
expression of genes to identify key transcriptional regulators (TRs). Using genetic alteration, Hi-C
topologically associating domain (TAD), and clinical information, identified key TRs and TR-enhancer-
gene networks are characterized. Additional gene expression and clinical data are used to validate
findings of key TRs. Lung-related datasets used for this study are shown at left. The output from this
LUAD study is indicated in the middle bottom box. The left bottom box summarizes key TENET 2.0
functions.
17
To study transcriptional enhancer networks in LUAD using TENET 2.0, we first identified lung-
relevant enhancer regions. Alveolar epithelial cells (AECs) are the presumed cells of origin of LUAD
82
.
There are two types of alveolar epithelial cells: cuboidal type 2 cells (AT2), which are involved in
surfactant production and serve as facultative progenitors post-injury, and large, delicate type 1 cells
(AT1), which cover the majority of the alveolar surface and mediate gas exchange. While AT2 cells are
the suspected cells of origin of lung adenocarcinoma, the possible role of AT1 cells has not been well
investigated due to the difficulty in manipulating these fragile cells. Thus, we incorporated both
populations of these cells into our study. We first purified human AT2 cells and then used an in
vitro differentiation protocol (which mimics aspects of normal lung re-epithelialization) to derive AT1-
like cells
52,53
. We then generated H3K27ac ChIP-seq data from the AT2 cells (day 0), transitional cells (day
4), and differentiated AT1-like cells (day 6). We also used H3K27ac ChIP-seq data from normal lung
tissue samples and LUAD cells downloaded from the Roadmap Epigenomics Project (REMC)
83
, the
Encyclopedia of DNA Elements Project (ENCODE)
59,60
, and the DataBase of Transcriptional Start Sites
(DBTSS)
63
. Because tumorigenesis might activate enhancers that are not normally active in the lung, we
also included H3K27ac ChIP-seq from 98 different cell types collected from REMC
83
and ENCODE
59,60
. We
next delineated the open chromatin regions where the transcription factors bind within each enhancer,
using ATAC-seq peaks generated in-house from AECs, ATAC-seq peaks from LUAD tissues and cell lines
downloaded from other studies
66–68
, DNaseI hypersensitive sites from LUAD cell lines, and a collected list
of DNaseI hypersensitive sites from 125 different tissues and cell lines from ENCODE
59,60
. A list of
datasets we used for this study can be found in (https://doi.org/10.1371/journal.pgen.1009023.s011),
and identified enhancer and open chromatin regions can be found
in (https://doi.org/10.1371/journal.pgen.1009023.s012). Finally, DNA methylation probes from the
Illumina Infinium Human Methylation 450K (HM450) array that are contained within the open
chromatin region of each enhancer were selected. As enhancers are bound by cell-type specific
18
transcription factors and more cell-type and individual specific than promoters
84
, we focused on
enhancers for our analyses using only probes located >1.5 kb from transcription start sites. In all, we
identified 76,765 "enhancer probes" that can be studied using lung tissue samples
(https://doi.org/10.1371/journal.pgen.1009023.s013); on average, one probe was found per open
chromatin region in each enhancer.
Having collected the above information, we next assessed the differential activities of all of the
enhancers in normal lung vs. LUAD tumor samples. For this, we collected DNA methylation data for the
enhancer probes (n = 76,765) from 453 LUAD tissue samples and 21 histologically normal lung tissue
samples adjacent to tumors from The Cancer Genome Atlas (TCGA)
9
(https://doi.org/10.1371/journal.pgen.1009023.s014). By comparing the DNA methylation level (as a
reflection of enhancer activity) of each probe in the normal vs. tumor samples, we classified the
enhancer probes into 4 groups: methylated (“constitutively inactive”), i.e. highly methylated in both
normal and tumor samples; unmethylated (“constitutively active”), i.e. lowly methylated in both normal
and tumor samples; hypermethylated (“normal-specific”; inactivated in LUAD), i.e. showing low
methylation in normal samples but higher methylation in tumor samples; and hypomethylated (“cancer-
specific”; activated in LUAD), i.e. showing high methylation in normal samples, but lower methylation in
tumor samples. For example, the unmethylated probe cg05156800, located in an enhancer region on
chr1p36.11 near the 3'UTR of EXTL1, marks an enhancer that is active in both normal lung and LUAD
tumors (Figure 2-2A, left panel). In contrast, the hypermethylated probe cg24149590 in an intergenic
region on chr14q24.3 is located in an enhancer, active in normal lung but not in LUAD (Figure 2-2A,
middle panel). Hypomethylated probe cg04683210, located in an intron of MACROD1 on chr11q13.1
marks an enhancer that is active LUAD but not in normal AECs (Figure 2-2A, right panel). Using this
classification scheme, we identified 4,344 unmethylated, 6,830 methylated, 9,056 hypermethylated, and
23,583 hypomethylated enhancer probes. An excess of identified hypomethylated probes suggests that
19
enhancer activation is a common molecular alteration in LUAD (Figure 2-2B)
(https://doi.org/10.1371/journal.pgen.1009023.s013).
2.3.2 Identification of key transcriptional regulators dysregulated in lung adenocarcinoma:
Having identified over 32,000 differentially activated enhancer probes between normal lung and
LUAD, we next used matched gene expression data to test the association between the expression of
each known human transcriptional regulator (n = 1,639) and the level of DNA methylation (as a measure
of accessibility and thus activity) of each enhancer probe, using TENET 2.0. We identified 1) inactivated
transcriptional regulators that showed a correlation of lower expression with increased DNA
methylation of enhancer probes in a subset of LUAD samples (candidate tumor suppressors), and 2)
activated transcriptional regulators that showed a correlation of higher expression with decreased DNA
methylation of enhancer probes in a subset of LUAD samples (candidate oncogenes) (Supplemental
Figure 2-1). Most of the known 1,639 human transcriptional regulators we interrogated were linked to
Figure 2-2: Identification of differentially-methylated enhancer probes
(A) Integrative Genomics Viewer (IGV) screenshots show 10 kb of the genomic context centered on
example probes, with UCSC gene annotations (GENCODE v22) in the vicinity, the name and location
of the probe, and the H3K27ac signal from AEC (normal) as well as A549 cells (LUAD cell line). The
unmethylated probe shows an active enhancer region in both the AEC and A549 cells. The
hypermethylated probe shows an active enhancer region found in only the AEC, indicating an
enhancer that is inactive in tumors, while the hypomethylated probe displays marks in only A549
cells, indicating an enhancer that is activated in tumors. (B) Categorization of the identified enhancer
probes by activity.
20
relatively few cell-type specific enhancer probes (Figure 2-3A). However, a subset of transcriptional
regulators was found to be linked to many cell-type specific enhancer probes (Figure 2-3A).
We found that 31 inactivated transcriptional regulators were found to be linked to 10 or more
hypermethylated enhancer probes (https://doi.org/10.1371/journal.pgen.1009023.s015). For example,
NKX2-1 and HNF1B were linked to 123 and 50 hypermethylated enhancer probes, respectively (Figure 2-
3B-C) (https://doi.org/10.1371/journal.pgen.1009023.s015). NKX2-1, the top transcriptional regulator
inactivated in LUAD, linked to the largest number of enhancers silenced in LUAD, is known to play an
important role in lung development and maintenance of AEC cell identity
85
. NKX2-1 also acts as an
activator of HOP (Hsp70/Hsp90 Organizing Protein), a potential tumor suppressor gene in lung cancer,
inhibiting epithelial to mesenchymal transition
86
. HNF1B is previously reported to act as a tumor
suppressor in several tumors, including renal cancer, ovarian cancer, and prostate cancer
87–89
. Our
finding that lower expression of HNF1B is linked to many inactivated enhancers in LUAD suggests that it
may also act as a tumor suppressor in lung cancer.
On the other hand, we found 101 activated transcriptional regulators linked to 50 or more
hypomethylated probes (https://doi.org/10.1371/journal.pgen.1009023.s015). The top activated
transcriptional regulators were CENPA, FOXM1, TCF24, and MYBL2, which were linked to 875, 845, 843,
and 840 cancer-specific enhancer probes, respectively (Figure 2-3B-C)
(https://doi.org/10.1371/journal.pgen.1009023.s015). These transcriptional regulators likely have the
largest influence on the transcriptomes of lung adenocarcinoma tumors by changing the activities of
many enhancers. Therefore, we further investigated the identified activated transcriptional regulators
associated with many cancer-specific activated enhancers. To determine whether these transcriptional
regulators control the activity of distinct enhancers or cooperate with each other to regulate the same
set of enhancers, we generated an interaction map displaying the association of the 3,682 cancer-
specific enhancer probes linked to at least one of the 101 transcriptional regulators (Figure 2-4A).
21
Interestingly, CENPA, FOXM1, and MYBL2 showed considerable overlap in their sets of linked probes;
over 75% of each of their linked probes was also linked to a probe in the set of at least one of the other
two transcriptional regulators (Figure 2-4A – red box)
(https://doi.org/10.1371/journal.pgen.1009023.s015). The overlap between these transcriptional
regulators is much higher than with other key transcriptional regulators identified (e.g. TCF24, SOX2).
Examination of the expression levels of each of the 101 top-ranked transcriptional regulators showed
that the expression levels of CENPA, FOXM1, and MYBL2 were highly correlated with each other (r
2
>0.7)
across all profiled LUAD samples (Figure 2-4B - red brackets)
(https://doi.org/10.1371/journal.pgen.1009023.s015). We validated these results using an additional
transcriptomic dataset obtained from other lung tumor tissue samples from ORIEN (Oncology Research
Information Exchange Network) (www.oriencancer.org) (Supplemental Figure 2A). These results suggest
that these 3 transcriptional regulators may work together to activate a common set of cancer-specific
enhancers.
22
Figure 2-3: Identification of key dysregulated transcriptional regulators in LUAD
(A) The left histogram shows the number of inactivated (hypermethylated) enhancer probes per
inactivated transcriptional regulator (TR), and the right shows the number of activated
(hypomethylated) enhancer probes per activated TR. Most TRs were linked to relatively few
enhancer probes. However, 31 inactivated TRs in LUAD were linked to 10 or more hypermethylated
enhancer probes, and 101 activated TRs in LUAD were linked to 50 or more hypomethylated
enhancer probes. (B) Number of enhancer links for top 12 transcriptional regulators. Inactivated TRs
are shown at left, while activated TRs are shown at right. (C) Circos plots show the link between the
top inactivated TR (left, NKX2-1) and activated TR (CENPA, right) and their associated enhancers
throughout the genome.
23
Figure 2-4: Interaction of key transcriptional regulators activated in LUAD
(A) Interaction map of the top 101 transcriptional regulators and the 3,682 total unique
hypomethylated probes linked to those genes. CENPA, FOXM1, and MYBL2 show strong overlap in
linked probes (red box). (B) Heatmap of pairwise expression correlation values between each of the
top 101 transcriptional regulators. FOXM1, CENPA, and MYBL2 show a high degree of correlation
with each other (r2>0.7), but TCF24 (one of the top 4 most highly linked TRs; Figure 2-3B) does not
(r2<0.1).
24
2.3.3 Identification of transcriptional regulators whose expression is associated with poor
patient survival:
To further investigate the role of key transcriptional regulators activated in LUAD, we more
closely examined gene expression levels in normal vs. tumor tissues. Of the top 12 transcriptional
regulators, CENPA, FOXM1, and MYBL2 were among the most highly expressed and displayed the largest
differences in expression between tumor and normal tissues; each was >8 times more highly expressed in
LUAD as compared to normal lung (Figure 2-5A) (Supplemental Figure 2-3). Next, we examined the
association of transcriptional regulator expression with patient survival, and we found that high
expression levels of CENPA, FOXM1, and MYBL2 were the most significantly associated with poor patient
survival in the TCGA LUAD cohort (Figure 2-5B) (Supplemental Figure 2-4). We validated these results
for CENPA, MYBL2, and FOXM1 using an additional survival dataset obtained from other LUAD samples
75
(Supplemental Figure 2-5). Expression of CENPA, FOXM1, and MYBL2 did not appear to be very strongly
associated with age, sex, or cancer stage. However, we found that history of tobacco exposure was
correlated with the gene expression of each of the three transcriptional regulators (Supplemental Figure
2-6A), (https://doi.org/10.1371/journal.pgen.1009023.s016). Additionally, we found that high total
mutation burden was similarly associated with increased expression of these genes in the LUAD tumor
samples (Supplemental Figure 2-6B).
25
2.3.4 CENPA, FOXM1, and MYBL2 are activated in a subgroup of lung adenocarcinoma and
breast adenocarcinoma:
Tumor samples with higher expression of CENPA, FOXM1, and MYBL2 appear to harbor relatively
more cancer-specific enhancers, suggesting that tumors highly expressing CENPA, FOXM1,
and MYBL2 may have distinct enhancer profiles (Supplemental Figure 2-7A). To investigate this, we
generated a DNA methylation heatmap of the enhancer probes linked to these three transcriptional
Figure 2-5: CENPA, FOXM1 and MYBL2 are highly expressed in tumors and associated with poor
patient survival
(A) Boxplots of expression of CENPA, FOXM1 and MYBL2 in 453 TCGA LUAD and 21 adjacent normal
samples. All three genes were significantly upregulated in LUAD. (B) Kaplan-Meier survival plots
comparing differences in survival between samples with the highest and lowest quartiles
of CENPA, FOXM1, and MYBL2 expression. Survival was compared using TCGA LUAD samples.
26
regulators (Figure 2-6A) (https://doi.org/10.1371/journal.pgen.1009023.s013). We observed a subgroup
consisting of LUAD samples that are broadly hypomethylated across these enhancers, and that possess
relatively high expression of these three transcriptional regulators together (Figure 2-6B - cluster
b) (Supplemental Figure 2-8A). These samples did not appear to be associated with age, sex, cancer
stage, purity, or cancer stage, but they were slightly associated with smoking history in the TCGA dataset,
especially current smoking, as well as total mutational burden (Supplemental Figure 2-8A-C). We saw no
apparent association between genetic alterations to KRAS, EGFR, NF1, or BRAF and activation of
specifically CENPA, FOXM1, and MYBL2-linked enhancers (Supplemental Figure 2-8A). It has been
previously shown that activation of KRAS signaling increases expression of FOXM1
90
, and that MYBL2 can
be regulated by EGFR
91
. We therefore examined the total number of cancer-specific enhancer links in
samples with and without KRAS or EGFR genetic alterations and in the highest quartile and remaining
quartiles of expression of FOXM1 and MYBL2, respectively. Samples with the highest quartile
of FOXM1 and MYBL2 expression possessed a significantly greater number of cancer-specific enhancer
links, however, KRAS or EGFR genetic alteration status was not associated with a significant difference in
the number of these links regardless of the FOXM1 and MYBL2 expression level (Supplemental Figure 2-
7B). We also observed that a subgroup of LUAD samples, representing those in the top 10% by number of
links to CENPA, FOXM1, and MYBL2 showed poorer survival outcomes than samples which did not
possess a link (Supplemental Figure 2-8D).
27
Figure 2-6: LUAD and BRCA subgroups with activated CENPA, FOXM1, and MYBL2-linked enhancers
(A) DNA methylation heatmap showing CENPA, FOXM1, and MYBL2 expression-linked LUAD-specific
enhancer probes for normal and LUAD tissue samples. Clusters represent the largest two divisions in
LUAD tumor samples as determined by unsupervised clustering. LUAD tumor samples in cluster b
display generally higher expression of the 3 transcriptional regulators and broad hypomethylation of
CENPA/FOXM1/MYBL2-linked probes. (B) DNA methylation heatmap showing CENPA, FOXM1, and
MYBL2-linked breast cancer-specific enhancer probes for normal and BRCA tissue samples. BRCA
PAM50 (Prediction Analysis of Microarray 50) subtypes are indicated in the middle bar. Of note is the
cluster of samples on the right, comprised predominantly of BRCA tumor samples of the basal
subtype, with relatively high expression of the three transcriptional regulators and broad
hypomethylation of CENPA/FOXM1/MYBL2-linked probes, similar to what is seen in the subgroup of
LUAD samples.
28
In previous analyses, we found that FOXM1 and MYBL2 were activated in breast
adenocarcinoma (BRCA). Having now identified these as key regulators in LUAD, we sought to determine
if different enhancers are linked to FOXM1 and MYBL2 in the two cancer types. We reanalyzed the BRCA
data using TENET 2.0 (https://doi.org/10.1371/journal.pgen.1009023.s013)
(https://doi.org/10.1371/journal.pgen.1009023.s014), and found that CENPA, FOXM1, and MYBL2 were
among the top transcriptional regulators in BRCA when ranked by number of linked probes
(Supplemental Figure 9A). However, only a subset of TENET-identified CENPA, FOXM1, and MYBL2-
linked enhancer probes were shared between both datasets (Supplemental Figure 9B). This suggests
that although some cancer-specific enhancers are common to LUAD and BRCA (Supplemental Figure
9C) (https://doi.org/10.1371/journal.pgen.1009023.s017), the enhancers regulated by CENPA, FOXM1,
and MYBL2 are largely different between tumor types. To further characterize the BRCA enhancers
linked to CENPA, MYBL2, and FOXM1, we generated heatmaps of DNA methylation for enhancer probes
linked to any of the three transcriptional regulators in BRCA. Interestingly, we found that BRCA tumor
samples that belong to the basal subtype have higher expression of these three transcriptional
regulators as well as a larger number of hypomethylated CENPA, FOXM1, and MYBL2-linked enhancers
than other BRCA subtypes (i.e. luminal A, luminal B, Her2, normal-like) (Figure 6B).
2.3.5 Identification of CENPA/FOXM1/MYBL2-linked enhancers associated with poor patient
survival and their potential target genes:
We next wondered whether high expression of CENPA, FOXM1, and MYBL2 and the presence of
more activated enhancers was clinically relevant. Therefore, we examined the subgroup of LUAD
samples, which had high expression of the three transcriptional regulators as well as many enhancer
links (over 290 cancer-specific CENPA, FOXM1, or MYBL2 enhancer links), called “highly linked” samples
for correlations to overall patient survival (Supplemental Figure 2-8A). These "highly linked" samples
29
showed significantly poorer survival outcomes than lowly linked samples (Supplemental Figure 2-8D).
To further investigate whether any particular cancer-specific enhancers were linked to patient outcome,
we performed survival analyses using cancer-specific enhancer probes linked to CENPA, FOXM1, or
MYBL2. We found 101 enhancer probes for which lower levels of DNA methylation were associated with
poor patient survival (Log-rank p<0.05) (https://doi.org/10.1371/journal.pgen.1009023.s013). Examples
of enhancer probes linked to patient survival included cg03535253, located on chr14q32.12 in the 3'UTR
of the BTBD7 gene, cg06956006, located on chr17q21.2 in an intron of the ACLY gene, and cg04016113
in an intron of the SFXN5 gene on chr2p13.2. Each is located in the vicinity of an active enhancer region
in LUAD cells not present in normal AEC, and patients with low levels of methylation of each of these
probes (indicating the activation of the enhancer regions) showed significantly poorer survival outcomes
(Figure 2-7).
We next aimed to identify genes and signaling pathways potentially regulated by CENPA,
FOXM1, and MYBL2. To this end, we first identified genes within 1 Mb of each of the enhancer probes
since most enhancer-promoter interactions occur within a topologically associating domain (TAD) that is
less than 1Mb in size
32
. From these, we selected the genes that were significantly upregulated in tumor
relative to normal as potential targets of these enhancers. For example, we found that the SPR gene was
a potential target of the enhancer probe cg0416113 (Figure 2-7). SPR (sepiapterin reductase) is located
~177kb upstream of the enhancer probe. A recent study showed that SPR depletion inhibited liver
cancer cell proliferation and promoted cancer cell apoptosis in vivo
92
, suggesting its role as an oncogene.
Gene ontology (GO) analyses revealed that target genes potentially regulated by CENPA, FOXM1, and
MYBL2 are involved in cell cycle, cellular response to DNA damage stimulus, chromosome organization,
and DNA repair (https://doi.org/10.1371/journal.pgen.1009023.s008).
30
Figure 2-7: Examples of CENPA/FOXM1/MYBL2-linked enhancer probes associated with survival
rate
(A) Shown are three examples of lung cancer-specific enhancers linked to CENPA, FOXM1, or MYBL2
in LUAD. IGV screenshots show 10 kb of the genomic context centered on example probes, with
GENCODE v22-annotated UCSC genes in the vicinity, the name and location of the probe, and the
H3K27ac signal from normal AEC as well as lung tumor A549 cells. These hypomethylated probes
show H3K27ac marks in A549 cells, indicating enhancers active in LUAD but not normal lung tissue.
(B) Kaplan-Meier survival plots comparing differences in survival between samples with the highest
and lowest quartiles of methylation of the enhancer probe.
31
To identify genes and signaling pathways regulated by FOXM1 and MYBL2, known transcription
factors, we performed knockdown experiments for FOXM1 and MYBL2 in A549 cells, a LUAD cell line.
More than a thousand genes were differentially expressed upon knockdown of either FOXM1 or MYBL2
or both (Figure 2-8A-C) (https://doi.org/10.1371/journal.pgen.1009023.s019). GO analyses of the genes
downregulated after knocking down FOXM1 or MYBL2 or both indicated that these genes are involved in
cell cycle and cell division, supporting the gene predictions made from deregulated genes near the
activated enhancers (https://doi.org/10.1371/journal.pgen.1009023.s018). We determined which of the
significantly downregulated genes from the siRNA knockdowns were located within 1 Mb of the
enhancer probes we had previously linked to these transcriptional regulators (Figure 2-
8D) (https://doi.org/10.1371/journal.pgen.1009023.s019). These genes likely represent the direct target
genes of the enhancers.
Of particular interest is the gene TK1, which showed a ~40% reduction in expression after
MYBL2 knock down (adjusted p = 2.506x10
-7
) (https://doi.org/10.1371/journal.pgen.1009023.s019). TK1,
encoding a protein that plays an important role in thymidine metabolism, is located ~188 kb from the
MYBL2-linked enhancer probe cg09580922 on chr17q25.3 (Figure 2-8E). Low methylation of cg09580922
is strongly associated with poor patient survival (Figure 2-8F), as is high expression of TK1 (Figure 2-8).
The promoter of TK1 and cg09580922 are both located in the same TAD according to Hi-C maps from
both A549 as well as GM12878, another cell line for which a high resolution Hi-C dataset is available
(Supplemental Figure 2-10). This suggests that a cancer-specific enhancer potentially regulated by
MYBL2 may increase the expression of TK1. A complete list of enhancers and their potential target genes
confirmed by knockdown experiments and located in the same TAD can be found
in https://doi.org/10.1371/journal.pgen.1009023.s019.
32
Figure 2-8: Identification of genes regulated by FOXM1 and MYBL2
Volcano plots showing gene expression changes after knocking down (A) FOXM1 or (B) MYBL2 or (C)
Double (both FOXM1 and MYBL2). The knocked down genes (FOXM1 or MYBL2) are highlighted by a
green or purple box, respectively. (D) Heatmap displaying fold change expression of significantly
downregulated genes in the vicinity of cancer-specific enhancers associated with poor patient
survival after FOXM1 (light blue) or MYBL2 (orange) or double (green) knockdown; log2(fold change)
were plotted from dark blue to dark red (https://doi.org/10.1371/journal.pgen.1009023.s009).
Genes shown represent potential target genes within 1 Mb of CENPA/FOXM1/MYBL2-linked
enhancers whose activation is significantly associated with poor patient survival. Expression of the
gene TK1 is highlighted by the red arrow. (E) Diagram of A549 H3K27ac mark overlapping the MYBL2-
linked probe cg09580922 and its potential target gene TK1 (Supplemental Figure 2-10). (F) Kaplan-
Meier survival plot comparing differences in survival between LUAD tumor samples with the highest
and lowest quartiles of cg09580922 methylation. (G) Kaplan-Meier survival plot comparing
differences in survival between LUAD tumor samples with the highest and lowest quartiles
of TK1 expression.
33
2.3.6 Discussion:
We have developed TENET 2.0, a method to characterize enhancer networks controlled by
transcriptional regulators that are potential tumor suppressors or oncogenic drivers. Using H3K27ac
ChIP-seq and open chromatin datasets, we identified enhancers active in lung cells. Then, using DNA
methylation levels at the identified enhancers in hundreds of normal vs. LUAD tissue samples
9
, we
identified over 32,000 differentially activated enhancers. By integrating DNA methylation and gene
expression data, we identified key transcriptional regulators (e.g. NKX2-1, CENPA, FOXM1, and MYBL2)
that are linked to many cell-type specific enhancers. These transcriptional regulators were not simply
the most highly expressed, most overexpressed, nor necessarily the most significantly overexpressed
transcriptional regulators in LUAD tumors (Table 2-1). This illustrates the value of the TENET method, as
it can identify functionally upregulated transcriptional regulators, which may not necessarily be among
the most expressed, or overexpressed transcriptional regulators in LUAD tumors. We further found that
high expression of CENPA, FOXM1, and MYBL2 is associated with poor survival in patients with LUAD
and with broad enhancer activation in a distinct group of LUAD tumors. We found a subgroup of BRCA
tumor samples which also showed activation of BRCA enhancers linked to these three transcriptional
regulators, and basal-subtype tumors were particularly enriched in that subgroup. We then identified
LUAD-specific enhancers that are linked to the three transcriptional regulators and whose increased
activities are correlated with poor survival. For example, the enhancer marked by probe cg09580922
appears to regulate the TK1 gene, whose high expression is associated with poor patient survival.
TENET 2.0, which now has updated databases, including new algorithms to identify epigenetic
traits associated with mortality with greatly decreased computational time, allowed us to identify
dysregulated transcriptional regulators and enhancers in LUAD. Key inactivated transcriptional
regulators, which are potential tumor suppressors, include NKX2-1 (Figure 2-3). Low expression of NKX2-
1 was observed in a subgroup of LUAD samples (Supplemental Figure 2-8A) and was linked to over a
34
hundred inactivated enhancers. NKX2-1, also known as thyroid transcription factor 1 (TTF1), regulates
transcription of genes specific for the thyroid and lung. NKX2-1 is reported to be involved in lung
development, and it inhibits epithelial to mesenchymal transition, supporting its role as a tumor
suppressor
93,94
. Besides NKX2-1, we identified that HNF1B, a previously reported tumor suppressor
found in other cancer types
87–89
, STAT6, and SP100, which were inactivated and linked to many silenced
enhancers in a subgroup of LUAD (Figure 2-3) (https://doi.org/10.1371/journal.pgen.1009023.s015).
Table 2-1: Top TRs identified in LUAD by TENET 2.0 are functionally relevant
Table displays the top-20 ranked TRs by highest overall expression in LUAD tumors (mean tumor
expression), highest overexpression in LUAD tumors compared to LUAD adjacent normal samples
(mean tumor expression – mean adjacent normal expression), most significant overexpression in
LUAD tumors (lowest t-test p-values, and overexpressed in tumor relative to adjacent normal
samples), and finally the TRs with the most linked hypomethylated probes from the TENET 2.0
analysis. Color is used to highlight overlapping elements between tables. As seen in the table, the
top-20 TRs identified by TENET 2.0 are not necessarily those that are the most highly expressed,
overexpressed, or even significantly overexpressed in LUAD, indicating TENET 2.0 can identify
functionally relevant transcriptional regulators which would otherwise be difficult to find.
35
Of the transcriptional regulators activated in LUAD, CENPA, FOXM1, and MYBL2 were linked to
the activation of hundreds of cancer-specific enhancers. These transcriptional regulators are therefore
potential cancer driver oncogenes. CENPA, which has a histone-binding domain, directs the assembly of
active kinetochores together with centromere-specific-DNA-binding factors. A recent study in cervical
and colorectal cancer cells reported that CEPNA can also bind to DNaseI hypersensitive sites
95
. MYBL2
(a.k.a. B-MYB), a member of the MYB family, regulates cell cycle genes by binding to regulatory
elements
96
. FOXM1, a member of the Forkhead family of pioneer transcription factors
97
, is involved in
the proper development of several different organ systems, including the lungs
90
. It has been
demonstrated to bind to enhancers in breast cancer cells
98
. Here we showed that CENPA, FOXM1, and
MYBL2 are upregulated together, potentially leading to the activation of many cancer-specific enhancers
in a subgroup of LUAD. The subgroup of LUAD with both high expression of CENPA, FOXM1,
and MYBL2 and broad enhancer activation had worse patient survival outcomes. This subgroup also
appears to have a higher proportion of smokers, which may be related to the observed epigenomic
changes, and higher tumor mutational burden
99
. It has been previously suggested that FOXM1 may act
as a regulator for genes involved in DNA damage response and repair
100
. Besides these 3 transcriptional
regulators, we identified other key transcriptional regulators, such as TCF24, SOX2, and ZNF695, each
linked to over 500 enhancers activated in LUAD (Figure 2-3), providing many further avenues of
investigation.
When we compared our LUAD data with that of a similar analysis of BRCA, CENPA, FOXM1,
and MYBL2 were also found to be activated, particularly in basal-subtype tumors, supporting the idea
that these factors work together in certain cancer subtypes. Previously, we showed that estrogen
receptor and FOXA1, which are known to be activated in estrogen receptor-positive breast cancer
subtypes (e.g. luminal A, luminal B), are not expressed in the basal subtype, but FOX and MYB motifs are
enriched at enhancers in basal-like breast cancer cells
101
. FOXM1 and MYBL2 motifs were enriched at
36
CENPA, FOXM1, and MYBL2-linked enhancers we found in lung cancer cells (91.8% for a FOXM1 motif,
60.3% for an MYBL2 motif) (https://doi.org/10.1371/journal.pgen.1009023.s020). Interestingly, CENPA,
FOXM1, and MYBL2 appear to target different enhancers in BRCA and LUAD, potentially working with
different co-factors
102
. In spite of this difference, GO analysis of potential target genes for these
enhancers revealed that both sets regulate similar cellular processes, including cell cycle control and
DNA repair (https://doi.org/10.1371/journal.pgen.1009023.s018). Further studies to elucidate the
function of these transcriptional regulators in tumor subgroups are needed to better understand their
role in epigenetic deregulation of cancer cells.
Previous studies had implicated FOXM1 and MYBL2 in lung cancer
103–105
, but our analysis
documents their profound effects on gene deregulation, potentially affecting hundreds of enhancers. As
acquisition of cancer-specific enhancers can drive tumorigenesis
106
, identifying key activated enhancers
in cancer is highly relevant. Here, we identified 101 LUAD-specific enhancers linked to CENPA, FOXM1,
and MYBL2 that show correlations with worse survival (Figure 2-7)
(https://doi.org/10.1371/journal.pgen.1009023.s013). For example, we found that the enhancer probe
cg04161113, whose activation (low DNA methylation) is associated with poor survival, is potentially
regulating the SPR gene, which was recently reported as an oncogene in liver cancer
92
.
Using knockdown experiments, we further identified potential target genes of these enhancers,
which included genes involved in cell division and cell cycle control. These potential target genes
included not only known oncogenes such as MYC, FBXL16, PHF5A, and KIF14
107–110
but also genes
(e.g. BRI3BP, RAB11FIP5) which are not yet reported to be involved in lung carcinogenesis
(https://doi.org/10.1371/journal.pgen.1009023.s019). Of the downregulated genes after siRNA
treatment, TK1 was the most significantly associated with survival rates (log-rank p = 1.194x10
-4
). High
expression of TK1 and low methylation of the nearby MYBL2-linked enhancer probe cg09580922 were
associated with poor patient survival (Figure2-8G-H), and both appear to be located in the same TAD
37
(Supplemental Figure 2-10). TK1 has been investigated as a diagnostic and prognostic biomarker for
several types of cancer, including LUAD
111
. Loss of TK1 has been shown to inhibit the growth and
metastatic capabilities of LUAD in vitro as well as in mice through a reduction in expression of GDF15
111
.
We have used TENET 2.0 to integrate epigenomic and transcriptomic profiles from hundreds of
samples and have identified key transcriptional regulators and enhancers altered in LUAD. The lists of
these enhancers, transcriptional regulators, and their potential target genes will be a useful resource for
researchers aiming to better understand the molecular mechanisms driving carcinogenesis in different
LUAD subgroups. Moreover, our findings may lead to new biomarkers as well as therapies that might
target distinct LUAD subgroups associated with poor survival; small molecule inhibitors for MYB family
members
112
as well as FOXM1 have been developed but have not yet been tested in lung cancer
113,114
.
Importantly, TENET 2.0 can be used to investigate molecular mechanisms underlying any cancer type for
which gene expression and epigenetic data are available (http://github.com/suhnrhie/TENET_2.0).
38
2.4 Conclusions
When I had started developing the TENET 2.0 method, the original TENET program was difficult
to find and was at that point non-functional due to its dependence on depreciated datasets. Above, I
detailed the major updates I made to the TENET 2.0 method as well as key findings I discovered in LUAD
using it. However, after utilizing the method, I had identified several key ways the method could be
improved further.
While I had made TENET 2.0 functional and it was hosted on the lab’s Github page in order to
increase its visibility, the method still consisted of just a directory containing a master bash script, a
number of R scripts which ran functions executed by the bash script, a .txt file the user could edit to
change TENET 2.0 settings, and all the datasets TENET 2.0 needed to function. This implementation had
several key deficiencies. First, the use of a bash script made TENET 2.0 difficult to use on a Windows-
based computer, especially for a novice user. Second, this implementation required extensive use of
external documentation and vignettes to both inform potential users how to actually use TENET 2.0, as
there is no basic convention for how such programs should operate, as well as to keep track changes of
TENET 2.0’s datasets or functions. Lastly, the build of TENET 2.0, utilizing a master bash script to activate
individual R scripts depending on combinations of settings read from a single master .txt. file, made it
difficult to understand the inner workings of the function and thus much more difficult to add new
features or fix bugs as they arose in the method.
For these reasons, I turned my attention for the rest of my project to continue improving TENET
2.0 by developing it as an R package. Not only would this address many of the concerns I had with TENET
2.0, but it would also give me the opportunity to convert many of the custom scripts I had developed to
analyze the LUAD dataset into default functions that could be included in this new package. And on a
personal level, being able to develop my own R package would be a major personal milestone as a
bioinformaticist-in-training.
39
2.5 Supplemental files
Supplemental Figure 2-1: TENET 2.0 pictorial workflow
(A) DNA methylation levels of enhancer probes are used to assess differential activity of
transcriptional regulator-linked enhancers. Enhancer probes are identified using H3K27ac ChIP-seq
peaks overlapping with regions of open chromatin. Probes intersecting both of these regions are
subsetted to those >1.5kb from GENCODE v22-annotated transcription start sites to avoid promoter
regions. (B) TENET classifies enhancer probes based on their differential activity as measured by
methylation level in normal vs. tumor samples. Methylated and unmethylated probes represent
enhancers that are uniformly inactive and active, respectively. Hypermethylated probes represent
enhancers that are inactive in cancer samples. These probes possess a low level of mean methylation
in normal samples, but higher levels of methylation in a subset of tumor samples. Conversely,
hypomethylated probes represent enhancers that are active in cancer samples, showing a decreased
level of methylation in tumor vs. normal lung samples. (C) Analyses are focused on transcriptional
regulators that are overexpressed in LUAD, resulting in increased activity of their regulated enhancer
regions as represented by decreased DNA methylation. The expression of each transcriptional
regulator and DNA methylation of each enhancer probe are assessed to find "linked" pairs with
increased expression of the transcriptional regulator and decreased methylation of the probe in a
subset of the tumor samples, relative to normal samples. (D) Transcriptional regulators with the
most linked enhancers are of interest for study because they are more likely to have large-scale
effects on genome-wide expression patterns. TENET 2.0 also includes new functions to identify
tumor subgroups based on differences in the activation of enhancers linked to the top
transcriptional regulators using heatmaps (E) and association with patient survival (F) as well as
potential target genes of enhancers using topologically associating domain (TAD) information (G).
40
Supplemental Figure 2-2: Correlation analyses of key transcriptional regulators activated in lung
cancer using ORIEN datasets
(A) Using ORIEN gene expression datasets from lung tumor tissue samples (n = 728), TR gene
expression correlation analyses were performed. Barplots show the top 5 most correlated
transcriptional regulators for CENPA (top), FOXM1 (middle), and MYBL2 (bottom). (B) TR gene
expression scatterplots are shown for FOXM1 vs. MYBL2 (top), FOXM1 vs. CENPA (middle),
and CENPA vs. MYBL2 (bottom).
41
Supplemental Figure 2-3: Expression of top 12 transcriptional regulators activated in LUAD
Boxplots of expression of remaining 9 of top 12 oncogenic transcriptional regulators in 453 TCGA
LUAD tumor and 21 adjacent normal samples (CENPA, FOXM1, and MYBL2 are shown in Figure 2-
5A). All genes were upregulated in LUAD tumors, but none as strongly as CENPA, FOXM1, or MYBL2.
42
Supplemental Figure 2-4: Survival analysis of top 12 transcriptional regulators activated in LUAD
Kaplan-Meier survival plots comparing differences in survival between samples with the highest and
lowest quartiles of expression of the remaining 9 of top 12 cancer-specific transcriptional regulators
by number of linked enhancers (CENPA, FOXM1, and MYBL2 are shown in Figure 2-5B). Survival was
compared using TCGA LUAD samples.
43
Supplemental Figure 2-5: Replication of association of expression of highly-linked oncogenic
transcriptional regulators with patient survival in LUAD using Kaplan-Meier Plotter.
Kaplan-Meier survival plots comparing differences in survival between samples with the highest and
lowest quartiles of expression of the top 12 oncogenic transcriptional regulators in LUAD cases using
Kaplan-Meier Plotter (https://kmplot.com/analysis/). Again, expression of CENPA, FOXM1,
and MYBL2 was the most strongly associated with patient survival amongst these transcriptional
regulators.
44
Supplemental Figure 2-6: Smoking history is associated with CENPA, FOXM1, and MYBL2
expression in TCGA samples
(A) Boxplots of CENPA, FOXM1, and MYBL2 expression in 453 TCGA LUAD tumor samples stratified
by smoking history. Tumor samples from current smokers had significantly higher expression of all
three transcriptional regulator genes than those from former smokers and individuals who had never
smoked (significant Tukey HSD p-values displayed; ***p<0.005). (B) Boxplots show CENPA, FOXM1,
and MYBL2 expression in all 453 TCGA LUAD tumor samples stratified by median total mutational
count. Samples with higher mutational burden had higher expression of these transcriptional
regulators.
45
Supplemental Figure 2-7: Association of total active enhancer links with expression of three
activated transcription factors and common LUAD mutations
(A) Boxplots show the total number of links to activated enhancers on a per sample basis in LUAD
samples with the highest quartile and lowest quartile of CENPA, FOXM1, and MYBL2 and expression.
(B) Boxplots display differences in the total number of links to activated enhancers on a per sample
basis in LUAD samples stratified by the presence and absence of a KRAS (right) or EGFR alteration
(left), and with and without the highest quartile of expression of FOXM1 (right) or MYBL2 (left).
There is a significant difference in the number of links between samples in the highest quartile of
expression vs. the other three quartiles for both transcriptional regulators regardless of their
alteration status, but no significant difference in the number of links between samples with and
without either alteration but with the same expression level (Significant Tukey HSD p-values
displayed; * = p<0.05, ** = p<0.01, *** = p<0.005).
46
Supplemental Figure 2-8: Association of links to CENPA, FOXM1, and MYBL2-linked enhancers
with clinical data and subgroup analysis
(A) Heatmap of DNA methylation β-values for CENPA/FOXM1/MYBL2-linked lung cancer-specific
enhancers (n = 1,338) for normal and LUAD tissue samples. From top to bottom, samples are
plotted with the age, sex, and cancer stage of the patients, smoking history status, expression of
additional identified TRs TCF24, SOX2, and NKX2-1, leukocyte and overall tumor purity, presence
of KRAS, EGFR, NF1 and BRAF alterations, log2-transformed mutational count, and sample link
status. (B) Chi-square test results comparing smoking history in the more active cluster b, to the less
active cluster a from Figure 2-6A. There is a much greater proportion of current smokers in cluster b
than in cluster a. (C) t-test results comparing mean total mutation count of samples in cluster b to
cluster a. Samples in cluster b have a significantly higher mean tumor burden than samples in
cluster a. (D) Univariate Kaplan-Meier survival plot comparing difference in survival between the
very highly-linked samples (marked in red in the link status bar), and samples that do not possess
any links to CENPA/FOXM1/MYBL2-linked probes (marked in blue in the link status bar).
47
Supplemental Figure 2-9: Key transcriptional regulators identified in LUAD vs. BRCA and
comparison of CENPA/FOXM1/MYBL2-linked probes in each dataset
(A) Barplot of the top 12 transcriptional regulators by number of links to activated enhancers
identified using TENET 2.0 in LUAD (left) and BRCA (right). (B) Venn diagrams display overlap of
probes linked to CENPA, FOXM1, and MYBL2 and (C) all hypomethylated probes in the LUAD vs.
BRCA analyses. There is a considerably higher percentage of overlap between all hypomethylated
probes than for probes linked only to CENPA, FOXM1, or MYBL2.
48
Supplemental Figure 2-10: Hi-C diagram from A549 and GM12878 cells showing the cg09580922
and TK1 genomic region
Hi-C diagrams of A549 cells (top) and GM12878 cells (bottom) show the genomic context of
the TK1/cg09580922 locus (middle) from chr17:77125000–79625000. In both cell lines, TAD
boundaries (lower middle) show that both TK1 and cg09580922 are located in the same TAD.
49
Chapter 3: Development and application of TENETR to identify
dysregulated transcription factors and linked regulatory elements
across cancer types
This chapter gives an in-depth look into how TENETR works and was developed, with a focus on
covering the methodology of the step1 through step6 functions, as well as how I applied them in an
analysis of a panel of 12 TCGA cancer datasets. The first part of this chapter will be largely methods
focused, discussing the philosophy of TENET, how each of the aforementioned step1 through step6
functions work to identify dysregulated regulatory elements and link their activity to differentially-
expressed TFs, as well as how the cancer and epigenomic datasets used by TENETR were acquired for
each of the 12 cancer types. The second half of the chapter will discuss my findings from the analysis of
these 12 cancer datasets, including the identification of the top TFs in each cancer type, to showcase the
TENETR package.
The work within this chapter is being prepared to be submitted as an article with the proposed
title “Creation and application of the TENETR package to identify dysregulated transcription factors
across a pan-cancer landscape”. The authors for this paper will be Daniel J Mullen, Zexun Wu, Ethan
Nelson-Moore, Lauren Han, Ria Pai, Huan Cao, Ite A. Offringa, and Suhn K Rhie. DJM is responsible for
bioinformatic analyses, package creation, dataset creation, figure creation, key idea contributions,
experimental design, and writing. ZW helped with bioinformatic analyses, dataset creation, and
experimental design. ENM helped with package creation. LH, RP, and HC helped with dataset creation
and package testing. IAO assisted with key idea contributions and editing. SKR helped with package
creation, key idea contributions, experimental design, writing, and editing.
50
3.1 Abstract
Different genetic alterations play an important role in driving the development of cancer.
However, despite extensive genomic profiling, there is still a considerable portion of tumor samples
across cancer types for which no driver mutations have been identified. Epigenetic alterations could play
a key role in these cancers. Here, I focus on alterations to the expression of transcription factors (TFs)
and subsequent changes to transcriptional regulatory networks, as they induce widespread downstream
effects in cells. Previously, the bioinformatic methods Tracing Enhancer Traits using Epigenetic Traits
(TENET) and TENET version 2.0 (TENET2.0) were developed to identify dysregulated TFs and their linked
enhancer regions and used to investigate breast, kidney, prostate, and lung cancers.
Here, I present the newly developed TENETR method and its analysis of twelve different cancer
types. TENETR utilizes ChIP-seq and open chromatin datasets to identify DNA methylation probes in
regulatory elements, then uses the DNA methylation levels of those probes as a surrogate for the
activity of the regulatory elements, with decreased methylation corresponding to increased element
activity. Combining these epigenomic datasets with RNA-seq datasets, TENETR identifies TFs whose
expression levels are inversely related to the DNA methylation level of regulatory element probes. I
developed TENETR as an R package, making it more intuitive to use. The TENETR method also possesses
several new features to increase its ease of use and applicability, including support for different DNA
methylation arrays, inclusion of consensus epigenomic datasets, options to assess both distal enhancer
and promoter elements, a TF-only option to increase processing speed, an algorithm to automatically
set methylation cutoff values, and a new function to download gene expression and DNA methylation
datasets from The Cancer Genome Atlas (TCGA). We also compiled extensive cancer-type-specific
epigenomic datasets, identified numerous distal enhancer elements marked by DNA methylation probes
along with top TFs linked to activated enhancers in the selected cancer types.
51
3.2 Materials and methods
3.2.1 Selection and downloading of TCGA datasets (Work done with: LH):
Sample counts with matching HM450 array DNA methylation β-values and fragments per
kilobase of transcript per million mapped reads upper quartile normalized (FPKM-UQ) RNA-seq values
were calculated for all 38 TCGA tumor and normal tissue datasets. From these, we selected 12 TCGA
cancer types: bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), colon
adenocarcinoma (COAD), esophageal carcinoma (ESCA), head and neck squamous cell carcinoma
(HNSC), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver
hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC),
prostate carcinoma (PRAD), and thyroid carcinoma (THCA) (Figure 3-1A). These cancer types have the
largest number of tumor and especially adjacent normal samples with both RNA-seq and DNA
methylation data available (Figure 3-1B).
The FPKM-UQ RNA-seq values, as well as HM450 array β-values were downloaded for both
tumor and adjacent normal samples from the 12 selected TCGA cancer types. After downloading,
samples lacking either gene expression or DNA methylation data were discarded to ensure the
remaining samples had matching gene expression and DNA methylation data. Then, for TCGA subjects
who had multiple tumor samples included in the dataset, the first tumor sample based on its
alphanumeric TCGA barcode was kept and the remaining samples from that subject were removed.
Finally, the FPKM-UQ RNA-seq values were log2 transformed before being used in TENETR analyses.
Available clinical data from each patient was also included for use in downstream TENET analyses
(Supplemental Data 3-1).
It should be noted however, that the datasets used in this study were downloaded before April
of 2022. Subsequently, the National Cancer Institute Genomic Data Commons (GDC), which hosts the
TCGA data, overhauled their database and reprocessed the data, which affected the gene expression
52
data available to the user. Specifically, STAR was used in place of HT-seq to generate gene counts data,
the algorithm used to calculate FPKM-UQ values was adjusted, and GENCODE v36 instead of GENCODE
v22 gene annotation was used. Presently, comparable data can be downloaded using the
TCGA_downloader() function.
3.2.2 TCGA_downloader() function (Work done with: ENM):
TCGA data following current GDC specifications can be acquired using the TENETR
TCGA_downloader() function, which uses functionality provided by the TCGABiolinks package
71
to
download and prepare datasets for use with TENETR. This function can also be used to acquire and
compile clinical, DNA methylation, and gene expression data for selected TCGA datasets, and can be
used to download those datasets for non-TENETR purposes as well.
The first argument for the TCGA_downloader() function is TCGA_directory, which requires the
Figure 3-1: TCGA datasets used in TENETR analysis
(A) Diagram of the 12 cancer types included in this study with their respective TCGA barcodes and
color coding. (B) Table of normal and tumor sample counts included for each cancer type in the
study.
53
user to provide a path to a directory they wish to download the TCGA files to. It should be noted that
these TCGA datasets are sizeable. For example, over 71 Gb of downloaded data were used to create the
LUAD dataset used in the TENETR study. Thus, it is recommended the user have plenty of available space
in the specified directory to which files will be downloaded.
The next argument is TCGA_study_abbreviation, which requires the users to provide the four-
letter code for the TCGA dataset they want to download. The available codes are listed at
https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations.
The RNA_seq_workflow argument allows the user to control which type of normalized RNA-seq
data the user wishes to use. Options for this function include “STAR – FPKM”, “STAR - FPKM-UQ” “STAR -
FPKM-UQ - old formula”, or “STAR – TPM”. Raw counts data can also be acquired by setting this to
“STAR – Counts”, but this data is not recommended for TENET usage as it has not been normalized.
The RNA_seq_log2_normalization argument controls whether the RNA_seq data should be log2-
normalized and can be set to either TRUE to perform the normalization, or FALSE. This argument
defaults to TRUE if not otherwise specified.
The matching_exp_met_samples argument allows the user to isolate only the samples which
have matching gene expression and DNA methylation data. Options for this argument include
“tumor_and_normal” to perform the matching for both the tumor and adjacent normal samples,
“tumor_only” to perform the matching on only the tumor samples and keeps every adjacent normal
sample, even if they don’t have matched data. Finally, setting this argument to “none” does not remove
any samples for lacking matching data.
The remove_dup_tumor argument can be set to either TRUE or FALSE depending on if the user
wants to remove duplicate tumor samples coming from the same patient. If set to TRUE, the first tumor
sample per patient is kept, as ordered alphanumerically by the sample’s unique TCGA barcode, and
remaining tumor samples from that patient are removed. If set to FALSE, all tumor samples are kept.
54
This argument defaults to TRUE if not otherwise set.
The final argument to the TCGA_downloader() function is the TENET_directory argument, which
the user should use to specify a path to a directory where they want to save the .rda file containing the
TCGA data compiled by the function. This path could be to a directory which will later be used to contain
results by the main TENETR functions discussed below.
After running this function, a .rda file is created in the specified TENET_directory. Its name
contains information about the type of data it contains based on the options selected by the user, and
the file should contain five separate objects. The first object is a data frame named “Clinical” which
contains the TCGA clinical metadata on the samples in the dataset for use in downstream step7 TENETR
functions discussed in Chapter 3. The remaining four objects are matrices named “expDataN”,
“expDataT”, “metDataN”, and “metDataT”. The former two objects contain the GENCODE v36-
annotated gene expression data from the adjacent normal and tumor samples, respectively. The latter
two objects contain DNA methylation β-values for probes from the HM450 array again from the
adjacent normal and tumor samples respectively. This .rda file can later be used by the
step2_get_diffmeth_regions() function.
As previously mentioned, the data used in this study is presently unavailable due to changes
made to the way TCGA gene expression data has been processed. However, similar datasets can still be
created using the TCGA_downloader() function. To acquire similar datasets to the ones I used in this
study, which I would recommend if you are creating your own datasets for TENET analyses, set the
TCGA_study_abbreviation to the abbreviations for each of the 12 selected cancer types, and set the
RNA_seq_workflow argument to “STAR - FPKM-UQ - old formula” to create FPKM-UQ normalized gene
expression values in the same manner they were generated before the most recent TCGA overhaul.
Then, set RNA_seq_log2_normalization to TRUE to log2 normalize the FPKM-UQ normalized values, and
set matching_exp_met_samples to “tumor_and_normal” to include only the tumor and adjacent normal
55
TCGA samples with matching gene expression and DNA methylation data as required by TENETR. Finally,
set remove_dup_tumor to TRUE to ensure only one tumor sample was kept per patient to prevent a
possible bias in the datasets.
3.2.3 Collection of external cancer type-specific epigenomic datasets (Work done with: ZW,
LH, RP, HC):
An exhaustive search for epigenomic datasets of relevance to the 12 cancer types in this study
was primarily conducted using the Cistrome Data Browser
115,116
. Homo sapiens was selected as the
species of interest, and both H3K4me1 and H3K27ac as factors to isolate datasets which represent
enhancer regions, or H3K4me3 to find datasets to use for promoter analyses, or DNaseI, ATAC-seq, or
FAIRE to identify nucleosome-depleted regions (NDR) datasets, i.e. regions with open chromatin. After
selecting these elements, searches were conducted to identify datasets pertaining to the organs of
interest for each cancer type. Similar searches were also conducted on the Gene Expression Omnibus to
identify further datasets which were not indexed by GEO. Initial searches returned 2,198 histone
modification datasets in total, and 692 nucleosome-depleted regions, or open chromatin, datasets.
After initial searches, datasets were manually filtered based on several criteria. First, datasets
from samples not relevant to the given cancer type were removed. As an example, datasets from IMR90
lung fibroblast cell lines were removed from both the LUAD and LUSC datasets because they were not
directly relevant to the development of said cancers. Similarly, datasets from BEAS-2B human bronchial
epithelial cells were removed from the LUAD analysis, while A549 lung adenocarcinoma cell datasets
were removed from the LUSC analysis. In both cases, based on the cell of origin, those samples were not
particularly relevant to that individual cancer type, even if they derived from the same organ.
Second, for datasets that were indexed by Cistrome, scores for the six quality control metrics
were recorded: sequence quality, mapping quality, library complexity, ChIP enrichment, signal to noise
56
ratio, and regulatory region ratio. Samples that failed 3 or more of these quality control metrics
according to Cistrome were excluded outright.
Next, samples that had a strong treatment applied, for example a chemical or siRNA treatment,
were removed. In addition, datasets from samples that had some sort of genetic alteration performed
were also removed. However, we did decide to keep datasets from experiments where the samples
were administered a mild control treatment, such as EtOH or a vehicle control.
Finally, we removed datasets with fewer than 10 million reads in them, as well as any datasets
for which we were unable to acquire the .fastq files to perform our own peak calling. These included
histone modification datasets for which the experimental .fastq files were available, but the input files
for those datasets were unavailable.
3.2.4 Peak calling and final preparation of external cancer type-specific epigenomic datasets
(Work done with: ZW, LH, RP, HC):
After compiling and removing epigenomic datasets as described above, datasets were
downloaded and peaks were called using the ENCODE consortium ChIP-seq pipeline
(https://github.com/ENCODE-DCC/chip-seq-pipeline2) for histone modification datasets, or their ATAC-
seq pipeline for DNaseI and ATAC-seq datasets (https://github.com/ENCODE-DCC/atac-seq-pipeline).
After peak calling, datasets with fewer than 10,000 called peaks were excluded. Since
we are examining histone modification peaks here, we would expect reasonably robust datasets
to have a greater number of peaks than this (as opposed to datasets of TF binding). Then the list
of datasets with called peaks were compiled, and datasets initially indexed from Cistrome were
removed if they passed fewer quality measures than at least one other dataset of the same type
included in the list which happened to derive from the same sample source. For example, a
BRCA H3K27ac dataset derived from MCF7 cells passing five of the six Cistrome quality control
metrics would be removed if there was another H3K27ac dataset from MCF7 cells that passed
57
all six.
3.2.5 R package creation:
TENETR and TENETR.data packages were initialized using the usethis package
(https://github.com/r-lib/usethis) and R 4.0.3 in RStudio (https://www.rstudio.com/). Both packages are
formatted to work with hg38-annotated data. The dataset and function documentation for both
packages was initialized using the sinew package’s makeOxygen() function
(https://github.com/yonicd/sinew) and relevant information was manually filled in.
3.2.6 TENETR.data package and datasets:
The TENETR.data package is composed solely of datasets and contains no functions. Specific
details about each dataset can be found in their corresponding .R scripts located in the data-raw
directory of the package’s Github page (https://github.com/rhielab/TENETR.data).
“ENCODE_PLS_regions”, “ENCODE_dELS_regions”, and “ENCODE_pELS_regions” datasets are
.bed-like files which contain the hg38 coordinates for promoter, distal enhancer, and proximal
enhancer signature elements annotated by the ENCODE SCREEN project
117
. By the definitions of
the ENCODE SCREEN project, elements with promoter-like signatures (ENCODE_PLS_regions) fall
within 200 bp of an annotated GENCODE transcription start site (TSS) and have high H3K4me3
and open chromatin signals. Elements with distal enhancer-like signatures (dELS) are located
more than 2000 bp from an annotated TSS and have high H3K27ac and open chromatin signals,
while enhancers with proximal-like signatures (pELS) have the same signatures but are located
within 2 kb of an annotated TSS. hg38 genome annotations for the EPIC+, EPIC, HM450, and
HM27 DNA methylation arrays were downloaded and a 1-indexed start position for each probe
was calculated. Otherwise, the datasets were formatted as the authors had done and the data is
58
contained in the “epic_plus_hg38_annotations”, “epic_h38_annotations”,
“HM450_hg38_annotations”, and “HM27_hg38_annotations” datasets
118
(http://zwdzwd.github.io/InfiniumAnnotation). The GENCODE v36 main comprehensive gene
annotation .gtf file was acquired from the GENCODE website and saved as the
“gencode_v36_annotations” object (https://www.gencodegenes.org/human/release_36.html).
Chromosome sizes for the hg38 genome were downloaded from the ucsc genome browser
database and make up the “hg38_chrom_sizes” object
(http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/). The
“human_transcription_factors_dataset” contains the full database of the annotated human TFs
and was downloaded and formatted as the authors designed it
70
(http://humantfs.ccbr.utoronto.ca/download.php). The “consensus_enhancer_regions” and
“consensus_open_chromatin_regions” datasets are bed-like files which contain hg38
coordinates for enhancer or open chromatin (nucleosome-depleted) regions, respectively, from
a variety of cell lines and tissues. The consensus_enhancer_regions dataset is composed of
ChromHMM annotated Genic enhancer and Active enhancer regions from 98 different human
epigenomes (https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html)
(https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/c
ore_K27ac/jointModel/final/bed_hg38_lifted_over/) and regions of eRNA expression from all
facets of the FANTOM5 sample panel
27
(https://enhancer.binf.ku.dk/presets/) which I lifted over
to the hg38 genome, which were combined with each other and reduced using the
GenomicRanges reduce() function
119
. The consensus_open_chromatin_regions is comprised of
the ENCODE DNaseI hypersensitive sites master list composed of data from 125 cell types
(http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgDnaseMasterSit
es/), which I lifted over to the hg38 genome, as well as TCGA ATAC-seq datasets from 410 cancer
59
types and 23 cancer types
67
, and were similarly combined and reduced using the
GenomicRanges reduce() function
119
.
3.2.7 TENETR package and functions:
The TENETR package contains no datasets, and instead is composed of 23 functions that
perform TENETR processes as well as data acquisition and downstream analyses. Functions starting with
step1 through step6 perform the main TENET functions while those starting with step7 perform various
post-hoc analyses. The TCGA_downloader function as previously described is also included which
efficiently downloads TCGA datasets for use with the TENETR functions. TENETR is designed such that
each of the step functions is to be run in sequence as a pipeline, with the results for each function read
by the subsequent step to continue the pipeline. Figure 3-2 provides the overview of this pipeline and
highlights major new features of TENETR compared to prior TENET versions.
Here I will focus on steps 1 through 6 which identify the top genes/TFs and their linked probes,
while the step7 functions which perform post-hoc analyses will be discussed in chapter 4. For each
function, I have specified the “default” analysis settings for each of the function arguments, which were
used to generate the data presented in both this and the next chapter, unless otherwise specified.
60
3.2.7.1 step1_make_external_datasets() function:
This opening function of TENETR is designed to identify the DNA methylation probes which are
contained within peaks from specified open chromatin or histone modification datasets, either built into
TENETR or supplied by the user.
To use the function, users must provide a directory to the output_directory argument, which
will be the directory to which this function will save results. This argument is additionally important
because the directory specified will also be used to contain the results for the remaining TENET
functions.
Users must also supply the DNA methylation array of interest to the DNA_methylation_manifest
argument of this function, which should be the same methylation array the user’s methylation data was
Figure 3-2: TENETR function pipeline
TENETR is composed of a variety of numbered “step” functions which are run in sequential order to
generate results. Stars represent key features new in TENETR which weren’t present in previous
TENET versions.
61
generated with. Currently, the HM27, HM450, EPIC, and EPIC+ arrays are supported, and their data is
imported from the TENETR.data package by setting the argument to "HM27", "HM450", “EPIC”, or
“EPIC_plus”, respectively. This argument defaults to “HM450” if not specified, to reflect TENETR’s
primary use in analyzing TCGA DNA methylation datasets, which as of this point, consist of data from the
HM450 DNA methylation array.
The next series of arguments to the function specify the datasets for which regions the user
wants to identify the DNA methylation probes. The ext_HM and ext_NDR allow users to analyze their
own histone modification and nucleosome-depleted regions files, respectively. These two arguments
should be supplied a path to separate directories containing .bed-like files that end with “.bed”,
“.narrowPeak”, “.broadPeak”, or “.gappedPeak” with the histone modification or nucleosome-depleted
region peaks annotated to the hg38 human genome. Said files must contain chromosome, 0-indexed
start positions, and 1-indexed end positions in the first three columns, typical of other .bed-like files.
Additional information is allowed in the files but does not affect the function’s operations. If the user
does not have their own files they wish to analyze, these arguments should be set to FALSE. By default,
they are set to FALSE.
The consensus_ENH and consensus_NDR files allow users to identify the DNA methylation
probes that are found within the regions of the consensus enhancer regions and consensus nucleosome-
depleted regions, respectively, described previously in the TENETR.data package. Since these datasets
have been prebuilt and there is only one of each, users can set this argument to either TRUE if they want
to identify the probes in those regions, or FALSE if they do not. By default, these arguments are set to
TRUE as we expect users will likely want to use these datasets in their TENETR analyses. The primary use
of TENETR thus far has been to identify dysregulated regulatory elements in tumors, which often display
features including genomic instability
120,121
and alterations to DNA damage repair pathways
122
. Thus, we
included these consensus datasets as a way of capturing a wide variety of potentially active genomic
62
elements across tissue and cell types, which could show alterations in activity in a given tumor sample. It
should be noted though, for the case of the consensus_ENH argument, setting it to TRUE by default
does presume the user looking for distal enhancer data as this dataset was generated with those
specifically in mind, as opposed to the more genomic element-type-agnostic nucleosome-depleted
region dataset.
Similarly, the arguments ENCODE_PLS, ENCODE_pELS, and ENCODE_dELS allow the users to use
the ENCODE promoter-like regions, proximal enhancer-like regions, and distal enhancer-like regions, as
defined by the ENCODE SCREEN project discussed previously, as they see fit for their analyses. These
should either be set to TRUE to identify the probes within these regions, or FALSE if the user doesn’t
want to generate a probe list for them. By default, these datasets are set to FALSE.
Finally, the core_count argument allows the user to use multiple cores to increase the speed of
this function. This argument can be set to an integer value equal to the number of cores the user wishes
to use but cannot exceed the total number of cores available on their system. This argument defaults to
1.
Once run, the function will create a “step1” subdirectory inside of the specified
output_directory. For each of the arguments associated with the different genomic datasets, if the given
argument is set to TRUE, or a path supplied in the case of the ext_HM or ext_NDR functions, a further
subdirectory named for that argument will be created in the step1 subdirectory to contain the results
from that dataset.
To identify probes with the specified regions, the function imports the relevant datasets from
the TENETR.data package or loads any files with the previously specified file extensions in the ext_HM
and ext_NDR directories. They are then transformed into GRanges objects using the GenomicRanges
package’s makeGRangesFromDataFrame() function, and overlaps between the datasets’ regions and the
DNA methylation probe coordinates on the specified DNA methylation array are identified using the
63
findOverlaps() function.
The probes that are found to overlap peaks in a given dataset are saved in a single column
without row or column labels in a .probelist.txt file with the name of the dataset located in the created
subdirectory for that type of dataset in the step1 subdirectory. This results in a single .txt file being
created for the consensus_ENH, consensus_NDR, ENCODE_PLS, ENCODE_pELS, and ENCODE_dELS
datasets. For user-supplied external histone modification and nucleosome-depleted region datasets, a
single .txt file will be created for each file supplied by the user in the specified directories. This approach
gives the step1_make_external_datasets() function broader applicability beyond TENETR analyses, as an
interested user can use the function to identify the DNA methylation probes located within .bed-like
files for other research purposes.
For the “default” analyses I present here, I utilized the epigenomic datasets from the 12 cancer
types we collected and processed by specifying directories to those files for the ext_HM and ext_NDR
arguments. I also elected to use both the consensus_enhancer_regions and
consensus_open_chromatin_regions datasets as I was interested in studying the activity of distal
enhancer regions in the tumors, so I set consensus_ENH and consensus_NDR to TRUE. For this reason, I
also set the ENCODE_dELS argument to TRUE to use the ENCODE_dELS_regions dataset, and I set
ENCODE_PLS and ENCODE_pELS to FALSE as I wanted to exclude promoters and regions in close
proximity to promoters.
3.2.7.2 step2_get_diffmeth_regions() function:
This function utilizes the probes found within epigenomic elements of interest from the
step1_make_external_datasets and uses that information to identify the probes that mark potential
regulatory elements of interest and classify them based on their differential methylation levels in
control and experimental samples. Here these refer to the adjacent normal and tumor samples
64
respectively.
First, the path to the directory that was given for the output_directory argument of the
step1_make_external_datasets() function, and that now contains the results of that function, should be
supplied to the TENET_directory argument of the step2_get_diffmeth_regions() function.
Next, the methylation_expression_dataset argument is given a path to a .rda file that contains
the gene expression and DNA methylation data of interest to the user. As TENET has been primarily run
using cancer data from TCGA, this data object can be created with the TCGA_downloader function
discussed later.
TENETR can be run using matched gene expression and DNA methylation from samples from
other control vs. experimental contexts. To do this, the user will have to provide their own .rda file with
the appropriate formatting. First, the .rda should contain at least 4 matrices named “expDataN”,
“expDataT”, “metDataN”, and “metDataT”, containing the expression data from the control and
experimental samples, and then the DNA methylation data from the control and experimental samples,
respectively. Currently, the gene expression data must be GENCODE v36-annotated, while the DNA
methylation data must come from the aforementioned EPIC+, EPIC, HM450, or HM27 arrays. These
matrices should be formatted with the sample data in the columns and the gene expression or DNA
methylation probe values in the rows. Similarly, the column names need to contain a unique name for
each sample, which should be at least 12 characters long if proper matching with clinical data is to be
done in downstream functions, while row names in the expDataN and expDataT data frames need to list
the GENCODE v36-annotated Ensembl gene IDs for each gene (in the form ENSG###########, the user
must be sure to omit the annotation with a period after the Ensembl ID that is sometimes seen in
datasets), while the row names of the metDataN and metDataT should list the DNA methylation probe
IDs (in the form cg########). In addition, the sample names for the expData and metData should be the
same between the control and experimental datasets, as the expression and methylation data for both
65
the control and experimental samples should be matched. As discussed previously, a “clinical” data
frame can also be provided. Its contents are not used by this function, and its formatting is discussed in
Chapter 4 where it is more relevant for downstream TENETR analyses. Additional data objects can be
included in the .rda file, but will be ignored. Formatting of the expData and metData objects is
illustrated in Table 3-1A and 3-1B, respectively.
Table 3-1: Format of expData and metData objects
(A) Illustration of the setup of the expData matrices for use in TENETR. These gene expression
datasets should include Ensembl gene IDs in the row names, and unique sample IDs in the columns
(here TCGA barcodes are used, through the portion identifier) with gene expression data in the body
of the matrices. (B) Similar illustration of the metData matrices for use in TENETR. These DNA
methylation datasets should include probe IDs in the row names, and the same unique sample IDs as
listed in the corresponding expData object in the column names with DNA methylation β-values in
the body of the matrices. Data shown in figure is simulated and not reflective of the data in the
actual TCGA datasets.
66
The user also needs to specify the DNA methylation array of interest to the
DNA_methylation_manifest argument of this function, which should be the same methylation array
used for methylation data contained in the .rda file discussed previously. Again, the HM27, HM450,
EPIC, and EPIC+ arrays are currently supported and their data is imported from the TENETR.data package
by setting this argument to "HM27", "HM450", “EPIC”, or “EPIC_plus”, respectively. This argument also
defaults to “HM450” if not specified.
The assess_promoter argument should be set to either TRUE or FALSE, depending on if the user
wishes to assess promoter regions, or distal enhancer elements. If set to TRUE, the
step2_get_diffmeth_regions will identify DNA methylation probes that mark regulatory element regions
within a certain distance, as determined by the TSS_dist argument explained subsequently, of GENCODE
v36 annotated transcription start sites (TSS). If set to FALSE, this function will identify probes marking
regulatory elements that specifically lie outside of this region.
As mentioned above, the TSS_dist argument should be given an integer value which is used to
determine whether a given DNA methylation probe is located within a promoter region. Probes within a
distance equal to or less than the given value will be analyzed if the assesss_promoter argument is also
set to TRUE, while probes a greater distance than the TSS_dist value from annotated TSS will be
assessed instead of assess_promoter is set to FALSE. By default this value is set to 1500 if not otherwise
specified, which is the value previously used in previous TENET analyses
34,35
.
The next seven arguments, use_ext_HM, use_ext_NDR, use_consensus_ENH,
use_consensus_NDR, use_ENCODE_PLS, use_ENCODE_pELS, and use_ENCODE_dELS, all can be set as
either TRUE or FALSE, depending on if the user wishes to use the DNA methylation probes found within
the regions of each of the arguments’ respective datasets to identify regulatory element probes of
interest in the step2_get_diffmeth_regions() function. As the argument names suggest, the use_ext_HM
and use_ext_NDR arguments allow the users to import DNA methylation probes identified within
67
external peak files they have supplied, while the remaining five arguments import probes found within
their corresponding datasets from the TENETR.data package. If a given argument is set to TRUE, then the
corresponding argument for that dataset from the step1_make_external_datasets() will also need to
have been set to TRUE so corresponding .probelist.txt file(s) for those datasets created by that function
are available for import by the step2_get_diffmeth_regions() function. By default, the
use_consensus_ENH and use_consensus_NDR are set to TRUE, while the others are set to FALSE,
matching the defaults for their respective arguments in step1_make_external_datasets().
The following four arguments, methcutoff, unmethcutoff, hypomethcutoff, and
hypermethcutoff, are essential for the operations of this function, as they control how the DNA
methylation probes are classified. The values of these cutoffs should be set between 0 to 1, or all set to
FALSE. Ideally, the methcutoff value should be greater than or equal to the hypomethcutoff, while the
unmethcutoff value should be less than or equal to the hypermethcutoff. Additionally, both the
unmethcutoff and hypermethcutoff values should be less than the hypomethcutoff and methcutoff
values. Step2_get_diffmeth_regions() has a new ability to determine cutoff values based on the overall
methylation patterns in the data. The function currently does not support the ability to set some of the
cutoff values and leave others to be set and will return an error alerting the user if this is done. The
default for each of this function is FALSE, thus ensuring the new cutoff setting algorithm will be used.
The meth_unmeth_proportion_offset and hypometh_hypermeth_proportion_offset arguments
set parameters that control the functionality of the cutoff setting algorithm by determining how it sets
the cutoff values based on the methylation patterns observed in the normal and tumor data. These
values should ideally bet set between 0 and 0.5. Unless otherwise specified, the
meth_unmeth_proportion_offset defaults to 0.2, and the hypometh_hypermeth_proportion_offset is
set to 0.1.
The min_experimental_count argument is used along with the four cutoff value arguments to
68
classify the probes in regulatory elements. This number represents the minimum number of
experimental samples that are needed to show a significant gain or loss of methylation of a given DNA
methylation probe for it to be considered as hypermethylated or hypomethylated. It should be set to an
integer value larger than 0, but less than the total number of samples in the expDataT and metDataT
datasets.
The final argument to the function is the purity_directory argument, which allows the user to
provide a path to a directory that contains datasets they wish to use to correct for possible
contamination by known cell types. The directory path should contain .rda files which themselves
contain data frames of methylation data organized in a similar fashion to the metData objects in the .rda
file given to the methylation_expression_dataset argument, with the purity samples in the columns
(with sample names as the column names), and the DNA methylation probe values in the rows (with the
probe IDs as row names). The DNA methylation probes in these datasets should be identical to those in
the methylation datasets in the methylation_expression_dataset. The availability of these purity
datasets is not necessary to run the step2_get_diffmeth_regions() function and by default it is set to
FALSE, as users are not expected to have formatted datasets to correct for purity of their samples.
Once run, the function will first create a “step2” subdirectory inside of the specified
TENET_directory which will contain up to 5 objects created by this function. The
step2_get_diffmeth_regions() then starts by finding the DNA methylation probes which mark regulatory
elements of interest. This is accomplished in a sequential process outlined in Figure 3-3.
69
As previously noted, specific epigenetic marks such as histone modifications are known to be
associated with specific regulatory elements. Thus, to identify the DNA methylation probes which mark
regulatory elements of interest, it is essential the probe falls within a region marked with histone
modifications of interest. To do this, the function first loads any available .probelist.txt files from histone
modification datasets as selected by the use_ext_HM or consensus_ENH arguments and creates a list of
probes that overlap with at least one of those regions.
A second feature of active genomic elements is that the chromatin in their vicinity tends to be
more accessible and nucleosome-depleted to allow relevant genomic factors access to the DNA
42,123
. To
be identified as a probe that marks an active regulatory element, the step2_get_diffmeth_regions()
function requires probes also be found within regions with open chromatin by next importing the
.probelist.txt files from these regions as selected by the user’s input to the use_ext_NDR or
Figure 3-3: TENETR workflow to identify DNA methylation probes marking regulatory elements
Diagram of how DNA methylation probes marking enhancer regions are identified by TENETR. First,
probes which overlapped regions with histone modification peaks of interest are identified, then the
subset of these probes which also overlapped with open chromatin regions are subsequently found.
Finally, probes located a specific distance from annotated TSS are isolated, then additional probes
found within the relevant ENCODE SCREEN cis-regulatory element datasets are added as well to
create the final regulatory element probe list for each cancer type. The data sources for the
epigenomic datasets are listed on the right. Stars represent datasets that are new to TENETR and are
built into the method and included for every cancer type analysis.
70
consensus_NDR arguments. The probes that were found to overlap histone modifications are then
checked to isolate those that were also found to overlap these open chromatin regions as well.
As alluded to previously in the description of the TSS_dist argument, the
step2_get_diffmeth_regions() function distinguishes between probes that mark distal enhancer
regulatory elements (those located a distance from annotated TSS, regardless of directionality) and
those that mark more proximal regulatory elements. This is done by identifying the probes that either lie
within or outside of the region the user specified with the TSS_dist and assess_promoter arguments and
excluding the ones that don’t fit that criteria (for example, removing probes that lie within 1500 bps of
TSS when assess_promoter is set to FALSE, and TSS_dist is set to a value of 1500). Though it is possible
that excluding probes marking potential regulatory elements within 1500 of the TSS could exclude
potential enhancer regions, we have decided to apply this filter because any observed differences in the
DNA methylation levels of probes in these regions would be difficult to attribute to the activity of an
epromoter compared to the activity of the gene itself
124
. In addition, if users don’t wish to discriminate
between these classes of regions, they can set the TSS_dist to 0, and regardless of the assess_promoter
setting, all probes would be analyzed.
Finally, any additional probes found within the ENCODE_dELS, ENCODE_pELS, and ENCODE_PLS
.probelist.txt files, if selected by the user, and added into the pool. These probes are not passed through
the filtering criteria because the ENCODE SCREEN consortium had already identified the regions
containing those probes using similar criteria to the step2_get_diffmeth_probes() function, i.e. they had
the appropriate histone modifications and open chromatin regions, and were located specified distances
from TSS.
After these filtering steps, this final list represents the DNA methylation probes that potentially
mark regulatory elements for further use in the function. Currently, the list of probes is saved internally
as “enhancer_probes”, an admitted misnomer that reflects TENET’s heritage in identifying enhancer
71
probes only, but one that does not otherwise affect proper analysis of these probes by the function.
step2_get_diffmeth_regions() ultimately classifies these DNA methylation probes marking
regulatory elements of interest into four separate groups of potential interest. The unmethylated and
methylated probes represent probes with relatively low and high levels of methylation, respectively, in
both the control and experimental samples. These probes represent regulatory elements that are likely
constitutively active and inactive, respectively, in both the tumor and normal samples. To be classified as
an unmethylated probe, the mean methylation for that probe in the control samples needs to be less
than the set unmethylated_cutoff value and there must be fewer than the Min_experimental_count
value number of experimental samples with methylation values for that probe greater than the
unmeth_cutoff as well (Figure 3-4A). Conversely, methylated probes are defined as those for which the
mean methylation of the probe in the control sample is greater than the meth_cutoff and there are
fewer than Min_experimental_count value number of experimental samples with methylation less than
the meth_cutoff. These two groups of probes are of least interest of the four classifications of probes
TENETR uses, since they do not show a significant difference in activity between the control and
experimental samples. Their count and identities are recorded in case the user is interested in them for
other purposes, but they are otherwise not referred to through the remaining steps2-6 functions (Figure
3-4B).
The two groups that are of particular interest are the hypermethylated and hypomethylated
DNA methylation probes. These represent probes which show higher levels of methylation in a subset of
the experimental samples compared to the control samples for hypermethylated probes, or lower levels
of methylation in a subset of the experimental samples compared to control samples for
hypomethylated probes. These probes represent regulatory elements that are differentially inactivated,
or activated, respectively, in a subsection of the experimental samples, as the DNA methylation levels of
the probes should be inversely related to the activity of the regulatory regions. To be considered a
72
hypermethylated probe, the mean methylation for that probe in the control samples needs to be less
than the set unmethylated_cutoff value and there must also be more than the Min_experimental_count
value number of experimental samples with methylation values for that probe greater than the
hypermeth_cutoff (Figure 3-4C). Conversely, hypomethylated probes are defined as those for which the
mean methylation of the probe in the control sample is greater than the meth_cutoff and there are
more than Min_experimental_count value number of experimental samples with methylation less than
the hypometh_cutoff (Figure 3-4D).
73
Before the DNA methylation probes are classified, if the user has not specified any cutoff values,
step2_get_diffmeth_regions() will use my newly developed cutoff setting algorithm to attempt to set
cutoffs. Ultimately, the setting of cutoffs with TENETR is on some level an arbitrary approach since we
Figure 3-4: TENETR methylation cutoffs
Diagram of TENETR’s definition of how the four classifications of DNA methylation probes marking
enhancer regions are classified based on methylation levels in normal and tumor samples. (A) Probes
with low levels of methylation, less than the unmeth_cutoff in both normal and tumor samples, are
classified as unmethylated. (B) Probes with high levels of methylation, greater than the meth_cutoff
value in both groups, are classified as methylated. (C) Probes with mean methylation less than the
unmeth_cutoff in normal samples, but methylation values in excess of the hypermeth_cutoff value in
a subset of tumor samples larger than the Min_exp_count parameter are classified as
hypermethylated. (D) Probes with mean methylation greater than the meth_cutoff in normal
samples, but methylation values less than the hypometh_cutoff value in a subset of tumor samples
smaller than the Min_exp_count parameter are classified as hypomethylated. Hypermethylated and
hypomethylated probes are of particular interest to TENETR, as these probes represent regions
differentially (in)activated in normal vs. a subset of tumor samples.
74
don’t know how the methylation patterns for each probe are reflective of the biological context for each
individual probe. On the one hand, we want to filter out noise in the dataset; probes for which the DNA
methylation differences are too small and too infrequent in the experimental samples. However, we
also want to ensure we can detect small but biologically meaningful changes in methylation of other
probes. Additionally in setting these cutoffs, we are applying the same global cutoffs to thousands of
DNA methylation probes to ensure uniformity in the analysis, yet as mentioned before, the biological
context of every probe is potentially going to be different. Though TENETR may not exhaustively identify
every single biologically significant differentially methylated probe or exclude every single probe whose
methylation differences are explained by noise, the TENETR method primarily works in aggregate, and
as an upstream analysis to further studies (covered in Chapter 4) to identify gene-to-probe links of
particular interest. Thus, my primary focus in designing the cutoff algorithm was to ensure it would
function on a variety of methylation patterns from identified probes marking either distal enhancer or
promoter regulatory elements, while accounting for individual dataset differences in methylation
patterns between the experimental and control samples.
In previous TENET studies, I set cutoffs for distal enhancer analyses using the distribution of the
mean methylation values of the DNA methylation probes marking distal enhancer regions. I observed a
bimodal distribution with two local maxima, one for a group of probes that tended to have low levels of
methylation, and a larger one for probes that tended to have high levels of methylation, and a large
trough between them (Figure 3-5). This pattern is consistent with the concept that the DNA methylation
levels can reflect regulatory element activity, as it is expected these elements to be either active and
lowly methylated, or inactive and highly methylated. Given this, I had manually set the cutoffs between
the two local maxima.
75
After examining similar methylation distributions in the 12 cancer datasets analyzed in this
study, I noticed the bimodal distribution remained consistent in each of them. In most of the datasets I
had also observed that while the tumor samples still displayed a high density of lowly and highly
methylated probes, there tended to be a larger number of probes with intermediate levels of
methylation, perhaps indicative of the DNA methylation probes that have gained and lost methylation
with changes in regulatory element activity in the tumor samples. However, this trend was not
consistent across all cancer types, with cancer types such as KIRP and PRAD showing only a limited
increase in intermediately methylated probe density in the tumor samples, and THCA showing almost no
difference in the methylation densities of the normal and tumor samples (Supplemental Figure 3-1).
Figure 3-5: Methylation cutoff setting
Diagram displaying a density plot of enhancer probe methylation values and methylation cutoff
values set in an example dataset. TENETR includes a new algorithm to automatically set cutoff values
for a given dataset based upon the position of the two local maxima in the bimodal distribution of
enhancer probe methylation density in the normal and tumor samples, and the distances between
them.
76
Based on these findings, I decided to use the local maxima in the lowly and highly methylated
density peaks to position the algorithm; they were the most consistent feature in the distal enhancer
datasets I had analyzed and ensure the greatest applicability of the algorithm when applied to other
distal enhancer datasets. After locating DNA methylation β-values of the two local maxima in both the
control and experimental samples (Figure 3-6A), the algorithm calculates the distance between them in
each of the datasets. The distance between the two maxima in the control samples (Figure 3-6B) is
multiplied by the meth_unmeth_proportion_offset value. This value is then added to the DNA
methylation β-value position of the lowly methylated local maxima of the control samples to set the
unmeth_cutoff value and is subtracted from the position of the highly methylated local maxima of the
control samples to set the meth_cutoff value (Figure 3-6C). My reasoning for this is to maximize the
capture all four categories of probes, mean methylation values in the control samples of these probes
need to be less than the unmeth_cutoff or more than the meth_cutoff; I wanted to ensure the highest
densities of the control probes are positioned to be less than the unmeth_cutoff and more than the
meth_cutoff to ensure more probes are classified. The distance between the two maxima in the
experimental samples (Figure 3-6B) is then calculated and multiplied by the
hypometh_hypermeth_proportion_offset. This value is then added with the
meth_unmeth_proportion_offset value and then also added to the value of the lower experimental
count local maxima position to set the hypermeth_cutoff and subtracted from the higher experimental
count local maxima position value to set the hypometh_cutoff value (Figure 3-6D). This is done to
ensure that the hypermeth_cutoff is greater than the unmeth_cutoff and the hypometh_cutoff value is
less than the meth_cutoff to ensure the logic of probe selection makes sense. I used the experimental,
as opposed to control, density values to set the hypometh_cutoff and hypermeth_cutoff as I wanted to
account for alterations to the distribution of the methylation in those samples. For instance, an increase
in intermediate probes resulting in a “pinching in” of the lowly and highly methylated maxima will
77
reduce the hypermeth_cutoff and increase the hypermeth_cutoff, increasing the likelihood these
intermediate probes will be classified in those respective categories.
This function works similarly to set cutoffs when assess_promoter is set to TRUE, but with a
small change. When I analyzed the DNA methylation probes found to overlap peaks of H3K4me3 and
open chromatin regions in the vicinity of promoters, they tended to be almost universally very lowly
Figure 3-6: Workflow of the methylation cutoff setting algorithm
Diagram displays the steps in the logic of the methylation setting algorithm. (A) The DNA methylation
β-value positions are calculated for the low and high local maxima in the density curves of the mean
DNA methylation β-values of the identified regulatory element probes in the experimental (here
tumor) and control (here normal) samples in the dataset. (B) The distances between the two pairs of
local maxima are calculated. (C) The distance between the local maxima in the control (normal)
samples is multiplied by the meth_unmeth_proportion_offset then added to the lower maxima and
subtracted from the higher maxima of the control (normal) samples to set the unmeth_cutoff and
meth_cutoff values. (D) The distance between the local maxima in the experimental (tumor) samples
is multiplied by the hypometh_hypermeth_proportion_offset value and added to the previously
calculated distance between the local maxima in the control (normal) samples multiplied by the
meth_unmeth_proportion_offset. This final value is then added to the lower maxima and subtracted
from the higher maxima of the experimental (tumor) samples to set the hypermeth_cutoff and
hypometh_cutoff values.
78
methylated, sometimes with only a single local maximum. Additionally, these promoter probes also
tended to show more variability in the methylation density distribution, often possessing multiple local
maxima peaks at different methylation β-values, thus confounding my previously written algorithm
(Supplemental Figure 3-2). To address this, I slightly adjusted the algorithm to investigate promoter
probes. Instead using the mean methylation distribution of probes found in regions with overlapping
histone modification and open chromatin peaks, plus ENCODE datasets, if selected, the algorithm
instead investigates all DNA methylation probes found in the vicinity specified by the TSS_dist argument
from GENCODE v22-annotated gene and transcript transcription start sites. These probes do display a
bimodal distribution necessary for the cutoff algorithm to operate and properly set cutoffs
(Supplemental Figure 3-3).
Finally, before identified DNA methylation probes marking regulatory elements are classified, if
the user has supplied a directory with purity .rda files to the purity_directory argument, these .rda files
are loaded and used as an additional filter to the probes. This is done with the intention of removing
probes whose methylation levels may be influenced by contamination of the samples with extraneous
cell types whose data is included in the supplied purity .rda files. First, the probes present in every purity
file as well as the identified regulatory element probes are isolated. These probes are considered the
final group of regulatory element probes, and are used in the subsequent cutoff calculation, if that is
also selected by the user. After cutoffs are set, the mean methylation values for each of the final
regulatory element probe list are calculated from each purity file. These are then used to identify the
hypermethylated and hypomethylated probes. In addition to requiring that hypermethylated probes
have mean methylation in the control samples less than the unmeth_cutoff value and hypomethylated
probes have mean methylation in the control samples more than meth_cutoff, it is also required that
hypermethylated probes also have mean methylation values less than the unmeth_cutoff in each of the
purity files, and likewise have mean methylation values greater than the meth_cutoff in each of the files
79
for the hypometh probes.
Once the DNA methylation probes are classified, the method saves the results in the “step2”
subdirectory created by the function in the “diff.methylated.datasets.rda” file. This .rda file contains a
copy of the expDataN, expDataT, metDataN, metDataT, and clinical data frames from the .rda supplied
to the methylation_expression_dataset argument. It also contains the values of the four cutoff values,
whether supplied by the user or calculated automatically by the function, as well as the
min_experimental_count value. Finally, the .rda also contains vectors with the identities of the
methylated, unmethylated, hypermethylated, and hypomethylated probes for use in downstream
functions. The function also saves a “TENET_step2_overall_metadata.txt” file that contains the values of
the cutoffs and the min_experimental_count, as well as the total number of unmethylated, methylated,
hypermethylated, hypomethylated probes, and regulatory element probes identified, as well as the total
number of DNA methylation probes and genes contained in the methylation_expression_dataset .rda
file. A “probes_called_by_dataset.txt” file is also created, which lists the names of each of the
epigenomic datasets analyzed in the function, and the DNA methylation probe IDs of each of the
identified regulatory element probes in the row names of the file. The body of the file notes within
which regions each of the probes was found with either TRUE or FALSE values. Lastly, if the user has
opted to allow the step2_get_diffmeth_regions() function to automatically set the cutoff values, two
additional .png files are created. The first,
“control_and_experimental_methylation_density_curves.png” displays the mean methylation value
densities of the identified regulatory element probes in the control and experimental samples, while the
“TENET_cutoff_selection_plot.png” shows those same density curves plus the positions of the cutoff
values set by the function.
For the “default” analyses I set the TENET_directory argument to be the output_directory
argument I specified from the step1_make_external_datasets() function. For
80
methylation_expression_dataset I specified the paths to the .rda files I created for each cancer type
using the TCGA datasets described previously, and I used the default value of “HM450” for the
DNA_methylation_manifest argument, as the methylation data for these TCGA datasets was derived
from the HM450 array. The assess_promoter argument was set to FALSE to analyze distal enhancer
regions, and the TSS_dist argument was set to the default value of 1500. For the dataset arguments,
use_ext_HM, use_ext_NDR, use_consensus_ENH, use_consensus_NDR, and use_ENCODE_dELS were all
set to TRUE to use the cancer-type-specific histone modification and open chromatin datasets we
compiled for each dataset in the analyses, as well as both the consensus enhancer and open chromatin
datasets built into TENETR. The ENCODE distal enhancer-like signatures dataset was to facilitate the
analyses of distal enhancers. I set the cutoff arguments, unmeth_cutoff, meth_cutoff,
hypermeth_cutoff, and hypometh_cutoff to FALSE to make use of the new cutoff setting algorithm in
each cancer type. The meth_unmeth_proportion_offset and hypometh_hypermeth_proportion_offset
arguments were both set to their default values of 0.2 and 0.1, and the min_experimental_count value
was similarly set to its default of 5. Finally, I did not use any purity information in any of the cancer
analyses, so I set this argument to FALSE.
3.2.7.3 step3_get_analysis_z_scores():
The third in the sequence of TENET functions, step3_get_analysis_z_scores() is relatively simple
in its process and calculates Z-scores from gene expression in tumor samples that are hypermethylated
or hypomethylated for the identified hypermethylated or hypomethylated DNA methylation probes
from the step2_get_diffmeth_regions() function. This calculation is carried out across all the genes or
TFs in the gene expression datasets provided by the user, and significant positive and negative Z-scores
are identified and sorted by this function. Although relatively simple, this function does most of the
heavy statistical lifting of the TENETR method, linking the expression of genes to the DNA methylation
81
level of probes, and is the most time-consuming step of the pipeline.
As before, TENET_directory is the first argument of this function, and requires the user provide a
path to the directory that contains the results produced by the previous step2_get_diffmeth_regions()
function. Conveniently, this is the same argument name (and path) that was provided to that function.
The next two arguments, hypermeth_analysis and hypometh_analysis, allow users to calculate
Z-scores for hypermeth probes, hypometh probes, or both by setting these arguments to TRUE or FALSE.
Setting both to FALSE, however, will induce the function to return an error as these are the two probe
classifications of interest, showing differential methylation in the control and experimental samples, and
if neither is selected for analysis, the remaining probes aren’t of particular interest for TENETR to study.
Next, the use_case_only argument controls whether the Z-score calculation should be
performed only in the experimental samples if set to TRUE, or if the control samples should also be
considered in the analysis. Setting this to TRUE increases the number of significant Z-scores identified,
and thus the total number of probe-gene links identified by TENETR in total. By default, this is set to
FALSE.
The TF_only argument is a new, key functionality in TENETR, and allows users to control
whether they wish to analyze only annotated TF genes in the analysis, if set to TRUE, or if they want to
perform calculations between the probe types of interest and all genes, if set to FALSE. Setting this to
TRUE vastly reduces the calculation times of the analysis, but increases the total number of significant
gene to probe links identified, even for the TF genes, due to calculations performed in the downstream
step5_optimize_links function.
The significant_p_value argument is used to set the Z-score that is considered significant for this
analysis. This value should be set to a number greater than 0 but less than 1 and defaults to 0.05.
Finally, as with step1_make_external_datasets(), the core_count argument allows the user to
use multiple cores to increase the speed of this function. This argument can be set to an integer value
82
equal to the number of cores the user wishes to use but cannot exceed the total number of cores
available on their system. This argument defaults to 1.
As with previous functions, step3_get_analysis_z_scores() creates a “step3” subdirectory in the
specified TENET_directory. Within that subdirectory, further “hypermeth_analysis” and
“hypometh_analysis” subdirectories are created if hypermeth_analysis and hypometh_analysis
arguments are set to TRUE, respectively, to contain the analyses of those types.
The "diff.methylated.datasets.rda" file output by the step2_get_diffmeth_regions() function is
loaded to access the DNA methylation and gene expression values for each of the control and
experimental samples, along with the cutoff_values set from that function. Then for each of the
hypermethylated or hypomethylated probes, depending on the settings of the hypermeth_analysis and
hypometh_analysis arguments, the individual experimental samples are identified which have
methylation values above the hypermeth_cutoff value or below the hypometh_cutoff value. The
samples themselves are said to be “hypermethylated” for the hypermethylated probes, or
“hypomethylated” for the hypomethylated probes. If use_cases_only is set to FALSE, the control
samples are first combined with the experimental samples and included in this categorization as well.
After the hypermethylated samples or hypomethylated samples are identified, the Z-score calculation
is performed by comparing the expression of every gene of interest, depending on the TF_only
argument setting, for groups based on sample methylation for each of the hypermethylated or
hypomethylated probes (Figure 3-7A). This is done by taking the average expression of a given gene in
the relatively more methylated group for a given probe, subtracting the average expression of that gene
in the less methylated group of that probe, and then dividing this total by the standard deviation of the
expression in the less methylated group of that probe (Figure 3-7B). When considering hypermeth
probes, the more methylated group represents hypermethylated samples with methylation levels for
that probe above the hypermeth_cutoff value, while the less methylated group includes all remaining
83
samples. Conversely, when considering hypometh probes, the more methylated group includes all
samples with methylation levels above the hypometh_cutoff value, while the less methylated group
represents the hypomethylated samples with values below it.
After Z-scores are calculated for each of the genes of interest and for each of the
hypermethylated or hypomethylated probes, the Z-score results are saved files ending with
“_zscore.txt” with the type of analysis and the ID of the gene in the name of the file. In addition,
significant Z-scores are isolated from each of the individual files per gene and saved to their own output
files. These files include the gene and probe IDs as well as significant Z-scores and are separated into
two files depending on the direction of effect. The files ending with “Gplus_sig_link_zscores.txt” contain
the significant hypermeth or hypometh probe to gene links with negative values, indicating that there is
increased expression of the gene in samples that have less methylation of the hypermethylated or
hypomethylated probes, while the files ending with “Gplus_sig_link_zscores.txt” contain the significant
hypermeth or hypometh probe to gene links with positive values, showing increased expression of the
gene is associated with higher methylation of the hypermethylated or hypomethylated probes (Figure 3-
7C). The results for each type of analysis are saved in the “hypermeth_analysis” and
“hypometh_analysis” subdirectories within the “step3” file.
For the “default” analyses, I set the TENET_directory argument to be the same directory I had
specified for the step1 and step2 functions. Although I focus mostly on the hypomethylated probe
results, I set both hypometh_analysis and hypermeth_analysis to TRUE to generate results for both. I set
use_case_only to FALSE, its default value, and I set TF_only to TRUE because I am primarily interested in
the results for actual TF genes, and it takes vastly less time to run. The significant_p_value was set to
0.05, its default value and typical bar for significance.
84
3.2.7.4 step4_permutate_z_scores():
Like step3_get_analysis_z_scores(), step4_permutation_z_scores() is another straightforward
function. This function takes the significant Z-scores calculated from the step3_get_analysis_z_scores()
function and calculates an empirical p-value for each significant gene-to-probe link by ranking the
Figure 3-7: Z-score calculation and classification of gene-to-probe links
(A) The step3_get_analysis_z_scores function calculates Z-scores for every combination between
each gene of interest and each hyper- or hypomethylated probes, and significant Z-scores are
selected. (B) Formula for calculation of Z-scores by the step3_get_analysis_z_scores. The more
methylated group represents tumor samples with relatively higher levels of methylation of the
probe of interest in the calculation, such as the hypermethylated samples in the analysis of
hypermeth probes for instance. Conversely, the less methylated group is composed of the remaining
samples with lower levels of methylation, such as the hypomethylated samples in the analysis of
hypomethylated probes. (C) Examples of gene-to-probe links of the four different TENETR
classifications.
85
generated for each probe to a given gene.
The TENET_directory is again the first argument of this function, and requires the user provide a
path to the directory that contains the Z-scores calculated by the previous
step3_get_analysis_z_scores() function. This is again the same argument name (and path) that was
provided to that function.
Since this function now looks for gene-to-probe links that have been classified based on
differential methylation of the probes and expression of the genes, there are four arguments,
hypermeth_Gplus_analysis, hypermeth_Gminus_analysis, hypometh_Gplus_analysis, and
hypometh_Gminus_analysis, which can be set to TRUE or FALSE to control if the user wishes to perform
empirical p-value calculations for each class of gene-to-probe links.
The only remaining argument is the the core_count argument. As with prior functions, it allows
the user to use multiple cores to increase the speed of this function. This argument can be set to an
integer value equal to the number of cores the user wishes to use but cannot exceed the total number
of cores available on their system. This argument defaults to 1.
As with previous functions, step4_permutate_z_scores() first creates a “step4” subdirectory in
the specified TENET_directory which will contain the files created by this function.
For each of the four analyses types selected, the function will load its corresponding
“sig_link_zscores.txt” file from either the “hypermeth_results” or “hypometh_results” subdirectories,
which contain the significant Z-scores for the specified type of gene-to-probe links. Once loaded, for
each gene listed in the significant probe to gene link pairs, the Z score file output for that gene to that
type of probe is also loaded. The complete list of Z scores for the specified classification of DNA
methylation probes to that gene is then sorted by decreasing Z-score for the “minus” analyses, where
again, expression of the given gene is higher in the more highly methylated group resulting in significant
positive Z-scores, and is sorted by increasing Z-scores instead for the “plus” analyses, where expression
86
of the given gene is lower in the more highly methylated group instead, resulting in significant negative
Z-scores.
Once sorted, the rank of the individual probe in the gene-to-probe link compared to all probes
of the same classification is calculated, and the permutated Z-score for that link is calculated as the rank
of that given probe for that gene, divided by all probes of the specified classification (Figure 3-8). These
empirical p-values are calculated for each significant Z-score and saved to a “sig_link_zscores_perm.txt”
file in the created step4 subdirectory, headed by the type of link classification they contained
(hypermeth_Gplus, hypermeth_Gminus, hypometh_Gplus, or hypometh_Gminus). These files also
contain the probe and gene IDs for the given link, as well as their Z-score calculated from the
step3_get_analysis_z_scores() function alongside their empirical p-values.
For this function, the default settings are simple; I supplied the same directory I had for the
previous three functions to the TENET_directory argument, and specified all four analysis types as TRUE
to collect data for all of them.
3.2.7.5 step5_optimize_links():
This is the final of the computational steps and performs the final analyses before selecting
Figure 3-8: Calculation of empirical p-values
The step4_permutate_z_scores function calculates empirical p-values for the significant gene-to-
probe Z-scores from the step3 function by sorting and ranking the Z-scores for each probe to a given
gene, then calculating the empirical p-values based on a given probe’s Z-score to that gene relative
to other probes.
87
individual gene-to-probe links that are significant according to the TENETR method. This function is
somewhat more complex than the previous two functions in the different types of calculations it
performs to “optimize” the links.
The TENET_directory once again is the first argument of this function, and requires the user
provide a path to the directory that contains the Z-scores calculated by the previous
step4_get_analysis_z_scores() function. This is again the same argument name (and path) that was
provided to that function.
Similar to the step4_permutate_z_scores() function, this function also requires the user to
specify either TRUE or FALSE for 4 arguments, hypermeth_Gplus_analysis, hypermeth_Gminus_analysis,
hypometh_Gplus_analysis, and hypometh_Gminus_analysis, to control if the user wishes to perform
optimization for each class of gene-to-probe links. If these are set to TRUE, it requires that the same will
have been done in the previous function.
The adj_pval_cutoff allows the user to set a significance level for multiple testing-corrected
Wilcoxon rank sum tests performed in the function. This value should be set to a number between 0 and
1 and defaults to 0.05, the standard metric of significance.
The hyper_stringency and hypo_stringency arguments are similar to the hypermeth_cutoff and
hypometh_cutoff arguments from the step2_get_diffmeth_regions() function. These arguments set
methylation β-value cutoffs used when optimizing gene-to-probe links. By default if these parameters
are not set, they default to the set hypermeth_cutoff and hypometh_cutoff values from the
step2_get_diffmeth_regions() function.
The final argument is once again the core_count argument. As with prior functions, it allows the
user to use multiple cores to increase the speed of this function. This argument can be set to an integer
value equal to the number of cores the user wishes to use but cannot exceed the total number of cores
available on their system. This argument defaults to 1.
88
As with prior functions, first step5_optimize_links() creates a “step5” subdirectory inside of the
specified TENET_directory to store the output of the function.
The function then loads the “sig_link_zscores_perm.txt” files created by the
step4_get_analysis_z_scores() function as specified by the hypermeth_Gplus_analysis,
hypermeth_Gminus_analysis, hypometh_Gplus_analysis, and hypometh_Gminus_analysis functions.
The “diff.methylated.datasets.rda” file originally output by the step2_get_diffmeth_regions() function is
also loaded in order to access the methylation and expression values for each control and experimental
sample, as well as the cutoff values set in that function.
Next, the function performs a series of calculations and filters gene-to-probe links based upon
those calculations to arrive at a final optimized list of probe to gene links.
The first of these calculations checks that the genes in the links are expressed, i.e., have mean
expression greater than 0, in both the control and experimental samples. This does mean that genes
that are entirely deactivated in the control or experimental samples won’t be analyzed. However,
analyzing these would induce a divide by zero error when performing the optimization calculations.
Then, the step5_optimize_links() function performs a Wilcoxon rank sum test comparing the
expression level of the gene in the link between the control samples, and the experimental samples that
happen to have methylation values greater than the hypermeth_cutoff, or below the hypometh_cutoff,
depending on the type of probe to gene link analyzed. The resulting p-value is then corrected for
multiple testing using the Benjamini-Hochberg (BH) method for the total number of probe to gene links
of the given classification type (hypermeth_Gplus, hypermeth_Gminus, hypometh_Gplus, or
hypometh_Gminus). Probe to gene links with BH-adjusted p-values less than the set adj_pval_cutoff
value are kept. As alluded to earlier, it is this optimization step that results in discrepancies in the
number of links found when the TF_only argument is set to TRUE in the step3_get_analysis_z_scores()
function. When that value is set to TRUE, only links to TFs are considered, resulting in a fewer number of
89
total links being analyzed, and thus a less severe multiple testing penalty. The final adjusted p-value is
recorded as the “wilcox.expNcTc” value.
Next, the mean expression values for the gene in the control samples and experimental samples
are examined. For “Gplus” links with positive Z-scores, it is expected that the mean expression for the
gene is higher in the control samples than experimental samples, and the opposite is true for the
“Gminus” links with negative Z-scores. Probe to gene links that fit this criterion are also kept.
Following this, for each probe to gene link, the function checks that there are a greater number
of experimental samples than the min_experimental_value set from step2_get_diffmeth_regions()
which are hypermethylated or hypomethylated for the given probe, with methylation β-values more
than the set hypermeth_cutoff or less than the set hypometh_cutoff respectively, and which possess
expression values for the given gene in the link less than the mean expression value for that gene in all
experimental samples for Gplus links, or more than the mean expression values for that gene in all
experimental samples for Gminus links.
The final criterion is that the maximum methylation value amongst the experimental samples
that are hypermethylated for a given probe, with methylation β-values more than the set
hypermeth_cutoff, in a hypermethylated probe to gene link, exceeds the set hyper_stringency value.
Similarly, the minimum methylation value amongst the experimental samples hypomethylated for a
given probe, with methylation β-values less than the hypometh_cutoff, in a hypomethylated gene-to-
probe link, is less than the hypo_stringency value to ensure that at least one of the hypermethylated or
hypomethylated samples shows sufficiently extreme methylation values (Figure 3-9).
90
For probe to gene links that pass the final optimization steps, the probe and gene IDs, calculated
Z-score, and empirical p-value calculated from previous functions for each probe to gene link are
included in the output file. Data collected during the optimization steps is also added, including, in
order, the mean expression of the linked gene in the control samples, mean expression in the
experimental samples, the Wilcoxon rank sum p-value, the total number of experimental samples that
are hypermethylated or hypomethylated for the given probe, the mean expression of the gene in the
hypermethylated or hypomethylated experimental samples, the number of hypermethylated or
hypomethylated experimental samples with expression of the gene greater or less than the mean tumor
expression overall, the maximum or minimum methylation value amongst the hypermethylated or
hypomethylated experimental samples, and finally the BH-corrected Wilcoxon rank sum p-value. This
complete table of data is saved in “sig_link_zscores_perm_optimized.txt” files, headed by the type of
link classification those files contain (hypermeth_Gplus, hypermeth_Gminus, hypometh_Gplus, or
hypometh_Gminus), in the created “step5” subdirectory.
For my default analyses, the TENET_directory remained the same as in previous functions, and
each of the four analysis types was set to TRUE to collect data for each. The adj_pval_cutoff was set to
Figure 3-9: Optimization of gene-to-probe links
A list of the statistical criteria the step5_optimize_links function uses to select strong gene-to-probe
links previously identified by the step3 and step4 functions for final consideration. Criteria listed
reflect those for identifying hypometh.Gplus gene-to-probe links.
91
the default of 0.05, and the hyper_stringency and hypo_stringency values were set to the previously
determined hypermeth_cutoff and hypometh_cutoff values respectively.
3.2.7.6 step6_probe_per_gene_tabulation():
This is the last of the mainline TENETR functions and is the most straightforward. This function
collects optimized link information from the previous step5_optimize_links() function then tallies up the
number of linked hypermethylated or hypomethylated probes for each of the Gplus or Gminus genes
and TFs, and outputs sorted lists for them by the number of linked probes in each of the four
classifications. These represent the top genes and TFs from a TENETR analysis.
The function requires a path be provided to the TENET_directory that contains the optimized
link data previously calculated by the step5_optimize_links() function. This should be the same path that
was provided to this argument in previous functions.
As with the previous two functions, the hypermeth_Gplus_analysis,
hypermeth_Gminus_analysis, hypometh_Gplus_analysis, and hypometh_Gminus_analysis arguments
need to be set to either TRUE or FALSE, depending on which classifications of probe to gene links the
user wishes to tabulate. If any of these are set to TRUE, it is necessary that these were also set to TRUE
in the previous two functions so data is available for this function to work with.
First, the step6_probe_per_gene_tabulation() function creates a “step6” subdirectory inside of
the specified TENET_directory to hold the results generated from it.
Next, the function loads the “sig_link_zscores_perm_optimized.txt” files created by the
step5_optimize_links() function as specified by the hypermeth_Gplus_analysis,
hypermeth_Gminus_analysis, hypometh_Gplus_analysis, and hypometh_Gminus_analysis functions.
For each of the four categories of analysis, the total number of probes linked to all the available
genes, as well as only the annotated TF genes from the human_transcription_factors_dataset in the
92
TENER.data package are calculated for that analysis type. The genes are then organized by decreasing
linked probe count.
The Ensembl gene IDs, number of linked probes, and common gene names are then saved to a
text file. The file containing the information for all genes ends with “links_all_genes_freq.txt”, and the
file containing information for just the TF genes ends with “links_all_TF_freq.txt”. Each of these files is
also headed with the type of link classification those files contain (hypermeth_Gplus,
hypermeth_Gminus, hypometh_Gplus, or hypometh_Gminus), and the two files per classification type
are saved in the created “step6” subdirectory. Of note, analyses where TF_only was set to TRUE in the
step3_get_analysis_z_scores() function will still have “all_genes” and “all_TF” files created. However,
since only TF genes were analyzed, both of these files will contain the same information as each other.
As before, for my default analyses, the TENET_directory remained the same, and each of the
four analysis types was set to TRUE. However, we are primarily focused on the hypomethylated probe
Gplus TF links across the cancer types, as these represent TFs with increased expression in at least a
subset of the tumor samples, associated with hypomethylation of distal enhancer probes, representing
an activation of those elements in the tumors.
3.2.8 Use of USC Center for Advanced Research Computing (CARC) high performance
computing cluster (HPCC):
TENETR computation was performed using the USC CARC HPCC. Example .slurm and .R scripts
used to execute TENETR runs for the 12 cancer types consecutively are included (Supplemental Data 3-
2).
3.2.9 Linked hypermethylated and hypomethylated probe similarity calculation:
To identify the overlap proportions between cancer datasets in terms of shared
93
hypermethylated or hypomethylated probes, only probes that were found to overlap regions in
consensus datasets (consensus enhancers, consensus open chromatin, and ENCODE SCREEN distal
enhancer-like signatures datasets) were used. Dataset similarity was calculated using the formula
(overlapped probe count *2)/(probes in dataset one + probes in dataset 2). Heatmaps were done using
the heatmap.2 function from the gplots function (https://github.com/talgalili/gplots). Unsupervised
clustering was performed using the Euclidean distance measurement and ward.d2 clustering method
125
.
3.2.10 RBO settings and comparisons:
RBO is a method used to compare the degree of difference between two ranked lists. This
method was used to compare the lists of TFs ranked by number of linked probes output by the
step6_probe_per_gene_tabulation() of different TENETR analyses. RBO was selected because it can
compare differences in lists of different lengths and places a greater weight on differences to higher
ranked entries in the lists (Supplemental Figure 3-4). RBO comparisons were performed using the rbo()
function in the gespeR package
126
(https://www.bioconductor.org/packages/release/bioc/html/gespeR.html). For all RBO comparisons,
the p weighting parameter argument was set to 0.92046, calculated such that the top 10 ranked items
accounted for almost 80% (79.9997%) of the overall value. The k argument was set to default settings,
and uneven.lengths was set to TRUE when the compared ranked lists were of uneven length. Finally, the
side argument was set to “bottom”, as we recorded increasing ranks in increasing order.
Unless otherwise noted, calculations were performed comparing the results from TENETR
analyses which utilized the “standard” data and parameters described previously, compared to those
with from TENETR analyses with specified alterations to either the input datasets or specific parameters.
The first comparison was to a TENETR analysis of all genes in the TCGA datasets, instead of just the
annotated TFs, by setting the step3_get_analysis_z_scores function’s TF_only argument to FALSE. The
94
second set of comparisons compared the results from the standard TENETR analyses of each cancer type
to each other. The third comparison was to a TENETR analysis were all cancer-type-specific external
epigenomic datasets created for each cancer type were excluded from the analysis, by setting the
step2_get_diffmeth_regions function’s use_ext_HM and use_ext_NDR functions to FALSE. The fourth
set of comparisons analyzed the effects of adjusting the minimum number of tumors needed to consider
a given probe as hyper/hypomethylated by adjusting the step2_get_diffmeth_regions function’s
min_experimental_count value. The fifth set of comparisons analyzed how specific adjustments to the
cutoff values defined by step2_get_diffmeth_regions’s methcutoff, hypomethcutoff, unmethcutoff, and
hypermethcutoff arguments affected the overall results. Cutoffs were set to the same specific values
across all 12 cancer analyses, or by making set adjustments to the cutoff values determined for each
cancer type by the cutoff setting algorithm. A final RBO analysis was performed by randomly dividing the
tumor samples of the BRCA dataset into two equal groups one hundred times, then running TENETR on
each pair and comparing them to assess how dataset sampling could affect TENETR results.
Heatmaps for the RBO analyses were done using the heatmap.3() function
(https://raw.githubusercontent.com/obigriffith/biostar-tutorials/master/Heatmaps/heatmap.3.R).
Coloration was done using the jet.colors() function from the matlab R package
(https://github.com/HenrikBengtsson/R.matlab).
95
3.3 Results and discussion
3.3.1 Identification of regulatory elements across cancer types:
Despite sharing a single genome, the myriad number of cell types in the body have unique
transcriptional outputs, as controlled by the activity of the cells’ individual epigenomes and
transcriptional regulators. TENET analyses specialize in identifying these dysregulated regulatory
elements, primarily in tumors. Previous TENET analyses focused on identifying such elements in a
relatively limited number of cancer types, including breast, kidney, lung, and prostate cancers across
two versions of the method
34,35
. Here, I applied TENETR to 12 individual cancer types to identify
differentially activated distal enhancer elements as well as the top TFs by the number of the identified
elements.
In this study, I chose to primarily focus on distal enhancers over promoters because they have
broad regulatory control, often regulating multiple downstream target genes thus potentially leading to
widespread transcriptional alterations
21–23
. Distal enhancers show cell-type-specific behavior to an even
a higher degree than promoter regulatory elements
24,25
, suggesting they may be more likely to drive
tumor development if altered in cancer cells. Additionally, current datasets and functions built into
TENETR, such as the consensus enhancer dataset, are primarily geared towards performing analyses on
distal enhancers.
The first step in the TENET process is to locate the regulatory elements of interest in each of the
cancer types. As discussed above, the TENETR method uses datasets of epigenetic marks associated with
the regions of interest, including H3K27ac and H3K4me1 for enhancers, and H3K4me3 for promoters,
which are overlapped with regions of open chromatin to validate these elements. As with the regulatory
elements they represent, these epigenomic marks are known to be cell type-specific. Thus, for each of
the 12 cancer types I analyzed, we identified datasets of H3K27ac, H3K4me1, H3K4me3, and open
chromatin regions that would be representative of those cancers, i.e. derived from sources such as
96
human primary cells, relevant cell lines, and tissue samples. These were combined with the consensus
enhancer and open chromatin datasets we created from a variety of cellular and tissue sources, as well
as the ENCODE SCREEN datasets of distal enhancer and promoter sequences. Although the inclusion of
the latter datasets may seem contradictory, as they are definitively not cell type-specific, they were
included to encompass regions that are active in at least one cellular context and could become
activated in tumors. I found there was a large difference in the number of cancer type-specific datasets
that were available to download and which we processed for each cancer type. Some datasets, such as
BRCA, COAD, and PRAD, which are well studied, commonly diagnosed, and easily biopsied cancers, had
over 100 total epigenomic datasets included just in the analyses of distal enhancers, while others such
as ESCA, HNSC, and BLCA, had as few as 2 identified datasets in total (Figure 3-10). This differential in
the number of datasets identified in each cancer type is of potential benefit to this study as we can
examine how differences in dataset availability could affect the overall TENET results.
97
3.3.2 Classification and count of distal enhancer probes:
After acquiring, filtering, and processing these datasets as described earlier, the cancer-type-
specific datasets were run through the step1_make_external_datasets() function, using the default
settings described for that function, to identify the DNA methylation probes that lie within the regions of
epigenomic marks. This data was then combined with the methylation and expression .rda files
compiled for each cancer type, and run through the step2_get_diffmeth_regions() function. From this, I
was able to set methylation cutoff values using the new cutoff value algorithm, identify the DNA
methylation probes which might mark distal enhancer elements, and classify those probes based on
Figure 3-10: Count of cancer type-specific external epigenomic datasets used in study
Barplot displays the number of cancer-type specific “external” epigenomic datasets included for
each cancer type. Datasets such as BRCA, COAD, and PRAD had more available cancer-type specific
datasets than other cancers.
98
their methylation levels in the adjacent normal (as control) and tumor (as experimental) samples.
For each cancer type, I identified over 85,000 probes marking potential distal enhancer
elements. Of these probes, the majority were found within the regions of the datasets included in each
of the twelve cancer type analyses, comprised of the consensus enhancer and open chromatin datasets
as well as the regions from the ENCODE SCREEN distal enhancer dataset (Figure 3-11 - green bar). As
expected, datasets such as BRCA, COAD, and PRAD, which had the largest number of cancer-type-
specific datasets added to the analysis, had the largest number of probes which were called by these
external datasets added to the analysis (Figure 3-11 - light blue bar). Overall, very few probes were
identified that were unique to the individual cancer types (Figure 3-11 - maroon bar). At face value, this
might appear to contradict the notion that the distal enhancer elements marked by these probes are cell
type specific. However, the presence of these epigenomic marks only indicates these regions are active
enhancers, it doesn’t necessarily predict how active they are within a given tumor
43,44
.
99
Using these probes, the TENETR methods set methylation cutoffs for the 12 cancer datasets as
seen in Supplemental Data 3-3. As I had seen from my testing, a bimodal distribution was observed in all
12 cancer types, and the algorithm was able to set cutoffs without failing. Across the board, these
cutoffs were fairly consistent, with a few minor deviations. Notably, the ESCA dataset exhibited
methylation cutoffs which were generally set lower than those of other cancer types. The LUAD dataset
displayed a “pinched in” set of cutoffs, with somewhat increased unmeth and hypermeth cutoffs, but
noticeably decreased meth and hypometh cutoffs. The LUSC dataset also displayed a relatively low
hypometh cutoff.
Neither these cutoff settings, nor the number of enhancer probes identified, seemed to have a
Figure 3-11: Number of distal enhancer DNA methylation probes identified
Barplot displays the number of distal enhancer probes identified per cancer type. While all cancer
types shared the same set of enhancer probes identified by the consensus epigenomic datasets
included for each, each cancer type also included a variable number of enhancer probes found using
the cancer-type specific external datasets, including a relatively small number of enhancer probes
which were unique to each cancer type.
100
major impact on the number of hypermethylated probes, representing enhancer regions deactivated in
a subset of the tumors, or hypomethylated probes, representing enhancer regions activated in a subset
of the tumors, identified in each of the cancer types. As seen in the counts of probes identified in Figure
3-12, the datasets with the most hypomethylated probes identified are the BRCA and COAD datasets,
which did have among the most total enhancer probes. However, these datasets do not have a
particularly large number of hypermethylated probes identified, and PRAD, the other cancer dataset
with many external datasets, had the fewest hypomethylated probes identified. Similarly, despite having
somewhat unusually set cutoffs, LUAD and LUSC do not have an abnormally large number of
hypomethylated probes identified. Overall, more hypomethylated probes were identified than
hypermethylated probes in each cancer type (Figure 3-12). This is not surprising however, as amongst
the identified distal enhancer probes, there is a large number (greater density) that show higher levels
of methylation, which could become hypomethylated, than there are with low levels of methylation that
could become methylated (Figure 3-5) (Supplemental Data 3-3 and 3-4).
101
Finally, I examined the similarity of the epigenomic states of the different cancer types by
analyzing similarity in the identified hypermethylated and hypomethylated probes. To do this analysis, I
used only the probes that were identified by the consensus datasets used in all 12 cancer type analyses.
I wanted to reduce bias based on the probes that could be identified as enhancer probes between the
datasets. For example, some of the epigenomic datasets we collected were derived from non-specific
organ tissue and were included for all cancer types that arose from that organ, such as LUAD and LUSC
cancers. As a result, these two cancers could show a higher degree of probe overlap simply based on the
fact they shared more external datasets in common used to identify the enhancer probes than other
cancer types.
Figure 3-12: Number of hypermethylated and hypomethylated distal enhancer probes identified
Barplot displays the number of hypermethylated and hypomethylated distal enhancer probes
identified in each cancer type. These probes are the ones of particular interest in TENETR analyses,
as they represent enhancer elements deactivated or activated in tumors respectively.
102
Interestingly, I found that there was still considerable overlap between the cancer types for the
hypermethylated and hypomethylated probes identified. The overlap proportion varied from 0.437 to
0.858 with an average of 0.629 for the hypermethylated probes, and from 0.384 to 0.829 for the
hypomethylated probes, with an average of 0.550. The degree of overlap was slightly higher for the
hypermethylated probes than the hypomethylated ones, which could be because these enhancer
regions are selectively deactivated to promote the growth of tumors, potentially controlling specific
tumor suppressor genes, while hypomethylated probes could represent an activation of any number of
regions. Another interesting finding to be gleaned from this is that cancers derived from the same organ,
such as KIRC and KIRP, and LUAD and LUSC, show the highest degree of similarity in their identified
hypermethylated and hypomethylated probes, even when we consider just the probes from consensus
datasets available to all cancer types (Figure 3-13). This could speak to possible underlying epigenetic
similarities between these cancer types, even though they are of different cancer subtypes. Another
possibility, however, given these samples were taken from resected tumor samples, is that some of the
samples may be contaminated with normal cell types native to the organs the tumors developed in,
resulting in some similarity between the tumor samples which arose in the same organs.
103
3.3.3 Identification of key TFs linked to hypomethylated distal enhancer probes:
After identifying and classifying the DNA methylation probes found to mark distal enhancers in
the twelve cancer types, the step3_get_analysis_z_scores(), step4_permutate_z_scores(),
step5_optimize_links(), and step6_probe_per_gene_tabulation() functions were run in succession to
identify the top TFs in each cancer types that were linked to the most hypomethylated distal enhancer
probes. I decided to focus on the hypomethylated enhancer probes as there were many more of these
probes identified, which increases the number of potential targets to analyze in downstream analyses.
The top ten TFs by number of linked enhancer probes are presented in Figure 3-14. The first
thing to note is that the number of probes that are linked to the top 10 TFs don’t seem to be closely
related to the number of hypomethylated enhancer probes identified in that cancer type, as HNSC has
Figure 3-13: Similarity of hypermethylated and hypomethylated probes between cancer types
Mirrored heatmaps of hypermethylated and hypomethylated similarity proportion between the 12
cancer types. To eliminate bias due to differences in the external cancer type epigenomic datasets
used to identify enhancer probes, only hypermethylated and hypomethylated probes identified by
the consensus epigenomic datasets used in the similarity analyses.
104
the most linked hypomethylated probes across the top 10 TFs in that cancer type while having a
relatively modest number of total hypomethylated probes. Conversely, COAD had one of the highest
hypomethylated probe totals, yet had the fewest linked hypomethylated probes across the top 10 TFs in
that cancer type, possessing a much flatter distribution of linked hypomethylated probe counts
(Supplemental Data 3-5).
When it comes to the top TFs themselves, as expected, those from the LUAD dataset looked
very similar to the ones I had identified previously from our analysis of LUAD using the previous TENET
2.0 method
34
. These included MYBL2, CENPA, and FOXM1, as well as TCF24, SOX2, and ZNF695, which
were highlighted previously. This is not surprising, since the primary updates made to TENETR compared
to the TENET 2.0 method used in the previous LUAD study focused on adding new functionalities,
Figure 3-14: Top transcription factors linked to hypomethylated probes in individual cancer types
Barplots display the top 10 TFs in each cancer type and the number of hypomethylated probes each
TF was linked to in that dataset. Results shown were from analyses where the TF_only argument in
the step2_get_diffmeth_regions() function was set to TRUE.
105
building the R package, and making TENETR easier to use and update, rather than broad changes that
would markedly affect the results. As seen later in the RBO analyses, the addition of new consensus
datasets and cancer-type-specific datasets would allow us to identify new hypomethylated probes
linked to these factors but are unlikely to substantially affect their overall order.
However, the results obtained here compared to the original TENET results showed more
divergence
35
. Overall, the results for BRCA looked similar, with GATA3, SPDEF, FOXA1, ESR1, MYBL2,
MYB, and ZNF695 all found among the top 10 TFs again. For KIRC, GLIS1 and RUNX1 were the only TFs
identified previously that were still in the top 10 for the current analysis, while for PRAD, the list of top
ten TFs is entirely dissimilar. There are three possible reasons for these differences. The first is that
between the original TENET and TENET2.0 (and TENETR) versions, I updated the database of TF genes,
resulting in some genes no longer being considered TFs in the newer versions of TENET. Genes such as
RCOR2, SAP30, TRIM15, SCAF1, and LASS4 could not be found in the TENETR analysis, as they are not
considered TFs in the TF database TENETR uses
70
. The second is the difference in gene expression
datasets and their quantification methods. In the original TENET method, gene expression data in the
TCGA datasets was calculated for around 20,000 genes and “level 3” normalized, which was not the
most accurate of processing methods
127
. Datasets used in TENET2.0 and in this study used FPKM-UQ
normalized GENCODE v22-annotated gene expression datasets consisting of over 60,000 genes. The
third explanation is similarly related to changes to the TCGA datasets – the addition of additional tumor
and adjacent normal tumor samples to some of the cancer datasets since the original TENET paper was
published. For instance, the TENETR analyses utilized 776 tumor and 83 adjacent normal BRCA samples
and 495 tumor and 35 adjacent normal PRAD samples, compared to 641 tumor and 66 adjacent normal
BRCA samples and 333 tumor and 19 normal PRAD samples in the original TENET analysis. Only the KIRC
dataset remained almost unchanged in number
35
. The addition of new tumor samples is especially
important, as given the settings from the original TENET paper and the TENETR analyses here, only 5
106
tumor samples are needed to show hypomethylation of a given DNA methylation probe for it to be
considered as such, so the addition of new tumor samples could lead to the identification of additional
probes, in turn linked to new TFs, thus diminishing the apparent importance of the previously identified
TFs.
A more comprehensive analysis and prioritization of activated TFs and their linked enhancers for
further analysis is performed in chapter 4, but a few TFs bear initial note. First, as seen previously in
LUAD and BRCA in the TENET 2.0 analysis, the combination of CENPA, MYBL2, and FOXM1 genes was
also observed among the top 10 TFs in KIRP. In LUAD, I had found that these TFs were overexpressed in
the tumor samples and were associated with poor patient survival
34
. Other recent studies have
examined the roles of these genes as epigenetic regulators
128
, and their role in several forms of cancer is
being investigated
129–132
. Other notable TFs include FOXA1, which has previously been implicated in both
BLCA and BRCA cancers, though the effects of its overexpression in these tumors is contested
133–136
,
RUNX1 in KIRC and KIRP
137,138
, as well as THCA, where there is little previous research of on its role in
that cancer type, and several others which appear in multiple cancer types.
3.3.4 RBO analysis of TENETR settings and comparisons:
Lastly, I conducted a series of analyses to assess how alterations to TENETR settings could affect
the output ranked lists of TFs across the 12 cancer types. The RBO returns a value which allows one to
assess how similar two ranked lists are to each other, such as the ranked lists of TFs based on the
number of linked hypomethylated probes output by the TENETR method. Unfortunately, RBO does not
allow one to assess if two lists are “significantly” different from each other; it only provides a value from
0 to 1 that reports how relatively different two lists are. This, in addition to the somewhat arbitrary
nature of setting a priori cutoff values for TENETR, and the lack of ground truth datasets on which to
perform TENETR, meant I had to compare different alterations to the “default” settings used for my
107
TENETR runs in the context of each other.
The first analysis I performed, and the original reason I sought to do these RBO comparisons,
was to compare the results from TF-only TENER runs, which analyzed links to only the annotated 1,639
TF genes vs. all-gene runs which analyzed links to all 60,483 genes in the TCGA datasets. TF-only runs
take a small fraction of the time to run that all-gene TENETR analyses do (Figure 3-15) and combined
with our interest in TF genes, TF-only runs were preferred. However, as discussed previously, results of
TF-only vs. all-gene runs do differ somewhat, with less probes being linked to each TF in all-gene runs
compared to TF-only ones. Thus, we wanted to assess how comparable these two analyses were to each
other. Fortunately, it seems the TF-only runs are very similar to those from all-gene runs, with RBO
values for 9 of the 12 cancer types exceeding 0.93. The COAD, ESCA, and KIRC cancer types showed
much more dissimilarity, with RBO values of 0.762, 0.691, and 0.869 respectively (Figure 3-16A)
(Supplemental Data 3-6).
Figure 3-15: Comparison of TF-only vs. all genes TENETR run times
Bar plot of the total runtime of TENETR runs for each cancer type when only TFs are considered
compared to all genes. In the same computational environment, TF-only analyses were completed in
a fraction of the time of all gene analyses and were preferred for the analyses presented.
108
To add context to these values, I next compared the output lists of TFs linked to hypomethylated
probes between each other. Although these analyses were performed using the same argument
parameters, I expected the important TFs discovered between different cancer types to be dissimilar,
owing to differences in their cell types of origin, and epigenetic states thereof. Thus, this analysis could
serve as a sort of “negative control”, in terms of what sorts of RBO values we would expect to see from
very dissimilar analyses. Indeed, confirming what we could see by examining lists of just the top 10 TFs
from each cancer types, RBO values across cancer types were very low, with a maximum RBO value of
0.306 for the comparison of results from BLCA and BRCA analyses, most likely owing to FOXA1 being
identified as the top TF in both cancers (Figure 3-16B).
Next, I compared the similarity of results between TENET runs which used just the consensus
and ENCODE dELS datasets in each cancer type, compared to the default ones which had also included
the cancer-type-specific epigenomic datasets. As I had observed, these datasets alone accounted for
most of the identified DNA methylation probes marking distal enhancer regions in each cancer type.
Thus, we felt that this comparison could serve as a sort of “positive control”, a comparison of similar but
not quite identical analyses to give a sense of how similar TENET results could be to each other. Indeed,
this returned very high RBO values for each cancer type, with all but COAD having comparative RBO
values higher than 0.9. COAD’s relatively low value of 0.790 is likely reflective of the large number of
cancer-type-specific datasets which were included for its analysis, including many H3K4me1 datasets.
This is further supported by the somewhat lower RBO values for the BRCA and PRAD datasets, which
also had larger numbers of cancer-type-specific datasets included and more non-consensus dataset
called enhancer probes. KIRC also had a somewhat lower RBO value, though as discussed below, this is a
common observation for that dataset (Figure 3-16C). As the TF-only vs. all-gene RBO values across the
cancer types were much closer to those of the consensus vs. all epigenetic dataset comparisons than the
cross-cancer type comparisons, this suggests that the use of TF-only analyses did not seem to severely
109
affect the overall results.
I also decided to compare how changing the min_experimental_count value can affect the
overall results. In general, reducing the min_experimental_count results in a greater number of probes
being linked to each TF as more hypermethylated and hypomethylated probes are identified because of
the relaxed requirements for their identification, while the converse is true when the
min_experimental_count is increased. I plotted the RBO values for each of the cancer types comparing
results from a given min_experimental_count value to those from our default value of 5. As expected,
the more the set min_experimental_count value deviated from the default value of 5, the more
dissimilar the results were, and the smaller the RBO values were. However, the level of this difference
was not consistent across the cancer types. Some, such as BRCA, HNSC, and THCA, had RBO values no
lower than 0.7 across the min_experimental_count values we examined, from 2 to 25. On the other
hand, cancer types like LUSC, KIRP, and KIRC were much more unstable, with values as low as 0.04,
equivalent to those of comparisons between cancer types (Figure 3-16D).
I also examined how adjusting the cutoff values affected the TENETR results. These included
setting the cutoff values to specific levels in each cancer type, similar to what had been done in previous
TENET analyses, and making specific adjustments to the algorithm’s set cutoffs for each cancer type.
Given the somewhat arbitrary nature in attempting to set global cutoffs to such large datasets, it was
reassuring to see that generally, outside of setting the unmeth_cutoff and meth_cutoff values to the
relatively extreme values of 0.1 and 0.9 respectively, RBO values for most cancer types were quite high
across the board, indicating adjustments to the cutoff values didn’t strongly affect the overall results.
The exceptions to this were the KIRC and PRAD datasets. PRAD showed relatively low RBO values when
the cutoff values were adjusted in such a way as to purposely reduce the number of probes, either by
reducing the unmeth_cutoff or hypometh_cutoff, or increasing the meth_cutoff and hypermeth_cutoff.
KIRC though showed highly disrupted results no matter what the adjustment to the cutoff values was,
110
though it was strongest when the cutoffs were adjusted in such a way as to increase the number of
probes found, the opposite of what was observed in the PRAD dataset (Figure 3-16E).
As noted before, KIRC seems like a relatively unstable dataset, showing relatively high degrees
of divergence in results compared to the other cancer datasets, no matter what the analysis. I first
analyzed the dataset by looking at the mean methylation values of the hypometh probes linked to the
top 10 TFs in the dataset, as well as the expression levels of those TFs in that dataset. The KIRC dataset
does not appear markedly different in those metrics than the other cancer datasets though. However,
KIRC did appear to have one of the flattest distributions of linked probes amongst its top TFs, with the
10
th
ranked TF possessing 66.4% of the total number of hypomethylated probes relative to the top TF in
that dataset, a full 10% higher than any other cancer dataset we analyzed. KIRC also had the smallest
number of TFs linked to the top TF among the cancer datasets. This suggests that the differences seen,
as different parameters are adjusted, are likely be due to the fact that it is easier to observe the top TFs
changing ranks between analyses as there is a smaller number of linked TFs separating them in the KIRC
dataset compared to other cancer types. It remains unclear why KIRC has such a flat distribution of
hypomethylated probes linked to the top TFs, and why there are so few linked probes to KIRC’s top TFs,
even though the KIRC dataset didn’t have a particularly small number of identified hypomethylated
probes overall.
The last comparative analysis I performed was to assess how dataset sampling could affect the
TENETR results. Since TENETR is designed to identify alterations that occur even in a small subset of the
experimental samples analyzed, I wanted to assess how results are affected when different input
datasets of the same cancer type are provided. To accomplish this, I created 100 random subdivisions of
the BRCA tumor dataset, and ran TENETR on each of the two halves using the default settings I had done
for other analyses, and then calculated an RBO value comparing the top TFs by number of linked
hypomethylated probes identified from each half of the dataset. I chose the BRCA dataset to perform
111
this analysis because it was the largest dataset and is composed of several well-characterized and well-
annotated subgroups. Overall, the distribution of RBO values ranged from 0.656 to 0.860. These
comparisons were generally more dissimilar than those of the TF-only vs. all-genes comparison, as well
as the comparison using consensus epigenomic datasets and all available ones. These RBO values were
much more similar to those for large alterations to the min_experimental_count value, or severe
changes to the cutoff values, indicating the importance of the actual input data alone in determining the
final output of the method (Figure 3-16F).
Finally, I checked to see if the output RBO values were correlated with BRCA subtype
distribution between the pairs of subsets. To do this, I performed a linear regression analysis, regressing
the RBO value on the total difference in annotated subtype counts
139
between the two datasets. I did
not notice a significant relationship between these two variables however, nor did I notice a significant
relationship between the RBO values and the difference in any individual subtype count.
112
Figure 3-16: RBO values of TENETR comparisons
(A) Heatmap of RBO values comparing final ranked lists of transcription factors based on the number
of linked hypomethylated probes to each transcription factor between the TF-only vs. All gene
analyses for each of the 12 cancer types (B) Mirrored heatmap of RBO values comparing final ranked
lists of transcription factors based on the number of linked hypomethylated probes to each
transcription factor between the 12 cancer types to each other. (C) Heatmap of RBO values
comparing final ranked lists of transcription factors based on the number of linked hypomethylated
probes to each transcription factor between analyses using only the enhancer probes called by the
consensus datasets and the default analyses performed using the consensus and cancer-type-
specific datasets. (D) Heatmap of RBO values comparing final ranked lists of transcription factors
based on the number of linked hypomethylated probes for different Min_exp_count_values,
compared to those of the default value of 5. (E) Heatmap of RBO values comparing final ranked lists
of transcription factors based on the number of linked hypomethylated probes for different cutoff
values, compared to those set using the new TENETR algorithm. Changes to the cutoff values are
listed in the table below the plot and include both consistent cutoff values applied to every dataset
(columns 1-4) as well as specified alterations to the algorithm values chosen for each cancer type
dataset (columns 5-12, noted with “Alg”). (F) Histogram of the compiled RBO values comparing 100
TENETR runs in which BRCA samples were randomized into two groups which had TENETR analyses
performed on them separately and compared.
113
3.4 Conclusions
Like other TENET applications, TENETR is designed to identify dysregulated regulatory elements
in a subset of experimental, usually tumor, samples, and associates their activity with the dysregulated
expression of TFs which might bind to and facilitate the activity of those regulatory elements. TENETR
uses datasets with a combination of regions with relevant histone marks as well as open chromatin, to
identify regulatory elements of interest (such as distal enhancers) and identifies DNA methylation
probes that fall within the identified regions. TENETR then uses the methylation level of these probes as
a surrogate for the activity of the regulatory elements in which they are contained and combines that
with gene expression data from a combination of control and experimental samples to identify “links”
between probes. These links represent differences in methylation associated with changes in expression
of TF genes in a subset of the experimental samples. Finally, the TFs with the most linked enhancer
probes of a given direction of activity are identified for downstream analyses.
In summary, TENETR is an R package application of the TENET method, which builds upon
previous TENET applications but has been built as an R package to increase its availability and ease of
use to potential users. As an R package, it also has a more intuitive framework for future additions and
improvements. We also added new functionality to TENETR, including the creation of consensus
enhancer and open chromatin datasets and the inclusion of annotated enhancer and promoter-like
regions from the ENCODE SCREEN project, compatibility with methylation data form the EPIC+, EPIC,
and HM27 methylation arrays, a new algorithm to automatically set methylation cutoffs, a new ability to
assess promoters instead of just distal enhancer regulatory elements as well as an option to analyze only
TF genes to vastly reduce computational time.
Using this updated TENETR method, I performed the largest TENET analysis yet, on 12 different
cancer types. Four of these, BRCA, KIRC, LUAD, and PRAD had been analyzed in previous TENET studies,
but the remaining eight, BLCA, COAD, ESCA, HNSC, KIRP, LIHC, LUSC, and THCA have not been previously
114
analyzed. For each of these datasets, we compiled numerous cancer-type-specific datasets of histone
marks and open chromatin regions. Using these, along with the built-in consensus datasets, I identified
tens of thousands of DNA methylation probes potentially marking the activity of distal enhancer
elements in each cancer type, a large portion of which were found to be hypomethylated in a subset of
the tumor samples, representing activation of those enhancer elements. I used these hypomethylated
probes to identify important upregulated TFs in each of the cancer types based on the number of
hypomethylated probes linked to them.
I also conducted analyses to compare how different alterations to TENETR parameters affect the
overall results. Using the RBO method, I determined that the new analysis of only TF genes did not
markedly affect the results generated compared to using all genes in the analysis, though it did indeed
greatly reduce computational times. I also identified that small adjustments methylation cutoff values as
well as the min_experimental_count value tended not to have large effects on the resulting top TFs
identified, though this was dependent on the individual cancer dataset. Finally, I assessed how input
samples can affect results by comparing how results from randomized subsets of the BRCA dataset
compared to each other, thus simulating how different input samples can affect the results. I found that
these had a sizeable effect on the overall results. Taken together, my analyses suggest that the
individual arguments of TENETR, especially the methylation cuttoffs and min_experimental_count
values (for which it can be difficult to determine the optimal value) may not have as large an effect on
the output results as the methylation and expression datasets that are supplied by the user.
This work solidifies TENETR as a powerful platform to bioinformatically identify key TFs and
regulatory elements in experimental datasets. With updated features, TENETR has expanded application
to both promoter and distal enhancer elements and can be used even when the user lacks epigenomic
datasets of their own. In the next chapter, I will cover various procedures for performing further
downstream analyses.
115
3.5 Supplemental files
Supplemental Figure 3-1: Mean methylation density curves of distal enhancer probes in all 12
cancer types
Mean methylation density curves of each identified enhancer probe in the adjacent normal and
tumor samples across the 12 cancer datasets are plotted, with the methylation β-value on the x-axis,
and the relative density values on the y-axis. Red lines represent mean methylation density in the
tumor samples, and blue lines represent the mean methylation density in the adjacent normal
samples for that cancer type.
116
Supplemental Figure 3-2: Representative promoter probe mean methylation density curve found
with overlapping histone modification and open chromatin regions
The mean methylation density curve of identified promoter probes found within overlapping histone
modification and open chromatin regions (or ENCODE datasets) in the adjacent normal and tumor
samples for a single cancer type is shown (BRCA). Again, red line is representative of tumor samples,
and the blue line represents adjacent normal samples. Compared to curves for enhancer probes,
those of promoter probes tend to have a single, highly dense local maximum at a very low average
methylation β-value, with variable number of miniscule local maxima at higher values. This curve is
representative of the typical pattern seen across the 12 cancer types and shows why the new cutoff
setting algorithm, which relies on having two distinct local maxima peaks, doesn’t work well when
used to set cutoffs for promoter probes.
117
Supplemental Figure 3-3: Representative promoter probe mean methylation density curve found
utilizing all promoter probes
The mean methylation density curve of identified promoter probes found in the adjacent normal and
tumor samples for a single cancer type is shown (BRCA). Again, red line is representative of tumor
samples, and the blue line represents adjacent normal samples. Compared to the mean methylation
density curve of only the promoter probes found within overlapping histone modification and open
chromatin regions (or ENCODE datasets) in Supplemental Figure 3-2, the density curve using all
promoter probes shows a bimodal distribution, similar to curves of enhancer probes, ensuring that
the new cutoff algorithm will work to set cutoffs for promoter probe analyses.
118
Supplemental Data 3.1: List of TCGA sample barcodes used in TENETR study (.xlsx file)
Includes the TCGA barcodes for each of the tumor and adjacent normal samples from each of the 12
cancer types. Hosted at: https://github.com/DanielJMullen/Daniel_Mullen_Thesis
Supplemental Data 3.2: HPCC scripts to submit TENETR runs (.sh, .slurm, and .R files)
Text files used to submit TENETR runs on the USC HPCC for all 12 cancer types simultaneously. (A) A
bash script which executes the slurm file to start jobs for each of the 12 cancer types. (B) A .slurm file
used to start an HPCC job, which executes the .R script to perform a TENET run for a given cancer type.
(C) An example .R script with step 1-6 TENETR functions (used to perform the default analyses here for
LUAD). Hosted at: https://github.com/DanielJMullen/Daniel_Mullen_Thesis
Supplemental Figure 3-4: Pictorial schematic of RBO analysis
Figure shows how the returned RBO value reflects different patterns in the compared list, with
higher RBO values being returned when lower ranked, and less entries are out of order in list B
compared to list A. p represents the parameter used when assigning weight to the higher ranked
items, and k represents the depth of search in an RBO calculation.
119
Supplemental Data 3.3: Tables of methylation cutoff values and methylation probe classification
counts across 12 cancer types (.xlsx file)
(A) Table with the Unmeth, Meth, Hypermeth, and Hypometh cutoffs for each of the 12 cancer types in
this study. (B) Table with the counts of identified Unmeth, Meth, Hypermeth, Hypometh, and all
enhancer probes for each of the 12 cancer types. Hosted at:
https://github.com/DanielJMullen/Daniel_Mullen_Thesis
Supplemental Data 3.4: Tables of identified methylation probes across 12 cancer types (.xlsx file)
(A-L) Tables with information about each of the identified enhancer probes for each of the 12 cancer
types, respectively. Hosted at: https://github.com/DanielJMullen/Daniel_Mullen_Thesis
Supplemental Data 3.5: Table of top TFs and number of linked hypomethylated probes across 12
cancer types (.xlsx file)
Table includes the TFs and number of linked hypomethylated DNA methylation probes identified in the
12 cancer types, ranked by decreasing numbers of linked probes. Hosted at:
https://github.com/DanielJMullen/Daniel_Mullen_Thesis
Supplemental Data 3.6: Tables of RBO values (.xlsx file)
Tables list RBO values from various comparisons performed in this study. (A) Table of RBO values from
pairwise comparison of ranks of identified TFs linked to hypomethylated probe, across the 12 cancer
types analyzed using default settings. (B) Table of RBO values comparing ranks of identified TFs linked to
hypomethylated probes from default analyses utilizing all epigenomic datasets compared to those
utilizing just the consensus datasets built into TENETR across the 12 cancer types analyzed. (C) Table of
RBO values comparing ranks of identified TFs linked to hypomethylated probes from analyses with
120
adjustments made to the min_experimental_count argument of the step2_get_diffmeth_regions()
function to those with the default setting of 5 across the 12 cancer types analyzed. (D) Table of RBO
values comparing ranks of identified TFs linked to hypomethylated probes from analyses with
adjustments made to the four cutoff arguments of the step2_get_diffmeth_regions() function to those
with the default settings 12 cancer types analyzed. (E) Table of RBO values comparing ranks of identified
TFs linked to hypomethylated probes between the 100 randomly generated subsets of the BRCA
dataset. Hosted at: https://github.com/DanielJMullen/Daniel_Mullen_Thesis
121
Chapter 4: Downsteam analysis of TENETR results and identification of
individual findings of interest for further validation
This chapter provides an illustration of the kind of downstream bioinformatic analyses which can
be performed on TENETR-derived data, using the results I had generated in the previous chapter form
the analysis of 12 TCGA cancer datasets. Here, I give an overview of the family of step7 functions
provided in the TENETR package as “default” analysis tools which users can take advantage of to start
interrogating their own data. Using variations of these functions as well as my own entirely custom
analyses, I analyze the results from the 12 TCGA cancer datasets to provide an example of the sort of
work that can be done, with a focus on identifying specific TF to probe links for additional validation.
As before, some of the work within this chapter is being prepared to be submitted as an article
with the proposed title “Creation and application of the TENETR package to identify dysregulated
transcription factors across a pan-cancer landscape”. The authors for this paper will be Daniel J Mullen,
Zexun Wu, Ethan Nelson-Moore, Lauren Han, Ria Pai, Huan Cao, Ite A. Offringa, and Suhn K Rhie. DJM is
responsible for bioinformatic analyses, package creation, dataset creation, figure creation, key idea
contributions, experimental design, and writing. ZW helped with bioinformatic analyses, dataset
creation, and experimental design. ENM helped with package creation. LH, RP, and HC helped with
dataset creation and package testing. IAO assisted with key idea contributions and editing. SKR helped
with package creation, key idea contributions, experimental design, writing, and editing.
122
4.1 Abstract
TENETR is a bioinformatic method designed to identify differentially-activated regulatory
elements in the human genome and link the activity of these regions to similarly differentially-expressed
TFs in case vs. control samples by combining datasets of epigenomic marks along with gene expression
and DNA methylation data. This is done with a particular focus on identifying such alterations that might
occur in only a subset of the case samples. The first six functions in the new TENETR package are
devoted to identifying such alterations, and ultimately create a list of the top TFs based on the number
of identified regulatory elements found to be linked to them. These represent the TFs that are likely to
have the most widespread impact across the genomes in the analyzed samples based on the number of
regulatory elements whose activity are tied to the expression of those transcription factors. Without
further context however, these lists of top TFs do not address all the research aims we hope can be
answered using TENETR, including the identification of sample subgroups based on differential
activation of the TF to regulatory element links and prioritization of these links for further analyses with
in vitro studies.
Here, I highlight the downstream bioinformatic analyses that can be performed using both the
step7 functions built into the TENETR package as well as custom analyzes to address these aims. I
identified 108 key TFs, including 8 of particular interest across the 12 cancer types analyzed with
TENETR, as well as key linked distal enhancers to these TFs in each cancer type. These were used to
identify potential subgroups across the 12 cancer types and to build a model to predict survival using
sparse regression. I also cover in vitro analyses performed to analyze the key TFs FOXM1 and MYBL2
identified in LUAD by TENET2.0.
123
4.2 Materials and methods:
4.2.1 TENETR step7 functions:
TENETR has a total of 16 step7 functions which are included to assist users to better
characterize either the top TFs and the DNA methylation probes marking regulatory elements
linked to them, or the samples analyzed based upon their expression or methylation of the top
TFs and DNA methylation probes.
As these functions are meant to perform default analyses for the user without a priori
knowledge of the data the user is analyzing or the type of information they might be interested
in, the step7 functions all include the following same set of arguments unless otherwise
specified, with a focus on using the same set(s) of top genes or TFs identified as the basis for the
analyses.
The first of these arguments is the TENET_directory argument, to which the user should
supply a path to the directory containing the results for the step1 through step6 functions, as
different step7 functions require output from several of the previous functions to operate.
The next set of arguments, including hypermeth_Gplus_analysis,
hypermeth_Gminus_analysis, hypometh_Gplus_analysis, and hypometh_Gminus_analysis, must
all be set to either TRUE or FALSE depending on which types gene/TF to probe links the user
wants to assess. As a reminder, hypermeth_Gplus links and hypometh_Gplus links represent
those with genes/TFs that are under or overexpressed in the case samples, respectively and
linked to regulatory elements which are putatively inactivated or activated, respectively, and are
usually of primary interest as both elements have concordant directions of effect. The
hypermeth_Gminus and hypometh_Gminus links however, are representative of those with
genes/TFs that are overexpressed or underexpressed in the case samples while being linked to
regulatory elements which are putatively inactivated or activated, respectively (Figure 3-7C).
124
The next argument top_gene_number controls the number of the top ranked genes/TFs
of the types selected by the previous 4 arguments which are included for analyses in these
functions. By default this is set to 10, meaning the top 10 genes/TFs by number of linked
regulatory element probes of the given type are assessed.
Finally, the core_count argument specifies the number of cores each function should use to
increase their computational speed. This argument can be set to an integer value equal to the number
of cores the user wishes to utilize but cannot exceed the total number of cores available on their
system. This argument is set to 1 by default.
All step7 functions will create a “step7” subdirectory in the specified TENET_directory if it
doesn’t already exist, and will create further subdirectories within this step7 subdirectory specific to
each function to contain the results of said function.
4.2.1.1 step7_linked_probe_motif_searching():
This function is designed to identify if the regulatory element probes linked to a given TF
have a motif for that TF in their vicinity. This is potentially useful to prioritize specific probes
linked to that TF for further study, as the presence of a motif for that TF provides additional
evidence the regulatory element containing the DNA methylation probe is directly regulated by
the TF.
Unlike other functions, this function does not have a top_gene_number argument.
Instead, the TF_gene argument allows the user to specify a single gene, either by its name or
Ensembl ID, for which to perform motif searching. Thus, this function will need to be run
multiple times for each gene of interest.
Like previous functions, the DNA_methylation_manifest argument specifies the
methylation array used to generate the user’s methylation data. The HM27, HM450, EPIC, and
125
EPIC+ arrays are supported by setting this argument to "HM27", "HM450", “EPIC”, or
“EPIC_plus”, respectively. This argument defaults to “HM450” if not specified.
The motif_PCM_PWM argument requires the user specify a position weight matrix
(PWM) for the TF of interest
140
, given as a matrix with 4 rows, for the A, C, G, and T bases, and
the columns representing the likelihood of observing each base at each position of the motif.
The values across the 4 rows should add up to 1 for each column, and there can theoretically be
an arbitrarily large number of columns depending on the length of the identified motif, which
will be discussed subsequently. Example PWMs for many TFs can be found in the databases
queried by the MotifDb package’s database
(https://bioconductor.org/packages/release/bioc/html/MotifDb.html).
The last two arguments, distance_from_probes, and matchPWM_min_score, control
how the motif searching is performed. distance_from_probes should be set to an integer value
greater than or equal to 0, and specifies a distance in base pairs, which is added to both
upstream and downstream of each DNA methylation probe linked to the given TF to perform
motif searching for that TF. Increasing the size of the distance_from_probes value will increase
the likelihood and number of motif instances identified in the vicinity of each probe. Thus, this
value ideally should reflect the size of the regulatory elements linked to the given TF. By default
it is set to 100.
The matchPWM_min_score should be given as a number from 0 to 100 and reflects the
minimum similarity threshold a given sequence needs to meet in matching the specified PWM.
Higher values result in less total motif instances being identified in the vicinity of the probes
linked to the given TF. It is worth noting that the likelihood of a given motif being found in the
vicinity of a given probe is also dependent on the length of the motif, with longer TF motifs
reducing the likelihood of that motif being found in the vicinities of the DNA methylation probes
126
linked to that TF, even for the same matchPWM_min_score value. This defaults to 75 (75%).
Motif matching is performed using the Biostrings package’s matchPWM() function
(https://bioconductor.org/packages/release/bioc/html/Biostrings.html). The Biostrings package
returns the sequences of motifs matching a given PWM with a given minimum score in a
specified area. It can be difficult to balance selecting an optimal matchPWM_min_score for a
given motif of a given TF, so users are encouraged to inspect several values to assess how
frequently motifs are found in the vicinity of the probes linked to that TF.
This function outputs three objects per gene it is run with. The first is a “_seqLogo.pdf”
file which contains a visualization of the PWM provided by the user to the motif_PCM_PWM
argument, provided by the seqLogo() function
(https://bioconductor.org/packages/release/bioc/html/seqLogo.html). The
“_probe_motif_occurences_table.tsv” file lists the occurrence of every single motif found at the
specified matchPWM_min_score, including the probe ID it was found near, the genomic
coordinates of the motif, the DNA sequence of the motif, and the analysis type of the probe to
TF link. The “_total_motif_occurences_per_probe.tsv” file lists every single probe linked to the
given TF, the number of motifs for that TF that were found in their vicinity using the user-
specified parameters, and the type of analysis of that probe to TF link. This later file is useful for
quickly identifying which linked probes have at least 1 occurrence of a motif for the TF of
interest found in their vicinity.
4.2.1.2 step7_selected_probes_simple_scatterplots():
This function allows the user to select specific DNA methylation probes, and if those
probes are linked to any TFs of the selected analysis type, the function will output scatterplots
displaying the expression level of each linked TF in the x-axis and the methylation level of the
127
DNA methylation probe in the y-axis, with values for each case sample in red, and each control
sample in blue.
The only unique argument to this function is the probe_list argument, to which the user
should provide a vector of probes for which they want to generate scatterplots to. For each of
these probes, an individual folder will be created to contain the .pdf files with the scatterplots.
Scatterplots were created using functions from the ggplot2 package (https://cran.r-
project.org/web/packages/ggplot2/index.html).
These scatterplots are saved in “_scatterplot.pdf” files, with the names of the TF and
linked probe ID, and are useful for quickly illustrating the relationship between the expression of
the TF gene and the methylation of the given probe, as well as illustrating how many case
samples are displaying such a specific TF expression to probe DNA methylation relationship.
4.2.1.3 step7_states_for_links():
This function is simple in setup, and does not require the user to even specify a
top_gene_number argument for the function to operate. Instead, this function analyzes every
identified TF-to-probe link for each of the four classification types selected for analysis. For each
of these links, every case sample is analyzed to assess if that sample might harbor said link. This
is determined by taking each of the case samples and first checking if that sample’s individual
expression of the TF is significantly greater than the TF expression level in the control samples,
with a Bonferroni multiple testing correction performed for the number of case samples
analyzed. Next, each of the case samples is checked to see if that sample’s methylation for the
probe in each probe to TF link is greater or less than the specified hypermeth_cutoff or
hypometh_cutoff value, respectively, set in the step2_get_diffmeth_regions() function.
Results for this function are saved in “_links_states_table.tsv” files, one for each of the
128
four classification types selected for analysis. The names of each of the case samples are listed in
the column names of the file, and the DNA methylation probe and TF Ensembl IDs for the links
listed in the row names. For each link, the case samples are listed as “0” if they did not pass both
of the criteria listed above, while those that did are listed with a “1” and are said to harbor that
link.
This function is useful for identifying which of the case samples might be driving the
identification of a given probe to TF link, so that those samples can be more closely analyzed on
an individual basis.
4.2.1.4 step7_top_genes_circos():
This function will create circos plots for the top TFs, visualizing the links between the TF
and each of the DNA methylation probes linked to it for the selected classification types.
Besides the default arguments, users will again need to specify the methylation array
their data is annotated to with the DNA_methylation_manifest argument by setting it to either
"HM27", "HM450", “EPIC”, or “EPIC_plus”. As before, this argument defaults to “HM450” if not
specified.
This function makes use of functions from the RCircos package to draw circos plots
(https://cran.r-project.org/web/packages/RCircos/index.html). This package was chosen over
similar packages such as BioCircos as it did not require the installation of specific, non-R
programs to operate, making it much easier to use this function when running TENETR on
remote systems such as the USC CARC HPC.
Circos plots are saved in “.hypo.G+.links.circosplot.pdf” files with the name and Ensembl
IDs of the TF included in the file names. These plots are useful for illustrating probe to TF links
that are identified and can be helpful in identifying clusters of linked DNA methylation probes.
129
4.2.1.5 step7_top_genes_complex_scatterplots():
This function is somewhat similar to the step7_selected_probes_simple_scatterplots()
function as it creates scatterplots displaying the expression of a given TF gene in the x-axis to
the methylation levels of DNA methylation probes linked to that TF in the y-axis, with both the
case and control samples in red and blue, respectively. However, as the name of the function
suggests, this function is first done in a gene-centric manner, creating scatterplots for the
selected top TFs and each of their linked probes. In addition, this function lets the user integrate
patient copy number variation (CNV), somatic mutation (SM), and sample purity information
into the plots, with CNV and SM status of the TF reflected in the shape of each of each samples’
point in the scatterplot, while purity information is reflected in their size.
For CNV data, users should provide a path to a tab-delimited file to the cnv argument of
the function. This file should be set up with sample IDs as the column names of the CNV file,
matching the sample IDs in the expData and metData matrices, and gene Ensembl IDs in the row
names, which should include, but need not be limited to, the Ensembl IDs for the top TFs of the
specified classification types. Within the body of this file, CNV data should be given as integer
values representing the copy number change value for each of the genes in each of the samples,
with negative values representing loss of that many copies, positive values representing gain of
that many copies, and 0 representing no change in copy number.
For somatic mutation (SM) data, users should provide a path to a tab-delimited file to
the sm argument of the function. Unlike the CNV file, this file should include the gene Ensembl
IDs in the column names, while the sample IDs should be included in the row names instead, but
otherwise should be formatted similarly to the SM file. The body of this file should include either
“0” or “1” values, with 0s indicating no mutation of the given gene was found in an individual
sample, while 1s indicate a mutation was found.
130
Finally, for purity data, users should provide a path to a tab-delimited file to the purity
argument of the function. Unlike the previous files, this file should consist of two columns
specifically named “ID” and “purity” and should have no row names. Instead, the “ID” column
should contain sample IDs which match with those of the expData and metData matrices, while
the “purity” column should contain values ranging from 0 to 1 (impure to pure) which
correspond to the respective samples’ purity scores.
Similar to the previous step7_selected_probes_simple_scatterplots() function,
scatterplots were created using ggplot2 functions (https://cran.r-
project.org/web/packages/ggplot2/index.html) and are saved in “_scatterplot.pdf” files, with
the names of the TF and linked probe ID. These scatterplots are useful to illustrate the
relationship between the expression of the top TF genes and the methylation of each of their
linked probes. This function provides even more information by also providing visual example
for how mutations and copy number alterations, as well as tumor purity values, affect the
relationship between the expression of a given TF and the methylation levels of its linked
probes.
4.2.1.6 step7_top_genes_cox_survival():
This function performs simple cox regression survival analyses using the expression of
top TF genes as well as the DNA methylation level of the probes linked to them for the link
classification types of interest to the user. No covariates are included in the survival analyses
performed by the function. This function also utilizes only data from the case samples in the
analysis.
Using this function necessitates that the user include a data frame named “clinical” in
the .rda file containing the previously discussed expDataN, expDataT, metDataN, and metDataT
131
matrices, whose path was supplied to the methylation_expression_dataset argument of the
step2_get_diffmeth_regions() function. Data for each patient should be included in the rows,
with different clinical data present in the columns.
The clinical data frame will need to first include a column named
“bcr_patient_barcode”, which matches with the first 12 characters of the sample names in the
columns of the expData and metData matrices (Table 3-1). Next, a column titled “vital_status”
should be included, which lists each patient as being either “Alive” or “Dead”. Patients who are
listed as “Alive” will have their survival time censored, while patients who are listed as “Dead”
are considered to have reached the event of interest. Finally, two additional columns
“days_to_death” and “days_to_last_followup” should be included, containing integer values of
the survival time of the patients. For patients who are listed as “Alive” in the vital_status
column, their survival information should be listed in the days_to_last_followup column, with an
NA value in the days_to_death column, while the opposite should be the case for samples listed
as “Dead” in the vital_status column.
Otherwise, this function requires no unique arguments to be supplied to it by the user,
as unlike Kaplan-Meier survival analysis, cox regression analysis doesn’t require the user to
determine how to bifurcate the samples analyzed based on the variable of interest.
Using functions provided in the survival package (https://cran.r-
project.org/web/packages/survival/index.html), for each of the link classifications selected by
the user, the function will output two .tsv files with the calculated survival statistics. The first
file, ending with “_top_TFs_cox_survival_info.tsv”, contains survival information for the
expression of the top TFs of the selected classification type, with the genes in the rows of the
file and different statistics in the columns. Data in the file includes the total number of control
and case samples, the number of samples in each group lacking expression data for the TF,
132
mean expression of the TF in those two groups, the number of control and case samples which
had clinical data (with the variables described previously) and were included in the analysis, the
regression coefficient, the hazard ratio, the survival_direction_of_effect value noting whether
higher or lower expression of the gene was associated with poorer patient survival (regardless
of significance of this association), and the survival p-value, along with finally the Ensembl ID
and name of each TF. Data from survival analyses of linked probes are contained in the
“_top_TFs_linked_probes_cox_survival_info.tsv” file and contain all the same information as the
survival file previously described, except with the probe ID numbers listed instead of TF Ensembl
IDs and names. This file also contains additional columns which list the Ensembl IDs and names
of the top TFs analyzed by this function which were linked to the given probes.
Survival analyses are one of the most useful functions for prioritizing TFs and their
linked regulatory element probes. As will be noted later, though a statistically significant survival
value does not necessarily indicate said TF or probe is causative of the condition of interest,
such as cancer, it does indicate they are associated with poorer prognosis of case samples,
suggesting they are of particular interest for further analysis. Note that when identifying which
of the TFs and probes are significantly survival associated, the output survival p-values are not
corrected for multiple testing.
4.2.1.7 step7_top_genes_experimental_vs_control_expression_boxplots():
This is another relatively simple downstream function, which creates boxplots displaying
the difference in the expression levels of the top TFs between the case and control samples and
performs Student’s t-tests comparing that difference.
This function requires no unique arguments to be supplied to it by the user, and outputs
“_expression_boxplot.pdf” files with the Ensembl IDs and names of the TFs analyzed. These
133
boxplots also include the calculated Student’s t-tests for the comparison of expression levels
between the case and control samples.
These boxplots are created using ggplot2 functions (https://cran.r-
project.org/web/packages/ggplot2/index.html), and are useful for differentiating TFs that are
relatively highly expressed in general, compared to ones that are lowly expressed but just
becoming more expressed or entirely deactivated in the case samples. They are also useful for
identifying TFs which are broadly over or under-expressed in the case samples compared to
those which are only mis-expressed in a small subset of case samples versus control samples.
4.2.1.8 step7_top_genes_expression_correlation_heatmaps():
This is another relatively simple function, which generates heatmaps and tables
displaying the correlation in expression of the top TFs across the case samples. Pearson
correlation coefficients are used to assess the degree of correlation in expression levels of these
TFs. These heatmaps are created using the heatmap.3() function
(https://github.com/obigriffith/biostar-tutorials/blob/master/Heatmaps/heatmap.3.R) which
gives excellent functionality, particularly in the use of row and column color bar labels, which
are of particular use for later functions.
This function also requires no additional unique arguments to function, and outputs two
files. The first file is a “_top_TFs_expression_correlation_heatmap.pdf” file, which displays a
mirrored heatmap showing the Pearson correlation coefficient for each pair of the top TFs
analyzed, with blue values indicating Pearson’s correlation values close to -1, and red values
indicating values close to 1. A “_top_TFs_expression_correlation_matrix.tsv” file is also
generated, which displays the same information as the heatmap, but in the form of the
numerical Pearson’s correlation values, as opposed to illustrated ones. For both these files, the
134
rows and columns are clustered in an unsupervised manner, using the Euclidean distance
measurement and the ward.d2 clustering method
125
.
These plots are useful for identifying which of the top TFs analyzed are likely to be
coregulated based on their correlation in expression, and thus may be worth investigating as a
group rather than individually. Note that such correlations are calculated across all the case
samples, so instances where TFs may be co-correlated in some subgroups and not in others will
not be reflected well in these plots.
4.2.1.9 step7_top_genes_histograms():
This is a simple function which creates histograms displaying the number of TFs with a
given number of linked DNA methylation probes of a given classification.
This function has no unique functions, and it lacks the top_gene_number argument, as
this function calculates the distribution of the number of linked probes per TF across all the TFs
with at least one link of that classification type.
A “_links_TF_gene_freq_histogram.pdf” file is output for each of the classifications
selected, which displays the histogram with the linked probe count per TF on the x-axis, and the
frequency of TFs with that many linked probes in the y-axis. Of note, the binwidth of the plot is
set to 1/200
th
of the maximum number of probes linked to the TFs.
These plots are of interest for illustrating how distinct the top TFs are, by the number of
their linked probes compared to those of most TFs in the classification type, as well as for
comparing how different alterations to the TENETR specifications can affect distribution of
linked probes to TFs. These plots can be used alongside the previously discussed RBO analyses.
135
4.2.1.10 step7_top_genes_met_heatmaps():
This is a simple function in terms of its arguments, but it creates information dense
heatmaps which display the methylation levels of all DNA methylation probes linked to at least
one of the specified top TFs of a given classification type across the case samples, along with the
expression levels of those top TFs in the column labels of the heatmap. These heatmaps are
created using the heatmap.3() function (https://github.com/obigriffith/biostar-
tutorials/blob/master/Heatmaps/heatmap.3.R)
As with several of the functions before it, this function also has no unique functions and
outputs a “_top_TFs_linked_probe_methylation_heatmap.pdf” file with the heatmap. The DNA
methylation values of the probes linked to the top TFs comprise the body of the heatmap, while
the TF expression values are displayed above the DNA methylation values. Each probe or TF
comprises a row in the heatmap, while each case sample is displayed in the columns.
The DNA methylation values in the rows of the heatmap body, and sample data in all columns
are clustered in an unsupervised manner, using the Euclidean distance measurement and the ward.d2
clustering method
125
, while the TFs are listed in descending order of number of linked probes. Gene
expression values in the plot are also rescaled, such that a given sample’s coloration for a given TF
reflects its proportionality between the sample with the highest expression of that TF, and the sample
with the lowest, non-zero expression. Samples lacking expression for a TF are automatically set to the
lowest possible value. Coloration is done using the jet.colors() function from the matlab R package
(https://github.com/HenrikBengtsson/R.matlab).
This function is one of the more important downstream analysis functions, as it most
clearly allows the identification of potential subgroups within the case samples. If a small
subsection of case samples is showing differential activation of some of the TFs, it should be
displayed in the heatmap by both the expression level of these TFs, and ideally the methylation
136
level of a subsection of the linked probes, compared to other case samples.
4.2.1.11 step7_top_genes_overlapping_linked_probe_heatmaps():
This is another simple function which outputs binary black and white heatmaps which
indicates which of the top TFs the DNA methylation probes were linked to and are created with
the heatmap.3() function (https://github.com/obigriffith/biostar-
tutorials/blob/master/Heatmaps/heatmap.3.R).
This function has no unique arguments and lacks the core_count argument as the
function is relatively quick to operate and currently isn’t coded to work to take advantage of
multiple cores. This function outputs a “_top_TFs_overlapping_linked_probe_heatmap.pdf”
heatmap for each of the selected link classification types, which includes the top TFs in the rows
of the heatmap and the pool of DNA methylation probes linked to at least one of the top TFs in
the columns. Both the TFs as well as the linked probes are clustered by the function in an
unsupervised manner, using the binary distance measurement and the ward.d2 clustering
method
125
. For each of the probes, if it was linked to a given TF it is colored in black, otherwise it
is colored white.
Along with the heatmaps produced by the
step7_top_genes_expression_correlation_heatmaps() function, these heatmaps are useful for
identifying and visualizing groups of TFs that potentially function in tandem to regulate similar
groups of probes and may be worth studying as a group instead of singularly.
4.2.1.12 step7_top_genes_simple_scatterplots():
This function works similarly to the step7_top_genes_complex_scatterplots() in that it
will generate scatterplots in a top TF-centric manner, i.e. creating ones for the top TFs of a given
137
classification type versus each of their linked probes. These scatterplots are also produced using
ggplot2 functions (https://cran.r-project.org/web/packages/ggplot2/index.html).
Unlike the previous function however, this function does not take into account CNV, SM,
or purity information, and thus only operates using the default step7 function arguments. As
before, these scatterplots are saved in “_scatterplot.pdf” files, with the names of the TF and
linked probe ID, and are similarly useful when examining relationships between the expression
levels of the top TF genes and the methylation of each of their linked probes, especially when
the user does not have their own CNV, SM, or purity data from the case and control samples.
4.2.1.13 step7_top_genes_survival():
This function is similar to the previously discussed step7_top_genes_cox_survival() but
performs Kaplan-Meier instead of simple cox regression survival analyses using the expression
of top TF genes in addition to the DNA methylation level of the probes linked to them for the
link classification types of interest to the user. This function only utilizes data from the case
samples in the analysis.
Similar to the step7_top_genes_cox_survival(), use of this function necessitates that the
user included a data frame named “clinical” in the .rda file containing the previously discussed
expDataN, expDataT, metDataN, and metDataT matrices, whose path was supplied to the
methylation_expression_dataset argument of the step2_get_diffmeth_regions() function.
This function requires the user set the visualize_survival_plots_genes and
visualize_survival_plots_probes arguments to either TRUE or FALSE to determine if they would
like the function to produce survival plots for the top TFs and their linked DNA methylation
probes respectively. The user will also need to set the high_fraction and low_fraction arguments
to a value between 0 to 1, which controls which proportion of samples that are sorted into the
138
“high” or “low” groups for the expression or methylation of the given TF or probe. As an
example, setting the high_fraction argument to 0.25 and the low_fraction argument to 0.25 will
result in case samples with the highest 25% of expression or methylation being included in the
“high” group, and case samples with the lowest 25% of expression or methylation being sorted
into the “low” group. Kaplan-Meier survival analyses will be performed comparing these two
groups, while remaining case samples will be sorted into an “intermediate” group, but will not
be compared in the analyses. Note that the sum of the high_fraction and low_fraction values
also should not exceed 1.
Using functionality provided by the survival package (https://cran.r-
project.org/web/packages/survival/index.html), for each of the link classifications selected by
the user, the function will output two .tsv files with the calculated survival statistics. The
“_top_TFs_survival_info.tsv” file contains the survival information based on the expression of
the top TFs of the selected classification type. This file contains similar data to the file with
survival data for TFs output by the step7_top_genes_cox_survival() function, except it also
includes columns listing the mean expression of the TF in the “intermediate” samples, as well as
the number of these samples with clinical data. Additionally, the proportion of samples that
have end point of interest is calculated and included, and these values are used to determine
the listed “survival_direction_of_effect” value, as (in contrast to cox regression survival
analyses) a survival coefficient is not calculated by the Kaplan-Meier survival analysis. Similar
data for the linked DNA methylation probes is included in the
“_top_TFs_linked_probes_survival_info.tsv”, which also includes the Ensembl IDs and names of
the top TFs to which probe is linked.
Finally, the step7_top_genes_survival() function will also output “_survival_plot.pdf”
files with either the name of the TF, or linked probe IDs if the user has elected to produce these
139
files with the visualize_survival_plots_genes and visualize_survival_plots_probes arguments.
These .pdfs show the survival curves for the identified “high” and “low” case samples, with the
former in red and the latter in black, and also include the uncorrected survival p-value in the
title of the plot.
Similar to the simple Cox regression analyses, the Kaplan Meier survival analyses are one
of the most useful functions for prioritizing TFs and their linked regulatory element probes, but
come with the same caveats in interpreting their significance to the data analyzed with TENETR.
Choosing between the two analyses in one of personal preference, though it should be noted
that the set values for the high_fraction and low_fraction arguments can affect the survival
significance. In my testing, setting these values further apart (such as the analysis of high and
low tertiles, instead of splitting all the samples 50/50) tends to result in generally more
significant survival p-values in spite of the reduced number of samples analyzed (which could
potentially reduce significance).
4.2.1.14 step7_top_genes_TAD_tables():
This function allows users to identify the pool of DNA methylation probes which are
linked to at least one of the top TFs of the specified classification types, and overlap them with
topologically associating domain (TAD) files the user has in order to identify other genes which
might be located in the same TAD as each of the linked probes.
This function requires the user to use the DNA_methylation_manifest argument to
specify DNA methylation array from which their data was generated, similar to several previous
arguments. Additionally, the user must set the TAD_directory argument to be a path to a
directory which contains files with TAD information. These files should have .bed-like
formatting, with the chromosome, 0-indexed start, and 1-indexed end locations of the TAD
140
domains. The directory specified can have any number of TAD files, and each will be loaded and
analyzed by the function. It is recommended that the user only supply TAD files from cellular
sources relevant to the samples with expression and methylation data analyzed by TENETR.
This function outputs a “_top_TFs_TAD_analysis.tv” file, which can end up being large in
size. This file lists information for each of the probes linked to the top TFs, including the probes’
hg38 coordinates, columns for each of the top TFs, listing whether each probe is linked to that
individual TF or not. The final columns in the file are listed in sets of 3 columns, with one set of
columns for each of the TAD files. If the probe was found in one of the TADs specified in that
file, the three columns will list the number of genes also identified in the same TAD as the
probe, the Ensembl IDs of those genes, and the gene names respectively. However, if the probe
is not found in a TAD in a given file, all three columns will instead list “No_TAD_identified”.
This function is particularly useful for users who are analyzing distal enhancers; this file
can aid in identifying downstream target genes for the putative distal enhancers marked by the
DNA methylation probes themselves linked to the upstream transcription factors. As distal
enhancer probes tend to regulate genes within the same TAD
32
the genes found inside the same
TADs as the DNA methylation probes are of particular interest as potential target genes for the
regulatory elements containing those probes.
4.2.1.15 step7_top_genes_UCSC_bed_files():
This function can be used to create .bed-formatted interact files, which can be uploaded
to the UCSC Genome Browser (https://genome.ucsc.edu/cgi-bin/hgTracks) to display links
between the top TFs and their linked DNA methylation probes for each of the classification
types of interest.
This function has no unique arguments to it, and it returns
141
“_TFs_to_enhancer_probe_links_hg38.bed” files, which as the extension suggests are .bed-like
files which contain the coordinates for links from the top TFs to each of their linked probes. The
file also has a header line so that it can be read by the UCSC Genome Browser as an interact file
and display appropriately.
This function is useful for visualizing links between the top TFs and their linked DNA
methylation probes, especially when the user also has other genomic datasets (for instance, the
epigenomic datasets used to identify the DNA methylation probes) to visualize the links within
the Genome Browser.
4.2.1.16 step7_top_genes_user_peak_overlap():
This is the final of the step7 functions and allows the user to identify which of the DNA
methylation probes linked to the top TFs for each of the four link classification types are found
in a specified vicinity of regions in additional files specified by the user.
Besides the usual default arguments for the step7 functions, users should supply a path
to a directory to the ext_peaks_directory argument. This directory should contain either .bed,
.narrowPeak, .broadPeak, or .gappedPeak files, which must all be in a .bed-like format and
contain hg38-annotated regions the user wants to check for overlaps with the DNA methylation
probes linked to the top TFs. As with previous functions, the DNA_methylation_manifest should
be set to the DNA methylation array used to generate the methylation data for the linked
probes of interest. Finally, the distance_from_probes argument should be set to a positive
integer value, representing the maximum distance a probe should lie from a given region for
that region to be said to lie in close proximity to the probe.
This function outputs a “_linked_probes_peak_overlap.tsv” file for each of the top TFs,
which lists the probe IDs and hg38-annotated genomic coordinates for each of the TFs’ linked
142
probes, as well as a column for each of the user supplied peak files loaded from the specified
ext_peaks_directory. These columns will have a header with the name of each peak file, and will
contain either TRUE or FALSE, indicating whether each probe was located within the
distance_from_probes value from at least one of the peaks in that file.
This function is useful if the user has additional peak files with relevant genomic data for
which they wish to determine if any of the TF-linked probes fall in the vicinity. These could
include ChIP-seq data for the TF to which the probes were linked. Finding the probes that fall in
the vicinity of the regions the TF was found to bind to would add further evidence that the
regulatory elements marked by those probes could be directly regulated by that TF, potentially
prioritizing the regulatory elements marked by these probes for further study.
4.2.2 Additional TCGA clinical covariates:
Besides the values related to the survival analyses discussed previously, there are other
covariates related to the TCGA samples which I have included in various downstream analyses. I
have found this data can be most easily derived either directly from the “clinical” data frame
downloaded with the TCGA_downloader() function in the TENETR package, assembled by the
TCGABiolinks package
71
, or acquired from cBioPortal
72,76
.
Data such as patient age, sex, race, tumor staging, and smoking history were acquired
from the clinical data frame and cleaned. There are a few considerations that I have taken into
account when using this data. First, not every clinical covariate is available for every TCGA
cancer type. For instance, smoking history data is available for more smoking-relevant cancer
types like BLCA, LUAD, and LUSC, but is not available for cancer types like BRCA, and PRAD.
Second, clinical data often needs to be cleaned and possibly grouped, such as the cancer stage
information, which I condensed into four levels; “stage I”, “stage II”, “stage III”, and “stage IV”,
143
from the extended “stage IIIA”, “stage IIIB”, etc. notation. In addition, some clinical covariates,
such as the smoking history variable, are encoded with numerical values meant to denote
different classifications. For these, the encoding can be checked by going to
https://cdebrowser.nci.nih.gov/cdebrowserClient/cdeBrowser.html#/search, clicking on the
“Public ID Search” tab, then using clinical data element (CDE) value 2181650 to search for what
groups the values represent, such as those seen for the smoking history categorization in Table
4-1. CDE values can be seen at https://gdc.cancer.gov/about-data/gdc-data-processing/clinical-
data-standardization. Lastly, even if a given covariate is included in the TCGA clinical data, it
does not guarantee that that variable has complete information. For instance, I have generally
elected to use smoking history to analyze the potential impact of tobacco smoking over pack
years smoked, as there is a lack of data for the latter in approximately one-third of TCGA LUAD
samples with clinical data.
Table 4-1: Encoding of TCGA smoking history
The smoking history variable in the TCGA clinical data is an example of an encoded variable. The first
column shows the values, from 1 to 7, that might appear in the TCGA clinical data. The “Meaning”
column shows the interpretation of those values in the TCGA data, while the “My Encoding” column
lists how I have encoded the values in my datasets. NA values are values I have generally excluded
from my analyses due to their indefinite smoking history values. The final column lists the numeric
value I have assigned to the groups when I have investigated treating smoking history as a single
continuous variable rather than a categorical one. Values were assigned to reflect more recent
smoking influence.
144
The TCGA clinical data does often include variables for relating to common molecular
alterations associated with each cancer type, such as EGFR and KRAS mutations, as well as
EML4-ALK alterations in the LUAD dataset
9
. However, these variables often lack information for
most of the patients in the dataset and are limited in scope. Thus, for molecular alteration
information, I have turned to cBioPortal to acquire this information. cBioportal collects datasets
from repositories such as the TCGA and disseminates that data in an easily accessible
manner
72,76
. To acquire information on specific molecular alterations, such as EGFR and KRAS in
LUAD, I found the cancer type of interest and selected data from the “PanCancer Atlas” dataset
for each cancer type and selected “Query By Gene”. After specifying the genes of interest, I
downloaded the “Tabular” file from the Download tab immediately above the displayed
mutational tracks for the genes I had selected, which downloaded a
“PATIENT_DATA_oncoprint.tsv” file. For each of the specified genes, this file will note if a CNV,
mutation, or structural variant was found. For the sake of simplicity, any sample I have said that
has an alteration of these three types has an “alteration” of that gene, while samples without
either of the three have “no alteration” for that gene. Overall mutational count data for each
TCGA sample can also be acquired from cBioPortal by selecting the “Explore Selected Studies”
after specifying the PanCancer Atlas dataset for each cancer type, then clicking the “Download”
and then “Data” option for Mutational Count.
Purity data can be derived for each of the TCGA samples as well. The easiest way I have
found to do this is using the Tumor.purity dataset in the TCGAbiolinks package
71
, as referenced
previously in chapter 2.2.8. This dataset contains several tumor purity metrics for TCGA samples,
including ESTIMATE, which uses gene expression patterns from immune and stromal genes,
ABSOLUTE, which uses copy-number data, LUMP, which uses methylation levels of immune-
specific CpG sites, and IHC, which uses estimates based on slide staining
141
. A final CPE variable
145
is included, which derives a consensus measurement after normalizing the 4 metrics listed
above and is the variable I used in previous analyses
34
.
4.2.3 Upset plots:
Upset plots were created using the upset() function from the UpSetR package
142
(https://cran.r-project.org/web/packages/UpSetR/index.html), as an alternative to Venn
diagrams for displaying complex series of overlaps.
4.2.4 Radar (spider) plots:
Radar plots (also commonly referred to as spider plots) were created to show the rank
of TFs across the 12 cancer datasets analyzed. These plots were created with the radarchart()
function from the fmsb package (https://cran.r-project.org/web/packages/fmsb/index.html).
Ranks across the 12 cancer types are shown, with the outermost ring indicating a TF was the
number one ranked TF in that cancer type, and the innermost ring showing that TF was ranked
200 or lower. Rings in between were gradated by 50 ranks, increasing towards the outer ring.
TFs were colored using color palettes from the viridis (https://cran.r-
project.org/web/packages/viridis/index.html) and matlab
(https://github.com/HenrikBengtsson/R.matlab) packages.
4.2.5 Heatmaps:
Heatmaps for the expression Z-score and TF rank analyses were done using the heatmap.3()
function (https://raw.githubusercontent.com/obigriffith/biostar-
tutorials/master/Heatmaps/heatmap.3.R). Clustering for rows (representing TFs) was done in an
unsupervised manner using the Euclidean distance measurement and the ward.d2 clustering method
125
.
146
Individual tumor samples in expression Z-score heatmaps were grouped by cancer type in the columns.
4.2.6 Venn diagrams (Work with: LH):
Venn diagrams were created using the Venn() function from the Vennerable package
(https://github.com/js229/Vennerable). Ensuring consistent coloring of groups was done using
Inkscape (https://inkscape.org/).
147
4.3 Results and discussion
4.3.1 Selection of highly ranked TF panel:
TENETR analysis of even one cancer type produces a plethora of data, with 491 to 886
upregulated TFs found to be linked to at least 1 hypomethylated distal enhancer probe in each
cancer type, and almost every annotated TF, 1515 out of a total of 1639, was linked to at least
one such hypomethylated probe in at least one of the 12 cancers analyzed. Thus, I first sought to
reduce the complexity of the data by focusing the bulk of my analyses on a pool of TFs of
greatest importance across all 12 cancer types.
In my previous analysis of LUAD using TENET 2.0, I had started my analyses by looking at
the top 101 TFs, representing those with ≥50 linked hypomethylated probes. Such a cutoff
would likely not be as useful in this analysis, as it would first be biased towards certain cancer
types over others, with as many as 327 TFs passing that cutoff in the LUSC dataset and as few as
135 doing the same in the COAD dataset. Additionally, such a cutoff would still result in over
1,000 unique TFs being found when pooled across all 12 cancer types.
To address this, I decided to first assemble a pool of the TFs which were found to be
among the top 10 TFs in any cancer types as seen in Figure 3-14. This pool included 97 unique
TFs; several TFs were identified among the top 10 in multiple cancer types (Figure 4-1).
Interestingly, some TFs such as FOXA1, seemed to have a high number of linked probes in only a
handful of cancers, while others such as CENPA were relatively highly linked across many cancer
types.
148
Based on this finding, I decided that there is a second class of TFs which might be
important to study. This included TFs that may not be among the top 10 most highly ranked TFs
in a single cancer type but are among the most highly ranked TFs when considering ranks across
all 12 cancer types in sum. I identified these TFs by adding up their individual ranking across the
12 cancer types. The top 20 TFs in terms of lowest overall added ranks are shown in Figure 4-2,
and their individual ranks across the 12 cancer types are shown in Figure 4-3. A complete list of
TF rankings is provided in Supplemental Data 4-1. This revealed that these 20 TFs displayed
distinct patterns of cancer types they appear relevant in, despite having the highest additive
Figure 4-1: TFs among the top 10 in multiple cancer types
Upset plot shows the TFs which are found amongst the top 10 TFs by number of linked
hypomethylated distal enhancer probes in multiple cancer types. Lines below the bars represent the
cancer types each set TF was found in the top 10 TFs for. POU6F2, RUNX1, and CENPA were found
among the top 10 TFs in three cancer types, while 17 others were found among the top 10 in two
different cancer types. These include MYBL2, POU4F1, and ZNF695, all three of which were found
among the top 10 TFs in both LUAD and BRCA. In fact, LUAD and BRCA had the most top 10 TFs what
were also shared with other cancer types (as seen in the side bars).
149
importance (in terms of ranks) overall. The top 20 TFs across cancer types were already well
represented in the pool of TFs that were important in single cancer types, but their inclusion still
added 11 TFs, bringing the total pool to 108 TFs on which I focused further analyses
(Supplemental Data 4-2).
Figure 4-2: Top 20 TFs by total rankings across 12 cancer types
Bar plot shows the top 20 TFs selected based on having the lowest total sum of ranks across the 12
cancer types, along with the sum.
150
4.3.2 Clustering TFs by expression and ranking patterns:
Next, I examined the expression patterns of the 108 TFs across all 12 cancer types. In
the TENET 2.0 study of LUAD, I had initially identified our key TFs of interest through the
examination of their expression correlation
34
, a function now provided through the previously
Figure 4-3: Individual cancer rankings of top 20 TFs by total rankings across 12 cancer types
Radar plots show individual ranking of the top 20 TFs by total rankings in each cancer type. (A) The
top 5 TFs by total rankings. (B) The 6
th
through 10
th
TFs by total rankings. (C) The 11
th
through 15
th
TFs by total rankings. (D) The 16
th
through 20
th
TFs by total rankings.
151
discussed step7_top_genes_expression_correlation_heatmaps() function. However, this was
performed with only one cancer type. Looking at broad co-correlation across all 12 cancer types
at once would likely lose the nuance which could be gained from such an analysis, as two genes
closely coregulated in one cancer type may not be in another. Instead, I decided to perform an
unbiased clustering of expression Z-scores for every sample in all 12 cancer types, which I
anticipated would capture expression patterns much more readily than an average correlation
value across all cancer types (Figure 4-4). A Z-score transformation of expression values was
performed to normalize values across the different genes analyzed, some of which were well
expressed in all tumor samples, others of which had more mixed expression.
dw
Figure 4-4: Selecting TF cluster based on expression Z-scores across cancer types
Heatmap shows the expression Z-scores of TFs of interest across all 12 cancer types. Each column
displays an individual tumor sample grouped by cancer type, and each row represents one of 108 TFs
which were either ranked among the top 10 in at least a single cancer type or were ranked among
the top 20 TFs across all cancer types. Row labels note whether each TF was found among the top 10
within each cancer type, or was in the top 1-10 or 11-20 most highly ranked TFs in all cancer types.
TFs marked with an asterisk at the bottom of the plot show a cluster of selected TFs which showed
similar expression patterns across all cancer types and were among the most highly ranked TFs in
BRCA, KIRP, and LUAD cancer types, as well as overall across the cancer types.
152
I examined the unsupervised clustering of the TFs to identify groups which showed similar
patterns of expression across the cancer types, indicating possible coregulation or similarity in function.
This revealed a group of 8 TFs seen at the bottom of Figure 4-4 which were more closely clustered
together than any other similarly sized grouping of TFs. Additionally, all 8 of these TFs were also among
the top 20 most highly ranked TFs across cancer types, and 7 of 8 were found among the top 10. This
group of 8 TFs included FOXM1, MYBL2, and CENPA, as well as the E2F family members E2F1, E2F2,
E2F7, E2F8, and finally DNMT1. The inclusion of genes such as DNMT1 and CENPA may seem odd, as
these genes are better known for encoding a DNA methyltransferase and a H3 histone variant proteins,
respectively, rather than TFs. However, as the curators of the list of human TFs note, a given gene may
possess TF-like activity if it can bind to the DNA in a sequence specific manner, thus potentially excluding
bona fide TFs from binding to that region, thus effecting a TF-like effect
70
.
FOXM1, MYBL2, and CENPA were previously highlighted in the previous TENET analysis of
LUAD
34
so it was unsurprising to see that they were highly ranked in LUAD in this study. Members of this
cluster were also highly ranked in BRCA and KIRP cancers as well (Figure 4-4) (Table 4-2). Furthermore,
these TFs were also generally closely clustered together when we examined the ranks of the individual
TFs across cancer types. 7 of 8 of the TFs clustered together along with several other highly ranked TFs
which were ranked across each of the cancer types, including NFE2L3, DMBX1, and E2F3 (Figure 4-5).
E2F2 however, clustered separately in this analysis because it did not possess a single linked probe in
the COAD dataset.
153
This last finding also spurred me to look at how different families of genes included amongst the
108 TFs ranked across cancer types. Two families stand out, the FOX family, which included 4 members
in the 108 TF pool I analyzed, as well as the E2F family, which included 5 members in the 108 TFs.
Intriguingly, these TF families displayed very different ranking patterns. The E2F family members were
generally concordant in rank across the cancer types, consistent with the emerging role of E2F family
Table 4-2: Ranks of 8 TFs in selected cluster across cancer types
This table displays the ranks of each of the 8 highlighted TFs across each of the 12 cancer types, as
well as the total sum of ranks and overall composite rank across the 12 cancer types.
Figure 4-5: Clustering TFs by ranks in 12 cancer types
Heatmap showing the rank of the 108 selected TFs in each of the 12 cancer types. Highly ranked TFs
have more linked hypomethylated probes assigned to them for a given cancer type by TENETR.
Columns represent the individual 108 TFs, and each column represents their rank in that cancer type.
Row labels denote how the 108 TFs were included in the set, either being among the top 10 most
highly ranked TFs in a single cancer type, or among the top 20 TFs most highly ranked across all
cancer types. 7 of the 8 TFs clustered by expression Z-score are included in the close clustered group
marked with the asterisk, though in this analysis this group also contains several other highly ranked
TFs.
154
overexpression in cancer and cancer stem cells
143,144
. On the other hand, FOX family members displayed
very different patterns of ranking across the 12 cancer types, perhaps due to the vastly different roles
subfamily members play across different cancers
145,146
(Figure 4-6).
4.3.3 Survival analysis of highly ranked TF panel:
At this point, I performed survival analyses utilizing the expression of the 108 TFs and the DNA
methylation levels of their linked probes. I have used survival analyses as the main analyses to focus on
on individual TF genes and linked probes for individual analyses; as survival significance can suggest
important functional relevance to the cancer type at hand.
I focused my survival analyses on the BRCA, KIRP, and LUAD cancer types, as these were the
cancer types the 8 closely clustered TFs were highest ranked in amongst all cancer types.
Figure 4-6: Individual cancer rankings of E2F and FOX family members
Radar plots show individual ranking of the E2F and FOX family members included my panel of 108
highly ranked TFs. (A) E2F family members shown generally the same patterns in ranking across
cancer types. (B) FOX family members show very different patterns of rankings, with different family
members showing more or less importance in different cancer types.
155
In previous TENET studies I had performed survival analyses using Kaplan-Meier curves,
comparing samples with the highest and lowest quartile of expression or methylation of the TF or
probes of interest. Here, I elected to use univariate cox regression analyses because I was analyzing
expression of hundreds of TFs and thousands of linked DNA methylation probes, and I did not want to
bias the analysis towards specific TFs or linked probes whose specific distribution of expression or DNA
methylation values would produce more significant results for a given splitting criterion. Instead, the use
of cox regression analyses would allow me to analyze all samples at once, without setting an arbitrary
value to split the case samples. Additionally, use of univariate cox regression analyses allows me to
perform more sophisticated survival analyses with additional clinical covariates on specific TFs and
probes of interest and compare the results from those much more readily to the univariate analyses.
I also decided to look beyond just the association of each TF and probe with survival and I
considered the rank of each TF and probe analyzed amongst the pool of all I analyzed. I did this to
account for differences in survival outcomes between cancer types. For instance, BRCA has much better
overall survival outcomes, which means it is much harder to generate significant p-values associated
with survival, especially when correcting for multiple testing. However, it might still be important to
note if a given TF was the most significantly associated with survival in that cancer type amongst the 108
I analyzed. Likewise, a given TF or probe associated with survival in another cancer type is important,
but if there are many other TFs or probes that are more significantly associated with survival, that
particular TF or probe would be a less intriguing target.
I started by analyzing the survival association of each of the 108 highly ranked TFs I identified,
highlighting the 8 TFs of particular interest, beginning with BRCA. Relatively few of the 108 TFs were
nominally associated with BRCA patient survival, and none of them were significant after Bonferroni
multiple testing correction. Using the ranking, I observed that expression of TFs such as SOX12, AIRE,
and DMBX1 were the most strongly associated with survival among the 108 TFs despite not passing
156
Bonferroni significance. Additionally, I found that despite being relatively highly ranked in this cancer
type in terms of number of linked hypomethylated probes, the 8 clustered TFs were not among the most
survival associated of the 108 TFs (Figure 4-7).
By comparison, TFs were much more strongly associated with survival in the KIRC and LUAD
datasets. This is likely due in part to the higher rates of patient death due in cancers than in BRCA.
Within these diseases, the 8 clustered TFs of particular interest were also more closely associated with
patient survival, even relative to the other 108 highly-ranked TFs. In KIRP, expression of MYBL2 and
CENPA had the strongest association with patient survival out of the entire pool of 108 TFs, and E2F7
Figure 4-7: Survival analysis of 108 TFs in BRCA
Plot shows the -log10-transformed p-values from univariate cox regression survival analyses on the
expression of the highly-ranked 108 TFs in the BRCA dataset along with the rank of each TF’s survival
association among the 108 TFs. The blue dashed line represents the threshold of nominal
significance (p<0.05), and the red-dashed line represents Bonferroni-corrected significance
(Bonferroni p<0.05). TFs labelled in red are the clustered 8 TFs of particular interest.
157
and FOXM1 were included among the top 5. E2F8 and DNMT1 were also significantly associated with
patient survival after Bonferroni-correction. Interestingly, in the KIRP dataset, expression of 17 of the
108 TFs was significantly associated with patient survival after Bonferroni-correction (Figure 4-8),
further illustrating the efficacy of being able to report association of each TF with patient survival as well
as its association compared to other TFs in a panel; we could observe that some of our clustered TFs
associated with patient survival here are among the most associated with patient survival, at least in this
cohort of TFs.
A similar pattern was observed in LUAD as in KIRP, though there were fewer TFs whose
expression was significantly associated with patient survival. Only 6 of the 108 TFs were significantly
Figure 4-8: Survival analysis of 108 TFs in KIRP
Plot shows the -log10-transformed p-values from univariate cox regression survival analyses on the
expression of the highly-ranked 108 TFs in the KIRP dataset along with the rank of each TF’s survival
association among the 108 TFs. The blue dashed line represents the threshold of nominal
significance (p<0.05), and the red-dashed line represents Bonferroni-corrected significance
(Bonferroni p<0.05). TFs labelled in red are the clustered 8 TFs of particular interest.
158
associated with patient survival after Bonferroni-correction, though 2 of the 8 clustered TFs were
included in this number, E2F7 and FOXM1, and E2F7 was the TF whose expression was most significantly
associated with patient survival of the 108 TF panel in LUAD (Figure 4-9).
4.3.4 Survival analysis of DNA methylation probes linked to highly ranked TF panel:
Next, I focused on identifying the DNA methylation probes linked to at least of the 108 TFs, to
identify those for which their DNA methylation levels were associated with patient survival.
Traditionally, survival analyses have largely been performed using gene expression. However, combining
survival analyses on gene expression of TFs along with DNA methylation of linked probes could be of key
Figure 4-9: Survival analysis of 108 TFs in LUAD
Plot shows the -log10-transformed p-values from univariate cox regression survival analyses on the
expression of the highly-ranked 108 TFs in the LUAD dataset along with the rank of each TF’s survival
association among the 108 TFs. The blue dashed line represents the threshold of nominal
significance (p<0.05), and the red dashed line represents Bonferroni-corrected significance
(Bonferroni p<0.05). TFs labelled in red are the clustered 8 TFs of particular interest.
159
importance, particularly for a TENETR approach. Considering the TENET model (Figure 1-1), though the
misexpression of a given TF may be the initiating event for the condition of interest, such as cancer
development or progression, it in and of itself is unlikely to directly lead to that condition. Instead, this
happens through the activities of the downstream regulatory regions regulated by that TF, whose
activities are more closely reflected in the methylation levels of the probes marking those regulatory
regions. Thus, the overall effect of the difference in TF expression, although still important as the
potentiating event for downstream changes, could reflect the averaged influence of all the regions it
affects, including activation of elements which ultimately have an antagonizing effect as part of a
feedback loop. As a result, individual regions regulated by a TF which is say, upregulated in tumors and
associated with poor patient survival, could individually be even more closely associated with poor
patient survival than the TF itself. Identifying these regions linked to key TFs is particularly important as
these could constitute valuable targets for further analysis and could lead to a better understanding of
why these TFs are so strongly associated with patient survival in specific cancer types. In addition, there
many more DNA methylation probes than there are TFs, and the top TFs are usually linked to hundreds,
if not thousands of probes. Therefore, analyzing the top probes alongside the top TFs would give much
more data to work with, especially for potential downstream applications such as model building or
machine learning.
To identify linked regions associated with patient survival, I performed similar survival analyses
as were performed on the 108 TFs in BRCA, KIRP, and LUAD, and used univariate cox regression analyses
to assess the survival association of the DNA methylation levels of all identified probes linked to at least
one of the 108 TFs in each of the cancer types, then I ranked each probe by its respective significance.
In the BRCA dataset, a total of 7,542 DNA methylation probes were linked to at least one of the
108 highly ranked TFs. Of these, 1,282 were linked to the 8 clustered TFs of particular interest
accounting for almost 17% of all the probes analyzed. Of the 7,542 total probes analyzed, only 9 passed
160
Bonferroni significance and an additional 802 were nominally associated with patient survival (Figure 4-
10). Though the number of Bonferroni-significant probes in BRCA is smaller than the count of these
probes in KIRP and LUAD, the significance of the probes most closely associated with survival is still
much stronger than that of the TFs; not a single TF had passed Bonferroni correction for association with
patient survival. In spite of the TFs’ weak association with patient survival, I observed an enrichment of
nominally-significant probes among the probes linked to the 8 clustered TFs of interest, compared to
those that were not (X
2
p < 2.2x10
-16
). However, probes linked to the 8 TFs of particular interest also
tended to be linked to more TFs in general, with an average of 9.25 linked TFs per probe, compared to
probes which were not linked to those top 8 TFs, which were linked to an average of just 4.26 TFs per
probe. Thus, it is possible that the probes most closely associated with patient survival in the BRCA
dataset could be influenced by other TFs, even if they are linked to the 8 TFs of interest.
Figure 4-10: Survival analysis of probes linked to 108 TFs in BRCA
Plot shows the -log10-transformed p-values from univariate cox regression survival analyses on the
methylation of the 7,542 probes linked of the highly-ranked 108 TFs in the BRCA dataset along with
the rank of each probe’s survival association. The blue dashed line represents the threshold of
nominal significance (p<0.05), and the red-dashed line represents Bonferroni-corrected significance
(Bonferroni p<0.05). Probes labelled in red are those linked to the clustered 8 TFs of particular
interest.
161
There were a smaller number of total probes linked to the 108 highly ranked TFs in the KIRP
than in the BRCA dataset, a total of “only” 5,046. In spite of this, there were significantly more survival-
associated probes identified in the KIRP dataset than in BRCA. DNA methylation levels of 1,921 of these
probes were nominally associated with patient survival in the KIRP dataset, including 211 which
remained significant even after Bonferroni correction. Of the 5,036 total linked probes, 1,547 were
linked to the 8 clustered TFs of interest, accounting for 30.7% of the total, a much larger proportion than
in BRCA (Figure 4-11). A large portion of the survival-associated probes were also amongst those linked
specifically to the 8 TFs of interest; 145 of the 211 Bonferroni-significant ones (68.7%), and 988 of all
1,921 at least nominally associated (51.4%), both outsized proportions given less than a third of all
probes were linked to these 8 TFs. As in LUAD though, probes linked to the 8 TFs of interest were linked
to more TFs in general than the probes that were not, with an average of 13.52 linked TFs per probe
versus 5.83.
162
Finally, I analyzed the survival association of the 6,652 probes linked to the highly ranked 108
TFs in the LUAD dataset. The curve in the reduction of survival association with increasing rank was
much sharper in the LUAD dataset than either BRCA or LUAD; only 640 DNA methylation probes were
identified to be nominally associated with patient survival. However, this still included 22 probes that
passed Bonferroni correction for multiple testing, a larger total than for BRCA. Additionally, the
landscape of linked probes in the LUAD dataset was even more heavily dominated by the 8 clustered TFs
of particular interest, with 2895 of the 6652 probes (43.5%) linked to at least one of the 8 TFs (Figure 4-
12). This extended to the strongly survival associated probes, with 20 of the 22 Bonferroni-significant
probes (90.9%) and 372 of the 640 nominally-significant probes in total (58.1%) being linked to the 8
clustered TFs in particular. Yet again, as with BRCA and KIRP, probes that were linked to these 8 TFs
Figure 4-11: Survival analysis of probes linked to 108 TFs in KIRP
Plot shows the -log10-transformed p-values from univariate cox regression survival analyses on the
methylation of the 5,036 probes linked of the highly-ranked 108 TFs in the KIRP dataset along with
the rank of each probe’s survival association. The blue dashed line represents the threshold of
nominal significance (p<0.05), and the red-dashed line represents Bonferroni-corrected significance
(Bonferroni p<0.05). Probes labelled in red are those linked to the clustered 8 TFs of particular
interest.
163
tended to be linked to more TFs in total. Each probe linked to at least one of the 8 TFs was linked to an
average of 12.07 TFs, compared to an average of 6.26 TFs for those that were only linked to other TFs.
I had noted a considerable degree of overlap of the identified DNA methylation probes marking
hypomethylated distal enhancer probes (Figure 3-13). Given the hypomethylated distal enhancer probes
specifically linked to the 8 clustered TFs in BRCA, KIRP, and LUAD seemed more strongly associated with
patient survival in those cancer types than similar probes linked to other TFs and seemed to be linked to
more TFs than other probes. I investigated the possibility that similar sets of probes might be regulated
by these TFs in each of these cancer types. We created Venn diagrams and examined the degree of
overlap between three classes of probes across the BRCA, KIRP, and LUAD cancers: the original pool of
Figure 4-12: Survival analysis of probes linked to 108 TFs in LUAD
Plot shows the -log10-transformed p-values from univariate cox regression survival analyses on the
methylation of the 6,652 probes linked of the highly-ranked 108 TFs in the LUAD dataset along with
the rank of each probe’s survival association. The blue dashed line represents the threshold of
nominal significance (p<0.05), and the red-dashed line represents Bonferroni-corrected significance
(Bonferroni p<0.05). Probes labelled in red are those linked to the clustered 8 TFs of particular
interest.
164
hypomethylated probes, the probes linked to at least one of the 108 highly ranked TFs, and probes
linked to just one of the 8 clustered TFs of interest.
Unexpectedly, there was a very small overlap in the probes linked to at least 1 of the clustered 8
TFs of interest across the three cancer types (Figure 4-13). While 54.8% of all hypomethylated probes
were found in common between at least 2 of the 3 cancer types, only 18.6% of the probes found linked
to the 108 TFs were shared between at least two of the three, and even more surprisingly, only 3.4% of
the probes linked to the 8 TFs of interest were found to be linked to these TFs across two of the three
cancer types. No such probes were identified linked to the 8 TFs in all three. This indicates that although
this group of TFs seem to play an important function in these cancer types, they seem to be doing so by
regulating different sets of regulatory elements in each cancer type, further supporting the observation
we had made between BRCA and LUAD probes linked to CENPA, FOXM1, and MYBL2 in the TENET 2.0
analyses (Supplemental Figure 2-9).
Figure 4-13: Overlap of probes between cancer types of interest
A) Venn diagram displaying the overlap of the hypometh probes identified in the BRCA, KIRP, and
LUAD analyses. B) Displays the overlap of only the hypometh probes which were linked to the 108
highly ranked TFs in the BRCA, KIRP, and LUAD analyses. C) Displays the overlap of just the hypometh
probes linked to the 8 clustered TFs of particular interest in BRCA, KIRP, and LUAD.
165
Notably, even when accounting for the much larger number of probes analyzed in each cancer
type compared to the number of TFs (several thousand, as opposed to 108), the top probes in each
cancer type showed a stronger association with patient survival than even the most strongly associated
TFs.
3.4.5 Selecting a single TF-to-probe link to illustrate further downstream analyses:
Based on these results, I decided to focus further individual analyses on the gene MYBL2 and its
linked probe cg20246907. Across the three cancer types I analyzed, this pair represented the TF most
highly associated with patient survival in any of these three cancer types (Figure 4-7-9), and the DNA
methylation levels of its linked probe cg20246907 had among the highest association with patient
survival of all probes analyzed in the three cancer types as well (Figure 4-10-12).
Comparing the survival data again side by side, MYBL2 is most significantly associated with
patient survival in the KIRP dataset, and to a lesser extent the LUAD dataset, and was not associated in
the BRCA dataset (Figure 4-14).
Figure 4-14: Survival curves for MYBL2 expression in BRCA, KIRP, and LUAD tumors
Figure shows survival curves of tumor samples split into 4 groups based on their quartile of
expression of MYBL2, as well as the univariate Cox regression survival p-value in that cancer type. (A)
MYBL2 survival results in BRCA. MYBL2 expression is not associated with BRCA patient survival. (B)
MYBL2 survival results in KIRP. High MYBL2 expression is strongly associated with poor patient
survival. (C) MYBL2 survival results in LUAD. High MYBL2 expression is weakly associated with poor
LUAD patient survival.
166
cg20246507 was identified as a distal enhancer probe in all the cancer datasets, as it was
identified in the vicinity of peaks in the consensus enhancer and consensus open chromatin datasets.
Furthermore, this probe was hypomethylated in 10 of the 12 cancer datasets, excluding PRAD and THCA,
but was linked to 41 and 31 TFs in only the KIRC and KIRP datasets, respectively. The methylation level of
cg20246907 was only weakly associated with patient survival in BLCA and KIRC but strongly associated
with patient survival in KIRP in particular (Table 4-3). This is further illustrated in Figure 4-15, which
shows the survival outcomes of BRCA, KIRP, and LUAD patients divided into four groups based on
quartiles of cg20246907 methylation. In KIRP, there was not a single patient death observed in the
group with the highest quartile of cg20246907 methylation, while less than 50% of patients with the
lowest quartile of cg20246907 methylation survived more than 5 years.
167
Table 4-3: Status of cg20246907 across cancer types
Table shows the status of cg20246907 as a candidate distal enhancer probe across the 12 cancer
types. While the probe was identified as a possible enhancer probe in all 12 cancer types, it was only
identified as a hypomethylated enhancer probe in 10 of the 12 cancer types and was linked to TFs,
including MYBL2, in only two cancers, KIRC and KIRP. cg20246907 was nominally-associated with
patient survival in BLCA and KIRC, but KIRP was the only cancer type for which it was strongly so.
Figure 4-15: Survival curves for cg20246907 methylation in BRCA, KIRP, and LUAD tumors
Figure shows survival curves of tumor samples split into 4 groups based on their quartile of
methylation of cg20246907, as well as the univariate Cox regression survival p-value in that cancer
type. (A) cg20246907 survival results in BRCA. cg20246907 methylation is not associated with BRCA
patient survival. (B) cg20246907 survival results in KIRP. Low cg20246907 methylation is strongly
associated with poor patient survival, even more significantly than high MYBL2 expression. (C)
cg20246907 survival results in LUAD. cg20246907 methylation expression is not associated with
LUAD patient survival.
168
4.3.6 Expression and methylation patterns of MYBL2 and cg20246907 across tumors:
When a given link is identified by TENETR, the relationship between TF expression and
DNA methylation of the linked probe may only be observed in a small subset of the samples
analyzed. Since I set the min_experimental_count value in the step2_get_diffmeth_regions()
function to 5 for these analyses, as few as 6 tumor samples might actually show increased
expression of MYBL2 and decreased methylation of cg20246907. I therefore compared the
expression level of MYBL2 and methylation of cg20246907 broadly between the tumor and
adjacent normal samples in BRCA, KIRP, and LUAD.
MYBL2 was broadly overexpressed in each of these three cancer types, on average >8
times more so in the tumor samples than the adjacent normal samples in each of the three
cancer types. Combined with its significant survival association in KIRP, this supports a potential
role in this cancer type (Figure 4-16). However, it is not uncommon to find broadly highly
upregulated TFs amongst the most highly linked TFs in each cancer type.
Figure 4-16: MYBL2 is broadly overexpressed in BRCA, KIRP, and LUAD tumor samples
Figure displays boxplots created by the
step7_top_genes_experimental_vs_control_expression_boxplots() function to display expression
levels in normal and tumor samples, as well as t-test results comparing expression of MYBL2 in each
group (A) Results from BRCA (B) Results from KIRP (C) Results from LUAD. In each cancer type,
MYBL2 is greatly and significantly overexpressed in the tumor samples compared to the adjacent
normal ones of each cancer type.
169
While MYBL2 was broadly upregulated in most tumor samples, DNA methylation
differences of cg20246907 across the bulk of tumor samples compared to the adjacent normal
samples were much less pronounced in each of the three cancer types. Although the
methylation level of cg20246907 was significantly lower in the tumor samples compared to the
adjacent normal samples in all three cancer types, the significance of the difference was not as
strong as it was for the increase in expression of MYBL2 in tumor samples. This is likely due to
the skewed distribution of methylation values for cg20246907 in the tumor samples. Across all
three cancer types, the median methylation values in the tumor samples were not much lower
than those in the adjacent normal samples. Instead, a small portion of tumor samples showed
greatly decreased levels of methylation, with values often half of that of the median level
(Figure 4-17).
This phenomenon is clearly visualized by plotting the methylation level of cg20246907
versus MYBL2 expression in each of the three cancer types. In each cancer type, there is a subset
Figure 4-17: cg20246907 methylation is decreased in some BRCA, KIRP, and LUAD tumor samples
Figure displays boxplots created using a modified version of the
step7_top_genes_experimental_vs_control_expression_boxplots() function display methylation
levels in normal and tumor samples, as well as t-test results comparing methylation of cg20246907
in each group (A) Results from BRCA (B) Results from KIRP (C) Results from LUAD. In each cancer
type, while DNA methylation levels of this are significantly decreased in tumor samples, this is a
much less pronounced effect on the whole than seen for expression of MYBL2, and seems to be
driven by the sharp decrease in methylation in just a subset of tumor samples.
170
of tumor samples showing increased expression of MYBL2 and decreased methylation of
cg20246907. A portion of these samples with particularly low methylation of cg20246907 dip
below the hypometh cutoff in each of the cancer types, representing the samples which are
driving the identification of cg20246907 as a hypomethylated probe in each of the cancer types.
However, the number of hypomethylated samples in each cancer type is low, with most tumor
samples showing differential expression of MYBL2, but no large differences in cg20246907
methylation (Figure 4-18). This is particularly true for KIRP (Figure 4-18B), though this could also
be due to the relatively smaller number of samples observed in that dataset.
4.3.7 Investigation of MYBL2 and cg20246057 survival association with clinical covariates:
The previous findings raise more questions than they answer. I had previously illustrated
that methylation of cg20246907 was more strongly associated with patient survival than MYBL2
Figure 4-18: Relationship of MYBL2 expression to cg20246907 methylation across cancer types of
interest
Figure displays scatterplots created using a modified version of the
step7_top_genes_simple_scatterplots() function, which plot the MYBL2 gene expression (in log2
FPKM-UQ) of tumor samples on the x-axes, and DNA methylation β-values of cg20246907 on the y-
axes. Tumor samples are plotted as red dots, and adjacent normal samples are plotted as blue ones.
Additionally, the set hypometh cutoff in each cancer dataset is plotted with the gold dashed line. (A)
Scatterplot of relationship in BRCA (B) Scatterplot of relationship in KIRP (C) Scatterplot of
relationship in LUAD. In general, only a small subset of the tumor samples show significantly
decreased methylation of cg20246907, though a decent portion of tumor samples possess high
expression of MYBL2 regardless.
171
expression in KIRP. However, MYBL2 was more broadly overexpressed than cg20246057 was
hypomethylated in BRCA, KIRP, and LUAD tumors. Additionally, while relatively few tumors in
each of the three cancer types showed increased expression of MYBL2 along with decreased
methylation of cg20246057, there were a larger number of tumors not showing decreased levels
of cg20246057 methylation, especially in the KIRP dataset. This suggests that there are
additional factors playing a role in their relationship, perhaps clinical covariates.
To investigate if clinical variables from TCGA (patient age, sex, race, and cancer stage)
might play a role, I developed a more complex Cox regression survival model to test the
association of MYBL2 expression and cg20246057 methylation levels on patient survival with
respect to the patients’ age, sex, race, and cancer stage.
Since MYBL2 expression and cg20246907 methylation were strongly associated with
patient survival in only the KIRP dataset, I focused on KIRP for this analysis. In KIRP, 236 of 273
tumor samples had complete data for all the clinical variables of interest. The model involving
MYBL2 expression was strongly predictive of patient survival outcome. Stage IV tumors in
particular showed a strong association with poorer survival outcomes as well in this model
(Table 4-4).
172
Table 4-4: Complex Cox regression survival analysis of MYBL2 expression in KIRP
Table shows results of Cox regression analysis to assess patient survival in KIRP tumor samples with
respect to MYBL2 expression, sex, age, cancer stage, and race. Variable levels for categorical
variables, or variable units for continuous variables are listed in the first column, with the sample
counts for each group in the second. Output coefficients, hazard ratios, and p-values for individual
variables are listed in columns 3, 4, and 5, with “-“ listed for categorical variable levels used as the
reference group. Finally overall model p-values, assessed by likelihood ratio test, Wald test, and log
rank tests are listed below the main table. Overall the model was significantly predictive of patient
survival, with increased MYBL2 expression and stage IV cancer status being significantly associated
with poor patient survival, with respect to the other variables in the model. Statistics for samples
with “Asian” listed for race are not reported as a “Loglik converged before variable” error was
returned.
173
Similar to the univariate analyses, low methylation of cg20249607 was more strongly
associated with poor patient survival than increased expression of MYBL2. In this model, the
only other covariate associated with survival was stage III tumors. But unlike the model with
MYBL2 expression, the association with survival of stage III tumors was much weaker than the
association of cg20249607 methylation. The overall model involving methylation of cg20249607
was also more significant than the similarly constructed model for MYBL2 expression (Table 4-
5).
Given that cancer stage was the only variable that seemed to potentially be associated
with patient survival in the KIRP dataset, along with expression of MYBL2 and methylation of
cg20246097, I next decided to assess how significantly MYBL2 and cg20246907 were in just the
stage I KIRP samples. Not only do stage I KIRP samples compose the bulk of the tumor samples,
but as these represent the least advanced group of tumors, this should allow me to ascertain
how significant these two factors are largely independent of tumor stage. Additionally, if these
factors can significantly predict patient survival in stage I tumors, then that would increase their
significance for further study, as it would be more likely they, at the very least, could help
predict patients with poorer prognosis, if they aren’t involved in the process of the tumor
progression themselves.
174
I performed univariate Cox regression survival analyses on the expression of MYBL2 and
methylation of cg20246907, this time with only stage I KIRP tumors, and split these samples into
Table 4-5: Complex Cox regression survival analysis of cg20246907 methylation in KIRP
Table shows results of Cox regression analysis to assess patient survival in KIRP tumor samples with
respect to cg20246907 methylation, sex, age, cancer stage, and race. Variable levels for categorical
variables, or variable units for continuous variables are listed in the first column, with the sample
counts for each group in the second. Output coefficients, hazard ratios, and p-values for individual
variables are listed in columns 3, 4, and 5, with “-“ listed for categorical variable levels used as the
reference group. Finally overall model p-values, assessed by likelihood ratio test, Wald test, and log
rank tests are listed below the main table. Overall the model was very significantly predictive of
patient survival, with decreased cg20246907 methylation and stage III cancer status being
significantly associated with poor patient survival, with respect to the other variables in the model.
Statistics for samples with “Asian” or “American Indian or Alaskan Native” listed for race are not
reported as a “Loglik converged before variable” error was returned.
175
quartiles based on their expression or methylation values and plotted survival curves. MYBL2
expression was not significantly associated with patient survival in this group (Figure 4-19A).
However, low methylation of cg20246907 trended towards significant association with poor
survival in the stage I KIRP patients alone (Figure 4-19B). This suggests this marker could be
predictive of poor survival even in early stage KIRP tumors, warranting further investigation.
4.3.8 Identification of potential downstream targets for enhancer marked by cg20246907:
As a final analysis of the MYBL2 and cg20246907 link in KIRP, I tried to identify potential
downstream target genes for the putative distal enhancer marked by cg20246907. Enhancers
are thought to regulate genes within the same TAD as the enhancer
32
. I therefore sought out
publicly-available datasets for TAD information from KIRP-relevant sources. Unfortunately, I was
not able to locate any, so I compiled 40 hg38-annotated TAD datasets from the 3D genome
browser (http://3dgenome.fsm.northwestern.edu/downloads/hg38.TADs.zip) from a variety of
Figure 4-19: Survival curves for MYBL2 and cg20246907 in Stage I KIRP
Figure shows survival curves of tumor samples. (A) Tumor samples were split into 4 groups based on
their quartile of expression of MYBL2. (B) Tumor samples were split into 4 groups based on quantile
of methylation of cg20246907. The univariate Cox regression survival p-values from each analysis are
also displayed. MYBL2 expression was not significantly associated with patient survival in this group,
but cg20246907 methylation very nearly is. (Q4 samples in MYBL2 plot are hidden behind curve for
Q2 samples)
176
cellular sources to identify genes likely located in the same TAD as the putative enhancer
marked by cg20246907.
I identified 45 genes in the same TAD as cg20246907 in at least one of the TAD files. A
strong candidate target gene for the putative cg20246907 enhancer would be strongly
associated with poor KIRP patient survival and broadly upregulated in KIRP tumor samples. I
used the step7_top_genes_TAD_tables() and step7_top_genes_cox_survival() functions to
identify all genes in the same TAD as cg20246907 and perform univariate cox regression survival
analyses on survival, as well as identify the mean expression of the genes in the KIRP adjacent
normal and tumor samples.
Of the 45 genes, only 7 were nominally significantly associated with KIRP patient
survival, and none of them passed Bonferroni-correction. The three most survival-associated
genes in this group, FARP1, MIR3170, and DOCK9-DT, all showed decreased expression in tumor
samples compared to normal samples, and lower expression of these genes was associated with
poorer patient survival. The top gene showing increased expression associated with poorer
patient prognosis as well as increased expression in KIRP tumors, was KRTAP3-1, a pseudogene
which showed very low levels of expression in both normal and tumor samples. The best target
behaving in the expected direction of effect, was HMGB3P4, a pseudogene of HMGB3 for which
there has been some prior evidence for its promotion of breast, colorectal, and neuroblastoma
tumors
147–149
. However, this gene is only weakly significantly associated with patient survival,
and expression is low in tumor and adjacent normal samples, with no significant difference in
expression between them (Figure 4-20).
Thus, perhaps due to the lack of a kidney or KIRP-relevant TAD file, I did not find a
strong potential target gene. To address this, I decided to expand my analysis. Not knowing how
the TAD architecture might be altered at this locus in KIRP tumors, I calculated the average TAD
177
size in the 40 TAD datasets, which was approximately 1.5 Mb, larger than the 1 Mb commonly
suggested for TAD size
150–154
. I identified 63 genes within 1.5 Mb upstream or downstream of
cg20246907.
Amongst these 63 genes, I identified 12 genes that were nominally-significantly
associated with patient survival, including three that were Bonferroni-significant. Two of these
genes, ZIC2 and ZIC5 were upregulated in tumors and high expression of these genes was
associated with poor patient survival. However, these genes are very lowly expressed in both
normal and tumor samples, which might make them difficult to study further. The last gene in
this group, CLYBL, was highly expressed in normal and tumor samples, but was downregulated
Figure 4-20: Analysis of HMGB3P4 as a potential target gene of the cg20246907 enhancer
(A) Survival curves for KIRP tumor samples based on their quartile of HMGB3P4 expression. Although
samples with the lowest quartile of expression have the best survival outcomes, samples with the
highest quartile of expression perform better than samples with middling expression. This is
reflected in the overall Cox regression p-value, which is nominally significant for this gene, but not
after correcting for other genes expressed in the analysis. (B) Boxplot of expression levels of
HMGB3P4 given as log2-transformed FPKM-UQ values in KIRP adjacent normal and tumor samples.
HMG3P4 has a higher median expression in tumor compared to adjacent normal samples, but there
are a sizeable portion of samples in each group which do not express the gene. Furthermore, a
Student’s t-test comparing expression between groups shows the difference in expression between
them is not significant.
178
in tumors and low expression of the gene was associated with poorer patient survival.
One potentially interesting target identified in the vicinity of cg20246907 is LINC01232,
a long non-coding RNA that is modestly expressed in both normal and tumor samples, but
shows significant upregulation in KIRP tumors (Figure 4-21A). High expression of this gene is
modestly associated with patient survival, though it does not meet the Bonferroni significance
threshold (Figure 4-21B).
Figure 4-21: Analysis of LINC01232 as a potential target gene of the cg20246907 enhancer
(A) Survival curves for KIRP tumor samples based on their quartile of LINC01232 expression. (B)
Boxplot of expression levels of LINC01232 given as log2-transformed FPKM-UQ values in KIRP
adjacent normal and tumor samples. LINC01232 is modestly but still broadly expressed across all
normal and tumor samples, and a Student’s t-test comparing expression between groups shows the
difference in expression between them is strongly significant.
179
4.4 Conclusions:
TENETR’s primary function is to identify important TF genes and regulatory regions which are
dysregulated in a given dataset compared to control samples. Initial steps of the TENETR method use
bioinformatic methods to identify dysregulated TFs and regulatory regions using matched gene
expression and DNA methylation data, link these two types of elements together, and prioritizes TFs
based on the number of regulatory regions linked to them as without any other context, these are the
TFs most likely to have widespread effects on the genomes of the case samples at large.
TENETR can be run on a combination of a user’s own data, or publicly-available data, as I have
illustrated here. It can be a valuable hypothesis generation tool, or be used to provide supporting data
for an established study. However, such functionality of TENETR is limited within the first series of
functions I detailed in Chapter 3 as these functions do little to contextualize their results. While we
might assume a given TF is particularly important based on its overall number of linked regulatory
element DNA methylation probes, TENET often returns many TFs linked to hundreds, if not thousands of
linked regulatory element probes. Additionally, the single most highly linked TF may not necessarily
represent the best target for study. To address these concerns, I wrote the step7 functions into the
TENETR package to perform downstream analyses on the highest ranked TF candidates as well as their
linked probes. I have shown some of the analyses I have performed on the pan cancer results as an
illustration of the type of downstream analyses a user can perform.
To reduce the complexity of my analyses, I first isolated a panel of 108 TFs, composed of those
that were highly ranked individually in each cancer type or highly ranked across the 12 cancer types
combined. I then performed independent clustering of the expression Z-scores and ranks across the 12
cancer types for each of these 108 TFs and identified a closely clustered group of 8 TFs, including CENPA,
MYBL2, FOXM1, E2F1, E2F2, E2F7, E2F8, and DNMT1 (Supplemental Data 4-2).
Amongst the 12 cancer types, I found the 8 clustered TFs were most highly ranked in BRCA,
180
KIRP, and LUAD cancers. Based on this finding, I analyzed the association of expression of the 108 TFs
with patient survival in these three cancer types, was well as the methylation of their linked probes.
Within this analysis I also ranked each of the 108 TFs and their linked probes by the significance of their
survival association to identify the most significantly-associated TFs and probes in each cancer type.
From this analysis, I identified that members of the 8 clustered TFs were among the most highly
associated TFs with patient survival in the KIRP and LUAD datasets, but not BRCA. Additionally, I found
that the survival association of the most survival-associated DNA methylation probes surpassed those of
the TFs themselves, even after correcting for the increased number of probes compared to TFs analyzed.
Amongst the TFs and probes I analyzed, I focused on MYBL2 and its linked probe cg20246907, which
were the most strongly survival-associated TF and probe across the three cancer types I assessed.
I found that cg20246907 was identified as an enhancer probe in all 12 cancer types and was
hypomethylated in 10 of them. However, it was only linked to TFs in KIRC and KIRP (including MYBL2 in
each), and only showed a strong association with patient survival in KIRP. Additionally, while MYBL2 was
highly and broadly upregulated in BRCA, KIRP, and LUAD cancers, cg20249607 showed much less
hypomethylation across tumor samples even in KIRP. Although a large number of tumor samples highly
expressed MYBL2, only a fraction of them showed hypomethylation of cg20246907. MYBL2 expression
and cg20249607 methylation, along with cancer stage, were found as significant predictors of poor KIRP
patient survival in a model considering several clinical variables. cg20249607 even showed a trend
towards patient survival significance in just stage I KIRP samples.
I attempted to identify a target gene of the putative enhancer near cg20246907, but no genes
met all the criteria I consider for a strong candidate, including being found in the same TAD as
cg20249607, strong and broad overexpression in tumor samples compared to adjacent normal samples,
and high expression being associated with poor patient survival.
My analyses investigated only a small portion of the data from TENET, and much remains to be
181
investigated using the methods I have illustrated in this chapter. In the next chapter, I have also detailed
several additional analyses and potential updates to the TENETR method.
182
4.5 Supplemental files:
Supplemental Data 4.1: Table of top TFs by overall rank across 12 cancer types (.xlsx file)
Table lists TFs by increasing rank of number of linked hypomethylated probes in each of the 12 cancer
types, as well as the additive rank across all 12. Hosted at:
https://github.com/DanielJMullen/Daniel_Mullen_Thesis
Supplemental Data 4.2: Table of top TFs by overall rank across 12 cancer types (.xlsx file)
Table lists the identities of the 108 highly-ranked TFs of interest in this chapter, as well as their rank of
number of linked hypomethylated probes in each of the 12 cancer types, as well as the additive rank
across all 12. Hosted at: https://github.com/DanielJMullen/Daniel_Mullen_Thesis
183
Chapter 5: Conclusions and future directions for TENET study
This chapter sums up my work and details remaining analyses and potential improvements to
the TENETR method. My ideas for future directions here are split between those regarding how
TENETR’s core functionality could be improved or expanded upon, as well as additional analyses I would
perform to further characterize my findings from the analysis of these 12 cancer types.
184
5.1 Conclusions
5.1.1 TENET 2.0 findings in LUAD
My first major project was to update the original TENET method, which, since its initial
publication had been underutilized
35
. I updated the method to make it functional again,
including a new dataset of TFs
70
, increased speed, and several new functions. I then applied this
method to a new cancer type in LUAD to identify dysregulated TFs and linked enhancers. From
this study, I keyed in specially on CENPA, FOXM1, and MYBL2 which were linked to large
numbers of activated enhancer regions, highly overexpressed in LUAD tumors, were associated
were poor patient survival, and seemed they could be working in concert based on their
similarity in expression and linked probes.
I also discovered that expression of these three TFs and the methylation of their linked
probes seem to denote a specific subgroup of LUAD tumors. This group of LUAD tumors seemed
to have a lot of similarity to a group of primarily basal subtype BRCA tumors, though
interestingly while both groups had particularly high expression of CENPA, FOXM1, and MYBL2,
the TFs were linked to distinct groups of probes in each dataset.
Finally, I also identified DNA methylation probes linked to CENPA, FOXM1, and MYBL2
which were significantly associated with LUAD patient survival. Then with the help of Chunli Yan,
we performed siRNA knockdown of FOXM1 and MYBL2 expression, then used RNA-seq to
identify genes downstream of the enhancers marked by survival associated probes, which might
be potential targets for those enhancers. From this, I identified a potential relationship between
the probe cg09580922, which was linked to MYBL2. Both of these were associated with poor
LUAD patient survival, and within the same TAD of cg09580922 I discovered the gene TK1, which
was knocked down by MYBL2 siRNA treatment and itself was also associated with poor LUAD
patient survival.
185
5.1.2 The TENETR package
Following up from my study investigating LUAD using TENET 2.0
34
, I continued updating
the TENET method, redesigning it to increase processing speed, and add new functionality. The
culmination of this work in the form of TENETR, an R package version of the TENET method. I
have detailed the full workings of TENETR, including new functionalities such as the inclusion of
consensus epigenomic datasets, ability to assess enhancer and promoter regions, a new
algorithm to set methylation cutoffs (for enhancer probes), and the ability to assess TF genes
only in order to increase computational speed considerably. I have re-written a number of the
functions to limit the number of input-output activities and reduce runtimes.
As an R package hosted on GitHub (or potentially repositories like Bioconductor as a
future aim), changes or other additions can be made much more readily and transparently than
in previous TENET iterations, ensuring that TENETR can remain the final platform for the
method, even as functions are added or tinkered with, barring a major overhaul of the core
methodology. As an R package hosted publicly online, it will be much more visible and easily
accessed by outside users interested in the package.
5.1.3 TENETR pan-cancer findings
To exhibit the TENETR method, I helped drive the most ambitious TENET study to date,
analyzing 12 TCGA cancer types. As part of this analysis, I downloaded RNA-seq, DNA
methylation, and clinical data for each of the 12 cancer types from the TCGA, and with the help
of co-authors, we assembled hundreds of publicly available epigenomic datasets specific to each
cancer type to include in the study. Combining these datasets, I used TENETR to identify over
80,000 DNA methylation probes which mark the activity of distal enhancer elements, most of
which were actually found in common between two or more of the cancer datasets. Of these,
186
almost 20,000+ probes in each cancer dataset were identified to be hypomethylated in a subset
of the tumor samples, indicating an activation of the distal enhancer regions. TENETR linked
these probes to TF genes which were overexpressed in the 12 cancer types. I have listed the top
10 TFs in each cancer type by number of linked hypomethylated probes which represent the TFs
most likely to have the largest impact in the genomes in tumors of the cancer types.
From the TF lists, I developed a panel of 108 highly ranked TFs, composed of TFs that
were ranked among the top 10 in at least one of the cancer types, or were ranked among the
top 20 summed across all 12 cancer types. Unsupervised clustering of the expression levels of
the 108 TFs across all the tumor samples of the 12 cancer types identified a group of 8 TFs that
clustered particularly closely, indicating they are likely coregulated and could be operating in
concert with each other. These TFs included CENPA, DNMT1, E2F1, E2F2, E2F7, E2F8, FOXM1,
and MYBL2, all 8 of which were included amongst the top 20 TFs in terms of total ranks across
cancer types, and were relatively highly ranked in BRCA, KIRP, and LUAD cancer types.
Examination of survival association of the 108 TFs in BRCA, KIRP, and LUAD cancers, as well as
the survival association of all the probes linked to each of the TFs as well revealed that the
linked probes tended to be more significantly-associated with patient survival, even after
correcting for multiple testing, than the TFs themselves. Focusing on MYBL2, the most
significantly survival-associated TF in across BRCA, KIRP, or LUAD datasets, and its most
significant linked probe cg20246907, I found that while cancer stage of KIRP was a significant
covariate along with MYBL2 expression, methylation of cg20246907 was by far the most
significant predictor of survival. However, when plotting the expression of MYBL2 along with the
methylation of cg2046907, there were relatively few tumor samples that showed severe
hypomethylation of the probe, while there were many more tumor samples with
overexpression of MYBL2, suggesting another factor may play a role in these patients.
187
In sum, the application TENETR has provided a wealth of information on the epigenetic
alterations across cancer types. The analyses I present exemplify the sorts of downstream
analyses that can be performed to mine TENETR results and select specific observations for
further downstream analyses including in vitro experimentation
34
.
188
5.2 Future developments for the TENETR package
5.2.1 Develop TENETR for inclusion on Bioconductor
Bioconductor (https://www.bioconductor.org/) is a commonly used repository of R
packages designed with the goal of advancing biological research and data analysis. Including
TENETR in the Bioconductor repository would allow additional potential users to find and access
the method and would be an important achievement for the method. However, Bioconductor
has a series of specific guidelines that submitted packages are required to meet for inclusion,
including the use of vignettes and use of established classes and functions when handling
specific types of data, such as .bed files. Guidelines and the process for package submission to
Bioconductor can be found at https://contributions.bioconductor.org/index.html.
Submitting TENETR to Bioconductor could be a very feasible goal, as the page above also
details specific functions Bioconductor curators have developed to help test prospective
packages for errors and other formatting concerns when applying to Bioconductor. Adapting
TENETR to submit it for Bioconductor inclusion will require effort from someone with good R
coding skills and an eye for detail to sift through the package and make the changes necessary
for TENETR’s submission, but could valuable to advance this project.
5.2.2 Add a sparse analysis option
TENETR already has the built-in capability to assess either TF genes only, or all genes in
the dataset. I propose to add another analysis type into TENETR, with the goal of reducing the
number of TF genes each individual probe is linked to. Trying to identify individual TF gene to
probe links for further analysis, such as the MYBL2 to cg20249607 link I investigated, can be
confounded by the number of different TFs each probe is linked to. On average, based on data
from BRCA, KIRP, and LUAD, each probe is linked to an average of 5-7 TFs, though this number is
189
much higher for some TFs, such as the 8 clustered TFs of interest. When TFs are more correlated
in expression, it is more likely that a given probe will be linked to each of those TFs. This can
make identifying individual links challenging.
To this end I would consider developing a “sparse” option, with the goal of being more
stringent when linking probes and TFs. This would be done by adding additional checks when
establishing a probe to TF link. Most of the optimization parameters meant to cut down on the
number of probe to TF links identified are focused around the expression of the TFs in the
step5_optimize_links() function. While the step4_permutate_z_scores() function does purport
to do this, it is not very restrictive on the probes at default settings. This is because, if there are
conservatively only 1000 hypermethylated or hypomethylated probes that are being analyzed, a
much smaller number than seen here in these analyses, for a given one of these probes to be
filtered by that step, there would have to be 49 other probes with Z-scores more significant than
it (1000/50 /< 0.05). As the number of hypermethylated or hypomethylated probes identified
increases, this method will become less and less stringent.
One potential way to address this would be to add an argument into the
step4_permutate_z_scores() or step5_optimize_links() functions to allow the user to select a
more stringent permutation p-value for filtering. Alternatively, some sort of multiple testing
correction could be performed on the calculated p-values. A false discovery rate method could
be particularly valuable in doing this since they tend to prioritize more highly ranked
observations when it comes to significance.
5.2.3 Combine TENET probe classification with regression analysis to link TFs to probes
One of the TENET methodology’s unique strengths is its ability to identify relationships
between the expression of a given TF and the methylation of a given DNA methylation probe,
190
even if only present in a relatively small number of samples as in Figure 4-17. However, the
number of hypermethylated or hypomethylated samples for a given probe can often be much
smaller than the number of samples showing increased or decreased expression of a given TF,
suggesting other factors may be at play, raising the question of the significance of the limited set
of observed methylation changes. It might therefore be useful to increase the
min_experimental_count value in order to isolate TF to probe links which have a larger number
of tumor samples that are hypermethylated or hypomethylated for a given probe. Alternatively,
leveraging a regression analysis could be used to identify TF to probe links by identifying those
with a significant association between the expression of the TF and the methylation of the probe
in the samples that are identified as being hypermethylated or hypomethylated for the given
probe. This should have the benefit of identifying TF to probe links where there is a clear
relationship between the two, at least in the hypermethylated or hypomethylated samples, and
would have the potentially added benefit of identifying more links when there are a larger
number of samples that are hypermethylated or hypomethylated due to the influence of sample
size when determining if an association is significant.
191
5.3 Additional bioinformatic analyses
5.3.1 Further analysis of data generated by TENETR across 12 cancer types
The data I have presented here represents only a small portion of the total amount of
data output from all 12 cancer types. For example, we also acquired H3K4me3 ChIP-seq
datasets, a chromatin mark associated with active promoters, from cancer type-relevant
sources. There is also a large amount of distal enhancer data that can be further analyzed. The
top TFs identified and their associated linked probes in the individual cancer datasets remain
largely unexplored. My analyses have focused on the panel of 108 highly ranked TFs and within
them, largely on the 8 clustered TFs and the BRCA, KIRP, and LUAD cancers. So even in terms of
distal, hypomethylated enhancer probes analyses, there is still a considerable amount of data
that can be further examined.
5.3.2 Integrate tumor purity, mutational burden, and cancer type-specific molecular
alterations into analyses
In my previous TENET study, I had found that the putatively hypomethylated cluster of
LUAD samples with high expression of, and links to the key TFs CENPA, MYBL2, and FOXM1 was
also associated with higher tumor mutation count
34
. Given these TFs are known to be involved
with DNA damage response and repair
100,155–158
and having now observed similar trends for
these TFs across other cancer types, integrating mutational count information into the analyses I
have presented here down key TF to probe links for further study may aid in identifying such
links for further study.
Additionally, each cancer type has a different set of common driver mutations.
Analyzing their status could be another possible route to better understand why certain TF to
probe links might be more associated with patient survival, as well as why some tumor samples
192
may be hypomethylated for a given probe and others not despite both having increased
expression of a given TF gene. Similarly, tumor purity could also play a potential role in this and
could be especially important as a check when identifying specific TF to probe links to ensure
the observed link isn’t merely due to impurity of specific tumor samples.
5.3.3 Integrate TF binding prediction and ChIP-seq data to prioritize TF to probe links
Another type of data that could still be included to better prioritize TF to probe links is
the use of TF binding prediction information or existing ChIP-seq data. First, the existence of a
predicted binding motif or ChIP-seq peak for a given TF in the vicinity of its linked probe would
increase the likelihood the regulatory element marked by that probe is being regulated by that
TF, by providing some evidence the TF of interest is binding to the regulatory element and
influencing its activity.
The step7_linked_probe_motif_searching() function has been included in TENETR to aid
in such an analysis. This function allows analysis of individual probes for motif occurrences, and
enables the user to supply their own PWM for motif searching. However, using this function
requires manual tuning as the distance_from_probes, matchPWM_min_score, and length of the
specified PWM affect the frequency of motif identification. In addition, the use of this function
requires the user have a PWM for the motif they wish to analyze, which may not be available for
all TFs.
Acquisition and analysis of ChIP-seq data would be a powerful tool to validate TF to
probe links. The step7_top_genes_user_peak_overlap() function allows the user to identify
probes linked to the top TFs that lie in the vicinity of peak files supplied by the user for the TF of
interest. A caveat is that such datasets may be difficult to find for all TFs and should be acquired
from a relevant cell type. As an example, very few publicly-available ChIP-seq datasets exist for
193
MYBL2, with none of particular relevance for BRCA, KIRP, or LUAD datasets until very recently
159
.
Generation of such datasets for a TF of interest identified through TENETR could be a
particularly valuable follow-up experiment.
5.3.4 Utilize other bioinformatic programs to help identify target genes for linked probes
The main function of TENETR is its ability to identify upstream TFs with altered
expression patterns in the case vs. control samples, as specified by the user, and links these TFs
to DNA methylation probes whose methylation levels mark the activity of putative regulatory
elements controlled by those TFs.
To further expand on TENETR findings, it might be of interest to use other programs,
such as ELMER v.2
160
. ELMER v.2 is primarily focused on bioinformatically predicting target genes
for regulatory elements using very similar data to TENETR. It may be worth using ELMER v.2 to
validate specific targets which are identified before embarking on in vitro analyses.
5.3.5 Use machine learning methods to develop a model to predict patient survival based
on key DNA methylation probes identified with TENETR
As I had previously noted, methylation of cg20246907 trended towards significance in
terms of predicting patient survival in just stage I tumors. Given this finding, a more
sophisticated model, potentially including the other 200+ hypomethylated DNA methylation
probes linked to the 108 highly ranked TFs in the KIRP dataset which were also significantly
associated with patient survival after multiple testing correction, may be even more predictive.
To build such a model I would investigate using regularized Cox regression approaches,
which can be implemented using the glmnet package (https://cran.r-
project.org/web/packages/glmnet/index.html), which is a one-stop-shop for all things machine
194
learning in R. A regularized approach would be useful when building a survival model for KIRP
since it would be beneficial to reduce the complexity of the model using such a technique.
195
References:
1. DeSantis, C.E., Miller, K.D., Dale, W., Mohile, S.G., Cohen, H.J., Leach, C.R., Goding Sauer, A.,
Jemal, A., and Siegel, R.L. (2019). Cancer statistics for adults aged 85 years and older, 2019. CA
Cancer J Clin 69, 452–467. 10.3322/caac.21577.
2. Siegel, R.L., Miller, K.D., Fuchs, H.E., and Jemal, A. (2022). Cancer statistics, 2022. CA Cancer J Clin
72, 7–33. 10.3322/caac.21708.
3. Yabroff, K.R., Wu, X.-C., Negoita, S., Stevens, J., Coyle, L., Zhao, J., Mumphrey, B.J., Jemal, A., and
Ward, K.C. (2022). Association of the COVID-19 Pandemic With Patterns of Statewide Cancer
Services. JNCI: Journal of the National Cancer Institute 114, 907–909. 10.1093/jnci/djab122.
4. Mendiratta, G., Ke, E., Aziz, M., Liarakos, D., Tong, M., and Stites, E.C. (2021). Cancer gene
mutation frequencies for the U.S. population. Nat Commun 12, 5961. 10.1038/s41467-021-
26213-y.
5. Comprehensive molecular characterization of human colon and rectal cancer (2012). Nature 487,
330–337. 10.1038/nature11252.
6. Comprehensive molecular portraits of human breast tumours (2012). Nature 490, 61–70.
10.1038/nature11412.
7. Integrated genomic analyses of ovarian carcinoma (2011). Nature 474, 609–615.
10.1038/nature10166.
8. Cai, W.-Q., Zeng, L.-S., Wang, L.-F., Wang, Y.-Y., Cheng, J.-T., Zhang, Y., Han, Z.-W., Zhou, Y.,
Huang, S.-L., Wang, X.-W., et al. (2020). The Latest Battles Between EGFR Monoclonal Antibodies
and Resistant Tumor Cells. Front Oncol 10. 10.3389/fonc.2020.01249.
9. Comprehensive molecular profiling of lung adenocarcinoma (2014). Nature 511, 543–550.
10.1038/nature13385.
10. Onitilo, A.A., Engel, J.M., Greenlee, R.T., and Mukesh, B.N. (2009). Breast Cancer Subtypes Based
on ER/PR and Her2 Expression: Comparison of Clinicopathologic Features and Survival. Clin Med
Res 7, 4–13. 10.3121/cmr.2009.825.
11. Toyota, M., Ahuja, N., Ohe-Toyota, M., Herman, J.G., Baylin, S.B., and Issa, J.-P.J. (1999). CpG
island methylator phenotype in colorectal cancer. Proceedings of the National Academy of
Sciences 96, 8681–8686. 10.1073/pnas.96.15.8681.
12. Weisenberger, D.J., Siegmund, K.D., Campan, M., Young, J., Long, T.I., Faasse, M.A., Kang, G.H.,
Widschwendter, M., Weener, D., Buchanan, D., et al. (2006). CpG island methylator phenotype
underlies sporadic microsatellite instability and is tightly associated with BRAF mutation in
colorectal cancer. Nat Genet 38, 787–793. 10.1038/ng1834.
13. Zhang, X., Zhang, W., and Cao, P. (2021). Advances in CpG Island Methylator Phenotype
Colorectal Cancer Therapies. Front Oncol 11. 10.3389/fonc.2021.629390.
196
14. Tiong, K.-L., and Yeang, C.-H. (2018). Explaining cancer type specific mutations with
transcriptomic and epigenomic features in normal tissues. Sci Rep 8, 11456. 10.1038/s41598-
018-29861-1.
15. Lenhard, B., Sandelin, A., and Carninci, P. (2012). Metazoan promoters: emerging characteristics
and insights into transcriptional regulation. Nat Rev Genet 13, 233–245. 10.1038/nrg3163.
16. Haberle, V., and Stark, A. (2018). Eukaryotic core promoters and the functional basis of
transcription initiation. Nat Rev Mol Cell Biol 19, 621–637. 10.1038/s41580-018-0028-8.
17. Roeder, R.G. (1996). The role of general initiation factors in transcription by RNA polymerase II.
Trends Biochem Sci 21, 327–335. 10.1016/0968-0004(96)10050-5.
18. Maston, G.A., Evans, S.K., and Green, M.R. (2006). Transcriptional Regulatory Elements in the
Human Genome. Annu Rev Genomics Hum Genet 7, 29–59.
10.1146/annurev.genom.7.080505.115623.
19. Spitz, F., and Furlong, E.E.M. (2012). Transcription factors: from enhancer binding to
developmental control. Nat Rev Genet 13, 613–626. 10.1038/nrg3207.
20. Heintzman, N.D., Stuart, R.K., Hon, G., Fu, Y., Ching, C.W., Hawkins, R.D., Barrera, L.O., van Calcar,
S., Qu, C., Ching, K.A., et al. (2007). Distinct and predictive chromatin signatures of transcriptional
promoters and enhancers in the human genome. Nat Genet 39, 311–318. 10.1038/ng1966.
21. He, B., Chen, C., Teng, L., and Tan, K. (2014). Global view of enhancer–promoter interactome in
human cells. Proceedings of the National Academy of Sciences 111. 10.1073/pnas.1320308111.
22. Sheffield, N.C., Thurman, R.E., Song, L., Safi, A., Stamatoyannopoulos, J.A., Lenhard, B., Crawford,
G.E., and Furey, T.S. (2013). Patterns of regulatory activity across diverse human cell types
predict tissue identity, transcription factor binding, and long-range interactions. Genome Res 23,
777–788. 10.1101/gr.152140.112.
23. Peng, Y., and Zhang, Y. (2018). Enhancer and super-enhancer: Positive regulators in gene
transcription. Animal Model Exp Med 1, 169–179. 10.1002/ame2.12032.
24. Ernst, J., Kheradpour, P., Mikkelsen, T.S., Shoresh, N., Ward, L.D., Epstein, C.B., Zhang, X., Wang,
L., Issner, R., Coyne, M., et al. (2011). Mapping and analysis of chromatin state dynamics in nine
human cell types. Nature 473, 43–49. 10.1038/nature09906.
25. Rada-Iglesias, A., Bajpai, R., Swigut, T., Brugmann, S.A., Flynn, R.A., and Wysocka, J. (2011). A
unique chromatin signature uncovers early developmental enhancers in humans. Nature 470,
279–283. 10.1038/nature09692.
26. Cai, W., Huang, J., Zhu, Q., Li, B.E., Seruggia, D., Zhou, P., Nguyen, M., Fujiwara, Y., Xie, H., Yang,
Z., et al. (2020). Enhancer dependence of cell-type–specific gene expression increases with
developmental age. Proceedings of the National Academy of Sciences 117, 21450–21458.
10.1073/pnas.2008672117.
197
27. Andersson, R., Gebhard, C., Miguel-Escalada, I., Hoof, I., Bornholdt, J., Boyd, M., Chen, Y., Zhao,
X., Schmidl, C., Suzuki, T., et al. (2014). An atlas of active enhancers across human cell types and
tissues. Nature 507, 455–461. 10.1038/nature12787.
28. Furlong, E.E.M., and Levine, M. (2018). Developmental enhancers and chromosome topology.
Science (1979) 361, 1341–1345. 10.1126/science.aau0320.
29. Herz, H.-M. (2016). Enhancer deregulation in cancer and other diseases. BioEssays 38, 1003–
1015. 10.1002/bies.201600106.
30. Luo, Z., Rhie, S.K., Lay, F.D., and Farnham, P.J. (2017). A Prostate Cancer Risk Element Functions
as a Repressive Loop that Regulates HOXA13. Cell Rep 21, 1411–1417.
10.1016/j.celrep.2017.10.048.
31. Jones, P.A., and Baylin, S.B. (2007). The Epigenomics of Cancer. Cell 128, 683–692.
10.1016/j.cell.2007.01.029.
32. Rhie, S.K., Perez, A.A., Lay, F.D., Schreiner, S., Shi, J., Polin, J., and Farnham, P.J. (2019). A high-
resolution 3D epigenomic map reveals insights into the creation of the prostate cancer
transcriptome. Nat Commun 10, 4154. 10.1038/s41467-019-12079-8.
33. Panigrahi, A., and O’Malley, B.W. (2021). Mechanisms of enhancer action: the known and the
unknown. Genome Biol 22, 108. 10.1186/s13059-021-02322-1.
34. Mullen, D.J., Yan, C., Kang, D.S., Zhou, B., Borok, Z., Marconett, C.N., Farnham, P.J., Offringa, I.A.,
and Rhie, S.K. (2020). TENET 2.0: Identification of key transcriptional regulators and enhancers in
lung adenocarcinoma. PLoS Genet 16, e1009023. 10.1371/journal.pgen.1009023.
35. Rhie, S.K., Guo, Y., Tak, Y.G., Yao, L., Shen, H., Coetzee, G.A., Laird, P.W., and Farnham, P.J.
(2016). Identification of activated enhancers and linked transcription factors in breast, prostate,
and kidney tumors by tracing enhancer networks using epigenetic traits. Epigenetics Chromatin
9, 1–17. 10.1186/s13072-016-0102-4.
36. Heintzman, N.D., Hon, G.C., Hawkins, R.D., Kheradpour, P., Stark, A., Harp, L.F., Ye, Z., Lee, L.K.,
Stuart, R.K., Ching, C.W., et al. (2009). Histone modifications at human enhancers reflect global
cell-type-specific gene expression. Nature 459, 108–112. 10.1038/nature07829.
37. Creyghton, M.P., Cheng, A.W., Welstead, G.G., Kooistra, T., Carey, B.W., Steine, E.J., Hanna, J.,
Lodato, M.A., Frampton, G.M., Sharp, P.A., et al. (2010). Histone H3K27ac separates active from
poised enhancers and predicts developmental state. Proceedings of the National Academy of
Sciences 107, 21931–21936. 10.1073/pnas.1016071107.
38. Cui, K., Zang, C., Roh, T.-Y., Schones, D.E., Childs, R.W., Peng, W., and Zhao, K. (2009). Chromatin
Signatures in Multipotent Human Hematopoietic Stem Cells Indicate the Fate of Bivalent Genes
during Differentiation. Cell Stem Cell 4, 80–93. 10.1016/j.stem.2008.11.011.
39. Heintzman, N.D., Stuart, R.K., Hon, G., Fu, Y., Ching, C.W., Hawkins, R.D., Barrera, L.O., van Calcar,
S., Qu, C., Ching, K.A., et al. (2007). Distinct and predictive chromatin signatures of transcriptional
198
promoters and enhancers in the human genome. Nat Genet 39, 311–318. 10.1038/ng1966.
40. Henikoff, S. (2008). Nucleosome destabilization in the epigenetic regulation of gene expression.
Nat Rev Genet 9, 15–26. 10.1038/nrg2206.
41. Visel, A., Rubin, E.M., and Pennacchio, L.A. (2009). Genomic views of distant-acting enhancers.
Nature 461, 199–205. 10.1038/nature08451.
42. Tsompana, M., and Buck, M.J. (2014). Chromatin accessibility: a window into the genome.
Epigenetics Chromatin 7, 33. 10.1186/1756-8935-7-33.
43. Tak, Y.G., Hung, Y., Yao, L., Grimmer, M.R., Do, A., Bhakta, M.S., O’Geen, H., Segal, D.J., and
Farnham, P.J. (2016). Effects on the transcriptome upon deletion of a distal element cannot be
predicted by the size of the H3K27Ac peak in human cells. Nucleic Acids Res 44, 4123–4133.
10.1093/nar/gkv1530.
44. Zhang, T., Zhang, Z., Dong, Q., Xiong, J., and Zhu, B. (2020). Histone H3K27 acetylation is
dispensable for enhancer activity in mouse embryonic stem cells. Genome Biol 21, 45.
10.1186/s13059-020-01957-w.
45. Stadler, M.B., Murr, R., Burger, L., Ivanek, R., Lienert, F., Schöler, A., Nimwegen, E. van,
Wirbelauer, C., Oakeley, E.J., Gaidatzis, D., et al. (2011). DNA-binding factors shape the mouse
methylome at distal regulatory regions. Nature 480, 490–495. 10.1038/nature10716.
46. Blattler, A., and Farnham, P.J. (2013). Cross-talk between Site-specific Transcription Factors and
DNA Methylation States. Journal of Biological Chemistry 288, 34287–34294.
10.1074/jbc.R113.512517.
47. Aran, D., Sabato, S., and Hellman, A. (2013). DNA methylation of distal regulatory sites
characterizes dysregulation of cancer genes. Genome Biol 14. 10.1186/gb-2013-14-3-r21.
48. Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J.M., Delano, D., Zhang, L., Schroth, G.P.,
Gunderson, K.L., et al. (2011). High density DNA methylation array with single CpG site
resolution. Genomics 98, 288–295. 10.1016/j.ygeno.2011.07.007.
49. Konigsberg, I.R., Barnes, B., Campbell, M., Davidson, E., Zhen, Y., Pallisard, O., Boorgula, M.P.,
Cox, C., Nandy, D., Seal, S., et al. (2021). Host methylation predicts SARS-CoV-2 infection and
clinical outcome. Communications Medicine 1, 42. 10.1038/s43856-021-00042-y.
50. Pidsley, R., Zotenko, E., Peters, T.J., Lawrence, M.G., Risbridger, G.P., Molloy, P., van Djik, S.,
Muhlhausler, B., Stirzaker, C., and Clark, S.J. (2016). Critical evaluation of the Illumina
MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome
Biol 17, 208. 10.1186/s13059-016-1066-1.
51. Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., and Gunderson, K.L. (2009).
Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics 1, 177–200.
10.2217/epi.09.14.
199
52. Marconett, C.N., Zhou, B., Rieger, M.E., Selamat, S.A., Dubourd, M., Fang, X., Lynch, S.K., Stueve,
T.R., Siegmund, K.D., Berman, B.P., et al. (2013). Integrated Transcriptomic and Epigenomic
Analysis of Primary Human Lung Epithelial Cell Differentiation. PLoS Genet 9, 1–14.
10.1371/journal.pgen.1003513.
53. Yang, C., Stueve, T.R., Yan, C., Rhie, S.K., Mullen, D.J., Luo, J., Zhou, B., Borok, Z., Marconett, C.N.,
and Offringa, I.A. (2018). Positional integration of lung adenocarcinoma susceptibility loci with
primary human alveolar epithelial cell epigenomes. Epigenomics 10, 1167–1187. 10.2217/epi-
2018-0003.
54. Wright, J.C., Mudge, J., Weisser, H., Barzine, M.P., Gonzalez, J.M., Brazma, A., Choudhary, J.S.,
and Harrow, J. (2016). Improving GENCODE reference gene annotation using a high-stringency
proteogenomics workflow. Nat Commun 7, 1–11. 10.1038/ncomms11778.
55. Anders, S., Pyl, P.T., and Huber, W. (2015). HTSeq-A Python framework to work with high-
throughput sequencing data. Bioinformatics 31, 166–169. 10.1093/bioinformatics/btu638.
56. Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion
for RNA-seq data with DESeq2. Genome Biol 15, 1–21. 10.1186/s13059-014-0550-8.
57. Zhu, A., Ibrahim, J.G., and Love, M.I. (2019). Heavy-Tailed prior distributions for sequence count
data: Removing the noise and preserving large differences. Bioinformatics 35, 2084–2092.
10.1093/bioinformatics/bty895.
58. Mi, H., Muruganujan, A., Ebert, D., Huang, X., and Thomas, P.D. (2019). PANTHER version 14:
More genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools. Nucleic
Acids Res 47, D419–D426. 10.1093/nar/gky1038.
59. Dunham, I., Kundaje, A., Aldred, S.F., Collins, P.J., Davis, C.A., Doyle, F., Epstein, C.B., Frietze, S.,
Harrow, J., Kaul, R., et al. (2012). An integrated encyclopedia of DNA elements in the human
genome. Nature 489, 57–74. 10.1038/nature11247.
60. Davis, C.A., Hitz, B.C., Sloan, C.A., Chan, E.T., Davidson, J.M., Gabdank, I., Hilton, J.A., Jain, K.,
Baymuradov, U.K., Narayanan, A.K., et al. (2018). The Encyclopedia of DNA elements (ENCODE):
data portal update. Nucleic Acids Res 46, D794–D801. 10.1093/nar/gkx1081.
61. Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., Lee, L.,
Ye, Z., Ngo, Q.M., et al. (2009). Human DNA methylomes at base resolution show widespread
epigenomic differences. Nature 462, 315–322. 10.1038/nature08514.
62. Bernstein, B.E., Stamatoyannopoulos, J.A., Costello, J.F., Ren, B., Milosavljevic, A., Meissner, A.,
Kellis, M., Marra, M.A., Beaudet, A.L., Ecker, J.R., et al. (2010). The NIH roadmap epigenomics
mapping consortium. Nat Biotechnol 28, 1045–1048. 10.1038/nbt1010-1045.
63. Suzuki, A., Kawano, S., Mitsuyama, T., Suyama, M., Kanai, Y., Shirahige, K., Sasaki, H., Tokunaga,
K., Tsuchihara, K., Sugano, S., et al. (2018). DBTSS/DBKERO for integrated analysis of
transcriptional regulation. Nucleic Acids Res 46, D229–D238. 10.1093/nar/gkx1001.
200
64. Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B.E.,
Bickel, P., Brown, J.B., Cayting, P., et al. (2012). ChIP-seq guidelines and practices of the ENCODE
and modENCODE consortia. Genome Res 22, 1813–1831. 10.1101/gr.136184.111.
65. Buenrostro, J.D., Wu, B., Chang, H.Y., and Greenleaf, W.J. (2015). ATAC-seq: A method for
assaying chromatin accessibility genome-wide. Curr Protoc Mol Biol 2015, 21.29.1-21.29.9.
10.1002/0471142727.mb2129s109.
66. Wang, Z., Tu, K., Xia, L., Luo, K., Luo, W., Tang, J., Lu, K., Hu, X., He, Y., Qiao, W., et al. (2019). The
open chromatin landscape of non–small cell lung carcinoma. Cancer Res 79, 4840–4854.
10.1158/0008-5472.CAN-18-3663.
67. Corces, M.R., Granja, J.M., Shams, S., Louie, B.H., Seoane, J.A., Zhou, W., Silva, T.C., Groeneveld,
C., Wong, C.K., Cho, S.W., et al. (2018). The chromatin accessibility landscape of primary human
cancers. Science (1979) 362, eaav1898. 10.1126/science.aav1898.
68. Guler, G.D., Tindell, C.A., Pitti, R., Wilson, C., Nichols, K., KaiWai Cheung, T., Kim, H.J.,
Wongchenko, M., Yan, Y., Haley, B., et al. (2017). Repression of Stress-Induced LINE-1 Expression
Protects Cancer Cell Subpopulations from Lethal Drug Exposure. Cancer Cell 32, 221-237.e13.
10.1016/j.ccell.2017.07.002.
69. Davis, C.A., Hitz, B.C., Sloan, C.A., Chan, E.T., Davidson, J.M., Gabdank, I., Hilton, J.A., Jain, K.,
Baymuradov, U.K., Narayanan, A.K., et al. (2018). The Encyclopedia of DNA elements (ENCODE):
Data portal update. Nucleic Acids Res 46, D794–D801. 10.1093/nar/gkx1081.
70. Lambert, S.A., Jolma, A., Campitelli, L.F., Das, P.K., Yin, Y., Albu, M., Chen, X., Taipale, J., Hughes,
T.R., and Weirauch, M.T. (2018). The Human Transcription Factors. Cell 172, 650–665.
10.1016/j.cell.2018.01.029.
71. Colaprico, A., Silva, T.C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T.S., Malta, T.M.,
Pagnotta, S.M., Castiglioni, I., et al. (2016). TCGAbiolinks: An R/Bioconductor package for
integrative analysis of TCGA data. Nucleic Acids Res 44, e71. 10.1093/nar/gkv1507.
72. Cerami, E., Gao, J., Dogrusoz, U., Gross, B.E., Sumer, S.O., Arman, B., Jacobsen, A., Byrne, C.J.,
Heuer, M.L., Larsson, E., et al. (2012). In Focus The cBio Cancer Genomics Portal : An Open
Platform for Exploring Multidimensional Cancer Genomics Data. 10.1158/2159-8290.CD-12-0095.
73. Caligiuri, M.A., Dalton, W.S., Rodriguez, L., Sellers, T., and Willman, C.L. (2016). Orien Reshaping
Cancer Research & Treatment. Oncol Issues 31, 62–66. 10.1080/10463356.2016.11884100.
74. Dalton, W.S., Sullivan, D., Ecsedy, J., and Caligiuri, M.A. (2018). Patient Enrichment for Precision-
Based Cancer Clinical Trials: Using Prospective Cohort Surveillance as an Approach to Improve
Clinical Trials. Clin Pharmacol Ther 104, 23–26. 10.1002/cpt.1051.
75. Gyorffy, B., Surowiak, P., Budczies, J., and Lánczky, A. (2013). Online survival analysis software to
assess the prognostic value of biomarkers using transcriptomic data in non-small-cell lung cancer.
PLoS One 8. 10.1371/journal.pone.0082241.
201
76. Gao, J., Aksoy, B.A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S.O., Sun, Y., Jacobsen, A.,
Sinha, R., Larsson, E., et al. (2013). Integrative analysis of complex cancer genomics and clinical
profiles using the cBioPortal. Sci Signal 6, 1–20. 10.1126/scisignal.2004088.
77. Kulakovskiy, I. v., Vorontsov, I.E., Yevshin, I.S., Sharipov, R.N., Fedorova, A.D., Rumynskiy, E.I.,
Medvedeva, Y.A., Magana-Mora, A., Bajic, V.B., Papatsenko, D.A., et al. (2018). HOCOMOCO:
Towards a complete collection of transcription factor binding models for human and mouse via
large-scale ChIP-Seq analysis. Nucleic Acids Res 46, D252–D259. 10.1093/nar/gkx1106.
78. Grant, C.E., Bailey, T.L., and Noble, W.S. (2011). FIMO: Scanning for occurrences of a given motif.
Bioinformatics 27, 1017–1018. 10.1093/bioinformatics/btr064.
79. Rao, S.S.P., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn,
A.L., Machol, I., Omer, A.D., Lander, E.S., et al. (2014). A 3D map of the human genome at
kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680.
10.1016/j.cell.2014.11.021.
80. Wang, Y., Song, F., Zhang, B., Zhang, L., Xu, J., Kuang, D., Li, D., Choudhary, M.N.K., Li, Y., Hu, M.,
et al. (2018). The 3D Genome Browser: A web-based browser for visualizing 3D genome
organization and long-range chromatin interactions. Genome Biol 19, 1–12. 10.1186/s13059-018-
1519-9.
81. Bulger, M., and Groudine, M. (2010). Enhancers: The abundance and function of regulatory
sequences beyond promoters. Dev Biol 339, 250–257. 10.1016/j.ydbio.2009.11.035.
82. Chen, Z., Fillmore, C.M., Hammerman, P.S., Kim, C.F., and Wong, K.K. (2014). Non-small-cell lung
cancers: A heterogeneous set of diseases. Nat Rev Cancer 14, 535–546. 10.1038/nrc3775.
83. Roadmap Epigenomics Consortium, Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A.,
Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., et al. (2015). Integrative analysis of 111
reference human epigenomes. Nature 518, 317–329. 10.1038/nature14248.
84. Rhie, S.K., Schreiner, S., Witt, H., Armoskus, C., Lay, F.D., Camarena, A., Spitsyna, V.N., Guo, Y.,
Berman, B.P., Evgrafov, O. v., et al. (2018). Using 3D epigenomic maps of primary olfactory
neuronal cells from living individuals to understand gene regulation. Sci Adv 4.
10.1126/sciadv.aav8550.
85. Herriges, M., and Morrisey, E.E. (2014). Lung development: orchestrating the generation and
regeneration of a complex organ. Development 141, 502–513. 10.1242/dev.098186.
86. Chen, Y., Pacyna-Gengelbach, M., Deutschmann, N., Niesporek, S., and Petersen, I. (2007).
Homeobox gene HOP has a potential tumor suppressive activity in human lung cancer. Int J
Cancer 121, 1021–1027. 10.1002/ijc.22753.
87. Rebouissou, S., Vasiliu, V., Thomas, C., Bellanné-Chantelot, C., Bui, H., Chrétien, Y., Timsit, J.,
Rosty, C., Laurent-Puig, P., Chauveau, D., et al. (2005). Germline hepatocyte nuclear factor 1α and
1β mutations in renal cell carcinomas. Hum Mol Genet 14, 603–614. 10.1093/hmg/ddi057.
202
88. Terasawa, K., Toyota, M., Sagae, S., Ogi, K., Suzuki, H., Sonoda, T., Akino, K., Maruyama, R.,
Nishikawa, N., Imai, K., et al. (2006). Epigenetic inactivation of TCF2 in ovarian cancer and various
cancer cell lines. Br J Cancer 94, 914–921. 10.1038/sj.bjc.6602984.
89. Yu, D.D., Guo, S.W., Jing, Y.Y., Dong, Y.L., and Wei, L.X. (2015). A review on hepatocyte nuclear
factor-1beta and tumor. Cell Biosci 5, 1–8. 10.1186/s13578-015-0049-3.
90. Wang, I.-C., Snyder, J., Zhang, Y., Lander, J., Nakafuku, Y., Lin, J., Chen, G., Kalin, T. v., Whitsett,
J.A., and Kalinichenko, V. v. (2012). Foxm1 Mediates Cross Talk between Kras/Mitogen-Activated
Protein Kinase and Canonical Wnt Pathways during Development of Respiratory Epithelium. Mol
Cell Biol 32, 3838–3850. 10.1128/mcb.00355-12.
91. Hanada, N., Lo, H.W., Day, C.P., Pan, Y., Nakajima, Y., and Hung, M.C. (2006). Co-regulation of B-
Myb expression by E2F1 and EGF receptor. Mol Carcinog 45, 10–17. 10.1002/mc.20147.
92. Wu, Y., Du, H., Zhan, M., Wang, H., Chen, P., Du, D., Liu, X., Huang, X., Ma, P., Peng, D., et al.
(2020). Sepiapterin reductase promotes hepatocellular carcinoma progression via FoxO3a/Bim
signaling in a nonenzymatic manner. Cell Death Dis 11. 10.1038/s41419-020-2471-7.
93. Winslow, M.M., Dayton, T.L., Verhaak, R.G.W., Kim-Kiselak, C., Snyder, E.L., Feldser, D.M.,
Hubbard, D.D., Dupage, M.J., Whittaker, C.A., Hoersch, S., et al. (2011). Suppression of lung
adenocarcinoma progression by Nkx2-1. Nature 473, 101–104. 10.1038/nature09881.
94. Rice, S.J., Lai, S.C., Wood, L.W., Helsley, K.R., Runkle, E.A., Winslow, M.M., and Mu, D. (2013).
MicroRNA-33a mediates the regulation of high mobility group AT-hook 2 gene (HMGA2) by
thyroid transcription factor 1 (TTF-1/NKX2-1). Journal of Biological Chemistry 288, 16348–16360.
10.1074/jbc.M113.474643.
95. Athwal, R.K., Walkiewicz, M.P., Baek, S., Fu, S., Bui, M., Camps, J., Ried, T., Sung, M.H., and Dalal,
Y. (2015). CENP-A nucleosomes localize to transcription factor hotspots and subtelomeric sites in
human cancer cells. Epigenetics Chromatin 8, 1–23. 10.1186/1756-8935-8-2.
96. Sadasivam, S., Duan, S., and DeCaprio, J.A. (2012). The MuvB complex sequentially recruits B-
Myb and FoxM1 to promote mitotic gene expression. Genes Dev 26, 474–489.
10.1101/gad.181933.111.
97. Sanders, D.A., Gormally, M. v, Marsico, G., Beraldi, D., and Tannahill, D. (2015). FOXM1 binds
directly to non-consensus sequences in the human genome. Genome Biol, 1–23.
10.1186/s13059-015-0696-z.
98. Sanders, D.A., Ross-Innes, C.S., Beraldi, D., Carroll, J.S., and Balasubramanian, S. (2013). Genome-
wide mapping of FOXM1 binding reveals co-binding with estrogen receptor alpha in breast
cancer cells. Genome Biol 14, 1–16. 10.1186/gb-2013-14-1-r6.
99. Chae, Y.K., Davis, A.A., Raparia, K., Agte, S., Pan, A., Mohindra, N., Villaflor, V., and Giles, F.
(2019). Association of Tumor Mutational Burden With DNA Repair Mutations and Response to
Anti–PD-1/PD-L1 Therapy in Non–Small-Cell Lung Cancer. Clin Lung Cancer 20, 88-96.e6.
10.1016/j.cllc.2018.09.008.
203
100. Zona, S., Bella, L., Burton, M.J., Nestal de Moraes, G., and Lam, E.W.F. (2014). FOXM1: An
emerging master regulator of DNA damage response and genotoxic agent resistance. Biochim
Biophys Acta Gene Regul Mech 1839, 1316–1322. 10.1016/j.bbagrm.2014.09.016.
101. Rhie, S.K., Hazelett, D.J., Coetzee, S.G., Yan, C., Noushmehr, H., and Coetzee, G.A. (2014).
Nucleosome positioning and histone modifications define relationships between regulatory
elements and nearby gene expression in breast epithelial cells. BMC Genomics 15, 1–19.
10.1186/1471-2164-15-331.
102. Wang, Y., Ung, M.H., Xia, T., Cheng, W., and Cheng, C. (2017). Cancer cell line specific co-factors
modulate the FOXM1 cistrome. Oncotarget 8, 76498–76515. 10.18632/oncotarget.20405.
103. Wei, P., Zhang, N., Wang, Y., Li, D., Wang, L., Sun, X., Shen, C., Yang, Y., Zhou, X., and Du, X.
(2015). FOXM1 promotes lung adenocarcinoma invasion and metastasis by upregulating SNAIL.
Int J Biol Sci 11, 186–198. 10.7150/ijbs.10634.
104. Zhang, Y., Qiao, W. bin, and Shan, L. (2018). Expression and functional characterization of FOXM1
in non-small cell lung cancer. Onco Targets Ther 11, 3385–3393. 10.2147/OTT.S162523.
105. Xiong, Y.C., Wang, J., Cheng, Y., Zhang, X.Y., and Ye, X.Q. (2020). Overexpression of MYBL2
promotes proliferation and migration of non-small-cell lung cancer via upregulating NCAPH. Mol
Cell Biochem 468, 185–193. 10.1007/s11010-020-03721-x.
106. Takeda, D.Y., Spisák, S., Seo, J.H., Bell, C., O’Connor, E., Korthauer, K., Ribli, D., Csabai, I.,
Solymosi, N., Szállási, Z., et al. (2018). A Somatically Acquired Enhancer of the Androgen Receptor
Is a Noncoding Driver in Advanced Prostate Cancer. Cell 174, 422-432.e13.
10.1016/j.cell.2018.05.037.
107. Corson, T.W., Huang, A., Tsao, M.S., and Gallie, B.L. (2005). KIF14 is a candidate oncogene in the
1q minimal region of genomic gain in multiple cancers. Oncogene 24, 4741–4753.
10.1038/sj.onc.1208641.
108. Iwakawa, R., Kohno, T., Kato, M., Shiraishi, K., Tsuta, K., Noguchi, M., Ogawa, S., and Yokota, J.
(2011). MYC amplification as a prognostic marker of early-stage lung adenocarcinoma identified
by whole genome copy number analysis. Clinical Cancer Research 17, 1481–1489. 10.1158/1078-
0432.CCR-10-2484.
109. Mao, S., Li, Y., Lu, Z., Che, Y., Huang, J., Lei, Y., Wang, Y., Liu, C., Wang, X., Zheng, S., et al. (2019).
PHD finger protein 5A promoted lung adenocarcinoma progression via alternative splicing.
Cancer Med 8, 2429–2441. 10.1002/cam4.2115.
110. Morel, M., Shah, K.N., and Long, W. (2020). The F-box protein FBXL16 up-regulates the stability of
C-MYC oncoprotein by antagonizing the activity of the F-box protein FBW7. J Biol Chem 295,
7970–7980. 10.1074/jbc.RA120.012658.
111. Malvi, P., Janostiak, R., Nagarajan, A., Cai, G., and Wajapeyee, N. (2019). Loss of thymidine kinase
1 inhibits lung cancer growth and metastatic attributes by reducing GDF15 expression. PLoS
Genet 15, e1008439. 10.1371/journal.pgen.1008439.
204
112. Uttarkar, S., Frampton, J., and Klempnauer, K.H. (2017). Targeting the transcription factor Myb by
small-molecule inhibitors. Exp Hematol 47, 31–35. 10.1016/j.exphem.2016.12.003.
113. Kwok, J.M.M., Myatt, S.S., Marson, C.M., Coombes, R.C., Constantinidou, D., and Lam, E.W.F.
(2008). Thiostrepton selectively targets breast cancer cells through inhibition of forkhead box M1
expression. Mol Cancer Ther 7, 2022–2032. 10.1158/1535-7163.MCT-08-0188.
114. Gormally, M. v., Dexheimer, T.S., Marsico, G., Sanders, D.A., Lowe, C., Matak-Vinkoviä, D.,
Michael, S., Jadhav, A., Rai, G., Maloney, D.J., et al. (2014). Suppression of the FOXM1
transcriptional programme via novel small molecule inhibition. Nat Commun 5.
10.1038/ncomms6165.
115. Zheng, R., Wan, C., Mei, S., Qin, Q., Wu, Q., Sun, H., Chen, C.-H., Brown, M., Zhang, X., Meyer,
C.A., et al. (2019). Cistrome Data Browser: expanded datasets and new tools for gene regulatory
analysis. Nucleic Acids Res 47, D729–D735. 10.1093/nar/gky1094.
116. Mei, S., Qin, Q., Wu, Q., Sun, H., Zheng, R., Zang, C., Zhu, M., Wu, J., Shi, X., Taing, L., et al. (2017).
Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and
mouse. Nucleic Acids Res 45, D658–D662. 10.1093/nar/gkw983.
117. Abascal, F., Acosta, R., Addleman, N.J., Adrian, J., Afzal, V., Ai, R., Aken, B., Akiyama, J.A., Jammal,
O. al, Amrhein, H., et al. (2020). Expanded encyclopaedias of DNA elements in the human and
mouse genomes. Nature 583, 699–710. 10.1038/s41586-020-2493-4.
118. Zhou, W., Laird, P.W., and Shen, H. (2017). Comprehensive characterization, annotation and
innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res 45, e22.
10.1093/nar/gkw967.
119. Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M.T., and
Carey, V.J. (2013). Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9,
1–10. 10.1371/journal.pcbi.1003118.
120. Negrini, S., Gorgoulis, V.G., and Halazonetis, T.D. (2010). Genomic instability — an evolving
hallmark of cancer. Nat Rev Mol Cell Biol 11, 220–228. 10.1038/nrm2858.
121. Wei Dai, Y.Y. (2014). Genomic Instability and Cancer. J Carcinog Mutagen 05. 10.4172/2157-
2518.1000165.
122. Kiwerska, K., and Szyfter, K. (2019). DNA repair in cancer initiation, progression, and therapy—a
double-edged sword. J Appl Genet 60, 329–334. 10.1007/s13353-019-00516-9.
123. Thurman, R.E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M.T., Haugen, E., Sheffield, N.C.,
Stergachis, A.B., Wang, H., Vernot, B., et al. (2012). The accessible chromatin landscape of the
human genome. Nature 489, 75–82. 10.1038/nature11232.
124. Dao, L.T.M., Galindo-Albarrán, A.O., Castro-Mondragon, J.A., Andrieu-Soler, C., Medina-Rivera,
A., Souaid, C., Charbonnier, G., Griffon, A., Vanhille, L., Stephen, T., et al. (2017). Genome-wide
characterization of mammalian promoters with distal enhancer functions. Nat Genet 49, 1073–
205
1081. 10.1038/ng.3884.
125. Murtagh, F., and Legendre, P. (2014). Ward’s Hierarchical Agglomerative Clustering Method:
Which Algorithms Implement Ward’s Criterion? J Classif 31, 274–295. 10.1007/s00357-014-9161-
z.
126. Webber, W., Moffat, A., and Zobel, J. (2010). A similarity measure for indefinite rankings. ACM
Trans Inf Syst 28, 1–38. 10.1145/1852102.1852106.
127. Rahman, M., Jackson, L.K., Johnson, W.E., Li, D.Y., Bild, A.H., and Piccolo, S.R. (2015). Alternative
preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis
results. Bioinformatics 31, 3666–3672. 10.1093/bioinformatics/btv377.
128. Jeffery, D., Gatto, A., Podsypanina, K., Renaud-Pageot, C., Ponce Landete, R., Bonneville, L.,
Dumont, M., Fachinetti, D., and Almouzni, G. (2021). CENP-A overexpression promotes distinct
fates in human cells, depending on p53 status. Commun Biol 4, 417. 10.1038/s42003-021-01941-
5.
129. Ahmed, F. (2019). Integrated Network Analysis Reveals FOXM1 and MYBL2 as Key Regulators of
Cell Proliferation in Non-small Cell Lung Cancer. Front Oncol 9. 10.3389/fonc.2019.01011.
130. Liu, C., Barger, C.J., and Karpf, A.R. (2021). FOXM1: A Multifunctional Oncoprotein and Emerging
Therapeutic Target in Ovarian Cancer. Cancers (Basel) 13, 3065. 10.3390/cancers13123065.
131. Han, J., Xie, R., Yang, Y., Chen, D., Liu, L., Wu, J., and Li, S. (2021). CENPA is one of the potential
key genes associated with the proliferation and prognosis of ovarian cancer based on integrated
bioinformatics analysis and regulated by MYBL2. Transl Cancer Res 10, 4076–4086. 10.21037/tcr-
21-175.
132. Wang, Q., Xu, J., Xiong, Z., Xu, T., Liu, J., Liu, Y., Chen, J., Shi, J., Shou, Y., Yue, C., et al. (2021).
CENPA promotes clear cell renal cell carcinoma progression and metastasis via Wnt/β-catenin
signaling pathway. J Transl Med 19, 417. 10.1186/s12967-021-03087-8.
133. DeGraff, D.J., Clark, P.E., Cates, J.M., Yamashita, H., Robinson, V.L., Yu, X., Smolkin, M.E., Chang,
S.S., Cookson, M.S., Herrick, M.K., et al. (2012). Loss of the Urothelial Differentiation Marker
FOXA1 Is Associated with High Grade, Late Stage Bladder Cancer and Increased Tumor
Proliferation. PLoS One 7, e36669. 10.1371/journal.pone.0036669.
134. Sikic, D., Eckstein, M., Wirtz, R.M., Jarczyk, J., Worst, T.S., Porubsky, S., Keck, B., Kunath, F.,
Weyerer, V., Breyer, J., et al. (2020). FOXA1 Gene Expression for Defining Molecular Subtypes of
Muscle-Invasive Bladder Cancer after Radical Cystectomy. J Clin Med 9, 994.
10.3390/jcm9040994.
135. Warrick, J.I., Walter, V., Yamashita, H., Chung, E., Shuman, L., Amponsa, V.O., Zheng, Z., Chan, W.,
Whitcomb, T.L., Yue, F., et al. (2016). FOXA1, GATA3 and PPARɣ Cooperate to Drive Luminal
Subtype in Bladder Cancer: A Molecular Analysis of Established Human Cell Lines. Sci Rep 6,
38531. 10.1038/srep38531.
206
136. Seachrist, D.D., Anstine, L.J., and Keri, R.A. (2021). FOXA1: A Pioneer of Nuclear Receptor Action
in Breast Cancer. Cancers (Basel) 13, 5205. 10.3390/cancers13205205.
137. Rooney, N., Mason, S.M., McDonald, L., Däbritz, J.H.M., Campbell, K.J., Hedley, A., Howard, S.,
Athineos, D., Nixon, C., Clark, W., et al. (2020). RUNX1 Is a Driver of Renal Cell Carcinoma
Correlating with Clinical Outcome. Cancer Res 80, 2325–2339. 10.1158/0008-5472.CAN-19-3870.
138. Fu, Y., Sun, S., Man, X., and Kong, C. (2019). Increased expression of RUNX1 in clear cell renal cell
carcinoma predicts poor prognosis. PeerJ 7, e7854. 10.7717/peerj.7854.
139. Ciriello, G., Gatza, M.L., Beck, A.H., Wilkerson, M.D., Rhie, S.K., Pastore, A., Zhang, H., McLellan,
M., Yau, C., Kandoth, C., et al. (2015). Comprehensive Molecular Portraits of Invasive Lobular
Breast Cancer. Cell 163, 506–519. 10.1016/j.cell.2015.09.033.
140. Tietjen, J.R., Donato, L.J., Bhimisaria, D., and Ansari, A.Z. (2011). Sequence-Specificity and Energy
Landscapes of DNA-Binding Molecules. 3–30. 10.1016/B978-0-12-385075-1.00001-9.
141. Aran, D., Sirota, M., and Butte, A.J. (2015). Systematic pan-cancer analysis of tumour purity. Nat
Commun 6, 8971. 10.1038/ncomms9971.
142. Lex, A., and Gehlenborg, N. (2014). Sets and intersections. Nat Methods 11, 779–779.
10.1038/nmeth.3033.
143. Kent, L.N., and Leone, G. (2019). The broken cycle: E2F dysfunction in cancer. Nat Rev Cancer 19,
326–338. 10.1038/s41568-019-0143-7.
144. Xie, D., Pei, Q., Li, J., Wan, X., and Ye, T. (2021). Emerging Role of E2F Family in Cancer Stem Cells.
Front Oncol 11. 10.3389/fonc.2021.723137.
145. Golson, M.L., and Kaestner, K.H. (2016). Fox transcription factors: from development to disease.
Development 143, 4558–4570. 10.1242/dev.112672.
146. Bach, D.-H., Long, N., Luu, T.-T.-T., Anh, N., Kwon, S., and Lee, S. (2018). The Dominant Role of
Forkhead Box Proteins in Cancer. Int J Mol Sci 19, 3279. 10.3390/ijms19103279.
147. Zhang, Z., Chang, Y., Zhang, J., Lu, Y., Zheng, L., Hu, Y., Zhang, F., Li, X., Zhang, W., and Li, X.
(2017). HMGB3 promotes growth and migration in colorectal cancer by regulating WNT/β-
catenin pathway. PLoS One 12, e0179741. 10.1371/journal.pone.0179741.
148. Gu, J., Xu, T., Huang, Q.-H., Zhang, C.-M., and Chen, H.-Y. (2019). HMGB3 silence inhibits breast
cancer cell proliferation and tumor growth by interacting with hypoxia-inducible factor 1α.
Cancer Manag Res Volume 11, 5075–5089. 10.2147/CMAR.S204357.
149. Zhong, X., Zhang, S., Zhang, Y., Jiang, Z., Li, Y., Chang, J., Niu, J., and Shi, Y. (2021). HMGB3 is
Associated With an Unfavorable Prognosis of Neuroblastoma and Promotes Tumor Progression
by Mediating TPX2. Front Cell Dev Biol 9. 10.3389/fcell.2021.769547.
150. Lettice, L.A. (2003). A long-range Shh enhancer regulates expression in the developing limb and
207
fin and is associated with preaxial polydactyly. Hum Mol Genet 12, 1725–1735.
10.1093/hmg/ddg180.
151. Sagai, T., Hosoya, M., Mizushina, Y., Tamura, M., and Shiroishi, T. (2005). Elimination of a long-
range cis-regulatory module causes complete loss of limb-specific Shh expression and truncation
of the mouse limb. Development 132, 797–803. 10.1242/dev.01613.
152. Jeong, Y., El-Jaick, K., Roessler, E., Muenke, M., and Epstein, D.J. (2006). A functional screen for
sonic hedgehog regulatory elements across a 1 Mb interval identifies long-range ventral
forebrain enhancers. Development 133, 761–772. 10.1242/dev.02239.
153. Sagai, T., Amano, T., Tamura, M., Mizushina, Y., Sumiyama, K., and Shiroishi, T. (2009). A cluster
of three long-range enhancers directs regional Shh expression in the epithelial linings.
Development 136, 1665–1674. 10.1242/dev.032714.
154. Tena, J.J., and Santos-Pereira, J.M. (2021). Topologically Associating Domains and Regulatory
Landscapes in Development, Evolution and Disease. Front Cell Dev Biol 9.
10.3389/fcell.2021.702787.
155. Zeitlin, S.G., Baker, N.M., Chapados, B.R., Soutoglou, E., Wang, J.Y.J., Berns, M.W., and Cleveland,
D.W. (2009). Double-strand DNA breaks recruit the centromeric histone CENP-A. Proceedings of
the National Academy of Sciences 106, 15762–15767. 10.1073/pnas.0908233106.
156. Hédouin, S., Grillo, G., Ivkovic, I., Velasco, G., and Francastel, C. (2017). CENP-A chromatin
disassembly in stressed and senescent murine cells. Sci Rep 7, 42520. 10.1038/srep42520.
157. Bayley, R., Blakemore, D., Cancian, L., Dumon, S., Volpe, G., Ward, C., Almaghrabi, R., Gujar, J.,
Reeve, N., Raghavan, M., et al. (2018). MYBL2 Supports DNA Double Strand Break Repair in
Hematopoietic Stem Cells. Cancer Res 78, 5767–5779. 10.1158/0008-5472.CAN-18-0273.
158. Morris, B.B., Wages, N.A., Grant, P.A., Stukenberg, P.T., Gentzler, R.D., Hall, R.D., Akerley, W.L.,
Varghese, T.K., Arnold, S.M., Williams, T.M., et al. (2021). MYBL2-Driven Transcriptional Programs
Link Replication Stress and Error-prone DNA Repair With Genomic Instability in Lung
Adenocarcinoma. Front Oncol 10. 10.3389/fonc.2020.585551.
159. Lee, Y., Wu, Z., Yang, S., Schreiner, S.M., Gonzalez-Smith, L.D., and Rhie, S.K. (2022).
Characterizing and Targeting Genes Regulated by Transcription Factor MYBL2 in Lung
Adenocarcinoma Cells. Cancers (Basel) 14, 4979. 10.3390/cancers14204979.
160. Silva, T.C., Coetzee, S.G., Gull, N., Yao, L., Hazelett, D.J., Noushmehr, H., Lin, D.-C., and Berman,
B.P. (2019). ELMER v.2: an R/Bioconductor package to reconstruct gene regulatory networks
from DNA methylation and transcriptome profiles. Bioinformatics 35, 1974–1977.
10.1093/bioinformatics/bty902.
208
Appendix: Publications as a contributing author
Johnson C, Mullen DJ, Selamat SA, Campan M, Offringa IA, Marconett CN. The Sulfotransferase SULT1C2
Is Epigenetically Activated and Transcriptionally Induced by Tobacco Exposure and Is Associated with
Patient Outcome in Lung Adenocarcinoma. (2022). Int J Environ Res Public Health. 19(1): 416.
I utilized data from lung adenocarcinoma (LUAD) samples in TCGA to examine the expression of
SULT1C2 as well as the DNA methylation of probe cg13968390 located in the promoter region of the
gene (Figures 1+2, Table 2).
Zhou B, Stueve TR, Mihalakakos EA, Miao L, Mullen D, Wang Y, Liu Y, Luo J, Tran E, Siegmund KD, Lynch
SK, Ryan AL, Offringa IA, Borok Z, Marconett CN. Comprehensive epigenomic profiling of human alveolar
epithelial differentiation identifies key epigenetic states and transcription factor co-regulatory networks
for maintenance of distal lung identity. (2021). BMC Genom. 22(1): 906.
I aided in the bioinformatic analysis and helped correct errors in code (Figure 3E-G).
Rangel DF, Dubeau L, Park R, Chan P, Ha DP, Pulido MA, Mullen DJ, Vorobyova I, Zhou B, Borok Z,
Offringa IA, Lee AS. Endoplasmic reticulum chaperone GRP78/BiP is critical for mutant Kras-driven lung
tumorigenesis. (2021). Oncogene. 40(20):3624-3632.
I examined the expression of GRP78 in LUAD samples from TCGA and compared the expression
of this gene in LUAD tumor and non-tumor adjacent lung tissue and performed an analysis of variance
(ANOVA) comparing its expression in groups of the LUAD tumor samples with different combinations of
alterations to the KRAS and EGFR genes (Figure S1A-B).
209
Mullen DJ, Yan C, Kang DS, Zhou B, Borok Z, Marconett CN, Farnham PJ, Offringa IA, Rhie SK. TENET 2.0:
Identification of key transcriptional regulators and enhancers in lung adenocarcinoma. (2020). PLoS
Genet. 6(9): e1009023.
This paper, on which I am the lead author and did the bioinformatic analyses and writing of,
describes my work to update and apply the bioinformatic tool TENET, originally developed by Dr. Rhie,
to LUAD to identify key TFs MYBL2, FOXM1, and CENPA, as well as linked distal enhancer elements
potentially regulated by them.
Shahabi S, Kumaran V, Castillo J, Cong Z, Nandagopal G, Mullen DJ, Alvarado A, Correa MR, Saizan A,
Goel R, Bhat A, Lynch SK, Zhou B, Borok Z, Marconett CN. (2019). LINC00261 is an epigenetically-
regulated tumor suppressor that is essential for activation of the DNA damage response. Cancer Res.
79(12): 3050-3062.
I investigated the expression of LINC00261 in LUAD and lung squamous cell carcinoma (LUSC)
data from TCGA and identified differences in its expression between tumor and adjacent normal
samples, and with respect to cancer stage and smoking status and created Kaplan-Meier survival plots to
examine the association of LINC00261 expression with patient survival (Figure S2).
Yang C, Stueve TR, Yan C, Rhie SK, Mullen DJ, Luo J, Zhou B, Borok Z, Marconett CN, Offringa IA. (2018).
Positional Integration of lung adenocarcinoma susceptibility loci with primary human alveolar epithelial
cell genomes. Epigenomics 10(9): 1167-1187.
I utilized gene expression data from TCGA to verify that there is significant differential
expression of target genes identified in this analysis in LUAD tumors and created a Kaplan-Meier plot
showing differential survivorship based on expression level for the same genes (Figures 5+6).
210
Park SL, Patel YM, Loo LWM, Mullen DJ, Offringa IA, Maunakea A, Stram DO, Siegmund K, Murphy SE, Tiirikainen
M, Le Marchand L. (2018). Association of internal smoking dose with blood DNA methylation in three racial/ethnic
populations. Clin Epigenetics 10(1):110.
I aided in cataloging which of the smoking-associated cytosine-guanine dinucleotides (CpGs) found in this
study had been identified previously in 19 different epigenome-wide association studies and annotated which of
these significant CpGs were located in active regions of the genome according to the presence of open chromatin
and histone marks associated with active enhancer and promoter regions (Figures 4+5).
Stueve TR, Li WQ, Shi J, Marconett CN, Zhang T, Yang C, Mullen D, Wheeler W, Hua X, Zhou B, Borok Z,
Caporaso NE, Pesatori AC, Duan J, Laird-Offringa IA, Landi MT. (2017). Epigenome-wide analysis of DNA
methylation in lung tissue shows concordance with blood studies and identifies tobacco smoke-
inducible enhancers. Hum Mol Gen 26(15):3014-3027.
I cataloged significant, smoking-associated DNA methylation changes at individual CpGs found in
18 previous epigenome-wide association studies performed on blood samples and cross-referenced the
smoking-associated CpGs found in lung tissue samples from this study with those previously identified in
blood (Figure 3).
Abstract (if available)
Abstract
Although genetic alterations are known as key drivers of cancer, epigenetic alterations also play key roles in tumor development. Of particular interest are alterations to the expression of transcription factors (TFs) and subsequent changes to transcriptional regulatory networks, which induce widespread downstream effects in cells. To identify these alterations, I facilitated the development of TENET 2.0 and then TENETR, which are updated versions of the bioinformatic method Tracing Enhancer Networks using Epigenetic Traits (TENET) method. These methods utilize ChIP-seq and open chromatin datasets to identify DNA methylation probes in regulatory elements, then use the DNA methylation levels of those probes as a surrogate for the activity of the regulatory elements. Combining these epigenomic datasets with RNA-seq datasets, the TENET methods identify TFs whose expression levels are related, or “linked” to the DNA methylation level of each regulatory element probe.
TENET 2.0 was utilized to identify dysregulated TFs and linked enhancers in lung adenocarcinoma (LUAD), including key TFs such as CENPA, FOXM1, and MYBL2. I followed up on this by developing TENETR, an updated R package variation of TENET which possesses several new features to increase its ease of use and applicability compared to TENET 2.0. TENETR was used to identify numerous enhancer elements along with top TFs linked to activated enhancers in twelve selected cancer types. To illustrate downstream analyses which can be performed using TENETR, I identified a panel of 108 TFs which were highly ranked within or across the 12 cancer types. These included a cluster of 8 TFs which were highly correlated in expression and rank, which were particularly important in BRCA, KIRP, and LUAD cancer. Survival analysis revealed expression of TFs within this cluster and methylation of their linked methylation probes, including MYBL2 and its linked probe cg20246907, are strongly associated with poor patient survival. cg20249607 is particularly strongly associated with KIRP patient survival even when considering clinical covariates and could potentially predict survival even in stage I tumors.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Identification and characterization of cancer-associated enhancers
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
RNA methylation in cancer plasticity and drug resistance
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Investigating the function and epigenetic regulation of ABCA3, a novel LUAD tumor suppressor gene
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Tight junction protein CLDN18.1 attenuates malignant properties and related signaling pathways of human lung adenocarcinoma in vivo and in vitro
PDF
Functional role of chromatin remodeler proteins in cancer biology
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
Development of a colorectal cancer-on-chip to investigate the tumor microenvironment's role in cancer progression
PDF
Modeling lung adenocarcinoma progression in vitro using immortalized human alveolar epithelial cells
PDF
Mapping transcription factor networks linked to glioblastoma multiform: identifying target genes of the oncogenic transcription factor ZFX in glioblastoma multiforme
PDF
Perinatal epigenetic and genetic analyses in childhood cancers
PDF
Identification of novel epigenetic biomarkers and microRNAs for cancer therapeutics
PDF
Generation of an epigenetic toggle switch to test LINC00261 function on lung adenocarcinoma cellular response to the chemotherapeutics oxaliplatin and carboplatin
PDF
Determining the epigenetic contribution of basal cell identity in cystic fibrosis
PDF
Using epigenetic toggle switches to repress tumor-promoting gene expression
Asset Metadata
Creator
Mullen, Daniel James
(author)
Core Title
Application of tracing enhancer networks using epigenetic traits (TENET) to identify epigenetic deregulation in cancer
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Cancer Biology and Genomics
Degree Conferral Date
2022-12
Publication Date
12/16/2023
Defense Date
11/02/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,cancer,epigenetics,Lung,lung adenocarcinoma,OAI-PMH Harvest,TENET,tracing enhancer networks using epigenetic traits
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Farnham, Peggy J. (
committee chair
), Offringa, Ite A. (
committee member
), Rhie, Suhn K. (
committee member
), Siegmund, Kimberly D. (
committee member
)
Creator Email
DanielJamesMullen@gmail.com,dmullen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112666316
Unique identifier
UC112666316
Identifier
etd-MullenDani-11391.pdf (filename)
Legacy Identifier
etd-MullenDani-11391
Document Type
Thesis
Format
theses (aat)
Rights
Mullen, Daniel James
Internet Media Type
application/pdf
Type
texts
Source
20230104-usctheses-batch-999
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatics
epigenetics
lung adenocarcinoma
TENET
tracing enhancer networks using epigenetic traits