Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Cell-specific case studies of enhancer function prediction using machine learning
(USC Thesis Other)
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CELL-SPECIFIC CASE STUDIES OF ENHANCER FUNCTION PREDICTION USING
MACHINE LEARNING
by
Xuewei Han
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
BIOSTATISTICS
August 2021
Copyright 2021 Xuewei Han
ii
Acknowledgments
Throughout the writing of this thesis, I have received a great deal of support and
assistance.
I would first like to thank my supervisor, Professor Huaiyu Mi, for his assistance at every
stage of my project. Your invaluable advice and continuous support encouraged me to finish my
thesis. Your insightful feedback pushed me to sharpen my thinking and brought my work to a
higher level.
I am deeply grateful to my committee members, Steven Gazal and Juan Pablo Lewinger.
I want to thank you for your patient support and all of the suggestions given for my manuscript.
I would also like to extend my sincere thanks to Caitlin Mills for her valuable guidance
throughout my studies. You provided me with the original data that I needed and gave me a lot
of help when I began working on this project.
My appreciation also goes out to my family and friends for their encouragement and
support all through my studies.
iii
TABLE OF CONTENTS
Acknowledgements ......................................................................................................................... ii
List of Tables ................................................................................................................................. iv
List of Figures ................................................................................................................................. v
Abstract .......................................................................................................................................... vi
Chapter 1: Introduction ................................................................................................................... 1
Chapter 2: Method .......................................................................................................................... 9
Data collection ........................................................................................................ 9
Features ........................................................................................... 9
Training datasets ........................................................................... 11
Test datasets .................................................................................. 12
Machine learning methods .................................................................................... 12
Learning Models ........................................................................... 12
Feature selection ........................................................................... 13
Cell-type-specific analysis ............................................................ 15
Chapter 3: Results ......................................................................................................................... 16
Prediction performance of missing feature models .............................................. 16
Include only one feature in single models .................................... 17
Dropping one feature from the full feature set in single models .. 19
Finding the best feature combination in a missing feature model 21
Repeating the above pipeline in joint models ............................... 26
Cell-type-specific analysis .................................................................................... 35
Chapter 4: Discussion ................................................................................................................... 40
The most powerful features in predicting enhancer-gene links ............................ 40
Viability of missing feature models ...................................................................... 41
The application of joint models ............................................................................ 42
Divergence of training models and score distributions in various test datasets ... 43
Limitations ............................................................................................................ 44
Bibliography ................................................................................................................................. 46
iv
List of Tables
Table 1: Feature Summary. ........................................................................................................... 10
Table 2: Training Dataset Summary. ............................................................................................ 11
Table 3: Prediction performance in Random Forest models trained by individual cell types with a
single feature. ................................................................................................................................ 18
Table 4: Prediction performance in Random Forest models trained by individual cell types with
dropping one feature from the full set........................................................................................... 20
Table 5: Prediction performance of missing feature models (K562-trained models). .................. 24
Table 6: Prediction performance of missing feature models (HepG2-trained models). ............... 25
Table 7: Prediction performance of missing feature models (HCT116-trained models).............. 26
Table 8: Prediction performance in Random Forest models trained by joint training datasets with
a single feature. ............................................................................................................................. 28
Table 9: Prediction performance in Random Forest models trained by joint training datasets with
dropping one feature from the full feature set. ............................................................................. 29
Table 10: Prediction performance of missing feature models (HepG2+K562-trained models). .. 33
Table 11: Prediction performance of missing feature models (HepG2+K562+HCT116-trained
models). ......................................................................................................................................... 34
v
List of Figures
Figure 1: Feature Importance in single models (A. HepG2, B. K562, C. HCT116). ................... 23
Figure 2: Feature Importance in joint models (A. HepG2+K562, B. HepG2+K562+HCT116) and
feature importance comparison (C. in three single models and two joint models). ...................... 32
Figure 3: Scoring distribution of three validation datasets using different prediction models. .... 38
Figure 4: Boxplot of scores for four different trained models. ..................................................... 39
vi
Abstract
Enhancers are short DNA elements located widely throughout the human genome and
modulate transcription of one or more specific target genes. Enhancer function mainly suggests
connections between enhancers and their target genes within a particular tissue or cell type, which
are considered to play important roles in human development and diseases. It is always challenging
to identify enhancers and determine their functions. Previously, an interactive tool, PEREGRINE,
combines enhancer-gene links across different cell types by using publicly available data and
allows users to easily access the information of each enhancer-gene link through the PANTHER
System. A machine learning method was developed to predict possible enhancer-gene links
through a set of selected features and validate the experimental links by scoring them in a cell-
type-specific manner. Based on previous findings, in this study, I tried to discover the most critical
subset of features needed to build a machine learning model and effectively predict the enhancer-
gene links with high confidence. Furthermore, by applying the balanced random forest models
with a full feature set to new test datasets, probabilities of being a positive enhancer-gene link are
provided in each candidate cell type. I also investigate if score distributions for enhancer-gene
links in new test datasets differ among various cell types and prediction models.
1
Chapter 1: Introduction
Decades of studies in gene regulation have indicated that multiple essential functional
elements are located in the non-coding regions of the human genome. However, identifying these
elements and characterizing their functions remain a principal challenge since the overall impact
of presumably functional non-coding DNA sequences on human diseases and development was
initially unclear (Visel et al., 2009b). Particularly, GWAS (genome-wide association studies) have
shown that a large number of disease-related SNPs do not overlap on protein-coding regions but
rather map to non-coding areas (Visel et al., 2009b). One possible explanation for these GWAS
results is that the corresponding SNPs are located in some functional elements, such as enhancers,
that could regulate gene expression at the transcription level (Javierre et al., 2016).
Enhancers are gene-regulatory elements that are classically defined as cis-acting DNA
sequences (Panigrahi and O’Malley, 2021). As short DNA elements, enhancers could be bound by
transcription factors and increase the transcription of one or more target genes over long distances
in a tissue- and cell-type-specific manner (Kim et al., 2019; Pennacchio et al., 2013). Although it
is believed that most enhancers appear to regulate their neighboring genes, transcriptional
enhancers can also be located upstream or downstream of their target genes, within the introns,
and even on separate chromosomes (Lomvardas et al., 2006; Pennacchio et al., 2013; Visel et al.,
2009b). Enhancer elements play essential roles in genetic regulatory circuitry and generate cell-
type-specific transcriptional responses (Sur and Taipale, 2016). A great number of studies have
suggested that in the human genome, the variation of distant-acting enhancers can contribute to
diseases (Emison et al., 2005; Kleinjan and Van Heyningen, 2005; Lettice, 2003). Specifically,
enhancer malfunction is involved in many developmental and disease-relevant processes,
including various cancers caused by the aberrant regulation of oncogenes (Kulaeva et al., 2012;
2
Sur and Taipale, 2016; Visel et al., 2009b). Therefore, identifying enhancers and determining their
functions to target genes are considered to be important in solving complex human diseases.
However, it is always a challenge to identify all enhancers and determine their functions.
First, enhancers are distributed among the vast non-coding portion of the genome (nearly 98%),
resulting in a large search space (Pennacchio et al., 2013; Visel et al., 2009a). Second, in some
cases, individual enhancers can regulate multiple genes over a long distance regardless of their
orientation (Mohrs et al., 2001). Namely, those enhancers could skip the respective closest genes
and then regulate genes located more distantly (Lettice, 2003; Visel et al., 2009b). Besides, in
contrast to well-defined protein-coding sequences, a lack of understanding of enhancer sequences
leads to more computational difficulty identifying enhancers from DNA sequences alone with high
confidence (Pennacchio et al., 2013). Finally, enhancers and their functions are considered tissue-
and cell-type-specific, adding further complexity to annotate enhancers to genes (Kim et al., 2019;
Pennacchio et al., 2013).
Despite the difficulties mentioned above, efforts to identify and locate enhancers on the
genome scale never stopped in the past decades. In the beginning, comparative genomics was
applied to identify enhancers. It was found that enhancers are deeply enriched in highly conserved
non-coding regions between different vertebrates and mammalian species, especially in their early
development (Bejerano et al., 2004). Notably, a transgenic mouse assay was used to compare the
embryonic enhancer activity by systematically testing hundreds of highly conserved non-coding
human genome sequences and revealed that about half of these non-coding elements were
enhancers (Visel et al., 2008). These findings identified enhancers from non-coding sequences and
suggested that enhancers constitute a significant type of functional elements in the non-coding
regions of the genome.
3
Even with the initial success of comparative genomics in identifying enhancers in the non-
coding regions, there are still some limitations. For example, in the experiment of transgenic assays
described above, it was challenging to demonstrate the function of those enhancers located in
highly conserved genomic sequences (Visel et al., 2008). One possible explanation is that currently
available assays are restricted to specific time points of development. Since those enhancers were
identified from highly conserved sequences, it is also possible that they are conserved due to their
essential functions instead of being enhancers (Pennacchio et al., 2013). Additionally, the evidence
that a conserved sequence might be an enhancer could not be related to when and where those
enhancers would function within the human genome (Pennacchio et al., 2013). Interestingly, with
further research, recent studies revealed that a majority number of enhancers are not conserved
across species, which reveals the limitation of comparative genomics research in identifying
enhancers (Blow et al., 2010; Schmidt et al., 2010).
With the advance of high-throughput sequencing, some of these limitations could be
addressed using more sophisticated methods that generate genome-wide maps of chromatin marks
to identify the location of enhancers and other regulatory regions. In this case, ChIP–seq
(chromatin immunoprecipitation followed by high-throughput sequencing) is one of the most
potent approaches in identifying enhancers because it is independent of DNA conservation and
defines enhancer catalogs directly from studies on specific cells or tissues (Pennacchio et al., 2013).
Marks usually used for identifying candidate enhancers include p300, H3K4me1 (H3
monomethylated at K4), and H3K27ac (histone H3 acetylated at lysine 27). Through experimental
validation using enhancer reporter vectors in vitro or in vivo, these molecular tools have been
proven helpful for enhancer identification in various mammalian species (Heintzman et al., 2007;
Visel et al., 2009a).
4
It is estimated that ~1 million enhancers in the human genome regulate ~20,000 protein-
coding genes, and many of these links have disease implications (Cho, 2012; Consortium, 2012).
Therefore, understanding the functional relevance of these elements will be crucial in elucidating
the cause of complex human diseases. While it is challenging to map functions directly to
enhancers, it is possible to do so by mapping the target genes regulated by them. These regulatory
relationships between enhancers and their target genes can be identified as enhancer-gene links
that appear to influence human development, disease, and homeostatic cellular processes
(Adhikary et al., 2021; Mills et al., 2020).
Despite the importance of building enhancer-gene links, it remains an issue to determine
which enhancer could regulate which target genes. Recent experiments have discovered many
highly valuable enhancer-gene links in specific cell types or tissues, but most enhancer-gene links
remain unknown (Gasperini et al., 2020; Schoenfelder and Fraser, 2019). Given its association
with human diseases, building a vast regulatory network connecting all detectable enhancers and
their target genes is urgent. High throughput approaches provide additional data to predict
enhancer-gene links. Those new experimental methods include PCHi-C (promoter capture
genome-wide chromosome conformation capture) (Schoenfelder et al., 2018) and genome-wide
screens using CRISPR (Clustered Regulatory Interspersed Short Palindromic Repeat) (Fulco et al.,
2019; Li et al., 2020). However, since these techniques are still developing, they are not widely
available for all cell types. Moreover, although these enhancer-gene links could be highly reliable
and validated by multiple approaches experimentally, they are not easily accessible to researchers
because they are scattered in different places on the Internet.
Several computational projects have been designed to build enhancer-gene link databases
through existing data (Fishilevich et al., 2017; Gao et al., 2016; Gao and Qian, 2020; Jiang et al.,
5
2019; Schoenfelder et al., 2018; Wang et al., 2019; Wang et al., 2018). However, although they
utilize the existing publicly available data to generate putative enhancer-gene links, further
essential information about enhancer-gene links cannot be obtained through these databases (Mills
et al., 2020). For example, it is hard to reach cell type information in their databases, so that
researchers cannot associate those enhancer-gene links to particular cell types (Fishilevich et al.,
2017). Besides, some tools have a problem that their predictive enhancer-gene links can only be
accessed through their websites without download possibility, leading to difficulties in utilizing
them in a large-scale analysis (Fishilevich et al., 2017; Jiang et al., 2019; Wang et al., 2019).
Previously, a publicly accessible interactive database, PEREGRINE, was introduced to
address all these issues. It collects and summarizes publicly available data from ChIA-PET, eQTL,
and Hi-C assays to connect 449,627 enhancers with 17,643 protein-coding genes across 78 cell
and tissue types in the human genome (Mills et al., 2020). In contrast to other resources with
enhancer-gene link prediction, these PEREGRINE links are much more comprehensive with
relative experimental evidence of cell type information. They could be downloaded
(www.peregrineproj.org) or searchable by variants and putative target gene(s) of interest online
(in partnership with the PANTHER System, www.pantherdb.org). In addition, since PANTHER
is a comprehensive resource for gene function annotation, the PEREGRINE links allow functions
to be associated with these non-coding regions of the genome (Mills et al., 2020).
Although PEREGRINE has arranged a considerable number of putative enhancer-gene
links based on experimental data, there is no guarantee that all of the links are true. In order to
assess the likelihood of each enhancer-gene link to be true, a confidence score should be assigned
in a cell-type-specific manner. For this purpose, building a predictive model was considered, which
could provide a statistically validated score for each candidate enhancer-gene link in a specific cell
6
type. The score represents the probability of being a positive enhancer-gene link for each putative
link in each test dataset through machine learning prediction. Through this confidence score, a
putative enhancer-gene link could be judged as positive or negative. Therefore, the first step is to
collect a proper amount of experimentally validated enhancer-gene links in the desired cell types.
Those links constitute a true positive set and a true negative set, respectively. Both of them are
required for the prediction and validation processes of a prediction model. If a model has a good
performance in predicting the experimentally validated enhancer-gene links as true links and
identifying the enhancer-gene links known to be inactive as false links, this model would be
considered valid, and the scores provided by it would be regarded as reliable. In this case, machine
learning comes into my consideration.
Machine learning is a study of building computer algorithms that could improve
themselves through a “learning” experience (Dutton and Conroy, 1997). Generally, it could be
classified as supervised learning or unsupervised learning according to whether the response
variable in its training datasets is involved in the algorithm. Simultaneously, there are numerous
machine learning approaches, and some could be utilized to achieve the task at hand. For example,
in enhancer-gene link prediction, it is likely to be more achievable to develop a supervised
algorithm using both input and output data and treat it as a two-type classification problem, such
as logistic regression, random forest, bootstrapping, and elastic net. Then, according to the
prediction results of machine learning models with good performance, putative enhancer-gene
links could be labeled positive or negative with a set of relative characteristics.
In order to predict enhancer-gene links more precisely in machine learning models, it is
crucial to select an appropriate set of features from the fundamental characteristics of both active
enhancers and genes. The features used in machine learning models include characteristics of the
7
active state of the enhancer, the gene, and the enhancer and the gene simultaneously (Mills, 2021).
Nine main features in the training datasets can be downloaded from ENCODE and GTEx databases.
They include H3K27ac for enhancers, H3K4me1 for enhancers, H3K4me3 for promoters, binding
of histone acetyltransferase p300, eQTL combined Z-score, eQTL averaged coefficient (absolute
values of original coefficients), H3K27ac for promoters, if the gene in an enhancer-gene link is
considered the nearest gene to that enhancer, and if the enhancer is located in the intronic region
of each gene. Those features are considered to be highly relevant to enhancer activity on target
genes. For example, H3K4me1 (histone 3 monomethylated at lysine 4) is widely considered to be
an essential chromatin modification at many enhancers (Ong and Corces, 2011). Also, the presence
of H3K27ac (histone 3 acetylated at lysine 27) appears to suggest that an enhancer is active, and
the modification by H3K27me3 (histone 3 trimethylated at lysine 27) specifies that an enhancer is
poised for activation (Ong and Corces, 2011). The binding of histone acetyltransferase p300 is a
highly expressed component of enhancer-associated protein assemblies and is critically established
as a marker of active status to the enhancer (Visel et al., 2009a). Besides those nine main features,
some interaction terms were also created between the features only for enhancers and only for
promoters. The purpose of generating interaction terms is to create features that could represent
information of both sides in an enhancer-gene link rather than individual enhancer or gene
characteristics.
Previously, 168 experimentally validated enhancer-gene links from 57 peer-reviewed
papers in four famous cancer cell lines (HepG2, HCT116, K562, and MCF7) were selected to train
predictive machine learning models (Mills, 2021). The purpose of this research was to generate
valid machine learning models that can precisely predict which enhancer-gene links are positive
in a particular cell type. The results indicate that balanced random forest machine learning models
8
have a better performance to provide statistically validated scores and predict whether an enhancer-
gene link is positive in each cell line than other algorithms (Mills, 2021).
In this study, I will continue working on improving the predictive models. Basically, a
model with more features is likely to have a better performance score, like AUC (Area Under the
ROC Curve), but it can also lead to overfitting. Although all features selected in the study are
biologically associated with the active state of an enhancer-gene link, it is still desirable to test if
each feature can significantly contribute to the prediction performance of a machine learning
model in a cell-type-specific manner. Further, it is also interesting to examine if a missing feature
model could also be informative and powerful enough to provide validation scores for those cell
types without complete data of all 17 features. Therefore, I will work on feature selection and
investigate whether dropping some features from the full feature set will seriously influence the
model performance in a particular cell type or not.
Since enhancer-gene links are supposed to be cell-type specific, it is also necessary to
consider the applicability of the predictive models in other tissues and cell types. Despite the
success of building predictive models based on HepG2, K562, and HCT116 cell lines, I am still
curious if those cell-type-specific models using different training datasets will provide similar
score distributions for enhancer-gene links in a new cell type. In order to explore prediction
differences on other cell types, four additional cell lines that are highly cited (GM12878, H1,
HeLa-S3, and SK-N-SH) are selected. Previous balanced random forest models trained by HepG2,
K562, and HCT116 will be applied to those new cell types and provide a confidence score for each
enhancer-gene link in each test data. Score distributions of training models and test cell types are
compared to show any prediction differences among cell types of testing data and training models.
9
Chapter 2: Method
Data collection
Features
The feature set used in my predictive models is the same as previously described (Mills,
2021). Briefly, there are up to nine main features and eight interaction terms in each cell-type-
specific dataset of cis enhancer-gene links (Table 1). Among the nine main features, two eQTL
features were obtained from GTEx databases according to the tissue type of a selected cell line,
including a measure of significance and a measure of effect for eQTL data. In addition, five
features of ChIP-seq data were obtained from ENCODE targeting epigenetic marks, including
active enhancers (H3K27ac, H3K4me1, and binding of histone acetyltransferase P300) and
actively regulated promoters (H3K4me3, H3K27ac). Last, two binary variables were generated to
record if the gene in an enhancer-gene link was located nearest to the enhancer and if the enhancer
was located in the gene’s introns. In addition, the BEDTools was used to gather all enhancers
within 1 Mb of each gene’s transcription start site.
Based on the characteristics of main features, six interaction terms were created between
features only for enhancers and only for promoters (shown in Table 1). I also considered the
interaction terms between eQTL combined Z-score and eQTL averaged coefficient because a
significant positive correlation was detected between them. Furthermore, interactions between two
active enhancer marks, H3K4me1enhancer and H3K27acenhancer, were also used in the following
analysis. Overall, there are eight interaction terms in total. Subsequently, all continuous variables
(twelve total) were mean-centered and standardized.
10
Feature Description Value
H3K27ace A histone acetylation mark commonly associated with
active enhancers
1: high value; 0: not high value of
ChIP-seq peak
H3K4me1e A methylation mark commonly associated with active
enhancers
Positive continuous value
(Score of the ChIP-seq peak)
H3K4me3p A methylation mark commonly associated with the
promoters of genes being actively enhanced
1: high value; 0: not high value of
ChIP-seq peak
p300 bindinge Binding of p300, a histone acetyltransferase well
established as a marker of active enhancers to the enhancer
Positive continuous value
(Score of the ChIP-seq peak)
eQTL—combined Z score eQTL p-values transformed into a combined Z-score for
each enhancer with eQTL(s) pointing to the same gene (a
measure of statistical significance)
Positive continuous value
eQTL averaged coefficient
(absolute values of original
coefficients)
If multiple eQTLs for the same gene are located in the same
enhancer, what is the average value of the absolute values
of the coefficients (a measure of effect) of these eQTLs?
Positive continuous value
H3K27acp A histone acetylation mark commonly associated with
promoters of genes being actively enhanced
Positive continuous value
(Score of the ChIP-seq peak)
nearest gene Is the gene in this link the enhancer’s nearest gene? 1: yes; 0: no
intronic Is the enhancer located in an intron? 0: enhancer not located in an intron;
1: enhancer located in the intron of a
gene; 2: enhancer located in different
gene
H3K27ace * H3K27acp interaction terms of H3K27ac marks on the enhancer and
the promoter
Positive continuous value
H3K27ace * H3K4me3p interaction terms of H3K27ac marks on the enhancer and
H3K4me3 marks on the promoter
1: both H3K27ace and H3K4me3p are
positive; 0: at least one of H3K27ace
and H3K4me3p is not positive
H3K27ace * H3K4me1e *
H3K4me3p
interaction terms of H3K27ac marks on the enhancer,
H3K4me1 marks on the enhancer, and H3K4me3 marks on
the promoter
Positive continuous value
H3K27ace * H3K4me1e *
H3K27acp
interaction terms of H3K27ac marks on the enhancer,
H3K4me1 marks on the enhancer, and H3K27ac marks on
the promoter
Positive continuous value
H3K4me1e * H3K4me3p interaction terms of H3K4me1 marks on the enhancer and
H3K4me3 marks on the promoter
Positive continuous value
H3K4me1e * H3K27acp interaction terms of H3K4me1 marks on the enhancer and
H3K27ac marks on the promoter
Positive continuous value
eQTL * eQTL average
absolute coefficient
interaction terms of two eQTL features Positive continuous value
H3K27ace * H3K4me1e interaction terms of H3K27ac marks on the enhancer and
H3K4me1 marks on the enhancer
Positive continuous value
Table 1: Feature Summary. Subscripts of features listed above are abbreviated. Among them, “e” suggests the feature is a
characteristic of the active state of the enhancer, and “p” represents the feature as a characteristic of the active state of the promoter.
11
There are up to 17 features used in my analysis, including nine main features and eight interaction terms. The detailed introduction
of each feature could be found in the second column, “Description.” For the data type of features, five features were designed as
categorical. Except for intronic, the other four features are all binary variables. Twelve additional features are considered
continuous, which means their values are not lower than zero. All continuous variables were subsequently centered on the mean
and standardized before used in model training.
Training datasets
All training datasets with experimentally validated enhancer-gene links are from previous
work, involving HepG2, K562, MCF7, and HCT116 cell lines (Mills, 2021). However, due to
fewer true positive links, MCF7 was only used to validate model performance. In addition, since
there was no p300 data for the HCT116 cell type, the p300 feature was not used in building the
model in HCT116.
Training Dataset Feature Numbers
Enhancer-Gene Links
Total Positive Negative
HepG2 17 338 38 300
K562 17 360 60 300
HCT116 16 (w/o p300 binding) 443 43 400
MCF-7 17 327 27 300
HepG2+K562 17 662 98 564
HepG2+HCT116 16 (w/o p300 binding) 628 81 547
K562+HCT116 16 (w/o p300 binding) 599 103 496
HepG2+K562+HCT116 16 (w/o p300 binding) 902 141 761
Table 2: Training Dataset Summary. Eight enhancer-gene link datasets with experimentally validated true positive
enhancer-gene links (four cell-type-specific datasets and four joint training datasets). The p300 binding data were not available
for all datasets involving the HCT116 cell line. The number of observations (enhancer-gene links) in each dataset was shown
above, along with its positive links and negative links.
Joint training datasets included enhancer-gene links from more than one cell type out of
HepG2, K562, and HCT116. They were generated by pooling individual cell-type-specific training
data. Thus, there was a total of four joint datasets, HepG2+K562, HepG2+HCT116,
K562+HCT116, and HepG2+K562+HCT116. Notably, only the HepG2+K562 joint datasets had
12
the full set of features (17 features in total), while the rest missed the p300 data. The numbers of
total enhancer-gene links, true positive links, and true negative links in each training dataset are
indicated in Table 2.
Test datasets
Four test datasets without experimentally validated enhancer-gene links were collected in
GM12878, HeLa-S3, SK-N-SH, and H1 cell lines. Those four cell types were selected based on
data accessibility. According to my investigation, only seven cell lines have the full set of
ENCODE data required for learning models. Three of them (HepG2, K562, and MCF72) have
already been used in the training section. The remaining four (GM12878, HeLa-S3, SK-N-SH, and
H1) were selected as outside test datasets and used to study the differential score distributions in
subsequent cell-type-specific analysis. Each test dataset included the full set of PEREGRINE
enhancer-gene links (n = 17,354,145). Among accessible tissues for eQTL data in the GTEx
database, I decided to use the Cells EBV-transformed lymphocytes for GM12878, the uterus for
HeLa-S3, and the adrenal gland for SK-N-SH. No available eQTL data was found for the H1 cell
line, so there were no eQTL Z-score and eQTL absolute coefficient in the H1 test dataset. All other
three datasets have the full set of features (17 features total).
Machine learning methods
Learning Models
Prediction of enhancer-gene links can be cast as a machine learning classification problem
with a binary outcome. A list of supervised learning approaches could be considered to achieve
this purpose. In previous work (Mills, 2021), balanced random forest machine learning models
had a better performance than other techniques, which were supposed to be applied in the
upcoming analysis. To compare the model performance of predictive machine learning models
13
using different training datasets and different feature combinations, the AUC value (Area Under
the ROC Curve) was the criterion to examine prediction performance. A repeated 3-fold cross-
validation with 30 repeats was processed to calculate the averaged AUC values from repeated
cross-validation (also called Mean CV AUCs). Besides, test AUCs for other datasets were used to
validate model performance too.
In order to eliminate the effect of several enhancers associated with the same gene,
blocking was added to my models based on the gene ID, which means enhancer-gene links with
the same gene ID would be assigned to the same blocking. In this way, those enhancer-gene links
including the same gene can remain in the same training or testing split over the 3-fold repeated
cross-validation iterations (Mills, 2021).
According to the number of cell types in each training dataset, predictive models can be
classified as single models or joint models. A single model means a model trained with individual
cell types. It could also be called a cell-type-specific model. Relatively, a joint model represents a
jointly trained model using a pooled training dataset including enhancer-gene links from more than
one cell type.
All machine learning analysis was performed using the MLR package in R.
Feature selection
Two methods were applied to evaluate the prediction importance of each feature in a
specific learning model.
First, by including only one feature each time or dropping only one feature from the full
feature set each time, I aim to determine which features contribute most to the prediction
performance by comparing the AUC values between missing feature models and full feature
models. It is worth noting that full feature models here indicate the machine learning models
14
trained with the full set of features, whereas missing feature models represent predictive models
trained with only a subset of features (it can include only one feature, but the number of features
applied to machine learning models is less than the original number of features I have). All models
were developed under the same condition for one training dataset except for the different feature
sets used. Averaged AUCs from 3-fold repeated cross-validation were the primary criteria to
compare how predictive each feature was. Besides, the same model was also evaluated by test
AUCs achieved using data in other cell types. In contrast to the AUCs in full feature models, when
adding only one feature into learners, the features with greater AUC values indicate their
importance in predicting enhancer-gene links in a particular cell type. On the contrary, when
dropping one feature from the full feature model, the features with lower AUC values suggest that
they contribute more to the model performance.
The second method used to assess feature importance was the filter method in the MLR
package of R, which could assign an importance value (also called filter value) to each feature.
Based on this approach, feature importance can be ranked (continuous number from 0 to 1, which
represents low importance to high importance), and the most appropriate number of features can
be determined in each model (Schiffner et al., 2016). According to the rank of feature importance
values, the most critical features were added to the predictive model one by one in every single
model or joint model until all features with positive importance values were included. Mean Cross-
Validation AUCs were used to compare model performance for different feature combinations.
Three single models (HepG2, K562, and HCT116) and two joint models (HepG2+K562
and HepG2+K562+HCT116) were selected to perform this pipeline.
15
Cell-type-specific analysis
Some new outside test datasets were introduced into this experiment. Due to the missing
eQTL data in the H1 cell line, only enhancer-gene links in GM12878, HeLa-S3, and SK-N-SH
were provided with probabilities of being positive links based on learning models using different
training datasets (including two single models, the HepG2-trained model and the K562-trained
model, and two joint models, the HepG2+K562-trained model and the HepG2+K562+HCT116-
trained model). The probabilities of being a positive link are considered as scores here. The
student’s t-test was used to evaluate the difference in mean scores of all enhancer-gene links
between two test datasets or two machine learning models. Furthermore, boxplots and histograms
were used to visually compare score distributions among four learning models and three test
datasets.
16
Chapter 3: Results
Prediction performance of missing feature models
Feature characteristics of enhancer activities on target genes were included in a feature set
for modeling and used to predict the positive or negative response for a given enhancer-gene link
in a cell-type-specific manner. Nine main features and eight interaction terms were selected and
applied to machine learning modeling in HepG2, K562, and HCT116 cell types, respectively.
According to the previous work [28], balanced random forest machine learning models with a full
set of features were justified to have a good prediction performance regardless of which training
datasets were used (Mean CV AUCs for full feature models: 0.90 in HepG2; 0.83 in K562; and
0.89 in HCT116). After three single models were built, test AUCs obtained from enhancer-gene
link prediction in other cell types were also used to estimate the prediction performance of each
model. For example, when the effectiveness of a machine learning model using HepG2 data is
evaluated, except for the averaged AUC value from repeated cross-validation, both test AUCs for
K562 and MCF7 datasets could also represent its model performance. HCT116 was not used to
test the HepG2-trained model here since its p300 binding data were missing.
However, the mean AUC values from repeated cross-validation were considerably
different from the corresponding test AUCs in other cell types. In particular, for a balanced random
forest model trained in HepG2 cell type, the mean AUC value from repeated cross-validation was
0.90, whereas the test AUC of the same learning model in K562 was only 0.80. Additionally, for
the K562-trained balanced random forest model, the mean AUC from repeated cross-validation
was 0.83, but unexpectedly, the test AUC in HepG2 was 0.91. Usually, the AUCs for test data
would be lower than the AUCs for training data because of the overfitting in the training process.
However, my results showed a contradictory case where the test AUC in HepG2 was much greater
17
than the mean cross-validation AUC in K562. Considering the different approaches to gathering
the true positive enhancer-gene links between K562 and other training datasets, I think it is possible
due to the experimental differences of various projects. However, another explanation could also
make sense. The significant differences between test AUC values and cross-validation AUC values
are likely to show that predicting enhancer-gene links using machine learning algorithms is highly
cell-specific.
To figure out why the test AUCs were different from the averaged AUCs from repeated
cross-validation in my prediction models, each feature was investigated individually to see their
importance in prediction performance for each cell type. Here, three methods were used to estimate
the feature importance. Next, I would like to test if similar results can be obtained through all of
them.
Include only one feature in single models
In order to determine the importance of an individual feature in model performance, only
one feature was added to the balanced random forest machine learning model each time in HepG2,
K562, and HCT116. The response in each model had no change -- if the enhancer-gene link is true
or false, whereas there was only one feature being used to predict the response each time. As shown
below in Table 3, averaged AUCs from repeated cross-validation were compared among those
models from the same cell type under the same training conditions. Test AUC values in other
datasets were provided at the same time for each model. A higher AUC value represents that the
feature has a better performance on predicting enhancer-gene links.
18
The single feature included in models
K562 HepG2 HCT116
Mean CV AUC
Test AUC in HepG2
Test AUC in MCF7
Mean CV AUC
Test AUC in K562
Test AUC in MCF7
Mean CV AUC
Test AUC in HepG2
Test AUC in K562
Test AUC in MCF7
H3K27ace 0.50 0.50 0.50 0.70 0.70 0.85 0.70 0.70 0.70 0.85
H3K4me1e 0.73 0.63 0.87 0.64 0.62 0.65 0.79 0.66 0.58 0.65
H3K4me3p 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50
p300 bindinge 0.73 0.63 0.74 0.62 0.73 0.74
eQTL Zscore 0.60 0.63 0.57 0.55 0.63 0.61 0.60 0.60 0.68 0.62
eQTL absolute value 0.63 0.56 0.57 0.55 0.35 0.44 0.53 0.52 0.63 0.57
H3K27acp 0.51 0.58 0.47 0.55 0.54 0.55 0.49 0.44 0.50 0.67
nearest gene 0.54 0.82 0.73 0.83 0.54 0.73 0.76 0.82 0.54 0.73
intronic 0.69 0.52 0.53 0.68 0.68 0.52 0.53
H3K27ace * H3K27acp 0.64 0.59 0.54 0.67 0.70 0.68 0.64 0.59 0.72 0.71
H3K27ace * H3K4me3p 0.50 0.50 0.50 0.64 0.62 0.66 0.69 0.65 0.62 0.66
H3K27ace * H3K4me1e * H3K4me3p 0.74 0.62 0.64 0.63 0.54 0.66 0.70 0.63 0.61 0.65
H3K27ace * H3K4me1e * H3K27acp 0.67 0.67 0.84 0.69 0.60 0.81 0.66 0.65 0.61 0.72
H3K4me1e * H3K4me3p 0.68 0.54 0.52 0.65 0.58 0.65 0.79 0.66 0.58 0.65
H3K4me1e * H3K27acp 0.58 0.49 0.50 0.61 0.52 0.46 0.64 0.55 0.49 0.49
eQTL * eQTL absolute coefficient 0.59 0.50 0.51 0.60 0.61 0.52 0.58 0.49 0.38 0.39
H3K27ace * H3K4me1e 0.79 0.64 0.85 0.68 0.59 0.84 0.71 0.64 0.64 0.84
Full Model 0.83 0.91 0.84 0.90 0.80 0.86 0.89 0.90 0.70 0.82
Table 3: Prediction performance in Random Forest models trained by individual cell types with a single
feature. The subscripts of features are abbreviated in the first column. Among them, “e” suggests the feature is a characteristic
of the active state of the enhancer, and “p” represents the feature as a characteristic of the active state of the promoter. It is
impossible to include intronic only in the K562-trained model due to the unbalanced data structure; thus, AUC values are missing
in the relative cells. For HCT116, no p300 data is available, so the AUCs are missing too. For each feature, experiments were
repeated in the K562-trained model, HepG2-trained model, and HCT116-trained model. Under the same learner, both Mean CV
AUCs and test AUCs for other cell types were obtained, and they can be compared with AUCs from full feature models. Model
performance in full feature models was shown in the last row of the above table.
In HepG2-trained models, the nearest gene is the most important feature (Mean CV
AUC=0.83), and no other features have a relatively powerful effect on predicting enhancer-gene
19
links (all Mean CV AUC<0.70). However, in K562-trained models, more features show good
performance individually (Mean CV AUCs = 0.79 for H3K27acenhancer*H3K4me1enhancer, 0.74 for
H3K27acenhancer*H3K4me1enhancer*H3K4me3prompter, 0.73 for p300, and 0.73 for H3K4me1enhancer).
For the HCT116 cell line, the features showing good performance include those best features in
K562 and HepG2, like the nearest gene, H3K4me1enhancer, and H3K27acenhancer*H3K4me1enhancer,
plus H3K27acenhancer*H3K4me3prompter, even though the p300 binding is missing. I cannot obtain
the prediction performance for intronic in K562 because the data is extremely unbalanced (only
two observations with intronic = 1 compared with 460 observations totally).
Test AUC values in Table 3 could also evaluate my findings of Mean CV AUCs. For
example, despite the weak effect on predicting enhancer-gene links in K562 (Mean CV AUC =
0.54), the nearest gene still had an ideal test AUC in HepG2 using the K562-trained model (test
AUC = 0.82). Similarly, in HepG2-trained models, the p300 binding also had a high test AUC in
K562 (test AUC = 0.73), even though the Mean CV AUC was only 0.62. Those results indicate
that regardless of which training datasets are used to build machine learning models, the good
connections between features and the response in a particular cell type do not change significantly.
Dropping one feature from the full feature set in single models
I continue to determine which feature is more critical in a specific training dataset by
dropping one feature each time from the full feature set and evaluating the model performance
with AUC measurements. In HepG2-trained models, the nearest gene was the only feature that
impacted the prediction performance after it was dropped (CV Mean AUC = 0.83 w/o the nearest
gene compared with 0.90 in a full feature model). In K562-trained models, p300 showed a
significant impact on the model performance when it was dropped from the full feature set (CV
20
Mean AUC = 0.78 w/o the p300 binding compared with 0.83 in a full feature model). No apparent
changes in model performance were detected in HCT116 in this experiment.
The feature dropped from the full feature
set in single models
K562 HepG2 HCT116
Mean CV AUC
Test AUC in HepG2
Test AUC in MCF7
Mean CV AUC
Test AUC in K562
Test AUC in MCF7
Mean CV AUC
Test AUC in HepG2
Test AUC in K562
Test AUC in MCF7
H3K27ace 0.82 0.91 0.83 0.90 0.79 0.86 0.89 0.90 0.71 0.81
H3K4me1e 0.82 0.90 0.85 0.90 0.80 0.86 0.89 0.91 0.71 0.81
H3K4me3p 0.83 0.91 0.84 0.90 0.81 0.86 0.89 0.90 0.71 0.82
p300 bindinge 0.78 0.89 0.81 0.90 0.74 0.84
eQTL Zscore 0.83 0.91 0.86 0.89 0.79 0.86 0.89 0.89 0.72 0.83
eQTL absolute value 0.83 0.89 0.83 0.90 0.80 0.87 0.89 0.90 0.71 0.82
H3K27acp 0.83 0.90 0.84 0.90 0.80 0.87 0.89 0.90 0.72 0.82
nearest gene 0.83 0.83 0.81 0.83 0.80 0.82 0.88 0.88 0.71 0.81
intronic 0.83 0.91 0.84 0.90 0.79 0.89 0.87 0.89 0.71 0.86
H3K27ace * H3K27acp 0.83 0.91 0.84 0.90 0.80 0.86 0.89 0.90 0.72 0.83
H3K27ace * H3K4me3p 0.83 0.90 0.85 0.90 0.82 0.86 0.89 0.90 0.72 0.84
H3K27ace * H3K4me1e * H3K4me3p 0.83 0.91 0.85 0.90 0.81 0.86 0.89 0.91 0.70 0.84
H3K27ace * H3K4me1e * H3K27acp 0.83 0.91 0.85 0.90 0.81 0.87 0.89 0.89 0.71 0.84
H3K4me1e * H3K4me3p 0.82 0.89 0.84 0.90 0.80 0.86 0.89 0.91 0.72 0.82
H3K4me1e * H3K27acp 0.83 0.90 0.82 0.90 0.79 0.87 0.89 0.90 0.74 0.85
eQTL * eQTL absolute coefficient 0.83 0.91 0.85 0.90 0.81 0.87 0.89 0.90 0.71 0.85
H3K27ace * H3K4me1e 0.82 0.91 0.84 0.90 0.80 0.86 0.89 0.90 0.73 0.81
Full Model 0.83 0.91 0.84 0.90 0.80 0.86 0.89 0.90 0.70 0.82
Table 4: Prediction performance in Random Forest models trained by individual cell types with dropping one
feature from the full set. Subscripts of features listed above are abbreviated. Among them, “e” suggests the feature is a
characteristic of the active state of the enhancer, and “p” represents the feature as a characteristic of the active state of the promoter.
Since p300 binding data is missing in HCT116, there are no AUC values for that line. For each feature, experiments were repeated
in the K562-trained model, HepG2-trained model, and HCT116-trained model. Under the same learner, both Mean CV AUCs and
test AUCs for other cell types were obtained, and they would be compared with AUCs from full-feature models.
21
Compared with the earlier experiments with a single feature added each time to machine
learning predictive models, new findings could filter those interaction terms with good
performance before (Like H3K27acenhancer*H3K4me1enhancer, H3K27acenhancer* H3K4me1enhancer*
H3K4me3prompter, and H3K27acenhancer*H3K4me3prompter). It becomes apparent now that p300
binding is the most important feature for prediction performance in the K562-trained model, and
the nearest gene is the most crucial feature in the HepG2-trained model.
Finding the best feature combination in a missing feature model
According to the two experiments adding or dropping one feature each time, I had a general
idea of which features could contribute to the model performance most in HepG2, K562, and
HCT116. Simultaneously, some features in the full feature set did not show a satisfactory effect,
such as H3K4me3promoter (AUC = 0.50 in all training models). What are their roles in a machine
learning model? How do they contribute to the prediction performance? Next, I try to remove some
weak features and find a possible feature combination that can also provide a good prediction
performance as the full feature model in a cell-type-specific manner. Despite these assumptions I
already have, it is still challenging to determine which feature combinations could better predict
performance than the full feature models. Because there are up to 17 features in each single training
dataset, it is impossible to list all possibilities of feature combinations and examine their model
performance.
Filter methods for feature selection in the MLR package gave us some ideas on solving this
problem. According to its introduction, feature importance can be calculated by filtering values
that indicate the general influence on prediction models (Figure 1). As expected, I found that the
nearest gene was the feature with the highest importance value in HepG2 (filter value = 0.12), and
the p300 binding was the feature with the highest importance value in K562 (filter value = 0.14).
22
23
Figure 1: Feature Importance in single models (A. HepG2, B. K562, C. HCT116). The x-axis in each figure
represents the filter values, which could also be regarded as feature importance values (Ranges: 0-1). A higher importance value
means the feature tested is much more critical for the model’s prediction performance. Except for the missing p300 data in
HCT116, all features were included in the rank of importance values. Specifically, the nearest gene was the feature with the
highest importance value in HepG2-trained models compared with p300 binding in K562 and H3K4me1*HEK4me3 in HCT116.
For K562, there were eight features with an importance value of zero compared to only five features in HepG2 and four features
in HCT116, respectively.
After the rank of feature importance was generated, an increasing feature method was
applied to explore the possible changes in model performance when different feature combinations
were used. With the decreasing rank of feature importance in each cell type, the most important
feature was added to a balanced random forest model first, and then subsequent ones were added
to the previous model one by one. Different combinations of features were tested through this
method, and the model with the highest AUC value in each training dataset can be observed (Table
5-7). For example, in K562-trained models (shown in Table 5), compared with the averaged CV
AUC of 0.83 in the full feature model, only including the first nine features (Model 9) could give
a better AUC value (Mean CV AUC = 0.85). Particularly, when only the first eight features were
24
added to the learning model, I could get the highest AUC value among all combinations (CV Mean
AUC = 0.86 in Model 8).
p300 binding e
H3K27ac e * H3K27ac p
H3K4me1 e * H3K4me3 p
H3K27ac e * H3K4me1 e
H3K27ac e
H3K4me1 e
eQTL averaged coefficient
H3K27ac e * H3K4me1 e * H3K4me3 p
eQTL—combined Z score
eQTL * eQTL absolute coefficient
H3K27ac e * H3K4me3 p
H3K27ac p
H3K27ac e * H3K4me1 e * H3K27ac p
H3K4me1 e * H3K27ac p
H3K4me3 p
Intronic
nearest gene
CV Mean AUC
Importance 0.14 0.05 0.04 0.04 0.04 0.04 0.03 0.03 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Model 1
0.733
Model 2
0.778
Model 3
0.834
Model 4
0.851
Model 5
0.854
Model 6
0.852
Model 7
0.859
Model 8
0.860
Model 9
0.851
Model 10
0.827
Table 5: Prediction performance of missing feature models (K562-trained models). Subscripts of feature names
are abbreviated. Among them, “e” suggests the feature is a characteristic of the active state of the enhancer, and “p” represents
the feature as a characteristic of the active state of the promoter. Feature importance values were shown under the feature names.
Here, Model 1-9 represented nine missing feature models, and one feature was added to the previous model once based on the
rank of importance values. Model 10 represents the full feature model. Averaged AUC values from 3-fold repeated cross-
validation were listed on the right.
Although it is clear which feature contributes most to model performance and which feature
could increase the model performance in a specific combination of features, it remains an issue to
determine the best size of feature subsets that can improve the learner performance and speed up
the learning process at the same time. The MLR package in R provides a method for tuning the
size of a feature subset. For example, a subset of four features was selected by this approach for
K562-trained models (p300, H3K27acenhancer*H3K4me1enhancer, H3K27acenhancer*H3K27acpromoter,
25
H3K4me1enhancer*H3K4me3promoter). In this case, the Mean CV AUC was 0.85 with the top four
features, and the model performance did not show significant improvement when additional
features were added.
The same approach was also applied to HepG2-trained models and HCT116-trained
models, as shown in Tables 6 and 7.
nearest gene
H3K27ac e * H3K27ac p
Intronic
H3K27ac e
H3K27ac e * H3K4me1 e * H3K27ac p
p300 binding e
H3K27ac e * H3K4me3 p
H3K27ac e * H3K4me1 e * H3K4me3 p
H3K27ac e * H3K4me1 e
H3K4me1 e * H3K4me3 p
H3K4me1 e * H3K27ac p
H3K4me1 e
eQTL averaged coefficien
eQTL * eQTL absolute coefficient
eQTL—combined Z score
H3K27ac p
H3K4me3 p
CV Mean AUC
Importance 0.12 0.07 0.06 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.04 0.04 0.00 0.00 0.00 0.00 0.00
Model 1
0.826
Model 2
0.863
Model 3
0.862
Model 4
0.875
Model 5
0.883
Model 6
0.882
Model 7
0.880
Model 8
0.880
Model 9
0.880
Model 10
0.882
Model 11
0.882
Model 12
0.883
Model 13
0.898
Table 6: Prediction performance of missing feature models (HepG2-trained models). Subscripts of feature names
are abbreviated. Among them, “e” suggests the feature is a characteristic of the active state of the enhancer, and “p” represents
the feature as a characteristic of the active state of the promoter. Feature importance values were shown under the feature names.
Model 1-12 represented 12 missing feature models, and one feature was added to the previous model once based on the rank of
importance values. Model 13 represents the full feature model. Averaged AUC values from 3-fold repeated cross-validation
were listed on the right.
26
H3K4me1 e * H3K4me3 p
H3K4me1 e
H3K4me1 e * H3K27ac p
3K27ac e * H3K4me1 e * H3K4me3 p
H3K27ac e * H3K4me1 e
H3K27ac e * H3K4me1 e * H3K27ac p
H3K27ac p
H3K27ac e * H3K27ac p
nearest gene
H3K27ac e * H3K4me3 p
Intronic
H3K27ac e
eQTL averaged coefficien
eQTL * eQTL absolute coefficient
eQTL—combined Z score
H3K4me3 p
CV Mean AUC
Importance 0.14 0.14 0.10 0.10 0.10 0.09 0.08 0.08 0.08 0.08 0.07 0.07 0.00 0.00 0.00 0.00
Model 1
0.790
Model 2
0.799
Model 3
0.788
Model 4
0.783
Model 5
0.789
Model 6
0.786
Model 7
0.779
Model 8
0.775
Model 9
0.836
Model 10
0.841
Model 11
0.854
Model 12
0.856
Model 13
0.880
Table 7: Prediction performance of missing feature models (HCT116-trained models). Subscripts of feature
names are abbreviated. Among them, “e” suggests the feature is a characteristic of the active state of the enhancer, and “p”
represents the feature as a characteristic of the active state of the promoter. Feature importance values were shown under the
feature names. Model 1-12 represented 12 missing feature models, and one feature was added to the previous model once based
on the rank of importance values. Model 13 represents the full feature model. Averaged AUC values from 3-fold repeated cross-
validation were listed on the right.
Repeating the above pipeline in joint models
Previously, only training datasets involving a single cell type were used to estimate feature
importance and figure out possible feature subsets that could provide a better model performance
than full-feature models. Since the heterogeneous results were observed in different models trained
in individual cell types, the differences in training datasets would definitely influence prediction
results according to my assumption. However, I cannot determine whether the differences came
27
from the data collection of experimental methods or the intrinsic biological differences of cell
types. In order to further investigate this question, I repeated all experiments described above to
the joint datasets generated by pooling individual cell-type-specific training data.
The same pipeline was applied to two jointly training datasets, HepG2+K562 and
HepG2+K562+HCT116 (no p300 data available). First, based on experiments only adding one
feature and dropping one feature from the full feature set each time in two joint training datasets
(Table 8 and 9), my experiments show more neutral and moderate results than in models trained
in individual cell types. Features could be classified as powerful or weak ones according to their
performance compared with other features’ performance (Mean CV AUC values were mainly used
here), but the most robust features with significant influence in single models did not have as much
impact as before, like the p300 binding and the nearest gene.
For instance, when only one feature was added into the HepG2+K562 jointly trained model
each time, both Mean CV AUCs in models only with the p300 binding and the nearest gene
dropped a little bit from their corresponding AUCs in single models (Here, Mean CV AUC = 0.69
for the p300 binding, 0.65 for the nearest gene. In comparison, in the K562-trained model, Mean
CV AUC = 0.73 w/ p300 binding; in HepG2-trained model, Mean CV AUC = 0.83 w/ nearest
gene). Relatively, in the experiments dropping one feature from the full feature set of
HepG2+K562 data each time (Table 9), although those two features (the p300 binding and the
nearest gene) could also influence the initial high performance of random forest machine learning
models, Mean CV AUCs relatively reduced but did not decrease as much as before if I dropped
those two features (Mean CV AUC = 0.86 w/o p300 binding; Mean CV AUC = 0.85 w/o nearest;
Mean CV AUC = 0.88 for full feature model). The comparison between joint models and single
28
models suggests that joint models can help to weaken the influence of some robust features.
Namely, they are likely to make a model fitter in outside test datasets.
The single feature included in models
HepG2+K562 HepG2+K562+HCT116
Mean CV AUC
Test AUC in HepG2
Test AUC in K562
Test AUC in MCF7
Mean CV AUC
Test AUC in HepG2
Test AUC in K562
Test AUC in MCF7
Test AUC in HCT116
H3K27ace 0.51 0.50 0.50 0.50 0.62 0.50 0.50 0.50 0.50
H3K4me1e 0.73 0.65 0.57 0.85 0.75 0.67 0.61 0.65 0.79
H3K4me3p 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50
p300 bindinge 0.69 0.63 0.73 0.74
eQTL Zscore 0.58 0.59 0.64 0.57 0.59 0.56 0.67 0.59 0.62
eQTL absolute value 0.59 0.64 0.58 0.59 0.61 0.59 0.56 0.53 0.62
H3K27acp 0.48 0.50 0.42 0.57 0.50 0.45 0.45 0.50 0.46
nearest gene 0.65 0.82 0.54 0.73 0.68 0.82 0.54 0.73 0.73
intronic 0.58 0.68 0.52 0.53 0.61 0.68 0.52 0.53 0.66
H3K27ace * H3K27acp 0.68 0.70 0.72 0.76 0.70 0.70 0.71 0.76 0.69
H3K27ace * H3K4me3p 0.51 0.50 0.50 0.50 0.61 0.50 0.50 0.50 0.50
H3K27ace * H3K4me1e * H3K4me3p 0.69 0.63 0.62 0.64 0.68 0.62 0.68 0.65 0.71
H3K27ace * H3K4me1e * H3K27acp 0.69 0.67 0.69 0.83 0.72 0.67 0.63 0.82 0.69
H3K4me1e * H3K4me3p 0.68 0.61 0.59 0.66 0.71 0.67 0.55 0.52 0.80
H3K4me1e * H3K27acp 0.50 0.56 0.56 0.59 0.65 0.52 0.55 0.47 0.44
eQTL * eQTL absolute coefficient 0.59 0.56 0.63 0.39 0.57 0.46 0.60 0.49 0.57
H3K27ace * H3K4me1e 0.73 0.64 0.61 0.84 0.72 0.64 0.68 0.85 0.72
Full Model 0.88 0.96 0.83 0.87 0.88 0.95 0.73 0.87 0.90
Table 8: Prediction performance in Random Forest models trained by joint training datasets with a
single feature. The subscripts of features are abbreviated in the first column. Among them, “e” suggests the feature is
a characteristic of the active state of the enhancer, and “p” represents the feature as a characteristic of the active state of
the promoter. For HCT116, no p300 data is available, so the AUC values are missing. For each feature, experiments were
repeated in the HepG2+K562-trained model, HepG2+K562+HCT116-trained model. Under the same learner, both
averaged AUC from repeated cross-validation and test AUCs for other cell types were obtained, and they will be compared
with AUCs from full-feature models.
29
The feature dropped from the full feature
set in single models
HepG2+K562 HepG2+K562+HCT116
Mean CV AUC
Test AUC in HepG2
Test AUC in K562
Test AUC in MCF7
Mean CV AUC
Test AUC in HepG2
Test AUC in K562
Test AUC in MCF7
Test AUC in HCT116
H3K27ace 0.88 0.95 0.84 0.87 0.88 0.95 0.74 0.87 0.90
H3K4me1e 0.88 0.95 0.85 0.86 0.88 0.94 0.75 0.86 0.88
H3K4me3p 0.88 0.96 0.84 0.85 0.89 0.95 0.73 0.87 0.90
p300 bindinge 0.86 0.95 0.76 0.85
eQTL Zscore 0.87 0.93 0.82 0.86 0.88 0.92 0.72 0.87 0.89
eQTL absolute value 0.88 0.96 0.84 0.86 0.88 0.95 0.79 0.85 0.91
H3K27acp 0.88 0.95 0.85 0.87 0.89 0.94 0.73 0.88 0.91
nearest gene 0.85 0.93 0.84 0.81 0.87 0.92 0.74 0.82 0.90
intronic 0.88 0.95 0.84 0.91 0.88 0.95 0.73 0.91 0.89
H3K27ace * H3K27acp 0.88 0.96 0.84 0.86 0.88 0.95 0.73 0.86 0.91
H3K27ace * H3K4me3p 0.88 0.96 0.84 0.87 0.89 0.95 0.74 0.86 0.89
H3K27ace * H3K4me1e * H3K4me3p 0.88 0.96 0.84 0.86 0.88 0.94 0.76 0.87 0.90
H3K27ace * H3K4me1e * H3K27acp 0.88 0.96 0.85 0.87 0.89 0.95 0.73 0.86 0.90
H3K4me1e * H3K4me3p 0.88 0.96 0.84 0.87 0.88 0.94 0.75 0.87 0.89
H3K4me1e * H3K27acp 0.89 0.96 0.84 0.87 0.89 0.95 0.73 0.88 0.88
eQTL * eQTL absolute coefficient 0.88 0.96 0.83 0.86 0.88 0.94 0.73 0.86 0.90
H3K27ace * H3K4me1e 0.87 0.96 0.85 0.87 0.88 0.95 0.75 0.87 0.90
Full Model 0.88 0.96 0.83 0.87 0.88 0.95 0.73 0.87 0.90
Table 9: Prediction performance in Random Forest models trained by joint training datasets with
dropping one feature from the full feature set. Subscripts of features listed above are abbreviated. Among them,
“e” suggests the feature is a characteristic of the active state of the enhancer, and “p” represents the feature as a
characteristic of the active state of the promoter. Since p300 binding data is missing in HCT116, there are no AUC values
for that line. For each feature, experiments were repeated in the HepG2+K562-trained model, HepG2+K562+HCT116-
trained model. Under the same learner, both averaged AUC from 3-fold repeated cross-validation and test AUCs for other
cell types were obtained, and they will be compared with AUCs from full-feature models.
Similarly, I cannot find any significant decreases among AUC values shown in Table 9
when dropping one feature from the full feature set each time in the HepG2+K562+HCT116 joint
30
training dataset. There is no p300 binding data in HepG2+K562+HCT116 jointly trained model,
and the nearest gene did not represent an apparent AUC drop if removing it from the learning
model (Mean CV AUC = 0.87 w/o the nearest gene: Mean CV AUC = 0.88 for full feature model).
Notably, some features showed comparatively high Mean CV AUCs in the experiments of
only adding one feature into joint models, such as H3K4me1enhancer and some interaction terms,
but their influence on model performance did not appear in the experiments of dropping one feature
from the full feature set. Therefore, I would not discuss their effect on prediction performance here.
One more thing worth noting is that the two joint models have similar prediction performance
(Mean CV AUC = 0.88 in both HepG2+K562-trained model and HepG2+K562+HCT116-trained
model with full feature set). The averaged AUCs from repeated cross-validation and the test AUCs
in other datasets also show a similarity in these two jointly trained models except for testing in
K562. However, due to the missing p300 binding data in HepG2+K562+HCT116-trained models,
a relative decrease in test AUC values was found for the K562 training dataset. One possible reason
is that, as known, p300 binding is crucial for predicting enhancer-gene links in the K562 training
dataset. Therefore, its absence could lead to a weak performance in K562 data.
The above results justify the importance of a joint training dataset, which could keep those
highly informative features but weaken their influence on prediction performance in machine
learning models. As expected, fewer features show filter values of zero in joint models compared
to single models. By comparing feature importance among five models, the filter values for
H3K4me3promoter were all zero, which indicated the weak predictive power of H3K4me3promoter no
matter which training datasets were used. Besides, H3K27acpromoter only had a positive filter value
in the HCT116-trained model, and the interaction term between two eQTL features was only
positive in the triple dataset. Except for the weak features I found according to feature importance,
31
I could also acquire some features showing positive filter values all the time regardless of training
datasets, such as H3K27acenhancer, H3K4me1enhacner, H3K27acenahcner*H3K27acpromoter,
H3K27acenhancer*H3K4me1enhancer*H3K4me3promoter, H3K4me1enhacner*H3K4me3promoter, and
H3K27acenhancer*H3K4me1enhancer.
32
C.
Figure 2: Feature Importance in joint models (A. HepG2+K562, B. HepG2+K562+HCT116) and feature
importance comparison (C. in three single models and two joint models). A and B: The x-axis represents feature
importance values (Ranges: 0-1). A higher importance value means the feature tested is much more critical for the model’s
prediction performance. p300 binding is the feature with the highest importance value in HepG2+K562-trained models, and
H3K27acenhancer*H3K4me1enhancer is the feature with the highest importance value in HepG2+K562+HCT116-trained models
(p300 binding data is not available for this joint training dataset). C: Feature importance values were compared among five
machine learning models (HepG2-trained model, K562-trained model, HCT116-trained model, HepG2+K562-trained model,
and HepG2+K562+HCT116-trained model). The height of each bar indicates the value of feature importance. Different colors
in the above figure represent one of the five training models.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Feature Importance Comparison
HepG2 K562 HCT116 HepG2+K562 HepG2+K562+HCT116
33
p300 binding e
H3K27ac e * H3K4me1 e
H3K27ac e * H3K4me1 e * H3K4me3 p
H3K27ac e * H3K27ac p
nearest gene
H3K27ac e
H3K27ac e * H3K4me1 e * H3K27ac p
H3K4me1 e
H3K4me1 e * H3K4me3 p
intronic
eQTL averaged coefficient
eQTL—combined Z score
H3K27ac e * H3K4me3 p
eQTL * eQTL absolute coefficient
H3K27ac p
H3K4me1 e * H3K27ac p
H3K4me3 p
CV Mean AUC
Importance 0.12 0.10 0.06 0.06 0.06 0.05 0.05 0.04 0.04 0.03 0.03 0.03 0.03 0.00 0.00 0.00 0.00
Model 1
0.693
Model 2
0.793
Model 3
0.793
Model 4
0.790
Model 5
0.876
Model 6
0.878
Model 7
0.875
Model 8
0.877
Model 9
0.875
Model 10
0.877
Model 11
0.887
Model 12
0.896
Model 13
0.895
Model 14
0.880
Table 10: Prediction performance of missing feature models (HepG2+K562-trained models). Subscripts of feature
names are abbreviated. Among them, “e” suggests the feature is a characteristic of the active state of the enhancer, and “p”
represents the feature as a characteristic of the active state of the promoter. Feature importance values were shown under the feature
names. Model 1-13 represented 13 missing feature models, and one feature was added to the previous model once based on the
rank of importance values. Model 14 represented the full feature model. Averaged AUC values from 3-fold repeated cross-
validation were listed on the right.
34
H3K27ac e * H3K4me1 e
H3K27ac e * H3K4me1 e * H3K27ac p
H3K27ac e * H3K4me1 e * H3K4me3 p
nearest gene
H3K4me1 e
H3K27ac e * H3K27ac p
H3K4me1 e * H3K4me3 p
H3K27ac e
intronic
H3K4me1 e * H3K27ac p
H3K27ac e * H3K4me3 p
eQTL—combined Z score
eQTL averaged coefficient
eQTL * eQTL absolute coefficient
H3K27ac p
H3K4me3 p
CV Mean AUC
Importance 0.09 0.07 0.07 0.07 0.69 0.07 0.07 0.06 0.05 0.04 0.04 0.02 0.02 0.01 0.00 0.00
Model 1
0.717
Model 2
0.742
Model 3
0.754
Model 4
0.832
Model 5
0.865
Model 6
0.861
Model 7
0.869
Model 8
0.868
Model 9
0.864
Model 10
0.862
Model 11
0.862
Model 12
0.888
Model 13
0.888
Model 14
0.889
Model 15
0.880
Table 11: Prediction performance of missing feature models (HepG2+K562+HCT116-trained models).
Subscripts of feature names are abbreviated. Among them, “e” suggests the feature is a characteristic of the active state of the
enhancer, and “p” represents the feature as a characteristic of the active state of the promoter. Feature importance values were
shown under the feature names. Model 1-14 represents nine missing feature models, and one feature was added to the previous
model once based on the rank of importance values. Model 15 represented the full feature model. Averaged AUC values from 3-
fold repeated cross-validation were listed on the right.
A greater Mean CV AUC can be found consistently through those missing feature models
(Table 10 and 11). In HepG2+K562-trained models, Model 11-13 all had greater prediction
performance than the full feature model (Mean CV AUC = 0.88); meanwhile, in
HepG2+K562+HCT116-trained models, Model 12-14 also had higher AUC values than the full
35
feature model (Mean CV AUC = 0.88). These results verify that regardless of single models or
joint models, it is possible to find some feature combinations that make missing feature machine
learning models predict enhancer-gene links better than full feature models.
By the filter method, a subset of the first eight features with the highest importance (Model
8 in Table 10) was selected for HepG2+K562-trained models (H3K4me1enhancer, p300,
H3K27acenhancer, nearest, H3K27acenhancer*H3K4me1enhancer, H3K27acenhancer*H3K27acpromoter,
H3K27acenhancer*H3K4me1enhancer*H3K4me3promoter,
H3K27acenhancer*H3K27acpromoter*H3K4me1enhancer), which has a Mean CV AUC of 0.88, similar
to full model AUC.
Cell-type-specific analysis
In the prior study, I investigated how important each feature could be in random forest
models using five different training datasets (three single models and two joint models) and
explored some possible missing feature models with better prediction performance than full feature
models. However, back to the differences observed in AUC values among various cell-type-
specific training datasets, the question remained whether the differences were caused by cell-type-
specific or training-dataset-specific characteristics. Additional experiments were necessary to
address the question.
Earlier, all of my models were built based on HepG2, K562, and HCT116 data, and model
validation was performed in the same single training datasets, plus MCF7. In this study, datasets
from new cell types were introduced as test datasets in order to avoid any potential bias due to the
same cell types used in the model test process. As mentioned in the method section, GM12878,
HeLa-S3, SK-N-SH, and H1 cell lines were selected because of their data completeness in
ENCODE database. Thus, four entirely new test datasets with a full set of PEREGRINE enhancer-
36
gene links (~ 17 million) were generated using the same feature set as mentioned before but
without the true positive and true negative enhancer-gene links data. Meanwhile, the same data
cleaning and processing were used in new test datasets as previous training datasets. Because of
the uncertainty of determining the appropriate tissue for the H1 cell line, only GM12878, HeLa-
S3, and SK-N-SH cell lines were used in testing the probability of being positive enhancer-gene
links.
After new outside data were obtained, it was possible to apply each training model to all
test datasets simultaneously and get their score distributions. Based on score data, I mainly want
to investigate if the score distributions are significantly different among training models or test
datasets. Here is my hypothesis. If the same training model generates significantly different score
distributions in different test datasets, the differences are likely to be caused by cell-type-specific
characteristics in the machine learning prediction models. However, if different training models
produce significantly different score distributions to one test dataset, the differences are likely to
be caused by intrinsic differences in the prediction models. A more complex model combining
different cell type information (joint model) is expected. Therefore, further cell-specific analysis
was needed to interpret why different training models would lead to different score distributions
for enhancer-gene links in new cell types.
From my previous analysis, p300 was an essential feature in predicting enhancer-gene links
in K562-trained models and a crucial feature in joint models. However, p300 was not available for
the HCT116 training dataset, so here, HepG2, K562, and HepG2+K562 joint datasets were first
used to train predictive models with a full set of features and compare their score distribution in
GM12878, HeLa-S3, SK-N-SH validation datasets.
37
My results indicate statistically significant differences in score distributions among training
models. As shown in Figure 3, in each validation cell line, the scoring distributions from the three
models were definitely different. On the contrary, using the same training model, the scoring
differences in GM12878, HeLa-S3, and SK-N-SH validation datasets were not obvious, especially
in the HepG2-trained model (Figure 3.B). The HepG2+K562-trained model had a higher median
value of scores compared with those two single-trained models. Some statistical approaches could
validate my results. The paired t-test indicated that for the same validation datasets, the scoring
between K562 and HepG2 was significantly different (in GM12878: t=1009.1, p-value<0.001; in
HeLa-S3: t=-621.21, p-value<0.001; in SK-N-SH: t=540.3, p-value<0.001).
A.
38
B.
Figure 3: Scoring distribution of three validation datasets using different prediction models. Three machine
learning models are used, including the HepG2-trained model, K562-trained model, HepG2+K562-trained model. Three
validation datasets include GM12878, HeLa-S3, SK-N-SH cell types. A) Histogram of scores for three validation datasets.
In each validation dataset, scoring distributions from three different trained models are shown in green (HepG2), yellow (K562),
and purple (HepG2+K562), respectively. B) Boxplot of scores for three machine learning models in HepG2, K562, and
HepG2+K562. In the same trained models, three different validation cell types are indicated in red (GM12878), green (HeLa-
S3), and blue (SK-N-SH). The black points on the top represent outlines, and the lines in each box mean the median value of
scores in each validation scoring set.
When I investigated the missing feature models, my finding suggested compared with
models trained in individual cell types, jointly trained models with more than one cell type data
showed more similarities in model performance. They can reduce the weight of some powerful
features on predicting enhancer-gene links and make more features show their importance on
model performance. Because of the remarkable dissimilarity among two single models and one
joint model in Figure 3, another joint training dataset, the HepG2+K562+HCT116 joint dataset,
was introduced to my experiment. I attempted to see if two joint models could provide more
homogeneous results as expected.
39
As shown in Figure 4, the two joint models are much more consistent than single models.
It is worth noting that in the boxplot, the HepG2+K562-trained model and
HepG2+K562+HCT116-trained model show nearly the same median values of scores in
GM12878 (red), HeLa-S3 (green), and SK-N-SH (blue).
Figure 4: Boxplot of scores for four different trained models. Triple here means the joint model that uses the
HepG2+K562+HCT116 training dataset. In the same trained models, three different validation cell types were indicated in red
(GM12878), green (HeLa-S3), and blue (SK-N-SH). The black points on the top represented outlines, and the lines in each box
meant the median value of scores in each validation scoring set.
40
Chapter 4: Discussion
The most powerful features in predicting enhancer-gene links
Through my experiments of missing feature models introduced in Chapter 3, I found two
significantly important features in single models, the nearest gene in HepG2 and the p300 binding
in K562. The experimentally validated enhancer-gene links in the HepG2 training dataset were
obtained from careful PubMed searches. A paper and its supplemental data would be considered
if it seemed likely to include experimental evidence linking an enhancer to at least one target gene
in the desired cell type. However, the true positive enhancer-gene links in the K562 training dataset
were gathered from published CRISPRi-FlowFISH experiments (Fulco et al., 2019). In their
experiments, enhancers were perturbed within 450kb of 30 genes, and the effects on gene
expression were measured subsequently.
Past experiments identifying enhancer functions suggest that most enhancers are more
likely to regulate their neighboring genes, even though they could also modulate distant genes
(Visel et al., 2009b). Namely, the nearest genes to an enhancer are more likely to be modulated by
that enhancer than ones located distantly. Therefore, the nearest gene feature is the predominant
factor in those papers where the experimental results were collected. Most of the experimentally
validated positive enhancer-gene links are between enhancers and their nearest genes, which could
be considered the main reason why the nearest gene feature had such high importance in predicting
enhancer-gene links for HepG2 cell type. On the other hand, the CRISPRi-FlowFISH experiments
had already removed the bias produced by the distance of enhancers to their target genes, so it did
not show a substantial effect in K562-trained models.
The differences in experiments collecting enhancer-gene links also lead to weak model
performance in K562-trained models. Under the same conditions of building random forest models,
41
the models trained by K562 data always had lower AUC values than HepG2-trained models and
HCT116-trained models. It makes sense because the bias produced by the most predominant factor,
the nearest gene feature, was eliminated before, and therefore, the p300 binding was detected as
the most powerful feature in K562-trained models. It is probably due to the CRISPR experiments
have a better ability to detect histone marks, like p300 binding here (Hilton et al., 2015).
To estimate the individual effect of each feature, I developed two experiments, only adding
one feature in training models and removing one feature from the full feature set each time. From
my results, the second approach gave a more intuitive conclusion. By dropping each feature from
the full feature set, only the nearest gene in HepG2 and the p300 binding in K562 showed their
powerful influence on predicting enhancer-gene links. On the contrary, when only including one
feature in training models, I observed some interaction terms acquiring higher AUC values than
the p300 binding in K562 and the nearest gene in HCT116. Meanwhile, H3K4me1enhancer also
presents good performance independently in the K562-trained model and the HCT116-trained
model. These findings suggest that except for the nearest gene and the p300 binding that are
considered as possible biases produced by experimental approaches, all other features showing
their effects in predicting enhancer-gene links are not crucial in cell-type-specific machine learning
models. Therefore, if only including one of them, I could get a good model performance; however,
if they are dropped from the full feature set, the prediction performance will not change
significantly.
Viability of missing feature models
My experiments demonstrated the viability of missing feature models in predicting
enhancer-gene links. Theoretically, if 17 features are used in full feature models, there are a total
of 131,070 feature combinations (2
17
-2 = 131,070, where 2
17
means the number of combinations
42
for17 elements, empty set and full set are excluded according to the definition of missing feature
models). However, it is impossible to test the model performance of all possibilities in MLR.
Therefore, I applied the filter method to rank the variable importance of all features, and then I
tried to test the combinations of features with high importance values. In this way, I detected some
missing feature models with more excellent model performance than their respective full feature
models in K562, HepG2+K562, and HepG2+K562+HCT116 training datasets. Thus, my findings
justify that it is meaningful to consider processing feature selection and dropping some weak
features.
However, no missing feature models with better performance were observed in HepG2 and
HCT116 training datasets. It probably happened by accident or mistake, but it is probably related
to the unbalanced data structure in HepG2 and HCT116. Compared with the K562, HepG+K562,
HepG2+K562+HCT116 training datasets, the HepG2 and HCT116 training datasets are much
more unbalanced, which means the proportion of positive links to negative links is smaller than
the other datasets. Therefore, it is assumed that the unbalanced data structure may cause subtle
changes in model performance and result in no better missing feature models are found.
The application of joint models
Because of the differences in cell-type-specific training datasets, compared with single
models, the joint models indicated similar prediction performance. Interestingly, despite no
available p300 binding data in the HepG2+K562+HCT116 training dataset, the HepG2+K562-
trained models and the HepG2+K562+HCT116-trained models had similar performance across
full feature models or missing feature models. Only two or three cell types were used in joint
models, and one of the crucial features was missing in the triple one, but these disadvantages did
not change their prediction performance. Therefore, I assume that if more cell types are used in
43
training machine learning models, the joint model will have better stability in predicting enhancer-
gene links. My findings also support that for cell types missing data or features, the application of
a missing feature model would be valuable and provide cell-specific scores with good performance.
Despite the experimental biases of predominant features, they did not show their crucial
impact on prediction performance if they were included in a joint model. In the HepG2+K562-
trained model and HepG2+K562+HCT116-trained model, those predominant features did not
show their importance on prediction performance in my experiments as in models trained with
individual cell types. It is assumed that jointly trained models could help eliminate the independent
influence of dominant features and improve model performance on the whole. Besides, in the cell-
type-specific analysis, joint models were proved helpful again. Compared with the discrepancy of
single models, two joint models had more identical score distributions in each test dataset.
Divergence of training models and score distributions in various test datasets
In the experiments of cell-type-specific analysis, the differences in score distributions
mainly focus on comparing training models. Significantly, the two single models showed
significant differences when they were compared in the same test dataset. By contrast, joint models
had more similar score distributions. The possible reason is that the single models are intrinsically
different. The nearest gene is too dominant in HepG2-trained models, leading to the disparity with
the K562-trained model where the p300 binding is the crucial factor in predicting enhancer-gene
links.
Besides, for each training model, the differences in score distributions among test datasets
also exist, especially in the K562-trained model. However, the difference is not apparent in the
HepG2-trained model, probably due to the dominant effect of the nearest gene. The cell-type-
specific score distribution suggests the uniqueness of each cell type.
44
Limitations
Only three cell types were used to train machine learning predictive models in my analysis,
which lead to the lack of data comprehensiveness. Generally speaking, more cell types are used in
model training, the results become more comparative and convincing. However, there are limited
papers showing enhancer functions on their target genes through experiments, and most of them
are only focused on some highly cited cell types. Due to the difficulty of collecting those
experimentally validated enhancer-gene links, only HepG2, K562, and HCT116 are available for
training machine learning models in my study. Although I found that CRISPRi-FlowFISH
experiments could eliminate the biased influence of the nearest gene feature and are more advanced
than traditional experiments, no other accessible data using the same approach were found.
Therefore, the HepG2 and HCT116 training datasets are initially biased, and the impact of the
predominant factor, the nearest gene, is difficult to avoid in the model training process. Despite
that, according to the findings, the Random Forest models still show outstanding performance in
predicting enhancer-gene links.
In addition to not enough cell types used in the analysis, the experimentally validated true
links are also not many as expected. Notably, due to the lack of true positive links in training
datasets, in order to guarantee the total number of links for training, a large proportion of false
links were included, leading to the unbalanced data structure of the positive and negative output.
Both of the insufficient true links and unbalanced positive/negative links may result in
computational error and inaccuracy in calculating model performance.
Besides, AUC values are considered not good enough to evaluate model performance for
an unbalanced training dataset. In this case, AUPRC could be a more reasonable criterion in a
future study. However, my findings in this study are still meaningful because the use of AUC or
45
AUPRC would not influence the comparison results in missing feature models and cell-type-
specific analysis.
46
Bibliography
Adhikary, S., Roy, S., Chacon, J., Gadad, S.S., and Das, C. (2021). Implications of enhancer
transcription and eRNAs in cancer. Cancer Research, canres.4010.4202. 10.1158/0008-5472.can-
20-4010.
Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W.J., Mattick, J.S., and Haussler, D.
(2004). Ultraconserved elements in the human genome. Science 304, 1321-1325.
10.1126/science.1098119.
Blow, M.J., McCulley, D.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I.,
Shoukry, M., Wright, C., Chen, F., et al. (2010). ChIP-Seq identification of weakly conserved
heart enhancers. Nat Genet 42, 806-810. 10.1038/ng.650.
Cho, K.W.Y. (2012). Enhancers. Wiley Interdisciplinary Reviews: Developmental Biology 1,
469-478. 10.1002/wdev.53.
Consortium, E.P. (2012). An integrated encyclopedia of DNA elements in the human genome.
Nature 489, 57-74. 10.1038/nature11247.
Dutton, D.M., and Conroy, G.V. (1997). A review of machine learning. The Knowledge
Engineering Review 12, 341-367. 10.1017/s026988899700101x.
Emison, E.S., Mccallion, A.S., Kashuk, C.S., Bush, R.T., Grice, E., Lin, S., Portnoy, M.E.,
Cutler, D.J., Green, E.D., and Chakravarti, A. (2005). A common sex-dependent mutation in a
RET enhancer underlies Hirschsprung disease risk. Nature 434, 857-863. 10.1038/nature03467.
Fishilevich, S., Nudel, R., Rappaport, N., Hadar, R., Plaschkes, I., Iny Stein, T., Rosen, N.,
Kohn, A., Twik, M., Safran, M., et al. (2017). GeneHancer: genome-wide integration of
enhancers and target genes in GeneCards. Database (Oxford) 2017. 10.1093/database/bax028.
Fulco, C.P., Nasser, J., Jones, T.R., Munson, G., Bergman, D.T., Subramanian, V., Grossman,
S.R., Anyoha, R., Doughty, B.R., Patwardhan, T.A., et al. (2019). Activity-by-contact model of
enhancer–promoter regulation from thousands of CRISPR perturbations. Nature Genetics 51,
1664-1669. 10.1038/s41588-019-0538-0.
Gao, T., He, B., Liu, S., Zhu, H., Tan, K., and Qian, J. (2016). EnhancerAtlas: a resource for
enhancer annotation and analysis in 105 human cell/tissue types. Bioinformatics 32, 3543-3551.
10.1093/bioinformatics/btw495.
Gao, T., and Qian, J. (2020). EnhancerAtlas 2.0: an updated resource with enhancer annotation
in 586 tissue/cell types across nine species. Nucleic Acids Res 48, D58-D64.
10.1093/nar/gkz980.
Gasperini, M., Tome, J.M., and Shendure, J. (2020). Towards a comprehensive catalogue of
validated and target-linked human enhancers. Nat Rev Genet 21, 292-310. 10.1038/s41576-019-
0209-0.
47
Heintzman, N.D., Stuart, R.K., Hon, G., Fu, Y., Ching, C.W., Hawkins, R.D., Barrera, L.O., Van
Calcar, S., Qu, C., Ching, K.A., et al. (2007). Distinct and predictive chromatin signatures of
transcriptional promoters and enhancers in the human genome. Nat Genet 39, 311-318.
10.1038/ng1966.
Hilton, I.B., D'Ippolito, A.M., Vockley, C.M., Thakore, P.I., Crawford, G.E., Reddy, T.E., and
Gersbach, C.A. (2015). Epigenome editing by a CRISPR-Cas9-based acetyltransferase activates
genes from promoters and enhancers. Nature Biotechnology 33, 510-517. 10.1038/nbt.3199.
Javierre, B.M., Burren, O.S., Wilder, S.P., Kreuzhuber, R., Hill, S.M., Sewitz, S., Cairns, J.,
Wingett, S.W., Vá rnai, C., Thiecke, M.J., et al. (2016). Lineage-Specific Genome Architecture
Links Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell 167, 1369-
1384.e1319. 10.1016/j.cell.2016.09.037.
Jiang, Y., Qian, F., Bai, X., Liu, Y., Wang, Q., Ai, B., Han, X., Shi, S., Zhang, J., Li, X., et al.
(2019). SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res 47, D235-
D243. 10.1093/nar/gky1025.
Kim, D., An, H., Shearer, R.S., Sharif, M., Fan, C., Choi, J.O., Ryu, S., and Park, Y. (2019). A
principled strategy for mapping enhancers to genes. Sci Rep 9, 11043. 10.1038/s41598-019-
47521-w.
Kleinjan, D.A., and Van Heyningen, V. (2005). Long-Range Control of Gene Expression:
Emerging Mechanisms and Disruption in Disease. The American Journal of Human Genetics 76,
8-32. 10.1086/426833.
Kulaeva, O.I., Nizovtseva, E.V., Polikanov, Y.S., Ulianov, S.V., and Studitsky, V.M. (2012).
Distant activation of transcription: mechanisms of enhancer action. Mol Cell Biol 32, 4892-4897.
10.1128/MCB.01127-12.
Lettice, L.A. (2003). A long-range Shh enhancer regulates expression in the developing limb and
fin and is associated with preaxial polydactyly. Human Molecular Genetics 12, 1725-1735.
10.1093/hmg/ddg180.
Li, K., Liu, Y., Cao, H., Zhang, Y., Gu, Z., Liu, X., Yu, A., Kaphle, P., Dickerson, K.E., Ni, M.,
and Xu, J. (2020). Interrogation of enhancer function by enhancer-targeting CRISPR epigenetic
editing. Nature Communications 11. 10.1038/s41467-020-14362-5.
Lomvardas, S., Barnea, G., Pisapia, D.J., Mendelsohn, M., Kirkland, J., and Axel, R. (2006).
Interchromosomal Interactions and Olfactory Receptor Choice. Cell 126, 403-413.
10.1016/j.cell.2006.06.035.
Mills, C. (2021). GENOME-WIDE CHARACTERIZATION OF THE REGULATORY
RELATIONSHIPS OF CELL TYPE-SPECIFIC ENHANCER- GENE LINKS.
Mills, C., Muruganujan, A., Ebert, D., Marconett, C.N., Lewinger, J.P., Thomas, P.D., and Mi,
H. (2020). PEREGRINE: A genome-wide prediction of enhancer to gene relationships supported
by experimental evidence. PLoS One 15, e0243791. 10.1371/journal.pone.0243791.
48
Mohrs, M., Blankespoor, C.M., Wang, Z.-E., Loots, G.G., Afzal, V., Hadeiba, H., Shinkai, K.,
Rubin, E.M., and Locksley, R.M. (2001). Deletion of a coordinate regulator of type 2 cytokine
expression in mice. Nature Immunology 2, 842-847. 10.1038/ni0901-842.
Ong, C.T., and Corces, V.G. (2011). Enhancer function: new insights into the regulation of
tissue-specific gene expression. Nat Rev Genet 12, 283-293. 10.1038/nrg2957.
Panigrahi, A., and O’Malley, B.W. (2021). Mechanisms of enhancer action: the known and the
unknown. Genome Biology 22. 10.1186/s13059-021-02322-1.
Pennacchio, L.A., Bickmore, W., Dean, A., Nobrega, M.A., and Bejerano, G. (2013). Enhancers:
five essential questions. Nat Rev Genet 14, 288-295. 10.1038/nrg3458.
Schiffner, J., Bischl, B., Lang, M., Richter, J., Jones, Z.M., Probst, P., Pfisterer, F., Gallo, M.,
kirchhoff, D., Kü hn, T., et al. (2016). mlr Tutorial.
Schmidt, D., Wilson, M.D., Ballester, B., Schwalie, P.C., Brown, G.D., Marshall, A., Kutter, C.,
Watt, S., Martinez-Jimenez, C.P., Mackay, S., et al. (2010). Five-vertebrate ChIP-seq reveals the
evolutionary dynamics of transcription factor binding. Science 328, 1036-1040.
10.1126/science.1186176.
Schoenfelder, S., and Fraser, P. (2019). Long-range enhancer–promoter contacts in gene
expression control. Nature Reviews Genetics 20, 437-455. 10.1038/s41576-019-0128-0.
Schoenfelder, S., Javierre, B.M., Furlan-Magaril, M., Wingett, S.W., and Fraser, P. (2018).
Promoter Capture Hi-C: High-resolution, Genome-wide Profiling of Promoter Interactions. J Vis
Exp. 10.3791/57320.
Sur, I., and Taipale, J. (2016). The role of enhancers in cancer. Nat Rev Cancer 16, 483-493.
10.1038/nrc.2016.62.
Visel, A., Blow, M.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I., Shoukry, M.,
Wright, C., Chen, F., et al. (2009a). ChIP-seq accurately predicts tissue-specific activity of
enhancers. Nature 457, 854-858. 10.1038/nature07730.
Visel, A., Prabhakar, S., Akiyama, J.A., Shoukry, M., Lewis, K.D., Holt, A., Plajzer-Frick, I.,
Afzal, V., Rubin, E.M., and Pennacchio, L.A. (2008). Ultraconservation identifies a small subset
of extremely constrained developmental enhancers. Nat Genet 40, 158-160. 10.1038/ng.2007.55.
Visel, A., Rubin, E.M., and Pennacchio, L.A. (2009b). Genomic views of distant-acting
enhancers. Nature 461, 199-205. 10.1038/nature08451.
Wang, J., Dai, X., Berry, L.D., Cogan, J.D., Liu, Q., and Shyr, Y. (2019). HACER: an atlas of
human active enhancers to interpret regulatory variants. Nucleic Acids Res 47, D106-D112.
10.1093/nar/gky864.
Wang, Z., Zhang, Q., Zhang, W., Lin, J.R., Cai, Y., Mitra, J., and Zhang, Z.D. (2018). HEDD:
Human Enhancer Disease Database. Nucleic Acids Res 46, D113-D120. 10.1093/nar/gkx988.
49
Abstract (if available)
Abstract
Enhancers are short DNA elements located widely throughout the human genome and modulate transcription of one or more specific target genes. Enhancer function mainly suggests connections between enhancers and their target genes within a particular tissue or cell type, which are considered to play important roles in human development and diseases. It is always challenging to identify enhancers and determine their functions. Previously, an interactive tool, PEREGRINE, combines enhancer-gene links across different cell types by using publicly available data and allows users to easily access the information of each enhancer-gene link through the PANTHER System. A machine learning method was developed to predict possible enhancer-gene links through a set of selected features and validate the experimental links by scoring them in a cell-type-specific manner. Based on previous findings, in this study, I tried to discover the most critical subset of features needed to build a machine learning model and effectively predict the enhancer-gene links with high confidence. Furthermore, by applying the balanced random forest models with a full feature set to new test datasets, probabilities of being a positive enhancer-gene link are provided in each candidate cell type. I also investigate if score distributions for enhancer-gene links in new test datasets differ among various cell types and prediction models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Machine learning-based breast cancer survival prediction
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Understanding ancestry-specific disease allelic effect sizes by leveraging multi-ancestry single-cell RNA-seq data
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Gene-set based analysis using external prior information
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Predicting autism severity classification by machine learning models
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
High-dimensional regression for gene-environment interactions
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Exploring the function of distal nephron enhancers in zebrafish
PDF
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Using multi-angle imaging spectroradiometer aerosol mixture properties and meteorology for PM₂.₅ assessment in Iran
Asset Metadata
Creator
Han, Xuewei
(author)
Core Title
Cell-specific case studies of enhancer function prediction using machine learning
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2021-08
Publication Date
07/18/2021
Defense Date
06/18/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cell-specific,confidence score,enhancer,enhancer-gene regulation,machine learning,OAI-PMH Harvest,PEREGRINE,random forest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mi, Huaiyu (
committee chair
), Gazal, Steven (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
vickyhan7378@gmail.com,xueweiha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15610515
Unique identifier
UC15610515
Legacy Identifier
etd-HanXuewei-9766
Document Type
Thesis
Format
application/pdf (imt)
Rights
Han, Xuewei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cell-specific
confidence score
enhancer
enhancer-gene regulation
machine learning
PEREGRINE
random forest