Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
(USC Thesis Other)
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENOME-WIDE CHARACTERIZATION OF THE REGULATORY RELATIONSHIPS OF
CELL TYPE-SPECIFIC ENHANCER-GENE LINKS
by
Caitlin Mills
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
BIOSTATISTICS
August 2021
Copyright 2021 Caitlin Mills
ii
Acknowledgments
The journey to complete my doctorate was a challenging and substantial endeavor that
helped me grow tremendously as a student, scientist, and person. Throughout the years, I was
helped by many people without whom this accomplishment would not have been possible. I
would like to recognize them here.
Firstly, I would never have made it were it not for the guidance and assistance of my
primary advisor, Dr. Huaiyu Mi. In the beginning of my PhD, when Dr. Mi plucked me out of my
established path as a bench scientist and convinced me I had what it takes to transfer into the
Biostatistics program, I had no idea what lay ahead of me. My new path turned out to be both
more challenging and more rewarding than I anticipated. I have benefitted time and time again
from Dr. Mi’s vision, support, flexibility, creativity, and scientific expertise. For his patience and
understanding, candor, and dedication, I am especially grateful.
Additionally, special thanks are assuredly due to Dr. Juan Pablo Lewinger, who stepped
forward during a time of intense creative gridlock to provide the methodological direction we had
been searching for. A professor with many responsibilities who already had plenty of student
mentees, he volunteered to mentor me in my project when I am sure he could ill spare the time.
For the countless hours generously given to me and Dr. Mi in the pursuit of scientific progress
and camaraderie, I am deeply grateful.
I would like to thank my outside committee member, Dr. Ite Offringa. Over the years I
have felt continually humbled by the wonderful opportunity to work with Dr. Offringa, a scientist
of rare caliber and a person with many excellent qualities. Although she is possibly the busiest
person I know, I have always admired her remarkable willingness to make time to connect,
inspire, and mentor students who need it most.
I would also like to recognize the leadership and thoughtful guidance of Dr. Kim
Seigmund, who always provided an array of suggestions and options when I encountered an
iii
obstacle, and always advocated for me when I needed it most. Her candor helped me identify
shortcomings in my research, and was the impetus I needed to hold myself to a higher
standard. Her good nature has made her a pleasure to know, and I am so thankful for her
mentoring in all the forms I experienced over the years.
I am incredibly grateful to my other committee members, Dr. David Conti and Dr. Crystal
Marconett, whose invaluable recommendations and contributions to my research allowed me to
achieve this wonderful milestone in my career. I would also like to thank Dr. Paul Thomas, who
served as co-author on all of my publications and offered many helpful suggestions.
I also thank my mother, who raised me to place a great importance on education. She
finished her bachelor’s degree when I was still in high school, making her the first person in my
family to obtain a college degree. Without her love and support, I could never have made it to
where I am now. A strong and intelligent woman whose capacity for hard work and fearless
leadership knows no bounds, she is a role model that I am incredibly fortunate to have. She
taught me to never give up, and has truly led by example. Her contributions toward any
accomplishment of mine are beyond measure.
To my future husband Jonathan, thank you for your boundless love and encouragement
throughout the hard times in this degree when I didn’t know if I would ever make it. Thank you
for supporting me and believing in me no matter how bleakly I viewed my prospects, especially
the countless late nights you spent in the desk beside me at Soto while I prepared for the
screening exam. Thank you for helping me see the light at the end of the tunnel whenever I felt
anxious or depressed, and thank you for loudly celebrating each of my accomplishments.
iv
Table of Contents
Acknowledgments………………………………………………………………………………………...ii
List of Tables……………………………………………………………………………………………...vi
List of Figures…………………………………………………………………………………………….vii
Abstract…………………………………………………………………………………………………..viii
Chapter 1: Introduction…………………………………………………….……………………………..1
Chapter 2: PEREGRINE: A Genome-wide Prediction of Enhancer to Gene Relationships
Supported by Experimental Evidence………………………………………………………..10
Introduction……………………………………………………………………………………...10
Methods………………………………………………………………………………………….13
Gathering the enhancer set……………………………….…………………………..13
Constructing the enhancer-gene links……………………………………………….14
Integrating the PEREGRINE enhancer-gene link data into PANTHER for
interactive online access……………………………………………………..16
Statistical analysis……………………………………………………………………..17
Data availability………………………………………………………………………...17
Results…………………………………………………………………………………………..18
The PEREGRINE enhancer set……………………………………………………...18
Characterizing genome-wide enhancer-gene link relationships in
PEREGRINE….……………………………………………………………….18
Utilizing the PEREGRINE data in PANTHER………………………………...…….28
Discussion……………………………………………………………………………………….37
Chapter 3: Developing a Cell Type-Specific Score for Enhancer-Gene Regulatory Links Using a
Machine Learning Approach…………………………………………………………………..46
Introduction…………………………………………………………………………………...…46
Methods………………………………………………………………………………………….49
Assembling the training dataset……………………………………………………...49
Features……………………………………………………………………………...…53
Model selection and prediction performance……………………………………….55
Scoring new enhancer-gene links……………………………………………………57
Data availability………………………………………………………………………...58
Results…………………………………………………………………………………………..58
Feature selection………………………………………………………………………58
Model generation………………………………………………………………………58
Prediction performance………………………………….…………………………….61
Model selection………………………………………….……………………………..63
PEREGRINE enhancer-gene links have higher cell type-specific scores than
other possible enhancer-gene links in cis….……………………………….64
Discussion……………………………………………………………………………………….67
Using the cell type-specific enhancer-gene link scores……………………………67
Score interpretation……………………………………………………………………68
Perspectives……………………………………………………………………………69
Conclusions…………………………………………………………………………….70
v
Chapter 4: Utilizing the PEACOCK Cell Type-Specific Enhancer-Gene Link Scores for Gene Set
Enrichment Analysis in GWAS Data: A Proof of Concept Application……………………71
Introduction……………………………………………………………………………………...71
Methods………………………………………………………………………………………….74
Gene set enrichment analysis………………………………………………………..74
GWAS data……………………………………………………………………………..75
Incorporation of PEACOCK enhancer-gene regulatory link scores………………75
Adjusting for multiple variants……………………….………………………………..78
Results…………………………………………………………………………………………..80
The effects of including enhancer information on the distribution of genes’ best
variants.………………………………………………………………………...80
The effects of including enhancer information on gene set enrichment analysis
results….……………………………………………………………………….82
Discussion……………………………………………………………………………………….86
Chapter 5: Conclusions…………………………………………………………………………………89
Summary of contributions…………………………….………………………………………..89
Future directions………………………………………………………………………………..91
References………………………………………………………………………………………….……93
vi
List of Tables
Table 1.1 The cell lines used for generating and validating PEACOCK machine learning
scores…………………………………………………..…………………………………………4
Table 2.1 PEREGRINE enhancers……………………………………………………………………14
Table 2.2 Enhancer overlap between sources……………………………………………………….14
Table 2.3 Summary of the PEREGRINE enhancer-gene links dataset…………………………...23
Table 2.4 Distribution of enhancers with each number of target genes…………………………..23
Table 2.5 Mean number of genes linked per enhancer by assay………………………………….24
Table 2.6 Mean number of enhancers linked per gene by assay………………………………….25
Table 3.1 Description of the positive set criteria……………………………………………………..50
Table 3.2 Description of the main effect features……………………………………………………53
Table 3.3 Interaction terms between active enhancer and promoter marks……………………...55
Table 4.1 Top seven pathways negatively enriched in Reactome pathway enrichment analysis
of genes incorporating gene- and enhancer-mapped variant and predicted regulatory link
information………………………………………………………………………………………83
Table 4.2 Top five pathways negatively enriched in Reactome pathway enrichment analysis of
genes incorporating gene-mapped variant information only……………………………….86
vii
List of Figures
Figure 1.1 Training the model and prediction performance………………………………………….5
Figure 1.2 A schematic of this body of work…………………………………………………………...9
Figure 2.1 The number of enhancer-gene links found in each assay……………………………..20
Figure 2.2 Distributions of enhancer-gene links……………………………………………………..21
Figure 2.3 Clustering of genes enriched among PEREGRINE enhancer-gene links as being
linked to fewer enhancers than expected…...……………………………………………….26
Figure 2.4 Viewing enhancer-gene link information on a gene list………………………………...29
Figure 2.5 The schema of PEREGRINE within PANTHER………………………………………...30
Figure 2.6 The enhancer-gene link between EH37E0145522 and GALNT2……………………..34
Figure 2.7 Comparison of PEREGRINE enhancers with HACER and GeneHancer
enhancers……………………………………………………………………………………….40
Figure 3.1 Negative enhancer-gene examples………………………………………………………52
Figure 3.2 The precision-recall curve…………………………………………………………………56
Figure 3.3 Evaluating prediction performance……………………………………………………….57
Figure 3.4 Precision-recall curves of the best performing models on test datasets……………..60
Figure 3.5 Predicted cell type-specific scores for all possible enhancer-gene links in cis………66
Figure 4.1 Description of the Mann-Whitney rank-sum test used for enrichment analysis in
PANTHER……………………………………………………………………………………….75
Figure 4.2 Mapping and weighting GWAS variants…………………………………………………76
Figure 4.3 The distribution of the score-adjusted weighted variant p-values for all genes for
which any variants were attributed to………………………………………………………...79
Figure 4.4 The distribution of the variant p-values for all genes for which any variants were
attributed to……………………………………………………………………………………...82
viii
Abstract
With the advent of genome-wide association studies over the last decade, an explosion
of variant association data has become available to the scientific community. The vast majority
of these trait-associated variants map to enhancers, vital regulatory elements which orchestrate
the recruitment of transcriptional complexes to the promoter of their target genes to upregulate
transcription in a cell type- and timing-dependent manner. Through these popular genome-wide
association studies, variants located within enhancers have implicated thousands of these
powerful regulatory elements with many common genetic diseases, including almost every form
of cancer. Although much is now known about genes and the individual functions of their
products, very little is known about the function of individual enhancers beyond their general
shared properties. Indeed, as the primary function of enhancers is the transcriptional regulation
of their specific target genes, it is crucial to identify the target genes of as many enhancers as
possible, especially those enhancers which are already strongly associated with diseases.
To interpret the disease-associated signals located within enhancer elements, it is
necessary to be able to contextualize enhancers via their biological functions. By predicting
high-quality enhancer-gene regulatory links and then statistically validating a cell type-specific
score for each prediction, this work illustrates how it is possible to interpret enhancer variants
meaningfully within the framework of protein pathways and shared biological processes that are
already well characterized by using their predicted target genes as a bridge. Although the
possible applications for such data using this principle are large in number and for the most part
beyond the scope of this work, we conclude by providing a proof of concept in the form of gene
set enrichment analysis which incorporates enhancer variant information through their target
genes and the cell type-specific score measuring the confidence of each prediction.
1
Chapter 1
Introduction
The human genome is thought to contain up to one million enhancers
1
—short regulatory
elements that primarily function via regulation of their target genes. These regulatory elements
regulate their target genes in a cell type-specific manner by binding transcription factors to
enhance gene expression and recruiting transcriptional machinery to the promoter of the target
gene. This regulation occurs via an intricately arranged network of many-to-many relationships
where enhancers are capable of regulating multiple genes while each gene may be regulated by
multiple enhancers. Regulatory relationships which are active in one cell type may not function
in another different cell type. Enhancers are often near a target gene—indeed, they are even
sometimes located within the intron of their target gene—but they are also often located quite
far away. They can be located over one megabase from their target gene or even regulate a
target gene from another chromosome
2, 3
.
Enhancers are believed to play a vital role in orchestrating the extremely complicated
system of gene expression necessary for human development and continued maintenance of
our differentiated cell populations. When normal enhancer regulation of genes is disrupted,
disease can follow as a result. In fact, enhancers have been associated with many diseases,
including many types of cancer
4, 5
.
Despite their importance, relatively little is known about the specific links between
enhancers and their target genes. Decades of painstaking benchwork and analyses have been
performed at the individual laboratory level to provide many high-confidence enhancer-gene
links in various cell types
6-8
. Although these experimentally validated enhancer-gene links are
valuable, they represent only a small percentage of the total enhancer-gene links at play within
the human genome. They are also not available in any central repository which would make
2
high throughput applications easier, making it difficult to use the existing data in any large-scale
analysis. Additionally, the vast expenses in terms of time and resources make the elucidation of
the rest of all the enhancer-gene links in cell type-specific context via traditional bench
experiments a daunting task. Higher throughput experimental methods for predicting enhancer-
gene links provide a desirable alternative for characterizing this immense regulatory network.
The most commonly used high throughput experimental methods used to predict
enhancer-gene regulatory relationships assay the property of enhancer regulation involving the
enhancer coming into close physical proximity to its target gene via chromatin looping. These
include proximity ligation assays such as 3C, 4C, 5C, Hi-C, promoter-capture HiC, ChIA-PET,
and Hi-ChIP as well as advanced microscopy techniques which are all used to identify the long-
range physical contacts between active enhancers and the promoters of their target genes to
establish a regulatory relationship
9
. Experiments which engineer enhancer-promoter contacts as
well as genome and epigenome editing techniques (e.g., CRISPR-based techniques) are
outstanding approaches for determining enhancer-gene regulatory relationships in specific cell
types
9
. However, genome-wide results are not yet available in a range of cell types. Many
research groups lack the equipment to perform these experiments, presenting a barrier to
gaining this information in new cell types. Thus, computational methods offer an attractive
avenue for predicting enhancer-gene regulatory relationships.
Computational prediction methods have yielded some enhancer-gene link databases
which do not require new experimental data and instead utilize the existing publicly available
data to generate predicted enhancer-gene links
10-15
. However, these databases often do not
provide certain important information related to these enhancer-gene links to the end user. It is
also not always clear what specific types of experimental evidence in which cell types supports
each enhancer-gene link in the download data, an important consideration for many
researchers
11
. Frequently, predicted enhancer-gene links are only available through individual
webpages for each enhancer-gene link or gene
11-13
. Many websites lack an up to date and
3
complete bulk data download file to allow overall analysis of the enhancer-gene link data with
minimal processing
11-13, 15
. Additionally, these databases often do not provide any type of score
for the enhancer-gene links by which the investigator can judge the quality of the presented
enhancer-gene links
10, 12
. When a score is provided, it is not clearly interpretable to the user, as
the score bounds from zero to infinity and does not follow a known statistical distribution
11
. This
makes it difficult for the end user to distinguish between a good or a bad score. Additionally,
some databases do not provide enhancer-gene links in cell type-specific contexts, instead
pooling results from experiments performed in many different cells to link an enhancer to a
potential target gene
11
. Given that enhancer regulation occurs in a very cell type-specific
manner, pooling experimental data without cell type matching may result in many specious
enhancer-gene links when they are assumed to be active in all or particular cellular
environments. Clearly, there is still much room for improvement in the exciting field of predicting
cell type-specific enhancer-gene links.
When new enhancer-gene regulatory relationships are predicted, it is vitally important to
be able to evaluate the quality of the enhancer-gene link with a statistically validated score for
each one. This requires a suitable quantity of enhancer-gene links experimentally validated in
the same cell type for prediction validation purposes. If the predictive model almost always
correctly predicts the experimentally validated enhancer-gene links as true or active, while at the
same time almost always correctly classifying the enhancer-gene links known to not be active in
the same cell type as false links, the model may be considered useful and the resulting scores it
provides to enhancer-gene links may be considered informative. This application lends itself
well to machine learning, a type of artificial intelligence.
Machine learning is the study of how to build systems that improve automatically through
experience and is considered a subset of the field of artificial intelligence
16
. For many
applications, it can be much easier and more feasible to train an algorithm by showing it
examples of desired input-output behavior than to program it manually by anticipating the
4
desired response for all possible inputs
10
. Predicting enhancer-gene regulatory relationships is
certainly one such task, since although the scientific community has a good idea of which
characteristics are important to identifying an enhancer-gene link, it is not always clear how to
weigh these different characteristics simultaneously. There are many different types of machine
learning algorithms and many different ways of implementing machine learning depending on
the task at hand. In the case of enhancer-gene link data, the problem may be viewed in a two-
type classification setting, where given a possible enhancer-gene link with a set of values for
characteristics of interest, an algorithm is trained to classify new possible enhancer-gene links
quickly and accurately as a member of one of the two classes (positive or negative, where
positive indicates that the enhancer-gene link represents a true or active regulatory relationship
in the cellular environment presented and negative indicates that the enhancer-gene link is false
or inactive in that particular cell type). Machine learning classification methods require a set of
training data that the algorithm can use to learn from data, detect patterns, and make fast and
accurate predictions
17
. In this case, the model would require a sample of enhancer-gene links
(instances) from each of the classes and a set of characteristics (features) that would give it
enough information to be able to learn to distinguish between enhancer-gene links from the two
classes. Accordingly, training datasets were assembled for three cell types which were utilized
to train machine learning models that can accurately predict which enhancer-gene links are true
in a given cell type.
Cell line Relevant tissue Disease
HepG2 Liver Hepatocellular carcinoma
HCT116 Colon Colorectal carcinoma
K562 Bone marrow Chronic myeloid leukemia
MCF7 Breast Breast carcinoma
Table 1.1 The cell lines used for generating and validating PEACOCK machine learning scores. Each cell
line is listed with its tissue of origin and what disease it originated from.
After training a machine learning model, it is important to be able to gauge the prediction
performance of the model. There are several ways to accomplish this, but cross-validation and
5
using holdout/outside data as a test set are the main two methods used for assessing the
prediction performance of machine learning models. The preferred method of assessing
prediction performance is generally held to be using outside data which is separate from the
training data and comes from a different source. This means that the algorithm is trained using
the characteristics of the training data to discern patterns differentiating the two classes and
then the newly developed model is tested on how well it applies to new unseen data by
measuring how well it predicts on the outside data. This is a good measure of generalizability
and a good way to infer whether the model has been overfit (trained to predict the training data
so precisely that it will poorly predict on any other data set). Datasets that are used to test the
prediction performance are called the test set, just as the data used to train the model are called
the training set.
Figure 1.1 Training the model and prediction performance. On the left is a set of training data, where each
observation in the data is categorized as “cat” or “not cat” similar to how enhancer-gene links are classed as either
“true” or “not true” in the training data. These examples from each class are then fed to the model for training. After
this process, the algorithm is ready to predict the class of new unseen input data. By inputting test data for which
the classes are known, it is possible to gauge the prediction performance of the model by examining how many of
the observations in the test data it correctly classified (righthand section). In this figure the model classified all
observations correctly.
Sometimes it is not feasible to use outside data as a testing set because there may not
be outside data available from another source. An alternative solution is to partition off a portion
6
of the training data as a first step before any training is done on the model and then use that
holdout data as a testing set later on. This is a less preferable method because the holdout data
may be less independent from the training data than completely outside data from a different
source would be, but this is still a good method for gauging prediction performance. In datasets
of smaller size, it may not be possible to hold out a large enough portion of the training data to
conduct testing while still providing enough training data to train the model well. In cases such
as these, cross-validation is a good solution. Cross-validation partitions the training data into k
equally sized sets (folds) and then iterates through training and testing cycles where in each
iteration, a different fold serves as the holdout data and the remaining folds are combined and
treated as the training data. After training is completed, the holdout fold is used as testing data
and prediction performance metrics are recorded. After each fold has taken a turn as the testing
data, all of the prediction performance metrics are averaged. Looking at these averaged
prediction performance metrics provides a more reliable and less variable measure of prediction
performance, which is more important when the overall dataset is small. To provide an even
more stable measure of prediction performance, sometimes k-fold cross-validation is repeated.
The data is split into k folds of equal size and each fold serves as the testing set until all folds
have served as the testing set. When repeated k-fold cross-validation is used, this process is
then repeated r times. This is useful because even though the folds are still the same size, the
observations are shuffled r times to create a new set of k partitions made up of different
individual observations. This can be repeated many times and all of the prediction performance
metrics averaged at the end to give an even more stable sense of how well the model predicts
on average. This is especially useful when the dataset is too small to allow for many folds, as
was often the case with the enhancer-gene link training data in this work.
When using a classification machine learning method, the model provides a predicted
score bounded from [0,1) of how likely the instance is to be classified in the true class based on
the training the model received. If that number is closer to 1, the model is more likely to classify
7
the instance as true. If the number is closer to 0, the model is more likely to classify the instance
as false. The default classification threshold is usually 0.50, meaning that any observations with
a predicted score higher than 0.50 will be classified as true and the others will be classified as
false. However, rather than establishing a cutoff at the model level, predicted scores were
provided for each link so that the investigator may decide what cutoff is appropriate for their
work. Additionally, the Z-score corresponding to the distribution of the scores of all enhancer-
gene pairs in cis was provided with each raw score to provide a more interpretable value to
each predicted enhancer-gene regulatory relationship. The prediction performance of each
model that predicted these enhancer-gene link scores is also provided so that the investigator
can judge how much confidence they should put into the scores based on how well the
prediction performance of the models were.
After using the final machine learning models believed to have the best prediction
performance to score all enhancer-gene pairs located within 1 Mb of each other as possible
enhancer-gene regulatory relationships, it would be desirable to apply this new information to
existing genome-wide analyses, such as GWAS (genome-wide association studies). Often in
GWAS the majority of disease-associated SNPs are located in non-coding regions, which are
frequently enhancer regions
18
. However, since the majority of enhancers have no known or
even conjectured target genes (excepting the common practice of assuming that the gene
nearest the enhancer may be regulated, which is habitually not the case
19
), a limited amount of
information can be gathered from these signals. Indeed, most downstream analyses ignore
enhancer variants altogether. By applying the enhancer-gene link information scored in a
specific cellular environment via the PEACOCK (Predicted Enhancer Activity in Cis Originating
from Cell-specific Knowledge) method described in Chapter 3 to these studies, it may be
possible to shed light onto previously obscured disease-associated signals by using these
statistically validated enhancer-gene link scores to assign functions to enhancers in a cell type-
specific context.
8
As a first application of these cell type-specific enhancer-gene regulatory relationship
scores, which were provided for all enhancer-gene pairs in cis across the genome, gene set
enrichment analysis was expanded to incorporate enhancer variant data. The PEACOCK
scores for enhancer-gene regulatory relationships in a HCT116 cellular environment were used
to link enhancer variants from a GWAS on colorectal cancer to their putative target genes. After
incorporating the enhancer variants into the overall signal attributed to each gene, gene set
enrichment analysis was performed using Reactome
31
pathways as the gene sets, resulting in
enriched pathways that were shown in a brief search of the scientific literature to have been
previously well-established in the initiation and progression of colorectal cancer. This proof of
concept analysis represents merely the tip of the proverbial iceberg with regard to what
applications are possible using these new enhancer-gene regulatory links.
To summarize, this works begins with an effort to generate high quality enhancer-gene
links. Thus, experimentally supported data was gathered to infer enhancer-gene regulatory
links, which were made available via PEREGRINE
20
and PANTHER
21, 22
, where the research
community has easy access to the enhancer-gene links and supporting experimental evidence
in both bulk download format (PEREGRINE) and using an interface which is friendly to non-
computational biologists who wish to query by gene or variants (PANTHER). Suitable training
data was then gathered for the purpose of using machine learning models to score enhancer-
gene links in a cell type-specific manner. Numerous classification models were developed and
tested, the best performing of which was selected for the purpose of scoring new enhancer-
gene links in available cell lines. These scores were then added to PEREGRINE for four popular
cell lines so that they could become available to the scientific community at large. As an initial
exploratory application, HCT116-specific enhancer-gene link scores were then incorporated via
colorectal cancer GWAS data into gene set enrichment analysis. The PEREGRINE resource,
including the cell type-specific scores available in four cell lines with more to come, will provide
the scientific community with an easily accessible online resource detailing all possible cis
9
enhancer-gene links with their cell type-specific scores. These carefully constructed scores may
then easily be applied to any existing framework relating to enhancers or their target genes,
such as the gene set enrichment analysis of colorectal cancer GWAS data provided here as a
proof of concept.
Figure 1.2 A schematic of this body of work. This diagram describes the flow of the work presented in this
dissertation. The first box represents the motivation for developing PEREGRINE, a collection of predicted
regulatory enhancer-gene links based on publicly available experimental data and described in Chapter 2. The
second box represents the bridge between enhancer-gene pairs and their actual regulatory relationship: their cell
type-specific activities. By developing the PEACOCK cell type-specific scores using machine learning (Chapter 3),
the genome-wide scoring of all possible 17 million enhancer-gene regulatory links was made possible in four
popular cell lines. The third box represents the motivation for the development of cell type-specific enhancer-gene
regulatory predictions: the ability to incorporate enhancer variant information into GWAS data (Chapter 4).
10
Chapter 2
PEREGRINE: A Genome-wide Prediction of Enhancer to
Gene Relationships Supported by Experimental Evidence
Introduction
Enhancers are short regulatory DNA elements that upregulate their target genes in a
tissue- and cell type-specific manner
10
. They may be located locally or distally from their target
genes; they may even be located within the non-coding regions of a target gene. Enhancers
may be found upstream or downstream from their target genes and can function regardless of
orientation
23, 24
. There are thought to be up to one million enhancers in the human genome
which regulate about 20,000 protein-coding genes through a complex network of many-to-many
relationships where enhancers may regulate multiple genes and each gene may be subject to
regulation by multiple enhancers
25, 26
. This transcriptional regulation is widely believed to occur
through enhancer binding of transcription factors and recruitment of transcriptional machinery to
the promoter of the target gene.
27
Enhancers play an important role in establishing cell-specific
identities during differentiation and development through the regulation of gene transcription.
5, 28
When misregulation occurs, it can lead to disease. Indeed, enhancers have been associated
with numerous human diseases, including many cancers
4, 5
.
Despite the importance of enhancer-gene links, relatively little is known about which
target genes are regulated by specific enhancers. The Introduction chapter includes a
discussion on what methods are generally used to experimentally determine the regulatory
relationships enhancers have with their target genes as well as recent efforts to computationally
predict them and the current drawbacks of each. Recently the PEREGRINE (Predicted by
Experimental Results: Enhancer-Gene Relationships Illustrated by a Nexus of Evidence)
11
enhancer-gene links database (www.peregrineproj.org) was introduced
20
, which can be queried
via the PANTHER
21, 22
website (www.pantherdb.org). The PEREGRINE enhancer-gene links
represent a comprehensive set of enhancer-gene predicted regulatory links with accompanying
experimental evidence available via bulk download (PEREGRINE), and also searchable by
variants and putative target gene(s) of interest (PANTHER). Furthermore, since PANTHER is a
comprehensive resource for gene function which provides the most up-to-date functional
annotations to genes, including Gene Ontology
29, 30
, Reactome
31
, and PANTHER Pathways
32
,
these enhancer-gene links provide functional implications to the non-coding regions in the
genome.
To generate high-quality enhancer-gene regulatory links, a set of well-supported
enhancers were amassed from nationally recognized sources which are well-known and trusted
by the scientific community that contained fairly non-redundant enhancer information. Enhancer
data was gathered from ENCODE
26
, Ensembl
33
, FANTOM
34
, and VISTA
35
to generate a list of
putative enhancers which were then linked to protein-coding genes using publicly available
experimental data from Hi-C, ChIA-PET, and eQTL experiments provided by ENCODE and
GTEx
36
. By incorporating several types of experiments that provide information on different
aspects of enhancer activity and function, the PEREGRINE enhancer-gene regulatory links
represent an amalgam of information gathered by examining the various characteristics of the
enhancer-gene regulatory dynamic.
Although enhancers often act on their target genes from great distances, the probability
of a regulatory relationship is thought to drop considerably with increasing genomic distance
between the enhancer and the gene.
11
A megabase of separation is considered to be a practical
upper limit for most enhancer-gene regulatory activity.
37, 38
To this end, topologically associated
domain (TAD) data was incorporated from Hi-C experiments. TADs are spatial subdivisions of
chromatin where physical contacts within the TAD are much more likely than contact between
DNA elements located within separate TADs.
39
TADs are typically a few hundred kilobases to a
12
few megabases of continuous DNA which has been shown to consist of regions folded within
themselves into local compartments where a higher number of chromatin contacts take place.
38
It is believed that the majority of enhancer-gene links are found within the same TAD, which are
fairly stable across cell types and remain largely stable throughout development.
40-42
Enhancers contain multiple transcription factor binding sites which are used to recruit
chromatin remodeling complexes and transcription machinery
27
. This enhancer-protein complex
physically contacts the target gene promoter via chromatin looping to upregulate transcription
43-
47
. Active enhancers engaged in upregulation of their target genes are typically associated with
the presence of H3K27ac marks as well as the presence of RNA polymerase II, the polymerase
responsible for mRNA transcription of genes in humans.
7
In order to assay this function of
enhancer activity on specific target genes, data was incorporated from an experiment designed
to detect genome-wide looping interactions at high resolution. Chromatin Interaction Analysis by
Paired-End Tag sequencing (ChIA-PET) is an assay that is capable of assessing chromatin
interaction frequency through the targeting of DNA regions that are bound to a specific protein
of interest
48
. The protein of interest (in this case RNA polymerase II) is pulled down from cross-
linked fragmented chimeric DNA fragments with an antibody which is specific to the protein of
interest.
49
Sequencing is then performed on the paired fragments to enable the investigator to
examine which DNA regions were interacting with which DNA regions via binding of the protein
of interest.
48
ChIA-PET data were examined to ascertain which enhancers were localized to
which promoters in the presence of RNA polymerase II.
An expression quantitative trait locus (eQTL) is a locus that “explains a fraction of the
genetic variance of a gene expression phenotype. Standard eQTL analysis involves a direction
association test between markers of genetic variation with gene expression levels typically
measured in tens or hundreds of individuals.”
50
Oftentimes eQTL are located within an exon and
may result in a nonsynonymous mutation in the gene product, but eQTL also occur outside of
13
exons and even beyond the gene body.
51, 52
Statistically significant eQTL that mapped to
enhancers were used to link these regulatory elements to their putative target genes.
Many factors must be considered when seeking to link enhancers to their target genes.
They should be located proximally to the promoter of their target gene bound by transcriptional
machinery, which ChIA-PET targeting RNA Polymerase II can assay. Hi-C data can be
informative when enhancers and promoters are located within the same TAD, and especially
when they are thought to be interacting within the same TAD, sometimes seen in the calling of
hierarchical TADs. Additionally, when an enhancer contains a variant that has been shown to
explain part of the variation of gene expression, it implies that the enhancer has a transcriptional
regulatory relationship with the gene. Taken together, all of these data predict many potential
enhancer-gene regulatory links with varying amounts of evidence across many tissue and cell
types. By providing the details of these predicted enhancer-gene links in a cell type-specific
manner via the PANTHER website, a useful and easily accessible network of well-supported
enhancer-gene regulatory links in 78 cell and tissue types has been made available to the
research community. By presenting these enhancer-gene links within the existing framework of
PANTHER, not only will it be possible to extend our understanding of function for those
enhancers, but users will also be able to browse them in the context of function and existing
gene pathways.
Methods
Gathering the enhancer set
Enhancer data was gathered from firmly established and highly utilized sources.
Enhancer coordinate data from four sources comprised the final enhancer set
20
. These sources
include: ENCODE’s catalog of Candidate Regulatory Elements, VISTA, Ensembl, and
FANTOM. Tissue-specific information on the enhancers was not included, but rather a list of
14
enhancers defined by genomic location was the end product. These enhancer coordinates were
the result of experiments performed across various tissues by their respective sources, which
differed somewhat according to each enhancer source. No filtering out of enhancers from these
sources was performed. Instead, the differing enhancer calling pipelines were utilized
simultaneously to maximize the chance of capturing the highest number of enhancers. Each
enhancer was assigned a numeric ID except for enhancers taken from ENCODE’s cCRE which
retained their original alphanumeric IDs assigned by that project. See Tables 2.1 and 2.2 for a
brief summary of enhancer information from these sources. Enhancer overlap for comparison
purposes only was determined using bedtools
53
(an open source software package comprised
of multiple tools for comparing and exploring genomic datasets) using the minimum threshold of
1bp to determine overlap between two elements unless otherwise stated.
Enhancer Source
(Number of enhancers)
Average length
(base pairs)
ENCODE cCRE (991,173) 423
Ensembl (28,239) 662
FANTOM (65,423) 281
VISTA (959) 2037
Total combined (1,085,794) 422
Table 2.1 PEREGRINE enhancers. Summary of the PEREGRINE enhancer set by source of enhancers and
average length of enhancers in base pairs.
ENCODE Ensembl FANTOM VISTA
ENCODE 991,173 (100%) 36,284 (3.6%) 42,457 (4.3%) 1,916 (0.2%)
Ensembl 24,294 (86.0%) 28,239 (100%) 5,234 (18.5%) 92 (0.3%)
FANTOM 40,378 (61.7%) 5,699 (8.7%) 65,423 (100%) 203 (0.3%)
VISTA 763 (79.6%) 75 (7.8%) 144 (15.0%) 959 (100%)
Table 2.2 Enhancer overlap between sources. Numbers in the cells represent the number of enhancers from
the sources in each row that were found to overlap with enhancers from the sources in each column. Percentages
are of the total number of enhancers from the source listed for each row.
Constructing the enhancer-gene links
ChIA-PET
ChIA-PET experimental data were taken from ENCODE’s data matrix for cell lines K562,
MCF7, and HCT116 and processed with the MANGO
49
pipeline to obtain pairs of interacting
15
regions with p-values for each pair. The protein target for the ChIA-PET experiments was RNA
Polymerase II. Bedtools
53
was then used to screen each region for PEREGRINE enhancers and
protein-coding genes. If at least 50% of the enhancer overlapped with the ChIA-PET region, or
at least 50% of the ChIA-PET region overlapped with the enhancer, the ChIA-PET region was
considered to contain the enhancer. Then the enhancer-containing region’s interaction partner
was screened for the promoter of a gene. For purposes of this analysis, the gene’s transcription
start site and the preceding 600bp of its promoter were considered to be the promoter. If at least
50% of a gene’s promoter overlapped with the ChIA-PET region, or at least 50% of the ChIA-
PET region overlapped with the gene’s promoter, the region was considered to contain a gene
capable of upregulation by the enhancer in its ChIA-PET interacting partner region. Thus, the
pairs of interacting ChIA-PET regions containing an enhancer and a promoter according to
these parameters were recorded as enhancer-gene links if the ChIA-PET interaction achieved
significance at the α=0.05 level.
eQTL
eQTL data were downloaded from GTEx for all 48 available tissues if they were
statistically significant (p<0.05). Any eQTL located within the exons of the gene they were
associated with were excluded from analysis. Only protein-coding genes were considered for
this analysis. Bedtools intersect was then used to map eQTL to enhancers. If an eQTL was
located within an enhancer, it was considered linked to the gene influenced by the eQTL.
Individual eQTL were recorded with tissue type, eQTL, p-value, mapped enhancer, and
associated gene.
Hierarchical Topologically Associated Domains
Analyzed Hi-C data was downloaded from PSYCHIC
38
for the 9 available cell types.
Regions were provided that interacted with the promoter of the listed gene with a FDR of <0.01.
16
Bedtools intersect was then used to map PEREGRINE enhancers to these regions. If at least
90% of an enhancer overlapped with one of these regions, it was recorded as linked to the gene
PSYCHIC reported as physically interacting with the region. The cell type and FDR were also
recorded for each enhancer-gene link.
Topologically Associated Domains
Topologically associated domain (TAD) boundary data were downloaded from
ENCODE’s Hi-C experiments for 19 cell types. Bedtools intersect was then used to screen each
region for PEREGRINE enhancers and protein-coding genes. If at least 90% of the enhancer
overlapped with the TAD, the TAD was considered to contain the enhancer. Then the TAD was
screened for the promoter of a gene. For purposes of this analysis, the gene’s transcription start
site and the preceding 600bp of its promoter were considered to be the promoter. If at least 90%
of a gene’s promoter overlapped with the TAD, the TAD was considered to contain a gene
capable of upregulation by the enhancer within the same TAD. Thus, all enhancers contained
within a TAD were linked to all of the genes with promoters located in the same TAD. This
supporting evidence for a potential regulatory relationship was recorded only for enhancer-gene
links already generated from another dataset. This ensured that enhancer-gene links generated
only due to the enhancer and the gene being located within the same TAD (a relatively weak
form of supporting evidence likely to include a disproportionately large amount of false
enhancer-gene links) were not recorded.
Integrating the PEREGRINE enhancer-gene link data into PANTHER for interactive online
access
The enhancer-gene link data is indexed and stored in an Apache Solr
54
database. The
stored data contains gene, enhancer (ID and coordinates), assay (tissue, score), and source
information. PANTHER retrieves the data from the Solr DB through requests by gene,
17
enhancer, or genomic coordinates to a Python Flask REST
55
API server that then
communicates with Solr to return the results. The PANTHER website primarily handles genomic
data and its attributes. The Enhancer REST API is used to retrieve additional information about
enhancers that have been mapped to genes. There are three areas where the enhancer REST
API is utilized:
1. It is queried via SNP i.e. chromosome with start and end position to return list of
associated enhancers and genes. This feature is used when mapping VCF data.
2. It is queried via gene identifier after SNP data has been mapped to genes, to determine
enhancers associated with genes. The same functionality is used to retrieve information
about enhancers for a single gene when displaying gene detail information.
3. It is queried via enhancer ID to determine additional details about an enhancer as well
as the list of genes that it is associated with.
Statistical Analysis
Enrichment analysis was performed using Gene Ontology Biological Processes via the
PANTHER web interface
56
to examine if there were any gene pathways enriched due to having
more or less than the expected number of linked enhancers per gene under the null hypothesis
of the Mann Whitney U test that the two samples come from the same distribution.
Data Availability
The PEREGRINE enhancer-gene links are available at www.peregrineproj.org, and can
be queried via the PANTHER website (www.pantherdb.org). The GitHub repository for this
work, which includes the URLs to all the source data as well as scripts to generate the final
dataset, is available at https://github.com/USCbiostats/PEREGRINE_enhancer_gene_links.
18
Results
The PEREGRINE enhancer set
The PEREGRINE enhancer set consists of 1,085,794 enhancers from ENCODE’s
catalog of candidate Cis-Regulatory Elements, Ensembl, FANTOM, and VISTA with an average
length of 422 bp (Table 2.1). A total of 991,173 non-overlapping enhancers were collected from
ENCODE with an average length of 423 bp. Another 28,239 non-overlapping enhancers were
collected from Ensembl with an average length of 662 bp. Additionally, 65,423 non-overlapping
enhancers were collected from FANTOM with an average length of 281 bp. Finally, 959
enhancers were collected from VISTA with an average length of 2,037 bp. Seven of these
overlapped with another enhancer within the VISTA set. Although the enhancers from each
source were almost perfectly non-overlapping among enhancers from the same source, there
was some overlap between enhancers from different sources. Overlap in this context was
calculated at the minimum threshold of 1 bp. 157,539 PEREGRINE enhancers overlapped with
at least one other enhancer to form 1,002,071 non-overlapping enhancer regions (426 million
base pairs total) from 1,085,794 total enhancers, accounting for ~13% of the human genome.
These overlapping enhancers represent 14.5% of the PEREGRINE enhancer set. For a
breakdown of enhancer overlap by source, see Table 2.2.
Characterizing genome-wide enhancer-gene link relationships in PEREGRINE
Altogether, there were 890,403 enhancer-gene links generated from ChIA-PET, eQTL,
hierarchical TAD, and linear TAD data across 78 cell and tissue types (Fig. 2.1). See Table 2.3
for a breakdown of the number of enhancer-gene links predicted from each assay. These
enhancer-gene links linked 449,627 enhancers representing nearly 181 million bp (~6% of the
genome) to 17,643 genes. Across all enhancer-gene link data, each enhancer was linked to an
average of 2 genes (1.98) and each gene was linked to an average of 50 enhancers (50.47).
19
These averages are about what might be expected for roughly one million enhancers regulating
roughly 20,000 protein-coding genes. Histograms of the number of enhancers per gene and the
number of genes per enhancer are provided in Figure 2.2 as well as Table 2.4 giving the
cumulative percentages for each frequency value. Unsurprisingly, the value with greatest
density in both histograms is 1. Indeed, most enhancers (56%) had only one putative target
gene, and 96% of enhancers had 5 putative target genes or less. The highest number of
putative target genes per enhancer was 34, but this accounted for only one enhancer.
Enhancers with 10 or more putative target genes accounted for less than 1% of the total
enhancers linked to genes in this analysis. When stratified by tissue and cell type, the average
number of putative target genes per enhancer was largely the same, with a range of 1.00-1.95
putative target genes per enhancer from each tissue and a median of 1.31 average putative
target genes per enhancer in each tissue. When enhancer-gene links were stratified by assay
(Table 2.5), the lowest average putative target genes per enhancer was found in ChIA-PET data
(1.17 putative target genes per enhancer), and the highest average from eQTL data (2.02
putative target genes per enhancer).
20
Figure 2.1 The number of enhancer-gene links found in each assay. This Venn diagram (not to scale) shows
the number of enhancer gene-links found in each assay and each combination of assays.
21
a.
22
b.
23
c.
Figure 2.2 Distributions of enhancer-gene links. a. The distribution of the number of putative target genes for
each enhancer. Each bar represents the quantity of enhancers with each given value of linked genes. b. The
distribution of the number of enhancers linked to each gene. Each bar represents the quantity of genes with each
given value of linked enhancers. c. The distribution of the number of total tissues each enhancer-gene link was
found in. Each bar represents the quantity of enhancer-gene links with each given value of tissues giving
supporting evidence of the link.
Enhancer-Gene Links By Assay Number of Links Generated From Each Assay
ChIA-PET 11,402
eQTL 435,973
TAD 855,976
Hierarchical TAD 491,346
Total Enhancer-Gene Links 890,403
Table 2.3 Summary of the PEREGRINE enhancer-gene links dataset. These enhancer-gene links were taken
from datasets across 78 tissues.
Number of target genes Number of Enhancers Cumulative Percentage of Enhancers
1 249,820 0.56
2 99,919 0.78
3 46,024 0.88
4 22,620 0.93
24
5 12,404 0.96
6 6,851 0.97
7 4,097 0.98
8 2,610 0.99
9 1,757 0.99
10 1,179 0.99
11 800 >0.99
12 490 >0.99
13 289 >0.99
14 183 >0.99
15 154 >0.99
16 104 >0.99
17 59 >0.99
18 30 >0.99
19 55 >0.99
20 17 >0.99
21 35 >0.99
22 16 >0.99
23 31 >0.99
24 22 >0.99
25 15 >0.99
26 11 >0.99
27 13 >0.99
28 5 >0.99
29 6 >0.99
30 3 >0.99
31 2 >0.99
32 2 >0.99
33 3 >0.99
34 1 1.0
Table 2.4 Distribution of enhancers with each number of target genes. The most target genes an enhancer
was found to have was 34, with the least amount being 1. Over 56% of enhancers linked to genes in this analysis
were found to only have a single target gene. About 97% of enhancers here were found to have less than six
target genes.
Assay Mean Putative Target Genes per Enhancer
ChIA-PET 1.17
eQTL 2.02
Linear TAD 1.96
Hierarchical TAD 1.56
Table 2.5 Mean number of genes linked per enhancer by assay. Each row lists the mean number of genes that
were linked to each enhancer in the assay listed.
Although the mean number of enhancers linked to each gene was 50.46 across the
entire dataset, most genes had 40 or less linked enhancers (51%). However, the top 12% of
genes had 100 or more linked enhancers, and a single gene (ERI1) was linked to 378
25
enhancers. Interestingly, ERI1 is an evolutionarily conserved exoribonuclease involved in the
regulation of diverse types of RNA to function as an important modulator of epigenetic gene
expression
57
. When stratified by tissue and cell type, the average number of enhancers linked to
each gene was lower in many tissues, with a range of 4.40-50.59 enhancers linked to each
gene from each tissue and a median of 13.03 average enhancers linked to each gene in each
tissue. The average across all data remains higher due to the outsized number of enhancer-
gene links found in the tissues with the highest mean enhancers linked to each gene (Fig 2.2C).
When enhancer-gene links were stratified by assay (Table 2.6), the lowest average of
enhancers linked to each gene was found in ChIA-PET data (5.40 enhancers linked to each
gene), and the highest average came from linear TAD data (50.88 enhancers linked to each
gene). This is likely due to the fact that linking enhancers and genes together based on being
located within the same TAD is the least discriminate way to link an enhancer to a gene of all
methods used in PEREGRINE.
Assay Mean Enhancers per Gene
ChIA-PET 5.40
eQTL 26.97
Linear TAD 50.88
Hierarchical TAD 38.30
Table 2.6 Mean number of enhancers linked per gene by assay. Each row lists the mean number of enhancers
that were linked to each gene in the assay listed.
GO Biological Processes that were statistically significantly enriched for having fewer
enhancers per gene than expected included several processes related to immune function.
Clustering of these genes and their linked enhancers in the genome was also observed. It has
been previously shown that clustering occurs with genes related to immune function
58
, which is
thought to potentially be due to coregulation by shared enhancers
59
. Such a phenomenon could
account for why these groups of genes were linked to less enhancers on average than others.
Several other clustered sets of genes were also observed among the groups of genes linked to
fewer enhancers than expected—the olfactory receptors on chromosomes 14 and 17 and the
26
taste receptors on chromosomes 7 and 12 (Fig 2.3). These clustered genes and their nearby
enhancers linked by PEREGRINE may be good candidates for further research on the
relationship between clustered genes and their potential coregulators.
a.
b.
c.
27
d.
e.
f.
Figure 2.3 Clustering of genes enriched among PEREGRINE enhancer-gene links as being linked to fewer
enhancers than expected. a. Genes with function related to immune system processes shown to cluster. b.
Olfactory receptors on chromosome 14 annotated with as part of the GO biological process “detection of chemical
stimulus involved in sensory perception of smell” GO:0050911 (p<1.0E-10) c. Olfactory receptors on chromosome 17
annotated with as part of the GO biological process “detection of chemical stimulus involved in sensory perception of
smell” GO:0050911 (p<1.0E-10) d. Taste receptors on chromosome 7 annotated with as part of the GO biological
28
process “detection of chemical stimulus involved in sensory perception of taste” GO:0050912 (p<1.0E-10) e. Taste
receptors on chromosome 12 annotated with as part of the GO biological process “detection of chemical stimulus
involved in sensory perception of taste” GO:0050912 (p<1.0E-10) f. Enrichment analysis for GO biological processes
based on number of enhancers linked to each gene.
Out of all 890,403 enhancer-gene links, each link was found in an average of 19.30
tissues. Two enhancer-gene links were found in 71 tissue and cell types, which was the
maximum for any enhancer-gene link in PEREGRINE. In order to examine the patterns between
these data more easily, the number of tissues the enhancer-gene links were found in was
binned in groups of 5. The enhancer-gene links found in 71 tissues were binned with the
enhancer-gene links found in 66-70 tissues. A chi-squared test of association was performed
between the binned total tissues enhancer-gene links were found in and the number of assays
they were supported by, indicating that the number of assays the enhancer-gene links were
supported by was related to the number of tissues the enhancer-genes links were found in
(p<2.2 e-16). The Pearson correlation coefficient between these two variables was 0.39 (p<2.2
e-16), indicating that the more tissues an enhancer-gene link was found in, the more likely it
was to be supported by multiple assays.
Utilizing the PEREGRINE data in PANTHER
A user can access the data in two ways. The first is to retrieve the enhancers linked to
genes of interest. A single gene or a list of genes may be uploaded and viewed by selecting
“Functional Classification viewed in gene list.” This will generate a gene list page with
annotations to the genes (Fig 2.4). Each gene is displayed with a list of its linked enhancers.
Each enhancer will be shown as a hyperlink to the enhancer detail page (Fig 2.5A) for that
enhancer. The enhancer detail page contains information on the experimental evidence (assay,
p-value, and eQTL ID if applicable) supporting each enhancer-gene link that that enhancer is
involved in. The user can also click the gene identifier hyperlink to go to the gene detail page
(Fig 2.5B). Click the “View enhancers” link to view details of all enhancers linked to this gene.
29
The second way is to map the genetic variants to enhancers and retrieve a list of genes that are
regulated by the enhancers. The user can upload a VCF file on the home page, select the VCF
File as list type and check the Search Enhancer Data box.
Figure 2.4 Viewing enhancer-gene link information on a gene list. The PANTHER website is able to take a list
of genes from the user and provide a list of enhancers associated with each gene presented as a hyperlink to
more information about the supporting evidence for each enhancer-gene link as well as its cell and tissue type.
30
a.
31
b.
Figure 2.5 The schema of PEREGRINE within PANTHER. This schema shows what information will be made
available within the PANTHER website. a. Gene detail page. Each gene in PANTHER has a gene detail page
which now includes an Enhancers section (circled in red) with a link to view all enhancers associated with that
gene by PEREGRINE. b. Enhancer detail page. Each enhancer has an enhancer detail page in PANTHER with
the enhancer’s ID, genomic location, original source, and detailed information on the experimental evidence used
by PERGERINE to link that enhancer each of its associated genes.
The PANTHER website (www.pantherdb.org) allows various methods for querying the
PEREGRINE predicted enhancer-gene links. On the homepage, the user may upload a VCF file
with SNPs of interest and return any enhancers or genes the rsIDs map to. For example, a user
may be interested in rs143969848, a rare single-nucleotide variant found in 5.4% of suspected
Lynch syndrome
60
(the most common type of hereditary colon cancer
61
) patients. PANTHER
maps this rsID to enhancer EH37E0652188 (Fig 2.5), which PEREGRINE links to just three
genes (LRRFIP2, MLH1, and EPM2AIP1). All three are associated with Lynch syndrome in
ClinVar
62
, and two of the genes (MLH1 and EPM2AIP1) are overlapping, with MLH1 on the plus
32
strand and EPM2AIP1 on the minus strand. The shorter EPM2AIP1 gene (8.3kb) is completely
within the coordinates of the much larger MLH1 (57.6 kb), though on the opposite strand.
LRRFIP2 is located just 1.7kb away from MLH1, on the minus strand. MLH1 is a tumor
suppressor gene involved in DNA mismatch repair. It is involved in the pathogenesis of Lynch
syndrome as well as endometrial and colorectal carcinomas. The LRRFIP2 protein product
binds to the cytosolic tail of TLR4, resulting in activation of nuclear factor kappa B signaling.
Dysregulation of the nuclear factor kappa B signaling is a common event in many cancer types
which contributes to tumor initiation and progression by driving expression of pro-
proliferative/anti-apoptotic genes.
61
High expression of NF-κB has also been significantly
associated with late stage colorectal cancer.
63
The function of the protein encoded by
EPM2AIP1 is not known. Therefore, it seems that MLH1 and LRRFIP2 and their connection to
the EH37E0652188 enhancer most warrant further investigation. While little is known about the
distal regulation of LRRFIP2, Liu et al
60
recently showed that a 1.8kb region located 35kb
upstream of MLH1 interacted with the MLH1 promoter, displayed enhancer function in luciferase
reporter assays, and statistically significantly altered the expression of MLH1 using CRISPR-
Cas9-mediated deletion of endogenous regions. This region also includes a CTCF-binding
motif, which has been shown to disrupt enhancer activity in SW620 colorectal carcinoma cells.
60
Also within this region lies the entire 770bp enhancer EH37E0652188, the enhancer that
PEREGRINE predicted as regulating MLH1.
The SNP rs2144300 has been statistically significantly associated with HDL cholesterol
levels in humans.
64
PANTHER maps this rsID to enhancer EH37E0145522, which PEREGRINE
links to just one gene, GALNT2 (Fig. 6). This variant is found within the first intron of GALNT2, a
gene strongly associated with HDL cholesterol levels
65
. Roman et al
66
explored the SNPs at this
locus to reveal a 780-bp segment containing rs4846913, rs2144300, and rs6143660 that
displayed allelic differences in regulatory enhancer activity in luciferase assays. They also
showed differential CEBPB binding to rs4846913 using electrophoretic mobility shift assays
33
which they confirmed occurred in a native chromatin context with ChIP assays in two liver
cancer cell lines. Allelic-expression-imbalance assays performed with RNA from primary human
hepatocyte samples and expression-quantitative-trait-locus (eQTL) data confirmed that these
SNPs are associated with increased GALNT2 expression. They proposed that at minimum,
rs4846913 and rs2281721 play key roles in influencing GALNT2 expression at this locus.
Cavalli et al
65
showed that rs4846913 and the neighboring rs2144300 displayed allele specific
enhancer activity and proposed that events occurring at these SNPs influence the transcription
levels of GALNT2. All three SNPs (rs4846913, rs2144300, and rs6143660) in the 780-bp
segment validated by Roman et al fall within EH37E0145522, which is only 813bp long. The
other SNP, rs2281721, did not map to any enhancers in the PEREGRINE set.
34
a.
b.
35
Figure 2.6 The enhancer-gene link between EH37E0145522 and GALNT2. a. Gene detail page. b. Enhancer
detail page.
36
The colorectal cancer risk-associated variant rs2238126 is located within an intron of
ETV6, an ETS family transcription factor. PANTHER maps this variant to a single enhancer,
EH37E0254775. Wang et al
67
showed that the G allele of rs2238126 reduces the binding affinity
of MAX, a transcription factor thought to enhance transcription of ETV6, resulting in significantly
lower mRNA levels of ETV6. They proposed that expression of ETV6 is regulated by the SNP
rs2238126 and that the rs2238126 G allele is associated with an increased risk of colorectal
cancer because of decreased transcription factor MAX binding, resulting in downregulating
ETV6 expression. They also tested a putative enhancer region centering rs2238126 (1kb in
length) for enhancer activity using luciferase assays in HCT116 and SW480 cells and found that
the A allele of rs2238126 conferred statistically significantly higher luciferase expression in both
cell types as compared to the G allele and as compared to a vector with no enhancer region.
Though rs2238126 mapped only to enhancer EH37E0254775 (647 bp), this longer putative
enhancer region also mapped to enhancer 12808 (292 bp), which overlaps almost completely
with EH37E0254775.
Before the release of PEREGRINE, a user could upload variants of interest in a VCF file
to the PANTHER homepage for analysis, and PANTHER would call the variants to their nearest
genes according to a gene flanking region distance set by the user. This means that if a user
was looking for variants within an enhancer that could have a regulatory relationship with a
gene, those enhancer variants would need to be within the flanking region of that gene to
appear in the PANTHER gene list output as associated with that gene. Using the PEREGRINE
data integrated into PANTHER, the user can now select a search for variants called to the
PEREGRINE enhancer regions and obtain a gene list of all of the genes associated with those
variant-containing enhancers. This is especially important when considering that of the
PEREGRINE enhancer-gene links, only 12% involve an enhancer that is within 20kb of its
putative target gene. Indeed, 42% of the PEREGRINE enhancer-gene links comprise a gene
37
and an enhancer that are at least 100kb apart, thus greatly expanding the user’s ability to
examine putative regulatory variants using the PANTHER framework.
Discussion
Enhancers are vital regulatory elements that increase transcription of their target genes
many times over. They play a key role in development and are implicated in many common
diseases, including many cancers. Determining which genes are the target genes of specific
enhancers is key to informing to what extent and how enhancers contribute to disease
pathology. Although significant efforts have been made to successfully elucidate enhancer-gene
regulatory relationships at the bench, these experimental findings represent only a small fraction
of all enhancer-gene regulatory relationships. Additionally, these results are not automatically
deposited into any central repository of known enhancer-gene links. Thus, utilizing these data
on a large scale is laborious. A high throughput method of predicting enhancer-gene links with
good ability is desirable. High throughput experimental methods have helped in this direction,
with good ability to predict enhancer-gene contacts in the cell types the experiments are
conducted in. However, some of these methods are relatively new and therefore data is not yet
widely available in a large range of cell types or on a genome-wide basis. Additionally, a bench
laboratory is necessary to perform these experiments in new cell types, which is a limiting factor
for many analytical groups. Some computationally predicted enhancer-gene link databases
have been developed, which do not rely on new experimental data and instead use publicly
available data to compile predicted enhancer-gene links. However, these methods are limited in
their availability to the scientific community. Most do not offer an up to date bulk downloadable
option for the data to be examined en masse, but instead only make the data available to the
end user via individual webpages for each enhancer, gene, or enhancer-gene link. There are
also varying degrees of the amount of information available regarding what specific evidence in
38
which cell types supports each enhancer-gene link, which may be of great interest to the end
user. The PEREGRINE enhancer-gene links, made available via the PEREGRINE website for
bulk download and via PANTHER website for putative target gene and variant queries,
represent an easily accessible and comprehensive set of enhancer-gene links with
accompanying experimental evidence.
In order to assay the enhancer and promoter binding of the transcription machinery that
enhancers are known to recruit to their target genes, ChIA-PET data was used to identify pairs
of regions containing enhancers and promoters bound to the target protein RNA Polymerase II.
Enhancers are often located within the same topologically associated domain as their target
genes, so Hi-C data was used to link enhancers to genes within the same topologically
associated domain, but these results were only recorded if they supported an enhancer-gene
link that was already found in another experiment. This was done in order to reduce the number
of false positives likely to be incurred by linking every enhancer to every gene within the same
topologically associated domain. Since enhancers are thought to function through achieving
close proximity with their target genes’ promoters, enhancer-gene links were also taken from Hi-
C data where the hierarchy of contacts within topologically associated domains was captured.
Enhancers were also screened for eQTL and linked to any gene which showed statistically
significant differences in expression based on that eQTL. Together, these data assay the
characteristics of enhancer regulation of genes in 78 cell and tissue types to yield 890,403
enhancer-gene links. On average, each gene was linked to 50 enhancers while each enhancer
was linked to 2 putative target genes. Enhancer-gene links which were found in many tissues
were more likely to be supported by more assays (p<2.2e-16) according to a chi-squared test of
association. The Pearson correlation coefficient between the number of assays supporting an
enhancer-gene link and the number of tissues and cell types that link was found in was 0.39
(p<2.2e-16), indicating that enhancer-gene links supported by more assays are more likely to be
found within a wider range of cell and tissue types in these data.
39
Although each gene was linked to an average of 50 enhancers using the PEREGRINE
enhancer set, there is some overlap among the PEREGRINE enhancers (Table 2.2). In order to
determine how much this might be influencing the average enhancers linked to each gene,
analysis was redone using only the enhancers taken from ENCODE, as these are all mutually
exclusive enhancer elements and also account for over 91% of the PEREGRINE enhancer set.
Restricting only to the ENCODE enhancers, each gene was linked to an average of 45
enhancers. Thus, the modest overlap between enhancers in the PEREGRINE set does not
dramatically change the average number of enhancers linked to each gene.
Two enhancer-gene link prediction databases employing strategies most similar to those
used in PEREGRINE, GeneHancer and HACER, were compared to the PEREGRINE enhancer-
gene link database. An analysis of PEREGRINE enhancers compared to the enhancer sets
from HACER and GeneHancer provided in their Data Download sections show that
PEREGRINE enhancers are more comprehensive than either of these sets (Fig 2.7). Of the
1,085,794 enhancers in the PEREGRINE set, only 268,811 (24.8%) overlap with the enhancers
in the HACER set. In contrast, of the 1,685,398 HACER enhancers, 1,644,428 (97.6%) have
overlap with PEREGRINE enhancers. Of the 1,085,794 enhancers in the PEREGRINE set,
461,202 (42.5%) overlap with the enhancers in the GeneHancer set. Conversely, of the 217,695
GeneHancer enhancers that successfully converted to hg19, 180,743 (83.0%) have overlap with
PEREGRINE enhancers. Even when analyses require that overlap between two enhancers
require at least one of the elements to overlap with the other at least 50%, these numbers
remain stable (HACER shifts from 24.8% and 97.6% to 23.7% and 97.4% respectively;
GeneHancer shifts from 42.5% and 83.0% to 40.0% and 81.2% respectively). The reason for
such overlap is most likely due to PEREGRINE using some enhancer data from sources that
are common among both databases (e.g. VISTA, FANTOM). However, the PEREGRINE
enhancer set is larger and not well captured by either HACER or GeneHancer enhancer sets in
part because enhancer data from various sources with mostly non-overlapping enhancer
40
coordinates was utilized to make up the PEREGRINE enhancer set. In fact, the enhancers in
PEREGRINE exhibit less internal overlap than the enhancers in the HACER set.
a.
41
b.
42
c.
43
d.
Figure 2.7 Comparison of PEREGRINE enhancers with HACER and GeneHancer enhancers. Percentages are
in terms of the enhancer set that the percentage labels are in. a. HACER and PEREGRINE enhancer overlap in
terms of percentages of numbers of enhancers from each source. b. HACER and PEREGRINE enhancer clusters
overlapped in terms of percentages of numbers of clusters from each source. c. GeneHancer and PEREGRINE
enhancer overlap in terms of percentages of numbers of enhancers from each source. d. GeneHancer and
PEREGRINE enhancer clusters overlapped in terms of percentages of numbers of clusters from each source.
Using the cluster and merge commands from the bedtools suite to output clusters of
partially overlapping elements, 1,085,794 PEREGRINE enhancers clustered to 1,002,071 non-
overlapping enhancer regions. Performing the same analysis with HACER enhancers resulted
44
in 104,078 clusters of non-overlapping enhancer regions from 1,685,398 enhancers. This high
instance of overlapping among enhancers in the HACER set is in part due to the fact that
HACER reports cell type specificity for each enhancer and names each enhancer from each cell
type uniquely, even if there is very high overlap between them (which possibly indicates that
these elements sometimes refer to the same enhancer in two different cell lines). This means
that instances frequently occur such as enhancer AE_hg19_HCT116_42299, located at
chr1:227297-239062, highly overlapping with enhancer AE_hg19_VCaP_111919, located at
chr1:227356-238615, because AE_hg19_HCT116_42299 is based on data in cell line HCT116
and AE_hg19_VCaP_111919 is based on data in cell line VCaP. These likely refer to the same
enhancer region, but the data from each cell line gives slightly different coordinates for the
enhancer region. GeneHancer enhancers were almost perfectly unique, resulting in non-
overlapping clusters nearly identical to their enhancers with an average length of 1,572 bp.
Another interesting difference between PEREGRINE enhancers compared to HACER
and GeneHancer enhancers relates to the average length of the enhancers in each set. The
average length of HACER and GeneHancer enhancers are much longer than the average
length of PEREGRINE enhancers. The average length of PEREGRINE enhancers is 422 bp.
The average length of PEREGRINE clusters is 434 bp. The average length of HACER
enhancers is 713 bp. The average length of HACER clusters is 3,440 bp. The average length of
GeneHancer enhancers and clusters is 1,572 bp. This indicates that PEREGRINE enhancers
are more closely scaled to the length that most enhancers are thought to be, which is closer to
hundreds of base pairs than to thousands.
These databases, including the PEREGRINE enhancer-gene regulatory links, are limited
by their lack of statistical validation. Due to the lack of a gold standard database of enhancer-
gene regulatory relationships for new predictions to be judged by, it is nearly impossible to
statistically validate predicted enhancer-gene regulatory links which have been generated
across many assays and cell types. Although there are multiple instances of experimentally
45
validated enhancer-gene links from years of benchwork being captured by prediction databases,
little is known about the magnitude of how many spurious enhancer-gene links are included
along with the legitimate ones in these sets of predictions, which would also depend on the
cellular context the predictions were being evaluated in. Future work was driven by the desire to
address this limitation. The focus was on the generation of a statistically validated enhancer-
gene regulatory link score utilizing curated enhancer-gene regulatory relationships taken from
peer-reviewed published studies in a cell type-specific context. The motivation to develop such
a score was to allow researchers to see which enhancer-gene regulatory links are reported with
the highest confidence, which represents a valuable addition to the PEREGRINE enhancer-
gene regulatory links as well as any collection of predicted enhancer-gene regulatory links for
which cell type-specific context is desired.
46
Chapter 3
Developing a Cell Type-Specific Score for Enhancer-Gene
Regulatory Links Using a Machine Learning Approach
Introduction
Enhancers are short (50-2000 bp) DNA regulatory elements that activate the expression
of target genes in a cell type- and timing-specific manner
68
. Enhancers function independently of
orientation and contain transcription factor binding sites which are used to recruit chromatin
remodeling complexes and transcription machinery. The enhancer complex loops over and
physically contacts the target gene promoter to upregulate transcription
47
. Genes can be
targeted by more than one enhancer, and some have been seen to interact with more than ten
enhancers
69-71
. A single enhancer may also regulate more than one target gene, perhaps even
simultaneously
72
. There is great variability in the distance from which enhancers regulate their
target genes. Often an enhancer targets the nearest gene, but enhancers have also been
shown to frequently skip their nearest genes to regulate more distal genes, even at distances of
over a million base pairs away
73, 74
. During cellular differentiation, enhancers play a vital role in
cell fate determination
75
. Most cancer cells require the rogue upregulation of oncogenes, which
is directed by enhancers or by clusters of enhancers termed super-enhancers, previously
described as locus control regions
76-78
. Thousands of disease-associated variants identified by
GWAS have been found to reside in enhancers, implicating them in many deadly diseases
18
.
Although the genomic locations of thousands of enhancers are known, the target genes
of each enhancer in a specific cell type are usually undetermined. Determining which genes are
the regulatory targets of specific enhancers—especially those already associated with
47
disease—is key to furthering our understanding of how enhancer regulatory activities may
function to contribute to disease.
Since it is widely believed that enhancers come into physical contact with the promoters
of their target genes in order to regulate their transcription, some of the most common
experimental methods by which enhancer-gene regulatory associations are identified are via
proximity ligation assays (e.g., 3C, 4C, 5C, Hi-C, promoter-capture HiC, ChIA-PET, Hi-ChIP,
etc.) and advanced microscopy techniques which seek to identify long-range physical contacts
between enhancers and promoters to establish a regulatory relationship
9
. Genome and
epigenome editing techniques as well as experimental engineering of enhancer-promoter
contacts (e.g., CRISPR-based techniques) provide excellent avenues for elucidating specific
enhancer-gene regulatory relationships in specific cell types
9
. However, these datasets remain
unavailable in the majority of cell types. Due to the incredibly cell type-specific nature of
enhancer-gene regulatory relationships, the genome-wide characterization of enhancer-gene
regulatory networks in a large range of specific cell types represents a significant need for
disease research.
Many computational efforts to predict large quantities of enhancer-gene links have
resulted in numerous databases with predictions based on publicly available experimental
data
10-15
. However, these prediction-based datasets lack a cell type-specific validation method to
consistently validate the accuracy of their predictions. There is not enough cell type-matched
data provided in these databases on which target genes are regulated by specific enhancers to
be able to statistically validate any score (when one is provided) of these predicted enhancer-
gene regulatory links, which adds a great degree of uncertainty to such datasets. Although there
are multiple instances of prediction databases capturing some of the enhancer-gene regulatory
relationships which have been experimentally validated through years of rigorous benchwork,
little is known about the magnitude of how many spurious enhancer-gene links are included
along with the legitimate ones in these sets of predictions.
48
This chapter describes a simple yet effective approach (PEACOCK: Predicted Enhancer
Activity in Cis Originating from Cell-specific Knowledge) to predict gene-enhancer regulatory
links using standard machine learning classification algorithms. Peer-reviewed scientific articles
in four cancer cell lines (HepG2, HCT116, K562, and MCF7) were scoured to amass 148
experimentally validated enhancer-gene links to generate cell type-specific positive enhancer-
gene regulatory examples. Publicly available DNA accessibility data in the same cell lines were
leveraged to generate a large collection of negative examples. Machine learning models were
trained separately on the training data generated for each cell type and evaluated on their ability
to identify true or active enhancer-gene regulatory links on separate test data.
For each enhancer-gene link, features that are widely believed to be hallmarks of
enhancer activation of target genes were used to describe the regulatory link. Well-chosen
features enable the algorithm to distinguish meaningful patterns between the positive enhancer-
gene links representing a true regulatory relationship and the negative enhancer-gene links
representing an enhancer-gene pair with no regulatory relationship in a particular cellular
environment. ChIP-seq data from ENCODE targeting epigenetic marks associated with active
enhancers (H3K27ac, H3K4me1, and binding of histone acetyltransferase P300) and actively
regulated promoters (H3K4me3, H3K27ac) were included. A measure of significance (p-value)
and a measure of effect (coefficient) for eQTL from GTEx which mapped to enhancers were
also included as features. Additionally, two binary variables recorded if the enhancer-gene link
consisted of the gene which was located nearest to the enhancer, and whether or not the
enhancer was located in one of the gene’s introns.
The results show that these models perform well both within and across cell types, even
from highly unrelated tissue types. These findings suggest that although the behavior and state
of individual enhancer-gene pairs are highly specific to the particular cell type at hand, the
characteristics of active enhancer regulation of target genes remains consistent across cell
types. This work also shows that it is thus possible to pool observations from multiple cell types
49
to make a larger training dataset as long as within-observation cell type specificity remains
consistent. Using a pooled model trained across multiple cell types, all possible gene-enhancer
regulatory links in cis (~17M) were scored and added to the PEREGRINE database
(www.peregrineproj.org).
Methods
Assembling the training dataset
Positive class datasets
Enhancer-gene links comprising the positive training and test datasets (true active
enhancer-gene links) were obtained from four different cancer cell lines. The training datasets
for HepG2 (liver) and HCT116 (colon) as well as the test dataset for MCF7 (breast) were
obtained through careful PubMed searches. The search terms included the names of the cell
lines and the word “enhancer” and terms related to enhancer-gene linking experiments, such as
“CRISPR” or “luciferase.” The abstracts of the search results were then read to ascertain if the
paper was relevant. If the paper seemed likely to include experimental evidence linking an
enhancer to at least one target gene in the desired cell type, the paper and any necessary
supplemental data were examined to determine if there was enough evidence (based on the
criteria in Table 3.1) to include the enhancer-gene link in the curated list. The training dataset for
a third cell type, K562 (myelogenous leukemia) was obtained from published CRISPRi-
FlowFISH experiments
79
. Positive enhancer-gene links were gathered from all results achieving
statistical significance (p<0.05) in experiments which perturbed enhancers within 450kb of 30
genes and measured the effects on gene expression. The test dataset for K562 was obtained
from published CRISPRi experiments
80
. Positive enhancer-gene links were gathered from all
results achieving statistical significance (FDR<0.05) in experiments where sgRNA-targeted
50
enhancers located within 1 Mb of target genes MYC and GATA1 were shown to significantly
alter their gene expression.
Criteria 1 CRISPR deletion of mutation of the enhancer results in statistically significantly
altered expression of the target gene.
Criteria 2 Luciferase plasmids with the target gene’s promoter were used to show that the
enhancer was important for activation of the gene. Transcription factors that help
upregulate expression of the target gene were shown to interact with the
enhancer. Mutating the transcription factor binding sites within the enhancer
resulted in statistically significantly altered target gene expression.
Criteria 3 SNPs located in an enhancer alter target gene transcription statistically
significantly in an allele-specific manner. The enhancer was shown to interact
physically with the target gene’s promoter. The SNPs were observed in GWAS to
be associated with differential expression of the target gene.
Criteria 4
A transcription factor known to be important to the expression of the target gene
is shown to bind to the enhancer. Knockdown experiments of the transcription
factor are shown to statistically significantly decrease the target gene’s
expression. SNPs located in an enhancer alter target gene transcription
statistically significantly in an allele-specific manner.
Criteria 5
The enhancer was shown to confer inducible expression of the target gene.
Binding motifs were found in the enhancer for a transcription factor that was
shown to greatly alter inducible expression in the presence of an expression
vector for the transcription factor. Deletions within the enhancer of these binding
motifs altered gene expression.
Table 3.1 Description of the positive set criteria. These are the different criteria that enable an enhancer-gene
link from the literature to be accepted into the positive class in the training dataset and used for analysis.
The enhancer coordinates from the above datasets were mapped to the enhancers in
the PEREGRINE enhancer set and the genes were recorded with their PANTHER long IDs.
Only enhancer-gene links with protein-coding genes (recorded by their PANTHER long IDs)
were considered for further analysis. PEREGRINE enhancers were mapped to an enhancer
from the literature if there was at least 33% overlap between their coordinates, which was
calculated using bedtools. For all enhancer-gene links that mapped to at least one PEREGRINE
enhancer, the enhancer-gene link was recorded using the PEREGRINE enhancer ID and the
gene’s PANTHER long ID. These enhancer-gene links made up the positive class of the training
or test dataset for each cell line (HepG2: n=23, HCT116: n=43, K562 training: n=60, K562 test:
n=6, MCF7: n=27).
51
Negative class datasets
Enhancer-gene links comprising the negative training and test datasets were gathered
based on the justification that physically inaccessible DNA could not contain enhancers and
genes that were actively engaged in transcriptional regulatory activity. Although there is reason
to believe that in Drosophila some genes are active in the heterochromatic state
81
, we can find
no evidence to show which genes this might be true for in human cells, if any. However, we
recognize the possibility that such genes may exist and therefore erroneously be included in the
negative set.
Additionally, there are areas of the genome for which euchromatin and heterochromatin
associated marks overlap, which have been documented to be enriched in imprinted genes
82
.
This raises the possibility of heterochromatin marks in certain parts of the genome coinciding
with active genes or enhancers, rendering ChIP-seq and other histone modification-targeting
assays potentially problematic. FAIRE-seq and DNase-seq assay for areas of the genome that
are physically exposed and vulnerable to digestion, offering a potentially more exact
representation of active versus inactive chromatin. Therefore, these DNA accessibility
experiments were utilized.
The bed files containing the peaks from ENCODE for DNase-seq and FAIRE-seq
experiments were examined. Bedtools was used to determine which enhancers from the
PEREGRINE set never overlapped with any of the peaks. Bedtools was also used to determine
which genes (plus a 2kb window on each end) never overlapped with any of the peaks.
Enhancers or genes that met these requirements were considered inaccessible. All of the
possible cis enhancer-gene links (the enhancer and gene could not be more than 1 Mb apart)
which included either an inaccessible enhancer and/or an inaccessible gene were classed as
negative. The negative class comprises these 15,491,123 enhancer-gene links in HepG2 data,
16,304,125 enhancer-gene links in HCT116 data, 13,205,263 enhancer-gene links in K562
data, and 16,215,821 enhancer-gene links in MCF7 data.
52
To generate the most useful training and test datasets, the negative class was randomly
sampled to only include cis enhancer-gene links targeting the genes found in enhancer-gene
links in the positive set. Negative enhancer-gene links included in the training and test sets were
required to target a gene that was found in a positive enhancer-gene link in the same cell type.
For each gene found in any enhancer-gene links in the positive class, 30 negative enhancer-
gene links targeting that same gene were randomly sampled from total the negative sets for
HepG2, HCT116, and MCF7 to comprise the negative class. For K562 training data, 400
enhancer-gene links were selected that were found to be inaccessible and failed to achieve
statistical significance from CRISPR experiments (p>0.10) which examined the effect that
enhancer deletion had on previously selected nearby genes’ expression levels. For K562 test
data, 1,303 enhancer-gene links were selected that failed to achieve statistical significance from
CRISPRi experiments (FDR>0.05) which examined the effect that enhancer perturbation had on
nearby MYC or GATA1 expression levels. It was then confirmed that none of them were in the
positive set. These enhancer-gene links make up the final negative class instances in the
training or test dataset for each cell type.
Figure 3.1 Negative enhancer-gene examples. The diagram illustrates how for each positive enhancer-gene
example (gene A:e4 marked by a blue line), ten enhancers targeting the gene from the positive example (gene A)
were randomly chosen from those in the complete set of negative examples to be used for the training set (Ten
enhancers connected to gene A by a red line). This was done so that the negative examples would come from a
cellular environment similar to the positive examples. This was only performed for HepG2 and HCT116 training
data, and MCF7 testing data, as the K562 training and testing data came from two CRISPR datasets that only
included negative and positive examples within the same ~1Mb long region, already ensuring that the negative
and positive examples came from the same genomic regions.
53
Joint training datasets
Joint training datasets were assembled using enhancer-gene links from more than one
cell type. Positively classed enhancer-gene links from multiple cell types were combined to
make the positive class of enhancer-gene links for joint training sets. The negative class for joint
training sets was assembled in the same way.
Features
Features classically relevant to enhancer activity on target genes were considered for
establishing a feature set for enhancer-gene link data. Features for an enhancer-gene link
include traits that are characteristic of the enhancer, the gene, and the enhancer and the gene
simultaneously. There are nine main features in each cell type specific dataset of 17,354,145
enhancer-gene links. They are described in Table 3.2. All continuous variables were
subsequently standardized. Since the eQTL features are based on GTEx’s cis-eQTL datasets,
only cis enhancer-gene links were included in the dataset, as defined by GTEx, which was 1 Mb
upstream or downstream of each gene’s transcription start site. The bedtools window command
was used to gather all enhancers within 1 Mb of each gene’s transcription start site. These
17,354,145 enhancer-gene links were then attributed features based on the datasets described
in Table 3.2.
Feature (Enhancer,
Gene, or Both)
Description Value
Feature 1: H3K27ac
(Enhancer)
A histone acetylation mark
commonly associated with active
enhancers
0/1 (binary for high value or not
high value of peak of ChIP-seq
peak)
Feature 2: H3K4me1
(Enhancer)
A histone methylation mark
commonly associated with active
enhancers
Positive continuous value (score
of the ChIP-seq peak)
Feature 3: H3K4me3
(Promoter)
A histone methylation mark
commonly associated with the
promoters of genes being actively
enhanced
0/1 (binary for high value or not
high value of peak of ChIP-seq
peak)
Feature 4: P300
binding (Enhancer)
Binding of P300, a histone acetyl
transferase well established as a
Positive continuous value (score
of the ChIP-seq peak)
54
marker of active enhancers
83
, to
the enhancer
Feature 5: eQTL—
combined Z score
eQTL p-values transformed into a
combined Z-score for each
enhancer with eQTL(s) pointing to
the same gene (measure of
statistical significance)
Positive continuous value
Feature 6: nearest
gene
Is the gene in this link the
enhancer’s nearest gene?
0/1 (binary, 0=no, 1=yes)
Feature 7: intronic Is the enhancer located in an
intron of the target gene?
0=enhancer not located in an
intron of the target gene
1=enhancer located in the intron
of the target gene
Feature 8: average of
the absolute values of
eQTL coefficients of
eQTL located within
the enhancer pointing
to the same gene
If multiple eQTL for the same
gene are located in the same
enhancer, what is the average
value of the absolute values of
the coefficients (measure of
effect) of these eQTL?
Positive continuous value
Feature 9: H3K27ac
(Promoter)
A histone acetylation mark
commonly associated with
promoters of genes being actively
enhanced
Positive continuous value (score
of the ChIP-seq peak)
Table 3.2 Description of the main effect features. All of the features listed are from ENCODE datasets from
HepG2, HCT116, K562, and MCF7 cells, with the exception of eQTL data which was only available from GTEx at
the tissue level (liver, colon, whole blood, and breast were used respectively) and P300 data which was
unavailable in HCT116. For any features involving overlap of a ChIP-seq peak signal (H3K27ac, H3K4me1,
H3K4me3, and P300 binding), overlap between the two features (enhancer/gene and peak) was required at a
minimum threshold of 50% overlap for either region.
Interaction terms were created between the features that were only for enhancers and
only for promoters (Table 3.3). This was in an effort to create features in the model that
represented the entire enhancer-gene link, rather than just the enhancer or just the gene.
Interaction terms were also considered that were the result of an interaction effect—the estimate
for a feature’s effect differing by levels of another feature in the model. This was clearly shown
by eQTL * eQTL average absolute coefficient (the interaction of the eQTL combined Z statistic,
a measure of significance, and the average of the absolute values of the eQTL coefficients, a
measure of effect size). Interactions between H3K4me1 and H3K27ac, two active enhancer
marks, were also examined.
55
Promoter marks
Enhancer
marks
H3K27ac H3K4me3
H3K27ac H3K27acenhancer *
H3K27acpromoter
H3K27acenhancer *
H3K4me3promoter
H3K4me1 H3K4me1enhancer *
H3K27acpromoter
H3K4me1enhancer *
H3K4me3promoter
H3K27ac,
H3K4me1
H3K27acenhancer *
H3K4me1enhancer *
H3K27acpromoter
H3K27acenhancer *
H3K4me1enhancer *
H3K4me3promoter
Table 3.3 Interaction terms between active enhancer and promoter marks. The columns are headed by
epigenetic histone marks commonly associated with active genes, while the rows are headed by epigenetic
histone marks commonly associated with active enhancers. The box at the intersection of a row and a column is
filled in with the interaction term which is a product of the active promoter mark described at the top of that column
and the active enhancer mark described at the beginning of that row.
To obtain general insights of how predictive each feature is on its own, univariate
analysis was performed on all features (including interaction terms) in 5-repeat 3-fold cross-
validated logistic regression.
Model selection and prediction performance
To select a predictive model that would be appropriate for this application (modestly
sized dataset with a binary outcome and relatively few features), the prediction performance of
each model with all features was examined. Random forest, flexible discriminant analysis, linear
discriminant analysis, boosting, ridge regression, k-nearest neighbor, and support vector
machines with Gaussian radial, polynomial, linear, hyperbolic tangent, Laplace radial, Bessel,
and ANOVA radial basis kernels were all evaluated in terms of the AUC, AUPRC, precision and
recall levels they achieved on MCF7 and K562 test sets. Models were also evaluated on their
prediction performance using other complementary training sets from HepG2, HCT116, and
K562. No model was ever evaluated using test data that was present during the training of that
model.
To select the final model used for all enhancer-gene link scoring, all models trained on
all cell types and joint datasets were evaluated based on their prediction performance in MCF7
and K562 test sets as well as training sets from any other cell types that were not used in the
56
training of that model. In order to choose a classification threshold that resulted in the best
balance of precision and recall, the F-measure that summarized the harmonic mean of both
measures was optimized.
𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
The classification threshold corresponding to the maximum F-measure was chosen for
each model and the precision and recall at that classification threshold was evaluated in order to
choose the best model for scoring new enhancer-gene links (Fig 3.2). In cases where there was
more than one model with nearly identical precision and recall values at the classification
threshold, the model with the largest AUPRC was selected (Fig 3.3). The best model was
chosen based on the highest precision and recall values across all test sets. Since the HCT116
dataset did not have the P300 feature, an alternative best model was also developed for use in
any dataset missing the P300 feature.
Figure 3.2 The precision recall curve. This figure shows the precision-recall curve for a random forest model
trained in HepG2 data and tested on MCF7 data. The x-axis shows the recall values while the y-axis shows the
precision values that the model achieves at every different classification threshold. The yellow dot demarcates the
classification threshold at which the F-measure is maximized.
57
Model kernel AUROC AUPRC Best precision Classification
cutoff
Best recall
Figure 3.3 Evaluating prediction performance. This is the table of prediction performance values corresponding to the
models trained on HepG2 data and tested on K562 data. The “best” values for precision and recall are the values that
correspond to the classification cutoff which was chosen by maximizing the F-measure. For each model, the AUROC and
AUPRC is recorded, as well as the classification cutoff with the largest F-measure and the precision and recall levels at
that point.
Scoring new enhancer-gene links
All cis links (n=17,354,145) for which complete features data was available were scored
with the HepG2+K562-trained final model in a cell type-specific manner. All cis links in cell types
(e.g., HCT116) missing only the P300 feature were scored with the K562-trained final model in a
cell type-specific manner. Of these, the scores of all cis PEREGRINE enhancer-gene links
(n=880,946) were examined. Student’s t-test was used to evaluate the difference in means of
the scores of PEREGRINE enhancer-gene links compared to all cis enhancer-gene links in all
cell types.
58
Data availability
The complete set of scored cis enhancer-gene links for all cell types are available in bulk
data download file on the PEREGRINE website (www.peregrineproj.org). Cell type-specific
scores for PEREGRINE enhancer-gene links will also be integrated into the PANTHER web
interface for querying PEREGRINE enhancer-gene link information (www.pantherdb.org).
Results
Feature selection
Characteristics that are classically associated with active enhancer regulation of target
genes were gathered to be used as features (variables) in the predictive models. Thus, each
observation (enhancer-gene regulatory link) has values for each feature that are then used by
the models to predict the outcome (whether or not the enhancer-gene pair actually has a
regulatory relationship in the given cell type). Features for an enhancer-gene link include
features that are classic hallmarks of the active state of the enhancer, the gene, and the
enhancer and the gene simultaneously. There are nine main features in the cell type-specific
datasets of cis enhancer-gene links. They are described in Table 3.2. Interaction terms were
created between the features that were only for enhancers and only for promoters (see
Methods). This was in an effort to create features in the model that represented the entire
enhancer-gene link, rather than just the enhancer or just the gene. This resulted in up to 17
features for each cell type-specific enhancer-gene link dataset.
Model generation
In order to select the model with the best prediction performance, cell type-specific
training sets in K562, HepG2, and HCT116 were each trained using the full feature set in
59
random forest, flexible discriminant analysis, linear discriminant analysis, boosting, ridge
regression, k-nearest neighbor, and support vector machine models with Gaussian radial,
polynomial, linear, hyperbolic tangent, Laplace radial, Bessel, and ANOVA radial basis kernels.
Each cell type-specific trained model was evaluated in terms of the AUC, AUPRC, precision and
recall levels they achieved on MCF7 and K562 test sets. Models were also evaluated on their
prediction performance using other complementary training sets from HepG2, HCT116, and
K562 as test sets as long as they were not present in the training phase.
Interestingly, models that were trained in one cell type were found to perform quite well
when presented test data from a completely unrelated cell type (Fig 3.4), suggesting that
although the specific activities of individual enhancers are very cell type-specific due to the
sometimes dramatic differences in cellular environment from one cell type to another, the overall
patterns of the relevant features of enhancer regulatory activity remain constant across cell
types. Thus, the possible benefit to be had from pooling enhancer-gene link data from more
than one cell type for training a jointly trained model to score enhancer-gene links on new data
from a completely unrelated cell type was explored.
60
a.
b.
c.
d.
Figure 3.4 Precision-recall curves of the best performing models on test datasets. The precision-recall curve
(PRC) is given for each model and all test sets available for that model. The performance at the classification
cutoff point is represented by a black dot on each curve. a. PRCs for the best HepG2-trained model (support
vector machine with a linear kernel) on the K562 and MCF7 test datasets as well as the K562 training set used as
a test set here. b. PRCs for the best K562-trained model (support vector machine with a linear kernel) on the K562
and MCF7 test datasets as well as the HepG2 training set used as a test set here. c. PRCs for the best final model
(support vector machine with a linear kernel) which was trained on the HepG2+K562 joint dataset evaluated on the
K562 and MCF7 test datasets. d. PRCs for the best alternate final model (support vector machine with an ANOVA
radial basis kernel) which was trained on the K562 training set and tested on the K562 and MCF7 test datasets as
well as the HepG2 and HCT116 training sets used as test sets here. This model was based on training data that
61
did not include the P300 feature and is only suitable for scoring enhancer-gene links for which P300 data is
unavailable.
This approach requires only that all data for each instance come from the same cell
type; it does not require every instance in the dataset to come from the same cell type as was
the case with previous single cell-trained analyses. Such a model is trained by pooling the data
from multiple cell types to predict whether an enhancer-gene link is true in a given cell type as
long as the features characterizing each observation are drawn from the same cell type. This
pooled model benefits from added robustness due to a larger training dataset from pooled data
from multiple cell types rather than relying on the more limited number of instances comprising
the training sets in single cell types.
This is not to suggest that cell type specificity is no longer a vital consideration in the
dynamics between enhancers and their target genes. Rather, the features from each instance
will need to be cell type-matched to ensure that the state of an enhancer in one cell type (i.e., its
H3K27ac peak or lack thereof) is not erroneously used to infer targeting of a gene based on
data in another cell type (i.e., it would be inappropriate to examine the interaction between the
enhancer’s H3K27ac peak score in HCT116 cells and the promoter’s H3K4me3 peak score in
HepG2 cells). These pooled models take feature information on an experimentally validated
enhancer-gene link in a given cell type and draw on that cell type-specific information to train
the model on how to classify new enhancer-gene links in the positive class. However, the
instances in the positive class are allowed to comprise experimentally validated enhancer-gene
links from more than one cell type, thus pooling information from across multiple cell types to
build a more robust predictive model.
Prediction performance
There are many useful metrics by which to assess the prediction performance of
different models. Which metric is the best metric to use depends on the application. In this
62
application, there are two classes which need to be classified based on a classification cutoff
which serves as a cut point for observations. Those which score higher than the cutoff are
classified as part of the positive class, while those which score below the cutoff are classified as
members of the negative class. Once classification occurs, it is possible to gauge the prediction
performance as a function of how many errors the model makes in classification, but not all
errors are necessarily considered equal. Due to the fact that the real size of these two classes
(true enhancer-gene regulatory partners and enhancer-gene pairs without a regulatory
relationship in a given cell type) is severely imbalanced, it is necessary to consider which class
the misclassification errors occur in. In this case, it is far preferable to have a false negative than
a false positive, as false positives are much more likely to result in significant waste in time,
money, and energy on the part of bench scientists seeking to further characterize these
predictions, not to mention any incorrect scientific assumptions based on wrongly predicted
enhancer-gene regulatory relationships. Unfortunately, true enhancer-gene links are vanishingly
rare compared to enhancer-gene pairs with no regulatory relationship (In the K562 test set of
1,309 observations, the positive class made up less than 0.5% of total observations), so it is
very possible to have relatively few misclassification errors and still generate many false
positives. Therefore, focus was placed on the misclassification errors involving the positive class
to select the model with the fewest misclassification errors in that class.
Precision is the proportion of predicted positive enhancer-gene links that actually are
true. Recall is the proportion of actual enhancer-gene links with a regulatory relationship that
were correctly classed by the model as true. It is important to select a classification cutoff point
that results in the highest possible precision without coming at the expense of too low a recall
value, and vice versa. By optimizing the F-measure that summarizes the harmonic mean of both
measures, classification cutoffs were chosen for each model at the point with the highest F-
measure. Using this measure of accuracy results in the best balance of precision and recall for
each model.
63
Once the classification cutoff was established for each model, the corresponding
precision and recall were measured for every test set (MCF7 test set, K562 test set, and any
training data from other cell types used only to train other models). For the purpose of selecting
a final model suitable for predicting new enhancer-gene links, these were the metrics of
prediction performance utilized.
Model selection
Since models trained in one cell type were shown to be capable of excellent prediction
performance in test data from other cell types (Fig 3.4), the final model chosen to be used for
predicting new enhancer-gene links was chosen not only based on how well it predicted, but
how consistently it predicted well across all available test sets. Any model that performed poorly
in at least one test set was not considered suitable for selection as the final model. The support
vector machine models with linear kernels were the highest performing models when trained in
HepG2 (Fig 3.4A) and K562 (Fig 3.4B), and these models performed consistently well across all
cell types. Unsurprisingly, among the jointly trained models, the models trained in the
K562+HepG2 joint dataset performed the best across all test sets in a consistent manner (Fig
3.4C). Therefore, the highest performing K562+HepG2 trained model (ridge regression) was
selected as the final model to score enhancer-gene links.
Many popular cell lines contain nearly the full set of features needed for the final model,
but they are missing ChIP-seq data targeting histone acetyltransferase P300, rendering the
K562+HepG2 final model unusable. This was true for the HCT116 training data, so an alternate
final model was developed in an analogous manner for cell types missing P300 data to score
cell lines for which no P300 ChIP-seq data is available. K562 and HepG2 training sets excluding
the P300 feature were generated for this process. Although these models generally performed
worse than the models which were able to utilize the P300 feature, it is still important to be able
to score enhancer-gene links without P300 data. The support vector machine model with an
64
ANOVA radial basis kernel trained using K562 data performed the best across all test sets and
was selected as the alternate final model for use in scoring enhancer-gene links in cell lines for
which no P300 data is available (Fig 3.4D).
In order to make the cell typeispecific scores more interpretable across cell types, the
cell type-specific Z-score is provided for each cis enhancer-gene link in addition to the raw score
[0,1) which should only be used to compare enhancer-gene links scored in the same cell type.
Although this analysis shows that a higher cell type-specific score represents a higher chance of
the enhancer-gene link representing a true regulatory relationship in that cell type, this score
should not be interpreted as a probability. Cell type-specific scores and Z scores for all cis
enhancer-gene regulatory links are available at the PEREGRINE website for bulk download
(www.peregrineproj.org) in four cell types, with more on the way. The cell type-specific Z scores
for PEREGRINE enhancer-gene links will soon be available in the new Enhancer module of the
PANTHER website (www.pantherdb.org) where the user may easily access the supporting
evidence for each PEREGRINE enhancer-gene link, as well as query by target gene and
enhancer location.
PEREGRINE enhancer-gene links have higher cell type-specific scores than other
possible enhancer-gene links in cis
The enhancer-gene link database PEREGRINE, a genome-wide prediction of enhancer
to gene relationships supported by experimental evidence was recently published. These
predicted links were based on publicly available experimental data from ChIA-PET, eQTL, and
Hi-C assays across 78 cell and tissue types. Unfortunately, few of the experiments were
available in the same cell or tissue types as the other assays, making it difficult to amass cell
type-matched evidence for cell type-specific predicted enhancer-gene links. Although p-values
were reported for experimental evidence whenever available, PEREGRINE lacked a statistically
validated cell type-specific score based on gold standard data for its predicted enhancer-gene
65
links, a common shortcoming among enhancer-gene link prediction databases. Using data for
all cis enhancer-gene links, cell type-specific scores generated from the final model (if all
features were available) or the alternate final model (if P300 data was unavailable) were
attributed to all 17M possible enhancer-gene link pairs. The distribution of the scores for
enhancer-gene links found in PEREGRINE was compared to the distribution of the scores for all
other possible cis enhancer-gene links not reported in PEREGRINE (Fig 3.5). The scores
among PEREGRINE links were markedly higher than the scores among the remaining cis
enhancer-gene links that were not reported in PEREGRINE in all cell types. The probability
density plots for non-PEREGRINE links have much higher peaks near zero than the distribution
for PEREGRINE link scores. Welch’s two-sample t-tests show that for all cell types, the means
of PEREGRINE cis enhancer-gene links are statistically significantly different than the means of
the non-PEREGRINE cis enhancer-gene links (p< 2.2e-16), and in all cell types the score
means are higher for PEREGRINE enhancer-gene links. These findings suggest that enhancer-
gene links predicted by PEREGRINE have a much higher chance of representing real
enhancer-gene regulatory relationships than enhancers and genes located less than 1 Mb apart
but not predicted by PEREGRINE.
66
Figure 3.5 Predicted cell type-specific scores for all possible enhancer-gene links in cis. All possible cis
enhancer-gene links are separated by enhancer-gene links predicted in PEREGRINE, and all other possible
enhancer-gene links that were not predicted in PEREGRINE (non-PEREGRINE) for all cell types. The probability
density of the scores in four cell types of the PEREGRINE links in cis versus the distribution of the scores of the
non-PEREGRINE links in cis.
67
Discussion
Using the cell type-specific enhancer-gene link scores
Cell type-specific enhancer-gene link scores can be accessed in various ways. On the
PANTHER website, a user may upload a VCF file with SNPs of interest and return a list of
genes that mapped to the rsIDs provided by the user. The gene list provides information on
each gene, including any enhancers that were linked according to PEREGRINE data. The user
may alternatively wish to map the SNPs of interest directly to any PEREGRINE enhancers,
which will also yield a list of genes that the mapped enhancers were linked to. The cell type-
specific scores for these links are available in 4 cell lines available for download from the
PEREGRINE website, allowing the user to select the cell type-specific score for the cell line
most relevant to their research. For instance, researchers interested in colorectal cancer (CRC)
may be interested in rs58920878, a SNP shown to be statistically significantly (p=0.0035)
associated with increased disease risk (OR: 1.49)
84
. This SNP maps to just one enhancer in the
PEREGRINE set, EH37E0467415, located in the intron of SMAD7. SMAD7 is a negative
feedback regulator of the TGFβ signaling pathway, a pathway containing multiple genes known
to be involved in the progression to CRC
85
. Changes in SMAD7 expression levels have also
been shown to influence the progression of CRC
86
. Further, the silencing of SMAD7 using
antisense RNA inhibits the proliferation of CRC cell lines both in vitro and in vivo after
transplantation into immunodeficient mice
87
. The EH37E0467415 enhancer contains this SNP.
The Z-score for the EH37E0467415-SMAD7 enhancer-gene link in colorectal carcinoma cell line
HCT116 is a formidable 4.03, a Z-score corresponding to the top 1.90% of possible enhancer-
gene pairs in this cell type. Indeed, four functional SNPs (rs6507874, rs6507875, rs8085824
and rs58920878) contained within this enhancer have demonstrated allele-specific enhancer
activity in HCT116, correlate with increased expression of SMAD7 in normal colon epithelial
68
tissues, and were located within sequences that bound to nuclear proteins from CRC cell lines
in an allele-specific manner in electrophoretic mobility shift assays
88
.
Cyclin D1 (CCND1) is an important oncogene that is vital for cell-cycle progression and
is thought to have an important role in breast cancer and other tumors. It is overexpressed in
over 50% of breast cancer tumors
89
. The enhancer EH37E0225350 is located 126 kb upstream
of CCND1 in an intergenic region. The cell type-specific Z-score for the EH37E0225350-CCND1
enhancer-gene link is a staggering 10.0 (corresponding to the top 0.13% of all possible
enhancer-gene links) in the human adenocarcinoma cell line MCF7 which is often used to study
breast cancer. This enhancer is located in a hotspot for ERα binding in MCF7 cells and interacts
with the promoter of CCND1
90
. MCF7 cells treated with CRISPR-mediated deletion of this
enhancer showed dramatically reduced eRNA expression and complete abolishment of CCND1
mRNA activation
91
. CCND1 and EH37E0225350 endogenous expression was shown to be
dependent on estrogen signaling, indicating that the active EH37E0225350 enhancer may be
necessary for the activation of CCND1 expression by estrogen in breast cancer cells
91
. Notably,
the EH37E0225350-CCND1 enhancer-gene link scored much lower in other cell lines unrelated
to breast cancer, such as HepG2 (Z-score: 0.77), HCT116 (Z-score: -0.36), and K562 (Z-score:
-0.36), suggesting that the model has good ability to discriminate between different cellular
environments for the same enhancer-gene pair.
Score interpretation
Although the goal is to predict which enhancer-gene pairs have a true active regulatory
relationship in a specific cell type, there is always some degree of uncertainty with even the
most carefully constructed predictions. Thus, rather than providing a classification of true/false
for potential enhancer-gene regulatory relationships, each enhancer-gene link is provided a cell
type-specific score by the final model. Although this score is bounded by [0,1), it should not be
interpreted as a probability, which has an absolute interpretation regardless of cell type
69
environment. The results of the validation analysis described here show that the higher the
score for an enhancer-gene link, the more likely it is to have a real regulatory relationship.
However, the scores are relative only with regard to the enhancer-gene link scores from the
same cell type, and not to the enhancer-gene link scores for another cell type. Each cell type
yields its own scoring distribution, which makes it inappropriate to consider the same
classification cutoff for scores generated with different cell type-specific data. A cell type-specific
score of 0.42 may be low in one cell type’s scoring distribution, but quite high for another cell
type distribution. Therefore, it is not suitable to compare raw cell type-specific scores outside of
the same cell type. Instead, the Z-score (a measure of how many standard deviations a score is
from the mean of all scores in that cell type) of the cell type-specific scores form a better basis
for comparing the confidence of enhancer-gene links scored in different cell types. Thus, the Z-
score is provided for every enhancer-gene link to make the scoring system more interpretable
for the investigator.
Perspectives
PEACOCK is a simple yet effective approach to providing cell type-specific scores to
predicted enhancer-gene links across many widely studied cell lines. It requires the data from
only a few common assays to generate features for all possible enhancer-gene links made up of
enhancers and genes <1 Mb apart. These scores represent a way for investigators to evaluate
predicted enhancer-gene links from any database or collection of enhancer-gene predictions in
cell types of interest. Although enhancer-gene prediction databases such as PEREGRINE are
based on experimental evidence, it is clear based on the distribution of the cell type-specific
scores for all links predicted in PEREGRINE (Fig 3.5) that many of the enhancer-gene links are
likely not active in certain cell types, demonstrating the need for cell type-specific scoring when
incorporating predicted enhancer-gene regulatory relationships for any application. By utilizing
PEACOCK scores for cell types of interest, investigators could benefit enormously from being
70
able to contextualize predicted enhancer-gene regulatory relationships in disease-relevant
settings.
Conclusions
Enhancers play a vital role in orchestrating the extremely complicated system of gene
expression necessary for human development and continued maintenance of our differentiated
cell populations. When normal enhancer regulation of genes is disrupted, disease can follow as
a result. In fact, enhancers have been associated with many diseases, including many types of
cancer. Despite their importance, relatively few of the cell type-specific regulatory relationships
between enhancers and their target genes have been identified. There have been many efforts
at high throughput predictions of enhancer-gene regulatory links but a statistically validated
scoring system based on gold standard data in a cell type-specific context has been difficult to
accomplish. Here are described the basis for attributing cell type-specific scores for over 17
million enhancer-gene pairs generated by a ridge regression final model trained on
experimentally validated enhancer-gene links curated from peer-reviewed scientific articles. For
cell types missing P300 ChIP-seq data, cell type-specific scores were generated by a support
vector machine alternate final model. These scores are available in four cell types, accessible in
bulk download format from www.peregrineproj.org. Enhancer-gene links previously predicted in
the PEREGRINE database and incorporated in the PANTHER Classification System
(www.pantherdb.org) for gene and variant search will also have cell type-specific scores
attributed to them in the same cell types on the PANTHER platform. These cell type-specific
scores will allow researchers to evaluate the quality of predicted enhancer-gene links in a more
systematic way and to better harness the knowledge regarding enhancer-gene regulatory
relationships. This will provide the research community with an accessible tool for
contextualizing disease-associated variants located within enhancers, facilitating further
investigation of the role of individual enhancers in genetic diseases.
71
Chapter 4
Utilizing the PEACOCK Cell Type-Specific Enhancer-Gene
Link Scores for Gene Set Enrichment Analysis in GWAS
Data: A Proof of Concept Application
Introduction
A pooled machine learning model that would identify high-quality predictions of
enhancer-gene regulatory relationships across cell types is a potentially high impact finding that
can aid the study of enhancers in the context of disease research considerably. It could also
quicken the enhancer-gene regulatory relationship validation process in many laboratories by
informing bench scientists which enhancer-gene regulatory relationships are most likely to be
functionally active in their cell type of interest, allowing them to prioritize enhancer-gene
regulatory links of interest based on the information the model provides. Enhancer-gene
regulatory links with predicted probabilities could also be readily incorporated into genomic
statistical analyses, such as genome-wide association studies (GWAS), helping to unravel the
genetic processes that contribute to disease pathology.
In this vein, an initial application of the cell type-specific enhancer-gene link scores
generated by PEACOCK was performed by incorporating HCT116-specific enhancer-gene link
scores into the existing framework of gene set enrichment analysis offered on the PANTHER
web interface. Since most of the disease-associated variants found in GWAS are located in
non-coding regions, making use of the variants in enhancers with predicted regulatory links to
genes may provide a way to contextualize signals of statistical significance found in non-coding
regions. If the disease-associated signals of variants mapped to enhancers could be used to
inform the disease-associated signal of their predicted target genes in existing approaches
72
where groups of genes (gene sets) serve as the unit of analysis for their relevance to disease,
statistical methods could more accurately reflect the complexities of upstream effectors of gene
function such as enhancers. The justification for this on a biological level requires that we
consider that an enhancer’s function is primarily its transcriptional regulation of its target genes,
and that a disease-association signal which includes variants mapping to enhancers is likely to
be due to that enhancer’s function as it pertains to its role in cell type-specific transcriptional
regulation of its target genes. Therefore, it seems reasonable to incorporate the disease-
association signal of variants mapped to enhancers which are thought to regulate specific target
genes with some quantifiable amount of confidence.
Gene set enrichment analysis is a statistical approach initially developed for use in gene
expression analysis using large amounts of data produced by high-throughput technologies
such as DNA microarrays and RNA-Seq
92
. The motivation for this approach was the need to
address several shortcomings involved with using single genes as the unit of analysis. With the
advent of high dimensional data, large numbers of single gene tests mean that increasingly
severe adjustments for multiple testing corrections take place. This can easily result in subtle
differences in gene expression being swallowed up in the naturally occurring biological
variability of samples
93, 94
. However, when using a set of genes that are biologically related (e.g.,
involved in the same biological process) as a unit of analysis, it may be easier to detect fine but
consistent changes in gene expression patterns
92
. Additionally, large numbers of genes are
believed to be involved in multiple biological activities. A genome-wide analysis of multi-
functional genes published in 2015 reported that 26% of annotated genes in human are multi-
functional
95
. This makes interpretation of differential expression difficult in single gene analysis
when investigating genes that have multiple functions, but when genes are viewed in the
context of their biological processes through gene set analysis, it may aid in avoiding false or
ambiguous conclusions
92
. Indeed, analyzing sets of biologically related genes is the most
intuitive and biologically relevant approach to reducing dimensionality in high-throughput
73
studies, where long lists of hundreds or thousands of differentially expressed genes can be
difficult for investigators to interpret.
Gene set enrichment analysis can also be performed on variant data, such as the output
resulting from genome-wide association studies
96
. GWAS are used to detect associations
between single variants and the phenotype of interest, often a certain disease state (e.g., having
breast cancer or not having breast cancer) or biological trait (e.g., height)
97
. For complex traits,
the effect sizes of GWAS-associated variants are typically quite modest
98
. This can be best
understood in the context of evolutionary selective pressures. Variants with small effect on
disease risk are able to become common because variants with large effect on disease risk
would likely be reduced in the population by natural selection
97
. Thus the commonly observed
situation of GWAS yielding many small effect variants with few or no large effect variants tends
not to be resolved by increased sample sizes
99
. The small effect size of most GWAS-associated
variants can make them difficult to identify and to then biologically interpret
99
. Gene set
enrichment analysis offers a solution to both challenges by creating the potential to detect
associations even when individual variants fail to reach the severe statistical significance levels
required after multiple testing corrections for millions of SNPs and by incorporating a framework
of biological interpretability (utilizing gene sets that contain biologically related genes) baked
right into the statistical analysis
100
.
The vast majority of disease-associated variants identified through GWAS map to non-
coding regions of the genome
101
. Indeed, it has been estimated that only 11% of GWAS-
associated variants lie in coding regions
102
. Therefore, most disease-associated variants likely
are involved in activities that modulate the expression of genes relevant to the disease. Among
these, there are undoubtedly many regulatory elements including enhancers. All that remains is
to find a way to biologically interpret these disease-associated signals.
Just as gene set enrichment uses biologically related sets of genes (e.g., through shared
biological processes or previously annotated protein pathways) to more readily interpret the role
74
of GWAS-associated variants mapping to genes, enhancer variants could be incorporated into
those same gene sets based on their biological relevance to the genes in the gene set, perhaps
through a highly scored regulatory relationship predicted to exist with one of the genes in a
disease-relevant tissue or cell line.
Using this approach, variant data from a GWAS for colorectal cancer was applied to the
HCT116-specific scores for predicted enhancer-gene links generated in PEACOCK. After
incorporating the disease-associated signal of the variants mapped to enhancers with predicted
target genes, a gene list was uploaded into the PANTHER website for enrichment analysis
using Reactome pathways and GO biological processes. Resulting output indicated that
incorporating enhancer variant information using PEACOCK scores in the HCT116 human
colorectal carcinoma cell line showed enrichment for pathways shown to be extremely related to
colorectal cancer disease function and progression. This analysis serves as a proof of concept
for utilizing PEACOCK scores as additional information to be used in existing statistical analysis
of GWAS data.
Methods
Gene set enrichment analysis
The PANTHER website currently makes available a tool for statistical enrichment that
employs the Mann-Whitney rank-sum test to examine a list of genes and numerical data values
for them and determine “whether any ontology class or pathway has numeric values that are
nonrandomly distributed with respect to the entire list of values.”
21
Box 4 is provided in Figure
4.1 taken from the paper announcing the method’s integration into PANTHER for clarity. The
gene list and adjusted p-values were uploaded to the PANTHER website where the “statistical
enrichment test” was selected for analysis in Homo Sapiens. Annotation sets selected to serve
as gene sets were GO biological processes and Reactome pathways.
75
Figure 4.1 Description of the Mann-Whitney rank-sum test used for enrichment analysis in PANTHER. This
figure was taken from Box 4 of the protocol paper published in 2013
21
.
GWAS data
While the numerical data serving as input for this test can be the output from several
different experiments, this analysis used the published p-values from the summary statistics of a
GWAS examining colorectal cancer
103
. This GWAS was based on data from the UK Biobank
using 4,562 cases and 382,756 controls to examine the binary trait of colorectal cancer
(PheCode 153)
104-106
.
Incorporation of PEACOCK enhancer-gene regulatory link scores
The variants provided in the GWAS summary data were mapped to enhancers and
genes using bedtools. Variants were mapped to genes if they were located within the gene body
or within a 2kb flanking region on each end of the gene body. Variants were mapped to
enhancers if they were located within the genomic boundaries of the enhancer (Fig 4.2).
76
Figure 4.2 Mapping and weighting GWAS variants. Blue triangles represent GWAS variants and their genomic
location in relation to gene C (magenta flag) and enhancer 8 (teal pentagon). The 2kb boundary is marked on
either side of gene C with a magenta bar. Blue triangles located either within the gene body of gene C or within the
2kb flanking region are attributed to gene C with a weight of 1. The variants that mapped to enhancer 8 is
attributed to any predicted target genes of enhancer 8 with a weight of F(score) where F() is the empirical CDF of
all enhancer-gene link scores in the cell type that the scores come from and “score” represents the score of each
enhancer-gene link involving enhancer 8. Variants that did not fall within 2 kb of a gene or directly within the
boundaries of an enhancer were not attributed to any gene and therefore were not assigned a weight.
All variants that mapped to an enhancer or gene were attributed to a gene. If the variant
mapped directly to a gene, the variant was attributed to that gene. If the variant mapped to an
enhancer, the variant was attributed to all predicted target genes for that enhancer in HCT116.
Each variant attributed with a gene was then assigned a weight for this attribution. For
variants that mapped directly to genes and were thereby attributed to that gene, the assigned
weight was 1. For variants that were attributed a gene because they mapped to an enhancer
that PEACOCK predicted a regulatory relationship with that gene in HCT116, the assigned
weight was F(x) were F() is the empirical CDF of the distribution of the scores for all enhancer-
gene regulatory predictions in HCT116 (n=17M) and x is the score of the enhancer-gene
regulatory link which formed the basis for attributing the enhancer-mapped variant to the target
gene.
The weight serves as an approximation of the confidence with which a predicted
regulatory relationship between a given enhancer and gene is viewed. Since all predicted
enhancer-gene regulatory links in PEACOCK include all enhancer-gene pairs located <1 Mb
apart, it would be disastrous to apply the association signal from variants mapped to enhancers
to their predicted target genes without taking into account the quality of the prediction (i.e.,
without weighting based on the score). However, because the distribution of all PEACOCK
scores includes every possible enhancer-gene pair (a complete population), the empirical CDF
77
may function as an approximation of a measure of confidence for a given enhancer-gene link
since previous analysis (Chapter 3) showed that true enhancer-gene regulatory relationships
are more likely to have higher PEACOCK scores than enhancer-gene pairs without a regulatory
relationship. Thus, the p-value of the variant mapped to an enhancer is weighted against the
confidence of the predicted enhancer-gene regulatory link as approximated by the empirical
CDF of the scoring distribution.
The p-value for each variant attributed to a gene was adjusted by the weight of the
confidence of its association to the gene. For variants that mapped directly to a gene, that
weight was 1, because there is no inference needed to attribute the variant to the gene it is
located within. For variants that were attributed to a gene based on a predicted enhancer-gene
regulatory link, the weight was the value of the CDF of the distribution of all scores at the value
of the score for this enhancer-gene link. The weights were used to adjust the p-values of the
variants as shown below:
𝑥 𝑖𝑗
=
𝑣 𝑗 𝑤 i
{
𝑤 𝑖 = 1, 𝑣 𝑗 mapped to gene
𝑤 𝑖 = 𝐹 (𝑠 𝑖 ), 𝑣 𝑗 mapped to enhancer
where v j represents the GWAS p-value for the j
th
variant, w i represents the weight of the i
th
gene
attributed to v j, F() represents the empirical CDF function of the distribution of the HCT116 scores,
s i represents the score of the i
th
enhancer-gene regulatory link, and x ij represents the i
th
enhancer-
gene link score-adjusted p-value of the j
th
variant. Thus, for each variant that maps to an enhancer,
there are multiple x for each variant p-value v j; one for each of i genes that its mapped enhancer
is linked to. Since the denominator F(s i) is larger for higher scoring predictions, v j is transformed
into a larger x ij for weak enhancer-gene regulatory link predictions, is penalized to a lesser degree
for the highest scoring enhancer-gene regulatory predictions, and penalized not at all for variants
that map directly to genes. The smaller x ij is, the more significant the variant and the less degraded
that signal is by the score of any involved enhancer-gene regulatory prediction.
78
The purpose of adjusting the p-value of the variant by the score of the predicted
enhancer-gene regulatory link is to render it very unlikely that a variant of even very high
statistical significance comes to be the representative variant for a gene which it is only weakly
linked to. For each gene, only one variant is chosen to represent that gene in the gene set
enrichment analysis. The representative variant is the one with the smallest x ij out of all v j
attributed to that gene. Once the smallest xij was assigned to the gene it represents, an
adjustment was made for the effect of the presence of multiple variants.
Adjusting for multiple variants
Some genes are very large and therefore are more likely to have larger number of
variants mapping within the gene body. Additionally, some genes are located within 1 Mb of
many enhancers, while other genes may be located near only a few. As a result, it is more likely
that a larger gene in an enhancer-dense region will have a smaller x ij due to random chance. In
order to adjust for this effect, the distribution of the sum of the weighted number of variants for
all genes and their log-transformed x ij values were plotted (Fig 4.3).
79
Figure 4.3 The distribution of the score-adjusted weighted variant p-values for all genes for which any
variants were attributed to. Each dot represents a gene. The x-axis shows the sum of the weighted number of
variants attributed to each gene. The y-axis shows the score-adjusted weight variant p-values on the negative log
scale. A LOWESS curve is plotted in blue, and a linear regression line is plotted in red. The purple line shows the
cutoff for statistical significance using a Bonferroni correction for the total number of variants (n=28,146,007)
reported in the GWAS summary statistics. The green line shows the cutoff for statistical significance using the
common significance correction threshold of 5x10E-8.
Just as the p-value of each enhancer-mapped variant attributed to a gene was weighted
by the quality of the enhancer-gene regulatory prediction informing its attribution to the putative
target gene, the number of variants attributed to a gene were also weighted. The rationale for
this was that if a gene is attributed many weakly scored predicted enhancer variants, it should
be penalized less for each of them since the p-value for each was highly penalized. Thus,
80
instead of simply summing the number of variants attributed to each gene to make the multiple-
variant adjustment, the weight of each variant was summed for all variants attributed to each
gene. Thus, gene-mapped variants with a weight of 1 and enhancer-mapped variants attributed
to genes on the basis of very high scoring enhancer-gene link predictions contributed the
highest weight to the sum of each gene’s variant weight total.
Once the sum of the weights of all variants attributed to each gene was plotted against
the representative score-adjusted weighted p-value for each gene, a locally weighted scatterplot
smoothing (LOWESS) curve was fitted. This function served as the source of expected values
that a gene might be expected to have in terms of score-adjusted weight variant p-value for a
given sum of weights of total attributed variants. The adjustment for multiple variants was then
made accordingly:
𝑦 𝑖 = − ln(𝑥 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 ,𝑖 ) = −ln (𝑥 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 ,𝑖 ) − ln (𝑥 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 ,𝑖 )
Then the value y i for each of i genes was transformed out of the negative log scale to the value
p i which was then ready to serve as input for the gene set enrichment analysis.
𝑝 𝑖 = 𝑒 −𝑦 𝑖
A list of i genes was constructed as the final output, each with the value p i, representing
the PEACOCK score- and multiple variant-adjusted p-value from the variant chosen to
represent the gene. This file served as input for the gene set enrichment analysis performed
using the PANTHER website.
Results
The effects of including enhancer information on the distribution of genes’ best variants
In order to compare the effect of including enhancer regulatory predictions in gene set
enrichment analysis, a comparison was made to gene set enrichment analysis performed
without the inclusion of enhancer information. Only variants which were mapped directly to
81
genes were included in the analysis, resulting in a new distribution of log-transformed p-values
from the best variants versus the total number of variants mapped to each gene (Fig 4.4). The
number of genes with any attributed variants dropped slightly from 18,616 genes attributed
variant information when incorporating predicted enhancer regulatory information to 18,093
genes with mapped variants without taking enhancer information into account. Nevertheless, the
mean weight-adjusted p-value of the best variant for each gene in the negative log scale
dropped from 8.08 to 5.99 when enhancer information was not included. Additionally, the
number of genes with a weighted p-value achieving statistical significance after correcting for
the total number of variants reported in the GWAS (purple line) dropped from 25 to two when
enhancer information was excluded. No genes achieved statistical significance at the more
stringent adjustment threshold of 5x10E-8 (green line) with analysis using gene-mapped
variants only (Fig 4.4), but when enhancer-mapped variants were added to the analysis, 15
genes exceeded this threshold (Fig 4.3).
82
Figure 4.4 The distribution of the variant p-values for all genes for which any variants were attributed to.
Each dot represents a gene. The x-axis shows the sum of the number of variants attributed to each gene. The y-
axis shows the best variant p-values for each gene on the negative log scale. A LOWESS curve is plotted in blue,
and a linear regression line is plotted in red. The purple line shows the cutoff for statistical significance using a
Bonferroni correction for the total number of variants (n=28,146,007) reported in the GWAS summary statistics.
The green line shows the cutoff for statistical significance using the common significance correction threshold of
5x10E-8.
The effects of including enhancer information on gene set enrichment analysis results
When gene set enrichment analysis is performed, it is best to use sets of genes that are
grouped by some shared biological function or process. Analysis on the PANTHER website was
performed using the annotated gene sets defined as Reactome pathways and GO biological
processes. The output of this analysis is a list of gene sets that have been shown to violate the
83
null hypothesis that their numerical values were drawn randomly from the overall distribution of
values, i.e., that the input values for the genes in these sets are significantly different than the
overall distribution based on all genes in the input file.
The pathways achieving statistical significance were sorted by the magnitude of the
negativity of their differences (i.e., the pathway with a gene list with input values that are most
dramatically shifted towards smaller values than the overall list). This was the direction of effect
most interesting here since the smaller a p-value, the more significant its variant.
For Reactome pathway enrichment analysis of 18,616 genes with weighted p-values
adjusted by PEACOCK score and number of attributed variants, p i, 91 pathways were listed as
negatively enriched, i.e., their values were smaller than those of the overall distribution. The top
seven are listed in Table 4.1. Each of these pathways was then investigated for any known
involvement with colorectal cancer.
Reactome pathway p-value Number of genes
in pathway
Lysosphingolipid and LPA receptors 0.029 13
Toll-like Receptor Cascades 0.031 143
Breakdown of the nuclear lamina 0.035 2
PIWI-interacting RNA (piRNA) biogenesis 0.044 29
Highly sodium permeable acetylcholine nicotinic receptors 0.039 7
Intracellular signaling by second messengers 0.030 279
Synthesis of (16-20)-hydroxyeicosatetraenoic acids (HETE) 0.015 9
Table 4.1 Top seven pathways negatively enriched in Reactome pathway enrichment analysis of genes
incorporating gene- and enhancer-mapped variant and predicted regulatory link information. The left
column names the pathways starting with the most negative enriched. The center column gives the p-value
attributed to the pathway enrichment analysis. The right column lists the number of genes that are members of
each pathway.
Lysophosphatidic acid (LPA) and LPA receptors have been shown to contribute to the
initiation and progression of colorectal cancer
107
. LPA signaling activates inflammatory pathways
which promote tumorigenesis and immune evasion, cancer progression, and metastasis.
Indeed, LPA signaling has a well-known role in many cancers
108
. LPA signaling promotes
cancer progression via the processes of cell proliferation, invasion, adhesion, angiogenesis, and
cell survival
107
.
84
Toll-like receptors (TLRs) recognize inflammatory mediators and play an important role
in immune responses, mediating the inflammatory response
109
. They have been shown to
maintain epithelial barrier homeostasis
110
. TLRs have been identified as a key molecule in
inflammation-driven carcinogenesis
111
. TLRs have also been shown to promote angiogenesis,
promoting cancer progression
112
. Clearly the involvement of TLRs in colorectal cancer is well-
established.
Morphological changes at the nuclear enveloped have been shown to affect the behavior
and the phenotype of tumor cells and may even have a pivotal role in tumor formation
113
. The
Reactome pathway “breakdown of nuclear lamina” shows two gene members, lamin B and
caspase-6. Changes in level or localization of LMNB1 have been associated with various
cancers, including colorectal cancer
114, 115
, lung cancer
114
, liver cancer
116
, ovarian cancer
117
, and
prostate cancer
118
. CASP6 is well-known for its critical role in apoptosis
119
. For successful tumor
growth and spread, cancer cells must escape apoptosis. Mutations of CASP6 in gastric and
colorectal carcinomas have been reported
120
. Taken together, the genes in this pathway seem
to be clearly involved in colorectal cancer.
P-Element induced wimpy testis (PIWI)-interacting RNAs (piRNAs) are small non-coding
RNAs that can bind to PIWI proteins to influence transposon silencing, genome rearrangement,
epigenetic regulation, and protein regulation
121
. Several piRNAs have been implicated in many
cancers, including colorectal cancer, although they are usually not expressed in somatic
tissues
122
. Hundreds of piRNAs have been shown to be differentially expressed in cancer
tumors
123
, and piRNAs are thought to have roles in cancer cell proliferation, apoptosis,
metastasis, and invasion
121
.
Nicotinic acetylcholine receptors (nAChRs) are ligands gated ion channels that are
expressed in the cell membrane
124
. They have been shown to regulate the neurotransmitters
that control the synthesis and release of growth, angiogenic, and neurotrophic factors in cancer
cells, the cancer microenvironment, and distant organs
125
. Smoking has been shown to affect
85
the functions of nAChRs that can stimulate cancer cells while also reducing the functions of
nAChRs that obstruct cancer cells
126
. Smoking is a risk factor for many cancers, including
colorectal cancer, and nAChRs are believed to be the central regulatory module by which the
cellular responses of nicotine contribute to multiple oncogenic signaling pathways
127
.
Additionally, evidence supporting the relationship between colorectal cancer and
intracellular messaging by second messengers
128-130
as well as synthesis of (16-20)-
hydroxyeicosatetraenoic acids
131, 132
was abundant in the literature.
Although there were nearly one hundred pathways identified by Reactome pathway
enrichment analysis as negatively enriched for colorectal cancer, these top results offer the
promise that many of them are important to the disease. The input for this analysis appeared to
be at least somewhat informative of disease initiation and progression. The next step was to
determine if these associated pathways were the result of the included enhancer information.
Reactome pathway enrichment analysis of 18,093 genes with p-values taken only from
variants mapped directly to genes and adjusted according to the number of variants for each
gene was performed. Of the 117 Reactome pathways achieving statistical significance for
negative enrichment, only one (intracellular signaling by second messengers, p=0.020) of the
seven pathways listed in Table 4.1 which were described at length in the literature as important
to colorectal cancer were in the results. However, a look at the first five most negatively
enriched pathways from the analysis using only gene-mapped variant information did show that
some of these pathways may also be important to colorectal cancer (Table 4.2). No evidence
was readily available to connect the formation of the editosome with colorectal cancer, but there
were scientific articles linking intracellular oxygen transport
133, 134
, processing of capped
intronless pre-mRNA
135
, acetycholine regulation of insulin secretion
136
, and processing of
intronless pre-mRNAs
137
to colorectal cancer. Interestingly, none of these pathways was in the
enrichment analysis results that incorporated enhancer prediction information.
86
Reactome pathway p-value Number of genes in pathway
Formation of the Editosome 0.047 8
Intracellular oxygen transport 0.030 3
Processing of Capped Intronless Pre-mRNA 0.015 27
Acetylcholine regulates insulin secretion 0.012 10
Processing of Intronless Pre-mRNAs 0.047 18
Table 4.2 Top five pathways negatively enriched in Reactome pathway enrichment analysis of genes
incorporating gene-mapped variant information only. The left column names the pathways starting with the
most negative enriched. The center column gives the p-value attributed to the pathway enrichment analysis. The
right column lists the number of genes that are members of each pathway.
Results of the gene set enrichment analysis using GO Biological Processes yielded 467
negatively enriched biological processes, and several of them are clearly colorectal cancer-
related, such as DNA damage response, signal transduction by p53 class mediator resulting in
cell cycle arrest (GO:0006977) and positive regulation of establishment of endothelial barrier
(GO:1903142). However, such a large number of processes defies any immediate attempt to
individually investigate their relationships with colorectal cancer. Gene set enrichment using GO
Biological Processes without incorporation of enhancer prediction and variant data in turn
yielded 553 negatively enriched biological processes.
Discussion
The cell type-specific scores of predicted enhancer-gene regulatory associations
generated with PEACOCK (Chapter 3) present an excellent opportunity for reexamining the
published results of GWAS for further statistical analysis that brings this new biological
information into play. Using the existing framework of gene set enrichment analysis via the
Mann-Whitney rank-sum test, Reactome pathway enrichment analysis was performed using the
PANTHER website. Gene values were based originally on the p-values of the most significant
variants called to each gene, after adjusting for the score of any predicted enhancer regulatory
information used to call variants to genes. Gene values were also adjusted for multiple called
variants per gene. An analogous analysis was performed using only gene information.
87
The results from these analyses showed that the Reactome pathways associated with
colorectal cancer were well supported in the literature as having important roles in that disease,
especially so in the top enriched pathways in the analysis which included enhancer information.
There were also few enriched pathways that were common among the gene-only analysis and
the enhancer and gene analysis. This suggests that although gene set enrichment analysis is a
method capable of yielding good information with or without incorporating enhancer information,
adding enhancer information to the analysis may uncover additional pathways with disease
significance that would not have been captured with analysis using gene information alone.
Future work should delve deeper into the potential benefits of applying enhancer-gene
regulatory predictions to GWAS data in gene set enrichment analysis.
The common scenario of an enhancer with a disease-associated signal having more
than one target gene containing disease-associated signals should also be further considered in
future work. This analysis was performed under the assumption that because such an enhancer
has more than one target gene, its effects on gene expression could be multiplicative and
therefore no correction for multiple effects on multiple genes needed to be performed. However,
this should be investigated more thoroughly. The resulting statistical approach should be one
that accurately reflects the biological action of enhancer function.
Although this chapter describes only a first pass at analyzing GWAS data through the
lens of gene set enrichment with the added information provided by PEACOCK scores of
enhancer-gene regulatory predictions, it is an area with great promise. Indeed, as prediction
methods for cell-specific enhancer regulatory action on target genes improve, the mountain of
existing data resulting from previously published GWAS and whole-genome sequencing studies
will shed even more light on the complicated web of regulatory interaction that informs disease
initiation and progression. The ability to properly utilize enhancer-gene links in the statistical
analysis of GWAS will provide lasting potential to illuminate disease pathology and inform
treatment and diagnostic developments.
88
89
Chapter 5
Conclusions
The rapid emergence of GWAS in the last decade has resulted in thousands of disease-
associated variants that are primarily located in non-coding regions, many of which reside in
enhancers. Enhancers are genetic elements that recruit transcriptional machinery to the
promoter of target genes via chromatin looping to modulate transcriptional regulation. They act
in an extremely cell type-specific fashion, with target genes and regulatory activities often
differing by cell type. Genes may be targeted by enhancers located upstream or downstream,
nearby or from more than a megabase away. Determining the regulatory targets of enhancers is
crucial to placing the thousands of disease-associated enhancer variants into their appropriate
biological contexts. In this chapter, the contributions toward this goal of the body of work
outlined in this thesis are summarized. Some of the many possible future directions are
discussed.
Summary of Contributions
Firstly, a collection of experimentally supported enhancer-gene regulatory predictions
was created and presented in PEREGRINE (Predicted by Experimental Results: Enhancer-
Gene Relationships Illustrated by a Nexus of Evidence). This compendium of experimental
results as they relate to enhancer activity of target genes provided nearly 900,000 predicted
enhancer-gene regulatory relationships based on data from 78 cell and tissue types (Chapter 2).
However, these experimental results were not evaluated in any systematic way to determine the
quality of each enhancer-gene link, or even guess at the quantity of correctly predicted
enhancer-gene regulatory relationships. In part this was due to the common shortcoming among
enhancer-gene regulatory prediction databases of not having enough cell-matched data to
capture the differences in enhancer regulatory activity based on cell type. Although
90
PEREGRINE undoubtedly provided many high-quality enhancer-gene regulatory predictions,
the database stood to benefit from applying a clearer cell type-specific context to these
predictions.
Enhancer-gene regulatory relationships cannot be evaluated outside the context of cell
type. The genomic architecture dictating which genes are expressed, which parts of the genome
are physically accessible for protein binding, and which regions of DNA are co-localized with
other regions of DNA can differ dramatically based on cell type. Other factors vital for the
regulatory function of enhancers include the protein expression profile of the cell and the
epigenetic conditions prevailing around the enhancer and any gene it might regulate. These are
also unique to each cell type. Perhaps there is nothing that is more so a product of its
environment than an enhancer.
To ever hope to gain any insight into the regulatory undertakings of an enhancer, the
cellular environment must be properly accounted for. This was the motivation for generating a
set of features that would be able to describe enhancer-gene pairs well enough for a machine
learning classification algorithm to be able to spot a pattern of different traits between enhancer-
gene pairs with actual regulatory connections and enhancer-gene pairs devoid of any regulatory
relationship. In Chapter 3, such a scheme was successful in grading the predictions of
regulatory interaction for enhancer-gene pairs in four cancer cell lines. This approach,
PEACOCK (Predicted Enhancer Activity in Cis Originating from Cell-specific Knowledge),
resulted in cell-specific enhancer-gene regulatory relationship scores for each of the four cell
lines in over 17 million enhancer-gene pairs, which opened the door to the possibility of exciting
statistical applications.
In Chapter 4, a first application of the PEACOCK scores was entered on. Using the
existing framework of a simple gene set enrichment algorithm (the Mann-Whitney rank sum test)
offered on the PANTHER website, HCT116-specific scores were incorporated into a Reactome
pathway enrichment analysis as a proof of concept. The HCT116 scores were applied to the
91
summary data from a GWAS for colorectal cancer where over 18 million variants and their p-
values were reported. Variants were mapped to genes and enhancers, and variant signal was
applied to all predicted target genes of variant-containing enhancers in a manner that adjusted
for the quality of the prediction of enhancer-to-gene regulation. Initial results look quite
promising.
Incorporation of enhancer-gene link prediction scores in the colorectal cancer cell line
HCT116 applied to the variant information provided by a GWAS for colorectal cancer resulted in
higher measures of significance for genes based on the best attributed variant and slightly more
gene information than when variants contained within genes were only used for analysis. The
Reactome pathways enriched with smaller than expected gene score-adjusted p-values
incorporating enhancer regulatory prediction and variant information were amply substantiated
in the literature, with clear ties to colorectal cancer initiation and progression. Interestingly, the
associated pathways were usually not the same as the associated pathways for analysis done
without enhancer information. This suggests that incorporating enhancer information into the
existing framework of gene set enrichment analysis can uncover different but very meaningful
biological findings that would have been missed by gene-exclusive analysis.
Future Directions
The PEACOCK cell type-specific scores are an important development that can aid in
the study of enhancers as they relate to disease research. The implementation of this method
on additional cell lines would do much to improve the potential impact of this work. Close to 100
cell lines in ENCODE have the necessary data to perform scoring using the final models
described in Chapter 3. The work in Chapter 3 is being written into a manuscript to be submitted
for publication, and by the time of its submission more cell lines will be scored. However, future
work will involve additional analysis to score cell lines with incomplete data, perhaps on models
with reduced feature sets.
92
The analysis described in Chapter 4 represents only a first pass application of the
PEACOCK scores. Future analysis will have ample room for improvement on the methodology
used to apply the enhancer variant information to genes in a way that takes into account the
quality of the enhancer-gene link predictions, and could even explore alternate methods of
adjusting for multiple variants per gene. The method by which variant association signal is
reduced to a single value for each gene is also an area that could benefit from exploration.
Using the most significant variant as a representative for each gene is but one of many options,
with its own pros and cons. There are also many more algorithms available for gene set
enrichment analysis besides the Mann-Whitney rank-sum test which could be examined for this
application. And clearly, additional analysis using scores and GWAS data from other diseases
and cell types would make a very worthwhile pursuit.
93
References
1. Gasperini M, Hill AJ, McFaline-Figueroa JL, et al. A Genome-wide Framework for
Mapping Gene Regulation via Cellular Genetic Screens. Cell. Mar 2019;176(6):1516.
doi:10.1016/j.cell.2019.02.027
2. Rubtsov MA, Polikanov YS, Bondarenko VA, Wang YH, Studitsky VM. Chromatin
structure can strongly facilitate enhancer action over a distance. Proc Natl Acad Sci U S A. Nov
2006;103(47):17690-5. doi:10.1073/pnas.0603819103
3. Kulaeva OI, Nizovtseva EV, Polikanov YS, Ulianov SV, Studitsky VM. Distant activation
of transcription: mechanisms of enhancer action. Mol Cell Biol. Dec 2012;32(24):4892-7.
doi:10.1128/MCB.01127-12
4. Corradin O, Scacheri PC. Enhancer variants: evaluating functions in common disease.
Genome Med. 2014;6(10):85. doi:10.1186/s13073-014-0085-3
5. Hnisz D, Abraham BJ, Lee TI, et al. Super-enhancers in the control of cell identity and
disease. Cell. Nov 2013;155(4):934-47. doi:10.1016/j.cell.2013.09.053
6. Enver T, Ebens AJ, Forrester WC, Stamatoyannopoulos G. The human beta-globin
locus activation region alters the developmental fate of a human fetal globin gene in transgenic
mice. Proc Natl Acad Sci U S A. Sep 1989;86(18):7033-7. doi:10.1073/pnas.86.18.7033
7. Attema JL, Bert AG, Lim YY, et al. Identification of an enhancer that increases miR-
200b~200a~429 gene expression in breast cancer cells. PLoS One. 2013;8(9):e75517.
doi:10.1371/journal.pone.0075517
8. Gasperini M, Tome JM, Shendure J. Towards a comprehensive catalogue of validated
and target-linked human enhancers. Nat Rev Genet. Jan 2020;doi:10.1038/s41576-019-0209-0
9. Schoenfelder S, Fraser P. Long-range enhancer-promoter contacts in gene expression
control. Nat Rev Genet. 08 2019;20(8):437-455. doi:10.1038/s41576-019-0128-0
10. Gao T, He B, Liu S, Zhu H, Tan K, Qian J. EnhancerAtlas: a resource for enhancer
annotation and analysis in 105 human cell/tissue types. Bioinformatics. 12 2016;32(23):3543-
3551. doi:10.1093/bioinformatics/btw495
11. Fishilevich S, Nudel R, Rappaport N, et al. GeneHancer: genome-wide integration of
enhancers and target genes in GeneCards. Database (Oxford). 01
2017;2017doi:10.1093/database/bax028
12. Wang J, Dai X, Berry LD, Cogan JD, Liu Q, Shyr Y. HACER: an atlas of human active
enhancers to interpret regulatory variants. Nucleic Acids Res. Jan 2019;47(D1):D106-D112.
doi:10.1093/nar/gky864
13. Jiang Y, Qian F, Bai X, et al. SEdb: a comprehensive human super-enhancer database.
Nucleic Acids Res. 01 2019;47(D1):D235-D243. doi:10.1093/nar/gky1025
14. Gao T, Qian J. EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586
tissue/cell types across nine species. Nucleic Acids Res. 01 2020;48(D1):D58-D64.
doi:10.1093/nar/gkz980
15. Wang Z, Zhang Q, Zhang W, et al. HEDD: Human Enhancer Disease Database. Nucleic
Acids Res. 01 2018;46(D1):D113-D120. doi:10.1093/nar/gkx988
16. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects.
Science. Jul 2015;349(6245):255-60. doi:10.1126/science.aaa8415
17. Yang W, Fidelis TT, Sun WH. Machine Learning in Catalysis, From Proposal to
Practicing. ACS Omega. Jan 2020;5(1):83-88. doi:10.1021/acsomega.9b03673
18. Javierre BM, Burren OS, Wilder SP, et al. Lineage-Specific Genome Architecture Links
Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell. 11
2016;167(5):1369-1384.e19. doi:10.1016/j.cell.2016.09.037
19. Montefiori LE, Sobreira DR, Sakabe NJ, et al. A promoter interaction map for
cardiovascular disease genetics. Elife. 07 2018;7doi:10.7554/eLife.35788
94
20. Mills C, Muruganujan A, Ebert D, et al. PEREGRINE: A genome-wide prediction of
enhancer to gene relationships supported by experimental evidence. PLoS One.
2020;15(12):e0243791. doi:10.1371/journal.pone.0243791
21. Mi H, Muruganujan A, Casagrande JT, Thomas PD. Large-scale gene function analysis
with the PANTHER classification system. Nat Protoc. Aug 2013;8(8):1551-66.
doi:10.1038/nprot.2013.092
22. Mi H, Muruganujan A, Ebert D, Huang X, Thomas PD. PANTHER version 14: more
genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools. Nucleic
Acids Res. 01 2019;47(D1):D419-D426. doi:10.1093/nar/gky1038
23. Ong CT, Corces VG. Enhancer function: new insights into the regulation of tissue-
specific gene expression. Nat Rev Genet. Apr 2011;12(4):283-93. doi:10.1038/nrg2957
24. Banerji J, Rusconi S, Schaffner W. Expression of a beta-globin gene is enhanced by
remote SV40 DNA sequences. Cell. Dec 1981;27(2 Pt 1):299-308. doi:10.1016/0092-
8674(81)90413-x
25. Consortium EP. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS
Biol. Apr 2011;9(4):e1001046. doi:10.1371/journal.pbio.1001046
26. Consortium EP. An integrated encyclopedia of DNA elements in the human genome.
Nature. Sep 2012;489(7414):57-74. doi:10.1038/nature11247
27. Yao L, Berman BP, Farnham PJ. Demystifying the secret mission of enhancers: linking
distal regulatory elements to target genes. Crit Rev Biochem Mol Biol. 2015;50(6):550-73.
doi:10.3109/10409238.2015.1087961
28. Heintzman ND, Hon GC, Hawkins RD, et al. Histone modifications at human enhancers
reflect global cell-type-specific gene expression. Nature. May 2009;459(7243):108-12.
doi:10.1038/nature07829
29. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology.
The Gene Ontology Consortium. Nat Genet. May 2000;25(1):25-9. doi:10.1038/75556
30. The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing
strong. Nucleic Acids Res. 01 2019;47(D1):D330-D338. doi:10.1093/nar/gky1055
31. Jassal B, Matthews L, Viteri G, et al. The reactome pathway knowledgebase. Nucleic
Acids Res. 01 2020;48(D1):D498-D503. doi:10.1093/nar/gkz1031
32. Mi H, Thomas P. PANTHER pathway: an ontology-based pathway database coupled
with data analysis tools. Methods Mol Biol. 2009;563:123-40. doi:10.1007/978-1-60761-175-2_7
33. Zerbino DR, Achuthan P, Akanni W, et al. Ensembl 2018. Nucleic Acids Res. Jan
2018;46(D1):D754-D761. doi:10.1093/nar/gkx1098
34. Fraser J, Ferrai C, Chiariello AM, et al. Hierarchical folding and reorganization of
chromosomes are linked to transcriptional changes in cellular differentiation. Mol Syst Biol. Dec
2015;11(12):852. doi:10.15252/msb.20156492
35. Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer Browser--a database
of tissue-specific human enhancers. Nucleic Acids Res. Jan 2007;35(Database issue):D88-92.
doi:10.1093/nar/gkl822
36. Consortium G. The Genotype-Tissue Expression (GTEx) project. Nat Genet. Jun
2013;45(6):580-5. doi:10.1038/ng.2653
37. Visel A, Rubin EM, Pennacchio LA. Genomic views of distant-acting enhancers. Nature.
Sep 2009;461(7261):199-205. doi:10.1038/nature08451
38. Ron G, Globerson Y, Moran D, Kaplan T. Promoter-enhancer interactions identified from
Hi-C data using probabilistic models and hierarchical topological domains. Nat Commun. 12
2017;8(1):2237. doi:10.1038/s41467-017-02386-3
39. Acemel RD, Maeso I, Gómez-Skarmeta JL. Topologically associated domains: a
successful scaffold for the evolution of gene regulation in animals. Wiley Interdiscip Rev Dev
Biol. 05 2017;6(3)doi:10.1002/wdev.265
95
40. Ghavi-Helm Y, Klein FA, Pakozdi T, et al. Enhancer loops appear stable during
development and are associated with paused polymerase. Nature. Aug 2014;512(7512):96-100.
doi:10.1038/nature13417
41. Dixon JR, Selvaraj S, Yue F, et al. Topological domains in mammalian genomes
identified by analysis of chromatin interactions. Nature. Apr 2012;485(7398):376-80.
doi:10.1038/nature11082
42. Rao SS, Huntley MH, Durand NC, et al. A 3D map of the human genome at kilobase
resolution reveals principles of chromatin looping. Cell. Dec 2014;159(7):1665-80.
doi:10.1016/j.cell.2014.11.021
43. Marsman J, Horsfield JA. Long distance relationships: enhancer-promoter
communication and dynamic gene transcription. Biochim Biophys Acta. 2012 Nov-Dec
2012;1819(11-12):1217-27. doi:10.1016/j.bbagrm.2012.10.008
44. Blackwood EM, Kadonaga JT. Going the distance: a current view of enhancer action.
Science. Jul 1998;281(5373):60-3. doi:10.1126/science.281.5373.60
45. Bulger M, Groudine M. Looping versus linking: toward a model for long-distance gene
activation. Genes Dev. Oct 1999;13(19):2465-77. doi:10.1101/gad.13.19.2465
46. de Laat W, Klous P, Kooren J, et al. Three-dimensional organization of gene expression
in erythroid cells. Curr Top Dev Biol. 2008;82:117-39. doi:10.1016/S0070-2153(07)00005-1
47. Panigrahi A, O'Malley BW. Mechanisms of enhancer action: the known and the
unknown. Genome Biol. Apr 2021;22(1):108. doi:10.1186/s13059-021-02322-1
48. Fullwood MJ, Ruan Y. ChIP-based methods for the identification of long-range chromatin
interactions. J Cell Biochem. May 2009;107(1):30-9. doi:10.1002/jcb.22116
49. Phanstiel DH, Boyle AP, Heidari N, Snyder MP. Mango: a bias-correcting ChIA-PET
analysis pipeline. Bioinformatics. Oct 2015;31(19):3092-8. doi:10.1093/bioinformatics/btv336
50. Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philos
Trans R Soc Lond B Biol Sci. 2013;368(1620):20120362. doi:10.1098/rstb.2012.0362
51. Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human
disease. Nat Biotechnol. Nov 2012;30(11):1095-106. doi:10.1038/nbt.2422
52. Maurano MT, Humbert R, Rynes E, et al. Systematic localization of common disease-
associated variation in regulatory DNA. Science. Sep 2012;337(6099):1190-5.
doi:10.1126/science.1222794
53. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics. Mar 2010;26(6):841-2. doi:10.1093/bioinformatics/btq033
54. Apache Solr. https://lucene.apache.org/solr/
55. Burke K, Conroy K, Horn R, Stratton F, Binet G. Flask-RESTful. https://flask-
restful.readthedocs.io/en/latest/
56. Mi H, Muruganujan A, Huang X, et al. Protocol Update for large-scale genome and gene
function analysis with the PANTHER classification system (v.14.0). Nat Protoc. 03
2019;14(3):703-721. doi:10.1038/s41596-019-0128-8
57. Thomas MF, L'Etoile ND, Ansel KM. Eri1: a conserved enzyme at the crossroads of
multiple RNA-processing pathways. Trends Genet. Jul 2014;30(7):298-307.
doi:10.1016/j.tig.2014.05.003
58. Makino T, McLysaght A. Interacting gene clusters and the evolution of the vertebrate
immune system. Mol Biol Evol. Sep 2008;25(9):1855-62. doi:10.1093/molbev/msn137
59. West AG, Gaszner M, Felsenfeld G. Insulators: many functions, many mechanisms.
Genes Dev. Feb 2002;16(3):271-88. doi:10.1101/gad.954702
60. Liu Q, Thoms JAI, Nunez AC, et al. Disruption of a -35 kb Enhancer Impairs CTCF
Binding and. Clin Cancer Res. 09 2018;24(18):4602-4611. doi:10.1158/1078-0432.CCR-17-
3678
61. Chen J, Stark LA. Aspirin Prevention of Colorectal Cancer: Focus on NF-κB Signalling
and the Nucleolus. Biomedicines. Jul 2017;5(3)doi:10.3390/biomedicines5030043
96
62. Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relationships among
sequence variation and human phenotype. Nucleic Acids Res. Jan 2014;42(Database
issue):D980-5. doi:10.1093/nar/gkt1113
63. Wu D, Wu P, Zhao L, et al. NF-κB Expression and Outcomes in Solid Tumors: A
Systematic Review and Meta-Analysis. Medicine (Baltimore). Oct 2015;94(40):e1687.
doi:10.1097/MD.0000000000001687
64. Murray A, Cluett C, Bandinelli S, et al. Common lipid-altering gene variants are
associated with therapeutic intervention thresholds of lipid levels in older people. Eur Heart J.
Jul 2009;30(14):1711-9. doi:10.1093/eurheartj/ehp161
65. Cavalli M, Pan G, Nord H, Wadelius C. Looking beyond GWAS: allele-specific
transcription factor binding drives the association of GALNT2 to HDL-C plasma levels. Lipids
Health Dis. Jan 2016;15:18. doi:10.1186/s12944-016-0183-x
66. Roman TS, Marvelle AF, Fogarty MP, et al. Multiple Hepatic Regulatory Variants at the
GALNT2 GWAS Locus Associated with High-Density Lipoprotein Cholesterol. Am J Hum Genet.
Dec 2015;97(6):801-15. doi:10.1016/j.ajhg.2015.10.016
67. Wang M, Gu D, Du M, et al. Common genetic variation in ETV6 is associated with
colorectal cancer susceptibility. Nat Commun. 05 2016;7:11478. doi:10.1038/ncomms11478
68. Adhikary S, Roy S, Chacon J, Gadad SS, Das C. Implications of enhancer transcription
and eRNAs in cancer. Cancer Res. May 2021;doi:10.1158/0008-5472.CAN-20-4010
69. Schoenfelder S, Furlan-Magaril M, Mifsud B, et al. The pluripotent regulatory circuitry
connecting promoters to their long-range interacting elements. Genome Res. Apr
2015;25(4):582-97. doi:10.1101/gr.185272.114
70. Mifsud B, Tavares-Cadete F, Young AN, et al. Mapping long-range promoter contacts in
human cells with high-resolution capture Hi-C. Nat Genet. Jun 2015;47(6):598-606.
doi:10.1038/ng.3286
71. Osterwalder M, Barozzi I, Tissières V, et al. Enhancer redundancy provides phenotypic
robustness in mammalian development. Nature. 02 2018;554(7691):239-243.
doi:10.1038/nature25461
72. Fukaya T, Lim B, Levine M. Enhancer Control of Transcriptional Bursting. Cell. Jul
2016;166(2):358-368. doi:10.1016/j.cell.2016.05.025
73. Lettice LA, Heaney SJ, Purdie LA, et al. A long-range Shh enhancer regulates
expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol
Genet. Jul 2003;12(14):1725-35. doi:10.1093/hmg/ddg180
74. Sanyal A, Lajoie BR, Jain G, Dekker J. The long-range interaction landscape of gene
promoters. Nature. Sep 2012;489(7414):109-13. doi:10.1038/nature11279
75. Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. A unique
chromatin signature uncovers early developmental enhancers in humans. Nature. Feb
2011;470(7333):279-83. doi:10.1038/nature09692
76. Sengupta S, George RE. Super-Enhancer-Driven Transcriptional Dependencies in
Cancer. Trends Cancer. 04 2017;3(4):269-281. doi:10.1016/j.trecan.2017.03.006
77. Pott S, Lieb JD. What are super-enhancers? Nat Genet. Jan 2015;47(1):8-12.
doi:10.1038/ng.3167
78. Whyte WA, Orlando DA, Hnisz D, et al. Master transcription factors and mediator
establish super-enhancers at key cell identity genes. Cell. Apr 2013;153(2):307-19.
doi:10.1016/j.cell.2013.03.035
79. Fulco CP, Nasser J, Jones TR, et al. Activity-by-contact model of enhancer-promoter
regulation from thousands of CRISPR perturbations. Nat Genet. 12 2019;51(12):1664-1669.
doi:10.1038/s41588-019-0538-0
80. Fulco CP, Munschauer M, Anyoha R, et al. Systematic mapping of functional enhancer-
promoter connections with CRISPR interference. Science. 11 2016;354(6313):769-773.
doi:10.1126/science.aag2445
97
81. Schulze SR, McAllister BF, Sinclair DA, et al. Heterochromatic genes in Drosophila: a
comparative analysis of two genes. Genetics. Jul 2006;173(3):1433-45.
doi:10.1534/genetics.106.056069
82. Wen B, Wu H, Bjornsson H, Green RD, Irizarry R, Feinberg AP. Overlapping
euchromatin/heterochromatin- associated marks are enriched in imprinted gene regions and
predict allele-specific modification. Genome Res. Nov 2008;18(11):1806-13.
doi:10.1101/gr.067587.108
83. Raisner R, Kharbanda S, Jin L, et al. Enhancer Activity Requires CBP/P300
Bromodomain-Dependent Histone H3K27 Acetylation. Cell Rep. Aug 2018;24(7):1722-1729.
doi:10.1016/j.celrep.2018.07.041
84. Baert-Desurmont S, Charbonnier F, Houivet E, et al. Clinical relevance of 8q23, 15q13
and 18q21 SNP genotyping to evaluate colorectal cancer risk. Eur J Hum Genet. Jan
2016;24(1):99-105. doi:10.1038/ejhg.2015.72
85. Markowitz SD, Bertagnolli MM. Molecular origins of cancer: Molecular basis of colorectal
cancer. N Engl J Med. Dec 2009;361(25):2449-60. doi:10.1056/NEJMra0804588
86. Levy L, Hill CS. Alterations in components of the TGF-beta superfamily signaling
pathways in human cancer. Cytokine Growth Factor Rev. 2006 Feb-Apr 2006;17(1-2):41-58.
doi:10.1016/j.cytogfr.2005.09.009
87. Stolfi C, De Simone V, Colantoni A, et al. A functional role for Smad7 in sustaining colon
cancer cell growth and survival. Cell Death Dis. Feb 2014;5:e1073. doi:10.1038/cddis.2014.49
88. Fortini BK, Tring S, Plummer SJ, et al. Multiple functional risk variants in a SMAD7
enhancer implicate a colorectal cancer risk haplotype. PLoS One. 2014;9(11):e111914.
doi:10.1371/journal.pone.0111914
89. Arnold A, Papanikolaou A. Cyclin D1 in breast cancer pathogenesis. J Clin Oncol. Jun
2005;23(18):4215-24. doi:10.1200/JCO.2005.05.064
90. Fullwood MJ, Liu MH, Pan YF, et al. An oestrogen-receptor-alpha-bound human
chromatin interactome. Nature. Nov 2009;462(7269):58-64. doi:10.1038/nature08497
91. Korkmaz G, Lopes R, Ugalde AP, et al. Functional genetic screens for enhancer
elements in the human genome using CRISPR-Cas9. Nat Biotechnol. Feb 2016;34(2):192-8.
doi:10.1038/nbt.3450
92. Maleki F, Ovens K, Hogan DJ, Kusalik AJ. Gene Set Analysis: Challenges,
Opportunities, and Future Research. Front Genet. 2020;11:654. doi:10.3389/fgene.2020.00654
93. Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. Jul
2003;34(3):267-73. doi:10.1038/ng1180
94. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a
knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad
Sci U S A. Oct 2005;102(43):15545-50. doi:10.1073/pnas.0506580102
95. Pritykin Y, Ghersi D, Singh M. Genome-Wide Detection and Analysis of Multifunctional
Genes. PLoS Comput Biol. Oct 2015;11(10):e1004467. doi:10.1371/journal.pcbi.1004467
96. Zhu X, Stephens M. Large-scale genome-wide enrichment analyses identify new trait-
associated genes and pathways across 31 human phenotypes. Nat Commun. 10
2018;9(1):4361. doi:10.1038/s41467-018-06805-x
97. Guo MH, Hirschhorn JN, Dauber A. Insights and Implications of Genome-Wide
Association Studies of Height. J Clin Endocrinol Metab. 09 2018;103(9):3155-3168.
doi:10.1210/jc.2018-01126
98. Visscher PM, Wray NR, Zhang Q, et al. 10 Years of GWAS Discovery: Biology,
Function, and Translation. Am J Hum Genet. Jul 2017;101(1):5-22.
doi:10.1016/j.ajhg.2017.06.005
99. Sham PC, Purcell SM. Statistical power and significance testing in large-scale genetic
studies. Nat Rev Genet. May 2014;15(5):335-46. doi:10.1038/nrg3706
98
100. Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide
association studies. Nat Rev Genet. Dec 2010;11(12):843-54. doi:10.1038/nrg2884
101. Gusev A, Lee SH, Trynka G, et al. Partitioning heritability of regulatory and cell-type-
specific variants across 11 common diseases. Am J Hum Genet. Nov 2014;95(5):535-52.
doi:10.1016/j.ajhg.2014.10.004
102. Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional
implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci
U S A. Jun 2009;106(23):9362-7. doi:10.1073/pnas.0903103106
103. Zhou W, Nielsen JB, Fritsche LG, et al. Efficiently controlling for case-control imbalance
and sample relatedness in large-scale genetic association studies. Nat Genet. 09
2018;50(9):1335-1341. doi:10.1038/s41588-018-0184-y
104. Denny JC, Bastarache L, Ritchie MD, et al. Systematic comparison of phenome-wide
association study of electronic medical record data and genome-wide association study data.
Nat Biotechnol. Dec 2013;31(12):1102-10. doi:10.1038/nbt.2749
105. Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for
identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.
Mar 2015;12(3):e1001779. doi:10.1371/journal.pmed.1001779
106. Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep
phenotyping and genomic data. Nature. 10 2018;562(7726):203-209. doi:10.1038/s41586-018-
0579-z
107. Yun CC. Lysophosphatidic Acid and Autotaxin-associated Effects on the Initiation and
Progression of Colorectal Cancer. Cancers (Basel). Jul
2019;11(7)doi:10.3390/cancers11070958
108. Brindley DN. Lysophosphatidic Acid Signaling in Cancer. Cancers (Basel). Dec
2020;12(12)doi:10.3390/cancers12123791
109. Holldack J. Toll-like receptors as therapeutic targets for cancer. Drug Discov Today. Apr
2014;19(4):379-82. doi:10.1016/j.drudis.2013.08.020
110. Li TT, Ogino S, Qian ZR. Toll-like receptor signaling in colorectal cancer: carcinogenesis
to cancer therapy. World J Gastroenterol. Dec 2014;20(47):17699-708.
doi:10.3748/wjg.v20.i47.17699
111. Tang XY, Zhu YQ, Wei B, Wang H. Expression and functional research of TLR4 in
human colon carcinoma. Am J Med Sci. Apr 2010;339(4):319-26.
doi:10.1097/MAJ.0b013e3181cef1b7
112. Belmont L, Rabbe N, Antoine M, et al. Expression of TLR9 in tumor-infiltrating
mononuclear cells enhances angiogenesis and is associated with a worse survival in lung
cancer. Int J Cancer. Feb 2014;134(4):765-77. doi:10.1002/ijc.28413
113. Chow KH, Factor RE, Ullman KS. The nuclear envelope environment and its cancer
connections. Nat Rev Cancer. Feb 2012;12(3):196-209. doi:10.1038/nrc3219
114. Moss SF, Krivosheyev V, de Souza A, et al. Decreased and aberrant nuclear lamin
expression in gastrointestinal tract neoplasms. Gut. Nov 1999;45(5):723-9.
doi:10.1136/gut.45.5.723
115. Alfonso P, Cañamero M, Fernández-Carbonié F, Núñez A, Casal JI. Proteome analysis
of membrane fractions in colorectal carcinomas by using 2D-DIGE saturation labeling. J
Proteome Res. Oct 2008;7(10):4247-55. doi:10.1021/pr800152u
116. Lim SO, Park SJ, Kim W, et al. Proteome analysis of hepatocellular carcinoma. Biochem
Biophys Res Commun. Mar 2002;291(4):1031-7. doi:10.1006/bbrc.2002.6547
117. Bengtsson S, Krogh M, Szigyarto CA, et al. Large-scale proteomics analysis of human
ovarian cancer for biomarkers. J Proteome Res. Apr 2007;6(4):1440-50. doi:10.1021/pr060593y
118. Coradeghini R, Barboro P, Rubagotti A, et al. Differential expression of nuclear lamins in
normal and cancerous prostate tissues. Oncol Rep. Mar 2006;15(3):609-13.
99
119. Zheng M, Karki R, Vogel P, Kanneganti TD. Caspase-6 Is a Key Regulator of Innate
Immunity, Inflammasome Activation, and Host Defense. Cell. 04 2020;181(3):674-687.e13.
doi:10.1016/j.cell.2020.03.040
120. Lee JW, Kim MR, Soung YH, et al. Mutational analysis of the CASP6 gene in colorectal
and gastric carcinomas. APMIS. Sep 2006;114(9):646-50. doi:10.1111/j.1600-
0463.2006.apm_417.x
121. Han YN, Li Y, Xia SQ, Zhang YY, Zheng JH, Li W. PIWI Proteins and PIWI-Interacting
RNA: Emerging Roles in Cancer. Cell Physiol Biochem. 2017;44(1):1-20.
doi:10.1159/000484541
122. Liu Y, Dou M, Song X, et al. The emerging role of the piRNA/piwi complex in cancer. Mol
Cancer. 08 2019;18(1):123. doi:10.1186/s12943-019-1052-9
123. Martinez VD, Vucic EA, Thu KL, et al. Unique somatic and malignant expression
patterns implicate PIWI-interacting RNAs in cancer-type specific biology. Sci Rep. May
2015;5:10423. doi:10.1038/srep10423
124. Wessler I, Kirkpatrick CJ. Acetylcholine beyond neurons: the non-neuronal cholinergic
system in humans. Br J Pharmacol. Aug 2008;154(8):1558-71. doi:10.1038/bjp.2008.185
125. Schuller HM. Is cancer triggered by altered signalling of nicotinic acetylcholine
receptors? Nat Rev Cancer. Mar 2009;9(3):195-205. doi:10.1038/nrc2590
126. Zhao Y. The Oncogenic Functions of Nicotinic Acetylcholine Receptors. J Oncol.
2016;2016:9650481. doi:10.1155/2016/9650481
127. Hecht SS. Tobacco carcinogens, their biomarkers and tobacco-induced cancer. Nat Rev
Cancer. Oct 2003;3(10):733-44. doi:10.1038/nrc1190
128. Fajardo AM, Piazza GA, Tinsley HN. The role of cyclic nucleotide signaling pathways in
cancer: targets for prevention and treatment. Cancers (Basel). Feb 2014;6(1):436-58.
doi:10.3390/cancers6010436
129. Carlson CC, Smithers SL, Yeh KA, Burnham LL, Dransfield DT. Protein kinase A
regulatory subunits in colon cancer. Neoplasia. Oct 1999;1(4):373-8.
doi:10.1038/sj.neo.7900039
130. Li P, Schulz S, Bombonati A, et al. Guanylyl cyclase C suppresses intestinal
tumorigenesis by restricting proliferation and maintaining genomic integrity. Gastroenterology.
Aug 2007;133(2):599-607. doi:10.1053/j.gastro.2007.05.052
131. Alexanian A, Miller B, Roman RJ, Sorokin A. 20-HETE-producing enzymes are up-
regulated in human cancers. Cancer Genomics Proteomics. 2012 Jul-Aug 2012;9(4):163-9.
132. Alexanian A, Sorokin A. Targeting 20-HETE producing enzymes in cancer - rationale,
pharmacology, and clinical potential. Onco Targets Ther. 2013;6:243-55.
doi:10.2147/OTT.S31586
133. Knoll G, Bittner S, Kurz M, Jantsch J, Ehrenschwender M. Hypoxia regulates TRAIL
sensitivity of colorectal cancer cells through mitochondrial autophagy. Oncotarget. Jul
2016;7(27):41488-41504. doi:10.18632/oncotarget.9206
134. Sreevalsan S, Safe S. REACTIVE OXYGEN SPECIES AND COLORECTAL CANCER.
Curr Colorectal Cancer Rep. Dec 2013;9(4):350-357. doi:10.1007/s11888-013-0190-5
135. Danckwardt S, Hentze MW, Kulozik AE. 3' end mRNA processing: molecular
mechanisms and implications for health and disease. EMBO J. Feb 2008;27(3):482-98.
doi:10.1038/sj.emboj.7601932
136. Spindel ER. Muscarinic receptor agonists and antagonists: effects on cancer. Handb
Exp Pharmacol. 2012;(208):451-68. doi:10.1007/978-3-642-23274-9_19
137. Bordonaro M. Crosstalk between Wnt Signaling and RNA Processing in Colorectal
Cancer. J Cancer. 2013;4(2):96-103. doi:10.7150/jca.5470
Abstract (if available)
Abstract
With the advent of genome-wide association studies over the last decade, an explosion of variant association data has become available to the scientific community. The vast majority of these trait-associated variants map to enhancers, vital regulatory elements which orchestrate the recruitment of transcriptional complexes to the promoter of their target genes to upregulate transcription in a cell type- and timing-dependent manner. Through these popular genome-wide association studies, variants located within enhancers have implicated thousands of these powerful regulatory elements with many common genetic diseases, including almost every form of cancer. Although much is now known about genes and the individual functions of their products, very little is known about the function of individual enhancers beyond their general shared properties. Indeed, as the primary function of enhancers is the transcriptional regulation of their specific target genes, it is crucial to identify the target genes of as many enhancers as possible, especially those enhancers which are already strongly associated with diseases. To interpret the disease-associated signals located within enhancer elements, it is necessary to be able to contextualize enhancers via their biological functions. By predicting high-quality enhancer-gene regulatory links and then statistically validating a cell type-specific score for each prediction, this work illustrates how it is possible to interpret enhancer variants meaningfully within the framework of protein pathways and shared biological processes that are already well characterized by using their predicted target genes as a bridge. Although the possible applications for such data using this principle are large in number and for the most part beyond the scope of this work, we conclude by providing a proof of concept in the form of gene set enrichment analysis which incorporates enhancer variant information through their target genes and the cell type-specific score measuring the confidence of each prediction.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Identification and characterization of cancer-associated enhancers
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Scalable latent factor models for inferring genetic regulatory networks
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
Application of tracing enhancer networks using epigenetic traits (TENET) to identify epigenetic deregulation in cancer
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Gene-set based analysis using external prior information
PDF
Functional characterization of colorectal cancer GWAS loci
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
High-dimensional regression for gene-environment interactions
PDF
Incorporating prior knowledge into regularized regression
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Investigating hematopoietic stem cells via multiscale modeling, single-cell genomics, and gene regulatory network inference
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Adaptive set-based tests for pathway analysis
PDF
Statistical analysis of high-throughput genomic data
PDF
Using genomics to understand the gene selectivity of steroid hormone receptors
PDF
Computational analysis of genome architecture
Asset Metadata
Creator
Mills, Caitlin Merideth LeDoux
(author)
Core Title
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2021-08
Publication Date
07/12/2021
Defense Date
06/18/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
enhancer,gene set enrichment analysis,genetics,genomics,machine learning,OAI-PMH Harvest,pathway enrichment analysis,Peacock,PEREGRINE,regulatory,transcriptional regulation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mi, Huaiyu (
committee chair
), Conti, David (
committee member
), Lewinger, Juan Pablo (
committee member
), Marconett, Crystal (
committee member
), Offringa, Ite (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
caitlimm@usc.edu,cmlmills@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16252534
Unique identifier
UC16252534
Legacy Identifier
etd-MillsCaitl-9723
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mills, Caitlin Merideth LeDoux
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
enhancer
gene set enrichment analysis
genetics
genomics
machine learning
pathway enrichment analysis
PEREGRINE
regulatory
transcriptional regulation