Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Finding signals in Infinium DNA methylation data
(USC Thesis Other)
Finding signals in Infinium DNA methylation data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Finding Signals in Infinium DNA
Methylation Data
by
Xinhui Wang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
August 2015
Copyright 2015 Xinhui Wang
ii
Dedicated to my husband Rui
and my two lovely kids, Sophia and Grace
iii
Acknowledgements
First of all, I would like to express my deepest gratitude to my advisor Dr.
Kimberly Siegmund. From the very beginning of my transferring to Biostatics major, she
is the first professor that recognized my potential in this discipline and granted me
financial support. During the journey of my PhD training, Dr. Siegmund has always been
supportive and encouraging, whenever I encountered challenges in research or had to
deal with difficulties in my personal life. I feel fortunate to have Dr. Siegmund as my
advisor, as the completion of this dissertation work would not be possible without her
enthusiastic inspiration, exceptional guidance and insightful critiques. Perhaps most
importantly, I learned enormously from the many examples Dr. Siegmund set for what
makes a great biostatistician scientist: critical thinking, perseverance, and independence.
I would also like to thank my other committee members: Professors from
Department of Preventive Medicine, Dr. Stanley P. Azen, for his warm encouragement
and inspiration on research; Dr. Susan Groshen, for her valuable suggestions on both my
dissertation topics and her continuous support throughout my entire PhD study; Dr. Juan
Pablo Lewinger, for his insightful advices on the batch effect correction project; and Dr.
Joseph Hacia, Professor from Department of Biochemistry and Molecular Biology, for
his helpful guidance in understanding biological background of my data and his
generosity in sharing his research experiences. I am sincerely grateful to all of them both
for their valuable discussions that have helped move my dissertation work up to a higher
level and for their candid advices that will benefit my professional career as a
biostatistician scientist.
iv
My sincere appreciation also goes to Dr. Jiu-Chiuan Chen, who shared a lot of
valuable thoughts and advices on improving my dissertation writing and also taught me
how to do better population health sciences by integrating multidisciplinary perspectives
and different research tools. In addition I want to give my sincere thanks to Dr. Anny H.
Xiang, who supports and grants me great opportunities to work on projects at Kaiser that
clearly have broadened my experience and vision in conducting biomedical researches.
I would also like to extend my thanks to professors, colleagues and friends at
USC and at Department of Research and Evaluation in Kaiser. Particularly, Dr. Peter W.
Laird and Dr. Toshinori Hinoue for their collaboration on the feature selection paper; Dr.
Towhid Salam, Dr. Yani Lu, Xia Li, Feifei Liu, Zoe Bider and many others for their
inspiration, encouragement and friendship.
Last but not least, I would like to give special thanks to my parents and parents-
in-law. I could not have handled multiple tasks in my life without their encouragement
and unselfish support. My greatest thanks also go to my husband and my two lovely kids
whose unconditional love support me moving forward.
v
Table of Contents
Dedication ii
Acknowledgement iii
List of Tables viii
List of Figures ix
Abstract xi
Chapter I: Introduction .......................................................................................... 1
1.1 DNA methylation and Illumina HumanMethylation BeadArrays ................ 1
1.2 Cluster analysis .............................................................................................. 2
1.3 Variable selection by filtering ........................................................................ 4
1.4 Batch Effect Correction ................................................................................. 5
Chapter II: Variable filtering methods for Beta-distributed data ...................... 7
2.1 Background and Motivation ......................................................................... 7
2.2 Methods......................................................................................................... 9
2.2.1 Illumina HumanMethylation BeadArrays......................................... 9
2.2.2 Beta Distribution ............................................................................. 10
2.2.3 Filter Methods ................................................................................. 11
2.2.4 Simulation Study ............................................................................. 14
2.2.5 Real Data sets .................................................................................. 16
2.3 Results ......................................................................................................... 19
vi
2.3.1 Colon Cancer Data .......................................................................... 19
2.3.2 Simulation Study ............................................................................. 20
2.3.2.1 Enrichment of Ranked Lists .................................................. 20
2.3.2.2 Cluster Analysis .................................................................... 22
2.3.3 Real Data Application ..................................................................... 24
2.4 Discussion ................................................................................................... 33
2.5 Conclusions ................................................................................................. 37
2.6 Appendix ..................................................................................................... 38
2.6.1 Real Data Sets ................................................................................ 38
2.6.2 Supplemental Figure and Tables .................................................... 40
Chapter III: Comparison of Batch Effect correction methods for Illumina
Infinium DNA methylation data ............................................................................... 44
3.1 Background and Motivation ....................................................................... 44
3.2 Batch Effect Correction Methods ............................................................... 45
3.2.1 Combining Batches (ComBat) ........................................................ 46
3.2.2 Surrogate Variable Analysis (SVA)................................................. 48
3.2.3 Independent SVA (ISVA) ............................................................... 50
3.2.4 RUV-2 and RUV-4 ......................................................................... 52
3.3 Simulation Study ......................................................................................... 55
3.3.1 Simulation Set-up............................................................................ 56
vii
3.3.2 Method Evaluation using Simulated Data ...................................... 65
3.4 Real Data Set............................................................................................... 66
3.5 Results ......................................................................................................... 67
3.5.1 Simulation ....................................................................................... 67
3.5.2 Real Data Analysis .......................................................................... 84
3.6 Discussion ................................................................................................... 87
3.7 Conclusion .................................................................................................. 97
3.8 Appendix ..................................................................................................... 98
Chapter IV: Conclusion and Future Direction ................................................. 106
4.1 Conclusion ................................................................................................. 106
4.2 Future Directions ....................................................................................... 107
Bibligography ....................................................................................................... 110
viii
List of Tables
Table 2.1 Description of feature filtering methods 13
Table 2.2 Description of data sets used in application analysis 16
Table 2.3 Misclassification rate of RPMM cluster analysis to find 2 groups using
different variable filtering methods (top 1000 features) 26
Table 2.4 Mean age in two clusters identified by RPMM using different filtering
methods on blood samples of a normal population 31
Supplemental Table 2.1 Area under the curve (95% confidence interval) for
simulation results in Figure 2.2 40
Supplemental Table 2.2 Adjusted Rand Index of RPMM cluster analysis result
using a variety of filtering methods for multiple data sets 41
Supplemental Table 2.3 Genomic context of the features selected by the top
two filter methods 42
Supplemental Table 2.3A For HM27 platform (N/%) 42
Supplemental Table 2.3B For HM450 platform (N/%) 43
Table 3.1 Comparison of available batch effect correction methods 55
Table 3.2 Simulation scenarios for different Batch effect or different POI effect
(scenarios 0-5) 64
Table 3.3 Simulation scenarios when POI and covariate or POI and Batch have
correlation (Other parameter setting is the same as scenario 1) 64
Table 3.4 Summary of the simulated correlations (Mean±SD) for Scenarios with
varying correlations between POI and covariate or between POI and the Batch
variable 78
Table 3.5 Comparison of enrichment of age associated CpG probes from linear
regression after BE correction by different BEC method (at q value < 0.05 level) 84
ix
List of Figures
Figure 2.1 Smoothed scatter plots showing six filter statistics vs. the mean DNA
methylation (Beta) value (22198 features, 26 colon cancer samples) 19
Figure 2.2 ROC curves for 7 filtering methods (2 groups, 200 informative
features out of 2000, 200 samples, 100 simulated data sets) 22
Figure 2.3 Misclassification rates of RPMM cluster analysis using top filtered
features for 7 filtering methods (2 groups, 200 informative features out of 2000,
200 samples, 100 simulated data sets) 23
Figure 2.4 Heatmaps of RPMM cluster analysis using top 1000 filtered features
by A) TM-GOF or B) SD-b methods using 26 colon cancer samples (data set #1) 28
Figure 2.5 Heatmaps of RPMM cluster analysis using top 1000 filtered features
by TM-GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer
samples (data set #4) 29
Supplemental Figure 2.1 Distribution of statistics in Colon Cancer data set
(data set #1) 40
Figure 3.1 Parameter setting for simulating fixed coefficients 61
Figure 3.2 Data generation work flow in simulation 62
Figure 3.3 Data analyses work flow in simulation 63
Figure 3.4 Density plots of raw simulated methylation data in Scenario 1 with
batch effect simulated 67
Figure 3.5 Density plot (A) and PCA plot (B) of Quantile Normalized raw
simulated methylation data in Scenario 1 68
Figure 3.6 PCA plots of ComBat corrected methylation data or residuals in
Scenario 1 after correction by different BE method 69-70
Figure 3.7 ROC curves of all BEC methods for scenarios with different levels of
POI effect (Scenarios 1-3; 100 simulated data sets) 71
Figure 3.8 ROC curves of all BEC methods for scenarios with different levels of
Batch effect (Scenarios 1, 4, and 5; 100 simulated data sets) 73
Figure 3.9 Boxplots of sensitivity and specificity of all BEC methods for
scenarios with different levels of POI effect (Scenarios 1-3; 100 simulated data
sets) 74
Figure 3.10 Boxplots of sensitivity and specificity of all BEC methods for
scenarios with different levels of Batch effect (Scenarios 1, 4, and 5; 100
simulated data sets) 75
Figure 3.11 ROC curves of all BEC methods for scenarios with different levels
of correlation between POI and covariate (Scenarios 1a, 1b, and 1c; 100
simulated data sets) 76
Figure 3.12 ROC curves of all BEC methods for scenarios with different levels
of correlation between POI and the Batch variable (Scenarios 1d, 1e, and 1f; 100
simulated data sets) 77
Figure 3.13 Boxplots of sensitivity and specificity of all BEC methods for
scenarios with different levels of correlation between POI and covariate
(Scenarios 1, 1a, and 1b; 100 simulated data sets) 79
x
Figure 3.14 Boxplots of sensitivity and specificity of all BEC methods for
scenarios with different levels of correlation between POI and the Batch variable
(Scenarios 1, 1d, 1e, and 1f; 100 simulated data sets) 80
Figure 3.15 Boxplots of False Positive Rate (FPR) of all BEC methods for
scenarios under the null (Scenario 0; 100 simulated data sets) 83
Figure 3.16 PCA plots of T1D data or residuals after correction by different
BEC method. 85-87
Supplemental Figure 3.1 PCA plots of simulated methylation data or residuals
after correction by different BE method (Scenario 1) 98-99
Supplemental Figure 3.2 PCA plots of T1D data or residuals from linear
regression using ComBat corrected data 100-101
Supplemental Figure 3.3 PCA plots of T1D data, or ComBat corrected data, or
residuals after correction by different BE method and removing 803 probes on
Chromosomes X or Y 102-105
xi
Abstract
DNA methylation plays critical roles in a variety of biological processes, such as
aging, carcinogenesis, etc. DNA methylation data from Illumina Infinium platforms are
measured as the proportion of total intensities due to methylated alleles. They are values
bounded between 0 and 1, assuming to follow Beta distribution. Due to the specialties of
this distribution, methods work for Gaussian distributed data, such as gene expression
microarrays, may not perform equally well on DNA methylation data.
This dissertation addresses two issues we encounter when dealing with high
throughput DNA methylation data from the Infinium platforms. One is to do feature
selection for cluster analysis to discover novel sample subgroups, and the other is to
correct Batch effect in data that was due to samples being processed and measured in
different batches. Methods used in these two processes in order to strengthen the
biological signal of interest are evaluated and compared using both simulation and real
data sets.
1
Chapter I
Introduction
This chapter simply introduces some background knowledge, including the
characteristics of the DNA methylation data platform used and the motivations of the
topics studied in this dissertation.
1.1 DNA methylation and Illumina HumanMethylation
BeadArrays
DNA methylation is a chemical modification of DNA cytosine at CpG or CpH (H
= A or T or C) sites and can be inherited in cell division. It is widely involved in many
aspects of normal development and carcinogenesis through gene expression regulation
(Bergman and Cedar, 2013; Cantone and Fisher, 2013; Suva et al., 2013). Large-scale
DNA methylation arrays such as Illumina’s BeadArray technology analyzes 27,578
targeted CpGs located at the promoter regions of 14,495 genes on the
HumanMethylation27 platform (HM27), and 482,421 CpGs distributed across promoter,
5’UTR, first exon, gene body, and 3’UTR of 17,820 genes on the HumanMethylation450
platform (HM450) (Bibikova et al., 2011). Overall, the HM450 platform covers 96% of
CpG islands with their island shores and their flanking regions (www.illumina.com).
There are two probe designs used by the HumanMethylation arrays, Infinium I and
Infinium II. For the Infinium I bead design, two bead types are designed for each targeted
cytosine, one for the methylated and one for the unmethylated alleles. For the Infinium II
2
design, both alleles are measured using a single bead type with elongation using different
fluorescence dyes to indicate DNA methylation state, green for a methylated and red for
an unmethylated nucleotide. All HM27 probes use Infinium I design, while HM450
platform uses both designs with 135,501 probes in Infinium I design and 350,076 probes
in Infinium II design. Using Illumina’s platform, DNA methylation is measured by
fluorescence intensities, the ratio of the methylated signal to the sum of the methylated
and unmethylated signal. This ratio is commonly known as the beta value, a continuous
variable bounded between zero and one, where zero represents an unmethylated locus
and one represents a fully methylated locus. Under the assumption that the Methylated
and Unmethylated fluorescence intensities approximately follow gamma distributions on
the same scale, these beta values follow beta distribution (Houseman et al., 2008).
Our goal is to develop novel methods for the analysis of DNA methylation data
generated using the Illumina HumanMethylation arrays. In Chapter II, we are interested
in developing a variable selection routine for performing cluster analysis. In Chapter III,
we are interested in evaluating methods for batch effect correction.
1.2 Cluster analysis
Cluster analysis is a method of finding latent homogeneous groups in data. It can
be applied to a wide variety of research area. Psychologists use it to identify
subcategories of diseases (Graham et al., 2013; Machulda et al., 2013; Paykel and
Henderson, 1977; Sar et al., 2010) and marketing researchers use it to analyze customer
characteristics that can benefit future marketing strategy (Geeroms et al., 2008).
Biologists use cluster analysis to analyze high throughput genetic data in order to identify
3
clustered phenotypes and/or gene with similar expression patterns (Boutros and Okey,
2005; Croner et al., 2009; Ehler et al., 2011; Harding et al., 2010). For DNA methylation
high throughput data, two-dimensional cluster analysis can also be used to classify tumor
subtypes and/or DNA markers (Fernandez-Tajes et al., 2013; Noushmehr et al., 2010;
Wang et al., 2011). For example, using cluster analysis Weisenberger et al. successfully
identified a new 5-gene panel as “CpG island methylator phenotype” (CIMP) marker in
colorectal tumors. Classifying based on these markers, the CIMP-positive tumors were
found almost all had BRAF mutation and they contained several other distinct
characteristics (Weisenberger et al., 2006).
There are two approaches to do cluster analysis, one is algorithmic-based and the
other model-based. Examples of algorithmic-based methods include hierarchical
clustering and K-means clustering (Dougherty et al., 2002). In these methods, samples
are assigned to clusters based on some measure of distance. Model-based methods, on the
other hand, assume the data are from a mixture of parametric distributions (Houseman et
al., 2008; Koestler et al., 2013b; Koestler et al., 2010; McLachlan, 1992). These are often
fit by maximum likelihood approaches. An issue common to all clustering methods is the
problem of variable selection. With technologies available today, we can measure
thousands of molecular features from a single tissue sample. Then, if the main purpose of
cluster analysis is to group the samples, the question is which molecular features are
informative of the underlying subgroups. Several studies have shown that probes that do
not vary across samples are uninformative and can attenuate cluster signals (Bourgon et
al., 2010; Tadesse et al., 2005; Tritchler et al., 2009). Therefore, reducing variable
dimension is recommended in order to improve the clustering results.
4
There are two types of variable selection methods for cluster analysis. One is
called a “filter” approach which selects probes before the cluster analysis. This is a pre-
processing step that is independent of the cluster algorithm (Dash et al., 2002; Jouve and
Nicoloyannis, 2005). The other approach is called a “wrapper” approach, which
integrates variable selection with cluster analysis and evaluates it through a cluster
algorithm (Goh and Kasabov, 2005; Maugis et al., 2009; Wang and Zhu, 2008). This
approach depends on the clustering algorithm, and sometimes requires the number of
clusters to be pre-specified. In addition, it is computational intensive for high-
dimensional microarray data. Consequently, in our research, we will focus on “filter”
methods for variable selection. These are particularly useful if one’s primary interest is a
quick, visualization of the data.
1.3 Variable selection by filtering
A number of filtering methods have been proposed for gene expression studies.
Some consider probes individually, while others consider them as sets. Several filtering
methods based on gene sets are developed. However, they are computational intensive
too (Dash et al., 2002; Jouve and Nicoloyannis, 2005). In practice, the most commonly
used filter methods are based on single feature. For many microarray studies, cluster
analysis of the samples is applied to the subset of most variant probes across samples
(Christensen et al., 2009; Houseman et al., 2008; Kim et al., 2009b; Noushmehr et al.,
2010; Tothill et al., 2008). For normally distributed data, the variance of the measures (on
the log2 scale) is independent of the mean. However, when the data follow a beta
distribution, as they do for DNA methylation measured as a proportion, the variance is
5
associated with their mean. The probes with highest variance are located in the middle
mean range (mean beta value around 0.5). In this case, using a filtering criteria based on
variation of the data will bias towards picking those probes with mean methylation values
in the middle of the beta scale range. In order to avoid this kind of bias, we propose
several variance transformation based probe filtering methods that can eliminate the
variance-mean dependence of Infinium DNA methylation data. These newly developed
methods use different summary statistics of the cumulative distribution function, the
variance stabilizing transformation for a Beta distribution (Rocke, 1993).
1.4 Batch Effect Correction
When we generate DNA methylation data using the HumanMethylation platform,
systematic technical effects can be introduced into the final data reflecting one or more of:
experimental conditions (such as, different bisulphite conversion efficiencies, different
machines etc.), time (such as different days) and/or laboratory personnel differences
(such as different technicians). If the samples are randomized, these factors should not be
associated with the biological Phenotype Of Interest (POI). However, if not addressed
appropriately, they may distort the association between the POI and DNA methylation.
Many methods have been developed to solve this issue in the context of gene
expression microarrays (Alter et al., 2000; Gagnon-Bartsch and Speed, 2012; Johnson et
al., 2007; Leek et al., 2012; Leek and Storey, 2007; Teschendorff et al., 2011). Gene
expression levels, measured by the fluorescent intensity signals, assume to follow
Gaussian distribution after taking the log transformation. Thus they satisfy the
assumption of all these Batch Effect Correction methods. However, DNA methylation
6
values in HumanMethylation platform are measured by the ratio of intensities from
methylated alleles over total intensities from both methylated and unmethylated alleles.
They are assumed to follow Beta distribution. Thus, these methods may work differently.
Here we would like to compare the performance of several commonly suggested batch
effect correction methods on the analysis of DNA methylation data. We evaluate the
methods in a simulation study and apply them to an HM27 data set.
7
Chapter II
Variable filtering methods of Beta-
distributed data
In this chapter, we developed several non-specific variable filtering methods of
Beta distributed data and compared them with commonly used feature selection methods
used for DNA methylation data. We used simulation and real data to evaluate the
performance of these methods in enriching informative features and identifying sample
clusters.
2.1 Background and Motivation
Non-specific filtering of variables, the selection of a subset of variables based on
a characteristic unrelated to the outcome of interest, is often applied for dimension
reduction in high-dimensional data sets. The approach can be used for both supervised
and unsupervised analysis. For differential expression, non-specific filtering of features
prior to hypothesis testing can increase power to detect differentially expressed genes
(Bourgon et al., 2010). For cluster analysis, it can improve sensitivity of finding disease
clusters (Tadesse et al., 2005). The most common filter methods use a measure of
variance to rank variables, hoping to enrich for features that vary due to biological signal
(Houseman et al., 2008; Kim et al., 2009a; Tothill et al., 2008). In these settings, the
variance of the (possibly log-transformed) data is usually independent of the mean.
8
However, when studying DNA methylation measured as a proportion, this may not
always be the case, and alternate filter statistics may be preferred.
DNA methylation is an epigenetic mark that varies between different cell types,
correlating with DNA packaging within the cell and facilitating cell-type specific
function. Today, Illumina’s DNA methylation BeadArrays allow for high-throughput
analysis, with their most recent platform measuring hundreds of thousands of targeted
loci for large numbers of samples. On Illumina’s platform, DNA methylation is measured
by the percentage of total fluorescence due to methylation. This value is bounded
between 0 and 1, and can be modelled using a Beta distribution. To perform cluster
analysis on such data, Houseman et al. developed a recursive partitioning mixture model
(RPMM) for Beta-distributed data. They applied a non-specific filter prior to cluster
analysis, using the standard deviation of the Beta values (Houseman et al., 2008).
However, for Beta-distributed data, the variance is a function of the mean, and a standard
deviation filter will bias the selection of most variable features towards those having a
mean near the middle of the 0 to 1 scale. This bias could favour selecting features that
show cell-type specific methylation and be desirable for clustering subgroups of normal
tissue, as CpG methylation at cell-type specific marks can be sensitive to shifts in the
distributions of cells driving the associations of CpG methylation with sample
characteristic (e.g. disease state, or age) (Houseman et al., 2012; Koestler et al., 2013a;
Reinius et al., 2012). However, at the same time, a bias that favours selecting features
with mean DNA methylation levels near 0.5 can be problematic in studies to discover
cancer subtypes where aberrant DNA methylation may only be observed in a small
fraction of tumors (Noushmehr et al., 2010; Weisenberger et al., 2006). When the subset
9
of tumors with aberrant profiles is rare, the average DNA methylation level across a set
of tumors for sites that are normally unmethylated is closer to the normal state of 0 than
0.5. At the same time, the average level for sites that are normally methylated is closer to
1. When filtering variables based on standard deviation, clusters having only a few
samples may not separate distinctly from the rest. To decrease the association between
the mean and variance of methylation proportions measured on the Illumina platform, Du
et al. (Du et al., 2010) propose using a logit transformation (on the log2 scale). We
explore alternate transformations that take the Beta-distribution explicitly into account. In
particular, we consider methods making use of the cumulative distribution function, the
variance stabilizing transformation for a Beta distribution (Rocke, 1993).
We compare different filtering methods in a collection of real data sets generated
on either Illumina’s HumanMethylation27 or HumanMethylation450 platform. The
variety of examples considered will allow us to evaluate filter methods across different
data distributions and structures. We find that the properties of the data set, specifically
the fraction of samples in a subtype, or the variation of features within groups, can lead to
very different behaviour of some filtering methods.
2.2 Methods
2.2.1 Illumina HumanMethylation BeadArrays
Illumina’s BeadArray technology analyzes more than 27,000 targeted CpGs on
the HumanMethylation27 (HM27) platform, and over 480,000 on the
HumanMethylation450 (HM450) (Bibikova et al., 2011). Whereas the HM27 array
primarily targets CpGs in promoter regions, the HM450 expands coverage of exons, gene
10
bodies, and 3′UTRs, targeting sites in 99% of RefSeq genes (Bibikova et al., 2011). At
each targeted position, the quantity of methylated (M) and unmethylated (U) DNA is
measured by fluorescence intensity, and the proportion of methylated target is
summarized by the average Beta value = M/(M + U), a value bounded by 0 and 1. Many
such targets show skewed distributions near the boundary conditions, motivating the use
of a Beta distribution for statistical modelling (Houseman et al., 2008). As not all targeted
sites show variation in proportion of DNA methylation, our goal is to use non-specific
filtering to reduce the dimension of variables in our analysis.
We refer to each targeted CpG site on the array as a feature, and evaluate a
number of methods for ranking features and filtering for dimension reduction. A number
of methods explicitly use parameters from, or variance-stabilizing transformations of,
Beta distributions.
2.2.2 Beta Distribution
For a single feature, we model the distribution of DNA methylation across
independent samples using a Beta distribution. Let ~𝐵𝑒𝑡𝑎 (𝛼 ,𝛽 ), 𝑓 (𝑥 ) =
1
𝐵 (𝛼 ,𝛽 )
𝑥 𝛼 −1
(1−
𝑥 )
𝛽 −1
, where 𝐵 (𝛼 ,𝛽 ) = ∫ 𝑢 𝛼 −1
(−𝑢 )
𝛽 −1
1
0
and 𝛼 ,𝛽 > 0. The mean and variance are
given by 𝜇 =
𝛼 𝛼 +𝛽 and 𝜎 2
=
𝜇 (1−𝜇 )
(𝛼 +𝛽 +1)
, respectively, with the variance a function of the
mean. A useful reparameterization is 𝐵𝑒𝑡𝑎 (𝜇 ,∅), with ∅ = 𝛼 + 𝛽 a precision parameter
independent of the mean.
Transformations of the data can also lead to independence of the mean and
variance. Du et al. (Du et al., 2010) proposed the M-value, a (log2) logit transformation
11
of the methylation proportion. However, the true variance stabilizing transformation for
the Beta distribution is the cumulative distribution function (CDF), 𝑌 = 𝑓 𝐵 (𝑋 ;𝛼 ,𝛽 ) =
1
𝐵 (𝛼 ,𝛽 )
∫ 𝑡 𝛼 −1
(1 − 𝑡 )
𝛽 −1
𝑑𝑡 𝑥 0
(Rocke, 1993). If the original data 𝑋 follow a Beta
distribution, the data after transformation (𝑌 ) will follow a Uniform distribution with
mean 1/2 and variance 1/12. Any lack of fit of a single Beta distribution would suggest
that the data arise from a mixture of Betas. We measure lack of fit using the distance of
our transformed data from their expected distribution. Two filters below rank features
using the CDF-transformed data, 𝑌 = 𝐶𝐷𝐹 (𝑋 ).
2.2.3 Filter Methods
We evaluate a total of eleven filters, eight based on ranking single statistics, and
three based on a weighted score for combining ranks (Table 2.1). Filters 1 through 4 are
commonly used methods today: Filter 1, standard deviation of Beta values (SD-b); Filter
2, standard deviation of M-values (logit-transformed Beta values) (SD-m); Filter 3,
median absolute deviation of Beta values (MAD-b); Filter 4, DIP test, a measure of
unimodality of Beta values (DIP) (Hartigan and Hartigan, 1985).
Filters 5 through 8 are statistics that assume the data derive from a single Beta
distribution. Filter 5, inverse precision (also known as adjusted SD) (Bell et al., 2011),
ranks the data by an estimate of the (inverse) precision parameter, 1/∅
̂
= 1/(𝛼̂ + 𝛽 ̂
),
with alpha and beta estimated using method of moments estimators,
𝛼̂ =
𝑥 ̅
2
(1−𝑥 ̅ )−𝑥 ̅ 𝑠 2
𝑠 2
, 𝛽 ̂
=
(1−𝑥 ̅ )[𝑥 ̅ (1−𝑥 ̅ )−𝑠 2
]
𝑠 2
, (1)
where 𝑥 ̅ and 𝑠 2
are the mean and variance for a given feature. For this method the
features with lower precision have higher variation. Filter 6, Beta-quantile Goodness-of-
12
Fit (BQ-GOF), is a comparison of the observed to theoretical quantiles from a Beta
distribution. It sums for each feature, the absolute difference between corresponding
quantiles from the observed cumulative distribution function and the theoretical one
obtained using the estimated parameters 𝛼̂,𝛽 ̂
.
Filters 7 and 8 measure goodness-of-fit on the CDF-transformed data, 𝑌 =
𝐶𝐷𝐹 (𝑋 ). Filter 7 is Beta-Transformed Moments Goodness-of-Fit (TM-GOF). The CDF-
transformed data are ranked using the distance of the mean, 𝑦̅, and standard deviation, 𝑠 𝑦 ,
from their expected values. The complete procedure is summarized as follows: 1. For
each feature, estimate 𝛼̂,𝛽 ̂
; 2. Compute 𝑌 = 𝐶𝐷𝐹 (𝑋 ); 3. Compute, 𝑦̅ and 𝑠 𝑦 , the mean
and standard deviation of the transformed data; 4. Calculate 𝑠 𝑦̅
and 𝑠 𝑠 𝑦 , standard
deviation for 𝑦̅ and 𝑠 𝑦 across all features; 5. Rank features by their standardized
Euclidean distance
√
(
𝑦̅−1/2
𝑠 𝑦̅
)
2
+ (
𝑠 𝑦 −√1/12
𝑠 𝑠 𝑦 )
2
. The features containing a mixture of Betas
will have larger standardized Euclidean distances compared to features that are from a
single Beta distribution. Filter 8, Beta Transformed Quantiles Goodness-of-Fit (TQ-
GOF), besides using the mean and SD of the CDF-transformed data as a pair of statistics
to measure lack of fit, we can use the quantile differences between the observed CDF of
𝑌 and the theoretical CDF. Here, we rank the features by the sum of the absolute
difference of the corresponding quantiles. This is similar to the BQ-GOF (Filter 6) except
the quantiles, instead of the cumulative quantiles, are compared for the CDF-transformed
data.
13
Table 2.1 Description of feature filtering methods
Methods descriptions how to calculate the statistics
1 SD-b Standard deviation based on beta
values
SD=sqtr(1/N ∑(Xi-mean(X))^2)
2 SD-m Standard deviation based on M values SD=sqtr(1/N ∑(Xi-mean(X))^2)
3 MAD Median absolute deviation of beta
values
median(|Xi-median(X)|)
4 DIP Measure of unimodality in a sample DIP is the maximum difference, over all sample points, between the empirical
distribution function and the unimodal distribution function that minimizes that
maximum difference.
5 Precision Inverse precision parameter phi=mean(X)(1-mean(X))/SD^2-1
6 BQ-GOF Beta Quantile Goodness-of-fit BQ-GOF is calculated by summation of the absolute differences, over 25
quantile points, between the empirical distribution function and the expected
beta distribution function.
7 TM-GOF Transformed Moment Goodness-of-fit TQ-GOF is calculated using the Euclidian distance between the empirical
standardized transformed moments and the expected center of the transformed
moments (1/2,sqrt(1/12)).
8 TQ-GOF Transformed Quantile Goodness-of-fit TQ-GOF is calculated by summation of the absolute differences, over 25
quantile points, between the empirical cumulative distribution function and the
expected cumulative beta distribution function (uniform distribution function).
9 BR Best rank of 8 single filter methods Use the best rank value of 8 single filter methods as BR (the highest rank value)
10 AR Average of the top 2 ranks Use the average of the best two rank values of 8 single filter methods as AR
11 WAR Weighted average of the top 4 ranks Use the weighted average of the best four rank values of 8 single filter methods
as WAR (weight=4:3:2:1)
14
Filters 9 through 11 are summaries of the ranks from the individual statistics used
above. Filter 9, Best Rank (BR) selects as the statistic the top rank across the eight
statistics, Filter 10, Average Rank (AR), averages the top 2 ranks, and Filter 11,
Weighted Average Rank (WAR), averages the top four ranks using weights 4:3:2:1,
respectively.
2.2.4 Simulation Study
We perform a simulation study to evaluate the ability of the eleven non-specific
filters to enrich a ranked list of features with those informative of subgroups. The data are
simulated from distributions observed in our colon cancer data set (data set #1; see Real
data sets in Appendix). In this data set, 6 out of 26 subjects (23%) contain a
hypermethylation profile known as the CpG island methylator phenotype (CIMP),
determined using a separate technology (Weisenberger et al., 2006).
We simulate DNA methylation data from Beta distributions with
parameters (𝜇 𝑖𝑗
,∅
𝑖𝑗
), where i = 1 or 2, for the CIMP and non-CIMP subsets, and j =
1,…,2000 indicates the feature. A random 10% of features are selected to be informative,
with (𝜇 1𝑗 ,∅
1𝑗 ) and (𝜇 2𝑗 ,∅
2𝑗 ) estimated from the CIMP and non-CIMP subgroups,
respectively. For the non-informative features, a single set of parameters (𝜇 .𝑗 ,∅
.𝑗 )are
estimated from the non-CIMP subgroup only, and used for both subgroups (i = 1,2). We
simulate 200 samples, considering sample size ratios of 1:9, 1:1, and 9:1. As the feature
characteristics vary between the two groups (e.g., CIMP cancers shows higher mean and
variance of measures on average compared to non-CIMP cancers, Supplemental Figure
2.1 A-C), the ratios 1:9 and 9:1 can represent very different scenarios. For 100 replicate
15
data sets, we rank the 2000 features based on the different filter methods. For each data
set, the same 1800 distributions are used for the non-informative features, and a new
random sample of 200 features is selected for the informative features. For each data set
and each filter statistic, we rank the features by the statistic, and compute sensitivity and
specificity for identifying the 200 differentially methylated features, for feature lists of all
possible lengths. The average sensitivity and specificity is computed over the 100
replicate data sets for each filter, and presented using receiver operating characteristic
(ROC) curves.
In addition, we compare for the different filtering methods the sample
misclassification rates when performing cluster analysis using a Recursive-Partitioning
Mixture Model (RPMM) (Houseman et al., 2008). RPMM was designed for clustering
DNA methylation signatures, and clusters samples using a mixture of Beta distributions
in a recursive partitioning routine. For each filter method, cluster analysis of the samples
is performed on the top 100, 200, and 400 ranked features (5%, 10%, 20%, respectively),
when the true percentage of informative probes was 10% of all features. In the results we
will see that all filter methods performed well when the distributions of the informative
features mimicked the distributions observed in the real data set. We attributed this to a
very strong cluster signal from the subset of informative features. In an attempt to
differentiate the performance of the filter methods, we restricted the distributions of the
informative features to those from a subset of features showing a smaller effect size. We
defined the effect size of a feature by 𝜃 ̂
= 𝑙𝑛
𝜇̂
2𝑗 (1−𝜇̂
2𝑗 ) ⁄
𝜇̂
1𝑗 (1−𝜇̂
1𝑗 ) ⁄
, and sampled 200 informative
features from the subset with |𝜃 ̂
| ≤ 1, or |𝜃 ̂
| ≤ 0.5. The 1800 non-informative features
remain unchanged from the earlier ROC-curve evaluation. The limit on the effect size of
16
the informative features reduced the cluster signal in the data, and resulted in greater
variation in performance among the different filtering approaches.
2.2.5 Real Data sets
Finally, we apply our filtering methods to eight Illumina data sets (Table 2.2). The
data sets were selected to span a variety of biological conditions and include data
generated on the HM450 and HM27 platform.
Table 2.2 Description of data sets used in application analysis
Data
set Description
Platfor
m Source
# of probes
after
preprocessing # of samples
1 Colon cancer HM27 Local 19,965
20 NONCIMP vs.
6 CIMP
2 Glioblastoma HM27
TCGA, plate
1,2,3,10 20,549
74 NONCIMP vs.
12 CIMP
3 Glioblastoma HM450
TCGA, plate
79,111,130 374,601
93 NONCIMP vs.
6 CIMP
4 Kidney HM27
TCGA, plate
64 21,624
50 KIRC vs. 45
normal
5 Kidney HM450
TCGA, all
KIRC 374,708
283 KIRC vs.160
normal
6 Breast HM27
TCGA, plate
93 21,787
37 Infiltrating
Ductal Carcinoma
vs. 20 normal
7 Breast HM450
TCGA, plate
109 377,853
56 Infiltrating
Ductal Carcinoma
vs. 17 normal
8 Normal blood HM450
GEO:GSE402
79, plate 2 383,911 84 blood samples
We selected three cancer-only, four tumor-normal tissues, and one normal blood
data set. The three cancer-only data sets include one colon and two glioblastoma, cancer
17
types known to have a distinct subtype defined by the CpG island methylator phenotype
(CIMP). The four tumor-normal tissue data sets include two kidney and two breast data
sets. The normal blood data set is selected to evaluate the filter methods in the presence
of the strong quantitative risk factor, age. Further details are provided in Appendix and
Supplemental Materials (Wang et al., 2014). All of the data are anonymized, and this
study did not require institutional review board approval.
We perform RPMM cluster analysis on different lists of filtered features to assess
the ability of different filtering approaches to identify (1) cancer subtypes, (2)
cancer/normal tissue types, or (3) young from old individuals. For each data set we
applied the 11 filtering methods, each time selecting the top 1000 features for RPMM
clustering. We also considered a hybrid variable selection approach, in which we pool the
top 500 features selected from two filters methods above (SD-b and TM-GOF). These
filters are chosen because they each perform well for different biological conditions
studied (see Results below). For data sets 1–7 the misclassification rate was computed by
comparing the top two clusters to the known tissue types. We also evaluated the cluster
agreement between the clusters identified by the RPMM routine (typically varying from
2–6 groups) with our two known tissue classes using the adjusted rand index. For data set
8, we evaluated the differences in mean age between the two major cluster groups. To
evaluate the general effect of filtering the data, we analyzed the HM27 data sets without
any filtering and the HM450 data sets after selecting a random subset of 1000 features.
Feature reduction for the HM450 data was necessary, as the cluster analysis software
required a lower dimensional data set in order to run. Also, since the HM450 clustering
18
results varied with random selection of features, we repeated the analysis ten times and
report the average misclassification error rate over the ten replicates (data sets #1–#7).
For two filter methods that show good performance (SD-b and TM-GOF, filters
#1 and #7), we report the frequency of the top selected features by genomic context. On
the HM27 array the selected probes are characterized using the UCSC definition of CpG
island and the gene based definitions provided by Illumina: “promoter”, “transcribed
region”, “exonic region”, and “intronic region”. For the HM450 array we stratified four
gene-based categories (Promoter/Exon/Intron/Intergenic) (Weinstein et al., 2014) by their
position relative to a CpG island (hg19 UCSC definition).
All analyses are performed using the R programming language 2.15.2
(http://www.rproject.org). Infinium data were processed using the methylumi package in
Bioconductor, using a combination of Normal-Exponential background correction, dye
bias equalization, and beta-mixture quantile normalization (BMIQ) to remove technical
artifacts (Teschendorff et al., 2013; Triche et al., 2013). With a goal of discovering latent
disease subtypes, we removed features occurring on the X and Y chromosomes which
would be enriched for sex-related variation, and features with other data quality issues
(e.g. contain common SNP within 10 bp of the target CpG that may misrepresent DNA
methylation level, or map to multiple regions of genome hg19 and lack target specificity.
Common SNPs are defined as having MAF > 0.01 in dbSNP build 135 per the UCSC
snp135common track.) After pre-filtering, 22,198 CpG targets remain on theHM27 array,
and 384,310 on the HM450.
19
2.3 Results
2.3.1 Colon Cancer Data
Figure 2.1 shows the relationships between six filter statistics and mean DNA
methylation level in a study of 26 colon cancer tissues.
Figure 2.1 Smoothed scatter plots showing six filter statistics vs. the mean DNA
methylation (Beta) value (22198 features, 26 colon cancer samples)
A. SD-b: standard deviation on beta values; B. SD-m: standard deviation of M-values; C.
1/Precision: inverse of precision parameter; D. BQ-GOF: Beta Quantile Goodness-Of-Fit;
E. TM-GOF: Transformed Moment Goodness-Of-Fit; F. TQ-GOF: Transformed Quantile
Goodness-Of-Fit. Redline in each figure indicates the median statistic values.
In this collection of heterogeneous cancers (23% CIMP and 77% non-CIMP), we
see a strong relationship between standard deviation and the mean value (Figure 2.1A),
and selecting features with high variation (SD-b) biases the selection to those with mean
20
near 0.5. This relationship is reduced for alternate filter statistics (Figure 2.1B-1E).
Therefore, depending on the filter statistic employed, a different set of top ranked
features may be retained for statistical evaluation.
2.3.2 Simulation Study
2.3.2.1 Enrichment of Ranked Lists
The ROC curves from the analysis of simulated data show the ability of the filters
to enrich the top ranking genes with those truly informative of subgroup (Figure 2.2). The
features that enrich the top of the list will show increasing sensitivity under low false-
positives, falling above the diagonal line, and having an area under the curve (AUC)
greater than 0.5. Figure 2.2 shows that for all scenarios considered, filters that combine a
Beta variance stabilization transformation with goodness-of-fit statistic (TQ-GOF, TM-
GOF, BQ-GOF) appear to enrich the most highly ranked features with ones informative
for cluster subgroup. The filters SD-b, SD-m, and Precision appear non-informative, with
95% confidence intervals for the AUC containing 0.5 (Supplemental Table 2.1). The
figures also show that for the informative filters, the greatest enrichment occurs when the
subgroups have equal sample size. This is to be expected, as equal sample sizes will give
the greatest power to detect differential DNA methylation in supervised analyses.
Between the two analyses with unequal sample sizes, the better discrimination occurs
when the group with larger variance has the larger sample size (Figure 2.2 C & F).
In ROC analyses, another quantity of interest is the partial AUC, the AUC for a
given false positive rate. In this setting, fixing the error rate will give a variable number
of features for different filters. Instead, we select a fixed number of features (p) to discuss
21
the sensitivity and specificity. This approach reflects how non-specific filtering is
performed in practice. The solid black diagonal line in Figures 2.2 D-F indicates the
estimated sensitivity and specificity levels for the top 100 features. The diagonal line
connects the boundary points indicating the maximum true-positive fraction (y-axis) for 0
false-positives (x-axis) and the maximum false positive fraction for 0 true-positives. In
the simulation, the true number of informative features is always 200 out of 2000. Thus,
the maximum possible true-positive fraction is 0.5, corresponding to all 100 features
selected being true positives (100/200 true-positives and 0/1800 false-positives); the
maximum possible false-positive fraction is 0.056 (100/1800 false-positives and 0/200
true positives). The diagonal lines in Figures 2.2 D-F connect the coordinates for these
boundary points: (0,0.5) and (0.056,0), respectively. We estimate the sensitivity and
specificity for the top ranked 100 features from each filter by the coordinates where the
ROC curve crosses the diagonal line. These intersection points provide a more clear
comparison of enrichment for the top ranking features than is evident when viewing the
entire ROC curve. For subgroups of equal size, the TQ-GOF statistic shows the greatest
sensitivity and specificity in selecting informative features for a fixed number of features.
For unequal sized subgroups, the methods TM-GOF and BQ-GOF, performed
competitively.
22
Figure 2.2 ROC curves for 7 filtering methods (2 groups, 200 informative features
out of 2000, 200 samples, 100 simulated data sets)
For each data set the sensitivity and specificity of selecting informative features using the
top ranked list (1–2000 features) are averaged over 100 replications. Figures A-C show
ROC curves for 7 listed filtering methods: SD-b, Precision, SD-m, BQ-GOF, TM-GOF,
TQ-GOF, and BR (best rank) under different sample ratio scenarios: A. Sample size ratio
9:1 (non-CIMP/CIMP); B. Sample size ratio 1:1; C. Sample size ratio 1:9. The bottom
three panels D-F are partial ROC curves obtained from the panels A-C by restricting the
axis ranges to the region relevant to the diagonal line. The solid black diagonal line in
Figure D-F indicates the estimated sensitivity and specificity levels for a list of 100 genes.
2.3.2.2 Cluster Analysis
Applying cluster analysis to the data simulated in Figure 2.2, all methods
performed nearly perfectly (results not shown). Presumably this is due to the selection of
a few features with very large signal between the two cancer subtypes. To introduce
variation in behaviour, we reduced the effect sizes for the informative features in the
simulation (see Methods). Figure 2.3 shows the misclassification error rates from a
cluster analysis of data simulated under these reduced effect sizes. For all scenarios, TM-
23
GOF and TQ-GOF performed best among all single statistic methods. We see that the
Precision filter, SD-b and SD-m performed worst. Filters MAD and DIP also performed
poorly (data not shown). Regardless of the number of features retained, the Precision
filter was unable to find the correct subgroups in the cluster analysis. For other methods,
the misclassification rates increased as we increased the number of features in the
analysis. Among the three summary methods, BR and AR performed similarly to the best
single filter methods (AR data not shown); the WAR filter did not perform as well (data
not shown). These results are consistent with previous ROC curves (Figure 2.2).
Figure 2.3 Misclassification rates of RPMM cluster analysis using top filtered
features for 7 filtering methods (2 groups, 200 informative features out of 2000, 200
samples, 100 simulated data sets)
Average 100 simulations of misclassification rates from a cluster analysis performed
using RPMM, for the top 100, 200, or 400 features of seven different filtering methods
under different sample size ratios. A*. Sample size ratio 9:1 (non- CIMP/CIMP); B*.
Sample size ratio 1:9; C* & D**. Sample size ratio 1:1. For A*-C*: informative features
have effective size smaller than 1; For D**: informative features have effective size
smaller than 0.5.
24
We note that the maximum error rate for each panel in Figure 2.3 depended on the
sample sizes in the two subgroups. When there are no clear clusters in the data, RPMM
tends to find one big cluster. This resulted in a maximum error rate of 10% (=20/200) for
sample size ratios of 1:9 and 9:1 (Figures 2.3 A and B) and an error rate of 50%
(=100/200) when the sample sizes were equal (Figures 2.3 C and D).
2.3.3 Real Data Application
Table 2.3 shows the misclassification rates of RPMM cluster analysis after
variable filtering on a variety of cancer data sets. Error rates are also reported in the
absence of filtering, or after random feature selection. For the colon and glioblastoma
cancer data sets containing CIMP and non-CIMP cancer subtypes, (the colon data set
being the one our simulation study mimicked), TM-GOF and TQ-GOF consistently
showed low misclassification rates, as expected from our simulation study. For the data
sets having both cancer and non-cancer tissues, these same filters performed much worse
than other methods (Table 2.3). For the kidney samples (data sets #4 and #5), all methods
except TM-GOF and TQ-GOF performed well, including no filtering for the HM27 data
and random filtering for the HM450. Interestingly, for the breast tissues (data sets #6 and
#7), SD-b and SD-m performed best. None of the summary-based filter methods ever
performed better than the best single-filter method. They also performed inconsistently
across the different data sets. The hybrid approach that combined an equal number of top
SD-b probes with TM-GOF probes showed some merit. Sometimes the hybrid selection
scheme behaved as well as the best single filter method (Colon data set #1, Kidney data
25
sets #4, #5), and at other times it resulted in error rates that were intermediate between
the other two (Glioblastoma data set #3, Breast cancer data sets #6, #7). Overall, the TM-
GOF and TQ-GOF methods consistently performed best for identifying the CIMP
subgroup in cancer data, while the SD-b filter performed best at distinguishing cancer
from non-cancer tissue.
26
Table 2.3 Misclassification rate of RPMM cluster analysis to find 2 groups using different variable filtering methods (top 1000
features)
Data set 1 Data set 2 Data set 3 Data set 4 Data set 5 Data set 6 Data set 7
Tissue type Colon cancer Glioblastoma Glioblastoma Kidney Kidney Breast Breast
Platform HM27 HM27 HM450 HM27 HM450 HM27 HM450
# of samples
20 NONCIMP
vs. 6 CIMP
74 NONCIMP
vs. 12 CIMP
93 NONCIMP
vs. 6 CIMP
50 KIRC vs. 45
non-cancer
283 KIRC vs.
160 non-cancer
37 Breast cancer
vs. 20 non-
cancer
56 Breast cancer
vs. 17 non-
cancer
No filter 0.31 0.22 NA 0 NA 0.12 NA
Filter top 1000 by:
Random * 0.34 0.27 0.40 0.004 0.005 0.12 0.20
SD-b 0.19 0.07 0.49 0 0.02 0 0.12
SD-m 0.12 0.07 0.42 0.02 0.03 0.12 0.08
MAD 0.38 0.35 0.49 0 0.005 0 0.14
DIP 0.23 0.36 0.45 0 0.005 0 0.14
Precision 0.08 0 0.10 0.03 0.01 0.11 0.22
BQ-GOF 0.19 0 0.07 0 0.01 0.25 0.23
TM-GOF 0.08 0.02 0.06 0.36 0.47 0.44 0.49
TQ-GOF 0.08 0.03 0.06 0.35 0.47 0.44 0.48
BR 0.12 0.02 0.11 0.02 0.02 0.23 0.19
AR 0.08 0.06 0.11 0.02 0.02 0.25 0.19
WAR 0.12 0.07 0.45 0.02 0.01 0.11 0.10
SD-b + TM-GOF** 0.08 0.07 0.20 0.05 0.01 0.26 0.36
NA = not applicable; Too many features for RPMM to run.
* Average from 10 analyses of randomly sampled feature sets.
** Combine top 500 SD-b + top 500 TM-GOF features.
27
To visualize the different performance of our top filters, TM-GOF and SD-b, we created
heatmaps of the top 1000 filtered features following RPMM clustering. Figures 2.4 and 2.5 show
the clusters identified for the colon cancer data set (data set #1, tumor-only) and the TCGA
kidney data set (data set #4, tumor and non-tumor kidney), respectively. Using TM-GOF in the
colon cancer data, our subcluster identified 4 out of 6 CIMP samples, leaving 2 CIMP samples
misclassified in our first split of the data (Figure 2.4A). (We note that we use the word
“misclassified” loosely, as our definition of CIMP is likely not a gold standard (see Real data
sets in Appendix)). Using SD-b as the filter, the first split identified two clusters more equal in
sample size (11 and 15), misclassifying 5 non-CIMP samples (Figure 2.4 B). Interestingly, in the
next split of the data all 6 CIMP samples separated themselves from the others, appearing
together in one of the four sub clusters (Figure 2.4 B). Thus the CIMP subtype was found by SD-
b filtering, but only after further sub-clustering. Using the adjusted rand index measure of the co-
clustering of sample pairs by cluster category and tissue label the TM-GOF filter showed
superiority over the SD-b filter because of the smaller number of clusters estimated by the
clustering method (Supplemental Table 2.2). In Figures 2.4 A and B, 47 out of 1000 features are
shared by the two filter methods. For data set #4 (Figure 2.5), SD-b filter resulted in the
successful identification of tumor and normal kidney samples at the first division (Figure 2.5 B).
Using TM-GOF, the first cluster division identified a subtype of 4 cancer samples (Figure 2.5 A,
light green) with high DNA methylation in a subset of features. However, a second division of
the data resulted in the separation of non-cancer tissues from the cancer samples (red bar). Thus,
again we found the substructure sought in the cluster analysis, but not until the second division
of the clusters. In Figures 2.5 A and B, only 4 of the 1000 features overlapped. Interestingly,
using different feature selection methods the cluster substructure became similar for both data
28
sets, even when the first split found different subgroups. This common substructure in the data
set was captured by the adjusted rand index (Supplemental Table 2.2). However, this same
cluster substructure was not reproduced on the HM450 platform.
Figure 2.4 Heatmaps of RPMM cluster analysis using top 1000 filtered features by A) TM-
GOF or B) SD-b methods using 26 colon cancer samples (data set #1)
Rows represent features and columns represent samples; yellow represents high DNA
methylation and blue represents low. The color bars at the top of the columns indicate sample
tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate CIMP and non-
CIMP tumors, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters
found after two divisions of clustering using RPMM. In Figure A, the red and yellow clusters are
identified at the second division, and no subdivision of the blue cluster is found. In Figure B, the
red and yellow clusters separate in the second division, as do the blue and green clusters.
29
Figure 2.5 Heatmaps of RPMM cluster analysis using top 1000 filtered features by TM-
GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer samples (data set #4)
Rows represent features and columns represent samples; yellow represents high DNA
methylation and blue represents low. The color bars at the top of the columns indicate sample
tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate cancer and non-
cancer samples, respectively. In row 2 red, yellow, blue and green bars indicate the sample
clusters found after two divisions of clustering using RPMM. Figures A & B show heatmaps of
all 95 kidney samples using top 1000 features filtered by TM-GOF or by SD-b method,
respectively. Figures C & D show heatmaps of 50 kidney tumors using top 1000 features filtered
by TM-GOF or by SD-b method, respectively. In C, the blue and green bar clusters are found at
the second separation.
We also asked whether omitting features with outlier values might discover larger
clusters than the small disease subset discovered when filtering using TM-GOF (Figure 2.5 A).
30
We omitted all features with values greater than or less than the median Beta value ± 3 times the
inter quartile range (IQR), excluding a relatively large number of features from the kidney data
set (data set #4). We then selected the top 1000 features using TM-GOF, performed cluster
analysis, and found perfect discrimination of cancer and non-cancer kidney (Figure not shown).
This confirmed that the TM-GOF filter favoured features that identified disease subtypes prior to
its selection of features identifying tissue disease state.
This same removal of features with outliers, followed by filtering the top 1000 features
and cluster analysis, was performed on data sets #5-#7. Each time we found TM-GOF was able
to find disease state clusters at the first division just as well as SD-b (results not shown).
However, for the CIMP cancer data sets (data sets #1-#3), a pre-filtering of features with outliers
(using 𝑚𝑒𝑑𝑖𝑎𝑛 ± 3 ∗ 𝐼𝑄𝑅 criteria) resulted in equally bad clustering for all filtering methods
(data not shown). These results suggest that broad pre-filtering using this definition removed
features informative for a CIMP classification.
Next we asked if we could find the same tumor substructure in the kidney data set if we
had started with the 50 kidney tumors only. In general, this appeared to be the case. Using TM-
GOF as our filter, we identified a distinct subgroup of five tumors, four of which were those
previously identified when clustering the larger set of tumor and non-tumor tissue (Figure 2.5 C).
A subset of 50 features (5%) is shared by the two analyses, selected in the top 1000 features of
both data sets (Figures 2.5 A and C). When using features selected using SD-b, the tumors are
separated into subgroups of size 12 and 38 (Figure 2.5 D). The cluster of five tumors identified
from the TM-GOF filter is a subset of the cluster of 12 identified using SD-b filter; a subset of 44
of the 1000 features (4.4%) were selected by both filter methods (Figures 2.5 C and D).
31
Table 2.4 Mean age in two clusters identified by RPMM using different filtering methods
on blood samples of a normal population
Filter Method
Sample
Size
Group 1
Sample
Size
Group 2
Mean
Age
Group 1
Mean
Age
Group 2
Difference
in Mean
Ages
T test p-
value
MAD 84 0 59.54 0 NA NA
DIP 84 0 59.54 0 NA NA
SD-b 52 32 57.10 63.50 6.40 0.03
SD-m 64 20 57.92 64.70 6.78 0.04
Precision 40 44 57.73 61.18 3.46 0.25
BQ-GOF 62 22 58.21 63.27 5.06 0.09
TM-GOF 75 9 59.11 63.11 4.00 0.38
TQ-GOF 75 9 59.11 63.11 4.00 0.38
BR 57 27 57.89 63.00 5.11 0.08
AR 56 28 57.38 63.86 6.48 0.03
WAR 55 29 57.04 64.28 7.24 0.01
SD-b + TM-GOF* 56 28 57.09 64.43 7.34 0.02
* Combine top 500 SD-b + top 500 TM-GOF features
The last data set analyzed was whole blood (data set #8), selected to compare the
different filter methods when there is structure due to age, a continuous variable. Table 2.4
summarizes the mean age in the top two clusters after filtering the top 1000 features and
performing RPMM in 84 whole blood samples. The number of samples in each cluster varied
considerably depending on the filtering method. Selecting features ranked by SD-b resulted in
two groups approximately similar in size (52 vs 32) whereas selecting features ranked by MAD
resulted in the detection of only a single cluster (n = 84) (Table 2.4). The difference in mean age
between the top two clusters was greatest when using the hybrid SD-b and TMGOF filtering
method (difference = 7.3 years, p = 0.02) or WAR (difference = 7.2 years, p = 0.01), and the
filters SD-b, SD-m, and AR all recovered differences in mean age around 6.5 years (p = 0.03-
0.04). Two filters using the Beta-distribution variance stabilizing transformation (TM-GOF and
TQ-GOF) tended to find groups most unequal in sample size, but not statistically significantly
32
associated with age (p > 0.05). Interestingly, the Precision filter found groups with nearly
balanced sample size, but did not discriminate samples by age (difference = 3.5, p = 0.25).
We report the genomic context of the features selected by TM-GOF and SD-b, our two
best performing filter methods (Supplemental Table 2.3 A & B). For the colon cancer and
glioblastoma data sets (#1-3), both filter methods enriched for features in CpG islands, which
makes sense for detecting a CpG island methylator phenotype cancer subtype. In the cancer
versus normal tissue comparisons, the different filters prioritized different feature subsets. For
the HM27 data, SD-b prioritized features from non-CpG island regions while TM-GOF still
prioritized features from CpG island regions. Both filters enriched for features from exonic
regions, with only SD-b from the Kidney data set (#4) preferentially selecting features in
promoters. For the HM450 Kidney cancer vs non-cancer tissue comparison, both filter methods
over-represented non-CpG island features however TM-GOF selected more from exons and SD-
b selected more from introns and intergenic regions. The Breast cancer versus non-cancer tissue
(data set #7) showed the greatest variation in enrichment categories by the TM-GOF and SD-b
filter methods. The better performing SD-b filter selected intergenic features, both inside and
outside CpG islands. In general, across all HM450 data sets the TM-GOF filter selected more
features from CpG island promoters than the SD-b filter (Supplemental Table 2.3 B). At the
same time, the SD-b filter selected more features from non-CpG island intergenic regions. Thus
the two filters were sensitive to prioritizing features in different regions of the genome. This
likely explains their different performance for clustering samples from different biological
conditions.
We comment briefly on the effect processing HM450 data has on feature selection. We
present results for data processed using a combination of background correction, dye-bias
33
(Triche et al., 2013) and BMIQ normalization (Teschendorff et al., 2013). We performed
analyses both with and without BMIQ normalization and saw a huge enrichment of design type 1
features prior to BMIQ normalization. Following BMIQ normalization the distribution of
selected design type 1 and type 2 features aligned more closely to the distribution on the array.
Interestingly, despite the different probe types being selected after normalization, the distribution
of features by genomic context varied little (results not shown). Thus we believe the genomic
context of the feature is a stronger predictor of feature selection than the platform feature design
type.
2.4 Discussion
We used both simulated and real data to evaluate the performance of a number of
variable filtering methods for cluster analysis, when the variables are proportions that are
bounded on the 0 to 1 scale. Both the simulated and real data show that TM-GOF and TQ-GOF
are the best at identifying a subset of cancers having the CpG island methylator phenotype
(CIMP). The new filters that use a Beta variance stabilizing transformation are very sensitive to
outlier measurements. This may benefit the search for low frequency cancer subtypes that have
extreme values occurring across a large number of features (e.g. CpG island methylation
phenotype), but may not translate to an ability to directly cluster cancer versus normal tissues
well. For clustering cancer versus normal tissue, an outlier removal step was required before the
tissue clusters could be properly recovered. In general, the cancer-based simulation study results
were not generalizable to the clustering of normal tissue, or tumor versus non-tumor tissue,
suggesting that the filter methods are sensitive to the variation in observed DNA methylation
distributions due to the underlying biology.
34
Overall, SD-b performed very well in the real data examples including normal tissues.
One explanation could be that the SD-b filter enriches for features in regions having cell-type
specific DNA methylation differences. It tended to find groups of approximately equal size,
finding a separation of groups by mean age for the normal blood samples that was statistically
significant (difference = 6.4 years, p = 0.03); in the cancer and non-cancer studies it identified
clusters based on disease state in the first partitioning of samples. Although at first glance it
appeared to perform poorly in detecting the CIMP subtype in the colon cancer data, upon further
partitioning of the data the sub-cluster of interest appeared. It is unknown if this is a coincidence
from the data selected in this study. Although the SD-b filter did not show a high AUC in the
simulation study, it did cluster the samples perfectly using RPMM when we did not set
boundaries on the largest effect size, suggesting that the cluster analysis can be strongly
influenced by a small number of highly informative features.
We found the SD-b and TM-GOF filters tended to prioritize features in different areas of
the genome. For each HM450 data set, TM-GOF selected a higher number of features in CGI
promoters compared to SD-b, while SD-b selected a larger number of features in non-CGI
intergenic regions. The better performance of TM-GOF for detecting a CIMP subtype in cancer,
and SD-b for clustering normal tissues, suggests that features in different regions of the genome
are not equally informative for all biological conditions. A recent study reported a novel
classification of breast cancer using markers of normal-breast epithelial cell subtypes (Santagata
et al., 2014), and might explain our superior classification of cancer versus normal breast using
SD-b, the filter performing best in normal tissues.
The SD-m filter was a close competitor to SD-b, but Precision showed an unexplained
sensitivity to tissue type. Both Precision and SD-m performed slightly better than SD-b in the
35
analysis of CIMP cancers and nearly as well in clustering cancer and non-cancer kidney and
breast. In the analysis of whole blood, SD-m found clusters that correlated with age, but
Precision did not. Because of this unexplained sensitivity of Precision to non-cancer tissue we do
not recommend its general use.
One filter method not included in our comparison is arcsine-square-root transformation,
also a good variance stabilization method suitable for data bounded between 0 and 1. Similar to
the logit transformation, the arcsine-square-root transformation can be written as an incomplete
beta function having only one parameter (Rocke, 1993). Thus we would expect a filter based on
its standard deviation to behave similarly to our filter using the logit transformed data (SD-m).
Another statistic that could be used as a filter method is the SD ratio,
𝑆𝐷 (𝑋 ) [𝑚𝑒𝑎𝑛 (𝑋 )(1− 𝑚𝑒𝑎𝑛 (𝑋 )] ⁄ . For a Beta distributed variable X, this ratio is equivalent to
the (inverse) precision method used in this paper (filter #5).
Although the summary-based filtering methods take advantage of using the top ranked
filter methods, they are not always more robust than the single-filter methods. This is because
sometimes the best rank of a feature can be affected by a single non-informative filtering
method. Thus, due to the different (and somewhat complimentary) characteristics of the features
enriched by SD-b and TM-GOF methods, we prefer to use both SD-b and TM-GOF methods for
any data when our main purpose of cluster analysis is to identify novel subgroups. Our results
suggest that SD-b is very robust in enriching for features that identify big subgroups, while TM-
GOF and TQ-GOF are very sensitive in enriching features to identify low frequency cancer
subtypes that have outlying values occurring across a large number of features (e.g. CpG island
methylation phenotype).
36
We noticed that in the data sets comparing DNA methylation in cancer to non-cancer
tissue, the differences in standard deviation (SD) between sample groups are not symmetrically
distributed. The majority of features have a much higher SD in cancer samples than in normal
samples. However, in data sets with non-CIMP vs. CIMP cancer subtypes, the differences in SD
are symmetrically distributed with mean around zero (data not shown). This suggests that SD on
beta values may be more informative in data sets with huge SD differences between subgroups
than in data sets with balanced SD differences around zero.
One limitation of our simulation design is that for each tissue subgroup our measures are
simulated to follow a beta distribution and the best performing filter methods make proper use of
this knowledge. In reality, a mixture of betas might yield a more realistic measure from a
population of mixed cell types. However, instead of simulating this added complexity which
would require additional model assumptions, we chose to look for patterns of behaviour from the
analysis of a variety of real data sets that spanned different biological conditions (e.g. tumor
only, tumor versus normal, or single normal tissue). This evaluation shows: (1) the top two filter
methods, TM-GOF and SD-b, prioritize features from different parts of the genome, (2) TM-
GOF is much more susceptible to outlier measures, and (3) that the underlying biology can drive
their performance.
One question not addressed in this study is the number of features to carry forward to the
cluster analysis. One might plot the filter statistics to see if they show a bimodal distribution,
suggesting subgroups of features with different behaviour. In our experience, the statistics are
unimodal so we tend to use a number of thresholds for features selection (e.g. top 1000, top
2000, top 5000 features), and evaluate the robustness of our groups across the different feature
lists.
37
2.5 Conclusions
We found two filter statistics, SD-b and TM-GOF, outperform the rest in different data
sets with different nature. We would suggest using each one, as cluster analysis is for the purpose
of class discovery and the two methods tend to prioritize different sets of features, thus serving
as complimentary feature enrichment methods for DNA methylation data.
38
2.6 Appendix
2.6.1 Real Data Sets
Table 2.2 describes eight data sets used to evaluate the different filtering methods in
conjunction with RPMM cluster analysis. As RPMM requires complete data for all features, we
omitted features having any missing beta values as determined by the detection p-value. Features
were also filtered based on quality control criteria listed in the methods section (e.g. SNP at
target CpG or nearest 10bps in feature body, 15-bp repetitive elements, etc.). The data sets and
total number of features after filtering are listed below.
Data set #1 is HM27 data from 26 colon cancers generated at the USC Epigenome
Center. A previous analysis of these samples using the MethyLight technology found a subset of
6 cancers to have the CpG island methylator phenotype (CIMP). This independent assessment of
CIMP subtype is used to assess our clustering results. Parameter values in our simulation study
were sampled from features in this data set (N=19,965 features).
Data sets #2 to #7 are downloaded from The Cancer Genome Atlas (TCGA) data portal.
We downloaded level 1 data and performed our own data processing as described in the methods
section.
Data set #2 is HM27 data (plates 1, 2, 3, and 10) of 86 glioblastoma cancers (GBM). A
hierarchical cluster analysis on 1362 out of the 1503 CIMP-related features published by
Noushmehr et al. (Noushmehr et al., 2010) identified a subset of 12 samples to have the
glioblastoma CpG island methylator phenotype (G-CIMP). This independent assessment of G-
CIMP subtype, using the clustering method applied by Noushmehr et al. in their original article,
is used to assess our clustering results in the full feature set, as well as after non-specific
filtering. (N=20,549 features total).
39
Data set #3 is HM450 data (plates 79, 111, and 130) of 99 glioblastoma cancers. Again,
hierarchical cluster analysis of the CIMP-related features provided by Noushmehr et al.
(Noushmehr et al., 2010) allowed us to identify a subset of G-CIMP samples in our data set
(n=6). This independent assessment of G-CIMP subtype is used to assess our clustering results
using the non-specific filtering methods described in this paper (N=374,601 features).
Data set #4 is HM27 data (plate 64) of 50 Kidney Renal Clear-cell (KIRC) cancer and 45
normal kidney tissues (N=21,624 features).
Data set #5 is HM450 data of 283 Kidney KIRC cancers and 160 normal kidney tissues
(N=374,708 features).
Data set #6 is HM27 data (plate 93) of 37 infiltrating ductal carcinomas and 20 normal
breast tissues (N=21,787 features).
Data set #7 is HM450 data (plate 109) of 56 infiltrating ductal carcinomas and 17 normal
breast tissues (N=377,853 features).
Data set #8 is HM450 data of normal whole blood samples from NCBI’s Gene
Expression Omnibus under accession number GSE40279 (Hannum et al., 2013). We restricted
our analysis to the 84 samples on plate 2 with subjects’ ages ranging from 28 to 86 years
(N=383,911 features).
40
2.6.2 Supplemental Figure and Tables
Supplemental Figure 2.1 Distribution of statistics in Colon Cancer data set (data set #1)
A. The histogram of mean differences between CIMP and non-CIMP samples for 22198 features.
B. The histogram of standard deviation differences between CIMP and non-CIMP samples for
22198 features. C. The histogram of 𝑙𝑜𝑔 2
(𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜 ) for 22198 features. The 𝑙𝑜𝑔 2
(𝑂𝑅 ) is
calculated using 𝜃 ̂
= 𝑙𝑛
𝜇̂
2𝑗 (1−𝜇̂
2𝑗 ) ⁄
𝜇̂
1𝑗 (1−𝜇̂
1𝑗 ) ⁄
, where 𝜇 ̂
1𝑗 is the mean for non-CIMP tumors, and 𝜇 ̂
2𝑗 is the
mean for CIMP tumors. The red vertical line indicates mean.
Supplemental Table 2.1 Area under the curve (95% confidence interval) for simulation
results in Figure 2.2
Sample ratio of group1 : group2
Methods
180 non-CIMP:
20 CIMP
100 non-CIMP:
100 CIMP
20 non-CIMP:
180 CIMP
SD-b 0.52 (0.48, 0.57) 0.54 (0.49, 0.58) 0.51 (0.46, 0.56)
SD-m 0.52 (0.48, 0.57) 0.53 (0.48, 0.58) 0.48 (0.44, 0.52)
MAD 0.50 (0.46, 0.54) 0.50 (0.46, 0.54) 0.49 (0.45, 0.54)
DIP 0.49 (0.45, 0.54) 0.50 (0.45, 0.55) 0.51 (0.47, 0.55)
Precision 0.53 (0.48, 0.58) 0.54 (0.49, 0.58) 0.50 (0.45, 0.55)
BQ-GOF 0.56 (0.52, 0.61) 0.63 (0.59, 0.68) 0.57 (0.53, 0.62)
TM-GOF 0.57 (0.53, 0.62) 0.70 (0.66, 0.75) 0.62 (0.57, 0.66)
TQ-GOF 0.59 (0.55, 0.63) 0.73 (0.68, 0.77) 0.63 (0.59, 0.68)
BR 0.58 (0.54, 0.62) 0.69 (0.64, 0.73) 0.63 (0.58, 0.68)
AR 0.58 (0.54, 0.62) 0.70 (0.65, 0.74) 0.64 (0.59, 0.68)
WAR 0.57 (0.53, 0.62) 0.66 (0.62, 0.71) 0.61 (0.56, 0.65)
41
Supplemental Table 2.2 Adjusted Rand Index of RPMM cluster analysis result using a variety of filtering methods for
multiple data sets
Data set 1 Data set 2 Data set 3 Data set 4 Data set 5 Data set 6 Data set 7
Tissue type Colon cancer Glioblastoma Glioblastoma Kidney Kidney Breast Breast
Platform HM27 HM27 HM450 HM27 HM450 HM27 HM450
# of probes after
preprocessing 19965 20549 374601 21624 374708 21787 377853
# of samples
20
NONCIMP
vs. 6 CIMP
74
NONCIMP
vs. 12 CIMP
93 NONCIMP
vs. 6 CIMP
50 KIRC vs.
45 non-cancer
283 KIRC vs.
160 normal
37 Breast
cancer vs. 20
normal
56 Breast
cancer vs. 17
Normal
No filter 0.12 0.22 NA 0.75 NA 0.71 NA
Filter top 1000 by:
Random * 0.07 0.21 0.003 0.74 0.46 0.64 0.27
SD-b 0.29 0.27 0.03 0.54 0.34 0.53 0.31
SD-m 0.29 0.27 0.08 0.67 0.31 0.49 0.35
MAD -0.10 0.21 0.02 0.55 0.32 0.48 0.31
DIP 0.45 0.00 -0.04 0.54 0.33 0.52 0.31
Precision 0.32 0.34 0.09 0.52 0.25 0.52 0.29
BQ-GOF 0.27 0.22 0.15 0.88 0.31 0.42 0.27
TM-GOF 0.45 0.16 0.16 0.57 0.15 0.14 0.02
TQ-GOF 0.45 0.16 0.16 0.50 0.14 0.15 0.01
BR 0.29 0.20 0.13 0.65 0.30 0.47 0.27
AR 0.33 0.30 0.18 0.63 0.37 0.46 0.26
WAR 0.29 0.27 0.07 0.64 0.27 0.52 0.33
SD-b + TM-
GOF** 0.30 0.29 0.08 0.67 0.38 0.46 0.21
* Average adjusted rand index from 10 analyses of randomly sampled feature sets
** Combine top 500 SD-b + top 500 TM-GOF features
42
Supplemental Table 2.3 Genomic context of the features selected by the top two filter methods
A. For HM27 platform (N/%)
Filter by SD-b Filter by TM-GOF
Data set 1 Data set 2 Data set 4 Data set 6 Data set 1 Data set 2 Data set 4 Data set 6
All
CpG
Targets
Filtered
Set
Colon
Cancer
Glioblast
oma
Kidney
Cancer vs
Normal
Breast
Cancer vs
Normal
Colon
Cancer
Glioblast
oma
Kidney
Cancer vs
Normal
Breast
Cancer vs
Normal
total probes
included 27578 22198 1000 1000 1000 1000 1000 1000 1000 1000
CpG island 41.8% 46.5% 75.4% 59.0% 21.9% 33.8% 74.4% 73.8% 79.6% 66.9%
promoter region 52.2% 50.4% 39.4% 45.3% 53.9% 47.0% 43.2% 43.4% 38.5% 42.1%
transcribed region 49.5% 51.3% 63.0% 56.2% 47.0% 53.9% 58.4% 58.0% 62.4% 60.6%
exonic region 25.4% 26.9% 37.5% 31.5% 27.5% 31.3% 33.5% 31.8% 37.8% 34.2%
intronic region 74.6% 73.1% 62.5% 68.5% 72.5% 68.7% 66.5% 68.2% 62.2% 65.8%
43
B. For HM450 platform (N/%)
Filter by SD-b Filter by TM-GOF
Data set 3 Data set 5 Data set 7 Data set 8 Data set 3 Data set 5 Data set 7 Data set 8
All
CpG
Targets
Filtere
d Set
Glioblast
oma
Kidney
Cancer vs
Normal
Breast
Cancer vs
Normal
Normal
Blood
Glioblast
oma
Kidney
Cancer vs
Normal
Breast
Cancer vs
Normal
Normal
Blood
total probes
included 482,421 384310 1000 1000 1000 1000 1000 1000 1000 1000
Type I design 28.1% 29.2% 56.3% 22.2% 36.0% 31.0% 49.4% 27.8% 41.8% 12.8%
Type II design 71.9% 70.8% 43.7% 77.8% 64.0% 69.0% 50.6% 72.2% 58.2% 87.2%
CGI Promoter 21.5% 24.8% 32.1% 5.4% 15.6% 10.4% 55.5% 19.1% 56.7% 16.5%
CGI Exon 2.5% 3.0% 5.4% 2.0% 2.8% 4.3% 3.0% 4.6% 2.9% 2.8%
CGI Intron 2.9% 3.3% 6.3% 2.2% 4.1% 5.1% 4.8% 3.4% 2.6% 2.4%
CGI Intergenic 4.2% 4.8% 14.1% 2.7% 9.0% 12.1% 9.2% 4.5% 4.8% 5.1%
non-CGI
Promoter 22.2% 21.6% 16.2% 20.7% 20.2% 19.4% 9.3% 15.3% 14.6% 20.8%
non-CGI Exon 7.0% 7.9% 3.3% 7.3% 5.8% 3.9% 4.3% 11.7% 5.0% 8.1%
non-CGI Intron 20.9% 18.9% 8.3% 33.1% 21.4% 20.4% 9.9% 23.7% 9.5% 27.0%
non-CGI
Intergenic 18.7% 15.5% 14.3% 26.6% 21.1% 24.4% 4.0% 17.7% 3.9% 17.3%
44
Chapter III
Comparison of Batch Effect correction
methods for Illumina Infinium DNA
methylation data
In this chapter, we investigate another important topic in the analysis of Illumina
Infinium DNA methylation data: Batch Effect (BE) correction. We use simulation to evaluate
and compare the performance of several existing BE correction methods originally developed for
gene expression data on DNA methylation data. We also apply them on a real DNA methylation
data set to elucidate.
3.1 Background and Motivation
Technical effects are known to introduce a large amount of noise into microarray studies.
If samples are not randomized with respect to outcome, these effects can introduce bias for
testing effects of variables. Even when samples are randomized, the increased variation can
overwhelm the biological signal in the data if it is not taken into consideration in the analysis.
This may reduce the power in statistical analysis.
A variety of methods have been developed to deal with batch effects in the context of
gene expression microarray experiments. In microarray studies, fluorescence intensity measures
level of expression (Chen et al., 2011), and the batch correction methods assume that the log of
the signal intensities follows a Gaussian distribution. However, in DNA methylation studies the
quantity of interest is usually the proportion of methylated alleles, captured by the proportion of
45
signal intensity due to methylated alleles. Early reports found applying batch effect correction
(BEC) can eliminate bias and improve power to discover differentially methylated loci despite
this model misspecification (Sun et al., 2011). Comparisons of some BEC methods were also
performed on a handful of Illumina DNA methylation (HumanMethylation27) data (Sun et al.,
2011; Teschendorff et al., 2011). Here we would like to compare the performance of those
commonly suggested BEC methods on DNA methylation data using computer simulation. We
will also use one published HumanMethylation27 (HM27) data set with known batch effect to
evaluate. In this study, we will apply these BEC methods on the beta values.
3.2 Batch Effect Correction Methods
A variety of methods have been proposed to correct batch effects in microarray analysis.
Early methods include Singular-Value Decomposition (SVD) (Alter et al., 2000), Distance
Weighted Discrimination (DWD) (Benito et al., 2004), and model based Location and Scale
(L/S) adjustments (Schadt et al., 2001). These have been extended by the methods using
empirical Bayes (ComBat (Johnson et al., 2007)) or surrogate variables (SVA, Surrogate
Variable Analysis (Leek et al., 2012; Leek and Storey, 2007); ISVA, independent surrogate
variable analysis (Teschendorff et al., 2011); and RUV-2, RUV-4, Remove Unwanted Variance
(Gagnon-Bartsch and Speed, 2012)). We will compare performance of the last five methods:
ComBat, SVA, ISVA, RUV-2 and RUV-4 methods using simulation. Whereas the first method
requires definition of batch for model adjustment, the latter three do not.
Here are some notations used in the description of the following BEC methods.
m is the number of probes and n is the number of samples.
46
Y is the 𝑚 × 𝑛 matrix of DNA methylation data, with the component 𝑌 𝑗𝑖𝑏 , 𝑜𝑟 𝑌 𝑗𝑖
representing
the methylation value for probe j, sample i (from batch b)
j is the probe indicator, j=1,…,m
i is the sample indicator, i=1,…,n
b is the batch indicator, b=1,…,B
l is the latent factor indicator, l=1,…,K
𝐾 ̂
is the number of estimated surrogate variables
𝛼 is the 𝑚 × 1 vector of overall DNA methylation, with the component 𝛼 𝑗 representing the
overall DNA methylation of probe j
X is the 𝑛 × 𝑝 design matrix for sample conditions, 𝑝 is the number of variables (including POI
and covariates)
𝛽 𝑗 is the 1× 𝑝 vector of regression coefficients corresponding to X for probe j
G is the 𝐾 × 𝑛 matrix of latent factors, with the component 𝑔 𝑙𝑖
representing the l
th
latent factor of
sample i,, l=1,…,K; i=1,…,n
L is the 𝑚 × 𝐾 matrix of coefficient for G, with the component 𝐿 𝑗𝑙
representing the probe j
specific coefficient for 𝑔 𝑙𝑖
, j=1,…,m; l=1,…,K
𝘀 𝑗𝑖𝑏 , 𝑜𝑟 𝘀 𝑗𝑖
is the error term for probe j, sample i (from batch b)
3.2.1 Combining Batches (ComBat)
ComBat uses an empirical Bayes (EB) approach to combine DNA methylation data run
in separate batches (Johnson et al., 2007). Similar to the location and scale (L/S) model proposed
by Schadt et al. (Schadt et al., 2001), this method assumes that batch effects have both additive
and multiplicative errors affecting the signal intensities. In the empirical Bayes implementation
47
of this model, the error terms for each probe have a batch-level error shared by samples run
within the same batch, with a distribution imposed on the batch-specific estimates. In this
method, batch effects are represented by the L/S model parameters that are estimated via
summarizing batch effect estimates across all probes.
We define an L/S model of DNA methylation value for probe j, sample i from batch b as
𝑌 𝑗𝑖𝑏 = 𝛼 𝑗 + 𝛽 𝑗 𝑋 𝑖 𝑇 + 𝛾 𝑗𝑏
+ 𝛿 𝑗𝑏
𝘀 𝑗𝑖𝑏
where α
j
is the overall DNA methylation for probe j, 𝑋 𝑖 is a 1 × 𝑝 design matrix for sample i, 𝛽 𝑗
is the 1 × 𝑝 vector of regression coefficients for probe j corresponding to 𝑋 𝑖 . The error term
𝘀 𝑗𝑖𝑏 ~𝑁 (0,𝜎 𝑗 2
). Here 𝛾 𝑗𝑏
and 𝛿 𝑗𝑏
represent the additive and multiplicative batch effects of batch b
for probe j, respectively. Notice we have batch subscript b specified explicitly in the model.
There are three steps in applying ComBat.
Step I. Standardize the data to assure that after removing effects from variables of interest
all probe residuals in each batch have similar overall mean and variance.
The parameters 𝛼 𝑗 ,𝛽 𝑗 ,and 𝛾 𝑗𝑏
are estimated for each probe and sample using a probe-
wise ordinary least-squares approach, constraining ∑𝑛 𝑏 𝛾̂
𝑗𝑏 𝑏 = 0 for all j=1,…,m. 𝜎 𝑗 2
is then
estimated by 𝜎̂
𝑗 2
=
1
𝑁 ∑(𝑌 𝑖𝑏
− 𝛼̂
𝑗 − 𝛽 ̂
𝑗 𝑋 𝑖 𝑇 − 𝛾̂
𝑗𝑏
)
2
𝑏𝑖
. Then we standardize the data using 𝑍 𝑗𝑖𝑏 =
𝑌 𝑗𝑖𝑏 −𝛼̂
𝑗 −𝛽 ̂
𝑗 𝑋 𝑖 𝑇 𝜎̂
𝑗 .
Step II. Estimate the EB batch effect parameter estimates using parametric empirical
priors.
We assume the standardized data 𝑍 𝑗𝑖𝑏 ~𝑁 (𝛾 𝑗𝑏
,𝛿 𝑗𝑏
2
), where prior distributions on the batch
effect parameters are 𝛾 𝑗𝑏
~𝑁 (𝛾 𝑏 ,𝜏 𝑏 2
) and 𝛿 𝑗𝑏
2
~𝐼𝑛𝑣𝑒𝑟𝑠𝑒 𝐺𝑎𝑚𝑚𝑎 (𝜆 𝑏 ,𝜃 𝑏 ). The hyperparameters
48
𝛾 𝑏 ,𝜏 𝑏 2
,𝜆 𝑏 ,𝜃 𝑏 are estimated empirically from standardized data using the method of moments.
Then the EB estimates for batch effect parameters, 𝛾 𝑗𝑏
and 𝛿 𝑗𝑏
2
are given by the
conditional posterior means 𝛾 𝑗𝑏
∗
=
𝑛 𝑏 𝜏̅
𝑏 2
𝛾̂
𝑗𝑏
+𝛿 𝑗𝑏
2∗
𝛾̅
𝑏 𝑛 𝑏 𝜏̅
𝑏 2
+𝛿 𝑗𝑏
2∗
and 𝛿 𝑗𝑏
2∗
=
𝜃 ̅
𝑏 +1/2∑ (𝑍 𝑗𝑖𝑏 −𝛾 𝑗𝑏
∗
)
2
𝑖 𝑛 𝑖 /2+𝜆 ̅
𝑏 −1
.
Step III. Adjust the data for batch effects
The final EB batch adjusted data are calculated by 𝑌 𝑗𝑖𝑏 ∗
=
𝜎 𝑗 𝛿 ̂
𝑗𝑏
∗
(𝑍 𝑗𝑖𝑏 − 𝛾̂
𝑗𝑏
∗
) + 𝛼̂
𝑗 + 𝛽 ̂
𝑗 𝑋 𝑖 𝑇 .
The R function developed for this method is ComBat and is included in sva package.
Note in ComBat function, only categorical variables (factors) are allowed, including the
Batch variable, the POI and any covariates whose effects we want to estimate.
3.2.2 Surrogate Variable Analysis (SVA)
Leek et al. developed a surrogate variable analysis (SVA) method to estimate surrogate
variables for latent expression heterogeneity such as batch effects (Leek and Storey, 2007, 2008).
These surrogate variables are then used in downstream linear regression to correct for unwanted
variation due to batch, and/or other unknown confounders.
SVA method contains two parts: a first step for detection of unmodeled factors and a
second step for construction of surrogate variables. The singular value decomposition (SVD)
method is employed in SVA method to decompose the data matrix at both steps. Two algorithms
can be used in estimating surrogate variables: two-step algorithm or Iteratively Re-Weighted
(IRW) algorithm. The difference between these two algorithms lies in the second step for
construction of surrogate variables. The two-step algorithm uses a subset of probes that are
statistically associated with the unmodeled factor identified in the first step to estimate each
surrogate variable. The IRW algorithm, however, instead of selecting a probe subset, uses the
49
empirical Bayes probability of a probe being only associated with the unmodeled factor
estimated in the first step as weight iteratively to identify surrogate variable from the weighted
whole data set. IRW is the default setting in SVA package, and will be used in my study. Below I
describe the IRW algorithm in detail.
This method assumes the DNA methylation data for probe j, sample i as
𝑌 𝑗𝑖
= 𝛼 𝑗 + 𝛽 𝑗 𝑋 𝑖 𝑇 + ∑ 𝐿 𝑗𝑙
𝑔 𝑙𝑖
𝐾 𝑙 =1
+ 𝘀 𝑗𝑖
where 𝛼 𝑗 is the overall DNA methylation, 𝑋 𝑖 is a 1× 𝑝 design matrix for sample i, 𝛽 𝑗 is the 1 × 𝑝
vector of regression coefficients for probe j corresponding to 𝑋 𝑖 . 𝑔 𝑙𝑖
is the l
th
unmodeled factor of
sample i, and 𝐿 𝑗𝑙
is the probe j specific coefficient for 𝑔 𝑙𝑖
. Error term 𝘀 𝑖𝑗
is the true probe specific
independent noise 𝘀 𝑖𝑗
~𝑁 (0,𝜎 𝑗 2
).
There are four detailed steps in applying the SVA method with IRW algorithm:
Step I. Regress DNA methylation values on design matrix X and obtain residuals to
remove the effect of the primary variable on methylation. Form a 𝑚 × 𝑛 residual matrix R,
where the (𝑗 ,𝑖 ) element of R is 𝑟 𝑗𝑖
= 𝑌 𝑗𝑖
− 𝛼̂
𝑗 − 𝛽 ̂
𝑗 𝑋 𝑖 𝑇 . Calculate the SVD of the residual matrix
𝑅 = 𝑈𝐷 𝑉 𝑇 , where the diagonal entries of D are equal to the singular values of R. The columns
of U and V are, respectively, left- and right-singular vectors for the corresponding singular
values.
Step II. Let 𝑒 𝑘 = (𝑒 𝑘 1
,…,𝑒 𝑘𝑛
)
𝑇 be the k
th
column (eigenvector) of V (k=1,…,n). Identify
𝐾 ̂
, the number of significant eigenvectors in residual matrix, using the permutation-based
algorithm (Buja and Eyuboglu, 1992).
Step III. For each significant residual eigenvector 𝑒 𝑘 (𝑘 = 1,…,𝐾 ̂
), iteratively obtain
empirical Bayes estimate of the probability of a probe to be associated with this eigenvector, but
50
not associated with the main variable of interest (Storey et al., 2005), and use it as weight in
SVD of the data matrix Y until it converges to get the final empirical Bayes estimates of the
above probabilities. Perform a final weighted SVD of Y using the above final probabilities as
weights and get the right eigenvector ℎ
̂
𝑘 that is most correlated with 𝑒 𝑘 , 𝑘 = 1,…,𝐾 ̂
. That is, the
linear combination of unmodeled factors ∑ 𝐿 𝑗𝑙
𝑔 𝑙𝑖
𝐾 𝑙 =1
= ∑ λ
𝑗𝑘
ℎ
̂
𝑘𝑖
𝐾 ̂
𝑘 =1
, where ℎ
̂
1
,ℎ
̂
2
,…,ℎ
̂
𝑘 ̂
are
“surrogate variables”.
Step IV. These surrogate variables ℎ
̂
1
,ℎ
̂
2
,…,ℎ
̂
𝑘 ̂
are then directly applied as covariates in
later supervised regression analysis.
The R function developed for this method is sva and is included in the sva package.
3.2.3 Independent SVA
Independent Surrogate Variable Analysis (ISVA), a variation of SVA, estimates the
surrogate variables as statistically independent variables (Teschendorff et al., 2011). ISVA
releases the requirement of finding linearly uncorrelated factors to allow the variables to be
uncorrelated in a non-linear fashion. This is of special meaning in analysing DNA methylation
data due to its non-Gaussian distribution. Both SVA and ISVA methods assume a linear model
between the phenotype of interest (POI) and the actual data and an additive latent confounding
(such as batch) effect. The SVA method uses SVD method to identify unmodeled factors as
uncorrelated with POI in a linear scale from the regression residuals. However, the ISVA method
uses independent component analysis (ICA) to identify unmodeled factors from the regression
residuals. ICA is a blind source separation technique which identifies surrogate variables as
independent, meaning uncorrelated with POI in a non-linear way.
51
ISVA method follows a similar algorithm as the two-step algorithm of SVA method
which has similar steps I, II, and IV and a different step III as in the previous described IRW
algorithm of SVA method. The step III of ISVA method performs ICA on a subset of probes that
are statistically associated with the significant independent component identified in step I & II,
instead of using all data with iteratively estimated weights to do SVD analysis. Other
modifications in ISVA are in the following three important aspects:
1) ISVA uses ICA instead of SVD to identify independent components from regression
residuals in step I and from subset of probes in step III.
For example, in step I, ICA tries to find an orthogonal 𝐾 ̂
× 𝐾 ̂
matrix W such that it
maximizes the statistical independence of the columns of S’ in residual matrix
decomposition: 𝑅 = 𝑆𝐴 + 𝘀 𝑗𝑖
= 𝑆 ′
𝑊 𝐴 ′
+ 𝘀 𝑗𝑖
. Here S and S’ are the 𝑚 × 𝐾 ̂
source matrices
(correspond to probe specific coefficients for A and A’), A and A’ are the 𝐾 ̂
× 𝑛 “mixing”
matrices, and W is an orthogonal 𝐾 ̂
× 𝐾 ̂
transformation matrix. m is the number of probes
and n is the number of samples.
2) Random Matrix Theory (RMT) (Plerou et al., 2002) instead of a permutation-based
dimensionality algorithm (Buja and Eyuboglu, 1992) used in SVA method is used to
determine 𝐾 ̂
, the number of components to be used in ICA. RMT estimates the
dimensionality by counting the number of the eigenvalues in observed data covariance
matrix that are larger than the theoretical maximum of a random matrix counterpart (null),
of which the density distribution of eigenvalues is known (Plerou et al., 2002).
3) ISVA adds a surrogate variable selection step. Pick ISVs that are significantly correlated
with known confounder factors and exclude those ISVs that correlate more strongly with
POI.
52
The R function developed for this method is DoISVA and is included in isva package.
3.2.4 RUV-2 and RUV-4
Gagnon-Bartsch et al. (unpublished manuscript) recently developed a new method,
Removing Unwanted Variation (RUV-4), updating a previous released version RUV-2 method
(Gagnon-Bartsch and Speed, 2012). The original RUV-2 corrects for batch effect in a two-step
procedure: first, performing factor analysis on negative control (NC) probes, and second,
adjusting for the identified factors in the differential DNA methylation analysis. The NC genes
are assumed unaffected by the POI, and the number of latent factors can be chosen empirically.
However, those NC genes may still contain POI effect if not carefully chosen. So as to not
erroneously adjust away signal, the new RUV-4 method utilizes the similar strategy as SVA
(Leek and Storey, 2007) does to measure unwanted variation using the residuals from a
regression of POI. NC probes are used to rescue those unwanted effects that may be mistakenly
removed from the previous steps, making this method less affected by correlation between the
POI effect and Batch effect.
Similar as in SVA and ISVA methods, RUV-4 defines the 𝑚 × 𝑛 DNA methylation value
matrix 𝑌 as follows:
𝑌 = 𝛼 + 𝛽 𝑋 𝑇 + 𝐿 𝐺 𝑇 + 𝘀
where 𝛼 is a 𝑚 × 1 vector for overall DNA methylation. X is a 𝑛 × 𝑝 design matrix for
sample conditions, and 𝛽 is the 𝑚 × 𝑝 vector of regression coefficients corresponding to X. 𝐺 is
a 𝑛 × 𝐾 matrix whose columns are unobserved covariates (such as batch effects), and 𝐿 is the
𝑚 × 𝐾 vector of coefficients corresponding to 𝐺 . The error term component 𝘀 𝑗𝑖
~𝑁 (0,𝜎 𝑗 2
).
The four steps used in RUV-4 are the following:
53
Step I. As SVA or ISVA methods do, RUV4 method first forms a 𝑚 × 𝑛 residual matrix
R by regressing Y on design matrix X, where the (𝑗 ,𝑖 ) element of R is 𝑟 𝑗𝑖
= 𝑌 𝑗𝑖
− 𝛼̂
𝑗 − 𝛽 ̂
𝑗 𝑋 𝑖 𝑇 . R
matrix can also be presented as 𝑅 = 𝑅 𝑋 𝐿 𝐺 𝑇 + 𝑅 𝑋 𝘀 = 𝐿 0
𝐺 0
𝑇 + 𝑅 𝑋 𝘀 , where 𝑅 𝑋 is the residual
operator of matrix X, defined as 𝑅 𝑋 ≡ 𝐼 − 𝑋 (𝑋 𝑇 𝑋 )
−1
𝑋 𝑇 .
Step II. Perform factor analysis on the residuals from step I to estimate 𝐿 0
𝐺 0
𝑇 ̂
. Note this
estimation is not the same as the estimation of the original 𝐿𝐺
𝑇 , since the 𝐿 𝐺 𝑇 in the original
model is also regressed by the residual operator of matrix X in the residual. The difference
between 𝐿 𝐺 𝑇 and 𝑅 𝑋 𝐿 𝐺 𝑇 is the component lost in Step I. We now need to recover it in the next
step.
Step III. Estimate the surrogate factor matrix 𝐺 for the original scale of residuals by
adding back the component lost in Step I estimated using NC probes to 𝐺 0
̂
: 𝐺 ̂
≡ 𝐺 0
̂
+
𝑋 𝑏 𝑌 𝑐 𝑋 𝐿 ̂
0𝑐 𝑇 (𝐿 ̂
0𝑐 𝐿 ̂
0𝑐 ′𝑇 )
−1
, where 𝑏 𝑌 𝑐 𝑋 ≡ (𝑋 𝑇 𝑋 )
−1
𝑋 𝑇 𝑌 𝑐 is the partial regression coefficient of Y on X,
using subset data of NC probes only.
Step IV. Adjust for resulting surrogate factors 𝐺 ̂
in future regression analysis.
Selection of NC probes for Illumina Infinium DNA methylation data
In simulation, we could use probes that are designed to be only associated with batch
variable as NC probes. This is aimed at identifying unwanted factors due to technical variations.
In real data analysis, NC probes are usually not easily determined. Spike-in controls or
probes from housekeeping (HK) genes could be used as NC if available (Gagnon-Bartsch and
Speed, 2012). In an unpublished manuscript that developed RUV4 method, Gagnon-Bartsch et
al. suggested some iterative procedures by using an initial set of NC probes in RUV4 to get the
empirical NC probes that are not associated with POI, then iterating the procedure a couple of
54
times to get a refined set of empirical NC probes. The initial set of NC probes could be all probes
if we believe the probes differentially methylated by POI are sparse. Caution should always be
taken after each step of iteration by carefully checking each resulting empirical set of NC probes
and evaluating the quality of analysis used. Instead of RUV4 method, other choices of analysis
method could also be used in this iteration procedure.
Alternatively, as we tend to consider the majority of probes with lowest variances contain
only noises, we could also empirically select those probes with lowest variances as NC probes
when applying RUV-2/RUV-4 method for batch effect correction. This could be better than
selecting all probes as NC probes, and it could also simplify the simulation process.
The R functions developed for these methods are RUV2 and RUV4 that are included in
ruv package.
Table 3.1 lists the BEC methods we compare.
55
Table 3.1 Comparison of available batch effect correction methods
Method Algorithm
R
Package function Concerns
ComBat Correct both additive and
multiplicative batch effects using
Empirical Bayes (EB) method.
sva ComBat Batch variable has to be pre-
specified
SVA Detect latent confounding factors
in residuals and construct
surrogate variables (SVs) using
SVD method
sva sva Problematic if phenotype of
interest (POI) correlates too
strongly with unknown
confounders.
ISVA Detect latent confounding factors
in residuals and construct
independent SVs (ISVs) using
ICA method; Select those not
significantly associated with the
POI as final ISVs
isva DoISVA Problematic if POI
correlates too strongly with
unknown confounders; two-
stage approach may lead to
biases in the estimated
regression coefficients.
RUV-2 Perform factor analysis (FA) on
negative control (NC) probes to
identify unwanted factors
ruv RUV2 Problematic if POI
correlates too strongly with
NC genes.
RUV-4 Kind of combine SVA and RUV-
2 together; identify unwanted
factors from residuals and using
NC genes to recover the
component lost in getting the
residuals
ruv RUV4 Problematic if POI
correlates too strongly with
NC genes.
3.3 Simulation Study
We designed a simulation for comparing the above mentioned batch correction methods
on data sets with/without batch effects. Then we plan to apply these methods to one real HM27
DNA methylation data (described later). We will evaluate the performance of methods on the
beta value scale.
56
3.3.1 Simulation Set-up
In order to understand the strengths and weakness of the different BEC methods in
identifying biological signals that might be hidden by some unwanted factors, including batch
effects, the simulation of the methylation data set is critical. Figure 3.1 shows the simulation
setting for those fixed coefficients. Figure 3.2 demonstrates the lay out of the whole data
simulation structure. Below we explain our simulation setting and each simulation step in detail.
We will simulate a data set of 2000 CpG probes and 200 samples. The Phenotype Of
Interest (POI) is a continuous variable (such as air pollution). We assume there is one known
covariate (a continuous variable, such as age) and one known categorical batch effect (0/1
variable) in the data. We select 5% of the CpGs (100 CpGs) to be from Differentially Methylated
DNA Regions (DMR) for POI and 10% (200 CpGs) to be from DMRs for the covariate (these
two sets of CpGs may overlap). All CpGs are assumed to be affected by the Batch Effect
(BE).The whole data simulation mainly follows what described in Houseman et al.’s paper
(Houseman et al., 2014), except that we do not consider any cell type mixtures. The design of the
parameters for all 2000 CpG probes is shown in Figure 3.2.
A few notations and simulation parameter settings are listed below:
n: # of subjects, here we let n=200
m: # of CpG probes, here we let m=2000, of which 5% are differentially methylated for the POI
and 10% for the covariate.
K: # of latent factors, here we let K=2, to represent the effect of processing samples in batches.
We model batches of equal size (n=100 each).
57
As formulated in Houseman’s paper (and also similar to SVA method), for CpG probe j
and subject i, we let DNA methylation
𝑌 𝑗𝑖
= 𝛽 1𝑗 + 𝛽 2𝑗 𝑋 𝑖 + 𝛽 3𝑗 𝐶 𝑖 + 𝛽 4𝑗 𝐵 𝑖 + ∑ 𝐿 𝑗𝑙
𝑔 𝑙𝑖
2
𝑙 =1
+ 𝘀 𝑗𝑖
where 𝛽 1𝑗 is the overall DNA methylation for probe j, 𝑋 𝑖 is the POI, 𝐶 𝑖 is the covariate, and 𝐵 𝑖 is
the Batch variable for sample i. 𝛽 2𝑗 , 𝛽 3𝑗 and 𝛽 4𝑗 are the regression coefficients corresponding to
𝑋 𝑖 , 𝐶 𝑖 𝑎𝑛𝑑 𝐵 𝑖 . 𝑔 𝑙𝑖
is the l
th
latent factor of sample i, and 𝐿 𝑗𝑙
is the probe j specific coefficient for
𝑔 𝑙𝑖
. Error term 𝘀 𝑗𝑖
is the true CpG probe specific independent noise with mean 0.
In simulation, to obtain the measured methylation value 𝑌 𝑗𝑖
, we model it as the sum of
three parts: 1) linear regression of known phenotypes of interest (𝛽 1
+ 𝛽 2
𝑋 𝑖 + 𝛽 3
𝐶 𝑖 + 𝛽 4
𝐵 𝑖 ), 2)
latent effects error (∑ 𝐿 𝑗𝑙
𝑔 𝑙𝑖
2
𝑙 =1
), and 3) leftover residual (𝘀 𝑗𝑖
). See Figure 3.2 for parameter
setting in simulating fixed coefficients and Figure 3.3 for the detailed data generation work flow
used in simulation.
Here are the details of how these three parts are simulated.
I. For the linear regression part, we simulate a 4 × 𝑚 matrix Beta from a 3-part mixture
model as described below.
𝐵𝑒𝑡𝑎 4∗𝑚 = [
𝛽 1
⋮
𝛽 4
] = [
𝛽 11
… 𝛽 1𝑚 ⋮ … ⋮
𝛽 41
… 𝛽 4𝑚 ], where intercept is 𝛽 1
, slope 𝛽 2
is for the POI, slope
𝛽 3
is for the covariate, slope 𝛽 4
is for Batch variable.
1) The intercept 𝛽 1
is generated as 𝑙𝑜𝑔𝑖𝑡 −1
(𝑎 ), where
𝑎 ~0.3𝑁 (−2.8,0.7
2
)+ 0.2𝑁 (−0.75,1.4
2
)+ 0.5𝑁 (1.5,0.7
2
)
2) Each POI effect slope 𝛽 2
is 0 whenever 𝑎 arises from the first or third component of the
3-part mixture. For 𝑎 arising from the second component, the corresponding slope will be
generated as −𝑠𝑖𝑔𝑛 (𝑎 + 0.75)𝑏 2
, where 𝑏 2
~0.75𝛿 0
+ 0.25𝑁 (0.05,0.025
2
). There will
58
be approximately 5% non-null CpG probes simulated as shown in Houseman’s paper
(Houseman et al., 2014).
3) Each covariate effect slope 𝛽 3
will be generated as −𝑠𝑖𝑔𝑛 (𝑎 − 𝜇 )𝑏 3
, where 𝑏 3
~0.9𝛿 0
+
0.1𝑁 (0.1,0.1
2
), 𝜇 = −2.8 if 𝑎 arises from the first component, 𝜇 = −0.75 if 𝑎 arises
from the second component, and 𝜇 = 1.5 if 𝑎 arises from the third component.
4) Each Batch effect slope 𝛽 4
will be generated as 𝑁 (0.05,0.1
2
) regardless of 𝛽 1
,
𝛽 2
and 𝛽 3
generation algorithms.
Once this coefficient matrix Beta is generated, we will use it for all simulated data sets.
After the coefficient matrix is simulated, let us simulate the 4 × 𝑛 design matrix as
𝑑𝑒𝑠𝑖𝑔𝑛 4∗𝑛 =
[
1 … 1
𝑋 1
… 𝑋 𝑛 𝐶 1
… 𝐶 𝑛
𝐵 1
… 𝐵 𝑛 ]
, where POI 𝑋 𝑖 ~𝑈𝑛𝑖𝑓 (−0.5,0.5), covariate 𝐶 𝑖 ~𝑈𝑛𝑖𝑓 (0,0.5), Batch
𝐵 𝑖 = 0 𝑜𝑟 1 (we assume equal sample size in batches).
Then the assumed methylation value will be calculated as:
𝜇 𝑚 ∗𝑛 = 𝐵𝑒𝑡𝑎 4∗𝑚 𝑇 ∗ 𝑑𝑒𝑠𝑖𝑔𝑛 4∗𝑛 = [
𝛽 11
… 𝛽 1𝑚 ⋮ … ⋮
𝛽 41
… 𝛽 4𝑚 ]
𝑇 ∗
[
1 … 1
𝑋 1
… 𝑋 𝑛 𝐶 1
… 𝐶 𝑛
𝐵 1
… 𝐵 𝑛 ]
= [
𝛽 11
+ 𝛽 21
𝑋 1
+ 𝛽 31
𝐶 1
+𝛽 41
𝐵 1
⋯ 𝛽 11
+ 𝛽 21
𝑋 𝑛 + 𝛽 31
𝐶 𝑛 +𝛽 41
𝐵 𝑛 ⋮ 𝛽 1𝑗 + 𝛽 2𝑗 𝑋 𝑖 + 𝛽 3𝑗 𝐶 𝑖 +𝛽 4𝑗 𝐵 𝑖 ⋮
𝛽 1𝑚 + 𝛽 2𝑚 𝑋 1
+ 𝛽 3𝑚 𝐶 1
+𝛽 4𝑚 𝐵 1
⋯ 𝛽 1𝑚 + 𝛽 2𝑚 𝑋 𝑛 + 𝛽 3𝑚 𝐶 𝑛 +𝛽 4𝑚 𝐵 𝑛 ]
As we presume the biological variation occurs on the linear scale, with error arising from
a beta distribution, we then generate beta-distributed values as
𝑏 𝑗𝑖
~𝐵𝑒𝑡𝑎 (𝜇̃
𝑗𝑖
𝑞 𝑗 ,(1− 𝜇̃
𝑗𝑖
)𝑞 𝑗 ),
59
where 𝜇̃
𝑗𝑖
= min{max{𝜇 𝑗𝑖
, 0.001}, 0.999}, 𝑞 𝑗 = max{𝑛 −1
∑
𝜇̃
𝑗𝑖
(1−𝜇̃
𝑗𝑖
)
𝜃 2
𝑛 𝑖 =1
− 1, 1} 𝑎𝑛𝑑 𝜃 =
0.008. 𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑛𝑑𝑒𝑥 𝑖 = 1,…,𝑛 ;𝐶𝑝𝐺 𝑝𝑟𝑜𝑏𝑒 𝑖𝑛𝑑𝑒𝑥 𝑗 = 1,…,𝑚 .
II. For the latent error component of a factor-analytic form, we simulate it on the logit scale
of measurement (M-value). In our simulation setting, we let number of latent factors to be
2. We generate the elements of the 2 × n latent factor effect matrix G
2∗n
as standard
normal variables, while the corresponding m× 2 CpG-specific latent factor loading
matrix L
m∗2
will be generated as Z
m∗2
∆, with the elements of Z
m∗2
as standard normal
variables and diag(∆) = (0.25,0.01)
T
.
Then the assumed latent error component will be calculated as:
𝐸 𝑚 ∗𝑛 = 𝐿 𝑚 ∗2
∗ 𝐺 2∗𝑛 = (𝑍 𝑚 ∗2
∗ ∆)∗ 𝐺 2∗𝑛 = ([
𝑧 11
𝑧 12
⋮ ⋮
𝑧 𝑚 1
𝑧 𝑚 2
][
0.25 0
0 0.01
])∗ [
𝑔 11
… 𝑔 1𝑛 𝑔 21
… 𝑔 2𝑛 ]
=
[
0.25 ∗ 𝑧 11
𝑔 11
+ 0.01∗ 𝑧 12
𝑔 21
⋯ 0.25∗ 𝑧 11
𝑔 1𝑛 + 0.01∗ 𝑧 12
𝑔 2𝑛 ⋮ ∑ 𝐿 𝑗𝑙
𝑔 𝑙𝑖
2
𝑙 =1
⋮
0.25 ∗ 𝑧 𝑚 1
𝑔 11
+ 0.01∗ 𝑧 𝑚 2
𝑔 21
⋯ 0.25 ∗ 𝑧 𝑚 1
𝑔 1𝑛 + 0.01∗ 𝑧 𝑚 2
𝑔 2𝑛 ]
𝑊 ℎ𝑒𝑟𝑒 𝐿 𝑗𝑙
= {
0.25𝑧 𝑗 1
, 𝑖𝑓 𝑙 = 1
0.01𝑧 𝑗 2
, 𝑖𝑓 𝑙 = 2
.
III. For the last residual part, assuming the error 𝘀 𝑗𝑖
~𝑁 (0,0.25
2
).
Thus, the final simulated DNA methylation value will be
𝑦 𝑗𝑖
= 1 − [1+ exp{log(
𝑏 𝑗𝑖
1−𝑏 𝑗𝑖
) + 𝐸 𝑗𝑖
+ 𝘀 𝑗𝑖
}]
−1
.
60
After data set Y is generated, we follow the data analyses work flow to do batch effect
correction (BEC) and compare the effect of the BEC methods as illustrated in Figure 3.3. We use
Quantile Normalization on the simulated beta values to remove variation across samples (Bolstad
et al., 2003). Then we apply ComBat, SVA, ISVA, RUV-2, or RUV-4 methods on the quantile
normalized (QN) data and compare the data after using the above 5 BEC methods to the data
after QN only. In SVA and ISVA methods, we let the functions themselves to determine the
number of surrogate variables estimated. In RUV2 and RUV4 methods, as the authors suggested
estimating the number of unwanted factors “by hand” which is infeasible for simulation, we
decide to use the default setting in the functions (k=4). This number makes sense here since in
our simulation design, there are total four potential factors: covariate effect, Batch effect, and
two latent effects simulated in the DNA methylation data that are unwanted in terms of the POI.
The main visualization method used in comparing these BEC methods is Principal
Component Analysis (PCA) plot. Furthermore, we assess our ability to detect features associated
with the POI with and without batch effect correction. Since we know the true CpG probes with
simulated differential DNA methylation, we can evaluate the sensitivity and the specificity for
finding these true positive probes. False Discovery Rate (FDR) adjustment for multiple testing of
linear regressions across all CpG probes will be applied (Benjamini and Hochberg, 1995).
61
Figure 3.1 Parameter setting for simulating fixed coefficients
62
Linear Regression part
Linear Regression part
Latent effect error part
Latent effect error part
Leftover residual part
Leftover residual part
Subject index i=1, …,n
CpG probe index j=1, …,m
Fixed Coefficients
Figure 3.2 Data generation work flow in simulation
63
Simulated Data set
(beta values: 2000 probes *200 samples)
Quantile Normalization
Center each probe to its mean, then apply SVA (or ISVA)
method on them to find SVs (or ISVs)
Remove those SVs that are
significantly associated with POI
(cutoff p-value=0.05)
Check association of SVs with POI
and potential confounders
Do linear regression:
Meth = POI + SVs
Apply ComBat method on
the data, with POI as main
exposure, confounder as
covariate
Do linear regression using the
Combat corrected data:
Meth.Combat = POI + confounder
Do linear regression:
Meth = POI + confounder
Do linear regression:
Meth = POI +confounder + Batch
No BE correction
ComBat correction
Center each probe to its mean, then apply RUV-2/RUV-4 method
on them to find Unwanted Factors using negative control probes.
Do linear regression: Meth = POI + UFs
RUV-2/RUV-4 correction
Simple BE correction
Do linear regression:
Meth = POI + ISVs
SVA or ISVA correction
Figure 3.3 Data analyses work flow in simulation
64
Table 3.2 Simulation scenarios for different Batch effect or different POI effect (scenarios
0-5)
parameter setting
in simulation
Scenario 0 Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5
Null Original setting Larger POI E. smaller POI E. larger BE largest BE
POI effect 0 N(0.05,0.025
2
) N(0.1,0.025
2
) N(0.025,0.025
2
) N(0.05,0.025
2
) N(0.05,0.025
2
)
Batch effect N(0.05,0.1
2
) N(0.05,0.1
2
) N(0.05,0.1
2
) N(0.05,0.1
2
) N(0.1,0.1
2
) N(0.2,0.1
2
)
Covariate effect 0 N(0.1,0.1
2
) N(0.1,0.1
2
) N(0.1,0.1
2
) N(0.1,0.1
2
) N(0.1,0.1
2
)
In total, six scenarios are investigated in terms of different effect sizes of POI or Batch
(Table 3.2). Scenario 0 is the situation under the null hypothesis of no POI effect at all. Scenario
1 is the basic one with the parameter settings demonstrated in Figure 3.1 and 3.2. Scenarios 2 and
3 correspond to increased or decreased POI effect based on Scenario 1, respectively. Scenarios 4
and 5 refer to increased Batch effect sizes (Table 3.2).
Table 3.3 Simulation scenarios when POI and covariate or POI and Batch have correlation
(Other parameter setting is the same as scenario 1)
independent
Add correlation between POI
and covariate variable
Add correlation between POI
and Batch variable
weak moderate strong weak moderate strong
Scenario 1 1a 1b 1c 1d 1e 1f
Cor(POI, C) 0 0.2 0.5 0.9 0 0 0
Cor(POI, B) 0 0 0 0 0.2 0.5 0.9
Six more scenarios are investigated to assess the effect of different levels of correlation
between POI and covariate or between POI and Batch into simulation (Table 3.3). The Cholesky
decomposition method will be used to simulate correlated variables (Becker et al., 1988). We
simulate two independent continuous variables first as a data matrix. We then specify a
correlation matrix and do Cholesky decomposition of it. Then we multiply them together and get
65
the resulting matrix of two continuous variables with the specified correlation. In order to
simulate two correlated variables with one to be a dummy variable, we further dichotomize it by
the median to get a corresponding dummy variable. We then check the correlation of the
resulting dummy variable with the other continuous variable to record the empirical correlation
simulated. We design three levels of correlations (weak, moderate, strong) between POI and
covariate, or between POI and Batch variable (Table 3.3).
For the all scenarios listed in Tables 3.2 and 3.3, we simulate 100 data sets both with and
without (except scenarios 4&5) batch effects for a total of 22 simulation scenarios.
3.3.2 Method Evaluation using Simulated Data
For all analyses, the data are quantile normalized (QN) as previously described (Figure
3.3). Although methods SVA, ISVA, RUV2 and RUV4 can work on continuous exposure
variables, ComBat method removes batch effects for categorical exposure variables only. Since
we simulated a quantitative variable, we dichotomize the POI at the median and use the resulting
categorical variable for our analysis. For RUV2 and RUV4 methods, theoretical NC probes are
those probes fall in the first and third component of the 3-part mixture distribution of 𝑎 and lack
both a POI and covariate effects by design, while the empirically estimated NC probes are
chosen to be those top 10% probes with lowest variance.
Scatter plots of the first two principal components (PC) of the data allow us to visualize
whether sample clusters are apparent. We do PCA plots either on raw data, QN data, ComBat
corrected data, or residuals from linear regression of QN data on estimated surrogate variables
(unwanted factors). Samples are colored by their Batch, POI or covariate value (
/>median), and PCs are plotted before and after BE correction.
66
We use linear regression to detect features associated with the POI. All linear regressions
are done on continuous POI variable itself. Models are run unadjusted for Batch variable, using
the ComBat adjusted data, or adjusting in the model for Batch variable or surrogate variables
estimated by SVA, ISVA, RUV2, or RUV4. Sensitivity and specificity are calculated for each
method according to the True DMR labelling of CpG probes simulated and whether FDR
adjusted p-value <= 0.05 in each simulation. We also plot Receiver Operating Characteristic
(ROC) curves from the average sensitivity and specificity across 100 simulations estimated for a
series of FDR adjusted p-value levels increasing from 0 to 1 at a step size of 0.05.
3.4 Real Data Set
We evaluate the performance of the above BEC methods on the Type-1 diabetes data
(T1D) that was kindly given by Dr. Teschendorff (Teschendorff et al., 2011). The data were
generated using the HM27 platform, and measures for 22486 CpG probes on 187 whole blood
samples from patients with type-1 diabetes were analyzed. This data set was served as validation
for a DNA methylation signature for aging (Teschendorff et al., 2010). Samples are distributed
over 17 BeadChips from 3 plates. Possible covariates include age, gender (94 females and 93
males), bisulfite conversion efficiency (BSCE), cohort, and BeadChip. We consider age to be the
POI and gender, BSCE and cohort to be covariates. Either BeadChip or plate (17 BeadChips are
on 3 plates) is used as Batch variable in ComBat method. Once more we use PCA plots for data
visualization and linear regression for detecting differentially methylated features.
67
3.5 Results
3.5.1 Simulation
Figure 3.4 shows density plots of the raw data for all samples (Figure 3.4 A), one sample
(Figure 3.4 B), and the samples separated by batch (Figures 3.4 C&D). After Quantile
Normalization, the density plots become identical (Figure 3.5 A), but the BE still remains as seen
from PCA plot (Figure 3.5 B).
Figure 3.4 Density plots of raw simulated methylation data in Scenario 1 with batch effect
simulated
A: Density plot of all samples simulated under Scenario 1. B: Density plot and histogram plot of
one sample from A. C & D: Density plot of all samples simulated within single batch
respectively.
68
Figure 3.6 shows PCA plots that indicate ComBat, SVA and RUV2 all successfully
removed the BE (Panel As). We also see that after correction by SVA method, the effect of POI
is clearly shown as the second principal component of the residuals (Figure 3.6 B2), and the
effect of covariates is the first (Figure 3.6 C2). PCA plots using residuals after ISVA or RUV-4
correction are similar to what is shown in SVA PCA plots (See Supplemental Figure 3.1).
However, for RUV2 method using theoretical NC probes, both effects of POI and covariate are
apparent in the PCA plot of residuals (Figures 3.6 A3-C3); while for RUV2 method using
estimated NC probes, only effect of covariate remains (Figures 3.6 A4-C4).
Figure 3.5 Density plot (A) and PCA plot (B) of Quantile Normalized raw simulated
methylation data in Scenario 1
A: Density plot of all samples simulated and quantile normalized under Scenario 1.
B: PCA plot of the same data, with samples colored by the Batch variable.
69
Figure 3.6 PCA plots of ComBat corrected methylation data or residuals in Scenario 1 after
correction by different BE method
The data used for PCA plots are described as follows:
A1-C1: Simulated data corrected by ComBat method;
A2-C2: Residuals from linear regression using simulated data regress on surrogate variables
estimated using SVA method;
A3-C3: Residuals from linear regression using simulated data regress on unwanted factors
estimated using RUV2 method with theoretical NC probes;
A4-C4: Residuals from linear regression using simulated data regress on unwanted factors
estimated using RUV2 method with empirical NC probes.
Samples in panels A1-A4 are colored by the Batch variable.
Samples in panels B1-B4 are colored by the POI variable dichotomized at the median.
Samples in panels C1-C4 are colored by the covariate variable dichotomized at the median.
70
Figure 3.6 (Continued)
71
Figure 3.7 ROC curves of all BEC methods for scenarios with different levels of POI effect
(Scenarios 1-3; 100 simulated data sets)
Total 2000 probes are simulated with 5% to be DMR by the POI. Sensitivities and specificities of
picking informative probes passing FDR adjusted p-value at values from 0 to 1 increasing at a
step of 0.05 are calculated from each simulation and summarized over 100 simulations. Total 9
BEC methods demonstrated include: Raw, adj.Batch, ComBat, SV A, ISVA, RUV2, RUV2.est,
RUV4, and RUV4.est under different scenarios.
Panels A1-C1: Scenarios under condition when BE exists;
Panels A2-C2: Scenarios under condition without BE.
Panels A1&A2: Scenario 2 with largest POI effect;
Panels B1&B2: Scenario 1 with moderate POI effect;
Panels C1&C2: Scenario 3 with smallest POI effect.
Figures 3.7 and 3.8 show the ROC curves for all BEC methods summarizing over 100
simulations under both conditions with/without BE, for Scenarios with different levels of POI
effect (Figure 3.7) or Scenarios with different levels of Batch effect (Figure 3.8). When POI and
72
other variables are independent (Scenarios in Table 3.2), SVA (bold black line), RUV2 (solid
green line) and RUV4 (solid red line) using theoretical NC probes show consistently highest
ROC curves in the presence of batch effects (Figures 3.7&3.8 A1-C1). This is also true under
condition of no BE, except when POI effect is high where RUV4 performs a little worse than the
other two methods (Figure 3.7 A2). As expected, using empirically selected NC probes in RUV2
seems to be less effective at finding true signals than using theoretical NC probes (green dashed
line vs. green solid line). However, the RUV4 method shows a smaller difference from using
empirically selected NC probes (red dashed line vs. red solid line). The QN data without any
BEC shows the lowest ROC curve in all scenarios no matter whether BE existed or not. ComBat
and directly adjusting for Batch variable perform similarly in all scenarios, lower than SVA,
RUV2 and RUV4 methods, but higher than QN data when BE existed and the same as raw
method when no BE exists (Figures 3.7&3.8). The ISVA method performs similar to SVA when
no BE existed, but a little worse when BE exists (Figures 3.7&3.8).
Figures 3.9 and 3.10 show the boxplots of sensitivities and specificities of these BEC
methods over 100 simulations for an FDR adjusted p-value < 0.05, corresponding to scenarios in
Figures 3.7 and 3.8 respectively.
73
Figure 3.8 ROC curves of all BEC methods for scenarios with different levels of Batch
effect (Scenarios 1, 4, and 5; 100 simulated data sets)
Calculation and presentation in these Figures are the same as in Figure 3.7.
Panels A1: Scenario 1 with smallest Batch effect;
Panels B1: Scenario 4 with moderate Batch effect;
Panels C1: Scenario 5 with largest Batch effect.
When taking a closer look at FDR adjusted p-value cut off point of 0.05, as we expected,
the boxplots clearly show that the sensitivities of all methods decrease with decreasing of POI
effect size (Figures 3.9 A1&B1) or with increasing of Batch effect size (Figure 3.10 A1). The
comparison across all these BEC methods is consistent with what ROC plots showed in Figures
3.7 and 3.8. Specificities for all methods under scenarios without correlation between POI and
other variables are all close to 1 (Figures 3.9 A2&B2, Figure 3.10 A2). ISVA method has a big
variance in both sensitivity and specificity when effect size of POI was big (Scenario 2; Figure
3.9). This is also reflected in the big drop of ROC curve for ISVA method (bold black dashed
line) in Scenario 1 (Figure 3.7 A1).
74
Figure 3.9 Boxplots of sensitivity and specificity of all BEC methods for scenarios with different levels of POI effect (Scenarios
1-3; 100 simulated data sets)
Panels A1 and B1: sensitivity
Panels A2 and B2: specificity
Panels A1 and A2 are for scenarios under condition of BE exists
Panels B1 and B2 are for scenarios without BE.
Scenario 2 has the highest POI effect. Scenario 1 has a moderate POI effect. Scenario 3 has the lowest POI effect. Different BEC
method are colored by different grade of gray.
75
Figure 3.10 Boxplots of sensitivity and specificity of all BEC methods for scenarios with different levels of Batch effect
(Scenarios 1, 4, and 5; 100 simulated data sets)
Panels A: sensitivity; Panels B: specificity.
Scenario 1 has the lowest Batch effect (BE=0.05, black).
Scenario 4 has a moderate Batch effect (BE=0.1, gray).
Scenario 5 has the highest Batch effect (BE=0.2, light gray).
76
Figure 3.11 ROC curves of all BEC methods for scenarios with different levels of
correlation between POI and covariate (Scenarios 1a, 1b, and 1c; 100 simulated data sets)
Calculation and presentation in these figures are the same as in Figure 3.7.
Panels A1-C1: Scenarios under condition with BE exists;
Panels A2-C2: Scenarios under condition without BE.
Panels A1&A2: Scenario 1a with weak correlation between the POI and the covariate;
Panels B1&B2: Scenario 1b with moderate correlation between the POI and the covariate;
Panels C1&C2: Scenario 1c with strong correlation between the POI and the covariate.
77
Figure 3.12 ROC curves of all BEC methods for scenarios with different levels of
correlation between POI and the Batch variable (Scenarios 1d, 1e, and 1f; 100 simulated
data sets)
Calculation and presentation in these figures are the same as in Figure 3.7.
Panels A1-C1: Scenarios under condition with BE exists;
Panels A2-C2: Scenarios under condition without BE.
Panels A1&A2: Scenario 1d with weak correlation between the POI and the Batch variable;
Panels B1&B2: Scenario 1e with moderate correlation between the POI and the Batch variable;
Panels C1&C2: Scenario 1f with strong correlation between the POI and the Batch variable.
Figures 3.11 and 3.12 show the ROC curves for all BEC methods summarizing over 100
simulations under both conditions (with or without BE) for Scenarios with different levels of
correlation between the POI and covariate (Figure 3.11, for Scenario 1a-c in Table 3.3) or
Scenarios with different levels of correlation between POI and Batch (Figure 3.12, for Scenario
1d-f in Table 3.3). Figures 3.13 and 3.14 show the boxplots of sensitivities and specificities of
78
these BEC methods over 100 simulations specifically for FDR adjusted p-value at 0.05 cut off
point, corresponding to scenarios in Figures 3.11 and 3.12 respectively. Since simulating
correlated variables POI and Batch need a further dichotomization step after simulating two
correlated continuous variables for Batch value (0/1), the mean and standard deviation of the
simulated correlation for Scenarios 1a-c and 1d-f are shown in Table 3.4.
Table 3.4 Summary of the simulated correlations (Mean±SD) for Scenarios with varying
correlations between POI and covariate or between POI and the Batch variable
Correlation
between key
variables
(Mean±SD) independent
add association between POI and
covariate variable
add association between POI and
Batch variable
weak moderate strong weak moderate strong
Scenario 1 1a 1b 1c 1d 1e 1f
in design 0 0.2 0.5 0.9 0.2 0.5 0.9
for simulated
continuous
variables 0.06±0.05 0.19±0.07 0.50±0.05 0.90±0.01 0.20±0.07 0.49±0.05 0.90±0.01
after
dichotomizing
one variable by
the median NA NA NA NA 0.13±0.07 0.33±0.06 0.80±0.02
79
Figure 3.13 Boxplots of sensitivity and specificity of all BEC methods for scenarios with different levels of correlation between
POI and covariate (Scenarios 1, 1a, and 1b; 100 simulated data sets)
Panels A1 and B1: sensitivity
Panels A2 and B2: specificity
Panels A1 and A2 are for scenarios under condition of BE exists
Panels B1 and B2 are for scenarios without BE.
Scenario 1 has the no corelation between POI and covariate (black). Scenarios 1a-1c has increased levels of correlation between the
POI variable and covariate (decreased level of gray).
80
Figure 3.14 Boxplots of sensitivity and specificity of all BEC methods for scenarios with different levels of correlation between
POI and the Batch variable (Scenarios 1, 1d, 1e, and 1f; 100 simulated data sets)
Panels A1 and B1: sensitivity
Panels A2 and B2: specificity
Panels A1 and A2 are for scenarios under condition of BE exists
Panels B1 and B2 are for scenarios without BE.
Scenario 1 has the no corelation between POI and the Batch variable (black). Scenarios 1d-1f has increased levels of correlation
between the POI variable and the Batch variable (decreased level of gray).
81
In scenarios when the correlation between POI and covariate exists (Scenarios 1a-c), the
major relationships across all BEC methods remain (Figure 3.11). ROC curves of SVA, RUV2
and RUV4 using theoretical NC probes are highest most of the time, except that the ROC curves
of ComBat and directly adjusting for Batch variable cross over them at the high specificity end
when correlation between POI and covariate is moderate (Figures 3.11 B&C). Besides the above
mentioned crossover, we also observe that RUV4 method using empirically estimated NC probes
outperforms all other methods at high specificity end when BE does not exist (Figures 3.11
B2&C2).
From the boxplots of sensitivity and specificity at FDR adjusted p-value cut off point of
0.05, we can see that the relationship among all these BEC methods still remains when
correlation between POI and covariate existed (Figure 3.13). When strong correlation between
POI and covariate is introduced in, sensitivities of raw method, directly adjusting for Batch
variable, and ComBat method decrease dramatically with specificities remaining close to 1 under
both conditions with/without BE (Figure 3.13). On the other hand, the sensitivities of SVA,
ISVA, RUV2 and RUV4 methods increase with the correlation strength increasing, while the
specificity drops correspondingly when BE exists (Figures 3.13 A1&A2). This is also the case
when BE does not exist, except that RUV2 and RUV4 methods using empirically estimated NC
probes showed a little decrease of sensitivity when correlation strength increases. These are all
consistent with what the ROC curves showed at the high specificity end which corresponds to
FDR adjusted p-value cut off point of 0.05 (Figures 3.11 B&C).
When BE exists in data, the ROC curve of ISVA drops dramatically even when only
weak correlation of POI and Batch variable was introduced in (Figure 3.12 A1). This is followed
by the drop of the ROC curve of SVA when moderate correlation exists and both become
82
diagonal line as raw method without BE correction when strong correlation between POI and
Batch variable exists (Figures 3.12 A2&A3). The relationship among the other 7 BEC methods
remains, with RUV2 and RUV4 methods had highest ROC curves and the difference between
using different sets of NC probes for RUV4 method is much smaller than that for RUV2 method
(Figures 3.12 A1-A3). When BE does not exist but the correlation between POI and Batch
variable increases (for example, although samples on one BeadChip are more older than those on
another BeadChip, these two BeadChips are handled on the same time with the same set of
experiment conditions, so assume no Batch effect), the SVA, ISVA, RUV2 and RUV4 methods
maintain to have the highest ROC curves. However, ROC curves of ComBat method and directly
adjusting for Batch variable drop below that of the raw method when the correlation becomes
stronger (Figure 3.12 B2&C2). These are concordant with what shows in boxplots of sensitivity
and specificity at FDR adjusted p-value cut off point of 0.05 (Figure 3.14). Except for SVA,
ISVA and raw methods, when BE exists and the correlation between POI and Batch increases,
the sensitivities decrease and the specificities remain to be close to 1 for all other BEC methods
(Figures 3.14 A1&A2). Although the sensitivities of SVA, ISVA and raw methods increase as
correlation between POI and Batch variable increasing, the specificities drop dramatically when
BE exists (Figures 3.14 A1&A2). When BE does not exist, all BEC methods perform stably no
matter how the correlation between POI and Batch variable increases, excepted for ComBat
method and directly adjust for Batch variable that has decreased sensitivities when the
correlation is strong (Figures 3.14 B1&B2).
RUV4 method seems to be the best method that tolerates the correlation between POI and
Batch variable and is less affected by the selection of NC probes under both conditions with and
without BE.
83
Simulations under the null hypothesis that no POI effect exists show that all BEC
methods have the mean False Positive Rate (FPR) over 100 simulations at around 0.05, with
larger variances observed for raw method, directly adjusted for Batch variable, and ComBat
methods (Figure 3.15). RUV4 method shows a slightly elevated FPR when BE exists (mean of
FPR = 0.06) and not so much when BE does not exist (Figure 3.15).
Figure 3.15 Boxplots of False Positive Rate (FPR) of all BEC methods for scenarios under
the null (Scenario 0; 100 simulated data sets)
Panels A is for scenario under condition of BE exists; Panel B is for scenario without BE.
Red horizontal line indicates FPR=0.05 which is the type I error rate in simulation setting.
84
3.5.2 Real data analysis
Figure 3.16 shows the PCA plots of the T1D data and the residuals after BE correction by
different BEC method. We used all probes to do the PCA plots. We can see that there is clearly a
Batch effect in the data set (Figure 3.16 A1), and it is corrected by all BEC methods (Figures
3.16 A2-A6). Among all BEC methods, RUV4 clearly corrects Batch effect as well as all other
latent unwanted factors and results in the clearest separation of data by the main exposure of
interest (age) (Figure 3.16 B6). SVA and ISVA methods manifest age effect to some extent
(Figures 3.16 B3&B4), while ComBat method using both plate and BeadChip as batch variable
as well as RUV2 method remove the effect of age but leave clear effect of sex in the remaining
data (Figure 3.16 B2,C2,B5&C5; Supplemental Figure 3.2).
Table 3.5 Comparison of enrichment of age associated CpG probes from linear regression
after BE correction by different BEC method (at q value < 0.05 level)
Linear
regression
results
Crude
model
adjust for
CF &
BeadChip
Batch Effect Correction method
ComBat
(plate)
ComBat
(BeadChip) SVA* ISVA* RUV2 RUV4
In paper 294 440 NA NA 688 902 NA NA
My analysis 294 440 534 723 667 877** 500 1405
*Both SV A and ISVA used continuous age as input in functions.
**Multiple applications of ISV A resulted in different numbers of probes enriched.
Table 3.5 shows the enrichment of CpG probes that are differentially methylated with age
from results of linear regressions on total 22486 CpG probes. After BE correction by different
method, the numbers of age associated CpG probes are calculated at a false discovery rate (q-
value) of 0.05. We uses q-value instead of FDR adjusted p-value here, in order to compare our
results directly with the results in paper (Teschendorff et al., 2011). We can see that RUV4
method results in a much larger number of CpG probes enriched, followed by ISVA and SVA
85
methods. This is consistent with what we see from PCA plots in Figure 3.16. Although PCA
plots for both ComBat applications are very similar, ComBat correction using BeadChip does a
much better job than using plate as the batch variable in terms of CpG probe enrichment.
Figure 3.16 PCA plots of T1D data or residuals after correction by different BEC method.
The data used for PCA plots are described as follows:
A1-C1: Raw DNA methylation data;
A2-C2: Data corrected by ComBat method using BeadChip as Batch variable;
A3-C3: Residuals from linear regression using raw data regress on surrogate variables estimated
using SVA method;
A4-C4: Residuals from linear regression using raw data regress on surrogate variables estimated
using ISVA method;
A5-C5: Residuals from linear regression using raw data regress on unwanted factors estimated
using RUV2 method with empirical NC probes;
A6-C6: Residuals from linear regression using raw data regress on unwanted factors estimated
using RUV4 method with empirical NC probes.
Samples in panels A1-A6 are colored by the plate (Batch) variable.
Samples in panels B1-B6 are colored by the age (POI) variable dichotomized at median.
Samples in panels C1-C6 are colored by the sex (covariate) variable.
86
Figure 3.16 (Continued)
87
Figure 3.16 (Continued)
3.6 Discussion
In this Batch Effect Correction methods comparison study, total of 8 different BEC
methods were compared to no correction across 22 simulation scenarios and in one real data set.
We found that RUV4 method outperformed all others under all simulation scenarios. It also
resulted in a higher number of CpG probes that were differentially methylated by age (POI)
enriched and a better separation of samples by age groups in the real data evaluated. Since the
majority of these BEC methods were developed under the assumption of a Gaussian distribution,
this systematic evaluation using beta distribution in simulation contributed as a validation of
using these methods in Infinium DNA methylation data.
From the simulation study, we can see that under both conditions with or without BE, the
relative performance among these BEC methods compared seems to be stable when there were
no correlations between POI and other variables in the data (Figures 3.7-3.10). The ComBat
method performed similarly as directly adjusting for the Batch variable in linear regression in all
these scenarios. They were better than no BEC at all when BE existed but not as good as the
others. Although we simulated both Batch effect and latent errors in our data, there were no
88
associations between these two sources of unwanted errors in design. The ComBat method,
which utilizes the Empirical Bayes (EB) approach to estimate Batch effect borrowing
information across all genes in the same batch, still could not identify these latent unwanted
errors effectively. That’s why the ComBat method did not show much improvement from
directly adjusting for the Batch variable. When there were no BE, these two methods performed
similarly to no BEC at all.
The difference between these BEC methods mainly lies in the different algorithms they
use. Except ComBat method that needs to specify the Batch variable ahead of time, all other
BEC methods we studied, namely, SVA, ISVA, RUV2 and RUV4, use algorithms to identify
unwanted factors without specifying the Batch variable. In SVA, RUV2, and RUV4 methods,
the singular value decomposition (SVD) is used as dimension reduction method. By
decomposing the data matrix (or residual matrix) and selecting the most important eigenvectors
(dimensions) that have the largest eigenvalues (i.e. best represent the signals), we can map the
data to lower dimensions. In SVA, the surrogate variables are found based on the “subset” of
probes that account for the majority of the variances in the residual matrix but have very little
association with the POI. This probe “subset” can be selected using two-step algorithm or using
Iteratively Re-Weighted (IRW) algorithm in SVA. For IRW algorithm that we used in our SVA
method, the probe “subset” is not truly formed, instead each probe is weighted by the probability
of the probe to be associated with the unwanted factors, but not the factor of interest. Although
all probes will be used to do SVD to estimate the surrogate variables, only those with high
probability will have major impact. So this step in SVA could also be considered as an
unsupervised selection of NC probes. In RUV2 and RUV4, however, unwanted factors are found
using factor analysis (using SVD) done either on a subset of data matrix with pre-specified NC
89
probes only (for RUV2) or on residual matrix from regression of all data matrix on POI (for
RUV4). RUV4 further uses the pre-specified NC probes to recover the component of unwanted
factors that may be mistakenly removed by regression on POI at an earlier step. Although RUV4
is considered as a hybrid of SVA and RUV2 methods, the main differences between RUV4 and
SVA are the following two. First, RUV4 uses pre-specified NC probes directly at the second step
while SVA uses IRW algorithm to guide the selection of the NC probes. Secondly, RUV4 adds
back the recovered component directly to the unwanted factors estimated using residual matrix
from the first step, while SVA estimates surrogate variables by doing weighted SVD on all data
and selecting those that are significantly correlated with the eigenvectors identified by SVD on
residual matrix of all data regressing on the POI.
As we know, Infinium DNA methylation data is beta distributed, i.e., non-Gaussian
distributed. Independent component analysis (ICA) that does not depend on the Gaussian
distribution of data is supposed to be a better fit for it. Unlike SVD used in SVA, RUV2 and
RUV4 that decomposes the data matrix according to variance, ICA used in ISVA method utilizes
the Kurtosis to decompose the residual matrix (original data matrix regressed on POI) into
independent components. Kurtosis is the fourth moment (mean, variance, and skewness are the
first three) of a distribution, which measures non-Gaussianity. So using Kurtosis we are able to
separate non-Gaussian independent sources. ICA requires the decomposed components to be as
independent as possible, thus uncorrelated in a non-linear fashion.
The algorithm used for the optimization procedure in ISVA method is FastICA. It aims at
decomposing the data matrix into a product of source matrix and mixing matrix, where columns
of the source matrix are considered as latent factors. In order to make these latent factors as
independent as possible, ICA starts from an initial source matrix and tries to search for a
90
transformation matrix W that maximizes the statistical independence of columns of the Source
matrix. Due to the randomness in the assignment of the initial source matrix for residual matrix
factorization, the final resulting source matrix is not uniquely determined. This randomness in
estimating independent surrogate variables was more obvious when POI effect was relatively
high (Scenario 2), where the performance of ISVA method was quite unstable.
The usage of pre-specified NC probes is critical in the RUV methods. The NC probes in
our simulations were selected as either those that do not vary by POI and covariate (theoretical
NC), or those with lowest variance (empirical NC). This would have some impact on the
simulation results, since we considered both covariate and Batch effects as well as latent effects
as unwanted factors. Theoretical NC probes capture Batch effect and latent effect but not effect
from covariate. Empirical NC probes could possibly capture Batch effect, POI effect, or
covariate effect whichever corresponds to lower variance in the simulated data. This statement is
validated by the PCA plots for Scenario 1 (Figures 3.6 A3-C3&A4-C4) that RUV2 using
empirical NC probes removed effect from POI, but left the covariate effect in residuals.
However, when POI effect increased (such as in Scenario 3), PCA plots showed that RUV2
using empirical NC probes removed effect from covariate, but left the POI effect in residuals
(results not shown). It is noteworthy that the number of theoretical NC probes used in simulation
is larger (72%) than the number of empirical NC probes used (10%). So Batch effect and latent
error can be captured by methods using theoretical NC probes more effectively than those using
empirical NC probes.
There is more than one way to select NC probes for RUV methods. In both HM27 and
HM450 platforms, the non-specific fluorescence in the colour channel opposite of their design
for Infinium I design beads are measured. These so called out-of-bound probes are used in
91
Norm-Exponential background correction (Triche et al., 2013). They are also assumed invariant
to POI and only affected by unknown technique effects such as batch effects. Thus, we could
pick the 1000 most variable out-of-band probes as NC probes when applying RUV-4 method for
batch effect correction. However, in simulation, we did not have such information collected. So
we chose to select those probes with lowest variance as NC probes. To some extent, empirically
selecting NC probes according to the lowest variance can be problematic, since DNA
methylation data is beta distributed and its mean is associated with its variance. Those probes
with means close to 0 or 1 tended to have lower variance in theory, which could possibly be
probes differentially methylated by POI when BE was in the data. The bias towards selecting NC
probes from those with means close to 0 or 1 may be the reason why RUV2 using empirical NC
probes performed a little worse than RUV2 using theoretical NC probes in almost all scenarios
when BE existed. When BE, the major component of variance in all probes, disappeared, those
probes with POI effect tended to have a little higher variance compared to those with only latent
errors at the same mean DNA methylation level. Thus empirical NC probes tended to capture
less of POI effect. So RUV2 method using empirical NC probes performed similarly as RUV2
using theoretical NC probes when no BE existed.
From our simulation results we noticed that although we included covariate effect in our
simulated data for all scenarios, it turned out to be of minimal impact on almost all processes.
Compared to Batch effect and latent error that were simulated on all probes, covariate effect was
presented in 10% of probes and only 0.5% of total probes have both POI effect and covariate
effect simulated in. So capturing covariate effect has little impact on surrogate variable
estimation and the final finding signal of POI when BE dominates the variances. This can be
validated as seen in all scenarios with BE when there was no correlation between POI and other
92
variables, RUV2 method using theoretical NC probes performed almost the same as SVA
method that corrected all Batch effect, latent error and covariate effect (Figures 3.7-3.10).
RUV4 methods performed similarly when using either theoretical NC probes or using
empirical NC probes. They were also similar to SVA method and RUV2 method with theoretical
NC probes in most scenarios. This is because RUV4 employs a SVA-like step before using NC
probes in order to avoid the possible POI effect captured in NC probes (thus superior to RUV2
method), and uses NC probes to add back unwanted factors that could possibly be removed by
regression on POI at the first step (thus superior to simple factor analysis using SVD method).
Being a hybrid of RUV2 and SVA methods, RUV4 method showed very stable performances
across all scenarios no matter whether BE existed or not, especially when there were strong
correlation between POI and Batch variable (Figures 3.12&3.14).
However, exceptions were seen when POI effect was very strong (Scenario 2, no BE) or
the correlation between POI and covariate was strong (Scenario 1c, no BE). The simulation study
assumed a continuous POI variable. However, we dichotomized it at median and supplied the
categorical variable into RUV4 function to do regression to get the residuals. This was designed
trying to compare all BEC methods at the same conditions, since ComBat can only use POI as a
character variable. In order to see if this additional step introduced any unwanted bias in
estimating unwanted factors in data, we further did simulation in a similar way to compare the
performance of RUV2 and RUV4 by using POI dichotomized at the median as input in functions
vs. using continuous POI as input in functions. We found that using continuous POI in the RUV4
first step of residual matrix generation did a much better job in the above two exception
scenarios with similar sensitivity as RUV2 method using theoretical NC probes. They were also
as good as using dichotomized POI in all other scenarios. This suggested that the first SVA-like
93
step in RUV4 was sensitive to the use of POI when the POI effect was strong or the correlation
between POI and covariate was strong. Under both scenarios, regressing data on dichotomized
POI resulted in a large amount of POI effects remaining in the residuals, thus were kept in
unwanted factors and over adjusted in downstream linear regression. It needs further
investigation why this was not a problem for RUV4 when the POI effect or the association
between POI and covariate was not very strong.
It seemed that both SVA and ISVA methods did not suffer from the information loss by
dichotomization of POI in the first regression step, since they performed relatively consistently
under almost all scenarios except when BE existed and there were strong correlation between
POI and Batch variables (Figures 3.12 A1-C1; Figures 3.14 A1&A2). This may be because that
SVA method selected NC probe “subset” empirically using IRW algorithm, which reduced the
bias from estimating surrogate variables directly from residuals using all probes (Leek and
Storey, 2008). ISVA also has a similar probe subset selection step that mimics the two-step
algorithm in SVA (Alter et al., 2000; Gagnon-Bartsch and Speed, 2012; Johnson et al., 2007;
Leek et al., 2012; Leek and Storey, 2007; Teschendorff et al., 2011). This NC probe selection
step imbedded within SVA or ISVA made them tolerate the misspecification of the POI variable
at the first regression step much better.
Whether there was a correlation between the POI and Batch variables had strong impact
for the performance of some BEC methods, but not all. When BE existed and the correlation
between POI and Batch variables was strong (Scenario 1f), the residual matrix now contained
only a small part of Batch effect after regressing the data on POI. Although both SVA and ISVA
had implicit NC probe selection as stated above, the BE was so strongly correlated with the POI
that the BE in residuals are too small to guide the selection of the NC probes. Thus the surrogate
94
variables identified by SVA or ISVA captured little BE. The leftover BE in a lot of probes were
mistakenly thought to be the POI effect. This resulted in a spurious high sensitivity, a very low
specificity, and the ROC curve lying diagonally (Figures 3.12 C1; Figure 3.14 A1&A2). Unlike
SVA and ISVA, RUV4 did SVD on all probes to estimate the unwanted factors, and then
recovered any unwanted factors associated with POI at a later step using the pre-specified NC
probes. This latter step allowed RUV4 to tolerate the correlation between POI and Batch variable
very well.
When strong correlation between POI and Batch variable existed but there was no BE, as
was expected, BEC methods that identify surrogate variables (unwanted factors) performed as if
there were no correlation between POI and Batch variable. However, ComBat and directly
adjusting for Batch variable had reduced sensitivity. This may be due to over correction for those
variance from the POI that were associated with Batch since Batch variable was explicitly used
in these two methods even with no BE.
Adding correlation between POI and covariate in data had little impact on the
performances of most of the BEC methods. As expected, when strong correlation between POI
and covariate existed, although directly adjusting for Batch variable or using ComBat did remove
Batch effect to some extent, adjusting for covariate in linear regression made things worse
(Figures 3.11 A1&B1). Other BEC methods did not have such issue here, since they did not use
the covariate variable explicitly in estimating surrogate variables (i.e. unwanted factors) from
data or in linear regressions finding DMRs by POI.
We noticed that the real data we evaluated was quite different from what we designed in
simulation. Probes designed in our simulation followed the density plots of Type II probes in
HM450 platform. However, the real data is from HM27 platform with a higher proportion of
95
CpG probes from CpG islands that have a lower mean. Furthermore, our simulation only
considered one covariate that is differentially methylated on 10% of all probes. However, in the
real data analysis, we considered total 3 covariates: sex, BSCE, and cohort, which could have
affected a much larger percentage of probes. Batch variable considered in real data analysis also
contained more than two categories and with an unbalanced allocation. Correlation between age
and Batch variable, or correlations between age and covariates were more complex than what we
designed in simulation.
Despite the above differences between the simulated data and the real data, in the
conducted case study with real data, the relative performance of these BEC methods was similar
as that in simulation study. RUV4 outperformed all others in identifying the signals associated
with age. ComBat correction using BeadChip instead of plate as the Batch variable improved the
enrichment of DMR CpG probes by age (POI) greatly. This was because ComBat method found
those latent effects associated with BeadChip more effectively than identifying the latent effects
associated with plate. However, PCA plots using ComBat corrected data did not show much
differences between using BeadChip and using plate as the Batch variable (Figures 3.16 A2-C2;
Supplemental Figures 3.2 A2-C2). Both showed no age effect but had sex effect as one major
component in ComBat corrected data. Even after linear regression of the ComBat corrected data
on those major covariates (sex, BSCA, and cohort), the residuals still retained a strong sex effect
and no age effect (Supplemental Figures 3.2 A1-C1&A3-C3).
The strong sex effect observed in real data evaluation mainly was because there were 803
probes on chromosome X or Y in our data set. So we further removed these 803 probes and did
PCA plots again as comparison. The results are shown in Supplemental Figure 3.3. After
removing these probes, the PCA plots of raw data now showed the Batch effect as the first PC.
96
The most obvious changes happened on the PCA plots of residuals from RUV4 corrected data
(Supplemental Figures 3.3 A6-C6). Although the age effect was still the first PC in data, it was
weaker than what showed in Figure 3.16 where probes from X/Y chromosomes were included.
This suggested that those probes contained important information in terms of age, which guided
the estimation of unwanted factors using RUV4 method. PCA plots of residuals from linear
regression on surrogate variables estimated by SVA, ISVA, RUV2 methods were similar to the
PCA plots using complete data (Supplemental Figures 3.3 A3-C5). Although sex effect was not
as strong as those shown in Figure 3.16 when probes from X/Y chromosomes were included, it
was still the first PC in those data corrected by ComBat method. As shown in PCA plots using
complete data, PCA plots of Combat corrected data or residuals from linear regression using
ComBat corrected data on other covariates did not show much difference among each other, no
matter using plate or using BeadChip as the Batch variable (Supplemental Figures 3.3 A2-
C2&A7-C9).
97
3.7 Conclusion
In conclusion, in both simulation and real data evaluation, RUV4 method is the best out
of all Batch Effect correction methods compared no matter whether the BE exists or not. Its
stable performance across a variety of scenarios makes it preferable in correcting unwanted
factors for high throughput DNA methylation data. We could apply RUV4 to estimate latent
unwanted factors and adjust for them when we have data from multiple batches (such as different
time, plate, or technician, etc.). Since those estimated unwanted factors tend to capture all kinds
of latent heterogeneity in residual space, even for some covariates, we could also choose to
adjust for those estimated unwanted factors instead of adjusting for the covariates themselves
explicitly in the downstream analysis. However, we should be aware that the unwanted factors
identified from residual data could contain some important biological heterogeneity information
that should not be removed. So when we try to isolate signals by removing unwanted variation
in Infinium DNA methylation data, RUV4 method could serve as a good tool to do the job, but
caution should always be taken.
98
3.8 Appendix
Supplemental Figure 3.1 PCA plots of simulated methylation data or residuals after
correction by different BE method (Scenario 1)
The residual data used for PCA plots are described as follows:
A1-C1: Corrected by ISVA method;
A2-C2: Corrected by RUV4 using theoretical NC probes;
A3-C3: Corrected by RUV4 using empirical NC probes.
Samples in panels A1-A6 are colored by the plate (Batch) variable.
Samples in panels B1-B6 are colored by the age (POI) variable dichotomized at median.
Samples in panels C1-C6 are colored by the sex (covariate) variable.
99
Supplemental Figure 3.1 (Continued)
100
Supplemental Figure 3.2 PCA plots of T1D data or residuals from linear regression using
ComBat corrected data
The data used for PCA plots are described as follows:
A1-C1: ComBat corrected data with plate as the Batch variable;
A2-C2: Residuals from linear regression of ComBat corrected data with BeadChip as the Batch
variable on covariates (Sex, BSCE, and cohort);
A3-C3: Residuals from linear regression of ComBat corrected data with plate as the Batch
variable on covariates (Sex, BSCE, and cohort).
Samples in panels A1-A3 are colored by the plate (Batch) variable.
Samples in panels B1-B3 are colored by the age (POI) variable dichotomized at median.
Samples in panels C1-C3 are colored by the sex (covariate) variable.
101
Supplemental Figure 3.2 (Continued)
102
Supplemental Figure 3.3 PCA plots of T1D data, or ComBat corrected data, or residuals
after correction by different BE method and removing 803 probes on Chromosomes X or Y
The data used for PCA plots are described as follows:
A1-C1: Raw DNA methylation data;
A2-C2: Data corrected by ComBat method using BeadChip as Batch variable;
A3-C3: Residuals from linear regression using raw data regress on surrogate variables estimated
using SVA method;
A4-C4: Residuals from linear regression using raw data regress on surrogate variables estimated
using ISVA method;
A5-C5: Residuals from linear regression using raw data regress on unwanted factors estimated
using RUV2 method with empirical NC probes;
A6-C6: Residuals from linear regression using raw data regress on unwanted factors estimated
using RUV4 method with empirical NC probes.
A7-C7: ComBat corrected data with plate as the Batch variable;
A8-C8: Residuals from linear regression of ComBat corrected data with BeadChip as the Batch
variable on covariates (Sex, BSCE, and cohort);
A9-C9: Residuals from linear regression of ComBat corrected data with plate as the Batch
variable on covariates (Sex, BSCE, and cohort).
Samples in panels A1-A9 are colored by the plate (Batch) variable.
Samples in panels B1-B9 are colored by the age (POI) variable dichotomized at median.
Samples in panels C1-C9 are colored by the sex (covariate) variable.
103
Supplemental Figure 3.3 (Continued)
104
Supplemental Figure 3.3 (Continued)
105
Supplemental Figure 3.3 (Continued)
106
Chapter IV
Conclusion and Future Directions
4.1 Conclusion
This dissertation contains two projects: Feature selection for cluster analysis and
Comparison of Batch effect correction methods in Illumina Infinium DNA methylation data.
Both projects aim at identifying effective methods to help finding signals in Infinium DNA
methylation data at different study stages.
In the first project, we developed several statistics that utilizes the Beta distribution
characteristic of Illumina Infinium DNA methylation data to do feature selection for cluster
analysis. Using simulation and real data evaluation, we compared them with other available
statistics such as standard deviation on beta value (SD-beta) in terms of enriching informative
probes and identifying correct subgroups in cluster analysis. We concluded that both TM-GOF
and SD-beta are useful under different scenarios, so try both in doing exploratory cluster
analysis.
In the second project, we compared a handful of Batch Effect correction (BEC) methods
using simulation and illustrated them in real data. We concluded that among all those commonly
used BEC methods, RUV4 performed the best in finding signals of POI by effectively removing
unwanted factors from the DNA methylation data.
107
4.2 Future Directions
This future direction part mainly focuses on expanding the BEC method comparison
project. In our simulation for BEC methods comparison, we considered equal numbers of
samples in 2 batches under conditions with BE or without BE. We haven’t considered
unbalanced sample sizes or more than two batches, which are more realistic.
What’s more, covariate variable in my simulation is of weak impact on the whole studies
due to its low occurrence in the design. Real data would be much more complex with multiple
covariates and high heterogeneity in their effects. So increasing the percentage of probes with
covariate effects among all probes and among those probes differentially methylated by POI in
simulation would be interesting. In our simulation, we assumed that those probes that are
differently methylated by covariate are equally likely to be among those DMRs by POI and
among those non-DMRs by POI. This resulted in only 10 CpG probes in my simulated total 2000
CpG probes were DMR by both POI and covariate. Thus when the correlation between POI and
covariate were introduced into the data, the performances of all method did not change much. In
reality, those CpG probes that are differentially methylated by POI (e.g. age) could also be more
easily affected by some other covariates that we need to consider. In that case, some of our
simulation results may not be directly applicable.
In our simulation, we only considered continuous POI variable. In many scenarios,
categorical POI variable is also of interest. For example, we may study how DNA methylation
would be associated with disease status. Our study results may be directly applicable since we
can consider 0/1 as a special continuous variable that only has two values. If we use simulation
to validate, we need to carefully select the distribution parameters in preparing the coefficient
matrix needed for DNA methylation data matrix generation.
108
We only considered analysing DNA methylation data using Beta values in our
simulation. Due to its Beta distribution property and majority of these BEC methods were
developed under the assumption of Gaussian distribution, some researchers applied BEC
methods on M-values (logit of Beta values) in their studies (Agha et al., 2014). The usage of M-
values in DNA methylation study has its pros and cons (Du et al., 2010). We haven’t
systematically investigated the comparison of these BEC methods on M-values yet, which will
also be interesting.
Correct selection of negative control (NC) probes is critical for RUV2 and RUV4
methods. As we discussed before, besides selecting the probes with lowest variance, we could
select NC probes specifically according to the Illumina Infinium platform design if this
information is available. This will be of great interesting in further real data evaluation.
Real data almost always come from a mixture of multiple cell types. Some cell type
mixture adjustment methods have been developed to estimate the cell type allocations in samples
and adjust for it in analysis (Houseman et al., 2012; Houseman et al., 2014). In those studies, the
authors compared their cell mixture correction methods with SVA method and concluded that
Reference-Free cell mixture adjustment is as good as or sometimes superior to SVA method
when cell type mixture exists in DNA samples. Batch effect in the real data analysed in their
studies was corrected using ComBat method (Houseman and Ince, 2014; Houseman et al., 2014).
In our current study, we did not take cell type mixture into consideration. However, in a certain
point of view, cell type mixture could also be considered as some latent variables, a specific
source of unwanted factor associated with sample. So we would be interested to know how the
Reference-Free cell mixture adjusting methods will perform comparing to those BEC methods
which is supposed to simultaneously correct Batch effect and adjust for cell type mixtures if we
109
further introduce cell mixture into simulation. Furthermore, under scenarios when strong
correlation between POI and Batch variable exists, RUV4 outperformed all other methods
including SVA in removing unwanted factors. We would like to know how RUV4 would
perform when cell type mixtures are also considered in data. This also deserves further
investigation.
110
Bibliography
Agha, G., E. A. Houseman, K. T. Kelsey, C. B. Eaton, S. L. Buka, and E. B. Loucks, 2014,
Adiposity is associated with DNA methylation profile in adipose tissue: Int J Epidemiol.
Alter, O., P. O. Brown, and D. Botstein, 2000, Singular value decomposition for genome-wide
expression data processing and modeling: Proc Natl Acad Sci U S A, v. 97, p. 10101-6.
Becker, R. A., J. M. Chambers, and A. R. Wilks, 1988, The New S Language., Wadsworth &
Brooks/Cole.
Bell, D., A. Berchuck, M. Birrer, ., J. Chien, D. Cramer, F. Dao, R. Dhir, P. DiSaia, H. Gabra,
and e. al., 2011, Integrated genomic analyses of ovarian carcinoma: Nature, v. 474, p.
609-15.
Benito, M., J. Parker, Q. Du, J. Wu, D. Xiang, C. M. Perou, and J. S. Marron, 2004, Adjustment
of systematic microarray data biases: Bioinformatics, v. 20, p. 105-14.
Benjamini, Y., and Y. Hochberg, 1995, Controlling the false discovery rate: a practical and
powerful approach to multiple testing, J.R.Statist.Soc. B, p. 289-300.
Bergman, Y., and H. Cedar, 2013, DNA methylation dynamics in health and disease: Nat Struct
Mol Biol, v. 20, p. 274-81.
Bibikova, M., B. Barnes, C. Tsan, V. Ho, B. Klotzle, J. M. Le, D. Delano, L. Zhang, G. P.
Schroth, K. L. Gunderson, J. B. Fan, and R. Shen, 2011, High density DNA methylation
array with single CpG site resolution: Genomics, v. 98, p. 288-95.
Bolstad, B. M., R. A. Irizarry, M. Astrand, and T. P. Speed, 2003, A comparison of
normalization methods for high density oligonucleotide array data based on variance and
bias: Bioinformatics, v. 19, p. 185-93.
Bourgon, R., R. Gentleman, and W. Huber, 2010, Independent filtering increases detection
power for high-throughput experiments: Proc Natl Acad Sci U S A, v. 107, p. 9546-51.
Boutros, P. C., and A. B. Okey, 2005, Unsupervised pattern recognition: an introduction to the
whys and wherefores of: Brief Bioinform, v. 6, p. 331-43.
Buja, A., and N. Eyuboglu, 1992, Remarks on parallel analysis, Multivar. Behav. Res., p. 509-
540.
Cantone, I., and A. G. Fisher, 2013, Epigenetic programming and reprogramming during
development: Nat Struct Mol Biol, v. 20, p. 282-9.
Chen, C., K. Grennan, J. Badner, D. Zhang, E. Gershon, L. Jin, and C. Liu, 2011, Removing
batch effects in analysis of expression microarray data: an evaluation of six batch
adjustment methods: PLoS One, v. 6, p. e17238.
Christensen, B. C., E. A. Houseman, C. J. Marsit, S. Zheng, M. R. Wrensch, J. L. Wiemels, H. H.
Nelson, M. R. Karagas, J. F. Padbury, R. Bueno, D. J. Sugarbaker, R. F. Yeh, J. K.
Wiencke, and K. T. Kelsey, 2009, Aging and environmental exposures alter tissue-
specific DNA methylation dependent upon CpG island context: PLoS Genet, v. 5, p.
e1000602.
Croner, R. S., B. Lausen, V. Schellerer, I. Zeittraeger, A. Wein, C. Schildberg, T. Papadopoulos,
A. Dimmler, E. G. Hahn, W. Hohenberger, and W. M. Brueckl, 2009, Comparability of
microarray data between amplified and non amplified RNA in: J Biomed Biotechnol, v.
2009, p. 837170.
111
Dougherty, E. R., J. Barrera, M. Brun, S. Kim, R. M. Cesar, Y. Chen, M. Bittner, and J. M. Trent,
2002, Inference from clustering with application to gene-expression microarrays: J
Comput Biol, v. 9, p. 105-26.
Du, P., X. Zhang, C. C. Huang, N. Jafari, W. A. Kibbe, L. Hou, and S. M. Lin, 2010,
Comparison of Beta-value and M-value methods for quantifying methylation levels:
BMC Bioinformatics, v. 11, p. 587.
Ehler, M., V. N. Rajapakse, B. R. Zeeberg, B. P. Brooks, J. Brown, W. Czaja, and R. F. Bonner,
2011, Nonlinear gene cluster analysis with labeling for microarray gene expression data:
BMC Proc, v. 5 Suppl 2, p. S3.
Fernandez-Tajes, J., A. Soto-Hermida, M. E. Vazquez-Mosquera, E. Cortes-Pereira, A.
Mosquera, M. Fernandez-Moreno, N. Oreiro, C. Fernandez-Lopez, J. L. Fernandez, I.
Rego-Perez, and F. J. Blanco, 2013, Genome-wide DNA methylation analysis of articular
chondrocytes reveals a cluster: Ann Rheum Dis.
Gagnon-Bartsch, J. A., and T. P. Speed, 2012, Using control genes to correct for unwanted
variation in microarray data: Biostatistics, v. 13, p. 539-52.
Geeroms, N., W. Verbeke, and P. Van Kenhove, 2008, Consumers' health-related motive
orientations and ready meal consumption: Appetite, v. 51, p. 704-12.
Goh, L., and N. Kasabov, 2005, An integrated feature selection and classification method to
select minimum number of variables on the case study of gene expression data: J
Bioinform Comput Biol, v. 3, p. 1107-36.
Graham, C. D., M. R. Rose, M. Hankins, T. Chalder, and J. Weinman, 2013, Separating
emotions from consequences in muscle disease: comparing beneficial and: J Psychosom
Res, v. 74, p. 320-6.
Hannum, G., J. Guinney, L. Zhao, L. Zhang, G. Hughes, S. Sadda, B. Klotzle, M. Bibikova, J. B.
Fan, Y. Gao, R. Deconde, M. Chen, I. Rajapakse, S. Friend, T. Ideker, and K. Zhang,
2013, Genome-wide methylation profiles reveal quantitative views of human aging rates:
Mol Cell, v. 49, p. 359-67.
Harding, P., X. P. Yang, J. Yang, E. Shesely, Q. He, and M. C. LaPointe, 2010, Gene expression
profiling of dilated cardiomyopathy in older male EP4 knockout: Am J Physiol Heart
Circ Physiol, v. 298, p. H623-32.
Houseman, E. A., W. P. Accomando, D. C. Koestler, B. C. Christensen, C. J. Marsit, H. H.
Nelson, J. K. Wiencke, and K. T. Kelsey, 2012, DNA methylation arrays as surrogate
measures of cell mixture distribution: BMC Bioinformatics, v. 13, p. 86.
Houseman, E. A., B. C. Christensen, R. F. Yeh, C. J. Marsit, M. R. Karagas, M. Wrensch, H. H.
Nelson, J. Wiemels, S. Zheng, J. K. Wiencke, and K. T. Kelsey, 2008, Model-based
clustering of DNA methylation array data: a recursive-partitioning algorithm for high-
dimensional data arising as a mixture of beta distributions: BMC Bioinformatics, v. 9, p.
365.
Houseman, E. A., and T. A. Ince, 2014, Normal cell-type epigenetics and breast cancer
classification: a case study of cell mixture-adjusted analysis of DNA methylation data
from tumors, Cancer Inform, v. 13: New Zealand, p. 53-64.
Houseman, E. A., J. Molitor, and C. J. Marsit, 2014, Reference-free cell mixture adjustments in
analysis of DNA methylation data: Bioinformatics.
Johnson, W. E., C. Li, and A. Rabinovic, 2007, Adjusting batch effects in microarray expression
data using empirical Bayes methods: Biostatistics, v. 8, p. 118-27.
112
Kim, E. Y., S. Y. Kim, D. Ashlock, and D. Nam, 2009a, MULTI-K: accurate classification of
microarray subtypes using ensemble k-means: BMC Bioinformatics, v. 10, p. 260.
Kim, E. Y., S. Y. Kim, D. Ashlock, and D. Nam, 2009b, MULTI-K: accurate classification of
microarray subtypes using ensemble k-means clustering: BMC Bioinformatics, v. 10, p.
260.
Koestler, D. C., B. Christensen, M. R. Karagas, C. J. Marsit, S. M. Langevin, K. T. Kelsey, J. K.
Wiencke, and E. A. Houseman, 2013a, Blood-based profiles of DNA methylation predict
the underlying distribution of cell types: a validation analysis: Epigenetics, v. 8, p. 816-
26.
Koestler, D. C., B. C. Christensen, C. J. Marsit, K. T. Kelsey, and E. A. Houseman, 2013b,
Recursively partitioned mixture model clustering of DNA methylation data using: Stat
Appl Genet Mol Biol, p. 1-16.
Koestler, D. C., C. J. Marsit, B. C. Christensen, M. R. Karagas, R. Bueno, D. J. Sugarbaker, K. T.
Kelsey, and E. A. Houseman, 2010, Semi-supervised recursively partitioned mixture
models for identifying cancer: Bioinformatics, v. 26, p. 2578-85.
Leek, J. T., W. E. Johnson, H. S. Parker, A. E. Jaffe, and J. D. Storey, 2012, The sva package for
removing batch effects and other unwanted variation in high-throughput experiments:
Bioinformatics, v. 28, p. 882-3.
Leek, J. T., and J. D. Storey, 2007, Capturing heterogeneity in gene expression studies by
surrogate variable analysis: PLoS Genet, v. 3, p. 1724-35.
Leek, J. T., and J. D. Storey, 2008, A general framework for multiple testing dependence, Proc
Natl Acad Sci U S A, v. 105: United States, p. 18718-23.
Machulda, M. M., J. L. Whitwell, J. R. Duffy, E. A. Strand, P. M. Dean, M. L. Senjem, C. R.
Jack, Jr., and K. A. Josephs, 2013, Identification of an atypical variant of logopenic
progressive aphasia: Brain Lang.
Maugis, C., G. Celeux, and M. L. Martin-Magniette, 2009, Variable selection for clustering with
Gaussian mixture models: Biometrics, v. 65, p. 701-9.
McLachlan, G. J., 1992, Cluster analysis and related techniques in medical research: Stat
Methods Med Res, v. 1, p. 27-48.
Noushmehr, H., D. J. Weisenberger, K. Diefes, H. S. Phillips, K. Pujara, B. P. Berman, F. Pan, C.
E. Pelloski, E. P. Sulman, K. P. Bhat, R. G. Verhaak, K. A. Hoadley, D. N. Hayes, C. M.
Perou, H. K. Schmidt, L. Ding, R. K. Wilson, D. Van Den Berg, H. Shen, H. Bengtsson,
P. Neuvial, L. M. Cope, J. Buckley, J. G. Herman, S. B. Baylin, P. W. Laird, and K.
Aldape, 2010, Identification of a CpG island methylator phenotype that defines a distinct
subgroup of glioma: Cancer Cell, v. 17, p. 510-22.
Paykel, E. S., and A. J. Henderson, 1977, Application of cluster analysis in the classification of
depression. A: Neuropsychobiology, v. 3, p. 111-9.
Plerou, V., P. Gopikrishnan, B. Rosenow, L. A. Amaral, T. Guhr, and H. E. Stanley, 2002,
Random matrix approach to cross correlations in financial data: Phys Rev E Stat Nonlin
Soft Matter Phys, v. 65, p. 066126.
Reinius, L. E., N. Acevedo, M. Joerink, G. Pershagen, S. E. Dahlen, D. Greco, C. Soderhall, A.
Scheynius, and J. Kere, 2012, Differential DNA methylation in purified human blood
cells: implications for cell lineage and studies on disease susceptibility: PLoS One, v. 7, p.
e41361.
Rocke, D. M., 1993, On the Beta transformation family, TECHNOMETRICS, p. 72-81.
113
Santagata, S., A. Thakkar, A. Ergonul, B. Wang, T. Woo, R. Hu, J. C. Harrell, G. McNamara, M.
Schwede, A. C. Culhane, D. Kindelberger, S. Rodig, A. Richardson, S. J. Schnitt, R. M.
Tamimi, and T. A. Ince, 2014, Taxonomy of breast cancer based on normal cell
phenotype predicts outcome: J Clin Invest, v. 124, p. 859-70.
Sar, V., O. Taycan, N. Bolat, M. Ozmen, A. Duran, E. Ozturk, and H. Ertem-Vehid, 2010,
Childhood trauma and dissociation in schizophrenia: Psychopathology, v. 43, p. 33-40.
Schadt, E. E., C. Li, B. Ellis, and W. H. Wong, 2001, Feature extraction and normalization
algorithms for high-density oligonucleotide gene expression array data: J Cell Biochem
Suppl, v. Suppl 37, p. 120-5.
Storey, J. D., J. M. Akey, and L. Kruglyak, 2005, Multiple locus linkage analysis of genomewide
expression in yeast: PLoS Biol, v. 3, p. e267.
Sun, Z., H. S. Chai, Y. Wu, W. M. White, K. V. Donkena, C. J. Klein, V. D. Garovic, T. M.
Therneau, and J. P. Kocher, 2011, Batch effect correction for genome-wide methylation
data with Illumina Infinium platform: BMC Med Genomics, v. 4, p. 84.
Suva, M. L., N. Riggi, and B. E. Bernstein, 2013, Epigenetic reprogramming in cancer: Science,
v. 339, p. 1567-70.
Tadesse, M. G., J. G. Ibrahim, R. Gentleman, S. Chiaretti, J. Ritz, and R. Foa, 2005, Bayesian
error-in-variable survival model for the analysis of GeneChip arrays: Biometrics, v. 61, p.
488-97.
Teschendorff, A. E., F. Marabita, M. Lechner, T. Bartlett, J. Tegner, D. Gomez-Cabrero, and S.
Beck, 2013, A beta-mixture quantile normalization method for correcting probe design
bias in Illumina Infinium 450 k DNA methylation data: Bioinformatics, v. 29, p. 189-96.
Teschendorff, A. E., U. Menon, A. Gentry-Maharaj, S. J. Ramus, D. J. Weisenberger, H. Shen,
M. Campan, H. Noushmehr, C. G. Bell, A. P. Maxwell, D. A. Savage, E. Mueller-
Holzner, C. Marth, G. Kocjan, S. A. Gayther, A. Jones, S. Beck, W. Wagner, P. W. Laird,
I. J. Jacobs, and M. Widschwendter, 2010, Age-dependent DNA methylation of genes
that are suppressed in stem cells is a hallmark of cancer: Genome Res, v. 20, p. 440-6.
Teschendorff, A. E., J. Zhuang, and M. Widschwendter, 2011, Independent surrogate variable
analysis to deconvolve confounding factors in large-scale microarray profiling studies:
Bioinformatics, v. 27, p. 1496-505.
Tothill, R. W., A. V. Tinker, J. George, R. Brown, S. B. Fox, S. Lade, D. S. Johnson, M. K.
Trivett, D. Etemadmoghadam, B. Locandro, N. Traficante, S. Fereday, J. A. Hung, Y. E.
Chiew, I. Haviv, D. Gertig, A. DeFazio, and D. D. Bowtell, 2008, Novel molecular
subtypes of serous and endometrioid ovarian cancer linked to: Clin Cancer Res, v. 14, p.
5198-208.
Triche, T. J., Jr., D. J. Weisenberger, D. Van Den Berg, P. W. Laird, and K. D. Siegmund, 2013,
Low-level processing of Illumina Infinium DNA Methylation BeadArrays: Nucleic Acids
Res, v. 41, p. e90.
Tritchler, D., E. Parkhomenko, and J. Beyene, 2009, Filtering genes for cluster and network
analysis: BMC Bioinformatics, v. 10, p. 193.
Wang, S., and J. Zhu, 2008, Variable selection for model-based high-dimensional clustering and
its application to microarray data: Biometrics, v. 64, p. 440-8.
Wang, X., G. H. Kang, M. Campan, D. J. Weisenberger, T. I. Long, W. Cozen, L. Bernstein, A.
H. Wu, K. D. Siegmund, D. Shibata, and P. W. Laird, 2011, Epigenetic subgroups of
esophageal and gastric adenocarcinoma with differential: PLoS One, v. 6, p. e25985.
114
Wang, X., P. W. Laird, T. Hinoue, S. Groshen, and K. D. Siegmund, 2014, Non-specific filtering
of beta-distributed data: BMC Bioinformatics, v. 15, p. 199.
Weinstein, J., R. Akbani, B. Broom, W. Wang, R. Verhaak, D. McConkey, S. Lerner, M.
Morgan, C. Creighton, C. Smith, D. Kwiatkowski, A. Cherniack, J. Kim, C. Sekhar
Pedamallu, M. Noble, H. Al-Ahmadie, V. Reuter, J. Rosenberg, D. Bajorin, B. Bochner,
D. Solit, T. Koppie, B. Robinson, D. Gordenin, D. Fargo, L. Klimczak, S. Roberts, J. Au,
P. Laird, T. Hinoue, and e. al., 2014, Comprehensive molecular characterization of
urothelial bladder carcinoma: Nature, v. 507, p. 315-22.
Weisenberger, D. J., K. D. Siegmund, M. Campan, J. Young, T. I. Long, M. A. Faasse, G. H.
Kang, M. Widschwendter, D. Weener, D. Buchanan, H. Koh, L. Simms, M. Barker, B.
Leggett, J. Levine, M. Kim, A. J. French, S. N. Thibodeau, J. Jass, R. Haile, and P. W.
Laird, 2006, CpG island methylator phenotype underlies sporadic microsatellite
instability and: Nat Genet, v. 38, p. 787-93.
Abstract (if available)
Abstract
DNA methylation plays critical roles in a variety of biological processes, such as aging, carcinogenesis, etc. DNA methylation data from Illumina Infinium platforms are measured as the proportion of total intensities due to methylated alleles. They are values bounded between 0 and 1, assuming to follow Beta distribution. Due to the specialties of this distribution, methods work for Gaussian distributed data, such as gene expression microarrays, may not perform equally well on DNA methylation data. ❧ This dissertation addresses two issues we encounter when dealing with high throughput DNA methylation data from the Infinium platforms. One is to do feature selection for cluster analysis to discover novel sample subgroups, and the other is to correct Batch effect in data that was due to samples being processed and measured in different batches. Methods used in these two processes in order to strengthen the biological signal of interest are evaluated and compared using both simulation and real data sets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
DNA methylation groups determined by GATA5 gene methylation level are correlated with tumor subtype, sex, smoking status, and body mass index in esophageal and gastric adenocarcinoma
PDF
Statistical analysis of high-throughput genomic data
PDF
Differential methylation analysis of colon tissues
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
DNA methylation of NOS genes and carotid intima-media thickness in children
PDF
An analysis of conservation of methylation
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
DNA methylation as a biomarker in human reproductive health and disease
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Incorporating prior knowledge into regularized regression
PDF
DNA methylation review and application
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Prenatal air pollution exposure, newborn DNA methylation, and childhood respiratory health
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
Asset Metadata
Creator
Wang, Xinhui
(author)
Core Title
Finding signals in Infinium DNA methylation data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/29/2017
Defense Date
05/27/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
batch effect correction,beta distribution,DNA methylation,feature selection,Infinium,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Siegmund, Kimberly (
committee chair
), Azen, Stanley P. (
committee member
), Groshen, Susan L. (
committee member
), Hacia, Joseph (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
xinhuiwa@gmail.com,xinhuiwa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-615791
Unique identifier
UC11302822
Identifier
etd-WangXinhui-3758.pdf (filename),usctheses-c3-615791 (legacy record id)
Legacy Identifier
etd-WangXinhui-3758.pdf
Dmrecord
615791
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Xinhui
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
batch effect correction
beta distribution
DNA methylation
feature selection
Infinium