Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Preprocessing and analysis of DNA methylation microarrays
(USC Thesis Other)
Preprocessing and analysis of DNA methylation microarrays
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PREPROCESSING AND ANALYSIS OF DNA METHYLATION MICROARRAYS
by
Timothy J. Triche, Jr.
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS)
August 2013
Copyright 2013 Timothy J. Triche, Jr.
Dedication
To my wife, Catherine, and my daughter, Isabel, for your love and support.
ii
Acknowledgments
This dissertation is the product of several years of effort by many people, not one.
Though any mistakes in its presentation are my own, the substance of this work could
not exist without the patience, support, and guidance of my committee members. Kim
Siegmund, my chair, undoubtedly bore the brunt of my inexperience in academic writ-
ing, and for her patience and rigor (in particular) I am deeply grateful. Juan Pablo
Lewinger provided my first opportunity as a coauthor during his postdoctoral fellow-
ship, and substantial insights both simple and subtle throughout my graduate career.
Frank Gilliland’s experimental control samples were pivotal in the first paper I could
truly claim to have written. Gareth James patiently explained penalized regression and
convex optimization to me and probed weak points in my understanding of estimation.
Peter W. Laird, my outside member and P.I., supported my efforts both financially and
scientifically, in addition to sharpening my intuition and providing opportunities to par-
ticipate in landmark studies that very few graduate students will ever have, along with
a collegial and dynamic center for research at the USC Epigenome Center. To each of
you I am sincerely and humbly grateful.
During the years taken to complete this work, my family – my wife Catherine, my
daughter Isabel, my father, my mother, my sister Rachel, and my younger brother Mar-
tin – never wavered in their support of my efforts. I cannot thank you enough. The
iii
frustrating pace of progress at times caused me to wonder if there would be an end to
this journey. There are those who can complete a dissertation without a family’s support,
but I am not among them. To my family as well, I am forever indebted.
Last but certainly not least, my fellow graduate students in the Statistical Genet-
ics program (particularly James Baurley and Tommy Ly) and colleagues at the USC
Epigenome Center (particularly Hui Shen, Ben Berman, Dan Weisenberger, Zack Ram-
jan, Toshinori Hinoue, and Moiz Bootwalla) have shaped the way I approach challenges,
evaluate solutions, and communicate results to those I would like to regard as peers.
There are many other people who deserve thanks, but you are the ones who pushed me
to complete this endeavor, many of you traveling the same path.
Thank you all – for your friendships, your love, your encouragement, your criticism,
your intution, and most of all, your support.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List of Figures vi
List of Tables vii
Abstract viii
Introduction 1
Preprocessing of Illumina HumanMethylation BeadArrays 6
Analysis of differential methylation via Beta regression 35
Conclusion 57
References 61
v
List of Figures
1.1 Transfer of a methyl group from S-adenosylmethionine. . . . . . . . 1
1.2 Two classes of transcription start sites in the human genome. . . . . 3
2.3 The importance of background and dye bias correction. . . . . . . . 7
2.4 Infinium methylation probe designs (from Illumina technical note
270-2012-001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Distribution of control and OOB intensities. . . . . . . . . . . . . . 23
2.6 Distribution of raw and corrected intensities and beta values. . . . . 24
2.7 QQ-plots for norm-exp and gamma-gamma convolutions. . . . . . . 25
2.8 Probe-level standard deviations of PBL samples. At left, beta val-
ues are the proportion of total probe intensity at a locus attributable
to the methylated probe allele; at right, M-values are defined as the
log2-ratio of the methylated to unmethylated probe allele intensity
at a locus. ’Raw’ refers to results obtained without background cor-
rection or dye bias equalization. Other methods (’q5’,’lumi’, etc.)
are described at length in Table 2.1 and were additionally corrected
for red/green dye bias effects. . . . . . . . . . . . . . . . . . . . . . 26
2.9 Probe-level bias in mixture samples. . . . . . . . . . . . . . . . . . 28
2.10 Probe-level bias by probe chemistry. . . . . . . . . . . . . . . . . . 29
2.11 Probe-level RMS error in mixture samples. . . . . . . . . . . . . . . 30
2.12 F-test statistics for (a) HapMap and (b) TCGA AML replicates. . . . 31
3.13 Validated loci associated with age in whole blood, by method. . . . 50
3.14 CD4+ cells: loci validated by any (left) or all (right) methods. . . . . 51
3.15 CD4+ cells: loci uniquely detected by each individual method. . . . 52
vi
List of Tables
2.1 Background correction methods for HumanMethylation arrays . . . 13
3.1 Loci (out of 381552) associated with lineage in blood . . . . . . . . 46
3.2 Realized FDR in simulations, at a nominal< 0:05. . . . . . . . . 47
3.3 Simulated statistical power, by test (inflated FDRs are struck) . . . . 48
3.4 Loci (out of 381552) associated with age in blood & brains . . . . . 49
vii
Abstract
In the first chapter of this dissertation, I review the importance of cytosine methylation
in vertebrate genomes. In the second chapter, I present methods which leverage a design
feature of the most widely used DNA methylation microarrays to ameliorate common
technical artifacts. In the third chapter, I present an investigation of sensitive methods
for detecting differential DNA methylation. In the final chapter, I discuss the impact of
these findings for population-scale studies of DNA methylation.
Taken together, these chapters describe versatile and widely applicable approaches to
study DNA methylation at the scale demanded by modern investigators. If recent years
are any indication, the size and complexity of such studies will only increase. Changes
to genomic sequence are relatively rare and typically deleterious, thus the genomes of
healthy somatic cells from different tissues in an individual are with few exceptions
nearly identical. The epigenomes of these same cells differ radically. Furthermore, the
erasure of paternal DNA methylation marks at conception may be incomplete, offering a
potential route for epigenetic differences to persist across generations as well as tissues.
Unraveling the roles and functional interactions of genomic and epigenomic changes
in health and disease may thus become one of the great scientific and informatic chal-
lenges of our time. This dissertation presents effective analytic and informatic practices,
in an attempt to guide future investigators as they rise to the challenge.
viii
Introduction
DNA methylation at the 5’ position of cytosine (the biological basis of which is shown
in Figure 1.1) is the most readily accessible aspect of what has come to be known as
the epigenome – a collection of chemical modifications which alter the properties or
accessibility of genomic DNA, rather than the sequence of its constituent nucleotide
bases. These modifications appear to play crucial roles in vertebrate development and
disease. The exact nature of these roles remains imperfectly understood. Nonethe-
less, an expanding corpus of research has taken advantage of quantitative and relatively
inexpensive genome-scale assays to characterize DNA methylation across populations
and tissues, with particular interest in the mediation of potentially heritable interactions
between the environment and the genome. So-called “epialleles,” present at birth and
persisting into childhood (Beyan et al. (2012)) suggest that the more plastic nature of
DNA methylation, in contrast to the more permanent nature of DNA mutations, might
play a role in heredity. This evidence is far from conclusive, but it offers a tantaliz-
ing alternative possibility for information to pass from parents to offspring, and thus a
potential explanation for the “missing heredity” problem in genetics (Slatkin (2009)).
Figure 1.1: Transfer of a methyl group from S-adenosylmethionine.
1
Far more conclusive evidence for a conserved role of cytosine methylation in verte-
brates comes from the depletion, and focally clustered retention, of CpG dinucleotides
across vertebrate genomes relative to expectation. The spontaneous deamination of 5-
methylcytosine produces thymine; the resulting change, unlike the uracil base which
results from deamination of unmethylated cytosines, is not detected by DNA repair
enzymes as abnormal. Most cytosines in CG dinucleotides within vertebrate genomes
are methylated; most cytosines outside of CpGs are not. Consequently, the sponta-
neous transversion mutations from methylated cytosines to thymines have, on evolu-
tionary timescales, severely depleted mammalian genomes of CpG dinucleotides. How-
ever, substantial clustered regions of CpG dinucleotides do survive in the mammalian
genome. These stretches of DNA – termed CpG islands – are distinguished from the
genome at large not only by their CpG density, but also by a lack of methylation in most
cells. It is unlikely that this occurred by chance alone. CpG islands possess distinct fea-
tures (Deaton and Bird (2011)) which simplify the regulation of gene transcription and
local chromatin state, and it appears that selective pressure favored the resulting plas-
ticity, especially in complex multicellular organisms. The importance of this selective
advantage in simplifying transcriptional regulation is suggested by its pervasive nature
and the association of these islands with transcriptionally active regions of the genome.
From the perspective of CpG density, there are two distinct classes of transcription
start sites (which typically serve as gene promoter regions) in the human genome. One
class resembles the rest of the genome in that it is CpG-depleted. The other class is
notably CpG-dense, and the transcripts regulated by these regions tend to be ubiqui-
tously expressed in differentiated cells. The latter class has biologically relevant prop-
erties (Deaton and Bird (2011)) that are of particular advantage to more complex organ-
isms such as vertebrates. When I performed an updated analysis of transcription start
2
Figure 1.2: Two classes of transcription start sites in the human genome.
site CpG density, in the style of (Saxonov et al. (2006)), as shown in Figure 1.2, the dis-
covery of additional transcripts over the years since the original publication had altered
the relative proportions of each discovered class somewhat. However, the conclusion
remained sound: a plurality of known transcription start sites (by and large, gene pro-
moters) in the human genome are CpG-rich, and the CpG dinucleotides in these regions
tend to be unmethylated. (CpG-poor transcription start sites tend to abut genes which
are expressed in a tissue-specific pattern, whereas CpG-rich transcription start sites tend
to abut ubiquitously transcribed genes). This enrichment for CpG-dense and largely
unmethylated regions stands in contrast to the rest of the mammalian genome; the per-
sistence of these regions over millions of years is most easily ascribed to the regulatory
plasticity conferred by physical and genomic characteristics associated with their reten-
tion.
3
Cytosine methylation (Borgel et al. (2010)) and related modifications (Koh et al.
(2011)) are in fact so important in development that mammalian embryos lacking the
machinery to process these changes not only die, but typically fail to be born (Koh
et al. (2011)). A substantial fraction of malignancies exhibit striking genome-wide
aberrations in DNA methylation (Lengauer et al. (1997)), some in the absence of con-
sistent patterns of genetic mutations (Griffiths et al. (2010)), bolstering the hypothesis
that aberrant DNA methylation in cancerous cells offers an alternative route to bypass
normal regulation of genes involved in proliferation and mechanisms of tumor suppres-
sion. Certain cancerous cells with distinct epigenetic profiles behave as if they have
reverted to a near-embryonic state (Mihara et al. (2007)), where differentiation appears
to be suppressed by changes in DNA methylation. The methylation of genomic DNA is
widespread and conserved not just in humans or mammals, but in nearly all vertebrates
(Maunakea et al. (2010)). DNA methylation thus plays an important role in develop-
ment and disease. The relatively stable nature of cytosine methylation, and its heri-
tability across cell divisions, holds promise as a potential biomarker for exposures, and,
perhaps, heritable epigenetic differences (Slatkin (2009)) between individuals. How-
ever, the study of such differences has historically been complicated by the difficulty
and expense of genome-scale assays of DNA methylation (Laird (2010)).
In recent years, the repurposing of genotyping arrays to detect methylation at CpG
dinucleotides has ameliorated this concern. In its place we find concerns about reliably
processing hundreds or thousands of assays, each interrogating hundreds of thousands
of loci, with multiple dye chemistries across varying genomic contexts. Reliable experi-
mental inference is made more difficult by technical variations which can dwarf biolog-
ical differences if left uncontrolled. For this reason, my dissertation begins by address-
ing these technical variations using a novel convolution model to correct nonspecific
4
background fluorescence, and a simple (but effective) median-reference scheme for nor-
malizing dye bias which is observed on Illumina HumanMethylation450 beadarrays, the
most popular choice for cost-effective quantitative measurement of DNA methylation at
genomic scale.
5
Preprocessing of Illumina
HumanMethylation BeadArrays
In this chapter, I revisit a novel approach which I proposed for background correc-
tion of Infinium HumanMethylation BeadArrays. My coauthors and I described several
novel approaches to the control of technical variation in nonspecific fluorescence across
BeadArrays, along with a simple median normalization scheme to address issues of dye
bias seen on the HumanMethylation450 platform. By leveraging the vast population
of paired probes on Infinium methylation arrays, we improved sensitivity to measure
nonspecific fluorescence, and showed usable correction of dye bias effects.
Our approach capitalized on a new use for the Infinium I design bead types to mea-
sure non-specific fluorescence in the color channel opposite of their design (Cy3/Cy5).
This provided tens of thousands of features for measuring background fluorescence,
instead of the much smaller number of negative control probes on the platforms
(n=32 for HumanMethylation27 and n=614 for HumanMethylation450, respectively).
My coauthors and I compared the performance of our novel methods with existing
approaches, using technical replicates of both mixture samples and biological sam-
ples, and demonstrated that within- and between-platform artifacts can be substantially
reduced, with concomitant improvement in sensitivity, by our proposed methods. Fig-
ure 2.6 presents an intuitive rationale for both steps; all of the samples in each pane are
identical peripheral blood lymphocyte controls, yet their raw methylation summaries
vary substantially. After correcting two major sources of non-biological variation (the
only sort present in these particular samples), the variation between identical samples is
largely eliminated.
6
Figure 2.3: The importance of background and dye bias correction.
Above, six replicate samples of peripheral blood lymphocytes, run on six different
96-well plates. In the upper left quadrant, the beta values (proportion methylated probe
intensity as a fraction of the total) are shown for 482,421 targeted CpGs, without
background correction or dye bias equalization. In the upper right quadrant, dye bias
equalization has been applied, and in the lower left quadrant, a Gamma additive
convolution has been used to correct nonspecific fluorescence. At lower right, both
corrections have been applied. Note the substantial decrease in replicate variability.
Bisulfite conversion of unmethylated cytosines to uracils, followed by amplification,
allows existing genotyping and sequencing technologies to be repurposed for studies
of DNA methylation. Small (sub-microgram) amounts of DNA from hundreds of sam-
ples can be assayed at thousands of sites throughout the genome using high-density
7
microarrays (Sandoval et al. (2011)), for a fraction of the cost of whole-genome bisul-
fite resequencing. As with all microarray technologies, however, signal processing is an
important consideration in obtaining consistent, reproducible results (Siegmund (2011)).
The Illumina BeadArray platform, in the form of the HumanMethylation27 (HM27)
and HumanMethylation450 (HM450) assays, is widely used to study genomic DNA
methylation. At each targeted cytosine position, the fluorescence intensities of the
methylated and unmethylated signals are measured. DNA methylation level is esti-
mated by the ratio of DNA methylation intensity over the total, M/(M+U), where M and
U denote the average fluorescent signals from the methylated and unmethylated bead
types, respectively. Illumina adds a stabilizing constant of 100 in the denominator, and
calls the ratio the average Beta value. A benefit of Illumina’s HM450 BeadArray design
is that for each probe sequence, a median of 14 beads are randomly distributed on the
array, with each bead containing hundreds of thousands of oligonucleotides. The HM27
measures even more beads per probe. Further, Illumina organizes arrays into 12-sample
BeadChips offering great potential for automation. However, as with the analysis of
gene expression BeadChips, the sequential processing of samples can give rise to spe-
cific types of technical artifacts. Potential issues include the variation in background
fluorescence (Xie et al. (2009)), the positional effects on a single BeadChip (Verdugo
et al. (2009)), and efficiency of bisulfite conversion (Teschendorff et al. (2011)). We
focused on array-to-array variation in non-specific (background) fluorescence.
In microarray studies, background fluorescence appears to contribute an indepen-
dent additive error to the measure of signal. The observed intensity (also known as
foreground, Xf) is modeled as the sum of true signal (Xs) and background (Xb), Xf =
Xs+Xb. An additive bias on the intensity translates to a reduced dynamic range for
the Beta value. The goal of background correction methods is to estimate the true
8
signal from the observed foreground. The primary methods currently available for
HumanMethylation arrays use a simple background subtraction, a method that suffers
from truncation of the data at low intensity signals (e.g. GenomeStudio 2011.1 and
‘lumi’ package in Bioconductor 2.10). The de facto standard for expression microarray
background correction is the normal-exponential convolution, with the desirable fea-
ture of smooth interpolation, rather than truncation, of low intensity signals (Ritchie
et al. (2007)). We evaluated the use of convolution models for background correc-
tion of DNA methylation arrays, considering three models. First, we adapted the pop-
ular normal-exponential convolution model to the two-color BeadArray design; sec-
ond, we introduced a novel Gamma convolution model, a generalization of both the
normal-exponential and Gamma-exponential (Chen et al. (2011)); third, we adapted a
distribution-free approach previously proposed for Affymetrix GeneChips. For each of
the convolution models, we capitalized on the unique design of the Infinium I chemistry
DNA methylation probes, to vastly increase the number of features on the array from
which we measure background fluorescence. We found that convolution models using
the Infinium I chemistry probes to measure background improved the consistency of
measures at identical target sequences from multiple arrays of the same biological sam-
ples, and showed less bias than other methods in estimates of percent DNA methylation,
as determined from calibrated mixture samples.
The following sections describe the Illumina Infinium HumanMethylation BeadAr-
ray platforms used in the experiment, along with the particulars of extracting full signal
data from their scans. I also review the various methods, existing and novel, which we
evaluated in the course of the work, along with the experimental samples which permit
the evaluation, and finally, our results.
9
Methods and materials
The HumanMethylation27 and HumanMethylation450 BeadArrays
The Illumina HM27 array interrogates 27,578 targeted CpGs. Each array includes 144
control bead types, 32 of which are ‘negative’ controls designed to not match any
genomic regions and thus measure background fluorescence. The HM450 array interro-
gates 485,577 targets, of which 482,421 are CpGs (99.3%), 3,091 are CpHs (0.6%), and
65 are SNPs (0.1%). A total of 850 control bead types are assayed, 614 of which are
‘negative’ controls. Six hundred of the negative controls are used by GenomeStudio for
background correction.
The bead types on the array utilize two probe chemistries (Infinium I and Infinium
II) and two color dyes (Cy3-Green/Cy5-Red) (Sandoval et al. (2011)). Infinium I probes
are paired 50-base oligonucleotides, with one terminating directly across from a methy-
lated ‘C’ and the second from an unmethylated ‘C’. The extension base fluorophore
dictates the color channel (Cy3 for ‘C’ and Cy5 for ‘A’ or ‘T’), and is the same for
both the methylated and unmethylated probes within a pair, so that both probes fluo-
resce at the same wavelength. Infinium II probes have a single 49-base oligonucleotide
that hybridizes to the bisulfite-converted target sequence in a non-DNA methylation
dependent fashion upstream of the target CpG. The DNA methylation of the interro-
gated CpG is measured by extension of either a Cy3-linked base for methylated sites
or a Cy5-linked base for unmethylated sites. These probes are designed to fluoresce
at either wavelength (Cy3 for methylated, Cy5 for unmethylated). All probes on the
HM27 array use the Infinium I chemistry; the HM450 array uses a combination of both
(30% Infinium I and 70% Infinium II). The differences between the two probe designs
described above are illustrated in Figure2.4, as provided by the manufacturer. For our
10
Figure 2.4: Infinium methylation probe designs (from Illumina technical note
270-2012-001).
purposes, the most important aspect of the Infinium I probe design is that it is a single-
wavelength design which is scanned at both wavelengths.
As probe pairs using the Infinium I chemistry fluoresce at the same wavelength, all
fluorescence from probes of this design at the wavelength of the opposite fluorophore –
that which is not the extension base – can be used to estimate nonspecific fluorescence
11
across the array. All 27,578 probes on the HM27 array and 135,501 (30%) of the probes
on the HM450 array are of this type, and provide fluorescence signals in the opposite
channel from probe design (“out-of-band”). On the HM27 array, 15,500 Cy3 and 12,078
Cy5 probe pairs yield 31,000 and 24,156 features in each channel, respectively, numbers
in vast excess of the 32 ‘negative’ controls on the platform. The corresponding numbers
for the HM450 array are 92,596 Cy3 and 178,406 Cy5 features for estimating non-
specific fluorescence. We used these sets of features when correcting for background
using the convolution models described below.
Data extraction
The Illumina GenomeStudio software does not provide signal intensities for Infinium I
probes in the opposite channel from the designed fluorophore. Therefore, we enhanced
routines originally implemented by Dr. Keith Baggerly, and later incorporated into the
‘crlmm’ package in Bioconductor (Ritchie et al. (2009)) to recover this “out-of-band”
information from the binary .IDAT files produced by Illumina scanners. These proce-
dures, and all processing methods described below, are available through Bioconductor
in the package ‘methylumi’. A more recent implementation, decoupled from any par-
ticular generation of Infinium array design, is also available as the ’illuminaio’ package,
also in Bioconductor.
Background correction methods
For all background correction methods, probes are corrected within each color chan-
nel, after pooling all types of specific probes assayed in that channel. The green chan-
nel probes include all Cy3 channel Infinium I probe pairs and the methylated cytosine
12
Name Method Controls Distributions
noob Norm-Exp convolution Out-of-band intensities X
b
N(;
2
)
X
s
Exp()
goob Gamma-Gamma convolution Out-of-band intensities X
b
(;)
X
s
(
;)
doob Distribution-free convolution Out-of-band intensities none
normexp Norm-Exp convolution Negative controls X
b
N(;
2
)
X
s
Exp()
dfcm Distribution-free convolution Negative controls none
q5 Subtract fifth percentile Negative controls none
lumi Background subtraction Methylated intensities none
Table 2.1: Background correction methods for HumanMethylation arrays
Infinium II probes on the HM450 arrays. The red channel probes include the Cy5 chan-
nel Infinium I probe pairs and the unmethylated cytosine Infinium II probes. For clarity
of presentation, the color channel information is not noted for the methods detailed
below. I provide a summary of all the background approaches in Table 2.1.
Normal-exponential convolution.
LetX
s
be the fluorescence intensity from signal,X
b
the background intensity, andX
f
the observed total intensity. An additive model for the observed intensity is
X
f
=X
s
+X
b
13
In the normal-exponential formulation, the background intensity is modeled as X
b
Normal(;), and the signal asX
s
Exp(), with parameters to be estimated from
the data and controls. IfX
s
has an exponential distribution, the density ofX
s
is
f
Xs
(x
s
) =
1
e
xs
If the backgroundX
b
N(;
2
), the joint distribution ofX
s
andX
b
is:
f
XsX
b
(x
s
;x
b
;;;) =
1
p
2
2
e
1
2
(x
b
)
2
2
1
e
xs
The transformationX
f
=X
s
+X
b
yields the joint density of the signalX
s
andX
f
:
f
XsX
f
(x
s
;x
f
) =
1
p
2
2
e
1
2
(x
f
xs)
2
2
1
e
xs
If
sf
=x
f
2
=
, integrating overX
s
yields the marginal distribution ofx
f
:
f
X
f
(x
s
;x
f
;;;) =
1
e
x
f
+
1
2
2
2
Z
1
0
1
p
2
2
e
1
2
(xs
sf
)
2
2
dx
s
=
1
e
x
f
+
1
2
2
2
(1 (0;
xs:x
f
;
2
)
The conditional distribution ofX
s
givenX
f
is the joint over the marginal:
f
XsjX
f
(x
s
jx
f
;;;) =
1
p
2
2
e
1
2
(xs
sf
)
2
2
1 (0;
sf
;
2
)
Given the score of the joint distribution relative to
sf
,
@ log(f(x
s
jx
f
))
@
sf
=
1
2
(x
s
sf
)
(0;
sf
;
2
)
1 (0;
sf
;
2
)
14
The expectation of the signalX
s
given the observedX
f
is (per Ritchie et al. (2007))
E[X
s
jX
f
] =
sf
+
2
(0;
sf
;
2
)
1 (0;
sf
;
2
)
This method is described in detail by (Ritchie et al. (2007)). We estimated param-
eters from the background distribution using the control probes, and the signal param-
eter
from the observed foreground intensities with the background mean subtracted.
Using the conditional expectation smoothly interpolated probes with intensities near the
background level. We evaluated the approach using two separate populations of control
probes: a.) negative control probes (‘normexp’) and b.) ‘out-of-band’ Infinium I probes
(‘noob’=Normal-exponential using out-of-band probes) as described previously.
Gamma convolution.
If we free the shape parameter
to vary in the Normal-Exponential convolution, we
haveX
s
Gamma(
;), and the density ofX
s
is
f
Xs
(x
s
) =
1
(
)
x
1
s
e
xs
where (x
s
) is Euler’s Gamma function, (x
s
) =
R
1
0
t
xs1
e
t
dt for allx
s
2Z.
15
If the background is also generalized, so thatX
b
Gamma(;), then
f
XsX
b
(x
s
;x
b
;
;;;) =
1
(
)()
x
1
b
x
1
s
e
xs
x
b
The joint distribution ofX
s
and the observedX
f
=X
s
+X
b
is thus
f
XsX
f
(x
s
;x
f
;
;;;) = (
(
)())
1
(x
f
x
s
)
1
x
1
s
e
xs
x
f
xs
The marginal distribution ofX
f
is
f
X
f
(x
s
;x
f
;
;;;) =
e
x
f
1
f
+1
F
1
[
;
+;x
f
1
1
](
(
+))
so the conditional distribution ofX
s
givenX
f
is the joint over the marginal, or
f
XsjX
f
(x
s
jx
f
) =
e
xs(
1
1
)
x
1
f
(x
f
x
s
)
1
x
1
s
1
F
1
[
;
+;x
f
(
1
1
)]
(
)()
(
+)
1
F
1
(a;b;z) in the denominator is the confluent Kummer hypergeometric function:
1
F
1
(a;b;z) =
1
X
n=0
(a)
n
(b)
n
z
n
n!
=
1
X
n=0
(a +n)(b)
(b +n)(a)
z
n
n(n)
Note that, withB(a;b) indicating the Beta function (a)(b)=(a +b),
f
XsjX
f
(x
s
jx
f
) =
e
xs(
1
1
)
x
1
f
(x
f
x
s
)
1
x
1
s
B(
;)
1
F
1
[
;
+;x
f
(
1
1
)]
16
For x
f
large enough to cause divergence in the denominator, E[X
s
jX
f
]! x
f
E[X
b
]. Elsewhere,E[X
s
jX
f
=x
f
] is well behaved and amenable to quadrature as
E[X
s
jX
f
=x
f
] =
Z
x
f
0
e
t(
1
1
)
x
1
f
(x
f
t)
1
t
1
B(
;)
1
F
1
[
;
+;x
f
(
1
1
)]
tdt
When X
f
' ( + 3
2
), E[X
s
jX
f
] is complicated by very large terms in the
denominator, leading to over- and underflow problems. Fortunately,E[X
s
jX
f
] becomes
indistinguishable fromX
f
E[X
b
] beyond this point.
I developed and implemented this convolution method, with background inten-
sity distribution Gamma(,) and signal abundance distribution Gamma(
,), in
the Bioconductor package ’methylumi’. A fast maximum likelihood estimator
(http://research.microsoft.com/en-us/um/people/minka/papers/minka-gamma.pdf) esti-
mated the parameters from the data, and numerical integration was then performed to
obtain the conditional expectation of the signal given the foreground and background
intensities. To avoid instability in the numerical integration step, we used a simple
heuristic: if an observation is greater than three standard deviations above the mean of
the background, we simply subtracted the background mean. In simulations, differences
between the integral computed to arbitrary precision via the ’mpmath’ package, and the
approximation we use, were so small as to vanish at 64-bit machine precision. We call
this method ‘goob’ for Gamma out-of-band background correction.
Distribution-free convolution model.
For comparison to the above model-based approaches, we implemented a variation of
the distribution-free background correction method described in (Chen et al. (2009)),
which we adapted to utilize control beads or out-of-band fluorescence from the Illumina
Infinium platform.
17
Let
x
s
=
8
>
>
<
>
>
:
x
f
; if x
f
^ + 2^
1 + (x
f
min)(
2^ 1
^ +2^ min
otherwise:
where ^ is the mean of the control bead types, ^ is
p
2 times the sample standard
deviation of the intensities smaller than ^ , a robust measure of standard deviation for
the skewed noise distribution, and min is the minimum intensityx
f
. For smallx
f
, the
correction is a linear interpolation with slope defined by the minimum intensity value,
and the background mean and robust estimate of standard deviation. For largex
f
, the
correction is background subtraction. We evaluated this method using two different
populations of control probes: a.) negative control probes (‘dfcm’=distribution-free
convolution model) and b.) ‘out-of-band’ Infinium I probes (‘doob’=distribution-free
using out-of-band probes).
The convolution models were compared to the two background subtraction meth-
ods currently available for the platform. The first was the method documented in the
most recent release of Illumina’s GenomeStudio software (ver2011.1), subtracting the
5th percentile of the negative control probe intensities on the array. The second was
the method implemented by the ‘lumi’ package in Bioconductor version 2.10, subtract-
ing the mode of the lower half of the distribution of methylated probe intensities. For
both of these approaches the maximum of the background-subtracted value and one was
returned.
All methods employed an offset of 15, which was selected based on earlier results
for gene expression arrays (Shi et al. (2010)), and evaluation of the probe-specific vari-
ance for Beta values from technical replicates. Also, for all HM450 data, we applied
18
an additional global dye-bias equalization step to control for the different average inten-
sities in the red and green channels. This procedure scaled the background corrected
intensities, dividing by the average intensity of the positive control probes in the same
channel, red or green, and multiplying by the average intensity of all positive controls
in a reference array. The reference array was selected as the one with the smallest dif-
ference in average red and average green positive control intensity. This is similar to the
approach implemented in GenomeStudio (ver2011.1), but can select a different array as
the reference. (One potential improvement to this method, which can improve through-
put in the preprocessing of very large experiments, will be discussed in the final chapter
of this dissertation.)
DNA methylation summary measures
We evaluated the effects of different background correction methods on the bias and
variance of DNA methylation estimates. The background-corrected and dye-bias equal-
ized values xs were separated into the intensity estimates, ms and us, for the methylated
and unmethylated cytosines, respectively, and two summaries of DNA methylation were
evaluated: i.) Beta value = ms/(ms+us), a value between 0 and 1, and ii.) M-value =
log2(ms/us), the logit transformed Beta value (=log2(Beta/(1-Beta)) with unrestricted
boundaries (as described in Du et al. (2010)).
Illumina HumanMethylation Data sets
Four data sets were used to compare the benefits of differing background correction
methods for DNA methylation arrays. We used technical replicates to evaluate array-
to-array variation for each probe, and calibrated mixture samples to evaluate bias and
19
root mean squared error in estimating absolute percent DNA methylation, or the cor-
responding M-value. Finally, we evaluated the methods using two real data sets that
contain replicate sample measurements. The real data studies allowed us to evaluate
the technical variation for large numbers of independent samples for two different cell
types.
Technical replicates.
Six technical replicates (1g genomic DNA each, extracted from commercially avail-
able pooled human male peripheral blood lymphocytes, or PBLs) were assayed as part
of a larger experiment (Raby et al. (2011)). Each 96-well plate of the study included
the random placement of one technical replicate of the control PBL. Bisulfite conver-
sion and whole genome amplification of the extracted DNA were performed by plate as
described by (Sandoval et al. (2011)), and samples were run on HM450 arrays.
Calibrated mixture samples.
Four proportions of M.SssI-treated sperm DNA (10%, 35%, 60%, 85%) were prepared
by titrating equal concentrations of treated and mock-treated sperm DNA to achieve near
uniformity of DNA concentration at the desired proportions. Oneg of DNA from each
of the four mixture proportions was then randomly arranged on each of four 96-well
plates and analyzed on the HM450 arrays, as part of the same study as the technical
replicates described above. As the M.SssI enzyme fully methylates DNA, the expected
DNA methylation proportion at normally unmethylated loci should equal the fraction
of M.SssI-treated sperm in the mixture (0.10, 0.35, 0.60, and 0.85). Histograms of
percent DNA methylation for the 10% and 35% mixture samples showed clear bimodal
distributions, allowing us to distinguish normally unmethylated loci from a subset of
20
‘constitutively methylated loci’, loci that would not reflect the proportion of M.SssI-
treated DNA in the mixture. From these bimodal distributions we selected 0.65 as a
cut-off. Probes yielding a median beta value greater than 0.65 in the 10% M.SssI sample
were omitted from all mixture samples as constitutively methylated.
Acute myeloid leukemia patient samples.
As part of The Cancer Genome Atlas (TCGA) project, 1g of DNA from each of 192
acute myeloid leukemia (AML) patient samples was assayed on the Illumina Infinium
HM27 platform. In order to interrogate a greater variety of sites, DNA from the same
patient samples was again assayed on the Illumina Infinium HM450 platform. A total
of 25,978 (94%) of the target sequences on the HM27 array are retained on the HM450
array. As AML is a notoriously diverse disease with significant epigenetic heterogeneity
(Figueroa et al. (2010)), the data allowed us to compare the ability of the various back-
ground correction methods to minimize differences between platforms, without dimin-
ishing biological differences between patients. A one-way ANOV A was conducted
to partition variance between subjects from variance between each subject’s two data
sets (one HumanMethylation27, one HumanMethylation450) and identify methods to
best improve the sensitivity of the Infinium platform for measuring biologic (between-
subject) variation. The precision of the different background correction methods was
assessed by a comparison of the distribution of ANOV A F-statistics. Since we know
SNPs at the CpG site can induce a DNA methylation signal across samples (Byun et al.
(2009)), we excluded probes with SNPs overlapping a CpG site (n=2238, dbSNP build
135). All 384 samples for the AML patients are available as part of TCGA.
21
HapMap sample data.
Jordana Bell and colleagues kindly provided 160 samples of raw .IDAT files of HM27
data for 77 HapMap lymphoblastoid cell lines (Bell et al. (2011)), processed at the Uni-
versity of California, Los Angeles. Seventy-two samples were run in duplicate, four in
triplicate, and one in quadruplicate. We processed these data with the methods presented
above and performed a one-way analysis of variance to compare the sensitivity of the
different approaches on an independent dataset. Again, probes with SNPs occurring at
the CpG site were omitted. The .IDAT files are now available from the Gene Expression
Omnibus as GSE26133.
Results
Distribution of nonspecific fluorescence
Figure 2.5 shows the distribution of intensities for both the negative control probes and
the Infinium I probes in the color channel opposite their designed extension base (‘oob’)
for four PBL replicates, two on the HM27 array (left-hand side) and two on the HM450
array (right-hand side). The two samples on each platform were selected to show the
variation in background intensity that can arise between replicates of the exact same
sample. Note that the number of Infinium I probes that provide information on out-
of-band intensities is orders of magnitude larger than the number of control probes.
This motivated our use of the opposite-channel fluorescence, rather than negative con-
trol probes, for correcting background effects. The mean of the background tended
to be equally well captured by either probe population, but variances from two distri-
butions were considerably different, and therefore warranted individual consideration.
The Gamma model for non-specific fluorescence directly accommodates the skew in
22
Figure 2.5: Distribution of control and OOB intensities.
distributions of ‘oob’ probes, while the nonparametric ‘doob’ procedure makes no dis-
tributional assumptions at all.
Low-level Signal Processing
The substantial technical variation in raw signals from Illumina Infinium Human-
Methylation450 microarrays, and the improvement yielded by low-level processing,
is shown in Figure 2.6. For six replicate samples of peripheral blood lymphocytes
(PBLs), densities of the methylated allele intensities, unmethylated allele intensities,
and Beta values are shown (top row to bottom row), before and after processing (left
and right columns, respectively). Processing included both ‘noob’ background correc-
tion (Normal-exponential convolution using out-of-band probes) and dye-bias equaliza-
tion. A contributing factor to the differences between profiles reflects the differences
in background distribution between identical samples run on different plates. Back-
ground fluorescence has the effect of shrinking the dynamic range of Beta values, and
dye bias variations skew the values of Infinium II probes differentially across samples.
23
Figure 2.6: Distribution of raw and corrected intensities and beta values.
By correcting for both we spread the peaks closer to the extremes of the distribution.
The median pairwise Spearman correlation between Beta (or M-) values of uncorrected
replicates was 0.9798; for the corrected replicates, the median correlation was 0.9897.
The parametric convolution models estimate the signal intensity distribution from
the pool of background subtracted methylated and unmethylated allele intensities (by
color channel). For the PBL sample intensities shown in Figure 2.6, the shape param-
eter for a Gamma distribution, which is equal to one under the exponential model, is
24
Figure 2.7: QQ-plots for norm-exp and gamma-gamma convolutions.
estimated to be less than one under the Gamma convolution. This reflects a signal distri-
bution that is even more highly skewed, and with greater variance, than the exponential
distribution assumed under a Normal exponential convolution.
In Figure 2.7, a quantile-quantile plot shows the observed and estimated distributions
for the foreground and background intensities from a randomly selected PBL sample,
analyzed using a Normal-exponential, or Gamma-gamma model. The estimates from
25
Figure 2.8: Probe-level standard deviations of PBL samples. At left, beta values are the
proportion of total probe intensity at a locus attributable to the methylated probe allele;
at right, M-values are defined as the log2-ratio of the methylated to unmethylated probe
allele intensity at a locus. ’Raw’ refers to results obtained without background
correction or dye bias equalization. Other methods (’q5’,’lumi’, etc.) are described at
length in Table 2.1 and were additionally corrected for red/green dye bias effects.
the Gamma convolution showed a better fit to the observed intensities than the estimates
from the Normal-exponential for the middle 95% of the distribution. A better fit of
background intensity was also seen by modeling a Gamma distribution compared to a
Normal.
Probe-wise variance in replicate samples
Figure 2.8 shows the probe-wise standard deviation of the Beta values, and M-values,
for the different background correction methods. We found that methods using only neg-
ative control probes to measure background have the undesirable property of increasing
the probe-wise standard deviation for the technical replicate PBL samples. The median
26
variance was minimized by the newer convolution methods using the out-of-band inten-
sities to estimate background, with ‘goob’ showing the clear advantage on the Beta-
value scale, while the ‘noob’ procedure performed best on the M-value scale. Linear
modelling of the probe-level variation as a function of the number of overlaid CpG sites
per probe revealed statistically significant (p < 2:2 10
16
) decreases in probe-level
variation with each additional overlaid CpG locus; this relationship was present in both
Infinium I and Infinium II probe designs, albeit with larger magnitude in the former.
Bias, variance, and RMSE as measured from mixture samples
We next quantified the probe-level bias and variance trade offs presented by the various
background correction methods using a panel of samples with titrated mixtures of DNA
methylation. Specifically, we computed the bias as the difference between each sample’s
Beta values and its titrated concentration of M.SssI-treated DNA, or its M-values and the
logit-transformed concentrations. We found that the bias showed a strong dependency
on the concentration of M.SssI-DNA, with pronounced differences towards the extremes
of 10% (mostly unmethylated) and 85% (mostly methylated), as seen in Figure 2.9.
For these concentrations, the bias was most reduced using the convolution mod-
els. We evaluated the probe-level variation in measurements using the sample standard
deviation, whereupon a strong improvement was observed for all background correction
methods in the more numerous Infinium II probes, but Infinium I probes showed slight
increases in probe-level variation, as seen in Figure 2.10. In evaluating the raw data,
the Infinium II probes showed greater variability than Infinium I probes, but after back-
ground correction this was no longer the case. In fact, for mixtures of 10% - 60%, the
variability of the DNA methylation measures for Infinium II probes was lower than the
variability for the Infinium I probes.
27
Figure 2.9: Probe-level bias in mixture samples.
We also tabulated the root mean squared error (RMSE), a function of both vari-
ance and bias, across the different mixing fractions. Near the extremes (where most
measurements in biological samples are expected to fall), we find that the model-based
convolutions using out-of-band probes, ’noob’ and ‘goob’, generally perform best (Fig-
ure RMSE). We note, however, a slight increase in RMSE after background correction
for probes with 35 or 60 percent DNA methylation, with ‘noob’ representing the smaller
increase in RMSE between ‘noob’ and ‘goob’. Thus, the models suffer mildly for probes
that do not have one of their two alleles (M or U) near background. We omitted from
these comparisons the method used by the lumi package as it relies on the assumption
that a large number of probes on the array are unmethylated, which is violated for these
samples containing M.SssI-treated DNA.
28
Figure 2.10: Probe-level bias by probe chemistry.
Performance in replicated HapMap and TCGA samples
In the HapMap cell line replicates (all Infinium I probes on HM27 BeadArrays), all
background correction methods appeared to improve the ability of the platform to detect
biological differences, as judged by the distribution of F statistics (Figure 2.12, left
panel). This illustrates that, regardless of the specific method employed, the differences
between cell lines were more prominent after processing, relative to the differences
within replicates of the same cell line. A comparison of within-subject and between-
subject standard deviations showed ‘goob’ as the method that reduced the within-subject
variation the most. The tendency of lumi’s mode subtraction scheme to increase within-
cell-line variance relative to the raw data is of concern.
In the AML replicate comparison (Figure 2.12, right panel), only the out-of-band
convolution-based methods avoided decreasing sensitivity in the distribution of the F-
statistics, with only ‘noob’ providing a significant improvement in a cross-platform
29
Figure 2.11: Probe-level RMS error in mixture samples.
comparison. When comparing across platforms, 21,497 (91%) of the 23,740 targeted
sequences appearing on both platforms are measured using the differently designed
Infinium II probes on the HM450 platform. As the Infinium II probes have additional
dye bias correction that does not affect Infinium I probes, we stratified the comparison
of within- and between-subject variation by the design of the targeted sequence on the
HM450 array. We observed that the out-of-band convolution methods decreased over-
all and within-subject variance in probes not switching chemistry between platforms
(Infinium I probes). Among the probes that do switch chemistry, ‘noob’ showed greater
between-subject variability than the unprocessed data, and showed the least variance
inflation of the background correction methods compared. The combination of these
two events likely explains the increased ability of the assay to detect biological differ-
ences, as compared to the raw data or other methods under study.
30
Figure 2.12: F-test statistics for (a) HapMap and (b) TCGA AML replicates.
Discussion
We explored three novel deconvolution methods for correcting background fluorescence
on Illumina Infinium DNA methylation arrays, all of which benefited from using the
large number of out-of-band intensities provided by the platform’s design. The out-
of-band intensities of the Infinium I probe pairs provided an extremely large feature
set for measuring the additive noise component. The distribution of the nonspecific
fluorescence of these features was markedly non-Gaussian in many samples, thus we
implemented a flexible Gamma convolution model as a competitor to the well-regarded
Normal-exponential convolution. Both the Normal exponential and Gamma exponen-
tial convolution models are special cases of the latter convolution. We also evaluated a
distribution-free convolution model, in part to assess the importance of model fit as a
component of performance. Our results indicated that deconvolution methods operating
on out-of-band probes, as a group, outperformed subtractive methods and methods using
the designed negative controls on the array. In addition, the model-based deconvolution
methods appeared to perform better than the distribution-free approach. However, no
31
one single model emerged as the uniformly best performing method across all compar-
isons.
Background correction had the greatest impact on signals near background, or cor-
respondingly, DNA methylation proportions near 0 and 1. Indeed, the mixture sam-
ples showed that the greatest reduction in bias and RMSE from low-level processing
appeared at the extremes of the Beta distribution, in mixtures of 10% and 85% M.SssI
treated DNA, especially for Infinium II probes (which also benefitted from dye-bias
equalization). The Gamma convolution showed the greatest reduction in RMSE for
Beta values for these samples, with the Normal exponential a close second. At the same
time, the 35% and 60% mixture samples, corresponding to probes for which both the
methylated and unmethylated intensities are above background, showed a slight infla-
tion of variance from low-level processing. For these mixtures the Normal exponential
convolution showed less variance inflation than the Gamma convolution.
The observation that some probes benefit from processing whereas others suffer
inflated variance suggests that the distribution of the DNA methylation in our sam-
ples may impact the relative overall performance of different processing methods. This
might explain the superior performance of the Gamma convolution for the HapMap
lymphoblastoid cell lines that were processed on the HM27 arrays, with a preponder-
ance of probes in promoter CpG islands measuring low levels of DNA methylation.
For the TCGA AML replicate comparisons, the results differed. As the AML samples
came from subjects harboring significant subclonal populations (Ding et al. (2012)),
and the disease is associated with profound genome-wide changes to DNA methylation
(Akalin et al. (2012)), we hypothesize that the observed shifts away from the extremes
and toward intermediate levels of DNA methylation reduced the impact of background
32
correction. Stark differences between the AML patients were recovered regardless of
the processing method used.
Our background correction and dye-bias equalization processing of the data tended
to increase the dynamic range of the Beta values, most notably for Infinium II probes.
This resembled the empirical spreading of the Infinium II probe distribution proposed
by (Dedeurwaerder et al. (2011)), though the spreading of peaks occurs in our method
as a natural consequence of background correction, and affects both Infinium I and II
designed probes. Further stratification and incorporation of probe sequence characteris-
tics, both those of the out-of-band Infinium I controls and those of the analytic probes of
either design, could yield future improvements, and indeed, at least one group has begun
to implement such methods (Fortin and Labbe, unpublished). The distribution of the
tens of thousands of Infinium I self-controls for measuring background largely matches
the designed negative controls, but provides many times the sample size for estimating
distributional parameters. Additionally, the probe sequences for the paired probes are
available, allowing a number of sequence-based refinements that are not possible with
the negative control probes because their sequences are not made available. The pres-
ence of these probes, and their large number and known sequence properties, may be of
use in supervised normalization procedures, in addition to the uses we demonstrated in
unsupervised processing.
While supervised and unsupervised methods for normalization are of great interest
for these arrays, we performed only the amount of processing necessary to make the
two platforms (HM27 and HM450) comparable for the AML replicates. Better dye-
bias correction and within-array normalization methods are of interest, but they were
beyond the scope of our work. However, we did evaluate our noob and dye-bias equal-
ization procedure in the presence of SWAN (Maksimovic et al. (2012)), a new software
33
package that performs subset quantile normalization, and found that our method always
improved the median pairwise Spearman correlation for all mixture samples and the
PBL replicates. This led us to believe that our methods will stand up to downstream
processing. Another consideration for method improvement is the optimal offset value.
For all signals we used an offset of 15, the offset reported by Shi and colleagues (Shi
et al. (2010)) for gene expression data in their investigation of bias and sensitivity. How-
ever, as the addition of an offset affects Beta values and M-values differently, it could be
that different values would be preferred depending on the choice of scale. This remains
to be explored.
In summary, we found that the use of out-of-band intensities for estimating the
parameters of a convolution model for background correction on Illumina Infinium
methylation microarrays outperformed subtractive approaches for background correc-
tion. However, as the correction appeared to slightly increase variance of probes with
Beta values in the interior (35% and 60% mixture samples), refinements are likely to
see further improvements. Thus, the overall improvement of the method over com-
peting approaches was primarily due to the distribution of beta values in experimen-
tal samples, falling primarily at the extremes of the scale. Data processing using the
Normal exponential model is fast and computationally efficient; the Gamma convo-
lution, in its present implementation (‘goob’), requires approximately 10-15 minutes
per HM450 sample on a 2Ghz Intel processor. Other groups have already begun to
investigate stratified and allele-specific designs, and their quantile regression-based nor-
malization approaches will undoubtedly benefit from leveraging the large population of
out-of-band intensities on these chips.
34
Analysis of differential methylation via
Beta regression
In this chapter, I explore methods for sensitive analysis of differential DNA methylation,
as might be desired for population-based studies of genetic and environmental factors
where multiple interacting factors are believed to influence a phenotype of interest. For
example, a study of asthma in children as a function of both air pollution and genetic
differences could also investigate durable changes in DNA methylation that complement
genetic predictors in shaping gene expression profiles.
In population-scale studies of multifactorial outcomes, improved sensitivity to detect
true associations between DNA methylation changes and exposures or outcomes is of
great interest. I therefore sought to explore the trade-offs involved in faithfully mod-
elling the proportional measurements via maximum likelihood beta regression (Gr¨ un
et al. (2012)), vs. least-squares modelling after logit transformation (Du et al. (2010)),
or a N(0,1)-quantile normalization, or no transformation at all. Beta regression has long
been an item of interest among those who deal in proportions, but its computationally
intensive nature has hampered its adoption for genome-scale modeling of DNA methy-
lation. Independent of computational concerns, however, is the question of whether a
Beta regression model offers superior properties to less intensive methods. The work
documented herein attempts to answer this question.
35
Typical approaches to modelling DNA methylation at individual cytosine residues
include ignoring the heteroskedasticity of proportional (0-1) measurements; N(0,1)-
quantile normalizing the measurementsBell et al. (2011); logit-transforming the mea-
surements and working on a fold-change scale Du et al. (2010), or discretizing pro-
portions into “states” – for example, regions of unmethylated, low/variably methylated,
and highly/completely methylated cytosines – which are presented as surrogates for
biologically relevant changes. The latter discretization of proportions is useful in situ-
ations where regional changes are of primary interest and typically assumes extensive
(although not necessarily deep) coverage (as with whole-genome bisulfite sequencing),
making strong assumptions regarding local correlation on the genome. By contrast, we
focus on regression approaches useful for sensitive modelling of associations at indi-
vidual targeted cytosines, as a function of exposures, which could be continuous (e.g.
age) and typically require sample sizes that would be cost-prohibitive for whole-genome
bisulfite sequencing. We evaluated beta regression and common approaches to linear
models, with or without transformation, for various sample sizes and designs.
Linear models are quite flexible and can be solved with relative ease even for highly
underdetermined data matrices (e.g. via penalized least-squares methods). Similarly,
any generalized linear model can be iteratively solved to find constrained maximum like-
lihood estimates of predictor effects on a response, with certain distributional assump-
tions. Beta regression, as proposed by Ferrari, Cribari-Neto, Smithson and Verkuilen,
provides a regression framework that will be familiar to practitioners comfortable with
generalized linear models. However, the model allows for the coupled mean and vari-
ance terms to both depend on covariates. It is not feasible, with a single uniformly
applied transformation across responses, to simultaneously normalize the proportional
36
measurements and completely decouple its variance from its mean (although transfor-
mations which render the data approximately homoskedastic, e.g. Du et al., 2010, have
proven useful). Thus, directly modelling both terms as a function of the data is an attrac-
tive approach.
In addition, changes in the variability of DNA methylation are also recognized as
predictors of interest for many diseases (Hansen et al. (2011)) and more recently, as
a nearly inevitable side effect of aging (Heyn et al. (2012)). A dispersion (variabil-
ity) parameter is integral to the beta regression model, permitting either or both of the
mean and dispersion to depend upon covariates. Moreover, one of the most popular
biospecimens for population-scale epigenomics work – whole blood – is in fact a com-
plicated mixture of myeloid, lymphoid, and dendritic cells (Naik et al. (2013)), each
with characteristic DNA methylation marks (Hodges et al. (2011)). While cell sorting is
a useful method to reduce this ambiguity, it is not always feasible to separate fractions
(e.g. Beyan et al. (2012)). Greater sensitivity is thus useful for detecting phenotypic
associations of interest, which may only be present in a fraction of cell types. At the
same time, regression analysis allows us to model the influence of other factors, such as
fraction of cell type (e.g. Reinius et al. (2012) and Guintivano et al. (2013)), which may
influence the estimated associations.
We employed publicly available data sets to explore several approaches to mod-
elling variation in DNA methylation. First, we evaluated Beta regression in the con-
text of dichotomous exposure. We used the myeloid/lymphoid classification of sorted
blood cell fractions to evaluate common two-group tests (Student’s t test, the Welch t
test, Mann-Whitney-Wilcoxon) on variously transformed or untransformed data, and to
37
compare the sensitivity of these to that of likelihood ratio tests based on the Beta regres-
sion model, with either a shared estimate of variability (the “Beta-T” test) or a fitted
per-group estimate of variability (the “Beta-Welch” test).
To gauge sensitivity in a large-sample, validated setting, we analyzed a large study of
aging and DNA methylation in whole blood (Hannum et al. (2012)) and split the dataset
into a discovery and a test set. We located another, similar study in whole blood (Horvath
et al. (2012)) to serve as an independent validation set. We also attempted to validate
loci discovered to have age-related differences in the whole blood datasets, using data
from a separate study conducted on sorted brain cell populations (neural and non-neural)
from subjects of varying age and gender (Guintivano et al. (2013)). If probes showing
age-related DNA methylation in whole blood are either (1) all false-positives, or (2)
tissue-specific, we would expect the validation in a separate tissue to only retain false-
positives. Instead, we find the number of validated results vastly exceeds the expected
number of false-positives, allowing us both to compare the number of discovered results
across the different analysis methods as a measure of sensitivity, as well as suggest the
existence of a common subset of probes showing age-related changes across multiple
tissues.
Lastly, as we know blood to be a complex mixture of different cell types that vary
in fraction with age, we evaluated the ability to discover cell type-specific aging that
occurs in only a fraction of cells (Chu et al. (2008)) by comparing whole blood results
to differences between an elderly (103-year-old) person’s CD4+ T cells and those of
a newborn (Heyn et al. (2012)). This provided further support for the discovery and
validation in independent studies of loci with small effect sizes (e.g. differences in beta
less than 0.05).
38
Methods
Models for differential DNA methylation
The ‘betareg’ R package (version 3.0.2), as described in Grun, Kosmidis, and Zeleis
2012, was employed for the Beta regression fits. Instead of the usual formulation,
Beta(a,b), with both a and b greater than 0, mean μ=a/(a+b) and variance
2
= μ(1-
μ)/( a+b+1) as a function of the mean, the reparameterized distribution Beta(;') is
used, where' = a+b is a precision parameter, independent of the mean. For small',
the variance is large, and for large', the variance is small. Briefly, given link functions
g1() and g2(), we model the mean μ and precision' as a function of predictor matrices
X and Z, respectively, to find maximum likelihood estimates of the coefficients and
.
Assume the Beta value for locus j from individual i is y
ij
Beta(
j;
'
j
). For
simplicity, we drop the index j. Then,
g
1
() =
1
=X
T
g
2
(') =
2
=Z
T
with the links for g1() and g2(), being logit() and log(), respectively.
The reparametrized Beta loglikelihood is expressed for responses of N subjects as
l (;'jY ) =
N
X
i=1
l(;'jy
ij
)
such that each individual observation contributes to the loglikelihood the quantity
39
l (;'jy
ij
) = log (') log (') log ((1)') +
(' 1) log (y
ij
) + (((1)') 1) log (1y
ij
)
where is Euler’s Gamma function ((n) = (n-1)!).
We implemented three changes from the default implementation that we found nec-
essary to improve stability. First, the default optimizer for finding maximum likelihood
estimates was changed from BFGS, a quasi-Newton method, (Nocedal & Wright, 2006)
to Nelder-Mead, a slower but derivative-free and (in our comparisons) far more sta-
ble method. Second, we employed the “squeezing” technique suggested by Smith and
Verkuilen (2005) to eliminate 0 and 1 values without altering the rank order of observa-
tions, although this proved to be a minor consideration compared to the choice of opti-
mizer. To squeeze the data we shifted the values around zero and multiplied by 0.99 (=
(Beta – 0.5)*0.99 + 0.5 ). Finally, we used an approach similar to that of ‘limma’ (Smyth
et al., 2004), with predefined design matrices for the null and alternative hypothesis, for
model fitting. Fitting a 2-3 variable model to each CpG site on the HumanMethyla-
tion450 array took approximately 24 hours on a machine with 6 processors and 48GB
of RAM, when a likelihood ratio test was conducted at each locus. Parallelization by
chromosome further decreases the time taken, but further complicates execution. The
modifications employed to perform large numbers (> 300; 000) of Beta regression fits in
a robust, stable fashion, when applied in parallel to datasets such as those studied here,
are provided as Supplemental files 2-3. I expect to release a small R package, combining
the above-mentioned functionality, to the CRAN archive shortly.
40
For linear regression, the ‘limma’ package (Smyth (2004)) was employed, using the
marginal t-test statistics for each coefficient of interest. Models were fit using untrans-
formed and transformed Beta values. Untransformed Beta values, the ratio of intensi-
ties M/(M+U), where M measures methylation and U lack of methylation intensities,
were labelled ‘(B)’ or ‘betas’. We also considered M-values, log2(M/U), labelled ‘(M)’,
owing to their purportedly superior sensitivity, as described in Du et al. 2010. Lastly,
we evaluated the normalizing transformation described in Bell et al (2011), whereby the
observed quantiles at a given locus are replaced with their corresponding values for a
standard Normal distribution (i.e., N(0,1)) labelled ‘(NQ)’.
Simulations
I conducted simulations exploring statistical power and FDR control, comparing Beta
regression with common two-group tests for differential DNA methylation. The results
informed my interpretation of the results observed when fitting many thousands of mod-
els. Simulations proceeded as follows.
Standard two-group tests for location shift – Student’s t, Welch’s t, the Mann-
Whitney-Wilcoxon test – were performed at each round of simulation. I also introduced
two new hypothesis tests for comparison, each a specific case of likelihood ratio tests
between Beta regression models. One model for a two-class comparison of means com-
pares a null model,yBeta(;) against an alternative modelyBeta(+X;),
where the parameter of interest is , andX2f0; 1g indicates group membership. This
is similar to Student’st test in assuming homogeneity of scale, so we call this the “Beta-
T” for brevity. Another comparison takes as the null modelyBeta(; ( +X
)) and
as the alternative,y Beta(( +X); ( +X
)), where the parameter of interest is
again . As in the Welch test, there is no assumption of homogeneous scale between the
41
groups, so we call this the “Beta-Welch” test. Both of these tests were also performed
at each round of simulation.
In order to maximize fidelity to the data, I estimated parameters and effect sizes
directly from a real study of epigenetic changes in colorectal cancer. Colorectal can-
cer is known to involve epigenetic dysregulation in at least some (if not all) cases (e.g.
Lengauer et al. (1997)), thus I could be reasonably assured of significant epigenetic dif-
ferences in samples of this tumor. From the Cancer Genome Atlas project, matched sam-
ples of 24 colorectal tumors and 24 adjacent (putatively normal) tissues from the same
24 patients were selected as representative groups from which to estimate parameters.
I therefore estimated parameters of the Beta distribution for tumor and normal samples
at each of the 22,721 non-SNP, non-sex chromosome, non-repeat probe loci assayed
by the HumanMethylation27k platform. Probes with single nucleotide polymorphisms
(SNPs) near the interrogated CpG dinucleotide were excluded, as were probes on the X
chromosome, and those which mapped either to multiple perfect matches or to repeat
regions of the genome.
The method of moments was used to estimate Beta parameters from the sample
mean y and variance s
2
y
. For a Beta(;) random variable, the method of moments
estimate for the sample mean y; and the estimate for the precision parameter
~
=
( y(1 y)=s
2
y
))-1. The R statistical package (R Development Core Team (2010)) was
then used to simulate (p
0
+p
1
) x (n
0
+n
1
) matrices of simulated Beta values from subsets
of the population, wherep
0
are unaffected loci identically distributed between tumor and
normal simulated observations,p
1
are affected loci distributed differently between tumor
and normal samples,n
0
is the number of samples from simulated normal tissues, andn
1
is the number of samples from simulated tumor tissues. p
1
=p
0
+p
1
was held constant at
0.2 for all simulations presented under Results.
42
Error rate control
We used independent data sets to validate loci associated with age, allowing us to com-
pare sensitivity of the different analysis approaches. For the blood lineage comparisons,
a validation set was not available. For this data set, I therefore permuted the samples,
blocked by subject, to establish that control of the type I error rate was maintained, as
my simulation results had previously suggested it would be.
Datasets
All analyses in this chapter were applied to publicly available data. When available
(in fractionated normal blood cells), raw IDAT files were downloaded and processed
using the ‘methylumi’ package in R to correct for background fluorescence and equalize
within–array dye-bias (Triche TJ Jr, 2013). Where raw data (including control probes)
was not available, GenomeStudio internal controls were typically used to equalize dye
bias and correct background fluorescence prior to uploading data to GEO. The probes
were mapped to genomic coordinates using sequence-verified mappings from the manu-
facturer and the Bioconductor package, ‘FDb.InfiniumMethylation.hg19’, to mask com-
mon SNPs, repeat regions, and sex chromosome probes. The resulting mask is also used
for the Cancer Genome Atlas project HumanMethylation450 data, and excludes probes
with common SNPs at, or up to 15 bases 3’ to, the interrogated locus, as well as probes
with at least 15 bases proximal to the interrogated locus falling into an annotated repeat
region of the genome. Regressions were fit separately for each probe (excluding the
masked loci), and p-values adjusted in each discovery set using the Benjamini-Hochberg
procedure (Benjamini (2010)) to control the FDR at 0.05.
43
A data set of 48 samples of sorted cell fractions (24 myeloid/ 24 lymphoid sam-
ples) from 6 adult male subjects was used to compare methods for testing differen-
tial DNA methylation in two-group designs (GEO series GSE35069) (Reinius et al.
(2012)). Reinius et al. (2012) showed that cell type-specific differences dominated
individual-specific differences in these samples, so we modelled DNA methylation as
a function solely of cell type (myeloid or lymphoid lineage). The raw IDAT files were
downloaded from (http://publications.scilifelab.se/kere j/methylation) and processed as
described above.
A large data set of 656 blood samples from subjects aged 19-101 (GEO series
GSE40279), was used to test for an association between CpG methylation and age, a
quantitative variable. The sample was split into a discovery set of 400 samples and a
test set of the remaining 256. The data subsets were balanced with respect to age and
gender, having a median age of 65 and male/female balance of 205/195 and 133/123 in
the discovery and test subsets, respectively. We fitted models including both age and
gender. Loci with a BH-adjusted p-value of 0.05 or less for age in the discovery set
were evaluated using the test dataset, and those with a (raw) p-value of less than 0.05 in
the test set were considered hits. An additional set of whole blood samples was used for
validation, collected from 89 Dutch subjects, 18 to 65 years old (GEO series GSE41169,
accessions GSM1009660-GSM1009894); the subset of loci significant after testing was
evaluated in this independent cohort, with age and sex as covariates in the models. Loci
with an unadjusted p-value of 0.05 or less were considered validated in the independent
Dutch dataset.
Separately generated HumanMethylation450 data (GEO series GSE30870) from a
newborn male and a 103-year-old subjects’ CD4+ T cells (GEO accessions GSM765861
and GSM765860, respectively) were used to evaluate estimates of DNA methylation
44
differences of the fitted models for the union of all validated loci identified as age-
related in whole-blood, but discovered possibly using different regression models. We
fit models for prediction using the larger study of 656 whole blood samples, combining
the discovery and test sets. CD4+ T cells exhibit characteristics of frequently dividing
blood cells, and as such, are known to display particularly variable DNA methylation
with age (Chu et al. (2008)).
Further, we attempted to validate age-related loci from the initial blood-based com-
parison against a separate dataset (GSE41826) from Guintinavo and colleagues, contain-
ing 58 sorted neuronal and 58 non-neuronal cell populations from the brains of subjects
with and without major depression, all with age, sex, and ethnicity recorded as well
as disease status. We fitted regression models for age with sex and cell subpopulation
(neuronal/non-neuronal) as covariates. Loci with p-values less than 0.05 in the valida-
tion set were tested; those which yielded an unadjusted p-value less than 0.05 in the
brain samples as well were considered validated across the two tissue types.
Results
Table 3.1 presents the number of loci discovered to have differential DNA methylation
in myeloid and lymphoid cell fractions, performed on a data set of 48 samples of sorted
blood cell types from six healthy males. The “Beta-T” and “Beta-Welch” tests were
most sensitive among the single methods compared, discovering 194,767 (51%) and
204,189 (54%) of the tested loci, respectively. However, for these data, the marginal gain
over the number of loci discovered by any other approach was modest. A large number
of loci were identified by all approaches (177,446 loci, 91% of the loci discovered by
“Beta-T” and 87% of those found using “Beta-Welch”). Permutations confirmed that
the error rate was adequately controlled by all methods (Table 3.1). Although there was
45
GSE35069
(24 lymphoid,
24 myeloid)
p
BH
< 0:05
Permutations:
Set 1
p< 0:05
(realized% )
Permutations:
Set 2
p< 0:05
(realized% )
Student’s t-test: B 194,565 (50.9%) 7,059 (1.9%) 17,009 (4.5%)
Student’s t-test: NQ 186,940 (49.0%) 7,226 (1.9%) 20,314 (5.3%)
Student’s t-test: M 194,001 (50.8%) 6,692 (1.8%) 19,057 (5.0%)
Welch’s t-test: B 193,189 (50.6%) 6,833 (1.8%) 16,874 (4.4%)
Welch’s t-test: NQ 186,444 (48.9%) 7,193 (1.9%) 20,231 (5.3%)
Welch’s t-test: M 192,870 (50.5%) 6,589 (1.7%) 18,902 (5.0%)
Mann-Whitney U-test 187,653 (49.2%) 6,582 (1.7%) 21,767 (5.7%)
Beta-T test (shared) 194,767 (51.0%) 6,957 (1.9%) 20,844 (5.4%)
Beta-Welch (indiv.) 204,189 (53.5%) 14,521 (3.8%) 20,928 (5.4%)
Table 3.1: Loci (out of 381552) associated with lineage in blood
variation in empirical error rate between the two permutations, this is likely a function
of the small sample size (6 subjects); the total per-group sample size is within the range
where we expect Beta likelihood ratio tests to establish control of the FDR and improve
power (Tables 3.2 and3.3). Finally, the rank order of the methods for sensitive detection
was not altered when we computed either Benjamini and Hochberg FDR-adjusted p-
values or q-values for multiple comparisons correction.
Additionally, we present the results of simulations which characterized the FDR
control and statistical power of various alternatives.
In Table 3.2 , the realized false discovery rate was tabulated from 1000 simulations
at 1000 loci each, where 20% (200) of the loci in each simulation were drawn separately
from ’tumor’ and ’adjacent’ distributions, while the remaining 80% (800) of the loci
were drawn from a homogeneous distribution with the parameters from ’adjacent’ tissue
only. In the samples from which the simulation parameters were estimated, the variance
46
n
1
vsn
0
t t
w
U t
`
t
w
T
W
10 vs 10 0.04 0.04 0.04 0.04 0.03 0.08 0.12
10 vs 15 0.03 0.03 0.04 0.04 0.03 0.06 0.13
15 vs 10 0.03 0.03 0.03 0.03 0.03 0.05 0.14
15 vs 15 0.04 0.03 0.04 0.04 0.03 0.05 0.07
25 vs 25 0.04 0.04 0.04 0.03 0.03 0.04 0.06
10 vs 40 0.04 0.11 0.04 0.05 0.06 0.04 0.13
40 vs 10 0.05 0.14 0.04 0.04 0.10 0.05 0.15
50 vs 50 0.04 0.06 0.04 0.04 0.05 0.04 0.09
75 vs 25 0.04 0.06 0.03 0.04 0.05 0.04 0.09
25 vs 75 0.04 0.04 0.04 0.04 0.04 0.04 0.05
Table 3.2: Realized FDR in simulations, at a nominal< 0:05.
of methylation in the tumors tended to be larger than that observed in the adjacent tissue
samples, and this is reflected in the simulation results. We compared the FDR control
exhibited by each of nine tests: Student’st-test, the Welch test, the Mann-Whitney test,
Student’st-test on a logit-transformed beta value, the Welch test on a probit-transformed
beta value, the likelihood ratio test of the “Beta-T” described previously, and the like-
lihood ratio test of the “Beta-Welch” test hypothesis. The realized false discovery rate
(false positives / all positives) was tabulated for each. All of the scenarios were simu-
lated simultaneously at the same randomly selected loci in each run for reproducibility
and comparability. 20% of the loci in each scenario were simulated separately from esti-
mated control and case parameters, while the remaining 80% of the loci were simulated
as independent samples from an identically distributed control population.
In Table 3.3, statistical power for the same tests as in Table 3.2 is tabulated via
simulation. Based on these results, we concluded that our sample sizes were within the
range where FDR control could reasonably be expected of the Beta regression-based
likelihood ratio tests, and proceeded to fit more comprehensive models where validation
of the results was possible.
47
Table 3.4 reports the number of discovered and validated CpG targets with associ-
ations between DNA methylation and age, a quantitative exposure, in a large study of
whole blood samples from 656 individuals of varying age (19-101 years of age), with
samples split into discovery and test sets. We compared the number of discovered and
validated loci using different regression approaches, all adjusting for sex as a covariate.
In both the discovery subset, where we judged significance by FDR-adjusted p-values,
and in the test subset, where the loci discovered in screening were judged significant
if they yielded an unadjusted p-value of 0.05 or below, Beta regression yielded more
numerous hits (Table 3.4, columns 1 and 2). In the 89-subject independent validation
dataset (age 18-65, median age of 29, with 24 females and 65 males), Beta regression
yielded the largest number of validated loci (Table 3.4, column 3).
Approximately 57% of the union of all loci that remained significant in this vali-
dation set were found to be associated with age regardless of the method used (Figure
3.13). Eighty-two percent of the union of validated loci were found by Beta regression,
78% by linear regression on the Beta values, 75% by linear regression on the N(0,1)
quantile normalized values, and 71% by linear regression on the M-values. Eighteen
n
1
vsn
0
t t
w
U t
`
t
w
T
W
10 vs 10 0.06 0.04 0.05 0.06 0.04 0.07 0.13
10 vs 15 0.07 0.09 0.07 0.06 0.08 0.07 0.17
15 vs 10 0.11 0.05 0.09 0.10 0.05 0.12 0.13
15 vs 15 0.12 0.10 0.11 0.11 0.10 0.12 0.19
25 vs 25 0.20 0.19 0.18 0.18 0.17 0.18 0.28
10 vs 40 0.09 0.21 0.11 0.08 0.17 0.08 0.27
40 vs 10 0.25 0.08 0.13 0.22 0.06 0.22 0.15
50 vs 50 0.35 0.35 0.29 0.29 0.28 0.28 0.42
75 vs 25 0.39 0.22 0.25 0.32 0.20 0.32 0.31
25 vs 75 0.35 0.35 0.29 0.29 0.28 0.28 0.42
Table 3.3: Simulated statistical power, by test (inflated FDRs are struck)
48
GSE40279
Discovery
Set (N=
400 WB
1
)
GSE40279
Test Set
(N=256 WB†)
(% of column 1)
GSE41169
Validation Set 1
(N=89 WB†)
(% of column 2)
GSE41826
Validation Set 2
(N=116 SBr‡)
(% of column 3)
Linear fit: B 153,220 79,225 (51.7%) 22,952 (29.0%) 15,772 (68.7%)
Linear fit: NQ 149,609 80,929 (54.1%) 22,107 (27.3%) 15,985 (72.3%)
Linear fit: M 136,058 70,846 (52.1%) 20,989 (29.6%) 15,742 (75.0%)
Betareg:,' 160,713 85,278 (53.1%) 24,225 (28.4%) 16,351 (67.5%)
Table 3.4: Loci (out of 381552) associated with age in blood & brains
percent of validated loci were found by only one regression approach:11.5% of vali-
dated loci were found only by beta regression, while 4.3% were found only by linear
regression on N(0,1)-quantile normalized values. Linear regression on beta values and
linear regression on M-values found the fewest loci identified by only one method (1.8%
and 1.4%, respectively) (Figure 3.13).
For the loci validated by the independent data set, we used smooth scatterplots to
compare the differences in DNA methylation observed in two new samples to those pre-
dicted from each model (Figures 3.14 and 3.15). The two new samples were previously
generated HumanMethylation450 data from CD4+ T cells of a 103-year-old man and a
newborn boy. Our predictions were from regression equations fitted using the full 656-
subject Hannum dataset. It is worth noting that the samples in this dataset are whole
blood, a mixture of cell types for which CD4+ T cells is only a fraction. Sorted CD4+
T cells are a subpopulation of blood cells that has been reported to have significantly
greater mitotic age than the bulk of cells in whole blood, with loci showing increased
DNA methylation with age (Chu et al. (2008)).
The results for the union and intersection of loci found by all four methods are shown
in Figure 3.14, with results for loci uniquely found by each method presented in Figure
3.15. All four methods generated virtually identical correlations between observed and
49
Figure 3.13: Validated loci associated with age in whole blood, by method.
predicted differences on their set of validated loci (0.795-0.799). Probes detected by
the intersection of all four methods (n=16917) showed a correlation of 0.81, and those
detected by any method (i.e., the union of results from all methods, n=29533), a correla-
tion of 0.784. When we tabulated the loci which were uniquely detected by a particular
method, those detected uniquely by Beta regression showed a correlation between pre-
dicted and observed differences of 0.664, whereas loci uniquely detected using m-values
showed a correlation of 0.641, N(0,1)-quantile normalized values a correlation of 0.651,
50
Figure 3.14: CD4+ cells: loci validated by any (left) or all (right) methods.
and loci uniquely detected by fitting a linear model to beta values showed an observed-
expected correlation of 0.507. In all cases, the observed differences in CD4+ cells tend
to vastly exceed those predicted from the whole blood samples. This was as expected
(Chu et al. (2008)) and illustrates a situation in which sensitivity to changes in a mixture
of cells can indicate detection of changes in a specific fraction.
We then tested for significant association with aging in sorted brain tissues (neural
and non-neural brain cell populations) in a final analysis of the age-associated, validated
loci discovered in whole blood. The order of methods validating the largest number of
discovered probes was consistent with the previous two analyses (Table 3.4, column 4).
Overall, 61.4% of the loci validated for age association in the whole blood datasets are
also associated with aging in the sorted brain cell populations (nominal p-value less than
0.05), a far higher fraction than 5%, suggesting (1) that the aging effects in whole blood
are real, allowing us to compare sensitivity of different methods, and (2) that a subset of
loci have age-related effects in different tissues. Comparing sensitivity of the methods
we find once more that a substantial fraction of the overall number of loci is recovered
51
Figure 3.15: CD4+ cells: loci uniquely detected by each individual method.
by all methods (here, 14195 loci, or 79.3% of the union of all validation results, are
significant regardless of the method used). An additional 629 loci (3.5%) are uniquely
validated by beta regression, and another 626 loci (3.5%) are uniquely validated by
linear regression on N(0,1) quantile normalized beta values.
52
Discussion
As a relatively young field, genome-scale investigation of DNA methylation (and indeed
epigenetic marks in general) has spawned multiple approaches to assess the significance
of focal differences in DNA methylation (1bp-1kbp), as well as broader concordant
DNA methylation patterns, often associated with chromatin accessibility states. Both
focal and broad changes may be associated with a phenotype of interest, and both are
strongly associated with tissue-specific changes involved in development and disease.
Thus, when concordant regional changes are to be detected by ‘smoothing’, greater
sensitivity to detect focal changes can translate into more sensitivity towards regional
changes, and can help to avoid false negatives when the results are subjected to regional
permutation testing and/or independent replication. I thus sought to characterize the
sensitivity of methods to detect these sorts of focal changes.
My results suggest that, for data from Illumina Infinium methylation chips, Beta
regression offers the greatest sensitivity to detect changes in DNA methylation, and for
the analysis of age-related changes, this result carried through to validation in indepen-
dent datasets. However, whole blood is known to be a mixture of cell types and the
fractions of those cell types vary over age (Houseman et al. (2012)). Therefore it is
expected that a number of the validated loci may not have differences in DNA methyla-
tion with age within individual cell fractions, but rather it is the changing composition
of cells types with age that allow cell-type specific marks to result in different average
DNA methylation levels with aging. Caution is thus needed when interpreting results
from complex cell mixtures.
If, as reported for whole blood, cell type-specific aging is faster in one or more
cell fractions than the majority of cells (Chu M, 2008), one might expect the cell-type
53
specific DNA methylation differences with age to be attenuated in the mixed cell popu-
lation. If the cell showing aging is in the diminishing lymphoid population (e.g. CD4+
cells), one would expect the bias in changing cell fractions to accelerate the attenuation
in signal. I evaluated this by comparing differences in DNA methylation of CD4+ cells
in a male newborn and 103-year-old to the predicted difference from regression models
based on whole blood, observing the attenuated difference expected from modelling the
whole blood mixture. Furthermore, I found a similar distribution of differences across
the different methods, but with more candidate loci flagged by Beta regression, suggest-
ing greater sensitivity.
The data sets I selected for analysis had large sample sizes, as will be typical in an
epidemiologic genome-wide DNA methylation study. My simulation results indicated
that very small (less than 15 samples per group) or unbalanced comparisons (splits more
uneven than approximately 80%/20%) designs, could lead to loss of control of the error
rate for Beta regression. This might be due to instability of the unconstrained maximum
likelihood estimates in small samples, but these issues did not appear to affect the anal-
yses of the large datasets I worked with. For small sample sizes, various approaches
to bias correction and reduction have been proposed (e.g. Gr¨ un, Kosmidis, & Zeileis,
2012), but these can increase computation times by at least an order of magnitude. In
order to establish control of the error rate for the (unvalidated) two-group tests analyzed
in this chapter, having different genomic distribution and potentially different proper-
ties than studied in my simulations, I analyzed two permutations of the samples, ran-
domly switching the group assignments for age-balanced sets of subjects. In general,
my coauthors and I recommended this type of verification for error-rate control when an
independent data set is not available.
54
As always, the investigator must decide what the goal of the experiment or study
should be, and what the most appropriate tools may be to accomplish the goal. Compu-
tational resources and time frames for completion can guide such choices, but ultimately
the nature of a particular study should determine the method(s) of analysis. Exploratory
analyses may be better served by the speed of linear methods, which allow for rapid
iteration of large numbers of models.
By contrast, an additional day of computation may not be a significant consideration
for (e.g.) a cohort study investigating changes in DNA methylation at specific loci in
normal tissues as associated with exposures, where sensitivity might well be a central
consideration, and all resulting findings subject to independent orthogonal validation. In
such situations an investigator may favor sensitivity over speed of computation.
Similarly, when characterizing regional changes at a given significance threshold,
either for a sequential area statistic or by modelling smoothed runs of multiple test
results, additional sensitivity may help to unearth biologically interesting patterns that
are difficult to discern otherwise. Typically, the sequential nature of these regional calls
requires permutation to establish a robust FDR estimate, which provides additional con-
trol of the putative error rate at the regional level.
Conclusions
Based on the results presented in this chapter, I concluded that Beta regression offers
greater sensitivity for the discovery phase of genome-scale DNA methylation studies,
particularly in large sample sizes with potentially confounding per-group effects on dis-
persion. The Beta regression framework appears to draw its enhanced power from the
natural representation of such effects, which is more challenging in a linear model due to
55
the decoupling of the variance from the mean. The ability of the beta regression frame-
work to simultaneously model changes in the mean and dispersion grants a significant
advantage when modelling factors (such as aging) that appear to influence both quanti-
ties. The approximation of homoskedasticity induced by M-value transformation does
not appear sufficient to overcome these differences, especially when complex models
are fit to large cohorts, nor is a N(0,1)-normalizing transformation ultimately superior
to fitting on untransformed beta values. Investigators dissecting complex epigenetic
associations with continuous phenotypes may want to consider beta regression in their
arsenal of tools for discovery, balancing computational convenience and conservative
error rates against the need for sensitivity and flexibility in modelling. By providing a
concrete implementation of a usable beta regression framework for hundreds of thou-
sands of features, as are found on the HumanMethylation450 chip, I sought to ensure
that the appropriate choice will not be complicated for investigators by excessively bur-
densome computational requirements or an absence of usable implementations to build
upon.
56
Conclusion
Relentless improvements in assay technologies, coupled with enormously expanded
computing power and storage resources, have characterized recent years in biology and
bioinformatics. Data generation is, in some cases, no longer the limiting step in exper-
imental designs. Rather, the ability to usefully clean and analyze the resulting data,
generating further hypotheses to test and discoveries to validate, is now a rate-limiting
step in the progress of certain fields.
It is currently cost-prohibitive to perform whole-genome bisulfite sequencing on
large numbers of experimental subjects; even if it were not, the protocols involved are
substantially more time-consuming than microarray protocols, where hundreds of sam-
ples can be processed daily. For the moment, studies of DNA methylation which seek
quantitative measurements across the genomes of diverse subjects, at a cost that can be
borne in population-based studies, will opt for microarrays. Processing DNA methyla-
tion data on the scale currently generated is challenging, and interpretation of the results
remains subtle, not least due to changes in our understaning of DNA methylation in vivo
(e.g.Ko et al. (2010)).
Nonetheless, certain approaches to analysis lend themselves to pipeline-centric pro-
cessing, enabling massive numbers of samples to be processed and the “unreasonable
effectiveness of data” (Halevy et al. (2009)) to reveal patterns not apparent in smaller
datasets. The Cancer Genome Atlas project provides one example. By largely elim-
inating many sources of human error in performing the HumanMethylation450 assay,
and employing the preprocessing scheme described in Chapter 2, which makes few if
any assumptions about experimental designs or subjects, my colleagues at the USC
Epigenome Center and I have extracted useful information from thousands of tumor
57
methylomes, and hundreds of normal adjacent tissues, in the creation of a public
resource which has been used by thousands of investigators. Further improvements
to these approaches can and will be made. By decoupling dye bias equalization into
per-chip adjustment, as suggested in Chapter 2, existing pipelines are not restricted to
processing experiments by batch, and the resulting improvement can effectively remove
dye bias from any chip run on the platform. Computational throughput is then limited
largely by the speed at which data can be transfered to and from hardware.
Each successive study builds upon its predecessors, creating a corpus of specialized
data which can be leveraged by combining it with others. The resulting corpora are too
large for popular memory-resident approaches to analysis. By necessity, I have thus
explored out-of-core approaches to computation (built on a foundation laid by Bigmem-
ory project members) which allow fast, comprehensive analysis of arbitrary subsets, or
useful decompositions of the entire processed corpus, without moving to cumbersome
batching or Map/Reduce systems. Drawing inference across thousands of subjects is
substantially more informative when conducted in an iterative, which cannot readily
occur when the writing of code or the processing of data operates at glacial speeds. At
the same time, forcing annotations to reside in data structures build for genomic coordi-
nates has simplified the task of interfacing microarray results with sequence-based data,
which often provides complementary information. The same approaches made it possi-
ble to evaluate computationally intensive methods, such as Beta regression, in Chapter
3, while integrating the results of multiple large studies to draw firmer conclusions.
In some instances, the results of this integration challenged assumptions that I previ-
ously held, among them the perception of whole-genome bisulfite sequencing as a “gold
standard” for quantitating DNA methylation at base resolution. In the process of assem-
bling the results in Chapter 3, I compared whole-genome bisulfite sequencing results
58
with those of HumanMethylation450 arrays run on the same subjects. While the over-
all correlation of each individual’s results across both assays was similar, varying read
coverage led to substantially poorer correlation of the differences between individuals,
compared to that measured on the array. The ready availability of multiple sources of
data on the same subjects permits such ad hoc experiments to be run by any investigator
who wishes to do so.
Moreover, the data generated by massive projects such as ENCODE Consortium
(2004) and the Reference Epigenome Mapping Consortium can greatly inform the
design, interpretation, and integration of smaller or more specialized studies, which oth-
erwise could not afford access to such results. By integrating reference data from these
large projects with data collected from designed experiments on individual genomes
and epigenomes, individual labs can leverage both public and proprietary resources,
enhancing statistical power and potentially increasing the pace of scientific discover-
ies. Similarly, rapidly accumulating genome-scale DNA methylation data in the Gene
Expression Omnibus (Edgar et al. (2002)) can help establish useful starting directions
for more specific investigations. A common challenge in both cases is addressing tech-
nical artifacts, applying the best-suited tools for analysis, and interpreting the results in
light of existing studies. Eventually, a more scalable and cost-effective approach will
use experiment-agnostic preprocessing methods, such as described above, and permit
network-accessible resources which perform the minimum steps necessary to enable
further analysis. At the same time, better methods to opportunistically access huge cor-
pora out-of-memory could reduce demands for both storage and computational power
on the user’s end, allowing fast, cost-effective access to deposited experimental data
without incurring huge latency and storage concerns.
59
A comprehensive implementation of the above steps does not yet appear imminent.
However, each step towards this ideal represents a further relaxation on the scope and
scale that can be addressed by studies of DNA methylation, whether as a marker for
environmental exposures, an early harbinger of disease, a metric for treatment response,
a target for therapy, or an adjunct to known genetic risk factors. Computational consider-
ations are not necessarily roadblocks to the use of sensitive methods, such as I discussed
in Chapter 3, when parallelization can bring massive computational resources to bear on
analyses.
Ultimately, the utility of DNA methylation arrays, like gene expression arrays before
them, will depend on the creativity of investigators and the veracity of the data they
generate. Gene expression signatures have proven useful as prognostic tools, and early
designs spurred much work in high-dimensional statistics (e.g. Bolstad et al. (2003))
which benefited many unrelated disciplines. If the unfamiliar properties and scale of
DNA methylation data generate similarly useful direct (scientific) and indirect (bioin-
formatic) benefits, the humble microarray will again have served as a pivotal tool in
advancing molecular biology.
60
References
Akalin, A., Garrett-Bakelman, F. E., Kormaksson, M., Busuttil, J., Zhang, L., Khreb-
tukova, I., Milne, T. a., Huang, Y ., Biswas, D., Hess, J. L., Allis, C. D., Roeder, R. G.,
Valk, P. J. M., L¨ owenberg, B., Delwel, R., Fernandez, H. F., Paietta, E., Tallman,
M. S., Schroth, G. P., Mason, C. E., Melnick, A., and Figueroa, M. E. (2012). Base-
pair resolution DNA methylation sequencing reveals profoundly divergent epigenetic
landscapes in acute myeloid leukemia. PLoSgenetics, 8(6):e1002781.
Bell, J. T., Pai, A. a., Pickrell, J. K., Gaffney, D. J., Pique-Regi, R., Degner, J. F., Gilad,
Y ., and Pritchard, J. K. (2011). DNA methylation patterns associate with genetic and
gene expression variation in HapMap cell lines. GenomeBiology, 12(1):R10.
Benjamini, Y . (2010). False Discovery Rates. Genetics, (January):1–7.
Beyan, H., Down, T. a., Ramagopalan, S. V ., Uvebrant, K., Nilsson, A., Holland, M. L.,
Gemma, C., Giovannoni, G., Boehm, B. O., Ebers, G. C., Lernmark, A., Cilio, C. M.,
Leslie, R. D., and Rakyan, V . K. (2012). Guthrie card methylomics identifies tempo-
rally stable epialleles that are present at birth in humans. Genomeresearch.
Bolstad, B. M., Irizarry, R. a., Astrand, M., and Speed, T. P. (2003). A comparison of
normalization methods for high density oligonucleotide array data based on variance
and bias. Bioinformatics(Oxford,England), 19(2):185–93.
Borgel, J., Guibert, S., Li, Y ., Chiba, H., Sch¨ ubeler, D., Sasaki, H., Forn´ e, T., and Weber,
M. (2010). Targets and dynamics of promoter DNA methylation during early mouse
development. NatureGenetics, 42(12):1093–1100.
Byun, H.-M., Siegmund, K. D., Pan, F., Weisenberger, D. J., Kanel, G., Laird, P. W.,
and Yang, A. S. (2009). Epigenetic profiling of somatic tissues from human autopsy
specimens identifies tissue- and individual-specific dna methylation patterns. Human
MolecularGenetics, 18(24):4808–4817.
Chen, M., Xie, Y ., and Story, M. (2011). An Exponential-Gamma Convolution Model
for Background Correction of Illumina BeadArray Data. Communications in statis-
tics: theoryandmethods, 40(17):3055–3069.
61
Chen, Z., McGee, M., Liu, Q., Kong, M., Deng, Y ., and Scheuermann, R. H. (2009).
A Distribution-Free Convolution Model for background correction of oligonucleotide
microarray data. BMCgenomics, 10 Suppl 1:S19.
Chu, M., Siegmund, K. D., Hao, Q.-L., Crooks, G. M., Tavar
˜
A©, S., and Shibata, D.
(2008). Inferring relative numbers of human leucocyte genome replications. British
JournalofHaematology, 141(6):862–871.
Consortium, T. E. P. (2004). The encode (encyclopedia of dna elements) project. Sci-
ence, 306(5696):636–640.
Deaton, A. M. and Bird, A. (2011). CpG islands and the regulation of transcription.
Genes&development, 25(10):1010–22.
Ding, L., Ley, T. J., Larson, D. E., Miller, C. a., Koboldt, D. C., Welch, J. S., Ritchey,
J. K., Young, M. a., Lamprecht, T., McLellan, M. D., McMichael, J. F., Wallis, J. W.,
Lu, C., Shen, D., Harris, C. C., Dooling, D. J., Fulton, R. S., Fulton, L. L., Chen,
K., Schmidt, H., Kalicki-Veizer, J., Magrini, V . J., Cook, L., McGrath, S. D., Vickery,
T. L., Wendl, M. C., Heath, S., Watson, M. a., Link, D. C., Tomasson, M. H., Shannon,
W. D., Payton, J. E., Kulkarni, S., Westervelt, P., Walter, M. J., Graubert, T. a., Mardis,
E. R., Wilson, R. K., and DiPersio, J. F. (2012). Clonal evolution in relapsed acute
myeloid leukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–
10.
Du, P., Zhang, X., Huang, C.-C., Jafari, N., Kibbe, W. a., Hou, L., and Lin, S. M. (2010).
Comparison of Beta-value and M-value methods for quantifying methylation levels
by microarray analysis. BMCBioinformatics, 11(1):587.
Edgar, R., Domrachev, M., and Lash, A. E. (2002). Gene expression omnibus: Ncbi
gene expression and hybridization array data repository. Nucleic Acids Research,
30(1):207–210.
Figueroa, M. E., Lugthart, S., Li, Y ., Erpelinck-verschueren, C., Deng, X., Christos,
P. J., Schifano, E., Booth, J., Putten, W. V ., Skrabanek, L., Campagne, F., Mazum-
dar, M., Delwel, R., Melnick, A., Greally, J. M., Valk, P. J. M., and Lo, B. (2010).
Article DNA Methylation Signatures Identify Biologically Distinct Subtypes in Acute
Myeloid Leukemia. CancerCell, pages 13–27.
Griffiths, E., Gore, D., Mcdevitt, M., Douglas, B., Karp, J. E., and G, J. (2010). Epige-
netic differences in cytogenetically normal versus abnormal acute myeloid leukemia.
pages 1–11.
Gr¨ un, B., Kosmidis, I., and Zeileis, A. (2012). Extended beta regression in R: Shaken,
stirred, mixed, and partitioned. JournalofStatisticalSoftware, 48(11):1–25.
62
Guintivano, J., Aryee, M. J., and Kaminsky, Z. A. (2013). A cell epigenotype specific
model for the correction of brain cellular heterogeneity bias and its application to age,
brain region and major depression. Epigenetics, 8(3):290–302.
Halevy, A., Norvig, P., and Pereira, F. (2009). The Unreasonable Effectiveness of Data.
IEEEIntelligentSystems, 24(2):8–12.
Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B.,
Bibikova, M., Fan, J.-B., Gao, Y ., et al. (2012). Genome-wide methylation profiles
reveal quantitative views of human aging rates. Molecularcell.
Hansen, K. D., Timp, W., Bravo, H. C., Sabunciyan, S., Langmead, B., McDonald,
O. G., Wen, B., Wu, H., Liu, Y ., Diep, D., Briem, E., Zhang, K., Irizarry, R. a., and
Feinberg, A. P. (2011). Increased methylation variation in epigenetic domains across
cancer types. NatureGenetics, 43(8):768–775.
Heyn, H., Li, N., Ferreira, H. J., Moran, S., Pisano, D. G., Gomez, A., Diez, J.,
Sanchez-Mut, J. V ., Setien, F., Carmona, F. J., et al. (2012). Distinct dna methylomes
of newborns and centenarians. Proceedings of the National Academy of Sciences,
109(26):10522–10527.
Hodges, E., Molaro, A., Dos
ˆ
A Santos, C., Thekkat, P., Song, Q., Uren, P. J., Park, J.,
Butler, J., Rafii, S., McCombie, W., Smith, A., and Hannon, G. (2011). Directional
DNA Methylation Changes and Complex Intermediate States Accompany Lineage
Specificity in the Adult Hematopoietic Compartment. MolecularCell, pages 1–12.
Horvath, S., Zhang, Y ., Langfelder, P., Kahn, R., Boks, M., van Eijk, K., van den Berg,
L., and Ophoff, R. (2012). Aging effects on dna methylation modules in human brain
and blood tissue. GenomeBiology, 13(10):R97.
Houseman, E., Accomando, W., Koestler, D., Christensen, B., Marsit, C., Nelson, H.,
Wiencke, J., and Kelsey, K. (2012). Dna methylation arrays as surrogate measures of
cell mixture distribution. BMCBioinformatics, 13(1):86.
Ko, M., Huang, Y ., Jankowska, A. M., Pape, U. J., Tahiliani, M., Bandukwala, H. S.,
An, J., Lamperti, E. D., Koh, K. P., Ganetzky, R., Liu, X. S., Aravind, L., Agarwal, S.,
Maciejewski, J. P., and Rao, A. (2010). Impaired hydroxylation of 5-methylcytosine
in myeloid cancers with mutant TET2. Nature, pages 1–5.
Koh, K. P., Yabuuchi, A., Rao, S., Huang, Y ., Cunniff, K., Nardone, J., Laiho, A.,
Tahiliani, M., Sommer, C. a., and Mostoslavsky, G. (2011). Tet1 and Tet2 Regu-
late 5-Hydroxymethylcytosine Production and Cell Lineage Specification in Mouse
Embryonic Stem Cells. CellStemCell, 8(2):200–213.
63
Laird, P. W. (2010). Principles and challenges of genome-wide DNA methylation anal-
ysis. Naturereviews.Genetics, 11(3):191–203.
Lengauer, C., Kinzler, K. W., and V ogelstein, B. (1997). DNA methylation and genetic
instability in colorectal cancer cells.ProceedingsoftheNationalAcademyofSciences
oftheUnitedStatesofAmerica, 94(6):2545–50.
Maksimovic, J., Gordon, L., and Oshlack, A. (2012). Swan: Subset-quantile within
array normalization for illumina infinium humanmethylation450 beadchips. Genome
Biology, 13(6):R44.
Maunakea, A. K., Nagarajan, R. P., Bilenky, M., Ballinger, T. J., Fouse, S. D., Johnson,
B. E., Hong, C., Nielsen, C., Zhao, Y ., Turecki, G., Delaney, A., Varhol, R., Thiessen,
N., Shchors, K., Heine, V . M., Rowitch, D. H., Xing, X., Fiore, C., Schillebeeckx,
M., Jones, S. J. M., Haussler, D., Marra, M. A., Hirst, M., Wang, T., and Costello,
J. F. (2010). Conserved role of intragenic DNA methylation in regulating alternative
promoters. Nature, 466(7303):253–257.
Mihara, K., Takihara, Y ., and Kimura, a. (2007). Genetic and epigenetic alterations in
myelodysplastic syndrome. Cytogeneticandgenomeresearch, 118(2-4):297–303.
Naik, S. H., Peri´ e, L., Swart, E., Gerlach, C., van Rooij, N., de Boer, R. J., and Schu-
macher, T. N. (2013). Diverse and heritable lineage imprinting of early haematopoi-
etic progenitors. Nature, 496(7444):229–232.
R Development Core Team (2010). The.
Raby, B., Barnes, K., Beaty, T., Bosco, A., Carey, V ., Castro, M., Cheadle, C., Gilliland,
F., Islam, K., Salam, M., et al. (2011). Asthma bridge: the asthma biorepository for
integrative genomic exploration. American Journal of Respiratory and Critical Care
Medicine, 183(1 MeetingAbstracts):A6189–A6189.
Reinius, L. E., Acevedo, N., Joerink, M., Pershagen, G., Dahl´ en, S.-E., Greco, D.,
S¨ oderh¨ all, C., Scheynius, A., and Kere, J. (2012). Differential DNA methylation
in purified human blood cells: implications for cell lineage and studies on disease
susceptibility. PloSone, 7(7):e41361.
Ritchie, M. E., Carvalho, B. S., Hetrick, K. N., Tavar´ e, S., and Irizarry, R. a. (2009).
R/Bioconductor software for Illumina’s Infinium whole-genome genotyping Bead-
Chips. Bioinformatics(Oxford,England), 25(19):2621–3.
Ritchie, M. E., Silver, J., Oshlack, A., Holmes, M., Diyagama, D., Holloway, A., and
Smyth, G. K. (2007). A comparison of background correction methods for two-colour
microarrays. Bioinformatics(Oxford,England), 23(20):2700–7.
64
Sandoval, J., Heyn, H., Moran, S., Serra-Musach, J., Pujana, M. A., Bibikova, M., and
Esteller, M. (2011). Validation of a DNA methylation microarray for 450,000 CpG
sites in the human genome. epigenetics, 6(1559-2294):692–702.
Saxonov, S., Berg, P., and Brutlag, D. L. (2006). A genome-wide analysis of CpG
dinucleotides in the human genome distinguishes two distinct classes of promoters.
Proceedings of the National Academy of Sciences of the United States of America,
103(5):1412–7.
Shi, W., Oshlack, A., and Smyth, G. K. (2010). Optimizing the noise versus bias trade-
off for Illumina whole genome expression BeadChips. Nucleicacidsresearch, pages
1–11.
Siegmund, K. (2011). Statistical approaches for the analysis of dna methylation microar-
ray data. HumanGenetics, 129(6):585–595.
Slatkin, M. (2009). Epigenetic inheritance and the missing heritability problem. Genet-
ics, 182(3):845–850.
Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differ-
ential expression in microarray experiments. StatisticalApplicationsinGeneticsand
MolecularBiology3,Article, 3.
Teschendorff, A. E., Zhuang, J., and Widschwendter, M. (2011). Independent Surro-
gate Variable Analysis to deconvolve confounding factors in large-scale microarray
profiling studies. Bioinformatics(Oxford,England), pages 1–10.
Verdugo, R. a., Deschepper, C. F., Mu˜ noz, G., Pomp, D., and Churchill, G. a. (2009).
Importance of randomization in microarray experimental designs with Illumina plat-
forms. Nucleicacidsresearch, 37(17):5610–8.
Xie, Y ., Wang, X., and Story, M. (2009). Statistical methods of background correction
for Illumina BeadArray data. Bioinformatics(Oxford,England), 25(6):751–7.
65
Abstract (if available)
Abstract
In the first chapter of this dissertation, I review the importance of cytosine methylation in vertebrate genomes. In the second chapter, I present methods which leverage a design feature of the most widely used DNA methylation microarrays to ameliorate common technical artifacts. In the third chapter, I present an investigation of sensitive methods for detecting differential DNA methylation. In the final chapter, I discuss the impact of these findings for population-scale studies of DNA methylation. ❧ Taken together, these chapters describe versatile and widely applicable approaches to study DNA methylation at the scale demanded by modern investigators. If recent years are any indication, the size and complexity of such studies will only increase. Changes to genomic sequence are relatively rare and typically deleterious, thus the genomes of healthy somatic cells from different tissues in an individual are with few exceptions nearly identical. The epigenomes of these same cells differ radically. Furthermore, the erasure of paternal DNA methylation marks at conception may be incomplete, offering a potential route for epigenetic differences to persist across generations as well as tissues. ❧ Unraveling the roles and functional interactions of genomic and epigenomic changes in health and disease may thus become one of the great scientific and informatic challenges of our time. This dissertation presents effective analytic and informatic practices, in an attempt to guide future investigators as they rise to the challenge.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Prenatal air pollution exposure, newborn DNA methylation, and childhood respiratory health
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
Finding signals in Infinium DNA methylation data
PDF
Air pollution, smoking, and multigenerational DNA methylation Signatures: a study of two southern California cohorts
PDF
Air pollution, mitochondrial function, and growth in children
PDF
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
Statistical analysis of high-throughput genomic data
PDF
Identification and characterization of cancer-associated enhancers
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Dietary carcinogens and genetic variation in their metabolism: epidemiological studies on the risk of selected cancers
PDF
A linear model for measurement errors in oligonucleotide microarray experiment
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Effects of chromatin regulators during carcinogenesis
Asset Metadata
Creator
Triche, Timothy J., Jr.
(author)
Core Title
Preprocessing and analysis of DNA methylation microarrays
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
07/23/2013
Defense Date
06/06/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
DNA methylation,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Siegmund, Kimberly D. (
committee chair
), Gilliland, Frank D. (
committee member
), Laird, Peter W. (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
tim.triche@gmail.com,ttriche@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-298429
Unique identifier
UC11294882
Identifier
etd-TricheTimo-1829.pdf (filename),usctheses-c3-298429 (legacy record id)
Legacy Identifier
etd-TricheTimo-1829.pdf
Dmrecord
298429
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Triche, Timothy J., Jr.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
DNA methylation