Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Bayesian hierarchical models in genetic association studies
(USC Thesis Other)
Bayesian hierarchical models in genetic association studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
BAYESIAN
HIERARCHICAL
MODELS
IN
GENETIC
ASSOCIATION
STUDIES
by
Wei
Liang
A
Dissertation
Presented
to
the
FACULTY
OF
THE
USC
GRADUATE
SCHOOL
UNIVERSITY
OF
SOUTHERN
CALIFORNIA
In
Partial
Fulfillment
of
the
Requirements
for
the
Degree
DOCTOR
OF
PHILOSOPHY
(Biostatistics)
December
2013
Copyright
2013
Wei
Liang
ii
To
Yu,
Jiang
and
Jennifer
iii
Acknowledgements
First
of
all,
I
would
like
to
express
my
deepest
appreciation
to
my
mentors,
Dr.
Duncan
Thomas
and
Dr.
David
Conti.
This
dissertation
would
not
be
possible
without
their
strong
support
through
the
past
five
years.
With
their
enthusiasm,
I
was
introduced
into
the
area
of
genetic
statistics
and
Bayesian
analysis.
I
would
like
to
sincerely
thank
their
great
guidance,
not
only
in
biostatistics,
but
also
in
many
aspects
of
my
life
and
career.
I
would
like
to
extend
my
appreciation
to
all
my
committee
members:
Dr.
Kai
Wang
and
Dr.
Sergey
Nuzhdin.
Their
questions
and
suggestions
have
been
very
helpful
for
me
to
improve
my
dissertation.
I
would
also
like
to
thank
the
faculty
of
the
Department
of
Biostatistics
and
Epidemiology,
who
taught
me
fundamental
knowledge
in
biostatistics.
Special
thanks
goes
to
Dr.
Graham
Casey,
Dr.
Gary
Chen
and
Christopher
Edlund,
for
their
helpful
discussions
and
support.
I
am
very
grateful
to
all
my
supportive
friends,
both
in
LA
area
and
in
many
other
places.
They
stood
by
me
in
good
times
and
bad
times.
Because
of
them,
I
never
felt
thousands
of
miles
away
from
home.
Last
but
not
least,
I
would
like
to
express
my
gratitude
to
my
parents,
Guanglin
Liang
and
Yongzhi
Dai,
and
my
sister
Hong
Liang.
Their
ever-‐lasting
love
is
always
the
motivation
for
me
to
do
better
in
my
life.
iv
Table
of
Contents
ACKNOWLEDGEMENTS
...........................................................................................................
III
LIST
OF
TABLES
......................................................................................................................
VII
LIST
OF
FIGURES
....................................................................................................................
VIII
ABSTRACT
...............................................................................................................................
IX
CHAPTER
1
INTRODUCTION
.....................................................................................................
1
1.1
GENETIC
ASSOCIATION
STUDIES
....................................................................................................
1
1.1.1
Pre-‐GWAS
era:
candidate
gene
study
............................................................................
1
1.1.2
GWAS
era:
the
achievements
and
shortcomings
............................................................
2
1.1.3
Post-‐GWAS
era:
an
overview
of
the
topics
in
the
dissertation
.......................................
4
1.2
USING
BAYESIAN
HIERARCHICAL
MODELS
TO
INCORPORATE
EXTERNAL
INFORMATION
............................
6
1.2.1
Motivation:
the
development
of
biological
ontologies
...................................................
6
1.2.2
Bayesian
hierarchical
models
.........................................................................................
7
1.3
OVERVIEW
................................................................................................................................
8
CHAPTER
2
A
BAYESIAN
HIERARCHICAL
MODEL
IN
NEXT-‐GENERATION
SEQUENCING
WITH
CASE-‐CONTROL
POOLS
...........................................................................................................
10
2.1
INTRODUCTION
........................................................................................................................
10
2.2
METHOD:
A
BAYESIAN
HIERARCHICAL
MODEL
FOR
POOLED
NGS
DATA
..............................................
12
2.3
SIMULATION
STUDIES
................................................................................................................
15
2.3.1
Design
of
simulation
studies
.........................................................................................
15
2.3.2
Determining
optimal
study
design
for
future
studies
...................................................
16
2.4
RESULTS
.................................................................................................................................
21
v
2.4.1
Validity
and
empirical
power
of
simulated
scenarios
...................................................
21
2.4.2
Construction
of
a
prediction
model
..............................................................................
26
2.4.3
Predicting
power
and
determining
optimal
designs
for
future
studies
........................
28
2.5
DISCUSSION
............................................................................................................................
34
CHAPTER
3
BAYESIAN
HIERARCHICAL
LASSO
..........................................................................
39
3.1
INTRODUCTION
........................................................................................................................
39
3.2
ITERATIVE
APPROACHES
TO
ESTIMATE
BAYESIAN
HIERARCHICAL
LASSO
...............................................
41
3.2.1
Method
.........................................................................................................................
41
3.2.2
Simulation
studies
........................................................................................................
46
3.3
MARGINAL
MAP
ESTIMATES
OF
HYPER-‐PARAMETERS
.....................................................................
51
3.3.1
Method
.........................................................................................................................
51
3.3.2
Simulation
results
of
the
MCMC
procedures
(3.3.1.A-‐3.3.1.C)
.....................................
56
3.4
DISCUSSION
............................................................................................................................
59
CHAPTER
4
BAYESIAN
HIERARCHICAL
QUANTILE
MODEL
TO
PRIORITIZE
GWAS
RESULTS
.......
62
4.1
INTRODUCTION
........................................................................................................................
62
4.2
A
BAYESIAN
QUANTILE
REGRESSION
............................................................................................
65
3
A
BAYESIAN
HIERARCHICAL
MODEL
WITH
MULTIPLE
QUANTILES
.........................................................
68
4.4
SIMULATION
STUDIES
BASED
ON
A
NORMAL
APPROXIMATION
.........................................................
70
4.4.1
Design
of
the
simulation
study
.....................................................................................
70
4.4.2
Simulation
Results
........................................................................................................
72
4.5
SIMULATION
STUDIES
BASED
ON
MARGINAL
GWAS
ESTIMATION
....................................................
77
4.5.1.
Design
of
the
simulation
study
....................................................................................
77
4.6
AN
APPLIED
STUDY
ON
THE
COLORECTAL
CANCER
FAMILY
REGISTRY
.................................................
85
4.7
DISCUSSIONS
...........................................................................................................................
91
vi
CHAPTER
5
FUTURE
RESEARCH
IN
RELATED
AREAS
................................................................
93
5.1
VARIABLE
SELECTION
PROBLEM
FOR
A
LARGE
DATASET
....................................................................
94
5.3
INTEGRATIVE
ANALYSIS
ON
GENETIC
ASSOCIATION
STUDIES,
ESPECIALLY
WITH
NEXT-‐GENERATION
SEQUENCING
.................................................................................................................................
96
5.4
PHARMACOGENOMICS
AND
PERSONALIZED
MEDICINE
...................................................................
97
APPENDICES
..........................................................................................................................
98
APPENDIX
1.
VARIANT
DETECTION:
COMPARISON
BETWEEN
POOLING
VS.
INDIVIDUAL
DESIGNS
...................
98
BIBLIOGRAPHY
.....................................................................................................................
103
vii
List
of
Tables
Table
3.1
Null
Simulation
Results
.................................................................................................................
47
Table
3.2
Simulation
Results
with
1
covariate
..............................................................................................
48
Table
3.3
Simulation
Results
with
5
covariates
............................................................................................
49
Table
4.1
2×2
table
of
causal
variants
with
respect
to
biofeature
...............................................................
63
Table
4.2
A
quantile
regression
on
an
empirical
example
............................................................................
64
Table
4.3
Frequencies
of
5
Biofeatures
in
44010
Regions
............................................................................
86
Table
4.4
Mean
Estimates
β
d
(τ
k
)
and
95%
Interval
......................................................................................
87
Table
4.5
Changes
in
rank
for
the
known
hits
(genes/regions)
....................................................................
90
viii
List
of
Figures
Figure
2.1
DAG
of
the
hierarchical
model,
focusing
on
a
single
marker
in
a
single
pool
..............................
15
Figure
2.2
Summary
of
the
validity
of
study
designs
of
a
variant
with
VAF=0.05
........................................
22
Figure
2.3
Comparison
of
empirical
power
in
different
scenarios
and
designs
............................................
24
Figure
2.4
General
trends
of
empirical
power
with
respect
to
the
design
parameters
................................
25
Figure
2.5
Prediction
of
power
in
semi-‐simulation
studies
using
1000
Genomes
data
................................
29
Figure
2.6
Grid
search
for
the
optimal
pooling
design,
for
a
study
interested
in
a
variant
with
VAF=0.03
and
OR=2,
assuming
a
fixed
cost
function,
with
constraints
on
cost
and
sample
size
........................................
30
Figure
2.7
The
optimal
power
of
pooling
designs
for
a
specific
study
(VAF=0.03,
OR=2),
with
the
same
sample
size
constraint,
different
cost
functions
but
comparable
cost
constraints
with
individual
designs
.
32
Figure
2.8
Power
(y-‐axis)
comparison
of
optimal
pooling
designs
with
various
cost
function
(dashed
lines),
versus
individual
designs
(solid
line),
over
a
range
of
variant
allele
frequencies
(x-‐axis)
.............................
33
Figure
3.1
The
piecewise
likelihood
function
................................................................................................
55
Figure
3.2
Iteration
trace
of
the
updated
values
using
Newton's
method
(I=100,
P=100)
...........................
57
Figure
3.3
Iteration
trace
of
the
upated
values
using
Newton's
method
(I=100,
P=500)
.............................
58
Figure
4.1
An
illustration
of
the
effect
of
biofeature
on
test
statistics
.........................................................
64
Figure
4.2
Gain
of
number
of
causal
SNPs
in
the
top
20(100)
ranked
SNPs
.................................................
73
Figure
4.3
Gain
of
number
of
causal
SNPs
in
the
top
20(100)
ranked
SNPs
.................................................
74
Figure
4.4
Gain
of
number
of
causal
SNPs
in
the
top
20(100)
ranked
SNPs
.................................................
76
Figure
4.5
Gain
of
number
of
causal
SNPs
in
the
top
1,
5
and
20
ranked
SNPs
............................................
79
Figure
4.6
Gain
of
number
of
causal
SNPs
in
the
top
1,
5
and
20
ranked
SNPs
............................................
81
Figure
4.7
Gain
of
number
of
causal
SNPs
in
the
top
1,
5
and
20
ranked
SNPs
............................................
82
Figure
4.8
Gain
of
number
of
causal
SNPs
in
the
top
20
ranked
SNPs
..........................................................
84
Figure
4.9
Gain
of
number
of
causal
SNPs
in
the
top
20
ranked
SNPs
..........................................................
84
Figure
4.10
Gain
of
number
of
causal
SNPs
in
the
top
200
meta-‐analysis
ranked
regions
..........................
89
ix
Abstract
Genetic
association
studies
aim
to
find
the
genetic
variations
that
are
associated
with
phenotypes
(traits).
In
the
last
decade,
genome-‐wide
association
studies
have
been
proposed
and
performed
in
large
populations
for
many
common
diseases/traits.
A
number
of
associated
SNPs
have
been
found,
but
limited
amount
of
heritability
for
the
phenotypes
are
explained
by
those
findings.
It
is
now
in
an
era
of
post-‐GWAS,
in
which
we
are
looking
for
better
ways
to
explain
the
GWAS
results,
to
follow
up
with
replication/validation
studies,
and
to
find
novel
genetic
components
that
explains
the
phenotypes.
In
the
scope
of
this
dissertation,
I
focused
on
three
different
aspects
of
genetic
association
study.
I
proposed
a
novel
study
design
for
genetic
association
studies
using
next-‐generation
sequencing,
investigated
a
hierarchical
model
for
variable
selection
in
GWAS
analysis,
and
implemented
a
quantile
regression
model
to
prioritize
GWAS
results.
With
its
potential
to
discover
a
much
greater
amount
of
genetic
variation,
next-‐generation
sequencing
is
fast
becoming
an
emergent
tool
for
genetic
association
studies.
However,
the
cost
of
sequencing
all
individuals
in
a
large-‐scale
population
study
is
still
high
in
comparison
to
most
alternative
genotyping
options.
While
the
ability
to
identify
individual-‐level
data
is
lost
(without
bar-‐coding),
sequencing
pooled
samples
can
substantially
lower
costs
without
compromising
the
power
to
detect
significant
associations.
We
propose
a
hierarchical
Bayesian
model
that
estimates
the
association
of
each
variant
using
pools
of
cases
and
controls,
accounting
for
the
variation
in
read
depth
across
pools
and
sequencing
error.
To
investigate
the
performance
of
our
method
across
a
range
of
number
of
pools,
number
of
individuals
within
each
pool,
and
x
average
coverage,
we
undertook
extensive
simulations
varying
effect
sizes,
minor
allele
frequencies,
and
sequencing
error
rates.
In
general,
the
number
of
pools
and
the
pool
size
have
dramatic
effects
on
power
while
the
total
depth
of
coverage
per
pool
has
only
a
moderate
impact.
This
information
can
guide
the
selection
of
a
study
design
that
maximizes
power
subject
to
cost,
sample
size,
or
other
laboratory
constraints.
We
provide
an
R
package
(hiPOD:
hierarchical
Pooled
Optimal
Design)
to
find
the
optimal
design,
allowing
the
user
to
specify
a
cost
function,
cost,
and
sample
size
limitations,
and
distributions
of
effect
size,
minor
allele
frequency,
and
sequencing
error
rate.
In
genome-‐wide
association
studies
(GWAS)
with
complex
disease,
the
most
used
analysis
tool
has
been
a
single-‐point,
one
degree
of
freedom
test
of
association,
such
as
the
Cochran-‐
Armitage
test.
However,
testing
one
SNP
at
a
time
does
not
fully
exploit
the
potential
of
genome-‐wide
association
studies
to
identify
multiple
causal
variants,
which
is
a
plausible
scenario
for
many
complex
diseases.
By
adapting
a
Laplace
prior
distribution
for
model
parameters,
and
incorporating
external
information
in
the
prior
distribution
for
the
shrinkage
parameter
of
the
Laplace
distribution,
we
aimed
to
perform
variable
selection
using
the
specified
hierarchical
model.
We
tried
two
approaches
to
estimate
the
hyper-‐parameters,
empirical
Bayes
and
fully
Bayesian.
The
method
works
when
the
model
size
is
relatively
small,
but
fails
when
the
model
is
over-‐parameterized.
As
mentioned
above,
for
most
complex
diseases,
only
a
moderate
amount
of
heritability
has
been
explained
by
the
SNPs
declared
genomewide
statistically
significant.
Rather
than
a
resource-‐limited
approach
of
increasing
the
sample
size,
one
alternative
approach
is
to
find
causal
SNPs
within
the
lower
Manhattan—the
SNPs
that
just
failed
to
achieve
genome-‐wide
xi
significance.
Thus,
to
improve
efficiency
of
GWAS
results,
we
propose
a
Bayesian
hierarchical
quantile
regression
model
to
incorporate
external
information
with
the
aim
of
improving
the
ranking
of
causal
SNPs.
The
proposed
model
examines
the
extremes
of
the
p-‐value
distribution
by
adapting
a
Normal-‐Exponential
mixture
presentation
of
asymmetric
Laplace
distribution
as
a
prior
distribution,
which
enables
us
to
build
an
efficient
Gibbs
sampler.
Simulation
results
show
that
the
proposed
model
improves
the
ranking
of
causal
SNPs
if
the
external
information
is
informative
(associated
with
the
causality
of
a
SNP)
and
does
not
decrease
the
causal
SNP’s
ranking
if
the
external
information
is
non ‐informative.
We
compare
this
approach
to
several
alternatives,
including
a
filtering
framework,
and
demonstrate
that
these
approaches
can
worsen
the
ranking
of
causal
SNPs
if
the
external
information
is
not
informative.
1
Chapter
1 Introduction
1.1
Genetic
association
studies
For
rare
Mendalian
diseases,
a
single
allele
usually
determines
the
presence
of
disease,
and
a
simple
family
based
study
can
be
conducted
to
detect
the
causal
allele.
However,
for
most
common
diseases,
the
risk
of
being
affected
is
usually
influenced
by
multiple
genetic
and
environmental
factors.
For
such
diseases,
the
presence
of
a
high-‐risk
allele
may
only
increase
the
probability
of
disease
by
a
small
to
moderate
amount
(Risch,
2000).
To
detect
those
genetic
risk
alleles
in
common
and
usually
complex
diseases,
genetic
association
studies
were
proposed
and
performed
in
the
last
two
decades
(Cordell
&
Clayton,
2005).
In
this
section,
we
provide
an
overview
of
genetic
association
studies
performed,
the
contributions
made
along
the
path,
and
efforts
needed
in
future
studies.
1.1.1
Pre-‐GWAS
era:
candidate
gene
study
A
genetic
association
study
answers
the
specific
question:
is
there
an
association
between
one
or
more
genetic
variations
and
the
disease
susceptibility
(or
the
quantitative
trait)?
In
the
scope
of
this
dissertation,
we
focus
on
studies
with
dichotomous
traits,
using
case-‐control
designs.
In
these
studies,
the
frequencies
of
genetic
variants
in
cases
are
compared
with
those
in
controls,
and
evidence
of
association
is
obtained
if
the
frequency
difference
is
significant.
The
first
proposed
genetic
association
studies
are
candidate
gene/allele
studies.
In
those
studies,
one
or
more
candidate
genes
(or
specific
alleles)
were
hypothesized
to
be
potential
2
causal
gene
(variants).
Genotypes
of
the
candidate
genes/alleles
were
obtained,
and
statistical
analysis
was
performed
to
analyze
their
association
with
the
risk
of
disease.
In
a
review
paper
published
in
2002,
Hirschhorn
summarized
268
genes
that
contain
polymorphisms
reported
to
be
associated
with
1
of
the
133
common
diseases
or
dichotomous
traits
investigated
(J.
N.
Hirschhorn,
Lohmueller,
Byrne,
&
Hirschhorn,
2002).
Candidate
gene
studies
constrain
the
investigation
in
a
few
pre-‐selected
alleles,
based
on
strong
hypothesis,
thus
they
are
sometimes
criticized
for
significant
selection
bias.
On
the
hand,
because
of
the
pre-‐selection,
only
a
few
tests
are
carried
out,
which
usually
leads
to
a
sufficient
statistical
power.
1.1.2
GWAS
era:
the
achievements
and
shortcomings
With
the
rapid
development
of
genotyping
arrays,
it
became
possible
to
genotype
millions
of
SNPs
at
the
same
time.
After
the
completion
of
the
Human
Genome
project
and
the
International
HapMap
Project
(Lander
et
al.,
2001)
(The
International
HapMap
Consortium,
2005),
due
to
the
linkage
disequilibrium
structure
of
the
genome,
a
subset
of
SNPs
can
be
used
as
tags
to
represent
most
of
the
common
allelic
variation
across
the
whole
genome.
The
first
genome-‐wide
association
study
(GWAS)
was
published
in
2005,
in
which
116,204
tag
SNPs
were
genotyped,
and
two
SNPs
were
found
to
be
statistically
significantly
associated
with
age-‐related
macular
degeneration
(Klein,
2005).
Since
then,
about
6000
SNPs
have
been
reported
for
associations
with
more
than
130
traits
as
of
January
2012
(http://www.genome.gov/gwastudies/).
3
GWAS
is
agnostic
in
the
sense
that
it
tests
tag
SNPs
for
common
variants
across
the
whole
genome.
However,
due
to
the
large
number
of
tests
performed,
a
very
small
p-‐value
is
required
to
declare
significance
after
multiple
test
correction.
Although
GWAS
have
achieved
great
success
in
identifying
risk
alleles,
there
are
a
few
pitfalls/shortcomings
remaining.
First
of
all,
population
stratification
acts
as
a
confounding
factor
in
GWAS,
especially
in
admixed
populations.
In
admixed
populations,
false
positive
or
false
negative
association
results
are
reported
if
both
the
disease
distribution
and
the
genotype
distribution
differ
among
the
sub-‐
groups
of
the
population.
Many
methods
have
been
proposed
to
correct
for
population
stratification,
e.g.,
Liu
et
al
investigated
the
effects
of
global
ancestry
and
local
ancestry,
and
proposed
methodology
to
correct
the
potential
confounding
effects
(Liu,
Lewinger,
Gilliland,
Gauderman,
&
Conti,
2013).
Secondly,
the
published
GWAS
results
usually
overestimate
the
effect
size,
as
a
result
of
“winner’s
curse”
(Turnbull
et
al.,
2010).
This
is
because
the
effect
is
usually
weak
for
a
common
variant,
thus
the
first
statistically
significant
study
is
almost
certain
to
have
overestimated
the
true
effect
of
the
variant
being
tested
(J.
N.
Hirschhorn
et
al.,
2002).
The
most
common
way
to
correct
for
winner’s
case
is
to
increase
the
sample
size,
or
to
perform
meta-‐analysis
of
previous
studies.
4
Thirdly,
only
a
small
to
moderate
proportion
of
the
hereditary
variation
has
been
explained
by
the
variants
discovered.
Many
hypotheses
have
been
proposed
to
explain
the
missing
heredity:
other
kinds
of
genetic
variations
than
SNP,
e.g.,
copy
number
variation
and
deletion/insertion/inversion;
gene-‐gene
interactions
and
gene-‐environment
interactions,
especially
with
epigenetic
factors;
the
combinational
effects
of
multiple
rare
variants;
as
well
as
many
other
possibilities.
Last
but
not
least,
most
of
the
findings
from
GWAS
are
just
surrogates
for
the
causal
variants.
Ideally
we’d
like
to
follow
up
the
GWAS
results
to
identify
causal
variant,
which
might
lead
to
the
elucidation
of
the
biological
mechanisms
of
the
disease.
But
what
is
the
best
way
to
follow
up
a
GWAS
study
remains
an
open
research
question
(Freedman
et
al.,
2011)
(Cantor,
Lange,
&
Sinsheimer,
2010)
(Ioannidis
&
Thomas,
2009).
1.1.3
Post-‐GWAS
era:
an
overview
of
the
topics
in
the
dissertation
As
described
above,
there
are
many
pitfalls
and
shortcomings
in
genome-‐wide
association
studies,
and
a
lot
of
work
needs
to
be
done.
In
this
section,
we
focus
on
a
few
aspects
of
post-‐
GWAS
study
design
and
analysis
issues,
as
an
overview
of
the
questions
we
try
to
answer
in
this
dissertation.
5
The
first
question
we
are
interested
in
relates
to
next-‐generation
sequencing
(NGS)
studies.
As
we
mentioned
above,
the
missing
hereditary
variation
from
GWAS
might
be
explained
by
rare
variants
across
the
population.
With
the
rapid
development
of
NGS
technology,
researchers
are
able
to
sequence
the
whole
genome
of
subjects,
and
test
the
association
of
rare
variants
with
disease.
However,
the
cost
of
NGS
is
still
relatively
high
to
perform
an
association
study
with
large
sample
size.
Pooling
has
been
one
potential
method
to
reduce
cost
without
losing
too
much
power.
In
chapter
2,
we
investigate
a
pooled
case-‐control
study
design
with
NGS
data,
propose
a
novel
method
to
analyze
the
data,
and
provide
some
advice
to
design
such
a
pooled
study.
Secondly,
most
of
the
current
GWAS
data
are
analyzed
marginally,
i.e.,
the
association
of
each
SNP
with
the
risk
of
disease
is
tested
one
by
one.
However,
testing
one
SNP
at
a
time
does
not
utilize
the
full
potential
of
GWAS
to
identify
multiple
causal
SNPs.
In
chapter
3,
we
investigate
a
novel
method
to
simultaneously
analyze
multiple
SNPs
at
the
same
time.
In
the
third,
it
is
of
great
interest
to
prioritize
GWAS
results
and
identify
causal
variants.
In
chapter
4,
we
propose
a
novel
way
to
prioritize
the
results
from
a
GWAS
study
and
increase
the
chance
of
identifying
causal
variants.
6
1.2
Using
Bayesian
hierarchical
models
to
incorporate
external
information
In
the
previous
section,
we
listed
a
few
questions
of
interest
in
the
scope
of
this
dissertation.
To
answer
these
questions,
we
propose
to
incorporate
external
biological
information
using
Bayesian
hierarchical
models.
1.2.1
Motivation:
the
development
of
biological
ontologies
A
biological
ontology
is
a
systematic
description
of
specific
biological
attributes
(Jensen
&
Bork,
2010).
In
the
last
decade,
as
the
biological
data
explodes,
accompanied
with
the
rapid
development
of
systems
biology,
many
biological
ontologies
are
developed.
One
of
the
mostly
used
biological
ontologies
is
Gene
Ontology
(Ashburner
et
al.,
2000),
which
provides
systematic
description
of
gene
products
using
standardized
terms,
organized
in
a
structure
of
directed
acyclic
diagram
(DAG).
Since
the
purpose
of
a
genetic
association
study
is
to
find
the
genetic
variations
associated
with
risk
of
disease/trait,
it
is
natural
to
incorporate
the
external
information
in
biological
ontologies
into
the
analysis
of
genetic
association
studies.
Many
investigators
use
such
information
informally
or
implicitly
when
interpreting
the
results
of
a
genetic
association
study.
Bayesian
hierarchical
model
is
a
natural
way
to
formally
use
such
external
information.
7
1.2.2
Bayesian
hierarchical
models
The
usual
approach
to
analyze
a
genetic
association
study
is
to
compute
a
p-‐value
in
a
frequentist
way.
However,
to
incorporate
external
information,
Bayesian
methods
provide
a
coherent
way
to
analyze
and
interpret
association
study
results
combining
external
information
(Stephens
&
Balding,
2009).
In
a
genetic
association
study,
we
are
usually
interested
in
multiple
parameters,
such
as
the
marginal/joint
effect
sizes
of
all/part
of
the
SNPs
in
a
GWAS
study.
These
parameters
are
related
with
each
other,
based
on
the
underlying
structure
of
the
problem.
For
example,
in
genetic
association
studies
with
a
case-‐control
design,
the
association
of
each
SNP
with
the
disease
susceptibility
is
represented
by
the
odds
ratio,
as
the
parameters
of
interest.
It
is
natural
to
model
the
data
in
a
hierarchical
way,
in
which
the
first-‐level
parameters
are
given
prior
distributions
in
terms
of
hyper-‐parameters
and
external
information.
Such
hierarchical
framework
specifies
clearly
the
underlying
process
linking
the
external
information
to
the
observed
data;
and
computational
strategies
have
been
well
developed
to
analyze
such
a
hierarchical
model.
Once
the
model
is
specified,
we
could
apply
multiple
computational
methods
to
estimate
the
hyper-‐parameters
and
model
parameters.
The
two
most
commonly
used
approaches
are
empirical
Bayes
estimation
and
fully
Bayesian
inference.
For
the
purpose
of
simplicity,
empirical
Bayes
uses
the
data
to
provide
point
estimates
of
the
hyper-‐parameters.
In
the
context
of
fully
Bayesian
analysis,
we
can
obtain
a
complete
joint
posterior
distribution
of
the
hyper-‐parameters
and
model
parameters.
If
it
is
possible
to
derive
the
conditional
and
marginal
distributions
8
analytically,
(e.g.,
using
conjugacy
among
the
hyper-‐parameter
prior
distribution,
model
parameter
prior
distribution
and
likelihood),
we
could
directly
calculate
the
posterior
distribution,
or
numerically
draw
simulations
from
posterior
distribution.
But
in
most
cases,
it
is
not
possible
to
analytically
derive
the
conditional
and
marginal
distributions,
and
we
must
rely
on
more
complicated
computational
methods
(e.g.
Markov
Chain
Monte
Carlo)
to
draw
the
simulations
from
posterior
distribution.
For
a
more
detailed
introduction
to
Bayesian
hierarchical
models,
see
(Gelman,
Carlin,
Stern,
&
Rubin,
2004).
1.3
Overview
I
will
conclude
this
chapter
with
an
overview
of
the
following
chapters.
In
chapter
2,
to
answer
the
question
of
a
pooled
case-‐control
design
of
genetic
association
study
using
next-‐generation
sequencing
data,
we
proposed
a
Bayesian
hierarchical
model
to
estimate
the
association
of
each
variant,
accounting
for
the
variation
in
read
depth
across
pools
and
sequencing
error.
We
undertook
extensive
simulations
varying
effect
sizes,
minor
allele
frequencies
and
sequencing
error
rates.
In
general,
we
found
that
the
number
of
pools
and
pool
size
have
dramatic
effects
on
power
while
the
total
depth
of
coverage
per
pool
has
only
a
moderate
impact.
This
information
can
guide
the
selection
of
a
study
design
that
maximizes
power
subject
to
cost,
sample
size,
or
other
laboratory
constraints.
We
developed
an
R
package
(hiPOD:
hierarchical
Pooled
Optimal
Design)
to
find
the
optimal
design,
allowing
the
user
to
specify
a
cost
function,
cost
and
sample
size
limitations,
and
distributions
of
effect
size,
minor
allele
frequency
and
sequencing
error
rate.
9
In
chapter
3,
by
adapting
a
Laplace
prior
distribution
for
model
parameters,
and
incorporating
external
information
in
the
prior
distribution
for
the
shrinkage
parameter
of
the
Laplace
distribution,
we
aimed
to
perform
variable
selection
using
the
specified
hierarchical
model.
We
tried
two
approaches
to
estimate
the
hyper-‐parameters,
empirical
Bayes
and
fully
Bayesian.
The
method
works
when
the
model
size
is
relatively
small,
but
fails
when
the
model
is
over-‐
parameterized.
In
chapter
4,
to
improve
prioritization
of
GWAS
results,
we
propose
a
Bayesian
hierarchical
quantile
model
to
incorporate
external
information,
with
a
focus
on
the
ranking
of
causal
SNPs.
The
proposed
model
simultaneously
makes
inferences
on
multiple
quantiles.
It
also
adapts
a
Normal-‐Exponential
mixture
presentation
of
asymmetric
Laplace
distribution,
which
is
efficient
for
Gibbs
sampling
of
the
posterior
distribution.
In
simulation
results,
we
have
demonstrated
that
the
proposed
model
improves
the
ranking
of
causal
SNPs
if
the
external
information
is
highly
informative
(associated
with
causality
of
a
SNP),
and
it
does
not
decrease
the
causal
SNP’s
ranking
if
the
external
information
is
non-‐informative.
In
chapter
5,
we
propose
some
future
extensions
in
this
dissertation.
10
Chapter
2 A
Bayesian
Hierarchical
Model
in
Next-‐
Generation
Sequencing
with
Case-‐Control
Pools
2.1
Introduction
The
last
decade
has
seen
a
rapid
growth
of
genome-‐wide
association
studies
(GWAS).
GWAS
look
for
the
association
between
genetic
variation
(mainly
single
nucleotide
polymorphisms,
SNP)
and
phenotypes
of
interest
in
a
hypothesis
free
manner.
These
studies
were
so
popular,
that
around
6000
SNP
associations
have
been
reported
for
more
than
130
traits
as
of
January
2012
(http://www.genome.gov/gwastudies/).
However,
only
a
small
to
moderate
proportion
of
the
hereditary
variation
has
been
explained
by
the
variants
discovered
(Frazer,
Murray,
Schork,
&
Topol,
2009).
To
find
the
unexplained
heritability
by
GWAS,
investigators
have
begun
to
look
beyond
common
genetic
variants.
Based
on
population
genetics,
deleterious
variants
are
often
under
greater
selection
pressure,
and
they
are
more
likely
to
be
present
in
a
lower
frequency
in
the
current
population.
There
have
already
been
a
few
published
studies
indicating
the
effects
of
rare
variants,
solely
or
in
combination,
in
common
diseases
(B.
Li
&
Leal,
2008;
Morris
&
Zeggini,
2009;
Neale
et
al.,
2011;
Quintana,
Berstein,
Thomas,
&
Conti,
2011).
To
better
study
the
effects
of
rare
variants
in
association
studies,
technological
advances,
especially
Next
Generation
Sequencing
(NGS),
have
been
rapidly
developing.
Despite
its
efficiency
for
generating
sequencing
reads,
NGS
is
still
relatively
expensive
to
be
used
in
population
based
association
studies.
The
structure
of
the
NGS
data
is
also
completely
different
from
current
genotyping
technologies,
and
the
error
rate
of
NGS
is
relatively
higher
than
current
11
genotyping
technologies
(Shendure
&
Ji,
2008).
Aiming
for
uncompromised
statistical
power
with
much
reduced
cost,
DNA
pooling
has
become
a
promising
design
for
association
studies,
especially
applicable
to
NGS
data
(Lupton
et
al.,
2011;
Out
et
al.,
2009;
Prabhu
&
Pe'er,
2009;
T.
Wang,
Pradhan,
Ye,
Wong,
&
Rohan,
2011).
Prabhu
et
al.
constructed
over-‐lapping
pools
to
tackle
the
problem
of
detecting
the
carrier
of
rare
variants
(Prabhu
&
Pe'er,
2009).
Similar
works
along
these
lines
are
(Shental
&
Amir,
2010)
and
(Erlich
et
al.,
2009).
Price
et
al.
also
carried
out
tests
for
detecting
rare
variants
associations
from
pools
(Price
et
al.,
2010).
Druley
et
al.
developed
an
efficient
algorithm
to
accurately
quantify
rare
variants
from
pooled
genomic
DNA
using
NGS
(Druley
et
al.,
2009).
Futschik
and
Schlotterer
investigated
some
statistical
properties
of
methods
for
detection
and
frequency
estimation
of
rare
variants
in
NGS
data
of
pooled
samples
(Futschik
&
Schlotterer,
2010).
Lee
et
al
tried
to
find
the
optimal
number
of
individuals
in
a
DNA
pool
to
detect
an
allele
under
given
coverage
depth
and
detection
threshold
(Lee,
Choi,
Yan,
Lifton,
&
Zhao,
2011).
Focusing
on
association
studies,
Kim
et
al.
(Kim
et
al.,
2010)
developed
a
likelihood
ratio
test
on
pooled
or
un-‐pooled
designs,
and
investigated
the
design
that
achieved
the
optimal
power.
They
elegantly
considered
the
structure
of
the
NGS
data,
especially
the
nature
of
the
NGS
error.
However,
there
are
still
some
essential
parts
missing
for
a
thorough
analysis
of
such
a
design:
a
clear
criterion
to
discriminate
SNPs
from
sequencing
error,
a
thorough
examination
of
the
validity
of
the
study
design,
robust
estimation
of
SNP
frequency
and
effect
size,
a
realistic
cost
function
reflecting
the
different
costs
of
pool
formation
and
sequencing,
and
corrections
for
multiple
tests.
12
Here,
we
propose
a
Bayesian
framework
for
addressing
these
issues.
Capturing
the
exact
NGS
data
structure,
we
show
that
our
method
yields
an
unbiased
estimation
not
only
of
the
variant
frequency
in
a
pool,
but
also
of
the
effect
size
of
a
variant
for
a
robust
test
on
the
significance
of
the
association.
Using
extensive
simulations,
we
have
described
the
constraints
for
valid
study
designs
that
yield
the
expected
test
statistic
distribution
under
the
null.
Among
valid
designs,
we
show
that
our
test
statistic
has
a
Normal
distribution
and
that
the
scale
parameter
of
the
test
statistic
was
well
predicted
by
our
design
parameters.
Based
on
these
results,
we
developed
a
formula
for
power
calculations
incorporating
all
the
design
parameters
and
the
assumed
genetic
model,
and
extended
this
approach
to
the
GWAS
setting.
Using
a
realistic
model
for
the
dependence
of
NGS
costs
on
the
design
parameters,
we
have
developed
a
statistical
tool
for
finding
an
optimal
design
subject
to
the
budget
limitations,
the
VAF,
the
effect
size
of
the
SNP
to
be
targeted,
and
the
error
rate
of
the
NGS
instrument.
For
researchers
looking
for
a
cost-‐
efficient
association
study
design,
we
hope
our
work
will
provide
useful
insights
on
the
design
and
analysis
of
such
studies
in
pooled
samples
with
NGS
technology.
2.2
Method:
A
Bayesian
hierarchical
model
for
pooled
NGS
data
In
a
non-‐overlapping
pooled
sequencing
study,
individuals
are
randomly
allocated
into
pools
within
each
stratum
(e.g.,
case/control
status).
At
each
location
we
observe
the
total
number
of
reads
X
p
and
the
number
of
variant
reads
V
p
in
pool
p.
We
model
V
p
as
a
Binomial
distribution
with
probability
θ
p
.
) , ( ~
p p p
X Bin V θ
13
) ( ~
p p
Pois X λγ
(2.1)
where
λ
is
the
average
depth
of
sequencing
for
each
pool
and
γ
p
~
Gamma(α,
1/α)
accounts
for
variation
in
DNA
capturing
efficacy;
note
that
this
implies
that
the
average
depth
of
sequencing
per
subject
is
λ/N
p
.
V
p
/X
p
may
be
used
as
an
estimate
of
the
variant
frequency;
however,
the
standard
error
of
this
estimate
will
be
greatly
under
estimated
as
it
is
driven
by
the
coverage
and
not
the
number
of
individuals
within
each
pool.
Since
pool
p
constitutes
DNA
samples
from
N
p
individuals,
we
let
Z
p
denote
the
number
of
chromosomes
(out
of
the
2N
p
possible
chromosomes)
that
carry
the
variant
allele
in
pool
p.
Thus,
ρ
p
denotes
the
variant
allele
frequency
for
the
individuals
in
pool
p,
and
Z
p
is
distributed
as:
) , 2 ( ~
p p p
N Bin Z ρ
(2.2)
Since
each
read
is
generated
as
an
independent
Bernoulli
process,
in
the
absence
of
sequencing
error
the
probability
of
a
read
being
called
as
a
variant
is
Z
p
/2N
p
.
To
account
for
sequencing
error
we
assume
a
simple
and
symmetric
error
structure
in
which
ε
=
Pr(read=v|allele=r)
=
Pr(read=r|allele=v),
where
v
denotes
the
variant
allele
and
r
denotes
the
reference
allele.
Thus,
in
the
presence
of
sequencing
error,
θ
p
is
defined
as:
)
2
1 (
2
) 1 (
) Pr( ) | Pr( ) Pr( ) | Pr(
) , Pr( ) , Pr(
p
p
p
p
p
N
Z
N
Z
r allele r allele v read v allele v allele v read
r allele v read v allele v read
− + − =
= = = + = = = =
= = + = = =
ε ε
θ
(2.3)
To
facilitate
case-‐control
comparisons
across
pools,
we
define
the
variable
W
p
as
an
indicator
of
the
disease
status
of
all
individuals
within
pool
p
(W
p
=
1
if
pool
p
constitutes
cases
and
W
p
=
0
if
14
pool
p
constitutes
controls),
and
we
use
a
logistic
model
to
link
the
pool-‐level
variant
allele
frequency
with
the
case-‐control
status
of
the
pool:
) ( ) logit(
1 0
W W
p p
− + = β β ρ
(2.4)
Here,
β
0
is
the
logit
of
the
variant
frequency
within
the
entire
sample;
β
1
is
the
log
of
the
odds
ratio
between
cases
and
controls
used
for
association
testing
(e.g.,
with
disease
or
determinants
of
the
VAF);
and
W
is
the
average
of
W.
For
completeness,
we
tried
different
prior
distributions
on
the
hyperparameters,
and
found
that
the
model
is
insensitive
with
respect
to
the
prior
distribution,
especially
in
valid
study
designs.
Thus
we
put
relatively
non-‐informative
priors
on
hyperparameters,
and
fit
the
entire
model
using
Markov
chain
Monte
Carlo
(MCMC)
techniques
as
implemented
in
WinBUGS.
] 05 . 0 , 0 [ ~
) 10 , 0 ( ~
) 10 , 0 ( ~
1
0
U
N
N
ε
β
β
(2.5)
Overall,
our
hierarchical
model
is
summarized
in
the
DAG
in
Fig
2.1.
15
Figure
2.1
DAG
of
the
hierarchical
model,
focusing
on
a
single
marker
in
a
single
pool
2.3
Simulation
studies
2.3.1
Design
of
simulation
studies
To
investigate
the
performance
of
our
hierarchical
model
and
to
build
a
prediction
model
for
power,
we
performed
simulations
over
a
variety
of
scenarios
with
different
variant
allele
frequencies
(VAF),
effect
sizes
(OR),
and
sequencing
error
rates
(ε),
as
well
as
a
range
of
study
design
parameters
including
the
number
of
pools
(P,
thus
there
are
P/2
case
pools
and
P/2
control
pools),
the
number
of
individuals
per
pool
(N
p
),
and
the
average
coverage
for
each
pool
(λ).
16
Given
specified
values
for
the
sample
VAF,
OR
and
ε,
we
simulated
the
number
of
variant-‐
carrying
chromosomes
in
each
pool
using
Eqs.(2.2)
and
(2.4).
Within
each
pool
we
simulated
the
total
coverage
X
p
and
the
number
of
variants
V
p
following
Eqs.(2.1)
and
(2.3).
The
simulations
used
the
following
specified
values:
VAF
=
{0.002,
0.005,
0.01,
0.02,
0.03,
0.05,
0.07,
0.1,
0.12,
0.15,
0.2},
OR
=
{1.0,
1.2,
1.6,
2.0},
and
ε
=
{0.005,
0.01,
0.02}.
We
varied
the
design
parameters
with
N
p
=
{1,
2,
5,
20,
40,
60,
80,
100,
150,
200};
P
=
{2,
4,
10,
40,
80,
120,
160,
200,
250,
300,
350,
400};
and
λ
=
{10,
20,
40,
80,
160,
320,
640,
1280}.
For
each
specified
scenario,
we
simulated
1000
replicates
and
analyzed
the
resulting
data
with
our
hierarchical
model
using
WinBUGS
and
a
single
chain
of
1000
MCMC
samples.
We
summarized
the
posterior
distribution
of
β
0
,
β
1
and
ε.
Hypothesis
testing
was
based
on
the
test
statistic
B
=
1 1
ˆ /
ˆ
σ β
for
each
replicate,
where
1
ˆ
β
and
1
ˆ σ
are
the
posterior
mean
and
std.
error
of
the
effect
size.
Under
the
null
hypothesis,
we
assume
that
B
is
distributed
as
a
standard
Normal.
The
empirical
power
was
calculated
as
the
proportion
of
replicates
that
show
a
statistical
significance
to
reject
the
null
hypothesis
(p-‐value
<0.05),
and
empirical
type
I
error
was
the
empirical
power
under
the
null
scenarios
(where
the
simulated
OR
=
1).
2.3.2
Determining
optimal
study
design
for
future
studies
2.3.2.A
Using
Simulations
to
Build
a
Prediction
Model
for
Power
In
valid
designs,
the
test
statistics
should
be
distributed
as
a
standard
Normal
under
a
null
scenario.
Thus,
we
used
a
Kolmogorov-‐Smirnov
(KS)
test
to
determine
whether
these
test
statistics
obtained
from
the
hierarchical
model
under
a
null
scenario
for
a
given
study
design
17
follow
a
standard
Normal
distribution,
which
is
a
prerequisite
for
the
study
design
to
be
valid.
If
the
corresponding
KS
test
was
significant
(p-‐value
<
0.05)
we
removed
the
study
design
from
the
determination
of
the
prediction
model.
Thus,
in
all
the
valid
designs,
the
test
statistic
follows
an
asymptotically
Normal
distribution
summarized
by
the
location
parameter
of
each
valid
design,
denoted
as
µ
d
,
with
B
d
~
N
(µ
d
,
1).
We
empirically
estimated
the
location
parameter
µ
d
using
∑
=
−
=
R
r
dr d
B R B
1
1 ,
where
R
is
the
number
of
replicates
for
design
d,
and
B
dr
is
the
test
statistic
for
design
d
replicate
r.
Using
these
estimated
location
parameters,
we
then
built
a
linear
regression
model
to
predict
µ
d
as
a
function
of
all
the
scenario
parameters
(VAF,
OR
and
ε)
and
design
parameters
(N
p
,
P
and
λ).
This
regression
model
is
then
used
to
predict
power
for
future
study
designs.
2.3.2.B
Predicting
Power
for
Future
Studies
To
predict
power
for
any
proposed
future
study,
we
first
determined
if
the
study
design
results
in
valid
inference
using
our
hierarchical
modeling
analysis,
and
if
deemed
valid,
we
then
predicted
power
using
the
estimated
prediction
model
constructed
from
the
analysis
of
the
simulated
results.
For
the
examples
in
this
paper,
we
determined
if
a
proposed
study
was
valid
by
calculateing
the
probability
that
a
variant
will
be
detected
in
either
the
cases
or
the
controls.
For
a
specific
design,
we
use
a
threshold
for
the
number
of
variant
reads,
V
t
to
determine
if
a
variant
is
detected.
Given
the
pool
size
(N
p
),
average
coverage
per
pool
(λ)
and
number
of
pools
(P),
if
there
was
no
variant
in
the
pool,
the
number
of
variant
reads
(V
0
)
would
be
approximately
18
distributed
as:
V
o
~
Bin
(λ,
θ
0
=ε),
where
we
use
the
average
coverage
λ
as
an
approximate
of
the
true
coverage.
If
only
one
chromosome
in
the
pool
carries
a
variant,
the
number
of
variant
reads
(V
1
)
would
be
approximately
distributed
as:
V
1
~
Bin
(λ,
θ
1
=
(1-‐ε)/2N
+
ε(2N-‐1)/2N).
By
comparing
these
two
situations,
we
calculate
two
corresponding
probabilities:
Pr(V
0
>V
t
)
and
Pr(V
1
>V
t
).
To
control
type
I
error
with
Bonferroni
correction
for
multiple
tests
at
a
significance
level
of
0.05
and
to
ensure
an
adequate
odds
that
a
true
variant
is
detected
vs.
a
false
discovery,
we
chose
the
smallest
V
t
that
satisfies
both
Pr(V
0
>V
t
)<(0.1/P)
and
Pr(V
1
>V
t
)
>
5Pr(V
0
>V
t
).
This
is
a
conservative
comparison,
since
Pr(V
1
>V
t
)
<
Pr(V
2
>V
t
)
<
…
<
Pr(V
2N
>V
t
),
and
we
are
comparing
the
least
distinctive
positive
scenario
vs.
negative
scenario.
Next,
we
used
the
threshold
V
t
to
calculate
the
probability
of
variant
detection.
If
we
assume
that
we
have
an
equal
number
of
case
and
control
pools,
the
probability
of
detecting
a
true
variant
in
both
the
case
pools
and
the
control
pools
is:
Pr(detection)
=
{1
–
[1
–
Pr(V>V
t
|N
p
,
λ,
ε)]
P/2
}
2
Valid
study
designs
were
those
in
which
Pr(detection)
>
Pr
T
,
where
Pr
T
is
determined
from
a
comparison
of
the
validity
determination
using
the
KS
test
to
the
predicted
validity
using
the
calculated
Pr(detection)
for
all
simulated
scenarios.
If
a
novel
study
design
was
determined
to
be
valid,
then
we
predicted
the
expected
non-‐
centrality
parameter
using
the
fitted
regression
model.
Let
µ
d
denote
the
predicted
non-‐
centrality
parameter,
P
d
denote
the
number
of
pools
in
the
proposed
study
design,
and
q
αd
denote
the
one-‐sided
α-‐level
critical
value
for
the
test
statistic:
Pr[x
>
q
αd
|x
~
N
(
0,
1)]
=
α.
The
19
expected
power
for
the
proposed
study
design
is
Λ
d
=
Pr[x
>
q
αd
|x
~
N(µ
d
,
1)].
Using
the
estimated
non-‐centrality
parameter
d
µ ˆ
from
the
prediction
model,
we
predicted
the
expected
power
as
d
Λ
ˆ
=
Pr[x
>
q
αd
|x
~
N(
d
µ ˆ
,
1)].
We
obtained
the
95%
confidence
interval
for
the
non-‐centrality
parameter
µ
d
,
denoted
ld
µ ˆ
and
ud
µ ˆ
,
thus
Pr(µ
d
>
ud
µ ˆ
)
=
Pr((µ
d
<
ld
µ ˆ
)
=
0.025.
The
confidence
interval
for
the
predicted
power
was
based
on
the
confidence
interval
of
the
non-‐centrality
parameter.
To
test
the
prediction
model,
we
also
performed
an
empirically
based
simulation
study
using
the
1000
Genomes
exome
alignment
data
(based
on
the
exome
alignment
index
file
in
ftp://ftp-‐
trace.ncbi.nih.gov/1000genomes/ftp/exome.alignment.index),
including
unrelated
individuals
in
both
pilot
project
2
and
pilot
project
3.
Focused
on
a
subset
of
representative
SNPs
in
Chromosome
2
(as
shown
in
Table
2),
we
extracted
the
sequencing
reads
and
genotype
of
each
individual,
and
simulated
the
disease
status
based
on
a
logistic
disease
model.
Based
on
the
specific
study
design
(i.e.,
number
of
pools
P
=
{20,
40,
100,
200},
number
of
individuals
per
pool
N
p
=
1000/P,
and
average
coverage
per
pool
λ
=
{20,
50}),
we
randomly
sampled
affected/unaffected
individuals
to
form
case/control
pools,
and
constructed
the
pool-‐level
sequencing
data
by
summing
over
a
randomly
sampled
subset
of
sequencing
reads
of
each
member
of
the
pool.
Finally,
we
applied
our
Bayesian
hierarchical
model
to
analyze
the
generated
pooling
association
study
data.
For
each
scenario,
we
generated
1000
replicates
and
compared
the
empirical
power
to
the
power
calculated
from
our
prediction
function.
20
2.3.2.C
Finding
an
Optimal
Study
Design
Given
a
set
of
assumed
scenario
parameters
(VAF,
OR
and
ε),
we
used
the
following
procedure
to
find
an
optimal
study
design:
1) Specify
the
range
of
interest
on
design
parameters
(P,
N
p
and
λ)
with
constraints
on
study
cost
(C
T
)
and
sample
size
(N
T
).
2) Using
a
generalized
cost
function:
CT
=
CE
+
CPP
+
CX
P
λ
where
C
E
is
the
cost
per
experiment
including
the
initiation
of
the
study
and
sample
collection,
C
P
is
the
cost
per
pool
such
as
sample
preparation
(and
region
capturing
if
focused
on
region
specific
sequencing),
and
C
X
is
the
cost
per
1x
coverage.
3) Specify
120x120
grid
covering
values
of
P
and
N
p
across
the
specified
range
of
interest.
4) For
each
point
in
the
grid
use
the
corresponding
number
of
pools
(P
g
)
and
number
of
individuals
per
pool
(N
pg
)
to
calculate
the
maximum
coverage
λ
g
consistent
with
the
constraint
to
study
cost
(C
T
).
5) Check
if
the
design
(P
g
,
N
g
)
satisfies
all
of
the
following
conditions:
a. P
g
N
g
≤
N
T
,
to
satisfy
the
sample
size
constraint;
b. Calculate
the
Pr(detection)
for
the
specific
design
using
the
maximum
coverage
λ
g
.
Pr(detection)
≥
Pr
T
are
deemed
valid
study
designs.
c. Using
the
estimated
prediction
model,
calculate
the
predicted
power
for
the
study
design.
6) Perform
an
exhaustive
search
over
all
valid
grid-‐points.
Select
the
optimal
design
as
the
study
with
the
maximum
power.
21
To
help
investigators
find
the
optimal
pooling
design,
an
R
package
hiPOD
(hierarchical
Pooled
Optimal
Design)
is
now
available
in
the
online
CRAN
depository.
2.4
Results
2.4.1
Validity
and
empirical
power
of
simulated
scenarios
In
total,
we
simulated
116,160
scenarios
with
different
study
designs
and
disease
models.
Not
all
the
scenarios
result
in
valid
tests
of
association.
Results
from
the
KS
tests
for
the
uniform
distribution
of
type-‐I
error
under
the
null
scenarios
demonstrated
that,
in
general,
study
designs
with
few
pools,
few
individuals
within
each
pool,
or
with
low
coverage
tend
to
result
in
invalid
tests
of
association.
This
was
especially
true
for
scenarios
dealing
with
rare
variants
and
small
effect
sizes.
These
trends
were
reflected
in
Fig
2.2,
showing
the
valid
and
invalid
designs
for
a
variant
with
VAF
=0.05
and
an
average
coverage
of
either
40
or
160,
respectively.
The
reason
large
N
p
designs
are
often
invalid
in
Fig
2.2
is
that
the
average
depth
of
sequencing
per
subject
is
very
low
for
some
of
these.
22
Figure
2.2
Summary
of
the
validity
of
study
designs
of
a
variant
with
VAF=0.05
A
B
23
From
all
the
simulated
scenarios
we
selected
a
few
representative
examples
to
illustrate
the
relationship
between
empirical
power
and
the
design
parameters.
In
Fig
3A
and
3B,
the
variant
allele
frequency
is
0.05
with
a
moderate
effect
size
(1.6).
In
such
a
setting,
empirical
power
increases
as
either
coverage
increases
or
more
individuals
are
included
in
each
pool.
In
the
figure
also
displays
the
optimal
power
from
a
comparable
individual
sequencing
study,
which
is
assumed
to
have
no
errors
in
genotyping
calls
due
to
high
coverage,
and
have
the
same
number
of
individuals
sequenced
as
the
number
of
pools
constructed
in
the
pooling
design
(thus
they
would
require
similar
sequencing
efforts).
Such
a
separately
constructed
individual
sequencing
analysis
has
an
improved
power
over
the
single-‐individual
pool
analysis,
as
it
eliminates
the
within-‐pool
uncertainties
in
the
analysis.
In
this
scenario,
the
pooling
design
has
superior
power
over
individual
sequencing,
especially
as
the
number
of
individuals
in
each
pool
increases
or
as
the
coverage
increases.
Fig
3C,
3D
and
3E
show
a
scenario
where
a
rare
variant
(VAF
=
0.01)
with
a
large
effect
(OR=2.0)
is
tested.
Similar
results
are
obtained
with
increased
empirical
power
as
the
number
of
pools
increases,
the
coverage
increases
or
as
the
number
of
individuals
in
each
pool
increases.
In
all
the
plots
of
Fig
2.3,
we
include
both
valid
and
invalid
designs
to
demonstrate
overall
trends
in
power.
However,
while
it
is
true
that
increasing
the
number
of
individuals
per
pool
increases
power,
if
the
average
coverage
per
pool
is
held
constant
the
study
design
is
more
likely
to
yield
an
invalid
test.
24
Figure
2.3
Comparison
of
empirical
power
in
different
scenarios
and
designs
A
B
C
D
25
E
In
general,
power
increases
most
with
number
of
pools,
whereas
number
of
individuals
per
pool
has
a
moderate
effect
and
the
coverage
depth
per
pool
has
only
a
limited
effect.
These
trends
are
more
clearly
seen
in
contour
plots
of
power
with
respect
to
the
design
parameters
(Fig
2.4).
Figure
2.4
General
trends
of
empirical
power
with
respect
to
the
design
parameters
A
B
26
C
D
E
F
2.4.2
Construction
of
a
prediction
model
To
predict
power
for
future
study
designs
we
removed
all
invalid
designs
and
all
null
scenarios
from
the
simulated
data.
We
confirmed
that
the
test
statistic
was
asymptotically
distributed
as
Normal,
empirically
summarizing
location
parameter
µ
d
with
d
B .
We
subsequently
used
d
B to
27
predict
power
and
found
that
the
predicted
power
compared
very
favorably
to
the
empirical
power
observed,
with
an
R
2
=
0.998.
In
constructing
prediction
models
for
the
non-‐centrality
parameter
of
the
test
statistics,
we
built
separate
prediction
models
for
rare
variants
(VAF
<
0.01),
less
common
variants
(0.01
<
VAF
<
0.05),
modest
frequency
variants
(0.05
<
VAF
<
0.1),
and
common
variants
(VAF
>
0.1).
We
transformed
the
response
variable
for
better
model
prediction.
We
used
a
linear
regression
model
for
the
cube
root
of
the
non-‐centrality
parameter
as
a
function
of
the
variant
allele
frequency,
log
odds
ratio,
sequencing
error
rate,
number
of
pools,
number
of
subjects
per
pool,
average
depth
of
sequencing,
and
the
various
transformations
and
interactions
amongst
these
variables
listed
in
Table
1.
In
general,
all
the
4
models
(for
different
strata
of
VAFs)
predict
the
non-‐centrality
parameter
quite
accurately
(all
with
R
2
>
0.97).
We
summarized
the
partial
R
2
for
each
variable
in
Table
1
for
the
full
model,
bolding
the
two
strongest
predictors
after
ln(OR).
Across
all
the
design
parameters,
the
average
sequencing
coverage
depth
explains
the
most
variation
in
performance
involving
rare
VAF’s
(VAF
0.002-‐0.01),
whereas
the
number
of
pools
explains
the
greatest
variations
involving
intermediate
and
common
VAF’s
(VAF
0.05-‐0.2).
The
predicted
power
is
highly
correlated
(R
2
=
0.97)
with
the
empirical
power.
28
2.4.3
Predicting
power
and
determining
optimal
designs
for
future
studies
For
future
studies
with
a
given
design
in
terms
of
number
of
pools,
number
of
individuals
within
each
pool,
and
coverage,
we
calculate
the
probability
of
variant
detection
and
use
this
value
to
decide
if
the
proposed
study
is
valid.
To
determine
the
variant
probability
cutoff,
we
use
the
simulated
data
scenarios
and
results
to
construct
a
ROC
curve
comparing
these
calculated
probabilities
to
the
validity
determination
from
the
KS
test,
using
the
KS
test
as
the
gold
standard.
Overall,
the
area
under
the
ROC
curve
was
equal
to
0.89.
From
this
curve
we
selected
a
cutoff
at
0.9,
yielding
a
sensitivity
of
0.91
and
specificity
0.74.
We
used
the
estimated
prediction
models
to
calculate
power
for
any
proposed
future
study
design.
For
each
valid
study
design,
power
prediction
involves
2
steps:
first
predict
the
non-‐
centrality
parameter
for
the
test
statistic,
and
then
calculate
power
based
on
the
predicted
non-‐
centrality
parameter.
In
the
simulation
study
based
on
1000
Genomes
data,
across
a
set
of
representative
SNPs
with
different
VAFs
and
a
range
of
assumed
ORs,
we
calculated
the
probability
of
detection
of
each
study
design
first,
and
define
those
with
a
probability
of
detection
larger
than
0.9
as
valid
designs.
In
all
the
valid
designs,
we
found
that
the
predicted
power
is
in
good
concordance
with
the
empirical
power,
although
there
is
a
general
trend
for
the
prediction
model
to
be
conservative
(shown
in
Fig
2.5).
29
Figure
2.5
Prediction
of
power
in
semi-‐simulation
studies
using
1000
Genomes
data
As
an
example
of
finding
the
optimal
design,
suppose
we
have
the
following
values
for
our
cost
function:
C
E
=
$18,915,
C
P
=
$970,
and
C
X
=
$300
and
a
total
budget
of
C
T
=
$452,915
[personal
communication,
Dave
Duggan
from
TGen].
We
would
like
to
know
the
optimal
pooling
study
design
for
a
maximum
available
sample
of
5,000
and
an
error
rate
ε
=
0.01.
The
results
are
shown
in
Fig
2.6
for
VAF
=
0.03
and
OR
=
2.0.
In
the
contour
plot,
we
see
that
the
power
increases
with
number
of
pools
and
number
of
individuals
per
pool,
under
the
constraints
of
the
given
costs
and
sample
size.
The
overlap
of
the
red
region
(valid
study
designs)
and
blue
region
(study
designs
satisfying
the
cost
constraints)
define
the
region
of
interest.
In
this
example,
the
0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Power
Empirical Power
OR = 1.2
OR = 1.6
OR = 2.0
Power Prediction from 1000 Genomes Data
30
optimal
power
is
0.42,
achieved
with
51
pools
with
62
individuals
within
each
pool,
and
an
average
coverage
of
25.
The
top
10
choices
of
optimal
designs
are
shown
in
Appendix
Table
1.
Figure
2.6
Grid
search
for
the
optimal
pooling
design,
for
a
study
interested
in
a
variant
with
VAF=0.03
and
OR=2,
assuming
a
fixed
cost
function,
with
constraints
on
cost
and
sample
size
Of
course,
a
pivotal
question
facing
researchers
is
when
to
use
a
pooling
study
design
and
when
to
use
individual
level
sequencing.
To
compare
these
study
designs
we
use
a
similar
cost
function,
substituting
the
number
of
individuals
for
the
number
of
pools
and
assuming
that
4x
coverage
is
sufficient
to
generate
genotyping
calls
with
no
errors.
With
a
total
available
budget
31
of
$452,915
we
can
individually
sequence
200
individuals.
For
a
SNP
with
a
VAF
=
0.03,
and
an
OR
=
2.0,
the
expected
power
is
0.40
(calculated
using
Quanto
(Gauderman,
2002)).
More
generally,
to
compare
a
pooling
design
vs.
individual-‐level
sequencing,
we
investigated
power
over
a
range
of
different
cost
functions.
In
such
a
comparison,
it
is
more
informative
to
focus
on
the
relative
value
of
sample
preparation
cost
versus
sequencing
cost
(C
P
/C
X
),
since
the
experiment-‐wise
cost
is
fixed
(C
E
).
As
before,
we
focus
on
the
scenario
with
a
VAF
=
0.03,
an
OR
=
2.0,
and
an
error
rate
ε
=
0.01.
For
the
cost
function,
we
held
constant
C
E
=
$10,000
and
C
P
=
$1,000
and
varied
the
value
of
C
X
from
$50
to
$500).
For
each
value
of
C
X
,
we
calculated
the
corresponding
cost
resulting
from
an
individual-‐level
sequencing
design
(assuming
200
individuals
with
4X
coverage),
and
then
used
this
to
constrain
the
search
for
an
optimal
pooling
study
design.
The
power
for
an
individual
design
is
constant
at
0.40.
However,
the
power
for
a
pooled
design
varies
with
cost.
Fig
2.7
demonstrates
that
as
C
X
drops,
there
is
a
gain
in
power
for
an
optimal
pooled
design,
with
all
pooled
designs
having
greater
power
than
the
individual
sequence
design.
32
Figure
2.7
The
optimal
power
of
pooling
designs
for
a
specific
study
(VAF=0.03,
OR=2),
with
the
same
sample
size
constraint,
different
cost
functions
but
comparable
cost
constraints
with
individual
designs
A
more
general
set
of
comparisons
is
made
over
a
grid
of
potential
VAFs
and
ORs,
with
varying
values
of
sequencing
cost
C
X
.
As
shown
in
Fig
2.8,
pooling
designs
provides
improved
power
over
individual
designs
in
most
of
the
scenarios,
except
when
VAF
is
very
small
(VAF
=
0.005).
33
Figure
2.8
Power
(y-‐axis)
comparison
of
optimal
pooling
designs
with
various
cost
function
(dashed
lines),
versus
individual
designs
(solid
line),
over
a
range
of
variant
allele
frequencies
(x-‐axis)
However,
our
assumption
that
there
was
no
effect
of
sequencing
errors
in
individual
designs
might
not
be
true
when
VAF
is
as
small
as
0.005.
We
also
examined
this
comparison
in
a
GWAS
setting,
where
the
significance
level
is
set
to
be
5×10
-‐8
.
In
such
a
setting,
pooling
was
superior
to
individual
sequencing
in
all
the
scenarios
we
investigated,
especially
when
the
sequencing
cost
is
low
relative
to
other
costs.
For
example,
34
assuming
sequencing
cost
(C
X
)
decreases
to
$50
while
the
other
costs
are
constant
(C
E
=
$10,000
and
C
P
=
$1,000),
for
a
variant
with
an
odds
ratio
of
1.2
and
allele
frequency
0.2,
the
power
from
individual
design
is
close
to
0,
where
as
the
optimal
pooling
design
yields
power
of
0.15;
for
a
variant
with
an
odds
ratio
of
1.6
and
allele
frequency
0.1,
the
power
from
individual
design
is
0.32,
where
as
the
optimal
pooling
design
yields
power
of
0.9.
2.5
Discussion
We
have
shown
that
pooling
designs
for
genetic
association
studies
using
next-‐generation
sequencing
can
be
a
viable
alternative
to
designs
with
individual
level
sequencing.
However,
careful
consideration
of
the
design
parameters
is
needed
to
understand
the
tradeoffs
between
the
number
of
pools,
the
number
of
individuals
within
each
pool,
and
the
total
coverage
per
pool.
These
tradeoffs
must
also
be
understood
in
the
context
of
the
goals
of
the
study
(e.g.
discovery
vs.
association
testing;
rare
variants
vs.
common
variants)
and
limitations
of
budget
and
sample
size.
For
testing
association,
an
increase
in
the
number
of
pools
for
case-‐control
comparisons
can
increase
power.
In
addition,
increasing
the
number
of
individuals
within
each
pool
can
result
in
a
corresponding
increase
in
power
due
to
a
decrease
in
the
variance
for
the
estimated
allele
frequency
for
each
pool.
However,
our
simulation
study
demonstrates
that
the
maximum
power
is
not
simply
a
matter
of
increasing
both
of
these
factors
until
the
budget
is
exhausted
and
all
individuals
are
included
in
pools.
While
increasing
the
number
of
individuals
within
a
pool
increases
the
number
of
potential
chromosomes
to
be
sequenced,
to
ensure
that
there
is
a
high
35
probability
that
all
chromosomes
within
a
pool
are
read
there
must
be
sufficient
total
coverage.
Moreover,
power
for
testing
association
with
rare
variants
will
be
impacted
more
substantially
with
low
average
coverage
per
chromosome.
Thus,
under
a
limited
budget
and
in
certain
circumstances,
there
are
advantages
to
decreasing
the
number
of
pools
while
increasing
the
coverage
for
each
pool.
Likewise,
increasing
the
number
of
individuals
within
a
pool
may
be
counterproductive
if
limits
in
coverage
have
already
been
reached.
In
the
extreme,
this
interplay
not
only
impacts
power,
but
also
the
validity
for
inference.
Too
few
pools,
too
few
individuals
or
inadequate
coverage
can
result
in
unstable
estimates
and
invalid
tests
of
association.
Future
studies
need
more
than
broad
conclusions
for
guidance.
To
accomplish
this
we
developed
a
procedure
for
determining
the
optimal
study
design
given
total
available
budget
and
sample
size,
as
well
as
the
sequence
error
rate,
VAF
and
odds
ratio
for
the
targeted
causal
SNP.
In
searching
for
the
optimal
study
design,
there
are
two
key
steps
in
the
procedure.
First,
a
proposed
study
design
is
determined
to
be
valid
if
the
design
has
a
probability
of
detection
greater
than
0.9.
As
with
power,
too
few
pools,
too
few
individuals
or
inadequate
coverage
results
in
lower
probabilities
of
detection.
However,
since
the
impact
of
these
design
parameters
is
complicated,
the
ability
to
calculate
probabilities
of
detection
can
be
useful
if
the
goal
of
a
particular
study
is
discovery
and
not
association.
As
highlighted
in
the
Appendix
1
and
corresponding
figures
(Appendix
Fig
1,
Appendix
Fig
2
and
Appendix
Table
2),
pooling
designs
outperform
individual-‐level
sequencing
for
discovery
of
rare
variants
even
when
accounting
for
sequencing
error.
The
second
step
in
our
power
estimation
procedure
uses
a
regression
model
for
predicting
the
expected
non-‐centrality
parameters
and,
in
turn,
the
expected
power.
The
results
from
over
116,000
simulation
scenarios
were
used
to
construct
the
prediction
model.
Overall,
the
prediction
model
does
very
well
in
predicting
the
empirical
power
for
the
range
of
36
scenarios
given
by
the
simulations.
This
approach
was
used
since
an
exact
calculation
of
the
non-‐centrality
parameter
is
infeasible
due
to
the
complexity
of
our
multi-‐level
model
and
the
Bayesian
framework
in
which
our
estimation
is
done,
and
the
prediction
model
allows
for
flexibility
in
predicting
power
for
novel
study
designs.
However,
there
will
certainly
be
an
increase
in
the
uncertainty
when
predicting
power
for
study
designs
not
within
the
simulated
ranges
of
design
parameters.
For
example,
prediction
of
a
pooled
GWAS
analyses
using
a
α-‐level
of
5×10
-‐8
will
have
increased
uncertainty
since
very
few
simulated
designs
resulted
in
test-‐
statistics
that
reach
such
significance.
The
ability
to
predict
power
for
a
novel
pooled
study
design
facilitates
comparison
to
individual-‐
level
sequencing
designs.
Determination
of
which
design
is
most
powerful
relies
heavily
on
the
cost
function.
Here,
we
used
a
simple
cost
function
that
is
driven
by
a
sample
cost
(e.g.
cost
per
pool
(or
individual))
and
a
sequencing
cost
(e.g.
cost
per
unit
coverage
per
pool
(or
individual)).
Under
this
cost
function,
the
relative
value
of
sample
cost
versus
sequencing
cost
(C
P
/C
X
)
is
most
informative
for
determining
which
design
option
is
best.
Our
results
indicate
that
even
as
sequencing
costs
drop,
pooled
designs
offer
potentially
more
power
in
many
situations
in
comparison
to
cost-‐equivalent
individual
designs.
This
is
mainly
driven
by
the
sample
costs
and
highlights
how
the
goals
of
the
study
can
impact
design
choice.
For
targeted
sequencing
studies,
sample
costs
are
often
high
relative
to
sequencing
costs
due
to
library
formation.
In
such
studies
pooling
may
be
the
best
design.
In
contrast,
whole-‐genome
sequencing
studies
currently
have
relatively
low
sample
costs
with
much
higher
sequencing
costs.
For
these
studies,
individual-‐level
designs
may
be
more
appropriate,
especially
for
rare
variants.
To
present
comparisons
that
would
highlight
trends,
we
assumed
that
4x
coverage
was
sufficient
to
obtain
genotypes
without
error
from
individual-‐level
sequencing.
We
ignored
sequencing
error
and
37
variation
in
coverage
along
the
genome,
as
well
as
techniques
that
can
improve
genotype
calling
from
sequencing
data
such
as
imputation.
Despite
these
assumptions
we
believe
that
our
approach
provides
a
fair
comparison
between
the
design
options
and
clearly
presents
the
influencing
factors.
Finally,
for
testing
the
marginal
association
of
a
variant,
our
novel
hierarchical
modeling
approach
accounts
for
the
number
of
pools,
the
number
of
individuals
within
each
pool,
the
coverage
and
the
sequencing
error.
We
have
presented
the
model
to
test
association
between
cases
and
controls,
but
the
general
framework
is
applicable
to
the
comparison
of
any
categorical
variable
that
can
be
conditioned
on
to
construct
distinct
pools,
such
as
tumor
vs.
normal.
Additionally,
one
can
construct
pools
by
conditioning
on
more
than
one
variable
and
then
obtain
marginal
estimates
of
association
for
any
single
variable
by
simply
including
them
as
covariates
in
the
association
model
at
the
pool-‐level.
Such
an
approach
may
be
sufficient
for
the
adjustment
of
potential
confounding
due
to
population
stratification
by
constructing
case
and
control
pools
where
each
pool
consist
of
individuals
with
similar
values
for
key
ancestry
components.
Sequencing
studies
have
the
potential
for
substantial
impact
in
the
discovery
and
analysis
of
rare
variants.
Due
to
sparse
data,
most
analysis
approaches
for
rare
variants
have
relied
on
a
burden
test
where
the
number
of
rare
variants
for
each
individual
is
summed
and
a
resulting
overall
risk
index
is
tested
for
association
(Madsen
&
Browning,
2009).
In
its
current
form,
our
approach
focuses
only
on
testing
the
marginal
association
for
each
variant.
However,
one
could
potentially
extend
our
approach
to
include
a
test
of
the
difference
between
the
scaled
sum
of
the
variant
frequencies
in
case
pools
vs.
control
pools.
Further
research
is
certainly
warranted
in
this
area.
38
For
many
studies,
there
simply
is
no
substitute
for
individual-‐level
sequencing
as
it
offers
flexibility
to
test
multiple
hypotheses
(e.g.
tests
of
interaction)
or
allows
for
association
testing
when
individuals
are
correlated
such
as
in
family
studies.
As
sequencing
costs
continue
to
drop
the
ability
to
obtain
individual-‐level
data
may
be
too
enticing
to
warrant
pooled
approaches.
However,
if
the
goals
are
discovery
or
association
testing,
pooled
designs
will
continue
to
be
a
viable
option.
In
addition,
there
is
a
rich
literature
on
various
two-‐stage
options
in
which
one
can
potentially
use
pooling
as
the
first-‐stage
to
accomplish
both
discovery
of
novel
variants
and
ranking
of
those
variants
via
association
tests
for
second-‐stage
follow
up
via
genotyping
(Y.
Zhao
&
Wang,
2009).
New
technologies
may
also
facilitate
more
advanced
designs
such
as
pools
of
individuals
within
bar-‐coded
pools
all
sequenced
in
a
single
run
(Smith
et
al.,
2010).
39
Chapter
3 Bayesian
Hierarchical
Lasso
3.1
Introduction
In
genome-‐wide
association
studies
(GWAS)
with
common
disease,
the
most
used
analysis
tool
has
been
a
single-‐point,
one
degree
of
freedom
test
of
association,
such
as
the
Cochran-‐
Armitage
test
(McCarthy
et
al.,
2008).
However,
testing
one
SNP
at
a
time
does
not
utilize
the
full
potential
of
GWAS
to
identify
multiple
causal
SNPs,
which
is
usually
the
case
for
common
diseases.
Also,
by
testing
multiple
SNPs
simultaneously,
a
falsely
reported
SNP
would
be
weakened
or
excluded
if
the
true
causal
SNP
with
a
stronger
signal
is
included
in
the
model.
It
is
not
feasible
to
include
all
the
SNPs
at
the
same
time,
which
would
result
in
an
overparameterized
model;
thus
it
is
necessary
to
consider
some
variable
selection
method.
The
lasso
(least
absolute
shrinkage
and
selection
operator)
utilizes
a
𝒍𝟏
norm
penalized
regression,
and
thus
shrinks
some
coefficients
and
sets
others
to
exactly
zero
(Tibshirani,
1996).
It
has
been
used
in
many
statistical
areas
due
to
its
features
both
in
subset
selection
and
in
parameter
shrinkage.
For
example,
Wu
et
al
implemented
cyclic
coordinate
descent
algorithm
to
find
the
lasso
estimates
in
a
logistic
regression,
and
applied
their
algorithm
in
a
GWAS
dataset
(Wu
&
Lange,
2008)
(Wu,
Chen,
Hastie,
Sobel,
&
Lange,
2009).
From
a
Bayesian
point
of
view,
the
estimates
from
penalized
regression
are
equivalent
to
the
maximum
a
posteriori
estimates
of
a
Bayesian
model,
with
specific
priors
on
the
model
parameters.
For
example,
a
ridge
regression
(with
𝒍𝟐
norm
penalty)
corresponds
to
a
Bayesian
40
model
with
Gaussian
prior
distribution;
whereas
a
lasso
regression
(with
𝒍𝟏
norm
penalty)
corresponds
to
a
Bayesian
model
with
Laplace
prior
distribution
(Tibshirani,
1996).
Genkin
et
al
investigated
a
Bayesian
approach
to
logistic
regression,
using
Gaussian
or
Laplace
prior
distributions.
They
also
implemented
a
cyclic
coordinate
descent
algorithm
to
find
the
maximum
a
posteriori
estimates,
and
implemented
their
method
in
text
categorization
(Genkin,
Lewis,
&
Madigan,
2007).
Along
similar
line,
Hoggart
et
al
developed
a
Bayesian
model
with
modified
prior
distribution
-‐
a
Normal
Exponential
Gamma
distribution
(NEG)
-‐
as
a
generalization
of
the
Laplace
distribution,
and
applied
their
method
in
a
real
genome-‐wide
dataset
(Hoggart,
Whittaker,
De
Iorio,
&
Balding,
2008).
Yuan
and
Lin
also
proposed
an
empirical
Bayes
model
closely
related
to
the
lasso
regression
(Yuan
&
Lin,
2005a).
In
all
the
studies
mentioned
above,
the
shrinkage
parameter
in
lasso
regression,
as
well
as
the
hyper-‐parameter
in
the
Bayesian
approaches,
is
estimated
by
cross-‐validation.
And
since
there
is
only
one
shrinkage
parameter
for
all
variables,
it
is
computationally
feasible.
However,
all
the
variables
are
shrunk
by
the
same
amount.
In
a
GWAS
dataset,
the
variables
are
the
SNPs
across
the
genome.
Thus
we
have
plenty
of
external
information
on
the
variables.
For
example,
for
pathway
information,
we
could
extract
information
from
the
Gene
Ontology
(Ashburner
et
al.,
2000),
PANTHER
(P.
D.
Thomas,
2003),
and
other
databases.
In
our
study,
we
propose
a
Bayesian
hierarchical
approach
to
differentially
shrink
SNPs
by
using
such
external
information.
In
the
proposed
Bayesian
model,
the
model
parameters
have
different
prior
distributions,
whose
hyper-‐parameters
are
informed
by
external
information.
A
few
algorithms
are
also
proposed
to
estimate
the
hyper-‐parameters
and
model
parameters.
41
3.2
Iterative
approaches
to
estimate
Bayesian
hierarchical
lasso
3.2.1
Method
3.2.1.A
Bayesian
hierarchical
Lasso
In
a
genetic
association
study
with
case-‐control
data,
let
D
=
{(y
1
,
x
1
),
(y
2
,
x
2
),
…,
(y
n
,
x
n
)},
where
y
i
=
±1
indicating
the
case-‐control
status
of
the
i
th
subject,
and
x
i
is
a
vector
consisting
of
the
genetic
variables
for
the
i
th
subject.
We
use
a
standard
logistic
regression
model
on
the
observed
case-‐control
data:
𝑙𝑜𝑔𝑖𝑡𝑃𝑟 𝑌
!
= 1|𝛽
!
,𝜷,𝑿
𝒊
=𝛽
!
+𝑿
𝒊
𝜷= 𝛽
!
+ 𝛽
!
!
!!!
𝑥
!"
(3.1)
Our
model
features
in
variable
selection
by
using
pathway
information.
To
achieve
this,
we
build
a
hierarchical
model,
with
a
Laplace
(double
exponential)
prior
distribution
for 𝛽
!
.
Thus
the
maximum
a
posteriori
estimates
of 𝛽
!
tends
to
be
0,
unless
enough
evidence
is
shown
in
the
data
that
𝛽
!
is
not
0.
We
write
𝛽
!
separately
because
we
do
not
want
to
penalize
the
intercept
term
in
the
model.
𝛽
!
~ 𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝜆
!
,
i.e.,
Pr 𝛽
!
𝜆
!
=
!
!
!
exp −𝜆
!
|𝛽
!
|
(3.2)
The
Laplace
distribution
of
𝛽
!
has
mean
0
and
variance
2/𝜆
!
!
.
42
We
further
model
the
hyper-‐parameter
𝜆
!
based
on
external
bio-‐feature
information,
Z
j
=
{z
j1
,
z
j2
,…
,
z
jV
}.
𝜆
!
= exp −𝒁
𝒋
𝝅 = exp (− 𝜋
!
!
!!!
𝑧
!"
)
(3.3)
Here
the
we
used
an
exponential
link
to
ensure
that
each
𝜆
!
is
positive,
and
log(𝜆
!
)
is
inversely
linked
with
𝒁
𝒋
𝝅
such
that
a
smaller
penalty
term
(𝜆
!
)
was
reflected
by
larger
𝒁
𝒋
𝝅
values.
We
hope
that
the
use
of
external
information
helps
estimate
the
value
hyper-‐parameter
𝜆
!
,
thus
we
get
a
better
variable
selection
results
from
all
𝛽
!
’s.
3.2.1.B
Iterative
algorithms
to
find
the
maximum
a
posteriori
estimate
We
proposed
to
iteratively
update
the
model
parameters
(𝛽
!
,𝜷)
=
(𝛽
!
,𝛽
!
,… ,𝛽
!
)
and
the
hyper-‐
parameters
𝝅
=
(𝜋
!
,𝜋
!
,… ,𝜋
!
).
1)
Assign
an
initial
value
to
the
hyper-‐parameter
𝜋,
thus
the
initial
values
of 𝜆
!
’s.
2)
Update
the
model
parameters
(𝛽
!
,𝜷):
find
the
MAP
(maximum
a
posterior)
of
(𝛽
!
,𝜷)
based
on
the
current
values
of
hyper-‐parameter
𝝅.
3)
Update
the
hyper-‐parameter
𝝅:
using
Newton’s
method
to
find
the
MAP
of
𝝅
conditional
on
the
current
values
of
model
parameters
(𝛽
!
,𝜷).
4)
Repeat
step
2)
and
3)
until
convergence.
The
details
of
the
proposed
algorithm
are
explained
in
the
following
part.
43
It
is
quite
possible
that
different
initial
values
of
𝝅
matrix
will
lead
to
different
local
maxima
of
our
penalized
likelihood
function.
We
implement
multiple
runs
based
on
different
initial
values
of
the 𝝅
matrix
and
pick
the
highest
of
the
modes
identified.
Given
the
current
values
of
the
hyper-‐parameters
𝝅,
we
adopted
the
cyclic
coordinate
descent
algorithm
to
find
the
MAP
estimates
of
model
parameters
(𝛽
!
,𝜷).
The
conditional
distribution
of
Pr(𝜷|𝐷,𝝅)
is
proportional
to
the
prior
distribution
and
the
likelihood:
Pr 𝜷|𝐷,𝝅 ∝
1
1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷
!
!!!
Pr(𝜷|𝝅)
For
convenience,
we
work
on
the
log
likelihood:
logPr 𝜷|𝐷,𝝅 =− ln(1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷 )
!
!!!
+ ln𝑝 𝜷 + 𝐶
where
C
is
a
constant.
With
Laplace
prior
distribution
of 𝜷,
we
have:
logPr 𝜷|𝐷,𝝅 =− ln(1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷 )
!
!!!
+ (ln𝜆
!
− ln2
!
!!!
−λ
!
𝛽
!
)
(3.4)
To
optimize
the
maximum
a
posteriori
over 𝜷= (𝛽
!
,𝛽
!
,… ,𝛽
!
)′ ,
we
used
cyclic
coordinate
descent
algorithm.
It
is
easy
to
implement
and
fast
in
computational
speed.
Cyclic
coordinate
descent
algorithm
optimizes
each
variable
in
turn,
holding
all
other
variables
as
constant.
Thus
it
is
a
one-‐dimensional
optimization
problem.
44
We
will
illustrate
the
application
of
cyclic
coordinate
descent
algorithm
in
the
following.
To
optimize
𝛽
!
while
holding
all
other
variables
constant,
we
write
the
components
of
relevance
in
the
log
posterior
density
as:
𝑔 𝑧 = 𝑙𝑛(1+𝑒𝑥𝑝 −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷+ (𝑧−𝛽
!
)𝑥
!"
)
!
!!!
+𝜆
!
𝑧
(3.5)
Thus
𝛽
!
!"#
that
optimizes
posterior
distribution
is
equivalent
to
the
z
value
that
minimizes
the
g(z)
function
above.
There
is
no
closed
form
solution
to
the
univariate
optimization
problem,
but
Newton’s
method
can
be
applied
to
calculate 𝛽
!
!"#
:
𝛽
!
!"!
= arg
!
min𝑔 𝑧 =𝛽
!
+∆𝛽
!
!
, where the tentative update ∆𝛽
!
!
=−
𝑔
!
𝛽
!
𝑔
!!
𝛽
!
(3.6)
In
the
optimization
process
tailored
on
our
model
with
Laplace
prior
distribution,
we
adopted
a
few
modifications
following
(Genkin
et
al.,
2007).
1) To
avoid
large
updating
steps,
we
specified
a
trust
region
Δ
!
that
each
|Δ𝛽
!
|
cannot
exceed
on
a
single
iteration.
After
applying
the
trust
region
restriction,
the
actual
update
of
𝛽
!
is
𝛥𝛽
!
=
−𝛥
!
𝑖𝑓 ∆𝛽
!
!
<−𝛥
!
∆𝛽
!
!
𝑖𝑓 |∆𝛽
!
!
|≤𝛥
!
𝛥
!
𝑖𝑓 ∆𝛽
!
!
>𝛥
!
(3.7)
The
length
of
trust
region
𝚫
𝐣
is
adaptively
updated
each
iteration,
∆
!
!"#
=max( 2 Δ𝛽
!
,
!
!
Δ
!
)
(3.8)
45
2) The
tentative
update
is
invalid
if
it
would
change
the
sign
of 𝛽
!
.
If
this
happens,
we
simply
set
𝛽
!
!"#
to
be
0.
3) If 𝛽
!
= 0,
we
try
to
update
in
both
directions,
and
see
if
either
direction
succeeds.
Note
that
at
most
one
direction
can
succeed,
due
to
the
convexity
of
function
g(z).
4) To
retain
the
model
size,
we
applied
the
generalized
Bayesian
information
criterion
in
our
updating
(J.
Chen
&
Chen,
2008).
The
generalized
Bayesian
information
criteria
is:
𝐵𝐼𝐶
!
= ln(1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷 )
!
!!!
+
!
!
𝑝
!"
ln ( 𝐼 )+𝛾 ln
!
!
!"
(3.9)
where
𝑃
is
the
number
of
all
genetic
covariates
(𝜷’s)
in
data,
𝑝
!"
is
the
total
number
of
non-‐zero 𝜷’s
in
the
model,
and
I
is
the
total
number
of
subjects.
In
each
update
of
a
specific 𝛽
!
,
we
accept
Δ𝛽
!
if
the
generalized
Bayesian
Information
Criterion
does
not
increase
its
value.
Given
the
current
values
of
model
parameters
𝜷,
the
conditional
distribution
function
of
𝝅
is:
Pr 𝝅 𝛽,𝑍 =
!
!
𝜆
!
exp (−𝜆
!
|𝛽
!
|
!
!!!
)
where 𝜆
!
= exp −𝒁
𝒋
𝝅 = exp (− 𝜋
!
!
!!!
𝑧
!"
).
(3.10)
Thus
the
log
of
posterior
distribution
of
𝝅 is:
logPr 𝝅𝜷,𝒁 =−𝑝 ln2− 𝒁
𝒋
𝝅+exp −𝒁
𝒋
𝝅 𝛽
!
!
!!!
(3.11)
Then
the
score
function
for
𝝅,
and
specifically
the
k-‐th
component,
is:
𝑢 𝜋
!
𝜷,𝒁 =
! ! 𝝅𝜷,𝒁
! !
!
=− 𝑍
!"
−𝑍
!"
𝛽
!
𝑒𝑥𝑝 −𝒁
𝒋
𝝅
!
!!!
(3.12)
And
the
Fisher’s
information
for
𝝅
(the
k-‐th
row
and
r-‐th
column):
𝑖𝑛𝑓𝑜 𝜋
!"
𝛽,𝑍 =−
! ! !
!
𝜷,𝒁
! !
!
= 𝑍
!"
𝑍
!"
𝛽
!
𝑒𝑥𝑝 −𝒁
𝒋
𝝅
!
!!!
(3.13)
46
In
Newton’s
method,
the
tentative
update
for
𝜋
!
is:
𝛥𝜋
!
!
=−
! !
!
𝜷,𝒁
!"#$ !
!"
𝜷,𝒁
!
!!!
(3.14)
To
avoid
large
updating
steps,
we
also
specified
a
trust
region
𝛥𝛲
!
that
each
|Δ𝜋
!
|
cannot
exceed
on
a
single
iteration.
After
applying
the
trust
region
restriction,
the
actual
update
of
𝜋
!
is:
𝛥𝜋
!
=
−𝛥𝛲
!
𝑖𝑓 ∆𝜋
!
!
<−𝛥𝛲
!
∆𝜋
!
!
𝑖𝑓 |∆𝜋
!
!
|≤𝛥𝛲
!
𝛥𝛲
!
𝑖𝑓 ∆𝜋
!
!
>𝛥𝛲
!
(3.15)
And
we
updated
the
trust
region
𝛥𝛲
!
adaptively,
too.
∆𝛲
!
!"#
=max( 2 𝛥𝜋
!
,
!
!
𝛥𝛲
!
)
(3.16)
3.2.2
Simulation
studies
3.2.2.A
Design
of
simulation
studies
In
our
simulation
study,
we
first
simulated
external
information
with
P
items
and
V
bio-‐features,
as
a
(PxV)
matrix
Z,
distributed
as
i.i.d.
Normal(0,
1).
Then
we
specified
the
values
for
the
hyper-‐
parameters
𝝅
(Vx1).
Thus
the
hyper-‐parameters
of
Laplace
distribution
𝝀
(Px1)
are
calculated
as:
𝜆
!
=𝒁
𝒋
𝝅,
for
j
=
1,
2,
…,
P
The
model
parameters
𝜷
(Px1)
are
simulated
given
the
values
of 𝝀
(Px1):
𝛽
!
|𝜆
!
∼ 𝐿𝑎𝑝𝑙𝑎𝑐𝑒 (0, 𝜆
!
),
for
j=1,
2,
…,
P
The
genetic
variable
𝑿
(IxP)
is
assumed
to
be
binary:
47
𝑋
!"
∼
!.!.!.
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝
!
),
for
i=1,
2,
…,
I;
j=1,
2,
…,P
And
then
the
outcome
variable
𝒀
(Ix1)
is
simulated
as:
𝑌
!
∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑃𝑟(𝑌
!
= 1)),
for
i=1,
2,
…,
I
where
𝑙𝑜𝑔𝑖𝑡 𝑃𝑟 𝑌
!
= 1 = (𝑿
𝒊
−𝒑
𝒈
)𝜷= (𝑋
!"
−𝑝
!
)𝛽
!
!
!!!
,
for
i=1,
2,
…,
I
The
X
matrix
is
centered
at
the
mean,
so
we
will
have
roughly
the
same
number
of
cases
and
controls
in
our
simulated
data.
3.2.2.B
Results
of
global
null
simulations
In
the
global
null
simulation
studies,
we
simulate
𝛽
!
=𝛽
!
=⋯=𝛽
!
= 0.
We
tested
a
scenario
in
which
the
sample
size
is
relatively
large,
with
1000
individuals
(I)
and
100
SNPs
(P).
Based
on
100
replications,
we
summarize
the
results
in
Table
3.1.
When
we
applied
the
generalized
Bayesian
information
penalty
(with
weight
γ)
to
control
the
model
size,
it
was
possible
that
no
variable
was
selected.
In
such
case,
𝜷=𝟎,
and
we
failed
to
estimate
the
hyper-‐parameters
𝝅,
(it
happens
more
frequently
when
𝛾
is
relatively
large).
Table
3.1
Null
Simulation
Results
Table
3.1.A.
Null
simulation
(all
model
parameters
are
0,
I
=
1000,
P
=
100**)
𝜸
Average
number
of
selected
variables
MSE*
(selected
variables)
Estimated
𝝅
value
Number
of
replicates
in
which
no
variable
is
selected
(out
of
100)
0
57.06
0.0029
-‐3.157
0
0.1
75.02
0.0026
-‐3.264
0
0.3
6.22
0.0179
-‐2.061
50
0.5
2.52
0.0522
-‐1.486
35
1
0.07
0.1281
-‐1.099
94
*MSE
is
a
summary
of
the
sum
of
variance
and
bias
2
from
the
simulation
study.
It
is
calculated
as
the
square
root
of
the
average
deviation
of
the
estimated
value
from
the
true
value.
**
I
is
the
number
of
individuals
(observations),
P
is
the
number
of
parameters
(variables).
48
Table
3.1.B.
Null
simulation
(all
model
parameters
are
0,
I
=
1000,
P
=
1000)
𝜸
Average
number
of
selected
variables
MSE
(selected
variables)
Estimated
𝝅
value
Number
of
replicates
in
which
no
variable
is
selected
(out
of
100)
0
488.13
0.0031
-‐3.114
0
0.1
688.49
0.0027
-‐3.239
0
0.3
44.37
0.0394
-‐1.920
21
0.5
4.28
0.0904
-‐1.236
45
1
0.05
0.1589
-‐0.943
95
Table
3.1.C.
Null
simulation
(all
model
parameters
are
0,
I
=
100,
P
=
1000)
𝜸
Average
number
of
selected
variables
MSE
(selected
variables)
Estimated
𝝅
value
Number
of
replicates
in
which
no
variable
is
selected
(out
of
100)
0
266.80
0.0115
-‐2.506
0
0.1
15.76
0.0810
-‐1.624
2
0.3
16.76
0.5196
-‐0.411
1
0.5
2.22
1.5039
0.074
28
1
0.04
1.000
0.000
96
3.2.2.C
Results
of
main
simulation
studies
Simulation
study
in
which
𝝅
includes
an
intercept
term
and
1
covariate
term
In
this
simulation
study,
the
hyper-‐parameter
𝛑
=
(𝝅
𝟎
,𝝅
𝟏
)
=
(-‐1,
1).
The
empirical
results
based
on
100
replicates
are
shown
in
Table
3.2
below.
Table
3.2
Simulation
Results
with
1
covariate
Table
3.2.A.
1-‐covariate
simulation
((π
!
,π
!
)
=
(-‐1,
1),
I=1000,
P=100)
𝜸
Average
number
of
selected
variables
Effect*
(not
selected
variables)
Effect
(selected
variables
)
MSE
(not
selected
variables)
MSE
(selected
variables)
Estimated
𝝅
value
Number
of
replicates
in
which
2
or
less
variables
are
selected
(out
of
100)
𝝅
𝟎
𝝅
𝟏
0
70.11
0.1859
0.8111
0.1146
0.7904
-‐1.676
1.152
0
0.1
88.35
0.1795
0.6829
0.1072
0.7344
-‐1.885
1.105
0
0.3
<
84.59
0.2131
0.6947
0.1561
0.6014
-‐1.806
1.121
1
0.5
<
80.09
0.1929
0.7216
0.1216
0.4574
-‐1.680
1.099
2
1
<
69.40
0.2152
0.7984
0.1603
0.7149
-‐1.560
1.151
2
*Effect
is
the
average
of
the
absolute
simulated
value
of
the
selected
(or
unselected)
variables.
Effect
=
𝐸[𝛽|𝛽= 0]
for
selected
variables,
Effect
=
𝐸[𝛽|𝛽≠ 0].
49
Table
3.2.B.
1-‐covariate
simulation
((π
!
,π
!
)
=
(-‐1,
1),
I=1000,
P=1000)
𝜸
Average
number
of
selected
variables
Effect
(not
selected
variables)
Effect
(selected
variables)
MSE
(not
selected
variables)
MSE
(selected
variables)
Estimated
𝝅
value
Number
of
replicates
in
which
2
or
less
variables
are
selected
(out
of
100)
𝝅
𝟎
𝝅
𝟏
0
441.88
0.3667
0.9070
0.3570
3.5983
-‐3.021
0.087
0
0.1
660.95
0.3606
0.7311
0.3554
2.5674
-‐3.193
0.050
0
0.3
130.04
0.3699
2.1816
0.3691
8.7624
-‐1.642
0.601
0
0.5
97.98
0.4153
2.3592
0.4900
10.7008
-‐1.203
0.531
0
1
26.25
0.4998
4.5258
0.8184
33.9853
-‐0.839
0.414
0
Table
3.2.C.
1-‐covariate
simulation
(((π
!
,π
!
)
=
(-‐1,
1),
I=100,
P=1000)
𝜸
Average
number
of
selected
variables
Effect
(not
selected
variables)
Effect
(selected
variables)
MSE
(not
selected
variables)
MSE
(selected
variables)
Estimated
𝝅
value
Number
of
replicates
in
which
2
or
less
variables
are
selected
(out
of
100)
𝝅
𝟎
𝝅
𝟏
0
257.34
0.5250
0.8536
1.0442
5.1973
-‐2.421
0.009
0
0.1
<
17.74
0.5618
3.2159
1.3026
41.2613
-‐1.285
0.259
2
0.3
<
15.75
0.5765
2.6284
1.4356
34.1257
-‐0.406
0.212
1
0.5
<
4.84
0.5876
5.0579
1.5522
97.3689
0.078
0.171
18
1
<<
2.00
0.5768
16.8679
1.5109
504.28
-‐0.048
0.018
93
Simulation
study
in
which
𝝅
includes
2
or
more
covariates
In
this
simulation
study,
the
hyper-‐parameter
𝛑
=
(𝜋
!
,𝜋
!
,𝜋
!
,… ,𝜋
!
)
=
(-‐1,
1,
-‐1,
0,
0,
0.2).
The
empirical
results
based
on
100
replicates
are
shown
in
Table
3.3
below.
Table
3.3
Simulation
Results
with
5
covariates
Table
3.3.A.
5-‐covariate
simulation
(𝜋
!
,𝜋
!
,𝜋
!
,… ,𝜋
!
)
=
(-‐1,
1,
-‐1,
0,
0,
0.2)
(I=1000,
P=1000,
based
on
100
replicates)
𝜸
Number
of
selected
variables
Effect
(not
selected
variables)
Effect
(selected
variables)
MSE
(not
selected
variables)
MSE
(selected
variables)
0
410.06
0.53
1.68
1.16
27.45
0.1
649.18
0.53
1.26
1.28
17.95
0.3
85.89
0.53
6.02
1.21
110.73
0.5
66.72
0.61
6.52
1.61
129.56
1
26.60
0.71
11.44
2.50
293.91
50
𝜸
Estimated
𝝅
value
Number
of
replicates
in
which
6
or
less
variables
are
selected
𝝅
𝟎
𝝅
𝟏
𝝅
𝟐
𝝅
𝟑
𝝅
𝟒
𝝅
𝟓
0
-‐2.79
0.20
-‐0.20
0.01
0.01
0.01
0
0.1
-‐3.04
0.14
-‐0.15
0.01
0.00
0.01
0
0.3
-‐1.63
0.57
-‐0.59
0
0
0.11
0
0.5
-‐1.08
0.61
-‐0.47
-‐0.01
0.00
0.09
2
1
-‐0.86
0.38
-‐0.40
0.04
-‐0.00
0.08
6
Table
3.3.B.
5-‐covariate
simulation
(𝜋
!
,𝜋
!
,𝜋
!
,… ,𝜋
!
)
=
(-‐1,
1,
-‐1,
0,
0,
0.2)
(I=1000,
P=100,
based
on
100
replicates)
𝜸
Number
of
selected
variables
Effect
(not
selected
variables)
Effect
(selected
variables)
MSE
(not
selected
variables)
MSE
(selected
variables)
0
277.02
0.79
1.57
4.07
48.95
0.1
<14.33
0.83
12.14
4.63
558.80
0.3
<11.48
0.88
10.76
5.47
642.54
0.5
6.67
0.91
19.90
6.47
2028.50
𝜸 Estimated 𝝅 value Number of replicates in which 6
or less variables are selected 𝝅
𝟎
𝝅
𝟏
𝝅
𝟐
𝝅
𝟑
𝝅
𝟒
𝝅
𝟓
0 -‐2.40 0.04 -‐0.03 -‐0.00 0.01 -‐0.01 0
0.1 -‐1.88 0.49 -‐0.48 0.03 0.03 0.04 10
0.3 -‐0.29 0.11 -‐0.17 -‐0.03 -‐0.10 0.15 8
0.5 0.14 0.28 -‐0.16 0.40 -‐0.07 -‐0.9 85
Problems
of
the
iterative
approach:
1) We
do
not
have
a
justified
reason
to
use
the
generalized
Bayesian
information
criteria:
instead
of
constraining
model
size,
we
use
it
to
control
model
size.
2) Sometimes
the
model
fails
to
converge,
especially
when
all
the
𝜷’s
are
shrunk
to
zero.
3) Cannot
handle
the
situations
where
I
<
P
a. Prone
to
fail
in
updating
the
values
of
𝝅.
b. Not
good
to
estimate
values
of
𝝅
even
if
it
doesn’t
fail.
51
3.3
Marginal
MAP
estimates
of
hyper-‐parameters
To
solve
the
problem
of
non-‐convergence
in
updating
the
hyper-‐parameter
𝝅,
when
all
or
most
of
the
model
parameters
𝜷
are
prone
to
be
shrunk
to
zero,
we
propose
to
find
the
a
marginal
estimate
of
𝝅
first,
(e.g.,
the
marginal
maximum
a
posterior),
and
then
use
the
estimated
𝝅
values
to
estimate
the
model
parameters
𝜷.
3.3.1
Method
3.3.1.A
Stepwise
algorithms
to
find
the
maximum
a
posteriori
estimation
1) Find
the
marginal
maximum
likelihood
estimates
of
the
hyper-‐parameters
𝝅.
We
assume
that
the
model
parameters
𝜷
are
Laplace
distribution
with
parameter 𝝀.
We
find
the
empirical
Bayes
estimator
of
the
hyper-‐parameters
𝝅:
𝝅= arg
!
maxPr 𝝅 𝑌,𝐺,𝑍 ≈ arg
!
maxPr (𝑌|𝐺,𝑍,𝝅)
2) Given
the
marginal
MAP
estimates
of
𝝅,
we
use
the
cyclic
coordinate
descent
algorithm
to
find
the
MAP
estimates
of
model
parameters
𝜷.
3.3.1.B
Algorithm
to
find
the
marginal
MLE
of
hyper-‐parameters
𝝅
We
write
the
marginal
likelihood
of 𝝅
as:
𝐿 𝝅 = Pr 𝒀𝑮,𝒁,𝝅 = Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑑𝜷
(3.17)
The
marginal
likelihood
of 𝝅
is
closely
related
to
the
posterior
distribution
of
the
model
parameters
𝜷:
52
Pr 𝜷 𝒀,𝑮,𝒁,𝝅 =
!" 𝜷,𝒀,𝑮,𝒁,𝝅
!" 𝒀,𝑮,𝒁,𝝅
=
!" 𝒀𝑮,𝜷 !" 𝜷 𝒁,𝝅 !" 𝒁,𝝅
!" 𝒀,𝑮,𝒁,𝝅
=
!" 𝒀𝑮,𝜷 !" 𝜷 𝒁,𝝅
!" 𝒀𝑮,𝒁,𝝅
=
!" 𝒀𝑮,𝜷 !" 𝜷 𝒁,𝝅
!(𝝅)
(3.18)
Thus,
the
score
function
with
respect
to
𝝅:
𝑈 𝝅 =
𝑑𝑙𝑛𝐿 𝝅
𝑑𝝅
=
1
𝐿 𝝅
Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅
𝑑lnPr 𝜷 𝒁,𝝅
𝑑𝝅
𝑑𝜷
= 𝑃𝑟 𝜷 𝒀,𝑮,𝒁,𝝅 𝑢
𝝅
𝜷 𝑑𝜷
where
𝑢
𝝅
𝜷 =
𝒅𝐥𝐧𝐏𝐫 𝜷 𝒁,𝝅
𝒅𝝅
(3.19)
Also,
the
Fisher’s
information
with
respect
to
𝝅:
𝐼𝑛𝑓𝑜 𝝅 = −
𝑑𝑈 𝝅
𝑑𝝅
=
𝟏
𝐿 𝝅
𝟐
[ Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑢
𝝅
𝜷 𝑑𝜷]
𝟐
−
𝟏
𝐿 𝝅
Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑢
𝝅
𝜷 ⊗𝑢
𝝅
𝜷 𝑑𝜷
+
𝟏
𝐿 𝝅
Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑖𝑛𝑓𝑜
𝝅
𝜷 𝒅𝜷
=
[ Pr 𝜷 𝒀,𝑮,𝒁,𝝅 𝑢
𝝅
𝜷 𝑑𝜷]
𝟐
− Pr 𝜷 𝒀,𝑮,𝒁,𝝅 𝑢
𝝅
𝜷 ⊗𝑢
𝝅
𝜷 𝑑𝜷+
Pr 𝜷 𝒀,𝑮,𝒁,𝝅 𝑖𝑛𝑓𝑜
𝝅
𝜷 𝑑𝜷
where
𝑢
𝝅
𝜷 =
!"#$%(𝜷|𝒁,𝝅)
!𝝅
,and 𝑖𝑛𝑓𝑜
𝝅
𝜷 =−
!
!
!"#$(𝜷|𝒁,𝝅)
!𝝅
𝟐
(3.20)
Based
on
the
score
and
information
functions,
we
can
use
Newton’s
method
to
numerically
optimize
of
the
marginal
likelihood
function
𝑳 𝝅 .
To
achieve
that,
we
need
to
know
posterior
density
of
𝜷,
𝐏𝐫 𝜷 𝒀,𝑮,𝒁,𝝅 ,
which
we
propose
to
approximate
by
using
MCMC
sampling.
53
3.3.1.C
Approximate
the
posterior
distribution
of
model
parameters
𝜷:
MCMC
sampling
procedure
We
used
adaptive
MCMC
sampling
to
retain
a
good
acceptance
rate:
the
proposed
update
step
is
adaptively
set,
if
the
acceptance
rate
is
low,
the
update
is
set
to
be
small;
otherwise,
the
update
is
set
to
be
large.
We
try
to
sample
𝜷
from
its
posterior
distribution
𝐏𝐫 𝜷 𝒀,𝑮,𝒁,𝝅 ,
but
the
real
problem
is,
when
the
dimensionality
of
𝜷
is
high,
especially
when
the
problem
is
under-‐determined,
which
means
the
number
of
parameters
(P)
is
larger
than
the
number
of
observations
(I),
the
MCMC
sampling
is
often
does
not
converge.
We
also
found
that
the
Fisher’s
information
with
respect
to
𝝅
calculated
by
the
posterior
samples
of
𝜷
is
not
always
positive
definite.
There
are
two
possibilities:
1)
The
MCMC
sample
of
𝜷
is
not
representative
of
the
posterior
distribution;
2)
The
information
matrix
is
truly
not
positive
definite,
which
means
that
the
marginal
likelihood
function
with
respect
to
𝝅
is
not
concave,
and
the
MLE
does
not
exist.
We
emphasized
the
first
possibility
by
throwing
away
iterations
when
the
information
matrix
is
not
positive
definite
as
we
think
it
is
caused
by
a
bad
sampling
of
𝜷.
3.3.1.D
Approximate
the
likelihood
function
with
a
Normal
distribution
54
Due
to
the
computational
difficulty
of
convergence
in
MCMC
sampling,
when
the
dimensionality
is
high,
we
tried
to
simplify
the
problem
by
approximating
the
likelihood
with
Normal
distribution.
Assuming
every
component
of
𝜷
is
independent
of
each
other,
asymptotically,
the
univariate
MLE
of
𝜷
are
normally
distributed:
𝜷
!
~𝑵(𝜷
𝒋
, 𝝈
𝒋
𝟐
).
Thus,
the
marginal
likelihood
with
respect
to
𝝅,
as
in
equation
(17),
could
be
approximated
as
follows:
𝐿 𝝅 = Pr 𝜷 𝒁,𝝅 = Pr 𝜷𝜷,𝝈 Pr 𝜷 𝒁,𝝅 𝑑𝜷
= Pr 𝜷
!
𝜷
𝒋
,𝝈
!
Pr 𝜷
𝒋
𝒁,𝝅 𝑑𝜷
𝒋 𝒋
(3.21)
If
the
hyper-‐parameters
𝝅
include
only
an
intercept
term,
we
write
the
likelihood
function
as
𝐿 𝜆 = 𝑙
𝒋
𝜆
𝒋
(3.22)
where
the
piecewise
likelihood
corresponding
to
each
component
of
model
parameter
𝜷
is:
𝑙
𝒋
𝜆 = Pr 𝛽
!
𝜆 = Pr 𝛽
!
𝛽
!
,𝜎
!
Pr 𝛽
!
𝜆 𝑑𝛽
!
=
1
2𝜋𝜎
!
exp −
𝛽
!
−𝛽
!
!
2𝜎
!
!
𝜆
2
exp −𝜆 𝛽
!
𝑑𝛽
!
=
!
!
exp
!
!
𝜆
!
𝜎
!
! !
!
exp 𝜆𝛽
!
[1− erf (
!
!
!!!
!
!
!!
!
)]+
!
!
exp −𝜆𝛽
!
[1+ erf (
!
!
!!!
!
!
!!
!
)]
(3.23)
This
is
an
analytic
form
of
marginal
likelihood
with
respect
to
the
hyper-‐parameter 𝜆.
Based
on
the
approximate
form
of
likelihood,
we
could
easily
implement
Newton’s
method
to
find
the
marginal
MLE
of 𝜆
(corresponding
to
the
intercept
term
of
𝝅: 𝜋
!
).
However,
we
found
the
likelihood
function
not
always
concave,
in
which
case
the
MLE
does
not
exist.
Since
the
55
likelihood
is
the
product
of
all
piecewise
likelihoods,
let’s
focus
on
a
single
piecewise
likelihood
𝑙
!
𝜆 ,
as
shown
in
equation
(3.23).
𝑙
!
𝜆 =
𝜆
2
exp
1
2
𝜆
!
𝜎
!
!
1
2
exp 𝜆𝛽
!
[1− erf (
𝛽
!
+𝜆𝜎
!
!
2𝜎
!
)]
+
1
2
exp −𝜆𝛽
!
[1+ erf (
𝛽
!
−𝜆𝜎
!
!
2𝜎
!
)]
To
illustrate
that,
in
the
following
Fig
3.1,
we
show
the
piecewise
likelihood
function
and
its
relationship
with
the
values
of
𝜷
and 𝝈.
With
fixed
value
of 𝜎
!
,
the
𝑙
!
𝜆
is
non-‐concave
if
|𝛽
!
|
is
small;
and
with
fixed
value
of 𝛽
!
,
𝑙
!
𝜆
is
non-‐concave
if 𝜎
!
is
large.
Thus
we
would
suspect
that
the
𝑙
!
𝜆
is
concave
only
in
a
region
where
|
𝜷
!
𝝈
!
|
is
relatively
large.
Figure
3.1
The
piecewise
likelihood
function
Fig
3.1.A.
The
piecewise
likelihood
function,
with
𝝈
!
fixed
− 𝜷
!
!
=𝟎.𝟏,𝝈
!
! =𝟏
− 𝜷
!
!
=𝟎.𝟓,𝝈
!
! =𝟏
− 𝜷
!
!
=𝟏,𝝈
!
! =𝟏
− 𝜷
!
!
=𝟐,𝝈
!
! =𝟏
− 𝜷
!
!
=𝟒,𝝈
!
! =𝟏
𝒍
𝒋
(𝝀)
56
Fig
3.1.B.
The
piecewise
likelihood
function,
with
𝜷
!
fixed
We’ve
also
tried
to
simulate
the
posterior
distribution
in
the
same
set-‐up
in
WinBUGS,
and
the
results
confirm
with
what
we
found
here.
When
the
effect
size
is
relatively
small,
the
marginal
likelihood
function
is
non-‐concave,
and
the
MLE
does
not
exist.
3.3.2
Simulation
results
of
the
MCMC
procedures
(3.3.1.A-‐
3.3.1.C)
When
we
sample
the
posterior
distribution
of
𝜷,
we
implemented
the
Metropolis
algorithm
with
a
burn-‐in
of
12500,
and
obtained
an
MCMC
sample
of
2500.
We
used
the
2500
MCMC
samples
to
approximate
the
posterior
distribution
of
𝜷,
based
on
which
we
calculated
the
score
and
information,
and
then
implemented
Newton’s
method
to
update 𝝅.
𝒍
𝒋
(𝝀)
− 𝜷
!
!
=𝟏,𝝈
!
! =𝟎.𝟐𝟓
− 𝜷
!
!
=𝟏,𝝈
!
! =𝟎.𝟓
− 𝜷
!
!
=𝟏,𝝈
!
! =𝟏
− 𝜷
!
!
=𝟏,𝝈
!
! =𝟐
− 𝜷
!
!
=𝟏,𝝈
!
! =𝟑
57
In
this
set
of
examples,
we
investigated
a
marginally
under-‐determined
model:
number
of
individuals
I
=
100
with
the
number
of
parameters
P
=
100.
We
ran
2000
iterations
in
Newton’s
method.
As
shown
in
Fig
3.2
below,
it
seems
that
the
updated
values
converged
very
quickly,
and
the
estimated
value
is
close
to
the
simulated
true
value
(-‐1).
Figure
3.2
Iteration
trace
of
the
updated
values
using
Newton's
method
(I=100,
P=100)
58
In
the
next
set
of
examples,
we
investigated
situations
where
the
number
of
parameters
exceeds
the
number
of
observations,
(I=100
and
P=500).
We
can
see
that
it
has
obviously
not
converged.
When
we
ran
the
MCMC
sampling
scheme
with
a
longer
burn-‐in
period,
or
a
larger
sample,
the
convergence
was
not
improved.
Figure
3.3
Iteration
trace
of
the
upated
values
using
Newton's
method
(I=100,
P=500)
59
3.4
Discussion
In
this
project,
we
proposed
to
build
a
Bayesian
hierarchical
lasso
model,
using
some
external
information
informed
prior
distribution
on
the
model
parameters,
and
tried
two
distinct
approaches
to
gain
the
maximum
a
posteriori
estimate
of
the
hyper-‐parameters.
Although
none
of
the
proposed
algorithms
worked
well,
we
gained
some
perspectives
in
this
project.
In
machine
learning
and
data
mining
literatures,
the
shrinkage
parameter
(e.g.,
the
𝒍𝟏
penalty
in
lasso)
is
usually
estimated
by
cross-‐validation
(Hastie,
Tibshirani,
&
Friedman,
2003).
Another
way
to
estimate
the
shrinkage
parameter
is
via
the
empirical
Bayes
method,
maximizing
the
posterior
distribution
of
the
hyper-‐parameters
(George
&
Foster,
2000).
In
this
project,
we
first
proposed
an
iterative
algorithm
to
find
the
maximum
a
posteriori
estimate
of
both
model
parameters
and
hyper-‐parameters.
Since
we
did
not
obtain
the
theoretical
proof
that
such
an
iterative
algorithm
will
converge,
this
pilot
project
is
based
on
simulation
and
application,
without
strong
theoretical
foundations.
The
algorithm
iteratively
maximizes
two
functions:
arg
!
maxPr(𝜷|𝑫,𝝅)
and
arg
!
maxPr(𝝅|𝒁,𝜷).
In
the
implementation,
we
found
that
the
iterative
algorithm
did
not
converge
to
the
MAP
all
the
time.
In
a
detailed
analysis,
we
see
that
it
leads
to
a
non-‐stable
extreme
point
easily.
The
Laplace
prior
on
Pr(𝜷|𝝅),
arg
!
maxPr(𝜷|𝑫,𝝅)
will
set
most
of
the
model
parameters
𝜷
to
exactly
zeros;
in
return,
when
we
try
to
optimize
the
hyper-‐parameters,
arg
!
maxPr(𝝅|𝒁,𝜷),
as
when
the
model
parameters
𝜷
gets
set
to
zero
or
a
smaller
value,
the
shrinkage
parameter
(corresponding
to
𝑒
!𝒁𝝅
)
are
driven
to
be
larger,
which
in
return
shrinks
the
model
parameters
𝜷
even
more.
Thus,
it
is
easily
that
the
iterative
algorithm
will
result
to
extreme
situations
where
all
model
parameters
𝜷
are
shrunk
to
zero
and
we
can’t
estimate
the
hyper-‐parameters
𝝅.
Heuristically
we
adapted
the
generalized
Bayesian
criteria
to
60
control
model
size,
but
it
doesn’t
work
well
either,
especially
in
simulation
studies
in
which
the
sample
size
is
less
than
the
parameter-‐set
size.
Another
possible
extension
could
be
to
set
a
constrained
prior
distribution
on
the
hyper-‐parameter
𝝅.
But
it
would
be
more
computationally
complicated
and
we
didn’t
pursue
this
path.
In
the
second
part
of
this
project,
we
proposed
an
empirical
Bayes
estimator
on
the
hyper-‐
parameters.
We
proposed
to
maximize
the
marginal
likelihood
function
with
respect
to
the
hyper-‐parameters
𝝅:
arg
!
max𝐿 𝝅 ,
where
𝐿 𝝅 = Pr 𝒀𝑮,𝒁,𝝅 = Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑑𝜷.
Along
this
line
of
research,
we
proposed
two
different
estimation
methods.
In
the
first
method,
we
utilized
a
MCMC
approach
to
approximate
the
Pr 𝜷 𝒀,𝑮,𝒁,𝝅
,
which
is
used
in
the
optimization
of
𝐿 𝝅 .
However,
we
encountered
technical
problems
when
the
model
parameter
space
of
𝜷
is
too
large,
which
is
usually
the
case
for
GWAS
dataset.
In
such
circumstances,
the
MCMC
sample
may
not
converge,
so
is
a
poor
approximation
to
Pr 𝜷 𝒀,𝑮,𝒁,𝝅 .
In
the
second
method,
we
approximate
the
model
parameter
with
independent
Normal
distribution:
𝛽
!
~𝑁(𝛽
!
, 𝜎
!
!
).
Thus
we
can
gain
a
closed
form
representation
of
the
marginal
likelihood
𝐿 𝝅 .
But
a
detailed
analysis
shows
that
such
likelihood
is
not
always
concave,
and
the
MLE
does
not
always
exist.
In
this
project,
we
focused
on
MAP
estimate
of
the
hierarchal
Bayesian
model,
due
to
computational
efficiency
and
applicability
in
a
GWAS
dataset.
Park
&
Casella
have
proposed
a
full
Bayesian
lasso
model,
and
implemented
a
Gibbs
sampler
to
sample
the
posterior
distribution
(Park
&
Casella,
2008).
Chris
Hans
investigated
some
other
aspects
on
Bayesian
61
treatment
of
lasso
regression
(Hans,
2009).
But
their
methods
are
not
computationally
efficient,
and
they
have
only
implemented
a
common
shrinkage
parameter
for
all
variables.
Finally,
there
are
some
other
generalizations
of
lasso
methodology
recently.
Yuan
and
Lin
proposed
group
lasso,
as
a
combination
of
𝑙1
norm
and
𝑙2
norm
penalty
(Yuan
&
Lin,
2005b).
Meier
et
al
extended
the
group
lasso
to
logistic
regression(Meier
&
Van
De
Geer,
2008).
These
methods
can
be
used
to
differentially
shrink
different
groups
of
variables,
but
may
not
be
as
flexible
as
a
hierarchical
model
on
the
shrinkage
parameters.
62
Chapter
4 Bayesian
hierarchical
Quantile
Model
to
Prioritize
GWAS
Results
4.1
Introduction
Genome-‐wide
association
studies
have
been
a
standard
method
for
disease
gene/variant
discovery
in
the
past
decade.
Although
there
has
been
success
in
discovering
many
variants,
for
most
common
diseases
only
a
moderate
amount
of
heritability
has
been
explained.
To
discover
additional
variants
most
studies
have
relied
on
the
accumulation
of
additional
samples
to
increase
power.
While
this
approach
has
potential
it
does
come
with
substantial
logistical
and
monetary
costs.
As
an
alternative,
biological
knowledge,
often
publically
available,
can
be
used
to
prioritize
SNPs
lacking
genome-‐wide
statistical
significance
and
existing
in
the
lower-‐
Manhattan
of
a
GWAS
ranking
(Cantor,
Lange,
&
Sinsheimer,
2010).
In
addition
to
simply
discovering
associated
SNPs,
incorporating
biological
information
has
several
advantages.
SNPs
indicated
by
both
association
evidence
and
biological
characteristics
are
potentially
more
likely
to
be
functional
SNPs
with
the
underlying
mechanism
suggested
by
the
biological
characterization.
Furthermore,
investigation
of
the
association
evidence
for
groups
of
SNPs
from
a
GWAS
sharing
similar
biological
characteristics
can
reveal
underlying
systems
or
pathway
mechanisms.
Such
methods
are
referred
to
as
pathway
analysis
or
Gene
Set
Enrichment
Analysis
(Wang,
Li,
&
Hakonarson,
2010).
In
the
present
work,
we
aim
(1)
to
use
biological
knowledge
to
provide
a
more
informative
ranking
of
GWAS
results
for
further
identification
of
causal
SNPs;
(2)
63
to
better
explain
GWAS
results
in
light
of
biological
knowledge;
and
(3)
gain
insight
into
biological
mechanisms
that
may
be
driving
GWAS
associations.
To
motivate
our
approach
suppose
for
a
particular
disease
there
are
20
casual
variants
and
1000
loci
tested
in
a
GWAS.
Also
assume
there
is
a
biofeature
with
specificity
and
sensitivity
(0.8,
0.8)
with
respect
to
the
causality
of
the
variants,
as
shown
below
in
Table
4.1.
Table
4.1
2×2
table
of
causal
variants
with
respect
to
biofeature
Non-‐causal
Causal
Non-‐Biofeature
784
4
Biofeature
196
16
In
such
a
setup,
let
us
assume
that
when
a
non-‐causal
variant
is
tested,
the
test
statistic
(usually
the
Z-‐statistic
in
a
GWAS)
is
distributed
as
standard
Normal
deviate,
whereas
the
test
statistic
corresponding
to
a
causal
variant
is
distributed
as
Normal
distribution
with
unit
variance
and
mean
=
3.89.
For
example,
in
case-‐control
study
with
8850
cases
and
controls
each,
a
SNP
with
OR
=
1.2
and
MAF
=
0.05
would
have
a
test
statistic
at
approximately
3.89.
Thus
one
would
expect
to
observe
the
distribution
of
test
statistics
conditional
on
biofeatures
shown
in
Figure
4.1:
64
Figure
4.1
An
illustration
of
the
effect
of
biofeature
on
test
statistics
In
this
example,
if
we
perform
a
two-‐sample
t-‐test
of
the
test
statistics
over
the
biofeatures
(to
examine
the
difference
on
the
biofeature/nonbiofeature
group
means),
the
difference
is
not
significant
(p-‐value=0.13).
However,
we
do
observe
significant
difference
in
the
top
percentile
(results
acquired
using
rq()
function
in
R
package
quantreg
based
on
the
90th
percentile):
Table
4.2
A
quantile
regression
on
an
empirical
example
Coefficients
95%
C.I.
lower
bound
95%
C.I.
upper
bound
Intercept
2.66
2.46
2.89
Biofeature
1.03
0.28
1.50
So
we
are
motivated
to
use
quantile
regression
where
the
standard
linear
means
model
might
fail.
In
a
GWAS,
if
the
effects
of
a
biofeature
are
concentrated
in
the
top
percentiles
of
the
variants
tested,
it
would
be
natural
to
use
quantile
regression
to
model
such
an
effect.
In
fact,
quantile-‐dependent
penetrance
was
also
found
in
a
recently
published
genetic
association
study
-5 0 5
0.0 0.2 0.4 0.6 0.8 1.0
empirical CDF
beta-Z
CDF
non-biofeature
biofeature
65
(Williams,
2012).
Specifically,
to
fully
utilize
the
potential
of
effects
on
multiple
quantiles,
we
propose
to
build
a
hierarchical
model
using
multiple
quantiles
simultaneously.
4.2
A
Bayesian
Quantile
Regression
Quantile
regression
has
been
developed
intensively
as
a
methodology
focused
on
the
effects
on
specific
percentiles
of
a
distribution
(Koenker,
2005)
(Cade
&
Noon,
2003).
In
the
following
we
will
review
the
basic
statistical
properties
of
quantile
regression.
To
make
the
equations
more
compact,
we
include
the
intercept
1
at
the
beginning
of
the
covariate
vector
x
(of
length
p).
Let
(Q
!
(τ|x)
be
the
τ
!"
quantile
of
Y
conditional
on
X
=
x
(0
≤
τ
≤
1),
a
linear
quantile
regression
model
for
Q
!
(τ|x)
is
specified
as:
Q
!
τ x = x
!
β τ
(4.1)
where
x
is
a
covariate
vector
of
length
p+1
(with
the
first
element
as
intercept
term),
and
β(τ)
is
the
corresponding
coefficient
vector.
A
quantile
model
focuses
on
specific
quantiles
of
the
conditional
distribution
of
Y|X.
The
most
popular
method
to
estimate
the
coefficients
in
(1)
was
proposed
by
(Koenker
&
Bassett,
1978),
by
minimizing
the
loss
function
𝜌
!
(𝑦
!
−
!
!!!
𝒙
𝒊
′𝛽(𝜏))
on
the
data
{ 𝒙
𝒊
,𝑦
!
:1≤ 𝑖 ≤ 𝑛},
where
𝜌
!
𝑢 = 𝑢[𝜏− 𝐼(𝑢< 0)]
(4.2)
The
likelihood
above
is
focused
on
a
single
point
(the
𝜏
!!
percentile)
of
the
conditional
distribution,
which
does
not
do
justice
to
the
potential
of
the
model.
Here
we
would
like
to
implement
a
model
that
simultaneously
investigates
a
range
of
percentiles
of
the
distribution
66
𝑄
!
𝜏
!
𝒙 = 𝒙
!
𝛽 𝜏
!
; 0< 𝜏
!
< 1, 𝑘= 1,2,3,… ,𝐾
(4.3)
Although
it
is
valid
to
make
inference
by
post-‐processing
the
individual
quantile
regression
models
across
a
range
of
percentiles
(for 𝐾 percentiles: 0< 𝜏
!
< 𝜏
!
<⋯< 𝜏
!
< 1),
philosophical
and
practical
difficulties
exist
in
drawing
inference
simultaneously
on
multiple
𝜏
values
(Yu
&
Lu,
2003).
Another
drawback
of
post-‐processing
is
the
inability
to
connect
the
inference
on
multiple
percentiles
simultaneously
with
the
response
variable
Y,
which
is
particularly
important
if
we
are
to
set
up
a
hierarchical
model
where
the
response
variable
Y
is
a
variable
of
interest
itself
(Tokdar
&
Kadane,
2011).
From
a
Bayesian
point
of
view,
the
minimization
approach
above
is
equivalent
to
maximization
of
a
likelihood
function
under
the
asymptotic
Laplace
error
distribution
(Yu
&
Moyeed,
2001)
(Tsionas,
2003).
Following
Yu
and
Moyeed,
one
could
construct
the
linear
model
as
𝑌= 𝒙
!
𝛽 𝜏 + 𝜖
(4.4)
where
𝜖
has
the
asymmetric
Laplace
distribution
with
density:
𝑓
!
(𝜖)= 𝜏 1− 𝜏 exp {−𝜌
!
𝜖 }
where
ρ
!
∙
is
defined
in
equation
(4.2).
In
such
a
model,
if
we
try
to
maximize
the
likelihood,
the
solution
would
be
the
same
as
if
we
are
to
minimize
the
loss
function
in
equation
(2).
We
notice
that
the
parameter
𝜏
determines
the
shape
of
the
distribution,
and
the
𝜏
!!
quantile
of
the
distribution
is
0.
Kozumi
and
Kobaysashi
adapted
a
mixture
representation
of
the
asymmetric
Laplace
distribution,
and
developed
a
Gibbs
sampling
algorithm
(Kozumi
&
Kobayashi,
2011).
Let
w
be
a
standard
exponential
variable
and
u
a
standard
normal
variable,
and
𝜖=𝜃𝑤+ 𝛿 𝑤𝑢,
where
𝜃=
!!!!
!(!!!)
and
𝛿
!
=
!
!(!!!)
(4.5)
67
Thus
𝜖
has
an
asymmetric
Laplace
distribution
with
density
specified
in
equation
(4.4).
To
analyze
multiple
quantiles
simultaneously
in
a
Bayesian
model,
Karthik
Sriram
et
al
proposed
an
pseudo
asymmetric
Laplace
density
(PALD)
as
following
(Sriram
&
Ghosh,
2012;
Sriram,
Ramamoorthi,
&
Ghosh,
2011)
𝑃𝐴𝐿𝐷
!
!
,!
!
,…,!
!
𝜖
!
,𝜖
!
,… ,𝜖
!
= exp {−𝜌
!
!
(𝜖
!
)}
!
!!!
= exp{− 𝜌
!
!
(𝜖
!
)
!
!!!
}
(4.6)
where
0< 𝜏
!
< 𝜏
!
<⋯< 𝜏
!
< 1,
and
ρ
!
∙
is
defined
in
(2).
It
is
a
pseudo
density
because
the
normalization
constant
is
not
specified
in
the
equation.
However,
they
proved
that
the
posterior
is
consistent
under
proper
priors.
To
make
use
of
the
mixture
representation
of
the
asymmetric
Laplace
distribution
as
expressed
in
equation
(5),
the
PALD
can
be
written
as
𝜖
!
= 𝜃
!
𝑤
!
+ 𝛿
!
𝑤
!
𝑢
!
(4.7)
where
𝜃
!
=
!!!!
!
!
!
(!!!
!
)
,
𝛿
!
!
=
!
!
!
(!!!
!
)
,
w
k
are
i.i.d.
standard
Exponential
variables
and
u
k
are
i.i.d.
standard
Normal
variables
∀ 𝑘 .
Thus,
we
can
write
the
Normal-‐Exponential
mixture
representation
of
the
pseudolikelihood
for
the
data
{ 𝒙
𝒊
,𝑦
!
:1≤ 𝑖 ≤ 𝑛}
as
follows
𝑃 𝒚|𝑿,𝛽 𝜏
!
,… ,𝛽(𝜏
!
) ∝ 𝜙{
!
!
!𝒙
𝒊
!
! !
!
!!
!
!
!"
!
!
!
!"
}
!
!!!
!
!!!
where
𝑤
!"
~
!!"
𝐸𝑥𝑝 1 ∀ 𝑖 ,𝑘 ;
𝜃
!
=
!!!!
!
!
!
(!!!
!
)
and
𝛿
!
!
=
!
!
!
(!!!
!
)
∀𝑘
(4.8)
Such
a
mixture
representation
is
very
useful
in
many
aspects.
68
(1) If
we
have
Normal
prior
distribution
on
the
quantile
regression
parameters,
they
are
conjugate
with
our
pseudo-‐likelihood
function,
which
would
lead
to
an
efficient
Gibbs
sampling
procedure.
(2) As
shown
in
the
following
section,
in
a
hierarchical
model
where
the
first
level
is
a
linear
model
and
the
second
level
is
a
quantile
regression
model,
the
conditional
distribution
and
posterior
distribution
are
easily
constructed
due
to
the
conjugacy.
3
A
Bayesian
Hierarchical
Model
with
Multiple
quantiles
Based
on
the
pseudo
asymmetric
Laplace
density
(PALD),
a
very
simple
hierarchical
extension
of
the
quantile
regression
model
is
proposed
in
this
section.
Suppose
the
observed
marginal
test
statistics
T
1
,
T
2
,
…,
T
n
have
a
linear
relationship
with
latent
variables
V
1
,
V
2
,
…,
V
n
,
and
the
latent
variables
V
1
,
V
2
,
…,
V
n
have
a
distribution
that
could
be
explained
by
a
quantile
regression
with
explanatory
variable
X
1
,
X
2
,
…,
X
n
,
then
we
can
easily
construct
a
hierarchical
model
as
follows
𝑇
!
|𝑉
!
~𝑁𝑜𝑟𝑚𝑎𝑙(𝑉
!
,𝜎
!
)
𝑉
!
|𝑋
!
∝ 𝜙{
!
!
!𝑿
𝒋
!
! !
!
!!
!
!
!"
!
!
!
!"
}
!
!!!
where
𝑊
!"
~
!!"
𝐸𝑥𝑝 1 ∀ 𝑗 ,𝑘 ;
𝜃
!
=
!!!
!
!
!
(!!!
!
)
and
𝛿
!
!
=
!
!
!
(!!!
!
)
∀𝑘 .
(4.9)
The
prior
distribution
for
parameters
in
the
model
is
specified
𝛽 𝜏
!
~𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝜆
!
𝐼 ∀ 𝑘
where
𝜆
!
is
a
hyper-‐parameter
fixed
at
non-‐informative
value.
(4.10)
69
Based
on
the
model
and
the
priors,
the
MCMC
using
Gibbs
sampling
is
straightforward.
It
becomes
simpler
when
we
sample
the
latent
variable
V
i
along
with
other
variables.
The
conditional
distribution
of
the
parameters
of
interest
are
shown
as
following:
The
full
conditional
distribution
of
𝑉
!
is
𝑉
!
|𝑇
!
,𝑋
!
,𝑊
!!
,𝑊
!!
,… ,𝑊
!"
~ 𝑁𝑜𝑟𝑚𝑎𝑙 𝐷
!
!!
𝐸
!
, 𝐷
!
!!
where
𝐷
!
= 𝜎
!!
+
!
!
!
!
!
!"
!
!!!
,
and
𝐸
!
=
!
!
!
!
+
!
!
!
! !
!
! !
!
!
!"
!
!
!
!
!"
!
!!!
(4.11)
The
full
conditional
distribution
of
𝑊
!"
!!
is
𝑊
!"
!!
|𝑉
!
,𝑋
!
,𝜏
!
~ 𝐼𝑛𝑣𝑒𝑟𝑠𝑒 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 (𝜆
!
,𝜇
!"
)
where
𝜆
!
=
!
!
!
!!!
!
!
!
!
,
and
𝜇
!"
=
!
!
!
!!!
!
!
|!
!
!!
!
!
! !
!
|
(4.12)
Let
𝛽= (𝛽 𝜏
!
,𝛽 𝜏
!
,… ,𝛽 𝜏
!
).
The
full
conditional
distribution
of
𝛽
is
𝛽|𝑇,𝑋,𝑉,𝑊 ~𝑁𝑜𝑟𝑚𝑎𝑙 𝐴
!
!!
𝑀
!
, 𝐴
!
!!
where
𝐴
!
= diagonal
!
!
!
!
!
!
!
!
!
!"
!
!!!
, 𝑘= 1,2,… ,𝐾 +𝜆
!!
𝐼
and
𝑀
!
= column
(!
!
! !
!
!
!"
)!
!
!
!
!
!
!"
!
!!!
, 𝑘= 1,2,… ,𝐾
(4.13)
70
4.4
Simulation
Studies
Based
on
a
Normal
Approximation
4.4.1
Design
of
the
simulation
study
Suppose
we
have
d=1,…,D
bio-‐features
X
d
with
specific
informativeness
value
(specificity
and
sensitivity)
corresponding
to
the
causality
of
a
SNP,
as
shown
in
the
motivation
section
(Table
4.1.
and
Fig
4.1.).
We
first
simulate
causal
SNPs
and
non-‐causal
SNPs
as
an
indicator
variable:
𝐼
!
~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝
(4.14)
where
p
is
just
a
pre-‐specified
value
based
on
the
probability
of
causality
in
the
set
of
SNPs.
Then,
based
on
the
causality
of
SNPs,
we
simulate
the
values
of
X
jd
according
to
how
informative
they
are
(specificity
and
sensitivity).
The
unobserved
latent
variable
V
j
is
simulated
as
the
true
test
statistic
(not
the
effect
size,
but
the
statistic
corresponding
to
the
p-‐value
if
we
were
to
test
the
SNP
in
a
hypothetical
study),
which
is
linearly
associated
with
the
causality
of
a
SNP:
𝑉
!
~ 𝑁𝑜𝑟𝑚𝑎𝑙(𝜆 𝐼
!
,𝜎
!
!
)
(4.15)
And
the
observed
variable
t
j
is
a
sampled
observation
from
the
true
latent
variable:
𝑇
!
~ 𝑁𝑜𝑟𝑚𝑎𝑙(𝑉
!
,𝜎
!
)
(4.16)
Because
the
causality
indicator
I
is
associated
with
the
bio-‐features
X,
and
also
causality
mainly
has
an
effect
on
the
tail
of
the
distribution
of
the
observed
variable
T
(the
test
statistic
for
a
71
univariate
marginal
association
test),
we
would
use
a
hierarchical
quantile
regression
model
focused
on
the
upper
quantiles
of
the
conditional
distributions,
denoted
in
equation
(11).
Our
main
interest
lies
in
the
ability
to
prioritize
the
causal
SNPs,
so
we
focus
on
the
posterior
distribution
of
the
latent
variable
V.
By
calculating
the
number
of
causal
SNPs
among
the
top
K
values
of
the
latent
variable,
we
are
able
to
compare
with
other
popularly
used
methods
on
prioritization:
(1) A
linear
hierarchical
model
(in
which
the
first
level
is
a
linear
model
on
the
latent
variable
and
the
second
level
is
a
linear
regression
on
the
bio-‐features)
(2) A
combination
of
p-‐value
and
filters
(use
the
bio-‐feature
filtering
as
well
as
the
ranking
of
the
marginal
p-‐values)
(3) A
quantile
hierarchical
model
using
multiple
quantile
points
(in
which
the
first
level
is
a
linear
model
on
the
latent
variable
and
the
second
level
is
a
quantile
regression
on
the
bio-‐features)
We
first
performed
a
simulation
study
with
only
1
biofeature.
In
this
study,
the
biofeature
was
specified
with
a
fixed
sensitivity
and
specificity
(corresponding
to
the
causality
of
a
SNP),
and
the
data
was
analyzed
using
three
methods
as
mentioned
above.
In
the
Bayesian
hierarchical
quantile
regression,
we
simultaneously
focus
on
quantiles
in
the
upper
tail
of
the
conditional
distribution
(𝜏
=
0.7,
0.8,
0.9,
0.92,
0.95,
0.98).
After
we
draw
the
posterior
samples
of
the
latent
variable,
we
rank
the
SNPs
by
their
posterior
mean
of
the
latent
variable
(V).
For
each
scenario,
we
performed
100
replications.
To
summarize
the
results,
we
count
the
number
of
causal
SNPs
in
the
top
K
(20,
50,
100
or
200)
ranked
SNPs
from
each
of
the
three
ranking
methods,
and
compare
them
to
that
number
from
the
data
(marginal
p-‐value,
or
the
observed
test
statistic
T).
72
We
also
performed
simulation
studies
with
5
homogeneous
biofeatures
(having
the
same
specificity
and
sensitivity),
and
5
heterogeneous
biofeatures
(having
different
specificity
and
sensitivity).
4.4.2
Simulation
Results
For
prioritization,
our
major
interest
lies
in
the
ranking
of
the
causal
variants.
We
simulated
20
out
of
1000
as
causal
SNPs.
In
each
of
the
heat
map
that
follows,
we
show
the
difference
between
the
number
of
causal
SNPs
in
top
K
(20,
50
or
100)
SNPs
from
the
3
proposed
methods
above
and
the
corresponding
number
from
a
simple
p-‐value
ranking,
for
an
array
of
specificity
and
sensitivity.
The
simulation
results
with
only
1
biofeature
are
shown
in
Fig
4.2.
Based
on
the
ranking
from
a
Bayesian
hierarchical
linear
regression
model,
the
number
of
causal
SNPs
is
almost
the
same
as
using
marginal
data
itself.
In
general,
the
filtering
method
yields
significantly
more
causal
SNPs
if
the
biofeature
is
informative
(high
sensitivity
and
specificity),
but
loses
many
causal
SNPs
if
the
biofeature
is
not
informative
(low
sensitivity
and
specificity).
The
Bayesian
hierarchical
quantile
regression
produces
a
marked
gain
of
causal
SNPs
if
the
biofeature
is
informative,
with
very
little
loss
if
the
biofeature
is
not
informative
at
all.
For
example,
at
a
specificity
and
sensitivity
of
0.8
(relatively
informative),
the
filtering
approach
and
quantile
regression
approach
on
average
identifies
2
more
causal
SNPs
in
the
top
20
than
the
p-‐value
ranking
only,
whereas
the
linear
regression
approach
identifies
the
same
number
of
causal
SNPs
as
the
p-‐value
ranking
only.
When
the
specificity
and
sensitivity
is
0.5
(not
informative),
on
average
the
filtering
approach
identifies
3
fewer
causal
variants
in
the
top
20
compared
to
the
p-‐value
ranking
alone;
in
73
contrast,
the
linear
regression
and
the
quantile
regression
approaches
identify
the
same
number
of
causal
variants
in
the
top
20
as
from
the
p-‐value
ranking
only
for
this
scenario.
We
observed
the
same
pattern
of
gains/losses
on
the
identification
of
causal
variants
in
the
top
100
rankings.
These
results
suggest
that
the
Bayesian
hierarchical
quantile
regression
is
a
safe
method
to
boost
the
finding
of
causal
SNPs
by
incorporating
the
external
information
from
biofeatures.
Figure
4.2
Gain
of
number
of
causal
SNPs
in
the
top
20(100)
ranked
SNPs
(with
1
biofeature)
As
shown
in
Fig
4.3,
we
observed
similar
patterns
of
identification
in
simulation
studies
with
5
homogeneous
biofeatures.
In
these
simulation
studies,
all
of
the
5
biofeatures
have
the
same
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−4
−3
−2
−1
0
1
2
3
4
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−4
−3
−2
−1
0
1
2
3
4
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−4
−3
−2
−1
0
1
2
3
4
hierarchical linear regression
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−6
−4
−2
0
2
bioinformatics based
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−6
−4
−2
0
2
simultaneous quantile regression
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−6
−4
−2
0
2
74
specificity
and
sensitivity
with
respect
to
the
causality
of
a
SNP.
Because
of
the
additive
effects
of
the
biofeatures,
the
differences
on
causal
SNP
identification
from
different
approaches
are
even
more
distinct.
For
example,
at
a
specificity
and
sensitivity
of
0.8,
the
filtering
approach
identifies
7
more
causal
SNPs
in
the
top
20
list
on
average
compared
to
p-‐value
ranking,;
quantile
regression
approach
identifies
5
more
causal
SNPs;
and
linear
regression
approach
identifies
almost
the
same
number
of
causal
SNPs.
On
the
other
hand,
when
the
biofeatures
are
not
informative
(specificity
and
sensitivity
at
0.5),
filtering
approach
identifies
7
fewer
causal
SNPs
than
p-‐value
rankings,
whereas
the
quantile
regression
and
linear
regression
identifies
the
same
number
of
causal
variants
as
the
p-‐value
rankings.
Figure
4.3
Gain
of
number
of
causal
SNPs
in
the
top
20(100)
ranked
SNPs
(with
5
homogeneous
biofeatures)
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−5
0
5
10
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−5
0
5
10
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−5
0
5
10
75
Next
we
considered
heterogeneous
biofeatures,
of
which
three
have
the
sensitivity
and
specificity
fixed
at
0.8
and
the
sensitivity
and
specificity
the
other
two
vary
across
the
scenarios.
The
results
are
summarized
in
Fig
4.4.
In
these
figures,
the
sensitivity
and
specificity
correspond
to
the
two
varying
biofeatures.
Due
to
the
presence
of
3
informative
biofeatures
in
all
scenarios,
the
loss
of
causal
SNPs
in
the
bioinformatics
method
is
not
as
obvious
as
above.
However,
the
general
pattern
remains:
in
identification
of
causal
SNPs,
the
Bayesian
hierarchical
quantile
regression
is
still
more
robust
than
the
other
two
methods.
hierarchical linear regression
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−10
−8
−6
−4
−2
0
2
4
6
bioinformatics based
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−10
−8
−6
−4
−2
0
2
4
6
simultaneous quantile regression
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−10
−8
−6
−4
−2
0
2
4
6
76
Figure
4.4
Gain
of
number
of
causal
SNPs
in
the
top
20(100)
ranked
SNPs
(with
5
heterogeneous
biofeatures)
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
hierarchical linear regression
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
bioinformatics based
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top 100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
77
4.5
Simulation
Studies
Based
on
Marginal
GWAS
Estimation
4.5.1.
Design
of
the
simulation
study
In
the
previous
section,
we
performed
simulation
based
on
the
normal
approximation
of
the
marginal
test
statistics.
In
theory,
with
large
sample
sizes,
the
marginal
test
statistics
should
be
asymptotically
normally
distributed,
with
means
equal
to
the
true
test
statistic.
However,
the
sampling
variation
of
the
test
statistics
varies,
which
greatly
affects
the
power
of
different
statistical
approaches.
Thus,
in
this
simulation
study,
we
demonstrate
the
use
of
Bayesian
hierarchical
machinery
directly
on
a
simulated
GWAS
dataset.
As
in
section
4,
each
of
the
D
bio-‐
features
is
somewhat
informative
(or
non-‐informative)
with
respect
to
the
causality
of
a
SNP.
As
in
Section
4,
equation
(14),
we
first
simulate
causal
SNP’s
and
non-‐causal
SNP’s
as
an
indicator
variable;
and
then
we
simulate
the
value
of
biofeature
X
jk
according
to
how
informative
they
are
(specificity
and
sensitivity).
The
unobserved
effect
size
𝛽
!
was
simulated
according
to
the
causality
of
the
SNP
as
well:
𝛽
!
= log 1.2 ∗ 𝐼
!
(4.17)
Assuming
independent
SNPs
or
correlated
SNPs,
we
simulate
the
genotypes
𝑮
𝒊
for
individual
i.
Based
on
a
logistic
disease
model,
we
simulate
a
case-‐control
GWAS
dataset.
𝑃𝑟 𝑌
!
= 1 = exp( 𝑮
𝒊
−𝑮 ′𝜷) [ 1+exp 𝑮
𝒊
−𝑮
!
𝜷 ]
(4.18)
78
Based
on
the
GWAS
dataset,
we
can
apply
the
standard
2-‐df
logistic
regression.
The
observed
variable
t
i
is
the
marginal
test
statistic
from
separate
univariate
logistic
regressions
for
each
SNP
j.
We
use
a
hierarchical
quantile
regression
model
focused
on
the
upper
quantiles
of
the
conditional
distributions,
denoted
in
Equation
(4.11).
As
in
section
4,
we
compare
this
method
with
other
popularly
used
methods.
4.5.2
Simulation
Results
for
Independent
SNPs
In
the
simulation
studies,
we
simulated
1,
5
or
20
causal
SNPs
out
of
1000
SNPs.
All
the
SNPs
are
assumed
to
be
independent
of
each
other.
We
specified
biofeature(s)
with
different
specificity
and
sensitivity
corresponding
to
the
causality
of
a
SNP.
In
the
following,
we
show
the
heat
maps
of
the
increase
of
number
of
causal
SNPs
in
top
1,
5
or
20
SNPs.
In
the
first
study,
we
simulated
only
1
biofeature.
After
we
drew
the
posterior
samples
of
the
latent
variable,
we
ranked
the
SNPs
by
their
posterior
mean
of
the
latent
variable
(V).
For
each
scenario
(with
different
sensitivity
and
specificity),
we
performed
100
replications.
To
summarize
the
results,
we
count
the
number
of
causal
SNPs
in
the
top
K
(1,
5
or
20)
ranked
SNPs
from
each
of
the
three
ranking
methods,
and
compare
them
to
that
number
from
the
data
(marginal
p-‐value,
or
the
observed
test
statistic
Y).
The
results
are
shown
in
Fig
4.5.
The
Bayesian
hierarchical
linear
regression
prioritizes
significantly
more
causal
SNPs
if
the
biofeature
is
informative
(high
specificity
and
sensitivity),
but
suffers
a
significant
deficit
with
non-‐informative
biofeatures
(low
specificity
and
sensitivity).
Similar
patterns
are
shown
for
the
bioinformatics
filtering
method.
The
Bayesian
hierarchical
79
quantile
regression
yields
a
marked
gain
of
causal
SNPs
if
the
biofeature
is
informative,
with
very
little
loss
if
the
biofeature
is
not
informative
at
all.
For
example,
in
the
scenarios
with
20
causal
SNPs,
with
specificity
and
sensitivity
of
0.9,
all
three
approaches
increase
the
number
of
identified
causal
SNPs
by
4~5
on
average,
comparing
to
the
p-‐value
only
approach.
When
the
specificity
and
sensitivity
is
0.5,
quantile
regression
approaches
identifies
the
same
number
of
causal
SNPs,
whereas
the
linear
hierarchical
and
filtering
approaches
identify
1~2
causal
SNPs
fewer
than
the
p-‐value
alone
approach
on
average.
The
results
from
a
GWAS
simulation
study
directly
show
that
the
Bayesian
hierarchical
quantile
regression
is
a
safe
method
to
improve
the
identification
and
prioritization
of
causal
SNPs
by
incorporating
the
external
information
from
biofeatures.
Figure
4.5
Gain
of
number
of
causal
SNPs
in
the
top
1,
5
and
20
ranked
SNPs
(with
1
biofeature
and
1,
5
and
20
causal
SNPs
correspondingly)
hierarchical linear regression
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
bioinformatics based
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
simultaneous quantile regression
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
80
Similarly,
we
performed
simulation
studies
with
2
homogeneous/heterogeneous
biofeatures.
Similarly,
the
quantile
regression
approach
increases
the
number
of
causal
SNPs
identified
with
informative
biofeatures,
and
maintains
the
same
number
of
identified
causal
SNPs
with
non-‐
informative
biofeatures,
whereas
linear
regression
and
filtering
approaches
suffer
a
significant
loss
in
the
identification
of
causal
SNPs
if
the
biofeatures
are
not
informative.
The
results
are
summarized
in
Fig
4.6
and
Fig
4.7.
bioinformatics based
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
hierarchical linear regression
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
simultaneous quantile regression
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
−1
0
1
2
3
4
5
6
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
−1
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
−1
0
1
2
3
4
5
6
81
Figure
4.6
Gain
of
number
of
causal
SNPs
in
the
top
1,
5
and
20
ranked
SNPs
(with
2
homogeneous
biofeature
and
1,
5
and
20
causal
SNPs
correspondingly)
hierarchical linear regression
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
bioinformatics based
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
simultaneous quantile regression
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
hierarchical linear regression
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.5
0.0
0.5
1.0
1.5
bioinformatics based
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.5
0.0
0.5
1.0
1.5
simultaneous quantile regression
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.5
0.0
0.5
1.0
1.5
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
0
2
4
6
8
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
0
2
4
6
8
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
0
2
4
6
8
82
Figure
4.7
Gain
of
number
of
causal
SNPs
in
the
top
1,
5
and
20
ranked
SNPs
(with
2
heterogeneous
biofeature
and
1,
5
and
20
causal
SNPs
correspondingly)
hierarchical linear regression
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
bioinformatics based
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
simultaneous quantile regression
gain of causal SNPs in top 1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
hierarchical linear regression
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0.0
0.5
1.0
1.5
2.0
bioinformatics based
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0.0
0.5
1.0
1.5
2.0
simultaneous quantile regression
gain of causal SNPs in top 5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0.0
0.5
1.0
1.5
2.0
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
7
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
7
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
7
83
4.5.3
Simulation
Results
for
Correlated
SNPs
As
in
section
4.5.2,
we
simulated
1,
5
or
20
causal
SNPs
out
of
1000
SNPs.
To
mimic
the
LD
structure
in
the
genetic
data,
we
assume
a
simple
correlation
structure
among
the
SNPs.
Within
an
LD
region,
the
SNPs
are
equally
correlated
with
each
other,
while
SNPs
in
different
LD
regions
are
uncorrelated.
We
applied
similar
analysis
models
on
the
data,
and
the
results
are
summarized
below.
We
performed
simulations
with
LD
region
of
5
SNPs,
and
the
correlation
within
the
region
0.2
and
0.5.
The
results
are
summarized
in
Fig
4.8
and
Fig
4.9
respectively.
Specifically,
for
a
within-‐
region
correlation
of
0.5
with
specificity
and
sensitivity
of
0.5
(non-‐informative
biofeature),
linear
regression
and
filtering
approaches
identify
1
fewer
causal
SNP
in
the
top
20
ranked
SNPs
than
the
p-‐value
only
approach
on
average,
whereas
quantile
regression
approach
identifies
almost
the
same
number
of
causal
SNPs
on
average.
When
the
specificity
and
sensitivity
are
0.9
(informative
biofeature),
all
three
approaches
increase
the
number
of
causal
SNPs
identified
by
4~5.
For
the
quantile
regression
approach,
when
the
sensitivity
is
low
(0.5),
the
number
of
identified
causal
SNPs
increases
with
increased
specificity;
in
contrast,
at
low
specificity
(0.5),
higher
sensitivity
does
not
increase
the
number
of
identified
causal
SNPs.
84
Figure
4.8
Gain
of
number
of
causal
SNPs
in
the
top
20
ranked
SNPs
(with
1
biofeature,
with
LD
structure
of
region
size
5
and
correlation
0.2)
Figure
4.9
Gain
of
number
of
causal
SNPs
in
the
top
20
ranked
SNPs
(with
1
biofeature,
with
LD
structure
of
region
size
5
and
correlation
0.5)
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
bioinformatics based
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
hierarchical linear regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top 20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
85
We
observed
similar
patterns
for
the
prioritization
of
causal
SNPs
with
different
LD
structure
and
different
numbers
of
biofeatures
(e.g.,
with
a
region
size
of
10,
a
within-‐region
correlation
of
0.2
or
0.5,
and
2
homogenous/heterogeneous
biofeatures,
as
shown
in
the
appendix
figures).
4.6
An
Applied
Study
on
the
Colorectal
Cancer
Family
Registry
Colorectal
cancer
is
the
second
leading
cause
of
cancer
death
in
developed
countries,
with
the
lifetime
risk
estimated
to
be
5%
to
6%
(Altekruse
et
al.,
2007).
To
date,
genome-‐wide
association
studies
(GWAS)
have
identified
fourteen
low-‐penetrance
genetic
variants
that
together
explain
approximately
8%
of
the
familial
association
of
this
disease
(Peters
et
al.,
2011).
The
Colorectal
Cancer
Family
Registry
(Colon
CFR)
is
an
international
consortium
of
six
centers
in
North
America
and
Australia
formed
as
a
resource
to
support
studies
on
the
etiology,
prevention
and
clinical
management
of
colorectal
cancer
(Newcomb
et
al.,
2007).
The
Colon
CFR
includes
both
population-‐based
and
clinic-‐based
colorectal
cancer
cases.
To
date,
there
are
8745
population-‐
based
families
and
1861
clinic-‐based
families
participating
in
this
study
(http://www.coloncfr.org/index.php/summary-‐data).
Despite
the
large
sample
sizes,
only
a
few
previously
reported
GWAS
hits
have
been
significantly
replicated
in
the
Colon
CFR
and
no
new
loci
have
been
discovered
at
genomewide
levels
of
significance.
Here
we
propose
to
use
a
hierarchical
quantile
regression
to
prioritize
the
GWAS
results
from
Colon
CFR,
using
biofeatures
corresponding
to
the
genomic
annotations
of
each
SNP.
To
provide
a
fair
comparison,
we
used
the
change
in
ranking
of
the
SNPs
compared
with
a
“gold
standard”
based
on
a
recent
meta-‐
analysis
of
several
colon
cancer
GWAS
studies:
MECC
(Middle
East
Cancer
Consortium),
Colon
86
CFR,
KY
Colon
Cancer
Study
and
GECCO
(Genetics
and
Epidemiology
of
Colorectal
Cancer
Consortium).
In
the
meta-‐analysis,
we
identified
a
few
variants
that
pass
the
genome-‐wide
significance
level,
serving
as
the
“causal”
variants
to
be
detected.
We
constrained
our
analysis
to
the
subset
of
SNPs
that
passed
the
nominal
significance
level
(p-‐
value
<
0.05)
based
on
the
meta-‐analysis.
The
corresponding
SNPs
(in
total
431,813
SNPs)
in
the
Colon
CFR
dataset
were
used
for
analysis.
We
applied
a
pruning
procedure
to
the
SNPs
data
based
on
their
LD
structure,
which
resulted
in
44,010
relatively
uncorrelated
regions.
In
the
pruning
process,
we
looped
through
all
the
SNPs
in
the
order
of
increasing
p-‐values,
and
defined
regions
as
comprising
all
SNPs
that
had
correlation
larger
than
a
threshold.
As
a
test
statistic
for
each
region,
we
used
the
smallest
p-‐value
for
that
region.
We
utilized
information
on
5
biofeatures
in
this
analysis:
DNase
1
promoter,
enhancer,
exon,
promoter
and
TCF7L2
binding
site.
For
each
region,
we
summarized
the
biofeature
information
on
a
present/absent
basis:
the
biofeature
of
the
region
was
noted
as
1
if
any
SNP
in
the
region
had
the
biofeature,
otherwise
0.
We
summarize
the
frequency
of
the
5
biofeatures
for
all
44,010
regions
in
the
following
table.
The
rarest
biofeature
is
DNase
1
promoter,
associated
with
less
than
2%
of
all
the
regions,
whereas
the
most
common
biofeature
is
exon,
associated
with
about
8%
of
all
the
regions.
Table
4.3
Frequencies
of
5
Biofeatures
in
44010
Regions
DNase
1
Promoter
Enhancer
Exon
Promoter
TCF7L2
Binding
Site
Non-‐Biofeature
43252
42638
40858
41802
41887
Biofeature
758
1372
3152
2208
2123
87
With
this
biofeature
information,
we
applied
the
hierarchical
quantile
regression
model.
We
summarize
the
quantile
effects
of
the
6
biofeatures
in
Table
4.4
as
shown
below.
We
can
see
that
4
of
the
5
biofeatures
are
consistently
positively
associated
with
the
test
statistics
across
the
quantiles,
with
DNase
1
being
consistently,
but
weakly
negatively
associated.
Table
4.4
Mean
Estimates
β
d
(τ
k
)
and
95%
Interval
for
the
Biofeature
X
d
Effects
on
Test
Statistics
in
Multiple
Quantiles
τ
k
Quantile
τ
k
Feature
X
d
0.70
0.80
0.90
0.92
0.95
0.98
DNase
1
Promoter
-‐0.027
(-‐0.171,
0.115)
-‐0.027
(-‐0.175,
0.128)
-‐0.029
(-‐0.198,
0.169)
-‐0.019
(-‐0.195,
0.183)
-‐0.014
(-‐0.217,
0.199)
-‐0.010
(-‐0.295,
0.359)
Enhancer
0.100
(0.005,
0.206)
0.101
(0.002,
0.203)
0.097
(-‐0.021,
0.218)
0.102
(-‐0.028,
0.237)
0.111
(-‐0.037,
0.264)
0.111
(-‐0.075,
0.324)
Exon
0.045
(-‐0.027,
0.115)
0.047
(-‐0.021,
0.119)
0.053
(-‐0.034,
0.132)
0.053
(-‐0.042,
0.140)
0.053
(-‐0.051,
0.160)
0.056
(-‐0.083,
0.200)
Promoter
0.089
(-‐0.005,
0.180)
0.088
(-‐0.008,
0.184)
0.089
(-‐0.022,
0.212)
0.090
(-‐0.020,
0.202)
0.088
(-‐0.035,
0.214)
0.111
(-‐0.061,
0.303)
TCF7L2
Binding
Site
0.040
(-‐0.045,
0.120)
0.042
(-‐0.055,
0.133)
0.046
(-‐0.055,
0.154)
0.044
(-‐0.064,
0.157)
0.047
(-‐0.074,
0.359)
0.061
(-‐0.106,
0.239)
We
focused
the
analysis
on
the
top
regions
identified
by
the
meta-‐analysis
as
the
“gold
standard”.
For
example,
for
the
top
100
regions
from
the
meta-‐analysis,
we
found:
1)
The
change
in
rank
due
to
data
for
these
regions
was
associated
with
biofeatures
DNase
1
promoter,
enhancer,
and
TCF1L2.
For
example,
those
SNPs
within
a
DNase
1
Promoter
region
showed
a
mean
rank
improvement
of
4520
positions
in
the
meta-‐analysis,
comparing
to
the
crude
ranking
from
the
Colon
CFR
study,
indicating
that
this
biofeature
is
the
most
informative.
2)
The
quantile
regression
analysis
for
these
regions
improved
rankings
on
average
by
428
88
positions
comparing
to
the
Colon
CFR
study.
3)
Moreover,
of
those
100
regions
that
showed
an
improvement
in
ranking
in
meta-‐analysis
compared
to
the
Colon
CFR
study,
(i.e.
regions
that
are
indicated
by
data
to
be
more
likely
truly
associated),
these
regions
showed
an
average
improvement
of
ranking
of
1093
from
the
quantile
regression
analysis.
4)
Finally,
CCFR
analysis
had
only
2
of
these
regions
in
the
top
100,
whereas
quantile
regression
had
3.
We
further
plotted
the
improvements
visually
in
the
following
figure.
For
the
top
regions
identified
by
meta-‐analysis,
we
show
the
changes
of
ranking
from
the
Colon
CFR
dataset
to
the
quantile
regression
results.
The
biofeatures
are
indicated
as
dots
alongside
each
region.
In
general,
we
see
that
among
the
top
regions
identified
by
the
meta-‐analysis,
those
with
biofeatures
are
associated
with
an
improved
ranking
after
applying
quantile
regression.
89
Figure
4.10
Gain
of
number
of
causal
SNPs
in
the
top
200
meta-‐analysis
ranked
regions
We
also
tested
the
changes
in
ranks
for
the
known
SNPs/regions.
According
to
Peters
et
al.
(2011),
18
regions
have
been
reported
to
be
associated
with
colon
cancer
risk,
of
which
14
regions
were
included
in
the
subset
SNPs
used
in
our
analysis.
The
changes
in
rank
for
the
known
hits
are
shown
in
Table
4.5.
Compared
to
the
CCFR
data,
on
average
quantile
regression
improved
the
ranking
of
the
known
hits
by
2164
positions,
including
1
additional
known
hit
in
top
100
ranking
(12q13.13),
and
2
additional
known
hits
in
top
20,000
(16q22.1/CDH1
and
19q13.1/RHPN2).
Admittedly,
the
meta-‐analysis
does
better
with
a
mean
rank
increase
of
15,137,
including
8
additional
known
hits
in
the
top
100
and
6
additional
in
the
top
20,000.
But
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
14 12 10 8 6 4 2
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Rank Change
in Top 100
log
2
(Rank)
CCFR Data Quantile Model
dnase1
enhancer
exon
promoter
tcf7l2
rank+
rank−
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
15 10 5 0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
Rank Change
from Top 100 to 200
log
2
(Rank)
CCFR Data Quantile Model
dnase1
enhancer
exon
promoter
tcf7l2
rank+
rank−
90
this
highlights
our
proposal:
if
you
have
resources
(logistically
and
monetarily),
then
getting
more
samples
is
a
good
way
to
go.
In
the
absence
of
resources,
incorporation
of
information
can
help.
Table
4.5
Changes
in
rank
for
the
known
hits
(genes/regions)
Known
Hits
(gene/region)
Rank
in
CCFR
Data
Rank
in
Meta-‐
analysis
Rank
in
Quantile
Regression
6p21
14613
5541
10271
8q23.3/EIF3H
35021.5
5
34042
8q24/MYC
8436
3
2810
9p24
14405
235.5
14957
11q13.4
30372.5
27
28967
11q23
5934
4
3852
12q13.13
369
783
53
14q22.2/BMP4
10114
118
6286
14q22.2/BMP4
11311
6302
11921
15q13/CRAC1/HMPS/GREM1
26333
1047
24388
15q13/CRAC1/HMPS/GREM1
9511
5699
9819
15q13/CRAC1/HMPS/GREM1
15309
7
12170
16q22.1/CDH1
23764
970
18186
18q21/SMAD7
1996
1
2162
19q13.1/RHPN2
20965.5
477
17134
20p12.3/BMP2
18163
10
17465
20p12.3/BMP2
28540.5
591
28581
20q13.33/LAMA5
19184
45
12321
91
4.7
Discussions
Many
approaches
have
been
proposed
for
selecting
a
subset
of
SNPs
from
GWAS
results
to
follow
up
(Cantor
et
al,
2010).
Most
studies
rank
the
p-‐values
and
possibly
filter
on
prior
information.
In
our
work,
we
adapted
a
Bayesian
hierarchical
multi-‐quantile
regression
model
to
incorporate
biofeature
information,
and
applied
it
to
a
GWAS
dataset
with
marginal
test
statistics.
In
simulation
studies,
we
found
that
the
Bayesian
hierarchical
multi-‐quantile
regression
model
boosts
the
ranking
of
causal
SNPs
if
the
biofeatures
are
informative,
and
keeps
the
original
ranking
of
causal
SNPs
if
the
biofeatures
are
not
informative.
Thus
it
is
more
robust
and
powerful
for
prioritization
of
the
causal
SNPs
than
filtering
or
use
of
linear
models.
Here
we
summarize
a
few
advantages
of
our
method:
(1) It
is
a
probabilistic
model,
thus
intuitively
easy
to
interpret
the
results
(2) It
is
easy
to
fit
the
model
and
can
be
applied
to
continuous
biofeatures
(an
important
advantage
over
the
filtering
approach)
(3) It
has
better
performance
in
terms
of
the
ranking
of
causal
SNPs,
especially
when
we
have
multiple
biofeatures
whose
information
quality
(specificity
and
sensitivity
with
respect
to
the
causality
of
a
SNP)
varies.
In
the
application
to
the
Colon
CFR
dataset,
we
found
that
the
quantile
regression
model
improved
the
ranking
of
SNPs/regions
that
were
identified
by
the
meta-‐analysis.
Four
of
the
5
biofeatures
were
positively
correlated
with
the
change
in
rank
(from
the
Colon
CFR
dataset
to
the
meta-‐analysis),
and
the
quantile
regression
with
relevant
biofeatures
improved
the
ranking
of
these
corresponding
regions.
92
However,
at
the
current
stage
of
development,
the
Bayesian
hierarchical
quantile
regression
still
faces
many
limitations
and
challenges.
Due
to
the
computational
burden,
our
method
is
restricted
to
be
used
in
moderate-‐sized
datasets
in
terms
of
the
number
of
SNPs.
Given
a
complete
GWAS
dataset
with
a
massive
number
of
SNPs,
it
is
necessary
to
select
a
subset
of
candidate
SNPs
before
we
can
apply
our
method.
On
the
other
hand,
SNPs
in
LD
impacts
the
performance
of
the
approach,
as
we
assumed
the
independence
of
the
variables
in
quantile
regression
model.
Thus
we
proposed
to
prune
the
GWAS
dataset
based
on
the
LD
structure,
and
summarized
the
biofeature
information
of
the
LD
block
for
the
hierarchical
model.
It
is
very
important
to
select
informative
biofeatures
in
all
the
proposed
approaches.
For
example,
we
are
interested
in
biofeatures
that
are
associated
with
the
biological
mechanisms
of
colorectal
cancer
development
when
we
perform
genetic
association
studies
for
colorectal
cancer,
such
as
enhancer
or
promoter
regions.
However,
it
is
difficult
to
predict
how
informative
a
biofeature
will
be,
since
biological
knowledge
is
still
limited
about
the
trait
of
interest.
In
this
case,
the
hierarchical
quantile
regression
performs
as
a
robust
approach
for
SNP
prioritization,
as
it
estimates
the
quantile
effects
of
each
biofeature,
and
does
not
penalize
much
when
the
biofeature
is
not
informative.
93
Chapter
5 Future
research
in
related
areas
In
the
scope
of
this
dissertation,
we
proposed
statistical
methods
to
deal
with
three
specific
problems
on
genetic
association
studies.
Most
of
the
GWAS
results
are
analyzed
with
one
SNP
at
a
time,
which
does
not
fully
exploit
the
potential
of
GWAS
to
identify
multiple
causal
variants.
Thus
we
proposed
a
Bayesian
variable
selection
framework
for
GWAS,
and
build
a
multiple
regression
model
incorporating
external
biological
information.
To
follow
up
on
a
GWAS
study
and
search
for
causal
variants/
regions,
we
proposed
a
Bayesian
quantile
regression
model
to
prioritize
GWAS
results.
In
addition,
we
specifically
considered
a
case-‐control
study
design
with
DNA
pools,
which
are
to
be
sequenced
using
next-‐generation
sequencing
technology,
with
improved
statistical
power
to
detect
causal
variants.
In
general,
genetic
association
studies
are
advancing
with
more
vastly
data
and
complicated
structures.
The
development
of
next-‐generation
sequencing
technology
will
soon
be
cheap
enough
for
a
whole-‐genome
sequencing
study
on
a
population
basis;
discoveries
in
molecular
pathways
and
biological
mechanisms
add
up
to
a
more
complicated
system
of
biological
knowledge;
innovations
in
medical
monitoring
systems
will
generate
complex
biomarker
and
disease
data;
advances
in
electronic
medical
record
systems
makes
larger
samples
of
hospital
data
easily
available
for
research.
With
all
these
developments,
comprehensive
approaches
are
needed
to
integrate
multi-‐
dimensional
information.
Bayesian
hierarchical
models
will
be
very
useful
and
coherent
tools;
in
the
mean
time,
it
faces
many
challenges,
especially
to
scale
up
for
a
whole-‐genome
analysis.
Future
research
in
this
area
needs
to
deal
with
the
following
specific
issues.
94
5.1
Variable
selection
problem
for
a
large
dataset
In
a
whole-‐genome
sequencing
study,
a
dataset
could
potentially
include
millions
of
genomic
variations
as
variables.
For
example,
the
phase
I
of
1000
Genome
Project
has
identified
38 million
single
nucleotide
polymorphisms,
1.4 million
short
insertions
and
deletions,
and
more
than
14,000
larger
deletions(Consortium
et
al.,
2012).
An
efficient
variable
selection
approach
will
greatly
alleviate
the
computational
burden,
as
well
as
improve
the
statistical
power
both
in
building
a
good
prediction
model
and
finding
the
causal
variants.
In
Chapter
3
of
this
dissertation,
we
proposed
a
Bayesian
Lasso
with
biology-‐driven
penalty
term
as
a
prior
distribution.
The
method
presented
herein
is
a
first
step
as
computational
and
theoretical
restrictions
limit
the
number
of
variables
the
method
can
be
used
with.
As
an
extension
in
this
direction,
an
empirical
estimator
for
the
Informative
Bayesian
Lasso
might
be
developed
and
applied
to
the
full
dataset.
We
might
estimate
the
penalty
based
on
the
marginal
estimates
of
the
data,
and
then
apply
the
empirically
estimated
penalty
to
construct
an
empirical
maximum
a
posteriori
estimator.
Similarly,
in
Chapter
4,
in
the
context
of
fully
Bayesian
analysis,
given
a
study
with
hundreds
of
thousands
variables,
we
need
to
pre-‐select
a
subset
of
these
variables.
In
the
context
of
GWAS
studies,
we
proposed
to
prune
the
dataset
based
on
the
LD
structure.
For
each
LD
region,
we
use
a
summary
test
statistic
and
aggregated
biofeature
information.
An
alternative
approach
could
be
to
perform
a
variable
selection
procedure
in
the
full
data,
and
then
apply
the
quantile
regression
model
for
prioritization.
95
5.2
Further
development
on
hierarchical
quantile
regression
to
specifically
model
linkage
disequilibrium
between
SNPs
Currently
in
the
hierarchical
quantile
regression
model,
we
assume
the
SNPs
are
independent.
However,
due
to
linkage
disequilibrium,
many
SNPs
are
correlated
with
each
other.
To
model
the
correlation
among
SNPs,
one
might
adapt
the
multivariate
regression
model
within
the
hierarchical
quantile
regression:
in
the
first
level,
we
could
model
the
correlation
structure
among
the
observed
test
statistics;
and
in
the
second
level,
we
can
model
the
effects
of
the
biofeature(s)
on
the
quantiles
of
the
latent
variables.
We
have
taken
some
initial
steps
in
this
direction.
We
used
the
LD
structure
of
the
SNPs
as
estimation
for
the
correlation
structure
of
the
test
statistics,
and
estimated
the
posterior
distribution
for
the
latent
variables
based
on
the
multivariate
model.
The
initial
results
show
that
the
proposed
procedure
does
not
improve
the
prioritization
of
causal
SNPs.
More
investigations
are
needed
to
further
develop
the
multivariate
hierarchical
quantile
regression.
We
applied
the
hierarchical
quantile
regression
model
to
a
GWAS
study
in
Chapter
4.
A
fine-‐
mapping
study
uses
genotyping
with
increased
density
of
sequencing
across
the
candidate
regions.
Thus
it
is
a
powerful
way
of
finding
causal
variants,
which
are
more
likely
to
be
directly
genotyped/sequenced.
Since
the
biofeature(s)
might
be
more
relevant
to
the
causal
SNPs,
application
of
the
hierarchical
quantile
regression
model
in
a
fine-‐mapping
study
may
result
in
96
better
statistical
power
to
find
the
causal
variants.
As
stated
above,
we
need
to
construct
a
multivariate
regression
framework
to
deal
with
the
correlation
structure
in
a
fine-‐mapping
data.
Another
challenge
to
deal
with
fine-‐mapping
study
is
the
computational
burden
if
we
are
to
include
all
the
genetic
variations
in
the
region.
To
specifically
deal
with
this
problem,
we
can
construct
an
empirical
Bayes
estimator
considering
the
correlation
structure.
5.3
Integrative
analysis
on
genetic
association
studies,
especially
with
next-‐generation
sequencing
With
the
rapid
development
of
Next-‐Generation
Sequencing
technology,
it
is
attempting
to
design
sequencing
based
genetic
association
studies.
However,
we
need
more
delicate
thoughts
on
how
to
analyze
the
sequencing
data
in
an
integrative
approach.
First,
the
sequencing
data
will
be
less
accurate
than
genotyping
data
(especially
for
data
with
low
coverage),
and
a
probabilistic
approach
with
consideration
of
sequencing
error
is
often
needed
for
such
data.
In
the
pooling
design
using
Next-‐Generation
sequencing
data,
we
constructed
an
integrated
hierarchical
model
with
consideration
of
the
error
structure.
Secondly,
much
more
genetic
variations
will
be
detected
from
sequencing
data,
but
most
of
them
will
be
rare
variants.
Testing
one
variant
at
a
time
will
lead
to
severe
loss
of
power
due
to
the
low
frequency
of
each
variant.
In
such
case,
we
may
consider
using
group
tests
of
rare
variants.
For
example,
Neale
et
al
proposed
a
C-‐alpha
test,
which
tests
the
deviation
of
the
distribution
of
a
group
of
variants
in
cases
and
controls(Neale
et
al.,
2011).
On
the
third,
most
of
the
variants
will
have
null
effects
in
a
sequencing
study.
Thus,
it
is
suitable
to
apply
quantile
regression
approach
in
such
dataset,
and
directly
test
the
causal
variants
on
the
top
percentiles.
97
5.4
Pharmacogenomics
and
Personalized
Medicine
Pharmacogenomics
analyzes
how
genetic
makeup
affects
a
subject’s
response
to
drugs.
It
deals
with
the
influence
of
genetic
variance
with
the
drug’s
toxicity
or
efficacy
(Karczewski,
Daneshjou,
&
Altman,
2012).
Such
approaches
promise
the
advent
of
personalized
medicine,
whereas
drugs
are
optimized
according
to
each
individual’s
genetic
makeup.
Although
most
of
the
genetic
variants
have
small
effect
sizes
on
population-‐based
studies,
some
studies
have
suggest
that
genetic
makeup
may
have
a
bigger
effect
on
the
drug
metabolism.
For
example,
based
on
GWAS
studies,
while
SNPs
in
HMGCR
gene
have
only
a
small
effect
(~5%)
on
LDL
levels,
drugs
targeting
the
encoded
protein
decrease
LDL
levels
by
a
much
greater
extent
(~30%)
(Altshuler,
Daly,
&
Lander,
2008;
Willer
et
al.,
2008).
Maybe
it
is
because
the
effect
of
an
inherited
variant
is
limited
by
natural
selection
and
pleiotropy,
whereas
the
effect
of
a
drug
treatment
is
not.
With
more
discoveries
on
genetic
mappings
on
diseases,
personalized
medicine
should
be
in
the
center
stage
of
the
next-‐generation
development
of
pharmaceutical
industry.
98
Appendices
Appendix
1.
Variant
detection:
comparison
between
pooling
vs.
individual
designs
We
also
investigated
the
use
of
a
pooling
design
for
the
discovery
of
novel
variants,
based
on
the
probability
of
detection
proposed
in
the
Methods
section.
As
an
example,
with
the
same
cost
function,
suppose
we
were
interested
in
a
variant
with
VAF
=
0.01,
in
an
individual
sequencing
design
we
can
sequence
200
individuals
with
4X
coverage
(again
assuming
that
is
enough
to
yield
perfect
sequencing
result),
the
probability
of
discovery
would
be
1−(1−0.01)
200
=
0.866.
With
the
same
sequencing
cost
(and
with
no
more
than
4000
individuals),
the
optimal
pooling
design
would
yield
a
probability
of
detection
is
0.948.
The
discovery
probability
of
different
study
designs
was
shown
in
Appendix
Fig
1.
A
more
general
comparison
between
pooling
vs.
individual
sequencing
is
made
with
varying
VAFs
and
cost
functions,
with
the
same
cost
for
4X
coverage
on
100
individuals.
In
Appendix
Fig
2,
we
can
see
that
even
when
VAF
is
very
small,
optimal
pooling
design
yields
a
very
high
detection
probability.
The
optimal
designs
are
summarized
in
Appendix
Table
2.
99
Appendix
Figure
1.
The
discovery
probability
of
a
novel
variant
(VAF
=
0.01)
using
pooling
designs,
with
a
fixed
sequencing
cost
100
Appendix
Figure
2.
Comparison
in
variant
detection
probability
of
optimal
pooling
designs
with
various
cost
function,
versus
individual
designs,
over
a
range
of
variant
allele
frequencies
101
Appendix
Table
1.
Top
5
optimal
pooling
designs
for
a
study
interested
in
a
variant
with
VAF=0.03
and
OR=2,
with
a
specified
cost
function
and
fixed
constraints
on
sequencing
cost
and
sample
size
P
Np
λ
Power
51
62
25
0.424
51
60
25
0.421
51
58
25
0.417
51
56
25
0.313
51
54
25
0.409
Appendix
Table
2.
Comparison
in
variant
detection
probability
of
optimal
pooling
designs
with
various
cost
function,
versus
individual
designs,
over
a
range
of
variant
allele
frequencies.
The
cost
function
is
the
same
as
in
Figure
2.7.
C
λ
VAF
Np
λ
P
Detection
Probability
Individual
Design
(same
cost)
50
0.001
6
42
39
0.321
0.095
50
0.003
14
113
18
0.657
0.260
50
0.005
14
108
19
0.828
0.394
50
0.008
200
892
3
0.943
0.552
50
0.01
200
892
3
0.980
0.634
50
0.03
200
465
5
1.000
0.952
50
0.05
174
204
11
1.000
0.994
200
0.001
10
56
15
0.198
0.095
200
0.003
10
56
15
0.440
0.260
200
0.005
10
56
15
0.610
0.394
200
0.008
10
56
15
0.775
0.552
200
0.01
200
553
2
0.868
0.634
200
0.03
200
553
2
1.000
0.952
200
0.05
200
349
3
1.000
0.994
500
0.001
4
9
55
0.175
0.095
500
0.003
4
9
55
0.381
0.260
500
0.005
200
1086
1
0.556
0.394
500
0.008
200
1086
1
0.757
0.552
500
0.01
200
1086
1
0.844
0.634
500
0.03
200
1183
1
0.999
0.952
500
0.05
200
393
2
1.000
0.994
102
Appendix
Text
1.
The
WinBUGS
code
for
the
hierarchical
model
on
the
pooled
NGS
data.
model
{
for
(p
in
1:P)
{
logit(rho[p])
<-‐
beta0
+
beta1*(W[p]-‐m.W)
C[p]
~
dbin(rho[p],
2*N.p)
theta[p]
<-‐
(C[p]/(2*N.p))*(1-‐error)
+
(1-‐C[p]/(2*N.p))*error
V[p]
~
dbin(theta[p],
X[p])
}
error
~
dunif(0,
0.05)
beta0
~
dnorm(0,
0.01)
beta1
~
dnorm(0,
0.01)
T.beta1
<-‐
step(beta1)
m.W
<-‐
mean(W[1:P])
}
103
Bibliography
Altekruse,
S.
F.,
Kosary,
C.
L.,
Krapcho,
M.,
Neyman,
N.,
Aminou,
R.,
Waldron,
W.,
et
al.
(2007).
SEER
Cancer
Statistics
Review,
1975-‐2007,
National
Cancer
Institute.
Bethesda,
MD,http://seer.cancer.gov/csr/1975_2007/,
based
on
November
2009
SEER
data
submission,
posted
to
the
SEER
web
site,
2010.
Altshuler,
D.,
Daly,
M.
J.,
&
Lander,
E.
S.
(2008).
Genetic
Mapping
in
Human
Disease.
Science,
322(5903),
881–888.
doi:10.1126/science.1156409
Ashburner,
M.,
Ball,
C.
A.,
Blake,
J.
A.,
Botstein,
D.,
Butler,
H.,
Cherry,
J.
M.,
et
al.
(2000).
Gene
Ontology:
tool
for
the
unification
of
biology.
Nature
Genetics,
25(1),
25.
Cade,
B.
S.,
&
Noon,
B.
R.
(2003).
A
gentle
introduction
to
quantile
regression
for
ecologists.
Frontiers
in
Ecology
and
the
Environment,
1(8),
412–420.
Cantor,
R.
M.,
Lange,
K.,
&
Sinsheimer,
J.
S.
(2010).
Prioritizing
GWAS
results:
A
review
of
statistical
methods
and
recommendations
for
their
application.
The
American
Journal
of
Human
Genetics,
86(1),
6–22.
Chen,
J.,
&
Chen,
Z.
(2008).
Extended
Bayesian
information
criteria
for
model
selection
with
large
model
spaces.
Biometrika,
95(3),
759–771.
doi:10.1093/biomet/asn034
Consortium,
T.
1.
G.
P.,
The
1000
Genomes
Consortium
Participants
are
arranged
by
project
role,
T.
B.
I.
A.
A.
F.
A.
W.
I.
E.
F.
P.
I.
A.
P.
L.
A.
I.,
author,
C.,
committee,
S.,
Medicine,
P.
G.
B.
C.
O.,
BGI-‐Shenzhen,
et
al.
(2012).
An
integrated
map
of
genetic
variation
from
1,092
human
genomes.
Nature,
490(7422),
56–65.
doi:10.1038/nature11632
Cordell,
H.
J.,
&
Clayton,
D.
G.
(2005).
Genetic
association
studies.
The
Lancet,
366(9491),
1121–
1131.
Druley,
T.
E.,
Vallania,
F.
L.
M.,
Wegner,
D.
J.,
Varley,
K.
E.,
Knowles,
O.
L.,
Bonds,
J.
A.,
et
al.
104
(2009).
Quantification
of
rare
allelic
variants
from
pooled
genomic
DNA.
Nature
Methods,
6(4),
263–265.
doi:10.1038/nmeth.1307
Erlich,
Y.,
Chang,
K.,
Gordon,
A.,
Ronen,
R.,
Navon,
O.,
Rooks,
M.,
&
Hannon,
G.
J.
(2009).
DNA
Sudoku-‐-‐harnessing
high-‐throughput
sequencing
for
multiplexed
specimen
analysis.
Genome
Research,
19(7),
1243–1253.
doi:10.1101/gr.092957.109
Frazer,
K.
A.,
Murray,
S.
S.,
Schork,
N.
J.,
&
Topol,
E.
J.
(2009).
Human
genetic
variation
and
its
contribution
to
complex
traits.
Nature
Reviews
Genetics,
10(4),
241–251.
doi:10.1038/nrg2554
Freedman,
M.
L.,
Monteiro,
A.
N.
A.,
Gayther,
S.
A.,
Coetzee,
G.
A.,
Risch,
A.,
Plass,
C.,
et
al.
(2011).
Principles
for
the
post-‐GWAS
functional
characterization
of
cancer
risk
loci.
Nature
Publishing
Group,
43(6),
513–518.
doi:10.1038/ng.840
Futschik,
A.,
&
Schlotterer,
C.
(2010).
Massively
Parallel
Sequencing
of
Pooled
DNA
Samples—
The
Next
Generation
of
Molecular
Markers.
Genetics.
Gauderman,
W.
(2002).
Sample
Size
Requirements
for
Association
Studies
of
Gene-‐Gene
Interaction.
American
Journal
of
Epidemiology.
Gelman,
A.,
Carlin,
J.
B.,
Stern,
H.
S.,
&
Rubin,
D.
B.
(2004).
Bayesian
Data
Analysis.
Chapman
&
Hall/CRC.
Genkin,
A.,
Lewis,
D.
D.,
&
Madigan,
D.
(2007).
Large-‐Scale
Bayesian
Logistic
Regression
for
Text
Categorization.
Technometrics,
49(3),
291–304.
doi:10.1198/004017007000000245
George,
E.
I.,
&
Foster,
D.
P.
(2000).
Calibration
and
empirical
Bayes
variable
selection.
Biometrika,
87(4),
731–747.
Hans,
C.
(2009).
Bayesian
lasso
regression.
Biometrika,
96(4),
835–845.
doi:10.1093/biomet/asp047
Hastie,
T.,
Tibshirani,
R.,
&
Friedman,
J.
H.
(2003).
The
Elements
of
Statistical
Learning
105
(Corrected.).
Springer.
Hirschhorn,
J.
N.,
Lohmueller,
K.,
Byrne,
E.,
&
Hirschhorn,
K.
(2002).
A
comprehensive
review
of
genetic
association
studies.
Genetics
in
Medicine,
4(2),
45–61.
Hoggart,
C.
J.,
Whittaker,
J.
C.,
De
Iorio,
M.,
&
Balding,
D.
J.
(2008).
Simultaneous
Analysis
of
All
SNPs
in
Genome-‐Wide
and
Re-‐Sequencing
Association
Studies.
(P.
M.
Visscher,
Ed.)PLoS
Genetics,
4(7),
e1000130.
doi:10.1371/journal.pgen.1000130.t004
Ioannidis,
J.,
&
Thomas,
G.
(2009).
Validating,
augmenting
and
refining
genome-‐wide
association
signals.
Nature
Reviews
Genetics.
Jensen,
L.
J.,
&
Bork,
P.
(2010).
Ontologies
in
Quantitative
Biology:
A
Basis
for
Comparison,
Integration,
and
Discovery.
PLoS
Biology,
8(5),
e1000374.
doi:10.1371/journal.pbio.1000374.g003
Karczewski,
K.
J.,
Daneshjou,
R.,
&
Altman,
R.
B.
(2012).
Chapter
7:
Pharmacogenomics.
(F.
Lewitter
&
M.
Kann,
Eds.)PLoS
Computational
Biology,
8(12),
e1002817.
doi:10.1371/journal.pcbi.1002817.s001
Kim,
S.
Y.,
Li,
Y.,
Guo,
Y.,
Li,
R.,
Holmkvist,
J.,
Hansen,
T.,
et
al.
(2010).
Design
of
association
studies
with
pooled
or
un-‐pooled
next-‐generation
sequencing
data.
Genetic
Epidemiology,
34(5),
479–491.
doi:10.1002/gepi.20501
Klein,
R.
J.
(2005).
Complement
Factor
H
Polymorphism
in
Age-‐Related
Macular
Degeneration.
Science,
308(5720),
385–389.
doi:10.1126/science.1109557
Koenker,
R.
(2005).
Quantile
Regression
-‐
Roger
Koenker
-‐
Google
Books.
Koenker,
R.,
&
Bassett,
G.,
Jr.
(1978).
Regression
quantiles.
Econometrica:
journal
of
the
Econometric
Society,
33–50.
Kozumi,
H.,
&
Kobayashi,
G.
(2011).
Gibbs
sampling
methods
for
Bayesian
quantile
regression.
Journal
of
Statistical
Computation
and
Simulation,
81(11),
1565–1578.
106
Lander,
E.
S.,
Linton,
L.
M.,
Birren,
B.,
Nusbaum,
C.,
Zody,
M.
C.,
Baldwin,
J.,
et
al.
(2001).
Initial
sequencing
and
analysis
of
the
human
genome.
Nature,
409(6822),
860–921.
Lee,
J.
S.,
Choi,
M.,
Yan,
X.,
Lifton,
R.
P.,
&
Zhao,
H.
(2011).
On
optimal
pooling
designs
to
identify
rare
variants
through
massive
resequencing.
Genetic
Epidemiology,
35(3),
139–147.
doi:10.1002/gepi.20561
Li,
B.,
&
Leal,
S.
M.
(2008).
Methods
for
detecting
associations
with
rare
variants
for
common
diseases:
application
to
analysis
of
sequence
data.
The
American
Journal
of
Human
Genetics,
83(3),
311–321.
Liu,
J.,
Lewinger,
J.
P.,
Gilliland,
F.
D.,
Gauderman,
W.
J.,
&
Conti,
D.
V.
(2013).
Confounding
and
Heterogeneity
in
Genetic
Association
Studies
with
Admixed
Populations.
American
Journal
of
Epidemiology.
doi:10.1093/aje/kws234
Lupton,
M.
K.,
Proitsi,
P.,
Danillidou,
M.,
Tsolaki,
M.,
Hamilton,
G.,
Wroe,
R.,
et
al.
(2011).
Deep
Sequencing
of
the
Nicastrin
Gene
in
Pooled
DNA,
the
Identification
of
Genetic
Variants
That
Affect
Risk
of
Alzheimer's
Disease.
(R.
Roberts,
Ed.)PLoS
ONE,
6(2),
e17298.
doi:10.1371/journal.pone.0017298.t003
Madsen,
B.
E.,
&
Browning,
S.
R.
(2009).
A
Groupwise
Association
Test
for
Rare
Mutations
Using
a
Weighted
Sum
Statistic.
(N.
J.
Schork,
Ed.)PLoS
Genetics,
5(2),
e1000384.
doi:10.1371/journal.pgen.1000384.t002
McCarthy,
M.
I.,
Abecasis,
G.
R.,
Cardon,
L.
R.,
Goldstein,
D.
B.,
Little,
J.,
Ioannidis,
J.
P.
A.,
&
Hirschhorn,
J.
N.
(2008).
Genome-‐wide
association
studies
for
complex
traits:
consensus,
uncertainty
and
challenges.
Nature
Reviews
Genetics,
9(5),
356–369.
doi:10.1038/nrg2344
Meier,
L.,
&
Van
De
Geer,
S.
(2008).
The
group
lasso
for
logistic
regression.
group.
Morris,
A.
P.,
&
Zeggini,
E.
(2009).
An
evaluation
of
statistical
approaches
to
rare
variant
analysis
in
genetic
association
studies.
Genetic
Epidemiology,
34(2),
188–193.
107
doi:10.1002/gepi.20450
Neale,
B.
M.,
Rivas,
M.
A.,
Voight,
B.
F.,
Altshuler,
D.,
Devlin,
B.,
Orho-‐Melander,
M.,
et
al.
(2011).
Testing
for
an
Unusual
Distribution
of
Rare
Variants.
(S.
M.
Leal,
Ed.)PLoS
Genetics,
7(3),
e1001322.
doi:10.1371/journal.pgen.1001322.t002
Newcomb,
P.
A.,
Baron,
J.,
Cotterchio,
M.,
Gallinger,
S.,
Grove,
J.,
Haile,
R.,
et
al.
(2007).
Colon
Cancer
Family
Registry:
An
International
Resource
for
Studies
of
the
Genetic
Epidemiology
of
Colon
Cancer.
Cancer
Epidemiology
Biomarkers
&
Prevention,
16(11),
2331–2343.
doi:10.1158/1055-‐9965.EPI-‐07-‐0648
Out,
A.
A.,
van
Minderhout,
I.
J.
H.
M.,
Goeman,
J.
J.,
Ariyurek,
Y.,
Ossowski,
S.,
Schneeberger,
K.,
et
al.
(2009).
Deep
sequencing
to
reveal
new
variants
in
pooled
DNA
samples.
Human
Mutation,
30(12),
1703–1712.
doi:10.1002/humu.21122
Park,
T.,
&
Casella,
G.
(2008).
The
Bayesian
Lasso.
Journal
of
the
American
Statistical
Association,
103(482),
681–686.
doi:10.1198/016214508000000337
Peters,
U.,
Hutter,
C.
M.,
Hsu,
L.,
Schumacher,
F.
R.,
Conti,
D.
V.,
Carlson,
C.
S.,
et
al.
(2011).
Meta-‐analysis
of
new
genome-‐wide
association
studies
of
colorectal
cancer
risk.
Human
Genetics,
131(2),
217–234.
doi:10.1007/s00439-‐011-‐1055-‐0
Peters,
U.,
Jiao,
S.,
Schumacher,
F.
R.,
Hutter,
C.
M.,
Aragaki,
A.
K.,
Baron,
J.
A.,
et
al.
(2013).
Identification
of
Genetic
Susceptibility
Loci
for
Colorectal
Tumors
in
a
Genome-‐Wide
Meta-‐
analysis.
Gastroenterology,
144(4),
799–807.e24.
doi:10.1053/j.gastro.2012.12.020
Prabhu,
S.,
&
Pe'er,
I.
(2009).
Overlapping
pools
for
high-‐throughput
targeted
resequencing.
Genome
Research,
19(7),
1254–1261.
doi:10.1101/gr.088559.108
Price,
A.
L.,
Kryukov,
G.
V.,
de
Bakker,
P.
I.
W.,
Purcell,
S.
M.,
Staples,
J.,
Wei,
L.-‐J.,
&
Sunyaev,
S.
R.
(2010).
Pooled
Association
Tests
for
Rare
Variants
in
Exon-‐Resequencing
Studies.
The
American
Journal
of
Human
Genetics,
86(6),
832–838.
doi:10.1016/j.ajhg.2010.04.005
108
Quintana,
M.
A.,
Berstein,
J.
L.,
Thomas,
D.
C.,
&
Conti,
D.
V.
(2011).
Incorporating
model
uncertainty
in
detecting
rare
variants:
the
Bayesian
risk
index.
Genetic
Epidemiology,
35(7),
638–649.
doi:10.1002/gepi.20613
Risch,
N.
J.
(2000).
Searching
for
genetic
determinants
in
the
new
millennium.
Nature,
405(6788),
847–856.
doi:10.1038/35015718
Shendure,
J.,
&
Ji,
H.
(2008).
Next-‐generation
DNA
sequencing.
Nature
Biotechnology,
26(10),
1135–1145.
doi:10.1038/nbt1486
Shental,
N.,
&
Amir,
A.
(2010).
Identification
of
rare
alleles
and
their
carriers
using
compressed
se(que)nsing.
Nucleic
Acids
Research.
Smith,
A.
M.,
Heisler,
L.
E.,
St
Onge,
R.
P.,
Farias-‐Hesson,
E.,
Wallace,
I.
M.,
Bodeau,
J.,
et
al.
(2010).
Highly-‐multiplexed
barcode
sequencing:
an
efficient
method
for
parallel
analysis
of
pooled
samples.
Nucleic
Acids
Research,
38(13),
e142–e142.
doi:10.1093/nar/gkq368
Sriram,
K.,
&
Ghosh,
P.
(2012).
Simultaneous
Bayesian
Estimation
of
Multiple
Quantiles
with
an
Extension
to
Hierarchical
Models.
IIM
Bangalore
Research
Paper,
(359).
Sriram,
K.,
Ramamoorthi,
R.,
&
Ghosh,
P.
(2011).
Posterior
Consistency
of
Bayesian
Quantile
Regression
Under
a
Mis-‐Specified
Likelihood
Based
on
Asymmetric
Laplace
Density.
IIM
Bangalore
Research
Paper,
(340).
Stephens,
M.,
&
Balding,
D.
J.
(2009).
Bayesian
statistical
methods
for
genetic
association
studies.
Nature
Reviews
Genetics.
The
International
HapMap
Consortium.
(2005).
A
haplotype
map
of
the
human
genome.
Nature,
437(7063),
1299–1320.
doi:10.1038/nature04226
Thomas,
P.
D.
(2003).
PANTHER:
a
browsable
database
of
gene
products
organized
by
biological
function,
using
curated
protein
family
and
subfamily
classification.
Nucleic
Acids
Research,
31(1),
334–341.
doi:10.1093/nar/gkg115
109
Tibshirani,
R.
(1996).
Regression
shrinkage
and
selection
via
the
lasso.
Journal
of
the
Royal
Statistical
Society.
Series
B
(Methodological),
267–288.
Tokdar,
S.,
&
Kadane,
J.
B.
(2011).
Simultaneous
linear
quantile
regression:
A
semiparametric
bayesian
approach.
Bayesian
Analysis,
6(4),
1–22.
Tsionas,
E.
G.
(2003).
Bayesian
quantile
inference.
Journal
of
Statistical
Computation
and
Simulation,
73(9),
659–674.
doi:10.1080/0094965031000064463
Turnbull,
C.,
Ahmed,
S.,
Morrison,
J.,
Pernet,
D.,
Renwick,
A.,
Maranian,
M.,
et
al.
(2010).
Genome-‐wide
association
study
identifies
five
new
breast
cancer
susceptibility
loci.
Nature
Publishing
Group,
42(6),
504–507.
doi:10.1038/ng.586
Wang,
K.,
Li,
M.,
&
Hakonarson,
H.
(2010).
Analysing
biological
pathways
in
genome-‐wide
association
studies.
Nature
Reviews
Genetics,
11(12),
843–854.
doi:10.1038/nrg2884
Wang,
T.,
Pradhan,
K.,
Ye,
K.,
Wong,
L.-‐J.,
&
Rohan,
T.
E.
(2011).
Estimating
allele
frequency
from
next-‐generation
sequencing
of
pooled
mitochondrial
DNA
samples.
Frontiers
in
genetics,
2,
51.
doi:10.3389/fgene.2011.00051
Willer,
C.
J.,
Sanna,
S.,
Jackson,
A.
U.,
Scuteri,
A.,
Bonnycastle,
L.
L.,
Clarke,
R.,
et
al.
(2008).
Newly
identified
loci
that
influence
lipid
concentrations
and
risk
of
coronary
artery
disease.
Nature
Publishing
Group,
40(2),
161–169.
doi:10.1038/ng.76
Williams,
P.
T.
(2012).
Quantile-‐Specific
Penetrance
of
Genes
Affecting
Lipoproteins,
Adiposity
and
Height.
PLoS
ONE,
7(1),
e28764.
doi:10.1371/journal.pone.0028764.t001
Wu,
T.
T.,
&
Lange,
K.
(2008).
Coordinate
descent
algorithms
for
lasso
penalized
regression.
Annals
of
Applied
Statistics,
2(1),
224–244.
doi:10.1214/07-‐AOAS147
Wu,
T.
T.,
Chen,
Y.
F.,
Hastie,
T.,
Sobel,
E.,
&
Lange,
K.
(2009).
Genome-‐wide
association
analysis
by
lasso
penalized
logistic
regression.
Bioinformatics,
25(6),
714–721.
doi:10.1093/bioinformatics/btp041
110
Yu,
K.,
&
Lu,
Z.
(2003).
Quantile
regression:
applications
and
current
research
areas
-‐
Yu
-‐
2003
-‐
Journal
of
the
Royal
Statistical
Society:
Series
D
(The
Statistician)
-‐
Wiley
Online
Library.
Journal
of
the
Royal
Statistical
Society:
….
Yu,
K.,
&
Moyeed,
R.
A.
(2001).
Bayesian
quantile
regression.
Statistics
&
Probability
Letters,
54(4),
437–447.
Yuan,
M.,
&
Lin,
Y.
(2005a).
Efficient
empirical
Bayes
variable
selection
and
estimation
in
linear
models.
Journal
of
the
American
Statistical
Association,
100(472),
1215–1225.
Yuan,
M.,
&
Lin,
Y.
(2005b).
Model
selection
and
estimation
in
regression
with
grouped
variables.
Journal
of
the
Royal
Statistical
Society:
Series
B
(Statistical
Methodology),
68(1),
49–67.
Zhao,
Y.,
&
Wang,
S.
(2009).
Optimal
DNA
Pooling-‐Based
Two-‐Stage
Designs
in
Case-‐Control
Association
Studies.
Human
Heredity,
67(1),
46–56.
Abstract (if available)
Abstract
Genetic association studies aim to find the genetic variations that are associated with phenotypes (traits). In the last decade, genome-wide association studies have been proposed and performed in large populations for many common diseases/traits. A number of associated SNPs have been found, but limited amount of heritability for the phenotypes are explained by those findings. It is now in an era of post-GWAS, in which we are looking for better ways to explain the GWAS results, to follow up with replication/validation studies, and to find novel genetic components that explains the phenotypes. ❧ In the scope of this dissertation, I focused on three different aspects of genetic association study. I proposed a novel study design for genetic association studies using next-generation sequencing, investigated a hierarchical model for variable selection in GWAS analysis, and implemented a quantile regression model to prioritize GWAS results. ❧ With its potential to discover a much greater amount of genetic variation, next-generation sequencing is fast becoming an emergent tool for genetic association studies. However, the cost of sequencing all individuals in a large-scale population study is still high in comparison to most alternative genotyping options. While the ability to identify individual-level data is lost (without bar-coding), sequencing pooled samples can substantially lower costs without compromising the power to detect significant associations. We propose a hierarchical Bayesian model that estimates the association of each variant using pools of cases and controls, accounting for the variation in read depth across pools and sequencing error. To investigate the performance of our method across a range of number of pools, number of individuals within each pool, and average coverage, we undertook extensive simulations varying effect sizes, minor allele frequencies, and sequencing error rates. In general, the number of pools and the pool size have dramatic effects on power while the total depth of coverage per pool has only a moderate impact. This information can guide the selection of a study design that maximizes power subject to cost, sample size, or other laboratory constraints. We provide an R package (hiPOD: hierarchical Pooled Optimal Design) to find the optimal design, allowing the user to specify a cost function, cost, and sample size limitations, and distributions of effect size, minor allele frequency, and sequencing error rate. ❧ In genome-wide association studies (GWAS) with complex disease, the most used analysis tool has been a single-point, one degree of freedom test of association, such as the Cochran-Armitage test. However, testing one SNP at a time does not fully exploit the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. By adapting a Laplace prior distribution for model parameters, and incorporating external information in the prior distribution for the shrinkage parameter of the Laplace distribution, we aimed to perform variable selection using the specified hierarchical model. We tried two approaches to estimate the hyper-parameters, empirical Bayes and fully Bayesian. The method works when the model size is relatively small, but fails when the model is over-parameterized. ❧ As mentioned above, for most complex diseases, only a moderate amount of heritability has been explained by the SNPs declared genomewide statistically significant. Rather than a resource-limited approach of increasing the sample size, one alternative approach is to find causal SNPs within the lower Manhattan—the SNPs that just failed to achieve genome-wide significance. Thus, to improve efficiency of GWAS results, we propose a Bayesian hierarchical quantile regression model to incorporate external information with the aim of improving the ranking of causal SNPs. The proposed model examines the extremes of the p-value distribution by adapting a Normal-Exponential mixture presentation of asymmetric Laplace distribution as a prior distribution, which enables us to build an efficient Gibbs sampler. Simulation results show that the proposed model improves the ranking of causal SNPs if the external information is informative (associated with the causality of a SNP) and does not decrease the causal SNP’s ranking if the external information is non‐informative. We compare this approach to several alternatives, including a filtering framework, and demonstrate that these approaches can worsen the ranking of causal SNPs if the external information is not informative.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Two-step study designs in genetic epidemiology
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
A hierarchical physiologically-based pharmacokinetic modeling platform for genetic and exposure effects in metabolic pathways
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Bayesian multilevel quantile regression for longitudinal data
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Quantile mediation models: methods for assessing mediation across the outcome distribution
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
Small area cancer incidence mapping using hierarchical Bayesian methods
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
Asset Metadata
Creator
Liang, Wei
(author)
Core Title
Bayesian hierarchical models in genetic association studies
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
11/08/2013
Defense Date
07/30/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian,genetic association study,hierarchical model,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David V. (
committee chair
), Thomas, Duncan C. (
committee chair
), Nuzhdin, Sergey V. (
committee member
), Wang, Kai (
committee member
)
Creator Email
neovaliane@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-344907
Unique identifier
UC11295271
Identifier
etd-LiangWei-2144.pdf (filename),usctheses-c3-344907 (legacy record id)
Legacy Identifier
etd-LiangWei-2144.pdf
Dmrecord
344907
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liang, Wei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian
genetic association study
hierarchical model