Close
USC Libraries
University of Southern California
About
FAQ
Home
Collections
Login
USC Login
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Folder
Bayesian hierarchical models in genetic association studies
(USC Thesis Other) 

Bayesian hierarchical models in genetic association studies

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
  i
 
BAYESIAN
 HIERARCHICAL
 MODELS
 IN
 GENETIC
 
ASSOCIATION
 STUDIES
 

 
by
 

 
Wei
 Liang
 

A
 Dissertation
 Presented
 to
 the
 
FACULTY
 OF
 THE
 USC
 GRADUATE
 SCHOOL
 
UNIVERSITY
 OF
 SOUTHERN
 CALIFORNIA
 
In
 Partial
 Fulfillment
 of
 the
 
Requirements
 for
 the
 Degree
 
DOCTOR
 OF
 PHILOSOPHY
 (Biostatistics)
 

December
 2013
 

 

 
Copyright
 2013
   
  Wei
 Liang
 


  ii
 










 To
 Yu,
 Jiang
 and
 Jennifer
 


 

 

 

 

 

 

 

  iii
 
Acknowledgements
 
First
 of
 all,
 I
 would
 like
 to
 express
 my
 deepest
 appreciation
 to
 my
 mentors,
 Dr.
 Duncan
 Thomas
 
and
 Dr.
 David
 Conti.
 This
 dissertation
 would
 not
 be
 possible
 without
 their
 strong
 support
 
through
 the
 past
 five
 years.
 With
 their
 enthusiasm,
 I
 was
 introduced
 into
 the
 area
 of
 genetic
 
statistics
 and
 Bayesian
 analysis.
 I
 would
 like
 to
 sincerely
 thank
 their
 great
 guidance,
 not
 only
 in
 
biostatistics,
 but
 also
 in
 many
 aspects
 of
 my
 life
 and
 career.
 

 
I
 would
 like
 to
 extend
 my
 appreciation
 to
 all
 my
 committee
 members:
 Dr.
 Kai
 Wang
 and
 Dr.
 
Sergey
 Nuzhdin.
 Their
 questions
 and
 suggestions
 have
 been
 very
 helpful
 for
 me
 to
 improve
 my
 
dissertation.
 I
 would
 also
 like
 to
 thank
 the
 faculty
 of
 the
 Department
 of
 Biostatistics
 and
 
Epidemiology,
 who
 taught
 me
 fundamental
 knowledge
 in
 biostatistics.
 Special
 thanks
 goes
 to
 Dr.
 
Graham
 Casey,
 Dr.
 Gary
 Chen
 and
 Christopher
 Edlund,
 for
 their
 helpful
 discussions
 and
 support.
 

 
I
 am
 very
 grateful
 to
 all
 my
 supportive
 friends,
 both
 in
 LA
 area
 and
 in
 many
 other
 places.
 They
 
stood
 by
 me
 in
 good
 times
 and
 bad
 times.
 Because
 of
 them,
 I
 never
 felt
 thousands
 of
 miles
 away
 
from
 home.
 

 
Last
 but
 not
 least,
 I
 would
 like
 to
 express
 my
 gratitude
 to
 my
 parents,
 Guanglin
 Liang
 and
 
Yongzhi
 Dai,
 and
 my
 sister
 Hong
 Liang.
 Their
 ever-­‐lasting
 love
 is
 always
 the
 motivation
 for
 me
 to
 
do
 better
 in
 my
 life.
 

 

 

 

  iv
 
Table
 of
 Contents
 
ACKNOWLEDGEMENTS
 ...........................................................................................................
 III
 
LIST
 OF
 TABLES
 ......................................................................................................................
 VII
 
LIST
 OF
 FIGURES
 ....................................................................................................................
 VIII
 
ABSTRACT
 ...............................................................................................................................
 IX
 
CHAPTER
 1
 INTRODUCTION
 .....................................................................................................
 1
 
1.1
 GENETIC
 ASSOCIATION
 STUDIES
 ....................................................................................................
 1
 
1.1.1
 Pre-­‐GWAS
 era:
 candidate
 gene
 study
 ............................................................................
 1
 
1.1.2
 GWAS
 era:
 the
 achievements
 and
 shortcomings
 ............................................................
 2
 
1.1.3
 Post-­‐GWAS
 era:
 an
 overview
 of
 the
 topics
 in
 the
 dissertation
 .......................................
 4
 
1.2
 USING
 BAYESIAN
 HIERARCHICAL
 MODELS
 TO
 INCORPORATE
 EXTERNAL
 INFORMATION
 ............................
 6
 
1.2.1
 Motivation:
 the
 development
 of
 biological
 ontologies
 ...................................................
 6
 
1.2.2
 Bayesian
 hierarchical
 models
 .........................................................................................
 7
 
1.3
 OVERVIEW
 ................................................................................................................................
 8
 
CHAPTER
 2
 A
 BAYESIAN
 HIERARCHICAL
 MODEL
 IN
 NEXT-­‐GENERATION
 SEQUENCING
 WITH
 
CASE-­‐CONTROL
 POOLS
 ...........................................................................................................
 10
 
2.1
 INTRODUCTION
 ........................................................................................................................
 10
 
2.2
 METHOD:
 A
 BAYESIAN
 HIERARCHICAL
 MODEL
 FOR
 POOLED
 NGS
 DATA
 ..............................................
 12
 
2.3
 SIMULATION
 STUDIES
 ................................................................................................................
 15
 
2.3.1
 Design
 of
 simulation
 studies
 .........................................................................................
 15
 
2.3.2
 Determining
 optimal
 study
 design
 for
 future
 studies
 ...................................................
 16
 
2.4
 RESULTS
 .................................................................................................................................
 21
 

  v
 
2.4.1
 Validity
 and
 empirical
 power
 of
 simulated
 scenarios
 ...................................................
 21
 
2.4.2
 Construction
 of
 a
 prediction
 model
 ..............................................................................
 26
 
2.4.3
 Predicting
 power
 and
 determining
 optimal
 designs
 for
 future
 studies
 ........................
 28
 
2.5
 DISCUSSION
 ............................................................................................................................
 34
 
CHAPTER
 3
 BAYESIAN
 HIERARCHICAL
 LASSO
 ..........................................................................
 39
 
3.1
 INTRODUCTION
 ........................................................................................................................
 39
 
3.2
 ITERATIVE
 APPROACHES
 TO
 ESTIMATE
 BAYESIAN
 HIERARCHICAL
 LASSO
 ...............................................
 41
 
3.2.1
 Method
 .........................................................................................................................
 41
 
3.2.2
 Simulation
 studies
 ........................................................................................................
 46
 
3.3
 MARGINAL
 MAP
 ESTIMATES
 OF
 HYPER-­‐PARAMETERS
 .....................................................................
 51
 
3.3.1
 Method
 .........................................................................................................................
 51
 
3.3.2
 Simulation
 results
 of
 the
 MCMC
 procedures
 (3.3.1.A-­‐3.3.1.C)
 .....................................
 56
 
3.4
 DISCUSSION
 ............................................................................................................................
 59
 
CHAPTER
 4
 BAYESIAN
 HIERARCHICAL
 QUANTILE
 MODEL
 TO
 PRIORITIZE
 GWAS
 RESULTS
 .......
 62
 
4.1
 INTRODUCTION
 ........................................................................................................................
 62
 
4.2
 A
 BAYESIAN
 QUANTILE
 REGRESSION
 ............................................................................................
 65
 
3
 A
 BAYESIAN
 HIERARCHICAL
 MODEL
 WITH
 MULTIPLE
 QUANTILES
 .........................................................
 68
 
4.4
 SIMULATION
 STUDIES
 BASED
 ON
 A
 NORMAL
 APPROXIMATION
 .........................................................
 70
 
4.4.1
 Design
 of
 the
 simulation
 study
 .....................................................................................
 70
 
4.4.2
 Simulation
 Results
 ........................................................................................................
 72
 
4.5
 SIMULATION
 STUDIES
 BASED
 ON
 MARGINAL
 GWAS
 ESTIMATION
 ....................................................
 77
 
4.5.1.
 Design
 of
 the
 simulation
 study
 ....................................................................................
 77
 
4.6
 AN
 APPLIED
 STUDY
 ON
 THE
 COLORECTAL
 CANCER
 FAMILY
 REGISTRY
 .................................................
 85
 
4.7
 DISCUSSIONS
 ...........................................................................................................................
 91
 

  vi
 
CHAPTER
 5
 FUTURE
 RESEARCH
 IN
 RELATED
 AREAS
 ................................................................
 93
 
5.1
 VARIABLE
 SELECTION
 PROBLEM
 FOR
 A
 LARGE
 DATASET
 ....................................................................
 94
 
5.3
 INTEGRATIVE
 ANALYSIS
 ON
 GENETIC
 ASSOCIATION
 STUDIES,
 ESPECIALLY
 WITH
 NEXT-­‐GENERATION
 
SEQUENCING
 .................................................................................................................................
 96
 
5.4
 PHARMACOGENOMICS
 AND
 PERSONALIZED
 MEDICINE
 ...................................................................
 97
 
APPENDICES
 ..........................................................................................................................
 98
 
APPENDIX
 1.
 VARIANT
 DETECTION:
 COMPARISON
 BETWEEN
 POOLING
 VS.
 INDIVIDUAL
 DESIGNS
 ...................
 98
 
BIBLIOGRAPHY
 .....................................................................................................................
 103
 

 

 

 

   
 

  vii
 
List
 of
 Tables
 
Table
 3.1
 Null
 Simulation
 Results
 .................................................................................................................
 47
 
Table
 3.2
 Simulation
 Results
 with
 1
 covariate
 ..............................................................................................
 48
 
Table
 3.3
 Simulation
 Results
 with
 5
 covariates
 ............................................................................................
 49
 
Table
 4.1
 2×2
 table
 of
 causal
 variants
 with
 respect
 to
 biofeature
 ...............................................................
 63
 
Table
 4.2
 A
 quantile
 regression
 on
 an
 empirical
 example
 ............................................................................
 64
 
Table
 4.3
 Frequencies
 of
 5
 Biofeatures
 in
 44010
 Regions
 ............................................................................
 86
 
Table
 4.4
 Mean
 Estimates
 β
d
(τ
k
)
 and
 95%
 Interval
 ......................................................................................
 87
 
Table
 4.5
 Changes
 in
 rank
 for
 the
 known
 hits
 (genes/regions)
 ....................................................................
 90
 

 













   
 

  viii
 
List
 of
 Figures
 
Figure
 2.1
 DAG
 of
 the
 hierarchical
 model,
 focusing
 on
 a
 single
 marker
 in
 a
 single
 pool
 ..............................
 15
 
Figure
 2.2
 Summary
 of
 the
 validity
 of
 study
 designs
 of
 a
 variant
 with
 VAF=0.05
 ........................................
 22
 
Figure
 2.3
 Comparison
 of
 empirical
 power
 in
 different
 scenarios
 and
 designs
 ............................................
 24
 
Figure
 2.4
 General
 trends
 of
 empirical
 power
 with
 respect
 to
 the
 design
 parameters
 ................................
 25
 
Figure
 2.5
 Prediction
 of
 power
 in
 semi-­‐simulation
 studies
 using
 1000
 Genomes
 data
 ................................
 29
 
Figure
 2.6
 Grid
 search
 for
 the
 optimal
 pooling
 design,
 for
 a
 study
 interested
 in
 a
 variant
 with
 VAF=0.03
 and
 
OR=2,
 assuming
 a
 fixed
 cost
 function,
 with
 constraints
 on
 cost
 and
 sample
 size
 ........................................
 30
 
Figure
 2.7
 The
 optimal
 power
 of
 pooling
 designs
 for
 a
 specific
 study
 (VAF=0.03,
 OR=2),
 with
 the
 same
 
sample
 size
 constraint,
 different
 cost
 functions
 but
 comparable
 cost
 constraints
 with
 individual
 designs
 .
 32
 
Figure
 2.8
 Power
 (y-­‐axis)
 comparison
 of
 optimal
 pooling
 designs
 with
 various
 cost
 function
 (dashed
 lines),
 
versus
 individual
 designs
 (solid
 line),
 over
 a
 range
 of
 variant
 allele
 frequencies
 (x-­‐axis)
 .............................
 33
 
Figure
 3.1
 The
 piecewise
 likelihood
 function
 ................................................................................................
 55
 
Figure
 3.2
 Iteration
 trace
 of
 the
 updated
 values
 using
 Newton's
 method
 (I=100,
 P=100)
 ...........................
 57
 
Figure
 3.3
 Iteration
 trace
 of
 the
 upated
 values
 using
 Newton's
 method
 (I=100,
 P=500)
 .............................
 58
 
Figure
 4.1
 An
 illustration
 of
 the
 effect
 of
 biofeature
 on
 test
 statistics
 .........................................................
 64
 
Figure
 4.2
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20(100)
 ranked
 SNPs
 .................................................
 73
 
Figure
 4.3
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20(100)
 ranked
 SNPs
 .................................................
 74
 
Figure
 4.4
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20(100)
 ranked
 SNPs
 .................................................
 76
 
Figure
 4.5
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 1,
 5
 and
 20
 ranked
 SNPs
 ............................................
 79
 
Figure
 4.6
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 1,
 5
 and
 20
 ranked
 SNPs
 ............................................
 81
 
Figure
 4.7
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 1,
 5
 and
 20
 ranked
 SNPs
 ............................................
 82
 
Figure
 4.8
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20
 ranked
 SNPs
 ..........................................................
 84
 
Figure
 4.9
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20
 ranked
 SNPs
 ..........................................................
 84
 
Figure
 4.10
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 200
 meta-­‐analysis
 ranked
 regions
 ..........................
 89
 

  ix
 
Abstract
 
Genetic
 association
 studies
 aim
 to
 find
 the
 genetic
 variations
 that
 are
 associated
 with
 
phenotypes
 (traits).
 In
 the
 last
 decade,
 genome-­‐wide
 association
 studies
 have
 been
 proposed
 
and
 performed
 in
 large
 populations
 for
 many
 common
 diseases/traits.
 A
 number
 of
 associated
 
SNPs
 have
 been
 found,
 but
 limited
 amount
 of
 heritability
 for
 the
 phenotypes
 are
 explained
 by
 
those
 findings.
 It
 is
 now
 in
 an
 era
 of
 post-­‐GWAS,
 in
 which
 we
 are
 looking
 for
 better
 ways
 to
 
explain
 the
 GWAS
 results,
 to
 follow
 up
 with
 replication/validation
 studies,
 and
 to
 find
 novel
 
genetic
 components
 that
 explains
 the
 phenotypes.
 

 
In
 the
 scope
 of
 this
 dissertation,
 I
 focused
 on
 three
 different
 aspects
 of
 genetic
 association
 
study.
 I
 proposed
 a
 novel
 study
 design
 for
 genetic
 association
 studies
 using
 next-­‐generation
 
sequencing,
 investigated
 a
 hierarchical
 model
 for
 variable
 selection
 in
 GWAS
 analysis,
 and
 
implemented
 a
 quantile
 regression
 model
 to
 prioritize
 GWAS
 results.
 

 
With
 its
 potential
 to
 discover
 a
 much
 greater
 amount
 of
 genetic
 variation,
 next-­‐generation
 
sequencing
 is
 fast
 becoming
 an
 emergent
 tool
 for
 genetic
 association
 studies.
 However,
 the
 cost
 
of
 sequencing
 all
 individuals
 in
 a
 large-­‐scale
 population
 study
 is
 still
 high
 in
 comparison
 to
 most
 
alternative
 genotyping
 options.
 While
 the
 ability
 to
 identify
 individual-­‐level
 data
 is
 lost
 (without
 
bar-­‐coding),
 sequencing
 pooled
 samples
 can
 substantially
 lower
 costs
 without
 compromising
 the
 
power
 to
 detect
 significant
 associations.
 We
 propose
 a
 hierarchical
 Bayesian
 model
 that
 
estimates
 the
 association
 of
 each
 variant
 using
 pools
 of
 cases
 and
 controls,
 accounting
 for
 the
 
variation
 in
 read
 depth
 across
 pools
 and
 sequencing
 error.
 To
 investigate
 the
 performance
 of
 
our
 method
 across
 a
 range
 of
 number
 of
 pools,
 number
 of
 individuals
 within
 each
 pool,
 and
 

  x
 
average
 coverage,
 we
 undertook
 extensive
 simulations
 varying
 effect
 sizes,
 minor
 allele
 
frequencies,
 and
 sequencing
 error
 rates.
 In
 general,
 the
 number
 of
 pools
 and
 the
 pool
 size
 have
 
dramatic
 effects
 on
 power
 while
 the
 total
 depth
 of
 coverage
 per
 pool
 has
 only
 a
 moderate
 
impact.
 This
 information
 can
 guide
 the
 selection
 of
 a
 study
 design
 that
 maximizes
 power
 subject
 
to
 cost,
 sample
 size,
 or
 other
 laboratory
 constraints.
 We
 provide
 an
 R
 package
 (hiPOD:
 
hierarchical
 Pooled
 Optimal
 Design)
 to
 find
 the
 optimal
 design,
 allowing
 the
 user
 to
 specify
 a
 
cost
 function,
 cost,
 and
 sample
 size
 limitations,
 and
 distributions
 of
 effect
 size,
 minor
 allele
 
frequency,
 and
 sequencing
 error
 rate.
 

 
In
 genome-­‐wide
 association
 studies
 (GWAS)
 with
 complex
 disease,
 the
 most
 used
 analysis
 tool
 
has
 been
 a
 single-­‐point,
 one
 degree
 of
 freedom
 test
 of
 association,
 such
 as
 the
 Cochran-­‐
Armitage
 test.
 However,
 testing
 one
 SNP
 at
 a
 time
 does
 not
 fully
 exploit
 the
 potential
 of
 
genome-­‐wide
 association
 studies
 to
 identify
 multiple
 causal
 variants,
 which
 is
 a
 plausible
 
scenario
 for
 many
 complex
 diseases.
 By
 adapting
 a
 Laplace
 prior
 distribution
 for
 model
 
parameters,
 and
 incorporating
 external
 information
 in
 the
 prior
 distribution
 for
 the
 shrinkage
 
parameter
 of
 the
 Laplace
 distribution,
 we
 aimed
 to
 perform
 variable
 selection
 using
 the
 
specified
 hierarchical
 model.
 We
 tried
 two
 approaches
 to
 estimate
 the
 hyper-­‐parameters,
 
empirical
 Bayes
 and
 fully
 Bayesian.
 The
 method
 works
 when
 the
 model
 size
 is
 relatively
 small,
 
but
 fails
 when
 the
 model
 is
 over-­‐parameterized.
 

 
As
 mentioned
 above,
 for
 most
 complex
 diseases,
 only
 a
 moderate
 amount
 of
 heritability
 has
 
been
 explained
 by
 the
 SNPs
 declared
 genomewide
 statistically
 significant.
 Rather
 than
 a
 
resource-­‐limited
 approach
 of
 increasing
 the
 sample
 size,
 one
 alternative
 approach
 is
 to
 find
 
causal
 SNPs
 within
 the
 lower
 Manhattan—the
 SNPs
 that
 just
 failed
 to
 achieve
 genome-­‐wide
 

  xi
 
significance.
 Thus,
 to
 improve
 efficiency
 of
 GWAS
 results,
 we
 propose
 a
 Bayesian
 hierarchical
 
quantile
 regression
 model
 to
 incorporate
 external
 information
 with
 the
 aim
 of
 improving
 the
 
ranking
 of
 causal
 SNPs.
 The
 proposed
 model
 examines
 the
 extremes
 of
 the
 p-­‐value
 distribution
 
by
 adapting
 a
 Normal-­‐Exponential
 mixture
 presentation
 of
 asymmetric
 Laplace
 distribution
 as
 a
 
prior
 distribution,
 which
 enables
 us
 to
 build
 an
 efficient
 Gibbs
 sampler.
 Simulation
 results
 show
 
that
 the
 proposed
 model
 improves
 the
 ranking
 of
 causal
 SNPs
 if
 the
 external
 information
 is
 
informative
 (associated
 with
 the
 causality
 of
 a
 SNP)
 and
 does
 not
 decrease
 the
 causal
 SNP’s
 
ranking
 if
 the
 external
 information
 is
 non ‐informative.
 We
 compare
 this
 approach
 to
 several
 
alternatives,
 including
 a
 filtering
 framework,
 and
 demonstrate
 that
 these
 approaches
 can
 
worsen
 the
 ranking
 of
 causal
 SNPs
 if
 the
 external
 information
 is
 not
 informative.
 

 

 















  1
 
Chapter
 1 Introduction
 
1.1
 Genetic
 association
 studies
 
For
 rare
 Mendalian
 diseases,
 a
 single
 allele
 usually
 determines
 the
 presence
 of
 disease,
 and
 a
 
simple
 family
 based
 study
 can
 be
 conducted
 to
 detect
 the
 causal
 allele.
 However,
 for
 most
 
common
 diseases,
 the
 risk
 of
 being
 affected
 is
 usually
 influenced
 by
 multiple
 genetic
 and
 
environmental
 factors.
 For
 such
 diseases,
 the
 presence
 of
 a
 high-­‐risk
 allele
 may
 only
 increase
 
the
 probability
 of
 disease
 by
 a
 small
 to
 moderate
 amount
 (Risch,
 2000).
 To
 detect
 those
 genetic
 
risk
 alleles
 in
 common
 and
 usually
 complex
 diseases,
 genetic
 association
 studies
 were
 proposed
 
and
 performed
 in
 the
 last
 two
 decades
 (Cordell
 &
 Clayton,
 2005).
 In
 this
 section,
 we
 provide
 an
 
overview
 of
 genetic
 association
 studies
 performed,
 the
 contributions
 made
 along
 the
 path,
 and
 
efforts
 needed
 in
 future
 studies.
 
1.1.1
 Pre-­‐GWAS
 era:
 candidate
 gene
 study
 
A
 genetic
 association
 study
 answers
 the
 specific
 question:
 is
 there
 an
 association
 between
 one
 
or
 more
 genetic
 variations
 and
 the
 disease
 susceptibility
 (or
 the
 quantitative
 trait)?
 In
 the
 scope
 
of
 this
 dissertation,
 we
 focus
 on
 studies
 with
 dichotomous
 traits,
 using
 case-­‐control
 designs.
 In
 
these
 studies,
 the
 frequencies
 of
 genetic
 variants
 in
 cases
 are
 compared
 with
 those
 in
 controls,
 
and
 evidence
 of
 association
 is
 obtained
 if
 the
 frequency
 difference
 is
 significant.
 

 
The
 first
 proposed
 genetic
 association
 studies
 are
 candidate
 gene/allele
 studies.
 In
 those
 
studies,
 one
 or
 more
 candidate
 genes
 (or
 specific
 alleles)
 were
 hypothesized
 to
 be
 potential
 

  2
 
causal
 gene
 (variants).
 Genotypes
 of
 the
 candidate
 genes/alleles
 were
 obtained,
 and
 statistical
 
analysis
 was
 performed
 to
 analyze
 their
 association
 with
 the
 risk
 of
 disease.
 In
 a
 review
 paper
 
published
 in
 2002,
 Hirschhorn
 summarized
 268
 genes
 that
 contain
 polymorphisms
 reported
 to
 
be
 associated
 with
 1
 of
 the
 133
 common
 diseases
 or
 dichotomous
 traits
 investigated
 (J.
 N.
 
Hirschhorn,
 Lohmueller,
 Byrne,
 &
 Hirschhorn,
 2002).
 

 
Candidate
 gene
 studies
 constrain
 the
 investigation
 in
 a
 few
 pre-­‐selected
 alleles,
 based
 on
 strong
 
hypothesis,
 thus
 they
 are
 sometimes
 criticized
 for
 significant
 selection
 bias.
 On
 the
 hand,
 
because
 of
 the
 pre-­‐selection,
 only
 a
 few
 tests
 are
 carried
 out,
 which
 usually
 leads
 to
 a
 sufficient
 
statistical
 power.
 

 
1.1.2
 GWAS
 era:
 the
 achievements
 and
 shortcomings
 
With
 the
 rapid
 development
 of
 genotyping
 arrays,
 it
 became
 possible
 to
 genotype
 millions
 of
 
SNPs
 at
 the
 same
 time.
 After
 the
 completion
 of
 the
 Human
 Genome
 project
 and
 the
 
International
 HapMap
 Project
 (Lander
 et
 al.,
 2001)
 (The
 International
 HapMap
 Consortium,
 
2005),
 due
 to
 the
 linkage
 disequilibrium
 structure
 of
 the
 genome,
 a
 subset
 of
 SNPs
 can
 be
 used
 
as
 tags
 to
 represent
 most
 of
 the
 common
 allelic
 variation
 across
 the
 whole
 genome.
 The
 first
 
genome-­‐wide
 association
 study
 (GWAS)
 was
 published
 in
 2005,
 in
 which
 116,204
 tag
 SNPs
 were
 
genotyped,
 and
 two
 SNPs
 were
 found
 to
 be
 statistically
 significantly
 associated
 with
 age-­‐related
 
macular
 degeneration
 (Klein,
 2005).
 Since
 then,
 about
 6000
 SNPs
 have
 been
 reported
 for
 
associations
 with
 more
 than
 130
 traits
 as
 of
 January
 2012
 (http://www.genome.gov/gwastudies/).
 

  3
 
GWAS
 is
 agnostic
 in
 the
 sense
 that
 it
 tests
 tag
 SNPs
 for
 common
 variants
 across
 the
 whole
 
genome.
 However,
 due
 to
 the
 large
 number
 of
 tests
 performed,
 a
 very
 small
 p-­‐value
 is
 required
 
to
 declare
 significance
 after
 multiple
 test
 correction.
 

 
Although
 GWAS
 have
 achieved
 great
 success
 in
 identifying
 risk
 alleles,
 there
 are
 a
 few
 
pitfalls/shortcomings
 remaining.
 

 
First
 of
 all,
 population
 stratification
 acts
 as
 a
 confounding
 factor
 in
 GWAS,
 especially
 in
 admixed
 
populations.
 In
 admixed
 populations,
 false
 positive
 or
 false
 negative
 association
 results
 are
 
reported
 if
 both
 the
 disease
 distribution
 and
 the
 genotype
 distribution
 differ
 among
 the
 sub-­‐
groups
 of
 the
 population.
 Many
 methods
 have
 been
 proposed
 to
 correct
 for
 population
 
stratification,
 e.g.,
 Liu
 et
 al
 investigated
 the
 effects
 of
 global
 ancestry
 and
 local
 ancestry,
 and
 
proposed
 methodology
 to
 correct
 the
 potential
 confounding
 effects
 (Liu,
 Lewinger,
 Gilliland,
 
Gauderman,
 &
 Conti,
 2013).
 

 
Secondly,
 the
 published
 GWAS
 results
 usually
 overestimate
 the
 effect
 size,
 as
 a
 result
 of
 
“winner’s
 curse”
 (Turnbull
 et
 al.,
 2010).
 This
 is
 because
 the
 effect
 is
 usually
 weak
 for
 a
 common
 
variant,
 thus
 the
 first
 statistically
 significant
 study
 is
 almost
 certain
 to
 have
 overestimated
 the
 
true
 effect
 of
 the
 variant
 being
 tested
 (J.
 N.
 Hirschhorn
 et
 al.,
 2002).
 The
 most
 common
 way
 to
 
correct
 for
 winner’s
 case
 is
 to
 increase
 the
 sample
 size,
 or
 to
 perform
 meta-­‐analysis
 of
 previous
 
studies.
 

 

  4
 
Thirdly,
 only
 a
 small
 to
 moderate
 proportion
 of
 the
 hereditary
 variation
 has
 been
 explained
 by
 
the
 variants
 discovered.
 Many
 hypotheses
 have
 been
 proposed
 to
 explain
 the
 missing
 heredity:
 
other
 kinds
 of
 genetic
 variations
 than
 SNP,
 e.g.,
 copy
 number
 variation
 and
 
deletion/insertion/inversion;
 gene-­‐gene
 interactions
 and
 gene-­‐environment
 interactions,
 
especially
 with
 epigenetic
 factors;
 the
 combinational
 effects
 of
 multiple
 rare
 variants;
 as
 well
 as
 
many
 other
 possibilities.
 

 
Last
 but
 not
 least,
 most
 of
 the
 findings
 from
 GWAS
 are
 just
 surrogates
 for
 the
 causal
 variants.
 
Ideally
 we’d
 like
 to
 follow
 up
 the
 GWAS
 results
 to
 identify
 causal
 variant,
 which
 might
 lead
 to
 
the
 elucidation
 of
 the
 biological
 mechanisms
 of
 the
 disease.
 But
 what
 is
 the
 best
 way
 to
 follow
 
up
 a
 GWAS
 study
 remains
 an
 open
 research
 question
 (Freedman
 et
 al.,
 2011)
 (Cantor,
 Lange,
 &
 
Sinsheimer,
 2010)
 (Ioannidis
 &
 Thomas,
 2009).
 

 
1.1.3
 Post-­‐GWAS
 era:
 an
 overview
 of
 the
 topics
 in
 the
 
dissertation
 
As
 described
 above,
 there
 are
 many
 pitfalls
 and
 shortcomings
 in
 genome-­‐wide
 association
 
studies,
 and
 a
 lot
 of
 work
 needs
 to
 be
 done.
 In
 this
 section,
 we
 focus
 on
 a
 few
 aspects
 of
 post-­‐
GWAS
 study
 design
 and
 analysis
 issues,
 as
 an
 overview
 of
 the
 questions
 we
 try
 to
 answer
 in
 this
 
dissertation.
 

 

  5
 
The
 first
 question
 we
 are
 interested
 in
 relates
 to
 next-­‐generation
 sequencing
 (NGS)
 studies.
 As
 
we
 mentioned
 above,
 the
 missing
 hereditary
 variation
 from
 GWAS
 might
 be
 explained
 by
 rare
 
variants
 across
 the
 population.
 With
 the
 rapid
 development
 of
 NGS
 technology,
 researchers
 are
 
able
 to
 sequence
 the
 whole
 genome
 of
 subjects,
 and
 test
 the
 association
 of
 rare
 variants
 with
 
disease.
 However,
 the
 cost
 of
 NGS
 is
 still
 relatively
 high
 to
 perform
 an
 association
 study
 with
 
large
 sample
 size.
 Pooling
 has
 been
 one
 potential
 method
 to
 reduce
 cost
 without
 losing
 too
 
much
 power.
 In
 chapter
 2,
 we
 investigate
 a
 pooled
 case-­‐control
 study
 design
 with
 NGS
 data,
 
propose
 a
 novel
 method
 to
 analyze
 the
 data,
 and
 provide
 some
 advice
 to
 design
 such
 a
 pooled
 
study.
 

 
Secondly,
 most
 of
 the
 current
 GWAS
 data
 are
 analyzed
 marginally,
 i.e.,
 the
 association
 of
 each
 
SNP
 with
 the
 risk
 of
 disease
 is
 tested
 one
 by
 one.
 However,
 testing
 one
 SNP
 at
 a
 time
 does
 not
 
utilize
 the
 full
 potential
 of
 GWAS
 to
 identify
 multiple
 causal
 SNPs.
 In
 chapter
 3,
 we
 investigate
 a
 
novel
 method
 to
 simultaneously
 analyze
 multiple
 SNPs
 at
 the
 same
 time.
 

 
In
 the
 third,
 it
 is
 of
 great
 interest
 to
 prioritize
 GWAS
 results
 and
 identify
 causal
 variants.
 In
 
chapter
 4,
 we
 propose
 a
 novel
 way
 to
 prioritize
 the
 results
 from
 a
 GWAS
 study
 and
 increase
 the
 
chance
 of
 identifying
 causal
 variants.
 

  6
 
1.2
 Using
 Bayesian
 hierarchical
 models
 to
 
incorporate
 external
 information
 
In
 the
 previous
 section,
 we
 listed
 a
 few
 questions
 of
 interest
 in
 the
 scope
 of
 this
 dissertation.
 To
 
answer
 these
 questions,
 we
 propose
 to
 incorporate
 external
 biological
 information
 using
 
Bayesian
 hierarchical
 models.
 

 
1.2.1
 Motivation:
 the
 development
 of
 biological
 ontologies
 
A
 biological
 ontology
 is
 a
 systematic
 description
 of
 specific
 biological
 attributes
 (Jensen
 &
 Bork,
 
2010).
 In
 the
 last
 decade,
 as
 the
 biological
 data
 explodes,
 accompanied
 with
 the
 rapid
 
development
 of
 systems
 biology,
 many
 biological
 ontologies
 are
 developed.
 One
 of
 the
 mostly
 
used
 biological
 ontologies
 is
 Gene
 Ontology
 (Ashburner
 et
 al.,
 2000),
 which
 provides
 systematic
 
description
 of
 gene
 products
 using
 standardized
 terms,
 organized
 in
 a
 structure
 of
 directed
 
acyclic
 diagram
 (DAG).
 Since
 the
 purpose
 of
 a
 genetic
 association
 study
 is
 to
 find
 the
 genetic
 
variations
 associated
 with
 risk
 of
 disease/trait,
 it
 is
 natural
 to
 incorporate
 the
 external
 
information
 in
 biological
 ontologies
 into
 the
 analysis
 of
 genetic
 association
 studies.
 Many
 
investigators
 use
 such
 information
 informally
 or
 implicitly
 when
 interpreting
 the
 results
 of
 a
 
genetic
 association
 study.
 Bayesian
 hierarchical
 model
 is
 a
 natural
 way
 to
 formally
 use
 such
 
external
 information.
 

  7
 
1.2.2
 Bayesian
 hierarchical
 models
 
The
 usual
 approach
 to
 analyze
 a
 genetic
 association
 study
 is
 to
 compute
 a
 p-­‐value
 in
 a
 
frequentist
 way.
 
 However,
 to
 incorporate
 external
 information,
 Bayesian
 methods
 provide
 a
 
coherent
 way
 to
 analyze
 and
 interpret
 association
 study
 results
 combining
 external
 information
 
(Stephens
 &
 Balding,
 2009).
 

 
In
 a
 genetic
 association
 study,
 we
 are
 usually
 interested
 in
 multiple
 parameters,
 such
 as
 the
 
marginal/joint
 effect
 sizes
 of
 all/part
 of
 the
 SNPs
 in
 a
 GWAS
 study.
 These
 parameters
 are
 related
 
with
 each
 other,
 based
 on
 the
 underlying
 structure
 of
 the
 problem.
 For
 example,
 in
 genetic
 
association
 studies
 with
 a
 case-­‐control
 design,
 the
 association
 of
 each
 SNP
 with
 the
 disease
 
susceptibility
 is
 represented
 by
 the
 odds
 ratio,
 as
 the
 parameters
 of
 interest.
 It
 is
 natural
 to
 
model
 the
 data
 in
 a
 hierarchical
 way,
 in
 which
 the
 first-­‐level
 parameters
 are
 given
 prior
 
distributions
 in
 terms
 of
 hyper-­‐parameters
 and
 external
 information.
 Such
 hierarchical
 
framework
 specifies
 clearly
 the
 underlying
 process
 linking
 the
 external
 information
 to
 the
 
observed
 data;
 and
 computational
 strategies
 have
 been
 well
 developed
 to
 analyze
 such
 a
 
hierarchical
 model.
 

 
Once
 the
 model
 is
 specified,
 we
 could
 apply
 multiple
 computational
 methods
 to
 estimate
 the
 
hyper-­‐parameters
 and
 model
 parameters.
 
 The
 two
 most
 commonly
 used
 approaches
 are
 
empirical
 Bayes
 estimation
 and
 fully
 Bayesian
 inference.
 For
 the
 purpose
 of
 simplicity,
 empirical
 
Bayes
 uses
 the
 data
 to
 provide
 point
 estimates
 of
 the
 hyper-­‐parameters.
 In
 the
 context
 of
 fully
 
Bayesian
 analysis,
 we
 can
 obtain
 a
 complete
 joint
 posterior
 distribution
 of
 the
 hyper-­‐parameters
 
and
 model
 parameters.
 If
 it
 is
 possible
 to
 derive
 the
 conditional
 and
 marginal
 distributions
 

  8
 
analytically,
 (e.g.,
 using
 conjugacy
 among
 the
 hyper-­‐parameter
 prior
 distribution,
 model
 
parameter
 prior
 distribution
 and
 likelihood),
 we
 could
 directly
 calculate
 the
 posterior
 
distribution,
 or
 numerically
 draw
 simulations
 from
 posterior
 distribution.
 But
 in
 most
 cases,
 it
 is
 
not
 possible
 to
 analytically
 derive
 the
 conditional
 and
 marginal
 distributions,
 and
 we
 must
 rely
 
on
 more
 complicated
 computational
 methods
 (e.g.
 Markov
 Chain
 Monte
 Carlo)
 to
 draw
 the
 
simulations
 from
 posterior
 distribution.
 For
 a
 more
 detailed
 introduction
 to
 Bayesian
 
hierarchical
 models,
 see
 (Gelman,
 Carlin,
 Stern,
 &
 Rubin,
 2004).
 

 
1.3
 Overview
 
I
 will
 conclude
 this
 chapter
 with
 an
 overview
 of
 the
 following
 chapters.
 

 
In
 chapter
 2,
 to
 answer
 the
 question
 of
 a
 pooled
 case-­‐control
 design
 of
 genetic
 association
 study
 
using
 next-­‐generation
 sequencing
 data,
 we
 proposed
 a
 Bayesian
 hierarchical
 model
 to
 estimate
 
the
 association
 of
 each
 variant,
 accounting
 for
 the
 variation
 in
 read
 depth
 across
 pools
 and
 
sequencing
 error.
 We
 undertook
 extensive
 simulations
 varying
 effect
 sizes,
 minor
 allele
 
frequencies
 and
 sequencing
 error
 rates.
 In
 general,
 we
 found
 that
 the
 number
 of
 pools
 and
 pool
 
size
 have
 dramatic
 effects
 on
 power
 while
 the
 total
 depth
 of
 coverage
 per
 pool
 has
 only
 a
 
moderate
 impact.
 
 This
 information
 can
 guide
 the
 selection
 of
 a
 study
 design
 that
 maximizes
 
power
 subject
 to
 cost,
 sample
 size,
 or
 other
 laboratory
 constraints.
 We
 developed
 an
 R
 package
 
(hiPOD:
 hierarchical
 Pooled
 Optimal
 Design)
 to
 find
 the
 optimal
 design,
 allowing
 the
 user
 to
 
specify
 a
 cost
 function,
 cost
 and
 sample
 size
 limitations,
 and
 distributions
 of
 effect
 size,
 minor
 
allele
 frequency
 and
 sequencing
 error
 rate.
 

  9
 
In
 chapter
 3,
 by
 adapting
 a
 Laplace
 prior
 distribution
 for
 model
 parameters,
 and
 incorporating
 
external
 information
 in
 the
 prior
 distribution
 for
 the
 shrinkage
 parameter
 of
 the
 Laplace
 
distribution,
 we
 aimed
 to
 perform
 variable
 selection
 using
 the
 specified
 hierarchical
 model.
 We
 
tried
 two
 approaches
 to
 estimate
 the
 hyper-­‐parameters,
 empirical
 Bayes
 and
 fully
 Bayesian.
 The
 
method
 works
 when
 the
 model
 size
 is
 relatively
 small,
 but
 fails
 when
 the
 model
 is
 over-­‐
parameterized.
 

 
In
 chapter
 4,
 to
 improve
 prioritization
 of
 GWAS
 results,
 we
 propose
 a
 Bayesian
 hierarchical
 
quantile
 model
 to
 incorporate
 external
 information,
 with
 a
 focus
 on
 the
 ranking
 of
 causal
 SNPs.
 
The
 proposed
 model
 simultaneously
 makes
 inferences
 on
 multiple
 quantiles.
 It
 also
 adapts
 a
 
Normal-­‐Exponential
 mixture
 presentation
 of
 asymmetric
 Laplace
 distribution,
 which
 is
 efficient
 
for
 Gibbs
 sampling
 of
 the
 posterior
 distribution.
 In
 simulation
 results,
 we
 have
 demonstrated
 
that
 the
 proposed
 model
 improves
 the
 ranking
 of
 causal
 SNPs
 if
 the
 external
 information
 is
 
highly
 informative
 (associated
 with
 causality
 of
 a
 SNP),
 and
 it
 does
 not
 decrease
 the
 causal
 SNP’s
 
ranking
 if
 the
 external
 information
 is
 non-­‐informative.
 

 
In
 chapter
 5,
 we
 propose
 some
 future
 extensions
 in
 this
 dissertation.
 

 

 

 

 

   
 

  10
 
Chapter
 2 A
 Bayesian
 Hierarchical
 Model
 in
 Next-­‐
Generation
 Sequencing
 with
 Case-­‐Control
 Pools
 
2.1
 Introduction
 
The
 last
 decade
 has
 seen
 a
 rapid
 growth
 of
 genome-­‐wide
 association
 studies
 (GWAS).
 GWAS
 
look
 for
 the
 association
 between
 genetic
 variation
 (mainly
 single
 nucleotide
 polymorphisms,
 
SNP)
 and
 phenotypes
 of
 interest
 in
 a
 hypothesis
 free
 manner.
 These
 studies
 were
 so
 popular,
 
that
 around
 6000
 SNP
 associations
 have
 been
 reported
 for
 more
 than
 130
 traits
 as
 of
 January
 
2012
 (http://www.genome.gov/gwastudies/).
 However,
 only
 a
 small
 to
 moderate
 proportion
 of
 the
 
hereditary
 variation
 has
 been
 explained
 by
 the
 variants
 discovered
 (Frazer,
 Murray,
 Schork,
 &
 
Topol,
 2009).
 To
 find
 the
 unexplained
 heritability
 by
 GWAS,
 investigators
 have
 begun
 to
 look
 
beyond
 common
 genetic
 variants.
 Based
 on
 population
 genetics,
 deleterious
 variants
 are
 often
 
under
 greater
 selection
 pressure,
 and
 they
 are
 more
 likely
 to
 be
 present
 in
 a
 lower
 frequency
 in
 
the
 current
 population.
 There
 have
 already
 been
 a
 few
 published
 studies
 indicating
 the
 effects
 
of
 rare
 variants,
 solely
 or
 in
 combination,
 in
 common
 diseases
 (B.
 Li
 &
 Leal,
 2008;
 Morris
 &
 
Zeggini,
 2009;
 Neale
 et
 al.,
 2011;
 Quintana,
 Berstein,
 Thomas,
 &
 Conti,
 2011).
 
 

 
To
 better
 study
 the
 effects
 of
 rare
 variants
 in
 association
 studies,
 technological
 advances,
 
especially
 Next
 Generation
 Sequencing
 (NGS),
 have
 been
 rapidly
 developing.
 Despite
 its
 
efficiency
 for
 generating
 sequencing
 reads,
 NGS
 is
 still
 relatively
 expensive
 to
 be
 used
 in
 
population
 based
 association
 studies.
 The
 structure
 of
 the
 NGS
 data
 is
 also
 completely
 different
 
from
 current
 genotyping
 technologies,
 and
 the
 error
 rate
 of
 NGS
 is
 relatively
 higher
 than
 current
 

  11
 
genotyping
 technologies
 (Shendure
 &
 Ji,
 2008).
 Aiming
 for
 uncompromised
 statistical
 power
 
with
 much
 reduced
 cost,
 DNA
 pooling
 has
 become
 a
 promising
 design
 for
 association
 studies,
 
especially
 applicable
 to
 NGS
 data
 (Lupton
 et
 al.,
 2011;
 Out
 et
 al.,
 2009;
 Prabhu
 &
 Pe'er,
 2009;
 T.
 
Wang,
 Pradhan,
 Ye,
 Wong,
 &
 Rohan,
 2011).
 

 
Prabhu
 et
 al.
 constructed
 over-­‐lapping
 pools
 to
 tackle
 the
 problem
 of
 detecting
 the
 carrier
 of
 
rare
 variants
 (Prabhu
 &
 Pe'er,
 2009).
 Similar
 works
 along
 these
 lines
 are
 (Shental
 &
 Amir,
 2010)
 
and
 (Erlich
 et
 al.,
 2009).
 Price
 et
 al.
 also
 carried
 out
 tests
 for
 detecting
 rare
 variants
 associations
 
from
 pools
 (Price
 et
 al.,
 2010).
 Druley
 et
 al.
 developed
 an
 efficient
 algorithm
 to
 accurately
 
quantify
 rare
 variants
 from
 pooled
 genomic
 DNA
 using
 NGS
 (Druley
 et
 al.,
 2009).
 Futschik
 and
 
Schlotterer
 investigated
 some
 statistical
 properties
 of
 methods
 for
 detection
 and
 frequency
 
estimation
 of
 rare
 variants
 in
 NGS
 data
 of
 pooled
 samples
 (Futschik
 &
 Schlotterer,
 2010).
 Lee
 et
 
al
 tried
 to
 find
 the
 optimal
 number
 of
 individuals
 in
 a
 DNA
 pool
 to
 detect
 an
 allele
 under
 given
 
coverage
 depth
 and
 detection
 threshold
 (Lee,
 Choi,
 Yan,
 Lifton,
 &
 Zhao,
 2011).
 Focusing
 on
 
association
 studies,
 Kim
 et
 al.
 (Kim
 et
 al.,
 2010)
 developed
 a
 likelihood
 ratio
 test
 on
 pooled
 or
 
un-­‐pooled
 designs,
 and
 investigated
 the
 design
 that
 achieved
 the
 optimal
 power.
 They
 elegantly
 
considered
 the
 structure
 of
 the
 NGS
 data,
 especially
 the
 nature
 of
 the
 NGS
 error.
 However,
 
there
 are
 still
 some
 essential
 parts
 missing
 for
 a
 thorough
 analysis
 of
 such
 a
 design:
 a
 clear
 
criterion
 to
 discriminate
 SNPs
 from
 sequencing
 error,
 a
 thorough
 examination
 of
 the
 validity
 of
 
the
 study
 design,
 robust
 estimation
 of
 SNP
 frequency
 and
 effect
 size,
 a
 realistic
 cost
 function
 
reflecting
 the
 different
 costs
 of
 pool
 formation
 and
 sequencing,
 and
 corrections
 for
 multiple
 
tests.
 

  12
 
Here,
 we
 propose
 a
 Bayesian
 framework
 for
 addressing
 these
 issues.
 Capturing
 the
 exact
 NGS
 
data
 structure,
 we
 show
 that
 our
 method
 yields
 an
 unbiased
 estimation
 not
 only
 of
 the
 variant
 
frequency
 in
 a
 pool,
 but
 also
 of
 the
 effect
 size
 of
 a
 variant
 for
 a
 robust
 test
 on
 the
 significance
 of
 
the
 association.
 Using
 extensive
 simulations,
 we
 have
 described
 the
 constraints
 for
 valid
 study
 
designs
 that
 yield
 the
 expected
 test
 statistic
 distribution
 under
 the
 null.
 Among
 valid
 designs,
 we
 
show
 that
 our
 test
 statistic
 has
 a
 Normal
 distribution
 and
 that
 the
 scale
 parameter
 of
 the
 test
 
statistic
 was
 well
 predicted
 by
 our
 design
 parameters.
 Based
 on
 these
 results,
 we
 developed
 a
 
formula
 for
 power
 calculations
 incorporating
 all
 the
 design
 parameters
 and
 the
 assumed
 genetic
 
model,
 and
 extended
 this
 approach
 to
 the
 GWAS
 setting.
 Using
 a
 realistic
 model
 for
 the
 
dependence
 of
 NGS
 costs
 on
 the
 design
 parameters,
 we
 have
 developed
 a
 statistical
 tool
 for
 
finding
 an
 optimal
 design
 subject
 to
 the
 budget
 limitations,
 the
 VAF,
 the
 effect
 size
 of
 the
 SNP
 to
 
be
 targeted,
 and
 the
 error
 rate
 of
 the
 NGS
 instrument.
 For
 researchers
 looking
 for
 a
 cost-­‐
efficient
 association
 study
 design,
 we
 hope
 our
 work
 will
 provide
 useful
 insights
 on
 the
 design
 
and
 analysis
 of
 such
 studies
 in
 pooled
 samples
 with
 NGS
 technology.
 

 
2.2
 Method:
 A
 Bayesian
 hierarchical
 model
 for
 
pooled
 NGS
 data
 
In
 a
 non-­‐overlapping
 pooled
 sequencing
 study,
 individuals
 are
 randomly
 allocated
 into
 pools
 
within
 each
 stratum
 (e.g.,
 case/control
 status).
 At
 each
 location
 we
 observe
 the
 total
 number
 of
 
reads
 X
p

 and
 the
 number
 of
 variant
 reads
 V
p
 
in
 pool
 p.
 We
 model
 V
p

 as
 a
 Binomial
 distribution
 
with
 probability
 θ
p
.

 

 
 
 
 
 
) , ( ~
p p p
X Bin V θ

   
   
   
   
   
   
   
   
 

  13
 

 
 
 
 
) ( ~
p p
Pois X λγ

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
  (2.1)
   
 
where
 λ
 is
 the
 average
 depth
 of
 sequencing
 for
 each
 pool
 and
 γ
p
~
 Gamma(α,
 1/α)
 accounts
 for
 
variation
 in
 DNA
 capturing
 efficacy;
 note
 that
 this
 implies
 that
 the
 average
 depth
 of
 sequencing
 
per
 subject
 is
 λ/N
p
.
 V
p
/X
p
 
may
 be
 used
 as
 an
 estimate
 of
 the
 variant
 frequency;
 however,
 the
 
standard
 error
 of
 this
 estimate
 will
 be
 greatly
 under
 estimated
 as
 it
 is
 driven
 by
 the
 coverage
 
and
 not
 the
 number
 of
 individuals
 within
 each
 pool.
 
 Since
 pool
 p
 constitutes
 DNA
 samples
 from
 
N
p

 individuals,
 we
 let
 Z
p

 denote
 the
 number
 of
 chromosomes
 (out
 of
 the
 2N
p

 possible
 
chromosomes)
 that
 carry
 the
 variant
 allele
 in
 pool
 p.
 Thus,
 ρ
p

 denotes
 the
 variant
 allele
 
frequency
 for
 the
 individuals
 in
 pool
 p,
 and
 Z
p

 is
 distributed
 as:
 
 
) , 2 ( ~
p p p
N Bin Z ρ

 
 
 

   
   
   
   
   
   
   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
 
 
 
 (2.2)
 

 
Since
 each
 read
 is
 generated
 as
 an
 independent
 Bernoulli
 process,
 in
 the
 absence
 of
 sequencing
 
error
 the
 probability
 of
 a
 read
 being
 called
 as
 a
 variant
 is
 Z
p
/2N
p
.
 To
 account
 for
 sequencing
 
error
 we
 assume
 a
 simple
 and
 symmetric
 error
 structure
 in
 which
 ε
 =
 Pr(read=v|allele=r)
 =
 
Pr(read=r|allele=v),
 where
 v
 denotes
 the
 variant
 allele
 and
 r
 denotes
 the
 reference
 allele.
 Thus,
 
in
 the
 presence
 of
 sequencing
 error,
 θ
p

 is
 defined
 as:
 
)
2
1 (
2
) 1 (    
) Pr( ) | Pr( ) Pr( ) | Pr(    
) , Pr( ) , Pr(
p
p
p
p
p
N
Z
N
Z
r allele r allele v read v allele v allele v read
r allele v read v allele v read
− + − =
= = = + = = = =
= = + = = =
ε ε
θ

 
 
 
 
   
 
 
(2.3)

 
 
To
 facilitate
 case-­‐control
 comparisons
 across
 pools,
 we
 define
 the
 variable
 W
p

 as
 an
 indicator
 of
 
the
 disease
 status
 of
 all
 individuals
 within
 pool
 p
 (W
p
 
=
 1
 if
 pool
 p
 constitutes
 cases
 and
 W
p
 
=
 0
 if
 

  14
 
pool
 p
 constitutes
 controls),
 and
 we
 use
 a
 logistic
 model
 to
 link
 the
 pool-­‐level
 variant
 allele
 
frequency
 with
 the
 case-­‐control
 status
 of
 the
 pool:
 
) ( ) logit(
1 0
W W
p p
− + = β β ρ

   
   
   
   
   
   
 
 
 
 
 
 
 
 
 
 
   
 
 
 
 
 
  (2.4)
 

 
Here,
 β
0

 is
 the
 logit
 of
 the
 variant
 frequency
 within
 the
 entire
 sample;
 β
1

 is
 the
 log
 of
 the
 odds
 
ratio
 between
 cases
 and
 controls
 used
 for
 association
 testing
 (e.g.,
 with
 disease
 or
 determinants
 
of
 the
 VAF);
 and
 
W
is
 the
 average
 of
 W.
 For
 completeness,
 we
 tried
 different
 prior
 distributions
 
on
 the
 hyperparameters,
 and
 found
 that
 the
 model
 is
 insensitive
 with
 respect
 to
 the
 prior
 
distribution,
 especially
 in
 valid
 study
 designs.
 Thus
 we
 put
 relatively
 non-­‐informative
 priors
 on
 
hyperparameters,
 and
 fit
 the
 entire
 model
 using
 Markov
 chain
 Monte
 Carlo
 (MCMC)
 techniques
 
as
 implemented
 in
 WinBUGS.
 
] 05 . 0 , 0 [ ~
) 10 , 0 ( ~
) 10 , 0 ( ~
1
0
U
N
N
ε
β
β

 
 
 
   
   
   
   
   
   
   
   
 
 
 
 
(2.5)

 
 
 
 
 

 
Overall,
 our
 hierarchical
 model
 is
 summarized
 in
 the
 DAG
 in
 Fig
 2.1.
 

 

  15
 
Figure
 2.1
 DAG
 of
 the
 hierarchical
 model,
 focusing
 on
 a
 single
 marker
 in
 a
 single
 pool
 

 

 
2.3
 Simulation
 studies
 
2.3.1
 Design
 of
 simulation
 studies
 
To
 investigate
 the
 performance
 of
 our
 hierarchical
 model
 and
 to
 build
 a
 prediction
 model
 for
 
power,
 we
 performed
 simulations
 over
 a
 variety
 of
 scenarios
 with
 different
 variant
 allele
 
frequencies
 (VAF),
 effect
 sizes
 (OR),
 and
 sequencing
 error
 rates
 (ε),
 as
 well
 as
 a
 range
 of
 study
 
design
 parameters
 including
 the
 number
 of
 pools
 (P,
 thus
 there
 are
 P/2
 case
 pools
 and
 P/2
 
control
 pools),
 the
 number
 of
 individuals
 per
 pool
 (N
p
),
 and
 the
 average
 coverage
 for
 each
 pool
 
(λ).
 

  16
 
Given
 specified
 values
 for
 the
 sample
 VAF,
 OR
 and
 ε,
 we
 simulated
 the
 number
 of
 variant-­‐
carrying
 chromosomes
 in
 each
 pool
 using
 Eqs.(2.2)
 and
 (2.4).
 Within
 each
 pool
 we
 simulated
 the
 
total
 coverage
 X
p

 and
 the
 number
 of
 variants
 V
p

 following
 Eqs.(2.1)
 and
 (2.3).
 
The
 simulations
 used
 the
 following
 specified
 values:
 
 VAF
 =
 {0.002,
 0.005,
 0.01,
 0.02,
 0.03,
 0.05,
 
0.07,
 0.1,
 0.12,
 0.15,
 0.2},
 OR
 =
 {1.0,
 1.2,
 1.6,
 2.0},
 and
 ε
 =
 {0.005,
 0.01,
 0.02}.
 We
 varied
 the
 
design
 parameters
 with
 N
p

 =
 {1,
 2,
 5,
 20,
 40,
 60,
 80,
 100,
 150,
 200};
 P
 =
 {2,
 4,
 10,
 40,
 80,
 120,
 160,
 
200,
 250,
 300,
 350,
 400};
 and
 λ
 =
 {10,
 20,
 40,
 80,
 160,
 320,
 640,
 1280}.
 For
 each
 specified
 
scenario,
 we
 simulated
 1000
 replicates
 and
 analyzed
 the
 resulting
 data
 with
 our
 hierarchical
 
model
 using
 WinBUGS
 and
 a
 single
 chain
 of
 1000
 MCMC
 samples.
 We
 summarized
 the
 posterior
 
distribution
 of
 β
0
,
 β
1

 and
 ε.
 Hypothesis
 testing
 was
 based
 on
 the
 test
 statistic
 B
 =
 
1 1
ˆ /
ˆ
σ β

 for
 each
 
replicate,
 where
 
1
ˆ
β

 and
 
1
ˆ σ  
 are
 the
 posterior
 mean
 and
 std.
 error
 of
 the
 effect
 size.
 Under
 the
 
null
 hypothesis,
 we
 assume
 that
 B
 is
 distributed
 as
 a
 standard
 Normal.
 The
 empirical
 power
 was
 
calculated
 as
 the
 proportion
 of
 replicates
 that
 show
 a
 statistical
 significance
 to
 reject
 the
 null
 
hypothesis
 (p-­‐value
 <0.05),
 and
 empirical
 type
 I
 error
 was
 the
 empirical
 power
 under
 the
 null
 
scenarios
 (where
 the
 simulated
 OR
 =
 1).
 

 
2.3.2
 Determining
 optimal
 study
 design
 for
 future
 studies
 

 
2.3.2.A
 Using
 Simulations
 to
 Build
 a
 Prediction
 Model
 for
 Power
 
In
 valid
 designs,
 the
 test
 statistics
 should
 be
 distributed
 as
 a
 standard
 Normal
 under
 a
 null
 
scenario.
 Thus,
 we
 used
 a
 Kolmogorov-­‐Smirnov
 (KS)
 test
 to
 determine
 whether
 these
 test
 
statistics
 obtained
 from
 the
 hierarchical
 model
 under
 a
 null
 scenario
 for
 a
 given
 study
 design
 

  17
 
follow
 a
 standard
 Normal
 distribution,
 which
 is
 a
 prerequisite
 for
 the
 study
 design
 to
 be
 valid.
 If
 
the
 corresponding
 KS
 test
 was
 significant
 (p-­‐value
 <
 0.05)
 we
 removed
 the
 study
 design
 from
 the
 
determination
 of
 the
 prediction
 model.
 Thus,
 in
 all
 the
 valid
 designs,
 the
 test
 statistic
 follows
 an
 
asymptotically
 Normal
 distribution
 summarized
 by
 the
 location
 parameter
 of
 each
 valid
 design,
 
denoted
 as
 µ
d
,
 with
 B
d

 ~
 N
 (µ
d
,
 1).
 We
 empirically
 estimated
 the
 location
 parameter
 µ
d

 using
 
∑
=
−
=
R
r
dr d
B R B
1
1 ,
 where
 R
 is
 the
 number
 of
 replicates
 for
 design
 d,
 and
 B
dr
 
is
 the
 test
 statistic
 for
 
design
 d
 replicate
 r.
 Using
 these
 estimated
 location
 parameters,
 we
 then
 built
 a
 linear
 regression
 
model
 to
 predict
 µ
d

 as
 a
 function
 of
 all
 the
 scenario
 parameters
 (VAF,
 OR
 and
 ε)
 and
 design
 
parameters
 (N
p
,
 P
 and
 λ).
 This
 regression
 model
 is
 then
 used
 to
 predict
 power
 for
 future
 study
 
designs.
 
 

 
2.3.2.B
 Predicting
 Power
 for
 Future
 Studies
 
To
 predict
 power
 for
 any
 proposed
 future
 study,
 we
 first
 determined
 if
 the
 study
 design
 results
 
in
 valid
 inference
 using
 our
 hierarchical
 modeling
 analysis,
 and
 if
 deemed
 valid,
 we
 then
 
predicted
 power
 using
 the
 estimated
 prediction
 model
 constructed
 from
 the
 analysis
 of
 the
 
simulated
 results.
 

 
For
 the
 examples
 in
 this
 paper,
 we
 determined
 if
 a
 proposed
 study
 was
 valid
 by
 calculateing
 the
 
probability
 that
 a
 variant
 will
 be
 detected
 in
 either
 the
 cases
 or
 the
 controls.
 For
 a
 specific
 
design,
 we
 use
 a
 threshold
 for
 the
 number
 of
 variant
 reads,
 V
t

 to
 determine
 if
 a
 variant
 is
 
detected.
 Given
 the
 pool
 size
 (N
p
),
 average
 coverage
 per
 pool
 (λ)
 and
 number
 of
 pools
 (P),
 if
 
there
 was
 no
 variant
 in
 the
 pool,
 the
 number
 of
 variant
 reads
 (V
0
)
 would
 be
 approximately
 

  18
 
distributed
 as:
 V
o
 
~
 Bin
 (λ,
 θ
0
=ε),
 where
 we
 use
 the
 average
 coverage
 λ
 as
 an
 approximate
 of
 the
 
true
 coverage.
 If
 only
 one
 chromosome
 in
 the
 pool
 carries
 a
 variant,
 the
 number
 of
 variant
 reads
 
(V
1
)
 would
 be
 approximately
 distributed
 as:
 V
1
 
~
 Bin
 (λ,
 θ
1
 
=
 (1-­‐ε)/2N
 +
 ε(2N-­‐1)/2N).
 By
 
comparing
 these
 two
 situations,
 we
 calculate
 two
 corresponding
 probabilities:
 Pr(V
0
>V
t
)
 and
 
Pr(V
1
>V
t
).
 To
 control
 type
 I
 error
 with
 Bonferroni
 correction
 for
 multiple
 tests
 at
 a
 significance
 
level
 of
 0.05
 and
 to
 ensure
 an
 adequate
 odds
 that
 a
 true
 variant
 is
 detected
 vs.
 a
 false
 discovery,
 
we
 chose
 the
 smallest
 V
t

 that
 satisfies
 both
 Pr(V
0
>V
t
)<(0.1/P)
 and
 Pr(V
1
>V
t
)
 >
 5Pr(V
0
>V
t
).
 This
 is
 a
 
conservative
 comparison,
 since
 Pr(V
1
>V
t
)
 <
 Pr(V
2
>V
t
)
 <
 …
 <
 Pr(V
2N
>V
t
),
 and
 we
 are
 comparing
 the
 
least
 distinctive
 positive
 scenario
 vs.
 negative
 scenario.
 

 
Next,
 we
 used
 the
 threshold
 V
t

 to
 calculate
 the
 probability
 of
 variant
 detection.
 If
 we
 assume
 
that
 we
 have
 an
 equal
 number
 of
 case
 and
 control
 pools,
 the
 probability
 of
 detecting
 a
 true
 
variant
 in
 both
 the
 case
 pools
 and
 the
 control
 pools
 is:
 
Pr(detection)
 =
 {1
 –
 [1
 –
 Pr(V>V
t
 
|N
p
,
 λ,
 ε)]
P/2

 }
2

 

 
Valid
 study
 designs
 were
 those
 in
 which
 Pr(detection)
 >
 Pr
T
,
 where
 Pr
T

 is
 determined
 from
 a
 
comparison
 of
 the
 validity
 determination
 using
 the
 KS
 test
 to
 the
 predicted
 validity
 using
 the
 
calculated
 Pr(detection)
 for
 all
 simulated
 scenarios.
 

 
If
 a
 novel
 study
 design
 was
 determined
 to
 be
 valid,
 then
 we
 predicted
 the
 expected
 non-­‐
centrality
 parameter
 using
 the
 fitted
 regression
 model.
 Let
 µ
d

 denote
 the
 predicted
 non-­‐
centrality
 parameter,
 P
d

 denote
 the
 number
 of
 pools
 in
 the
 proposed
 study
 design,
 and
 q
αd

 
denote
 the
 one-­‐sided
 α-­‐level
 critical
 value
 for
 the
 test
 statistic:
 Pr[x
 >
 q
αd
 
|x
 ~
 N
 (
 0,
 1)]
 =
 α.
 The
 

  19
 
expected
 power
 for
 the
 proposed
 study
 design
 is
 Λ
d

 =
 Pr[x
 >
 q
αd
 
|x
 ~
 N(µ
d
,
 1)].
 Using
 the
 
estimated
 non-­‐centrality
 parameter
d
µ ˆ

 from
 the
 prediction
 model,
 we
 predicted
 the
 expected
 
power
 as
 
d
Λ
ˆ

 
=
 Pr[x
 >
 q
αd
 
|x
 ~
 N(
d
µ ˆ
,
 1)].
 

 
We
 obtained
 the
 95%
 confidence
 interval
 for
 the
 non-­‐centrality
 parameter
 µ
d
,
 denoted
 
ld
µ ˆ
and
ud
µ ˆ
,
 thus
 Pr(µ
d

 >
ud
µ ˆ
)
 =
 Pr((µ
d

 <
ld
µ ˆ
)
 =
 0.025.
 The
 confidence
 interval
 for
 the
 predicted
 power
 was
 
based
 on
 the
 confidence
 interval
 of
 the
 non-­‐centrality
 parameter.
 

 
To
 test
 the
 prediction
 model,
 we
 also
 performed
 an
 empirically
 based
 simulation
 study
 using
 the
 
1000
 Genomes
 exome
 alignment
 data
 (based
 on
 the
 exome
 alignment
 index
 file
 in
 ftp://ftp-­‐
trace.ncbi.nih.gov/1000genomes/ftp/exome.alignment.index),
 including
 unrelated
 individuals
 in
 
both
 pilot
 project
 2
 and
 pilot
 project
 3.
 Focused
 on
 a
 subset
 of
 representative
 SNPs
 in
 
Chromosome
 2
 (as
 shown
 in
 Table
 2),
 we
 extracted
 the
 sequencing
 reads
 and
 genotype
 of
 each
 
individual,
 and
 simulated
 the
 disease
 status
 based
 on
 a
 logistic
 disease
 model.
 Based
 on
 the
 
specific
 study
 design
 (i.e.,
 number
 of
 pools
 P
 =
 {20,
 40,
 100,
 200},
 number
 of
 individuals
 per
 pool
 
N
p

 =
 1000/P,
 and
 average
 coverage
 per
 pool
 λ
 =
 {20,
 50}),
 we
 randomly
 sampled
 
affected/unaffected
 individuals
 to
 form
 case/control
 pools,
 and
 constructed
 the
 pool-­‐level
 
sequencing
 data
 by
 summing
 over
 a
 randomly
 sampled
 subset
 of
 sequencing
 reads
 of
 each
 
member
 of
 the
 pool.
 Finally,
 we
 applied
 our
 Bayesian
 hierarchical
 model
 to
 analyze
 the
 
generated
 pooling
 association
 study
 data.
 For
 each
 scenario,
 we
 generated
 1000
 replicates
 and
 
compared
 the
 empirical
 power
 to
 the
 power
 calculated
 from
 our
 prediction
 function.
 
 

 

  20
 
2.3.2.C
 Finding
 an
 Optimal
 Study
 Design
 
Given
 a
 set
 of
 assumed
 scenario
 parameters
 (VAF,
 OR
 and
 ε),
 we
 used
 the
 following
 procedure
 
to
 find
 an
 optimal
 study
 design:
 
1) Specify
 the
 range
 of
 interest
 on
 design
 parameters
 (P,
 N
p

 and
 λ)
 with
 constraints
 on
 
study
 cost
 (C
T
)
 and
 sample
 size
 (N
T
).
 
2) Using
 a
 generalized
 cost
 function:
 
 
CT
 =
 CE
 +
 CPP
 +
 CX
 P
 λ
 
where
 C
E

 is
 the
 cost
 per
 experiment
 including
 the
 initiation
 of
 the
 study
 and
 sample
 
collection,
 C
P

 is
 the
 cost
 per
 pool
 such
 as
 sample
 preparation
 (and
 region
 capturing
 if
 
focused
 on
 region
 specific
 sequencing),
 and
 C
X

 is
 the
 cost
 per
 1x
 coverage.
 
 
3) Specify
 120x120
 grid
 covering
 values
 of
 P
 and
 N
p

 across
 the
 specified
 range
 of
 
interest.
 
4) For
 each
 point
 in
 the
 grid
 use
 the
 corresponding
 number
 of
 pools
 (P
g
)
 and
 number
 
of
 individuals
 per
 pool
 (N
pg
)
 to
 calculate
 the
 maximum
 coverage
 λ
g

 consistent
 with
 
the
 constraint
 to
 study
 cost
 (C
T
).
 
5) Check
 if
 the
 design
 (P
g
,
 N
g
)
 satisfies
 all
 of
 the
 following
 conditions:
   
 
a. P
g

 N
g

 ≤
 N
T
,
 to
 satisfy
 the
 sample
 size
 constraint;
 
b. Calculate
 the
 Pr(detection)
 for
 the
 specific
 design
 using
 the
 maximum
 coverage
 
λ
g
.
 Pr(detection)
 ≥
 Pr
T

 are
 deemed
 valid
 study
 designs.
 
c. Using
 the
 estimated
 prediction
 model,
 calculate
 the
 predicted
 power
 for
 the
 
study
 design.
 
6) Perform
 an
 exhaustive
 search
 over
 all
 valid
 grid-­‐points.
 
 Select
 the
 optimal
 design
 as
 
the
 study
 with
 the
 maximum
 power.
 
 

  21
 
To
 help
 investigators
 find
 the
 optimal
 pooling
 design,
 an
 R
 package
 hiPOD
 (hierarchical
 Pooled
 
Optimal
 Design)
 is
 now
 available
 in
 the
 online
 CRAN
 depository.
 

 
2.4
 Results
 
2.4.1
 Validity
 and
 empirical
 power
 of
 simulated
 scenarios
 
In
 total,
 we
 simulated
 116,160
 scenarios
 with
 different
 study
 designs
 and
 disease
 models.
 Not
 all
 
the
 scenarios
 result
 in
 valid
 tests
 of
 association.
 Results
 from
 the
 KS
 tests
 for
 the
 uniform
 
distribution
 of
 type-­‐I
 error
 under
 the
 null
 scenarios
 demonstrated
 that,
 in
 general,
 study
 designs
 
with
 few
 pools,
 few
 individuals
 within
 each
 pool,
 or
 with
 low
 coverage
 tend
 to
 result
 in
 invalid
 
tests
 of
 association.
 This
 was
 especially
 true
 for
 scenarios
 dealing
 with
 rare
 variants
 and
 small
 
effect
 sizes.
 These
 trends
 were
 reflected
 in
 Fig
 2.2,
 showing
 the
 valid
 and
 invalid
 designs
 for
 a
 
variant
 with
 VAF
 =0.05
 and
 an
 average
 coverage
 of
 either
 40
 or
 160,
 respectively.
 The
 reason
 
large
 N
p

 designs
 are
 often
 invalid
 in
 Fig
 2.2
 is
 that
 the
 average
 depth
 of
 sequencing
 per
 subject
 is
 
very
 low
 for
 some
 of
 these.
 

 

 

 

 

 

 

 

 

  22
 

 
Figure
 2.2
 Summary
 of
 the
 validity
 of
 study
 designs
 of
 a
 variant
 with
 VAF=0.05
 
A
 
B
 

 

  23
 
From
 all
 the
 simulated
 scenarios
 we
 selected
 a
 few
 representative
 examples
 to
 illustrate
 the
 
relationship
 between
 empirical
 power
 and
 the
 design
 parameters.
 In
 Fig
 3A
 and
 3B,
 the
 variant
 
allele
 frequency
 is
 0.05
 with
 a
 moderate
 effect
 size
 (1.6).
 In
 such
 a
 setting,
 empirical
 power
 
increases
 as
 either
 coverage
 increases
 or
 more
 individuals
 are
 included
 in
 each
 pool.
 In
 the
 
figure
 also
 displays
 the
 optimal
 power
 from
 a
 comparable
 individual
 sequencing
 study,
 which
 is
 
assumed
 to
 have
 no
 errors
 in
 genotyping
 calls
 due
 to
 high
 coverage,
 and
 have
 the
 same
 number
 
of
 individuals
 sequenced
 as
 the
 number
 of
 pools
 constructed
 in
 the
 pooling
 design
 (thus
 they
 
would
 require
 similar
 sequencing
 efforts).
 Such
 a
 separately
 constructed
 individual
 sequencing
 
analysis
 has
 an
 improved
 power
 over
 the
 single-­‐individual
 pool
 analysis,
 as
 it
 eliminates
 the
 
within-­‐pool
 uncertainties
 in
 the
 analysis.
 In
 this
 scenario,
 the
 pooling
 design
 has
 superior
 power
 
over
 individual
 sequencing,
 especially
 as
 the
 number
 of
 individuals
 in
 each
 pool
 increases
 or
 as
 
the
 coverage
 increases.
 Fig
 3C,
 3D
 and
 3E
 show
 a
 scenario
 where
 a
 rare
 variant
 (VAF
 =
 0.01)
 with
 
a
 large
 effect
 (OR=2.0)
 is
 tested.
 Similar
 results
 are
 obtained
 with
 increased
 empirical
 power
 as
 
the
 number
 of
 pools
 increases,
 the
 coverage
 increases
 or
 as
 the
 number
 of
 individuals
 in
 each
 
pool
 increases.
 In
 all
 the
 plots
 of
 Fig
 2.3,
 we
 include
 both
 valid
 and
 invalid
 designs
 to
 
demonstrate
 overall
 trends
 in
 power.
 
 However,
 while
 it
 is
 true
 that
 increasing
 the
 number
 of
 
individuals
 per
 pool
 increases
 power,
 if
 the
 average
 coverage
 per
 pool
 is
 held
 constant
 the
 study
 
design
 is
 more
 likely
 to
 yield
 an
 invalid
 test.
 

 

 

 

 

 

  24
 

 

 
Figure
 2.3
 Comparison
 of
 empirical
 power
 in
 different
 scenarios
 and
 designs
 

 
A
   
   
   
   
   
 
 
 
 
 
 
 B
 

 
C
   
   
   
   
   
 
 
 
 
 
 
 D
 

  25
 

 E
 

 
In
 general,
 power
 increases
 most
 with
 number
 of
 pools,
 whereas
 number
 of
 individuals
 per
 pool
 
has
 a
 moderate
 effect
 and
 the
 coverage
 depth
 per
 pool
 has
 only
 a
 limited
 effect.
 These
 trends
 
are
 more
 clearly
 seen
 in
 contour
 plots
 of
 power
 with
 respect
 to
 the
 design
 parameters
 (Fig
 2.4).
 

 
Figure
 2.4
 General
 trends
 of
 empirical
 power
 with
 respect
 to
 the
 design
 parameters
 

   
 
A
   
   
   
   
   
 
 
 
 
 
 
 
 
 B
 

  26
 

 
C
   
   
   
   
   
 
 
 
 
 
 
 
 
 D
 

   
 
E
   
   
   
   
   
 
 
 
 
 
 
 
 
 F
 

 
2.4.2
 Construction
 of
 a
 prediction
 model
 
To
 predict
 power
 for
 future
 study
 designs
 we
 removed
 all
 invalid
 designs
 and
 all
 null
 scenarios
 
from
 the
 simulated
 data.
 We
 confirmed
 that
 the
 test
 statistic
 was
 asymptotically
 distributed
 as
 
Normal,
 empirically
 summarizing
 location
 parameter
 µ
d

 with
d
B .
 We
 subsequently
 used
 
d
B to
 

  27
 
predict
 power
 and
 found
 that
 the
 predicted
 power
 compared
 very
 favorably
 to
 the
 empirical
 
power
 observed,
 with
 an
 R
2
 
=
 0.998.
 
 

 
In
 constructing
 prediction
 models
 for
 the
 non-­‐centrality
 parameter
 of
 the
 test
 statistics,
 we
 built
 
separate
 prediction
 models
 for
 rare
 variants
 (VAF
 
 <
 0.01),
 less
 common
 variants
 (0.01
 <
 VAF
 <
 
0.05),
 modest
 frequency
 variants
 (0.05
 <
 VAF
 <
 0.1),
 and
 common
 variants
 (VAF
 
 >
 0.1).
 We
 
transformed
 the
 response
 variable
 for
 better
 model
 prediction.
 We
 used
 a
 linear
 regression
 
model
 for
 the
 cube
 root
 of
 the
 non-­‐centrality
 parameter
 as
 a
 function
 of
 the
 variant
 allele
 
frequency,
 log
 odds
 ratio,
 sequencing
 error
 rate,
 number
 of
 pools,
 number
 of
 subjects
 per
 pool,
 
average
 depth
 of
 sequencing,
 and
 the
 various
 transformations
 and
 interactions
 amongst
 these
 
variables
 listed
 in
 Table
 1.

 

 In
 general,
 all
 the
 4
 models
 (for
 different
 strata
 of
 VAFs)
 predict
 the
 
non-­‐centrality
 parameter
 quite
 accurately
 (all
 with
 R
2
 
>
 0.97).
 We
 summarized
 the
 partial
 R
2

 for
 
each
 variable
 in
 Table
 1
 for
 the
 full
 model,
 bolding
 the
 two
 strongest
 predictors
 after
 ln(OR).
 
 
Across
 all
 the
 design
 parameters,
 the
 average
 sequencing
 coverage
 depth
 explains
 the
 most
 
variation
 in
 performance
 involving
 rare
 VAF’s
 (VAF
 0.002-­‐0.01),
 whereas
 the
 number
 of
 pools
 
explains
 the
 greatest
 variations
 involving
 intermediate
 and
 common
 VAF’s
 (VAF
 0.05-­‐0.2).
 The
 
predicted
 power
 is
 highly
 correlated
 (R
2
 
=
 0.97)
 with
 the
 empirical
 power.
 

  28
 

 
2.4.3
 Predicting
 power
 and
 determining
 optimal
 designs
 
for
 future
 studies
 
For
 future
 studies
 with
 a
 given
 design
 in
 terms
 of
 number
 of
 pools,
 number
 of
 individuals
 within
 
each
 pool,
 and
 coverage,
 we
 calculate
 the
 probability
 of
 variant
 detection
 and
 use
 this
 value
 to
 
decide
 if
 the
 proposed
 study
 is
 valid.
 To
 determine
 the
 variant
 probability
 cutoff,
 we
 use
 the
 
simulated
 data
 scenarios
 and
 results
 to
 construct
 a
 ROC
 curve
 comparing
 these
 calculated
 
probabilities
 to
 the
 validity
 determination
 from
 the
 KS
 test,
 using
 the
 KS
 test
 as
 the
 gold
 
standard.
 Overall,
 the
 area
 under
 the
 ROC
 curve
 was
 equal
 to
 0.89.
 From
 this
 curve
 we
 selected
 
a
 cutoff
 at
 0.9,
 yielding
 a
 sensitivity
 of
 0.91
 and
 specificity
 0.74.
 

 
We
 used
 the
 estimated
 prediction
 models
 to
 calculate
 power
 for
 any
 proposed
 future
 study
 
design.
 For
 each
 valid
 study
 design,
 power
 prediction
 involves
 2
 steps:
 first
 predict
 the
 non-­‐
centrality
 parameter
 for
 the
 test
 statistic,
 and
 then
 calculate
 power
 based
 on
 the
 predicted
 non-­‐
centrality
 parameter.
 In
 the
 simulation
 study
 based
 on
 1000
 Genomes
 data,
 across
 a
 set
 of
 
representative
 SNPs
 with
 different
 VAFs
 and
 a
 range
 of
 assumed
 ORs,
 we
 calculated
 the
 
probability
 of
 detection
 of
 each
 study
 design
 first,
 and
 define
 those
 with
 a
 probability
 of
 
detection
 larger
 than
 0.9
 as
 valid
 designs.
 In
 all
 the
 valid
 designs,
 we
 found
 that
 the
 predicted
 
power
 is
 in
 good
 concordance
 with
 the
 empirical
 power,
 although
 there
 is
 a
 general
 trend
 for
 
the
 prediction
 model
 to
 be
 conservative
 (shown
 in
 Fig
 2.5).
 
 

 

  29
 
Figure
 2.5
 Prediction
 of
 power
 in
 semi-­‐simulation
 studies
 using
 1000
 Genomes
 data
 

 

 

 
As
 an
 example
 of
 finding
 the
 optimal
 design,
 suppose
 we
 have
 the
 following
 values
 for
 our
 cost
 
function:
 C
E

 =
 $18,915,
 C
P

 =
 $970,
 and
 C
X

 =
 $300
 and
 a
 total
 budget
 of
 C
T

 =
 $452,915
 [personal
 
communication,
 Dave
 Duggan
 from
 TGen].
 We
 would
 like
 to
 know
 the
 optimal
 pooling
 study
 
design
 for
 a
 maximum
 available
 sample
 of
 5,000
 and
 an
 error
 rate
 ε
 =
 0.01.
 
 The
 results
 are
 
shown
 in
 Fig
 2.6
 for
 VAF
 =
 0.03
 and
 OR
 =
 2.0.
 In
 the
 contour
 plot,
 we
 see
 that
 the
 power
 
increases
 with
 number
 of
 pools
 and
 number
 of
 individuals
 per
 pool,
 under
 the
 constraints
 of
 the
 
given
 costs
 and
 sample
 size.
 The
 overlap
 of
 the
 red
 region
 (valid
 study
 designs)
 and
 blue
 region
 
(study
 designs
 satisfying
 the
 cost
 constraints)
 define
 the
 region
 of
 interest.
 In
 this
 example,
 the
 
0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Power
Empirical Power
OR = 1.2
OR = 1.6
OR = 2.0
Power Prediction from 1000 Genomes Data

  30
 
optimal
 power
 is
 0.42,
 achieved
 with
 51
 pools
 with
 62
 individuals
 within
 each
 pool,
 and
 an
 
average
 coverage
 of
 25.
 The
 top
 10
 choices
 of
 optimal
 designs
 are
 shown
 in
 Appendix
 Table
 1.
 
 

 
Figure
 2.6
 Grid
 search
 for
 the
 optimal
 pooling
 design,
 for
 a
 study
 interested
 in
 a
 variant
 with
 
VAF=0.03
 and
 OR=2,
 assuming
 a
 fixed
 cost
 function,
 with
 constraints
 on
 cost
 and
 sample
 size
 

 

 

 
Of
 course,
 a
 pivotal
 question
 facing
 researchers
 is
 when
 to
 use
 a
 pooling
 study
 design
 and
 when
 
to
 use
 individual
 level
 sequencing.
 
 To
 compare
 these
 study
 designs
 we
 use
 a
 similar
 cost
 
function,
 substituting
 the
 number
 of
 individuals
 for
 the
 number
 of
 pools
 and
 assuming
 that
 4x
 
coverage
 is
 sufficient
 to
 generate
 genotyping
 calls
 with
 no
 errors.
 With
 a
 total
 available
 budget
 

  31
 
of
 $452,915
 we
 can
 individually
 sequence
 200
 individuals.
 For
 a
 SNP
 with
 a
 VAF
 =
 0.03,
 and
 an
 
OR
 =
 2.0,
 the
 expected
 power
 is
 0.40
 (calculated
 using
 Quanto
 (Gauderman,
 2002)).
 
 
More
 generally,
 to
 compare
 a
 pooling
 design
 vs.
 individual-­‐level
 sequencing,
 we
 investigated
 
power
 over
 a
 range
 of
 different
 cost
 functions.
 
 In
 such
 a
 comparison,
 it
 is
 more
 informative
 to
 
focus
 on
 the
 relative
 value
 of
 sample
 preparation
 cost
 versus
 sequencing
 cost
 (C
P
/C
X
),
 since
 the
 
experiment-­‐wise
 cost
 is
 fixed
 (C
E
).
 As
 before,
 we
 focus
 on
 the
 scenario
 with
 a
 VAF
 =
 0.03,
 an
 OR
 
=
 2.0,
 and
 an
 error
 rate
 ε
 =
 0.01.
 For
 the
 cost
 function,
 we
 held
 constant
 C
E

 =
 $10,000
 and
 C
P

 =
 
$1,000
 and
 varied
 the
 value
 of
 C
X

 from
 $50
 to
 $500).
 For
 each
 value
 of
 C
X
,
 we
 calculated
 the
 
corresponding
 cost
 resulting
 from
 an
 individual-­‐level
 sequencing
 design
 (assuming
 200
 
individuals
 with
 4X
 coverage),
 and
 then
 used
 this
 to
 constrain
 the
 search
 for
 an
 optimal
 pooling
 
study
 design.
 The
 power
 for
 an
 individual
 design
 is
 constant
 at
 0.40.
 However,
 the
 power
 for
 a
 
pooled
 design
 varies
 with
 cost.
 Fig
 2.7
 demonstrates
 that
 as
 C
X

 drops,
 there
 is
 a
 gain
 in
 power
 for
 
an
 optimal
 pooled
 design,
 with
 all
 pooled
 designs
 having
 greater
 power
 than
 the
 individual
 
sequence
 design.
 

 

 

 

 

 

 

 

 

  32
 
Figure
 2.7
 The
 optimal
 power
 of
 pooling
 designs
 for
 a
 specific
 study
 (VAF=0.03,
 OR=2),
 with
 
the
 same
 sample
 size
 constraint,
 different
 cost
 functions
 but
 comparable
 cost
 constraints
 with
 
individual
 designs
 

 

 

 

 

 
A
 more
 general
 set
 of
 comparisons
 is
 made
 over
 a
 grid
 of
 potential
 VAFs
 and
 ORs,
 with
 varying
 
values
 of
 sequencing
 cost
 C
X
.
 As
 shown
 in
 Fig
 2.8,
 pooling
 designs
 provides
 improved
 power
 over
 
individual
 designs
 in
 most
 of
 the
 scenarios,
 except
 when
 VAF
 is
 very
 small
 (VAF
 =
 0.005).
 

 

  33
 

 
Figure
 2.8
 Power
 (y-­‐axis)
 comparison
 of
 optimal
 pooling
 designs
 with
 various
 cost
 function
 
(dashed
 lines),
 versus
 individual
 designs
 (solid
 line),
 over
 a
 range
 of
 variant
 allele
 frequencies
 
(x-­‐axis)
 

 

 
However,
 our
 assumption
 that
 there
 was
 no
 effect
 of
 sequencing
 errors
 in
 individual
 designs
 
might
 not
 be
 true
 when
 VAF
 is
 as
 small
 as
 0.005.
 

 
We
 also
 examined
 this
 comparison
 in
 a
 GWAS
 setting,
 where
 the
 significance
 level
 is
 set
 to
 be
 
5×10
-­‐8
.
 In
 such
 a
 setting,
 pooling
 was
 superior
 to
 individual
 sequencing
 in
 all
 the
 scenarios
 we
 
investigated,
 especially
 when
 the
 sequencing
 cost
 is
 low
 relative
 to
 other
 costs.
 For
 example,
 

  34
 
assuming
 sequencing
 cost
 (C
X
)
 decreases
 to
 $50
 while
 the
 other
 costs
 are
 constant
 (C
E

 =
 $10,000
 
and
 C
P

 =
 $1,000),
 for
 a
 variant
 with
 an
 odds
 ratio
 of
 1.2
 and
 allele
 frequency
 0.2,
 the
 power
 from
 
individual
 design
 is
 close
 to
 0,
 where
 as
 the
 optimal
 pooling
 design
 yields
 power
 of
 0.15;
 for
 a
 
variant
 with
 an
 odds
 ratio
 of
 1.6
 and
 allele
 frequency
 0.1,
 the
 power
 from
 individual
 design
 is
 
0.32,
 where
 as
 the
 optimal
 pooling
 design
 yields
 power
 of
 0.9.
 

 
2.5
 Discussion
 
We
 have
 shown
 that
 pooling
 designs
 for
 genetic
 association
 studies
 using
 next-­‐generation
 
sequencing
 can
 be
 a
 viable
 alternative
 to
 designs
 with
 individual
 level
 sequencing.
 However,
 
careful
 consideration
 of
 the
 design
 parameters
 is
 needed
 to
 understand
 the
 tradeoffs
 between
 
the
 number
 of
 pools,
 the
 number
 of
 individuals
 within
 each
 pool,
 and
 the
 total
 coverage
 per
 
pool.
 These
 tradeoffs
 must
 also
 be
 understood
 in
 the
 context
 of
 the
 goals
 of
 the
 study
 (e.g.
 
discovery
 vs.
 association
 testing;
 rare
 variants
 vs.
 common
 variants)
 and
 limitations
 of
 budget
 
and
 sample
 size.
 

 
For
 testing
 association,
 an
 increase
 in
 the
 number
 of
 pools
 for
 case-­‐control
 comparisons
 can
 
increase
 power.
 In
 addition,
 increasing
 the
 number
 of
 individuals
 within
 each
 pool
 can
 result
 in
 a
 
corresponding
 increase
 in
 power
 due
 to
 a
 decrease
 in
 the
 variance
 for
 the
 estimated
 allele
 
frequency
 for
 each
 pool.
 However,
 our
 simulation
 study
 demonstrates
 that
 the
 maximum
 power
 
is
 not
 simply
 a
 matter
 of
 increasing
 both
 of
 these
 factors
 until
 the
 budget
 is
 exhausted
 and
 all
 
individuals
 are
 included
 in
 pools.
 While
 increasing
 the
 number
 of
 individuals
 within
 a
 pool
 
increases
 the
 number
 of
 potential
 chromosomes
 to
 be
 sequenced,
 to
 ensure
 that
 there
 is
 a
 high
 

  35
 
probability
 that
 all
 chromosomes
 within
 a
 pool
 are
 read
 there
 must
 be
 sufficient
 total
 coverage.
 
 
Moreover,
 power
 for
 testing
 association
 with
 rare
 variants
 will
 be
 impacted
 more
 substantially
 
with
 low
 average
 coverage
 per
 chromosome.
 Thus,
 under
 a
 limited
 budget
 and
 in
 certain
 
circumstances,
 there
 are
 advantages
 to
 decreasing
 the
 number
 of
 pools
 while
 increasing
 the
 
coverage
 for
 each
 pool.
 
 Likewise,
 increasing
 the
 number
 of
 individuals
 within
 a
 pool
 may
 be
 
counterproductive
 if
 limits
 in
 coverage
 have
 already
 been
 reached.
 In
 the
 extreme,
 this
 interplay
 
not
 only
 impacts
 power,
 but
 also
 the
 validity
 for
 inference.
 Too
 few
 pools,
 too
 few
 individuals
 or
 
inadequate
 coverage
 can
 result
 in
 unstable
 estimates
 and
 invalid
 tests
 of
 association.
 
 

 
Future
 studies
 need
 more
 than
 broad
 conclusions
 for
 guidance.
 
 To
 accomplish
 this
 we
 
developed
 a
 procedure
 for
 determining
 the
 optimal
 study
 design
 given
 total
 available
 budget
 
and
 sample
 size,
 as
 well
 as
 the
 sequence
 error
 rate,
 VAF
 and
 odds
 ratio
 for
 the
 targeted
 causal
 
SNP.
 
 In
 searching
 for
 the
 optimal
 study
 design,
 there
 are
 two
 key
 steps
 in
 the
 procedure.
 First,
 a
 
proposed
 study
 design
 is
 determined
 to
 be
 valid
 if
 the
 design
 has
 a
 probability
 of
 detection
 
greater
 than
 0.9.
 As
 with
 power,
 too
 few
 pools,
 too
 few
 individuals
 or
 inadequate
 coverage
 
results
 in
 lower
 probabilities
 of
 detection.
 However,
 since
 the
 impact
 of
 these
 design
 
parameters
 is
 complicated,
 the
 ability
 to
 calculate
 probabilities
 of
 detection
 can
 be
 useful
 if
 the
 
goal
 of
 a
 particular
 study
 is
 discovery
 and
 not
 association.
 
 As
 highlighted
 in
 the
 Appendix
 1
 and
 
corresponding
 figures
 (Appendix
 Fig
 1,
 Appendix
 Fig
 2
 and
 Appendix
 Table
 2),
 pooling
 designs
 
outperform
 individual-­‐level
 sequencing
 for
 discovery
 of
 rare
 variants
 even
 when
 accounting
 for
 
sequencing
 error.
 
 The
 second
 step
 in
 our
 power
 estimation
 procedure
 uses
 a
 regression
 model
 
for
 predicting
 the
 expected
 non-­‐centrality
 parameters
 and,
 in
 turn,
 the
 expected
 power.
 
 The
 
results
 from
 over
 116,000
 simulation
 scenarios
 were
 used
 to
 construct
 the
 prediction
 model.
 
 
Overall,
 the
 prediction
 model
 does
 very
 well
 in
 predicting
 the
 empirical
 power
 for
 the
 range
 of
 

  36
 
scenarios
 given
 by
 the
 simulations.
 This
 approach
 was
 used
 since
 an
 exact
 calculation
 of
 the
 
non-­‐centrality
 parameter
 is
 infeasible
 due
 to
 the
 complexity
 of
 our
 multi-­‐level
 model
 and
 the
 
Bayesian
 framework
 in
 which
 our
 estimation
 is
 done,
 and
 the
 prediction
 model
 allows
 for
 
flexibility
 in
 predicting
 power
 for
 novel
 study
 designs.
 However,
 there
 will
 certainly
 be
 an
 
increase
 in
 the
 uncertainty
 when
 predicting
 power
 for
 study
 designs
 not
 within
 the
 simulated
 
ranges
 of
 design
 parameters.
 For
 example,
 prediction
 of
 a
 pooled
 GWAS
 analyses
 using
 a
 α-­‐level
 
of
 5×10
-­‐8

 will
 have
 increased
 uncertainty
 since
 very
 few
 simulated
 designs
 resulted
 in
 test-­‐
statistics
 that
 reach
 such
 significance.
 

 
The
 ability
 to
 predict
 power
 for
 a
 novel
 pooled
 study
 design
 facilitates
 comparison
 to
 individual-­‐
level
 sequencing
 designs.
 
 Determination
 of
 which
 design
 is
 most
 powerful
 relies
 heavily
 on
 the
 
cost
 function.
 Here,
 we
 used
 a
 simple
 cost
 function
 that
 is
 driven
 by
 a
 sample
 cost
 (e.g.
 cost
 per
 
pool
 (or
 individual))
 and
 a
 sequencing
 cost
 (e.g.
 cost
 per
 unit
 coverage
 per
 pool
 (or
 individual)).
 
Under
 this
 cost
 function,
 the
 relative
 value
 of
 sample
 cost
 versus
 sequencing
 cost
 (C
P
/C
X
)
 is
 most
 
informative
 for
 determining
 which
 design
 option
 is
 best.
 Our
 results
 indicate
 that
 even
 as
 
sequencing
 costs
 drop,
 pooled
 designs
 offer
 potentially
 more
 power
 in
 many
 situations
 in
 
comparison
 to
 cost-­‐equivalent
 individual
 designs.
 
 This
 is
 mainly
 driven
 by
 the
 sample
 costs
 and
 
highlights
 how
 the
 goals
 of
 the
 study
 can
 impact
 design
 choice.
 
 For
 targeted
 sequencing
 studies,
 
sample
 costs
 are
 often
 high
 relative
 to
 sequencing
 costs
 due
 to
 library
 formation.
 
 In
 such
 
studies
 pooling
 may
 be
 the
 best
 design.
 
 In
 contrast,
 whole-­‐genome
 sequencing
 studies
 
currently
 have
 relatively
 low
 sample
 costs
 with
 much
 higher
 sequencing
 costs.
 
 For
 these
 studies,
 
individual-­‐level
 designs
 may
 be
 more
 appropriate,
 especially
 for
 rare
 variants.
 
 To
 present
 
comparisons
 that
 would
 highlight
 trends,
 we
 assumed
 that
 4x
 coverage
 was
 sufficient
 to
 obtain
 
genotypes
 without
 error
 from
 individual-­‐level
 sequencing.
 We
 ignored
 sequencing
 error
 and
 

  37
 
variation
 in
 coverage
 along
 the
 genome,
 as
 well
 as
 techniques
 that
 can
 improve
 genotype
 calling
 
from
 sequencing
 data
 such
 as
 imputation.
 Despite
 these
 assumptions
 we
 believe
 that
 our
 
approach
 provides
 a
 fair
 comparison
 between
 the
 design
 options
 and
 clearly
 presents
 the
 
influencing
 factors.
 

 
Finally,
 for
 testing
 the
 marginal
 association
 of
 a
 variant,
 our
 novel
 hierarchical
 modeling
 
approach
 accounts
 for
 the
 number
 of
 pools,
 the
 number
 of
 individuals
 within
 each
 pool,
 the
 
coverage
 and
 the
 sequencing
 error.
 
 We
 have
 presented
 the
 model
 to
 test
 association
 between
 
cases
 and
 controls,
 but
 the
 general
 framework
 is
 applicable
 to
 the
 comparison
 of
 any
 categorical
 
variable
 that
 can
 be
 conditioned
 on
 to
 construct
 distinct
 pools,
 such
 as
 tumor
 vs.
 normal.
 
 
Additionally,
 one
 can
 construct
 pools
 by
 conditioning
 on
 more
 than
 one
 variable
 and
 then
 obtain
 
marginal
 estimates
 of
 association
 for
 any
 single
 variable
 by
 simply
 including
 them
 as
 covariates
 
in
 the
 association
 model
 at
 the
 pool-­‐level.
 
 Such
 an
 approach
 may
 be
 sufficient
 for
 the
 
adjustment
 of
 potential
 confounding
 due
 to
 population
 stratification
 by
 constructing
 case
 and
 
control
 pools
 where
 each
 pool
 consist
 of
 individuals
 with
 similar
 values
 for
 key
 ancestry
 
components.
 
 Sequencing
 studies
 have
 the
 potential
 for
 substantial
 impact
 in
 the
 discovery
 and
 
analysis
 of
 rare
 variants.
 Due
 to
 sparse
 data,
 most
 analysis
 approaches
 for
 rare
 variants
 have
 
relied
 on
 a
 burden
 test
 where
 the
 number
 of
 rare
 variants
 for
 each
 individual
 is
 summed
 and
 a
 
resulting
 overall
 risk
 index
 is
 tested
 for
 association
 (Madsen
 &
 Browning,
 2009).
 In
 its
 current
 
form,
 our
 approach
 focuses
 only
 on
 testing
 the
 marginal
 association
 for
 each
 variant.
 
 However,
 
one
 could
 potentially
 extend
 our
 approach
 to
 include
 a
 test
 of
 the
 difference
 between
 the
 
scaled
 sum
 of
 the
 variant
 frequencies
 in
 case
 pools
 vs.
 control
 pools.
 Further
 research
 is
 
certainly
 warranted
 in
 this
 area.
 
 

  38
 
For
 many
 studies,
 there
 simply
 is
 no
 substitute
 for
 individual-­‐level
 sequencing
 as
 it
 offers
 
flexibility
 to
 test
 multiple
 hypotheses
 (e.g.
 tests
 of
 interaction)
 or
 allows
 for
 association
 testing
 
when
 individuals
 are
 correlated
 such
 as
 in
 family
 studies.
 As
 sequencing
 costs
 continue
 to
 drop
 
the
 ability
 to
 obtain
 individual-­‐level
 data
 may
 be
 too
 enticing
 to
 warrant
 pooled
 approaches.
 
However,
 if
 the
 goals
 are
 discovery
 or
 association
 testing,
 pooled
 designs
 will
 continue
 to
 be
 a
 
viable
 option.
 
 In
 addition,
 there
 is
 a
 rich
 literature
 on
 various
 two-­‐stage
 options
 in
 which
 one
 
can
 potentially
 use
 pooling
 as
 the
 first-­‐stage
 to
 accomplish
 both
 discovery
 of
 novel
 variants
 and
 
ranking
 of
 those
 variants
 via
 association
 tests
 for
 second-­‐stage
 follow
 up
 via
 genotyping
 (Y.
 Zhao
 
&
 Wang,
 2009).
 New
 technologies
 may
 also
 facilitate
 more
 advanced
 designs
 such
 as
 pools
 of
 
individuals
 within
 bar-­‐coded
 pools
 all
 sequenced
 in
 a
 single
 run
 (Smith
 et
 al.,
 2010).
 

 

 

 

 

 

 

 

 

  39
 
Chapter
 3 Bayesian
 Hierarchical
 Lasso
 
3.1
 Introduction
 
In
 genome-­‐wide
 association
 studies
 (GWAS)
 with
 common
 disease,
 the
 most
 used
 analysis
 tool
 
has
 been
 a
 single-­‐point,
 one
 degree
 of
 freedom
 test
 of
 association,
 such
 as
 the
 Cochran-­‐
Armitage
 test
 (McCarthy
 et
 al.,
 2008).
 However,
 testing
 one
 SNP
 at
 a
 time
 does
 not
 utilize
 the
 
full
 potential
 of
 GWAS
 to
 identify
 multiple
 causal
 SNPs,
 which
 is
 usually
 the
 case
 for
 common
 
diseases.
 Also,
 by
 testing
 multiple
 SNPs
 simultaneously,
 a
 falsely
 reported
 SNP
 would
 be
 
weakened
 or
 excluded
 if
 the
 true
 causal
 SNP
 with
 a
 stronger
 signal
 is
 included
 in
 the
 model.
 It
 is
 
not
 feasible
 to
 include
 all
 the
 SNPs
 at
 the
 same
 time,
 which
 would
 result
 in
 an
 
overparameterized
 model;
 thus
 it
 is
 necessary
 to
 consider
 some
 variable
 selection
 method.
 

 
The
 lasso
 (least
 absolute
 shrinkage
 and
 selection
 operator)
 utilizes
 a
 𝒍𝟏
 norm
 penalized
 
regression,
 and
 thus
 shrinks
 some
 coefficients
 and
 sets
 others
 to
 exactly
 zero
 (Tibshirani,
 1996).
 
 
It
 has
 been
 used
 in
 many
 statistical
 areas
 due
 to
 its
 features
 both
 in
 subset
 selection
 and
 in
 
parameter
 shrinkage.
 For
 example,
 Wu
 et
 al
 implemented
 cyclic
 coordinate
 descent
 algorithm
 to
 
find
 the
 lasso
 estimates
 in
 a
 logistic
 regression,
 and
 applied
 their
 algorithm
 in
 a
 GWAS
 dataset
 
(Wu
 &
 Lange,
 2008)
 (Wu,
 Chen,
 Hastie,
 Sobel,
 &
 Lange,
 2009).
 

 
From
 a
 Bayesian
 point
 of
 view,
 the
 estimates
 from
 penalized
 regression
 are
 equivalent
 to
 the
 
maximum
 a
 posteriori
 estimates
 of
 a
 Bayesian
 model,
 with
 specific
 priors
 on
 the
 model
 
parameters.
 For
 example,
 a
 ridge
 regression
 (with
 𝒍𝟐
 norm
 penalty)
 corresponds
 to
 a
 Bayesian
 

  40
 
model
 with
 Gaussian
 prior
 distribution;
 whereas
 a
 lasso
 regression
 (with
 𝒍𝟏
 norm
 penalty)
 
corresponds
 to
 a
 Bayesian
 model
 with
 Laplace
 prior
 distribution
 (Tibshirani,
 1996).
 Genkin
 et
 al
 
investigated
 a
 Bayesian
 approach
 to
 logistic
 regression,
 using
 Gaussian
 or
 Laplace
 prior
 
distributions.
 They
 also
 implemented
 a
 cyclic
 coordinate
 descent
 algorithm
 to
 find
 the
 maximum
 
a
 posteriori
 estimates,
 and
 implemented
 their
 method
 in
 text
 categorization
 (Genkin,
 Lewis,
 &
 
Madigan,
 2007).
 Along
 similar
 line,
 Hoggart
 et
 al
 developed
 a
 Bayesian
 model
 with
 modified
 
prior
 distribution
 -­‐
 a
 Normal
 Exponential
 Gamma
 distribution
 (NEG)
 -­‐
 as
 a
 generalization
 of
 the
 
Laplace
 distribution,
 and
 applied
 their
 method
 in
 a
 real
 genome-­‐wide
 dataset
 (Hoggart,
 
Whittaker,
 De
 Iorio,
 &
 Balding,
 2008).
 Yuan
 and
 Lin
 also
 proposed
 an
 empirical
 Bayes
 model
 
closely
 related
 to
 the
 lasso
 regression
 (Yuan
 &
 Lin,
 2005a).
 

 
In
 all
 the
 studies
 mentioned
 above,
 the
 shrinkage
 parameter
 in
 lasso
 regression,
 as
 well
 as
 the
 
hyper-­‐parameter
 in
 the
 Bayesian
 approaches,
 is
 estimated
 by
 cross-­‐validation.
 And
 since
 there
 is
 
only
 one
 shrinkage
 parameter
 for
 all
 variables,
 it
 is
 computationally
 feasible.
 However,
 all
 the
 
variables
 are
 shrunk
 by
 the
 same
 amount.
 In
 a
 GWAS
 dataset,
 the
 variables
 are
 the
 SNPs
 across
 
the
 genome.
 Thus
 we
 have
 plenty
 of
 external
 information
 on
 the
 variables.
 For
 example,
 for
 
pathway
 information,
 we
 could
 extract
 information
 from
 the
 Gene
 Ontology
 (Ashburner
 et
 al.,
 
2000),
 PANTHER
 (P.
 D.
 Thomas,
 2003),
 and
 other
 databases.
 In
 our
 study,
 we
 propose
 a
 Bayesian
 
hierarchical
 approach
 to
 differentially
 shrink
 SNPs
 by
 using
 such
 external
 information.
 In
 the
 
proposed
 Bayesian
 model,
 the
 model
 parameters
 have
 different
 prior
 distributions,
 whose
 
hyper-­‐parameters
 are
 informed
 by
 external
 information.
 A
 few
 algorithms
 are
 also
 proposed
 to
 
estimate
 the
 hyper-­‐parameters
 and
 model
 parameters.
 

  41
 
3.2
 Iterative
 approaches
 to
 estimate
 Bayesian
 
hierarchical
 lasso
 
3.2.1
 Method
 
3.2.1.A
 Bayesian
 hierarchical
 Lasso
 
In
 a
 genetic
 association
 study
 with
 case-­‐control
 data,
 let
 D
 =
 {(y
1
,
 x
1
),
 (y
2
,
 x
2
),
 …,
 (y
n
,
 x
n
)},
 where
 y
i

 
=
 ±1
 indicating
 the
 case-­‐control
 status
 of
 the
 i
th

 subject,
 and
 x
i

 is
 a
 vector
 consisting
 of
 the
 
genetic
 variables
 for
 the
 i
th

 subject.
 

 
We
 use
 a
 standard
 logistic
 regression
 model
 on
 the
 observed
 case-­‐control
 data:
 
𝑙𝑜𝑔𝑖𝑡𝑃𝑟 𝑌
!
= 1|𝛽
!
,𝜷,𝑿
𝒊
=𝛽
!
+𝑿
𝒊
𝜷=  𝛽
!
+ 𝛽
!
!
!!!
𝑥
!"

   
   
  (3.1)
 

 
Our
 model
 features
 in
 variable
 selection
 by
 using
 pathway
 information.
 To
 achieve
 this,
 we
 build
 
a
 hierarchical
 model,
 with
 a
 Laplace
 (double
 exponential)
 prior
 distribution
 for  𝛽
!
.
 Thus
 the
 
maximum
 a
 posteriori
 estimates
 of  𝛽
!

 tends
 to
 be
 0,
 unless
 enough
 evidence
 is
 shown
 in
 the
 
data
 that
   𝛽
!

 is
 not
 0.
 We
 write
 𝛽
!

 separately
 because
 we
 do
 not
 want
 to
 penalize
 the
 intercept
 
term
 in
 the
 model.
 
𝛽
!
 ~  𝐿𝑎𝑝𝑙𝑎𝑐𝑒   𝜆
!
,
 i.e.,
 Pr 𝛽
!
𝜆
!
=  
!
!
!
exp   −𝜆
!
|𝛽
!
|  
   
   
   
  (3.2)
 
The
 Laplace
 distribution
 of
   𝛽
!

 has
 mean
 0
 and
 variance
 2/𝜆
!
!
.
 

 

  42
 
We
 further
 model
 the
 hyper-­‐parameter
 𝜆
!

 based
 on
 external
 bio-­‐feature
 information,
 Z
j

 =
 {z
j1
,
 
z
j2
,…
 ,
 z
jV
}.
 
𝜆
!
= exp −𝒁
𝒋
𝝅 = exp  (− 𝜋
!
!
!!!
𝑧
!"
 )
   
   
   
   
   
  (3.3)
 

 
Here
 the
 we
 used
 an
 exponential
 link
 to
 ensure
 that
 each
 𝜆
!

 is
 positive,
 and
 log(𝜆
!
)
 is
 inversely
 
linked
 with
 𝒁
𝒋
𝝅
 such
 that
 a
 smaller
 penalty
 term
 (𝜆
!
)
 was
 reflected
 by
 larger
 𝒁
𝒋
𝝅
 values.
 We
 
hope
 that
 the
 use
 of
 external
 information
 helps
 estimate
 the
 value
 hyper-­‐parameter
 𝜆
!
,
 thus
 we
 
get
 a
 better
 variable
 selection
 results
 from
 all
 𝛽
!
’s.
 

 
3.2.1.B
 Iterative
 algorithms
 to
 find
 the
 maximum
 a
 posteriori
 estimate
 
We
 proposed
 to
 iteratively
 update
 the
 model
 parameters
 (𝛽
!
,𝜷)
 =
 (𝛽
!
,𝛽
!
,… ,𝛽
!
)
 and
 the
 hyper-­‐
parameters
 𝝅
 =
 (𝜋
!
,𝜋
!
,… ,𝜋
!
).
 
1)
  Assign
 an
 initial
 value
 to
 the
 hyper-­‐parameter
 𝜋,
 thus
 the
 initial
 values
 of  𝜆
!
’s.
 
   
 
2)
  Update
 the
 model
 parameters
 (𝛽
!
,𝜷):
 find
 the
 MAP
 (maximum
 a
 posterior)
 of
 (𝛽
!
,𝜷)
 
based
 on
 the
 current
 values
 of
 hyper-­‐parameter
 𝝅.
 
3)
  Update
 the
 hyper-­‐parameter
 𝝅:
 using
 Newton’s
 method
 to
 find
 the
 MAP
 of
 𝝅
 
conditional
 on
 the
 current
 values
 of
 model
 parameters
 (𝛽
!
,𝜷).
 
4)
  Repeat
 step
 2)
 and
 3)
 until
 convergence.
 

 
The
 details
 of
 the
 proposed
 algorithm
 are
 explained
 in
 the
 following
 part.
 

  43
 
It
 is
 quite
 possible
 that
 different
 initial
 values
 of
 𝝅
 matrix
 will
 lead
 to
 different
 local
 maxima
 of
 
our
 penalized
 likelihood
 function.
 We
 implement
 multiple
 runs
 based
 on
 different
 initial
 values
 
of
 the  𝝅
 matrix
 and
 pick
 the
 highest
 of
 the
 modes
 identified.
 

 
Given
 the
 current
 values
 of
 the
 hyper-­‐parameters
 𝝅,
 we
 adopted
 the
 cyclic
 coordinate
 descent
 
algorithm
 to
 find
 the
 MAP
 estimates
 of
 model
 parameters
 (𝛽
!
,𝜷).
 
 
The
 conditional
 distribution
 of
 Pr(𝜷|𝐷,𝝅)
 is
 proportional
 to
 the
 prior
 distribution
 and
 the
 
likelihood:
   
 
 
Pr 𝜷|𝐷,𝝅 ∝
1
1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷
!
!!!
Pr(𝜷|𝝅)
 
For
 convenience,
 we
 work
 on
 the
 log
 likelihood:
 
logPr 𝜷|𝐷,𝝅 =− ln(1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷 )
!
!!!
+ ln𝑝 𝜷 +  𝐶
 
 
where
 C
 is
 a
 constant.
 
With
 Laplace
 prior
 distribution
 of  𝜷,
 we
 have:
 
logPr 𝜷|𝐷,𝝅 =− ln(1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷 )
!
!!!
+ (ln𝜆
!
− ln2
!
!!!
−λ
!
𝛽
!
)
 

   
   
   
   
   
   
   
   
   
   
  (3.4)
 
To
 optimize
 the
 maximum
 a
 posteriori
 over  𝜷= (𝛽
!
,𝛽
!
,… ,𝛽
!
)′  ,
 we
 used
 cyclic
 coordinate
 
descent
 algorithm.
 It
 is
 easy
 to
 implement
 and
 fast
 in
 computational
 speed.
 Cyclic
 coordinate
 
descent
 algorithm
 optimizes
 each
 variable
 in
 turn,
 holding
 all
 other
 variables
 as
 constant.
 Thus
 it
 
is
 a
 one-­‐dimensional
 optimization
 problem.
 

  44
 
We
 will
 illustrate
 the
 application
 of
 cyclic
 coordinate
 descent
 algorithm
 in
 the
 following.
 To
 
optimize
 𝛽
!

 while
 holding
 all
 other
 variables
 constant,
 we
 write
 the
 components
 of
 relevance
 in
 
the
 log
 posterior
 density
 as:
 
𝑔 𝑧 = 𝑙𝑛(1+𝑒𝑥𝑝 −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷+ (𝑧−𝛽
!
)𝑥
!"
)
!
!!!
+𝜆
!
𝑧  
   
  (3.5)
 
Thus
 𝛽
!
!"#
that
 optimizes
 posterior
 distribution
 is
 equivalent
 to
 the
 z
 value
 that
 minimizes
 the
 
g(z)
 function
 above.
 
There
 is
 no
 closed
 form
 solution
 to
 the
 univariate
 optimization
 problem,
 but
 Newton’s
 method
 
can
 be
 applied
 to
 calculate  𝛽
!
!"#
:
 
𝛽
!
!"!
= arg
!
min𝑔 𝑧 =𝛽
!
+∆𝛽
!
!
, where  the  tentative  update  ∆𝛽
!
!
=−
𝑔
!
𝛽
!
𝑔
!!
𝛽
!

 
(3.6)
 
In
 the
 optimization
 process
 tailored
 on
 our
 model
 with
 Laplace
 prior
 distribution,
 we
 adopted
 a
 
few
 modifications
 following
 (Genkin
 et
 al.,
 2007).
 
1) To
 avoid
 large
 updating
 steps,
 we
 specified
 a
 trust
 region
 Δ
!

 that
 each
 |Δ𝛽
!
|
 cannot
 
exceed
 on
 a
 single
 iteration.
 After
 applying
 the
 trust
 region
 restriction,
 the
 actual
 
update
 of
 𝛽
!

 is
 

 𝛥𝛽
!
=      
−𝛥
!
               𝑖𝑓  ∆𝛽
!
!
<−𝛥
!
∆𝛽
!
!
                 𝑖𝑓  |∆𝛽
!
!
|≤𝛥
!
   
𝛥
!
               𝑖𝑓  ∆𝛽
!
!
>𝛥
!

   
   
   
   
  (3.7)
 

 
The
 length
 of
 trust
 region
 𝚫
𝐣

 is
 adaptively
 updated
 each
 iteration,
 

 ∆
!
!"#
=max(  2 Δ𝛽
!
,      
!
!
Δ
!
 )
 
   
   
   
   
  (3.8)
 

  45
 
2) The
 tentative
 update
 is
 invalid
 if
 it
 would
 change
 the
 sign
 of  𝛽
!
.
 If
 this
 happens,
 we
 
simply
 set
 𝛽
!
!"#
to
 be
 0.
 
3) If  𝛽
!
= 0,
 we
 try
 to
 update
 in
 both
 directions,
 and
 see
 if
 either
 direction
 succeeds.
 
Note
 that
 at
 most
 one
 direction
 can
 succeed,
 due
 to
 the
 convexity
 of
 function
 g(z).
 
4) To
 retain
 the
 model
 size,
 we
 applied
 the
 generalized
 Bayesian
 information
 criterion
 
in
 our
 updating
 (J.
 Chen
 &
 Chen,
 2008).
 
 The
 generalized
 Bayesian
 information
 
criteria
 is:
 
𝐵𝐼𝐶
!
= ln(1+exp −𝑦
!
𝛽
!
+𝑿
𝒊
𝜷 )
!
!!!
+
!
!
𝑝
!"
ln  (  𝐼  )+𝛾  ln
!
!
!"

 
  (3.9)
 
where
 𝑃
 is
 the
 number
 of
 all
 genetic
 covariates
 (𝜷’s)
 in
 data,
 𝑝
!"

 is
 the
 total
 number
 of
 
non-­‐zero  𝜷’s
 in
 the
 model,
 and
 I
 is
 the
 total
 number
 of
 subjects.
 In
 each
 update
 of
 a
 
specific  𝛽
!
,
 we
 accept
 Δ𝛽
!

 if
 the
 generalized
 Bayesian
 Information
 Criterion
 does
 not
 
increase
 its
 value.
 
Given
 the
 current
 values
 of
 model
 parameters
 𝜷,
 the
 conditional
 distribution
 function
 of
 𝝅
 is:
 
Pr 𝝅 𝛽,𝑍 =  
!
!
 𝜆
!
 exp  (−𝜆
!
 |𝛽
!
|
!
!!!
 )  
 
 
where  𝜆
!
= exp −𝒁
𝒋
𝝅 = exp  (− 𝜋
!
!
!!!
𝑧
!"
 ).
   
   
   
   
  (3.10)
 
Thus
 the
 log
 of
 posterior
 distribution
 of
 𝝅  is:
 
logPr 𝝅𝜷,𝒁 =−𝑝  ln2−  𝒁
𝒋
𝝅+exp −𝒁
𝒋
𝝅   𝛽
!
 
!
!!!
 
 
   
   
  (3.11)
 
Then
 the
 score
 function
 for
 𝝅,
 and
 specifically
 the
 k-­‐th
 component,
 is:
 
𝑢 𝜋
!
𝜷,𝒁 =
!  ! 𝝅𝜷,𝒁
!  !
!
=− 𝑍
!"
−𝑍
!"
  𝛽
!
 𝑒𝑥𝑝 −𝒁
𝒋
𝝅    
!
!!!

   
   
  (3.12)
 
And
 the
 Fisher’s
 information
 for
 𝝅
 (the
 k-­‐th
 row
 and
 r-­‐th
 column):
 
𝑖𝑛𝑓𝑜 𝜋
!"
𝛽,𝑍 =−
!  ! !
!
𝜷,𝒁
!  !
!
=   𝑍
!"
𝑍
!"
𝛽
!
 𝑒𝑥𝑝 −𝒁
𝒋
𝝅
!
!!!

   
   
  (3.13)
 

  46
 
In
 Newton’s
 method,
 the
 tentative
 update
 for
 𝜋
!

 is:
 
𝛥𝜋
!
!
=−
! !
!
𝜷,𝒁
!"#$ !
!"
𝜷,𝒁  
!
!!!

 
   
   
   
   
   
   
  (3.14)
 
To
 avoid
 large
 updating
 steps,
 we
 also
 specified
 a
 trust
 region
 𝛥𝛲
!

 that
 each
 |Δ𝜋
!
|
 cannot
 
exceed
 on
 a
 single
 iteration.
 After
 applying
 the
 trust
 region
 restriction,
 the
 actual
 update
 of
 𝜋
!

 
is:
 
𝛥𝜋
!
=      
−𝛥𝛲
!
               𝑖𝑓  ∆𝜋
!
!
<−𝛥𝛲
!
 ∆𝜋
!
!
                     𝑖𝑓  |∆𝜋
!
!
|≤𝛥𝛲
!
   
𝛥𝛲
!
                       𝑖𝑓  ∆𝜋
!
!
>𝛥𝛲
!

   
   
   
   
   
  (3.15)
 
And
 we
 updated
 the
 trust
 region
 𝛥𝛲
!
 adaptively,
 too.
 
∆𝛲
!
!"#
=max(  2 𝛥𝜋
!
,      
!
!
𝛥𝛲
!
 )
   
   
   
   
   
  (3.16)
 

 
3.2.2
 Simulation
 studies
 
3.2.2.A
 Design
 of
 simulation
 studies
 
In
 our
 simulation
 study,
 we
 first
 simulated
 external
 information
 with
 P
 items
 and
 V
 bio-­‐features,
 
as
 a
 (PxV)
 matrix
 Z,
 distributed
 as
 i.i.d.
 Normal(0,
 1).
 Then
 we
 specified
 the
 values
 for
 the
 hyper-­‐
parameters
 𝝅
 (Vx1).
 Thus
 the
 hyper-­‐parameters
 of
 Laplace
 distribution
 𝝀
 (Px1)
 are
 calculated
 as:
 
𝜆
!
=𝒁
𝒋
 𝝅,
 for
 j
 =
 1,
 2,
 …,
 P
 
The
 model
 parameters
 𝜷
 (Px1)
 are
 simulated
 given
 the
 values
 of  𝝀
 (Px1):
 
𝛽
!
|𝜆
!
∼ 𝐿𝑎𝑝𝑙𝑎𝑐𝑒  (0, 𝜆
!
 ),
 for
 j=1,
 2,
 …,
 P
 
The
 genetic
 variable
 𝑿
 (IxP)
 is
 assumed
 to
 be
 binary:
 

  47
 
𝑋
!"
∼
!.!.!.
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖  (𝑝
!
),
 for
 i=1,
 2,
 …,
 I;
 
 
 j=1,
 2,
 …,P
 
And
 then
 the
 outcome
 variable
 𝒀
 (Ix1)
 is
 simulated
 as:
   
   
   
   
   
   
 
𝑌
!
∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖  (𝑃𝑟(𝑌
!
= 1)),
 for
 i=1,
 2,
 …,
 I
 
where
 𝑙𝑜𝑔𝑖𝑡   𝑃𝑟 𝑌
!
= 1 = (𝑿
𝒊
−𝒑
𝒈
)𝜷= (𝑋
!"
−𝑝
!
)𝛽
!
!
!!!
,
 for
 i=1,
 2,
 …,
 I
 
The
 X
 matrix
 is
 centered
 at
 the
 mean,
 so
 we
 will
 have
 roughly
 the
 same
 number
 of
 cases
 and
 
controls
 in
 our
 simulated
 data.
 

 
3.2.2.B
 Results
 of
 global
 null
 simulations
 
In
 the
 global
 null
 simulation
 studies,
 we
 simulate
 𝛽
!
=𝛽
!
=⋯=𝛽
!
= 0.
 
We
 tested
 a
 scenario
 in
 which
 the
 sample
 size
 is
 relatively
 large,
 with
 1000
 individuals
 (I)
 and
 
100
 SNPs
 (P).
 Based
 on
 100
 replications,
 we
 summarize
 the
 results
 in
 Table
 3.1.
 When
 we
 
applied
 the
 generalized
 Bayesian
 information
 penalty
 (with
 weight
 γ)
 to
 control
 the
 model
 size,
 
it
 was
 possible
 that
 no
 variable
 was
 selected.
 In
 such
 case,
 𝜷=𝟎,
 and
 we
 failed
 to
 estimate
 the
 
hyper-­‐parameters
 𝝅,
 (it
 happens
 more
 frequently
 when
 𝛾
 is
 relatively
 large).
 
 
Table
 3.1
 Null
 Simulation
 Results
 
Table
 3.1.A.
 Null
 simulation
 (all
 model
 parameters
 are
 0,
 I
 =
 1000,
 P
 =
 100**)
 
𝜸
  Average
 number
 of
 
selected
 variables
 
MSE*
 (selected
 
variables)
 
Estimated
 𝝅
 
value
 
Number
 of
 replicates
 in
 
which
 no
 variable
 is
 
selected
 (out
 of
 100)
 
0
  57.06
  0.0029
  -­‐3.157
  0
 
0.1
  75.02
  0.0026
  -­‐3.264
  0
 
0.3
  6.22
  0.0179
  -­‐2.061
  50
 
0.5
  2.52
  0.0522
  -­‐1.486
  35
 
1
  0.07
  0.1281
  -­‐1.099
  94
 
*MSE
 is
 a
 summary
 of
 the
 sum
 of
 variance
 and
 bias
2

 from
 the
 simulation
 study.
 It
 is
 calculated
 as
 the
 
square
 root
 of
 the
 average
 deviation
 of
 the
 estimated
 value
 from
 the
 true
 value.
 
**
 I
 is
 the
 number
 of
 individuals
 (observations),
 P
 is
 the
 number
 of
 parameters
 (variables).
 

 

  48
 
Table
 3.1.B.
 Null
 simulation
 (all
 model
 parameters
 are
 0,
 I
 =
 1000,
 P
 =
 1000)
 
𝜸
  Average
 number
 of
 
selected
 variables
 
MSE
 (selected
 
variables)
 
Estimated
 𝝅
 
value
 
Number
 of
 replicates
 in
 
which
 no
 variable
 is
 
selected
 (out
 of
 100)
 
0
  488.13
  0.0031
  -­‐3.114
  0
 
0.1
  688.49
  0.0027
  -­‐3.239
  0
 
0.3
  44.37
  0.0394
  -­‐1.920
  21
 
0.5
  4.28
  0.0904
  -­‐1.236
  45
 
1
  0.05
  0.1589
  -­‐0.943
  95
 

 
Table
 3.1.C.
 Null
 simulation
 (all
 model
 parameters
 are
 0,
 I
 =
 100,
 P
 =
 1000)
 
𝜸
  Average
 number
 of
 
selected
 variables
 
MSE
 (selected
 
variables)
 
Estimated
 𝝅
 
value
 
Number
 of
 replicates
 in
 
which
 no
 variable
 is
 
selected
 (out
 of
 100)
 
0
  266.80
  0.0115
  -­‐2.506
  0
 
0.1
  15.76
  0.0810
  -­‐1.624
  2
 
0.3
  16.76
  0.5196
  -­‐0.411
  1
 
0.5
  2.22
  1.5039
  0.074
  28
 
1
  0.04
  1.000
  0.000
  96
 

 
3.2.2.C
 Results
 of
 main
 simulation
 studies
 
Simulation
 study
 in
 which
 𝝅
 includes
 an
 intercept
 term
 and
 1
 covariate
 term
 
In
 this
 simulation
 study,
 the
 hyper-­‐parameter
 𝛑
 =
 (𝝅
𝟎
,𝝅
𝟏
)
 =
 (-­‐1,
 1).
 The
 empirical
 results
 based
 
on
 100
 replicates
 are
 shown
 in
 Table
 3.2
 below.
 
Table
 3.2
 Simulation
 Results
 with
 1
 covariate
 
Table
 3.2.A.
 1-­‐covariate
 simulation
 ((π
!
,π
!
)
 =
 (-­‐1,
 1),
 I=1000,
 P=100)
 
𝜸
  Average
 
number
 of
 
selected
 
variables
 
Effect*
 
(not
 
selected
 
variables)
 
Effect
 
(selected
 
variables
)
 
MSE
 (not
 
selected
 
variables)
 
MSE
 
(selected
 
variables)
 
Estimated
 𝝅
 
value
 
Number
 of
 
replicates
 in
 
which
 2
 or
 less
 
variables
 are
 
selected
 (out
 
of
 100)
 
𝝅
𝟎

  𝝅
𝟏

 
0
  70.11
  0.1859
  0.8111
  0.1146
  0.7904
  -­‐1.676
  1.152
  0
 
0.1
  88.35
  0.1795
  0.6829
  0.1072
  0.7344
  -­‐1.885
  1.105
  0
 
0.3
  <
 84.59
  0.2131
  0.6947
  0.1561
  0.6014
  -­‐1.806
  1.121
  1
 
0.5
  <
 80.09
  0.1929
  0.7216
  0.1216
  0.4574
  -­‐1.680
  1.099
  2
 
1
  <
 69.40
  0.2152
  0.7984
  0.1603
  0.7149
  -­‐1.560
  1.151
  2
 
*Effect
 is
 the
 average
 of
 the
 absolute
 simulated
 value
 of
 the
 selected
 (or
 unselected)
 variables.
 
Effect
 =
 𝐸[𝛽|𝛽= 0]
 for
 selected
 variables,
 Effect
 =
 𝐸[𝛽|𝛽≠ 0].
 

  49
 
Table
 3.2.B.
 1-­‐covariate
 simulation
 ((π
!
,π
!
)
 =
 (-­‐1,
 1),
 I=1000,
 P=1000)
 
𝜸
  Average
 
number
 
of
 
selected
 
variables
 
Effect
 
(not
 
selected
 
variables)
 
Effect
 
(selected
 
variables)
 
MSE
 (not
 
selected
 
variables)
 
MSE
 
(selected
 
variables)
 
Estimated
 𝝅
 
value
 
Number
 of
 
replicates
 in
 
which
 2
 or
 less
 
variables
 are
 
selected
 (out
 
of
 100)
 
𝝅
𝟎

  𝝅
𝟏

 
0
  441.88
  0.3667
  0.9070
  0.3570
  3.5983
  -­‐3.021
  0.087
  0
 
0.1
  660.95
  0.3606
  0.7311
  0.3554
  2.5674
  -­‐3.193
  0.050
  0
 
0.3
  130.04
  0.3699
  2.1816
  0.3691
  8.7624
  -­‐1.642
  0.601
  0
 
0.5
  97.98
  0.4153
  2.3592
  0.4900
  10.7008
  -­‐1.203
  0.531
  0
 
1
  26.25
  0.4998
  4.5258
  0.8184
  33.9853
  -­‐0.839
  0.414
  0
 

 
Table
 3.2.C.
 1-­‐covariate
 simulation
 (((π
!
,π
!
)
 =
 (-­‐1,
 1),
 I=100,
 P=1000)
 
𝜸
  Average
 
number
 of
 
selected
 
variables
 
Effect
 
(not
 
selected
 
variables)
 
Effect
 
(selected
 
variables)
 
MSE
 (not
 
selected
 
variables)
 
MSE
 
(selected
 
variables)
 
Estimated
 𝝅
 
value
 
Number
 of
 
replicates
 in
 
which
 2
 or
 less
 
variables
 are
 
selected
 (out
 
of
 100)
 
𝝅
𝟎

  𝝅
𝟏

 
0
  257.34
  0.5250
  0.8536
  1.0442
  5.1973
  -­‐2.421
  0.009
  0
 
0.1
  <
 17.74
  0.5618
  3.2159
  1.3026
  41.2613
  -­‐1.285
  0.259
  2
 
0.3
  <
 15.75
  0.5765
  2.6284
  1.4356
  34.1257
  -­‐0.406
  0.212
  1
 
0.5
  <
 4.84
  0.5876
  5.0579
  1.5522
  97.3689
  0.078
  0.171
  18
 
1
  <<
 2.00
  0.5768
  16.8679
  1.5109
  504.28
  -­‐0.048
  0.018
  93
 

 
Simulation
 study
 in
 which
 𝝅
 includes
 2
 or
 more
 covariates
 
 
In
 this
 simulation
 study,
 the
 hyper-­‐parameter
 𝛑
 =
 (𝜋
!
,𝜋
!
,𝜋
!
,… ,𝜋
!
)
 =
 (-­‐1,
 1,
 -­‐1,
 0,
 0,
 0.2).
 The
 
empirical
 results
 based
 on
 100
 replicates
 are
 shown
 in
 Table
 3.3
 below.
 
Table
 3.3
 Simulation
 Results
 with
 5
 covariates
 
Table
 3.3.A.
 5-­‐covariate
 simulation
 (𝜋
!
,𝜋
!
,𝜋
!
,… ,𝜋
!
)
 =
 (-­‐1,
 1,
 -­‐1,
 0,
 0,
 0.2)
 
(I=1000,
 P=1000,
 based
 on
 100
 replicates)
 
𝜸
  Number
 of
 
selected
 
variables
 
Effect
 (not
 
selected
 
variables)
 
Effect
 
(selected
 
variables)
 
MSE
 (not
 
selected
 
variables)
 
MSE
 
(selected
 
variables)
 
0
  410.06
  0.53
  1.68
  1.16
  27.45
 
0.1
  649.18
  0.53
  1.26
  1.28
  17.95
 
0.3
  85.89
  0.53
  6.02
  1.21
  110.73
 
0.5
  66.72
  0.61
  6.52
  1.61
  129.56
 
1
  26.60
  0.71
  11.44
  2.50
  293.91
 

 

  50
 

 
𝜸
  Estimated
 𝝅
 value
  Number
 of
 replicates
 in
 which
 
6
 or
 less
 variables
 are
 selected
  𝝅
𝟎

  𝝅
𝟏

  𝝅
𝟐

  𝝅
𝟑

  𝝅
𝟒

  𝝅
𝟓

 
0
  -­‐2.79
  0.20
  -­‐0.20
  0.01
  0.01
  0.01
  0
 
0.1
  -­‐3.04
  0.14
  -­‐0.15
  0.01
  0.00
  0.01
  0
 
0.3
  -­‐1.63
  0.57
  -­‐0.59
  0
  0
  0.11
  0
 
0.5
  -­‐1.08
  0.61
  -­‐0.47
  -­‐0.01
  0.00
  0.09
  2
 
1
  -­‐0.86
  0.38
  -­‐0.40
  0.04
  -­‐0.00
  0.08
  6
 

 
Table
 3.3.B.
 5-­‐covariate
 simulation
 
 (𝜋
!
,𝜋
!
,𝜋
!
,… ,𝜋
!
)
 =
 (-­‐1,
 1,
 -­‐1,
 0,
 0,
 0.2)
 
(I=1000,
 P=100,
 
 based
 on
 100
 replicates)
 
𝜸
  Number
 of
 
selected
 
variables
 
Effect
 (not
 
selected
 
variables)
 
Effect
 
(selected
 
variables)
 
MSE
 (not
 
selected
 
variables)
 
MSE
 
(selected
 
variables)
 
0
  277.02
  0.79
  1.57
  4.07
  48.95
 
0.1
  <14.33
  0.83
  12.14
  4.63
  558.80
 
0.3
  <11.48
  0.88
  10.76
  5.47
  642.54
 
0.5
  6.67
  0.91
  19.90
  6.47
  2028.50
 

 
𝜸   Estimated  𝝅  value   Number  of  replicates  in  which  6  
or  less  variables  are  selected   𝝅
𝟎
  𝝅
𝟏
  𝝅
𝟐
  𝝅
𝟑
  𝝅
𝟒
  𝝅
𝟓
 
0   -­‐2.40   0.04   -­‐0.03   -­‐0.00   0.01   -­‐0.01   0  
0.1   -­‐1.88   0.49   -­‐0.48   0.03   0.03   0.04   10  
0.3   -­‐0.29   0.11   -­‐0.17   -­‐0.03   -­‐0.10   0.15   8  
0.5   0.14   0.28   -­‐0.16   0.40   -­‐0.07   -­‐0.9   85  

 
Problems
 of
 the
 iterative
 approach:
 
 
1) We
 do
 not
 have
 a
 justified
 reason
 to
 use
 the
 generalized
 Bayesian
 information
 
criteria:
 instead
 of
 constraining
 model
 size,
 we
 use
 it
 to
 control
 model
 size.
 
2) Sometimes
 the
 model
 fails
 to
 converge,
 especially
 when
 all
 the
 𝜷’s
 are
 shrunk
 to
 
zero.
 
3) Cannot
 handle
 the
 situations
 where
 I
 <
 P
 
a. Prone
 to
 fail
 in
 updating
 the
 values
 of
 𝝅.
 
b. Not
 good
 to
 estimate
 values
 of
 𝝅
 even
 if
 it
 doesn’t
 fail.
 

  51
 
3.3
 Marginal
 MAP
 estimates
 of
 hyper-­‐parameters
 
To
 solve
 the
 problem
 of
 non-­‐convergence
 in
 updating
 the
 hyper-­‐parameter
 𝝅,
 when
 all
 or
 most
 
of
 the
 model
 parameters
 𝜷
 are
 prone
 to
 be
 shrunk
 to
 zero,
 we
 propose
 to
 find
 the
 a
 marginal
 
estimate
 of
 𝝅
 first,
 (e.g.,
 the
 marginal
 maximum
 a
 posterior),
 and
 then
 use
 the
 estimated
 𝝅
 
values
 to
 estimate
 the
 model
 parameters
 𝜷.
 
 

 
3.3.1
 Method
 
3.3.1.A
 Stepwise
 algorithms
 to
 find
 the
 maximum
 a
 posteriori
 estimation
 
1) Find
 the
 marginal
 maximum
 likelihood
 estimates
 of
 the
 hyper-­‐parameters
 𝝅.
 We
 
assume
 that
 the
 model
 parameters
 𝜷
 are
 Laplace
 distribution
 with
 parameter  𝝀.
 We
 
find
 the
 empirical
 Bayes
 estimator
 of
 the
 hyper-­‐parameters
 𝝅:
 
 
𝝅= arg
!
maxPr 𝝅 𝑌,𝐺,𝑍 ≈ arg
!
maxPr  (𝑌|𝐺,𝑍,𝝅)
 
2) Given
 the
 marginal
 MAP
 estimates
 of
 𝝅,
 we
 use
 the
 cyclic
 coordinate
 descent
 
algorithm
 to
 find
 the
 MAP
 estimates
 of
 model
 parameters
 𝜷.
 
3.3.1.B
 Algorithm
 to
 find
 the
 marginal
 MLE
 of
 hyper-­‐parameters
 𝝅
 
We
 write
 the
 marginal
 likelihood
 of  𝝅
 as:
 
 
𝐿 𝝅 = Pr 𝒀𝑮,𝒁,𝝅 = Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑑𝜷
   
   
   
  (3.17)
 
The
 marginal
 likelihood
 of  𝝅
 is
 closely
 related
 to
 the
 posterior
 distribution
 of
 the
 model
 
parameters
 𝜷:
 

  52
 
Pr 𝜷 𝒀,𝑮,𝒁,𝝅 =
!" 𝜷,𝒀,𝑮,𝒁,𝝅
!" 𝒀,𝑮,𝒁,𝝅
=
!" 𝒀𝑮,𝜷 !" 𝜷 𝒁,𝝅 !" 𝒁,𝝅
!" 𝒀,𝑮,𝒁,𝝅
=
!" 𝒀𝑮,𝜷 !" 𝜷 𝒁,𝝅
!" 𝒀𝑮,𝒁,𝝅
=
!" 𝒀𝑮,𝜷 !" 𝜷 𝒁,𝝅
!(𝝅)

 
   
   
   
   
   
   
   
  (3.18)
 

 
   
   
   
   
   
   
   
   
   
   
   
 
Thus,
 the
 score
 function
 with
 respect
 to
 𝝅:
 
                                         𝑈 𝝅 =  
𝑑𝑙𝑛𝐿 𝝅
𝑑𝝅
=
1
𝐿 𝝅
Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅
𝑑lnPr 𝜷 𝒁,𝝅
𝑑𝝅
𝑑𝜷
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 = 𝑃𝑟 𝜷 𝒀,𝑮,𝒁,𝝅 𝑢
𝝅
𝜷 𝑑𝜷
 
   
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 where
 𝑢
𝝅
𝜷 =
𝒅𝐥𝐧𝐏𝐫 𝜷 𝒁,𝝅
𝒅𝝅

   
   
   
   
   
  (3.19)
   
 
Also,
 the
 Fisher’s
 information
 with
 respect
 to
 𝝅:
 
𝐼𝑛𝑓𝑜 𝝅  =  −
𝑑𝑈 𝝅
𝑑𝝅
   
 
=
𝟏
𝐿 𝝅
𝟐
[ Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑢
𝝅
𝜷 𝑑𝜷]
𝟐
−  
𝟏
𝐿 𝝅
Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑢
𝝅
𝜷 ⊗𝑢
𝝅
𝜷 𝑑𝜷
+  
𝟏
𝐿 𝝅
Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑖𝑛𝑓𝑜
𝝅
𝜷 𝒅𝜷
 
             =
[ Pr 𝜷 𝒀,𝑮,𝒁,𝝅 𝑢
𝝅
𝜷 𝑑𝜷]
𝟐
−   Pr 𝜷 𝒀,𝑮,𝒁,𝝅 𝑢
𝝅
𝜷 ⊗𝑢
𝝅
𝜷 𝑑𝜷+
  Pr 𝜷 𝒀,𝑮,𝒁,𝝅 𝑖𝑛𝑓𝑜
𝝅
𝜷 𝑑𝜷
 
where
   𝑢
𝝅
𝜷 =
!"#$%(𝜷|𝒁,𝝅)
!𝝅
,and  𝑖𝑛𝑓𝑜
𝝅
𝜷 =−
!
!
!"#$(𝜷|𝒁,𝝅)
!𝝅
𝟐
 
   
   
  (3.20)
 
Based
 on
 the
 score
 and
 information
 functions,
 we
 can
 use
 Newton’s
 method
 to
 numerically
 
optimize
 of
 the
 marginal
 likelihood
 function
 𝑳 𝝅 .
 To
 achieve
 that,
 we
 need
 to
 know
 posterior
 
density
 of
 𝜷,
 𝐏𝐫 𝜷 𝒀,𝑮,𝒁,𝝅 ,
 which
 we
 propose
 to
 approximate
 by
 using
 MCMC
 sampling.
 

  53
 

 
3.3.1.C
 Approximate
 the
 posterior
 distribution
 of
 model
 parameters
 𝜷:
 
MCMC
 sampling
 procedure
 
We
 used
 adaptive
 MCMC
 sampling
 to
 retain
 a
 good
 acceptance
 rate:
 the
 proposed
 update
 step
 
is
 adaptively
 set,
 if
 the
 acceptance
 rate
 is
 low,
 the
 update
 is
 set
 to
 be
 small;
 otherwise,
 the
 
update
 is
 set
 to
 be
 large.
 

 
We
 try
 to
 sample
 𝜷
 from
 its
 posterior
 distribution
 𝐏𝐫 𝜷 𝒀,𝑮,𝒁,𝝅 ,
 but
 the
 real
 problem
 is,
 
when
 the
 dimensionality
 of
 𝜷
 is
 high,
 especially
 when
 the
 problem
 is
 under-­‐determined,
 which
 
means
 the
 number
 of
 parameters
 (P)
 is
 larger
 than
 the
 number
 of
 observations
 (I),
 the
 MCMC
 
sampling
 is
 
 often
 does
 not
 converge.
 
 

 
We
 also
 found
 that
 the
 Fisher’s
 information
 with
 respect
 to
 𝝅
 calculated
 by
 the
 posterior
 
samples
 of
 𝜷
 is
 not
 always
 positive
 definite.
 There
 are
 two
 possibilities:
 1)
 The
 MCMC
 sample
 of
 
𝜷
 is
 not
 representative
 of
 the
 posterior
 distribution;
 2)
 The
 information
 matrix
 is
 truly
 not
 
positive
 definite,
 which
 means
 that
 the
 marginal
 likelihood
 function
 with
 respect
 to
 𝝅
 is
 not
 
concave,
 and
 the
 MLE
 does
 not
 exist.
 We
 emphasized
 the
 first
 possibility
 by
 throwing
 away
 
iterations
 when
 the
 information
 matrix
 is
 not
 positive
 definite
 as
 we
 think
 it
 is
 caused
 by
 a
 bad
 
sampling
 of
 𝜷.
 

 
3.3.1.D
 Approximate
 the
 likelihood
 function
 with
 a
 Normal
 distribution
 

  54
 
Due
 to
 the
 computational
 difficulty
 of
 convergence
 in
 MCMC
 sampling,
 when
 the
 dimensionality
 
is
 high,
 we
 tried
 to
 simplify
 the
 problem
 by
 approximating
 the
 likelihood
 with
 Normal
 
distribution.
 Assuming
 every
 component
 of
 𝜷
 is
 independent
 of
 each
 other,
 asymptotically,
 the
 
univariate
 MLE
 of
 𝜷
 are
 normally
 distributed:
 𝜷
!
~𝑵(𝜷
𝒋
, 𝝈
𝒋
𝟐
).
 
Thus,
 the
 marginal
 likelihood
 with
 respect
 to
 𝝅,
 as
 in
 equation
 (17),
 could
 be
 approximated
 as
 
follows:
 
𝐿 𝝅 = Pr  𝜷 𝒁,𝝅 = Pr 𝜷𝜷,𝝈 Pr 𝜷 𝒁,𝝅 𝑑𝜷
 

 
 
 
 
 
 
 
 
 
 
 =   Pr 𝜷
!
𝜷
𝒋
,𝝈
!
Pr 𝜷
𝒋
𝒁,𝝅 𝑑𝜷
𝒋 𝒋

 
   
   
   
   
  (3.21)
 
If
 the
 hyper-­‐parameters
 𝝅
 include
 only
 an
 intercept
 term,
 we
 write
 the
 likelihood
 function
 as
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 𝐿 𝜆 = 𝑙
𝒋
𝜆
𝒋

   
   
   
   
   
   
   
  (3.22)
 
where
 the
 piecewise
 likelihood
 corresponding
 to
 each
 component
 of
 model
 parameter
 

  𝜷
 is:
 
𝑙
𝒋
𝜆 = Pr  𝛽
!
𝜆 = Pr 𝛽
!
𝛽
!
,𝜎
!
Pr 𝛽
!
𝜆 𝑑𝛽
!
=
1
2𝜋𝜎
!
exp −
𝛽
!
−𝛽
!
!
2𝜎
!
!
𝜆
2
exp −𝜆 𝛽
!
𝑑𝛽
!

 

 =
!
!
exp
!
!
𝜆
!
𝜎
!
! !
!
exp 𝜆𝛽
!
[1− erf  (
!
!
!!!
!
!
!!
!
)]+
!
!
exp −𝜆𝛽
!
[1+ erf  (
!
!
!!!
!
!
!!
!
)]  
  (3.23)
 

 
This
 is
 an
 analytic
 form
 of
 marginal
 likelihood
 with
 respect
 to
 the
 hyper-­‐parameter  𝜆.
 Based
 on
 
the
 approximate
 form
 of
 likelihood,
 we
 could
 easily
 implement
 Newton’s
 method
 to
 find
 the
 
marginal
 MLE
 of  𝜆
 (corresponding
 to
 the
 intercept
 term
 of
 𝝅:  𝜋
!
).
 However,
 we
 found
 the
 
likelihood
 function
 not
 always
 concave,
 in
 which
 case
 the
 MLE
 does
 not
 exist.
 Since
 the
 

  55
 
likelihood
 is
 the
 product
 of
 all
 piecewise
 likelihoods,
 let’s
 focus
 on
 a
 single
 piecewise
 likelihood
 
𝑙
!
𝜆 ,
 as
 shown
 in
 equation
 (3.23).
 
𝑙
!
𝜆 =
𝜆
2
exp
1
2
𝜆
!
𝜎
!
!
1
2
exp 𝜆𝛽
!
[1− erf  (
𝛽
!
+𝜆𝜎
!
!
2𝜎
!
)]
+
1
2
exp −𝜆𝛽
!
[1+ erf  (
𝛽
!
−𝜆𝜎
!
!
2𝜎
!
)]  
 

 
To
 illustrate
 that,
 in
 the
 following
 Fig
 3.1,
 we
 show
 the
 piecewise
 likelihood
 function
 and
 its
 
relationship
 with
 the
 values
 of
 𝜷
 and  𝝈.
 
With
 fixed
 value
 of  𝜎
!
,
 the
 𝑙
!
𝜆  
 is
 non-­‐concave
 if
 |𝛽
!
|
 is
 small;
 and
 with
 fixed
 value
 of  𝛽
!
,
 𝑙
!
𝜆  
 is
 
non-­‐concave
 if  𝜎
!

 is
 large.
 Thus
 we
 would
 suspect
 that
 the
 𝑙
!
𝜆  
 is
 concave
 only
 in
 a
 region
 
where
 |
𝜷
!
𝝈
!
|
 is
 relatively
 large.
 
Figure
 3.1
 The
 piecewise
 likelihood
 function
 
Fig
 3.1.A.
 The
 piecewise
 likelihood
 function,
 with
 𝝈
!

 fixed
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
−      𝜷
!
!
=𝟎.𝟏,𝝈
!
! =𝟏
 
−      𝜷
!
!
=𝟎.𝟓,𝝈
!
! =𝟏
 
−      𝜷
!
!
=𝟏,𝝈
!
! =𝟏
 
−      𝜷
!
!
=𝟐,𝝈
!
! =𝟏
 
−      𝜷
!
!
=𝟒,𝝈
!
! =𝟏
 

 
𝒍
𝒋
(𝝀)
 

  56
 

 
Fig
 3.1.B.
 The
 piecewise
 likelihood
 function,
 with
 𝜷
!

 fixed
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 

 
We’ve
 also
 tried
 to
 simulate
 the
 posterior
 distribution
 in
 the
 same
 set-­‐up
 in
 WinBUGS,
 and
 the
 
results
 confirm
 with
 what
 we
 found
 here.
 When
 the
 effect
 size
 is
 relatively
 small,
 the
 marginal
 
likelihood
 function
 is
 non-­‐concave,
 and
 the
 MLE
 does
 not
 exist.
 

 
3.3.2
 Simulation
 results
 of
 the
 MCMC
 procedures
 (3.3.1.A-­‐
3.3.1.C)
 
When
 we
 sample
 the
 posterior
 distribution
 of
 𝜷,
 we
 implemented
 the
 Metropolis
 algorithm
 
with
 a
 burn-­‐in
 of
 12500,
 and
 obtained
 an
 MCMC
 sample
 of
 2500.
 We
 used
 the
 2500
 MCMC
 
samples
 to
 approximate
 the
 posterior
 distribution
 of
 𝜷,
 based
 on
 which
 we
 calculated
 the
 score
 
and
 information,
 and
 then
 implemented
 Newton’s
 method
 to
 update  𝝅.
 
 
𝒍
𝒋
(𝝀)
 
−      𝜷
!
!
=𝟏,𝝈
!
! =𝟎.𝟐𝟓
 
−      𝜷
!
!
=𝟏,𝝈
!
! =𝟎.𝟓
 
−      𝜷
!
!
=𝟏,𝝈
!
! =𝟏
 
−      𝜷
!
!
=𝟏,𝝈
!
! =𝟐
 
−      𝜷
!
!
=𝟏,𝝈
!
! =𝟑
 

 

  57
 
In
 this
 set
 of
 examples,
 we
 investigated
 a
 marginally
 under-­‐determined
 model:
 number
 of
 
individuals
 I
 =
 100
 with
 the
 number
 of
 parameters
 P
 =
 100.
 We
 ran
 2000
 iterations
 in
 Newton’s
 
method.
 As
 shown
 in
 Fig
 3.2
 below,
 it
 seems
 that
 the
 updated
 values
 converged
 very
 quickly,
 
and
 the
 estimated
 value
 is
 close
 to
 the
 simulated
 true
 value
 (-­‐1).
 

 
Figure
 3.2
 Iteration
 trace
 of
 the
 updated
 values
 using
 Newton's
 method
 (I=100,
 P=100)
 

 

 

 

  58
 
In
 the
 next
 set
 of
 examples,
 we
 investigated
 situations
 where
 the
 number
 of
 parameters
 
exceeds
 the
 number
 of
 observations,
 (I=100
 and
 P=500).
 We
 can
 see
 that
 it
 has
 obviously
 not
 
converged.
 When
 we
 ran
 the
 MCMC
 sampling
 scheme
 with
 a
 longer
 burn-­‐in
 period,
 or
 a
 larger
 
sample,
 the
 convergence
 was
 not
 improved.
 

 
Figure
 3.3
 Iteration
 trace
 of
 the
 upated
 values
 using
 Newton's
 method
 (I=100,
 P=500)
 

 

 

 

  59
 
3.4
 Discussion
 
In
 this
 project,
 we
 proposed
 to
 build
 a
 Bayesian
 hierarchical
 lasso
 model,
 using
 some
 external
 
information
 informed
 prior
 distribution
 on
 the
 model
 parameters,
 and
 tried
 two
 distinct
 
approaches
 to
 gain
 the
 maximum
 a
 posteriori
 estimate
 of
 the
 hyper-­‐parameters.
 Although
 none
 
of
 the
 proposed
 algorithms
 worked
 well,
 we
 gained
 some
 perspectives
 in
 this
 project.
 

 
In
 machine
 learning
 and
 data
 mining
 literatures,
 the
 shrinkage
 parameter
 (e.g.,
 the
 𝒍𝟏
 penalty
 in
 
lasso)
 is
 usually
 estimated
 by
 cross-­‐validation
 (Hastie,
 Tibshirani,
 &
 Friedman,
 2003).
 Another
 
way
 to
 estimate
 the
 shrinkage
 parameter
 is
 via
 the
 empirical
 Bayes
 method,
 maximizing
 the
 
posterior
 distribution
 of
 the
 hyper-­‐parameters
 (George
 &
 Foster,
 2000).
 In
 this
 project,
 we
 first
 
proposed
 an
 iterative
 algorithm
 to
 find
 the
 maximum
 a
 posteriori
 estimate
 of
 both
 model
 
parameters
 and
 hyper-­‐parameters.
 Since
 we
 did
 not
 obtain
 the
 theoretical
 proof
 that
 such
 an
 
iterative
 algorithm
 will
 converge,
 this
 pilot
 project
 is
 based
 on
 simulation
 and
 application,
 
without
 strong
 theoretical
 foundations.
 The
 algorithm
 iteratively
 maximizes
 two
 functions:
 
arg
!
maxPr(𝜷|𝑫,𝝅)
 and
 arg
!
maxPr(𝝅|𝒁,𝜷).
 In
 the
 implementation,
 we
 found
 that
 the
 
iterative
 algorithm
 did
 not
 converge
 to
 the
 MAP
 all
 the
 time.
 In
 a
 detailed
 analysis,
 we
 see
 that
 it
 
leads
 to
 a
 non-­‐stable
 extreme
 point
 easily.
 The
 Laplace
 prior
 on
 Pr(𝜷|𝝅),
 arg
!
maxPr(𝜷|𝑫,𝝅)
 
will
 set
 most
 of
 the
 model
 parameters
 𝜷
 to
 exactly
 zeros;
 in
 return,
 when
 we
 try
 to
 optimize
 the
 
hyper-­‐parameters,
 arg
!
maxPr(𝝅|𝒁,𝜷),
 as
 when
 the
 model
 parameters
 𝜷
 gets
 set
 to
 zero
 or
 a
 
smaller
 value,
 the
 shrinkage
 parameter
 (corresponding
 to
 𝑒
!𝒁𝝅
)
 are
 driven
 to
 be
 larger,
 which
 in
 
return
 shrinks
 the
 model
 parameters
 𝜷
 even
 more.
 Thus,
 it
 is
 easily
 that
 the
 iterative
 algorithm
 
will
 result
 to
 extreme
 situations
 where
 all
 model
 parameters
 𝜷
 are
 shrunk
 to
 zero
 and
 we
 can’t
 
estimate
 the
 hyper-­‐parameters
 𝝅.
 Heuristically
 we
 adapted
 the
 generalized
 Bayesian
 criteria
 to
 

  60
 
control
 model
 size,
 but
 it
 doesn’t
 work
 well
 either,
 especially
 in
 simulation
 studies
 in
 which
 the
 
sample
 size
 is
 less
 than
 the
 parameter-­‐set
 size.
 
 Another
 possible
 extension
 could
 be
 to
 set
 a
 
constrained
 prior
 distribution
 on
 the
 hyper-­‐parameter
 𝝅.
 But
 it
 would
 be
 more
 computationally
 
complicated
 and
 we
 didn’t
 pursue
 this
 path.
 

 
In
 the
 second
 part
 of
 this
 project,
 we
 proposed
 an
 empirical
 Bayes
 estimator
 on
 the
 hyper-­‐
parameters.
 We
 proposed
 to
 maximize
 the
 marginal
 likelihood
 function
 with
 respect
 to
 the
 
hyper-­‐parameters
 𝝅:
 arg
!
max𝐿 𝝅 ,
 where
 
𝐿 𝝅 = Pr 𝒀𝑮,𝒁,𝝅 = Pr 𝒀𝑮,𝜷 Pr 𝜷 𝒁,𝝅 𝑑𝜷.
 Along
 this
 line
 of
 research,
 we
 proposed
 
two
 different
 estimation
 methods.
 In
 the
 first
 method,
 we
 utilized
 a
 MCMC
 approach
 to
 
approximate
 the
 Pr 𝜷 𝒀,𝑮,𝒁,𝝅  
 ,
 which
 is
 used
 in
 the
 optimization
 of
 𝐿 𝝅 .
 However,
 we
 
encountered
 technical
 problems
 when
 the
 model
 parameter
 space
 of
 𝜷
 is
 too
 large,
 which
 is
 
usually
 the
 case
 for
 GWAS
 dataset.
 In
 such
 circumstances,
 the
 MCMC
 sample
 may
 not
 converge,
 
so
 is
 a
 poor
 approximation
 to
 Pr 𝜷 𝒀,𝑮,𝒁,𝝅 .
 In
 the
 second
 method,
 we
 approximate
 the
 
model
 parameter
 with
 independent
 Normal
 distribution:
 𝛽
!
~𝑁(𝛽
!
, 𝜎
!
!
).
 Thus
 we
 can
 gain
 a
 
closed
 form
 representation
 of
 the
 marginal
 likelihood
 𝐿 𝝅 .
 But
 a
 detailed
 analysis
 shows
 that
 
such
 likelihood
 is
 not
 always
 concave,
 and
 the
 MLE
 does
 not
 always
 exist.
 

 
In
 this
 project,
 we
 focused
 on
 MAP
 estimate
 of
 the
 hierarchal
 Bayesian
 model,
 due
 to
 
computational
 efficiency
 and
 applicability
 in
 a
 GWAS
 dataset.
 Park
 &
 Casella
 have
 proposed
 a
 
full
 Bayesian
 lasso
 model,
 and
 implemented
 a
 Gibbs
 sampler
 to
 sample
 the
 posterior
 
distribution
 (Park
 &
 Casella,
 2008).
 Chris
 Hans
 investigated
 some
 other
 aspects
 on
 Bayesian
 

  61
 
treatment
 of
 lasso
 regression
 (Hans,
 2009).
 But
 their
 methods
 are
 not
 computationally
 efficient,
 
and
 they
 have
 only
 implemented
 a
 common
 shrinkage
 parameter
 for
 all
 variables.
 
 
Finally,
 there
 are
 some
 other
 generalizations
 of
 lasso
 methodology
 recently.
 Yuan
 and
 Lin
 
proposed
 group
 lasso,
 as
 a
 combination
 of
 𝑙1
 norm
 and
 𝑙2
 norm
 penalty
 (Yuan
 &
 Lin,
 2005b).
 
Meier
 et
 al
 extended
 the
 group
 lasso
 to
 logistic
 regression(Meier
 &
 Van
 De
 Geer,
 2008).
 
 These
 
methods
 can
 be
 used
 to
 differentially
 shrink
 different
 groups
 of
 variables,
 but
 may
 not
 be
 as
 
flexible
 as
 a
 hierarchical
 model
 on
 the
 shrinkage
 parameters.
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  62
 
Chapter
 4 Bayesian
 hierarchical
 Quantile
 Model
 to
 
Prioritize
 GWAS
 Results
 

 
4.1
 Introduction
 
Genome-­‐wide
 association
 studies
 have
 been
 a
 standard
 method
 for
 disease
 gene/variant
 
discovery
 in
 the
 past
 decade.
 Although
 there
 has
 been
 success
 in
 discovering
 many
 variants,
 for
 
most
 common
 diseases
 only
 a
 moderate
 amount
 of
 heritability
 has
 been
 explained.
 To
 discover
 
additional
 variants
 most
 studies
 have
 relied
 on
 the
 accumulation
 of
 additional
 samples
 to
 
increase
 power.
 While
 this
 approach
 has
 potential
 it
 does
 come
 with
 substantial
 logistical
 and
 
monetary
 costs.
 As
 an
 alternative,
 biological
 knowledge,
 often
 publically
 available,
 can
 be
 used
 
to
 prioritize
 SNPs
 lacking
 genome-­‐wide
 statistical
 significance
 and
 existing
 in
 the
 lower-­‐
Manhattan
 of
 a
 GWAS
 ranking
 (Cantor,
 Lange,
 &
 Sinsheimer,
 2010).
 In
 addition
 to
 simply
 
discovering
 associated
 SNPs,
 incorporating
 biological
 information
 has
 several
 advantages.
 SNPs
 
indicated
 by
 both
 association
 evidence
 and
 biological
 characteristics
 are
 potentially
 more
 likely
 
to
 be
 functional
 SNPs
 with
 the
 underlying
 mechanism
 suggested
 by
 the
 biological
 
characterization.
 Furthermore,
 investigation
 of
 the
 association
 evidence
 for
 groups
 of
 SNPs
 from
 
a
 GWAS
 sharing
 similar
 biological
 characteristics
 can
 reveal
 underlying
 systems
 or
 pathway
 
mechanisms.
 Such
 methods
 are
 referred
 to
 as
 pathway
 analysis
 or
 Gene
 Set
 Enrichment
 Analysis
 
(Wang,
 Li,
 &
 Hakonarson,
 2010).
 In
 the
 present
 work,
 we
 aim
 (1)
 to
 use
 biological
 knowledge
 to
 
provide
 a
 more
 informative
 ranking
 of
 GWAS
 results
 for
 further
 identification
 of
 causal
 SNPs;
 (2)
 

  63
 
to
 better
 explain
 GWAS
 results
 in
 light
 of
 biological
 knowledge;
 and
 (3)
 gain
 insight
 into
 
biological
 mechanisms
 that
 may
 be
 driving
 GWAS
 associations.
 
To
 motivate
 our
 approach
 suppose
 for
 a
 particular
 disease
 there
 are
 20
 casual
 variants
 and
 1000
 
loci
 tested
 in
 a
 GWAS.
 
 Also
 assume
 there
 is
 a
 biofeature
 with
 specificity
 and
 sensitivity
 (0.8,
 0.8)
 
with
 respect
 to
 the
 causality
 of
 the
 variants,
 as
 shown
 below
 in
 Table
 4.1.
 

 
Table
 4.1
 2×2
 table
 of
 causal
 variants
 with
 respect
 to
 biofeature
 

  Non-­‐causal
  Causal
 
Non-­‐Biofeature
  784
  4
 
Biofeature
  196
  16
 

 
In
 such
 a
 setup,
 let
 us
 assume
 that
 when
 a
 non-­‐causal
 variant
 is
 tested,
 the
 test
 statistic
 (usually
 
the
 Z-­‐statistic
 in
 a
 GWAS)
 is
 distributed
 as
 standard
 Normal
 deviate,
 whereas
 the
 test
 statistic
 
corresponding
 to
 a
 causal
 variant
 is
 distributed
 as
 Normal
 distribution
 with
 unit
 variance
 and
 
mean
 =
 3.89.
 For
 example,
 in
 case-­‐control
 study
 with
 8850
 cases
 and
 controls
 each,
 a
 SNP
 with
 
OR
 =
 1.2
 and
 MAF
 =
 0.05
 would
 have
 a
 test
 statistic
 at
 approximately
 3.89.
 Thus
 one
 would
 
expect
 to
 observe
 the
 distribution
 of
 test
 statistics
 conditional
 on
 biofeatures
 shown
 in
 Figure
 
4.1:
 

 

  64
 
Figure
 4.1
 An
 illustration
 of
 the
 effect
 of
 biofeature
 on
 test
 statistics
 

 
In
 this
 example,
 if
 we
 perform
 a
 two-­‐sample
 t-­‐test
 of
 the
 test
 statistics
 over
 the
 biofeatures
 (to
 
examine
 the
 difference
 on
 the
 biofeature/nonbiofeature
 group
 means),
 the
 difference
 is
 not
 
significant
 (p-­‐value=0.13).
 However,
 we
 do
 observe
 significant
 difference
 in
 the
 top
 percentile
 
(results
 acquired
 using
 rq()
 function
 in
 R
 package
 quantreg
 based
 on
 the
 90th
 percentile):
 

 
Table
 4.2
 A
 quantile
 regression
 on
 an
 empirical
 example
 

  Coefficients
  95%
 C.I.
 
lower
 bound
 
95%
 C.I.
 upper
 
bound
 
Intercept
  2.66
  2.46
  2.89
 
Biofeature
  1.03
  0.28
  1.50
 

 
So
 we
 are
 motivated
 to
 use
 quantile
 regression
 where
 the
 standard
 linear
 means
 model
 might
 
fail.
 In
 a
 GWAS,
 if
 the
 effects
 of
 a
 biofeature
 are
 concentrated
 in
 the
 top
 percentiles
 of
 the
 
variants
 tested,
 it
 would
 be
 natural
 to
 use
 quantile
 regression
 to
 model
 such
 an
 effect.
 In
 fact,
 
quantile-­‐dependent
 penetrance
 was
 also
 found
 in
 a
 recently
 published
 genetic
 association
 study
 
-5 0 5
0.0 0.2 0.4 0.6 0.8 1.0
empirical CDF
beta-Z
CDF
non-biofeature
biofeature

  65
 
(Williams,
 2012).
 Specifically,
 to
 fully
 utilize
 the
 potential
 of
 effects
 on
 multiple
 quantiles,
 we
 
propose
 to
 build
 a
 hierarchical
 model
 using
 multiple
 quantiles
 simultaneously.
 

 
4.2
 A
 Bayesian
 Quantile
 Regression
 
Quantile
 regression
 has
 been
 developed
 intensively
 as
 a
 methodology
 focused
 on
 the
 effects
 on
 
specific
 percentiles
 of
 a
 distribution
 (Koenker,
 2005)
 (Cade
 &
 Noon,
 2003).
 In
 the
 following
 we
 
will
 review
 the
 basic
 statistical
 properties
 of
 quantile
 regression.
 To
 make
 the
 equations
 more
 
compact,
 we
 include
 the
 intercept
 1
 at
 the
 beginning
 of
 the
 covariate
 vector
 x
 (of
 length
 p).
 Let
 
(Q
!
(τ|x)
 be
 the
 τ
!"

 quantile
 of
 Y
 conditional
 on
 X
 =
 x
 (0
 ≤
 τ
 ≤
 1),
 a
 linear
 quantile
 regression
 
model
 for
 Q
!
(τ|x)
 is
 specified
 as:
 

  Q
!
τ x = x
!
β τ  
   
   
   
   
   
   
   
  (4.1)
 
where
 x
 is
 a
 covariate
 vector
 of
 length
 p+1
 (with
 the
 first
 element
 as
 intercept
 term),
 and
 β(τ)
 is
 
the
 corresponding
 coefficient
 vector.
 

 
A
 quantile
 model
 focuses
 on
 specific
 quantiles
 of
 the
 conditional
 distribution
 of
 Y|X.
 The
 most
 
popular
 method
 to
 estimate
 the
 coefficients
 in
 (1)
 was
 proposed
 by
 (Koenker
 &
 Bassett,
 1978),
 
by
 minimizing
 the
 loss
 function
  𝜌
!
(𝑦
!
−
!
!!!
 𝒙
𝒊
′𝛽(𝜏))
 on
 the
 data
 { 𝒙
𝒊
,𝑦
!
:1≤ 𝑖  ≤ 𝑛},
 where
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 𝜌
!
𝑢 =  𝑢[𝜏− 𝐼(𝑢< 0)]
   
   
   
   
   
   
  (4.2)
 

 
The
 likelihood
 above
 is
 focused
 on
 a
 single
 point
 (the
 𝜏
!!

 percentile)
 of
 the
 conditional
 
distribution,
 which
 does
 not
 do
 justice
 to
 the
 potential
 of
 the
 model.
 Here
 we
 would
 like
 to
 
implement
 a
 model
 that
 simultaneously
 investigates
 a
 range
 of
 percentiles
 of
 the
 distribution
 

  66
 
𝑄
!
𝜏
!
𝒙 =  𝒙
!
𝛽 𝜏
!
;    0< 𝜏
!
< 1, 𝑘= 1,2,3,… ,𝐾
 
   
   
   
  (4.3)
 

 
Although
 it
 is
 valid
 to
 make
 inference
 by
 post-­‐processing
 the
 individual
 quantile
 regression
 
models
 across
 a
 range
 of
 percentiles
 (for  𝐾  percentiles:  0< 𝜏
!
< 𝜏
!
<⋯< 𝜏
!
< 1),
 
philosophical
 and
 practical
 difficulties
 exist
 in
 drawing
 inference
 simultaneously
 on
 multiple
 𝜏
 
values
 (Yu
 &
 Lu,
 2003).
 Another
 drawback
 of
 post-­‐processing
 is
 the
 inability
 to
 connect
 the
 
inference
 on
 multiple
 percentiles
 simultaneously
 with
 the
 response
 variable
 Y,
 which
 is
 
particularly
 important
 if
 we
 are
 to
 set
 up
 a
 hierarchical
 model
 where
 the
 response
 variable
 Y
 is
 a
 
variable
 of
 interest
 itself
 (Tokdar
 &
 Kadane,
 2011).
 
 

 
From
 a
 Bayesian
 point
 of
 view,
 the
 minimization
 approach
 above
 is
 equivalent
 to
 maximization
 
of
 a
 likelihood
 function
 under
 the
 asymptotic
 Laplace
 error
 distribution
 (Yu
 &
 Moyeed,
 2001)
 
(Tsionas,
 2003).
 Following
 Yu
 and
 Moyeed,
 one
 could
 construct
 the
 linear
 model
 as
 

  𝑌=  𝒙
!
𝛽 𝜏 + 𝜖
   
   
   
   
   
   
   
  (4.4)
 
where
 𝜖
 has
 the
 asymmetric
 Laplace
 distribution
 with
 density:
 

  𝑓
!
(𝜖)=  𝜏 1− 𝜏 exp  {−𝜌
!
𝜖 }
 
 
where
 ρ
!
∙  
 is
 defined
 in
 equation
 (4.2).
 
In
 such
 a
 model,
 if
 we
 try
 to
 maximize
 the
 likelihood,
 the
 solution
 would
 be
 the
 same
 as
 if
 we
 
are
 to
 minimize
 the
 loss
 function
 in
 equation
 (2).
 We
 notice
 that
 the
 parameter
 𝜏
 determines
 
the
 shape
 of
 the
 distribution,
 and
 the
 𝜏
!!

 quantile
 of
 the
 distribution
 is
 0.
 Kozumi
 and
 
Kobaysashi
 adapted
 a
 mixture
 representation
 of
 the
 asymmetric
 Laplace
 distribution,
 and
 
developed
 a
 Gibbs
 sampling
 algorithm
 (Kozumi
 &
 Kobayashi,
 2011).
 Let
 w
 be
 a
 standard
 
exponential
 variable
 and
 u
 a
 standard
 normal
 variable,
 and
 
 
𝜖=𝜃𝑤+  𝛿 𝑤𝑢,
 
  where
 𝜃=
!!!!
!(!!!)
 
 and
 
 𝛿
!
=
!
!(!!!)

   
   
  (4.5)
 

  67
 
Thus
 𝜖
 has
 an
 asymmetric
 Laplace
 distribution
 with
 density
 specified
 in
 equation
 (4.4).
 

 
To
 analyze
 multiple
 quantiles
 simultaneously
 in
 a
 Bayesian
 model,
 Karthik
 Sriram
 et
 al
 proposed
 
an
 pseudo
 asymmetric
 Laplace
 density
 (PALD)
 as
 following
 (Sriram
 &
 Ghosh,
 2012;
 Sriram,
 
Ramamoorthi,
 &
 Ghosh,
 2011)
 

  𝑃𝐴𝐿𝐷
!
!
,!
!
,…,!
!
𝜖
!
,𝜖
!
,… ,𝜖
!
= exp  {−𝜌
!
!
(𝜖
!
)}
!
!!!
= exp{− 𝜌
!
!
(𝜖
!
)
!
!!!
}
   
  (4.6)
 
where
 0< 𝜏
!
< 𝜏
!
<⋯< 𝜏
!
< 1,
 and
 ρ
!
∙  
 is
 defined
 in
 (2).
 
 
 

 
It
 is
 a
 pseudo
 density
 because
 the
 normalization
 constant
 is
 not
 specified
 in
 the
 equation.
 
However,
 they
 proved
 that
 the
 posterior
 is
 consistent
 under
 proper
 priors.
 To
 make
 use
 of
 the
 
mixture
 representation
 of
 the
 asymmetric
 Laplace
 distribution
 as
 expressed
 in
 equation
 (5),
 the
 
PALD
 can
 be
 written
 as
 

  𝜖
!
= 𝜃
!
𝑤
!
+  𝛿
!
𝑤
!
𝑢
!

 
   
   
   
   
   
   
  (4.7)
 
where
 𝜃
!
=
!!!!
!
!
!
(!!!
!
)
 ,
 
 𝛿
!
!
=
!
!
!
(!!!
!
)
,
 w
k

 are
 i.i.d.
 standard
 Exponential
 variables
 and
 u
k

 
are
 i.i.d.
 standard
 Normal
 variables
 ∀  𝑘 .
 

 
Thus,
 we
 can
 write
 the
 Normal-­‐Exponential
 mixture
 representation
 of
 the
 pseudolikelihood
 for
 
the
 data
 { 𝒙
𝒊
,𝑦
!
:1≤ 𝑖  ≤ 𝑛}
 as
 follows
 

  𝑃 𝒚|𝑿,𝛽 𝜏
!
,… ,𝛽(𝜏
!
) ∝ 𝜙{
!
!
!𝒙
𝒊
!
! !
!
!!
!
!
!"
!
!
!
!"
}
!
!!!
!
!!!

   
   
 
where
 𝑤
!"
~
!!"
𝐸𝑥𝑝 1    ∀  𝑖 ,𝑘 ;
 
 
 
 
 
 
 𝜃
!
=
!!!!
!
!
!
(!!!
!
)
 
 and
 
 𝛿
!
!
=
!
!
!
(!!!
!
)
 ∀𝑘  
 
 
  (4.8)
 

 
Such
 a
 mixture
 representation
 is
 very
 useful
 in
 many
 aspects.
 
 

  68
 
(1) If
 we
 have
 Normal
 prior
 distribution
 on
 the
 quantile
 regression
 parameters,
 they
 are
 
conjugate
 with
 our
 pseudo-­‐likelihood
 function,
 which
 would
 lead
 to
 an
 efficient
 Gibbs
 
sampling
 procedure.
 
(2) As
 shown
 in
 the
 following
 section,
 in
 a
 hierarchical
 model
 where
 the
 first
 level
 is
 a
 linear
 
model
 and
 the
 second
 level
 is
 a
 quantile
 regression
 model,
 the
 conditional
 distribution
 
and
 posterior
 distribution
 are
 easily
 constructed
 due
 to
 the
 conjugacy.
 

 
3
 A
 Bayesian
 Hierarchical
 Model
 with
 Multiple
 
quantiles
 
Based
 on
 the
 pseudo
 asymmetric
 Laplace
 density
 (PALD),
 a
 very
 simple
 hierarchical
 extension
 of
 
the
 quantile
 regression
 model
 is
 proposed
 in
 this
 section.
 Suppose
 the
 observed
 marginal
 test
 
statistics
 T
1
,
 T
2
,
 …,
 T
n

 have
 a
 linear
 relationship
 with
 latent
 variables
 V
1
,
 V
2
,
 …,
 V
n
,
 and
 the
 latent
 
variables
 V
1
,
 V
2
,
 …,
 V
n

 have
 a
 distribution
 that
 could
 be
 explained
 by
 a
 quantile
 regression
 with
 
explanatory
 variable
 X
1
,
 X
2
,
 …,
 X
n
,
 then
 we
 can
 easily
 construct
 a
 hierarchical
 model
 as
 follows
 

  𝑇
!
|𝑉
!
~𝑁𝑜𝑟𝑚𝑎𝑙(𝑉
!
,𝜎
!
)
 

  𝑉
!
|𝑋
!
∝ 𝜙{
!
!
!𝑿
𝒋
!
! !
!
!!
!
!
!"
!
!
!
!"
}
!
!!!

 
where
 𝑊
!"
~
!!"
𝐸𝑥𝑝 1  ∀  𝑗 ,𝑘 ;
 
 𝜃
!
=
!!!
!
!
!
(!!!
!
)
 
 and
 
 𝛿
!
!
=
!
!
!
(!!!
!
)
 ∀𝑘 .
 
   
  (4.9)
 

 
The
 prior
 distribution
 for
 parameters
 in
 the
 model
 is
 specified
 

  𝛽 𝜏
!
~𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝜆
!
𝐼              ∀  𝑘
 
 
 
 
 
 
 
 

  where
 𝜆
!

 is
 a
 hyper-­‐parameter
 fixed
 at
 non-­‐informative
 value.
   
   
  (4.10)
 

  69
 
Based
 on
 the
 model
 and
 the
 priors,
 the
 MCMC
 using
 Gibbs
 sampling
 is
 straightforward.
 It
 
becomes
 simpler
 when
 we
 sample
 the
 latent
 variable
 V
i

 along
 with
 other
 variables.
 The
 
conditional
 distribution
 of
 the
 parameters
 of
 interest
 are
 shown
 as
 following:
 

 
The
 full
 conditional
 distribution
 of
 𝑉
!

 is
 

  𝑉
!
|𝑇
!
,𝑋
!
,𝑊
!!
,𝑊
!!
,… ,𝑊
!"
 ~  𝑁𝑜𝑟𝑚𝑎𝑙 𝐷
!
!!
𝐸
!
, 𝐷
!
!!

 

  where
 𝐷
!
= 𝜎
!!
+  
!
!
!
!
!
!"
!
!!!

 ,
 and
 𝐸
!
=
!
!
!
!
+  
!
!
!
! !
!
 !  !
!
!
!"
!
!
!
!
!"
!
!!!

   
  (4.11)
 

 
The
 full
 conditional
 distribution
 of
 𝑊
!"
!!

 is
 

  𝑊
!"
!!
|𝑉
!
,𝑋
!
,𝜏
!
 ~  𝐼𝑛𝑣𝑒𝑟𝑠𝑒  𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛  (𝜆
!
,𝜇
!"
)
 

  where
 𝜆
!
=  
!
!
!
!!!
!
!
!
!
,
 and
 𝜇
!"
=
!
!
!
!!!
!
!
|!
!
!!
!
!
! !
!
|
       
   
   
   
   
  (4.12)
 

 
Let
 𝛽= (𝛽 𝜏
!
,𝛽 𝜏
!
,… ,𝛽 𝜏
!
).
 The
 full
 conditional
 distribution
 of
 𝛽
 is
 

  𝛽|𝑇,𝑋,𝑉,𝑊  ~𝑁𝑜𝑟𝑚𝑎𝑙 𝐴
!
!!
𝑀
!
, 𝐴
!
!!

 
where
 𝐴
!
= diagonal
!
!
!
!
!
!
!
!
!
!"
!
!!!
, 𝑘= 1,2,… ,𝐾 +𝜆
!!
𝐼
 

  and
 𝑀
!
= column
(!
!
 !  !
!
!
!"
)!
!
!
!
!
!
!"
!
!!!
, 𝑘= 1,2,… ,𝐾  
 
   
   
  (4.13)
 

 

  70
 
4.4
 Simulation
 Studies
 Based
 on
 a
 Normal
 
Approximation
 
4.4.1
 Design
 of
 the
 simulation
 study
 
Suppose
 we
 have
 d=1,…,D
 bio-­‐features
 X
d

 with
 specific
 informativeness
 value
 (specificity
 and
 
sensitivity)
 corresponding
 to
 the
 causality
 of
 a
 SNP,
 as
 shown
 in
 the
 motivation
 section
 (Table
 
4.1.
 and
 Fig
 4.1.).
 We
 first
 simulate
 causal
 SNPs
 and
 non-­‐causal
 SNPs
 as
 an
 indicator
 variable:
 

  𝐼
!
 ~  𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝  
 
   
   
   
   
   
   
   
  (4.14)
 

 
 
 
 
 
 
 
  where
 p
 is
 just
 a
 pre-­‐specified
 value
 based
 on
 the
 probability
 of
 causality
 in
 the
 set
 of
 
SNPs.
 

 
Then,
 based
 on
 the
 causality
 of
 SNPs,
 we
 simulate
 the
 values
 of
 X
jd
 
according
 to
 how
 informative
 
they
 are
 (specificity
 and
 sensitivity).
 The
 unobserved
 latent
 variable
 V
j

 is
 simulated
 as
 the
 true
 
test
 statistic
 (not
 the
 effect
 size,
 but
 the
 statistic
 corresponding
 to
 the
 p-­‐value
 if
 we
 were
 to
 test
 
the
 SNP
 in
 a
 hypothetical
 study),
 which
 is
 linearly
 associated
 with
 the
 causality
 of
 a
 SNP:
 
 

  𝑉
!
 ~  𝑁𝑜𝑟𝑚𝑎𝑙(𝜆  𝐼
!
 ,𝜎
!
!
)
 
 
   
   
   
   
   
   
  (4.15)
 

 
And
 the
 observed
 variable
 t
j

 is
 a
 sampled
 observation
 from
 the
 true
 latent
 variable:
 

  𝑇
!
 ~  𝑁𝑜𝑟𝑚𝑎𝑙(𝑉
!
,𝜎
!
)
 
   
   
   
   
   
   
   
  (4.16)
 

 
Because
 the
 causality
 indicator
 I
 is
 associated
 with
 the
 bio-­‐features
 X,
 and
 also
 causality
 mainly
 
has
 an
 effect
 on
 the
 tail
 of
 the
 distribution
 of
 the
 observed
 variable
 T
 (the
 test
 statistic
 for
 a
 

  71
 
univariate
 marginal
 association
 test),
 we
 would
 use
 a
 hierarchical
 quantile
 regression
 model
 
focused
 on
 the
 upper
 quantiles
 of
 the
 conditional
 distributions,
 denoted
 in
 equation
 (11).
 Our
 
main
 interest
 lies
 in
 the
 ability
 to
 prioritize
 the
 causal
 SNPs,
 so
 we
 focus
 on
 the
 posterior
 
distribution
 of
 the
 latent
 variable
 V.
 By
 calculating
 the
 number
 of
 causal
 SNPs
 among
 the
 top
 K
 
values
 of
 the
 latent
 variable,
 we
 are
 able
 to
 compare
 with
 other
 popularly
 used
 methods
 on
 
prioritization:
 
(1) A
 linear
 hierarchical
 model
 (in
 which
 the
 first
 level
 is
 a
 linear
 model
 on
 the
 latent
 
variable
 and
 the
 second
 level
 is
 a
 linear
 regression
 on
 the
 bio-­‐features)
 
(2) A
 combination
 of
 p-­‐value
 and
 filters
 (use
 the
 bio-­‐feature
 filtering
 as
 well
 as
 the
 
ranking
 of
 the
 marginal
 p-­‐values)
 
(3) A
 quantile
 hierarchical
 model
 using
 multiple
 quantile
 points
 (in
 which
 the
 first
 level
 
is
 a
 linear
 model
 on
 the
 latent
 variable
 and
 the
 second
 level
 is
 a
 quantile
 regression
 
on
 the
 bio-­‐features)
 

 
We
 first
 performed
 a
 simulation
 study
 with
 only
 1
 biofeature.
 In
 this
 study,
 the
 biofeature
 was
 
specified
 with
 a
 fixed
 sensitivity
 and
 specificity
 (corresponding
 to
 the
 causality
 of
 a
 SNP),
 and
 the
 
data
 was
 analyzed
 using
 three
 methods
 as
 mentioned
 above.
 In
 the
 Bayesian
 hierarchical
 
quantile
 regression,
 we
 simultaneously
 focus
 on
 quantiles
 in
 the
 upper
 tail
 of
 the
 conditional
 
distribution
 (𝜏
 =
 0.7,
 0.8,
 0.9,
 0.92,
 0.95,
 0.98).
 After
 we
 draw
 the
 posterior
 samples
 of
 the
 latent
 
variable,
 we
 rank
 the
 SNPs
 by
 their
 posterior
 mean
 of
 the
 latent
 variable
 (V).
 
 For
 each
 scenario,
 
we
 performed
 100
 replications.
 To
 summarize
 the
 results,
 we
 count
 the
 number
 of
 causal
 SNPs
 
in
 the
 top
 K
 (20,
 50,
 100
 or
 200)
 ranked
 SNPs
 from
 each
 of
 the
 three
 ranking
 methods,
 and
 
compare
 them
 to
 that
 number
 from
 the
 data
 (marginal
 p-­‐value,
 or
 the
 observed
 test
 statistic
 T).
 

 

  72
 
We
 also
 performed
 simulation
 studies
 with
 5
 homogeneous
 biofeatures
 (having
 the
 same
 
specificity
 and
 sensitivity),
 and
 5
 heterogeneous
 biofeatures
 (having
 different
 specificity
 and
 
sensitivity).
 

 
4.4.2
 Simulation
 Results
 
For
 prioritization,
 our
 major
 interest
 lies
 in
 the
 ranking
 of
 the
 causal
 variants.
 We
 simulated
 20
 
out
 of
 1000
 as
 causal
 SNPs.
 In
 each
 of
 the
 heat
 map
 that
 follows,
 we
 show
 the
 difference
 
between
 the
 number
 of
 causal
 SNPs
 in
 top
 K
 (20,
 50
 or
 100)
 SNPs
 from
 the
 3
 proposed
 methods
 
above
 and
 the
 corresponding
 number
 from
 a
 simple
 p-­‐value
 ranking,
 for
 an
 array
 of
 specificity
 
and
 sensitivity.
 

 
The
 simulation
 results
 with
 only
 1
 biofeature
 are
 shown
 in
 Fig
 4.2.
 Based
 on
 the
 ranking
 from
 a
 
Bayesian
 hierarchical
 linear
 regression
 model,
 the
 number
 of
 causal
 SNPs
 is
 almost
 the
 same
 as
 
using
 marginal
 data
 itself.
 In
 general,
 the
 filtering
 method
 yields
 significantly
 more
 causal
 SNPs
 if
 
the
 biofeature
 is
 informative
 (high
 sensitivity
 and
 specificity),
 but
 loses
 many
 causal
 SNPs
 if
 the
 
biofeature
 is
 not
 informative
 (low
 sensitivity
 and
 specificity).
 The
 Bayesian
 hierarchical
 quantile
 
regression
 produces
 a
 marked
 gain
 of
 causal
 SNPs
 if
 the
 biofeature
 is
 informative,
 with
 very
 little
 
loss
 if
 the
 biofeature
 is
 not
 informative
 at
 all.
 For
 example,
 at
 a
 specificity
 and
 sensitivity
 of
 0.8
 
(relatively
 informative),
 the
 filtering
 approach
 and
 quantile
 regression
 approach
 on
 average
 
identifies
 2
 more
 causal
 SNPs
 in
 the
 top
 20
 than
 the
 p-­‐value
 ranking
 only,
 whereas
 the
 linear
 
regression
 approach
 identifies
 the
 same
 number
 of
 causal
 SNPs
 as
 the
 p-­‐value
 ranking
 only.
 
When
 the
 specificity
 and
 sensitivity
 is
 0.5
 (not
 informative),
 on
 average
 the
 filtering
 approach
 
identifies
 3
 fewer
 causal
 variants
 in
 the
 top
 20
 compared
 to
 the
 p-­‐value
 ranking
 alone;
 in
 

  73
 
contrast,
 the
 linear
 regression
 and
 the
 quantile
 regression
 approaches
 identify
 the
 same
 number
 
of
 causal
 variants
 in
 the
 top
 20
 as
 from
 the
 p-­‐value
 ranking
 only
 for
 this
 scenario.
 We
 observed
 
the
 same
 pattern
 of
 gains/losses
 on
 the
 identification
 of
 causal
 variants
 in
 the
 top
 100
 rankings.
 
These
 results
 suggest
 that
 the
 Bayesian
 hierarchical
 quantile
 regression
 is
 a
 safe
 method
 to
 
boost
 the
 finding
 of
 causal
 SNPs
 by
 incorporating
 the
 external
 information
 from
 biofeatures.
 

 
Figure
 4.2
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20(100)
 ranked
 SNPs
 
 
(with
 1
 biofeature)
 

 

 

 
As
 shown
 in
 Fig
 4.3,
 we
 observed
 similar
 patterns
 of
 identification
 in
 simulation
 studies
 with
 5
 
homogeneous
 biofeatures.
 In
 these
 simulation
 studies,
 all
 of
 the
 5
 biofeatures
 have
 the
 same
 
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−4
−3
−2
−1
0
1
2
3
4
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−4
−3
−2
−1
0
1
2
3
4
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−4
−3
−2
−1
0
1
2
3
4
hierarchical linear regression
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−6
−4
−2
0
2
bioinformatics based
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−6
−4
−2
0
2
simultaneous quantile regression
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−6
−4
−2
0
2

  74
 
specificity
 and
 sensitivity
 with
 respect
 to
 the
 causality
 of
 a
 SNP.
 Because
 of
 the
 additive
 effects
 
of
 the
 biofeatures,
 the
 differences
 on
 causal
 SNP
 identification
 from
 different
 approaches
 are
 
even
 more
 distinct.
 For
 example,
 at
 a
 specificity
 and
 sensitivity
 of
 0.8,
 the
 filtering
 approach
 
identifies
 7
 more
 causal
 SNPs
 in
 the
 top
 20
 list
 on
 average
 compared
 to
 p-­‐value
 ranking,;
 
quantile
 regression
 approach
 identifies
 5
 more
 causal
 SNPs;
 and
 linear
 regression
 approach
 
identifies
 almost
 the
 same
 number
 of
 causal
 SNPs.
 On
 the
 other
 hand,
 when
 the
 biofeatures
 are
 
not
 informative
 (specificity
 and
 sensitivity
 at
 0.5),
 filtering
 approach
 identifies
 7
 fewer
 causal
 
SNPs
 than
 p-­‐value
 rankings,
 whereas
 the
 quantile
 regression
 and
 linear
 regression
 identifies
 the
 
same
 number
 of
 causal
 variants
 as
 the
 p-­‐value
 rankings.
 

 
Figure
 4.3
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20(100)
 ranked
 SNPs
 
 
(with
 5
 homogeneous
 biofeatures)
 

 

 
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−5
0
5
10
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−5
0
5
10
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−5
0
5
10

  75
 

 

 
Next
 we
 considered
 heterogeneous
 biofeatures,
 of
 which
 three
 have
 the
 sensitivity
 and
 
specificity
 fixed
 at
 0.8
 and
 the
 sensitivity
 and
 specificity
 the
 other
 two
 vary
 across
 the
 scenarios.
 
The
 results
 are
 summarized
 in
 Fig
 4.4.
 In
 these
 figures,
 the
 sensitivity
 and
 specificity
 correspond
 
to
 the
 two
 varying
 biofeatures.
 Due
 to
 the
 presence
 of
 3
 informative
 biofeatures
 in
 all
 scenarios,
 
the
 loss
 of
 causal
 SNPs
 in
 the
 bioinformatics
 method
 is
 not
 as
 obvious
 as
 above.
 However,
 the
 
general
 pattern
 remains:
 in
 identification
 of
 causal
 SNPs,
 the
 Bayesian
 hierarchical
 quantile
 
regression
 is
 still
 more
 robust
 than
 the
 other
 two
 methods.
 

 
hierarchical linear regression
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−10
−8
−6
−4
−2
0
2
4
6
bioinformatics based
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−10
−8
−6
−4
−2
0
2
4
6
simultaneous quantile regression
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−10
−8
−6
−4
−2
0
2
4
6

  76
 
Figure
 4.4
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20(100)
 ranked
 SNPs
 
 
(with
 5
 heterogeneous
 biofeatures)
 

 

 

 

 

 
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
hierarchical linear regression
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
bioinformatics based
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top  100
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6

  77
 
4.5
 Simulation
 Studies
 Based
 on
 Marginal
 GWAS
 
Estimation
 
4.5.1.
 Design
 of
 the
 simulation
 study
 
In
 the
 previous
 section,
 we
 performed
 simulation
 based
 on
 the
 normal
 approximation
 of
 the
 
marginal
 test
 statistics.
 In
 theory,
 with
 large
 sample
 sizes,
 the
 marginal
 test
 statistics
 should
 be
 
asymptotically
 normally
 distributed,
 with
 means
 equal
 to
 the
 true
 test
 statistic.
 However,
 the
 
sampling
 variation
 of
 the
 test
 statistics
 varies,
 which
 greatly
 affects
 the
 power
 of
 different
 
statistical
 approaches.
 Thus,
 in
 this
 simulation
 study,
 we
 demonstrate
 the
 use
 of
 Bayesian
 
hierarchical
 machinery
 directly
 on
 a
 simulated
 GWAS
 dataset.
 As
 in
 section
 4,
 each
 of
 the
 D
 bio-­‐
features
 is
 somewhat
 informative
 (or
 non-­‐informative)
 with
 respect
 to
 the
 causality
 of
 a
 SNP.
 As
 
in
 Section
 4,
 equation
 (14),
 we
 first
 simulate
 causal
 SNP’s
 and
 non-­‐causal
 SNP’s
 as
 an
 indicator
 
variable;
 and
 then
 we
 simulate
 the
 value
 of
 biofeature
 X
jk
 
according
 to
 how
 informative
 they
 are
 
(specificity
 and
 sensitivity).
 The
 unobserved
 effect
 size
 𝛽
!

 was
 simulated
 according
 to
 the
 
causality
 of
 the
 SNP
 as
 well:
 

  𝛽
!
 = log 1.2 ∗ 𝐼
!

 
   
   
   
   
   
   
   
  (4.17)
 

 
Assuming
 independent
 SNPs
 or
 correlated
 SNPs,
 we
 simulate
 the
 genotypes
 𝑮
𝒊

 for
 individual
 i.
 
Based
 on
 a
 logistic
 disease
 model,
 we
 simulate
 a
 case-­‐control
 GWAS
 dataset.
 

  𝑃𝑟 𝑌
!
= 1 = exp( 𝑮
𝒊
−𝑮 ′𝜷)  [  1+exp 𝑮
𝒊
−𝑮
!
𝜷  ]  
   
   
 

  (4.18)
 

 

  78
 
Based
 on
 the
 GWAS
 dataset,
 we
 can
 apply
 the
 standard
 2-­‐df
 logistic
 regression.
 The
 observed
 
variable
 t
i

 is
 the
 marginal
 test
 statistic
 from
 separate
 univariate
 logistic
 regressions
 for
 each
 SNP
 
j.
 
We
 use
 a
 hierarchical
 quantile
 regression
 model
 focused
 on
 the
 upper
 quantiles
 of
 the
 
conditional
 distributions,
 denoted
 in
 Equation
 (4.11).
 As
 in
 section
 4,
 we
 compare
 this
 method
 
with
 other
 popularly
 used
 methods.
 

 
4.5.2
 Simulation
 Results
 for
 Independent
 SNPs
 
In
 the
 simulation
 studies,
 we
 simulated
 1,
 5
 or
 20
 causal
 SNPs
 out
 of
 1000
 SNPs.
 All
 the
 SNPs
 are
 
assumed
 to
 be
 independent
 of
 each
 other.
 We
 specified
 biofeature(s)
 with
 different
 specificity
 
and
 sensitivity
 corresponding
 to
 the
 causality
 of
 a
 SNP.
 In
 the
 following,
 we
 show
 the
 heat
 maps
 
of
 the
 increase
 of
 number
 of
 causal
 SNPs
 in
 top
 1,
 5
 or
 20
 SNPs.
 

 
In
 the
 first
 study,
 we
 simulated
 only
 1
 biofeature.
 After
 we
 drew
 the
 posterior
 samples
 of
 the
 
latent
 variable,
 we
 ranked
 the
 SNPs
 by
 their
 posterior
 mean
 of
 the
 latent
 variable
 (V).
 
 For
 each
 
scenario
 (with
 different
 sensitivity
 and
 specificity),
 we
 performed
 100
 replications.
 To
 
summarize
 the
 results,
 we
 count
 the
 number
 of
 causal
 SNPs
 in
 the
 top
 K
 (1,
 5
 or
 20)
 ranked
 SNPs
 
from
 each
 of
 the
 three
 ranking
 methods,
 and
 compare
 them
 to
 that
 number
 from
 the
 data
 
(marginal
 p-­‐value,
 or
 the
 observed
 test
 statistic
 Y).
 
 
The
 results
 are
 shown
 in
 Fig
 4.5.
 The
 Bayesian
 hierarchical
 linear
 regression
 prioritizes
 
significantly
 more
 causal
 SNPs
 if
 the
 biofeature
 is
 informative
 (high
 specificity
 and
 sensitivity),
 
but
 suffers
 a
 significant
 deficit
 with
 non-­‐informative
 biofeatures
 (low
 specificity
 and
 sensitivity).
 
Similar
 patterns
 are
 shown
 for
 the
 bioinformatics
 filtering
 method.
 The
 Bayesian
 hierarchical
 

  79
 
quantile
 regression
 yields
 a
 marked
 gain
 of
 causal
 SNPs
 if
 the
 biofeature
 is
 informative,
 with
 very
 
little
 loss
 if
 the
 biofeature
 is
 not
 informative
 at
 all.
 For
 example,
 in
 the
 scenarios
 with
 20
 causal
 
SNPs,
 with
 specificity
 and
 sensitivity
 of
 0.9,
 all
 three
 approaches
 increase
 the
 number
 of
 
identified
 causal
 SNPs
 by
 4~5
 on
 average,
 comparing
 to
 the
 p-­‐value
 only
 approach.
 When
 the
 
specificity
 and
 sensitivity
 is
 0.5,
 quantile
 regression
 approaches
 identifies
 the
 same
 number
 of
 
causal
 SNPs,
 whereas
 the
 linear
 hierarchical
 and
 filtering
 approaches
 identify
 1~2
 causal
 SNPs
 
fewer
 than
 the
 p-­‐value
 alone
 approach
 on
 average.
 The
 results
 from
 a
 GWAS
 simulation
 study
 
directly
 show
 that
 the
 Bayesian
 hierarchical
 quantile
 regression
 is
 a
 safe
 method
 to
 improve
 the
 
identification
 and
 prioritization
 of
 causal
 SNPs
 by
 incorporating
 the
 external
 information
 from
 
biofeatures.
 
 

 
Figure
 4.5
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 1,
 5
 and
 20
 ranked
 SNPs
 
 
(with
 1
 biofeature
 and
 1,
 5
 and
 20
 causal
 SNPs
 correspondingly)
 

 

hierarchical linear regression
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
bioinformatics based
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
simultaneous quantile regression
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25

  80
 


 

 
Similarly,
 we
 performed
 simulation
 studies
 with
 2
 homogeneous/heterogeneous
 biofeatures.
 
Similarly,
 the
 quantile
 regression
 approach
 increases
 the
 number
 of
 causal
 SNPs
 identified
 with
 
informative
 biofeatures,
 and
 maintains
 the
 same
 number
 of
 identified
 causal
 SNPs
 with
 non-­‐
informative
 biofeatures,
 whereas
 linear
 regression
 and
 filtering
 approaches
 suffer
 a
 significant
 
loss
 in
 the
 identification
 of
 causal
 SNPs
 if
 the
 biofeatures
 are
 not
 informative.
 The
 results
 are
 
summarized
 in
 Fig
 4.6
 and
 Fig
 4.7.
 

 
bioinformatics based
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
hierarchical linear regression
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
simultaneous quantile regression
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
−1
0
1
2
3
4
5
6
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
−1
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
−1
0
1
2
3
4
5
6

  81
 
Figure
 4.6
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 1,
 5
 and
 20
 ranked
 SNPs
 
 
(with
 2
 homogeneous
 biofeature
 and
 1,
 5
 and
 20
 causal
 SNPs
 correspondingly)
 

 

 

 

 
hierarchical linear regression
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
bioinformatics based
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
simultaneous quantile regression
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
hierarchical linear regression
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.5
0.0
0.5
1.0
1.5
bioinformatics based
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.5
0.0
0.5
1.0
1.5
simultaneous quantile regression
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.5
0.0
0.5
1.0
1.5
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
0
2
4
6
8
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
0
2
4
6
8
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−2
0
2
4
6
8

  82
 
Figure
 4.7
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 1,
 5
 and
 20
 ranked
 SNPs
 
 
(with
 2
 heterogeneous
 biofeature
 and
 1,
 5
 and
 20
 causal
 SNPs
 correspondingly)
 

 

 

 

 
hierarchical linear regression
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
bioinformatics based
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
simultaneous quantile regression
gain of causal SNPs in top  1
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
hierarchical linear regression
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0.0
0.5
1.0
1.5
2.0
bioinformatics based
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0.0
0.5
1.0
1.5
2.0
simultaneous quantile regression
gain of causal SNPs in top  5
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
0.0
0.5
1.0
1.5
2.0
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
7
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
7
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
7

  83
 

 
4.5.3
 Simulation
 Results
 for
 Correlated
 SNPs
 
As
 in
 section
 4.5.2,
 we
 simulated
 1,
 5
 or
 20
 causal
 SNPs
 out
 of
 1000
 SNPs.
 To
 mimic
 the
 LD
 
structure
 in
 the
 genetic
 data,
 we
 assume
 a
 simple
 correlation
 structure
 among
 the
 SNPs.
 Within
 
an
 LD
 region,
 the
 SNPs
 are
 equally
 correlated
 with
 each
 other,
 while
 SNPs
 in
 different
 LD
 regions
 
are
 uncorrelated.
 We
 applied
 similar
 analysis
 models
 on
 the
 data,
 and
 the
 results
 are
 
summarized
 below.
 

 
We
 performed
 simulations
 with
 LD
 region
 of
 5
 SNPs,
 and
 the
 correlation
 within
 the
 region
 0.2
 
and
 0.5.
 The
 results
 are
 summarized
 in
 Fig
 4.8
 and
 Fig
 4.9
 respectively.
 Specifically,
 for
 a
 within-­‐
region
 correlation
 of
 0.5
 with
 specificity
 and
 sensitivity
 of
 0.5
 (non-­‐informative
 biofeature),
 
linear
 regression
 and
 filtering
 approaches
 identify
 1
 fewer
 causal
 SNP
 in
 the
 top
 20
 ranked
 SNPs
 
than
 the
 p-­‐value
 only
 approach
 on
 average,
 whereas
 quantile
 regression
 approach
 identifies
 
almost
 the
 same
 number
 of
 causal
 SNPs
 on
 average.
 When
 the
 specificity
 and
 sensitivity
 are
 0.9
 
(informative
 biofeature),
 all
 three
 approaches
 increase
 the
 number
 of
 causal
 SNPs
 identified
 by
 
4~5.
 For
 the
 quantile
 regression
 approach,
 when
 the
 sensitivity
 is
 low
 (0.5),
 the
 number
 of
 
identified
 causal
 SNPs
 increases
 with
 increased
 specificity;
 in
 contrast,
 at
 low
 specificity
 (0.5),
 
higher
 sensitivity
 does
 not
 increase
 the
 number
 of
 identified
 causal
 SNPs.
 

 

 

 

 

 

 

  84
 
Figure
 4.8
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20
 ranked
 SNPs
 
(with
 1
 biofeature,
 with
 LD
 structure
 of
 region
 size
 5
 and
 correlation
 0.2)
 

 

 

 

 

 
Figure
 4.9
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 20
 ranked
 SNPs
 
(with
 1
 biofeature,
 with
 LD
 structure
 of
 region
 size
 5
 and
 correlation
 0.5)
 

 

 

 
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
bioinformatics based
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
hierarchical linear regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6
simultaneous quantile regression
gain of causal SNPs in top  20
spec
sens
0.5
0.6
0.7
0.8
0.9
0.5 0.6 0.7 0.8 0.9
−1
0
1
2
3
4
5
6

  85
 
We
 observed
 similar
 patterns
 for
 the
 prioritization
 of
 causal
 SNPs
 with
 different
 LD
 structure
 
and
 different
 numbers
 of
 biofeatures
 (e.g.,
 with
 a
 region
 size
 of
 10,
 a
 within-­‐region
 correlation
 
of
 0.2
 or
 0.5,
 and
 2
 homogenous/heterogeneous
 biofeatures,
 as
 shown
 in
 the
 appendix
 figures).
 

 
4.6
 An
 Applied
 Study
 on
 the
 Colorectal
 Cancer
 Family
 
Registry
 
Colorectal
 cancer
 is
 the
 second
 leading
 cause
 of
 cancer
 death
 in
 developed
 countries,
 with
 the
 
lifetime
 risk
 estimated
 to
 be
 5%
 to
 6%
 (Altekruse
 et
 al.,
 2007).
 To
 date,
 genome-­‐wide
 association
 
studies
 (GWAS)
 have
 identified
 fourteen
 low-­‐penetrance
 genetic
 variants
 that
 together
 explain
 
approximately
 8%
 of
 the
 familial
 association
 of
 this
 disease
 (Peters
 et
 al.,
 2011).
 The
 Colorectal
 
Cancer
 Family
 Registry
 (Colon
 CFR)
 is
 an
 international
 consortium
 of
 six
 centers
 in
 North
 
America
 and
 Australia
 formed
 as
 a
 resource
 to
 support
 studies
 on
 the
 etiology,
 prevention
 and
 
clinical
 management
 of
 colorectal
 cancer
 (Newcomb
 et
 al.,
 2007).
 The
 Colon
 CFR
 includes
 both
 
population-­‐based
 and
 clinic-­‐based
 colorectal
 cancer
 cases.
 To
 date,
 there
 are
 8745
 population-­‐
based
 families
 and
 1861
 clinic-­‐based
 families
 participating
 in
 this
 study
 
(http://www.coloncfr.org/index.php/summary-­‐data).
 Despite
 the
 large
 sample
 sizes,
 only
 a
 few
 
previously
 reported
 GWAS
 hits
 have
 been
 significantly
 replicated
 in
 the
 Colon
 CFR
 and
 no
 new
 
loci
 have
 been
 discovered
 at
 genomewide
 levels
 of
 significance.
 
 Here
 we
 propose
 to
 use
 a
 
hierarchical
 quantile
 regression
 to
 prioritize
 the
 GWAS
 results
 from
 Colon
 CFR,
 using
 biofeatures
 
corresponding
 to
 the
 genomic
 annotations
 of
 each
 SNP.
 To
 provide
 a
 fair
 comparison,
 we
 used
 
the
 change
 in
 ranking
 of
 the
 SNPs
 compared
 with
 a
 “gold
 standard”
 based
 on
 a
 recent
 meta-­‐
analysis
 of
 several
 colon
 cancer
 GWAS
 studies:
 MECC
 (Middle
 East
 Cancer
 Consortium),
 Colon
 

  86
 
CFR,
 KY
 Colon
 Cancer
 Study
 and
 GECCO
 (Genetics
 and
 Epidemiology
 of
 Colorectal
 Cancer
 
Consortium).
 In
 the
 meta-­‐analysis,
 we
 identified
 a
 few
 variants
 that
 pass
 the
 genome-­‐wide
 
significance
 level,
 serving
 as
 the
 “causal”
 variants
 to
 be
 detected.
 

 
We
 constrained
 our
 analysis
 to
 the
 subset
 of
 SNPs
 that
 passed
 the
 nominal
 significance
 level
 (p-­‐
value
 <
 0.05)
 based
 on
 the
 meta-­‐analysis.
 The
 corresponding
 SNPs
 (in
 total
 431,813
 SNPs)
 in
 the
 
Colon
 CFR
 dataset
 were
 used
 for
 analysis.
 We
 applied
 a
 pruning
 procedure
 to
 the
 SNPs
 data
 
based
 on
 their
 LD
 structure,
 which
 resulted
 in
 44,010
 relatively
 uncorrelated
 regions.
 In
 the
 
pruning
 process,
 we
 looped
 through
 all
 the
 SNPs
 in
 the
 order
 of
 increasing
 p-­‐values,
 and
 defined
 
regions
 as
 comprising
 all
 SNPs
 that
 had
 correlation
 larger
 than
 a
 threshold.
 As
 a
 test
 statistic
 for
 
each
 region,
 we
 used
 the
 smallest
 p-­‐value
 for
 that
 region.
 We
 utilized
 information
 on
 5
 
biofeatures
 in
 this
 analysis:
 
 DNase
 1
 promoter,
 enhancer,
 exon,
 promoter
 and
 TCF7L2
 binding
 
site.
 For
 each
 region,
 we
 summarized
 the
 biofeature
 information
 on
 a
 present/absent
 basis:
 the
 
biofeature
 of
 the
 region
 was
 noted
 as
 1
 if
 any
 SNP
 in
 the
 region
 had
 the
 biofeature,
 otherwise
 0.
 

 
We
 summarize
 the
 frequency
 of
 the
 5
 biofeatures
 for
 all
 44,010
 regions
 in
 the
 following
 table.
 
The
 rarest
 biofeature
 is
 DNase
 1
 promoter,
 associated
 with
 less
 than
 2%
 of
 all
 the
 regions,
 
whereas
 the
 most
 common
 biofeature
 is
 exon,
 associated
 with
 about
 8%
 of
 all
 the
 regions.
 

 
Table
 4.3
 Frequencies
 of
 5
 Biofeatures
 in
 44010
 Regions
 

 
DNase
 1
 
Promoter
  Enhancer
  Exon
  Promoter
  TCF7L2
 Binding
 Site
 
Non-­‐Biofeature
  43252
  42638
  40858
  41802
  41887
 
Biofeature
  758
  1372
  3152
  2208
  2123
 

 

  87
 

 
With
 this
 biofeature
 information,
 we
 applied
 the
 hierarchical
 quantile
 regression
 model.
 We
 
summarize
 the
 quantile
 effects
 of
 the
 6
 biofeatures
 in
 Table
 4.4
 as
 shown
 below.
 We
 can
 see
 
that
 4
 of
 the
 5
 biofeatures
 are
 consistently
 positively
 associated
 with
 the
 test
 statistics
 across
 
the
 quantiles,
 with
 DNase
 1
 being
 consistently,
 but
 weakly
 negatively
 associated.
 

 
Table
 4.4
 Mean
 Estimates
 β
d
(τ
k
)
 and
 95%
 Interval
 
 
for
 the
 Biofeature
 X
d

 Effects
 on
 Test
 Statistics
 in
 Multiple
 Quantiles
 τ
k

 

  Quantile
 τ
k

 
Feature
 X
d

 

 
0.70
  0.80
  0.90
  0.92
  0.95
  0.98
 
DNase
 1
 
Promoter
 
-­‐0.027
 
 
(-­‐0.171,
 
0.115)
 
-­‐0.027
 
(-­‐0.175,
 
0.128)
 
-­‐0.029
 
(-­‐0.198,
 
0.169)
 
-­‐0.019
 
(-­‐0.195,
 
0.183)
 
-­‐0.014
 
(-­‐0.217,
 
0.199)
 
-­‐0.010
 
(-­‐0.295,
 
0.359)
 
Enhancer
 
0.100
 
(0.005,
 
0.206)
 
0.101
 
(0.002,
 
0.203)
 
0.097
 
(-­‐0.021,
 
0.218)
 
0.102
 
(-­‐0.028,
 
0.237)
 
0.111
 
(-­‐0.037,
 
0.264)
 
0.111
 
(-­‐0.075,
 
0.324)
 
Exon
 
0.045
 
(-­‐0.027,
 
0.115)
 
0.047
 
(-­‐0.021,
 
0.119)
 
0.053
 
(-­‐0.034,
 
0.132)
 
0.053
 
(-­‐0.042,
 
0.140)
 
0.053
 
(-­‐0.051,
 
0.160)
 
0.056
 
(-­‐0.083,
 
0.200)
 
Promoter
 
0.089
 
(-­‐0.005,
 
0.180)
 
0.088
 
(-­‐0.008,
 
0.184)
 
0.089
 
(-­‐0.022,
 
0.212)
 
0.090
 
(-­‐0.020,
 
0.202)
 
0.088
 
(-­‐0.035,
 
0.214)
 
0.111
 
(-­‐0.061,
 
0.303)
 
TCF7L2
 
Binding
 Site
 
0.040
 
(-­‐0.045,
 
0.120)
 
0.042
 
(-­‐0.055,
 
0.133)
 
0.046
 
(-­‐0.055,
 
0.154)
 
0.044
 
(-­‐0.064,
 
0.157)
 
0.047
 
(-­‐0.074,
 
0.359)
 
0.061
 
(-­‐0.106,
 
0.239)
 

 
We
 focused
 the
 analysis
 on
 the
 top
 regions
 identified
 by
 the
 meta-­‐analysis
 as
 the
 “gold
 
standard”.
 For
 example,
 for
 the
 top
 100
 regions
 from
 the
 meta-­‐analysis,
 we
 found:
 
1)
 The
 change
 in
 rank
 due
 to
 data
 for
 these
 regions
 was
 associated
 with
 biofeatures
 
 DNase
 1
 
promoter,
 enhancer,
 and
 TCF1L2.
 For
 example,
 those
 SNPs
 within
 a
 DNase
 1
 Promoter
 region
 
showed
 a
 mean
 rank
 improvement
 of
 4520
 positions
 in
 the
 meta-­‐analysis,
 comparing
 to
 the
 
crude
 ranking
 from
 the
 Colon
 CFR
 study,
 indicating
 that
 this
 biofeature
 is
 the
 most
 informative.
 
2)
 The
 quantile
 regression
 analysis
 for
 these
 regions
 improved
 rankings
 on
 average
 by
 428
 

  88
 
positions
 comparing
 to
 the
 Colon
 CFR
 study.
 
3)
 Moreover,
 of
 those
 100
 regions
 that
 showed
 an
 improvement
 in
 ranking
 in
 meta-­‐analysis
 
compared
 to
 the
 Colon
 CFR
 study,
 (i.e.
 regions
 that
 are
 indicated
 by
 data
 to
 be
 more
 likely
 truly
 
associated),
 these
 regions
 showed
 an
 average
 improvement
 of
 ranking
 of
 1093
 from
 the
 
quantile
 regression
 analysis.
 
4)
 Finally,
 CCFR
 analysis
 had
 only
 2
 of
 these
 regions
 in
 the
 top
 100,
 whereas
 quantile
 regression
 
had
 3.
 
We
 further
 plotted
 the
 improvements
 visually
 in
 the
 following
 figure.
 For
 the
 top
 regions
 
identified
 by
 meta-­‐analysis,
 we
 show
 the
 changes
 of
 ranking
 from
 the
 Colon
 CFR
 dataset
 to
 the
 
quantile
 regression
 results.
 The
 biofeatures
 are
 indicated
 as
 dots
 alongside
 each
 region.
 In
 
general,
 we
 see
 that
 among
 the
 top
 regions
 identified
 by
 the
 meta-­‐analysis,
 those
 with
 
biofeatures
 are
 associated
 with
 an
 improved
 ranking
 after
 applying
 quantile
 regression.
 

  89
 
Figure
 4.10
 Gain
 of
 number
 of
 causal
 SNPs
 in
 the
 top
 200
 meta-­‐analysis
 ranked
 regions
 
 

 

 

 
We
 also
 tested
 the
 changes
 in
 ranks
 for
 the
 known
 SNPs/regions.
 According
 to
 Peters
 et
 al.
 
(2011),
 18
 regions
 have
 been
 reported
 to
 be
 associated
 with
 colon
 cancer
 risk,
 of
 which
 14
 
regions
 were
 included
 in
 the
 subset
 SNPs
 used
 in
 our
 analysis.
 The
 changes
 in
 rank
 for
 the
 
known
 hits
 are
 shown
 in
 Table
 4.5.
 Compared
 to
 the
 CCFR
 data,
 on
 average
 quantile
 regression
 
improved
 the
 ranking
 of
 the
 known
 hits
 by
 2164
 positions,
 including
 1
 additional
 known
 hit
 in
 
top
 100
 ranking
 (12q13.13),
 and
 2
 additional
 known
 hits
 in
 top
 20,000
 (16q22.1/CDH1
 and
 
19q13.1/RHPN2).
 Admittedly,
 the
 meta-­‐analysis
 does
 better
 with
 a
 mean
 rank
 increase
 of
 
15,137,
 including
 8
 additional
 known
 hits
 in
 the
 top
 100
 and
 6
 additional
 in
 the
 top
 20,000.
 
 But
 
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
14 12 10 8 6 4 2
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Rank Change
in Top 100
log
2
(Rank)
CCFR Data Quantile Model
dnase1
enhancer
exon
promoter
tcf7l2
rank+
rank−
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
15 10 5 0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
Rank Change
from Top 100 to 200
log
2
(Rank)
CCFR Data Quantile Model
dnase1
enhancer
exon
promoter
tcf7l2
rank+
rank−

  90
 
this
 highlights
 our
 proposal:
 if
 you
 have
 resources
 (logistically
 and
 monetarily),
 then
 getting
 
more
 samples
 is
 a
 good
 way
 to
 go.
 In
 the
 absence
 of
 resources,
 incorporation
 of
 information
 can
 
help.
 

 
Table
 4.5
 Changes
 in
 rank
 for
 the
 known
 hits
 (genes/regions)
 

 

 

 

 

 

 

 

 

 

 

 

 

 
 
Known
 Hits
 (gene/region)
 
Rank
 in
 CCFR
 
Data
 
Rank
 in
 Meta-­‐
analysis
 
Rank
 in
 Quantile
 
Regression
 
6p21
  14613
  5541
  10271
 
8q23.3/EIF3H
  35021.5
  5
  34042
 
8q24/MYC
  8436
  3
  2810
 
9p24
  14405
  235.5
  14957
 
11q13.4
  30372.5
  27
  28967
 
11q23
  5934
  4
  3852
 
12q13.13
  369
  783
  53
 
14q22.2/BMP4
  10114
  118
  6286
 
14q22.2/BMP4
  11311
  6302
  11921
 
15q13/CRAC1/HMPS/GREM1
  26333
  1047
  24388
 
15q13/CRAC1/HMPS/GREM1
  9511
  5699
  9819
 
15q13/CRAC1/HMPS/GREM1
  15309
  7
  12170
 
16q22.1/CDH1
  23764
  970
  18186
 
18q21/SMAD7
  1996
  1
  2162
 
19q13.1/RHPN2
  20965.5
  477
  17134
 
20p12.3/BMP2
  18163
  10
  17465
 
20p12.3/BMP2
  28540.5
  591
  28581
 
20q13.33/LAMA5
  19184
  45
  12321
 

  91
 
4.7
 Discussions
 
Many
 approaches
 have
 been
 proposed
 for
 selecting
 a
 subset
 of
 SNPs
 from
 GWAS
 results
 to
 
follow
 up
 (Cantor
 et
 al,
 2010).
 Most
 studies
 rank
 the
 p-­‐values
 and
 possibly
 filter
 on
 prior
 
information.
 In
 our
 work,
 we
 adapted
 a
 Bayesian
 hierarchical
 multi-­‐quantile
 regression
 model
 to
 
incorporate
 biofeature
 information,
 and
 applied
 it
 to
 a
 GWAS
 dataset
 with
 marginal
 test
 
statistics.
 In
 simulation
 studies,
 we
 found
 that
 the
 Bayesian
 hierarchical
 multi-­‐quantile
 
regression
 model
 boosts
 the
 ranking
 of
 causal
 SNPs
 if
 the
 biofeatures
 are
 informative,
 and
 keeps
 
the
 original
 ranking
 of
 causal
 SNPs
 if
 the
 biofeatures
 are
 not
 informative.
 
 Thus
 it
 is
 more
 robust
 
and
 powerful
 for
 prioritization
 of
 the
 causal
 SNPs
 than
 filtering
 or
 use
 of
 linear
 models.
 Here
 we
 
summarize
 a
 few
 advantages
 of
 our
 method:
 
(1) It
 is
 a
 probabilistic
 model,
 thus
 intuitively
 easy
 to
 interpret
 the
 results
 
(2) It
 is
 easy
 to
 fit
 the
 model
 and
 can
 be
 applied
 to
 continuous
 biofeatures
 (an
 important
 
advantage
 over
 the
 filtering
 approach)
 
(3) It
 has
 better
 performance
 in
 terms
 of
 the
 ranking
 of
 causal
 SNPs,
 especially
 when
 we
 
have
 multiple
 biofeatures
 whose
 information
 quality
 (specificity
 and
 sensitivity
 with
 
respect
 to
 the
 causality
 of
 a
 SNP)
 varies.
 

 
In
 the
 application
 to
 the
 Colon
 CFR
 dataset,
 we
 found
 that
 the
 quantile
 regression
 model
 
improved
 the
 ranking
 of
 SNPs/regions
 that
 were
 identified
 by
 the
 meta-­‐analysis.
 Four
 of
 the
 5
 
biofeatures
 were
 positively
 correlated
 with
 the
 change
 in
 rank
 (from
 the
 Colon
 CFR
 dataset
 to
 
the
 meta-­‐analysis),
 and
 the
 quantile
 regression
 with
 relevant
 biofeatures
 improved
 the
 ranking
 
of
 these
 corresponding
 regions.
 

 

  92
 
However,
 at
 the
 current
 stage
 of
 development,
 the
 Bayesian
 hierarchical
 quantile
 regression
 still
 
faces
 many
 limitations
 and
 challenges.
 Due
 to
 the
 computational
 burden,
 our
 method
 is
 
restricted
 to
 be
 used
 in
 moderate-­‐sized
 datasets
 in
 terms
 of
 the
 number
 of
 SNPs.
 Given
 a
 
complete
 GWAS
 dataset
 with
 a
 massive
 number
 of
 SNPs,
 it
 is
 necessary
 to
 select
 a
 subset
 of
 
candidate
 SNPs
 before
 we
 can
 apply
 our
 method.
 On
 the
 other
 hand,
 SNPs
 in
 LD
 impacts
 the
 
performance
 of
 the
 approach,
 as
 we
 assumed
 the
 independence
 of
 the
 variables
 in
 quantile
 
regression
 model.
 Thus
 we
 proposed
 to
 prune
 the
 GWAS
 dataset
 based
 on
 the
 LD
 structure,
 and
 
summarized
 the
 biofeature
 information
 of
 the
 LD
 block
 for
 the
 hierarchical
 model.
 
It
 is
 very
 important
 to
 select
 informative
 biofeatures
 in
 all
 the
 proposed
 approaches.
 For
 
example,
 we
 are
 interested
 in
 biofeatures
 that
 are
 associated
 with
 the
 biological
 mechanisms
 of
 
colorectal
 cancer
 development
 when
 we
 perform
 genetic
 association
 studies
 for
 colorectal
 
cancer,
 such
 as
 enhancer
 or
 promoter
 regions.
 However,
 it
 is
 difficult
 to
 predict
 how
 informative
 
a
 biofeature
 will
 be,
 since
 biological
 knowledge
 is
 still
 limited
 about
 the
 trait
 of
 interest.
 In
 this
 
case,
 the
 hierarchical
 quantile
 regression
 performs
 as
 a
 robust
 approach
 for
 SNP
 prioritization,
 
as
 it
 estimates
 the
 quantile
 effects
 of
 each
 biofeature,
 and
 does
 not
 penalize
 much
 when
 the
 
biofeature
 is
 not
 informative.
 

 

 

 

 

 

  93
 
Chapter
 5 Future
 research
 in
 related
 areas
 
In
 the
 scope
 of
 this
 dissertation,
 we
 proposed
 statistical
 methods
 to
 deal
 with
 three
 specific
 
problems
 on
 genetic
 association
 studies.
 Most
 of
 the
 GWAS
 results
 are
 analyzed
 with
 one
 SNP
 at
 
a
 time,
 which
 does
 not
 fully
 exploit
 the
 potential
 of
 GWAS
 to
 identify
 multiple
 causal
 variants.
 
Thus
 we
 proposed
 a
 Bayesian
 variable
 selection
 framework
 for
 GWAS,
 and
 build
 a
 multiple
 
regression
 model
 incorporating
 external
 biological
 information.
 
 To
 follow
 up
 on
 a
 GWAS
 study
 
and
 search
 for
 causal
 variants/
 regions,
 we
 proposed
 a
 Bayesian
 quantile
 regression
 model
 to
 
prioritize
 GWAS
 results.
 In
 addition,
 we
 specifically
 considered
 a
 case-­‐control
 study
 design
 with
 
DNA
 pools,
 which
 are
 to
 be
 sequenced
 using
 next-­‐generation
 sequencing
 technology,
 with
 
improved
 statistical
 power
 to
 detect
 causal
 variants.
 

 
In
 general,
 genetic
 association
 studies
 are
 advancing
 with
 more
 vastly
 data
 and
 complicated
 
structures.
 The
 development
 of
 next-­‐generation
 sequencing
 technology
 will
 soon
 be
 cheap
 
enough
 for
 a
 whole-­‐genome
 sequencing
 study
 on
 a
 population
 basis;
 discoveries
 in
 molecular
 
pathways
 and
 biological
 mechanisms
 add
 up
 to
 a
 more
 complicated
 system
 of
 biological
 
knowledge;
 innovations
 in
 medical
 monitoring
 systems
 will
 generate
 complex
 biomarker
 and
 
disease
 data;
 advances
 in
 electronic
 medical
 record
 systems
 makes
 larger
 samples
 of
 hospital
 
data
 easily
 available
 for
 research.
 

 
With
 all
 these
 developments,
 comprehensive
 approaches
 are
 needed
 to
 integrate
 multi-­‐
dimensional
 information.
 Bayesian
 hierarchical
 models
 will
 be
 very
 useful
 and
 coherent
 tools;
 in
 
the
 mean
 time,
 it
 faces
 many
 challenges,
 especially
 to
 scale
 up
 for
 a
 whole-­‐genome
 analysis.
 
Future
 research
 in
 this
 area
 needs
 to
 deal
 with
 the
 following
 specific
 issues.
 

  94
 
5.1
 Variable
 selection
 problem
 for
 a
 large
 dataset
 
 
In
 a
 whole-­‐genome
 sequencing
 study,
 a
 dataset
 could
 potentially
 include
 millions
 of
 genomic
 
variations
 as
 variables.
 For
 example,
 the
 phase
 I
 of
 1000
 Genome
 Project
 has
 identified
 
38 million
 single
 nucleotide
 polymorphisms,
 1.4 million
 short
 insertions
 and
 deletions,
 and
 more
 
than
 14,000
 larger
 deletions(Consortium
 et
 al.,
 2012).
 An
 efficient
 variable
 selection
 approach
 
will
 greatly
 alleviate
 the
 computational
 burden,
 as
 well
 as
 improve
 the
 statistical
 power
 both
 in
 
building
 a
 good
 prediction
 model
 and
 finding
 the
 causal
 variants.
 
 

 
In
 Chapter
 3
 of
 this
 dissertation,
 we
 proposed
 a
 Bayesian
 Lasso
 with
 biology-­‐driven
 penalty
 term
 
as
 a
 prior
 distribution.
 The
 method
 presented
 herein
 is
 a
 first
 step
 as
 computational
 and
 
theoretical
 restrictions
 limit
 the
 number
 of
 variables
 the
 method
 can
 be
 used
 with.
 As
 an
 
extension
 in
 this
 direction,
 an
 empirical
 estimator
 for
 the
 Informative
 Bayesian
 Lasso
 might
 be
 
developed
 and
 applied
 to
 the
 full
 dataset.
 We
 might
 estimate
 the
 penalty
 based
 on
 the
 marginal
 
estimates
 of
 the
 data,
 and
 then
 apply
 the
 empirically
 estimated
 penalty
 to
 construct
 an
 
empirical
 maximum
 a
 posteriori
 estimator.
 
 

 
Similarly,
 in
 Chapter
 4,
 in
 the
 context
 of
 fully
 Bayesian
 analysis,
 given
 a
 study
 with
 hundreds
 of
 
thousands
 variables,
 we
 need
 to
 pre-­‐select
 a
 subset
 of
 these
 variables.
 In
 the
 context
 of
 GWAS
 
studies,
 we
 proposed
 to
 prune
 the
 dataset
 based
 on
 the
 LD
 structure.
 For
 each
 LD
 region,
 we
 
use
 a
 summary
 test
 statistic
 and
 aggregated
 biofeature
 information.
 An
 alternative
 approach
 
could
 be
 to
 perform
 a
 variable
 selection
 procedure
 in
 the
 full
 data,
 and
 then
 apply
 the
 quantile
 
regression
 model
 for
 prioritization.
 

 

  95
 
5.2
 Further
 development
 on
 hierarchical
 quantile
 
regression
 to
 specifically
 model
 linkage
 disequilibrium
 
between
 SNPs
 
Currently
 in
 the
 hierarchical
 quantile
 regression
 model,
 we
 assume
 the
 SNPs
 are
 independent.
 
However,
 due
 to
 linkage
 disequilibrium,
 many
 SNPs
 are
 correlated
 with
 each
 other.
 To
 model
 
the
 correlation
 among
 SNPs,
 one
 might
 adapt
 the
 multivariate
 regression
 model
 within
 the
 
hierarchical
 quantile
 regression:
 in
 the
 first
 level,
 we
 could
 model
 the
 correlation
 structure
 
among
 the
 observed
 test
 statistics;
 and
 in
 the
 second
 level,
 we
 can
 model
 the
 effects
 of
 the
 
biofeature(s)
 on
 the
 quantiles
 of
 the
 latent
 variables.
 
 

 
We
 have
 taken
 some
 initial
 steps
 in
 this
 direction.
 We
 used
 the
 LD
 structure
 of
 the
 SNPs
 as
 
estimation
 for
 the
 correlation
 structure
 of
 the
 test
 statistics,
 and
 estimated
 the
 posterior
 
distribution
 for
 the
 latent
 variables
 based
 on
 the
 multivariate
 model.
 The
 initial
 results
 show
 
that
 the
 proposed
 procedure
 does
 not
 improve
 the
 prioritization
 of
 causal
 SNPs.
 More
 
investigations
 are
 needed
 to
 further
 develop
 the
 multivariate
 hierarchical
 quantile
 regression.
 

 
We
 applied
 the
 hierarchical
 quantile
 regression
 model
 to
 a
 GWAS
 study
 in
 Chapter
 4.
 A
 fine-­‐
mapping
 study
 uses
 genotyping
 with
 increased
 density
 of
 sequencing
 across
 the
 candidate
 
regions.
 Thus
 it
 is
 a
 powerful
 way
 of
 finding
 causal
 variants,
 which
 are
 more
 likely
 to
 be
 directly
 
genotyped/sequenced.
 Since
 the
 biofeature(s)
 might
 be
 more
 relevant
 to
 the
 causal
 SNPs,
 
application
 of
 the
 hierarchical
 quantile
 regression
 model
 in
 a
 fine-­‐mapping
 study
 may
 result
 in
 

  96
 
better
 statistical
 power
 to
 find
 the
 causal
 variants.
 As
 stated
 above,
 we
 need
 to
 construct
 a
 
multivariate
 regression
 framework
 to
 deal
 with
 the
 correlation
 structure
 in
 a
 fine-­‐mapping
 data.
 
Another
 challenge
 to
 deal
 with
 fine-­‐mapping
 study
 is
 the
 computational
 burden
 if
 we
 are
 to
 
include
 all
 the
 genetic
 variations
 in
 the
 region.
 To
 specifically
 deal
 with
 this
 problem,
 we
 can
 
construct
 an
 empirical
 Bayes
 estimator
 considering
 the
 correlation
 structure.
 

 
5.3
 Integrative
 analysis
 on
 genetic
 association
 
studies,
 especially
 with
 next-­‐generation
 sequencing
 
With
 the
 rapid
 development
 of
 Next-­‐Generation
 Sequencing
 technology,
 it
 is
 attempting
 to
 
design
 sequencing
 based
 genetic
 association
 studies.
 However,
 we
 need
 more
 delicate
 thoughts
 
on
 how
 to
 analyze
 the
 sequencing
 data
 in
 an
 integrative
 approach.
 First,
 the
 sequencing
 data
 
will
 be
 less
 accurate
 than
 genotyping
 data
 (especially
 for
 data
 with
 low
 coverage),
 and
 a
 
probabilistic
 approach
 with
 consideration
 of
 sequencing
 error
 is
 often
 needed
 for
 such
 data.
 In
 
the
 pooling
 design
 using
 Next-­‐Generation
 sequencing
 data,
 we
 constructed
 an
 integrated
 
hierarchical
 model
 with
 consideration
 of
 the
 error
 structure.
 Secondly,
 much
 more
 genetic
 
variations
 will
 be
 detected
 from
 sequencing
 data,
 but
 most
 of
 them
 will
 be
 rare
 variants.
 Testing
 
one
 variant
 at
 a
 time
 will
 lead
 to
 severe
 loss
 of
 power
 due
 to
 the
 low
 frequency
 of
 each
 variant.
 
In
 such
 case,
 we
 may
 consider
 using
 group
 tests
 of
 rare
 variants.
 For
 example,
 Neale
 et
 al
 
proposed
 a
 C-­‐alpha
 test,
 which
 tests
 the
 deviation
 of
 the
 distribution
 of
 a
 group
 of
 variants
 in
 
cases
 and
 controls(Neale
 et
 al.,
 2011).
 On
 the
 third,
 most
 of
 the
 variants
 will
 have
 null
 effects
 in
 
a
 sequencing
 study.
 Thus,
 it
 is
 suitable
 to
 apply
 quantile
 regression
 approach
 in
 such
 dataset,
 
and
 directly
 test
 the
 causal
 variants
 on
 the
 top
 percentiles.
 

  97
 

 
5.4
 Pharmacogenomics
 and
 Personalized
 Medicine
 
Pharmacogenomics
 analyzes
 how
 genetic
 makeup
 affects
 a
 subject’s
 response
 to
 drugs.
 It
 deals
 
with
 the
 influence
 of
 genetic
 variance
 with
 the
 drug’s
 toxicity
 or
 efficacy
 (Karczewski,
 
Daneshjou,
 &
 Altman,
 2012).
 Such
 approaches
 promise
 the
 advent
 of
 personalized
 medicine,
 
whereas
 drugs
 are
 optimized
 according
 to
 each
 individual’s
 genetic
 makeup.
 

 
Although
 most
 of
 the
 genetic
 variants
 have
 small
 effect
 sizes
 on
 population-­‐based
 studies,
 some
 
studies
 have
 suggest
 that
 genetic
 makeup
 may
 have
 a
 bigger
 effect
 on
 the
 drug
 metabolism.
 For
 
example,
 based
 on
 GWAS
 studies,
 while
 SNPs
 in
 HMGCR
 gene
 have
 only
 a
 small
 effect
 (~5%)
 on
 
LDL
 levels,
 drugs
 targeting
 the
 encoded
 protein
 decrease
 LDL
 levels
 by
 a
 much
 greater
 extent
 
(~30%)
 (Altshuler,
 Daly,
 &
 Lander,
 2008;
 Willer
 et
 al.,
 2008).
 Maybe
 it
 is
 because
 the
 effect
 of
 an
 
inherited
 variant
 is
 limited
 by
 natural
 selection
 and
 pleiotropy,
 whereas
 the
 effect
 of
 a
 drug
 
treatment
 is
 not.
 With
 more
 discoveries
 on
 genetic
 mappings
 on
 diseases,
 personalized
 
medicine
 should
 be
 in
 the
 center
 stage
 of
 the
 next-­‐generation
 development
 of
 pharmaceutical
 
industry.
 

   
 

  98
 
Appendices
 
Appendix
 1.
 Variant
 detection:
 comparison
 between
 
pooling
 vs.
 individual
 designs
 
We
 also
 investigated
 the
 use
 of
 a
 pooling
 design
 for
 the
 discovery
 of
 novel
 variants,
 based
 on
 
the
 probability
 of
 detection
 proposed
 in
 the
 Methods
 section.
 As
 an
 example,
 with
 the
 same
 
cost
 function,
 suppose
 we
 were
 interested
 in
 a
 variant
 with
 VAF
 =
 0.01,
 in
 an
 individual
 
sequencing
 design
 we
 can
 sequence
 200
 individuals
 with
 4X
 coverage
 (again
 assuming
 that
 is
 
enough
 to
 yield
 perfect
 sequencing
 result),
 the
 probability
 of
 discovery
 would
 be
 1−(1−0.01)
200

 =
 
0.866.
 With
 the
 same
 sequencing
 cost
 (and
 with
 no
 more
 than
 4000
 individuals),
 the
 optimal
 
pooling
 design
 would
 yield
 a
 probability
 of
 detection
 is
 0.948.
 The
 discovery
 probability
 of
 
different
 study
 designs
 was
 shown
 in
 Appendix
 Fig
 1.
 
A
 more
 general
 comparison
 between
 pooling
 vs.
 individual
 sequencing
 is
 made
 with
 varying
 
VAFs
 and
 cost
 functions,
 with
 the
 same
 cost
 for
 4X
 coverage
 on
 100
 individuals.
 In
 Appendix
 Fig
 
2,
 we
 can
 see
 that
 even
 when
 VAF
 is
 very
 small,
 optimal
 pooling
 design
 yields
 a
 very
 high
 
detection
 probability.
 The
 optimal
 designs
 are
 summarized
 in
 Appendix
 Table
 2.
 
 

 

 

 

 

 

 

  99
 
Appendix
 Figure
 1.
 The
 discovery
 probability
 of
 a
 novel
 variant
 (VAF
 =
 0.01)
 using
 pooling
 
designs,
 with
 a
 fixed
 sequencing
 cost
 

 

 

 

 

 

 

 

 

 

 

  100
 
Appendix
 Figure
 2.
 Comparison
 in
 variant
 detection
 probability
 of
 optimal
 pooling
 designs
 
with
 various
 cost
 function,
 versus
 individual
 designs,
 over
 a
 range
 of
 variant
 allele
 
frequencies
 

 

 

 

 

 

 

 

 

 

  101
 
Appendix
 Table
 1.
 
 Top
 5
 optimal
 pooling
 designs
 for
 a
 study
 interested
 in
 a
 variant
 with
 
VAF=0.03
 and
 OR=2,
 with
 a
 specified
 cost
 function
 and
 fixed
 constraints
 on
 sequencing
 cost
 
and
 sample
 size
 
P
  Np
  λ
  Power
 
51
  62
  25
  0.424
 
51
  60
  25
  0.421
 
51
  58
  25
  0.417
 
51
  56
  25
  0.313
 
51
  54
  25
  0.409
 

 

 
Appendix
 Table
 2.
 Comparison
 in
 variant
 detection
 probability
 of
 optimal
 pooling
 designs
 
with
 various
 cost
 function,
 versus
 individual
 designs,
 over
 a
 range
 of
 variant
 allele
 
frequencies.
 The
 cost
 function
 is
 the
 same
 as
 in
 Figure
 2.7.
 
C
λ

  VAF
  Np
  λ
  P
 
Detection
 
Probability
 
Individual
 Design
 
(same
 cost)
 
50
  0.001
  6
  42
  39
  0.321
  0.095
 
50
  0.003
  14
  113
  18
  0.657
  0.260
 
50
  0.005
  14
  108
  19
  0.828
  0.394
 
50
  0.008
  200
  892
  3
  0.943
  0.552
 
50
  0.01
  200
  892
  3
  0.980
  0.634
 
50
  0.03
  200
  465
  5
  1.000
  0.952
 
50
  0.05
  174
  204
  11
  1.000
  0.994
 
200
  0.001
  10
  56
  15
  0.198
  0.095
 
200
  0.003
  10
  56
  15
  0.440
  0.260
 
200
  0.005
  10
  56
  15
  0.610
  0.394
 
200
  0.008
  10
  56
  15
  0.775
  0.552
 
200
  0.01
  200
  553
  2
  0.868
  0.634
 
200
  0.03
  200
  553
  2
  1.000
  0.952
 
200
  0.05
  200
  349
  3
  1.000
  0.994
 
500
  0.001
  4
  9
  55
  0.175
  0.095
 
500
  0.003
  4
  9
  55
  0.381
  0.260
 
500
  0.005
  200
  1086
  1
  0.556
  0.394
 
500
  0.008
  200
  1086
  1
  0.757
  0.552
 
500
  0.01
  200
  1086
  1
  0.844
  0.634
 
500
  0.03
  200
  1183
  1
  0.999
  0.952
 
500
  0.05
  200
  393
  2
  1.000
  0.994
 

 

 

  102
 

 

 

 
Appendix
 Text
 1.
 The
 WinBUGS
 code
 for
 the
 hierarchical
 model
 on
 the
 pooled
 NGS
 data.
 
model
 {
 
for
 (p
 in
 1:P)
 
{
 
logit(rho[p])
 <-­‐
 beta0
 +
 beta1*(W[p]-­‐m.W)
 
C[p]
 ~
 dbin(rho[p],
 2*N.p)
 
theta[p]
 <-­‐
 (C[p]/(2*N.p))*(1-­‐error)
 +
 (1-­‐C[p]/(2*N.p))*error
 
V[p]
 ~
 dbin(theta[p],
 X[p])
 
}
 

 
error
 ~
 dunif(0,
 0.05)
 
beta0
 ~
 dnorm(0,
 0.01)
 
beta1
 ~
 dnorm(0,
 0.01)
 

 
T.beta1
 <-­‐
 step(beta1)
 
m.W
 <-­‐
 mean(W[1:P])
   
 
}
 

 

 

 

 

 

 

 

   
 

  103
 
Bibliography
 
Altekruse,
 S.
 F.,
 Kosary,
 C.
 L.,
 Krapcho,
 M.,
 Neyman,
 N.,
 Aminou,
 R.,
 Waldron,
 W.,
 et
 al.
 (2007).
 
SEER
 Cancer
 Statistics
 Review,
 1975-­‐2007,
 National
 Cancer
 Institute.
 Bethesda,
 
MD,http://seer.cancer.gov/csr/1975_2007/,
 based
 on
 November
 2009
 SEER
 data
 
submission,
 posted
 to
 the
 SEER
 web
 site,
 2010.
 
Altshuler,
 D.,
 Daly,
 M.
 J.,
 &
 Lander,
 E.
 S.
 (2008).
 Genetic
 Mapping
 in
 Human
 Disease.
 Science,
 
322(5903),
 881–888.
 doi:10.1126/science.1156409
 
Ashburner,
 M.,
 Ball,
 C.
 A.,
 Blake,
 J.
 A.,
 Botstein,
 D.,
 Butler,
 H.,
 Cherry,
 J.
 M.,
 et
 al.
 (2000).
 Gene
 
Ontology:
 tool
 for
 the
 unification
 of
 biology.
 Nature
 Genetics,
 25(1),
 25.
 
Cade,
 B.
 S.,
 &
 Noon,
 B.
 R.
 (2003).
 A
 gentle
 introduction
 to
 quantile
 regression
 for
 ecologists.
 
Frontiers
 in
 Ecology
 and
 the
 Environment,
 1(8),
 412–420.
 
Cantor,
 R.
 M.,
 Lange,
 K.,
 &
 Sinsheimer,
 J.
 S.
 (2010).
 Prioritizing
 GWAS
 results:
 A
 review
 of
 
statistical
 methods
 and
 recommendations
 for
 their
 application.
 The
 American
 Journal
 of
 
Human
 Genetics,
 86(1),
 6–22.
 
Chen,
 J.,
 &
 Chen,
 Z.
 (2008).
 Extended
 Bayesian
 information
 criteria
 for
 model
 selection
 with
 
large
 model
 spaces.
 Biometrika,
 95(3),
 759–771.
 doi:10.1093/biomet/asn034
 
Consortium,
 T.
 1.
 G.
 P.,
 The
 1000
 Genomes
 Consortium
 Participants
 are
 arranged
 by
 project
 role,
 
T.
 B.
 I.
 A.
 A.
 F.
 A.
 W.
 I.
 E.
 F.
 P.
 I.
 A.
 P.
 L.
 A.
 I.,
 author,
 C.,
 committee,
 S.,
 Medicine,
 P.
 G.
 B.
 C.
 
O.,
 BGI-­‐Shenzhen,
 et
 al.
 (2012).
 An
 integrated
 map
 of
 genetic
 variation
 from
 1,092
 human
 
genomes.
 Nature,
 490(7422),
 56–65.
 doi:10.1038/nature11632
 
Cordell,
 H.
 J.,
 &
 Clayton,
 D.
 G.
 (2005).
 Genetic
 association
 studies.
 The
 Lancet,
 366(9491),
 1121–
1131.
 
Druley,
 T.
 E.,
 Vallania,
 F.
 L.
 M.,
 Wegner,
 D.
 J.,
 Varley,
 K.
 E.,
 Knowles,
 O.
 L.,
 Bonds,
 J.
 A.,
 et
 al.
 

  104
 
(2009).
 Quantification
 of
 rare
 allelic
 variants
 from
 pooled
 genomic
 DNA.
 Nature
 Methods,
 
6(4),
 263–265.
 doi:10.1038/nmeth.1307
 
Erlich,
 Y.,
 Chang,
 K.,
 Gordon,
 A.,
 Ronen,
 R.,
 Navon,
 O.,
 Rooks,
 M.,
 &
 Hannon,
 G.
 J.
 (2009).
 DNA
 
Sudoku-­‐-­‐harnessing
 high-­‐throughput
 sequencing
 for
 multiplexed
 specimen
 analysis.
 
Genome
 Research,
 19(7),
 1243–1253.
 doi:10.1101/gr.092957.109
 
Frazer,
 K.
 A.,
 Murray,
 S.
 S.,
 Schork,
 N.
 J.,
 &
 Topol,
 E.
 J.
 (2009).
 Human
 genetic
 variation
 and
 its
 
contribution
 to
 complex
 traits.
 Nature
 Reviews
 Genetics,
 10(4),
 241–251.
 
doi:10.1038/nrg2554
 
Freedman,
 M.
 L.,
 Monteiro,
 A.
 N.
 A.,
 Gayther,
 S.
 A.,
 Coetzee,
 G.
 A.,
 Risch,
 A.,
 Plass,
 C.,
 et
 al.
 
(2011).
 Principles
 for
 the
 post-­‐GWAS
 functional
 characterization
 of
 cancer
 risk
 loci.
 Nature
 
Publishing
 Group,
 43(6),
 513–518.
 doi:10.1038/ng.840
 
Futschik,
 A.,
 &
 Schlotterer,
 C.
 (2010).
 Massively
 Parallel
 Sequencing
 of
 Pooled
 DNA
 Samples—
The
 Next
 Generation
 of
 Molecular
 Markers.
 Genetics.
 
Gauderman,
 W.
 (2002).
 Sample
 Size
 Requirements
 for
 Association
 Studies
 of
 Gene-­‐Gene
 
Interaction.
 American
 Journal
 of
 Epidemiology.
 
Gelman,
 A.,
 Carlin,
 J.
 B.,
 Stern,
 H.
 S.,
 &
 Rubin,
 D.
 B.
 (2004).
 Bayesian
 Data
 Analysis.
 Chapman
 &
 
Hall/CRC.
 
Genkin,
 A.,
 Lewis,
 D.
 D.,
 &
 Madigan,
 D.
 (2007).
 Large-­‐Scale
 Bayesian
 Logistic
 Regression
 for
 Text
 
Categorization.
 Technometrics,
 49(3),
 291–304.
 doi:10.1198/004017007000000245
 
George,
 E.
 I.,
 &
 Foster,
 D.
 P.
 (2000).
 Calibration
 and
 empirical
 Bayes
 variable
 selection.
 
Biometrika,
 87(4),
 731–747.
 
Hans,
 C.
 (2009).
 Bayesian
 lasso
 regression.
 Biometrika,
 96(4),
 835–845.
 
doi:10.1093/biomet/asp047
 
Hastie,
 T.,
 Tibshirani,
 R.,
 &
 Friedman,
 J.
 H.
 (2003).
 The
 Elements
 of
 Statistical
 Learning
 

  105
 
(Corrected.).
 Springer.
 
Hirschhorn,
 J.
 N.,
 Lohmueller,
 K.,
 Byrne,
 E.,
 &
 Hirschhorn,
 K.
 (2002).
 A
 comprehensive
 review
 of
 
genetic
 association
 studies.
 Genetics
 in
 Medicine,
 4(2),
 45–61.
 
Hoggart,
 C.
 J.,
 Whittaker,
 J.
 C.,
 De
 Iorio,
 M.,
 &
 Balding,
 D.
 J.
 (2008).
 Simultaneous
 Analysis
 of
 All
 
SNPs
 in
 Genome-­‐Wide
 and
 Re-­‐Sequencing
 Association
 Studies.
 (P.
 M.
 Visscher,
 Ed.)PLoS
 
Genetics,
 4(7),
 e1000130.
 doi:10.1371/journal.pgen.1000130.t004
 
Ioannidis,
 J.,
 &
 Thomas,
 G.
 (2009).
 Validating,
 augmenting
 and
 refining
 genome-­‐wide
 association
 
signals.
 Nature
 Reviews
 Genetics.
 
Jensen,
 L.
 J.,
 &
 Bork,
 P.
 (2010).
 Ontologies
 in
 Quantitative
 Biology:
 A
 Basis
 for
 Comparison,
 
Integration,
 and
 Discovery.
 PLoS
 Biology,
 8(5),
 e1000374.
 
doi:10.1371/journal.pbio.1000374.g003
 
Karczewski,
 K.
 J.,
 Daneshjou,
 R.,
 &
 Altman,
 R.
 B.
 (2012).
 Chapter
 7:
 Pharmacogenomics.
 (F.
 
Lewitter
 &
 M.
 Kann,
 Eds.)PLoS
 Computational
 Biology,
 8(12),
 e1002817.
 
doi:10.1371/journal.pcbi.1002817.s001
 
Kim,
 S.
 Y.,
 Li,
 Y.,
 Guo,
 Y.,
 Li,
 R.,
 Holmkvist,
 J.,
 Hansen,
 T.,
 et
 al.
 (2010).
 Design
 of
 association
 
studies
 with
 pooled
 or
 un-­‐pooled
 next-­‐generation
 sequencing
 data.
 Genetic
 Epidemiology,
 
34(5),
 479–491.
 doi:10.1002/gepi.20501
 
Klein,
 R.
 J.
 (2005).
 Complement
 Factor
 H
 Polymorphism
 in
 Age-­‐Related
 Macular
 Degeneration.
 
Science,
 308(5720),
 385–389.
 doi:10.1126/science.1109557
 
Koenker,
 R.
 (2005).
 Quantile
 Regression
 -­‐
 Roger
 Koenker
 -­‐
 Google
 Books.
 
Koenker,
 R.,
 &
 Bassett,
 G.,
 Jr.
 (1978).
 Regression
 quantiles.
 Econometrica:
 journal
 of
 the
 
Econometric
 Society,
 33–50.
 
Kozumi,
 H.,
 &
 Kobayashi,
 G.
 (2011).
 Gibbs
 sampling
 methods
 for
 Bayesian
 quantile
 regression.
 
Journal
 of
 Statistical
 Computation
 and
 Simulation,
 81(11),
 1565–1578.
 

  106
 
Lander,
 E.
 S.,
 Linton,
 L.
 M.,
 Birren,
 B.,
 Nusbaum,
 C.,
 Zody,
 M.
 C.,
 Baldwin,
 J.,
 et
 al.
 (2001).
 Initial
 
sequencing
 and
 analysis
 of
 the
 human
 genome.
 Nature,
 409(6822),
 860–921.
 
Lee,
 J.
 S.,
 Choi,
 M.,
 Yan,
 X.,
 Lifton,
 R.
 P.,
 &
 Zhao,
 H.
 (2011).
 On
 optimal
 pooling
 designs
 to
 identify
 
rare
 variants
 through
 massive
 resequencing.
 Genetic
 Epidemiology,
 35(3),
 139–147.
 
doi:10.1002/gepi.20561
 
Li,
 B.,
 &
 Leal,
 S.
 M.
 (2008).
 Methods
 for
 detecting
 associations
 with
 rare
 variants
 for
 common
 
diseases:
 application
 to
 analysis
 of
 sequence
 data.
 The
 American
 Journal
 of
 Human
 
Genetics,
 83(3),
 311–321.
 
Liu,
 J.,
 Lewinger,
 J.
 P.,
 Gilliland,
 F.
 D.,
 Gauderman,
 W.
 J.,
 &
 Conti,
 D.
 V.
 (2013).
 Confounding
 and
 
Heterogeneity
 in
 Genetic
 Association
 Studies
 with
 Admixed
 Populations.
 American
 Journal
 
of
 Epidemiology.
 doi:10.1093/aje/kws234
 
Lupton,
 M.
 K.,
 Proitsi,
 P.,
 Danillidou,
 M.,
 Tsolaki,
 M.,
 Hamilton,
 G.,
 Wroe,
 R.,
 et
 al.
 (2011).
 Deep
 
Sequencing
 of
 the
 Nicastrin
 Gene
 in
 Pooled
 DNA,
 the
 Identification
 of
 Genetic
 Variants
 That
 
Affect
 Risk
 of
 Alzheimer's
 Disease.
 (R.
 Roberts,
 Ed.)PLoS
 ONE,
 6(2),
 e17298.
 
doi:10.1371/journal.pone.0017298.t003
 
Madsen,
 B.
 E.,
 &
 Browning,
 S.
 R.
 (2009).
 A
 Groupwise
 Association
 Test
 for
 Rare
 Mutations
 Using
 
a
 Weighted
 Sum
 Statistic.
 (N.
 J.
 Schork,
 Ed.)PLoS
 Genetics,
 5(2),
 e1000384.
 
doi:10.1371/journal.pgen.1000384.t002
 
McCarthy,
 M.
 I.,
 Abecasis,
 G.
 R.,
 Cardon,
 L.
 R.,
 Goldstein,
 D.
 B.,
 Little,
 J.,
 Ioannidis,
 J.
 P.
 A.,
 &
 
Hirschhorn,
 J.
 N.
 (2008).
 Genome-­‐wide
 association
 studies
 for
 complex
 traits:
 consensus,
 
uncertainty
 and
 challenges.
 Nature
 Reviews
 Genetics,
 9(5),
 356–369.
 doi:10.1038/nrg2344
 
Meier,
 L.,
 &
 Van
 De
 Geer,
 S.
 (2008).
 The
 group
 lasso
 for
 logistic
 regression.
 group.
 
Morris,
 A.
 P.,
 &
 Zeggini,
 E.
 (2009).
 An
 evaluation
 of
 statistical
 approaches
 to
 rare
 variant
 analysis
 
in
 genetic
 association
 studies.
 Genetic
 Epidemiology,
 34(2),
 188–193.
 

  107
 
doi:10.1002/gepi.20450
 
Neale,
 B.
 M.,
 Rivas,
 M.
 A.,
 Voight,
 B.
 F.,
 Altshuler,
 D.,
 Devlin,
 B.,
 Orho-­‐Melander,
 M.,
 et
 al.
 
(2011).
 Testing
 for
 an
 Unusual
 Distribution
 of
 Rare
 Variants.
 (S.
 M.
 Leal,
 Ed.)PLoS
 Genetics,
 
7(3),
 e1001322.
 doi:10.1371/journal.pgen.1001322.t002
 
Newcomb,
 P.
 A.,
 Baron,
 J.,
 Cotterchio,
 M.,
 Gallinger,
 S.,
 Grove,
 J.,
 Haile,
 R.,
 et
 al.
 (2007).
 Colon
 
Cancer
 Family
 Registry:
 An
 International
 Resource
 for
 Studies
 of
 the
 Genetic
 Epidemiology
 
of
 Colon
 Cancer.
 Cancer
 Epidemiology
 Biomarkers
 &
 Prevention,
 16(11),
 2331–2343.
 
doi:10.1158/1055-­‐9965.EPI-­‐07-­‐0648
 
Out,
 A.
 A.,
 van
 Minderhout,
 I.
 J.
 H.
 M.,
 Goeman,
 J.
 J.,
 Ariyurek,
 Y.,
 Ossowski,
 S.,
 Schneeberger,
 K.,
 
et
 al.
 (2009).
 Deep
 sequencing
 to
 reveal
 new
 variants
 in
 pooled
 DNA
 samples.
 Human
 
Mutation,
 30(12),
 1703–1712.
 doi:10.1002/humu.21122
 
Park,
 T.,
 &
 Casella,
 G.
 (2008).
 The
 Bayesian
 Lasso.
 Journal
 of
 the
 American
 Statistical
 Association,
 
103(482),
 681–686.
 doi:10.1198/016214508000000337
 
Peters,
 U.,
 Hutter,
 C.
 M.,
 Hsu,
 L.,
 Schumacher,
 F.
 R.,
 Conti,
 D.
 V.,
 Carlson,
 C.
 S.,
 et
 al.
 (2011).
 
Meta-­‐analysis
 of
 new
 genome-­‐wide
 association
 studies
 of
 colorectal
 cancer
 risk.
 Human
 
Genetics,
 131(2),
 217–234.
 doi:10.1007/s00439-­‐011-­‐1055-­‐0
 
Peters,
 U.,
 Jiao,
 S.,
 Schumacher,
 F.
 R.,
 Hutter,
 C.
 M.,
 Aragaki,
 A.
 K.,
 Baron,
 J.
 A.,
 et
 al.
 (2013).
 
Identification
 of
 Genetic
 Susceptibility
 Loci
 for
 Colorectal
 Tumors
 in
 a
 Genome-­‐Wide
 Meta-­‐
analysis.
 Gastroenterology,
 144(4),
 799–807.e24.
 doi:10.1053/j.gastro.2012.12.020
 
Prabhu,
 S.,
 &
 Pe'er,
 I.
 (2009).
 Overlapping
 pools
 for
 high-­‐throughput
 targeted
 resequencing.
 
Genome
 Research,
 19(7),
 1254–1261.
 doi:10.1101/gr.088559.108
 
Price,
 A.
 L.,
 Kryukov,
 G.
 V.,
 de
 Bakker,
 P.
 I.
 W.,
 Purcell,
 S.
 M.,
 Staples,
 J.,
 Wei,
 L.-­‐J.,
 &
 Sunyaev,
 S.
 
R.
 (2010).
 Pooled
 Association
 Tests
 for
 Rare
 Variants
 in
 Exon-­‐Resequencing
 Studies.
 The
 
American
 Journal
 of
 Human
 Genetics,
 86(6),
 832–838.
 doi:10.1016/j.ajhg.2010.04.005
 

  108
 
Quintana,
 M.
 A.,
 Berstein,
 J.
 L.,
 Thomas,
 D.
 C.,
 &
 Conti,
 D.
 V.
 (2011).
 Incorporating
 model
 
uncertainty
 in
 detecting
 rare
 variants:
 the
 Bayesian
 risk
 index.
 Genetic
 Epidemiology,
 35(7),
 
638–649.
 doi:10.1002/gepi.20613
 
Risch,
 N.
 J.
 (2000).
 Searching
 for
 genetic
 determinants
 in
 the
 new
 millennium.
 Nature,
 
405(6788),
 847–856.
 doi:10.1038/35015718
 
Shendure,
 J.,
 &
 Ji,
 H.
 (2008).
 Next-­‐generation
 DNA
 sequencing.
 Nature
 Biotechnology,
 26(10),
 
1135–1145.
 doi:10.1038/nbt1486
 
Shental,
 N.,
 &
 Amir,
 A.
 (2010).
 Identification
 of
 rare
 alleles
 and
 their
 carriers
 using
 compressed
 
se(que)nsing.
 Nucleic
 Acids
 Research.
 
Smith,
 A.
 M.,
 Heisler,
 L.
 E.,
 St
 Onge,
 R.
 P.,
 Farias-­‐Hesson,
 E.,
 Wallace,
 I.
 M.,
 Bodeau,
 J.,
 et
 al.
 
(2010).
 Highly-­‐multiplexed
 barcode
 sequencing:
 an
 efficient
 method
 for
 parallel
 analysis
 of
 
pooled
 samples.
 Nucleic
 Acids
 Research,
 38(13),
 e142–e142.
 doi:10.1093/nar/gkq368
 
Sriram,
 K.,
 &
 Ghosh,
 P.
 (2012).
 Simultaneous
 Bayesian
 Estimation
 of
 Multiple
 Quantiles
 with
 an
 
Extension
 to
 Hierarchical
 Models.
 IIM
 Bangalore
 Research
 Paper,
 (359).
 
Sriram,
 K.,
 Ramamoorthi,
 R.,
 &
 Ghosh,
 P.
 (2011).
 Posterior
 Consistency
 of
 Bayesian
 Quantile
 
Regression
 Under
 a
 Mis-­‐Specified
 Likelihood
 Based
 on
 Asymmetric
 Laplace
 Density.
 IIM
 
Bangalore
 Research
 Paper,
 (340).
 
Stephens,
 M.,
 &
 Balding,
 D.
 J.
 (2009).
 Bayesian
 statistical
 methods
 for
 genetic
 association
 
studies.
 Nature
 Reviews
 Genetics.
 
The
 International
 HapMap
 Consortium.
 (2005).
 A
 haplotype
 map
 of
 the
 human
 genome.
 Nature,
 
437(7063),
 1299–1320.
 doi:10.1038/nature04226
 
Thomas,
 P.
 D.
 (2003).
 PANTHER:
 a
 browsable
 database
 of
 gene
 products
 organized
 by
 biological
 
function,
 using
 curated
 protein
 family
 and
 subfamily
 classification.
 Nucleic
 Acids
 Research,
 
31(1),
 334–341.
 doi:10.1093/nar/gkg115
 

  109
 
Tibshirani,
 R.
 (1996).
 Regression
 shrinkage
 and
 selection
 via
 the
 lasso.
 Journal
 of
 the
 Royal
 
Statistical
 Society.
 Series
 B
 (Methodological),
 267–288.
 
Tokdar,
 S.,
 &
 Kadane,
 J.
 B.
 (2011).
 Simultaneous
 linear
 quantile
 regression:
 A
 semiparametric
 
bayesian
 approach.
 Bayesian
 Analysis,
 6(4),
 1–22.
 
Tsionas,
 E.
 G.
 (2003).
 Bayesian
 quantile
 inference.
 Journal
 of
 Statistical
 Computation
 and
 
Simulation,
 73(9),
 659–674.
 doi:10.1080/0094965031000064463
 
Turnbull,
 C.,
 Ahmed,
 S.,
 Morrison,
 J.,
 Pernet,
 D.,
 Renwick,
 A.,
 Maranian,
 M.,
 et
 al.
 (2010).
 
Genome-­‐wide
 association
 study
 identifies
 five
 new
 breast
 cancer
 susceptibility
 loci.
 Nature
 
Publishing
 Group,
 42(6),
 504–507.
 doi:10.1038/ng.586
 
Wang,
 K.,
 Li,
 M.,
 &
 Hakonarson,
 H.
 (2010).
 Analysing
 biological
 pathways
 in
 genome-­‐wide
 
association
 studies.
 Nature
 Reviews
 Genetics,
 11(12),
 843–854.
 doi:10.1038/nrg2884
 
Wang,
 T.,
 Pradhan,
 K.,
 Ye,
 K.,
 Wong,
 L.-­‐J.,
 &
 Rohan,
 T.
 E.
 (2011).
 Estimating
 allele
 frequency
 from
 
next-­‐generation
 sequencing
 of
 pooled
 mitochondrial
 DNA
 samples.
 Frontiers
 in
 genetics,
 2,
 
51.
 doi:10.3389/fgene.2011.00051
 
Willer,
 C.
 J.,
 Sanna,
 S.,
 Jackson,
 A.
 U.,
 Scuteri,
 A.,
 Bonnycastle,
 L.
 L.,
 Clarke,
 R.,
 et
 al.
 (2008).
 
Newly
 identified
 loci
 that
 influence
 lipid
 concentrations
 and
 risk
 of
 coronary
 artery
 disease.
 
Nature
 Publishing
 Group,
 40(2),
 161–169.
 doi:10.1038/ng.76
 
Williams,
 P.
 T.
 (2012).
 Quantile-­‐Specific
 Penetrance
 of
 Genes
 Affecting
 Lipoproteins,
 Adiposity
 
and
 Height.
 PLoS
 ONE,
 7(1),
 e28764.
 doi:10.1371/journal.pone.0028764.t001
 
Wu,
 T.
 T.,
 &
 Lange,
 K.
 (2008).
 Coordinate
 descent
 algorithms
 for
 lasso
 penalized
 regression.
 
Annals
 of
 Applied
 Statistics,
 2(1),
 224–244.
 doi:10.1214/07-­‐AOAS147
 
Wu,
 T.
 T.,
 Chen,
 Y.
 F.,
 Hastie,
 T.,
 Sobel,
 E.,
 &
 Lange,
 K.
 (2009).
 Genome-­‐wide
 association
 analysis
 
by
 lasso
 penalized
 logistic
 regression.
 Bioinformatics,
 25(6),
 714–721.
 
doi:10.1093/bioinformatics/btp041
 

  110
 
Yu,
 K.,
 &
 Lu,
 Z.
 (2003).
 Quantile
 regression:
 applications
 and
 current
 research
 areas
 -­‐
 Yu
 -­‐
 2003
 -­‐
 
Journal
 of
 the
 Royal
 Statistical
 Society:
 Series
 D
 (The
 Statistician)
 -­‐
 Wiley
 Online
 Library.
 
Journal
 of
 the
 Royal
 Statistical
 Society:
 ….
 
Yu,
 K.,
 &
 Moyeed,
 R.
 A.
 (2001).
 Bayesian
 quantile
 regression.
 Statistics
 &
 Probability
 Letters,
 
54(4),
 437–447.
 
Yuan,
 M.,
 &
 Lin,
 Y.
 (2005a).
 Efficient
 empirical
 Bayes
 variable
 selection
 and
 estimation
 in
 linear
 
models.
 Journal
 of
 the
 American
 Statistical
 Association,
 100(472),
 1215–1225.
 
Yuan,
 M.,
 &
 Lin,
 Y.
 (2005b).
 Model
 selection
 and
 estimation
 in
 regression
 with
 grouped
 
variables.
 Journal
 of
 the
 Royal
 Statistical
 Society:
 Series
 B
 (Statistical
 Methodology),
 68(1),
 
49–67.
 
Zhao,
 Y.,
 &
 Wang,
 S.
 (2009).
 Optimal
 DNA
 Pooling-­‐Based
 Two-­‐Stage
 Designs
 in
 Case-­‐Control
 
Association
 Studies.
 Human
 Heredity,
 67(1),
 46–56. 
Asset Metadata
Creator Liang, Wei (author) 
Core Title Bayesian hierarchical models in genetic association studies 
Contributor Electronically uploaded by the author (provenance) 
School Keck School of Medicine 
Degree Doctor of Philosophy 
Degree Program Biostatistics 
Publication Date 11/08/2013 
Defense Date 07/30/2013 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Bayesian,genetic association study,hierarchical model,oai:digitallibrary.usc.edu:usctheses,OAI-PMH Harvest 
Format application/pdf (imt) 
Language English
Advisor Conti, David V. (committee chair), Thomas, Duncan C. (committee chair), Nuzhdin, Sergey V. (committee member), Wang, Kai (committee member) 
Creator Email neovaliane@gmail.com 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c3-344907 
Unique identifier UC11295271 
Identifier etd-LiangWei-2144.pdf (filename),usctheses-c3-344907 (legacy record id) 
Legacy Identifier etd-LiangWei-2144.pdf 
Dmrecord 344907 
Document Type Dissertation 
Format application/pdf (imt) 
Rights Liang, Wei 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the a... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Abstract (if available)
Abstract Genetic association studies aim to find the genetic variations that are associated with phenotypes (traits). In the last decade, genome-wide association studies have been proposed and performed in large populations for many common diseases/traits. A number of associated SNPs have been found, but limited amount of heritability for the phenotypes are explained by those findings. It is now in an era of post-GWAS, in which we are looking for better ways to explain the GWAS results, to follow up with replication/validation studies, and to find novel genetic components that explains the phenotypes. ❧ In the scope of this dissertation, I focused on three different aspects of genetic association study. I proposed a novel study design for genetic association studies using next-generation sequencing, investigated a hierarchical model for variable selection in GWAS analysis, and implemented a quantile regression model to prioritize GWAS results. ❧ With its potential to discover a much greater amount of genetic variation, next-generation sequencing is fast becoming an emergent tool for genetic association studies. However, the cost of sequencing all individuals in a large-scale population study is still high in comparison to most alternative genotyping options. While the ability to identify individual-level data is lost (without bar-coding), sequencing pooled samples can substantially lower costs without compromising the power to detect significant associations. We propose a hierarchical Bayesian model that estimates the association of each variant using pools of cases and controls, accounting for the variation in read depth across pools and sequencing error. To investigate the performance of our method across a range of number of pools, number of individuals within each pool, and average coverage, we undertook extensive simulations varying effect sizes, minor allele frequencies, and sequencing error rates. In general, the number of pools and the pool size have dramatic effects on power while the total depth of coverage per pool has only a moderate impact. This information can guide the selection of a study design that maximizes power subject to cost, sample size, or other laboratory constraints. We provide an R package (hiPOD: hierarchical Pooled Optimal Design) to find the optimal design, allowing the user to specify a cost function, cost, and sample size limitations, and distributions of effect size, minor allele frequency, and sequencing error rate. ❧ In genome-wide association studies (GWAS) with complex disease, the most used analysis tool has been a single-point, one degree of freedom test of association, such as the Cochran-Armitage test. However, testing one SNP at a time does not fully exploit the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. By adapting a Laplace prior distribution for model parameters, and incorporating external information in the prior distribution for the shrinkage parameter of the Laplace distribution, we aimed to perform variable selection using the specified hierarchical model. We tried two approaches to estimate the hyper-parameters, empirical Bayes and fully Bayesian. The method works when the model size is relatively small, but fails when the model is over-parameterized. ❧ As mentioned above, for most complex diseases, only a moderate amount of heritability has been explained by the SNPs declared genomewide statistically significant. Rather than a resource-limited approach of increasing the sample size, one alternative approach is to find causal SNPs within the lower Manhattan—the SNPs that just failed to achieve genome-wide significance. Thus, to improve efficiency of GWAS results, we propose a Bayesian hierarchical quantile regression model to incorporate external information with the aim of improving the ranking of causal SNPs. The proposed model examines the extremes of the p-value distribution by adapting a Normal-Exponential mixture presentation of asymmetric Laplace distribution as a prior distribution, which enables us to build an efficient Gibbs sampler. Simulation results show that the proposed model improves the ranking of causal SNPs if the external information is informative (associated with the causality of a SNP) and does not decrease the causal SNP’s ranking if the external information is non‐informative. We compare this approach to several alternatives, including a filtering framework, and demonstrate that these approaches can worsen the ranking of causal SNPs if the external information is not informative. 
Tags
Bayesian
genetic association study
hierarchical model
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button