Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
X-linked repeat polymorphisms and disease risk: statistical power and study designs
(USC Thesis Other)
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
X-LINKED REPEAT POLYMORPHISMS AND DISEASE RISK:
STATISTICAL POWER AND STUDY DESIGNS
by
Timothy J. Triche, Jr.
A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
December 2008
Copyright 2008 Timothy J. Triche, Jr.
DEDICATION
To my wife, Catherine, for her love and support throughout my graduate studies.
To my parents, for their encouragement, and for believing in me for 33 years.
ii
ACKNOWLEDGEMENTS
I would like to extend my sincere thanks to Dr. Kimberly Siegmund, Dr. W.
James Gauderman, and Dr. Victoria Cortessis for their advice and suggestions in the
preparation of this manuscript. I also thank Dr. Duncan Thomas, Dr. Richard
Watanabe, and Dr. Stanley Azen for their guidance within the Biostatistics and
Statistical Genetics programs. I consider myself very fortunate to have had the
counsel of each person who advised me in the course of my studies.
iii
TABLE OF CONTENTS
DEDICATION ii
ACKNOWLEDGEMENTS iii
TABLE OF CONTENTS iv
LIST OF TABLES v
LIST OF FIGURES vi
ABSTRACT vii
CHAPTER I: INTRODUCTION 1
CHAPTER II: BACKGROUND 5
Quantitative markers of disease risk 5
An autosomal example: CAG repeats and Huntington's disease 5
An X-linked example: CAG repeats and Kennedy disease 6
Data structures for family-based studies of sex-linked markers 8
Quanto: statistical power for studies of gene-environment interaction 9
CHAPTER III: METHODS 10
Analytic estimation of statistical power via likelihood 10
Empiric confirmation of statistical power via simulation 14
Resampling methods for empiric estimation of power 15
CHAPTER IV: RESULTS 16
CHAPTER V: DISCUSSION 19
CHAPTER VI: CONCLUSION 23
BIBLIOGRAPHY 24
APPENDICES 27
Appendix A: Computation of relative informativeness 27
iv
LIST OF TABLES
Table 1: Sample size required at 80% power: male case-mother study 17
Table 2: Statistical power for numeric and categorical codings of repeat lengths 17
Table 3: Relative efficiency of study designs for X-linked repeat polymorphisms 18
Table 4: Effects of stratification on sample size for equivalent statistical power 19
Table 5: Effects of model mis-specification on statistical power 19
v
LIST OF FIGURES
Figure 1: Data structures for studies of sex-linked markers of disease risk. 8
Figure 2: Power curves by study design, predictor, and affected gender. 16
vi
ABSTRACT
The design and analysis of association studies for repeat polymorphisms has
received scant attention in the literature. We present an analytical power calculation
for studies of such polymorphisms, based on a case-parent design and an X-linked
polymorphism. Existing tools for estimating statistical power in family-based
studies (such as Quanto) presume categorical codings of autosomal loci. We extend
the underlying method to handle quantitative codings of repeat polymorphisms, and
discuss the advantages of doing so. Sample sizes for a conditional logistic regression
analysis of a sex-linked repeat polymorphism in a case-parent design are presented.
Empirical power for quantitative and categorical codings of the same polymorphism
with the same sample size in otherwise identical studies are then compared via
Monte Carlo simulation. The differences in information to be expected from male
and female case-parent, case-sibling, and case-population pairs are discussed. In
addition, the effects of stratification and model mis-specification are explored. The
effects of missing data, and potential approaches for addressing it, are discussed.
vii
CHAPTER I INTRODUCTION
Statistical tests for association of candidate genetic markers, both autosomal
and sex-linked (Horvath et al., 2000), are well established in the epidemiological
literature (Spielman and Ewens, 1995; Witte et al., 1999). The widely used
transmission-disequilibrium test, or TDT, is a test of the joint null hypothesis of no
association and no linkage disequilibrium at a locus, and most often used to analyze
case-parent studies of distortion of the transmission probabilities for an allele. The
case-parent study design also lends inherent robustness to confounding by population
stratification (the possibility that a marker for population admixture is inadvertently
characterized as a marker for a disease risk that differs between the admixed
populations). This is not an insignificant concern, as even well-designed case-
control studies have been shown to be susceptible to population stratification
(Freedman et al., 2004) in the absence of dense panels of markers.
The basic formulation of the TDT is as a McNemar χ
2
test for distortion in
the ratio of alleles transmitted to affected children versus those untransmitted. Under
the null hypothesis, transmission of a putative causal variant to an affected offspring
should be no more or less likely than transmission of the alternate allele. The
alternative hypothesis is that causality, or linkage disequilibrium with a causal locus,
is distorting this ratio (for a biallelic locus, the ratio is 0.5 under the null hypothesis),
thus increasing the observed proportion of the marker allele in affected offspring.
Detection and quantification of the distortion thereby provides a means to evaluate
the strength and significance of association between a candidate locus and disease.
1
This formulation has been adapted to sibships (Horvath and Laird, 1998;
Spielman and Ewens, 1998), pedigree-disequilibrium-based tests (Martin et al.,
2000), multiple markers (Sham and Curtis, 1995), quantitative trait loci (Allison,
1997), X-chromosomal markers (Horvath et al., 2000; Ho et al., 2000),
reconstruction methods for parental genotypes (Knapp 1999), and a generalized
omnibus test (the i-TDT) leveraging both affected and unaffected offspring to
extract maximum statistical power from an arbitrary assembly of families (Guo et al.,
2007). The latter three formulations hint at the motivation for the present topic:
accurate analytic estimation of power under varying assumptions, to maximize the
information extracted from a family-based study of a genetic risk factor or factors.
For the analysis of a simple genotype where an individual may carry either 0,
1, or 2 copies of the proposed risk allele, the TDT is equivalent to conditional
logistic regression (CLR) on the number of risk alleles present, stratified by family
(Self et al., 1991; Abel and Müller-Myhsok 1998). The formulation of a case-parent
trio as one case matched against one (X-linked) or three (autosomal) “pseudo-sibs”
(phantoms composed from the untransmitted parental alleles), lends itself naturally
to analysis as a matched case-control study. This formulation is ideally suited to
conditional logistic regression. Additionally, the conditional logistic regression
framework grants much flexibility in model fitting, allowing the inclusion of
additional covariates and/or interactions, easily accomodating quantitative predictors,
and permitting case-sibling or case-population-control study designs as matched
case-control analyses, without any additional methodological overhead.
2
Tools are widely available to calculate the expected statistical power of a
specified sample size, or the appropriate number of sets to enroll in a study to
achieve a desired statistical power at some minimum effect size, given population
parameters for the prevalence of disease and minor allele frequency under binomial
or multinomial models of allelic distribution. The standalone program Quanto
(http://hydra.usc.edu/gxe/) is one such tool, freely available for download. Quanto
offers an investigator-friendly user interface which permits the calculation of
statistical power and/or sample size in a flexible manner for studies of genetic,
environmental, and gene-environment risks, given population parameters and
exposure frequencies (for environmental risks). While Quanto does not accomodate
quantitative codings of repeat polymorphisms, the likelihood-based methodology
(discussed in further detail within the Methods section) which is used in its
calculation of power can be applied to quantitative codings, and that is the approach
which has been taken here. We developed a package (qpower) for the R statistical
computing environment (R Development 2005) to implement the methods and
perform Monte Carlo simulations to verify the analytical results.
Numerous authors derive the power for a genotypic predictor, given
population parameters for disease and allele frequencies, under the log-additive
penetrance model of the TDT or its CLR equivalent (Abel and Muller-Myhsok,
1998; Knapp, 1999), and for the general case of logistic regression with a binary
predictor (Demidenko 2007), but there appears to be no canonical method for
estimating the power of a repeat polymorphism with arbitrary penetrance. Due to
3
the desirable properties of the likelihood ratio test, as well as the relatively simple
changes needed to accommodate arbitrary distributions of predictor values,
Gauderman's approach is used (Gauderman 2002). This approach is described, the
resulting estimates corroborated by simulation, and the practical consequences for
studies of quantitative sex-linked genetic markers for disease are discussed. Cost-
efficiency and robustness of various formulations under various disease models are
examined. Existing studies of quantitative repeat polymorphisms, and methods
employed in such studies, are compared with the observed disease risk functions in
an autosomal disorder (Huntington's disease) with similar etiology, and the merits of
a loglinear or logistic approach are discussed. Finally, certain important
considerations for practical studies of this genetic risk model are evaluated, and some
potential solutions for those not addressed in this presentation are described.
4
CHAPTER II BACKGROUND
Repeat polymorphisms and disease risk
Trinucleotide repeat polymorphisms are known or implicated as risk factors for
numerous diseases. Perhaps the best-known example of a trinucleotide repeat
polymorphism with an established role in disease progression is that in the gene for
huntingtin (HTT, chr4p16.3), abnormal lengths of which lead to Huntington's disease
(Kieburtz et al., 1994), a progressively degenerative neuromuscular condition.
Another, sex-linked, example of trinucleotide repeat pathology is Kennedy disease,
where an unusually long trinucleotide repeat polymorphism in exon 1 of the
androgen receptor gene on the X chromosome leads to spinal-bulbar muscle atrophy
(La Spada, 1991). In both examples, a notable feature of the disease association is
increased severity of the disease with pathologically longer repeat lengths.
An autosomal example: CAG repeats and Huntington's disease
The best-studied example of quantitative genotypic risk for disease may be
Huntington's disease. Not only is the responsible gene (HTT) well-characterized,
but it appears to act independently and with penetrance dependent upon the repeat
length, presenting somewhat of an archetype for trinucleotide repeat disorders.
Furthermore, extreme lengths of the polymorphism are associated with a particularly
severe form of the disease, the so-called Westphal variant (Nance & Myers, 2001)
which afflicts young adults (so-called 'juvenile HD'). The genetic risk model for
Huntington's disease is in many ways consistent with (log-)additive penetrance.
5
An X-linked example: CAG repeats and Kennedy disease
First identified in 1968, Kennedy disease is a progressive neuromuscular disease
which manifests as muscle cramps and weakness due to degeneration of motor
neurons (Kennedy, 1968). In 1991, Albert La Spada and his colleagues positively
associated the condition with expansion of the CAG repeat in exon 1 of the androgen
receptor gene. The mechanism for pathology is unclear, but it is thought that a gain-
of-function mutation in the androgen receptor, induced by long repeat tracts of 40-50
repeats (the population mean is 22 repeats), may lead to developmental
abnormalities. Though females can be carriers of Kennedy disease, they are rarely
affected, and mildly. Only male subjects develop fully manifested Kennedy disease.
Design and analysis for studies of X-linked repeat polymorphisms and disease
The biological significance of the androgen receptor gene on the X chromosome, and
its role in disease, has motivated much study of its features. Among these are a CAG
repeat polymorphism and a GGC repeat polymorphism in exon 1 of the gene, which
have been studied for association with numerous conditions, including testicular
cancer (King et al, 1997), bladder cancer (Gonzalez-Zulueta et al, 1993),
cryptoorchidism (Ferlin et al, 2005), prostate cancer (Irvine et al, 1995), Alzheimer's
disease (Lehmann et al, 2003), and aberrant testicular histology (Dakoune-
Goudacelli, 2006). Some of the diseases under study have relatively low incidence,
typically resulting in relatively small samples, such that the most powerful method to
use the available information is quite desirable.
6
In such studies, the robustness of family-based designs to population stratification,
and the use of a composite test for association and linkage disequilibrium, is
particularly desirable. Further, the recasting of the TDT in a CLR framework allows
for modeling of additional covariates (such as potential interaction between
polymorphic loci, as for example the CAG and GGC repeat tracts in the androgen
receptor gene). Finally, the matched case-parent (or case-sibling) study design
retains robustness to population stratification, a concern due to the differing
occurrence of some conditions among populations. The option of using an efficient
matched case-population-control design also exists. Inasmuch as some mutations in
developmentally critical genes (of which the androgen receptor is one) are thought to
reduce prenatal survival, prevalence-based estimates for congenital pathology are the
norm, while differing lifetime risks for degenerative conditions may suggest an
incidence risk estimate. Conditional logistic regression is appropriate for either case.
Methods to extend the transmission-disequilibrium test to X-linked alleles
have been proposed and implemented, as previously described (Ho et al, 2000), but
none leverage the full information available from a quantitative genotype (such as
CAG repeat lengths), instead focusing on the more common case of categorical allele
codings. The impact that this omission may have upon statistical power has
remained unexplored. Fortunately, methods for estimating statistical power and
sample size requirements in studies of gene-environment interaction (Gauderman
2002) provide a basis for exploration of quantitative codings of genetic predictors,
and the statistical power to be expected for a conditional logistic regression analysis.
7
Data structures for family-based studies of X-linked markers
The data structures collected in family-based studies of X-linked predictors
for disease risk can be represented as follows (also presented as Figure 1):
• For a male-only trait, such as testicular cancer, studied for association with an X-
linked genotype, a case and his mother constitute a full case-parent set (“a”)
• For a female-only trait, such as ovarian cancer, studied for association with an X-
linked genotype, a case and her mother constitute a full case-parent set (“b”).
• For a non-sex-limited trait, sets of both types may be represented, such that
additional stratification by gender is to be expected in the study population (“c”).
Female cases might be expected to require a case-parent trio for full information, but
as the paternal X
f
is transmitted with 100% certainty, it is ascertained from the case.
8
Figure 1: Data structures for studies of sex-linked markers of disease risk.
3. Quanto: statistical power for studies of gene-environment interaction
Quanto (Gauderman 2002) is a user-friendly Windows application for
computing sample size requirements (or power at a given sample size) in matched
case-control, case-sibling, and case-parent studies focused on genetic, environmental,
gene-gene, and particularly gene-environment effects on disease risk. Given
appropriate parameters for genetic and/or environmental exposures within a
population, Quanto allows an investigator to estimate the requirements to detect a
specified minimum effect size with his or her population parameters, with additional
flexibility in terms of the effect of interest and the matching strategy. As discussed
in the following section, the underlying unified methodology allows for
maximization of the expected likelihood of the parameter(s) of interest, and from this
maximized likelihood, the number of sets required to provide a desired power to
detect the effect size (relative risk or odds ratio, depending upon the study design) is
computed. The sample size is thus a function of the expected loglikelihood
contribution from each set, and the noncentrality parameter of the χ
2
distribution for
the likelihood ratio test under the alternative hypothesis (namely, that the parameter
β of interest is not equal to 0). The investigator need not be aware of any of the
preceding machinery, as Quanto provides for various means of specifying known or
estimated population disease prevalence, Hardy-Weinberg equilibrium frequencies
for a minor allele, and baseline environmental exposure levels within the
investigator's target study population. Sample size estimates from the methodology
used in Quanto agree strongly with the results of simulations conducted herein.
9
CHAPTER III: METHODS
1. Analytic estimation of statistical power for the likelihood ratio test
The asymptotic approximation for power at a sample size N relies upon the
likelihood ratio test statistic, modeling the expected loglikelihood difference Λ for
each set under H1 and H0, and computing the noncentrality parameter for the χ
2
distribution from N sets as NΛ, yielding the power at a given type I error rate as
Φ
NΛ−Z
a
2
Φ−
NΛ – Z
a
2
for a 2-sided alternative hypothesis, where Φ(u) is the standard normal CDF
evaluated at u, z
u
is the (1-u)th standard normal quantile, and a is the Type I error.
Analytical estimation begins with calculation of the expected loglikelihood
contribution for each enrolled set under the null ( β = 0 ) and alternative hypotheses:
The component quantities of the expected loglikelihood above are as follows:
• The likelihood for a matched set G
i,
Lβ ;
G
i
=
e
β g
1
∑
g
j
∈
G
i
e
β g
j
consisting of each
constituent genotype (case and (pseudo)sib(s) or matched control), g, evaluated at β;
• The penetrance ƒ(D|G) = e
α+βX
, where X is either the number of risk alleles present,
the value of the repeat length for male X-linked repeat polymorphisms, or the value
for the effective (non-inactivated) copy in female X-linked repeat polymorphisms
(The X chromosomal inactivation process in females is modeled as a Bernoulli trial);
10
E[lnL β]=E[ℓ β]=
∑
g
N
ln[ L β ;
G]fD∣
G ,α , βf g∣
Ω
∑
g
N
fD∣
G ,a, βf g∣
Ω
• The transmission function ƒ(g1|gm,gf,Ω), where Ω can denote either the population
frequency of a categorical allele (qA), or a for quantitative predictors such as repeat
polymorphisms, Ω can bevector of population repeat length parameters Ω = (μ, σ) .
In the case of CAG repeat lengths on the androgen receptor gene, the population
distribution appears to be normal, with a mean of 22 and a standard deviation of 3
repeats (Davis-Dao, 2007), and lengths between 8 to 35 repeats have been reported
in clinical studies (exclusive of Kennedy disease sufferers, it should be noted).
A log-linear model for genetic risk is assumed for case-parent studies (Gauderman
2002), whereas a logistic model is assumed for case-sibling and case-population-
control designs. For a female-limited disease, or one where both males and females
are affected, the loglikelihood contribution expected from each possible inactivation
combination is tallied and multiplied by its frequency. For example, a case-mother
set with a female case genotype X1X3, maternal genotype X1X2, and pseudo-sib
genotype X2X3 will have four possible combinations of effective genotypes to
compare: (X1, X2), (X3, X2), (X1, X3), and (X3, X3), each equally likely. The
loglikelihood calculation must thus sum and normalize the contributions of each.
(However, see also the discussion regarding model misspecification and its effects.)
Though there is some controversy as to whether X chromosome inactivation is in
fact random and consistent within a given cell type, for the purposes of calculating
expected statistical power, we assume this is in fact the case.
11
The computed expected loglikelihood under H1 (L1) and H0 (L0) is then
Λ = 2
ℓ
1
−
ℓ
0
for each set in the study. The noncentrality parameter C of the χ
2
distribution for the desired power (1-b, where b is the probability of falsely accepting
the null hypothesis at significance a) then provides the basis to compute the sample
size N required to obtain the desired power at a given significance with the proposed
minimum effect size, allele distribution, and disease prevalence as N = (z
a/2
+ z
b
)
2
/Λ.
For male cases, the loglikelihood contributions of each possible matched set
are summed over the possible maternal genotypes given the population parameters
(for quantitative genotypes, the parameters are the mean and variance of the
genotype, such as repeat lengths). For female cases, the loglikelihood contributions
are additionally summed over the possible “effective” X genotypes. The process of
X inactivation and the observed data most closely resemble a Bernoulli process for
risk prediction purposes, thus a model where “effective” genotypes (50% probability
of either X chromosome being inactivated) are computed separately, then given half
weight in the overall term. Female case-sister pairs provide somewhat more
information (relative to case-mother pairs) than their male counterpart, and can be
expected to be 62.5% as informative as case-mother pairs for females. Male case-
brother (or case-sister, insofar as the models show) sets maintain the traditional ratio
of 50% informativeness relative to a case-mother pair (inasmuch as a male sibling
has a 50% chance of receiving the opposite maternal X allele from his brother).
Details of the relative information expected from each type of set, along with the
derivation of these quantities, can be found in Appendix A.
12
Stratified populations (mixed male and female cases, predicated upon K
p
for
family-based designs and Pr(gender) for population-based designs) are modeled by
computing Λ
S
for each strata S in the population, then multiplying the expected
proportion of each strata by its per-set noncentrality contribution, for an overall Λ
*
.
(The same approach used for male/female stratification can be used for any strata.)
In all of the above formulations we assume random mating, Hardy-Weinberg
equilibrium (which is maintained in simulations by employing the Bernoulli model
for X chromosome inactivation), and random X chromosome inactivation in females.
Additionally, we assume that the genotypes within a family are independent,
conditional upon the parental genotypes, which are in turn independent conditional
upon the population genetic parameters of the locus under study (random mating).
When the result of this analytical formulation was used as the sample size for
1000 simulated studies, 801 (or 80.1%) of the simulations conducted with the
proposed 196 family sets returned a statistically significant association (given a 1.1-
fold relative risk increase per unit, and a 0.05 significance level). This indicates
strong agreement between the analytical and simulated approaches. In order to
compare the power of a quantitative coding of a repeat polymorphism with
categorical codings of the same polymorphism, we turn to simulation methods and
empiric power determination.
13
2. Empiric estimation of statistical power via simulation
To simulate the ascertainment process in the general case of a disease with
some known baseline population risk and known population genetic parameters,
Monte Carlo trials were employed to generate random samples of potential study
participants with specified population parameters – genetic relative risk ratio, repeat
distribution (observed range 8-35, mean of 22, standard deviation of 3, cf. Davis-Dao
2007), and baseline risk within the population at large. The same assumptions
employed in the analytical computation of power are applied to the simulation
approach. This approach consists of enumerating the possible repeat lengths
between the maximum and minimum observed in clinical reports, computing the
probability of each possibility given the observed population mean and variance,
selecting pairs of alleles, and passing alleles to offspring b using a Bernoulli trial
(with p=0.5) to determine which allele was transmitted. For females, an additional
Bernoulli trial is used to determine which of the offspring's two X alleles would be
'inactivated', and the remaining allele is then treated as the risk predictor of interest.
The disease status of each at-risk individual is then modeled as a Bernoulli trial with
probability p(D|g) = e
α+βX
, with X the expressed ('effective') genotype (for case-sib
and case-control studies, the risk is modeled as pD=1∣g=X=
e
α βX
1e
αβX
).
Once the specified number of sets have accumulated, the simulated study
population is returned, and conditional logistic regression used to estimate genetic
risk. The significance of the likelihood ratio test is retained for power estimates.
14
Resampling methods for empiric estimation of power
The same computational resources that enable relatively easy simulation of
ascertainment-based study designs and their outcomes (under specified model
parameters) can be leveraged to estimate power at a given significance level by
generating a suitably large and diverse simulated data set, and then extensively
resampling from the larger dataset with a specified sample size N to estimate the
power with N sets of cases and controls. A reasonable simulated dataset might
contain 10000 to 100000 sets of genotypes. From this, the number of possible ways
to choose a sample of 200, or even 2000, matched case-control sets is nearly infinite.
The simulated datasets were then stored in a MySQL database for easy retrieval, and
restored into memory as needed, to perform resampling estimates under different
models. The empiric estimates of power reported at each sample size in the
following (Results) section were obtained by drawing at least 1000 samples of the
desired size (in some cases where the approximation appeared unstable, 10000
samples were used) and the proportion of sampled datasets where the likelihood ratio
test was significant (again at a specified level) was then reported as the empiric
power at significance level α with sample size N. (The function power.boot was
originally meant to use bootstrap estimates, but resampling with replacement proved
to be simpler in practice.)
15
CHAPTER IV: RESULTS
A disease model was implemented as described in the previous section.
10000 simulated cases were generated under each set of parameters to be evaluated.
Matched controls (pseudo-siblings, unaffected siblings, or population controls) were
generated simultaneously. Predictors were evaluated by conditional logistic
regression with various designs and model assumptions (some intentionally mis-
specified to investigate their effects). Analytic sample size estimates were used to
determine the size of the simulated populations, and the simulated populations were
then used to provide empiric estimates of power for each predictor under each set of
parameters, as described previously. In Figure 2, we plot expected power at various
sample sizes (from 1-500 matched case-control sets) and study designs (case-mother,
case-sibling, case-unrelated) for male cases (left) and female cases (right).
Figure 2. Power curves by study design, predictor, and affected gender
16
Table 1. Sample size required at 80% power: male case-mother study
Ψ = 1.1 Ψ = 1.2 Ψ = 1.3
σ = 1 1733 sets 477 sets 232 sets
σ = 2 434 sets 122 sets 61 sets
σ = 3 195 sets 57 sets 30 sets
In Table 1, sample size estimates for a male case-mother study of an X-linked repeat
polymorphism are shown across differing population variances and effect sizes.
Table 2. Statistical power for numeric and categorical codings of repeat lengths
Model Predictor Analytic N Empiric power at N
Ψ = 1.1, σ
2
= 9 Repeat length 196 sets at 0.8 power 0.801
Long allele 0.652
Above-mean 0.611
In Table 2, empiric estimates of statistical power are shown for three possible
codings of repeat polymorphism length using the same study population. The
categorical codings were “Long allele”, indicating whether the longer parental repeat
was transmitted to the case, and “Above mean”, indicating whether the individual's
CAG repeat polymorphism was longer than the population mean repeat length.
The results shown in Tables 1 and 2 indicate that power at a given minimum effect
size is proportional to the population variance, which in turn is most effectively
represented by a numeric repeat length, as accomodated by conditional logistic
regression. (The categorical codings correspond to those of a TDT or X-TDT.)
17
To construct Table 3, we generated analytical estimates of the sample size requred
for each of several designs, under a log-additive or logistic penetrance model. The
estimates were generated with a 1.1-fold per-unit risk increase to as the minimum to
be detected, and the relative efficiency of each design with progressively increasing
population repeat length variance is shown below. Female case-sister pairs are
62.5% as informative as case-mother pairs, while male case-brother pairs are only
50% as informative as case-mother pairs (see Appendix A for details). Matched
unrelated case-control sets are 2% more efficient than case-mother pairs, which
comes at the cost of potential confounding by population stratification. A full
derivation of the information to be expected from the various family-based designs
given under random X chromosome inactivation is shown in Appendix A. As the
population repeat length variance increased, the empiric relative efficiency of each
design converged towards the values derived analytically.
Table 3. R elative efficiency of study designs for X-linked repeat polymorphisms
Case gender Study design Relative efficiency*
Male Case-mother 100%
Case-brother 51%
Case-control 102%
Female Case-mother 76%
Case-sister 51%
Case-control 79%
* Male case-mother design used as a baseline for comparison across both genders.
18
Table 4. Effects of stratification on sample size for equivalent statistical power
Study design Affected (case) population Sample size at 80% power
Case-mother 100% male 195 sets (baseline value)
Case-mother 51% male, 49% female 222 sets (14% increase)
Case-mother 100% female 249 sets (28% increase)
In Table 4, results for stratified study populations are shown, using a population
repeat length variance of 9, and a 1.1-fold risk increase per unit of repeat length.
Table 5. Effects of model mis-specification on statistical power
True risk model Specified model Study design Power loss
Average of X
1
X
2
Bernoulli (X
1
or X
2
) Female case-mother 60%
10-fold increase
if X > (μ
X
+σ
X
)
Loglinear: e
(α+βX)
Male case-mother 48%
10-fold increase
if X > (μ
X
+σ
X
)
Logistic: expit
(α+βX)
Male case-control 49%
In Table 5, the effects of model misspecification upon statistical power are explored.
Misspecification of the X chromosomal inactivation process in females results in a
significant decrease in statistical power at the the same sample size (0.3 vs. 0.8).
Another possible misspecification is that of the penetrance function, as for example
the use of a log-additive penetrance model in the analysis when the true mechanism
involves a threshhold effect. In Table 5, rows 2-3, we supposed that a 10-fold risk
increase exists for individuals having a repeat length greater than one standard
deviation above the population mean, and generated 10000 cases via this mechanism.
The empirical power which results is roughly half what is predicted analytically.
19
CHAPTER V: DISCUSSION
The results presented suggest that significant statistical power can be gained
by using conditional logistic regression to analyze repeat polymorphisms and disease
association, as compared to categorical codings of the same predictors. However,
the results of model misspecification and the relative difficulty of incorporating
extended or missing information in this type of regression analysis must be
considered. Additive or incomplete penetrance is not inconsistent with known
trinucleotide repeat disorders, and the exact nature of X chromosome inactivation
within a given cell type is a matter of some debate; neverheless, an investigator must
consider the possible consequences of violating the assumptions presented herein.
The significant expected increase in statistical power (compared to a binomial
formulation of the TDT or X-TDT) is not without potential drawbacks of its own.
In the general case, unbiased misspecifications resulting in estimates with a
correlation r
2
to the true underlying behavior will see power attenuated as Λ* = Λr
2
.
In the event an investigator suspects such a mismatch, it might be prudent to estimate
r
2
empirically from simulations, and use the resulting estimate to generate Λ* for an
analytical estimate of the sample size required, providing some statistical 'insurance'.
One observation which arises from the simulations conducted herein is that
case-parent trios are unnecessary for female affected subjects. The genotype of the
father is completely ascertained in any informative mating by the genotype of the
daughter – whichever allele does not belong to themother is the father's. In cases of
disputed paternity, genotyping the wrong 'father' may in fact reduce statistical power.
20
Missing data is not at all unusual for this type of study. In the event that a
parent cannot be genotyped, one or more siblings may be genotyped (Figure 1,
bottom row), providing partial information towards the parental genotype(s).
Missing or imputed information is not addressed by the methods discussed herein.
Moreover, methods that assume randomly missing information may produce bias.
Bias may be created by a missing-at-random assumption in any given case-parent
study by age-of-onset risk factors, congenital defects which affect fetal survival to
term, diseases or genetic variants reducing parental fertility, or any of a number of
other subtle violations of model assumptions. As an example, one such scenario
involves cryptoorchidism, or undescended testes, in males. A known risk factor for
testicular cancer (Muller et al., 1984), it is also thought to be a risk factor for
infertility, but conversely, the observation of undescended testes in premature boys
may be due merely to pre-term delivery. Thus the estimation of influences upon
both disease risk and informatively missing information is, in practice, complicated
by myriad interacting factors. If incomplete sets are subject to an informative pattern
of missingness, estimates of association which fail to accomodate this will be biased
as a result. Reconstruction methods for categorical genotypes (Knapp, 2002) might
provide a basis to impute missing data in studies of repeat polymorphisms, though
potentially informative patterns of missingness are not directly addressed.
In an interesting approach to modeling informatively-missing data under a
hierarchical analysis method, Gauderman et al. (1997) employ a Gibbs sampler and
Markov chain Monte Carlo sampling to impute a continuous variable (smoking
21
history) and employ the results to improve an estimate of genetic relative risk in a
study of gene-environment interaction between smoking and an unidentified
Mendelian locus. Insofar as missing parents may be more likely to die of the disease
under study if they themselves are at elevated risk, this model-based method of
imputation described may provide a model for the present formulation. On one
hand, the presence of extended pedigree information and the use of proportional
hazards regression for the analysis both before and after imputation suggest that
significant work will be required to adapt the method, if it is appropriate to the task.
On the other hand, leveraging extended pedigree information would likely improve
the practical usefulness and power of the methods herein. This approach and its
requisite investigation of incorporating extended pedigree data are thus attractive.
22
CHAPTER VI: CONCLUSION
In studies of repeat polymorphisms and disease risk, statistical power is
improved by making use of the additional information gained from quantitative
(versus categorical) representations of the genotypes. Conditional logistic regression
provides a means to leverage such representations, but the statistical properties of
such an approach had not previously been elucidated. An analytical method for
specifying statistical power under both quantitative and categorical models is here
verified by simulation, and the general framework is amenable to a number of
models for genetic risk and disease transmission. The risk model for the methods
discussed is consistent with existing examples of trinucleotide repeat diseases, and its
statistical properties are attractive for investigators wishing to optimize their study's
cost efficiency. Further work is needed to develop methodology that can
accomodate missing or incomplete data while retaining the benefits of this approach,
especially in scenarios faced by investigators conducting practical studies.
Nonetheless, the improved power obtained by quantitative representations of repeat
polymorphisms suggest that this is a worthwhile focus for continued study.
23
ALPHABETIZED BIBLIOGRAPHY
Abel L, Müller-Myhsok B. Maximum-likelihood expression of the
transmission/disequilibrium test and power considerations. Am J Hum Genet. 1998
Aug ;63(2):664-7.
Allison DB. Transmission-disequilibrium tests for quantitative traits. Am J Hum
Genet. 1997 Mar ;60(3):676-90.
Dakouane-Giudicelli M, Legrand B, Bergere M, Giudicelli Y, Cussenot O, Selva J.
Association between androgen receptor gene CAG trinucleotide repeat length and
testicular histology in older men. Fertil Steril. 2006 Oct ;86(4):873-7.
Davis-Dao CA, Tuazon ED, Sokol RZ, Cortessis VK. Male infertility and variation
in CAG repeat length in the androgen receptor gene: a meta-analysis. J Clin
Endocrinol Metab. 2007 Nov ;92(11):4319-26.
Demidenko E. Sample size determination for logistic regression revisited. Stat Med.
2007 Aug 15;26(18):3385-97.
Ferlin A, Garolla A, Bettella A, Bartoloni L, Vinanzi C, Roverato A, et al. Androgen
receptor gene CAG and GGC repeat lengths in cryptorchidism. Eur J Endocrinol.
2005 Mar ;152(3):419-25.
Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, et
al. Assessing the impact of population stratification on genetic association studies.
Nat Genet. 2004 Apr ;36(4):388-93.
Gauderman WJ, Morrison JL, Carpenter CL, Thomas DC. Analysis of gene-smoking
interaction in lung cancer. 1997;14(2):199-214.
Gauderman WJ. Sample size requirements for association studies of gene-gene
interaction. Am J Epidemiol. 2002 Mar 1;155(5):478-84.
Gauderman WJ. Sample size requirements for matched case-control studies of gene-
environment interaction. Stat Med 2002 Jan 15; 35-50.
24
Gonzalez-Zulueta M, Ruppert JM, Tokino K, Tsai YC, Spruck CH, Miyao N, et al.
Microsatellite instability in bladder cancer. Cancer Res. 1993 Dec 1;53(23):5620-3.
Guo C, Lunetta KL, DeStefano AL, Ordovas JM, Cupples LA. Informative-
transmission disequilibrium test (i-TDT): combined linkage and association mapping
that includes unaffected offspring as well as affected offspring. Genet Epidemiol.
2007 Feb ;31(2):115-33.
Ho GY, Bailey-Wilson JE. The transmission/disequilibrium test for linkage on the X
chromosome. Am J Hum Genet. 2000 Mar ;66(3):1158-60.
Horvath S, Laird NM. A discordant-sibship test for disequilibrium and linkage: no
need for parental data. Am J Hum Genet. 1998 Dec ;63(6):1886-97.
Horvath S, Laird NM, Knapp M. The transmission/disequilibrium test and parental-
genotype reconstruction for X-chromosomal markers. Am J Hum Genet. 2000
Mar ;66(3):1161-7.
Irvine RA, Yu MC, Ross RK, Coetzee GA. The CAG and GGC microsatellites of the
androgen receptor gene are in linkage disequilibrium in men with prostate cancer.
Cancer Res. 1995 May 1;55(9):1937-40.
Kennedy WR, Alter M, Sung JH. Progressive proximal spinal and bulbar muscular
atrophy of late onset. A sex-linked recessive trait. Neurology 1968 (7): 671–80.
Kieburtz K, MacDonald M, Shih C, et al. Trinucleotide repeat length and
progression of illness in Huntington's disease. J. Med. Genet. 1994(11): 872–4.
King BL, Peng HQ, Goss P, Huan S, Bronson D, Kacinski BM, et al. Repeat
expansion detection analysis of (CAG)n tracts in tumor cell lines, testicular tumors,
and testicular cancer families. Cancer Res. 1997 Jan 15;57(2):209-14.
Knapp M. The transmission/disequilibrium test and parental-genotype
reconstruction: the reconstruction-combined transmission/ disequilibrium test. Am J
Hum Genet. 1999 Mar ;64(3):861-70.
La Spada AR, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH. Androgen
receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature
1991 (352): 77–9.
25
Lehmann DJ, Butler HT, Warden DR, Combrinck M, King E, Nicoll JAR, et al.
Association of the androgen receptor CAG repeat polymorphism with Alzheimer's
disease in men. Neuroscience Letters. 2003 Apr 10;340(2):87-90.
Martin ER, Monks SA, Warren LL, Kaplan NL. A test for linkage and association in
general pedigrees: the pedigree disequilibrium test. Am J Hum Genet. 2000 Jul ;
67(1):146-54.
Muller J, Skakkebaek NE, Nielsen OH, Groem N. Cryptorchidism and testis cancer.
Atypical germ cells followed by carcinoma in situ and invasive carcinoma in
adulthood. Cancer 1984; 54: 629-634
Nance MA, Myers RH. Juvenile onset Huntington's disease--clinical and research
perspectives. Ment Retard Dev Disabil Res Rev 7 2001 (3): 153–7.
R Development Core Team. R: A language and environment for statistical
computing, reference index version 2.2.1. R Foundation for Statistical Computing,
Vienna, Austria. 2005. ISBN 3-900051-07-0, URL http://www.R-project.org.
Self SG, Longton G, Kopecky KJ, Liang KY. On estimating HLA/disease
association with application to a study of aplastic anemia. Biometrics. 1991 Mar ;
47(1):53-61.
Sham PC, Curtis D. An extended transmission/disequilibrium test (TDT) for multi-
allele marker loci. Ann Hum Genet. 1995 Jul ;59(Pt 3):323-36.
Spielman RS, Ewens WJ. A sibship test for linkage in the presence of association:
the sib transmission/disequilibrium test. Am J Hum Genet. 1998 Feb ;62(2):450-8.
Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage
disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus
(IDDM). Am J Hum Genet. 1993 Mar ;52(3):506-16.
Weinberg CR, Wilcox AJ, Lie RT. A log-linear approach to case-parent-triad data:
assessing effects of disease genes that act either directly or through maternal effects
and that may be subject to parental imprinting. Am J Hum Genet. 1998 Apr ;
62(4):969-78.
Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-
control studies of candidate genes and gene-environment interactions: basic family
designs. Am J Epidemiol. 1999 Apr 15;149(8):693-705.
26
APPENDIX A: COMPUTATION OF RELATIVE INFORMATIVENESS
The relative information contributions to be expected from male and female case-
mother pairs are shown below, along with male case-brother pairs, female case-sister
pairs, and mixed (brother-sister or sister-brother) pairs. All of the following
calculations assume random X chromosomal inactivation, consistent within a given
cell type of interest, where the inactivation process is modeled as a Bernoulli trial.
For a male case and his mother, the maternal genotype (X
1
X
2
) and the affected
offspring's genotype (X
1
Y) provide a benchmark of 100% informativeness. Given a
heterozygous mother, this describes a discordant sib/pseudo-sib pair, X
1
Y and X
2
Y.
For a female case and her mother, the maternal genotype (X
1
X
2
) and the affected
offspring's genotype (X
1
X
3
, where X
3
represents the paternal X chromosome)
generate a discordant sib/pseudo-sib pair, X
1
X
3
and X
2
X
3
. However, this pair is only
75% as informative as a corresponding male sib/pseudo-sib pair, because after X
inactivation, we are left with 4 possible “effective” genotype pairings: X
1
vs. X
2
,
X
1
vs. X
3
, X
3
vs. X
2
, and X
3
vs. X
3
. There will always be an uninformative pairing of
“effective” (active) genotypes between the sib/pseudo-sib pair, thus there is a net loss
of ¼ the information as compared to a male case-parent (sib/pseudo-sib) pair.
27
A male case-brother pair from a heterozygous mother with genotype X
1
X
2
generates
either a discordant (X
1
vs. X
2
) or concordant (X
1
vs. X
1
) pair, with a 50% chance of
either. Therefore, a male case-brother pair is ½ as informative as a case-mother pair.
A female case-sister pair, however, is 5/8
ths
(62.5%) as informative as a case-mother
pair under a random, consistent model for X chromosome inactivation. This is
because there are 8 effective pairings of “effective” genotypes from the two possible
offspring pairs, given a heterozygous mating type. One set (X
1
X
3
vs. X
2
X
3
) produces
the “effective” genotype pairs X
1
vs. X
2
, X
1
vs. X
3
, X
3
vs. X
2
, and X
3
vs. X
3
, of which
¾ are informative. The other possible set (X
1
X
3
vs. X
1
X
3
) produces “effective”
genotype pairings X
1
vs. X
3
, X
1
vs. X
1
, X
3
vs. X
1
, and X
3
vs. X
3
, of which ½ are
informative. Thus, a female case-sister pair provides 62.5% of the information
contained in a case-mother pair. (The additional requirement for the paternal
genotype to be distinct from either of the maternal genotypes, in order for the mating
type to be fully informative, leads to a small decrease in informativeness vs. males.)
Performing the same calculations as above for a male case-sister pair or a female
case-brother pair generates “effective” genotype pairings of X
1
vs. X
2
and X
1
vs. X
3
for fully discordant male-female pairs, and X
1
vs. X
1
and X
1
vs. X
3
for partly
discordant male-female pairs (completely homozygous mating types are, as always,
uninformative). This leads to an relative informativeness of ¾ for either male case-
sister or female case-brother pairs, as compared to case-mother pairs.
28
Abstract (if available)
Abstract
The design and analysis of association studies for repeat polymorphisms has received scant attention in the literature. We present an analytical power calculation for studies of such polymorphisms, based on a case-parent design and an X-linked polymorphism. Existing tools for estimating statistical power in family-based studies (such as Quanto) presume categorical codings of autosomal loci. We extend the underlying method to handle quantitative codings of repeat polymorphisms, and discuss the advantages of doing so. Sample sizes for a conditional logistic regression analysis of a sex-linked repeat polymorphism in a case-parent design are presented. Empirical power for quantitative and categorical codings of the same polymorphism with the same sample size in otherwise identical studies are then compared via Monte Carlo simulation. The differences in information to be expected from male and female case-parent, case-sibling, and case-population pairs are discussed. In addition, the effects of stratification and model mis-specification are explored. The effects of missing data, and potential approaches for addressing it, are discussed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
Power and sample size calculations for nested case-control studies
PDF
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
Pooling historical information while addressing uncertainty and bias for power analysis: a Bayesian approach for designing single-level and multilevel studies
PDF
Two-step study designs in genetic epidemiology
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Bayesian hierarchical models in genetic association studies
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Hydraulic fracturing and the environment: risk assessment for groundwater contamination from well casing failure
PDF
Chronic eye disease epidemiology in the multiethnic ophthalmology cohorts of California study
PDF
The environmental and genetic determinants of cleft lip and palate in the global setting
PDF
Quango reforms and challenges in South Korea: social relations, informal networks, and hidden actions
Asset Metadata
Creator
Triche, Timothy J., Jr.
(author)
Core Title
X-linked repeat polymorphisms and disease risk: statistical power and study designs
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
12/07/2008
Defense Date
11/03/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conditional logistic regression,family-based genetic association studies,OAI-PMH Harvest,statistical power,x-linked repeat polymorphisms
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Siegmund, Kimberly (
committee chair
), Cortessis, Victoria Kristence (
committee member
), Gauderman, W. James (
committee member
)
Creator Email
tim.triche@gmail.com,ttriche@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1886
Unique identifier
UC1503334
Identifier
etd-Triche-1908 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-128484 (legacy record id),usctheses-m1886 (legacy record id)
Legacy Identifier
etd-Triche-1908.pdf
Dmrecord
128484
Document Type
Thesis
Rights
Triche, Timothy J., Jr.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
conditional logistic regression
family-based genetic association studies
statistical power
x-linked repeat polymorphisms