Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
(USC Thesis Other)
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Detecting Joint Interactions between Sets of Variables in the Context of
Studies with a Dichotomous Phenotype, with Applications to Asthma
Susceptibility Involving Epigenetics and Epistasis
by
Vladimir Kogan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY in BIOSTATISTICS
Degree conferral date: December 2018
Copyright 2018
i
I submit this dissertation with gratitude to Joshua Millstein for his kind and patient
mentorship, his council and direction in formulating the concepts laid out in this manuscript
and who enabled a publication that may help my name be recognized. I’m appreciative of all
the assistance and instruction provided by Carrie Breton who provided me with interesting
data, allowed to contribute to her research, and enlightened me about the importance of the
natural science aspects of my work. I express my appreciation of Heather Volk who gave me
the opportunity to apply my knowledge to an important study that lead to a publication. I’m
thankful to Kimberly Siegmund who provided advice on crafting this document and helped me
navigate it towards completion. I’m likewise appreciative of Daniel Stram who set an example
for how to be a proper, tolerant, and stoic educator, and aided me in understanding statistical
concepts in genomics that planted the seeds for ideas that facilitated consummation of this
work.
ii
Contents
List of Figures ............................................................................................. iv
List of Tables ............................................................................................ viii
1. Introduction ........................................................................................ 1
2. Prior Literature .................................................................................. 6
3. Statistical Methodology ................................................................ 12
3.1. Correspondence of within phenotype class covariance and between phenotype class
variance ............................................................................................................................................................. 15
3.2. Case-Only Analysis ............................................................................................................................ 20
3.3. Conceptual Correspondence between Correlation and Odds Ratio or Relative Risk 23
3.4. Canonical Correlation ...................................................................................................................... 26
4. Canonical Analysis of Set Interactions (CASI) ......................... 27
4.1. Generalization of the CASI method (gCASI) ............................................................................. 31
4.2. gCASI Procedure ................................................................................................................................ 31
4.3. Interpreting CASI via Canonical Correlation: Loadings ...................................................... 34
4.4. Interpreting CASI via Canonical Correlation: Allocation of Variance ............................ 34
5. Simulations ....................................................................................... 35
5.1. Simulation to Evaluate CASI as a Method to Detect Interaction Realized as a Contrast
Between Set Correlations Across Cases and Controls ....................................................................... 36
5.1.1. Statistical Power Analysis ..................................................................................................................... 39
5.2. Simulation to Evaluate CASI as a Method for Detecting Multiplicative Interaction .. 43
5.2.1. Statistical Power Analysis ..................................................................................................................... 47
6. Application to Real Data Sets ...................................................... 59
6.1. Application of CASI to Children’s Health Study (CHS) Data ............................................... 60
6.1.1. Demographics of Study Participants ................................................................................................ 60
6.1.2. CASI Analysis .............................................................................................................................................. 61
6.1.3. Follow-up Analysis Using Logistic Regression ............................................................................. 63
6.1.4. Most Significant Pairwise SNP Interactions from Interacting Gene Pairs......................... 71
6.2. Application of CASI to the Asthma Bio-Repository for Integrative Genomic
Exploration (ABRIDGE) Data Set .............................................................................................................. 80
6.2.1. Asthma BRIDGE (ABRIDGE) ................................................................................................................. 80
6.2.2. SNPs and DNA Methylation .................................................................................................................. 80
6.2.3. Demographics of the ABRIDGE Population ................................................................................... 81
6.2.4. CASI Analysis of ABRIDGE Data .......................................................................................................... 82
iii
6.2.5. Follow-up Analyses in Significant Regions .................................................................................... 84
6.2.6. Pairwise SNP-CpG Interaction Analysis .......................................................................................... 84
6.3. Application of the generalized CASI approach (gCASI) to ABRIDGE data ..................... 93
6.3.1. CAMP Expression Data ........................................................................................................................... 94
6.3.2. ABRIDGE Expression Data .................................................................................................................... 95
6.3.3. Identifying Gene Sets and Gene Interactions ................................................................................ 95
6.3.4. Gene Networks .......................................................................................................................................... 95
6.3.5. Application of gCASI to ABRIDGE WB Expression Data ........................................................... 98
7. Results .............................................................................................. 100
7.1. Simulations ........................................................................................................................................ 100
7.1.1. CASI as a Method for Detecting Interaction by a Contrast in Set Correlations ............ 100
7.1.2. CASI as a Method for Detecting Multiplicative Interaction .................................................. 101
7.2. Application ........................................................................................................................................ 102
7.2.1. CASI Analysis of CHS Data .................................................................................................................. 102
7.2.2. CASI Analysis of ABRIDGE data ....................................................................................................... 108
7.2.3. gCASI Analysis of ABRIDGE and CAMP Expression Data ...................................................... 109
8. Discussion ....................................................................................... 120
References ............................................................................................... 127
iv
List of Figures
Figure 1. Conceptual illustration showing distribution of genomic variables for case and
control individuals from a population in which the within class correlation between
genomic variable values is 0.9 and population in which it is 0.2. .................................................. 18
Figure 2. Notional grid of intersections between two discretized variables with overlaid ellipse
representing latent bivariate normal distribution. An illustration of the domain of
standard normal density function being discretized by dashed lines into a 2 × 2
contingency. .......................................................................................................................................................... 24
Figure 3. Notional grid of intersections between two discretized variables with overlaid ellipse
representing latent bivariate normal distribution. An illustration of the domain of
standard normal density function being discretized by dashed lines into a 3 × 11
contingency tables. ............................................................................................................................................ 24
Figure 4. FDR plots are snapshots from the entire range of simulations. The green values
shown horizontally indicate number of CASI tests judged significant for each threshold on
the horizontal. For sample size 250 we can glean from a series of FDR estimates the
number of true positive tests that can be identified, hence giving us a measure of power of
the test. The same can be shown for other sample and effect sizes. The top plot shows
small effect size and with the corresponding power quite small given that only
approximately 9 or 10 of the tests are identified correctly as significant at a traditional
level FDR of 0.05. However, as the effect size increases, as shown in 2nd and 3rd figures
from the top nearly or exactly 100% of the true positive tests are correctly identified. All
tests to which we apply CASI are true positive tests because they are comparing samples
from two distinct populations with different correlation structures. .......................................... 41
Figure 5. Power plots show power vs. effect size at sample sizes 50, 100, 150, 200, and ........... 42
Figure 6. The plot depicts consistency of the CASI method by summarizing power results ...... 43
Figure 7. Factors, 𝑔𝑔 1𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 / 𝑔𝑔 0𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 , on the y-axis representing relative likelihood that the
observed statistic is non-null compared to null. The study is a mixture of tests, arising
from underlying non-null and null relationships with disease. These associations are
assumed to have known rates of occurrence, corresponding to an odds of 𝑝𝑝 1/ 𝑝𝑝 0 =
0.1/0.9, where 𝑝𝑝 𝑝𝑝 are for non-null and null, respectively. The plot shows factors, relative
likelihoods, ratios of probability or density functions for null and alternative hypotheses,
𝑔𝑔 1𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 / 𝑔𝑔 0𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 , for a range of FDR values. It is evident from the plot that the factors can
be quite high for relatively modest FDR value such as 0.2, which has a factor of 36, and
extremely high for low FDR of 0.05, which has a factor of 171....................................................... 47
Figure 8. Comparison of power characteristics of set based methods and conventional logistic
regression for sample sizes 2000 and 2500 shows preeminence of CASI and other set-
v
based methods over logistic regression. The performance of set based methods is
illustrated in the top two figures showing side by side that increasing sample size
improves power for some approaches. CASI achieves the best outcome, nearing a power of
0.2 for sample size of 2000 and 0.3 for sample size of 2500 at FDR of 0.05. The top ten
version of CASI (CASI_top10) and CLD are the next best alternatives to CASI. Peng et al.’s
CCA (CCA (Peng)) approach appears to be the worst in identifying interaction, with the
lowest power that is much less than 0.1 at FDR of 0.05..................................................................... 51
Figure 9. Comparison of power characteristics of set based methods and conventional logistic
regression for sample sizes 2000 and 3000 shows preeminence of CASI and other set-
based methods over logistic regression. The performance of set based methods is
illustrated in the top two figures showing side by side that increasing sample size
improves power for some approaches. CASI achieves the best outcome, nearing a power of
0.2 for sample size of 2000 and 0.5 for sample size of 3000 at FDR of 0.05. The top ten
version of CASI (CASI_top10) and CLD are the next best alternatives to CASI for both
sample sizes. Peng et al.’s CCA (CCA (Peng)) method appears to be the worst in identifying
interaction, with the lowest power that is much less than 0.1 at FDR of 0.05. ......................... 52
Figure 10. Comparison of power characteristics of set based methods and conventional
logistic regression for sample sizes 3000 and 3500 shows preeminence of CASI and other
set-based methods over logistic regression. The performance of set based methods is
illustrated in the top two figures showing side by side that increasing sample size
improves power for some approaches. CASI achieves the best outcome, nearing a power of
0.5 for sample sizes of 3000 and 3500 at FDR of 0.05. The top ten version of CASI
(CASI_top10) and CLD are the next best alternatives to CASI for both sample sizes. Peng et
al.’s CCA (CCA (Peng)) method appears to be the worst in identifying interaction, with the
lowest power that is much less than 0.1 at FDR of 0.05..................................................................... 53
Figure 11. Comparison of power characteristics of set based methods and conventional
logistic regression for sample sizes 3500 and 4000 shows preeminence of CASI and other
set-based methods over logistic regression. Performance of set based methods is
illustrated in the top two figures showing side by side that increasing sample size
improves power for some approaches. CASI achieves the best outcome, reaching a power
of 0.5 for sample size of 3500 and more than 0.6 for sample size of 4000 at FDR of 0.05.
The top ten version of CASI (CASI_top10) and CLD are the next best alternatives to CASI
for both sample sizes. Peng et al.’s CCA (CCA (Peng)) method appears to be the worst in
identifying interaction, with the lowest power that is much less than 0.1 at FDR of 0.05. . 54
Figure 12. FDR plot for 3741 pairs of genomic regions using Children’s Health Study (CHS)
data. Top 10 significant genomic regions are selected using permutation based FDR
method (Millstein and Volfson, 2013), with significance threshold demarcated by vertical
line tethered at 15.85 on the horizontal axis, which corresponds to selected FDR estimate
of 0.049 (vertical axis) and FDR confidence limits (0.02, 0.12). Values shown horizontally
in green specify the number of genomic regions in each significant subset delimited by
significance thresholds on the horizontal axis. ...................................................................................... 62
vi
Figure 13. SNP(CDK2) by SNP(IL1RL1) interactions. ................................................................................. 66
Figure 14. SNP(IL1RL1) by SNP(CDK2) interactions. ................................................................................. 66
Figure 15. SNP(CDK2) by SNP(DENND1B) interactions. ........................................................................... 67
Figure 16. SNP(DENND1B) by SNP(CDK2) interactions. ........................................................................... 67
Figure 17. SNP(DBX1) by SNP(DENND1B) interactions. ........................................................................... 68
Figure 18. SNP(DENND1B) by SNP(DBX1) interactions. ........................................................................... 68
Figure 19. SNP(DENND1B) by SNP(IL2RB) interactions. .......................................................................... 69
Figure 20. SNP(IL2RB) by SNP(DENND1B) interactions. .......................................................................... 69
Figure 21. SNP(HCG23) by SNP(LPIN2) interactions. ................................................................................. 70
Figure 22. SNP(LPIN2) by SNP(HCG23) interactions. ................................................................................. 70
Figure 23. Interaction plot for most significant SNP pair for genes CDK2 and IL1RL1. The
relative shades of gray represent the odds of case vs. control at the specified combination
of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on
the vertical axis within each subset defined by the value of the SNP on the horizontal axis
and case status combination. ........................................................................................................................ 75
Figure 24. Interaction plot for most significant SNP pair for genes CDK2 and DENND1B. The
relative shades of gray represent the odds of case vs. control at the specified combination
of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on
the vertical axis within each subset defined by the value of the SNP on the horizontal axis
and case status combination. ........................................................................................................................ 76
Figure 25. Interaction plot for most significant SNP pair for genes DBX1 and DENND1B. The
relative shades of gray represent the odds of case vs. control at the specified combination
of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on
the vertical axis within each subset defined by the value of the SNP on the horizontal axis
and case status combination. ........................................................................................................................ 77
Figure 26. Interaction plot for most significant SNP pair for genes DENND1B and IL2RB. The
relative shades of gray represent the odds of case vs. control at the specified combination
of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on
the vertical axis within each subset defined by the value of the SNP on the horizontal axis
and case status combination. ........................................................................................................................ 78
vii
Figure 27. Interaction plot for most significant SNP pair for genes HCG23 and LPIN2. The
relative shades of gray represent the odds of case vs. control at the specified combination
of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on
the vertical axis within each subset defined by the value of the SNP on the horizontal axis
and case status combination. ........................................................................................................................ 79
Figure 28. FDR and confidence intervals for 18,178 genomic regions (genes) defined by 2kb
up and downstream of each gene. The value of the CASI statistic (x-axis) has no direct
interpretation other than that larger values are more extreme. 12 significant genomic
regions correspond to a threshold of 4.35 (vertical dashed line) and FDR of 0.05, CI =
(0.024, 0.104). Integer values shown in green specify the number of genomic regions with
CASI statistics at least as extreme as the thresholds specified by the horizontal axis. ......... 83
Figure 29. SNP and methylation loadings for genomic regions LOC101928523, LHX6, and
SCARNA18 with LD heat maps. Red circles indicate SNPs and blue squares CpG sites. The
horizontal axis indicates base pair (BP) positions in each of the three genomic regions.
Heat maps represent dependencies between SNPs and CpGs (alignment is approximate).
.................................................................................................................................................................................... 86
Figure 30. Interaction box plots for SNP-methylation pairs with highest loadings in the top 12
genomic regions. ................................................................................................................................................. 87
Figure 31. Ancestry distribution of study participants. ............................................................................. 97
Figure 32. 8 gene sets were identified at an FDR of 0.16 CI: (0.05, 0.53) ........................................... 98
Figure 33. Distribution of phenotype from the results. .............................................................................. 99
Figure 34. Expression values and their products (multiplications) vs. asthma severity. ......... 112
Figure 35. Gene pairs and their corresponding loadings represent extent of involvement in the
gene network perturbation. ........................................................................................................................ 113
Figure 36. Visualization of NAKAJIMA EOSINOPHIL gene network with pairwise gene
interactions shown by edges, representing strength of relationship, linking nodes. ......... 114
Figure 37. Plot for phenotype vs. canonical variate identify how expression profiles for genes
in a network relate to asthma control and severity. ......................................................................... 115
viii
List of Tables
Table 1
Characteristics of Participants in the CHS Study ................................................................................ 60
Table 2
Results of tests for multiplicative interaction of pairs of SNPs using logistic regression.
This subset was selected because it had the smallest q-value. The SNP pairs all come from
genes HCG23 and LPIN2. Many tests had the same interaction p-values due to complete LD
and were omitted as duplicate information and in the interest of brevity. ............................... 63
Table 3
Top 5 SNP(CDK2) by SNP(IL1RL1) interactions. Gene pair interaction q-value is 0.35. ..... 66
Table 4
Top 10 SNP(IL1RL1) by SNP(CDK2) interactions. Gene pair interaction q-value is 0.35. ... 66
Table 5
Top 3 SNP(CDK2) by SNP(DENND1B) interactions. Gene pair interaction q-value is 0.09.
.................................................................................................................................................................................... 67
Table 6
Top 10 SNP(DENND1B) by SNP(CDK2) interactions. Gene pair interaction q-value is 0.09.
.................................................................................................................................................................................... 67
Table 7
Top 6 SNP(DBX1) by SNP(DENND1B) interactions. Gene pair interaction q-value is 0.29.
.................................................................................................................................................................................... 68
Table 8
Top 10 SNP(DENND1B) by SNP(DBX1) interactions. Gene pair interaction q-value is 0.29.
.................................................................................................................................................................................... 68
Table 9
Top 10 SNP(DENND1B) by SNP(IL2RB) interactions. Gene pair interaction q-value is 0.39.
.................................................................................................................................................................................... 69
Table 10
Top 10 SNP(IL2RB) by SNP(DENND1B) interactions. Gene pair interaction q-value is 0.39.
.................................................................................................................................................................................... 69
Table 11
Top 10 SNP(HCG23) by SNP(LPIN2) interactions. Gene pair interaction q-value is 0.24. .. 70
Table 12
Top 10 SNP(LPIN2) by SNP(HCG23) interactions. Gene pair interaction q-value is 0.24. .. 70
ix
Table 13
Characteristics of Participants in the ABRIDGE Study. ....................................................................... 81
Table 14
Significant interactions for SNPs and CpGs with loadings > 0.5. .................................................... 85
Table 15
Tests of multiplicative interactions (P < 0.05) within LHX6 for SNP-CpG pairs. ..................... 90
Table 16
Tests of multiplicative interactions (P < 0.05) within LOC101928523 for SNP-CpG pairs. . 91
Table 17
Tests of multiplicative interactions (P < 0.05) within SCARNA18 for SNP-CpG pairs. .......... 92
Table 18
Top ranked gene sets and corresponding phenotypes associated with network
perturbation. ........................................................................................................................................................ 99
Table 19
Results of tests for multiplicative interaction from regression applied to pairs of
associated genes identified in “NAKAJIMA EOSINOPHIL" gene set network and found to
have significant influence on 6-month asthma control variable. (GPR44 is an alias for
PTGDR2 Gene). ................................................................................................................................................. 116
Table 20
Results of tests for marginal effects from regression applied to genes associated with at
least one other gene in “NAKAJIMA EOSINOPHIL" gene set network, and significantly
associate with a 6-month asthma control phenotype. (GPR44 is an alias for PTGDR2 Gene)
................................................................................................................................................................................. 117
Table 21
Results of tests for interaction with respect to 7-day asthma control phenotypes from
regression applied to genes involved in association in “NAKAJIMA EOSINOPHIL" gene set
network and found to vary significantly with 6-month asthma symptom events. Ordering
within each phenotype is based on loadings of gene pairs from analysis with 6-month
phenotype in ABRIDGE WB data set. (GPR44 is an alias for PTGDR2 Gene).......................... 119
1
1. Introduction
The discovery and characterization of genes known to have specific mutations as important
risk factors for complex diseases has ushered in a period of searching for genomic elements
involved in various illnesses (Narod and Foulkes, 2004). However, much of these diseases
remain unexplained, and many genomic factors involved have yet to be identified
(Ripperger, et al., 2009). The common variant approach to discovery may be insufficient to
uncover new variations that explain a large proportion of the genomic risk. While genome-
wide association studies (GWAS) can be used to identify individual or multiple variants it is
possible that there exists additional risk attributable to genomics in the form of
combination of various other factors, such as DNA binding proteins and methylation of
DNA, with possible other intermediate elements. There is a desire to considerably advance
our understanding of the genomic architecture of diseases, where genomic architecture is
understood to include sets of DNA sequences and epigenetic components such as
methylation involved in disease. It is also important to characterize their variation in the
population and their effects on the phenotype (Weiss, 1995). We contend here that
emerging evidence suggests elucidating the genomic architecture of diseases must focus on
such underlying complexity.
A widely used approach to revealing underlying genetics of a disease is to carry out a GWAS
with more than a million SNPs that capture great deal of the variation in the human
genome by labeling blocks of variants that are in linkage disequilibrium (Hirschhorn and
Daly, 2005; Wang, et al., 2005). This methodology relies on the proposition that scanning
the whole genome for SNP associations without presuppositions, disregarding what we
know about disease biology, will reveal much about a particular disease. Searching for a
strong SNP effect without considering the rest of the genome is reasonable when little is
known about the etiologic factors involved in diseases. However, this tactic might yield
unreliable results (Spitz, et al., 2008). GWAS has been useful in identifying susceptibility
loci but in terms of explaining the genomic architecture of complex disease it is not as
effective and may not explain the majority of genomic risk. Identification of associations is
important but it is also vital to note that for discovered susceptibility alleles the excess
risks associated with diseases are small (Ahmed, et al., 2009; Easton and Eeles, 2008;
Easton, et al., 2007; Ripperger, et al., 2009). Achieving success with GWAS is often governed
by assumptions we make about disease complexity, with limitations that include small
effect sizes and lack of reproducibility (Clark, et al., 2005; Jakobsdottir, et al., 2009; Kraft, et
al., 2009).
Complex diseases result from interactions between genetic variants and other genomic
factors, and our understanding of their genetic architectures has grown considerably
because in recent decades there has been great keenness to identify and understand gene-
2
gene interactions of complex diseases using genetic data (Mahdi, et al., 2009; Moore and
Williams, 2009; van der Woude, et al., 2010; Wan, et al., 2010; Zhang and Liu, 2007).
Substantial effort has been expended in finding and characterizing susceptibility genes of
human diseases. However, understanding their network of interactions and how other
genomic factors intercede remains a great challenge. For complex diseases, the relationship
between genomics and phenotype is usually multifarious. Both theoretically and
methodologically, detecting gene-gene and gene-epigenome interactions remains difficult.
One way of thinking about interaction is in context of statistical models. In this framework
the idea of interaction is thought of as deviation from additive effects and is formalized in
the theory of linear models that describe the relationship between an outcome phenotype
variable such as disease state and predictor variables such as SNPs. This includes ANOVA,
logistic, polynomial, multinomial, and other generalized linear models. In each a particular
model is proposed for how we believe the predictors might relate to the outcome, and we
determine how well the model fits our observed data by comparing the fit of models with
and without interaction term. Given two genetic factors conjectured as influencing disease
risk, perhaps the most straightforward way to test for statistical interaction in a case-
control study is to fit a logistic regression model that includes main effects and relevant
interaction term and then to test whether the coefficient of interaction term equals zero or
not. In established statistical modeling approaches, including linear and generalized linear
models such as logistic regression, the genetic effects are decomposed into main and
interaction effects (Aymler, 1918). The statistical interactions are deviations from the main
effects. Traditional statistical models may not work for sparse data or data with too many
variables. For example, logistic regression models that include interaction terms can fail to
converge when some cells contain too few individuals (Andrew, et al., 2006). Classical
statistical models can not properly fit the multifactorial relationship between genomic
factors and disease phenotypes. They may not be useful for modeling biological
interactions. Take as an example Andrew et al.’s (2006) analysis of bladder cancer data, the
main effects of genetic polymorphisms were not observed and it was unclear if logistic
regressions fit the data well (Andrew, et al., 2006). The failure of convergence may have
been because the logistic regression model itself was invalid. Yet, one advantage of
traditional statistical models is that the related theory is very mature, variance partitioning
and ANOVA are standard procedure for data analysis and model selection.
We can think of interactions in a biological sense, on the other hand, as that which happens
at the cellular level and results from physical interactions between biomolecules such as
DNA, RNA and proteins (Bateson, 2002; Moore and Williams, 2009; Wilson, 1902). The
relationship between genomics and disease phenotypes is usually complex, involving
genes, proteins, and other factors. The biological gene-gene and gene-epigenome
interactions are understood as the interdependence between genomic factors, disturbance
3
of which may cause complex diseases. Thus, biological interaction makes sense and it is
valid in describing the complicated relations between genomic factors and disease
phenotypes. Even in the absence of main effects, the biological gene-gene and gene-
epigenome interactions may exist and can be important (Frankel and Schork, 1996). The
related theory to detect and characterize such interactions is not well-developed. There is a
need to cultivate powerful methods to identify and to interpret the genetic architecture of
complex traits.
Unfortunately, due to complexity in the relationships among genomic factors that give rise
to traits it is not always clear how to define interaction and the challenge is both theoretical
and technical. We tend to apply methods conceived to detect a certain kind of statistical
interaction to biological systems that are not well understood. For example, it is well-
known that generalized linear models, including logistic regression in case-control studies,
are widely used to study complex system that may be nonlinear but could lend themselves
to such approximation (Andrew, et al., 2006). In forming our method, we consider a rather
broad construal of interaction as any statistical variability attributable to conjoint activity
of variables within a biological system. Conjoint activity could be thought of as effect
modification or multiplicative interaction, which in context of linear model is also seen as
deviation from additive effects. These are the most common statistical definitions. There
have been many different and changing definitions of interaction in genomics (Brodie III,
2000; Cheverud and Routman, 1995; Cordell, 2002; Cordell, 2009; Miller, 1997; Moore and
Williams, 2005; Phillips, 1998; Phillips, 2008; Snyder, 1935; Tyler, et al., 2009; Wade, 2001;
Wade, et al., 2001). Fisher defined interaction in a statistical fashion as deviation from
additivity in a linear model (Fisher, 1919). This definition of interaction as non-additivity of
genomic effects assessed mathematically is different from the biological definition. The
distinction between biological epistasis and Fisher's statistical epistasis is discussed in
Moore et al. (2005) (Moore and Williams, 2005), which is important when considering
genomic architecture of disease. We should think of biological interaction as occurring at
the cellular level whereas statistical interaction is a model relating genomics to phenotype
that results from variation in two or more genomic factors. This is important because we
try to draw a biological conclusion from a statistical model that defines a genomic
association. While investigators search for a modern definition of interaction in light of our
knowledge about gene networks and the relationship between genetics and epigenetics
they employ classic definition provided by Fisher and others (Aymler, 1918; Moore and
Williams, 2005; Phillips, 2008; Wilson, 1902).
Key point to keep in mind when espousing a complex view of genomic architecture is that
we are uncertain about what interaction models are plausible since we have yet to
systematically appraise genomic models of human disease. Nonetheless, we have begun to
think about what is biologically plausibility using computational experiments (Moore,
4
Boczko and Summar, 2005). A review by Moore and Williams claimed that there already
exist initial thrusts toward constructing the link between biological and statistical
interaction (Moore and Williams, 2005). Biological phenomena are innately complex. It is
therefore important to consider biological plausibility in addition to analytical simplicity in
designing analyses, and we attempt to do this with our approach. By using numerous
genomic factors in analysis, sometimes it becomes a high-dimensional or near high-
dimensional problem. If we consider multiple SNPs and other genomic factors
simultaneously we are challenged with a problem that has too many variables for a linear
model. Thus, the motivation and imperative to cultivate novel statistical methods to detect
and characterize the complex biological gene-gene and gene-epigenome interactions of
complex traits. The traditional statistical models cannot properly fit the relationship
between numerous genomic features and disease phenotypes, and therefore may not be
sufficient for modeling multifarious biological interactions.
It has been argued that interaction is likely to be a pervasive piece of the genetic
architecture of common human diseases (Moore, 2003; Templeton, 2000). The explanation
for this is that interaction is an idea that is common in biological literature because
biomolecular interactions are pervasive in gene regulation and biochemical systems. This
suggests that the relationship between DNA sequence variations and biological outcomes
involves interactions of multiple genes and their products. When data is extensively
interrogated interaction is frequently found. However, significant results commonly do not
replicate in follow-up studies. Interactions may be ubiquitous in human biology, but we
don’t have a way to explain their mechanisms. Evolution has succeeded in giving rise to
strong systems by creating redundant gene networks that are resilient to genomic and
environmental variations (Waddington, 1942). This may be the reason interaction is so
universal within the framework of human disease. What we perceive as disease may be the
consequence of aggregation of multiple alterations in parts of a gene network that is
perturbed beyond its evolved range. Therefore, single variants explain small amount of the
risk of a disease. This means that it is important to search for mixtures of genetic variants
in populations to understand the alteration of configurations across gene or genomic
networks that occur as individual’s phenotype moves into disease stage such as asthma.
Quintessentially, natural selection pushes a population to a condition where the large
portion of people are not ill, and this is achieved through multipart networks that include
ample interaction (Waddington, 1942). The idea of redundancy in genomic networks
provides a way thinking about why the genetic architecture of common diseases is so
complex, and interaction can be thought of as a network of genomic factors (Tyler, et al.,
2009). Evolution shaped human biology, which may explain why genomic architecture is
likely to be a blend of different kinds of genomic effects including gene-epigenome
interactions (Gibson, 2009).
5
Currently, readily available methods for detecting interaction in statistical genetics are
taken from regression framework and focus on detecting gene-gene relationship (GGI:
Gene-Gene Interaction) in genome-wide association studies (GWASs). In this way they are
mostly limited in their use of single nucleotide polymorphism (SNP) as the unit of
association. While such single-marker approaches to interaction analysis are widely
available, these methods suffer from very high multiple testing problem and do not take
advantage of existing information, notably defining genes as functional units in biological
systems. Weak effects of individual SNPs translate to weak power, and assumption that a
SNP plays a distinct large role in explaining the heritable variation of common disease has
been questioned (Visscher, et al., 2012). A greater portion of heritability may be accounted
for by epistasis, gene-environment interactions, or as for one of our applications the
relationship between SNPs and methylation loci. Finally, in line with the high multiple
testing issue there is the question of computational feasibility with the sheer number of
pairwise combinations, SNPxSNP or SNPxMethylation. Any kind of pairwise approach to
testing for interaction suffers from a multiple testing challenge (Cordell, 2009; Marchini,
Donnelly and Cardon, 2005), which is particularly severe if applied on a genome-wide
scale. For example, in a study with one million SNPs and 450,000 CpG sites, 450 billion
tests would be required. Our goal is to address some of these disadvantages, statistical
power in particular, by considering higher level units such as genes or genomic regions in
the analysis.
Our understanding of the genetic architecture of diseases has increased over the preceding
decades. A method that offers promise but has yet to be widely applied to genomic data is
the set-based approach that relies on canonical correlation. In this paper we develop
canonical correlation-based test statistic for gene-gene and gene-epigenome interactions.
We characterize its properties using simulation. We then apply these methods to asthma
data sets and thereby test their power and identify strong points and limitations. The
application is to two-way set interactions. As an extension of our method to higher order
interaction we demonstrate a generalization of our approach through application to gene
networks. In the generalization we introduce sparse canonical correlation function due to
the need to deal with high-dimensional data. Through simulation we expound on the test
statistic’s robustness, power, and unbiasedness. Applying it to the asthma data set allows
us to investigate the complex interactions between SNPs, methylation, and asthma
susceptibility. Although not yet widely applied, canonical correlation-based approaches can
be a useful tool for detecting gene-gene and gene-epigenome interactions. We hope the test
procedure we are advancing will add to a growing body of methodologies that will steadily
shed light on the complex architecture of common diseases.
Our method can characterize two types of interactions; the well-established definition
referred to as multiplicative and one takes the form of covariance which we show under
6
ANOVA assumptions to be deviation from additive effect. While covariance doesn’t fall into
traditional definitions of statistical interaction it is more closely associated with how we
understand biological interaction, those that biologists know to occur at the cellular level as
physical interactions between biomolecules (Bateson, 2002; Moore and Williams, 2009;
Wilson, 1902). In this sense biological interaction can be understood as a form of
interdependence between genomic factors which can lead to disease. This suggests
covariance may be appropriate for measuring biological interaction. We show covariance
to be a form of interaction as under multivariate ANOVA assumptions it can be viewed as
deviation from additivity, and in that context may be valid in describing the complicated
relationships between genomic factors and disease phenotypes. It serves as one of the
inspirations for our method. There is a need to develop powerful methods to identify and
interpret complex genomic architecture of complex disease. Underlying our method is the
canonical correlation function, a linear transformation of variables of interest. It is
commonly used in multivariate statistics to measure correlation between two sets of
random variables. The canonical correlation-based approach we devised is meant to detect
and characterize interactions between genes and other genomic level factors of complex
diseases.
Interaction involving genes and genomic factors plays an important role in the genomic
architecture of diseases and must be characterized if it is to be useful in caring for peoples’
health. We propose a method that is meant to detect interaction between genomic
variables under the assumption that architecture of disease involves linear relationships
between sets of variables because we believe it is ubiquitous in human biology. We provide
an outline of the methodical tool meant for that purpose whose utility includes the
necessary features to detect and characterize interaction in a genomic association study. As
a demonstration of its effectiveness we show simulations and applications to real data and
provide some candidate genes for further investigation.
2. Prior Literature
The classic statistical definition of “interaction” proposed by Fisher (Fisher, 1919) and
developed further by Cockerham (Cockerham, 1954) and Kempthorne (Kempthorne, 1954)
treats the concept as deviation from additive effects. These early researchers argued that if
you consider a genetic trait with two possible alternatives the fact that other factors affect
them to different extent may be regarded as an example of interaction. Deviation from the
additivity of effects may occur between a pair of genomic factors such as SNPs. In context of
genomics this implies that somatic effects of genomic elements are not additive, and for
this reason the likeness of relatives is partially obfuscated in the statistical aggregate.
Influences of genomic factors on a phenotype are divided for statistical purposes into two
parts, an additive part which reflects the direct effect of genomic elements and gives rise to
7
the correlations (similarity) among relatives, and a deviation from additive effect which
acts in much the same way as a stochastic variable. Cockerham and Kempthorne extended
this idea by characterizing in greater granularity components of phenotypic variance,
subdividing it into an additive portion resulting from average effects of genes, and
variances attributable to different types of interactions, quantifying the portions
(Cockerham, 1954; Kempthorne, 1954). This way of attributing variability in phenotype to
different sources, including interaction, and computing the variance for each is now termed
variance decomposition such as the one employed in analysis of variance (ANOVA). The
point is that these were early attempts to lay the groundwork for measuring interaction as
it affects phenotype, and eventually lead to the foundation of the linear model approach.
The shortcomings of the linear model and other parametric statistical methods have driven
the growth of computational approaches such as those from data mining, machine learning,
and neural networks. These analytical methods are conducted under fewer assumptions
about the design and components of the model (Hastie, Tibshirani and Friedman, 2009;
McKinney, et al., 2006; Mitchell, 2011). There is certainly a desire for new methods as
highlighted in recent reviews (Thornton-Wells, Moore and Haines, 2004) and an active
discussion comparing various tactics for detecting statistical interaction (Cordell, 2009;
Motsinger, Ritchie and Reif, 2007). The methods that have been reviewed include
innovative approaches such as combinatorial partitioning (Culverhouse, Klein and
Shannon, 2004; Nelson, et al., 2001), machine learning technique called random forests
(Bureau, et al., 2005; Lunetta, et al., 2004), and multifactor dimensionality reduction (MDR)
(Moore, 2007).
Using traditional methods combinatorial assessment of SNPs in a GWAS is not
computationally viable using brute-force approach outside of exploring two-way and three-
way combinations. It’s possible that to address this problem requires using previous
biological and statistical knowledge. Another key challenge to detecting and characterizing
interaction is interpretation. It is a great leap to conduct inference about biological
interactions from a conclusion based on a large representative sample using statistical
summary of gene-gene or gene-epigenome, as is attempting inference about public health
based on gene networks. Nonetheless, efforts have been made to negotiate this divide
(Moore and Williams, 2005).
The early definition of interaction as deviation from additivity in a linear model (Fisher,
1918) plays an important role in the study of genomics because it has a firm theoretical
foundation, it is well understood how to implement it with a wide range of software, and
interpretation is easy. Despite these advantages to using linear models (Cordell, 2002;
Cordell, 2009) they are limited in their ability to explain complicated genomic patterns of
interaction underlying disease (Moore and Williams, 2002). We see the problem of
modeling interactions requires considering combinations of multiple variables. Modeling
8
multiple genomic variables simultaneously is challenging due to sparse data spread across
manifold combinations. As such estimating parameters in a linear model can be
challenging. Linear models are often implemented so that presence of interaction effects is
only considered after independent main effects. This assumes that the predictors will have
main effects. For instance, a powerful logistic regression approach to detecting interaction
is presented by Millstein et al. (2006), referred to as the focused interaction testing
framework (FITF) it conditions on main effects (Millstein, et al., 2006). Often linear models
have greater power to detect main effects than interactions (Lewontin, 2006; Lewontin,
1974; Wahlsten, 1990). In using linear models, we are constrained not by our
understanding of biology but by statistical apparatus that were not established to test
faithful biological models. As an area of study genomics has preferred Fisher's definition of
interaction and this has led to analytical approaches that are inadequate for modelling real
genomic architecture.
One inspiration for our method is the more powerful approach often conducted in context
of case-control studies and linear models, the ‘case-only’ analysis (Piegorsch, Weinberg and
Taylor, 1994; Weinberg and Umbach, 2000; Yang, et al., 1999). Case-only approach exploits
the fact that, under certain conditions, an interaction term in the logistic regression
equation corresponds to dependency or correlation between the predictor variables within
the subpopulation of cases. Therefore, a case-only test of interaction can be performed by
testing the null hypothesis that there is no correlation between the pair of predictors, in a
sample restricted to cases alone. This test can easily be performed via variety of methods,
depending on type of data, including chi-square test of independence, logistic regression,
multinomial regression, or Pearson’s correlation, among others. One of the requirements or
assumptions of the case-only test, a possible problem in some scenarios, is that the
interacting variables be uncorrelated in the general population. This assumption endows
the procedure with higher power compared to logistic case-control analysis (Schmidt and
Schaid, 1999). The case-only test is therefore not suitable for pairs of sites on the genome
that are either closely linked, in high LD, or show correlation for any reason.
Since case-only analysis provides a more powerful test for the interaction effect (Piegorsch,
et al., 1994; Weinberg and Umbach, 2000; Yang, et al., 1999) Chapman and Clayton
proposed using a version of the joint test that combines a case-control main effect
component with a case-only interaction component (Chapman and Clayton, 2007). In
studies with environmental factors correlation and confounding is common. However, in
genetic studies the assumption of independence between unlinked genetic variables is
often fairly reasonable. If we are concerned with pairs of SNPs, they can be very far away
from each other on the same chromosome or on different chromosomes. If other genomic
factors are involved that are not correlated the assumption also holds. A two-stage
procedure is employed sometimes, where a test for correlation between the interacting
9
factors is conducted, and then based on the result it is determined whether to perform a
case-only or case-control interaction test (Millstein, 2013). However, this procedure has
potential bias (Mukherjee, et al., 2008).
An extension of the regression approach to testing introduced a kernel function into
random effects of a linear model as in Larson et al. (2013) (Larson and Schaid, 2013). A
mapping to some Hilbert space is accomplished through the kernel trick as in Scholkopf
and Smola (2002) (Scholkopf and Smola, 2001) that calculates inner products through the
given kernel function and yields kernel matrix 𝐾𝐾 𝑖𝑖 . The tests are based on the mixed effects
logistic regression disease model, 𝑙𝑙 𝑙𝑙 𝑔𝑔 𝑝𝑝 𝑙𝑙 ( 𝜋𝜋 ) = 𝑋𝑋𝑋𝑋 + 𝑚𝑚 1
+ 𝑚𝑚 2
+ 𝑚𝑚 3
, where 𝑚𝑚 1
∼
𝑁𝑁 (0, 𝜏𝜏 1
2
𝐾𝐾 1
), 𝑚𝑚 2
∼ 𝑁𝑁 (0, 𝜏𝜏 2
2
𝐾𝐾 2
), 𝑚𝑚 3
∼ 𝑁𝑁 (0, 𝜏𝜏 3
2
𝐾𝐾 3
), and conducted via score statistics for
marginal, global, and interaction tests. In this case the kernel inner products are
understood as reflecting variability in gene 1 ( 𝑚𝑚 1
), variability in gene 2 ( 𝑚𝑚 2
), and
covariance between gene 1 and gene 2 ( 𝑚𝑚 3
).
Although regression-based tests of interaction would seem most natural given the
definition of interaction as departure from additivity, alternative approaches have been
proposed. For example, Zhao et al. (2006) put forward a test based on the difference
between cases and controls in inter-locus allelic association (Zhao, Jin and Xiong, 2006), an
idea originally suggested by Hoh and Ott (2003) (Hoh and Ott, 2003). These tests rely on
the concept that a contrast in correlation between interacting variables across case control
status identifies interaction. Hoh and Ott (2003) determined their test to have greater
power than a logistic regression joint test of gene-gene interaction and main effects (Hoh
and Ott, 2003). Mukherjee and Chatterjee proposed an empirical Bayes procedure that uses
a weighted average of the case-control and case-only estimators of interaction, an approach
that exploits the assumption of interacting genes’ independence from the case-only method
while also incorporating controls (Mukherjee, et al., 2008; Mukherjee and Chatterjee,
2008).
Peng et. al. (2010) proposed a canonical correlation analysis (CCA) based method
applicable to case-control studies akin to ours (Peng, Zhao and Xue, 2010). The test statistic
is the difference in fisher transformed canonical correlations between two genes in cases
vs. controls divided by square root of variance, where variance is estimated using
bootstrap. The statistic is formed as 𝑈𝑈 = ( 𝑧𝑧 𝐷𝐷 − 𝑧𝑧 𝐶𝐶 )/√[𝑉𝑉 𝑉𝑉𝑉𝑉 ( 𝑧𝑧 𝐷𝐷 ) + 𝑉𝑉 𝑉𝑉𝑉𝑉 ( 𝑧𝑧 𝐶𝐶 )], and assumed to
be approximately normally distributed under the null. Conceptually, this method is related
to ours as it compares correlation structure in cases and controls. Exact distribution for
canonical correlation or a difference between two has not been derived, and accuracy of an
approximation to normal is not characterized. Z. Yuan et al. (2012) (Yuan, et al., 2012)
extended the Peng et. al. (2010) (Peng, et al., 2010) method by transforming the data
through kernel function. This kind of implementation allows measurement of non-linear
association. If correlation exists but it's not linear, interaction can still be detected. Kernel
10
canonical correlation analysis (KCCA) generalizes CCA by first mapping variables 𝑥𝑥 𝑖𝑖 and 𝑦𝑦 𝑖𝑖
to some Hilbert space (a type of family of functions), an abstract vector space possessing
the structure of an inner product. Kernel inner product matrices (also known as kernel
gram matrices) are constructed element-wise, denoted 𝐾𝐾 𝑖𝑖 𝑖𝑖 , and CCA is then performed on
the images. Analogous to linear CCA, the aim of KCCA is to find canonical vectors in terms of
coefficients, 𝛼𝛼 𝑖𝑖 , 𝑋𝑋 𝑖𝑖 ∈ 𝑅𝑅 𝑚𝑚 as a constrained optimization problem
𝑉𝑉 𝑉𝑉 𝑔𝑔𝑚𝑚 𝑉𝑉 𝑥𝑥 𝛼𝛼 𝑗𝑗 , 𝛽𝛽 𝑗𝑗 ∈ 𝑅𝑅 𝑚𝑚 [𝛼𝛼 𝑖𝑖 𝑇𝑇 𝐾𝐾 𝑋𝑋 𝐾𝐾 𝑌𝑌 𝑋𝑋 𝑖𝑖 ]. Common kernel functions include linear, polynomial, radial
basis function (RBF), sigmoid, identical by-state and weighted identical-by-state kernels.
KCCU statistic is constructed as 𝑈𝑈 = ( 𝑘𝑘 𝑧𝑧 𝐷𝐷 − 𝑘𝑘 𝑧𝑧 𝐶𝐶 )/√[𝑣𝑣 𝑉𝑉 𝑉𝑉 ( 𝑘𝑘 𝑧𝑧 𝐷𝐷 ) + 𝑣𝑣 𝑉𝑉 𝑉𝑉 ( 𝑘𝑘 𝑧𝑧 𝐶𝐶 )], and tested in
reference to normal(0,1).
Another approach that looks at a contrast in correlation across cases and controls uses a
metric that quantifies distance between matrices. I. Rajapakse et al. (2012) proposed a
quadratic distance-based method (Rajapakse, et al., 2012). The statistic is based on Nagao
normalized quadratic distance (NQD): 𝛿𝛿 2
= 𝛿𝛿 ( 𝐶𝐶 ̅ , 𝑇𝑇 �
) = 𝑙𝑙 𝑉𝑉 [( 𝐶𝐶 ̅ − 𝑇𝑇 �
) 𝑊𝑊 − 1
( 𝐶𝐶 ̅ − 𝑇𝑇 �
) 𝑊𝑊 − 1
] where
𝐶𝐶 ̅ = �
𝑊𝑊 1 1
𝐶𝐶 1 2
𝐶𝐶 2 1
𝑊𝑊 2 2
�, 𝑇𝑇 �
= �
𝑊𝑊 1 1
𝑇𝑇 1 2
𝑇𝑇 2 1
𝑊𝑊 2 2
�, 𝑊𝑊 1 1
is the pooled estimate of covariance, and S and T
are cases and controls, respectively. The principal behind this is that difference in
covariance across phenotype levels (cases vs. controls) is measured via distances between
estimates of covariance matrices. Yang et al. (Yang, et al., 2008) proposed a method based
on partitioning of chi-square values that, similar to Zhao et al.’s (2006) test for interaction
between two unlinked loci (Zhao, et al., 2006), also contrasts association between sites
across cases and controls. This method showed greater power than logistic regression
when the factors had no marginal effects.
Information-theoretic or entropy-based approaches for modelling genetic interactions
have also been suggested (Chanda, et al., 2007; Dong, et al., 2008; Kang, et al., 2008; Moore,
et al., 2006). This framework may not offer an advantage over more traditional statistical
approaches as the conditional probability statements utilized by the two tactics are
understood to be equivalent (Zwick, 2004). As an example, the method by J. Li et al. (2015)
(Li, et al., 2015) relies on the idea of information entropy to measure uncertainty. The
greater the uncertainty in variables' distribution, the greater the entropy. A gene-based
information gain method (GBIGM) they suggested is based on the entropy and information
gain theory and uses all SNPs in a gene for detecting GGIs in case-control studies. While
considering two genes, the interaction can be determined by comparing the joint entropy
with individual entropies, which represents genetic contribution to disease by a pair of
genes jointly.
Our method recalls and diverges in some important respects when compared to recently
published work. Among the above statistics presented there are some argued in a paradigm
11
similar to ours, conditioning “predictors” on the “outcome” ( 𝑋𝑋 | 𝑌𝑌 ) and comparing joint
distribution across phenotype classes. For example, Zhao et al. (2006) derived a test that
conditions inter-locus allelic association on cases-control status (Zhao, et al., 2006).
Something comparable is suggested by Hoh and Ott (2003) (Hoh and Ott, 2003). Peng et. al.
(2010) proposed a test of the difference in fisher transformed canonical correlations
between two sets of SNPs in cases vs. controls (Peng, et al., 2010). Z. Yuan et al. (2012)
(Yuan, et al., 2012) extended the Peng et. al. (2010) method by transforming the data
through kernel function, allowing for measurement of non-linear association (Peng, et al.,
2010). I. Rajapakse et al. (2012) proposed a quadratic distance-based method, with a
statistic based on Nagao normalized quadratic distance (Rajapakse, et al., 2012). Yang et al.
(Yang, et al., 2008) presented a method based on partitioning of chi-square values that,
similar to Zhao et al.’s (2006) test for interaction between two unlinked loci (Zhao, et al.,
2006), also contrasts association between sites across cases and controls. The mentioned
tests require at least some parametric assumptions and/or rely on central limit theorem
which are not needed in what we propose. The kernel-based methods may have an
advantage in the sense that association between sets of variables, such as SNPs and
methylation loci of interest to us, cannot always be expected to be linear, and kernel
transforms allow for nonlinear correlation.
Our applications focus on asthma. It is a chronic inflammatory disorder of the airways,
characterized by airway hyperresponsiveness and airflow limitation. Asthma prevalence
has increased in recent years (Akinbami, et al., 2012; Bateman, et al., 2008), afflicting 300
million worldwide, and is highly variable across population centers, ranging from 1% to
18% (Asher and Weiland, 1998; Bousquet, 2000; Kroegel, 2009). In the United States more
than 23 million people were diagnosed with asthma as of August 2015, including over 6
million children (Ward, 2013).
Numerous genome-wide association studies (GWAS) have shown a substantial genomic
contribution to the etiology of asthma, with heritability estimates varying between 35%
and 95% (Ober and Yao, 2011; Palmer and Cookson, 2000; Zhang, Moffatt and Cookson,
2012). However, only a fraction of this variation has been explained by specific causal
variants, such as those near Chr17q12-21 (Ferreira, et al., 2011; Halapi, et al., 2010;
Moffatt, et al., 2007; Sleiman, et al., 2008; Tavendale, et al., 2008; Verlaan, et al., 2009).
Epigenetic mechanisms, another source of disease variation, may explain a portion of this
missing heritability (Breton, et al., 2011; Fu, et al., 2012; Reinius, et al., 2013). Epigenomic
alterations, while heritable, may also occur as a response to the endogenous and exogenous
environment, and contribute to asthma pathogenesis (Ferreira, et al., 2010; Reinius, et al.,
2013).
There is evidence to suggest that genetic and epigenetic variation interact synergistically to
affect gene expression (Gutierrez-Arcelus, et al., 2013). In fact, interactions between SNPs
12
and DNA methylation in the genomic regions of T-helper 2 pathway genes IL4R and others
were found to affect asthma risk (Soto-Ramirez, et al., 2013; Zhang, et al., 2014), possibly
through expression of these genes.
3. Statistical Methodology
In support of our method we argue how we can detect interaction via correlation, where
interaction is seen as both, a multiplicative effect such as that in a logistic model and as
covariance between effects in a linear model. At this stage we leave that statement vague
and elaborate below. Fisher, Cockerham, and Kempthorne described how phenotype
variance can be divided into components ascribed to marginal effects of genes and due to
their combined activity (epistatic/interaction variance) by quantifying correlation in
phenotype among relatives who have one or both genotypes in common (Fisher, 1918;
Fisher, 1919; Hill and Weir, 2011; Kempthorne, 1954; Silventoinen, et al., 2003; Visscher,
2009; Visscher, Hill and Wray, 2008; Visscher, et al., 2007; Visscher, et al., 2006). In this
way correlation is understood to quantify the degree or probability of commonality in
phenotype induced by genomic factors and their interaction, and in turn allows inference
for how much variability is explained by parts of the genome. If data structure consists of
pairs of individuals matched on genotype (siblings with known genotype for example) then
we can correlate the corresponding pairs of phenotype values, and if they coincide in those
matched then resulting strong correlation is indicative of genotype being responsible for
large portion of variability in the phenotype. Put more simply, if there is a strong tendency
for two or more people with same genotype to exhibit the same phenotype class then
genetics is highly predictive and therefore explains a great deal of variation in the
phenotype. This idea closely corresponds to a key feature of ANOVA, that the between-
group variance equals the within-group covariance (Anholt and Mackay, 2009; Wang, Xie
and Fisher, 2011). Early genetic studies utilized family groups such as siblings. In that
framework the larger the covariance among members of a family, the larger the fraction of
total variation in the phenotype that is attributed to differences between families’
phenotype means. While using family groups is the classic setting under which this ANOVA
concept is explained, it is applicable to any groups that could be fit to ANOVA, including
groups of cases and controls. The concepts of correlation and variance as being related has
a role in our procedure and is discussed in sections below. Our method allows accounting
for correlation between two sets of variables by observing degree of their concordance,
apportionment of variability held in common by two groupings of variables, intersection of
vector spaces in geometric sense. Further, variance measures are assigned to interaction
detected by our method and we determine which specific constituents of the respective
sets are likely involved. Through simulation and application, the effectiveness of our
method is evaluated to specific types of data to prove effectiveness and characterize
properties of the test. Application to a data set from Children’s Health Study (CHS)
13
identified 10 pairs of candidate genes for conjunct association of pairs of SNP sets with
asthma, and a study of data from a cohort of the Asthma Bio-Repository for Integrative
Genomic Exploration (ABRIDGE) revealed 12 genes engaged in genetic-epigenetic
interaction in relation to asthma. We refer to our method as canonical analysis of set
interactions (CASI).
The method we advance is motivated by the type of data we would like to study. It is
assumed for our purposes that genes viewed as regions of DNA are functional units in an
organism, and are defined a priori based upon some relevant biological criteria, such as
production of certain protein isoforms (Blake, 1979). We argue that what favors our
method most is the line of reasoning that connects work of Fisher, Cockerham, and
Kempthorne to linear model theory, and extends to a more conterminous case-only
procedure for testing interaction. By connecting this line of reasoning to the case-only
method for detecting interaction we posit that within case correlation can be used to detect
multiplicative interaction. The validity of our approach is proven by two types of
simulations. We simulate data based on a contrast in correlation structure between cases
and controls which is used to demonstrate that our test is consistent, power for a fixed non-
null hypothesis increases to one as the sample size increases, and concomitantly sample
size required to reach statistical power of one decreases for larger effect sizes. We also
simulate data by assuming an underlying logistic regression model, thereby demonstrating
our test’s ability to detect multiplicative interaction. Our goal is to address the problem of
testing joint interaction in relation to a dichotomous phenotype. The approach employs
canonical correlation procedure, conditioning joint distribution of two sets of variables on
a phenotype of interest in case-control data. We avoid some of the limitations of
distributional assumptions of conventional regression and parametric set-based
approaches by obviating the premise of a specific statistical model and using a permutation
based false discovery rate (FDR) for assessing statistical significance. The conceptual
precedent for this test is case-only analysis typically used for pairwise interaction.
The principle that underlies our method is that through correlation we can identify
interacting variables. To make our argument we start with linear model notation, with 𝑌𝑌
the response variable and predictors 𝑋𝑋 as the random effects. We characterize how
variance relates to correlation in linear model context. For a model 𝑌𝑌 = 𝛼𝛼 + 𝑋𝑋𝑋𝑋 + 𝜖𝜖
multiple correlation of 𝑌𝑌 and 𝑋𝑋 ` = ( 𝑋𝑋 1, 𝑋𝑋 2, … , 𝑋𝑋 𝑝𝑝 )` is the maximum of 𝐶𝐶 𝑙𝑙𝑉𝑉 𝑉𝑉 ( 𝑌𝑌 , 𝛼𝛼 + 𝑋𝑋 ` 𝑋𝑋 )
over all 𝛼𝛼 and 𝑋𝑋 . We usually denote correlation 𝑅𝑅 through � 𝐶𝐶 𝑙𝑙𝑉𝑉 𝑉𝑉 ( 𝑌𝑌 , 𝛼𝛼 + 𝑋𝑋 ` 𝑋𝑋 ) �
2
= 𝑅𝑅 2
,
where 𝑅𝑅 2
is the proportion of variance explained in 𝑌𝑌 by 𝑋𝑋 `. Thus, if the correlation is 𝑅𝑅 , we
have accounted by reference to 𝑋𝑋 ` for 𝑅𝑅 2
of the variance of 𝑌𝑌 . Our goal is to connect the
concept of allocation of variance (variance decomposition) to correlation, and by
correlation attribute a measure and statistical interpretation to interaction between a pair
of factors. To effectuate this in concrete terms we start by recalling sums of squares in
14
regression or ANOVA corresponding to each predictor term in a linear model ( 𝐶𝐶𝐶𝐶𝑋𝑋 ), all
predictors (SSM), and to the error due to natural variability between individuals (SSE). The
decomposition is expressed as SST = SSM + SSE, where SST is the total sum of squares.
They are sample variances which have corresponding population variances designated as
𝜎𝜎 𝑋𝑋 2
, 𝜎𝜎 𝑀𝑀 2
, 𝜎𝜎 𝐸𝐸 2
, 𝜎𝜎 𝑇𝑇 2
, respectively. Correlation can be related to variance as 𝜎𝜎 𝑀𝑀 2
/( 𝜎𝜎 𝑀𝑀 2
+ 𝜎𝜎 𝐸𝐸 2
) =
𝜎𝜎 𝑀𝑀 2
/𝜎𝜎 𝑇𝑇 2
= 𝑅𝑅 , where 𝜎𝜎 𝑀𝑀 2
is variability attributed to the model covariates and 𝜎𝜎 𝐸𝐸 2
is
measurement error and/or variance due to person differences. If the response variable is
the phenotype then the larger the difference in phenotype with respect to levels of
predictors, the larger the 𝜎𝜎 𝑀𝑀 2
relative to 𝜎𝜎 𝐸𝐸 2
, the stronger the correlation 𝑅𝑅 . The same is
achieved if we focus on a model with a single predictor but the notation is modified to
𝜎𝜎 𝑋𝑋 2
/( 𝜎𝜎 𝑋𝑋 2
+ 𝜎𝜎 𝐸𝐸 2
) = 𝜎𝜎 𝑋𝑋 2
/𝜎𝜎 𝑇𝑇 2
= 𝑅𝑅 . In the instance where one-way ANOVA is the model being
applied it is helpful to understand the identity that between-group variance equals the
within-group covariance (Anholt and Mackay, 2009; Wang, et al., 2011). This is expressed as
𝐶𝐶 𝑙𝑙𝑣𝑣 ( 𝑋𝑋 𝑖𝑖 𝑖𝑖 , 𝑋𝑋 𝑖𝑖 ` 𝑖𝑖 )/𝜎𝜎 𝑇𝑇 2
= 𝜎𝜎 𝑋𝑋 2
/𝜎𝜎 𝑇𝑇 2
= 𝑅𝑅 for group 𝑗𝑗 and individuals 𝑝𝑝 and 𝑝𝑝 ′, and demonstrates that
strength of concordance in phenotype among individuals within group 𝑗𝑗 of predictor 𝑋𝑋
corresponds to magnitude of variability in mean phenotype between groups. This approach
to calculating 𝑅𝑅 relates variance to correlation, and such kinship between variance and
correlation is ubiquitous in statistics. This relationship being crucial to how we establish
our method is valid for canonical correlation and is the basis of our proposed canonical
correlation dependent procedure.
There are two motivations for our method. One utilizes the multivariate ANOVA model to
show that covariance corresponds to how we understand biological interaction, as
reactions or interdependence between genomic factors (Bateson, 2002; Moore and
Williams, 2009; Wilson, 1902). The other is the well-established case-only approach to
detecting multiplicative interaction. Our argument starts by referencing work in genetics
where studies of disease considered correlation in phenotype among siblings in a family
(Cockerham, 1954; Fisher, 1918; Fisher, 1919; Kempthorne, 1954). Using the above
mentioned ANOVA identity, that between-group variance equals the within-group
covariance (Anholt and Mackay, 2009; Wang, et al., 2011), within group phenotype
covariance for siblings was used to estimate association between genetics and disease.
Each family/siblings constituted a group. Strength of correlation within families, within-
group covariance, reflected the magnitude of difference between families, between-group
variance in phenotype means. Large differences, sizable variance, in mean phenotype
across families indicated strong relationship between genetics and phenotype. Inspired by
their work we attempt something in kind. Our groupings will be formed as cases and
controls rather than families, and within-group covariance will be calculated as within case
covariance.
15
3.1. Correspondence of within phenotype class covariance and
between phenotype class variance
To demonstrate how covariance can be interpreted as interaction we start with the
premise that if we invert the traditional roles of the genomic and case-control status
variables in a linear model we can proceed to use the ANOVA identity that between-group
variance equals the within-group covariance (Anholt and Mackay, 2009; Wang, et al., 2011).
Then we lay out how the within phenotype class covariance, a proxy for correlation, can be
used to quantify variation in the mean of a genomic variable between two phenotype
classes, case and control. This univariate model is then extended to a multivariate setting
where we define a contrast within an independent sampling unit, an individual, for a sum
of two genomic variables. By this we demonstrate that within-phenotype class (within
group) covariance of a combination of two genomic variables expressed via sum is equal to
a sum of between-group variances plus a deviation in the form of covariance, interpreted as
conjoint effect or interaction. The line of reasoning is then extended to the case-only
method where we posit that within case correlation can be used to detect interaction.
In context of early studies in genomics where families constituted different groupings, like
levels of a predictor variable in ANOVA model, measures of phenotypic resemblance among
siblings (correlation among siblings) were used to express the relative magnitudes of
components of phenotypic variance (variance decomposition) (Cockerham, 1954; Fisher,
1918; Kempthorne, 1954). In context of ANOVA this means that we estimate between
predictor level variance by observing the covariance among siblings and partition the total
outcome variance into between predictor levels variance, 𝜎𝜎 𝐵𝐵 2
, and within predictor levels
variance, 𝜎𝜎 𝑊𝑊 2
. The relationship between variance and correlation is usually expressed as
𝜎𝜎 𝐵𝐵 2
/( 𝜎𝜎 𝐵𝐵 2
+ 𝜎𝜎 𝑊𝑊 2
) = 𝑅𝑅 , where 𝑅𝑅 2
is the proportion of variance explained by the between
predictor level variability. A property of ANOVA is that any between-group variance equals
the covariance of members within groups (Anholt and Mackay, 2009; Wang, et al., 2011).
Thus, if we take the siblings within a family example, the larger the covariance among
siblings in a family (group) the larger is the fraction, 𝜎𝜎 𝐵𝐵 2
, of the total variance, 𝜎𝜎 𝐵𝐵 2
+ 𝜎𝜎 𝑊𝑊 2
, that
is attributed to differences between family means. A measure of phenotypic resemblance
among siblings can be considered equivalent to a measure of the similarity among siblings
of the same family for a phenotype compared to random members of a population, or the
same as a measure of the magnitude of difference in phenotype between different families.
The larger the variance between groups of siblings, the greater the difference between
group means, and the more similar are the phenotypes of members of a group.
Before we establish the relationship between correlation and interaction we will show,
using ANOVA, how covariance is used to identify differences in the genome between two
groupings, cases and controls, for a dichotomous disease phenotype, and we start with an
argument that references the above between families formulation. With a dichotomous
16
disease phenotype (D), we aim to find a genomic factor ( 𝐺𝐺 ) with values that are decidedly
different in one class compared to the other. This is most clearly parameterized using
ANOVA. For the sake of argument, consider reversing the traditional roles of predictors ( 𝐺𝐺 )
and outcome ( 𝐷𝐷 ) in the model. Designate 𝐷𝐷 and 𝐺𝐺 as in the model 𝐺𝐺 𝑖𝑖 𝑖𝑖 = 𝛼𝛼 + 𝐷𝐷 𝑖𝑖 + 𝜖𝜖 𝑖𝑖 𝑖𝑖 , where
𝐺𝐺 𝑖𝑖 𝑖𝑖 is the 𝑝𝑝 𝑡𝑡 ℎ
observation in the 𝑗𝑗 𝑡𝑡 ℎ
phenotype class, 𝛼𝛼 is an unobserved overall mean, 𝐷𝐷 𝑖𝑖 is
treated as a random effect in the model and shared by all values in class 𝑗𝑗 , and 𝜖𝜖 𝑖𝑖 𝑖𝑖 is an
unobserved noise term. The flow of causality, while seemingly violated, is not detrimental
to how the result is understood. A substantial variation in the genomic variable (designated
as the 𝐺𝐺 in model 𝐺𝐺 𝑖𝑖 𝑖𝑖 = 𝛼𝛼 + 𝐷𝐷 𝑖𝑖 + 𝜖𝜖 𝑖𝑖 𝑖𝑖 ) with respect to case-control status (designated as the
𝐷𝐷 in the same model 𝐺𝐺 𝑖𝑖 𝑖𝑖 = 𝛼𝛼 + 𝐷𝐷 𝑖𝑖 + 𝜖𝜖 𝑖𝑖 𝑖𝑖 ) would mean that a meaningful association exists
between the disease case status variable, 𝐷𝐷 , and genomic variable, 𝐺𝐺 . That is, the case-
control status explains a proportion of variance in genomic variable.
The genomic values are grouped based on being measures on individuals in the same
phenotype class rather than phenotypic observations on siblings in a family as was studied
by Fisher and other early researchers in context of intra-class correlation. The total
variance is the sum of the between and within group variances, 𝜎𝜎 𝐺𝐺 2
= 𝜎𝜎 𝑇𝑇 2
= 𝜎𝜎 𝐷𝐷 2
+ 𝜎𝜎 𝐺𝐺 | 𝐷𝐷 2
,
where the groups are cases and controls. To understand how this concept relates to
correlation we again reference the fact that between-group variance equals the within-group
covariance (Anholt and Mackay, 2009; Wang, et al., 2011). The genomic variable (G) is
treated as if it were a trait value we are trying to identify based on case-control status.
Using the identity 𝜎𝜎 𝐷𝐷 2
/𝜎𝜎 𝐺𝐺 2
= 𝜎𝜎 𝐷𝐷 2
/𝜎𝜎 𝑇𝑇 2
= 𝑅𝑅 and with individuals grouped into either cases or
controls we know that 𝐶𝐶 𝑙𝑙𝑣𝑣 � 𝐺𝐺 � 𝐷𝐷 𝑖𝑖 � = 𝜎𝜎 𝐷𝐷 2
= 𝑅𝑅 × 𝜎𝜎 𝑇𝑇 2
= 𝑅𝑅 × 𝜎𝜎 𝐺𝐺 2
, where 𝜎𝜎 𝐷𝐷 2
is the between
disease state variance, 𝜎𝜎 𝑇𝑇 2
is total variance, which is the same as total variance in the
genomic variable, 𝜎𝜎 𝐺𝐺 2
. We have used the above-mentioned ANOVA identity that the
between-group variance, 𝜎𝜎 𝐷𝐷 2
, equals the within-group covariance, 𝐶𝐶 𝑙𝑙𝑣𝑣 � 𝐺𝐺 � 𝐷𝐷 𝑖𝑖 � (here, the
covariance among genomic values within group of case and group of controls). If
𝐶𝐶 𝑙𝑙𝑣𝑣 � 𝐺𝐺 � 𝐷𝐷 𝑖𝑖 � is strong then 𝜎𝜎 𝐷𝐷 2
is high, which implies that between disease state variance is
high relative to within group variance, 𝜎𝜎 𝐺𝐺 | 𝐷𝐷 2
. Here the genomic values of individuals within
the same disease state are more closely correlated than they are with those from other
states. Disease state could simply mean presence of disease, absence of disease, or a
different level of severity. In case-control studies differences in genomics are observed in
groups of cases and controls. Correlation among genomic values is high within cases but
lower between values in cases and values in controls. To reiterate, on average correlation is
high within each group but not between them. This translates into large differences in
genomics between case and control individuals. The implication here is that it is the
observed correlation among genomic variables within cases and controls that statistically
quantifies the variability in the phenotype. Our claim is that we can identify an association
between genomic variables and a dichotomous phenotype by measuring correlation among
17
genomic variable values within a phenotype class, cases in particular. The reason we focus
on the cases is that we seek to marry this idea with a well-established case-only method,
which has been proven to be a powerful test for detecting multiplicative interaction
(Cordell, 2009; Gatto, et al., 2004; Piegorsch, et al., 1994; Yang, et al., 1999). As will be
shown below the statistic we developed is formed on the basis of correlation among
variables within the subset of case individuals which is then compared to the correlation
structure in controls.
A strong correlation within the cases is evidence of association under the ANOVA premise
and the motivation for concluding presence of interaction is the case-only quality of the
method. We are familiar with covariance as a measure of the association between two
variables, X and Y for example, defined as the mean product – product of means, 𝑣𝑣 𝑋𝑋 𝑌𝑌 =
𝑋𝑋 𝑌𝑌 � � � �
− 𝑋𝑋 �
× 𝑌𝑌 �
. Defined in this way, the variance is covariance of a variable with itself, 𝜎𝜎 𝑋𝑋 2
=
𝑋𝑋𝑋𝑋
� � � �
− 𝑋𝑋 �
̅ × 𝑋𝑋 �
. Considering this definition, it is not a leap to think of covariance between two
types of variables within phenotype class rather than one. We extend this into a
multivariate setting by studying covariance between two sets of variables through linear
combinations, essentially correlating two variables. This point is useful in expanding
univariate setting where one type of genomic variable is assumed to where two types of
genomic variables are tested for interaction. In fact, our goal is to have a test that identifies
interaction between two sets of variables. If we consider strong correlation among genomic
variables of one type to be effective in identifying significant differences with respect to
case vs. controls status, then correlation between two types of variables can be used in the
same manner.
Using ANOVA as an underlying population model figure 1 illustrates through simulation the
stark contrast between a condition where between phenotype class variability is high, or
equivalently within class correlation is high, 𝜎𝜎 𝐵𝐵 2
/( 𝜎𝜎 𝐵𝐵 2
+ 𝜎𝜎 𝑊𝑊 2
) = 0.9, and a population where
within class correlation is low, 𝜎𝜎 𝐵𝐵 2
/( 𝜎𝜎 𝐵𝐵 2
+ 𝜎𝜎 𝑊𝑊 2
) = 0.2, or comparable to the general populace.
If we consider presence of two types of genomic variables as having an additive effect on
the disease status of individuals with a linear model assumed, then the same reasoning
applied above to a single genomic variable can be applied to two. We use the argument to
identify how a difference in case-control status relates to variability in the linear
combination of genomic variables. The setting corresponds to a special case of multivariate
analysis of variance (MANOVA) test of a contrast within independent sampling unit, a
person for example, conducted by computing a linear combination, a simple sum in this
case, of model means. The model is � 𝐺𝐺 1
𝑖𝑖 𝑖𝑖 𝐺𝐺 2
𝑖𝑖 𝑖𝑖 � = [1 1] �
𝜇𝜇 1
𝜇𝜇 2
𝛼𝛼 1 𝑖𝑖 𝛼𝛼 2 𝑖𝑖 � + [1 1] �
𝜖𝜖 1 𝑖𝑖 𝑖𝑖 𝜖𝜖 2 𝑖𝑖 𝑖𝑖 �, where 𝜇𝜇 1
and 𝜇𝜇 2
represent the genomic variables’ grand means across all the individuals in the
sample, 𝛼𝛼 1 𝑖𝑖 and 𝛼𝛼 2 𝑖𝑖 are random effects for their respective genomic variables from the
group means, and 𝜖𝜖 1 𝑖𝑖 𝑖𝑖 and 𝜖𝜖 2 𝑖𝑖 𝑖𝑖 are random individual within group variations around their
18
respective means. To characterize the joint association between the two genomic variables
and disease state we observe variance and covariance properties of their sum. Matrix that
defines the appropriate contrast is included in the model as follows: 𝐺𝐺 1
𝑖𝑖 𝑖𝑖 + 𝐺𝐺 2
𝑖𝑖 𝑖𝑖 =
� 𝜇𝜇 1 𝑖𝑖 + 𝛼𝛼 1 𝑖𝑖 𝜇𝜇 2 𝑖𝑖 + 𝛼𝛼 2 𝑖𝑖 � �
1
1
� + [ 𝜖𝜖 1𝑖𝑖 𝑖𝑖 𝜖𝜖 2𝑖𝑖 𝑖𝑖 ] �
1
1
� = 𝜇𝜇 1 𝑖𝑖 + 𝛼𝛼 1 𝑖𝑖 + 𝜇𝜇 2 𝑖𝑖 + 𝛼𝛼 2 𝑖𝑖 + 𝜖𝜖 1𝑖𝑖 𝑖𝑖 + 𝜖𝜖 2𝑖𝑖 𝑖𝑖
Figure 1. Conceptual illustration showing distribution of genomic variables for case and control
individuals from a population in which the within class correlation between genomic variable
values is 0.9 and population in which it is 0.2.
Joint contribution of two variables construed mathematically as their sum has total
variability computed as Var( 𝐺𝐺 1
𝑖𝑖 𝑖𝑖 + 𝐺𝐺 2
𝑖𝑖 𝑖𝑖 ) = Var( 𝛼𝛼 1 𝑖𝑖 ) + Var( 𝛼𝛼 2 𝑖𝑖 ) + 2Cov( 𝛼𝛼 1 𝑖𝑖 , 𝛼𝛼 2 𝑖𝑖 ).
Equivalently, within phenotype class covariance,
Cov([𝐺𝐺 1
𝑖𝑖 𝑖𝑖 + 𝐺𝐺 2
𝑖𝑖 𝑖𝑖 ] , [𝐺𝐺 1
𝑖𝑖 ′ 𝑖𝑖 + 𝐺𝐺 2
𝑖𝑖 ′ 𝑖𝑖 ]) =
19
= Cov([𝜇𝜇 1 𝑖𝑖 + 𝛼𝛼 1 𝑖𝑖 + 𝜇𝜇 2 𝑖𝑖 + 𝛼𝛼 2 𝑖𝑖 + 𝜖𝜖 1 𝑖𝑖 𝑖𝑖 + 𝜖𝜖 2 𝑖𝑖 𝑖𝑖 ] , [𝜇𝜇 1 𝑖𝑖 + 𝛼𝛼 1 𝑖𝑖 + 𝜇𝜇 2 𝑖𝑖 + 𝛼𝛼 2 𝑖𝑖 + 𝜖𝜖 1 𝑖𝑖 ′ 𝑖𝑖 + 𝜖𝜖 2 𝑖𝑖 ′ 𝑖𝑖 ]) =
= Cov( 𝛼𝛼 1 𝑖𝑖 , 𝛼𝛼 1 𝑖𝑖 ) + Cov( 𝛼𝛼 2 𝑖𝑖 , 𝛼𝛼 2 𝑖𝑖 ) + Cov( 𝛼𝛼 1 𝑖𝑖 , 𝛼𝛼 2 𝑖𝑖 ) = Var( 𝛼𝛼 1 𝑖𝑖 ) + Var( 𝛼𝛼 2 𝑖𝑖 ) + 2Cov( 𝛼𝛼 1 𝑖𝑖 , 𝛼𝛼 2 𝑖𝑖 ),
for the 𝑝𝑝 𝑡𝑡 ℎ
observation in the 𝑗𝑗 𝑡𝑡 ℎ
phenotype class. The result is analogous to that obtained
by univariate ANOVA, within class covariance is the same as the sum of between class
variances for the genomic variables in the model plus twice their covariance. A reasonable
interpretation of this result is that variance of a sum of two variables is decomposed into
their respective between-group variances plus the covariance between the two case-
control variables. The total variability in the two genomic variables is equal to the sum plus
a deviation from additivity. In absence of strong main effects, small variances, it is their
covariance that could be considered the culprit in the disease. Differences in genomic
variables’ means separately across case-control status may be insubstantial but their joint
effect, their interaction as would be observed with a large covariance term, may account for
great deal of total variability. The large variability across phenotype classes in the
combined mean will occur due to their joint action, their covariance. As in the univariate
setting magnitude of the within-class covariance reflects the extent of differences in the
genome variable between cases and controls. We attempt to demonstrate that underlying
our method is the idea that addresses the conceptual challenge of how to define the
interactions in biological sense, as in gene-gene interactions for example.
We would like to connect this notion of covariance within phenotype class as interaction to
the other motivating idea for our method, the case-only approach to detecting interaction
(Kraft, et al., 2007). In explanations of the case-only method such as in its use for detecting
gene-gene (Yang, et al., 1999) and gene-environment (Murcray, Lewinger and Gauderman,
2009) interactions it is viewed through a regression lens but the idea can be equivalently
construed through correlation. As Fisher and his followers described a way to explain
phenotype through correlation we attempt something consonant. The case-only procedure
reveals presence of multiplicative interaction if an association between the two predictors
can be established within a subset of cases. Consider a pair of variables measured on each
case individual, their high correlation, 𝑅𝑅 , implies proportion of variation in one variable, 𝑅𝑅 2
,
explains a large portion of variation in the other, meaning total common variance, 𝑅𝑅 2
× 𝜎𝜎 2
,
in the two variables is high. This in turn implies that having a disease case status is
consistent with having a strong linear relationship between those two measured variables,
they are predictive of disease state.
Extending the case-only method to a setting with sets of variables we aim to demonstrate
that if correlation between interacting factors is strong among cases then this can be
understood as variability held in common by two groupings of variables, intersection of
vector spaces in geometric sense, is high. Where Fisher and others studied correlation in
phenotype between relatives to understand variability accounted by one set of predictors
20
we study correlation between two sets of predictors to learn whether interaction exists in
relation to a phenotype. In that sense we reverse flow of logic, instead of inferring portion
of variance in phenotype attributable to the genomic factors by learning about phenotype
correlation among individuals we correlate genomic factors within a specific phenotype
class (the disease cases). If there is a tendency for a specific, strong linear correlation
pattern between these genomic factors to emerge among disease cases then this is
evidence that interaction is highly predictive of disease, and a large portion of variability
held in common by variable sets is explained by reference to disease. Stated more plainly, if
individuals with the same phenotype class, disease cases, exhibit the same relationship
between sets of genomic factors then such an association is responsible for large portion of
variability observed in the phenotype. For clarity we point out how the mathematical
machinery switches the roles of variables we would like to relate; we hold phenotype fixed
and calculate variability in the genome variables, essentially predicting strength of
correlation in genomic variables with phenotype. However, we know that the cause-effect
relationship for which we are searching is genomic variables are the cause and phenotype
the outcome. But mathematics is indifferent to that interpretation, direction of the effect,
and while it may seem that we are predicting genomics using phenotype, which we hold
fixed, while treating genomic variables as random, we are doing the reverse.
3.2. Case-Only Analysis
Intending that our method is applicable to pair of sets of variables of any type while our
particular application is for pairs of genomic variables, throughout the following we denote
interaction variables in this section and next by G1 and G2 for consistency. In practice we
would like to prove interaction between two variables and generalize to sets. Case-only is
germane to our task because the main goal is to devise a statistical procedure that detect
interaction between a pair of variables in their effect on a dichotomous phenotype, and the
case-only approach addresses the problem of adequate power and/or small effect size
(Piegorsch et al. 1994, Cordell et al. 2009, Gatto et al. 2004) (Cordell, 2009; Gatto, et al.,
2004; Piegorsch, et al., 1994). Piegorsch et al. (1994) argued that interaction parameter can
be estimated with greater precision (that is, lower variance) using cases only (for a rare
disease), under the assumption that [interacting variables are independently distributed]
in the population under study" (Piegorsch, et al., 1994). This is explained with a logit
model, relevant to justifying our method, including in forming a simulation study. The
greater power is owed to the variance of maximum likelihood estimate (MLE) of parameter
for interaction being smaller. Case-only has been shown to be effective under the condition
of binary interacting variables but we would like to demonstrate its relevance to most
other types as well. Underpinnings of the case-only procedure can be a source of guidance
for formulating a test statistic that principally utilizes information in the case data.
Furthermore, it is of overriding interest to extend conceptual underpinnings of this
21
approach to pair of sets of variables. As we develop our method an overview of
implications of the familiar case-only approach is presented first, as it is the primary
motivation for the proposed test. As already stated, while there may not be a consensus
regarding how interaction is defined in statistics in general, model specific definitions are
well established. For example, interaction in linear models is defined as one variable
serving as an effect modifier of the other or as deviation from additive effects, and the case-
only approach has been widely applied to detect such multiplicative interaction.
While the assumption of logistic regression is often used to justify case-only method as in
Piegorsch et al. (1994) it’s not required to establish its efficacy. The case-only approach can
be justified in terms of departure from multiplicativity of risk ratios (Yang, et al., 1999).
Under assumption of no linkage disequilibrium (LD) and independence between 𝐺𝐺 1
and 𝐺𝐺 2
this is demonstrated with relation
𝑂𝑂 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
= [ 𝑝𝑝 𝑔𝑔 1
𝑔𝑔 2
/ 𝑝𝑝 ( 𝑔𝑔 1
− 1) 𝑔𝑔 2
]/[ 𝑝𝑝 𝑔𝑔 1
( 𝑔𝑔 2
− 1)
/ 𝑝𝑝 ( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1)
] =
=
𝑃𝑃 𝑔𝑔 1
× 𝑃𝑃 𝑔𝑔 2
× 𝑁𝑁 × 𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1, 𝑔𝑔 2) × 𝑃𝑃 ( 𝑔𝑔 1 − 1)
× 𝑃𝑃 ( 𝑔𝑔 2 − 1)
× 𝑁𝑁 × 𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1 − 1, 𝑔𝑔 2 − 1)
𝑃𝑃 𝑔𝑔 1
× 𝑃𝑃 𝑔𝑔 2 − 1
× 𝑁𝑁 × 𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1, 𝑔𝑔 2 − 1) × 𝑃𝑃 ( 𝑔𝑔 1 − 1)
× 𝑃𝑃 𝑔𝑔 2
× 𝑁𝑁 × 𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1 − 1, 𝑔𝑔 2)
=
=
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1, 𝑔𝑔 2) × 𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1 − 1, 𝑔𝑔 2 − 1)
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1, 𝑔𝑔 2 − 1) × 𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1 − 1, 𝑔𝑔 2)
=
=
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1, 𝐺𝐺 2 = 𝑔𝑔 2)
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1, 𝐺𝐺 2 = 𝑔𝑔 2 − 1)
/
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1 − 1, 𝐺𝐺 2 = 𝑔𝑔 2)
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1 − 1, 𝐺𝐺 2 = 𝑔𝑔 2 − 1)
=
=
𝑉𝑉 𝑝𝑝 𝑟𝑟 𝑘𝑘 𝑉𝑉 𝑉𝑉𝑙𝑙 𝑝𝑝 𝑙𝑙 𝑓𝑓 𝑙𝑙𝑉𝑉 𝑢𝑢𝑢𝑢 𝑝𝑝 𝑙𝑙 𝑐𝑐 ℎ 𝑉𝑉𝑢𝑢 𝑔𝑔 𝑎𝑎 𝑝𝑝𝑢𝑢 𝐺𝐺 2
𝑤𝑤 𝑝𝑝𝑙𝑙 ℎ 𝑝𝑝𝑢𝑢 𝑔𝑔 1
𝑟𝑟 𝑙𝑙 𝑉𝑉 𝑉𝑉𝑙𝑙 𝑢𝑢 𝑚𝑚 𝑉𝑉 𝑝𝑝 𝑟𝑟 𝑘𝑘 𝑉𝑉 𝑉𝑉𝑙𝑙 𝑝𝑝 𝑙𝑙 𝑓𝑓 𝑙𝑙𝑉𝑉 𝑢𝑢𝑢𝑢 𝑝𝑝 𝑙𝑙 𝑐𝑐 ℎ 𝑉𝑉𝑢𝑢 𝑔𝑔 𝑎𝑎 𝑝𝑝𝑢𝑢 𝐺𝐺 2
𝑤𝑤 𝑝𝑝𝑙𝑙 ℎ 𝑝𝑝𝑢𝑢 ( 𝑔𝑔 1
− 1) 𝑟𝑟 𝑙𝑙 𝑉𝑉 𝑉𝑉𝑙𝑙 𝑢𝑢 𝑚𝑚 = 𝑅𝑅 𝑅𝑅 𝐼𝐼
where 𝑝𝑝 𝑔𝑔 1
𝑔𝑔 2
= 𝑃𝑃 ( 𝐺𝐺1 = 𝑔𝑔 1, 𝐺𝐺2 = 𝑔𝑔 2| 𝐷𝐷 = 1), 𝑁𝑁 is the population size, 𝑃𝑃 𝑔𝑔 1
and 𝑃𝑃 𝑔𝑔 2
are
marginal proportions of population with genomic values 𝑔𝑔 1
and 𝑔𝑔 2
, respectively, and
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺1 = 𝑔𝑔 1, 𝐺𝐺2 = 𝑔𝑔 2) is the disease risk associated with variable values 𝑔𝑔 1
and 𝑔𝑔 2
.
The formulation for 𝑅𝑅 𝑅𝑅 𝐼𝐼 as the ratio of relative risks is the definition of relative risk for
interaction. It embodies the multiplicative model used for investigating disease etiology,
interpreted as ratio of relative risk of disease associated with unit increase in 𝐺𝐺 2
at level
𝐺𝐺 1
= 𝑔𝑔 1
to relative risk of disease associated with unit increase in 𝐺𝐺 2
at level 𝐺𝐺 1
= 𝑔𝑔 1
− 1.
Under certain simple assumptions that don’t include a model we established that within
case odds ratio is equal to relative risk for interaction.
Another way to demonstrate the relationship between relative risk and within case odds
ratio is to explicitly exploit the assumption that joint relative risk is the same as the product
of relative risks associated with each genomic variable separately. Consider the definition
for interaction relative risk taken from above from which we can define 𝑅𝑅 𝑅𝑅 𝑋𝑋 𝑌𝑌 =
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑋𝑋 , 𝑌𝑌 )
𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1))
, which gives us the relation
𝑅𝑅 𝑅𝑅 𝑔𝑔 1 𝑔𝑔 2
𝑅𝑅𝑅𝑅
𝑔𝑔 1( 𝑔𝑔 2 − 1)
× 𝑅𝑅 𝑅𝑅 ( 𝑔𝑔 1 − 1) 𝑔𝑔 2
=
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1
, 𝑔𝑔 2
)
𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1))
×
22
𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1))
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1
,( 𝑔𝑔 2
− 1))
×
𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1))
𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1), 𝑔𝑔 2
)
=
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1
𝑔𝑔 2
)× 𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1))
𝑃𝑃 ( 𝐷𝐷 = 1| 𝑔𝑔 1
,( 𝑔𝑔 2
− 1))× 𝑃𝑃 ( 𝐷𝐷 = 1|( 𝑔𝑔 1
− 1), 𝑔𝑔 2
)
= 𝑂𝑂 𝑅𝑅 𝑔𝑔 1 𝑔𝑔 2| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
. When
case-only odds ratio, 𝑂𝑂 𝑅𝑅 𝑔𝑔 1 𝑔𝑔 2| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
, departs from unity the 𝐺𝐺 1
and 𝐺𝐺 2
specific disease rates do
not conform to a multiplicative notion of joint effect, a deviation interpreted as interaction.
Thus, the case-only approach provides an estimate of the ratio of the joint effect ( 𝑅𝑅 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
)
divided by the product of the individual effects of 𝐺𝐺 1
and 𝐺𝐺 2
independently, 𝑅𝑅 𝑅𝑅 𝑔𝑔 1
( 𝑔𝑔 2 − 1)
and
𝑅𝑅 𝑅𝑅 ( 𝑔𝑔 1
− 1) 𝑔𝑔 2
, and can be regarded as effect measure of 𝐺𝐺 1
× 𝐺𝐺 2
interaction.
On the basis of definition for 𝑅𝑅 𝑅𝑅 𝐼𝐼 above as ratio of relative risks it has been shown in prior
work that in context of a logistic regression disease model odds ratio for interacting
variables calculated using cases only, 𝑂𝑂 𝑅𝑅 𝑔𝑔 1 𝑔𝑔 2| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
, can be related to odds ratio for
interaction, 𝑂𝑂 𝑅𝑅 𝐼𝐼 (Khoury and Flanders, 1996; Piegorsch, et al., 1994; Umbach and
Weinberg, 1997). This is shown as 𝑂𝑂 𝑅𝑅 𝑔𝑔 1 𝑔𝑔 2| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
=
𝑝𝑝 𝑔𝑔 1 𝑔𝑔 2
/ 𝑝𝑝 ( 𝑔𝑔 1 − 1) 𝑔𝑔 2
𝑃𝑃 𝑔𝑔 1( 𝑔𝑔 2 − 1)
/ 𝑃𝑃 ( 𝑔𝑔 1 − 1)( 𝑔𝑔 2 − 1)
= 𝑅𝑅 𝑅𝑅 𝐼𝐼 × 𝑂𝑂 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
≈
𝑅𝑅 𝑅𝑅 𝐼𝐼 , where 𝑝𝑝 𝑔𝑔 1
𝑔𝑔 2
and 𝑅𝑅 𝑅𝑅 𝐼𝐼 are defined above. The approximation, 𝑅𝑅 𝑅𝑅 𝐼𝐼 × 𝑂𝑂 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
≈ 𝑅𝑅 𝑅𝑅 𝐼𝐼 , is
due to the assumption that 𝐺𝐺 1
and 𝐺𝐺 2
are independent in the source population, which
implies 𝑂𝑂 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
≈ 1, indicating no association between G1 and G2. Since odds ratio and
relative risk are known to be related as 𝑂𝑂 𝑅𝑅 𝐼𝐼 = 𝑅𝑅 𝑅𝑅 𝐼𝐼 ×
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1, 𝐺𝐺 2 = 𝑔𝑔 2 − 1)
𝑃𝑃 ( 𝐷𝐷 = 0| 𝐺𝐺 1 = 𝑔𝑔 1 − 1, 𝐺𝐺 2 = 𝑔𝑔 2 − 1)
/
𝑃𝑃 ( 𝐷𝐷 = 0| 𝐺𝐺 1 = 𝑔𝑔 1, 𝐺𝐺 2 = 𝑔𝑔 2)
𝑃𝑃 ( 𝐷𝐷 = 0| 𝐺𝐺 1 = 𝑔𝑔 1 − 1, 𝐺𝐺 2 = 𝑔𝑔 2)
. The second term on the right side of the equation is approximately 1 if
the disease risk is small at all levels of both study variables, which requires a low risk for
carriers of the susceptibility variant. We are left with the sought-after relation,
𝑂𝑂 𝑅𝑅 𝑔𝑔 1 𝑔𝑔 2| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
=
𝑝𝑝 𝑔𝑔 1 𝑔𝑔 2
/ 𝑝𝑝 ( 𝑔𝑔 1 − 1) 𝑔𝑔 2
𝑃𝑃 𝑔𝑔 1( 𝑔𝑔 2 − 1)
/ 𝑃𝑃 ( 𝑔𝑔 1 − 1)( 𝑔𝑔 2 − 1)
= 𝑂𝑂 𝑅𝑅 𝐼𝐼 .
In a traditional case-control study, multiplicative interaction is measured as the ratio of
two odds ratios, 𝑂𝑂 𝑅𝑅 𝐼𝐼 . By appealing to the logistic regression framework, we have
log �
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1, 𝐺𝐺 2 = 𝑔𝑔 2)
1 − 𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1 = 𝑔𝑔 1, 𝐺𝐺 2 = 𝑔𝑔 2)
� = 𝑋𝑋 0
+ 𝑋𝑋 1
𝑔𝑔 1 + 𝑋𝑋 2
𝑔𝑔 2 + 𝑋𝑋 3
𝑔𝑔 1 × 𝑔𝑔 2
𝑟𝑟 𝑐𝑐 𝑐𝑐 𝑟𝑟𝑟𝑟 𝑐𝑐 𝑟𝑟 𝑔𝑔𝑐𝑐 𝑡𝑡 ℎ 𝑐𝑐 𝑐𝑐 𝑒𝑒𝑒𝑒 𝑐𝑐 𝑡𝑡𝑖𝑖 𝑒𝑒 𝑟𝑟 � � � � � � � � � � � � � � � � � � �
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺1 = 𝑔𝑔 1, 𝐺𝐺2 = 𝑔𝑔 2) =
ex p ( 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1 + 𝛽𝛽 2
𝑔𝑔 2 + 𝛽𝛽 3
𝑔𝑔 1× 𝑔𝑔 2)
1 + ex p ( 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1 + 𝛽𝛽 2
𝑔𝑔 2 + 𝛽𝛽 3
𝑔𝑔 1× 𝑔𝑔 2)
.
Define 𝑝𝑝 𝑔𝑔 𝑚𝑚 = 𝑃𝑃 ( 𝐺𝐺1 = 𝑔𝑔 1, 𝐺𝐺2 = 𝑔𝑔 2| 𝐷𝐷 = 1) =
𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1
= 𝑔𝑔 1
, 𝐺𝐺 2
= 𝑔𝑔 2
)× 𝑃𝑃 ( 𝐺𝐺 1 = 𝑔𝑔 1) 𝑃𝑃 ( 𝐺𝐺 2 = 𝑔𝑔 2)
𝑃𝑃 ( 𝐷𝐷 = 1)
where the
joint probability of 𝐺𝐺 1
and 𝐺𝐺 2
is 𝑃𝑃 ( 𝐺𝐺 1
= 𝑔𝑔 1
) 𝑃𝑃 ( 𝐺𝐺 2
= 𝑔𝑔 2
) due to the assumption of
independence of interacting variables. Interaction for a 1-unit difference is estimated as a
ratio of odds ratios with 𝑂𝑂 𝑅𝑅 𝐼𝐼 = exp ( 𝑋𝑋 3
). We are left with an expression showing estimation
of 𝑂𝑂 𝑅𝑅 𝐼𝐼 = exp( 𝑋𝑋 3
) based on observations in the data that are “cases", D=1:
𝑂𝑂 𝑅𝑅 𝐼𝐼 ≈ 𝑂𝑂 𝑅𝑅 𝑔𝑔 𝑚𝑚 | 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
=
𝑝𝑝 𝑔𝑔 1 𝑔𝑔 2
𝑝𝑝 ( 𝑔𝑔 1 − 1) 𝑔𝑔 2
𝑃𝑃 𝑔𝑔 1( 𝑔𝑔 2 − 1)
𝑃𝑃 ( 𝑔𝑔 1 − 1)( 𝑔𝑔 2 − 1)
= exp( 𝑋𝑋 3
) ×
ex p( 𝛽𝛽 0
+ 𝛽𝛽 1
( 𝑔𝑔 1 − 1) + 𝛽𝛽 2
𝑔𝑔 2 + 𝛽𝛽 3
( 𝑔𝑔 1 − 1)× 𝑔𝑔 2)
1 + ex p( 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1 + 𝛽𝛽 2
𝑔𝑔 2 + 𝛽𝛽 3
𝑔𝑔 1× 𝑔𝑔 2)
×
23
×
ex p ( 𝛽𝛽 0
+ 𝛽𝛽 1
( 𝑔𝑔 1 − 1) + 𝛽𝛽 2
( 𝑔𝑔 2 − 1) + 𝛽𝛽 3
( 𝑔𝑔 1 − 1)×( 𝑔𝑔 2 − 1))
1 + ex p ( 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1 + 𝛽𝛽 2
( 𝑔𝑔 2 − 1) + 𝛽𝛽 3
𝑔𝑔 1×( 𝑔𝑔 2 − 1))
. The interaction Odds Ratio estimated based on
case-only data is equal to 𝑎𝑎 𝑥𝑥𝑝𝑝 ( 𝑋𝑋 3
), the true 𝑂𝑂 𝑅𝑅 𝐼𝐼 , times a factor that is very close to 1 if
population disease risk factor is low, 𝑋𝑋 0
≪ 0. If the population disease risk is larger, 𝑋𝑋 0
is
less negative, then the bias factor is smaller than 1. Additionally when both the 𝐺𝐺 1
and 𝐺𝐺 2
main effects increase disease risk (i.e., 𝑋𝑋 1
> 0, 𝑋𝑋 2
> 0) a bias is introduced, which is
discussed extensively in Schmidt et al. 1999 (Schmidt and Schaid, 1999). Assuming the
prior stated conditions are satisfied (independence between 𝐺𝐺 1
and 𝐺𝐺 2
and low overall
disease risk) multiplicative interaction can be approximated with an odds ratio
( 𝑂𝑂 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
) for association between the interacting variables calculated using only the
cases, D=1. For a simulation study 𝑋𝑋 1
, 𝑋𝑋 2
, 𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑋𝑋 3
could be chosen to reflect plausible effect
sizes (e.g. 𝑋𝑋 1
might be assumed to be fairly large to model 𝐺𝐺𝑟𝑟 with strong association). In
interaction presented below we assume a milieu where main effects are quite small
( 𝑙𝑙 𝑢𝑢 ( 𝑋𝑋 1
) = 𝑙𝑙 𝑢𝑢 ( 𝑋𝑋 2
) = 𝑙𝑙 𝑢𝑢 (1.025) ≈ 0.025) and most disease risk is due to interaction (with
possible range 𝑙𝑙 𝑢𝑢 ( 𝑋𝑋 3
) ∈ [ln(1.1), ln(1.3)] = [0.09, 0.26]). The aim is to determine if
multiplicative interaction in context of logistic regression can be detected with our method.
3.3. Conceptual Correspondence between Correlation and Odds
Ratio or Relative Risk
Imagine that in measuring 𝐺𝐺 1
( 𝐶𝐶𝑁𝑁𝑃𝑃 ) or 𝐺𝐺 2
( 𝑀𝑀 𝑎𝑎 𝑙𝑙 ℎ 𝑦𝑦 𝑙𝑙 𝑉𝑉𝑙𝑙 𝑝𝑝𝑙𝑙 𝑢𝑢 ) variable for given individual, we
determine their values relative to multiple thresholds, assumed to exist naturally in the
biological system. Depending on where the latent variable level falls among the thresholds,
a number on the discrete scale is assigned. Figures 2 and 3 portray the intersections of two
hypothetical variables. For every set of 4 adjacent intersections we imagine 4 regions of
density mapped to 4 points, and with each an associated probability. As an illustration of
the idea we observe one set of intersections in figure 3. Depicted is an ellipse denoting
overlay of bivariate normal distribution of latent variables. Probabilities 𝑃𝑃 {1,0.8}, 𝑃𝑃 {2,0.8},
𝑃𝑃 {1,0.9}, and 𝑃𝑃 {2,0.9} denote the proportion that fall in each region defined by the
thresholds and mapped to 4 points. For example, 𝑃𝑃 {1,0.8} is the proportion below
horizontal threshold for SNP and below vertical threshold for methylation and is mapped
to a point at SNP=1, Methylation=0.8.
We should feel contented that among cases an association between 𝐺𝐺 1
and 𝐺𝐺 2
(SNP and
Methylation) for a specific quartet of variable values depicted in Figures 2 and 3 can be
measured by within case odds ratio, 𝑂𝑂 𝑅𝑅 𝑔𝑔 𝑚𝑚 | 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
, which can in turn be used to approximate
odds ratio for multiplicative interaction, 𝑂𝑂 𝑅𝑅 𝐼𝐼 . However, this is a theoretical exercise if data
is not in the form that can be used to estimate an odds ratio. If variables are dichotomous
we could calculate a cross product ratio or fit a logistic model to estimate interaction. For
example, if we coded 𝐺𝐺 1
∈ {0, 1} then an odds ratio for interaction can be estimated based
on 𝑂𝑂 𝑅𝑅 �
𝑔𝑔 1
𝑔𝑔 2
| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
= exp( 𝑋𝑋 1
) from fitting 𝑙𝑙 𝑙𝑙 𝑔𝑔 𝑝𝑝 𝑙𝑙 [𝑃𝑃 ( 𝐺𝐺 1
= 1)] = 𝑋𝑋 0
+ 𝑋𝑋 1
𝐺𝐺 2
to data.
24
Figure 2. Notional grid of intersections between two discretized variables with overlaid ellipse
representing latent bivariate normal distribution. An illustration of the domain of standard normal
density function being discretized by dashed lines into a 2 × 2 contingency.
Figure 3. Notional grid of intersections between two discretized variables with overlaid ellipse
representing latent bivariate normal distribution. An illustration of the domain of standard normal
density function being discretized by dashed lines into a 3 × 11 contingency tables.
25
Figures 2 and 3 are meant to illustrate how odds ratio can be related to correlation and
regression. Estimating Pearson correlation between two quantitative variables using
information from a 2X2 contingency table is one of the oldest problems in statistics
(Pearson, 1900). There is no data transformation that converts an odds ratio or relative
risk into a correlation. However, an odds ratio or relative risk can be transformed to
approximate a product-moment correlation, commonly referred to as a Pearson
correlation, for two quantitative variables in abstract. Our goal here is not to approximate
odds ratio but to draw a parallel between two metrics, odds ratio and Pearson correlation
coefficient.
The construct is that we consider variables as quantitative but measured on discrete scales.
We visualize this as existence of latent continuous variables that are observable only on a
discrete scale because of the latent variable exceeding, or not exceeding, some unknown
threshold values. Once we know the observed cross-classification proportions,
𝑝𝑝 𝑚𝑚 𝑔𝑔 , 𝑝𝑝 ( 𝑚𝑚 − 0. 1) 𝑔𝑔 , 𝑝𝑝 𝑚𝑚 ( 𝑔𝑔 − 1)
, 𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑝𝑝 ( 𝑚𝑚 − 0. 1)( 𝑔𝑔 − 1)
, for a study, it is a simple matter to estimate the
model represented in Figure 2. Specifically, we estimate the location of the discretization
thresholds, and a third parameter, 𝜌𝜌 , which determines the “fatness” of the ellipse. 𝜌𝜌 is the
tetrachoric correlation, and for this example it can be interpreted as the correlation
between SNP values ∈ {1,2} and methylation values ∈ {0.8, 0.9} before application of
thresholds. The principle of estimating correlation based on odds or risk ratio data involves
finding combinations of thresholds and 𝜌𝜌 � until values are found for which the expected
proportions in Figures 2 and 3 are as close as possible to the observed proportions. The
parameter value, 𝜌𝜌 �, is regarded as estimate of the true correlation, called the tetrachoric
correlation. The tetrachoric correlation coefficient, 𝜌𝜌 �, can be illustrated in reference to
figure 2 as the solution to the integral equation from the bivariate normal (Divgi, 1979).
The natural extension of the idea of tetrachoric correlation is the polychoric correlation
depicted in Figure 3, used when there are more than two ordered values in the variables'
ranges (Uebersax, 2006). It is a straightforward extension of the model based on bivariate
normal distribution. The difference is that there are more thresholds, more regions, and
more cells as in Figure 3. But again, the idea is to find the values for thresholds and 𝜌𝜌 � that
maximize similarity between model-expected, in this case bivariate normal model, and
observed cross-classification proportions. Discretized as in figures 2 and 3, we can quantify
interaction through strength of correlation between two variables at each cross-tabulation
in the plane and more generally across the entire grid. The goal here is to justify applying
the correlation function to case-only data as a way of detecting interaction.
We have established that case-only design can be used to estimate interaction under the
assumption that the variables are independent in the general population, and the
corresponding odds or risk ratio can be regarded as reflecting strength of departure from
multiplicativity of the effect of two variables on a phenotype. Furthermore, we have shown
26
that odds or risk ratio is reflected in a tetrachoric or polychoric correlation which is closely
related to Pearson product-moment coefficient. Therefore, using correlation as a measure
of association within case data in lieu of odds or risk ratio can be used to detect interaction
as it is typically defined, deviation from multiplicativity or modification of the effect of one
variable by another.
3.4. Canonical Correlation
In application we are concerned with describing interaction between two sets of variables
and building upon above reasoning we are set to utilize canonical correlation which
quantifies linear dependency between sets of variables. With two sets of variables
represented by x and y the principle behind canonical correlation is to find the best
matched pair of linear combinations on the x and y sides, that is, the one yielding the
largest coefficient of correlation, 𝜌𝜌 . Once we have the best pair, we can ask for the second-
best pair, the linear combination of x data values in the subspace orthogonal to the first
combination on the x side that best correlates with a linear combination of y data values in
the subspace orthogonal to the first combination on the y side. We can proceed like this
until we have exhausted the orthogonal subspaces on either the x or the y side, whichever
comes first. If number of variables is different for the two sets, 𝑝𝑝 1
≠ 𝑝𝑝 2
, or if the column
ranks are not maximal, then there are “too many columns” on either the x or y sides, and
solution specifies min(rank(x), rank(y)) correlations. So, the number of pairs of canonical
variates is d = min[rank(x), rank(y)], where rank means column rank. A representation of
the type of data set we would typically tackle is shown below:
⎣
⎢
⎢
⎢
⎡
𝑥𝑥 1 1
𝑥𝑥 2 1
⋯
𝑥𝑥 1 𝑝𝑝 𝑔𝑔 1
𝑥𝑥 2 𝑝𝑝 𝑔𝑔 1
⋮ ⋱ ⋮
𝑥𝑥 𝑟𝑟 1
⋯ 𝑥𝑥 𝑟𝑟𝑝𝑝
𝑔𝑔 1 ⎦
⎥
⎥
⎥
⎤
⎣
⎢
⎢
⎢
⎡
𝑦𝑦 1 1
𝑦𝑦 2 1
⋯
𝑦𝑦 1 𝑝𝑝 𝑔𝑔 2
𝑦𝑦 2 𝑝𝑝 𝑔𝑔 2
⋮ ⋱ ⋮
𝑦𝑦 𝑟𝑟 1
⋯ 𝑦𝑦 𝑟𝑟𝑝𝑝
𝑔𝑔 2 ⎦
⎥
⎥
⎥
⎤
We are tasked with finding a and b coefficients that form a set of pairs of scores, also
termed canonical variates, and calculate canonical correlation as Pearson product moment
coefficient, 𝜌𝜌 of linear combinations of x and y, illustrated below.
𝑢𝑢 = �
𝑉𝑉 1
× 𝑥𝑥 1 1
+ 𝑉𝑉 2
× 𝑥𝑥 1 2
+
⋯ 𝑉𝑉 𝑝𝑝 𝑔𝑔 1
× 𝑥𝑥 1 𝑝𝑝 𝑔𝑔 1
⋮ ⋱ ⋮
𝑉𝑉 1
× 𝑥𝑥 𝑟𝑟 1
+ 𝑉𝑉 2
× 𝑥𝑥 𝑟𝑟 2
+
⋯ 𝑉𝑉 𝑝𝑝 𝑔𝑔 1
× 𝑥𝑥 𝑟𝑟 𝑝𝑝 𝑔𝑔 1
�
𝑣𝑣 = �
𝑏𝑏 1
× 𝑦𝑦 1 1
+ 𝑏𝑏 2
× 𝑦𝑦 1 2
+ ⋯ 𝑉𝑉 𝑝𝑝 𝑔𝑔 1
× 𝑦𝑦 1 𝑝𝑝 𝑔𝑔 2
⋮ ⋱ ⋮
𝑏𝑏 1
× 𝑦𝑦 𝑟𝑟 1
+ 𝑏𝑏 2
× 𝑦𝑦 𝑟𝑟 2
+ ⋯ 𝑉𝑉 𝑝𝑝 𝑔𝑔 1
× 𝑦𝑦 𝑟𝑟 𝑝𝑝 𝑔𝑔 2
�
27
We denote the matrices of coefficients of the linear combinations by a on the x side, of size
𝑝𝑝 𝑔𝑔 × 𝑎𝑎 , and b on the y side, of size 𝑝𝑝 𝑚𝑚 × 𝑎𝑎 . If matrices u and v denote the linear
combinations evaluated for each data point, we have 𝑢𝑢 = 𝑥𝑥 𝑉𝑉 , 𝑣𝑣 = 𝑦𝑦𝑏𝑏 . The n rows of the
matrices u and v, are called the canonical variates of the data points. Both u and v have d
columns. The first columns of a and b, denoted 𝑉𝑉 1
and 𝑏𝑏 1
, have by definition the single best
linear correlation, so we can write, symbolically,
( 𝑉𝑉 1
, 𝑏𝑏 1
) = 𝑉𝑉𝑉𝑉 𝑔𝑔 𝑚𝑚 𝑉𝑉𝑥𝑥 𝑐𝑐 𝑙𝑙𝑉𝑉 𝑉𝑉 [ 𝑥𝑥 𝑉𝑉 1
, 𝑦𝑦 𝑏𝑏 1
] = 𝑉𝑉𝑉𝑉 𝑔𝑔 𝑚𝑚 𝑉𝑉𝑥𝑥 �
𝑢𝑢 1
𝑇𝑇 𝑣𝑣 1
� 𝑢𝑢 1
𝑇𝑇 𝑢𝑢 1
� 𝑣𝑣 1
𝑇𝑇 𝑣𝑣 1
� = 𝑉𝑉𝑉𝑉 𝑔𝑔 𝑚𝑚 𝑉𝑉𝑥𝑥 [𝜌𝜌 ( 𝑉𝑉 1
, 𝑏𝑏 1
)]
where u and v are considered functions of a and b, respectively, and 𝜌𝜌 is the canonical
correlation. The correlation between 𝑢𝑢 1
and some other column of 𝑣𝑣 , 𝑣𝑣 𝑖𝑖 with 𝑗𝑗 ≠ 1 is zero.
In like manner, we can see that all the columns of 𝑢𝑢 and 𝑣𝑣 are cross-orthogonal. The
matrices 𝑢𝑢 and 𝑣𝑣 are orthogonal by their construction in successive orthogonal subspaces.
We can choose to scale the columns of 𝑉𝑉 and 𝑏𝑏 to make 𝑢𝑢 𝑇𝑇 𝑢𝑢 = 1
𝑑𝑑 , 𝑣𝑣 𝑇𝑇 𝑣𝑣 = 1
𝑑𝑑 , where 1
𝑑𝑑 is the
identity matrix in d dimensions. If the coefficients are standardized to unit length then we
can represent all canonical correlations in matrix form,
𝑢𝑢 𝑇𝑇 𝑣𝑣 = 𝐷𝐷 = �
𝜌𝜌 1
⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝜌𝜌 𝑑𝑑 �
4. Canonical Analysis of Set Interactions (CASI)
We outline a novel approach to measuring and testing set interaction, referred to as
canonical analysis of set interactions (CASI). The procedure we propose is applicable to any
pair of variable sets but to expound on its properties we use notation appropriate to our
application.
To understand the crucial role of ‘case’ data as opposed to ‘control’ that endows our
method with the property to detect interaction we should clarify the trade-off between
prevalence and risk associated with joint effect of predictor variables. Taken in context of
logistic model assumptions for fixed disease probability, 𝑃𝑃 ( 𝐷𝐷 = 1| 𝐺𝐺 1
= 𝑔𝑔 1
, 𝐺𝐺 2
= 𝑔𝑔 2
) =
ex p( 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1
+ 𝛽𝛽 2
𝑔𝑔 2
+ 𝛽𝛽 3
𝑔𝑔 1
𝑔𝑔 2
)
1 + ex p( 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1
+ 𝛽𝛽 2
𝑔𝑔 2
+ 𝛽𝛽 3
𝑔𝑔 1
𝑔𝑔 2
)
we can see that lower prevalence, 𝑋𝑋 0
, can translate to stronger
interaction, 𝑋𝑋 3
, if distribution of predictors in the population is maintained, and 𝑋𝑋 1
>
0 𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑋𝑋 2
> 0 are moderate. In fact, it is the scarcity of event (disease) in the overall
population which enables this property, whereas abundance withholds. This is reflected in
the relationship shown in the case-only description above under assumption of
independence between predictors, as follows.
28
𝑂𝑂 𝑅𝑅 𝐼𝐼 ≈ 𝑂𝑂 𝑅𝑅 𝑔𝑔 1
𝑔𝑔 2
| 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
=
𝑝𝑝 𝑔𝑔 1
𝑔𝑔 2
/ 𝑝𝑝 ( 𝑔𝑔 1
− 1) 𝑔𝑔 2
[ 𝑝𝑝 𝑔𝑔 1
( 𝑔𝑔 2
− 1)
/ 𝑝𝑝 ( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1)
]
= exp( 𝑋𝑋 3
) ×
1 + ex p[ 𝛽𝛽 0
+ 𝛽𝛽 1
( 𝑔𝑔 1
− 1) + 𝛽𝛽 2
𝑔𝑔 2
+ 𝛽𝛽 3
( 𝑔𝑔 1
− 1) 𝑔𝑔 2
]
1 + ex p[ 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1
+ 𝛽𝛽 2
𝑔𝑔 2
+ 𝛽𝛽 3
𝑔𝑔 𝑚𝑚 ]
×
1 + 𝑐𝑐 𝑒𝑒 𝑝𝑝 [ 𝛽𝛽 0
+ 𝛽𝛽 1
( 𝑔𝑔 1
− 1) + 𝛽𝛽 2
( 𝑔𝑔 2
− 1) + 𝛽𝛽 3
( 𝑔𝑔 1
− 1)( 𝑔𝑔 2
− 1)]
1 + e xp [ 𝛽𝛽 0
+ 𝛽𝛽 1
𝑔𝑔 1
+ 𝛽𝛽 2
( 𝑔𝑔 2
− 1) + 𝛽𝛽 3
𝑔𝑔 1
( 𝑔𝑔 2
− 1)]
,
where 𝑝𝑝 𝑔𝑔 1
𝑔𝑔 2
= 𝑃𝑃 ( 𝐺𝐺 1
= 𝑔𝑔 1
, 𝐺𝐺 2
= 𝑔𝑔 2
| 𝐷𝐷 = 1) =
� 𝑃𝑃 � 𝐷𝐷 = 1 � 𝐺𝐺 1
= 𝑔𝑔 1
, 𝐺𝐺 2
= 𝑔𝑔 2
�× 𝑃𝑃 ( 𝐺𝐺 1
= 𝑔𝑔 1
) 𝑃𝑃 ( 𝐺𝐺 2
= 𝑔𝑔 2
) �
𝑃𝑃 ( 𝐷𝐷 = 1)
.
Multiplicative interaction, 𝑂𝑂 𝑅𝑅 𝐼𝐼 = 𝑎𝑎 𝑥𝑥𝑝𝑝 ( 𝑋𝑋 3
), will increase if penetrance, 𝑋𝑋 0
, of disease in the
population decreases. Therefore, our ability to observe and detect strong interaction
depends on low overall prevalence (consider 𝑋𝑋 0
≪ 0) of phenotype level studied for
association (disease class), and appropriately we calculate canonical correlation within a
subset of ‘cases’.
We outline steps of the CASI procedure below.
Step 1. Divide data according to cases and controls and identify two variable groupings.
For example, groupings might be SNPs from two different genes or SNPs and methylation
measures taken from the same gene.
Case and control subsets of variable values are denoted as 𝑔𝑔 1
and 𝑔𝑔 2
, and can be divided as
follows.
1 - Subset of cases from data set 𝑔𝑔 1
denoted 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
2 - Subset of cases from data set 𝑔𝑔 2
denoted 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
3 - Subset of controls from data set 𝑔𝑔 1
denoted 𝑔𝑔 1{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 }
4 - Subset of controls from data set 𝑔𝑔 2
denoted 𝑔𝑔 2{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 }
Step 2. Calculate the coefficients of the linear combinations for the 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
and 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
,
subject to the maximization constraint defined above in the canonical correlation section.
Using case data only we identify columns of coefficients, 𝑉𝑉 and 𝑏𝑏 , and denote 𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
and
𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
(the first columns) as coefficients for single best linear correlation estimated from
sample data,
� 𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
, 𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� = 𝑉𝑉 𝑉𝑉 𝑔𝑔𝑚𝑚 𝑉𝑉 𝑥𝑥 { 𝑐𝑐 ( 1)
, 𝑏𝑏 ( 1)
}
𝑐𝑐 𝑙𝑙𝑉𝑉 𝑉𝑉 � 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 ( 1)
, 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑏𝑏 ( 1)
� =
𝑉𝑉𝑉𝑉 𝑔𝑔 𝑚𝑚 𝑉𝑉𝑥𝑥 �
𝑒𝑒 1
𝑇𝑇 𝑣𝑣 1
� 𝑒𝑒 1
𝑇𝑇 𝑒𝑒 1
𝑣𝑣 1
𝑇𝑇 𝑣𝑣 1
� 𝑓𝑓 𝑙𝑙𝑉𝑉 𝑢𝑢 = 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑣𝑣 = 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑏𝑏 , with components of the
relations as defined above.
29
We have 𝑢𝑢 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
= 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑣𝑣 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
= 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
, vectors of size
𝑢𝑢 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 × 1, which are linear combinations for 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
, produced with vectors
of coefficients 𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
, respectively.
If we define 𝑉𝑉 � ′
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
= [𝑉𝑉 �
1
, 𝑉𝑉 �
2
, … , 𝑉𝑉 �
𝑝𝑝 𝑔𝑔 1
] and 𝑏𝑏 �
′
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
= [𝑏𝑏 �
1
, 𝑏𝑏 �
2
, … , 𝑏𝑏 �
𝑝𝑝 𝑔𝑔 2
], for data sets
𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
= �
𝑥𝑥 1 1
… 𝑥𝑥 1 𝑝𝑝 𝑔𝑔 1
⋮ … ⋮
𝑥𝑥 𝑟𝑟 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐𝑐𝑐
… 𝑥𝑥 𝑟𝑟 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐𝑐𝑐
𝑝𝑝 𝑔𝑔 1
� and 𝑔𝑔 2{ 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
= �
𝑦𝑦 1 1
… 𝑦𝑦 1 𝑝𝑝 𝑔𝑔 2
⋮ … ⋮
𝑦𝑦 𝑟𝑟 𝑐𝑐𝑐𝑐 𝑐𝑐 𝑐𝑐𝑐𝑐
… 𝑦𝑦 𝑟𝑟 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑝𝑝 𝑔𝑔 2
�
for 𝑝𝑝 𝑔𝑔 1
variables in the data subset 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
and 𝑝𝑝 𝑔𝑔 2
variables in the data subset 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
.
The canonical variates are formed as linear combinations for case data for the
best/strongest canonical correlation:
𝑢𝑢 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
𝑣𝑣 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
𝑉𝑉 �
1
× 𝑥𝑥 1 1
+ 𝑉𝑉 �
2
× 𝑥𝑥 1 2
+ ⋯ + 𝑉𝑉 �
𝑝𝑝 𝑔𝑔 1
× 𝑥𝑥 1 𝑝𝑝 𝑔𝑔 1
𝑏𝑏 �
1
× 𝑦𝑦 1 1
+ 𝑏𝑏 �
2
× 𝑦𝑦 1 2
+ ⋯ + 𝑏𝑏 �
𝑝𝑝 𝑔𝑔 2
× 𝑦𝑦 1 𝑝𝑝 𝑔𝑔 2
𝑉𝑉 �
1
× 𝑥𝑥 2 1
+ 𝑉𝑉 �
2
× 𝑥𝑥 2 2
+ ⋯ + 𝑉𝑉 �
𝑝𝑝 𝑔𝑔 1
× 𝑥𝑥 2 𝑝𝑝 𝑔𝑔 1
𝑏𝑏 �
1
× 𝑦𝑦 2 1
+ 𝑏𝑏 �
2
× 𝑦𝑦 2 2
+ ⋯ + 𝑏𝑏 �
𝑝𝑝 𝑔𝑔 2
× 𝑦𝑦 2 𝑝𝑝 𝑔𝑔 2
… … … …
𝑉𝑉 �
1
× 𝑥𝑥 𝑟𝑟 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 1
+ 𝑉𝑉 �
2
× 𝑥𝑥 𝑟𝑟 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 2
+ ⋯ + 𝑉𝑉 �
𝑝𝑝 𝑔𝑔 1
× 𝑥𝑥 𝑟𝑟 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑝𝑝 𝑔𝑔 1
𝑏𝑏 �
1
× 𝑦𝑦 𝑟𝑟 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 1
+ 𝑏𝑏 �
2
× 𝑦𝑦 𝑟𝑟 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 2
+ ⋯ + 𝑏𝑏 �
𝑝𝑝 𝑔𝑔 2
× 𝑦𝑦 𝑟𝑟 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 𝑝𝑝 𝑔𝑔 2
and the top canonical correlation formed using case data is
𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
= 𝑐𝑐 𝑙𝑙𝑉𝑉 𝑉𝑉 � 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
, 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� = 𝑐𝑐 𝑙𝑙𝑉𝑉 𝑉𝑉 [𝑢𝑢 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
, 𝑣𝑣 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
]
This is repeated for all possible correlations until the column space of the smaller variable
set is exhausted, 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 2)
, 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 3)
, … , 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 𝑑𝑑 )
.
Step 3. Apply the coefficients calculated for cases to the control data, 𝑔𝑔 1{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 }
and
𝑔𝑔 2{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 }
, 𝑔𝑔 1{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 }
𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑔𝑔 2{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 }
𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
. Calculate Pearson correlations for
the pairs of linear combinations, canonical variates, formed with these coefficients,
𝜌𝜌 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 ( 1)
= 𝑐𝑐 𝑙𝑙𝑉𝑉 𝑉𝑉 � 𝑔𝑔 1{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 }
𝑉𝑉 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
, 𝑔𝑔 2{ 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 }
𝑏𝑏 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� = 𝑐𝑐 𝑙𝑙𝑉𝑉 𝑉𝑉 [𝑢𝑢 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 , 𝑣𝑣 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐𝑐𝑐 ].
Likewise for the other canonical correlation values among controls,
𝜌𝜌 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 ( 2)
, 𝜌𝜌 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 ( 3)
, … , 𝜌𝜌 �
𝑐𝑐 𝑒𝑒𝑟𝑟𝑡𝑡𝑟𝑟 𝑒𝑒 𝑐𝑐 ( 𝑑𝑑 )
.
Step 4. Fisher transform,
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 𝑘𝑘 )
1 − 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 𝑘𝑘 )
� ( 𝑐𝑐 𝑙𝑙 𝑉𝑉 𝑟𝑟𝑟𝑟 ∈ { 𝑐𝑐 𝑉𝑉 𝑟𝑟𝑎𝑎 , 𝑐𝑐 𝑙𝑙 𝑢𝑢𝑙𝑙 𝑉𝑉 𝑙𝑙 𝑙𝑙 }, 𝑘𝑘 ∈ {1, 2, … , 𝑎𝑎 }) , the
correlations for the cases and controls, calculate the differences between those fisher
30
transformed correlations,
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 𝑘𝑘 )
1 − 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 𝑘𝑘 )
� −
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 𝑘𝑘 )
1 − 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 𝑘𝑘 )
�, and select the difference from
the first and largest case canonical correlation, 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
,
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� −
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
�.
We have experimented with using largest difference across all 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 𝑘𝑘 )
for our statistic, but
at this point found it to be inferior with respect to statistical power.
Step 5. Permute case-control labels B times (B=100 in our application) and repeat steps 1
through 4 for each permutation. The resulting B permutation statistics, fisher transformed
differences,
1
2
ln �
1 + 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� −
1
2
ln �
1 + 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 𝑐𝑐𝑐𝑐 ( 1)
� are meant to represent the null
distribution of our statistic.
Step 6. Calculate mean and standard deviation based on the permuted results for each
variable grouping (gene or gene pair in our studies). Scale the fisher transformed
differences for the observed and permuted data by subtracting mean and dividing by
standard deviation of their corresponding groupings. CASI Statistic from observed data =
� �
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
� −
1
2
ln �
1 + 𝜌𝜌 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 ( 1)
� � − 𝑀𝑀𝑀𝑀 𝐶𝐶 𝑁𝑁 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 𝐶𝐶 𝑡𝑡 𝑐𝑐𝑡𝑡 . 𝑓𝑓 𝑟𝑟𝑒𝑒𝑚𝑚 𝑝𝑝 𝑐𝑐 𝑟𝑟 𝑚𝑚 𝑒𝑒𝑡𝑡𝑐𝑐𝑑𝑑
�
𝐶𝐶 . 𝐷𝐷 .
𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 𝐶𝐶𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑐𝑐 . 𝑓𝑓 𝑟𝑟 𝑒𝑒 𝑚𝑚 𝑝𝑝 𝑐𝑐𝑟𝑟 𝑚𝑚 𝑒𝑒𝑡𝑡𝑐𝑐 𝑑𝑑
where 𝑀𝑀𝑀𝑀 𝐶𝐶 𝑁𝑁 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 𝐶𝐶 𝑡𝑡 𝑐𝑐𝑡𝑡 . 𝑓𝑓 𝑟𝑟𝑒𝑒𝑚𝑚 𝑝𝑝 𝑐𝑐 𝑟𝑟 𝑚𝑚 𝑒𝑒𝑡𝑡𝑐𝑐𝑑𝑑
=
1
𝐵𝐵 ∑
1
2
ln �
1 + 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� −
1
2
ln �
1 + 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
�
1: 𝐵𝐵
and 𝐶𝐶 . 𝐷𝐷 .
𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 𝐶𝐶𝑡𝑡 𝑐𝑐 𝑡𝑡 𝑐𝑐 . 𝑓𝑓 𝑟𝑟𝑒𝑒𝑚𝑚 𝑝𝑝 𝑐𝑐𝑟𝑟 𝑚𝑚 𝑒𝑒𝑡𝑡𝑐𝑐𝑑𝑑
=
1
n − 1
∑ [
1
2
ln �
1 + 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 ( 1)
� −
1
2
ln �
1 + 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
1 − 𝜌𝜌 �
𝑝𝑝 𝑐𝑐𝑐𝑐 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 ( 1)
�
1: 𝐵𝐵 − 𝑀𝑀𝑀𝑀 𝐶𝐶 𝑁𝑁 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 𝐶𝐶 𝑡𝑡 𝑐𝑐𝑡𝑡 . 𝑓𝑓 𝑟𝑟𝑒𝑒 𝑚𝑚 𝑝𝑝 𝑐𝑐𝑟𝑟 𝑚𝑚 𝑒𝑒𝑡𝑡𝑐𝑐 𝑑𝑑 ]
Step 7. Calculate false discovery rate (FDR) using FDR functions for permutation-based
estimators from the paper “Computationally efficient permutation-based confidence interval
estimation for tail-area FDR” (Millstein and Volfson, 2013). 𝐹𝐹 𝐷𝐷𝑅𝑅
�
=
𝐶𝐶 ̅ ∗
𝐶𝐶 ×
1 − 𝐶𝐶 / 𝑚𝑚 1 − 𝐶𝐶 ̅ ∗
/ 𝑚𝑚 where 𝐶𝐶 ̅ ∗
denotes the average of the count of positive tests for the B permuted datasets, S is the total
number of tests called significant, and m denotes the total number of tests conducted.
The permutation-based FDR estimation method is designed to compute FDR when a
permutation-based approach to calculating statistics or p-values has been utilized. The
objective is to identify a subset of groupings (genes or gene pairs in our applications) that
have corresponding statistics more extreme than the permuted results, which are assumed
to represent the null. The significance of the subset is described in terms of the FDR and
31
uncertainty in the FDR estimate by computing a confidence interval (Millstein and Volfson,
2013).
The current statistic is built upon fisher transformed canonical correlation. We aim to
explore alternative measures. We would like to evaluate other canonical correlation based
statistics inspired by tests from multivariate model theory, including Wilk's Lambda, Λ
�
=
∏ (1 − 𝜌𝜌 �
𝑘𝑘 2
)
𝑏𝑏 𝑘𝑘 = 1
, Hotelling-Lawley trace, 𝐻𝐻 𝐻𝐻 𝑇𝑇 = ∑ 𝜌𝜌 �
𝑏𝑏 𝑘𝑘 = 1
𝑘𝑘 2
/(1 − 𝜌𝜌 �
𝑘𝑘 2
), and Pillai-Bartlet trace,
𝑃𝑃𝑃𝑃 𝑇𝑇 = ∑ 𝜌𝜌 �
𝑏𝑏 𝑘𝑘 = 1
𝑘𝑘 2
. However, these will be left for further work that’s beyond the scope of this
dissertation.
4.1. Generalization of the CASI method (gCASI)
We propose a generalization of the CASI method which is specifically designed for case-
control studies to a setting where we have multiple continuous phenotypes. Specifically, to
test efficacy of this approach we apply it to gene networks in a study of asthma control
phenotypes. We consider this generalization a permutation-based approach to testing gene
network influence on multiple asthma control phenotypes.
The statistical approach we employ is based on sparse canonical correlation procedure. It
evaluates relationship between two sets of variables by maximizing correlation between
linear combinations subject to a constraint that sets certain coefficients to zero, in effect
conducting variable selection as part of association analysis. We avoid some of the
limitations and questions of distributional assumptions of classical regression by averting
the premise of a specific statistical model and using a permutation-based FDR for assessing
statistical significance. We refer to our statistical method as generalized canonical analysis
of set interactions (gCASI).
4.2. gCASI Procedure
We explain how the function works by referencing specific type of data. Consider data
consisting of microarray gene expression profile (measurement of the expression for
multitude of genes responsible for cellular function) for asthmatic individuals with
recorded manifestations of acute and chronic asthma symptoms. The pairs of genes
obtained by the “get_interactions" method using STRINGdb R package inform which are to
be included in the gCASI procedure. Now that gene sets are imported from The Molecular
Signatures Database (MSigDB) and STRING, a depository of known and predicted protein-
protein interactions, is used to identify pairs involved in interaction. Now we have
established the necessary particulars needed to apply the gCASI procedure. The basis of
this test is correspondent to the approach to testing interaction via difference in
correlations, reasoning established in prior work (Li, et al., 2015; Peng, Zhao and Xue,
2010; Rajapakse, et al., 2012; Yuan, et al., 2012), extended to a setting with continuous
32
phenotype. Their observations are given as 𝑥𝑥 𝑖𝑖 1
and 𝑥𝑥 𝑖𝑖 2
from two variables, 𝑥𝑥 1
and 𝑥𝑥 2
,
respectively. A difference in a measure across levels of a phenotype can be formulated for a
continuous set of outcomes by extracting components,
𝑒𝑒 𝑖𝑖 1
− 𝑒𝑒 ̅ 1
𝑐𝑐 𝑥𝑥 1
×
𝑒𝑒 𝑖𝑖 2
− 𝑒𝑒 ̅ 2
𝑐𝑐 𝑥𝑥 2
, from the equation
for Pearson correlation, 𝜌𝜌 � =
1
𝑟𝑟 − 1
Σ �
𝑒𝑒 𝑖𝑖 1
− 𝑒𝑒 ̅ 1
𝑐𝑐 𝑥𝑥 1
×
𝑒𝑒 𝑖𝑖 2
− 𝑒𝑒 ̅ 2
𝑐𝑐 𝑥𝑥 2
� =
𝑒𝑒 1
° 𝑒𝑒 2
�| 𝑒𝑒 1
| � || 𝑒𝑒 2
||
, and correlating the sums
of product terms, Σ
{ h ≠ l}
�
𝑒𝑒 𝑖𝑖 ℎ
− 𝑒𝑒 ̅ ℎ
𝑐𝑐 𝑥𝑥 ℎ
×
𝑒𝑒 𝑖𝑖 𝑐𝑐 − 𝑒𝑒 ̅ 𝑐𝑐 𝑐𝑐 𝑥𝑥 𝑐𝑐 � for different variables h and l (the procedure
accommodates many pairs of variables), transposed to match variables of another
category, phenotypes. For clarity visualize values of phenotype arranged vertically for the n
observations beside the product terms for the corresponding individuals. Thereby, through
correlation of linear combinations we characterize how interaction between two sets of
variables is perturbed by one or more phenotypes, variables defined on an ordinal, discrete
or continuous scale.
Product moment correlation equation is defined as the normalized average of the products
of pairs of features observed in each individual. As typical the data consists of n
observations, each with a known ordinal phenotype level. Here we are interested in
differences in normalized average of product pairs across classes of phenotype. The
elements summed in the Pearson correlation represent products for a pair of variables that
comprise a column of values in the data frame formulated for correlation of sums. The
possibility of high-dimensionality is addressed by application of sparse canonical
correlation function (Witten et. al. 2009) (Witten, Tibshirani and Hastie, 2009).
Sparse canonical correlation is employed. Consider denoting Y=Phenotypes, X=Feature
Products. We want to find the best matched pair of linear combinations on the x and y
sides, the one yielding the largest coefficient of correlation.
Y=Phenotype X=Feature Products
�
𝑦𝑦 1 1
⋯ 𝑦𝑦 1 𝑝𝑝 𝑦𝑦 ⋮ ⋱ ⋮
𝑦𝑦 𝑟𝑟 1
⋯ 𝑦𝑦 𝑟𝑟 𝑝𝑝 𝑦𝑦 �
⎣
⎢
⎢
⎢
⎡
�
𝑒𝑒 1 1
− 𝑒𝑒 ̅ . 1
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 1
)
� × �
𝑒𝑒 1 2
− 𝑒𝑒 ̅ . 2
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 2
)
� ⋯ �
𝑒𝑒 1( 𝑝𝑝 𝑥𝑥 − 1)
− 𝑒𝑒 ̅ .( 𝑝𝑝 𝑥𝑥 − 1)
𝑐𝑐𝑑𝑑 � 𝑒𝑒 .( 𝑝𝑝 𝑥𝑥 − 1)
�
� × �
𝑒𝑒 1 𝑝𝑝 𝑥𝑥
− 𝑒𝑒 ̅ . 𝑝𝑝 𝑥𝑥 𝑐𝑐𝑑𝑑 � 𝑒𝑒 . 𝑝𝑝 𝑥𝑥 �
�
⋮ ⋱ ⋮
�
𝑒𝑒 𝑐𝑐1
− 𝑒𝑒 ̅ . 1
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 1
)
� × �
𝑒𝑒 𝑐𝑐2
− 𝑒𝑒 ̅ . 2
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 2
)
� ⋯ �
𝑒𝑒 𝑐𝑐( 𝑝𝑝 𝑥𝑥 − 1)
− 𝑒𝑒 ̅ .( 𝑝𝑝 𝑥𝑥 − 1)
𝑐𝑐𝑑𝑑 � 𝑒𝑒 .( 𝑝𝑝 𝑥𝑥 − 1)
�
� × �
𝑒𝑒 𝑐𝑐𝑝𝑝 𝑥𝑥 − 𝑒𝑒 ̅ . 𝑝𝑝 𝑥𝑥 𝑐𝑐𝑑𝑑 � 𝑒𝑒 . 𝑝𝑝 𝑥𝑥 �
�
⎦
⎥
⎥
⎥
⎤
We find a and b coefficients that form a set of pairs of scores and calculate sparse canonical
correlation as Pearson correlation of linear combinations of x and y, where some
coefficients are set to 0 by the algorithm. The implication being that those with coefficients
set to 0 don’t contribute to the correlation/relationship between the sets of variables.
33
U=
⎣
⎢
⎢
⎢
⎡
𝑉𝑉 1
× �
𝑒𝑒 1 1
− 𝑒𝑒 ̅ . 1
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 1
)
� × �
𝑒𝑒 1 2
− 𝑒𝑒 ̅ . 2
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 2
)
� ⋯ 𝑉𝑉 { 𝑝𝑝 𝑥𝑥 ,( 𝑝𝑝 𝑥𝑥 − 1)}
× �
𝑒𝑒 1( 𝑝𝑝 − 1)
− 𝑒𝑒 ̅ .( 𝑝𝑝 − 1)
𝑐𝑐𝑑𝑑 � 𝑒𝑒 .( 𝑝𝑝 − 1)
�
� × �
𝑒𝑒 1 𝑝𝑝 − 𝑒𝑒 ̅ . 𝑝𝑝 𝑐𝑐𝑑𝑑 � 𝑒𝑒 . 𝑝𝑝 �
�
⋮ ⋱ ⋮
𝑉𝑉 1
× �
𝑒𝑒 𝑐𝑐1
− 𝑒𝑒 ̅ . 1
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 1
)
� × �
𝑒𝑒 𝑐𝑐2
− 𝑒𝑒 ̅ . 2
𝑐𝑐𝑑𝑑 ( 𝑒𝑒 . 2
)
� ⋯ 𝑉𝑉 { 𝑝𝑝 𝑥𝑥 ,( 𝑝𝑝 𝑥𝑥 − 1)}
× �
𝑒𝑒 𝑐𝑐( 𝑝𝑝 𝑥𝑥 − 1)
− 𝑒𝑒 ̅ .( 𝑝𝑝 𝑥𝑥 − 1)
𝑐𝑐𝑑𝑑 � 𝑒𝑒 .( 𝑝𝑝 𝑥𝑥 − 1)
�
� × �
𝑒𝑒 𝑐𝑐𝑝𝑝 𝑥𝑥 − 𝑒𝑒 ̅ . 𝑝𝑝 𝑥𝑥 𝑐𝑐𝑑𝑑 � 𝑒𝑒 . 𝑝𝑝 𝑥𝑥 �
�
⎦
⎥
⎥
⎥
⎤
V= �
𝑏𝑏 1
× 𝑦𝑦 1 1
⋯ 𝑏𝑏 𝑝𝑝 𝑦𝑦 × 𝑦𝑦 1 𝑝𝑝 𝑦𝑦 ⋮ ⋱ ⋮
𝑏𝑏 1
× 𝑦𝑦 𝑟𝑟 1
⋯ 𝑏𝑏 𝑝𝑝 𝑦𝑦 × 𝑦𝑦 𝑟𝑟 𝑝𝑝 𝑦𝑦 �
If the matrices U and V denote the linear combinations evaluated for each data point, we
have: U=Xa; V=Yb. The n rows of the matrices U and V are canonical variates of the data
points. A straightforward approach would be to compare the sample canonical correlation,
𝜌𝜌 �, to 0. In general it is better to make inference on fisher transformed version of 𝜌𝜌 �,
arctanh( 𝜌𝜌 �).
We are looking at differences across phenotype levels to test for interaction. Understanding
the underlying biology requires further study. Approach we propose is a high dimensional
procedure focused on detecting differences by considering multiple phenotypes and
covariates at a time. A permutation-based FDR is used as indication of significance. The
advantage over the model-based approach is that parametric methods suffer from model
misspecification, with asymptotic approximate p-values where approximation grows worse
as we move into the tails. The question addressed is that of testing for joint interaction with
this data. We argue that the most natural test is of equality of correlations across groups or
levels of a phenotype.
The hypothesis test is formulated as follows:
𝐻𝐻 ( 𝑗𝑗 , 𝑘𝑘 ) ∶ 𝑅𝑅 𝑚𝑚 ( 𝑗𝑗 , 𝑘𝑘 ) = 𝑅𝑅 ( 𝑗𝑗 , 𝑘𝑘 )
for each (j,k) pair of features from a predefine set of variables, where 𝑅𝑅 𝑚𝑚 ( 𝑗𝑗 , 𝑘𝑘 ) corresponds
to (j,k)-th entry of the Pearson product-moment correlation formula for class level m (m
can be ordinal, treated as continuous), that is the null hypothesis is that correlation doesn’t
vary across phenotype classes.
To assess significance, we directly estimate FDR by choosing some threshold for the
statistic, t, and reject (call significant) all sets with |Tset| > t. Not all interactions called
significant in this way will be truly non-null and it is important to estimate the FDR for this
cutoff, that is FDR = E[
# 𝑓𝑓 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑟𝑟 𝑐𝑐 𝑖𝑖𝑐𝑐 𝑐𝑐 𝑡𝑡 𝑖𝑖𝑒𝑒 𝑟𝑟 𝑐𝑐
# 𝑡𝑡 𝑒𝑒 𝑡𝑡 𝑐𝑐 𝑐𝑐 𝑟𝑟 𝑐𝑐 𝑖𝑖𝑐𝑐 𝑐𝑐 𝑡𝑡 𝑖𝑖𝑒𝑒 𝑟𝑟 𝑐𝑐
]. Now, let {1, . . . , 𝑢𝑢 }
p erm
be some random
permutation of the phenotype value of individuals. With these new labels we calculate
arctanh( 𝜌𝜌 �
p erm
). We permute the data B times and gather a large collection of these null
statistics. Often one is interested in the FDR of the I most significant interactions. In this
34
case the cut-off, t, is chosen to be absolute value of the I-th most significant statistic,
denoted T(I). The objective is to identify a subset of positive tests that have corresponding
statistics with a more extreme distribution than the permuted results, which are assumed
to represent the null.
4.3. Interpreting CASI via Canonical Correlation: Loadings
We use notation 𝑔𝑔 1
and 𝑔𝑔 2
as representing the two variable sets engaged in interaction.
Canonical correlation analysis (CCA) is used to find coefficients 𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
and 𝑏𝑏 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
for
𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
and 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
, respectively, that maximize Pearson correlation between their
corresponding canonical variates. Interpretation of CASI results through canonical variates,
𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
= 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
and 𝑙𝑙 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
= 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑏𝑏 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
, is aided by computing loadings,
correlations between canonical variates and the original variables in the sets 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
and
𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
. Once canonical variates are chosen for analysis we can provide an interpretation
in terms of original measured variables by identifying those that correlate most strongly
with this artificial variable. There is no specific guidance to how large a loading is
considered as representative of canonical variate and therefore involved in canonical
correlation and interaction, but we have chosen a threshold of 0.5 because it is regarded as
the “operational” definition of large effect size (Cohen, 1992; Faul, et al., 2007). Identifying
which genomic variables are involved helps direct attention to regions in or near the gene
that might be implicated in the effect being detected.
4.4. Interpreting CASI via Canonical Correlation: Allocation of
Variance
Canonical variates, which depend on genomic variable sets together are orthogonal
(uncorrelated), also partition the variance. We briefly summarize how the use of singular
value decomposition (SVD) can be applied to a set of variables to quantify the proportion
variance it explains in another set. Earlier we defined 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐 𝑐𝑐𝑐𝑐 }
= 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
, which
implies 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
= 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
− 1
, and SVD of 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
and 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
are 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
= 𝑢𝑢 1
𝑟𝑟 1
𝑣𝑣 1
𝑇𝑇
and 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
= 𝑢𝑢 2
𝑟𝑟 2
𝑣𝑣 2
𝑇𝑇 . Next, forming 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑇𝑇 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
and applying SVD, 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑇𝑇 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
=
𝑈𝑈 𝐶𝐶 𝑉𝑉 𝑇𝑇 , we can define 𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
= 𝑣𝑣 1
𝑟𝑟 1
− 1
𝑈𝑈 and 𝑏𝑏 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
= 𝑣𝑣 2
𝑟𝑟 2
− 1
𝑉𝑉 because it gives us the right
solution to 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑇𝑇 𝑙𝑙 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
= 𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑇𝑇 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑇𝑇 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑏𝑏 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
=
( 𝑈𝑈 𝑇𝑇 𝑟𝑟 1
− 1
𝑣𝑣 1
𝑇𝑇 )( 𝑣𝑣 1
𝑟𝑟 1
𝑢𝑢 1
𝑇𝑇 )( 𝑢𝑢 2
𝑟𝑟 2
𝑣𝑣 2
𝑇𝑇 )( 𝑣𝑣 2
𝑟𝑟 2
− 1
𝑉𝑉 ) = 𝑈𝑈 𝑇𝑇 ( 𝑈𝑈 𝐶𝐶 𝑉𝑉 𝑇𝑇 ) 𝑉𝑉 = 𝐶𝐶 , where S is a diagonal matrix
with elements the canonical correlations.
𝑏𝑏𝑏𝑏 𝑟𝑟 𝑐𝑐 𝑣𝑣𝑐𝑐 𝑟𝑟 𝑐𝑐 𝑖𝑖 𝑟𝑟 𝑔𝑔 𝑟𝑟 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
= 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
𝑐𝑐 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
by above d er ivat io n using S VD
� � � � � � � � � � � � � � � � � � � � � � � �
�| 𝑥𝑥 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐
| � = 𝑙𝑙 𝑉𝑉 ( 𝑔𝑔 𝑌𝑌 𝑇𝑇 𝑔𝑔 𝑌𝑌 ) = 𝑙𝑙 𝑉𝑉 � 𝑉𝑉 �
1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
− 1 𝑇𝑇 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑇𝑇 𝑉𝑉 ̂ 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
𝑉𝑉 �
1 𝑌𝑌 − 1
� = 𝑙𝑙 𝑉𝑉 [( 𝑣𝑣 1
𝑟𝑟 1
𝑈𝑈 ) 𝑢𝑢 𝑇𝑇 𝑢𝑢 ( 𝑈𝑈 𝑇𝑇 𝑟𝑟 1
𝑣𝑣 1
𝑇𝑇 )] =
35
by 𝑡𝑡 𝑟𝑟 ( 𝐶𝐶𝐵𝐵 ) = 𝑡𝑡 𝑟𝑟 ( BA)
� � � � � � � � � � � � � � = 𝑙𝑙 𝑉𝑉 [( 𝑈𝑈 𝑇𝑇 𝑟𝑟 1
)( 𝑟𝑟 1
𝑈𝑈 )] = ∑ ∑ ( 𝑟𝑟 1 𝑘𝑘𝑘𝑘
𝑈𝑈 𝑘𝑘 𝑖𝑖 )
2
𝑘𝑘 𝑖𝑖
The last line can be seen to partition the variance into pieces associated with each
canonical variate, indexed by i, 𝜎𝜎 𝑖𝑖 2
= ∑ ( 𝑟𝑟 1𝑘𝑘𝑘𝑘
𝑈𝑈 𝑘𝑘 𝑖𝑖 )
2
𝑘𝑘 . If the matrices are non-singular
(invertible), 𝑝𝑝 = 𝑞𝑞 = 𝑎𝑎 , the sum of the 𝜎𝜎 𝑖𝑖 2
𝑟𝑟 will equal the total variance. If 𝑉𝑉 𝑉𝑉 𝑢𝑢𝑘𝑘 ( 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
) <
𝑉𝑉 𝑉𝑉 𝑢𝑢𝑘𝑘 ( 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
), then the total variance in the canonical variates will be less than the total
variance, because the canonical variates won’t span the full range of 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
.
Since the diagonal elements of S, denoted Sii, are coefficients of correlation (canonical
correlations), their squares are fractions of variance in 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
explained by 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 }
. The
total variance in 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
thus explained by 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
through ith canonical variate is thus
𝜎𝜎 𝑖𝑖 2
𝐶𝐶 𝑖𝑖 𝑖𝑖 2
, with i=1, …, d. The total variance of 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
explained by 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
through all
canonical variates is ∑ 𝜎𝜎 𝑖𝑖 2
𝐶𝐶 𝑖𝑖 𝑖𝑖 2
𝑖𝑖 , whose value lies between 0 and the total variance of 𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
.
Since the canonical correlation is applied to cases it represents the portion of variance in
the phenotype, asthma, attributable to interaction, interplay between two sets of variables,
much like the well-established case only procedure.
In summary, we allocate variability via canonical variates in a pair of sets, variables
𝑔𝑔 1{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
and 𝑔𝑔 2{ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐 }
. Canonical correlation procedure can be seen as apportioning that
variability to canonical variates, and we utilize those corresponding to largest canonical
correlation. Each canonical variate encapsulates a portion of variability in the measured
data in one set represented in the other set. Furthermore, that variability can be quantified.
The idea of allocation of variance helps to understand the connection between case-only
approach and CASI we presented. It enables us to quantify statistical variability in the data
attributable to joint processes of two sets of variables.
5. Simulations
We establish the form of the null and alternative hypotheses for this simulation formally:
𝐻𝐻 0
: ρ
c as e
= 𝜌𝜌 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐
𝐻𝐻 𝐶𝐶 : 𝜌𝜌 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
≠ 𝜌𝜌 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐
where 𝜌𝜌 𝑚𝑚 ( 𝑚𝑚 ∈ { 𝑐𝑐 𝑉𝑉 𝑟𝑟𝑎𝑎 , 𝑐𝑐 𝑙𝑙 𝑢𝑢𝑙𝑙 𝑉𝑉 𝑙𝑙 𝑙𝑙 }) is the correlation structure between two sets of
interacting variables within respective phenotype class, 𝑚𝑚 . This can be embodied in a
correlation matrix and understood as the intersection of linear column spaces spanned by
the two sets. The common space defines statistical variance explained in one set by
another, estimated through decomposition of said matrix.
36
A few points should be made about the characteristics of the interacting variables in the
vein that describing their ranges is in order as that affects the statistical properties of
measures based on them. The case-only approach has been developed and employed using
the odds ratio measure with categorical predictor variables but needs to be extended to
discrete and continuous, as we are doing. For our intended application we have SNP
variables with the domain ∈ {0, 1, 2}, coded in reference to minor allele count.
As the above description and interpretation of canonical correlation explains, the CASI
method measures a difference in correlation structure between two class levels of a
phenotype, and by that specific definition we will demonstrate that the test is consistent,
show unbiasedness, and characterize its power. All assertions are proven through
simulation. The CASI statistics are evaluated for statistical significance by computing FDR
using permutation-based estimators proposed by Millstein and Volfson (Millstein and
Volfson, 2013), which yields confidence intervals that account for the number of
permutations conducted as well as dependencies between tests. Significance is assessed
directly using CASI statistics, thus no intermediate p-values are computed.
5.1. Simulation to Evaluate CASI as a Method to Detect Interaction
Realized as a Contrast Between Set Correlations Across Cases
and Controls
As the description of the CASI method makes clear, the statistic measures a difference in
correlation structure between two class levels of a dichotomous phenotype, case and
control. We demonstrate that data that arise from such an underlying assumption, two
populations with divergent correlation structures, is indicative of interaction with respect
to a phenotype. Such interaction is detectable by the CASI statistic, as we will demonstrate
by simulation.
We propose a test for identifying interaction between two sets of genomic factors. The aim
is to determine whether our statistic can detect contrast in correlations between two
groups of genomic variables in cases and controls, rather than only between one pair of
factors. The intention is that our multivariate permutation-based statistic is capable of
measuring differences between blocks of pairwise correlations in cases and controls. The
contrast being assessed is based on two groups of genomic variables. For example, all
genotyped variants within a gene and methylation measures in the vicinity, or two set of
SNPs from different genes.
In this simulation the correlations between variables in the sets are formed into blocks
within a correlation matrix. The blocks need not be the same size but are simulated as such.
The idea that interaction can be characterized through a contrast in blocks of correlation,
as the difference in correlation matrices across cases and controls, is not unprecedented
37
but has only been attempted twice, and serves as a source of guidance and to bolster the
validity of the simulation approach. Rajapakse et al. (2012) and Li et al. (2009) employed
covariance matrices in designing their methods, attempting to address the same problem
as ours, detection of set interaction (Li, et al., 2009; Rajapakse, et al., 2012). In addition the
cogency of this approach is borne out by seeing that if the joint distribution of variables
depends on disease status then they are jointly associated with disease, as described in
Millstein et al. (2006) (Millstein, et al., 2006). Joint distribution can be represented through
a correlation matrix. Consequently, if the correlation matrices of two groups of genomic
variables are different between cases and controls, the groups are interacting in association
with case-control status.
To extend this argument, if the distribution of two groups of genomic factors is the same
when observed separately among cases and controls, neither of these groups by itself is
associated with disease status. However, if at the same time the joint distribution of two
groups of factors varies with disease status, the groups together are linked with the
disease. The idea is that if neither of the groups by itself is associated with disease status,
but their joint distribution does show that they are jointly associated we have
demonstrated that there is an interaction effect of these two groups of genomic factors on
disease. In other words, if correlation differs between cases and controls interaction exists.
This prompts the tack we take with how we set up the simulation. In the underlying model
two different groupings of genomic factors differ between cases and controls, which means
we induce an interaction. Also, the diagonal blocks for the model are constructed for the
simulation so that the correlation within the groupings is the same across case-control
status. This feature of diagonal blocks represents the fact that distribution for each group
separately doesn’t vary by disease status, we are testing for interaction only.
The goal of the simulation is to show that the CASI method can be used to detect such
contrasts in correlation, and therefore interaction. For the simulation we start by creating
an underlying correlation structure represented by matrices for the cases and controls. For
each disease status two different sets of genomic variables have a predetermined
correlation structure represented by blocks within a matrix. By construction, elements of
correlation matrix within blocks that represent correlation of the set with itself are
sampled from continuous uniform distribution with a range ∈ [0,1], and are the same for
the case and control. On the other hand, the blocks that contain elements for correlation
between the two different sets are not the same for cases and controls. This is shown in
more detail below using matrix symbols. Those disease status distinguishing elements are
set to 0.1 for the controls and for the cases they range from 0.11 to 0.6. For the range of
case matrices, each paired with a control matrix, 1000 sets of samples are generated in
equal sample sizes for cases and controls. Samples of cases and controls are generated
using the multivariate normal function with the correlation matrix included as the
38
covariance parameter. This being a permutation-based approach we produce 100
combinatorial arrangements for each sample. The CASI statistics are calculated for
observed and permuted data and permutation based FDR from Millstein at al. (2013)
(Millstein and Volfson, 2013) is computed for a range of significance thresholds.
This test signifies that if the correlation patterns are different between cases and controls,
we conclude that there is an interaction. A disease-linked interacting factor might be
closely related to others. For example, a SNP near more than one genotyped marker in the
region. Therefore, our method examines a set of genetic and epigenetic variables in a
region jointly and can possibly offer greater statistical power than the classical method of
investigating 2-way interactions in univariate logistic regression models.
We attempt to simulate a simplified version of biological data. Let us assume that we are
simulating SNPs to simplify the notation, with groups of SNPs acting in concert based on
biological processes. We model this with a four block correlation matrix, on the diagonal a
block of correlations randomly assigned to pairs from one gene set ( 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
×
𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
) and block of ( 𝐶𝐶𝑁𝑁𝑃𝑃
( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶 𝑁𝑁𝑃𝑃
( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
) from the other gene set, also on the
diagonal. Correlations assigned likewise, and two blocks, one transpose of the other,
( 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁𝑃𝑃
( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
), equi-correlated, positioned on the off-diagonal. One way of
interpreting this is as a latent factor model, all the SNPs from both genes correlated in a
block are in turn highly correlated with the same unmeasured latent variables,
hypothetically expressed as weighted sums, the canonical variates that underlie the
canonical correlation procedure. This interpretation is not necessary to understand the
( 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁𝑃𝑃
( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 j)
) block as representing correlation structure between two sets of
variables. In the simulations each block within the correlation matrix contains 25 elements
(5 SNPs from gene i x 5 SNPs from gene j for example).
We simulate the SNPs for the healthy controls as jointly Gaussian (multivariate normal)
with 0 mean and correlation matrix
𝛴𝛴 𝑐𝑐 𝑒𝑒𝑟𝑟𝑡𝑡𝑟𝑟 𝑒𝑒 𝑐𝑐 = �
𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 �
where 𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 and 𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 are 5 × 5 matrices with 1s along the diagonal and 𝜌𝜌 > 0
randomly assigned from uniform distribution for all off-diagonal entries. 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
×
𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
and 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
are 5 × 5 matrices, transpose of each other with all
elements, 𝜌𝜌 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 > 0, having the same, fixed value (equi-correlated). For the case patients
we again use mean parameters 0, but change the correlation matrix to
𝛴𝛴 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
= �
𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑐𝑐𝑐𝑐𝑐𝑐 )
𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 �
39
where 𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 and 𝐺𝐺 𝑖𝑖 × 𝐺𝐺 𝑖𝑖 are 5 × 5 matrices with the same element values as for controls.
𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
and 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
are 5 × 5 matrices, transpose of each other with the
same element values of 𝜌𝜌 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
> 𝜌𝜌 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 > 0 (equi-correlated).
Effect size is expressed as a measure of the “distance” between the two correlation
matrices used for simulating samples. This metric is quantified as “norm" of the difference
between two matrices. In particular, we use the spectral norm or 2-norm, which is the
largest singular value. If we denote the two matrices as 𝛴𝛴 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑉𝑉 𝑢𝑢 𝑎𝑎 𝛴𝛴 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 then the distance
metric, effect size, is ‖ 𝛴𝛴 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
− 𝛴𝛴 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 ‖, where ‖ ‖ represents the norm.
Within each class (case and control) and for a range of effect sizes we simulated a sample of
size 𝑢𝑢 ∈ {50, 100, 150, 200, 250} and applied CASI method. We then estimated the false
discovery rate (FDR) of this method over 1000 trials for each. As we can see in figure 5
despite considerable volatility for very small effect sizes CASI performed quite well for
meaningful differences in correlation. To affirm the ability of the test to detect a difference
in correlation structure, for each simulation we set elements of 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 )
and
𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
× 𝐺𝐺 𝑖𝑖 ( 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 )
(along with their transposes) blocks in the correlation matrices to
different values. For each simulation we generated a sample of “case” status individuals
with 𝜌𝜌 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
∈ {0.11, 0.12, … , 0.40} and “control” status subjects using 𝜌𝜌 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 = 0.1. The
respective effect sizes, distances between correlation matrices (spectral norms of
difference between two matrices), are ∈ {0.05, 0.075, 0.10, … , 1.50}.
5.1.1. Statistical Power Analysis
FDR plots in figure 4 are snapshots of the entire range of simulations. For each sample size
we garner from the FDR estimates the number of true positive tests that can be identified,
hence giving us a measure of power of the test. We illustrate with figure 5 plots that for
increasing sample size our test can detect a non-null hypothesis at smaller effect size with
near 100% power. Figure 6 plot shows various effect sizes (vertical axis) and the
corresponding sample sizes (horizontal axis) at which we are able to detect the difference
at near 100% power (consistency). It serves as a summary of figure 5 plots depicting
power analysis, showing the effect size at which nearly 100% power is achieved for each
set of simulations attempted at a particular sample size. The statistical concept of
consistency states that “a consistent test is one for which the power of the test for a
particular alternative hypothesis increases to one as the number of data items increases”
(Espejo, 2004). By this definition, to demonstrate consistency we should fix an effect size
and increase the number of observations until power reaches unity. We reverse this setting
by fixing sample size and increasing effect size until power reaches one. Effectually, this
achieves the same goal, demonstrating that there exists a number of observations and a
corresponding effect size for which we can reject in favor of alternative hypothesis with
close to absolute certainty. We change the perspective but argue the same point, for a large
40
enough sample size we can achieve statistical power of one for a particular effects size.
Figure 5 plots show for a range sample sizes the effect detectable with power of one. They
show the flip side of the statistical concept of consistency. The jagged power plots for
samples 50, 100, 150, 200, and 250 indicate, as expected, that for larger samples we can
expect to reach power of 100% at smaller effect sizes. The criteria for identifying power is
the proportion of tests identified correctly as true positives at 𝐹𝐹 𝐷𝐷𝑅𝑅 ≤ 0.05. Figure 5 plots
demonstrate that as sample size increases we achieve a statistical power of one in favor of
the alternative hypothesis at smaller effect sizes. Thereby we have proved effectiveness of
the CASI test to detect interactions under the assumption of multivariate normal
distribution. However, we also argue that it is robust to deviation from such an assumption.
CASI will be effective, able to detect interaction, under varying conditions that it will be
typically applied given strong sensitivity of the test to small effect sizes and if applied to
studies with large samples.
41
Figure 4. FDR plots are snapshots from the entire range of simulations. The green values shown
horizontally indicate number of CASI tests judged significant for each threshold on the horizontal.
For sample size 250 we can glean from a series of FDR estimates the number of true positive tests
42
that can be identified, hence giving us a measure of power of the test. The same can be shown for
other sample and effect sizes. The top plot shows small effect size and with the corresponding
power quite small given that only approximately 9 or 10 of the tests are identified correctly as
significant at a traditional level FDR of 0.05. However, as the effect size increases, as shown in 2nd
and 3rd figures from the top nearly or exactly 100% of the true positive tests are correctly
identified. All tests to which we apply CASI are true positive tests because they are comparing
samples from two distinct populations with different correlation structures.
Figure 5. Power plots show power vs. effect size at sample sizes 50, 100, 150, 200, and
250. Sample size designations in the figures refer to the number of individuals in each
phenotype class (case and control), with equal amount in each. Since it’s a permutation-based
approach we utilize FDR of 0.05 as a significance threshold for calculating power. This is done in
43
lieu of using a threshold that corresponds to type 1 error which would typically be employed in a
parametric setting.
Figure 6. The plot depicts consistency of the CASI method by summarizing power results
from the other plots, showing effect sizes and corresponding sample sizes for which
100% power is achieved by the CASI test.
5.2. Simulation to Evaluate CASI as a Method for Detecting
Multiplicative Interaction
In the following we will substantiate our claim that this test in fact detects multiplicative
interaction. We attempt to address the problems that arise out of analyzing large-scale
studies that involve many simultaneous hypothesis tests. For example, SNPs measured by
microarrays present this question. In the simulation we use false discovery rate (FDR)
44
methods to carry out power and sample size calculations for this type of problem. A
permutation based FDR approach proposed by Millstein and Volfson (Millstein and Volfson,
2013) allows the FDR analysis to be applied without modeling assumptions. Accuracy of
FDR estimates is assessed via confidence intervals that account for the number of
permutations conducted as well as dependencies between tests. There are assumed to be
none in the simulation. The simulations rely on logistic regression model to evaluate our
methodology’s ability to detect multiplicative interaction. Power is also evaluated using
FDR, diagnosed by quantifying the likelihood non-null cases might appear or fail to appear
on a list of significant discoveries.
To help us understand how this simulation works and the resultant power analysis it
would help to first review the properties and strategies for interpreting FDR. We are faced
with many hypotheses that we are meant to consider simultaneously, large number of test
statistics.
Null hypothesis: 𝐻𝐻 1
, 𝐻𝐻 2
, … , 𝐻𝐻 𝑁𝑁
Test statistic: 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 1
, 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 2
, … , 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑁𝑁
In our setting the null hypotheses state that there is no relationship between interacting
variables and disease state. The 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 are simulated independently. A simple two-category
model underlies the simulation, we assume that the 𝑁𝑁 instances are divided into two
categories, null or non-null, occurring with probabilities 𝑝𝑝 0
or 𝑝𝑝 1
= 1 − 𝑝𝑝 0
. For each
category, null and non-null, we design a Bernoulli probability mass function to
stochastically generate a disease status for individuals in the simulation. The mean for this
function is modeled through the logit link with linear combination of predictors that differ
depending upon whether it is null or non-null.
Thus, we have null gene pair defined as one in which disease state of individual is a result
of an additive effect of two sets of SNPs. For brevity we compress the notation for gene i
and gene j to 𝐺𝐺 𝑖𝑖 and 𝐺𝐺 𝑖𝑖 , respectively.
Probability of disease in the null gene pair model: 𝑓𝑓 0
=
𝑎𝑎 𝑥𝑥𝑝𝑝 �𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝑔𝑔 𝑎𝑎 𝑢𝑢𝑎𝑎 𝑝𝑝 ) + ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
𝐶𝐶 𝑁𝑁𝑃𝑃 ( 𝑔𝑔 𝑎𝑎 𝑢𝑢𝑎𝑎 𝑗𝑗 ) �
1 + 𝑎𝑎 𝑥𝑥𝑝𝑝 �𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝑔𝑔 𝑎𝑎 𝑢𝑢𝑎𝑎 𝑝𝑝 ) + ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
𝐶𝐶 𝑁𝑁𝑃𝑃 ( 𝑔𝑔 𝑎𝑎 𝑢𝑢𝑎𝑎 𝑗𝑗 ) �
The non-null gene pair is defined as one in which disease state of an individual is the result
of interaction between two sets of SNPs and is generated from a logistic model as follows.
Probability of disease in the non-null gene pair model: 𝑓𝑓 1
=
45
𝑎𝑎𝑥𝑥𝑝𝑝 � 𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐𝑟𝑟 𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑝𝑝 ) + ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐𝑟𝑟 𝑐𝑐 𝑖𝑖 )
𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑗𝑗 ) + ∑ 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 × 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑝𝑝 ) × 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑗𝑗 ) �
1 + 𝑎𝑎𝑥𝑥𝑝𝑝 � 𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟 𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 i) + ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟 𝑐𝑐 𝑖𝑖 )
𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑗𝑗 ) + ∑ 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 × 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑝𝑝 ) × 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑎𝑎𝑢𝑢 𝑎𝑎 𝑗𝑗 ) �
We argue that it’s natural to take 𝑓𝑓 𝑚𝑚 ( 𝑚𝑚 ∈ {0, 1}) to be a logistic model given that it’s a well-
established means of estimating multiplicative interaction in studies with dichotomous
phenotype. The theoretical null, 𝑓𝑓 0
, in our setting is a linear model without multiplicative
interaction, only main effects. Theoretical non-null model on the other hand includes
product terms, multiplicative interactions. In the simulation, through repeated sampling we
create a population of samples. Each generated as an outcome of null, 𝑓𝑓 0
, or non-null, 𝑓𝑓 1
,
models we create samples that include cases and controls in equal size. In theoretical and
computational sense these samples are taken from an underlying distribution, a panoply of
gene pairs that exists in a population with a low disease prevalence of 0.05. The universe of
populations from which samples are taken contain either null gene pairs, those that effect
disease through main effects only, or non-null gene pairs, those whose association with
disease involves multiplicative interaction. The sampling occurs in 𝑝𝑝 0
vs. 𝑝𝑝 1
proportions,
respectively, priori probabilities, which creates a mixture of null and non-null samples in
that ratio. Once the samples are generated the 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 test statistics are then calculated. The
goal is to be able to distinguish the test statistics based on non-null samples as more
extreme valued compared to null samples. Specifying the probability density functions
precisely for null and non-null test statistics is not necessary, which is obliging to our
procedure since it is permutation based. However, we do assume that the probability
function for non-null is related to null, with larger contrast in correlations expected to be
observed in the former.
FDR estimation often relies on knowledge about the relative distributions of the statistic
under the null vs. non-null hypothesis. However, it’s possible to conduct inference if the
densities are not specified exactly. This is crucial because our statistics are permutation
based since we do away with modeling in employing the 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 procedure. While we don’t
have density functions specified we do adopt the view that non-null 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 are related to
those calculated from null, being larger, farther from the permutation null in comparison,
at least if multiplicative interaction can indeed be detected by our method. Even though the
distribution of 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 is not characterized, no model assumed in our setting, permutation-
based statistics have probabilistic properties. Therefore, to lay out our argument we
imagine hypothetical probability distribution functions that null and non-null 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 follow,
denoted 𝑔𝑔 0
and 𝑔𝑔 1
, respectively. To reiterate, those distributions are not specified precisely
but can be assumed to exist for illustration.
FDR calculation relies on proportions, 𝑝𝑝 0
and 𝑝𝑝 1
= 1 − 𝑝𝑝 0
, a priori probabilities of a gene
pair being null and non-null, believed known here. In practical applications of large-scale
testing it is safe to assume that 𝑝𝑝 0
is large, ≥ 0.9 (Efron, 2004; Langaas, Lindqvist and
46
Ferkingstad, 2005). This is sensible because the goal of such studies is to identify a
relatively small set of non-null occurrences, positive associations, among a very large
number of nulls. FDR calculation is mostly inflexible to the effect of 𝑝𝑝 0
once we have
assumed it is ≥ 0.9. If we define a mixture density based on 𝑝𝑝 𝑚𝑚 𝑉𝑉 𝑢𝑢 𝑎𝑎 𝑔𝑔 𝑚𝑚 ( 𝑚𝑚 ∈ {0, 1}) as
𝑔𝑔 ( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 ) = 𝑝𝑝 0
𝑔𝑔 0
( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 ) + 𝑝𝑝 1
𝑔𝑔 1
( 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ) then the posterior density that a measure we are
testing is null given observed 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 , or set of measures is null given observed series of
𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 ’s, gives the definition of FDR, 𝐹𝐹 𝐷𝐷𝑅𝑅 ( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 ) = Pr( 𝑢𝑢𝑢𝑢𝑙𝑙 𝑙𝑙 | 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 ) = 𝑝𝑝 0
𝑔𝑔 0
( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 )/𝑔𝑔 ( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 ).
It should be evident that 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 ’s further from the null are less likely to be false discoveries,
which is the valuable point communicated by small 𝐹𝐹 𝐷𝐷𝑅𝑅 ( 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 ), that the set of 𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 ′
≥
𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 is likely enriched for true positives.
The literature has not come to consensus on a standard choice of q for FDR when testing,
the equivalent of 0.05 for p-values, but applications to real data might offer some insight.
Simulations and applications suggest a threshold somewhat larger than 0.05, maybe closer
to 𝐹𝐹 𝐷𝐷𝑅𝑅 ≈ 0.20. For FDR of 0.20 a posterior odds of
𝑃𝑃 𝑟𝑟 � 𝑢𝑢𝑙𝑙 𝑢𝑢𝑢𝑢 𝑢𝑢𝑙𝑙 𝑙𝑙 � 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 �
𝑃𝑃 𝑟𝑟 � 𝑢𝑢𝑢𝑢𝑙𝑙 𝑙𝑙 � 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 �
=
1 − 𝐹𝐹 𝐷𝐷𝑅𝑅 ( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 )
𝐹𝐹 𝐷𝐷𝑅𝑅 ( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 )
=
𝑝𝑝 1
𝑔𝑔 1
( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 )
𝑝𝑝 0
𝑔𝑔 0
( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 )
≈
0. 8
0. 2
= 4 gives a factor of 𝑔𝑔 1
( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 )/𝑔𝑔 0
( 𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 )=
4
�
𝑝𝑝 1
𝑝𝑝 0
�
=
4
�
0. 1
0. 9
�
= 36, a probability
ratio which can be interpreted to mean the relative likelihood of observing a statistic from
the non-null distribution compared to null assuming a small number of non-null (positive
associations for multiplicative interaction) occurrences (10%) in a pool of large number of
tests.
Figure 7 indicates that FDR thresholds between 0.05 and 0.2 represent very strong levels of
evidence against the null hypothesis. We might justify using FDR in this way at a traditional
level of 0.05 as actually being conservative in guarding against multiple testing fallacies. On
the other hand, increasing the FDR threshold much above 0.20 can bring excessively high
proportions of false discoveries. The 0.05 threshold, used in the simulations as well as
applications, can be interpreted as reflecting a very conservative choice for FDR.
Settling on an optimal threshold from purely statistical considerations could leave
investigators concerned that the list of non-null cases at any FDR omits some of the
possibly important gene pairs. To allow the researcher to employ prior knowledge and
scholarly judgement in interpreting significance and select their own set of gene pairs for
further investigation we convey a full list of FDR(CASIi) values for corresponding gene
pairs, null and non-null. This is particularly imperative for low-powered situations where
we are limited by low sample or small effect size, and unfortunately chance becomes an
issue, exactness in any one resulting set becomes a casualty, and the investigator will have
to use judgement in identifying a reduced list. The FDR interpretation is that a portion of a
subset of gene pairs can be expected to be null, while the rest are genuine non-null
discoveries. This interpretation does not require independence, it is also robust to minor
47
departures from parametric assumptions and requires exchangeability assumption (Efron
and Tibshirani, 2002). Independence is assumed in the simulation, we employ
nonparametric methodology and exchangeability follows.
The factors, ratios of likelihoods, shown in figure 7, are given to explain and justify the
validity of our approach to evaluating power, calculating power for a range of FDR levels
and summarizing this relationship in plots. From this we learn that even though our
method affords us relatively low power to detect interaction at low FDR, the odds that the
significant subset mostly contains true positives is very high.
Figure 7. Factors, 𝑔𝑔 1
( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 )/ 𝑔𝑔 0
( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 ), on the y-axis representing relative likelihood that the
observed statistic is non-null compared to null. The study is a mixture of tests, arising from
underlying non-null and null relationships with disease. These associations are assumed to have
known rates of occurrence, corresponding to an odds of 𝑝𝑝 1
/ 𝑝𝑝 0
= 0.1/0.9, where 𝑝𝑝 𝑖𝑖 are for non-null
and null, respectively. The plot shows factors, relative likelihoods, ratios of probability or density
functions for null and alternative hypotheses, 𝑔𝑔 1
( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 )/ 𝑔𝑔 0
( 𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 ), for a range of FDR values. It is
evident from the plot that the factors can be quite high for relatively modest FDR value such as 0.2,
which has a factor of 36, and extremely high for low FDR of 0.05, which has a factor of 171.
5.2.1. Statistical Power Analysis
There is abundance of literature in which analysis of statistics from a microarray are
typically focused on controlling Type I error, the false rejection of genuinely null cases
(Dudoit, van der Laan and Pollard, 2004). However, the simulation and the analyses by an
48
alternative approach regulates significance by FDR, probability of incorrectly identifying
statistics as arising from non-null measures among a set. FDR methods enables us to assess
power, the probability of rejecting genuinely non-null cases. This section discusses power
diagnostics based on FDR, showing through simulation how our statistics can be used to
identify those genuine non-null cases, but also how a study might fail to identify all the
disease associate genes rather than just a portion. The emphasis here is on how CASI
statistic can be used to capture interacting sets of genomic variables.
The hypothetical non-null density g
1
(CASI) in the two-class model plays an important role
in assessing power. Power diagnostics are obtained from the comparison of g
1
(CASI) with
FDR(CASI). This is understood to mean that the likelihood of capturing non-null
observations relies heavily on the probability that they arise from g
1
(CASI) while
controlling type 1 error rate through FDR(CASI). A visual demonstration of this is
presented for simulation results in figures 8-11 for a range of FDR values. A moderate value
of FDR, perhaps 0.20, suggests good power, with a non-null gene very likely to show up on
a list of interesting candidates for further study. Because we generate a mixture of null and
non-null CASI statistics power assessment in the simulation operates based on the
empirical non-null probability estimates, non-null counts.
The non-null counts visualized in the power plots allow an intuitive interpretation of how
power is estimated. Suppose that all observed CASI values are placed into K sets of sizes
defined by thresholds of varying stringency with corresponding FDRs. Higher stringency
corresponds to lower FDR values. We can define notation as follows,
X
k
= boundary value of k
th
threshold for k = 1, 2, . . . , K
Y
k
= count of CASI
i
above the k
th
significance threshold value.
Since Prob{gene pair i non-null|CASI
i
≥ CASI}=1−FDR(CASI), an approximate estimate of
the non-null counts beyond the k
th
significance threshold is [1 − FDR(CASI)] × Y
k
. In the
empirical approach this is equivalent to counting the non-null gene pairs beyond the k
th
significance threshold. The proportion of all non-null gene-pairs, those exhibiting
interaction, identified from the entire set is an estimate of power.
To assess type 1 error rate control, statistical power, and unbiasedness properties of our
method to detect multiplicative interaction we conducted a simulation study in a logistic
regression framework. In traditional power analysis a non-null model is used to generate a
sample repeatedly, a statistic is calculated each time and compared to a threshold
corresponding to the designated type 1 error rate. Empirical statistical power is calculated
as the proportion of tests that reject the null in favor of alternative hypothesis.
49
Another, equivalent approach, more suitable to use of FDR as a way of conceptualizing type
1 error rate and evaluating power in hypothesis testing when conducting multiple
comparisons is to generate a sample repeatedly from a mixture/amalgam of null and non-
null models. Then calculate empirical statistical power as proportion of instances of the
statistic arising from the non-null model that are captured for a given FDR level. Since the
method relies on permutation for calculating a statistic we don't have a known reference
null distribution to determine significance for calculating power. Rather we estimate FDR
based on the empirical distribution of observed values of the statistic from the multitude of
tests by comparing them to realizations from permuted data. This is akin to generating
values from a mixture distribution of two kinds of logistic models. We decided to have 10%
of models in the mixture non-null, in deference to 2007 paper by Brad Efron, “Size, Power
and False Discovery Rates", which used the same proportion in its simulation. Their
version of FDR is parametric, which contrasts with ours since our method relies on
permutation for calculating a statistic.
When conducting simulations, we are interested in examining power for various “effect
sizes”. Logistic regression is specified via the coefficients, 𝑋𝑋 0
, 𝑋𝑋 𝑋𝑋 ; the first element is the
intercept (disease background prevalence) and the second consists of a vector of log odds
ratio parameters for the covariates (for example genomic variables such as SNPs). The
overall outcome prevalence in the population of interest is fixed in the simulation at 0.05.
Modifying any given element of 𝑋𝑋 𝑋𝑋 will automatically modify the overall prevalence, unless
there is a corresponding change in 𝑋𝑋 0
. In the simulation we adjust the value of 𝑋𝑋 0
so that it
minimizes the difference between the target outcome prevalence, 0.05, and prevalence
induced by the model in conjunction with the assumed marginal exposure and interaction
values. In effect we are generating a population of individuals that hold a gene pair that
affects disease through specified logistic model, holding prevalence invariant, and then take
a sample from that population.
The most important message conveyed by figures showing statistical power is that at low
FDR, where it’s most crucial, CASI captures enough true positives, exhibits moderate but
sufficient power, whereas the traditional logistic regression fails to make any discoveries at
small and moderately sized FDR. The superiority of CASI over other set-based methods is
also apparent at low FDR. For example, at a sample size of 2000, the smallest sample size at
which interaction is detectable for the simulated SNP data, CASI performs best among set
based methods and its superiority over conventional regression is also apparent. As stated
above we simulate a collection of genes that is a mixture of null and non-null gene pairs.
Non-null gene pair is defined as one in which disease state of an individual is the result of
interaction between two sets of SNPs and is generated from a logistic model whereas null
gene-pair includes a linear combination of main effects only.
50
The power curves from simulations suggest good power characteristics for set based
methods, CASI in particular, compared to conventional logistic regression for detecting
interaction. While power may seem too low these realizations represent what can be
expected in real studies, and as explained above are actually quite adequate. Most of the
non-null genes will not turn up on a list that corresponds to low FDR but those that do will
be non-null with high likelihood.
51
Figure 8. Comparison of power characteristics of set based methods and conventional logistic
regression for sample sizes 2000 and 2500 shows preeminence of CASI and other set-based
methods over logistic regression. The performance of set based methods is illustrated in the top
two figures showing side by side that increasing sample size improves power for some approaches.
CASI achieves the best outcome, nearing a power of 0.2 for sample size of 2000 and 0.3 for sample
size of 2500 at FDR of 0.05. The top ten version of CASI (CASI_top10) and CLD are the next best
52
alternatives to CASI. Peng et al.’s CCA (CCA (Peng)) approach appears to be the worst in identifying
interaction, with the lowest power that is much less than 0.1 at FDR of 0.05.
Figure 9. Comparison of power characteristics of set based methods and conventional logistic
regression for sample sizes 2000 and 3000 shows preeminence of CASI and other set-based
methods over logistic regression. The performance of set based methods is illustrated in the top
two figures showing side by side that increasing sample size improves power for some approaches.
CASI achieves the best outcome, nearing a power of 0.2 for sample size of 2000 and 0.5 for sample
53
size of 3000 at FDR of 0.05. The top ten version of CASI (CASI_top10) and CLD are the next best
alternatives to CASI for both sample sizes. Peng et al.’s CCA (CCA (Peng)) method appears to be the
worst in identifying interaction, with the lowest power that is much less than 0.1 at FDR of 0.05.
Figure 10. Comparison of power characteristics of set based methods and conventional logistic
regression for sample sizes 3000 and 3500 shows preeminence of CASI and other set-based
methods over logistic regression. The performance of set based methods is illustrated in the top
two figures showing side by side that increasing sample size improves power for some approaches.
54
CASI achieves the best outcome, nearing a power of 0.5 for sample sizes of 3000 and 3500 at FDR of
0.05. The top ten version of CASI (CASI_top10) and CLD are the next best alternatives to CASI for
both sample sizes. Peng et al.’s CCA (CCA (Peng)) method appears to be the worst in identifying
interaction, with the lowest power that is much less than 0.1 at FDR of 0.05.
Figure 11. Comparison of power characteristics of set based methods and conventional logistic
regression for sample sizes 3500 and 4000 shows preeminence of CASI and other set-based
methods over logistic regression. Performance of set based methods is illustrated in the top two
figures showing side by side that increasing sample size improves power for some approaches. CASI
achieves the best outcome, reaching a power of 0.5 for sample size of 3500 and more than 0.6 for
55
sample size of 4000 at FDR of 0.05. The top ten version of CASI (CASI_top10) and CLD are the next
best alternatives to CASI for both sample sizes. Peng et al.’s CCA (CCA (Peng)) method appears to be
the worst in identifying interaction, with the lowest power that is much less than 0.1 at FDR of 0.05.
Using simulation to test the ability of the CASI method to detect multiplicative interaction
we demonstrated its superiority across a range of sample sizes. It performed better than
other set-based methods at low FDR, where FDR is 0.05, and was clearly superior to logistic
regression. As comparison we use three other set-based methods in addition to CASI.
Additional approaches exist in the literature that test for interaction by aggregating
variables into sets. They include types that use very distinct processes, such as
information-theoretic or entropy-based tactics for modelling genetic interactions (Chanda,
et al., 2007; Dong, et al., 2008; Kang, et al., 2008; Moore, et al., 2006). Those we chose for
comparison resemble ours in that they involve difference in correlation between
interacting variables across cases and controls. We considered including a kernel-based
version of Peng et al. proposal but it proved to be too computationally burdensome.
The comparison of set based methods includes one designed in a way very similar to CASI,
the difference being that it chooses the largest difference in correlation between cases and
controls out of the first top 10 canonical correlations, designated “CASI_top10” in the
figures. Compared to CASI it underperforms in terms of power in the low FDR region,
where it’s most crucial, and for most sample sizes shown except 3500 where it appears to
match CASI. Also included for comparison is a procedure designed by Peng et. al. (2010),
labeled “CCA (Peng)” in the figures, a canonical correlation analysis (CCA) based method
applicable to case-control studies akin to ours (Peng, et al., 2010). CCA is the difference in
fisher transformed canonical correlations between two genes in cases vs. controls divided
by square root of variance, where variance is estimated using bootstrap. CCA appears to
falter across the entire range of FDR levels and sample sizes compared to other set-based
methods, especially CASI. For example, at the lowest shown sample size of 2000 it achieves
power much less than 0.1 at an FDR of 0.05. This gives support to the notion that in
designing a set interaction method like ours the focus should fall on the subset of cases for
estimation, as is done in CASI. This can be construed as an homage to the case-only
approach to detecting interaction and in distinct contrast to Peng et al.’s method which also
estimates canonical correlation within controls (Peng, et al., 2010). Another approach
chosen for comparison, proposed by I. Rajapakse et al. (2012) looks at a contrast in
correlation across cases and controls using a quadratic distance-based method, a metric
that quantifies distance between covariance matrices (Rajapakse, et al., 2012). Referred to
as “CLD” in the figures it has inferior power characteristics compared to CASI consistently
across all sample sizes for low FDR and only appears to burgeon at unacceptably high FDR
at a few sample sizes.
56
It is worth reiterating that the simulation is designed to test multiplicative interaction
through the underlying logistic regression model:
Null gene pair is defined as one in which disease state of individual is a result of an additive
effect of two sets of SNPs with probability of disease given by the following model: 𝑓𝑓 0
=
𝑎𝑎 𝑥𝑥𝑝𝑝 � 𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 )
𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) + ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑗𝑗 �
𝐶𝐶 𝑁𝑁𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � �
1 + 𝑎𝑎 𝑥𝑥𝑝𝑝 � 𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝑔𝑔 𝑐𝑐 𝑟𝑟𝑐𝑐 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝑔𝑔 𝑎𝑎 𝑢𝑢𝑎𝑎 𝑝𝑝 ) + ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑗𝑗 �
𝐶𝐶 𝑁𝑁𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � �
Non-null gene pair can be viewed as an extension of null relationship between a gene pair
and disease, where the association involves multiplicative interaction in addition to the
additive effects, with probability of disease given as the following model: 𝑓𝑓 1
=
𝑎𝑎𝑥𝑥𝑝𝑝 �𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 )
𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) + ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑗𝑗 �
𝐶𝐶𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � + ∑ 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) 𝐶𝐶𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � �
1 + 𝑎𝑎𝑥𝑥𝑝𝑝 �𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 )
× 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) + ∑ 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑗𝑗 �
𝐶𝐶𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � + ∑ 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 𝐶𝐶𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) 𝐶𝐶𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � �
Here 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 )
and 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑗𝑗 �
quantify the additive effects, 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 measures the interactions
between SNPs of the two genes, and 𝑋𝑋 0
represents the background level of the disease,
average overall prevalence without the effects of specific genes. In the simulations we take
the main effects to be exp( 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 )=1.025 in all models, so that the effect of each non-null
disease model heavily weighted in towards one parameter, 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 . Note that
exp( 𝑋𝑋 𝐶𝐶𝑁𝑁 𝑃𝑃 )=1.025 corresponds to weak main effects. We show results for sample sizes
where the number of cases and controls is equal. The overall disease pervasiveness
assumption is 5%, a moderately rare disease. We attempted a higher prevalence of 10% for
the simulation but it resulted in lower power likely due to the violation of one of the
conditions of the case-only approach that prevalence be very low. This is in addition to the
assumption of independence of interacting variables. We found that for model interaction
ORs lower than 1.2 (exp � 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 )
� < 1.2) the power to recognize interaction effects by
CASI was lacking. Examining smaller sample sizes also had insufficient results.
In practice we simulated SNP data based on minor allele frequency that is randomly chosen
from a uniform distribution ranging 0.05 to 0.49. 100 non-null and 900 null gene pairs
were generated. After application of CASI the distribution of non-null gene pairs in the
resulting ranked set is evaluated as a means of power diagnostic. As we varied interaction
odds ratio we observed the number of non-null genes identified as significant. The larger
the proportion of non-null genes (out of a total of 100) detected for low values of FDR the
higher the power.
To understand how power diagnostics relate sample size to power calculations we might
ask how many additional subjects in a study would substantially improve detection rate of
57
non-null gene pairs, increase power. Consider the number of independent replicates of
𝐶𝐶 𝐶𝐶𝐶𝐶 𝐶𝐶 𝑖𝑖 available for each gene pair from which the test statistic is formed. As figures 8-11
illustrates we would expect improvement in power if the sample size increased by 25%,
from 2000 cases and 2000 controls to 2500 cases and 2500 controls. The improvement is
notable but an increase of 50% from 2000 to 3000 is more dramatic, and we could expect
something akin from such expansion in size of studies with real data.
It is important to characterize how effect size is understood in this context as we are
attempting to detect multiplicative interaction in the underlying logistic model.
Understanding the effect requires formulating how it is defined. The most direct approach
might be to simply look at the coefficient for interaction in the logistic model, the
𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 ( 𝑙𝑙𝑉𝑉 exp( 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 )) in the linear model logit( 𝑓𝑓 1
) = 𝑋𝑋 0
+ ∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 )
𝐶𝐶 𝑁𝑁𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) +
∑ 𝑋𝑋 𝐶𝐶 𝑁𝑁 𝑃𝑃 � 𝐺𝐺 𝑗𝑗 �
𝐶𝐶𝑁𝑁𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � + ∑ 𝑋𝑋 𝐼𝐼 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑐𝑐 𝑡𝑡 𝑟𝑟 𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) 𝐶𝐶 𝑁𝑁𝑃𝑃 � 𝐺𝐺 𝑖𝑖 �, where 𝑓𝑓 1
represents probability of disease.
However, the ability of our method to detect interaction also depends on the type of data
involved, SNPs in this case. SNPs were coded as additive in reference to the number of
minor alleles. The SNP data is particularly difficult, a discrete, ordinal type restricted to
zeros, ones, and twos. The value of 𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) × 𝐶𝐶 𝑁𝑁𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � is zero for observations that have at
least one common allele. It was also learned from varying data types that introducing
continuous variables into the simulation, as was the case when we mimicked methylation
level, allowed detection of interaction ORs that were lower, less than 1.2. We propose
expressing effect size through conditional ORs because effect modification is the most
widely recognized interpretation of interaction. We use the average minor allele frequency
(MAF) to demonstrate plausible conditional ORs.
The multiplicative interactions in logistic equation from the non-null model was
constructed for simulation to accept SNP pairs from two genes, each with its own MAF. To
illustrate the effect size reflected in the conditional ORs we make use of an assumption that
the average minor allele frequency (MAF) is 0.27, obtained as the expectation of uniform
random density, unif(0.05, 0.49), which reflects the range of possible MAFs. Then we
propose that effect size be quantified as exp[( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (OR
S N P
) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (OR
in rc tn
) × SNP
i
) × SNP
j
]
or exp[� 𝑙𝑙 𝑙𝑙 𝑔𝑔 (OR
S N P
) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (OR
In rc tn
) × SNP
j
� × SNP
i
]. We start by quantifying the
probabilities of the SNP values using average MAF of 0.27,
SNP=2: 0.27
2
= 0.0729
SNP=1: 2 × 0.27 × 0.73=0.3942
SNP=0: 0.73
2
=0.5329
Set the identities 𝐶𝐶𝑁𝑁𝑃𝑃 ( 𝐺𝐺 𝑖𝑖 ) = 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 and 𝐶𝐶 𝑁𝑁𝑃𝑃 � 𝐺𝐺 𝑖𝑖 � = 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 . All the conditional scenarios are
presented below as a construal of multiplicative interaction in the logistic model. We have
two ways of thinking about multiplicative interaction, either as modification of the effect of
58
𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 by 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 or the modification of the effect of 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 by 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 . There are four possible
conditional ORs with the respective probabilities of occurrence. Again, as an illustration we
use the average MAF of 0.27. However, since the range is ∈ [0.05, 0.49] many other
conditional ORs are possible. There are four possible 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 × 𝐶𝐶𝑁𝑁 𝑃𝑃 𝑖𝑖 products with different
conditional ORs, {0, 1, 2, 4}.
Probability and estimates of conditional ORs as an interpretation of interaction for the five
possible SNP combinations that would result in SNPi* SNPj=0:
P(SNPi*SNPj=0) = P(SNPi=0)*P(SNPj=0)+2*P(SNPi=0)*P(SNPj=1)+2*P(SNPi=0)*P(SNPj =2)
= (0.73*0.73)*(0.73*0.73) +2*(0.73*0.73)*(2*0.27*0.73) +2*(0.73*0.73)*(0.27*0.27) =
0.7818176
Conditional OR estimates appear very modest when at least one of the SNPs exhibits a
major allele:
exp [( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 0) × 0] = 1
exp [( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 0) × 1] = 1.025
exp [( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 1) × 0] = 1
exp [( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 0) × 2] = 1.050625
exp [( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 2) × 0] = 1
Probability and estimate of conditional OR as an interpretation of interaction for the one
possible SNP combination that would result in SNPi *SNPj =1:
P(SNPi*SNPj =1) = P(SNPi=1)*P(SNPj=1) = (2*0.27*0.73)*(2*0.27*0.73) = 0.1553936
Conditional OR when both SNPs exhibit one minor allele:
exp [( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 1) × 1] = 1.23
Probability and estimate of conditional OR as an interpretation of interaction for the two
possible SNP combinations that would result in SNPi*SNPj =2:
P(SNPi*SNPj=2) = 2*P(SNPi=1)*P(SNPj=2) = 2*(2*0.27*0.73)*(0.27*0.27) = 0.05747436
Conditional OR when one of the SNPs exhibits one minor allele and the other exhibits two
minor alleles:
exp[( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 2) × 1] = 1.476
exp[( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 1) × 2] = 1.5129
Probability and estimate of conditional OR as an interpretation of interaction for the one
possible SNP combination that would result in SNPi*SNPj =4:
P(SNPi*SNPj =4) = P(SNPi=2)*P(SNPj=2) = (0.27*0.27)* (0.27*0.27) = 0.00531441
Conditional OR when both SNPs exhibit two minor alleles:
59
exp[( 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.025) + 𝑙𝑙 𝑙𝑙 𝑔𝑔 (1.2) × 2) × 2] = 2.178576
For edification we note that these probabilities add up to one.
P(SNPi*SNPj ∈ {0,1,2,4}) = 0.7818176 + 0.1553936 + 0.05747436 + 0.00531441 = 1
The most likely events correspond to conditional ORs of 1.025, 1.050625, and 1.23, with
probabilities of occurrence 0.78 for the first two and 0.16 for the last, respectively,
assuming an average MAF. The power evaluation figures were generated from a simulation
with logistic regression which included multiplicative interaction OR of 1.2. This suggests
higher interaction OR could also be identified with the CASI approach. Setting the
interaction OR lower did not produce appreciable results for this type of data. This
simulation assumed pairs of sets of SNPs interacting. In the literature genome-wide
association studies of common diseases have identified many related SNPs that reach
highly significant p-values, showing that most OR’s are less than 1.5 and many less than 1.2
(Hodge and Greenberg, 2016). As it relates to asthma, Manuel Ferreira et al. (2011) found
evidence suggesting that 11q13.5 locus was associated with allergic asthma with an OR of
1.33 (p=7 × 10
− 4
), and confirmed SNPs in IL6R and LRRC32 genes are associated with
asthma risk at an OR of 1.09 in an analysis that combined multiple studies (p=2.4 × 10
− 8
).
Therefore, ORs in the range of the conditional and interaction thresholds underlying the
simulation are plausible, as demonstrated by studies of real data (Ferreira, et al., 2011).
The minor allele frequencies of the SNPs in question are between 0.05 and 0.49. In the
simulations we consider a standard scenario investigated in the literature, where the SNPs
that are interacting are assumed to be observed. In fact, we assume that we can observe
blocks of SNPs, including the SNPs that we generated, to be causal. The interactions as we
designed them occur among multiple SNPs through the linear models that include
multiplicative interaction terms. We expect some power increase because of the additional
SNPs, which may be offset by some decrease in power because of multiple testing.
6. Application to Real Data Sets
Asthma is the result of complex interaction between epigenetic factors and genetic variants
that confer susceptibility. Studies of the genetics of asthma have previously been conducted
using linkage designs and candidate gene association studies. The association study design
has been extended from specific candidate genes to a genome-wide approach: the genome-
wide association study (GWAS). Early GWAS for asthma discovered a novel associated
locus on chromosome 17q21 encompassing the genes ORMDL3, GSDMB and ZPBP2. None
of these genes would have been selected in a candidate association study based on current
knowledge of the functions of these genes. Nevertheless, this finding has been consistently
replicated in independent populations of European ancestry and in other ethnic groups.
Thus, chromosome 17q21 seems to be a true asthma susceptibility locus. Other genes that
60
were identified in more than one GWAS are IL33, RAD50, IL1RL1 and HLA-DQB1.
Additional novel susceptibility genes identified in a single study include DENND1BI and
IL2RB. Discovering the causal mechanism behind these associations is likely to yield great
insights into the development of asthma. It is likely that further meta-analyses of asthma
GWAS data from existing international consortia will uncover more novel susceptibility
genes and further increase our understanding of this disease (Akhabir and Sandford,
2011).
Little is known about the role of most asthma susceptibility genes during human lung
development. Genetic determinants for normal lung development are not only important
early in life, but also for later lung function. Data provide insight about the role of asthma
susceptibility genes during lung development and suggest common mechanisms
underlying lung morphogenesis and pathogenesis of respiratory diseases. For example,
among the asthma genes identified in genome wide association studies, IL2RB was
differentially expressed during human lung development (Melen, et al., 2011).
6.1. Application of CASI to Children’s Health Study (CHS) Data
6.1.1. Demographics of Study Participants
Details about participants of the Children’s Health Study (CHS) can be found in Gauderman
et al. (2007) (Gauderman, et al., 2007) and Salam et al. (2005) (Salam, et al., 2005). Broadly,
it began with recruitment of two cohorts of fourth grade students, in 1993 and 1996, from
schools in 12 southern California communities. Initial goal was to learn about respiratory
health and long-term effects of air pollution. Pulmonary-function data were obtained yearly
by trained field technicians, using consistent equipment and testing protocol, and
diagnoses of asthma occurred during the study period. Children in both cohorts were
followed up for 8 years. A questionnaire provided information on race, ethnic origin,
parental income and education, history of doctor-diagnosed asthma, in-utero exposure to
maternal smoking, personal smoking, socioeconomic status, and household exposure to gas
stoves, pets, and environmental tobacco smoke.
Table 1
Characteristics of Participants in the CHS Study
Asthma Case
N (%)
423 (19.7)
Asthma Control
N (%)
1722 (80.3)
Characteristic n (%) n (%) P-Value
Gender 5.63E-04
Male 258 (61.0) 887 (51.5)
Female 165 (39.0) 835 (48.5)
61
Race/Ethnicity
White 231 (54.6) 854 (49.6)
Asian 1 (0.2) 3 (0.2)
Black or African American 0 (0.0) 3 (0.2)
Other/Multiple/Uncertain 191 (45.2) 862 (50.0)
Town 1.12E-04
Alpine 33 (1.9) 74 (4.3)
Anaheim 16 (0.9) 140 (8.1)
Glendora 49 (2.8) 225 (13.1)
Lake Elsinore 29 (1.7) 73 (4.2)
Lake Gregory 32 (1.9) 83 (4.8)
Long Beach 17 (1.0) 80 (4.6)
Mira Loma 39 (2.3) 193 (11.2)
Riverside 34 (2.0) 132 (7.7)
San Bernardino 23 (1.3) 55 (3.2)
San Dimas 45 (2.6) 158 (9.2)
Santa Barbara 43 (2.5) 239 (13.9)
Santa Maria 22 (1.3) 80 (4.6)
Upland 41 (2.4) 190 (11.0)
Age (at questionnaire completion) 5.08E-03
5-6 96 (22.7) 389 (22.6)
6-7 200 (42.3) 809 (47.0)
7-8 113 (26.7) 506 (29.4)
8-9 14 (3.3) 18 (1.0)
6.1.2. CASI Analysis
National Center for Biotechnology Information (NCBI) reference sequence (RefSeq)
database provides a foundation for uniting DNA sequence data (Pruitt, Tatusova and
Maglott, 2007). It is generated to provide reference standards for multiple purposes
ranging from genome annotation to reporting locations of sequence variation. This
database is utilized by PLINK (Purcell, et al., 2007). Our goal was to perform gene-based
interaction association analyses for 87 genes using hg19 coordinates, for which we used
files provided by PLINK to find chromosome, start and end base pair (BP) positions. The BP
ranges were created by combining overlapping isoforms of the same gene to form a single
full-length version of that gene. The associated ranges were extended by 5kb for both
flanking regions.
We abstracted 87 genes from genome-wide association studies (GWAS) catalog provided
jointly by the National Human Genome Research Institute (NHGRI) and the European
Bioinformatics Institute (EMBL-EBI) (MacArthur, et al., 2017). The information about this
62
collection of genes is quality controlled, manually curated, and literature-derived. Genes
included in the analysis were reported and/or mapped by investigators as containing or in
vicinity of SNPs determined to be associated with asthma. The distinctive nature of GWAS
work is that it focuses on gene discovery, with the goal of minimizing the false positive rate
(FDR), a consideration more important than controlling the false negative rate (type 2
error). A p-value < 5 × 10
8
is considered the genome-wide significance threshold for a
conventional GWAS (Barsh, et al., 2012), a cutoff that was used to obtain a subset of 87
genes. This resulted in a total of �
8 7
2
� = 3741 gene pairs that were tested for interaction.
Figure 12. FDR plot for 3741 pairs of genomic regions using Children’s Health Study (CHS) data.
Top 10 significant genomic regions are selected using permutation based FDR method (Millstein
and Volfson, 2013), with significance threshold demarcated by vertical line tethered at 15.85 on the
horizontal axis, which corresponds to selected FDR estimate of 0.049 (vertical axis) and FDR
confidence limits (0.02, 0.12). Values shown horizontally in green specify the number of genomic
regions in each significant subset delimited by significance thresholds on the horizontal axis.
63
We identified 10 gene pairs as significant at FDR of 0.049 (Figure 12). Five exhibited
loadings greater than 0.5 and were therefore considered good candidates for further
exploration using a more traditional approach, logistic regression. These were {CDK2,
IL1RL1}, {CDK2, DENND1B}, {DBX1, DENND1B}, {DENND1B, IL2RB}, and {HCG23, LPIN2},
and had rankings 1, 2, 7, 9, and 10, respectively. While one pair (HCG23, LPIN2)
consistently appeared in the top hits we attempted to further characterize another four
gene pairs as well. For each we extracted SNP pairs that exhibited most significant
interaction. We presented these results by isolating the highest-ranking interaction SNP
within each gene and displaying corresponding p-value and SNP in the paired gene. These
are shown in figures 13 to 22 along with loadings and LD. We looked at SNP interaction
from both sides for each of the gene pairs, determined which SNPs within a gene interacted
most strongly with SNPs from its pairing, and then which SNPs in that paired gene with the
former. The other five were determined significant by the CASI method yet did not exhibit
loadings within their constituent SNPs greater than 0.5 were not included in regression.
These were {FLG, IKZF4}, {HLA-DRA, LRRC32}, {IL5RA, PRMT3}, {MGC45800, PRMT3}, and
{LOC284661, TENM3}, ranked 3, 4, 5, 6, and 8, respectively.
6.1.3. Follow-up Analysis Using Logistic Regression
We applied logistic regression with a multiplicative interaction term to pairs of SNPs that
had loadings greater than 0.5, adjusting for site, gender, race, and age. To adjust for
multiple testing in the regression analysis portion of the study we employed a parametric
analog of the approach that does rely on p-values (Millstein, Chen and Breton, 2016;
Millstein and Volfson, 2013). While results did not reach significance at a traditional level
of 0.05 as determined by multiple testing FDR adjustment we did find that one gene pair
showed consistently in the top hits to be illustrative (Table 2).
Table 2
Results of tests for multiplicative interaction of pairs of SNPs using logistic regression. This subset
was selected because it had the smallest q-value. The SNP pairs all come from genes HCG23 and
LPIN2. Many tests had the same interaction p-values due to complete LD and were omitted as
duplicate information and in the interest of brevity.
SNP from
HCG23
SNP from
LPIN2
SNP
Coefficient
from HCG23
SNP
Coefficient
from LPIN2
Interaction
Coefficient
P-value
Interaction
q-value
Interaction
FDR
rs16870123 rs1164 0.41 0.15 -0.61 1.54E-03 2.68E-01
rs17202379 rs1164 0.38 0.14 -0.62 1.82E-03 2.68E-01
rs17202358 rs1164 0.4 0.14 -0.61 1.90E-03 2.68E-01
rs17202365 rs1164 0.41 0.14 -0.6 2.03E-03 2.68E-01
rs17202400 rs1164 0.39 0.14 -0.6 2.18E-03 2.68E-01
rs17208741 rs1164 0.37 0.14 -0.6 2.23E-03 2.68E-01
rs16870123 rs585295 0.45 0.06 -0.53 2.33E-03 2.68E-01
64
rs67512154 rs1164 0.4 0.14 -0.58 2.57E-03 2.68E-01
rs17208615 rs1164 0.38 0.14 -0.59 2.58E-03 2.68E-01
rs17202393 rs1164 0.37 0.14 -0.59 2.73E-03 2.68E-01
rs17423698 rs1164 0.4 0.14 -0.58 2.74E-03 2.68E-01
rs17208622 rs1164 0.4 0.14 -0.58 2.76E-03 2.68E-01
rs17208769 rs585295 0.44 0.06 -0.49 3.03E-03 2.68E-01
rs17202365 rs585295 0.45 0.05 -0.52 3.05E-03 2.68E-01
rs113485731 rs1164 0.38 0.14 -0.58 3.10E-03 2.68E-01
rs16870123 rs642926 0.44 0.05 -0.51 3.13E-03 2.68E-01
rs114516436 rs585295 0.44 0.06 -0.49 3.24E-03 2.68E-01
rs139950250 rs1164 0.4 0.14 -0.56 3.31E-03 2.68E-01
rs1555114 rs585295 0.43 0.06 -0.49 3.41E-03 2.68E-01
rs3817967 rs585295 0.42 0.06 -0.49 3.42E-03 2.68E-01
rs17202358 rs585295 0.43 0.05 -0.52 3.48E-03 2.68E-01
rs17202407 rs585295 0.42 0.06 -0.5 3.49E-03 2.68E-01
rs72843289 rs585295 0.43 0.06 -0.49 3.56E-03 2.68E-01
rs3817974 rs585295 0.42 0.06 -0.48 3.65E-03 2.68E-01
rs17202379 rs585295 0.41 0.05 -0.52 3.69E-03 2.68E-01
rs35194487 rs585295 0.41 0.06 -0.49 3.72E-03 2.68E-01
rs17208601 rs1164 0.37 0.13 -0.57 3.74E-03 2.68E-01
rs17202218 rs585295 0.41 0.06 -0.49 3.92E-03 2.68E-01
rs17202365 rs642926 0.45 0.05 -0.49 4.01E-03 2.68E-01
rs67512154 rs585295 0.43 0.05 -0.5 4.03E-03 2.68E-01
rs17202400 rs585295 0.42 0.05 -0.51 4.08E-03 2.68E-01
rs17423753 rs585295 0.43 0.06 -0.48 4.11E-03 2.68E-01
rs17208622 rs585295 0.43 0.05 -0.5 4.22E-03 2.68E-01
rs12529049 rs1164 0.39 0.13 -0.54 4.47E-03 2.68E-01
rs17423698 rs585295 0.43 0.05 -0.49 4.58E-03 2.68E-01
rs142592308 rs585295 0.4 0.05 -0.48 4.64E-03 2.68E-01
rs17208741 rs585295 0.39 0.04 -0.5 4.68E-03 2.68E-01
rs17202358 rs642926 0.43 0.05 -0.49 4.70E-03 2.68E-01
rs17208769 rs642926 0.43 0.05 -0.47 4.73E-03 2.68E-01
rs17208615 rs585295 0.41 0.05 -0.5 4.81E-03 2.68E-01
rs139950250 rs585295 0.43 0.05 -0.49 4.88E-03 2.68E-01
rs114516436 rs642926 0.43 0.05 -0.46 5.05E-03 2.68E-01
rs17202379 rs642926 0.41 0.04 -0.49 5.05E-03 2.68E-01
rs67512154 rs642926 0.43 0.04 -0.48 5.31E-03 2.68E-01
rs1555114 rs642926 0.42 0.05 -0.46 5.31E-03 2.68E-01
rs17202393 rs585295 0.4 0.04 -0.49 5.34E-03 2.68E-01
65
rs3817967 rs642926 0.41 0.05 -0.46 5.36E-03 2.68E-01
rs17202400 rs642926 0.41 0.04 -0.49 5.51E-03 2.68E-01
rs72843289 rs642926 0.42 0.05 -0.46 5.53E-03 2.68E-01
rs17208622 rs642926 0.43 0.04 -0.48 5.53E-03 2.68E-01
rs17202407 rs642926 0.41 0.05 -0.46 5.54E-03 2.68E-01
rs113485731 rs585295 0.41 0.04 -0.49 5.59E-03 2.68E-01
rs3817974 rs642926 0.41 0.05 -0.45 5.67E-03 2.68E-01
rs35194487 rs642926 0.4 0.05 -0.46 5.91E-03 2.68E-01
rs17423698 rs642926 0.43 0.04 -0.47 6.01E-03 2.68E-01
rs140928277 rs585295 0.38 0.05 -0.45 6.09E-03 2.68E-01
rs17202218 rs642926 0.4 0.05 -0.46 6.21E-03 2.68E-01
rs17423753 rs642926 0.42 0.05 -0.45 6.32E-03 2.68E-01
rs139950250 rs642926 0.43 0.04 -0.47 6.35E-03 2.68E-01
We selected a subset of logistic regression results from the follow-up analysis that met the
loading criteria of greater than or equal to 0.5. To produce a visual representation of
loadings and pairwise interactions for SNPs observed in the significant gene pairs we
utilized modified version of snp.plotter R package (Luna and Nicodemus, 2007). From a
subset of regression results with nominal interaction p-values less than or equal to 0.05 we
selected some of the stronger interactions. This left us with five pairs of genes portrayed
side by side in figures 13 to 22. The intention is to depict segments of both genes and show
the loadings for both (red circles). In addition, we provide a picture of pairwise interactions
(blue squares). To select the most interesting interactions we identified SNPs in the subset
that were previously determined as belonging in or near one of the genes in the pair and
matched them to SNPs in the paired gene that showed strongest interactions. SNPs in
complete LD were stripped from the set to remove redundancy leaving only one
representative to show its best interaction with a paired gene. Essentially, we observed one
gene in a pair and found the corresponding top interactions in the paired gene. Then we did
the same to the other gene in the pair, selected its SNPs, plotted their loading and only
highest-ranking interactions with first mentioned paired gene. Therefore, each pair of
genes is displayed in two diagrams, one gene in the pair on the left and the other on the
right. The tables below show FDR multiple testing adjustment applied to the entire set of
pairwise interactions for SNPs that had loadings greater than or equal to 0.5. Since we
knew a priori which five gene pairs we are investigating it may be more suitable to apply
FDR adjustment to one gene pair at a time. The headers of the of the tables 3 to 12 show the
FDR for the single corresponding gene pair.
66
Figure 13. SNP(CDK2) by SNP(IL1RL1) interactions.
Table 3
Top 5 SNP(CDK2) by SNP(IL1RL1) interactions. Gene pair
interaction q-value is 0.35.
SNP
from CDK2
SNP
from IL1RL1
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs12817765 rs11676124 0.57 1.10E-02 3.12E-01
rs1045435 rs11676124 0.47 1.97E-02 3.12E-01
rs2069398 rs17639215 0.63 7.15E-02 3.12E-01
rs3213122 rs17639215 0.58 9.09E-02 3.12E-01
rs11171705 rs17639215 0.60 9.91E-02 3.12E-01
Figure 14. SNP(IL1RL1) by SNP(CDK2) interactions.
Table 4
Top 10 SNP(IL1RL1) by SNP(CDK2) interactions. Gene pair
interaction q-value is 0.35.
SNP
from CDK2
SNP
from IL1RL1
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs12817765 rs11676124 0.57 1.10E-02 3.12E-01
rs12817765 rs17639215 0.75 1.45E-02 3.12E-01
rs12817765 rs985523 0.71 2.00E-02 3.12E-01
rs12817765 rs76362690 0.71 2.08E-02 3.12E-01
rs12817765 rs140594705 0.69 2.34E-02 3.12E-01
rs12817765 rs11891827 0.68 2.61E-02 3.12E-01
rs12817765 rs76943877 0.66 3.22E-02 3.12E-01
rs12817765 rs137953693 0.61 4.58E-02 3.12E-01
rs12817765 rs10208293 0.47 4.70E-02 3.12E-01
rs12817765 rs76930359 0.57 6.72E-02 3.12E-01
67
Figure 15. SNP(CDK2) by SNP(DENND1B) interactions.
Table 5
Top 3 SNP(CDK2) by SNP(DENND1B) interactions. Gene pair
interaction q-value is 0.09.
SNP
from CDK2
SNP
from DENND1B
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs2069408 rs35624964 0.40 1.05E-02 3.12E-01
rs773108 rs12118513 0.37 2.25E-02 3.12E-01
rs773107 rs2224873 0.36 2.39E-02 3.12E-01
Figure 16. SNP(DENND1B) by SNP(CDK2) interactions.
Table 6
Top 10 SNP(DENND1B) by SNP(CDK2) interactions. Gene pair
interaction q-value is 0.09.
SNP
from CDK2
SNP
from DENND1B
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs2069408 rs35624964 0.40 1.05E-02 3.12E-01
rs2069408 rs12141028 0.39 1.14E-02 3.12E-01
rs2069408 rs6702421 0.39 1.27E-02 3.12E-01
rs2069408 rs35939830 0.38 1.36E-02 3.12E-01
rs2069408 rs67236816 0.38 1.40E-02 3.12E-01
rs2069408 rs17621130 0.38 1.41E-02 3.12E-01
rs2069408 rs12118913 0.38 1.47E-02 3.12E-01
rs2069408 rs12118513 0.38 1.87E-02 3.12E-01
rs2069408 rs16841842 0.36 2.13E-02 3.12E-01
rs773108 rs2224873 0.36 2.34E-02 3.12E-01
68
Figure 17. SNP(DBX1) by SNP(DENND1B) interactions.
Table 7
Top 6 SNP(DBX1) by SNP(DENND1B) interactions. Gene pair
interaction q-value is 0.29.
SNP
from DBX1
SNP
from
DENND1B
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs11606156 rs6661330 0.43 1.33E-01 3.12E-01
rs7936427 rs6661330 0.35 2.18E-01 3.12E-01
rs831460 rs78259147 0.30 2.68E-01 3.12E-01
rs138973840 rs6661330 0.30 2.70E-01 3.12E-01
rs75944386 rs78259147 0.33 2.85E-01 3.12E-01
rs55664955 rs6661330 0.25 3.65E-01 3.12E-01
Figure 18. SNP(DENND1B) by SNP(DBX1) interactions.
Table 8
Top 10 SNP(DENND1B) by SNP(DBX1) interactions. Gene pair
interaction q-value is 0.29.
SNP
from DBX1
SNP
from DENND1B
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs11606156 rs6661330 0.43 1.33E-01 3.12E-01
rs11606156 rs6685897 0.43 1.35E-01 3.12E-01
rs11606156 rs78259147 0.44 1.47E-01 3.12E-01
rs11606156 rs114848122 0.43 1.65E-01 3.12E-01
rs11606156 rs17567921 0.41 1.69E-01 3.12E-01
rs11606156 rs151133986 0.40 1.70E-01 3.12E-01
rs11606156 rs111371365 0.38 1.90E-01 3.12E-01
rs11606156 rs111238846 0.38 1.94E-01 3.12E-01
rs11606156 rs112883537 0.38 1.98E-01 3.12E-01
rs11606156 rs10801621 0.37 2.02E-01 3.12E-01
69
Figure 19. SNP(DENND1B) by SNP(IL2RB) interactions.
Table 9
Top 10 SNP(DENND1B) by SNP(IL2RB) interactions. Gene pair
interaction q-value is 0.39.
SNP
from
DENND1B
SNP
from IL2RB
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs78259147 rs228974 -0.57 1.98E-02 3.12E-01
rs17567921 rs84459 -0.60 3.13E-02 3.12E-01
rs114848122 rs84459 -0.60 3.30E-02 3.12E-01
rs75007359 rs228974 -0.50 3.43E-02 3.12E-01
rs111371365 rs228974 -0.47 4.29E-02 3.12E-01
rs10922242 rs228974 -0.46 5.22E-02 3.12E-01
rs74541273 rs84459 -0.50 5.86E-02 3.12E-01
rs111238846 rs228974 -0.41 6.46E-02 3.12E-01
rs12094881 rs228974 -0.43 6.58E-02 3.12E-01
rs17641842 rs84459 -0.49 6.69E-02 3.12E-01
Figure 20. SNP(IL2RB) by SNP(DENND1B) interactions.
Table 10
Top 10 SNP(IL2RB) by SNP(DENND1B) interactions. Gene pair
interaction q-value is 0.39.
SNP
from DENND1B
SNP
from IL2RB
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs78259147 rs228974 -0.57 1.98E-02 3.12E-01
rs78259147 rs228975 -0.57 2.18E-02 3.12E-01
rs17567921 rs84459 -0.60 3.13E-02 3.12E-01
rs17567921 rs228945 -0.57 3.86E-02 3.12E-01
rs17567921 rs84460 -0.55 4.33E-02 3.12E-01
rs75007359 rs1003694 -0.43 1.01E-01 3.12E-01
rs10922242 rs228955 -0.30 1.89E-01 3.12E-01
rs75007359 rs2235330 -0.29 2.84E-01 3.12E-01
rs10922247 rs84458 -0.08 5.62E-01 3.26E-01
rs10922247 rs228942 -0.07 6.13E-01 3.40E-01
70
Figure 21. SNP(HCG23) by SNP(LPIN2) interactions.
Table 11
Top 10 SNP(HCG23) by SNP(LPIN2) interactions. Gene pair
interaction q-value is 0.24.
SNP
from HCG23
SNP
from LPIN2
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs16870123 rs1164 -0.61 1.54E-03 2.68E-01
rs17202379 rs1164 -0.62 1.82E-03 2.68E-01
rs17202358 rs1164 -0.61 1.90E-03 2.68E-01
rs116454630 rs1164 -0.60 1.95E-03 2.68E-01
rs17202365 rs1164 -0.60 2.03E-03 2.68E-01
rs17202400 rs1164 -0.60 2.18E-03 2.68E-01
rs17208741 rs1164 -0.60 2.23E-03 2.68E-01
rs144107508 rs1164 -0.60 2.40E-03 2.68E-01
rs67512154 rs1164 -0.58 2.57E-03 2.68E-01
rs17208615 rs1164 -0.59 2.58E-03 2.68E-01
Figure 22. SNP(LPIN2) by SNP(HCG23) interactions.
Table 12
Top 10 SNP(LPIN2) by SNP(HCG23) interactions. Gene pair
interaction q-value is 0.24.
SNP
from HCG23
SNP
from LPIN2
Interaction
Coefficient
Interaction
p-Value
Interaction
q-value
rs16870123 rs1164 -0.61 1.54E-03 2.68E-01
rs16870123 rs585295 -0.53 2.33E-03 2.68E-01
rs16870123 rs642926 -0.51 3.13E-03 2.68E-01
rs142592308 rs10853287 0.52 1.24E-02 3.12E-01
rs142592308 rs10460009 0.52 1.30E-02 3.12E-01
rs17202407 rs2282636 0.35 2.47E-02 3.12E-01
rs17202407 rs2282635 0.34 2.89E-02 3.12E-01
rs17208769 rs3745012 0.33 3.46E-02 3.12E-01
rs17202407 rs7237795 0.33 3.56E-02 3.12E-01
71
6.1.4. Most Significant Pairwise SNP Interactions from Interacting Gene Pairs
Interaction for single pairs of SNPs is taken in a classic sense, as effect modification. Since
SNP values are coded 0, 1, and 2 we premise the description of the follow up analysis
results on how such effect modification is understood in this context. The conditional
association between one SNP and the likelihood of asthma is considered in reference to
minor allele count, that is how the relative odds of case vs. control status fares in relation to
how SNP value increases from 0 to 1 and from 1 to 2.
We produced graphical representations of interactions between the most significant SNP
pairs from gene pairs CASI identified as significant. For each of the five interacting gene
pairs selected above we extracted one pair of SNPs. If we take the pair of genes CDK2 and
IL1RL1 in figure 23 as an example, we could think of the graph as showing effect
modification. Values 0, 1, and 2 on the horizontal axis and the same on the vertical axis are
for SNPs rs12817765 and rs11676124, respectively. As stated previously SNP value of 0
means two common alleles (homozygous major allele), 1 means heterozygous (one
common and one minor allele), and 2 means two minor alleles. The relative shades of gray
side by side represent relative likelihood of observing case or control for each combination
of SNP values. For rs12817765 and rs11676124 the comparative odds of case vs. control is
in favor of control phenotype across the entire grid. The side by side shades of gray don’t
appear to vary much. The percentages represent distribution of the SNP minor allele count
at rs11676124 within each subset defined by the value of rs12817765 and case status
combination. If we look at the three conditional distributions of percentages an effect
modification could be discerned. We observe how the distribution of percentages changes
when rs12817765 is homozygous for common allele, then shifts to heterozygous, and then
to 2 minor alleles. Effect modification can be inferred from conditional distribution of
rs11676124 shifting in favor of case status as its minor allele dose increases. This can be
explained as the association between rs11676124 and phenotype (Asthma) being
dependent on the value of rs12817765. The probability shifts in towards seeing a case
when rs11676124 is 1 or 2 minor alleles when conditioned on the rs12817765 value of 1
or 2. Consider the weighted average minor allele count for rs11676124 at
rs12817765.Asthma=0.N (0.2754*0+0.4822*1+0.2424*2=0.967), rs12817765.Asthma=1.N
(0.296*0+0.4843*1+0.2197*2=0.9237), and rs12817765.Asthma=2.N
(040*0+0.50*1+0.10*2=0.7). Now we compare that to the weighted average when
phenotype falls into the case status category (Y). Weighted average minor allele count for
rs12817765.Asthma=0.Y is 0.3499*0+0.4353*1+0.2149*2=0.651, for
rs12817765.Asthma=1.Y is 0.1228*0+0.6667*1+0.2105*2=1.0877, and for
rs12817765.Asthma=2.Y is 0.0*0+0.6667*1+0.3333*2=1.333. There is a noticeable shift
which can be described as interaction. For rs12817765=0 average minor allele count is
lower for ‘Y’ (0.651) compared to ‘N’ (0.967). However, for rs12817765 ∈ {1, 2} it’s higher
72
for ‘Y’ compared to ‘N’ with 1.0877 vs. 0.9237 for 1 minor allele and 1.333 vs. 0.7 for 2
minor alleles, respectively. The effect on rs11676124 of increasing minor allele dose in
rs12817765 is to shift association between rs11676124 and phenotype from negative,
where increasing number of minor alleles is protective, to positive, where increasing
number of minor alleles is a risk factor. Magnitude of the corresponding interaction p-value
is shown in figures 13 and 14 and tables 3 and 4. Figure 13 also shows the location of SNP
rs12817765 and Figure 14 the location of rs11676124. The same can be learned from the
rest of the interaction plots. It bears clarifying our claim that figures 23 to 27 in fact
represent interaction as we understand it to be effect modification. If you consider
maximum likelihood estimation, a method for maximizing parameters in a model given
observed data, the parameters are expressed in the form of coefficients in a linear model
for this type of data. Each frame within a figure represents the conditional association
between the SNP represented on the vertical and the phenotype, asthma case-control
status. The value of a SNP conditionally observed (conditioned on the value of the SNP on
the horizontal axis) for association is related to the value of the phenotype. In colloquial
terms, if it’s more likely to exhibit higher or lower minor allele counts in cases vs. controls
minor allele counts are positively or negatively associated with having asthma,
respectively. If we observe the panels from left to right and the association changes in
either direction interaction is being indicated. If a logistic model with an interaction term is
fit to the data a corresponding interaction effect will be shown by the estimate of the
coefficient of the multiplicative term and the consequent p-value, though it may not be
statistically significant.
From figures 24 to 27 we can learn the same as from figure 23 for the other SNP pairs by
quantifying weighted minor allele counts for SNP on the vertical axis conditioned on SNP
on the horizontal axis.
For SNP rs35624964 from DENND1B conditioned on rs2069408.Asthma from CDK2 the
weighted averages are:
0.N (0.6543*0+0.3065*1+0.0391*2=0.3847)
0.Y (0.7309*0+0.2556*1+0.0135*2=0.2826)
1.N (0.6297*0+0.3207*1+0.0496*2=0.4199)
1.Y (0.6478*0+0.3145*1+0.0377*2=0.3899)
2.N (0.569*0+0.3966*1+0.0345*2=0.4656)
2.Y (0.4878*0+0.3902*1+0.122*2=0.6342)
73
The above suggests that effect modification is in the positive direction. For rs2069408
absence or presence of only 1 minor allele points toward negative association between the
minor allele dose for rs35624964 and the asthma phenotype. However, direction of the
relationship between rs35624964 and asthma changes towards positive association when
rs2069408 exhibits 2 minor alleles, with increasing number of minor alleles indicating
greater likelihood of asthma case rather than control.
For rs6661330 from DENND1B conditioned on rs11606156.Asthma from DBX1 the
weighted averages are:
0.N (0.8744*0+0.1207*1+0.0049*2=0.1305)
0.Y (0.887*0+0.0989*1+0.0141*2=0.1271)
1.N (0.8546*0+0.1418*1+0.0035*2=0.1488)
1.Y (0.1228*0+0.6667*1+0.2105*2=1.0877)
2.N (0.6667*0+0.2667*1+0.0667*2=0.4001)
2.Y (0.75*0+0.25*1+0.0*2=0.25)
The relationship between rs6661330 and rs11606156.Asthma is not as clear. The
association between rs6661330 and phenotype is in the negative direction for rs11606156
value of 0, in the positive direction for value of 1, and again in the negative direction for
value of 2.
For rs228974 from IL2RB conditioned on rs78259147.Asthma from DENND1B the weighted
averages are:
0.N (0.2972*0+0.4762*1+0.2266*2=0.9294)
0.Y (0.2547*0+0.5416*1+0.2038*2=0.9492)
1.N (0.3317*0+0.5578*1+0.1108*2=0.7794)
1.Y (0.4444*0+0.40*1+0.1556*2=0.7112)
2.N (0.0*0+0.5556*1+0.4444*2=1.4444)
2.Y (1.00*0+0.0*+0.0*2=0)
The conditional association between rs228974 and phenotype is very close to null when
conditioned on rs78259147 value of 0 and slightly in favor of no disease status (control)
when conditioned on value of 1. However, the association is clearly negative between
rs228974 and phenotype when rs78259147 exhibits 2 minor alleles.
74
For rs1164 from LPIN2 conditioned on rs16870123.Asthma from HCG23 the weighted
averages are:
0.N (0.5235*0+0.3841*1+0.0924*2=0.5689)
O.Y (0.4951*0+0.4078*1+0.0971*2=0.602)
1.N (0.5013*0+0.4041*1+0.0946*2=0.5933)
1.Y (0.6415*0+0.3113*1+0.0472*2=0.4057)
2.N (0.5312*0+0.3438*1+0.125*2=0.5938)
2.Y (0.875*0+0.125*1+0.0*2=0.125)
For this interaction effect the conditional association between rs1164 and phenotype shifts
from moderately positive to clearly negative as the value of rs16870123 changes from 0 to
1 and then to 2.
75
Figure 23. Interaction plot for most significant SNP pair for genes CDK2 and IL1RL1. The relative shades of gray represent the odds of
case vs. control at the specified combination of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on the vertical axis within each subset
defined by the value of the SNP on the horizontal axis and case status combination.
76
Figure 24. Interaction plot for most significant SNP pair for genes CDK2 and DENND1B. The relative shades of gray represent the odds of
case vs. control at the specified combination of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on the vertical axis within each subset
defined by the value of the SNP on the horizontal axis and case status combination.
77
Figure 25. Interaction plot for most significant SNP pair for genes DBX1 and DENND1B. The relative shades of gray represent the odds of
case vs. control at the specified combination of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on the vertical axis within each subset
defined by the value of the SNP on the horizontal axis and case status combination.
78
Figure 26. Interaction plot for most significant SNP pair for genes DENND1B and IL2RB. The relative shades of gray represent the odds of
case vs. control at the specified combination of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on the vertical axis within each subset
defined by the value of the SNP on the horizontal axis and case status combination.
79
Figure 27. Interaction plot for most significant SNP pair for genes HCG23 and LPIN2. The relative shades of gray represent the odds of
case vs. control at the specified combination of minor allele doses from the two SNPs labeled on the horizontal and vertical axes. The
percentages represent distribution of the SNP minor allele count for the SNP specified on the vertical axis within each subset
defined by the value of the SNP on the horizontal axis and case status combination.
80
Our goal here is to make sense of how these interacting gene pairs can contribute to solving
the puzzle of asthma. We intend that follow-up analyses that focus specifically on these
pairs may provide insights into the genetic architecture of asthma and serve as a basis for
future functional and mechanistic studies.
6.2. Application of CASI to the Asthma Bio-Repository for Integrative
Genomic Exploration (ABRIDGE) Data Set
6.2.1. Asthma BRIDGE (ABRIDGE)
The data used for this study are from the Asthma BioRepository for Integrative Genomic
Exploration (Asthma BRIDGE or ABRIDGE) (N = 1460) (Raby, et al., 2011; Torgerson, et al.,
2011), which is publicly accessible and includes 1500 individuals with asthma and controls
with comprehensive phenotype and genomic data (Breton, et al., 2014). Asthma was
defined via questionnaire by the presence of asthma symptoms, use of an inhaled
bronchodilator at least twice per week, or use of a daily asthma medication for the 6
months before the screening interview (Breton, et al., 2014). Genome-wide SNP and DNA
methylation was obtained from whole blood samples from 576 participants. Samples were
randomized to avoid confounding by experimental batch. The final data set included 356
asthmatics and 220 non-asthmatic adults, for whom we had complete data.
6.2.2. SNPs and DNA Methylation
SNP data included all Phase 2, Release 21 consensus HapMap variants and were obtained
from studies participating in the EVE consortium (International HapMap, et al., 2007;
Torgerson, et al., 2011). The genotyping platforms, Illumina (1Mv1, 550k, 610k, 650k) and
Affymetrix (500k, 6.0), varied by sample source. Quality control (QC) procedures included
filtration for call rates (> 95%) and tests for agreement with Hardy-Weinberg expectations
applied to SNPs oriented to the plus strand. High-density imputation using 1000 Genomes
reference panel was performed using the MACH software (Liu, et al., 2013). Data were
checked for consistency in reference alleles and strand orientation, and SNPs were
excluded based on low quality scores and examination of QQ plots comparing the
distribution of association p-values for genotyped and imputed markers (Torgerson, et al.,
2011). SNPs with minor allele frequencies less than 0.05 were excluded.
A total of 11 plates were run on the Illumina human 450K methylation platform. Data
preprocessing steps included application of Norm-Exponential (NE) background correction
on the raw methylation data, plate by plate for all 11 plates. HG19 annotations, including
gene symbols, chromosome number, and sequence position, were added via the
Bioconductor package, FDb.InfiniumMethylation.hg19. One sample from a non-asthmatic
participant was excluded due to having an extreme value of 1.0 for all probes. Methylation
of CpG sites were assessed for outliers, and observations exceeding 3 interquartile ranges
81
(IQRs) from the 25
th
and 75th percentiles, per Tukey’s outer fences (Dawson, 2011), were
removed unless they comprised greater than or equal to 5% of the data for that variable.
The 11 NE corrected plates were combined, and four additional samples were excluded as
outliers. Dye bias (DB) correction was done using one sample as reference for all others,
and quantile normalization 125 (QN) was applied to the resulting data set.
SNPs and CpG sites were mapped to the nearest gene with positions identified using the
UCSC genome browser for all RefSeq genes. Overlapping isoforms of the same gene were
combined to form a single full-length version. SNPs and CpG sites residing in flanking
regions extending 2kb on either side were included.
6.2.3. Demographics of the ABRIDGE Population
Of the 576 participants, 62 percent had asthma (Table 13). While males and females were
evenly distributed among cases, there were more females than males among the controls.
Asthma cases had a greater percentage of Hispanics whereas controls had a greater
proportion of African Americans. Frequency of asthma prevalence varied by clinic site by
design, with some sites recruiting only cases.
Table 13
Characteristics of Participants in the ABRIDGE Study.
Asthma Case
N (%)
356 (61.81)
Asthma Control
N (%)
220 (38.19)
Characteristic n (%) n (%)
Gender
Male 169 (47.5) 78 (35.5)
Female 187 (52.5) 142 (64.5)
Race/Ethnicity
European 73 (20.5) 41 (18.6)
Hispanic/Latino 174 (48.9) 65 (29.5)
Black or African American 67 (18.8) 92 (41.8)
Other/Multiple/Uncertain 42 (11.8) 22 (10.0)
Education
Did Not Complete High School 41 (11.5) 20 (9.1)
High School or GED Degree 96 (27.0) 51 (23.2)
Some College 97 (27.2) 74 (33.6)
College Degree 71 (19.9) 44 (20.0)
Post-College Coursework 18 (5.1) 15 (6.8)
Graduate or Professional Degree 18 (5.1) 14 (6.4)
Other 15 (4.2) 2 (0.9)
Site
82
GRAAD, JH 39 (11.0) 87 (39.5)
CAG/UAC 32 (9.0) 18 (8.2)
CHS/USC 119 (33.4) 115 (52.3)
CARE; Denver NJH, Tucson/UARC,
St. Louis/WUC, Madison/UWM
42 (11.8) 0 (0)
MCCAS 124 (34.8) 0 (0)
Age at Methylation Measure
0-10 4 (1.1) 0 (0)
10-20 134 (37.6) 4 (1.8)
20-30 162 (45.5) 143 (65.0)
30-40 23 (6.5) 21 (9.5)
40-50 16 (4.5) 27 (12.3)
50-60 13 (3.7) 21 (9.5)
>60 4 (1.1) 4 (1.8)
Family History of Asthma
Yes 173 (48.6) 51 (23.2)
No 183 (51.4) 169 (76.8)
Age at Diagnosis
0-2 90 (25.3)
2-5 100 (28.1)
5-8 64 (18.0)
8-11 57 (16.0)
11-14 15 (4.2)
14-17 9 (2.5)
>17 21 (5.9)
6.2.4. CASI Analysis of ABRIDGE Data
In this study we apply our method to different type of pair of candidate interacting
variables. One is an ordinal SNP variable and the other is a continuous methylation
variable. There is evidence to suggest that asthma pathogenesis is affected by both genetic
and epigenetic variation independently, and there is some evidence to suggest genetic
epigenetic interactions affect risk of asthma. However, little research has been done to
identify such interactions on a genome-wide scale. We aim to identify genes with genetic-
epigenetic interactions associated with asthma. Using asthma case-control data, we applied
a novel nonparametric gene-centric approach to test for interactions between multiple
SNPs and CpG sites simultaneously in the vicinities of 18,178 genes across the genome. Our
statistical method simultaneously accounts for multiple variations across chromosomal
regions to detect these types of effects on a genome-wide scale. Using case-control data, we
applied our gene-centric approach to test for interactions between multiple SNPs and CpG
sites simultaneously in the neighborhood of each gene across the whole genome.
83
Note that p-values were not generated even as intermediate statistics because FDR was
estimated directly from the observed and permuted CASI statistics. The estimated FDR and
corresponding confidence intervals (CIs) for a series of increasingly stringent significance
thresholds for the CASI test statistic demonstrate a downward trend with narrow
confidence intervals (Figure 28), indicating that the CASI statistic is informative with
respect to distinguishing observed from permuted data. However, this dynamic does not
continue beyond about 4.35 where a minimum FDR occurs, implying that more extreme
values of the CASI statistic do not correspond to lower rates of false discoveries. The
threshold of 4.35 was therefore used to identify the most statistically significant results,
yielding 12 genes, HOPX, SCARNA18, PF4, STC1, ATF3, OR10K1, UPK1B, LOC101928523,
LHX6, CHMP4B, TPRA1, and LANCL1 (FDR = 0.050, CI = (0.024, 0.104)).
Figure 28. FDR and confidence intervals for 18,178 genomic regions (genes) defined by 2kb up and
downstream of each gene. The value of the CASI statistic (x-axis) has no direct interpretation other
than that larger values are more extreme. 12 significant genomic regions correspond to a threshold
84
of 4.35 (vertical dashed line) and FDR of 0.05, CI = (0.024, 0.104). Integer values shown in green
specify the number of genomic regions with CASI statistics at least as extreme as the thresholds
specified by the horizontal axis.
6.2.5. Follow-up Analyses in Significant Regions
SNPs and CpG sites can be ranked in their contribution to the test statistic by computing
correlations, loadings, in cases with their corresponding linear combinations, the canonical
variates. SNP-CpG site pairs that were the greatest contributors to statistically significant
tests were identified by selecting those with loadings greater than 0.5, which has been
proposed as the “operational definition” of a large effect size (Cohen, 1992). Using logistic
regression, conventional likelihood ratio tests of multiplicative interactions were
conducted, including one degree of freedom likelihood ratio tests (1-df LRT) for interaction
and three degree of freedom tests (3-df LRT) for the combined main and interaction effects.
Statistical significance was assessed according to FDR level of 0.05, using a parametric
version of Millstein and Volfson (Millstein, et al., 2016). Variables found to be significantly
correlated with asthma status (Table 13) were included as adjustment covariates in the
logistic models. While education level was not significant (p-value= 0.14), it was
nevertheless included due to prior evidence of association with asthma. Final models were
adjusted for gender, ancestry, family history of asthma, age at sample collection, and
education level. The ancestry variables were composed of the top two principal
components computed by the software Eigensoft (Patterson, Price and Reich, 2006) using
128 ancestry informative markers (AIMs) (Kosoy, et al., 2009). The primary results do not
include adjustment for site, because two ABRIDGE sites contributed only cases (Table 13).
However, sensitivity analyses were conducted by restricting to sites that included both
cases and controls and including site as an adjustment covariate, and the results did not
change substantially.
6.2.6. Pairwise SNP-CpG Interaction Analysis
Twelve genes, PF4, ATF3, TPRA1, HOPX, SCARNA18, STC1, OR10K1, UPK1B,
LOC101928523, LHX6, CHMP4B, and LANCL1, exhibited statistically significant SNP-CpG
interactions (FDR = 0.05). Of these, three have previously been implicated in asthma risk,
PF4, ATF3, and TPRA1. Follow-up analysis discovered statistically significant pairwise SNP-
CpG interactions for several of these genes, including SCARNA18, LHX6, and
LOC101928523, (P-Values = (1.33 × 10
− 4
, 8.21 × 10
− 4
, 1.11 × 10
− 3
), respectively).
Pairwise analyses of SNPs and CpG sites from the 12 significant genomic regions with
loadings of 0.5 or greater using logistic regression revealed evidence of pairwise
interactions for 19 top loading pairs (Table 14), corresponding to 4 of the 12 genomic
regions, LOC101928523, SCARNA18, LHX6, and STC1.
85
Table 14
Significant interactions for SNPs and CpGs with loadings > 0.5.
SNP
Methylation
Coefficient
SNPxMeth
Interaction
P-Value
1-df LRT
P-Value
3-df LRT
Gene
rs67216017(G/GA) cg14999833 19.79 2.84E-05 1.33E-04 SCARNA18
rs113665237(A/AT) cg14999833 19.79 2.84E-05 1.33E-04 SCARNA18
rs10061690 cg14999833 19.79 2.84E-05 1.33E-04 SCARNA18
rs10818651 cg01363324 -24.84 1.17E-04 8.21E-04 LHX6
rs13301641 cg07956857 -8.11 1.28E-04 1.11E-03 LOC101928523
rs11792474 cg07956857 -7.86 1.79E-04 1.49E-03 LOC101928523
rs10818651 cg21469772 -6.47 2.49E-04 3.74E-03 LHX6
rs10985567 cg01363324 -23.97 2.99E-04 4.35E-04 LHX6
rs10818651 cg04282082 -4.99 3.59E-04 4.88E-03 LHX6
rs10818651 cg13832372 12.80 3.68E-04 1.44E-03 LHX6
rs10985567 cg21213617 -12.88 6.19E-04 4.51E-03 LHX6
rs10985567 cg13832372 12.61 6.42E-04 4.55E-04 LHX6
rs10985567 cg21469772 -6.19 8.17E-04 4.18E-03 LHX6
rs989798 cg04282082 -4.70 1.17E-03 7.01E-03 LHX6
chr8/BP:23713016 (C/CT) cg16688533 -3.99 1.20E-03 3.87E-03 STC1
rs989798 cg13832372 12.33 1.42E-03 1.85E-03 LHX6
rs10818651 cg21213617 -11.34 1.54E-03 1.53E-02 LHX6
rs989798 cg21469772 -5.69 1.60E-03 1.24E-02 LHX6
rs10985567 cg04282082 -4.61 1.73E-03 4.51E-03 LHX6
Statistically significant (FDR < 0.05) pairwise interactions (1-df LRT) for top loading SNPs and
CpGs from the 12 genomic regions identified as significant by CASI. Logistic regression models
were adjusted for gender, ancestry/ethnicity, family history of asthma, age at methylation
measure, and education level. Very similar p-values may be indicative of SNPs in high LD.
Although the 3-df tests were significant for most of these pairs, the 1-df tests tended to be
more significant, implying that main effects were not appreciably contributing. Nine
individual SNPs and ten CpG sites in LHX6, three SNPs and two CpG sites in
LOC101928523, and three SNPs and two CpG sites in SCARNA18 met the criteria of having
loadings greater than 0.5 (Figure 29). The SNP-CpG pair with the greatest loadings did not
always elicit a significant result for multiplicative interaction, however, this is not
surprising considering that the CASI statistic is formed from linear combinations of SNPs
and CpGs. Multiple genomic regions, HOPX, PF4, ATF3, UPK1B, CHMP4B, TPRA1, and
LANCL1 did not show evidence of interactions for their top loading SNP-CpG pair. The lack
of statistical significance for individual pairs may indicate that it is necessary to evaluate
joint interactions between multiple SNPs and CpGs in order to have adequate power to
detect the effects.
86
Figure 29. SNP and methylation loadings for genomic regions LOC101928523, LHX6, and SCARNA18
with LD heat maps. Red circles indicate SNPs and blue squares CpG sites. The horizontal axis
indicates base pair (BP) positions in each of the three genomic regions. Heat maps represent
dependencies between SNPs and CpGs (alignment is approximate).
Additional pairwise interaction analyses were conducted for SNPs and CpGs with loadings
greater than 0.5 underlying LHX6, LOC101928523, and SCARNA18 (Tables 15-17).
Significant interactions were found for LHX6, where involved SNPs had negative loadings
located at the center haplotype block interacting with CpG sites clustered to the right in a
CpG island (Figure 29 and Table 15). Odds ratios for the main effects of both SNPs and CpGs
in LHX6 were not statistically significant (Table 15). However, interactions for seven SNP-
CpG pairs were significant, with odds ratios (ORs) ranging from 0.53 (0.36, 0.77) to 0.84
(0.74, 213 0.96) (Table 15). High LD between SNPs as well as dependencies among CpGs
may explain the similarity in interaction effects that is apparent across SNP-CpG pairs. For
87
LHX6, minor allele dose was associated with increasingly protective effects of methylation
(Table 15). This trend is particularly apparent in the conditional ORs (ORmeth|SNP). For
the most significant interaction in LHX6, individuals with the common ‘GG’ genotype in
rs10818651 had 9% greater odds of asthma for a 5% difference in cg21469772
methylation. Addition of one minor allele ‘A’ reverses the association to a 21% decrease in
the odds of asthma for 5% greater methylation, whereas individuals with the ‘AA’ genotype
had a 43% decrease. Similarly, for the next most significant interaction (rs10985567 and
cg21213617), the odds of asthma for an individual with 5% higher methylation and ‘GG’
genotype (common homozygote) increases by 24% whereas the odds for an individual
with 5% higher methylation and ‘AA’ genotype decreases by 66%.
Figure 30. Interaction box plots for SNP-methylation pairs with highest loadings in the top 12
genomic regions.
88
Boxplots of methylation values with jittered points can provide a visual demonstration of
relationships underlying the significant interaction effects presented in (Figure 30). For
example, LHX6 median methylation (cg04282082) is clearly greater in individuals with
asthma than controls with the GG genotype (rs10818651), but slightly less within AG
individuals. In general, for LHX6, increased methylation was protective in the presence of a
minor allele but not in the common homozygote (Table 15). In another example, median
methylation of cg16688533 in STC1 is slightly greater in individuals with asthma vs
without that have the AA genotype (rs9969426), but individuals with GG tend to be
unmethylated and have asthma. Another clear example can be observed for
LOC101928523, where median methylation (cg07956857) is slightly greater in CC
individuals with asthma (rs13301641), approximately equal in TC individuals, but less in
TT individuals with asthma.
In LHX6 the most statistically significant interaction involved the SNP with the greatest
loading (rs10818651, loading = -0.86) (Table 15) that resides within the middle LD block
(between base pair location values 124972042 and 124982500, indicated by vertical bars
in Figure 29), paired with CpG site cg01363324 (loading=0.53). The opposite signs of the
loadings for this pair convey that these variables are negatively correlated among cases
(Figure 29), which is reflected in the interaction pattern. Increasing minor allele dose of the
SNP reflects an association between methylation level and asthma that is increasingly
negative.
SNPs within the LOC101928523 genomic region were confined to a single haplotype block
(Figure 29). The highest loading SNPs, rs13301641 (loading = 0.87), rs11792474 (loading
= 0.87), and rs75088949 (loading = -0.68) as well as the highest loading CpG site,
cg07956857 (loading = -0.89), were the most statistically significant interactions for this
genomic region (Table 16). As with LHX6, there was little evidence of main effects,
however, unlike LHX6, minor allele dose was accompanied by both decreased and
increased ORs. The most significant interaction for LOC101928523, between rs13301641
and cg07956857, involved loadings with opposite signs (Figure 29), which reflected
negative correlation between methylation level and number of minor alleles among
individuals with asthma.
For SCARNA18 there were three SNPs, rs67216017, rs113665237, and rs10061690, all in
high LD, with loadings greater than 0.5. Thus, interaction analysis results are
indistinguishable among the three (Table 16). The loadings for SNPs and CpGs are
concordant (Figure 29), which is consistent with the idea that minor allele dose is
associated with increasingly deleterious effects of methylation on asthma susceptibility.
The range of methylation for cg14999833 is narrow (Figure 30), hence it is appropriate to
estimate the effect over a 1% 257 change. Deletion of allele ‘A’ near rs67216017 or ‘T’ near
rs113665237 appears to have an equivalent effect to presence of minor allele ‘A’ at
89
rs10061690 on the relationship between methylation at the CpG site cg14999833 and
asthma. With respect to cg14999833 a biologically meaningful effect is observed for a 1%
difference in methylation in contrast to cg20697188 where a 5% difference is within the
range of the observed data.
90
Table 15
Tests of multiplicative interactions (P < 0.05) within LHX6 for SNP-CpG pairs.
5% difference in methylation
SNP
(Minor/Major
alleles)
Methylation
MAF
a
P-Value
Interaction
OR Meth | SNP
(95% CI)
OR Interaction
(95% CI)
OR SNP
(95% CI)
OR Meth.
(95% CI)
rs10818651
(A/G)
cg21469772
0.25
2.49E-04
GG: 1.09 (0.97, 1.23)
AG/GA: 0.79 (0.58, 1.07)
AA: 0.57 (0.35, 0.94)
0.72
(0.60, 0.87)
1.00
(0.66, 1.53)
0.99
(0.93, 1.07)
rs10818651
(A/G)
cg04282082
0.25
3.59E-04
GG: 1.08 (0.99, 1.18)
AG/GA: 0.84 (0.67, 1.07)
AA: 0.66 (0.45, 0.96)
0.78
(0.67, 0.90)
1.00
(0.66, 1.53)
1.01
(0.95, 1.06)
rs10985567
(A/G)
cg21213617
0.23
6.19E-04
GG: 1.24 (0.99, 1.55)
AG/GA: 0.65 (0.35, 1.20)
AA: 0.34 (0.13, 0.93)
0.53
(0.36, 0.77)
1.27
(0.84, 1.96)
0.96
(0.83, 1.12)
rs10985567
(A/G)
cg21469772
0.23
8.17E-04
GG: 1.13 (1.01, 1.27)
AG/GA: 0.83 (0.61, 1.13)
AA: 0.61 (0.37, 1.00)
0.73
(0.60, 0.89)
1.27
(0.84, 1.96)
0.99
(0.93, 1.07)
rs989798
(T/C)
cg04282082
0.20
1.17E-03
CC: 1.08 (1.01, 1.17)
TC/CT: 0.86 (0.69, 1.07)
TT: 0.68 (0.47, 0.98)
0.79
(0.68, 0.92)
1.21
(0.78, 1.88)
1.01
(0.95, 1.06)
rs10985567
(A/G)
cg04282082
0.23
1.73E-03
GG: 1.11 (1.02, 1.20)
AG/GA: 0.88 (0.70, 1.11)
AA: 0.70 (0.48, 1.03)
0.79
(0.68, 0.92)
1.27
(0.84, 1.96)
1.01
(0.95, 1.06)
rs10985567
(A/G)
cg03363289
0.23
4.30E-03
GG: 1.17 (1.06, 1.28)
AG/GA: 0.98 (0.78, 1.23)
GG: 0.83 (0.58, 1.17)
0.84
(0.74, 0.96)
1.27
(0.84, 1.96)
1.03
(0.97, 1.09)
SNPs (in region chr9:124962877-124992824, GRCh37/hg19) were selected for presentation that reside in the
central haplotype block and have negative loadings greater than 0.5 in absolute value. Models were adjusted for
gender, ancestry/ethnicity, family history of asthma, age at methylation measure, and education level. CpG sites
clustered near the CpG island toward the right of the gene as shown in Figure 29. ORs reflect a 5% change in
methylation.
a
Minor Allele Frequency
91
Table 16
Tests of multiplicative interactions (P < 0.05) within LOC101928523 for SNP-CpG pairs.
5% difference in methylation
SNP
(Minor/Major
alleles)
Methylation
MAF
a
P-Val.
Intrctn.
𝑶𝑶 𝑹𝑹 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴 | 𝑺𝑺𝑺𝑺 𝑺𝑺 (95% CI) 𝑶𝑶 𝑹𝑹 𝑰𝑰 𝑰𝑰 𝑴𝑴 .
(95% CI)
𝑶𝑶 𝑹𝑹 𝑺𝑺𝑺𝑺 𝑺𝑺
(95% CI)
𝑶𝑶 𝑹𝑹 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴 .
(95% CI)
rs13301641
(T/C)
cg07956857
0.32
1.28E-04
CC: 1.24 (1.00, 1.53)
TC/CT: 0.82 (0.54, 1.27)
TT: 0.55 (0.29, 1.05)
0.67
(0.54, 0.83)
1.03
(0.76, 1.38)
0.92
(0.79, 1.06)
rs11792474
(G/A)
cg07956857
0.32
1.79E-04
AA: 1.23 (0.99, 1.52)
GA/AG: 0.83 (0.54, 1.27)
GG: 0.56 (0.30, 1.06)
0.68
(0.55, 0.84)
1.02
(0.76, 1.37)
0.92
(0.79, 1.06)
rs75088949
(A/G)
cg07956857
0.16
7.52E-03
GG: 0.81 (0.68, 0.96)
AG/GA: 1.15 (0.74, 1.77)
AA: 1.62 (0.81, 3.24)
1.42
(1.09, 1.83)
0.94
(0.63, 1.39)
0.92
(0.79, 1.06)
1% difference in methylation
rs13301641
(T/C)
cg13488921
0.32
1.35E-02
CC: 0.80 (0.58, 1.12)
TC/CT: 1.16 (0.62, 2.19)
TT: 1.68 (0.66, 4.27)
1.45
(1.07, 1.95)
1.02
(0.76, 1.38)
1.11
(0.89, 1.37)
rs11792474
(G/A)
cg13488921
0.32
1.45E-02
AA: 0.81 (0.58, 1.13)
GA/AG: 1.16 (0.62, 2.19)
GG: 1.67 (0.66, 4.24)
1.44
(1.07, 1.94)
1.02
(0.76, 1.37)
1.11
(0.89, 1.37)
SNPs are located on chr9:106760350-106764737 with loadings > .5 in absolute value. ORs reflect a 5% or
1% change in methylation as indicated. Models were adjusted for gender, ancestry/ethnicity, family history
of asthma, age at methylation measure, and education level. A 1% difference in methylation is used to
estimate effects where measures fall into a narrower range, and a difference of 5% would not be meaningful
because it falls outside the available data.
a
Minor Allele Frequency
92
Table 17
Tests of multiplicative interactions (P < 0.05) within SCARNA18 for SNP-CpG pairs.
1% difference in methylation
SNP
(Minor/Major
alleles)
Methylation
MAF
a
P-Value
Interaction
𝑶𝑶 𝑹𝑹 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴 | 𝑺𝑺𝑺𝑺 𝑺𝑺 (95% CI) 𝑶𝑶 𝑹𝑹 𝑰𝑰 𝑰𝑰 𝑴𝑴 .
(95% CI)
𝑶𝑶 𝑹𝑹 𝑺𝑺𝑺𝑺 𝑺𝑺
(95% CI)
𝑶𝑶 𝑹𝑹 𝑴𝑴𝑴𝑴 𝑴𝑴𝑴𝑴 .
(95% CI)
rs67216017
(G/GA)
cg14999833
0.27
2.84E-05
GA, GA: 0.96 (0.89, 1.03)
GA, G/G, GA: 1.16 (0.98, 1.38)
GG: 1.42 (1.08, 1.86)
1.22
(1.10, 1.35)
0.83
(0.58, 1.18)
1.04
(0.98, 1.10)
rs113665237
(A/AT)
cg14999833
0.27
2.84E-05
AT, AT: 0.96 (0.89, 1.03)
AT, A/A, AT: 1.16 (0.98, 1.38)
AA: 1.42 (1.08, 1.86)
1.22
(1.10, 1.35)
0.83
(0.58, 1.18)
1.04
(0.98, 1.10)
rs10061690
(A/G)
cg14999833
0.27
2.84E-05
GG: 0.96 (0.89, 1.03)
AG/GA: 1.16 (0.98, 1.38)
AA: 1.42 (1.08, 1.86)
1.22
(1.10, 1.35)
0.83
(0.58, 1.18)
1.04
(0.98, 1.10)
5% difference in methylation
rs67216017
(G/GA)
cg20697188
0.27
1.08E-02
GA, GA: 0.82 (0.66, 1.02)
GA, G/G, GA: 1.14 (0.71, 1.81)
GG: 1.57 (0.77, 3.20)
1.38
(1.08, 1.77)
0.82
(0.58, 1.18)
0.95
(0.80, 1.13)
rs113665237
(A/AT)
cg20697188
0.27
1.08E-02
AT, AT: 0.82 (0.66, 1.02)
AT, A/A, AT: 1.14 (0.71, 1.81)
AA: 1.57 (0.77, 3.20)
1.38
(1.08, 1.77)
0.82
(0.58, 1.18)
0.95
(0.80, 1.13)
rs10061690
(A/G)
cg20697188
0.27
1.08E-02
GG: 0.82 (0.66, 1.02)
AG/GA: 1.14 (0.71, 1.81)
GG: 1.57 (0.77, 3.20)
1.38
(1.08, 1.77)
0.82
(0.58, 1.18)
0.95
(0.80, 1.13)
SNPs reside in region, chr5:82358023-82362156 with loadings > 0.5 in absolute value. ORs reflect a 5% or 1%
change in methylation as indicated. Models were adjusted for gender, ancestry/ethnicity, family history of asthma,
age at methylation measure, and education level. A 1% difference in methylation is used to estimate effects where
measures fall into a narrower range, and a difference of 5% would not be meaningful because it falls outside the
available data.
a
Minor Allele Frequency
93
6.3. Application of the generalized CASI approach (gCASI) to
ABRIDGE data
We apply a non-parametric method for testing set interaction to a data set consisting of
measures of gene expression in subjects with varying levels of asthma severity phenotypes,
asthma control phenotypes in our application. We begin by identifying associations
between genes through protein activities. This entails utilizing information from a
depository of known and predicted protein-protein interactions called STRING
(http://www.string-db.org). The protein relationships documented include direct
(physical) and indirect (functional) associations. Numerous sources are utilized to build
evidence of their existence, including repositories for results from experimental and
computational prediction studies, and public text collections (text mining). Information is
systematically extracted from PubMed by searching for recurrent co-occurrence of gene
names in abstracts. This search relies on gene names and synonyms parsed from SwissProt
(curated protein sequence database with a description of the function of a protein, its
domain structure, post-translational modifications, variants, a minimal level of redundancy
and a high level of integration with other databases) as well as from organism-specific
databases, and a “benchmarked scoring system based on the frequencies and distributions
of gene names in abstracts" (von Mering, et al., 2005).
Linked genes are identified based on results from published genetic screening (techniques
used to select for individuals who possess a phenotype of interest in a mutagenized
population) and Microarray/RNAseq experiments for detecting differentially expressed
genes. Protein-protein associations are also derived from functional genomics data; co-
regulation of genes across diverse experimental conditions, as measured by using
microarray analysis, can be a predictor of functional associations. Each interaction is
assigned a combined confidence score that integrates various sources of evidence. The
scores for pairs of genes in STRING “correspond to the probability of finding the linked
proteins within the same KEGG pathway" (von Mering, et al., 2005). While a variety of
organisms is on offer we focus on gene relationships relevant in humans.
We utilize the STRINGdb R package to access the STRING database from R and take
advantage of its features and functionalities for identifying pairwise correlations and
plotting protein-protein interaction networks. STRINGdb R package employs a version of
programming language classes denoted “reference class" that entails a coding mechanism
that invokes a method on an object rather than a more familiar R process where a method
is applied via a function with an object fed as argument. In R syntax we use “$" for
operations such as mapping to STRING ids or assigning interaction. For example, one
invokes a method for identifying interactions, get_interactions, on an object string_db,
94
presented in the discussion of analysis steps below, by the expression
string_db$get_interactions.
We start the data analysis by instantiating the STRINGdb reference class object in R using
syntax string_db<-STRINGdb$new(version="10", species=9606, score_threshold=900).
This constructs the object string_db with defined STRING version for the human species
and a threshold of 900 (from a range 0-100) for the combined score of the interactions,
such that any interaction below that threshold is not loaded in the object. The combined
score is based on 13 measures of association, including homology related factors such as
conserved genomic neighborhood, gene fusion events, and co-occurrence of genes across
genomes of multiple species. Additionally, genomic context predictions, high-throughput
lab experiments, conserved co-expression, automated text-mining, and previous
knowledge in databases are factored in to the combined score. Combined score is equal to
or higher than the individual sub-scores, reflecting increased confidence when an
association is supported by several types of evidence. It is computed under the assumption
of independence for the various sources, in a “naive Bayesian fashion” (von Mering, et al.,
2005), with a simple expression from constituent scores, S(combined score) = 1 −
∏ (1 − 𝐶𝐶 𝑖𝑖 𝑖𝑖 ). It is possible to produce an image of the STRING network for any particular
gene set using “plot_network” method. The image shows clearly the genes and how they are
possibly functionally related. On the top of the plot, a p-value is inserted that represents the
probability that you can expect the same or greater number of interactions by chance.
Indexing through the entire list of gene sets from The Molecular Signatures Database
(MSigDB), as the next step we map the gene names to the STRING database identifiers
(STRING_id) using the “map” method. The map function adds an additional column with
STRING identifiers to the gene symbols to the data frame that is passed as first argument. It
is expected a warning is printed showing the percentage of genes that failed to map. The
method applied next, “get_interactions", with mapped STRING_ids as the input retrieves
pairs of STRING_ids that meet the confidence level criterion score of greater than or equal
to 900. The values of expression products calculated of these using the data are then
included in the sparse canonical correlation procedure to test for network perturbation.
6.3.1. CAMP Expression Data
The Childhood Asthma Management Program was designed to evaluate whether
continuous, long-term treatment (over a period of four to six years) with either an inhaled
corticosteroid (budesonide) or an inhaled non-corticosteroid drug (nedocromil) safely
produces an improvement in lung growth as compared with treatment for symptoms only
(with albuterol and, if necessary, prednisone, administered as needed) (Group, 1999).
The primary outcome in the study was lung growth, as assessed by the change in forced
expiratory volume in one second (FEV1, expressed as a percentage of the predicted value)
95
after the administration of a bronchodilator. Secondary outcomes included the degree of
airway responsiveness, morbidity, physical growth, and psychological development. Anti-
inflammatory therapies, such as inhaled corticosteroids or nedocromil, are recommended
for children with asthma, although there is limited information on their long-term use.
6.3.2. ABRIDGE Expression Data
The Asthma BioRepository for Integrative Genomic Exploration (Asthma BRIDGE) is a
NIH/NHLBI-supported initiative to develop a publicly accessible resource consisting of
lymphoblastoid cell lines from asthmatics and controls participating in genetic studies of
asthma (Raby, et al., 2011). The project includes collection of genome-wide data.
6.3.3. Identifying Gene Sets and Gene Interactions
Candidate genes for network perturbation analysis were derived from a curated collection
totaling 4726 gene sets, designated C2, from the Molecular Signatures Database (MSigDB).
Gene sets in C2 are obtained from various sources including “online pathway databases,
publications in PubMed, and knowledge of domain experts"(Christian von Mering et al.
2005) (von Mering, et al., 2005). It is comprised of several sub-collections. These are sets
tested for chemical and genetic perturbations (CGP), and those abstracted from canonical
pathways (CP), BioCarta (CP: BIOCARTA), Kyoto Encyclopedia of Genes and Genomes
(CP:KEGG), and Reactome (CP:REACTOME) databases. CGP gene sets represent results
from expression association studies, as documented in PubMed articles, including those
comprising genes induced and repressed by perturbation. CP gene sets are taken from the
pathway databases, “canonical representations of a biological process compiled by domain
experts" (von Mering, et al., 2005). CP: BIOCARTA are derived from the BioCarta pathway
database (http://www.genecarta.com). CP:KEGG are taken from the KEGG pathway
database (http://www.genome.jp/kegg/pathway.html). CP:REACTOME are from the
reactome pathway database (http://www.reactome.org/).
R package GSA enables import of this C2 gene set collection, available on the GSEA website
(http://software.broadinstitute.org) in the form of .gmt file, using GSA.read.gmt function,
which creates a list of vectors containing gene symbol names. These are then matched to
“nuid” designations in the “ExpressionSet" object containing expression data set for
analysis. A separate “ExpressionSet” was made available for each study group. To give
meaning to these vectors of genes we reference STRING (http://www.string-db.org), a
database of functional protein association networks.
6.3.4. Gene Networks
The Molecular Signatures Database (MSigDB) is a collection of annotated gene sets created
for use in gene set enrichment analysis (GSEA). They are typically employed where the goal
96
is to determine whether the genes behave in a concordant manner in how they shift
between biological states across phenotype levels. Given an appropriate method it is
possible to evaluate if such variability is statistically significant.
STRING database aims to enable access to information about a collection of protein
associations. “The associations are derived from high throughput experimental data, from
the mining of databases and literature, and from predictions based on genomic context
analysis. STRING integrates and ranks these associations by benchmarking them against a
common reference set (von Mering, et al., 2005)”, and provides a combined confidence
score for interaction.
Interaction is studied here because we believe a considerable portion of heritability may be
accounted for by gene-gene associations. We propose using a set interaction approach in
line with the high multiple testing issue and a separate question of computational
feasibility with the sheer number of pairwise combinations. We argue that interaction can
be evaluated as the difference in correlation across phenotype levels, favored in the
previous publications (Li, et al., 2015; Peng, et al., 2010; Rajapakse, et al., 2012; Yuan, et al.,
2012).
Application of gCASI considers the concern of prevention of asthma exacerbation through
the maintenance of optimal asthma control, and the desire for a better understanding of the
biological underpinnings of exacerbation for creating more targeted approaches to
effective asthma control. The goal here is to relate differences in asthma control to
interaction between gene expression. We emphasize the importance of gene sets, specific
genes identified as belonging to sets, largely non-overlapping with each other. We use two
data sets, measures on gene expression collected from whole blood (WB) in Asthma
BioRepository for Integrative Genomic Exploration (ABRIDGE) and Childhood Asthma
Management Program (CAMP) to study perturbation of networks of genes related to
asthma responses. Cellular and transcriptional responses during acute asthma
exacerbation are highly complex and the gene networks perturbed during acute
exacerbation may be different from those that destabilize asthma control and trigger
exacerbation as part of “disrupting daily activities". To determine the genomic basis of
asthma control we performed a large-scale gene expression interaction phenotype
association analysis of asthmatic individuals in two cohorts from the Asthma BRIDGE and
CAMP.
With available clinical phenotype data, we utilized asthma control scores: a six-month
chronic score. Four questions about asthma symptom frequency for six-month variables:
rescue therapy use, sleep interference, activity limitation. A seven-day acute score: sleep
interference, activity limitation, reactive rescue therapy uses, and preventive rescue
therapy use. Discovery analysis was first conducted using whole blood (WB) expression
97
profile data from an Asthma BRIDGE cohort (n = 245). Subsequent replication conducted in
the independent CAMP, non-overlapping dataset of profiles from WB (n = 604). This effort
resulted in the identification and replication of gene sets significantly associated with
variation in acute asthma control, but probably not chronic control. These results
confirmed the important role of certain types of genes in asthma control.
We consider application of the new approach, termed generalized CASI (gCASI), to
ABRIDGE and CAMP asthma control data. The application uses gene sets from the
Molecular Signature Database (MSigDB). Using STRINGDB package in R the associations
between genes can be determined. Expression data is correlated with asthma control
scores. We test for how interactions between genes are affected by asthma control
phenotypes. We would like to show it is useful in detecting set interaction. Furthermore,
we conceive of its use as a diagnostic tool as the ultimate objective.
Figure 31. Ancestry distribution of study participants.
98
6.3.5. Application of gCASI to ABRIDGE WB Expression Data
Four types of asthma control events were studied, sleep awakenings due to cough or
wheezing, coughing or wheezing due to exercise, coughing or wheezing unrelated to
exercise, and bronchodilator use. Each control variable had five ordered levels: (1) “never",
(2) “at least once", (3) “at least once a month", (4) “at least once a week", and (5) “almost
every night".
Figure 32. 8 gene sets were identified at an FDR of 0.16 CI: (0.05, 0.53)
99
Figure 33. Distribution of phenotype from the results.
Table 18
Top ranked gene sets and corresponding phenotypes associated with network perturbation.
Gene Set Rank Gene Set Name Associated Phenotype
1 NAKAJIMA EOSINOPHIL Wake due to cough/wheeze in
the previous six months
2 ZHAN MULTIPLE MYELOMA
CD2 DN
Albuterol for asthma in the
previous six months
3 LEE NEURAL CREST STEM
CELL DN
Cough/wheeze due to exercise
in the previous six months
4 SHEDDEN LUNG CANCER
GOOD SURVIVAL A12
Albuterol for asthma in the
previous six months
100
5 KENNY CTNNB1 TARGETS DN Albuterol for asthma in the
previous six months
6 DUTERTRE ESTRADIOL
RESPONSE 6HR UP
Cough/wheeze due to exercise
in the previous six months
7 KEGG T CELL RECEPTOR
SIGNALING PATHWAY
Wake due to cough/wheeze in
the previous six months
8 REACTOME G2 M
CHECKPOINTS
Cough/wheeze due to exercise
in the previous six months
Nakajima Eosinophil network involves mast cells and eosinophils which are believed to
play important roles in evoking allergic inflammation (Nakajima, et al., 2001). Zhan
Multiple Myeloma CD2 DN was found to be involved in defining the molecular basis of
multiple myeloma (Zhan, et al., 2006). Lee Neural Crest Stem Cell DN is understood to
isolate and direct differentiation of neural crest stem cells derived from human embryonic
stem cells (Lee, et al., 2007). Shedden Lung Cancer Good Survival A12 has a role in abetting
gene expression-based survival prediction in lung adenocarcinoma for a multi-site, blinded
validation study (Director's Challenge Consortium for the Molecular Classification of Lung,
et al., 2008). In a study of receptor and secreted targets of Wnt-1/beta-catenin signalling in
mouse mammary epithelial cells, Kenny CTNNB1 Targets DN network was found to play a
role, leading to discovery of HIG2 as a novel non-cell autonomous target of the Wnt
pathway which is potentially involved in human cancer (Kenny, Enver and Ashworth,
2005). Dutertre Estradiol Response 6HR UP was found to be a link in estrogen regulation
and associated with physiopathologically significant alternative promoter in breast cancer
(Dutertre, et al., 2010). Kegg T Cell Receptor Signaling Pathway is implicated in activation
of T lymphocytes, a key event for an efficient response of the immune system (Kanehisa, et
al., 2006). Reactome G2 M Checkpoints contains genes involved in G2/M Checkpoints
(Elledge, 1996). G2/M checkpoints include the checks for damaged DNA, un-replicated
DNA, and checks that ensure that the genome is replicated once and only once per cell
cycle.
7. Results
7.1. Simulations
7.1.1. CASI as a Method for Detecting Interaction by a Contrast in Set
Correlations
We proposed a permutation-based approach and utilize FDR of 0.05 as a significance
threshold for calculating power and significance for simulation and application to real data.
This is done in lieu of using a threshold that corresponds to type 1 error which would
typically be employed in a parametric setting. For a procedure involving FDR as a
101
significance measure we garner for a range of effect and sample sizes the number of true
positive tests that can be identified. This allowed us to quantify power of our test from the
results of the simulation. We illustrated this with plots, demonstrating that at various effect
sizes and corresponding sample sizes we can detect difference in set correlation across
cases and controls at near 100% power. We provided a summary of the simulation in the
form of a plot, depicting power analysis. This also showed consistency, that power of the
CASI test for a fixed non-null hypothesis increases to one as the number of data items
increases.
Power plots were provided for samples sizes 50, 100, 150, 200, and 250. All indicated that
for larger samples we can expect to reach power of 100% at smaller effect sizes. Therefore,
we were able to deduce that as sample size increases consistency of the test is
substantiated at smaller effect sizes. We proved this to be true at least under the
assumption of multivariate normal distribution and contend that under conditions that
CASI would be typically be applied its effectiveness will not deviate substantially given
strong sensitivity of the test to small effect sizes and if applied to studies with large
samples.
In addition, we presented FDR plots that are snapshots from the entire range of simulations
to better explain how effectiveness of our method is evaluated. This was demonstrated for
sample size of 250 where from a series of FDR estimates the number of true positive tests
were shown to be identifiable, which gave us a measure of the power of our test.
7.1.2. CASI as a Method for Detecting Multiplicative Interaction
We also evaluated the CASI method for its ability to detect multiplicative interaction. This
approach also allowed comparison of power characteristics of set based methods and
conventional logistic regression for a range of sample sizes. The results were illustrated for
sample sizes 2000, 2500, 3000, 3500, and 4000. We demonstrated superiority of CASI over
other set-based methods and over conventional logistic regression. We also showed that
for our approach and some other set-based methods power increases with increasing
sample size. CASI achieves the best outcome across a series of sample sizes, nearing a
power of 0.2 for sample size of 2000 per phenotype category (case vs. control) and 0.3 for
sample size of 2500 at FDR of 0.05. Other approaches such as the top ten version of CASI
(CASI_top10) and CLD, the I. Rajapakse et al. (2012) (Rajapakse, et al., 2012) method, are
the next best alternatives. Peng et al.’s (2010) CCA (CCA (Peng)) approach appears to be
the worst in identifying interaction relative to other set-based methods, with the lowest
power, much less than 0.1 at FDR of 0.05 for sample size of 2000. The CCA is given special
attention for because it’s distinguished from the CASI approach by focusing on the case-
only data. We point out the results at lower sample sizes because of the frequent difficulty
of gathering a large enough study group. Another approach chosen for comparison,
102
proposed by I. Rajapakse et al. (2012) looks at a contrast in correlation across cases and
controls using a quadratic distance-based method, a metric that quantifies distance
between covariance matrices (Rajapakse, et al., 2012). It had inferior power characteristics
compared to CASI consistently across all sample sizes for low FDR and only matched it at
unacceptably high FDR at a few sample sizes.
In performing simulation, we varied interaction odds ratio, we observed the proportion of
non-null genes detected for low values of FDR, that is calculated power. We observed that
using interaction OR less than 1.2 was not detectable given the assumptions and data type
we employed. To understand how power analysis relates sample size to power calculations
we can make inference based on the simulation results. We can decide how many
individuals would add to the study to have the desired power.
This is exemplified by the plots generated from this simulation. We illustrate in the figures
that increasing the number of replicates we would expect improvement in power. The
larger the increase the more striking the increase in power, and we could expect something
similar in studies with real data.
We explain effect by looking at conditional ORs in the underlying logistic regression. The
most likely events given an interaction OR of 1.2 correspond to conditional ORs of 1.025,
1.050625, and 1.23, with probabilities of occurrence 0.78 for the first two and 0.16 for the
last, respectively, assuming an average MAF. This suggests higher interaction OR could also
be identified with the CASI approach. This simulation assumed pairs of sets of SNPs
interacting. In the literature genome-wide association studies of common diseases have
identified many related SNPs that reach highly significant p-values, showing that most OR’s
are less than 1.5 and many less than 1.2 (Hodge and Greenberg, 2016). As it relates to
asthma, Manuel Ferreira et al. (2011) found evidence suggesting that 11q13.5 locus was
associated with allergic asthma with an OR of 1.33 (p=7 × 10
− 4
), and confirmed SNPs in
IL6R and LRRC32 genes are associated with asthma risk at an OR of 1.09 in an analysis that
combined multiple studies (p=2.4 × 10
− 8
). Therefore, ORs in the range of the conditional
and interaction thresholds underlying the simulation are plausible, as demonstrated by
studies of real data (Ferreira, et al., 2011).
7.2. Application
7.2.1. CASI Analysis of CHS Data
We identified 10 gene pairs as significant at FDR of 0.049. Five exhibited loadings greater
than 0.5 and were therefore considered good candidates for further exploration using a
more traditional approach, logistic regression. These were {CDK2, IL1RL1}, {CDK2,
DENND1B}, {DBX1, DENND1B}, {DENND1B, IL2RB}, and {HCG23, LPIN2}, and had rankings
1, 2, 7, 9, and 10, respectively. While one pair (HCG23, LPIN2) consistently appeared in the
103
top hits we attempted to further characterize the other four gene pairs as well. SNPs
identified within those gene pairs were further evaluated for interaction in association with
asthma. Using logistic regression, we tested for multiplicative interaction between SNPs
that had loadings greater than 0.5. We did not find q-values or nominal corresponding p-
values that reached traditional significance level of 0.05. However, we believe that from the
set of top hits that was identified we can offer as candidates for further exploration a
selection of five SNP pairs, one from each of the five gene pairs we selected. We also remind
that the smallest q-values came from the interacting genes HCG23 and LPIN2 and therefore
proffered for further research.
Analysis of interaction between rs12817765 within CDK2 and rs11676124 in IL1RL1
determined that rs12817765 modifies the effect of rs11676124 towards risk in association
with asthma. We also ascertained that rs12817765 resides in a flanking region, possibly
within a regulatory sequence. Similar effect is observed for rs35624964 conditioned on
rs2069408 which reside in DENND1B and CDK2, respectively. Increasing minor allele dose
exhibited by rs2069408 directs rs35624964 towards being a risk factor for asthma.
rs2069408 is found close to the center of CDK2 whereas rs35624964 is on the edge of
DENND1B, where the regulatory region might reside. An interaction effect is not apparent
between rs6661330 in DENND1B and rs11606156 in DBX1. There is no continuous trend,
so effect modification is not perceived to be in one direction or another. Interaction
between rs228974 from IL2RB and rs78259147 from DENND1B shows the effect of
rs228974 on asthma phenotype to be modified from null to negative. rs228974 is found
close to the center of the gene whereas rs78259147 is at the edge of its gene. Finally, the
conditional relationship between rs16870123 in HCG23 and rs1164 in LPIN2 shows a
negative trend. For rs16870123 value of 0 the relative likelihood of case vs. control in
relation to rs1164 leans toward an asthma case but for values of 1 and 2 reverses to favor
control status.
The 10 gene pairs identified as significant by the CASI method included five that did not
exhibit loadings within their constituent SNPs greater than 0.5 and were not included in
regression. These were {FLG, IKZF4}, {HLA-DRA, LRRC32}, {IL5RA, PRMT3}, {MGC45800,
PRMT3}, and {LOC284661, TENM3} ranked 3, 4, 5, 6, and 8, respectively. These results may
provide insights into the genetic architecture of asthma and serve as a basis for future
functional and mechanistic studies. In the following we provide a review of the genes we
identified using the CASI procedure, including what is known about them in the literature.
HCG23 (HLA Complex Group 23), an RNA Gene affiliated with the non-coding RNA class
(non-protein coding) has not been previously associated with asthma or related disease
pathways. HCG23 has been paired for interaction with LPIN2. LPIN2 is a protein coding
gene, a key effector in the biosynthesis of lipids. Mutations in the human LPIN2 gene are
associated with inflammatory-based disorders. Depletion of LPIN2 promotes the increased
104
expression of the pro-inflammatory genes while overexpression of LPIN2 reduces the
release of pro-inflammatory factors (Valdearcos, et al., 2012). LPIN2 has also been
implicated in auto-inflammatory bone disorders, a hereditary chronic inflammatory
disorders in which bone is the primary inflammatory target (Ferguson and El-Shanti,
2007).
DENND1B (DENN Domain Containing 1B) is a Protein Coding gene located at 1q31 locus.
DENND1B is expressed by natural killer cells and dendritic cells. Our analysis shows it
interacts with CDK2, DBX1, and IL2RB. A genome-wide association study mapped a variant
of DENND1B as one of the genetic risk factors for persistent asthma in North American
children of European and African ancestry (Sleiman, et al., 2010). Treatment of CDK2
(cyclin-dependent kinase 2) by resveratrol indicated that it effectively suppressed the
proliferation of eosinophils from asthmatic patients by regulating protein expression
levels. These findings suggested that resveratrol may be a potential agent for the treatment
of asthma by decreasing the number of eosinophils through CDK2 (Hu, et al., 2016). DBX1
(Developing Brain Homeobox 1) is a protein coding gene. It has been shown in mice to be
necessary for breathing. Additionally, DBX1 neurons are respiratory modulated. Loss of
DBX1 eliminates all glutamatergic neurons from the respiratory ventrolateral
medulla (VLM). DBX1 mutant mice do not express any spontaneous respiratory behaviors
in vivo. It is indicated that DBX1-derived neurons are essential for the expression and are
responsible for the generation of respiratory behavior both in vitro and in vivo (Gray, et al.,
2010).
IL2RB (Interleukin 2 Receptor Subunit Beta) is a protein coding gene. IL2RB has been
identified by GWAS as underlying asthma and relevant traits. The largest of these was from
the GABRIEL consortium, which discovered that the IL2RB loci was significantly associated
with asthma. This asthma gene is expressed within the respiratory epithelium,
emphasizing the importance of epithelial barriers in causing asthma (Zhang, et al., 2012). A
consortium-based genome-wide association study of asthma applied to a study population
comprised of varied subgroups. Tests of SNP rs2284033 in IL2RB reached significance for
association with childhood-onset asthma, later-onset asthma, severe asthma, and
occupational asthma (Moffatt, et al., 2010).
We also found evidence to suggest that CDK2 interacts with IL1RL1 (IL-1 receptor–like 1).
A group of authors propose that genetic variation associated with asthma at the IL1RL1 loci
can be dissected into independent signals with distinct functional consequences for a
pathway that is central to asthma pathogenesis (Grotenboer, et al., 2013). They found that a
variant of rs10197862 in IL1RL1 that is in low linkage disequilibrium with that reported
previously was associated with asthma risk. This association replicated convincingly in an
independent cohort (Ferreira, et al., 2011). Investigators performed a meta-analysis of
North American genome-wide association studies of individuals with asthma including
105
those of European American, African American or African Caribbean, and Latino ancestry.
IL1RL1 was reported to be associated with asthma risk in three of the ethnic groups. These
results suggest that IL1RL1, an asthma susceptibility locus, is robust to differences in
ancestry when sufficiently large samples sizes are investigated, and that ancestry-specific
associations also contribute to the complex genetic architecture of asthma (Torgerson, et
al., 2011). SNP rs1420101 at 2q12 near IL1RL1 reached genome-wide significance was
found to be associated with asthma in a collection of ten different populations
(Gudbjartsson, et al., 2009). In the prospective birth cohort Prevention and Incidence of
Asthma and Mite Allergy (PIAMA) IL1RL1 SNPs were found to be associated with asthma
prevalence from birth to age 8 years, demonstrating that IL1RL1 polymorphisms are
associated asthma in childhood (Savenije, et al., 2011). Late-onset wheeze was associated
with 2 IL1RL1 SNPs (rs10208293 and rs13424006), and IL1RL1 SNPs were nominally
associated with asthma. Both pathologic and genetic approaches support a role
for IL1RL1 in severe asthma, as well as TH2-lke asthma, suggesting that targeting this
pathway may have therapeutic benefits (Traister, et al., 2015).
A GWAS study of reactive chemicals known to be a common cause of occupational asthma
identified gene variants of TENM3 could contribute to the pathogenesis of diisocyanate
asthma. Our analysis shows it interacts with LOC284661. Strong association has been
found for SNP rs908084 near the TENM3 gene. In characterizing TENM3 gene as a
potential susceptibility locus for diisocyanate asthma was also studied for involvement in
pathway analysis, which indicated that it is associated with immune pathways (Yucesoy, et
al., 2015).
A genome-wide association study identified bi-allelic markers near MGC45800 involved in
disease susceptibility, including multiple sclerosis, a chronic inflammatory disease of the
central nervous system with autoimmune origin (Cavanillas, et al., 2011). Our analysis
shows it interacts with PRMT3. PRMT3 (Protein Arginine Methyltransferase 3) is a protein
coding gene. In the lung tissue of asthmatic rats, the gene expressions of PRMT3 were
significantly increased, varying significantly between asthmatic rats and control rats,
suggesting that PRMT3s play an important role in the post-translational modification
process of asthma-related genes (Sun, et al., 2010). In another study aimed at identifying
PRMT3s involvement in pulmonary inflammation a rat model for asthma was used. The
inhibition of PRMT3 by Protein Arginine Methyltransferase 1 (AMI-1), a pan-PRMT
inhibitor, in rats with Ag-induced pulmonary inflammation (AIPI) ameliorated pulmonary
inflammation, reduced IL-4 production and humoral immune response, and abrogate
eosinophil infiltration into the lungs (Sun, et al., 2012).
IL5RA (Interleukin 5 Receptor Subunit Alpha) is a protein coding gene. Our analysis shows
it interacts with PRMT3. Investigation of the association of polymorphisms in IL5RA with
asthma susceptibility found IL5RA to be weakly associated with eosinophil count in
106
asthmatic patients. Patients with minor allele of a SNP in IL5RA showed higher IL5RA
expression than those who were homozygous for the major allele. The identification of
single nucleotide polymorphisms and haplotypes of IL5RA might be a marker for
phenotypes of eosinophil number and in designing strategies to control diseases such as
asthma (Lee, et al., 2007). Statistical analysis of DNA sequences of IL5RA revealed that one
promoter SNP and one insertion/deletion polymorphism in an intron showed significant
association with the risk of asthma development. These genetic effects on asthma were
more apparent among atopic subjects. The findings suggest that polymorphisms in IL5RA
might be among the genetic risk factors for asthma development, especially in atopic
populations. Another study identified IL5RA variant/haplotype provides valuable
information for strategies for the control of asthma (Cheong, et al., 2005).
Our study has identified interaction between HLA-DRA and LRRC32. Investigation of
associations of HLA-DRA polymorphisms with nasal polyposis in asthmatic patients
showed that SNPs and a haplotype were significantly associated with the presence of nasal
polyposis in asthmatic patients. Two HLA-DRA polymorphisms were found to be potential
markers for nasal polyp development in aspirin-tolerant asthma compared with the
aspirin-exacerbated respiratory disease subgroup. Their findings suggest that HLA-DRA
polymorphisms might contribute to nasal polyposis susceptibility in patients with asthma
(Kim, et al., 2012). A GWAS performed in the Chicago Asthma Genetics Study comparing
cases and control found that minor allele T of rs2395185 in HLA-DRA is the risk allele for
asthma. The analyses suggested presence of SNPs in the gene associated with both asthma
and autoimmune diseases (Li, et al., 2012). A GWAS study by Xingnan Li (2010) confirmed
the importance of additional investigation of the HLA-DRA region on chromosome 6p21.3
to delineate their structural complexity and biologic function in the development of asthma
(Li, et al., 2010). Gene HLA-DRA was shown to be associated with asthma in GWASs, and
also to be bio-ontologically enriched for attributes such as ‘molecular/signal transducer
activity’ and ‘immune system process’ (Melen and Pershagen, 2012). A pathway analysis
with the aim of identifying candidate causal mechanisms of asthma found candidate causal
SNPs which provided hypothetical biologic mechanisms involving HLA-DRA. By applying
pathway analysis to asthma GWAS data, the same study found candidate causal SNPs
involving HLA-DRA which may contribute to asthma susceptibility (Song and Lee, 2013).
Our set-based analysis showed HLA-DRA interacts with LRRC32. In a GWAS which
considered asthma sought to identify variants associated with the disease, comparing
persons with physician-diagnosed asthma with persons without disease (Ferreira, et al.,
2014). In GWAS of physician-diagnosed asthmatics and controls conducted in Australia
aimed to identify novel genetic variants affecting asthma risk and attempted to provide
novel insights into molecular mechanisms underlying the disease. As follow up, loci from
that study were then tested and confirmed to associate with asthma risk in the replication
cohorts, reaching genome-wide significance in the combined analysis of multiple studies.
107
Variants identified included rs7130588 on chromosome 11q13.5 near the LRRC32. The
11q13.5 locus was significantly associated with atopic status among asthmatics, suggesting
that it is a risk factor for allergic asthma (Ferreira, et al., 2011). Xingnan Li (2014) et al.
data indicated that C11orf30 is likely to be functional genes in LRRC32 region. They
identified two distinct mechanisms affecting asthma risk. Eosinophilic or atopic asthma is
associated with up-regulation of its expression while airway hyperresponsiveness is
affected via mechanisms other than expression (Li, et al., 2014).
Results of one study suggest that FLG mutations are key organ specific factors
predominantly affecting the development of asthma (Weidinger, et al., 2008). Colin N.A.
Palmer et al. (2007) argued that the filaggrin mutations were significantly associated with
greater disease severity for asthma. Mean FEV1/forced vital capacity of FLG wild-type
individuals was higher from those carrying either FLG homozygous null allele. The
association of FLG null alleles with all markers of asthma disease severity was also found in
children. FLG mutations are associated with eczema-associated asthma susceptibility, but
also with asthma severity independent of eczema status. FLG status influences controller
and reliever medication requirements in children and young adults with asthma (Palmer,
et al., 2007). Authors sought to provide a more detailed and conclusive estimate of the risk
for asthma associated with FLG null alleles. Case-control studies were heterogeneous,
whereas family studies yielded more homogeneous results. Their meta-analysis
summarized the strong evidence for FLG mutations as robust risk factor for asthma
(Rodriguez, et al., 2009). Loss-of-function mutations in FLG were identified as risk factors
for asthma. Authors evaluated the utility of FLG mutations at an early age for the prediction
of asthma. In infants the FLG mutations predicted childhood asthma with a very high
positive predictive value (Marenholz, et al., 2009). Investigators found that FLG mutations
is also linked with a later development of asthma. Our analysis provided evidence of
interaction between FLG and IKZF4. IKZF4 gene encodes the EOS protein. Investigators
measured concentrations of EOS and found elevated levels in the subjects with asthma
compared with control subjects (Durham, et al., 1989). IKZF4s EOS degranulation occurs in
the airway of subjects with moderately symptomatic asthma. Both the FEV1 and the forced
expiratory flow at 50% (FEF50) in patients with symptomatic asthma were significantly
lower than the corresponding values for FEV1 and the FEF50 in the patients with
asymptomatic asthma. Levels and percent EOS were all significantly elevated in
symptomatic compared to asymptomatic patients with asthma (Broide, et al., 1991). They
concluded that there is an influx of active EOSs into the lung of pollen-allergic patients with
asthma during a pollen season (Rak, et al., 1991).
It is necessary to collect and arrange the numerous pieces we identified and learn how they
contribute to the jigsaw puzzle we call asthma, and there are studies which have
contributed to that effort (Matsumoto, Tamari and Saito, 2008).
108
7.2.2. CASI Analysis of ABRIDGE data
CASI identified 12 genomic regions with evidence of interactions between sets of SNPs and
sets of methylated CpGs in association with asthma. Of these 12 genes, 3 have previously
been implicated in asthma risk or underlying biological pathways related to pathology of
the disease, PF4 (Tutluoglu, et al., 2005), ATF3 (Gilchrist, et al., 2008; Marandi, Farahi and
Hashjin, 2013), and TPRA1 (Bautista, Pellegrino and Tsunozaki, 2013). Three genomic
regions not previously implicated in asthma, LOC101928523, SCARNA18, and LHX6,
contained the most statistically significant pairwise interactions between the considered
individual SNPs and CpGs.
LHX6 (LIM homeobox domain 6) has not been implicated in asthma, however it is a
recognized transcriptional regulator that controls the differentiation and development of
lymphoid cells (Liu, et al., 2013). LHX6 is known to be regulated epigenetically in lung
cancer, where in vitro and in vivo studies found that in normal lung tissue it is readily
expressed but down-regulated or silenced in lung cancer cells in which the gene is hyper-
methylated (Liu, et al., 2013). Other evidence of epigenetic regulation in the vicinity of
LHX6 has been found in head and neck squamous 278 cell carcinomas (HNSCC) (Estécio, et
al., 2006). Similar to the lung cancer study, hyper-methylation of the CpG island in LHX6
was associated with transcriptional silencing of LHX6. These findings suggest that
differential methylation near LHX6 plays a role in lung biology and lends credence to a
potential role in asthma.
The three genes previously implicated in asthma pathogenesis are PF4, ATF3, and TPRA1.
PF4 (Platelet Factor 4) is a protein coding gene which functions as an inhibitor of T-cell
function (Liu, et al., 2005). PF4 activation in the lung is a feature of the late inflammatory
response to antigen challenge and may play an important role in allergic inflammation and
asthma (Averill, et al., 1992). Atf3 (Activating transcription factor 3) is a negative regulator
of allergic inflammation in mice challenged with ovalbumin (Roussel, et al., 2011) and
deficiency in mice leads to the development of significantly increased airway hyper-
responsiveness and pulmonary eosinophilia (Gilchrist, et al., 2008). Significant increases in
ATF3 mRNA have also been observed in patients with mild asthma as compared with non-
asthmatic patients (Roussel, et al., 2011). TPRA1 (transmembrane protein adipocyte
associated 1) is an irritant sensing cation channel expressed in TRPV1-positive, capsaicin-
sensitive chemosensory neurons that innervate various organs, including the airways
(Facchinetti and Patacchini, 2010). Various exogenous chemicals have been described to
activate TRPA1, including agents recognized to trigger and/or worsen asthma such as
diisocyanates, cigarette smoke, acrolein, and chlorine (Bautista, et al., 2013). A potential
role of TRPA1 in mediating allergen-induced asthmatic responses has been described in
ovalbumin-sensitized mice, in which genetic deletion of Trpa1 or pretreatment with a
selective Trpa1 antagonist reduced leukocyte infiltration, decreased cytokine and mucus
109
production, and almost completely abolished airway hyperactivity (Facchinetti and
Patacchini, 2010). Bessac et al. (2008) suggested that TRPA1 may function as an integrator
of immunological stimuli modulating inflammation 300 in the airways (Bessac and Jordt,
2008). Furthermore, chemical irritant-induced activation of TRPA1 may trigger the release
of neuropeptides and chemokines in the airways, thereby exacerbating the cellular and
tissue inflammatory response observed in allergic individuals (Caceres, et al., 2009).
HOPX, OR10K1, UPK1B, CHMP4B, and LANCL1 were identified as significantly associated
with asthma in the tests for joint interactions but require further investigation into their
putative biological functions. There is scant prior evidence of a role for these genes in
pulmonary or immune function, lung disease, or asthma, although HOPX is involved in the
function of regulatory T cells (Jones, et al., 2015).
The CASI procedure is currently not able to explicitly adjust for potential confounding
covariates. However, we compensated by applying a logistic regression model that
included covariates to the same data in the follow-up analysis. Like many conventional
approaches, CASI is designed to detect linear associations, which limits the power for non-
linear relationships. However, considering that we were able to identify 12 genes with
evidence of interactions using a sample size in the hundreds suggests that more
interactions may be detectable if the sample size is increased substantially. Though the
procedure requires considerable computational resources due to its reliance on
permutations, the required number of permutations is small enough to allow genome-wide
applications. The many challenges of satisfying parametric assumptions on a genome-wide
level would seem to justify the trade-off associated with computational expense of
permutations.
We demonstrated that simultaneous consideration of genomic and epigenomic variation
has the potential to identify genetic risk factors for asthma beyond individual GWAS studies
or epigenetic screens. These results add to existing evidence suggesting a synergy between
genomic and epigenomic variation affecting risk of asthma.
7.2.3. gCASI Analysis of ABRIDGE and CAMP Expression Data
The gCASI procedure was able to identify eight gene sets at an FDR significance level of
0.16 (CI: [0.05, 0.53]). We chose to focus on the one highest ranked, gene set with largest
value of the test statistic, Nakajima Eosinophil. The sparse canonical correlation procedure
that underlies our method selected “wake due to cough/wheeze in the previous six
months” as the phenotype involved in perturbation of this gene set. Nakajima Eosinophil
network encompasses mast cells and eosinophils which are believed to play an important
role in evoking allergic inflammation (Nakajima, et al., 2001). Allergic inflammation studies
have been able to characterize biologic pathways that lead to initiation and maintenance of
110
asthma and implicate dysregulated interactions between mucosal epithelia and innate
immune cells as the underlying cause of the disorder (Locksley, 2010).
Below we present a visual guide to the results of gCASI analysis of ABRIDGE data. We
applied gCASI with the goal of identifying gene networks containing interactions associated
with asthma control phenotypes. It was determined that one phenotype was responsible
for the interaction effect we detected in Nakajima Eosinophil, “wake due to cough/wheeze
in the previous six months”.
Expression values and their products shown in figure 34 is a way to drill down to see how
genes in a single pair interact. Each box demonstrates how gene expression measures and
their products fluctuate with varying level of asthma severity. It’s hard to discern a specific
pattern for all gene pairs but if we focus on GPR44 and ADORA3 as an example we can see
there is an upward trend with greater severity of asthma, which is reflective of the set
interaction effect we identified in the Nakajima Eosinophil gene network.
In figure 35 we observe gene pairs and their corresponding loadings. All the gene pairs
demonstrate strong loadings in relation to their canonical variate. The canonical variate in
this setting serves as the summary of gene expression correlation, a linear combination. We
remind for clarity that the canonical variate, a linear combination of gene expression
correlation components, is formed with coefficients that maximize correlation with the
phenotype canonical variate. The plot demonstrates that GPR44*ADORA3 has the largest
representation in the interaction, but other pairs are not far behind. With sparse canonical
correlation other gene pairs included in the network are zeroed out, which is also how we
obtain the associated phenotype.
Figure 36 gives nodes and edges is an additional visualization of Nakajima Eosinophil gene
network with pairwise gene interactions shown by edges and genes represented by nodes.
This representation informs us which specific genes are understood to be interacting and
that they form a cluster. More precisely, we have learned that this cluster is perturbed by
“wake due to cough/wheeze”. Same genes are interacting with multiple other genes rather
than being isolated into separate pairs.
The goal of our analysis is to relate gene interactions within a network with a phenotype. Our
procedure identified “wake due to cough/wheeze in the previous six months” as culprit.
Ultimately, we might consider how gene expression levels relate to phenotype category and
that is demonstrated in figure 37. Gene expression correlations/interactions are
summarized via the canonical variate. The graph relates the value of canonical variate to
asthma control and severity. One could imagine using this relation being used for
prediction or diagnosis, measuring expression levels of genes resolved to be interacting,
forming canonical variate using coefficients obtained from application of gCASI to the
111
ABRIDGE data, and predict “wake due to cough/wheeze” phenotype level for an individual
or a group.
As follow-up we applied a linear model to genes within the Nakajima Eosinophil network to
test for pairwise multiplicative interaction, with ordinal nature of the phenotype variables
in mind, which are treated as the outcome. Results shown in table 19 are from regression
applied to pairs of associated genes identified in the gene set network to have significant
influence on 6-month “wake due to cough/wheeze” asthma control variable and those
shown in table 20 are for marginal effects of genes on the same phenotype.
112
Figure 34. Expression values and their products (multiplications) vs. asthma severity.
113
7.2.3.1. Analysis of “NAKAJIMA EOSINOPHIL" Set
Figure 35. Gene pairs and their corresponding loadings represent extent of involvement in the
gene network perturbation.
114
Figure 36. Visualization of NAKAJIMA EOSINOPHIL gene network with pairwise gene interactions
shown by edges, representing strength of relationship, linking nodes.
115
Figure 37. Plot for phenotype vs. canonical variate identify how expression profiles for genes in a
network relate to asthma control and severity.
116
Table 19
Results of tests for multiplicative interaction from regression applied to pairs of associated genes
identified in “NAKAJIMA EOSINOPHIL" gene set network and found to have significant influence on 6-
month asthma control variable. (GPR44 is an alias for PTGDR2 Gene).
Phenotype gene1*gene2 Coeff.
for
Intrctn.
P-Value
1 df LRT
for
Intrctn.
P-Value
3 df LRT
for Main
Effects
and
Intrctn.
Coeff./Weights
for Canonical
Variate
Loadings
(Correlations
with
canonical
variate)
Wake due to
cough/wheeze
GPR44 *
ADORA3
0.22 1.25E-05 2.20E-06 0.69
0.87
Wake due to
cough/wheeze
P2RY14 *
ADORA3
0.23 6.05E-05 8.62E-06 0.31 0.78
Wake due to
cough/wheeze
P2RY14 *
ADORA3
0.25 1.55E-05 1.61E-07 0.53 0.70
Wake due to
cough/wheeze
P2RY14 *
GPR44
0.24 8.24E-05 2.95E-06 0.12 0.64
Wake due to
cough/wheeze
ADORA3 *
CNR2
0.27 4.11E-05 4.41E-06 0.36 0.57
117
Table 20
Results of tests for marginal effects from regression applied to genes associated with at least one other
gene in “NAKAJIMA EOSINOPHIL" gene set network, and significantly associate with a 6-month asthma
control phenotype. (GPR44 is an alias for PTGDR2 Gene)
Phenotype gene1, gene2 Coeff.
Main
Effect
Gene 1
Pr(>|t|)
Wald Test
Main Effect
Gene 1
Coeff.
Main
Effect
Gene 2
Pr(>|t|)
Wald Test
Main Effect
Gene 2
Coeff./Weights
for Canonical
Variate
Loadings
(Correlations
with
canonical
variate)
Wake due to
cough/wheeze
GPR44, ADORA3
0.17
0.016 0.22 0.00183 0.69 0.87
Wake due to
cough/wheeze
P2RY14, ADORA3 0.14 0.0629 0.22 0.00183 0.31 0.78
Wake due to
cough/wheeze
P2RY14, ADORA3 0.21 0.00246 0.22 0.00183 0.53 0.70
Wake due to
cough/wheeze
P2RY14, GPR44 0.21 0.00246 0.17 0.0164 0.12 0.64
Wake due to
cough/wheeze
ADORA3, CNR2 0.22 0.00183 0.11 0.124 0.36 0.57
118
7.2.3.2. Substantiating Interaction Analysis in CAMP WB Study Group
We attempted to validate the significant results from tests of association between gene
networks and 6-month asthma control variables determined in the ABRIDGE WB data set.
Using the CAMP WB data, we employed phenotypes constructed based on self-reported
incidents of asthma control events in the preceding seven days.
The acute asthma control scores are based on questionnaire data related to self-reported
frequency of four types of asthma control events: sleep awakening, disruption of daily
activities, rescue bronchodilator use, and preventive bronchodilator use before exercise.
Each phenotype variable represented the number of days in which the given event
occurred. These asthma control phenotypes are meant to reproduce in substance the 6-
month variables in a 7-day context. They serve as a summary of sleep interference, activity
limitation, reactive rescue therapy uses, and preventive rescue therapy use. Results of tests
for interaction with respect to 7-day asthma control phenotypes are shown in table 21. We
found them to vary substantially from 6-month asthma symptom events.
119
Table 21
Results of tests for interaction with respect to 7-day asthma control phenotypes from regression applied
to genes involved in association in “NAKAJIMA EOSINOPHIL" gene set network and found to vary
significantly with 6-month asthma symptom events. Ordering within each phenotype is based on
loadings of gene pairs from analysis with 6-month phenotype in ABRIDGE WB data set. (GPR44 is an
alias for PTGDR2 Gene)
Phenotype: event
frequency over 7-day
period
gene1*gene2 Coeff.
for
Intrctn.
P-Value 1 df
LRT for Intrctn.
P-Value 3 df
LRT for
Main Effects
and Intrctn.
P-Value 3
df LRT BH
Adjusted
sleep awakening GPR44 *ADORA3 0.006 7.20E-01 1.24E-02 1.91E-02
sleep awakening P2RY14*ADORA3 0.029 7.27E-02 1.45E-03 3.23E-03
sleep awakening P2RY14*ADORA3 0.007 6.79E-01 1.93E-02 2.76E-02
sleep awakening P2RY14*GPR44 0.014 4.01E-01 8.93E-03 1.62E-02
sleep awakening ADORA3*CNR2 -0.004 8.24E-01 7.32E-03 1.46E-02
disruption of daily activities GPR44*ADORA3 -0.009 7.16E-01 5.07E-01 5.63E-01
disruption of daily activities P2RY14*ADORA3 0.005 8.48E-01 8.19E-01 8.62E-01
disruption of daily activities P2RY14*ADORA3 -0.033 2.45E-01 4.65E-01 5.47E-01
disruption of daily activities P2RY14*GPR44 -0.027 3.17E-01 4.04E-01 5.56E-01
disruption of daily activities ADORA3*CNR2 0.010 7.18E-01 9.78E-01 9.78E-01
rescue bronchodilator use GPR44*ADORA3 0.048 3.36E-01 1.06E-05 7.04E-06
rescue bronchodilator use P2RY14*ADORA3 0.141 5.64E-03 7.70E-09 1.54E-01
rescue bronchodilator use P2RY14*ADORA3 0.119 4.03E-02 2.97E-07 2.97E-06
rescue bronchodilator use P2RY14*GPR44 0.090 1.02E-01 2.01E-05 8.34E-06
rescue bronchodilator use ADORA3*CNR2 0.055 3.14E-01 2.08E-04 8.34E-06
preventive bronchodilator uses
before exercise
GPR44*ADORA3 -0.003 9.44E-01 2.42E-02 3.22E-02
preventive bronchodilator uses
before exercise
P2RY14*ADORA3 0.102 1.81E-02 6.18E-05 2.06E-04
preventive bronchodilator uses
before exercise
P2RY14*ADORA3 0.112 2.18E-02 9.00E-04 2.44E-03
preventive bronchodilator uses
before exercise
P2RY14*GPR44 0.093 4.44E-02 9.77E-04 2.44E-03
preventive bronchodilator uses
before exercise
ADORA3 * CNR2 0.031 5.06E-01 1.11E-01 1.86E-02
120
Application of gCASI to ABRIDGE expression data revealed candidate gene sets for conjunct
association with asthma control phenotypes. It is noteworthy that combined effects of gene
pairs suggest significance for association with awakening from sleep, rescue
bronchodilator use, and preventive bronchodilator use, but not in disruption of daily
activities.
Future study of the data and the gCASI approach should include simulation studies, which
would prove that it’s a method appropriate for detecting interaction between genes in a
network. Simulations could be constructed by forming nodes and edges of various
strengths for a network. In addition, simulation can occur via linear regression by selecting
correlated pairs of genes (nodes) for inclusion as interaction terms.
8. Discussion
Many conventional methods for discovering disease-susceptibility genes focus on one
genomic region at a time. This works for some illnesses that have a single genomic
component but are inadequate for complex diseases which are more multifarious in their
etiology due to multiple genes or other genomic factors involved. To enable detection of
sets of genomic variables for joint interaction in association with disease phenotype we
proposed a new method. It is based on a contrast in fisher transformed sample correlations
comparing cases and controls. A test for equality of correlations between linear
combinations within cases vs. controls is a way to learn if interaction is present in relation
to disease. Our proposed method avoids the need for parametric assumptions, and the
multiple testing challenge is addressed by permutation-based FDR.
We believe one reason that our test exhibits superior power is its property of aggregating
variables into sets as compared to classical approaches which are applied to single pairs of
variables. Simulation results show that the CASI method is more powerful than testing via
logistic regression. When we consider multiple SNPs simultaneously within or near a gene
with causal interacting SNPs among them the power is higher. This is exemplified in
scenarios in figures 8-11. It’s important to note that we observe causal and non-causal
SNPs when applying CASI to multiplicative interaction model simulation. SNPs generated
through the simulation are chosen probabilistically, using a binomial distribution, for
involvement in interaction but the CASI method is applied to the entire aggregates. Thereby
it demonstrates the attribute of discerning signal from noise, identifying a subset of
interactions within a larger set. That is, we observe that introduction of noise doesn’t
detract wholly from its ability to detect interaction. Additional simulation studies are
needed to evaluate how varying the fraction of interacting SNPs in a grouping affects power
of CASI.
121
Along the same line of reasoning about distinguishing signal from noise we must note that
the CASI method must exhibit this property because it does not adjust for possible
confounders or important covariates. Aside from the variables of interest one may believe
that other nuisance variables play a role in complex interactions. For example, it seems
reasonable that many SNPs are correlated with ethnicity and gender. In a permutation-
based method proposed by Simon et. al. (Simon and Tibshirani, 2015) the authors adapt
their procedure to deal with nuisance variables by replacing data in each class 𝑙𝑙 𝑎𝑎𝑣𝑣 𝑎𝑎 𝑙𝑙 ∈
{ 𝑐𝑐 𝑉𝑉 𝑟𝑟𝑎𝑎 , 𝑐𝑐 𝑙𝑙 𝑢𝑢𝑙𝑙 𝑉𝑉 𝑙𝑙 𝑙𝑙 }, 𝑔𝑔 ( 1) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
, 𝑔𝑔 ( 2) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
, 𝑔𝑔 ( 1) 𝑐𝑐 𝑒𝑒𝑟𝑟𝑡𝑡𝑟𝑟 𝑒𝑒 𝑐𝑐 , 𝑔𝑔 ( 2) 𝑐𝑐 𝑒𝑒 𝑟𝑟𝑡𝑡 𝑟𝑟 𝑒𝑒 𝑐𝑐 with residuals from a projection of
original data onto confounder variables. Likewise, we can adapt CASI to deal with these
nuisance variables by using partial correlations. Assume 𝑔𝑔 1
and 𝑔𝑔 2
are our variables of
interest, and 𝑧𝑧 is a vector of potential confounders. Rather than comparing 𝐶𝐶 𝑉𝑉𝑢𝑢 𝐶𝐶 𝑙𝑙 𝑉𝑉 ( 𝑔𝑔 1
, 𝑔𝑔 2
)
in cases and controls we compare the partial correlations. This is done by first regressing
our potential confounders, 𝑧𝑧 , out of all the variables of interest, SNPs or methylation in our
application, then running the remainder of the analysis as usual. The residuals are formed
as typically expressed in matrix form (Christensen, 2011):
𝐺𝐺 �
( 1) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
= [ 𝐶𝐶 − 𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
( 𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑇𝑇 𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
)
− 1
𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑇𝑇 ] 𝐺𝐺 ( 1) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝐺𝐺 �
( 2) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
= [ 𝐶𝐶 − 𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
( 𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑇𝑇 𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
)
− 1
𝑍𝑍 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑇𝑇 ] 𝐺𝐺 ( 2) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
It important to note that for the simulation with underlying multiplicative logistic model,
we assume a low overall population disease prevalence of 0.05, a rare disease. When
prevalence was increased to 0.10 power decreased markedly, suggesting a violation of rare
disease assumption necessary for the case-only procedure to work. As our method is
inspired by this previous work it’s plausible that we are observing the same effect.
As already argued in this paper and other publications, if correlations between gene one
and gene two are different in cases vs. controls there must be an interaction effect of gene
one and gene two on the disease outcome. In the simulation involving logistic regression
we observed that there is in fact a relation between the difference in the correlations as
recognized by CASI and magnitude of the interaction ORs in the logistic model. The
coefficients in the logistic model are the log of the ORs. Larger interaction ORs that we set
in the model were detectable by CASI at lower sample sizes and with greater power, though
we only show an interaction OR of 1.2. This demonstrates the appropriateness for our test
of a simulation that uses logistic as the underlying model, that a correspondence exists
between interaction OR and contrast in correlation. If the variables we are testing for
interaction in relation with disease phenotype actually behave in a way embodied by a
logistic model then CASI can in fact detect such interaction. It’s possible that even a
122
mathematical relation exists between this linear model and CASI, even if it’s without closed
form.
In the simulation involving difference in correlation matrices we explore interaction
between two sets of loci using a multivariate normal model. The correlation matrices are
used as parameters in this model. There is very low gene-gene dependence in the controls,
as we set the off-diagonal submatrices to 0.1 for those indices. The cases, on the other hand,
are simulated to have the underlying model’s off-diagonal submatrices greater than 0.1.
Simulation results indicate substantial increases in power for small increments in
correlation among the cases. These small differences in correlation between cases and
controls represent effect size which is summarized by taking the norm of the difference in
correlation matrices, the parameters in a multivariate normal model.
For situations where genes have multiple SNPs that are in LD we again turn to the
simulation that looked at the difference in correlation matrices. The submatrices on the
diagonal of the correlation parameter could represent correlations among SNPs within a
gene, an LD structure if we consider the interpretation that we are observing interaction
between two genes. We produce two samples, cases and controls are generated by
independent multivariate normal models. The samples can be thought of as arising from LD
structures reflective of the non-zero correlation parameter in a multivariate normal model.
We observe two sets of SNPs, one in cases and one in controls, which exhibit within set
correlation due to underlying correlation structure generated from continuous uniform
distribution with range ∈ [0,1] used in forming the matrices. This can be interpreted as a
way of modeling LD. So, it’s important that this ubiquitous quality, the LD structure within
genes, is represented in this simulation. Further research is required into this matter since
SNPs within a gene are often in LD, in highly dependent groupings that form haplotypes.
The logistic model built simulations generate SNPs independently for each gene. This
model should be extended to allow for LD within genes. The only dependence in the logistic
simulation is in estimation of variance of coefficients in the linear model.
Whereas for one simulation we generate the interaction effect using multivariate normal
distributions with varying correlation matrices, the other uses a logistic model. The CASI
method was able to identify interaction in both. This suggests robustness of our approach
with respect to the interaction model employed. Therefore, our ability to detect varying
types interactions is proven. As part of our study of interaction we ultimately wish to
distinguish between groups of patients, for example those with and without asthma. By
applying our method to two data sets that are part of large consortiums we benefited from
an opportunity to assess genome-genome set interactions.
Using our method, we substantiated statistical interaction between unlinked genetic loci in
the CHS study and two different types of genomic factors, SNPs and methylation, in the
123
ABRIDGE study. In investigation such as ours follow-up analyses should and were
conducted as reasonable next steps in validating performance. Hence, conventional
approaches to detecting interaction in single pairs of genomic variables were applied to
genes that were determined to be significant by a set-based method.
In the CHS example we wished to distinguish between groups of participants, those who
developed asthma and those who did not. Our study population provided a unique
opportunity to assess gene-gene interaction between asthma sufferers and those without
the condition. CASI analysis of CHS data identified 10 gene pairs as significant at FDR of
0.049. These were {CDK2, IL1RL1}, {CDK2, DENND1B}, {DBX1, DENND1B}, {DENND1B,
IL2RB}, {HCG23, LPIN2}, {FLG, IKZF4}, {HLA-DRA, LRRC32}, {IL5RA, PRMT3}, and
{MGC45800, PRMT3}, and {LOC284661, TENM3}. Five were considered good candidates
for further exploration using a more traditional approach, logistic regression. SNPs
identified within those gene pairs were further evaluated for pairwise multiplicative
interaction in association with asthma. The smallest q-values came from the interacting
genes HCG23 and LPIN2 and therefore are proffered for further research. While we did not
find q-values or nominal corresponding p-values that reached traditional significance level
of 0.05 from the set of top hits a selection of five SNP pairs prominently exhibiting
interaction were chosen for further analysis from five gene pairs. Specifically, we
characterized interaction of one SNP pair from each of the five gene pairs exhibiting high
loadings. Further study is required to learn what is known about these SNPs and genes in
the literature.
It should also be noted that some of the 10 genes identified as significant by CASI in the
CHS study have previously been implicated in asthma. Mutations in LPIN2 gene are
associated with inflammatory-based disorders. A genome-wide association study mapped a
variant of DENND1B as one of the genetic risk factors for persistent asthma in North
American children of European and African ancestry. Treatment of CDK2 by resveratrol
indicated that it effectively suppressed the proliferation of eosinophils from asthmatic
patients by regulating protein expression levels. DBX1 has been shown in mice to be
necessary for breathing. Additionally, DBX1 neurons are respiratory modulated. L2RB has
been identified by GWAS as underlying asthma by the GABRIEL consortium. This asthma
gene is expressed within the respiratory epithelium and is known to be associated with
childhood-onset asthma, later-onset asthma, severe asthma, and occupational asthma. This
association replicated convincingly in other independent cohorts and many studies have
shown it to be linked to asthma pathogenesis. A GWAS study of reactive chemicals known
to be a common cause of occupational asthma identified gene variants of TENM3 as
contributing to the pathogenesis of diisocyanate asthma. Our analysis shows it interacts
with LOC284661. A genome-wide association study identified bi-allelic markers near
MGC45800 involved in disease susceptibility, including chronic inflammatory disease of the
124
central nervous system with autoimmune origin. It interacts with PRMT3 according to our
data. In lung tissue of asthmatic rats, gene expressions of PRMT3 was observed to
significantly increase, varying significantly between asthmatic rats and control rats,
suggesting that PRMT3 plays an important role in the post-translational modification
process of asthma-related genes. Our analysis also shows that IL5RA interacts with PRMT3.
Investigation of the association of polymorphisms in IL5RA with asthma susceptibility
found it to be weakly associated with eosinophil count in asthmatic patients. Investigation
of associations of HLA-DRA polymorphisms with nasal polyposis in asthmatic patients
showed that SNPs and a haplotype were significantly associated with the presence of nasal
polyposis in asthmatic patients. Our study identified interaction between HLA-DRA and
LRRC32.
CASI analysis of ABRIDGE data identified 12 genomic regions with evidence of interactions
between sets of SNPs and sets of methylated CpGs in association with asthma. Of these 12
genes, 3 have previously been implicated in asthma risk or underlying biological pathways
related to pathology of the disease, PF4, ATF3, and TPRA1. From prior literature we know
that PF4 may play an important role in allergic inflammation and asthma. ATF3 deficiency
in mice leads to the development of significantly increased airway hyper-responsiveness
and pulmonary eosinophilia. TPRA1 is an irritant sensing cation channel expressed in
chemosensory neurons that innervate the airways. Three genomic regions not previously
implicated in asthma, LOC101928523, SCARNA18, and LHX6, contained the most
statistically significant pairwise interactions between individual SNPs and CpGs. There is
lack of prior literature about possible role of the other genes we identified as having
significant influence on asthma. HOPX, OR10K1, UPK1B, CHMP4B, and LANCL1 deserve
further study.
An extension of CASI was proposed that allows for tests that incorporate many genes, a
network of genes, and identify association with a phenotype by correlating linear
combinations of Pearson correlation components. We refer to this method as gCASI and
demonstrate its effectiveness by applying it to real data. gCASI analysis of ABRIDGE and
CAMP expression data was able to identify eight gene sets at an FDR significance level of
0.16 (CI: [0.05, 0.53]). We chose to focus on the highest ranked, Nakajima Eosinophil. Our
procedure selected “wake due to cough/wheeze in the previous six months” as the
phenotype involved in perturbation of this gene set. Nakajima Eosinophil network is
believed to play an important role in evoking allergic inflammation and biologic pathways
that lead to initiation and maintenance of asthma.
For all three applications to real data where we identified set interactions follow-up studies
were warranted to characterize further such associations. To evaluate performance of CASI
and drill down into the sets to learn where the interaction lies we applied logistic
regression to pairs of variables. The idea is that detection of interactions between two loci,
125
two SNPs or a SNP and CpG locus, can lend support be the proposed CASI and gCASI
statistics.
Though the CASI procedure requires considerable computational resources due to its
reliance on permutations, the required number of permutations is small enough to allow
genome-wide applications. The advantage over the model-based approach is that
parametric methods suffer from model misspecification, with asymptotic approximate p-
values where approximation grows worse as we move into the tails. While calculating test
statistics for numerous blocks of SNPs is computationally intensive, it is practically
achievable by spreading computations over clusters of computers as we did with USC’s
high-performance computing cluster (HPCC). Software implementing CASI function is
freely available to the public in a downloadable R package
https://github.com/USCbiostats/CASI. Also, for any proposed study that may be a good
candidate for the CASI method we would test for interactions between a limited number of
blocks (sets) of worthy of attention because there may be a biological interest. In future
studies we can imagine taking restricted numbers of blocks (sets) and then compare them
with other sections of the genome in a sliding window fashion. We demonstrated that
simultaneous consideration of genomic and epigenomic variation has the potential to
identify genetic risk factors for asthma beyond individual GWAS studies or epigenetic
screens. Using our method, we confirmed statistical interaction between multiple pairs of
sets of unlinked loci, examples of gene pairs from same and different chromosomes
showing many statistical interactions that align with current, known biological functions
related to asthma.
Our methods can be extended to test for gene-environment interactions. Instead of
comparing the correlation between two sets of genomic variables we can compare
correlations between a set of SNPs and a set of environmental variables. We can then apply
CASI to detect interaction via differences between cases and controls. An interesting
approach would be to take multi-level categorical environmental variables such as
smoking, pesticide, and pollution, among others. They can be considered as a set of
environmental variables, same as a set of SNPs in one gene is considered jointly.
We proposed an extension of our approach to network perturbation. The details are
described in the generalized CASI (gCASI) section of the manuscript. The generalization of
CASI is meant to allow testing for interaction in association with sets of phenotypes that
are categorized into more than two levels. The application that we employed involved gene
networks where correlation was calculated between pairs of genes’ expression values. We
describe this form of interaction as gene network perturbation. Eight networks were
identified as significant with an FDR of 0.16 CI: (0.05, 0.53). These include Nakajima
Eosinophil, Zhan Multiple Myeloma CD2 DN, Lee Neural Crest Stem Cell DN, Shedden Lung
126
Cancer Good Survival A12, Kenny CTNNB1 Targets DN, Dutertre Estradiol Response 6HR
UP, KEGG T Cell Receptor Signaling Pathway, Reactome G2 M Checkpoints.
We overcame a limitation of our approach, that it’s limited to dichotomous outcome and
generalized it to continuous phenotypes. Applying gCASI to gene networks generalizes it to
third and higher order interactions. We still need to assess power to identify these higher
order interactions and attempt to replicate our findings in other studies. Further
simulation studies could involve including additional variants that are in LD with the causal
variant to determine its effect on the power of a study.
Novel genomic methods and increased computational capacity has led to a proliferation in
the rate of discovery of disease genes. While traditional association studies have sought
single locus or individual gene associations, there is preponderance of evidence that
phenotypes result from complex interactions among large number of genetic epigenetic
factors. The CASI method we proposed allows investigation of relationships among groups
of SNPs in pairs genes and with other groups of genomic variables. By finding interactions
among networks of genes we may extend our understanding of how the combined behavior
of different parts of the genome gives rise to phenotypes as well as expand our ability to
predict disease outcome. Detecting interactions among disease associated SNPs and
epigenetic factors may reveal biological machineries that are critical to comprehending
development and progression of a disease state and provide a powerful and promising
basis for the development of novel therapeutic and diagnostic strategies.
127
References
Ahmed, S., Thomas, G., Ghoussaini, M., Healey, C.S., Humphreys, M.K., Platte, R., et al. (2009) Newly
discovered breast cancer susceptibility loci on 3p24 and 17q23.2, Nature Genetics, 41, 585-
590.
Akhabir, L. and Sandford, A.J. (2011) Genome-wide association studies for discovery of genes
involved in asthma, Respirology, 16, 396-406.
Akinbami, L.J., Moorman, J.E., Bailey, C., Zahran, H.S., King, M., Johnson, C.A., et al. (2012) Trends in
asthma prevalence, health care use, and mortality in the United States, 2001–2010, NCHS
Data Brief, 94, 1-8.
Andrew, A.S., Nelson, H.H., Kelsey, K.T., Moore, J.H., Meng, A.C., Casella, D.P., et al. (2006)
Concordance of multiple analytical approaches demonstrates a complex relationship
between DNA repair gene SNPs, smoking and bladder cancer susceptibility, Carcinogenesis,
27, 1030-1037.
Anholt, R.R. and Mackay, T.F. (2009) Principles of behavioral genetics. Academic Press.
Asher, M.I. and Weiland, S.K. (1998) The International Study of Asthma and Allergies in Childhood
(ISAAC). ISAAC Steering Committee, Clinical & Experimental Allergy, 28 Suppl 5, 52-66;
discussion 90-51.
Averill, F.J., Hubbard, W.C., Proud, D., Gleich, G.J. and Liu, M.C. (1992) Platelet activation in the lung
after antigen challenge in a model of allergic asthma, American Review of Respiratory
Disease, 145, 571-576.
Aymler, F.R. (1918) The correlation between relatives on the supposition of Mendelian inheritance,
Transactions of the Royal Society of Edinburgh, 52, 399-433.
Barsh, G.S., Copenhaver, G.P., Gibson, G. and Williams, S.M. (2012) Guidelines for genome-wide
association studies, PLoS Genetics, 8, e1002812.
Bateman, E.D., Hurd, S.S., Barnes, P.J., Bousquet, J., Drazen, J.M., FitzGerald, M., et al. (2008) Global
strategy for asthma management and prevention: GINA executive summary, European
Respiratory Journal, 31, 143-178.
Bateson, P. (2002) William Bateson: a biologist ahead of his time, J Genet, 81, 49-58.
Bautista, D.M., Pellegrino, M. and Tsunozaki, M. (2013) TRPA1: A gatekeeper for inflammation,
Annual Review of Physiology, 75, 181-200.
Bessac, B.F. and Jordt, S.E. (2008) Breathtaking TRP channels: TRPA1 and TRPV1 in airway
chemosensation and reflex control, Physiology (Bethesda), 23, 360-370.
Blake, C. (1979) Exons encode protein functional units, Nature, 277, 598-598.
Bousquet, J. (2000) Global initiative for asthma (GINA) and its objectives, Clinical and Experimental
Allergy, 30 Suppl 1, 2-5.
Breton, C.V., Byun, H.M., Wang, X., Salam, M.T., Siegmund, K. and Gilliland, F.D. (2011) DNA
methylation in the arginase-nitric oxide synthase pathway is associated with exhaled nitric
oxide in children with asthma, American Journal of Respiratory Critical Care Medicine, 184,
191-197.
Breton, C.V., Siegmund, K.D., Joubert, B.R., Wang, X. and Qui, W. (2014) Prenatal Tobacco Smoke
Exposure Is Associated with Childhood DNA CpG Methylation (vol 9, e99716, 2014), PLoS
One, 9.
Brodie III, E.D. (2000) Why evolutionary genetics does not always add up, Epistasis and the
Evolutionary Process, 3-19.
Broide, D.H., Gleich, G.J., Cuomo, A.J., Coburn, D.A., Federman, E.C., Schwartz, L.B., et al. (1991)
Evidence of Ongoing Mast-Cell and Eosinophil Degranulation in Symptomatic Asthma
Airway, Journal of Allergy and Clinical Immunology, 88, 637-648.
128
Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., et al. (2005) Identifying SNPs
predictive of phenotype using random forests, Genetic Epidemiology, 28, 171-182.
Caceres, A.I., Brackmann, M., Elia, M.D., Bessac, B.F., del Camino, D., D'Amours, M., et al. (2009) A
sensory neuronal ion channel essential for airway inflammation and hyperreactivity in
asthma, Procedings of the National Academy of Science United States of America, 106, 9099-
9104.
Cavanillas, M.L., Fernandez, O., Comabella, M., Alcina, A., Fedetz, M., Izquierdo, G., et al. (2011)
Replication of top markers of a genome-wide association study in multiple sclerosis in
Spain, Genes and Immunity, 12, 110-115.
Chanda, P., Zhang, A., Brazeau, D., Sucheston, L., Freudenheim, J.L., Ambrosone, C., et al. (2007)
Information-theoretic metrics for visualizing gene-environment interactions, American
Journal of Human Genetics, 81, 939-963.
Chapman, J. and Clayton, D. (2007) Detecting association using epistatic information, Genetic
Epidemiology, 31, 894-909.
Cheong, H.S., Kim, L.H., Park, B.L., Choi, Y.H., Park, H.S., Hong, S.J., et al. (2005) Association analysis of
interleukin 5 receptor alpha subunit (IL5RA) polymorphisms and asthma, Journal of Human
Genetics, 50, 628-634.
Cheverud, J.M. and Routman, E.J. (1995) Epistasis and its contribution to genetic variance
components, Genetics, 139, 1455-1461.
Christensen, R. (2011) Plane answers to complex questions: the theory of linear models. Springer
Science & Business Media.
Clark, A.G., Boerwinkle, E., Hixson, J. and Sing, C.F. (2005) Determinants of the success of whole-
genome association testing, Genome Research, 15, 1463-1467.
Cockerham, C.C. (1954) An extension of the concept of partitioning hereditary variance for analysis
of covariances among relatives when epistasis is present, Genetics, 39, 859.
Cohen, J. (1992) A power primer, Psychology Bulletin, 112, 155-159.
Cordell, H.J. (2009) Detecting gene-gene interactions that underlie human diseases, Nature Reviews
Genetics, 10, 392-404.
Cordell, H.J. (2002) Epistasis: what it means, what it doesn't mean, and statistical methods to detect
it in humans, Human Molecular Genetics, 11, 2463-2468.
Cordell, H.J. (2009) Detecting gene–gene interactions that underlie human diseases, Nature Reviews
Genetics, 10, 392-404.
Culverhouse, R., Klein, T. and Shannon, W. (2004) Detecting epistatic interactions contributing to
quantitative traits, Genetic Epidemiology, 27, 141-152.
Dawson, R. (2011) How significant is a boxplot outlier, Journal of Statistics Education, 19, 1-12.
Director's Challenge Consortium for the Molecular Classification of Lung, A., Shedden, K., Taylor,
J.M., Enkemann, S.A., Tsao, M.S., Yeatman, T.J., et al. (2008) Gene expression-based survival
prediction in lung adenocarcinoma: a multi-site, blinded validation study, Nature Medicine,
14, 822-827.
Divgi, D. (1979) Calculation of the tetrachoric correlation coefficient, Psychometrika, 44, 169-172.
Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., et al. (2008) Exploration of gene-gene interaction
effects using entropy-based methods, European Journal of Human Genetics: EJHG, 16, 229.
Dudoit, S., van der Laan, M.J. and Pollard, K.S. (2004) Multiple testing. Part I. Single-step procedures
for control of general type I error rates, Statistical Applications in Genetics and Molecular
Biology, 3, Article13.
Durham, S.R., Loegering, D.A., Dunnette, S., Gleich, G.J. and Kay, A.B. (1989) Blood Eosinophils and
Eosinophil-Derived Proteins in Allergic-Asthma, Journal of Allergy and Clinical Immunology,
84, 931-936.
129
Dutertre, M., Gratadou, L., Dardenne, E., Germann, S., Samaan, S., Lidereau, R., et al. (2010) Estrogen
regulation and physiopathologic significance of alternative promoters in breast cancer,
Cancer Research, 70, 3760-3770.
Easton, D.F. and Eeles, R.A. (2008) Genome-wide association studies in cancer, Human Molecular
Genetics, 17, R109-115.
Easton, D.F., Pooley, K.A., Dunning, A.M., Pharoah, P.D., Thompson, D., Ballinger, D.G., et al. (2007)
Genome-wide association study identifies novel breast cancer susceptibility loci, Nature,
447, 1087-1093.
Efron, B. (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis,
Journal of the American Statistical Association, 99, 96-104.
Efron, B. and Tibshirani, R. (2002) Empirical bayes methods and false discovery rates for
microarrays, Genetic Epidemiology, 23, 70-86.
Elledge, S.J. (1996) Cell cycle checkpoints: preventing an identity crisis, Science, 274, 1664-1672.
Espejo, M.R. (2004) The Oxford dictionary of statistical terms, Journal of the Royal Statistical Society:
Series A (Statistics in Society), 167, 377-377.
Estécio, M., Youssef, E., Rahal, P., Fukuyama, E., Gois-Filho, J., Maniglia, J., et al. (2006) LHX6 is a
sensitive methylation marker in head and neck carcinomas, Oncogene, 25, 5018-5026.
Facchinetti, F. and Patacchini, R. (2010) The rising role of TRPA1 in asthma, The Open Drug
Discovery Journal, 2.
Faul, F., Erdfelder, E., Lang, A.G. and Buchner, A. (2007) G*Power 3: a flexible statistical power
analysis program for the social, behavioral, and biomedical sciences, Behavioral Research
Methods, 39, 175-191.
Ferguson, P.J. and El-Shanti, H.I. (2007) Autoinflammatory bone disorders, Current Opinion in
Rheumatology, 19, 492-498.
Ferreira, M.A., Matheson, M.C., Duffy, D.L., Marks, G.B., Hui, J., Le Souëf, P., et al. (2011) Identification
of IL6R and chromosome 11q13. 5 as risk loci for asthma, The Lancet, 378, 1006-1014.
Ferreira, M.A., Matheson, M.C., Duffy, D.L., Marks, G.B., Hui, J., Le Souef, P., et al. (2011) Identification
of IL6R and chromosome 11q13.5 as risk loci for asthma, The Lancet, 378, 1006-1014.
Ferreira, M.A., Matheson, M.C., Tang, C.S., Granell, R., Ang, W., Hui, J., et al. (2014) Genome-wide
association analysis identifies 11 risk variants associated with the asthma with hay fever
phenotype, Journal of Allergy and Clinical Immunology, 133, 1564-1571.
Ferreira, M.A., McRae, A.F., Medland, S.E., Nyholt, D.R., Gordon, S.D., Wright, M.J., et al. (2011)
Association between ORMDL3, IL1RL1 and a deletion on chromosome 17q21 with asthma
risk in Australia, European Journal of Human Genetics, 19, 458-464.
Ferreira, M.A., Oates, N.A., van Vliet, J., Zhao, Z.Z., Ehrich, M., Martin, N.G., et al. (2010)
Characterization of the methylation patterns of MS4A2 in atopic cases and controls, Allergy,
65, 333-337.
Ferreira, M.A.R., McRae, A.F., Medland, S.E., Nyholt, D.R., Gordon, S.D., Wright, M.J., et al. (2011)
Association between ORMDL3, IL1RL1 and a deletion on chromosome 17q21 with asthma
risk in Australia (vol 19, pg 458, 2010), European Journal of Human Genetics, 19, 1109-1109.
Fisher, R.A. (1918) The Correlation between Relatives on the Supposition of Mendelian Inheritance,
Transactions of the Royal Society of Edinburgh, 52, 399-433.
Fisher, R.A. (1919) XV.—The correlation between relatives on the supposition of Mendelian
inheritance, Earth and Environmental Science Transactions of the Royal Society of Edinburgh,
52, 399-433.
Frankel, W.N. and Schork, N.J. (1996) Who's afraid of epistasis?, Nat Genet, 14, 371-373.
Fu, A., Leaderer, B.P., Gent, J.F., Leaderer, D. and Zhu, Y. (2012) An environmental epigenetic study
of ADRB2 5'-UTR methylation and childhood asthma severity, Clinical and Experimental
Allergy, 42, 1575-1581.
130
Gatto, N.M., Campbell, U.B., Rundle, A.G. and Ahsan, H. (2004) Further development of the case-only
design for assessing gene-environment interaction: evaluation of and adjustment for bias,
International Journal of Epidemiology, 33, 1014-1024.
Gauderman, W.J., Vora, H., McConnell, R., Berhane, K., Gilliland, F., Thomas, D., et al. (2007) Effect of
exposure to traffic on lung development from 10 to 18 years of age: a cohort study, The
Lancet, 369, 571-577.
Gibson, G. (2009) Decanalization and the origin of complex disease, Nature Reviews Genetics, 10,
134-140.
Gilchrist, M., Henderson, W.R., Clark, A.E., Simmons, R.M., Ye, X., Smith, K.D., et al. (2008) Activating
transcription factor 3 is a negative regulator of allergic pulmonary inflammation, Journal of
Experimental Medicine, 205, 2349-2357.
Gray, P.A., Hayes, J.A., Ling, G.Y., Llona, I., Tupal, S., Picardo, M.C., et al. (2010) Developmental origin
of preBotzinger complex respiratory neurons, The Journal of Neuroscience, 30, 14883-
14895.
Grotenboer, N.S., Ketelaar, M.E., Koppelman, G.H. and Nawijn, M.C. (2013) Decoding asthma:
Translating genetic variation in IL33 and IL1RL1 into disease pathophysiology, Journal of
Allergy and Clinical Immunology, 131, 856-865.
Group, C.A.M.P.R. (1999) The childhood asthma management program (CAMP): design, rationale,
and methods, Controlled Clinical Trials, 20, 91-120.
Gudbjartsson, D.F., Bjornsdottir, U.S., Halapi, E., Helgadottir, A., Sulem, P., Jonsdottir, G.M., et al.
(2009) Sequence variants affecting eosinophil numbers associate with asthma and
myocardial infarction, Nature Genetics, 41, 342-347.
Gutierrez-Arcelus, M., Lappalainen, T., Montgomery, S.B., Buil, A., Ongen, H., Yurovsky, A., et al.
(2013) Passive and active DNA methylation and the interplay with genetic variation in gene
regulation, Elife, 2, e00523.
Halapi, E., Gudbjartsson, D.F., Jonsdottir, G.M., Bjornsdottir, U.S., Thorleifsson, G., Helgadottir, H., et
al. (2010) A sequence variant on 17q21 is associated with age at onset and severity of
asthma, European Journal of Human Genetics, 18, 902-908.
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining,
Inference, and Prediction., 2nd edn.(Springer-Verlag: New York.), NY, USA.
Hill, W.G. and Weir, B. (2011) Variation in actual relationship as a consequence of Mendelian
sampling and linkage, Genetics Research, 93, 47-64.
Hirschhorn, J.N. and Daly, M.J. (2005) Genome-wide association studies for common diseases and
complex traits, Nature Reviews Genetics, 6, 95-108.
Hodge, S.E. and Greenberg, D.A. (2016) How Can We Explain Very Low Odds Ratios in GWAS? I.
Polygenic Models, Human Heredity, 81, 173-180.
Hoh, J. and Ott, J. (2003) Mathematical multi-locus approaches to localizing complex human trait
genes, Nature Reviews Genetics, 4, 701-709.
Hu, X., Wang, J., Xia, Y., Simayi, M., Ikramullah, S., He, Y.B., et al. (2016) Resveratrol induces cell cycle
arrest and apoptosis in human eosinophils from asthmatic individuals, Molecular Medicine
Reports, 14, 5231-5236.
International HapMap, C., Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve, L.L., et al. (2007)
A second generation human haplotype map of over 3.1 million SNPs, Nature, 449, 851-861.
Jakobsdottir, J., Gorin, M.B., Conley, Y.P., Ferrell, R.E. and Weeks, D.E. (2009) Interpretation of
genetic association studies: markers with replicated highly significant odds ratios may be
poor classifiers, PLoS Genetics, 5, e1000337.
Jones, A., Opejin, A., Henderson, J.G., Gross, C., Jain, R., Epstein, J.A., et al. (2015) Peripherally Induced
Tolerance Depends on Peripheral Regulatory T Cells That Require Hopx To Inhibit Intrinsic
IL-2 Expression, Journal of Immunology, 195, 1489-1497.
131
Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., et al. (2006) From
genomics to chemical genomics: new developments in KEGG, Nucleic Acids Research, 34,
D354-D357.
Kang, G., Yue, W., Zhang, J., Cui, Y., Zuo, Y. and Zhang, D. (2008) An entropy-based approach for
testing genetic epistasis underlying complex diseases, Journal of Theoretical Biology, 250,
362-374.
Kempthorne, O. (1954) The correlation between relatives in a random mating population,
Proceedings of the Royal Society of London B: Biological Sciences, 143, 103-113.
Kenny, P.A., Enver, T. and Ashworth, A. (2005) Receptor and secreted targets of Wnt-1/beta-catenin
signalling in mouse mammary epithelial cells, BMC Cancer, 5, 3.
Khoury, M.J. and Flanders, W.D. (1996) Nontraditional epidemiologic approaches in the analysis of
gene-environment interaction: case-control studies with no controls!, American Journal of
Epidemiology, 144, 207-213.
Kim, J.H., Park, B.L., Cheong, H.S., Pasaje, C.F., Bae, J.S., Park, J.S., et al. (2012) HLA-DRA
polymorphisms associated with risk of nasal polyposis in asthmatic patients, American
Joural of Rhinology and Allergy, 26, 12-17.
Kosoy, R., Nassir, R., Tian, C., White, P.A., Butler, L.M., Silva, G., et al. (2009) Ancestry informative
marker sets for determining continental origin and admixture proportions in common
populations in America, Human Mutation, 30, 69-78.
Kraft, P., Wacholder, S., Cornelis, M.C., Hu, F.B., Hayes, R.B., Thomas, G., et al. (2009) Beyond odds
ratios--communicating disease risk based on genetic profiles, Nature Reviews Genetics, 10,
264-269.
Kraft, P., Yen, Y.C., Stram, D.O., Morrison, J. and Gauderman, W.J. (2007) Exploiting gene-
environment interaction to detect genetic associations, Human Heredity, 63, 111-119.
Kroegel, C. (2009) Global Initiative for Asthma (GINA) guidelines: 15 years of application, Expert
Review of Clinical Immunology, 5, 239-249.
Langaas, M., Lindqvist, B.H. and Ferkingstad, E. (2005) Estimating the proportion of true null
hypotheses, with application to DNA microarray data, Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 67, 555-572.
Larson, N.B. and Schaid, D.J. (2013) A kernel regression approach to gene-gene interaction
detection for case-control studies, Genetic Epidemiology, 37, 695-703.
Lee, G., Kim, H., Elkabetz, Y., Al Shamy, G., Panagiotakos, G., Barberi, T., et al. (2007) Isolation and
directed differentiation of neural crest stem cells derived from human embryonic stem cells,
Nature Biotechnology, 25, 1468-1475.
Lee, J.H., Chang, H.S., Kim, J.H., Park, S.M., Lee, Y.M., Uh, S.T., et al. (2007) Genetic effect of CCR3 and
IL5RA gene polymorphisms on eosinophilia in asthmatic patients, Journal of Allergy and
Clinical Immunology, 120, 1110-1117.
Lewontin, R. (2006) Commentary: Statistical analysis or biological analysis as tools for
understanding biological causes, International Journal of Epidemiology, 35, 536-537.
Lewontin, R.C. (1974) Annotation: the analysis of variance and the analysis of causes, American
Journal of Human Genetics, 26, 400-411.
Li, J., Huang, D., Guo, M., Liu, X., Wang, C., Teng, Z., et al. (2015) A gene-based information gain
method for detecting gene-gene interactions in case-control studies, European Journal of
Human Genetics, 23, 1566-1572.
Li, J., Tang, R., Biernacka, J.M. and De Andrade, M. (2009) Identification of gene-gene interaction
using principal components. BMC Proceedings. BioMed Central, pp. S78.
Li, X., Ampleford, E.J., Howard, T.D., Moore, W.C., Torgerson, D.G., Li, H., et al. (2012) Genome-wide
association studies of asthma indicate opposite immunopathogenesis direction from
autoimmune diseases, Journal of Allergy and Clinical Immunology, 130, 861-868 e867.
132
Li, X., Howard, T.D., Zheng, S.L., Haselkorn, T., Peters, S.P., Meyers, D.A., et al. (2010) Genome-wide
association study of asthma identifies RAD50-IL13 and HLA-DR/DQ regions, Journal of
Allergy and Clinical Immunology, 125, 328-335 e311.
Li, X., Moore, W.C., Hastie, A.T., Ampleford, E.J., Li, H., Hawkins, G.A., et al. (2014) Deciphering
Functional Variants For Genes Associated With Asthma Susceptibility Using EQTL Of
Bronchial Epithelial Cells And Bronchial Alveolar Lavage. THE GENOME AND ASTHMA IN
2014. American Thoracic Society, pp. A1001-A1001.
Liu, C.Y., Battaglia, M., Lee, S.H., Sun, Q.H., Aster, R.H. and Visentin, G.P. (2005) Platelet factor 4
differentially modulates CD4(+)CD25(+) (regulatory) versus CD4(+)CD25(-)
(nonregulatory) T cells, Journal of Immunology, 174, 2680-2686.
Liu, E.Y., Li, M., Wang, W. and Li, Y. (2013) MaCH-admix: genotype imputation for admixed
populations, Genetic Epidemiology, 37, 25-37.
Liu, W., Jiang, X., Han, F., Li, Y., Chen, H., Liu, Y., et al. (2013) LHX6 acts as a novel potential tumour
suppressor with epigenetic inactivation in lung cancer, Cell Death & Disease, 4, e882.
Locksley, R.M. (2010) Asthma and allergic inflammation, Cell, 140, 777-783.
Luna, A. and Nicodemus, K.K. (2007) snp. plotter: an R-based SNP/haplotype association and
linkage disequilibrium plotting package, Bioinformatics, 23, 774-776.
Lunetta, K.L., Hayward, L.B., Segal, J. and Van Eerdewegh, P. (2004) Screening large-scale
association study data: exploiting interactions using random forests, BMC Genetics, 5, 32.
MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., et al. (2017) The new NHGRI-EBI
Catalog of published genome-wide association studies (GWAS Catalog), Nucleic Acids
Research, 45, D896-D901.
Mahdi, H., Fisher, B.A., Källberg, H., Plant, D., Malmström, V., Rönnelid, J., et al. (2009) Specific
interaction between genotype, smoking and autoimmunity to citrullinated α-enolase in the
etiology of rheumatoid arthritis, Nature Genetics, 41, 1319-1324.
Marandi, Y., Farahi, N. and Hashjin, G.S. (2013) Asthma: beyond corticosteroid treatment, Archives
of Medical Science, 9, 521-526.
Marchini, J., Donnelly, P. and Cardon, L.R. (2005) Genome-wide strategies for detecting multiple loci
that influence complex diseases, Nature Genetics, 37, 413-417.
Marenholz, I., Kerscher, T., Bauerfeind, A., Esparza-Gordillo, J., Nickel, R., Keil, T., et al. (2009) An
interaction between filaggrin mutations and early food sensitization improves the
prediction of childhood asthma, Journal of Allergy and Clinical Immunology, 123, 911-916.
Matsumoto, K., Tamari, M. and Saito, H. (2008) Involvement of eosinophils in the onset of asthma,
Journal of Allergy and Clinical Immunology, 121, 26-27.
McKinney, B.A., Reif, D.M., Ritchie, M.D. and Moore, J.H. (2006) Machine learning for detecting gene-
gene interactions: a review, Applied Bioinformatics, 5, 77-88.
Melen, E., Kho, A.T., Sharma, S., Gaedigk, R., Leeder, J.S., Mariani, T.J., et al. (2011) Expression
analysis of asthma candidate genes during human and murine lung development,
Respiratory Research, 12, 86.
Melen, E. and Pershagen, G. (2012) Pathophysiology of asthma: lessons from genetic research with
particular focus on severe asthma, Journal of Internal Medicine, 272, 108-120.
Miller, W.J. (1997) Dominance, codominance and epistasis, Brazilian Journal of Genetics, 20.
Millstein, J. (2013) Screening-testing approaches for gene-gene and gene-environment interactions
using independent statistics, Frontiers in Genetics, 4, 306.
Millstein, J., Chen, G.K. and Breton, C.V. (2016) cit: hypothesis testing software for mediation
analysis in genomic applications, Bioinformatics, 32, 2364-2365.
Millstein, J., Conti, D.V., Gilliland, F.D. and Gauderman, W.J. (2006) A testing framework for
identifying susceptibility genes in the presence of epistasis, American Journal of Human
Genetics, 78, 15-27.
133
Millstein, J. and Volfson, D. (2013) Computationally efficient permutation-based confidence interval
estimation for tail-area FDR, Frontiers in Genetics, 4, 179.
Mitchell, J.B. (2011) Informatics, machine learning and computational medicinal chemistry, Future
Medicinal Chemistry, 3, 451-467.
Moffatt, M.F., Gut, I.G., Demenais, F., Strachan, D.P., Bouzigon, E., Heath, S., et al. (2010) A large-scale,
consortium-based genomewide association study of asthma, New England Journal of
Medicine, 363, 1211-1221.
Moffatt, M.F., Kabesch, M., Liang, L., Dixon, A.L., Strachan, D., Heath, S., et al. (2007) Genetic variants
regulating ORMDL3 expression contribute to the risk of childhood asthma, Nature, 448,
470-473.
Moore, J.H. (2003) The ubiquitous nature of epistasis in determining susceptibility to common
human diseases, Human Heredity, 56, 73-82.
Moore, J.H. (2007) Genome-wide analysis of epistasis using multifactor dimensionality reduction:
Feature selection and construction in, Knowledge Discovery and Data Mining: Challenges and
Realities: Challenges and Realities, 17.
Moore, J.H., Boczko, E.M. and Summar, M.L. (2005) Connecting the dots between genes,
biochemistry, and disease susceptibility: systems biology modeling in human genetics,
Molecular Genetics and Metabolism, 84, 104-111.
Moore, J.H., Gilbert, J.C., Tsai, C.T., Chiang, F.T., Holden, T., Barney, N., et al. (2006) A flexible
computational framework for detecting, characterizing, and interpreting statistical patterns
of epistasis in genetic studies of human disease susceptibility, Journal of Theoretical Biology,
241, 252-261.
Moore, J.H. and Williams, S.M. (2005) Traversing the conceptual divide between biological and
statistical epistasis: systems biology and a more modern synthesis, Bioessays, 27, 637-646.
Moore, J.H. and Williams, S.M. (2002) New strategies for identifying gene-gene interactions in
hypertension, Annals of Medicine, 34, 88-95.
Moore, J.H. and Williams, S.M. (2009) Epistasis and its implications for personal genetics, American
Journal of Human Genetics, 85, 309-320.
Motsinger, A.A., Ritchie, M.D. and Reif, D.M. (2007) Novel methods for detecting epistasis in
pharmacogenomics studies, Pharmacogenomics, 8, 1229-1241.
Mukherjee, B., Ahn, J., Gruber, S.B., Rennert, G., Moreno, V. and Chatterjee, N. (2008) Tests for gene-
environment interaction from case-control data: a novel study of type I error, power and
designs, Genetic Epidemiology, 32, 615-626.
Mukherjee, B. and Chatterjee, N. (2008) Exploiting Gene ‐Environment Independence for Analysis
of Case-Control Studies: An Empirical Bayes-Type Shrinkage Estimator to Trade-Off
between Bias and Efficiency, Biometrics, 64, 685-694.
Murcray, C.E., Lewinger, J.P. and Gauderman, W.J. (2009) Gene-environment interaction in genome-
wide association studies, American Journal of Epidemiology, 169, 219-226.
Nakajima, T., Matsumoto, K., Suto, H., Tanaka, K., Ebisawa, M., Tomita, H., et al. (2001) Gene
expression screening of human mast cells and eosinophils using high-density
oligonucleotide probe arrays: abundant expression of major basic protein in mast cells,
Blood, 98, 1127-1134.
Narod, S.A. and Foulkes, W.D. (2004) BRCA1 and BRCA2: 1994 and beyond, Nature Reviews Cancer,
4, 665-676.
Nelson, M., Kardia, S., Ferrell, R. and Sing, C. (2001) A combinatorial partitioning method to identify
multilocus genotypic partitions that predict quantitative trait variation, Genome Research,
11, 458-470.
Ober, C. and Yao, T.C. (2011) The genetics of asthma and allergic disease: a 21st century
perspective, Immunological Reviews, 242, 10-30.
134
Palmer, C.N., Ismail, T., Lee, S.P., Terron-Kwiatkowski, A., Zhao, Y., Liao, H., et al. (2007) Filaggrin
null mutations are associated with increased asthma severity in children and young adults,
Journal of Allergy and Clinical Immunology, 120, 64-68.
Palmer, L.J. and Cookson, W.O. (2000) Genomic approaches to understanding asthma, Genome
Research, 10, 1280-1287.
Patterson, N., Price, A.L. and Reich, D. (2006) Population structure and eigenanalysis, PLOS Genetics,
2, e190.
Peng, Q., Zhao, J. and Xue, F. (2010) A gene-based method for detecting gene–gene co-association in
a case–control association study, European Journal of Human Genetics, 18, 582-587.
Phillips, P.C. (1998) The language of gene interaction, Genetics, 149, 1167-1171.
Phillips, P.C. (2008) Epistasis--the essential role of gene interactions in the structure and evolution
of genetic systems, Nature Reviews Genetics, 9, 855-867.
Piegorsch, W.W., Weinberg, C.R. and Taylor, J.A. (1994) Non-hierarchical logistic models and case-
only designs for assessing susceptibility in population-based case-control studies, Statistics
in medicine, 13, 153-162.
Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI reference sequences (RefSeq): a curated
non-redundant sequence database of genomes, transcripts and proteins, Nucleic Acids
Research, 35, D61-65.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., et al. (2007) PLINK: a
tool set for whole-genome association and population-based linkage analyses, American
Journal of Human Genetics, 81, 559-575.
Raby, B., Barnes, K., Beaty, T., Bosco, A., Carey, V., Castro, M., et al. (2011) Asthma bridge: the asthma
biorepository for integrative genomic exploration. American Journal of Respiratory and
Critical Care Medicine. AMER THORACIC SOC 25 BROADWAY, 18 FL, NEW YORK, NY 10004
USA.
Rajapakse, I., Perlman, M.D., Martin, P.J., Hansen, J.A. and Kooperberg, C. (2012) Multivariate
detection of gene-gene interactions, Genetics Epidemiology, 36, 622-630.
Rak, S., Bjornson, A., Hakanson, L., Sorenson, S. and Venge, P. (1991) The Effect of Immunotherapy
on Eosinophil Accumulation and Production of Eosinophil Chemotactic Activity in the Lung
of Subjects with Asthma during Natural Pollen Exposure, Journal of Allergy and Clinical
Immunology, 88, 878-888.
Reinius, L.E., Gref, A., Saaf, A., Acevedo, N., Joerink, M., Kupczyk, M., et al. (2013) DNA methylation in
the Neuropeptide S Receptor 1 (NPSR1) promoter in relation to asthma and environmental
factors, PLOS One, 8, e53877.
Ripperger, T., Gadzicki, D., Meindl, A. and Schlegelberger, B. (2009) Breast cancer susceptibility:
current knowledge and implications for genetic counselling, European Journal of Human
Genetics, 17, 722-731.
Rodriguez, E., Baurecht, H., Herberich, E., Wagenpfeil, S., Brown, S.J., Cordell, H.J., et al. (2009) Meta-
analysis of filaggrin polymorphisms in eczema and asthma: robust risk factors in atopic
disease, Journal of Allergy and Clinical Immunology, 123, 1361-1370 e1367.
Roussel, L., Robins, S., Schachter, A., Berube, J., Hamid, Q. and Rousseau, S. (2011) Steroids and
extracellular signal-regulated kinase 1/2 activity suppress activating transcription factor 3
expression in patients with severe asthma, Journal of Allergy and Clinical Immunology, 127,
1632-1634.
Salam, M.T., Millstein, J., Li, Y.F., Lurmann, F.W., Margolis, H.G. and Gilliland, F.D. (2005) Birth
outcomes and prenatal exposure to ozone, carbon monoxide, and particulate matter: results
from the Children's Health Study, Environmental Health Perspectives, 113, 1638-1644.
Savenije, O.E.M., Kerkhof, M., Reijmerink, N.E., Brunekreef, B., de Jongste, J.C., Smit, H.A., et al. (2011)
Interleukin-1 receptor-like 1 polymorphisms are associated with serum IL1RL1-a,
135
eosinophils, and asthma in childhood, Journal of Allergy and Clinical Immunology, 127, 750-
U394.
Schmidt, S. and Schaid, D.J. (1999) Potential misinterpretation of the case-only study to assess gene-
environment interaction, American Journal of Epidemiology, 150, 878-885.
Scholkopf, B. and Smola, A.J. (2001) Learning with kernels: support vector machines, regularization,
optimization, and beyond. MIT press.
Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I., Cornes, B.K., Davis, C., et al. (2003)
Heritability of adult body height: a comparative study of twin cohorts in eight countries,
Twin Research and Human Genetics, 6, 399-408.
Simon, N. and Tibshirani, R. (2015) A Permutation Approach to Testing Interactions for Binary
Response by Comparing Correlations Between Classes, Journal of the American Statistical
Association, 110, 1707-1716.
Sleiman, P.M., Annaiah, K., Imielinski, M., Bradfield, J.P., Kim, C.E., Frackelton, E.C., et al. (2008)
ORMDL3 variants associated with asthma susceptibility in North Americans of European
ancestry, Journal of Allergy and Clinical Immunology, 122, 1225-1227.
Sleiman, P.M., Flory, J., Imielinski, M., Bradfield, J.P., Annaiah, K., Willis-Owen, S.A., et al. (2010)
Variants of DENND1B associated with asthma in children, New England Journal of Medicine,
362, 36-44.
Snyder, L.H. (1935) The Principles of Heredity, The Principles of Heredity.
Song, G.G. and Lee, Y.H. (2013) Pathway analysis of genome-wide association study on asthma,
Human Immunology, 74, 256-260.
Soto-Ramirez, N., Arshad, S.H., Holloway, J.W., Zhang, H., Schauberger, E., Ewart, S., et al. (2013) The
interaction of genetic variants and DNA methylation of the interleukin-4 receptor gene
increase the risk of asthma at age 18 years, Clinical Epigenetics, 5, 1.
Spitz, M.R., Amos, C.I., Dong, Q., Lin, J. and Wu, X. (2008) The CHRNA5-A3 region on chromosome
15q24-25.1 is a risk factor both for nicotine dependence and for lung cancer, Journal of
National Cancer Institute, 100, 1552-1556.
Sun, Q., Yang, X., Zhong, B., Jiao, F., Li, C., Li, D., et al. (2012) Upregulated protein arginine
methyltransferase 1 by IL-4 increases eotaxin-1 expression in airway epithelial cells and
participates in antigen-induced pulmonary inflammation in rats, Journal of Immunology,
188, 3506-3512.
Sun, Q.Z., Jiao, F.F., Yang, X.D., Zhong, B., Jiang, M.H., Li, G.L., et al. (2010) [Expression of protein
arginine N-methyltransferases in E3 rat models of acute asthma], Nan Fang Yi Ke Da Xue Xue
Bao, 30, 716-719.
Tavendale, R., Macgregor, D.F., Mukhopadhyay, S. and Palmer, C.N.A. (2008) A polymorphism
controlling ORMDL3 expression is associated with asthma that is poorly controlled by
current medications, Journal of Allergy and Clinical Immunology, 121, 860-863.
Templeton, A.R. (2000) Epistasis and complex traits, Epistasis and the Evolutionary Process, 41-57.
Thornton-Wells, T.A., Moore, J.H. and Haines, J.L. (2004) Genetics, statistics and human disease:
analytical retooling for complexity, Trends in Genetics, 20, 640-647.
Torgerson, D.G., Ampleford, E.J., Chiu, G.Y., Gauderman, W.J., Gignoux, C.R., Graves, P.E., et al. (2011)
Meta-analysis of genome-wide association studies of asthma in ethnically diverse North
American populations, Nature Genetics, 43, 887-892.
Torgerson, D.G., Ampleford, E.J., Chiu, G.Y., Gauderman, W.J., Gignoux, C.R., Graves, P.E., et al. (2011)
Meta-analysis of genome-wide association studies of asthma in ethnically diverse North
American populations, Nature Genetics, 43, 887-U103.
Traister, R.S., Uvalle, C.E., Hawkins, G.A., Meyers, D.A., Bleecker, E.R. and Wenzel, S.E. (2015)
Phenotypic and genotypic association of epithelial IL1RL1 to human T(H)2-like asthma,
Journal of Allergy and Clinical Immunology, 135, 92-U160.
136
Tutluoglu, B., Gurel, C.B., Ozdas, S.B., Musellim, B., Erturan, S., Anakkaya, A.N., et al. (2005) Platelet
function and fibrinolytic activity in patients with bronchial asthma, Clinical and Applied
Thrombosis-Hemostasis, 11, 77-81.
Tyler, A.L., Asselbergs, F.W., Williams, S.M. and Moore, J.H. (2009) Shadows of complexity: what
biological networks reveal about epistasis and pleiotropy, Bioessays, 31, 220-227.
Uebersax, J.S. (2006) Introduction to the tetrachoric and polychoric correlation coefficients,
Obtenido de http://www.john-uebersax. com/stat/tetra.htm.
Umbach, D.M. and Weinberg, C.R. (1997) Designing and analysing case ‐control studies to exploit
independence of genotype and exposure, Statistics in Medicine, 16, 1731-1743.
Valdearcos, M., Esquinas, E., Meana, C., Pena, L., Gil-de-Gomez, L., Balsinde, J., et al. (2012) Lipin-2
reduces proinflammatory signaling induced by saturated fatty acids in macrophages,
Journal of Biological Chemistry, 287, 10894-10904.
van der Woude, D., Alemayehu, W.G., Verduijn, W., de Vries, R.R., Houwing-Duistermaat, J.J.,
Huizinga, T.W., et al. (2010) Gene-environment interaction influences the reactivity of
autoantibodies to citrullinated antigens in rheumatoid arthritis, Nature Genetics, 42, 814-
816; author reply 816.
Verlaan, D.J., Berlivet, S., Hunninghake, G.M., Madore, A.-M., Larivière, M., Moussette, S., et al. (2009)
Allele-specific chromatin remodeling in the ZPBP2/GSDMB/ORMDL3 locus associated with
the risk of asthma and autoimmune disease, The American Journal of Human Genetics, 85,
377-393.
Visscher, P.M. (2009) Whole genome approaches to quantitative genetics, Genetica, 136, 351-358.
Visscher, P.M., Brown, M.A., McCarthy, M.I. and Yang, J. (2012) Five years of GWAS discovery,
American Journal of Human Genetics, 90, 7-24.
Visscher, P.M., Hill, W.G. and Wray, N.R. (2008) Heritability in the genomics era--concepts and
misconceptions, Nature Reviews Genetics, 9, 255-266.
Visscher, P.M., Macgregor, S., Benyamin, B., Zhu, G., Gordon, S., Medland, S., et al. (2007) Genome
partitioning of genetic variation for height from 11,214 sibling pairs, American Journal of
Human Genetics, 81, 1104-1110.
Visscher, P.M., Medland, S.E., Ferreira, M.A., Morley, K.I., Zhu, G., Cornes, B.K., et al. (2006)
Assumption-free estimation of heritability from genome-wide identity-by-descent sharing
between full siblings, PLOS Genetics, 2, e41.
von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., et al. (2005) STRING:
known and predicted protein-protein associations, integrated and transferred across
organisms, Nucleic Acids Research, 33, D433-437.
Waddington, C.H. (1942) Canalization of development and the inheritance of acquired characters,
Nature, 150, 563.
Wade, M.J. (2001) Epistasis, complex traits, and mapping genes, Genetica, 112-113, 59-69.
Wade, M.J., Winther, R.G., Agrawal, A.F. and Goodnight, C.J. (2001) Alternative definitions of
epistasis: dependence and interaction, Trends in Ecology & Evolution, 16, 498-504.
Wahlsten, D. (1990) Insensitivity of the analysis of variance to heredity-environment interaction,
Behavioral and Brain Sciences, 13, 109-120.
Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N.L., et al. (2010) BOOST: A fast approach to
detecting gene-gene interactions in genome-wide case-control studies, American Journal of
Human Genetics, 87, 325-340.
Wang, J., Xie, H. and Fisher, J.F. (2011) Multilevel models: applications using SAS®. Walter de
Gruyter.
Wang, W.Y., Barratt, B.J., Clayton, D.G. and Todd, J.A. (2005) Genome-wide association studies:
theoretical and practical concerns, Nature Reviews Genetics, 6, 109-118.
Ward, B.W. (2013) Prevalence of multiple chronic conditions among US adults: estimates from the
National Health Interview Survey, 2010, Preventing Chronic Disease, 10.
137
Weidinger, S., O'Sullivan, M., Illig, T., Baurecht, H., Depner, M., Rodriguez, E., et al. (2008) Filaggrin
mutations, atopic eczema, hay fever, and asthma in children, Journal of Allergy and Clinical
Immunology, 121, 1203-1209 e1201.
Weinberg, C.R. and Umbach, D.M. (2000) Choosing a retrospective design to assess joint genetic and
environmental contributions to risk, American Journal of Epidemiology, 152, 197-203.
Weiss, K.M. (1995) Genetic variation and human disease: principles and evolutionary approaches.
Cambridge University Press.
Wilson, E.B. (1902) Mendel's Principles of Heredity and the Maturation of the Germ-Cells, Science,
16, 991-993.
Witten, D.M., Tibshirani, R. and Hastie, T. (2009) A penalized matrix decomposition, with
applications to sparse principal components and canonical correlation analysis,
Biostatistics, 10, 515-534.
Yang, Q., Khoury, M.J., Sun, F. and Flanders, W.D. (1999) Case-only design to measure gene-gene
interaction, Epidemiology, 10, 167-170.
Yang, Y., Houle, A.M., Letendre, J. and Richter, A. (2008) RET Gly691Ser mutation is associated with
primary vesicoureteral reflux in the French ‐Canadian population from Quebec, Human
Mutation, 29, 695-702.
Yuan, Z., Gao, Q., He, Y., Zhang, X., Li, F., Zhao, J., et al. (2012) Detection for gene-gene co-association
via kernel canonical correlation analysis, BMC Genetics, 13, 83.
Yucesoy, B., Kaufman, K.M., Lummus, Z.L., Weirauch, M.T., Zhang, G., Cartier, A., et al. (2015)
Genome-Wide Association Study Identifies Novel Loci Associated With Diisocyanate-
Induced Occupational Asthma, Toxicological Sciences, 146, 192-201.
Zhan, F., Huang, Y., Colla, S., Stewart, J.P., Hanamura, I., Gupta, S., et al. (2006) The molecular
classification of multiple myeloma, Blood, 108, 2020-2028.
Zhang, H., Tong, X., Holloway, J.W., Rezwan, F.I., Lockett, G.A., Patil, V., et al. (2014) The interplay of
DNA methylation over time with Th2 pathway genetic variants on asthma risk and temporal
asthma transition, Clinical Epigenetics, 6, 8.
Zhang, Y. and Liu, J.S. (2007) Bayesian inference of epistatic interactions in case-control studies,
Nature Genetics, 39, 1167-1173.
Zhang, Y., Moffatt, M.F. and Cookson, W.O. (2012) Genetic and genomic approaches to asthma: new
insights for the origins, Current Opinion in Pulmonary Medicine, 18, 6-13.
Zhao, J., Jin, L. and Xiong, M. (2006) Test for interaction between two unlinked loci, The American
Journal of Human Genetics, 79, 831-845.
Zwick, M. (2004) An overview of reconstructability analysis, Kybernetes, 33, 877-905.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical analysis of high-throughput genomic data
PDF
Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom
PDF
Prenatal air pollution exposure, newborn DNA methylation, and childhood respiratory health
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Two-step study designs in genetic epidemiology
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Computational analysis of genome architecture
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
Integrative analysis of multi-view data with applications in epidemiology
PDF
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data
PDF
DNA methylation of NOS genes and carotid intima-media thickness in children
PDF
Bayesian hierarchical models in genetic association studies
Asset Metadata
Creator
Kogan, Vladimir
(author)
Core Title
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
10/23/2018
Defense Date
10/01/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asthma susceptibility,DNA methylation,integrative genomics,OAI-PMH Harvest,set interaction,SNPs,statistical interactions
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Millstein, Joshua (
committee chair
), Breton, Carrie (
committee member
), Volk, Heather (
committee member
)
Creator Email
vkogan@usc.edu,vladimirkogan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-90714
Unique identifier
UC11676652
Identifier
etd-KoganVladi-6899.pdf (filename),usctheses-c89-90714 (legacy record id)
Legacy Identifier
etd-KoganVladi-6899.pdf
Dmrecord
90714
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kogan, Vladimir
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asthma susceptibility
DNA methylation
integrative genomics
set interaction
SNPs
statistical interactions