Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Missing heritability may be explained by the common household environment and its interaction with genetic variation
(USC Thesis Other)
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Missing Heritability May Be Explained By The Common Household Environment And Its Interaction With Genetic Variation by Nan Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Statistical Genetics and Genetic Epidemiology) December 2016 Copyright 2016 Nan Wang Acknowledgements I would like to express my gratitude to people who have helped me, in one way or another, in the process of developing this dissertation. First, I feel deeply grateful to my advisor, Dr Watanabe, for his support and guidance along the way of my academic research. It was him who introduced me to the field of statistical genetics and taught me both technical skills and ethical rules about doing scientific research. He always encourages his students to think independently and this dissertation would not be made possible if I hadn’t got such training. I would also like to thank my Committee for their invaluable questions and sug- gestions that helped me think more deeply and thorough about this dissertation. Dr Daniel Stram answered all my questions on linear mixed-effects models. Dr Juan Pablo Lewinger helped me on several technical details and inspired my confidence on this research topic. Dr Norberto Grzywacz raised the most and the toughest questions that helped me make this dissertation a lot more rigorous. Dr Gary Chen helped me know better about high performance computing through our discussions. I also feel very fortunate to have such nice lab mates, Yu-Hsiang Shu, Zhanghua Chen, Jie Ren, and colleagues, Adrienne Mackay, Enrique Trigo. Their help and support made my years at USC full of joy and memories. Last but not least, I would like to express my gratitude to my parents, Yanfang Wang and Zhaoli Qu, as well as my wife Wanqing Yu, for their constant support and love. Having you is always my source of strength to become a better myself. Contents Abstract i Abbreviations ii 1 Research Background 1 1.1 The definition of heritability . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Estimation of heritability . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Analysis of twins . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Parent-offsprint regression . . . . . . . . . . . . . . . . . . . . 5 1.2.3 Variance components approach . . . . . . . . . . . . . . . . . 7 1.3 The problem of missing heritability . . . . . . . . . . . . . . . . . . . 10 1.4 Plausible explanations to missing heritability . . . . . . . . . . . . . . 12 1.5 Other sources of missing heritability: common household environment and its interaction with genetic variation . . . . . . . . . . . . . . . . 15 2 Inflation In Estimating The Additive Variance Under A Misspeci- fied Model 19 2.1 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.1 Model for simulating data . . . . . . . . . . . . . . . . . . . . 20 2.1.2 Model for analyzing data . . . . . . . . . . . . . . . . . . . . . 21 2.2 Model misspecification: theory . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 The derivation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 The impact of the number of families . . . . . . . . . . . . . . 28 2.2.3 Test the theory using simulations . . . . . . . . . . . . . . . . 29 2.2.4 Predicted bias of the variance terms under misspecified models 32 2.2.5 The effect of the number of children per family . . . . . . . . 42 2.3 Model misspecification: simulations . . . . . . . . . . . . . . . . . . . 44 2.4 Inflated estimation of the additive variance under other scenarios . . 50 2.4.1 A general form interaction between the common household en- vironment and genetic variation . . . . . . . . . . . . . . . . . 50 2.4.2 Genetic interaction under the limiting pathway model . . . . . 53 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3 Unbiased Estimation Of The Additive Variance Under The Correct Model 59 3.1 Algorithms for estimating parameters under the correct model . . . . 59 3.1.1 Implementation of the EM algorithm . . . . . . . . . . . . . . 62 3.1.2 Implementation of the BFGS algorithm . . . . . . . . . . . . . 67 3.2 Test the algorithm via simulations . . . . . . . . . . . . . . . . . . . . 68 3.2.1 Unbiased estimation of the parameters . . . . . . . . . . . . . 68 3.2.2 The impact of the sample size and the number of children per family on standard errors . . . . . . . . . . . . . . . . . . . . . 77 3.3 Relationship to linear mixed-effects model method . . . . . . . . . . . 80 3.4 Choice of the reference allele and the interpretation of the estimated variance components . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4 Hypothesis Testing And Data Application 93 4.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.1.1 Beta of interaction method . . . . . . . . . . . . . . . . . . . . 94 4.1.2 Linear mixed-effects method . . . . . . . . . . . . . . . . . . . 95 4.1.3 Type 1 error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.1.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.1.5 Relationship between the sample size and power . . . . . . . . 105 4.2 Application: The Framingham Heart Study . . . . . . . . . . . . . . 106 4.2.1 General information about the study . . . . . . . . . . . . . . 106 4.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5 Conclusion And Future Directions 135 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 References 138 Abstract The ‘missing heritably’ problem describes the fact that the explained heritability from identified genetic variation associated with phenotypes/diseases is much smaller than the estimated total heritability from traditional population studies. This dissertation proposed a hypothesis to explain the missing heritably problem. We claimed that the problem may be due to the inflated estimation of the total heritability from population studies with misspecified model. We used theoretical derivation and simulations to illustrate that the heritability would be overestimated when there was common household environment (CHE) and/or its interaction with genetic variation but was not specified in the estimation model. We then proposed a method based on linear mixed-effects model (LMM) to properly account for the CHE and the interaction, which could be used to estimate the heritability without bias. The connection between this method and standard LMM methods was also built. We further proposed hypothesis testing frameworks for model selection. Finally we tested both the estimation and hypothesis testing methods using the Framingham Heart Study data, and found evidence supporting the hypothesis of this dissertation on the source of missing heritability. i Abbreviations CHE common household environment LMM linear mixed-effects models SNP single nucleotide polymorphism MAF minor allele frequency FHS Framingham Heart Study ii Chapter 1 Research Background 1.1. The definition of heritability The modern science of genetics is regarded to start from Gregor Mendel, who discovered the laws of inheritance in the mid-19th century. Despite of the fact that the early studies that applied Mendel’s Laws were focused on the inheritance of dis- crete traits (smoothness of pea seeds, color of flowers, etc.), the need for plant and animal breeding drove the development of quantitative genetics. Ronald Fisher, in his pioneering 1918 paper, discussed the genetic resemblance of quantitative traits between relatives. The percentage of the total variance that can be attributed to a quantity was termed as ‘essential genotype’, which is now called ‘narrow-sense heri- tability’ (h 2 ). Sewall Wright (1920) described the genetic resemblance using a path model, and denoted the path coefficient between genotype and phenotype as h (for heredity), which was then followed by people to denote heritability. The term ‘heri- tability’ was firstly used by Lush (1940) to describe the fraction of observed variation that is caused by difference in hereditary factors. Jacquard (1983) summarized dif- ferent meanings of heritability and clarified some confusions and misinterpretations 1 CHAPTER 1. RESEARCH BACKGROUND on them. Formal definition of heritability requires decomposition of observed variance of a given quantitative trait. In the following context, ‘trait’ indicates ‘quantitative trait’ unless stated otherwise, and ‘trait’ and ‘phenotype’ will be used interchangeably. Following Visscher et al. (2008), an observed trait of interest can be partitioned into components due to genotype (hereditary factors) and environment (non-hereditary factors), Phenotype(P) = Genotype(G) + Environment(E) Then the variance of the trait ( 2 P ) can be similarly decomposed as the sum of the corresponding components, 2 P = 2 G + 2 E (1.1) The assumption underlying equation (1.1) is the absence of both the covariance between genotype and environment ( 2 G;E ) and the interaction between them ( 2 GE ). The genetic variance ( 2 G ) can be further decomposed into additive genetic vari- ance ( 2 A ), dominance genetic variance ( 2 D ), and epistatic genetic variance ( 2 I ), 2 G = 2 A + 2 D + 2 I (1.2) Heritability can herein be defined as the proportion of observed phenotypic vari- ancethatisattributedtototalgeneticeffect(broad-sense, denotedasH 2 )oradditive genetic effect (narrow-sense, denoted as h 2 ). Or written in equations, H 2 = 2 G 2 P and h 2 = 2 A 2 P (1.3) 2 CHAPTER 1. RESEARCH BACKGROUND The environment variance in equation (1.1) can be partitioned into factors that are specific to certain group of people, such as shared environment for siblings ( 2 sib ) or for parent-offspring pairs ( 2 PO ), etc. Usually there will be an individually inde- pendent residual variance component capturing all the left variation that cannot be attributed to any specified factors ( 2 e ). Heritability (especially the narrow-sense h 2 ) plays an important role in genetic studies. h 2 quantifies the relative magnitude of response to selection across genera- tions, the relationship of which can be described using the breeder’s equation (Fisher, 1958; Lynch and Walsh, 1998, p. 50). It also serves as an upper bound of inheritance of traits or diseases, for the sum of linear effects of genes. People refer to h 2 to see if all the genetic factors underlying a given trait or disease have been found. However, there are limitations and confusions on the use ofh 2 , which are mostly related to the assumptions underlying its estimation. 1.2. Estimation of heritability Due to the importance of heritability, people have developed a variety of ap- proaches to estimate it, some of which even were before the emergence of the term. All of these approaches take advantage of the difference on phenotypic correlation or covariance estimates comparing various combinations of relatives, and they all have certain assumptions that are critical to the validity of the estimate of heritability. In the following is brief introduction to several widely implemented approaches of estimating heritability. 3 CHAPTER 1. RESEARCH BACKGROUND 1.2.1. Analysis of twins One of the oldest methods that has been used until today is the analysis of twins, which was initially recognized by Galton (1876) as a way to study the impact of both genetics and environment on a trait. The earliest method was simply based on one-way analysis of variance (ANOVA) (Kempthorne and Osborne, 1961; Lynch and Walsh, 1998, p. 582). For a give pair of twins, we have a linear model, y ij = +b i +e ij wherey ij is the trait of thej th member (j = 1 or 2) of thei th pair, is the population mean, b i is the deviation of pair i to the mean, and e ij is the residual error or each twin from their mean +b i . It is assumed that b i and e ij are independent. For n pairs of twins, the between-pair (B) and within-pair (W) observed mean squares (MS) and their expectations are df MS E(MS) Between pairs n 1 B 2 2 b + 2 e Within pairs n W 2 e If the twins are monozygotic (MZ) twins, given the fact that they have identical genetic background in theory, then an upper bound of broad-sense heritability H 2 can be estimated using the correlation t MZ = 2 b =( 2 b + 2 e ) among twin pairs. If the MS values are used to estimate the components of variance, ^ 2 b = (BW )=2 ^ 2 e = (B +W )=2 ^ 2 b 4 CHAPTER 1. RESEARCH BACKGROUND Therefore the estimate of H 2 is ^ H 2 = ^ t MZ = BW B +W The above estimator will be inflated if the MZ twin pairs share common environ- ment that has impact on the trait directly. To better address this question, dizygotic (DZ) twin pairs have been incorporated into the model, under the assumption that they share the same amount of environment as MZ twins do. The correlation t DZ among DZ twins can be calculated. With further assumption that the underlying genetic factors are purely additive, the total phenotypic variance can be partitioned into additive genetic (A), common environment (C) and residual environment (E) components, which is the ACE model (Visscher, 2004). The narrow-sense heritability h 2 can be estimated using the ‘Falconer’s formula’ (Falconer and Mackay, 1996), ^ h 2 = 2( ^ t MZ ^ t DZ ) The variance components involved in the calculation can also be estimated using maximum likelihood estimate (MLE) methods (Visscher, 2004). However, it won’t be further described here. 1.2.2. Parent-offsprint regression Another widely-used method to estimate h 2 is parent-offspring regression. It also has a long history because it has been a natural question to ask how much the offspring will resemble their parents on a given trait. The discussion on this topic started from Fisher (1918), where he showed that given some assumptions, regression 5 CHAPTER 1. RESEARCH BACKGROUND of offspring on a parent for a trait in a random mating population has an expected value 2 A =2 2 P . Specifically, if each family only includes one offspring and one parent, regress the trait of offspring y o on that of parent y p , y oi = + op y pi +e i (1.4) where y oi and y pi are the trait of offspring and parent for the i th family, is the intercept, op is the regression slope, e i is the residual. The least-square regression gives ^ op = Cov(y o ;y p )=Var(y p ). The expected slope coefficient, according to the covariance equation between relatives Lynch and Walsh (1998, p. 144), is op =E ^ op = (y o ;y p ) 2 (y p ) = 2 A 2 2 P = 1 2 h 2 under the assumption that (1) there is no shared environment causing the resem- blancebetweentheofferingandtheparent, (2)thereisnoadditiveadditiveepistatic effect, (3) the phenotypic variance across the two generations is the same. If both of the parents are included, and we further assume that the phenotypic variance is the same for mothers and fathers, we can instead regress the offspring on the mean of the parents (mid-parent value). Then equation (1.4) can be modified as y oi = + o p y mi +y fi 2 +e i (1.5) wherey mi andy fi are the trait of mothers and fathers. The slope coefficient has the 6 CHAPTER 1. RESEARCH BACKGROUND relationship o p = [y o ; (y m +y f )=2] 2 [(y m +y f )=2] = [(y o ;y m ) +(y o ;y f )]=2 ( 2 P + 2 P )=4 = 2(y o ;y p ) 2 (y p ) = 2 op Thus ^ o p is a direct estimate of h 2 . Using parent-offspring regression has some advantages compared to methods us- ingothercloselyrelatedrelatives, suchastwinsorsiblings(KempthorneandTandon, 1953). Parents and offspring are less likely to share environment that contributes to the trait of interest, whereas full sibs and twins even share a common placental environment. Furthermore, given that parents and offspring share exact one allele at a single locus, the estimate of h 2 will not be biased by dominance genetic effect. 1.2.3. Variance components approach The above two approaches, as well as other analysis on relatives that are based on ANOVA, can only take advantage of a single type of familiar relationship. However, data collected in the wild field or from breeding experiments usually include multiple relationships, e.g. half sibling, grandparent, etc. On the other hand, there are circumstances that the data are collected from studies with imbalanced design, or therearemissingdata. Insuchcircumstances, ordinaryANOVAisunabletocombine the data together. Although there were methods that modified ANOVA to account for the problem (Henderson, 1953), it’s computational cumbersome when there are many different classes. 7 CHAPTER 1. RESEARCH BACKGROUND HartleyandRao(1967)introducedamethodthatusedmaximumlikelihood(ML) to estimate random effects in a ‘general mixed analysis of variance model’, which is now usually referred to as ‘linear mixed-effects model’ (LMM). The method , in contrast to ANOVA, did not have restrictions on the data structure. However, an additional important assumption was added, which was the normality of the residual errors. The model for n observations, for example, can be expressed as y = X + Zu + e (1.6) where y is ann 1 vector of a given trait, is a vector of fixed effects (covariates), u is a vector of random effects, e is a vector of residual random errors, X and Z are design matrices of and u, respectively. Among the quantities, y, X and Z are observed(known), andtherestareunknown. Thevectorsandmatricesallconformin terms of their dimensions. The distributions of the random factors are u N(0; W), and e N(0; R). Then the total variance is V = ZWZ T + R, and the likelihood L is L = (2) n=2 jVj 1=2 exp 1 2 (y X) T V 1 (y X) (1.7) Inthecaseofvariancecomponentsanalysis, thevariancematrix W oftherandom effects is diagonal. In other words, the random effects are independent. Then the first term of V can be expressed as ZWZ T = z 1 z T 1 2 1 + z 2 z T 2 2 2 + + z k z T k 2 k , fork random effects, where z k is the k th column of Z and 2 k the corresponding variance component. On the other hand, the residual errore is usually assumed independent, or R = I 2 e , where I is an nn identity matrix, and 2 e the variance of e. The genetic covariance between two relatives x and y, following (Lynch and 8 CHAPTER 1. RESEARCH BACKGROUND Walsh, 1998, p.144), can be shown to have the form G (x;y) = (2 xy ) n m xy 2 A n D m = 2 xy 2 A + xy 2 D + (2 xy ) 2 2 AA + 2 xy xy 2 AD + 2 xy 2 DD + where xy is the coefficient of kinship, xy is the coefficient of fraternity, and 2 A n D m is variance from additive effect or dominance or all levels of epistasis. Furthermore, xy = 1 + 7 and reduces to 7 in the absence of inbreeding, where the 1 and 7 are two types of condensed coefficients of identity (Jacquard, 1974). In practice, the higher-order epistasis effects are ignored, and usually in human studies the subjects of interest are outbred, therefore the above equation reduces to G (x;y) = 2 xy 2 A + 7 2 D and the variance-covariance matrix is V = 2 2 A + 7 2 D + I 2 e where and 7 are matrices incorporating all across-individual coefficients xy and 7 . The variance components are estimated using maximum likelihood (ML). Rather thanestimatedanalytically,thevariancecomponentsareestimatednumerically,since the ML does not have a close-formed solution. This is also a difference from ANOVA- based approaches. If we denote the estimated additive variance and total phenotypic 9 CHAPTER 1. RESEARCH BACKGROUND variance to be ^ 2 A and ^ 2 P , then the estimated heritability is ^ h 2 = ^ 2 A ^ 2 P It is worth noting again that one important assumption underlying this approach is the normality of the given trait, or trait residuals. Fisher (1918) assumed a poly- genic trait has many contributing loci, each of which only has a very small effect. Lange (1978) gave theoretical proofs of normality of a trait within pedigrees under several assumptions. Besides, normality of traits can be examined in practice (Hop- per and Mathews, 1982), and it can be achieved via data transformation (Boehnke et al., 1986). 1.3. The problem of missing heritability Finding the gene or genes underlying a genetic disease has always been one of the focuses of genetics studies. In the last decades of the 20 th century, linkage analysis achieved remarkable success in finding genetic basic of many human diseases, such as Huntington’s disease (Gusella et al., 1983), breast cancer (Easton et al., 1993; Hall et al., 1990), Alzheimer’s disease (Saunders et al., 1993). However, the success did not extend to complex diseases, such as schizophrenia, bipolar disease, diabetes, etc. Linkage analysis failed to identify genetic loci underlying complex diseases, and it was believed that this was because linkage analysis was underpowered to detect modest genetic effect. Association analysis was raised as the solution to the problem, since it has higher power to detect common genetic loci with relatively low effect (Collins et al., 1997; 10 CHAPTER 1. RESEARCH BACKGROUND Lander, 1996; Risch and Merikangas, 1996). This lead to the so-called ‘common dis- ease, common variant’ hypothesis (Cargill et al., 1999)and the International Hapmap project(Gibbs et al., 2003). Since the first well-designed genome-wide association study (GWAS) in 2007 (Wellcome Trust Case Control Consortium, 2007), it proves to be a huge success in finding genetic susceptibility loci for complex diseases. To date (Nov 2014), there have been 2,024 publications and 14,648 single nucleotide polymorphisms (SNPs) recorded in the NHGRI GWAS Catalog for over 1,100 differ- ent diseases or quantitative traits. Despite of the fact that GWAS’s have discovered many genetic loci underlying eachcomplexdiseaseorquantitativetrait,theproportionofvariationexplainedbyall these variants is limited, which is usually between 0.1 and 0.2. Table 1.1 summarizes for some of the traits (Visscher et al., 2012). Table 1.1: Heritability estimated from population studies and explained by GWAS’s for a selective of complex diseases or traits Disease or trait h 2 population studies h 2 explained by GWAS’s Height 0.8 0.1 BMI 0.4-0.6 0.01-0.02 HDL cholesterol 0.5 0.1 Type 2 diabetes a 0.3-0.6 0.05-0.10 Crohn’s disease a 0.6-0.8 0.1 a Variation in liability scale. There is a huge gap between the heritability estimated from population studies and that explained by all SNPs discovered by GWAS’s for a variety of diseases or traits. This phenomenon is termed as the ‘missing heritability’ problem. Although as the sample size increase, GWAS has higher power to detect more variants in 11 CHAPTER 1. RESEARCH BACKGROUND theory(Visscher et al., 2012), the effect sizes of these yet-to-be-discovered variants are smaller. Therefore, GWAS’s are unlikely to fill the entire gap in phenotypic variation, even with increasing sample sizes. 1.4. Plausible explanations to missing heritability There have been a number of explanations or hypotheses to the question why GWAS do not capture all the heritability so far. One of them is that there are still common variants with moderate effect that have yet to be discovered. As argued in section 1.3, this explanation is becoming less and less promising as the ongoing GWAS’s with larger sample sizes fail to substantially increase the proportion of vari- ation explained. GWAS’s with sample sizes of around 20,000 individuals for lipid phenotypes, including total cholesterol, low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol and triglycerides, still identified SNPs passing statistically significance level that only together explained 5% to 10% of phenotypic variation (Aulchenko et al., 2009; Kathiresan et al., 2009). Yang et al. (2010) showed that a large proportion of heritability of human height could be at- tributable to common variants if all of them were taken into account using a linear mixed effect model, yet another study showed that this approach might not be able to fully explain missing heritability for all possible traits, at least not for BMI (Yang et al., 2011b). Another explanation is that rare variants with larger effect sizes, compared to common variants, might be as important as, or even more important than, common variants in the pathogenesis of complex diseases. The classification of ‘rare’ or ‘com- mon’ variants is relatively arbitrary, with a minor allele frequency (MAF) usually 12 CHAPTER 1. RESEARCH BACKGROUND at >5% for common variants, and a MAF at <0.5% (Manolio et al., 2009) or at <2%-3% (Bodmer and Bonilla, 2008) defined by different researchers. In general, the definition of rare variants should be dependent on the sample size of the study. By design, GWAS is underpowered to capture signals directly from rare variants, nor are rare variants able to be tagged by common variants due to an induced low linkage disequilibrium (LD) caused by the low frequencies of the rare variants (Zeggini et al., 2005). There have been arguments supporting the role of rare variants underlying complex diseases. From an evolutionary point of view, diseases lead to decreased fitness and individuals with diseases are selected against, and therefore variants con- tributing to diseases are unlikely to be common (Pritchard, 2001). Moreover, there are examples of diseases attributable to rare variants with relatively large effect, such as missense variants in APC to the susceptibility for inherited colorectal cancer (Frayling et al., 1998; Laken et al., 1997). Bearing these rationales, new technologies and algorithms for the Next-Generation Sequencing (NGS) studies have been devel- oped. Using NGS, researchers are now able to identify rare variants and examine their contributions to complex diseases. However, there are also arguments against the proposition that rare variants are the primary source of missing heritability. For complex diseases, the postulated effect sizes of rare variants predict a lower sibling recurrence rates than observed (Gibson, 2011). In addition, an simulation analysis using classical coalescence theory showed that the frequency distribution of the most associated allele was inconsistent with the hypothesis, which is referred to as the ‘Synthetic Association’ theory, that a GWAS signal captured by common variants were actually driven by multiple rare variants in LD (Wray et al., 2011). A study on autoimmune-locus rare coding-region variants could only explain a negligible proportion of heritability (Hunt et al., 2013). Similar 13 CHAPTER 1. RESEARCH BACKGROUND conclusions were drawn for a study for autism (Gaugler et al., 2014). In summary, rare variants are not likely to be the sole answer to missing heritability. It has also been postulated that structural variations of the genome might be a source of missing heritability. One of the most common structural variations is copy number variation (CNV). Two studies found strong associations between CNVs with the risk of schizophrenia (International Schizophrenia Consortium, 2008; Ste- fansson et al., 2008). However, there was another study showing that the common CNVs were well tagged by common SNPs, and therefore the effects of these CNVs should have already been indirectly captured by GWAS’s via SNPs (Conrad et al., 2010). Besides, the technologies of scanning CNVs are early in their development, so structural variations will not be a main consideration here. From another perspective, the estimate of heritability from population studies could be biased upwards in the first place. Nonlinear genetic effects and shared en- vironment within relatives, are two factors leading to inflated heritability estimates. Nonlinear genetic effects include dominant genetic effect, gene-gene interaction (epis- tasis) and gene-environment interaction. Dominant genetic effect is usually confounded by gene-gene interaction or shared environment (Neale and Cardon, 1992), thus its contribution to heritability has been poorly characterized in traditional studies. However, theory showed that dominant geneticvariancewassmallcomparedtoadditivevarianceinoutbredpopulations(Hill et al., 2008). Recent genomic technologies provide the means to examine this. Zhu et al. (2015) extended the concept of estimating variance components using genome- wide SNPs (Yang et al., 2010) to estimate dominant variance, and concluded that dominant variance contributed little to the missing heritability. In contrast, Chen et al. (2015) argued that the contribution of dominant variance was often masked 14 CHAPTER 1. RESEARCH BACKGROUND by shared environment and showed dominant variance contributed a considerable amount to total phenotypic variance, with a fraction ranging from 0.14 to 0.49 for the 14 examined traits. As controversial as it is, the contribution of dominant variance requires further investigation. Zuk et al. (2012) showed that gene-gene interaction, if present, might greatly inflate the heritability estimate from population studies, which was referred to as ‘phantom heritability’. It’s been shown that gene-gene interaction, or epistasis, is widely present in a variety of experimental organisms (Bloom et al., 2013; Shao et al., 2008). However, it’s also likely that there are still other sources to the inflation of heritability estimation, including shared environment and gene-environment interac- tion (Manolio et al., 2009; Visscher et al., 2008). These are also the two components that I will show in this dissertation to cause inflation of heritability estimation, and I’ll introduce related research background in detail in section 1.5. After all, the above-mentioned nonlinear genetic effects would not cause miss- ing heritability if they could be captured by GWAS. However, this is not the case, since GWAS is by design poorly powered to detect these nonlinear genetic effects (McCarthy et al., 2008). 1.5. Other sources of missing heritability: common household environment and its interaction with genetic variation To better understand the impact of environment on heritability, it is worth noting that heritability is not constant across populations or across time. Given a time 15 CHAPTER 1. RESEARCH BACKGROUND point, heritability is a property of the population and of the environment that the individualsareexposedto(FalconerandMackay,1996, p. 161). Heritabilitydepends on the relative proportions of each variance component. Allele frequencies of genes may be different across various populations, leading to different amount of genetic variation. On the other hand, if the environment is homogenous, heritability tends to be higher than if it’s heterogeneous, because the variability in the environment is lower. Forexample,HoffmannandParsons(1993)foundthatheritabilityincreasedin stressful environments. Heritability can also change over time, which implicates that the pathogenesis of complex diseases involves environmental factors. For example, the incidence of diabetes in developing Asian countries greatly increased in the past two to three decades (Chan et al., 2009). Theprimaryhypothesisofthisdissertationisthatthecurrentestimateofnarrow- sense heritability h 2 may be inflated due to not taking into account the common household environment (CHE) and its interaction with genetic variation, especially the second. The CHE indicates a combination of environmental factors shared by close in- dividuals living in the same household. It may include unmeasured dietary and/or other lifestyle factors, e.g. smoking, physical activity, stress, etc. Human studies cannot always eliminate the variation due to the CHE, unlike for animals or plants, where the experiment conditions can be controlled. There were studies on traits related to cardiovascular diseases or type 2 diabetes that took into account shared environment and revealed its considerable contribution to total phenotypic variation for certain traits (Freeman et al., 2002; Mitchell et al., 1996). There was also evi- dence in favor of shared environment over either dominant or epistatic genetic effects in explaining the excess phenotypic correlation among relatives (Zaitlen et al., 2013). 16 CHAPTER 1. RESEARCH BACKGROUND However, the interaction between shared environment and genetic variation has not been well acknowledged and characterized. This interaction can be considered as a modifying effect of shared environmental factors on the effects of genetic variation. There are evidences of gene-environment interaction for a variety of complex diseases (Caspi and Moffitt, 2006; Chamaillard et al., 2003; Kilpeläinen et al., 2011). Only if wetakeintoaccounttheinteraction, whentrulypresent, theseparatecontributionsof genes, environment and their interaction can be correctly estimated (Hunter, 2005). There have been previous work that examined whether not specifying the inter- action between shared environment and genetic variation would bias the heritability estimate. Rao and Morton (1974) investigated this question using path analysis on generated data under a single-locus model and concluded that there was not enough power to detect the interaction and therefore heritability would not be apprecia- bly overestimated. In contrast, Lathrope et al. (1984) took similar approach but used a multi-locus model, and found that both heritability and shared environment were overestimated in the presence of the interaction. Despite of the contrasting conclusions, there were some intrinsic flaws of path analysis. For example, the like- lihood function was not correct and the inference was based on invalid assumptions of independence(Karlin et al., 1983). In this dissertation, I used a more straight-forward and better-characterized ap- proach, the variance components framework, to investigate the problem similar to that in Lathrope et al. (1984). That is, will the interaction between shared environ- ment, herein the CHE, and genetic variation give rise to biased heritability estimate? If yes, by which direction and how much? I also derived equations to predict the bias based on the asymptotic maximum likelihood theories of Huber (1967), which makes it possible to quickly examine the bias under a wide variety of parameter settings 17 CHAPTER 1. RESEARCH BACKGROUND without the use of simulations. In addition, I developed a modified variance compo- nents method to estimate heritability, together with other parameters, without bias. Finally, I applied a likelihood ratio test to detect the statistical significance of both the CHE and its interaction with genetic variation, and examined my estimation and inference framework on real data. It’s also of note that there is a misunderstanding from people that it is unnec- essary to consider nonlinear genetic effects such as gene-gene or gene-environment interactions when dealing with narrow-sense heritability, for example Gibson (2011). As Zuk et al. (2012) showed gene-gene interaction would inflate the estimate of her- itability, I will illustrate in Chapter 2 that heritability will be biased upwards if not correctly specifying the CHE and its interaction with genetic variation. It is of note that this work addresses a possible explanation that incorrect assumptions may lead to considerable missing heritability. The model raised in this dissertation may not be necessarily true for every trait related to complex diseases. 18 Chapter 2 Inflation In Estimating The Additive Vari- ance Under A Misspecified Model 2.1. Formulation of the problem For this work the variance components approach coupled with maximum like- lihood (ML) method is used, considering its flexibility in model specification and therefore being readily modified to capture the interaction between the common household environment (CHE) and genetic variation. To illustrate the idea, an ideal source of data comes from nuclear families, since they consist of individual house- holds. The simplest nuclear family, which can be referred to as a ‘trio’, is comprised of a father, a mother and a child. Without loss of generality, trios are used below in the working model. 19 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL 2.1.1. Model for simulating data For a given trait y, we can model it with a linear equation following a similar formulation as equation (1.6) y ij = M X m=1 x ijm m +g ij +H i + int x ijt H i +e ij (2.1) for individual j from family i. x ::m is the m th SNP with an effect size m , out of M SNPs in total. g is the residual additive genetic variance unaccounted for by x,H the CHE, x ::t H the interaction between the t th SNP and the CHE with an effect size of int , ande the individual residual error. For this dissertation, we assume only one SNP has interaction with the CHE. It should be relatively straight-forward to extend this framework to scenarios where there are multiple SNPs that have interaction with the CHE. The data simulation workflow is as following: (1) set the total number of SNPs ,their minor allele frequencies (MAFs, denoted as p m for the m th SNP) and m ’s, randomly sample x i:m for each SNP from binomial distributions binom(2,p m ) for parents of thei th family, and then randomly pass alleles to their child; (2) randomly sample g i from a multivariate normal distribution N(0; G) for the i th family, with G being described in detail in the next paragraph; (3) randomly sample H i from a normal distribution N(0; 2 H ), where 2 H is the variance of the CHE, and then assign individuals from the i th family the same H i ; (4) calculate the interaction effect by multiplying the sampled t th SNP data with the CHE and int ; (5) randomly sample e i from a normal distribution N(0; 2 e ), where 2 e is the variance of the residual error; (6) y ij is then the sum of all the above quantities. 20 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL The total additive variance is the sum of the joint effects of the SNPs not in the analysis model and the directly sampled additive componentg. GivenM SNPs, with the marginal effect size m and the MAF p m for the m th SNP, the expected joint additive variance from these M SNPs is 2 A = M X m=1 2p m (1p m ) 2 m (2.2) It should be noted that both the CHE and its interaction with SNPs can be removed from this simulation framework for a purely additive genetic model. For simplicity, currently we may assume linkage equilibrium and Hardy-Weinberg equi- librium (HWE) across the SNPs, and only one SNP has interaction with the CHE. These restrictions can be loosened and further studied in future analyses. 2.1.2. Model for analyzing data The analysis model follows the mean equation y ij = K X k=1 x ijk k = x ij if there are K SNPs in the model, which may be less than the number of simulated SNPs. The ’s in the above equation are marginal effect sizes of the SNPs. The variance equations are (1) The additive genetic variance G = K 2 A = 2 2 A 21 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL For a trio, for example, if the order of individuals is father, mother and child, then K = 2 6 6 6 6 4 1 0 0:5 0 1 0:5 0:5 0:5 1 3 7 7 7 7 5 (2) The CHE variance a. no interaction with SNPs, H = B 2 H where B is a matrix of blocks of 1’s along the diagonal. For a trio, for example, a single block is 2 6 6 6 6 4 1 1 1 1 1 1 1 1 1 3 7 7 7 7 5 b. having interaction with the t th SNP, H = B 2 H where B is also a matrix of blocks of (1 + int x ::t )(1 + int x ::t ) T along the diagonal, and int is the effect size of the interaction between the t th SNP and the CHE. For the i th trio, the block is 2 6 6 6 6 4 (1 + int x i1t )(1 + int x i1t ) (1 + int x i1t )(1 + int x i2t ) (1 + int x i1t )(1 + int x i3t ) (1 + int x i2t )(1 + int x i1t ) (1 + int x i2t )(1 + int x i2t ) (1 + int x i2t )(1 + int x i3t ) (1 + int x i3t )(1 + int x i1t ) (1 + int x i3t )(1 + int x i2t ) (1 + int x i3t )(1 + int x i3t ) 3 7 7 7 7 5 22 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL (3) The residual variance E = I 2 e where I is the identity matrix. Therefore the total variance is the sum of the above matrices, V = G + H + E Similartothehowthedataissimulated, thematrix Hcanalsobedroppedtoassume a purely additive genetic model during analysis. The above model is actually a special case of the general framework of linear mixed-effects model (LMM), except for int which is introduced specifically for the current problem of interest. The likelihood function is then L(; 2 A ; 2 H ; 2 e ; int ; y) = 1 (2) n=2 jVj 1 2 exp 1 2 (y X) T V 1 (y X) (2.3) where n is the number of individuals. 2.2. Model misspecification: theory 2.2.1. The derivation Huber(1967)provedthatundersomegeneralregularityconditions, themaximum likelihood estimator (MLE) converges to a well-defined limit, even if the probability model is incorrectly specified. Akaike (1973) pointed out that the MLE of the model parameters ^ under the incorrect model converges in probability to the value that minimizes the Kullback-Leibler (KL) divergence of the misspecified model from the 23 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL true model, or minimizes E Y log P true (y;) P false (y; ) (2.4) )@E Y log P true (y;) P false (y; ) . @ = 0 (2.5) The expectation is taken with regard to the true distribution. The conditions for consistencyoftheMLEminimizingtheKLdivergencewereprovidedinWhite(1982). Although not as general as those in Huber (1967), the conditions are general enough for most settings and are more intuitive. Inourcase,theparametersbeingestimatedarethevariancecomponents 2 ’s. As- suming other covariates have been adjusted, namely the trait variable y N(0; V), denote the parameters of the false model (the misspecified model) with an asterisk ‘’, then following equation (2.3), the density functions of the true model and the false model are P true (y; 2 ) = (2) n=2 jVj 1=2 exp( 1 2 y T V 1 y) (2.6) P false (y; 2 ) = (2) n=2 jV j 1=2 exp( 1 2 y T V 1 y) (2.7) where V = n X i=1 M i 2 i ; V = m X j=1 M j 2 j m<n; M j = M i for some i;j (2.8) In other words, the false model is incorrect by specifying less variance components, compared to the true model. Note the condition (2.8) is not necessary for other general problems. 24 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Put(2.6)and(2.7)into(2.5), itcanbeshownthatbysolvingaseriesofequations, tr (I V 1 V)V 1 @V @ 2 j = 0 for each j (2.9) we can compute the values that the MLE ^ 2 j ’s converge to under the false model. Severalequationsusedinthederivationofequation(2.9)regardingmatrixalgebra are listed below. Let random variable x N(; ), and A is a square matrix, t is a scalar, tr(A) is the trace of A,jAj is the determinant of A, log(x) is the natural log of x, then by (Petersen and Pedersen, 2008), (1) E(x T Ax) = tr(A) + T A (2) @ @t log(jAj) = tr(A 1 @ @t A) (3) @ @t tr(A) = tr( @ @t A) (4) @ @t A 1 = A 1 @A @t A 1 The proof of (2.9) is as follow. Proof. Following (2.5) @E Y log P true (y; 2 ) P false (y; 2 ) . @ 2 = 0 25 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL then for each j 0 =@E Y log jV j 1=2 jVj 1=2 exp 1 2 (y T V 1 y y T V 1 y) . @ 2 j =@E Y 1 2 (logjV jlogjVj) + 1 2 (y T V 1 y y T V 1 y) . @ 2 j = 1 2 @ logjV jlogjVj +tr(V 1 V)tr(V 1 V) . @ 2 j = 1 2 tr V 1 @V @ 2 j tr V 1 @V @ 2 j V 1 V = tr (I V 1 V)V 1 @V @ 2 j It is of note that the same equation (2.9) also works when there is interaction between the CHE and genetic variation but the effect size of interaction int is not specified in the analysis model. However, certain modifications must be made, be- cause the interaction effect is influenced by the sampled genotype of the t th SNP that varies across different simulation replications. To overcome this problem, we can instead sum results across different genotype combinations proportionally to the expected probabilities of those combinations. Here we assume HWE, so we can as- sign probabilities to genotype combinations. For example, for a biallelic locus where the two alleles are A and a, if the minor allele frequency is q, and we set p = 1q, then the probabilities corresponding to each family genotype combination are listed in Table 2.1. Note that in this table it does not matter if parent 1 is the father or the mother. Now when using equation (2.9) to predict the bias, we only need to replace the V tobeanexpected V accordingtotheprobabilitiesinTable2.1, with int incorporated 26 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.1: Family genotype combination and corresponding probabilities Parent 1 Parent 2 Child probability AA AA AA p 4 AA Aa AA 2p 3 q AA Aa Aa 2p 3 q AA aa Aa 2p 2 q 2 Aa Aa AA p 2 q 2 Aa Aa Aa 2p 2 q 2 Aa Aa aa p 2 q 2 Aa aa Aa 2pq 3 Aa aa aa 2pq 3 aa aa aa q 4 in the H matrix. Another problem to consider is that the variance components may be predicted to be less than zero given certain parameter combinations, although they should be positive. As a solution, Lagrange multipliers were added to the original minimization equation (2.4), with a constraint of the variance terms greater than or equal to an arbitrary value, e.g. 0.1, we then have minimizes E Y log P true (y;) P false (y; ) +(0:1 ) Then equation (2.9) is modified to be tr (I V 1 V)V 1 @V @ 2 j j = 0 for each j (2.10) 27 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL In addition, two other necessary conditions are required, j (0:1 2 j ) = 0 (2.11) j 0 (2.12) Solve equation (2.10) to (2.12) should give a reasonable solution to any parameter combinations. Multiple equations from (2.9) need to be solved: one for each variance component. Theequationsarenonlinear, soanonlinearequationsolvernleqslv(v2.2) implemented in R (v3.0.2) was used. Under different parameter settings, the true parameters were put into (2.9), and the values that the parameters would converge to under a misspecified model were then computed. 2.2.2. The impact of the number of families [Methods] Intuitively, thenumberoffamiliesshouldnothaveanimpactontheresult, assum- ing the families are genetically independent. This was tested under four parameter settings, and we compared the estimated variance terms and computation time given 1, 10 or 100 families. The variance terms were arbitrarily set to either 10 or 100. The computation was on a Macbook Pro with a 2.53 GHz Intel Core 2 Duo CPU. [Results] The test result is shown in Table 2.2. As expected, the number of families does not have an impact on the prediction. However, including less families will dramatically speed up the computation. Given this, the following analysis will only include one trio to save computation. 28 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.2: Prediction and computation time using different numbers of families True parameters Number of families 2 A 2 e Time / ms a 2 A = 100, 1 228.68 61.97 5.52 2 H = 100, 10 228.68 61.97 34.45 2 e = 100 100 228.68 61.97 3449 2 A = 10, 1 144.23 57.94 6.51 2 H = 100, 10 144.23 57.94 34.56 2 e = 100 100 144.23 57.94 4271 2 A = 100, 1 114.78 94.74 4.37 2 H = 10, 10 114.78 94.74 29.72 2 e = 100 100 114.78 94.74 2635 2 A = 100, 1 183.20 0.1 b 11.56 c 2 H = 100, 10 183.20 0.1 b 41.56 c 2 e = 10 100 183.20 0.1 b 6977 c a Mean of 100 evaluations. b Value constrained to be equal or greater than 0.1 using Lagrange multiplier. c Calculation took more time because of the using of Lagrange mul- tiplier. 2.2.3. Test the theory using simulations [Methods] The equation (2.9) can be tested by comparing its prediction against result from simulations. We simulated data following the scheme similar to (2.1), with one dif- ference: the additive genetic effect for each individual was directly sampled from a multivariate normal distribution rather than being the joint effect of multiple SNPs. The simulated data was then analyzed using the classic LMM described in subsec- 29 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL tion 2.1.2. 200 trios were simulated under each parameter setting, which was then repeated for 200 times. Both the simulation and the analysis were implemented using R (v3.0.2). There were three scenarios to examine: (1) 2 A ; 2 H ; 2 e in the true model, and 2 A ; 2 e in the specified model; (2) 2 A ; 2 H ; 2 e ; int in the true model, and 2 A ; 2 e in the specified model; (3) 2 A ; 2 H ; 2 e ; int in the true model, and 2 A ; 2 H ; 2 e in the specified model. Since this comparison is not a prove, for scenarios (2) and (3), it is sufficient to show the results where the variance components were set to be all 100 and the interaction effect int was varied to be one of the values from -5, -2, -1, 1, 2, 5. The minor allele frequency of the SNP having interaction in scenarios (2) and (3) is 0.2. [Results] The results are shown in Table 2.3 to 2.5. It can be seen from Table 2.3 that under a wide range of different parameter combinations, thevaluespredictedbythetheoryconformwelltothesimulations. We can now just use the theory to explore how different parameter combinations affect the estimation under a misspecified model both in a even wider range and at a finer scale. Another important message is that the estimated additive genetic variance under the misspecified model was inflated under all the four different settings. This is further illustrated in subsection 2.2.4. Table 2.4 and 2.5 show that even with the interaction in the true model, the predictions are also consistent with the simulations. There are probably several exceptions where a given variance component is close to the 0, which is the boundary of the parameter space. This is because the distribution of the estimated variance 30 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.3: The comparison between the theory predication and simula- tions, scenario 1 True VC a Prediction Simulation b 2 A 2 H 2 e 2 A 2 e 2 A 2 e 100 100 100 228.7 62.0 231:7 31:8 61:2 19:7 10 100 100 144.2 57.9 144:5 19:0 57:8 11:5 100 10 100 114.8 94.7 114:1 18:9 95:2 15:7 100 100 10 183.2 0.1 c 181:9 11:6 0:0151 0:0079 d a Variance components b Mean sd c Value constrained to be equal to or greater than 0.1 using La- grange multiplier d Value constrained to be greater than 0 Table 2.4: The comparison between the theory predication and simulations, scenario 2 Truth a Prediction Simulation b int 2 A 2 e 2 A 2 e -5 1018.3 69.0 984:6 189:7 82:7 72:4 -2 233.4 98.3 231:8 38:0 97:9 23:8 -1 181.2 84.2 179:7 26:0 84:8 17:5 1 367.4 36.9 365:1 43:1 37:0 18:8 2 600.3 6.2 582:2 69:1 16:4 20:8 5 1666.5 0.1 c 1648:6 238:3 6:3 22:5 a Variance components 2 A = 2 H = 2 e = 100 b Mean sd c Value constrained to be equal to or greater than 0.1 using Lagrange multiplier 31 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.5: The comparison between the theory predication and simulations, scenario 3 Truth a Prediction Simulation b int 2 A 2 H 2 e 2 A 2 H 2 e -5 900.0 100.0 100.0 868:5 197:8 98:8 96:4 108:0 84:7 -2 228.0 4.0 100.0 214:5 43:2 12:8 17:5 103:9 25:1 -1 132.0 36.0 100.0 128:4 33:9 37:7 18:9 101:0 20:4 1 132.0 196.0 100.0 131:9 37:8 193:6 34:7 99:4 22:8 2 228.0 324.0 100.0 225:9 57:9 319:5 59:9 100:0 30:2 5 900.0 900.0 100.0 877:6 209:6 883:4 196:7 105:9 88:5 a Variance components 2 A = 2 H = 2 e = 100 b Mean sd term is truncated at 0, which makes the mean of the estimates shift right to a larger value. Generally speaking, this is not a big problem, since when any variance term is close to 0, it’s often too small relative to the other terms and won’t have a large impact on our conclusions on the bias. Two important messages are: (1) the inflation of 2 A increases as the absolute value of int gets bigger; (2) Table 2.5 shows that despite of not specifying int , the residual variance can be estimated without bias. These conclusions will also be further examined in subsection 2.2.4. 2.2.4. Predicted bias of the variance terms under misspecified models [Methods] Toevaluatetheeffectofvariouscombinationsofparametersettingsontheestima- tion, we repeatedly solve equation (2.9) (equation (2.10) if necessary) with different 32 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL parameter settings. For scenario 1 from subsection 2.2.3, since we are interested in the relative bias, we fixed the residual error variance to be 100 and modified the additive genetic vari- ance and the CHE variance, from a value of 10 to 1000 (from 0.1 to 10). The parameter space between 10 and 1000 was divided into 50 grids evenly on the log- arithm scale. For scenario 2 and 3, we set all 3 of the variance terms to be 100, and examine the impact of both the minor allele frequency p of the SNP and the interaction effect size int , with p in [0:01; 0:5] and int in [5; 5]. Here we assumed the SNP followed HWE. The parameter space was divided into 50 grids for each of them. Finally for each scenario, a relative bias was calculated as the ratio between the estimated variance component and the true value. The logarithm of the relative bias was used to clearly show if there is bias and its direction. [Results] The results are shown in 3-D plots and contour plots in Figure 2.1 to Figure 2.7. For scenario 1, the most important message from Figure 2.1 is that within a reasonablylargerangeoftheCHEvariance, whichisfrom0.1to10oftheresidual variance and is therefore from 0.01 to 100 of the additive genetic variance, the estimated additive variance is consistently biased upwards, given the log ratios are all positive. The ratios, on the other hand, depend on the relative magnitude of the true additive and CHE variance components. The ratios change with the same direction as the CHE variance and the opposite direction as the additive variance. This makes sensebecauseasthetrueCHEvarianceincreases, theestimatedadditivevariancewill be more biased upward, whereas when the true additive variance increases, ignoring the CHE variance will contribute relatively less to the bias. 33 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 log 10 (relative bias of σ A 2 ) log 10 ( σ A 2 / σ e 2 ) log 10 ( σ H 2 / σ e 2 ) (a) −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 log 10 ( σ A 2 /σ e 2 ) log 10 ( σ H 2 / σ e 2 ) 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.5 1.0 1.5 (b) Figure 2.1: The upward bias of the estimated additive variance under various parameter combinations for scenario 1. The x and y axes show the log (base=10) of the ratio of either 2 A or 2 H over 2 e . (a) 3-D plot of the log (base=10) of the relative bias (estimated 2 A divided by true 2 A ); (b) contour plot corresponding to (a). 34 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL log 10 ( σ A 2 /σ e 2 ) −1.0 −0.5 0.0 0.5 1.0 log 10 (σ H 2 / σ e 2 ) −1.0 −0.5 0.0 0.5 1.0 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 log 10 (relative bias of σ e 2 ) (a) −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 log 10 ( σ A 2 /σ e 2 ) log 10 ( σ H 2 /σ e 2 ) −2.5 −2 −1.5 −1 −0.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 (b) Figure 2.2: The downward bias of the estimated residual variance under various parameter combinations for scenario 1. The x and y axes show the log (base=10) of the ratio of either 2 A or 2 H over 2 e . (a) 3-D plot of the log (base=10) of the relative bias (estimated 2 e divided by true 2 e ); (b) contour plot corresponding to (a). Results in dark blue were from calculations using Lagrange multipliers to ensure the variance terms to be positive. 35 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL β int −4 −2 0 2 4 p 0.1 0.2 0.3 0.4 0.5 log 10 (relative bias of σ A 2 ) 0.5 1.0 1.5 0.2 0.4 0.6 0.8 1.0 1.2 1.4 (a) −4 −2 0 2 4 0.1 0.2 0.3 0.4 0.5 β int p 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 (b) Figure 2.3: The upward bias of the estimated additive variance under various parameter combinations for scenario 2. The x and y axes show the interaction effect size int and the minor allele frequencyp. (a) 3-D plot of the log (base=10) of the relative bias (estimated 2 A divided by true 2 A ); (b) contour plot corresponding to (a). 36 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL β int −4 −2 0 2 4 p 0.1 0.2 0.3 0.4 0.5 log 10 (relative bias of σ e 2 ) −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 (a) −4 −2 0 2 4 0.1 0.2 0.3 0.4 0.5 β int p −2.5 −2.5 −2 −2 −1.5 −1.5 −1 −1 −0.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 (b) Figure 2.4: The downward bias of the estimated residual variance under various parameter combinations for scenario 2. The x and y axes show the interaction effect size int and the minor allele frequencyp. (a) 3-Dplot of the log(base=10) of the relative bias (estimated 2 e divided by true 2 e ); (b) contour plot corresponding to (a). 37 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL β int −4 −2 0 2 4 p 0.1 0.2 0.3 0.4 0.5 log 10 (relative bias of σ A 2 ) 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 (a) −4 −2 0 2 4 0.1 0.2 0.3 0.4 0.5 β int p 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1 1 1.1 1.1 0.2 0.4 0.6 0.8 1.0 (b) Figure 2.5: The upward bias of the estimated additive variance under various parameter combinations for scenario 3. The x and y axes show the interaction effect size int and the minor allele frequencyp. (a) 3-D plot of the log (base=10) of the relative bias (estimated 2 A divided by true 2 A ); (b) contour plot corresponding to (a). 38 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL β int −4 −2 0 2 4 p 0.1 0.2 0.3 0.4 0.5 log 10 (relative bias of σ H 2 ) −4 −2 0 −5 −4 −3 −2 −1 0 1 (a) −4 −2 0 2 4 0.1 0.2 0.3 0.4 0.5 β int p −3 −2.5 −2 −2 −1.5 −1.5 −1 −1 −0.5 −0.5 0 0 0.5 0.5 1 1 −5 −4 −3 −2 −1 0 1 (b) Figure 2.6: The bias of the estimated CHE variance under various parameter combinations for scenario 3. The x and y axes show the interaction effect size int and the minor allele frequency p. (a) 3-D plot of the log (base=10) of the relative bias (estimated 2 H divided by true 2 H ); (b) contour plot corresponding to (a). 39 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL β int −4 −2 0 2 4 p 0.1 0.2 0.3 0.4 0.5 log 10 (relative bias of σ e 2 ) −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 (a) −4 −2 0 2 4 0.1 0.2 0.3 0.4 0.5 β int p −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 (b) Figure 2.7: The estimated residual variance is generally unbiased under various parameter combinations for scenario 3. The x and y axes show the interaction effect size int and the minor allele frequencyp. (a) 3-D plot of the log (base=10) of the relative bias (estimated 2 e divided by true 2 e ); (b) contour plot corresponding to (a). 40 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Interestingly,fromFigure2.2,theresidualerrorisbiaseddownwards. Anintuitive explanation is that adding the CHE variance matrix H to the residual error matrix E makes the latter more similar to the structure of the additive variance matrix G (refer to subsection 2.1.2). As a result, some of the error variance is absorbed into the additive variance during estimation. For scenario 2, we see from Figure 2.3 and 2.4 that both the inflation of the estimated additive variance and the deflation of the estimated residual variance get worse as either the absolute value of the interaction effect size int or the minor allele frequency p increases. It is of note that a positive int and a negative int of the same absolute value have the same direction yet different magnitudes of effects on the biases. For scenario 3, Figure 2.5 shows similar trends as 2.3, except that the effect of int is symmetric about zero. The trend seems to break when int approaches 5 and p 0.5. Similar phenomenon is also observed for the residual variance shown in Figure 2.7. This happens because the surface of the target function, equation (2.4), becomes more flat and therefore fairly different parameter combinations might have similar function values. As a consequence, the equation solver might find roots that show different trends from those with other parameter combinations. With this said, it can be shown that for those cases when int approaches 5 and p 0.5, setting the residual variance to be the true value, i.e. 100, while keeping the additive and CHE variance terms as predicted, the target function only increases from the minimum by less than 0.1%. Consequently, we conclude that the additive variance is biased upwards when int is nonzero while the residual variance is unbiased. Figure 2.6 shows the CHE is biased upwards when int is positive. However, things become complicated when int is negative. As int decreases, the CHE is at 41 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL first biased downwards but then upwards again. This is related to the structure of the coefficient matrix B of the CHE, i.e. (1 + int x ::t )(1 + int x ::t ) T . When the absolute value of a negative int is small, (1 + int x ijt )(1 + int x ikt ) can be less than 1, the net effect of which is a smaller CHE variance. As int gets more negative, the CHE variance tends to totally vanish at certain point, and then int x ijt int x ijt starts to predominate the product of (1 + int x ijt )(1 + int x ikt ), and therefore the CHE variance increases again as the square of int grows. It is also of note that the more common the SNP is, the more frequent the aforementioned effect, so the point of a vanished CHE variance requires a weaker interaction, i.e. smallerj int j. To sum, when the CHE variance is present but is not specified in the analysis model, the estimated additive genetic variance will be inflated. This inflation will be more dramatic if there is also interaction between the CHE and genetic variation (SNP). This inflation will be reduced if the CHE is specified, yet it can still be remarkable as long as the interaction is non-negligible but is not specified while the SNP is not rare. 2.2.5. The effect of the number of children per family [Methods] So far the investigations have been solely based on trios, but the conclusions can be readily extended to nuclear families with multiple children. For scenario 1 in subsection 2.2.3, we just need to expand the variance component matrices of G, H and E accordingly. For scenario 2 and 3, we need to consider the expect variance matrix as that in subsection 2.2.1, but the probabilities of family genotype combinations can no longer be simply summarized as in Table 2.1. However, with 42 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL the same assumption of HWE, the combinations still follow certain distributions and we can compute the probabilities. More specifically, assuming the two alleles are A anda for the SNP having inter- action with the CHE, there are three cases to consider, depending on the genotypes of the two parents: (1) parent 1: AA, parent 2: Aa (2) parent 1: Aa, parent 2: aa (3) parent 1: Aa, parent 2: Aa Note again that the order of the parents doesn’t matter. The numbers of children having either genotype for case 1 (AA orAa) and 2 (Aa oraa) both follow a binomial distribution with p = 0:5. The number of children having genotype AA, Aa or aa follows a multinomial distribution with p AA =p aa = 0:25 and p Aa = 0:5. We then set 2 A = 2 H = 2 e = 100, int = 1 and p = 0:2, and repeated the pre- dictions done in subsection 2.2.3, with the number of children per family changing from 1 to 10. [Results] Table 2.6 shows that there is still inflation in ^ 2 A and deflation in ^ 2 e for scenarios 1 and 2, although the biases are attenuated as the number of children per family increases. Surprisingly, thebiasesinscenario3remainthesamewhateverthenumber ofchildrenis. Anintuitiveexplanationtothisphenomenonisthatadditionalchildren in the same household don’t provide additional information for differentiating the 3 variance components when all of them are specified in the model. However, it is of note that this result is based on theoretical expectations. In real applications, children in a same household are quite unlikely to be identical on their genetic and 43 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.6: The impact of the number of children per family on the predictions Number of Scenario 1 Scenario 2 Scenario 3 children ^ 2 A ^ 2 e ^ 2 A 2 e 2 A 2 H 2 e 1 228.68 61.97 367.42 36.86 132.00 196.00 100.00 2 219.41 63.47 338.85 44.54 132.00 196.00 100.00 3 212.53 64.31 318.90 48.77 132.00 196.00 100.00 4 207.29 64.80 304.48 51.30 132.00 196.00 100.00 5 203.20 65.09 293.61 52.94 132.00 196.00 100.00 6 199.91 65.28 285.15 54.05 132.00 196.00 100.00 7 197.22 65.40 278.37 54.84 132.00 196.00 100.00 8 194.98 65.48 272.82 55.43 132.00 196.00 100.00 9 193.08 65.52 268.19 55.86 132.00 196.00 100.00 10 191.46 65.55 264.27 56.20 132.00 196.00 100.00 environmental information. Finally, all the above results were also confirmed by simulations, the result of which are omitted here. In summary, we conclude that the biases caused by model misspecification cannot be removed with practically large family sizes. However, in a real application, we still prefer to include more samples because that may reduce the standard errors of the estimated parameters, which is not able to be reflected by the investigations here. 2.3. Model misspecification: simulations Althoughtheconsequencesofmodelmisspecificationhavebeenextensivelyexam- ined using the theoretical predictions in section 2.2, it is still necessary to investigate the precision of the estimates, i.e. standard errors, as well as how the estimation performs when any of the variance terms approaches zero, which is the parameter 44 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL space boundary. Here we will be focusing on scenario 2 and 3 defined in subsection 2.2.3, since scenario 1 is a special case of scenario 2 when int = 0. [Methods] The simulations followed the scheme of equation (2.1). In addition to the directly sampled additive variance, one SNP was simulated with a fixed marginal effect size = 1:5 and a MAF p = 0:2. This SNP was included in the analysis model, so it did not contribute to the residual additive variance. The effect of the interaction was evaluated under four variance parameter settings as those tested in subsection 2.2.3. For each setting, the interaction effect int was set to be one of the values among -5, -2, -1, 1, 2, 5. Therefore there were 4 6 = 24 settings. Each setting was repeated 1000 times with 200 simulated trios. The analysis model was misspec- ified in manners corresponding to scenario 2 and 3 defined in subsection 2.2.3. The simulated data were analyzed using standard variance components approach based on maximum likelihood. The relative bias was calculated as the ratio between the estimated variance component and the true value. Standard error of each parameter estimate was calculated as the standard deviation of the estimates from the 1000 simulations. [Results] The results are shown from Figure 2.8 to 2.12. InFigure2.8and2.9, alltheestimatedadditivevariancetermsarebiasedupwards while almost all the residual variance terms are biased downwards. Moreover, the biases become more serious as int increases when int > 0, and less serious when int < 0 with a relatively smallj int j but more serious again as int further decreases. 45 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value 1 2 5 10 20 50 100 200 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ A 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 not specified Figure 2.8: The estimated additive variance 2 A when 2 A and 2 e are specified. Plotted are means standard errors. Note the y-axis is on log scale. The dashed line corresponds to the true values where the relative bias is 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value −2 −1 0 1 2 3 6 7 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ e 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 not specified Figure 2.9: The estimated residual error variance 2 e when 2 A and 2 e are spec- ified. Plotted are means standard errors. Note the y-axis is on arithmetic scale. The dashed line corresponds to the true values where the relative bias is 1. 46 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Theses trends are consistent with what we’ve observed in subsection 2.2.4. The light blue line in Figure 2.8 shows that even when 2 H is relatively small compared to 2 A , the inflation in the additive variance can still be remarkable due to the interaction. Besides the biases, the standard errors are satisfactorily small, expect for 2 e with the setting 2 A = 2 H = 100, 2 e = 10 and int <1. This shows that the estimation is less precise and the specified model fits poorly with the simulated data. This could also explain the inconsistencies in the bias of 2 e for this parameter setting: the conclusion is unreliable in the first place given the estimates with low precision. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value 1 2 5 10 20 50 100 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ A 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 specified Figure 2.10: The estimated additive variance 2 A when 2 A , 2 H and 2 e are spec- ified. Plotted are means standard errors. Note the y-axis is on log scale. The dashed line corresponds to the true values where the relative bias is 1. In Figure 2.10, the estimated additive variance is inflated for all but several settings when the CHE variance is small compared to the other two terms. The patterns are similar to what has been observed in Figure 2.8, but the inflation is less serious. When 2 H = 10 and int = -2 or -1 or 1, the estimated values are close to 47 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value 0 1 2 3 4 5 6 7 8 9 10 11 12 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ H 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 specified Figure 2.11: The estimated CHE variance 2 H when 2 A and 2 e are specified. Plotted are means standard errors. Note the y-axis is on arithmetic scale. The dashed line corresponds to the true values where the relative bias is 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value −2 −1 0 1 2 3 4 9 10 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ e 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 specified Figure 2.12: The estimated residual error variance 2 e when 2 A and 2 e are specified. Plotted are means standard errors. Note the y-axis is on arithmetic scale. The dashed line corresponds to the true values where the relative bias is 1. 48 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL the truth, but this might be simply due to the fact that 2 H is relatively small in this case. However, as the absolute value of int increases, the upward bias of 2 A is still significant. Figure 2.11 shows that the estimation of the CHE variance itself can be influence by int . The trends are consistent with the theoretical predictions in subsection 2.2.4, in which the bias is positive when either int > 0 orj int j is not too small, and is negative when int is around -1 and -2. All the lines in Figure 2.11 have almost the same shape, which implies that the magnitude of other variance components only have minor impact on how int biases the estimation. The estimation is less precise whenj int j is relatively big or 2 H is relatively small compared to the other variance components. Figure 2.12 shows that under every examined parameter setting, the estimated residualvarianceisnearlyunbiasedforall int exceptforwhenj int j = 5and 2 H = 10. Again the standard errors are huge with those parameter settings, which makes the estimation less reliable and therefore these two outliers are not too problematic. To sum, when the CHE variance and its interaction with SNP(s) are present but are not specified in the analysis model, the estimated additive genetic variance will be inflated. Even if the CHE variance is specified, failing to include int in the model may still lead to inflated additive variance whenever neither the CHE variance nor int is negligible. The inflation may get worse depending on the magnitude and direction of the interaction. 49 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL 2.4. Inflated estimation of the additive variance un- der other scenarios In the previous sections, we discussed the bias of estimating the additive vari- ance introduced by not including a multiplicative interaction between the CHE and genetic variation. Is the additive variance always biased upwards due to model mis- specification? We will examine this question in the following text. Although we don’t numerate all possible scenarios, the investigation here should provide valuable insight into answers for the question. 2.4.1. A general form interaction between the common house- hold environment and genetic variation [Methods] Instead of a multiplicative interaction, a more general form of interaction, was tested using the similar framework as that in subsection 2.2.4. We set the effect sizes of the environment EE and Ee separately for genotypes of either two (EE) or one (Ee) effect alleles. Since we are interested in in the effect of interaction, scenario 2 and 3 from subsection 2.2.3 are of our concern here. Again, we set all 3 of the variance terms to be 100. Both the effect sizes EE and Ee were in [5; 5], with the parameter space divided into 50 grids for each of them. 3 settings of effect al- lele frequency (EAF) of the interaction SNP at 5%, 20% and 40% were examined to evaluate the impact. HWE was assumed here. The relative bias was calculated as the ratio between the estimated variance component and the true value, and the logarithm of the relative bias was used. 50 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL [Results] The results are shown in contour plots in Figure 2.13 and Figure 2.14. β int (Ee) −4 −2 0 2 4 −4 −2 0 2 4 0.35 0.35 0.4 0.4 0.45 0.45 0.5 0.5 0.55 0.6 0.65 0.7 0.75 0.4 0.5 0.6 0.7 β int (EE) (a) −4 −2 0 2 4 −4 −2 0 2 4 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1 1.1 0.2 0.4 0.6 0.8 1.0 1.2 β int (EE) β int (Ee) (b) β int (Ee) −4 −2 0 2 4 −4 −2 0 2 4 0.2 0.4 0.6 0.8 1 1 1.2 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 β int (EE) (c) Figure2.13: Theupwardbiasoftheestimatedadditivevarianceunderageneral form interaction for scenario 2. The x and y axes show the interaction effect sizes Ee and EE for the two genotypes. Shown are contour plots of the log of the relative bias with the SNP effect allele frequency of (a) 5%; (b) 20%; (c) 40%. 51 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL −4 −2 0 2 4 −4 −2 0 2 4 0.05 0.05 0.1 0.1 0.15 0.15 0.2 0.2 0.25 0.25 0.3 0.3 0.35 0.35 0.4 0.4 0.45 0.45 0.5 0.5 0.0 0.1 0.2 0.3 0.4 0.5 β int (Ee) β int (EE) (a) β int (Ee) −4 −2 0 2 4 −4 −2 0 2 4 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.2 0.4 0.6 0.8 β int (EE) (b) −4 −2 0 2 4 −4 −2 0 2 4 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1 1 0.2 0.4 0.6 0.8 1.0 β int (EE) β int (Ee) (c) Figure2.14: Theupwardbiasoftheestimatedadditivevarianceunderageneral form interaction for scenario 3. The x and y axes show the interaction effect sizes Ee and EE for the two genotypes. Shown are contour plots of the log of the relative bias with the SNP effect allele frequency of (a) 5%; (b) 20%; (c) 40%. 52 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL With all the parameter combinations examined for both scenario 2 and 3, the bias in the additive variance is consistently upwards. Another fact is that when the effect allele is relatively rare (EAF 5%), the impact of EE on the bias is weaker than Ee , whereas the opposite is true when the effect allele is common (EAF 40%). This is expected given the expected frequency of homozygotes EE is the square of the EAF under HWE. 2.4.2. Genetic interaction under the limiting pathway model In addition to gene environment interaction, another nonlinear effect is also of interest: genegeneinteraction. Aspecificgeneticinteraction, thelimitingpathway (LP) model, was described in Zuk et al. (2012). However, they only investigated the bias for twin study design and parent-offspring regression. We extend the analysis to our variance components framework, and expect that a similar pattern should be observed. [Methods] Given that variance components model is optimized using either maximum likeli- hood or restricted maximum likelihood, a closed-form expression of the estimated h 2 cannot be derived as what was done in Zuk et al. (2012) for the other two designs. We investigated the problem by simulations using the same settings as in Zuk et al. (2012). According to the LP model for quantitative traits, the observed phenotype P is themaximumofk independentpathwaysvariables,Z i ,inwhichP = max(Z 1 ;Z 2 ;:::;Z k ). Z i variables are i.i.d. Gaussian (for parents), with each being the sum of genetic (A), 53 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL common environmental (C) and residual environmental (E) components, with re- spective variances h 2 pathway , c R (1h 2 pathway ) and (1c R )(1h 2 pathway ). Here c R is the fraction of environmental variance shared by the same household, h 2 pathway is the heritability of each pathway. We set the total variance of each pathway to be 1, thus the additive variance of each pathway 2 A =h 2 pathway . More specifically, for each of the k pathways, we sampled pathway variables A i (p1)andA i (p2)forthetwoparentsfromi.i.d. N(0; 2 A ), andthensampledpathway variablesA i (o)forthechildfromN((A 1 +A 2 )=2; 2 A =2). Wealsosampledthecommon householdenvironmentcomponentC i fromi.i.d. N(0;c R (1h 2 pathway ))foreachfamily and residual environment component E i from i.i.d. N(0; (1c R )(1h 2 pathway )) for each individual. We then had Z i =A i +C i +E i for each individual, and then took P = max(Z 1 ;Z 2 ;:::;Z k ). For the purpose of comparison, we only examined scenarios with parameters k in (1, 2, 3, 5, 10), cr in (0, 0.25, 0.75), h 2 pathway in (0.1, 0.5, 0.9). Simulation was repli- cated for 1,000 times and 10,000 trios were simulated for each. Simulated data were analyzed using an R package ‘pedigreemm’, which is described in detail in Chapter 3. We specified a common environment component in the analysis model, and max- imum likelihood criterion was used for optimization. [Results] Results are shown in the last column of Table 2.7. Other information are taken from the supplementary Table 7 of Zuk et al. (2012) for comparison. 54 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.7: Estimated heritability under the LP model with different parameters k h 2 path (%) c R (%) h 2 all (%) a h 2(ACE) pop (%) a h 2(VC) pop (%) b 1 10 0 10 10 9.27 1.85 1 50 0 50 50 49.38 1.49 1 90 0 90 90 89.78 1.02 1 10 25 10 10 10.09 1.95 1 50 25 50 50 49.97 1.74 1 90 25 90 90 91.87 2.33 1 10 75 10 10 10.01 0.84 1 50 75 50 50 50.03 1.25 1 90 75 90 90 92.84 4.87 2 10 0 7.3 7.7 6.63 1.83 2 50 0 36.7 45.7 38.96 1.58 2 90 0 66 97.8 75.17 1.24 2 10 25 7.3 8.8 8.56 2.61 2 50 25 36.7 48.9 42.56 1.82 2 90 25 66 99.4 76.74 1.56 2 10 75 7.3 11.3 11.02 1.08 2 50 75 36.7 56.6 49.00 1.46 2 90 75 66 100 79.02 1.85 3 10 0 6 6.4 5.28 1.79 3 50 0 29.8 42.8 33.28 1.61 3 90 0 53.6 100 66.83 1.40 3 10 25 6 8 7.67 2.08 3 50 25 29.8 47.9 38.18 1.92 3 90 25 53.6 100 68.92 1.65 3 10 75 6 12 11.40 1.17 3 50 75 29.8 60.3 47.84 1.43 3 90 75 53.6 100 72.24 1.58 5 10 0 4.5 5 4.00 1.78 5 50 0 22.3 39 26.67 1.73 55 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL Table 2.7 cont.: Estimated heritability under the LP model with different parameters k h 2 path (%) c R (%) h 2 all (%) a h 2(ACE) pop (%) a h 2(VC) pop (%) b 5 90 0 40.2 100 56.93 1.49 5 10 25 4.5 7 6.61 2.16 5 50 25 22.3 46 32.81 2.02 5 90 25 40.2 100 59.55 1.72 5 10 75 4.5 12.8 11.98 1.28 5 50 75 22.3 64.4 46.09 1.61 5 90 75 40.2 100 64.06 1.77 10 10 0 2.9 3.5 2.47 1.61 10 50 0 14.5 33.7 19.35 1.67 10 90 0 26.1 100 44.93 1.54 10 10 25 2.9 5.8 5.19 2.23 10 50 25 14.5 42.8 26.31 2.18 10 90 25 26.1 100 48.00 1.79 10 10 75 2.9 13.6 12.41 1.44 10 50 75 14.5 68.9 43.03 1.78 10 90 75 26.1 100 53.60 2.09 a Took from Supp. Table 7 of Zuk et al. (2012). b Mean standard error for our simulation result. When there is no genetic interaction (k = 1), h 2(VC) pop is estimated without bias. When there is genetic interaction in the form of the LP model (k > 1), h 2(VC) pop is generally biased upwards, compared to the true values h 2 all , with the exception of cases whenh 2 path = 0:1 andc R = 0. Given the fact that most qualitative traits of in- terest in human genetics field have relatively highh 2 and usually have environmental components underlying them, we may argue that genetic interactions in the form of the LP model will introduce upward bias toh 2 estimates derived from using variance components approach. 56 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL On the other hand, the expected h 2(VC) pop is consistently lower than h 2(ACE) pop when k > 1. This means that h 2 estimate from using variance components approach is more likely to have less bias than that from twin studies, should there be a genetic interaction. However, the bias in h 2(VC) pop is still significant compared to h 2 all . 2.5. Discussion In section 2.2 we’ve shown that the estimated additive genetic variance will be inflated if the CHE variance is not correctly specified. In the analysis, we’ve covered a wide enough range of different combinations of parameters, and the results are consistent. Therefore, in a real data analysis, if the CHE variance should be included in the model yet is not, the inflation in the estimated additive variance is inevitable. We’ve also illustrated in both section 2.2 and 2.3 that the CHE variance and its interaction with variants jointly may bias the estimated additive variance more upwards compared to the effect caused by the CHE variance solely. We’ve also shown thattheinteractionitselfmaycausetheinflationeveniftheCHEvarianceisspecified in the analysis model. In both scenarios, the inflation becomes more serious when absolute value of int is relatively big, whichever the direction is. In subsection 2.2.4 we observed situations where negative int may lead to under- estimated CHE variance. This happens when the modification of the CHE on the genetic effect has an opposite direction, which results in a masking effect. When this happens, the CHE or other types of shared environment may not be favored in the specified model. We will examine this question in more detail in Chapter 4. A question of interest is whether the additive variance is always biased upwards due to model misspecification. This is very important because we would like to know 57 CHAPTER 2. INFLATION IN ESTIMATING THE ADDITIVE VARIANCE UNDER A MISSPECIFIED MODEL if missing heritability can be introduced by other forms of model misspecification other than the specific one discussed in this dissertation. In subsection 2.2.4, we showed that the additive variance is generally biased upwards due to two other forms of model misspecification that are of most concern. We may argue that despite of other possible explanations to missing heritability, model misspecification is highly likely to be one of the most important sources. The main conclusion from this chapter is that ignoring the CHE and its interac- tion with genetic variants may bias the estimated additive genetic variance upwards and therefore inflate the estimated narrow-sense heritability. It provides a plausible explanation to the problem of missing heritability, although it may not be the only explanation. On the other hand, this explanation is not necessarily true for all com- plex disease related traits. The conclusion should be drawn based on the evaluation of each possible case in reality. 58 Chapter 3 Unbiased Estimation Of The Additive Vari- ance Under The Correct Model 3.1. Algorithms for estimating parameters under the correct model As has been shown in the previous chapter, the estimation of additive variance is biased upwards when the analysis model is misspecified. In this chapter I’ll show that the bias can be eliminated under the correct model. By ‘correct’, I mean the analysis model follows exactly how the data is simulated. For a real problem, it is impossible to know which model is correct, but we can always determine which model fits the data best. As a quote from Box and Draper (1987) says, “all models are wrong, but some are useful”. Model selection by doing hypothesis testing will be discussed in Chapter 4. In a correct analysis model, all components are specified as is shown in subsection 2.1.2, namely, in addition to the mean model where we specify an intercept and covariates including SNPs, we also have 2 A , 2 H , 2 e and int in the variance model. 59 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL Again, this model is built based on the variance components model, which is a special case of the linear mixed-effects models (LMM) framework. In this model, the random effects are independent of each other. In other words, the variance matrix W from equation (1.6) is diagonal with the variance components along the diagonal. Therefore, the variance matrix of the trait V = ZWZ T + R = M X i=1 2 i Z i Z T i + R (3.1) where M is the number of variance components. The method for estimating the parameters is still the maximum likelihood (ML). Usually we optimize the log like- lihood instead, since the derivatives are often easier to compute. The log likelihood corresponding to expression (2.3) is l(; A ; H ; e ; int ; y) = 1 2 logjVj 1 2 (y X) T V 1 (y X) (3.2) As maximizing the log likelihood does not have an analytical solution for complex problems, it will be solved numerically. Numerical optimization procedures generally include iterative updates of the parameters being estimated. There are a series of iterative approaches for solving ML problems, and two of them that have been widely used for LMM are the Newton-Raphson (NR) method and the expectation maximization (EM) algorithm. There are also a number of methods based on the NR method. For example, the Fisher scoring method replaces the hessian matrix in the parameter update steps with its expectation which is called the Fisher information matrix and is usually easier to compute (Searle et al., 1992, p. 295). There is also a group of quasi-Newton methods, in which the hessian matrix is approximated using 60 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL the first-order derivatives and therefore save computation when the hessian matrix is large. Both the NR method and the EM algorithm have their advantages and disad- vantages, respectively. For example, the NR method is generally faster to reach convergence compared to the EM algorithm; the hessian matrix can be obtained di- rectly from the last iteration of the NR method, from which the standard errors can be estimated, while the hessian matrix needs to be computed separately for the EM algorithm. On the other hand, the EM algorithm is guaranteed to converge while the NR algorithm is not (Lindstrom and Bates, 1988); the EM algorithm is more robust to its starting values than the NR method (Searle et al., 1992, p. 312). An- other advantage of the EM algorithm is that each iteration is guaranteed to go uphill on the likelihood surface till the maximum (Casella and Berger, 2002), whereas the NR algorithm requires a positive definite observed information matrix to keep the likelihood increasing (Battiti, 1992), in which the observed information is just the negative hessian. It often happens in an ill-conditioned problem that the information matrix is not positive definite or even singular. Researchershavebeentakingadvantagesofeachalgorithmsimultaneously. There are applications where the EM algorithm is used first to find a robust starting value and an NR-alike method then follows to reach converge quickly (Yang et al., 2011a; Zhou and Stephens, 2014). In this dissertation, we will also combine the two, ex- cept for that we apply the EM algorithm followed by a quasi-Newton method, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, to estimate the parameters (Broyden, 1970; Shanno, 1970). The most important reason for using the BFGS algorithm instead of a normal NR method or Fisher scoring is that the Fisher infor- mation matrix, which is the expected negative hessian, was found not always positive 61 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL definite in my experiments, whereas the BFGS algorithm guarantees the information matrix to be positive definite at its each update. Some other benefits of the BFGS algorithm are its simplicity of implementation in R and possibly faster convergence. The notations in this chapter follow subsection 2.1.2, and there is an assumption underlying the analysis model that the families are independent. As a result, both the log likelihood and the first derivatives can be calculated individually for each family and then added up, which can save a considerable amount of computation. 3.1.1. Implementation of the EM algorithm FortheEMalgorithm, thedatasetathandisusuallyreferredtoasthe‘incomplete data’. The ‘complete data’, which makes the problem easier to solve, is obtained by expectation. For the current problem, the incomplete data is the observed trait y, and the complete data is the combination of y, the unobserved random effects u and the interaction term int . The rationale is that if the random effects are observed, we can simply compute the maximum likelihood estimate (MLE) of the variance component corresponding to a random effect u i of length q i to be ^ 2 i = u T i u i =q i (3.3) Here we also assume it is a variance components model where all the random effects are independent. Following Searle et al. (1992), the conditional distribution of u i given y is u i jy N 2 i Z T i V 1 (y X); 2 i I q i 4 i Z T i V 1 Z i (3.4) 62 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL where I q i is identity matrix of dimension of q i . The expected value is then E(u i jy) = 2 i Z T i V 1 (y X) (3.5) We can substitute u T i u i in equation (3.3) with its expectation E(u T i u i jy) = 4 i (y X) T V 1 Z i Z T i V 1 (y X) +tr( 2 i I q i 4 i Z T i V 1 Z i ) = 4 i (y X) T V 1 Z i Z T i V 1 (y X) + 2 i q i 4 i tr(V 1 Z i Z T i ) (3.6) Here we used the fact that for a random vector x N(; V) and a compatible square matrix A E(x T Ax) = tr(AV) + T A (3.7) The Z i Z T i in equation (3.6) can be then replaced by the coefficient matrices of variance components, namely, K, B and I, for the calculation of ^ 2 A , ^ 2 H and ^ 2 e , respectively. Besides, the fixed effect can be estimated as ^ = (X T X) 1 X T (y M1 X i=1 Z i u i ) (3.8) if we let the M th variance component be the residual variance. The estimation of int actually uses Fisher scoring instead of the EM algorithm, because it requires the inversion of the random effects, which is a nonlinear function and therefore is unable to compute the expectation. The algorithm is described in detail as follows. 63 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL Step 0. Decide starting values. The ordinary least square (OLS) estimates (0) = (X T X) 1 X T y are used for the fixed effects and the total phenotypic variance divided by the number of variance components are used for each random effect. As we don’t know the direction of interaction, we use both (0) int =1 and (0) int = 1 as starting values. Set the number of iteration m = 0. Step 1 (E step). Calculate the conditional expected values of the sufficient statistics for fixed and random effects. For the m th iteration, , let t (m) i = E(u T i u i jy)j = (m) ; 2 = 2 (m) = 4(m) i (y X (m) ) T (V (m) ) 1 Z i Z T i (V (m) ) 1 (y X (m) ) + 2(m) i q i 4(m) i tr (V (m) ) 1 Z i Z T i (3.9) and by equation (3.5) s (m) = E(y M1 X i=1 Z i u i )j = (m) ; 2 = 2 (m) = y M1 X i=1 2(m) i Z i Z T i (V (m) ) 1 (y X (m) ) = y (V (m) I 2(m) e )(V (m) ) 1 (y X (m) ) = X (m) + 2(m) e (V (m) ) 1 (y X (m) ) (3.10) Step 2 (M step). Maximization of likelihood and estimate parameters of both fixed and random effects. Following (3.6) and (3.8), 2(m+1) i =t (m) i =q i (3.11) 64 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL and (m+1) = (X T X) 1 X T s (m) (3.12) Step 3. Update int . For this purpose we need the first and second derivatives of the log likelihood (3.2) with respect to int . We have S (m) = @l @ (m) int = 1 2 tr (V (m) ) 1 @V (m) @ (m) int ! + 1 2 (y X (m) ) T (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 (y X (m) ) (3.13) where @V (m) =@ (m) int , similar to matrix B in subsection 2.1.2, is a matrix of blocks of x t (1 + (m) int x t ) T + (1 + (m) int x t )x T t along the diagonal, assuming the interaction is between the t th SNP and the common household environment (CHE). The second derivative is @ 2 l @ (m)2 int = 1 2 tr (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 @V (m) @ (m) int + (V (m) ) 1 @ 2 V @ (m)2 int ! (y X (m) ) T (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 (y X (m) ) + 1 2 (y X (m) ) T (V (m) ) 1 @ 2 V @ (m)2 int (V (m) ) 1 ! (y X (m) ) (3.14) 65 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL Using the relationship (3.7), the information can now be computed as I (m) = E @ 2 l @ (m)2 int ! = 1 2 tr (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 @V (m) @ (m) int + (V (m) ) 1 @ 2 V @ (m)2 int ! +tr (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 V (m) ! 1 2 tr (V (m) ) 1 @ 2 V @ (m)2 int (V (m) ) 1 V (m) ! = 1 2 tr (V (m) ) 1 @V (m) @ (m) int (V (m) ) 1 @V (m) @ (m) int ! (3.15) We then update (m) int by (m+1) int = (m) int + S (m) I (m) (3.16) Step 4. Increasem by 1 and repeat from step 1 until the maximum number of iterations is reached. Since we have int = 1 or -1 as starting values, after the final iteration, we select parameters giving the higher likelihood. Since the EM algorithm is used to identify a good staring value, the number of iterations is set to be only 5. After all iterations complete, the current parameter valuesareusedasstartingvaluesforthenextoptimizationprocedureusingtheBFGS algorithm. 66 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL 3.1.2. Implementation of the BFGS algorithm The BFGS algorithm was independently developed by Broyden, Fletcher, Gold- farb and Shanno in 1970. The algorithm approximates the hessian matrix rather than directly calculating it, and updates the approximation at each iteration us- ing its current value and the first derivatives. There is an implementation of the algorithm in R’s function constrOptim(), and it was used for this dissertation. For quasi-Newton methods, only first derivatives are needed. For fixed effect vector we have @l @ = X T V 1 (y X) (3.17) And for either 2 or int @l @ = 1 2 tr V 1 @V @ + 1 2 (y X) T V 1 @V @ V 1 (y X) (3.18) where in the above equation can be replaced by 2 A , 2 H , 2 e or int . We already have @l=@ int from equation (3.13), and for the other three, we have @V @ 2 A = K; @V @ 2 H = B; @V @ 2 e = I from subsection 2.1.2. Now we provide the R function constrOptim() with the above derivatives, and constrain the variance components to be positive. Then we run the optimization till convergence to get the final estimates of the parameters. The methods described in this section have been written into R functions. 67 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL 3.2. Test the algorithm via simulations The main goal of this subsection is to examine the estimation performance of the proposed method in terms of bias, precision and computation time, under a wide range of parameter combinations. 3.2.1. Unbiased estimation of the parameters [Methods] Data were simulated exactly as described in section 2.3, and were then analyzed using the R implementation of the proposed methods. In addition, we also repeated the analysis with different minor allele frequencies (MAF) of the interaction SNP at 1%, 5%, 10%, 20% (the original setting), 30%, 40%. Standard error of each parame- ter estimate was calculated as the standard deviation of the estimates from the 1000 replications. Meanwhile, for each replication, we calculated the estimated standard error as the square roots of the negative approximated hessian after the optimization finished. The means and standard deviations of these estimated standard errors were also calculated for comparison. [Results] Figure 3.1 to 3.4 show that generally with all parameter settings, the variance components and the interaction effect size int can be estimated without bias. The only possible exception is that when any of the variance components is 10 which is small compared to the other two, its mean estimate is not close to the truth. This problem also occurs for int when 2 H = 10. However, the corresponding standard 68 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value 0 1 2 3 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ A 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 and β int specified Figure 3.1: The estimated additive variance 2 A when all of 2 A , 2 H , 2 e and int are specified. Plotted are means standard errors. The dashed line corresponds to the true values where the relative bias is 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value 0 1 2 3 4 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ H 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 and β int specified Figure 3.2: The estimated CHE variance 2 H when all of 2 A , 2 H , 2 e and int are specified. Plotted are means standard errors. The dashed line corresponds to the true values where the relative bias is 1. 69 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value 0 1 2 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated σ e 2 True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 and β int specified Figure 3.3: The estimated residual error variance 2 e when all of 2 A , 2 H , 2 e and int are specified. Plotted are means standard errors. The dashed line corresponds to the true values where the relative bias is 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● True value −1 0 1 2 3 −5 −2 −1 1 2 5 Beta of interaction β int Relative bias of estimated β int True model settings ● ● ● ● σ A 2 :100 σ H 2 :100 σ e 2 :100 σ A 2 :10 σ H 2 :100 σ e 2 :100 σ A 2 :100 σ H 2 :10 σ e 2 :100 σ A 2 :100 σ H 2 :100 σ e 2 :10 σ H 2 and β int specified Figure 3.4: The estimated effect size of interaction int when all of 2 A , 2 H , 2 e and int are specified. Plotted are means standard errors. The dashed line corresponds to the true values where the relative bias is 1. 70 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL errorsforthesesituationsarehuge,whichindicatesthattheestimatesarenotreliable. In fact, if any of the variance components is too small, other models will be favored over such models using model selection approaches. This will be further illustrated in Chapter 4. The simulation results of different MAF are shown in Table 3.1. We see that under all conditions, the variance components were estimated without bias, and the standard errors are not affected by the MAF of the interaction SNP. However when the MAF is 1%, we see both the estimation accuracy and precision are bad for the interaction effect size int and the marginal fixed effect 1 of the interaction SNP. Moreover, the performance is even worse then the magnitude of the interaction, or j int jislarger(5). AnotherproblemoccurredduringthesimulationswhentheMAF was 1%. Since the SNP was rare, the system was singular for certain replication of thesimulationbychance, andtherefore int wasnotsolvable. Althoughlargersample size can reduce the chance of encountering a singular problem, it is generally not a good idea to use this method to analyze rare SNPs. 71 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL Table 3.1: The impact of the minor allele frequency (MAF) of the interaction SNP on the estimation VC a True Estimated parameter values (mean sd), with SNP MAF (%) int 1 5 10 20 30 40 2 A -5 98:21 28:89 96:21 31:69 98:21 28:89 97:48 27:87 96:91 28:03 97:70 28:95 -2 98:15 33:13 98:78 29:69 98:02 27:30 98:44 25:62 98:06 25:81 98:14 25:34 -1 96:10 32:55 96:16 33:43 94:31 29:35 97:15 26:07 95:98 23:90 97:62 23:70 1 99:88 30:98 97:59 32:19 98:98 33:39 98:56 31:86 100:95 33:07 98:62 34:10 2 97:32 31:97 98:64 31:71 97:88 31:32 97:05 31:21 97:75 31:51 97:98 32:14 5 99:81 32:16 97:54 30:92 97:98 31:19 97:66 31:02 98:55 32:02 96:40 31:57 2 H -5 100:18 20:95 101:66 22:17 100:18 20:95 100:22 21:24 99:98 22:66 100:15 25:10 -2 101:50 22:38 100:56 22:41 100:65 21:60 99:77 23:08 100:38 23:78 99:45 26:16 -1 101:35 21:65 101:83 23:40 100:89 23:42 99:34 25:36 100:50 27:15 98:87 30:56 1 99:55 21:75 100:09 22:13 99:55 22:42 100:49 23:69 101:46 24:67 101:22 25:82 2 102:08 21:50 100:40 21:39 100:22 22:02 101:72 22:38 100:40 24:54 99:58 24:96 5 99:13 21:68 100:40 22:14 101:72 21:72 100:38 21:52 101:64 22:02 101:79 24:60 2 e -5 101:20 19:59 102:50 21:20 101:20 19:59 100:93 20:18 100:97 20:50 101:02 21:20 -2 100:39 20:93 100:36 19:77 100:26 19:13 100:42 19:49 101:20 19:56 100:50 19:68 -1 100:96 20:24 101:27 21:16 102:33 19:74 101:93 19:01 101:57 18:70 101:12 18:33 1 99:63 19:73 100:93 19:90 99:98 20:64 100:14 20:13 98:82 21:02 100:38 22:57 2 101:32 20:00 100:54 20:28 100:81 20:71 100:93 20:94 100:89 20:78 101:07 21:79 5 99:75 20:41 100:95 20:41 100:66 21:33 100:63 20:49 99:94 22:24 101:49 21:80 a Variance components b Marginal effect of the simulated SNP 72 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL Table 3.1 cont.: The impact of the minor allele frequency (MAF) of the interaction SNP on the estimation VC a True Estimated parameter values (mean sd), with SNP MAF (%) int 1 5 10 20 30 40 int -5 5:00 0:48 4:96 0:61 5:00 0:48 5:02 0:46 5:06 0:49 5:06 0:54 -2 1:65 0:78 1:96 0:34 1:99 0:22 2:01 0:19 2:01 0:18 2:03 0:20 -1 0:92 0:56 0:99 0:31 0:98 0:25 0:99 0:17 1:00 0:13 1:00 0:14 1 0:54 1:05 0:96 0:37 1:01 0:29 1:02 0:24 1:01 0:24 1:02 0:24 2 1:39 1:42 1:99 0:47 2:01 0:37 2:01 0:33 2:03 0:34 2:05 0:36 5 3:94 2:76 4:96 0:83 4:98 0:70 5:04 0:64 5:04 0:64 5:06 0:68 1 b -5 1:75 4:69 1:43 6:22 1:75 4:69 1:42 4:03 1:65 3:76 1:51 3:72 -2 1:41 6:27 1:56 2:79 1:61 2:18 1:48 1:82 1:44 1:66 1:49 1:73 -1 1:44 4:96 1:51 2:10 1:54 1:64 1:46 1:30 1:56 1:18 1:48 1:23 1 1:50 8:08 1:53 3:06 1:50 2:28 1:59 1:66 1:50 1:46 1:51 1:36 2 1:47 10:22 1:49 4:43 1:45 3:05 1:54 2:23 1:52 2:00 1:38 1:91 5 1:37 18:44 1:57 8:10 2:00 5:97 1:40 4:64 1:29 4:15 1:40 3:97 a Variance components b Marginal effect of the simulated SNP 73 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL ● ● ● ● ● ● ● ● ● ● ● ● 1.0 1.1 1.2 1.3 0.9 1.0 1.1 1.2 1.3 SD of estimate Estimated SE β int ● ● −5 −2 −1 1 2 5 MAF ● ● ● ● ● ● 0.01 0.05 0.1 0.2 0.3 0.4 Intercept (a) ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 5 10 15 SD of estimate Estimated SE β int ● ● −5 −2 −1 1 2 5 MAF ● ● ● ● ● ● 0.01 0.05 0.1 0.2 0.3 0.4 Slope (b) ● ● ● ● ● ● ● ● ● ● ● ● 24 28 32 36 25.0 27.5 30.0 32.5 SD of estimate Estimated SE β int ● ● −5 −2 −1 1 2 5 MAF ● ● ● ● ● ● 0.01 0.05 0.1 0.2 0.3 0.4 Additive variance (c) ● ● ● ● ● ● ● ● ● ● ● ● 20 25 30 22.5 25.0 27.5 30.0 SD of estimate Estimated SE β int ● ● −5 −2 −1 1 2 5 MAF ● ● ● ● ● ● 0.01 0.05 0.1 0.2 0.3 0.4 CHE variance (d) Figure3.5: Comparisonbetweenthetrueandestimatedstandarderrors. Plotted are means standard deviations of the estimated standard errors against the standard error calculated as the standard deviation of the estimated parameter over 1000 replica- tions. The black line is the 45-degree reference line for y=x. 74 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL ● ● ● ● ● ● ● ● ● ● ● ● 18 20 22 24 19 20 21 22 SD of estimate Estimated SE β int ● ● −5 −2 −1 1 2 5 MAF ● ● ● ● ● ● 0.01 0.05 0.1 0.2 0.3 0.4 Residual variance (a) ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 0 1 2 SD of estimate Estimated SE β int ● ● −5 −2 −1 1 2 5 MAF ● ● ● ● ● ● 0.01 0.05 0.1 0.2 0.3 0.4 Interaction effect size (b) Figure 3.5 cont.: Comparison between the true and estimated standard errors. Figure 3.5 shows that the estimated standard errors were generally consistent with the standard deviations of the parameter estimates from 1000 replications. The only exception is the interaction effect size int when the SNP MAF p = 0:01 and int > 0, where the estimated standard errors significantly deviate from the 45 degree reference line, and this deviation gets worse as int increases. This further illustrates the idea that the performance of the method deteriorates when the interaction SNP is rare. It is also of note that figure 3.5 confirmed the findings in Table 3.1 that while the standard errors remain nearly the same for the variance components whatever the int or MAF is, the estimated slope 1 and interaction effect size int had larger standard errors if the interaction SNP is more rare or the true int has a big positive value. In contrast, the standard errors for the intercept slightly increases as the interaction SNP gets more common. However, in a real application, we may still 75 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL favor common instead of rare SNPs, because the marginal and interaction effect of the SNP is of more interested than the intercept, and they are more sensitive to the change of the MAF. There were very few cases (8 out of 36,000 replications all together in this in- vestigation) where the approximated hessian had positive values on the diagonal for the additive variance, and therefore the standard error could not be computed. This problem occurred when the estimated additive variance hit the low boundary, i.e. 0, and the information matrix was no longer positive definite. This problem diminished as the sample size increased, which will be examined in subsection 3.2.2. The computation took around 17.8 seconds to analyze 600 individuals on a Mac- book Pro with a 2.53 GHz Intel Core 2 Duo CPU. It is relatively slow because the code was not fully optimized and was written in R. Currently an R implementation is sufficient given our focus is characterizing the new method. The simulations were therefore executed on a high performance computing cluster to take advantage of parallelized computing to save time. We conclude that the proposed estimation method is able to estimate parameters without bias and generally the estimates have good precision given a reasonable sample size. However, care must be taken when interpreting either the marginal or the interaction effect of the interaction SNP if it is rare or if the interaction effect is large. 76 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL 3.2.2. The impact of the sample size and the number of chil- dren per family on standard errors [Methods] We here examine the impact of the number of children per family on the estima- tion precision. Families were simulated with 1, 2 or 3 children, while the total sample sizes were kept the same. More specifically, we compared results with a sample size starting from 600 to 3600 with an increment of 600. As a result, there were 200 to 1200 families with 1 child, 150 to 900 with 2 children, 120 to 720 with 3 children. For this comparison, we set the MAF of the SNP to be 0.2 and the marginal effect size to be 1.5. The three variance components are set as 2 A = 2 H = 2 e = 100 and the int = 1, respectively. Simulation for each setting was repeated 1000 times. Standard error of each parameter estimate was calculated as the standard deviation of the estimates from the 1000 simulations. [Results] The results are shown in Figure 3.6. 77 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 600 1200 1800 2400 3000 3600 Sample size Estimated intercept Per family ● ● ● 1 child 2 children 3 children Intercept (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 1.0 2.0 3.0 1.5 600 1200 1800 2400 3000 3600 Sample size Estimated β 1 Per family ● ● ● 1 child 2 children 3 children Slope (b) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 80 100 120 600 1200 1800 2400 3000 3600 Sample size Estimated σ A 2 Per family ● ● ● 1 child 2 children 3 children Additive Variance (c) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 90 100 110 120 130 600 1200 1800 2400 3000 3600 Sample size Estimated σ H 2 Per family ● ● ● 1 child 2 children 3 children CHE Variance (d) Figure 3.6: Parameter estimation with different sample sizes and numbers of children per family. Plotted are means standard errors. Dashed lines show the true parameter values. Note the y-axes do not necessarily start from 0. 78 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 90 100 110 120 600 1200 1800 2400 3000 3600 Sample size Estimated σ e 2 Per family ● ● ● 1 child 2 children 3 children Residual Variance (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.9 1.0 1.1 1.2 1.3 600 1200 1800 2400 3000 3600 Sample size Estimated β int Per family ● ● ● 1 child 2 children 3 children Beta of Interaction (b) Figure 3.6 cont.: Parameter estimation with different sample sizes and numbers of children per family. Plotted are means standard errors. Dashed lines show the true parameter values. Note the y-axes do not necessarily start from 0. The estimated parameters are all unbiased under every setting. The standard errors decrease as the sample sizes increase, and the trend of decreasing slows down or nearly stops as the sample size reaches 2400. It is of note that the structure of nuclear families does not have a significant impact on standard errors, as long as the sample sizes are the same. Except for the residual variance, families with one child or trios seem to have the lowest standard errors, despite of the slight differences. As a result, it seems appropriate to use trio as the analysis unit. It also implies that we can combine families with different numbers of children in a real application to gain most information out of the data. In subsection 3.2.1, we observed cases where the information matrix was not positive definite. In the simulations here, we also had 1 such case out of 3000 repli- cations with a sample size of 600 individuals. For those with a sample size equal to 79 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL of greater than 1200, we did not see a single case out of the 15000 replications with the aforementioned problem. Although this is not a proof, it does imply that we need relatively large sample size to stabilize the estimation and increase its accuracy. 3.3. Relationshiptolinearmixed-effectsmodelmethod As was described, the proposed estimation approach is based on LMM. The variance components correspond to random effects in LMM. In the standard LMM framework, models that incorporate random slopes are used to capture cross-level interactions (Snijders, Tom AB and Bosker, 1999). We can show that our proposed method is equivalent to an LMM with both random intercept and random slope under the constraint of perfect correlation between these two random effects. Given the structure of the CHE coefficient matrix in subsection 2.1.2, we can show that for a single trio H = 2 6 6 6 6 4 1 int x 1 1 int x 2 1 int x 3 3 7 7 7 7 5 2 6 4 1 1 1 int x 1 int x 2 int x 3 3 7 5 2 H = 2 6 6 6 6 4 1 1 1 1 1 1 1 1 1 3 7 7 7 7 5 2 H + 2 6 6 6 6 4 x 1 x 1 x 1 x 2 x 2 x 2 x 3 x 3 x 3 3 7 7 7 7 5 int 2 H + 2 6 6 6 6 4 x 1 x 2 x 3 x 1 x 2 x 3 x 1 x 2 x 3 3 7 7 7 7 5 int 2 H + 2 6 6 6 6 4 x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 3 7 7 7 7 5 2 int 2 H (3.19) 80 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL On the other hand, given an LMM with random intercept u 0 and slope u 1 as is in equation (1.6) y = X + Zu + e (3.20) where 2 6 4 u 0 u 1 3 7 5 N(0; W) with W = 2 6 4 2 0 0 1 0 1 2 1 3 7 5 (3.21) with being the correlation between the random intercept and slope. Similarly, for a single trio, the variance introduced by the CHE and its interaction with genetic variation is V H;int = ZWZ T = 2 6 6 6 6 4 1 x 1 1 x 2 1 x 3 3 7 7 7 7 5 2 6 4 2 0 0 1 0 1 2 1 3 7 5 2 6 4 1 1 1 x 1 x 2 x 3 3 7 5 = 2 6 6 6 6 4 1 1 1 1 1 1 1 1 1 3 7 7 7 7 5 2 0 + 2 6 6 6 6 4 x 1 x 1 x 1 x 2 x 2 x 2 x 3 x 3 x 3 3 7 7 7 7 5 0 1 + 2 6 6 6 6 4 x 1 x 2 x 3 x 1 x 2 x 3 x 1 x 2 x 3 3 7 7 7 7 5 0 1 + 2 6 6 6 6 4 x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 3 7 7 7 7 5 2 1 (3.22) Wecanseethatequations(3.19)and(3.22)areequalundertheconditionsof H = 0 and int = 1 =( 0 ), in which = 1 if int > 0 and =1 if int < 0. Alternatively, 81 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL int can be computed from parameters estimated using the LMM method as int = 1 =( 0 ) when =1. = 1 essentially means that as a given family gets a high value on the random intercept — which is the CHE in our case — it also gets a big random slope. =1 is the opposite. Anything between -1 and 1 means that the random components are not always able to predict one another. However, when 6=1, the LMM method is not exactly the same as the proposed method. For a real case, it is more likely that a given SNP only has interaction with a proportion of the CHE, therefore being able to estimate the correlation between the random components is more appealing. There are multiple R packages for analysis based on LMM. For example, ‘nlme’ (Pinheiro et al., 2016) and ‘lme4’ (Bates et al., 2015) are two that have been most widely used. However, the structure of a kinship matrix does not directly fit into the framework of random effects, yet neither ‘nlme’ nor ‘lme4’ allows specification of a customized covariance matrix. Another R package ‘pedigreemm’ (Vazquez et al., 2010) was built upon ‘lme4’ and solved this problem, with this option specified in the estimation function pedigreemm() as following control=lmerControl(check.nobs.vs.nlev="ignore",check.nobs.vs.nRE="ignore") whichdisablesthedefaultbehavior‘lme4’ofcheckingnumberofsubjectsperlevel/group. We will examine parameter estimation using this package next. [Methods] Data were simulated as described in section 2.3, and were then analyzed using either the proposed new method or the R package ‘pedigreemm’ (v0.3-3), which we denoted as ‘beta of interaction’ (BI) approach and LMM, respectively. For the pur- 82 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL pose of comparison, we only examine one situation with the MAF of the interaction SNP being 20% and the beta of interaction being 1. The three variance components were all set to be 100. For a fair comparison, parameters were estimated using ML instead of restricted maximum likelihood (REML), which was the default setting in function pedigreemm(). The simulation was replicated for 1000 times. Standard error of each parameter estimate was calculated as the standard deviation of the estimates from the 1000 replications. [Results] The results were summarized in Table 3.2 and Figure 3.7. Table 3.2: Estimated parameters using standard LMM 2 A 2 e 2 e 1 a 98:11 31:44 101:19 22:28 99:75 20:56 1:56 1:70 a Marginal effect of the simulated SNP From Table 3.2 the estimates were all unbiased. Moreover, the means and stan- dard errors of the estimated parameters were similar to those values in Table 3.1 that corresponded to the same parameter setting. However, Figure 3.7 shows that for some replications the estimated parameters from using the two methods were not consistent. More specifically, when there was a difference, the estimated additive variance was always smaller from using ‘pedigreemm’ than from using the proposed new method in this chapter, while the estimated CHE variance larger. The dif- ferences in other parameters were either positive or negative. Also, the estimated correlation coefficients between the random components were not 1. On the other hand, when the estimates were the same from using both methods, the correlation 83 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL −2.5 0.0 2.5 5.0 7.5 −2.5 0.0 2.5 5.0 7.5 Estimates using BI method Estimates using LMM Slope (a) 0 50 100 150 200 0 50 100 150 200 Estimates using BI method Estimates using LMM Additive Variance (b) 50 100 150 200 40 80 120 160 Estimates using BI method Estimates using LMM CHE Variance (c) 50 75 100 125 150 50 75 100 125 150 Estimates using BI method Estimates using LMM Residual Variance (d) Figure 3.7: Comparison of estimated parameters between the two methods over 1000 replications. The blue dashed lines were the true parameter values. The red line is the 45-degree reference line for y=x. BI method, the new method based on beta of interaction. LMM, linear mixed-effects model implemented in the ‘pedigreemm’ package. 84 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL coefficients were all 1. Checking the maximized likelihood showed that the likelihood was consistently higher from using LMM than from using the BI method, whenever the estimated pa- rameters were different. This is expected because LMM did not put any constraint on the correlation between the random intercept and the random slope, and there- fore this model has one more free parameter than the BI method and is able to fit the simulated data better whenever it’s possible. This is a merit for a model with less constraints. However, it may have lower power than a restricted model if the constraint is reasonable. We will examine this in Chapter 4. Finally, it’s worth noting that ‘pedigreemm’ package ran much faster, since the base optimizer was a wrapper of Powell’s BOBYQA algorithm implemented in For- tran (Bates et al., 2015, 2014; Powell, 2009). 3.4. Choice of the reference allele and the interpreta- tion of the estimated variance components In the previous analyses, we analyzed the same SNP data as was simulated, e.g. we used the same reference allele as those in the simulation. However, in a real situation, there is usually no such information and the minor allele is chosen as the reference allele. Assuming the SNP of interest has alleles A and a. First, A is chosen as the reference allele, and the number of allelesA is the variablex as in an additive model. Then the variance coefficient matrix of the CHE has the form as shown in equation (3.19). Now the reference allele is changed to a, instead, and the number of alleles a 85 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL is denoted as y = 2x. For the BI method, we have (1 + int x)(1 + int x) T = (1 + int (2 y))(1 + int (2 y)) T = (1 + int 2 int y)(1 + int 2 int y) T = (2 int + 1) 2 (1 int 2 int + 1 y)(1 int 2 int + 1 y) T = (2 int + 1) 2 (1 + 0 int y)(1 + 0 int y) T (3.23) where 0 int = int 2 int +1 is the (estimated) effect size for the interaction when a is the reference allele. Because of the coefficient (2 int + 1) 2 in this equation, the estimated CHE variance 2 H 0 = 1 (2 int +1) 2 2 H , where 2 H is the estimated CHE variance whenA is the reference allele. In other words, we get two different estimated values of the CHE variance if different reference alleles are used. However, this will not change either the maximum likelihood or other estimated parameters. Although this is guaranteed by the equality of the two forms as is shown in equation (3.23), we examined it via simulations. The same rationale also applies to the LMM method. The equation (3.22) can be re-expressed as 86 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL V H;int = ZWZ T = 2 6 6 6 6 4 1 2y 1 1 2y 2 1 2y 3 3 7 7 7 7 5 2 6 4 2 0 0 1 0 1 2 1 3 7 5 2 6 4 1 1 1 2y 1 2y 2 2y 3 3 7 5 = 2 6 6 6 6 4 1 1 1 1 1 1 1 1 1 3 7 7 7 7 5 2 0 + 2 6 6 6 6 4 2y 1 2y 1 2y 1 2y 2 2y 2 2y 2 2y 3 2y 3 2y 3 3 7 7 7 7 5 0 1 + 2 6 6 6 6 4 2y 1 2y 2 2y 3 2y 1 2y 2 2y 3 2y 1 2y 2 2y 3 3 7 7 7 7 5 0 1 + 2 6 6 6 6 4 (2y 1 )(2y 1 ) (2y 1 )(2y 2 ) (2y 1 )(2y 3 ) (2y 2 )(2y 1 ) (2y 2 )(2y 2 ) (2y 2 )(2y 3 ) (2y 3 )(2y 1 ) (2y 3 )(2y 2 ) (2y 3 )(2y 3 ) 3 7 7 7 7 5 2 1 = 2 6 6 6 6 4 1 1 1 1 1 1 1 1 1 3 7 7 7 7 5 ( 2 0 + 4 2 1 + 4 0 1 ) + 2 6 6 6 6 4 y 1 y 1 y 1 y 2 y 2 y 2 y 3 y 3 y 3 3 7 7 7 7 5 ( 0 1 2 2 1 )+ 2 6 6 6 6 4 y 1 y 2 y 3 y 1 y 2 y 3 y 1 y 2 y 3 3 7 7 7 7 5 ( 0 1 2 2 1 ) + 2 6 6 6 6 4 y 1 y 1 y 1 y 2 y 1 y 3 y 2 y 1 y 2 y 2 y 2 y 3 y 3 y 1 y 3 y 2 y 3 y 3 3 7 7 7 7 5 2 1 (3.24) in which we let the parameters of using the a reference allele as 0 0 2 = 2 0 + 4 2 1 + 4 0 1 , 0 1 2 = 2 1 and 0 0 0 0 1 = 0 1 2 2 1 . Thiswasalsoexaminedviasimulations. 87 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL [Methods] Data were simulated the same way as were in section 3.4. In addition to analyz- ing the simulated SNP data (x), we examined the data recoded using the alternative allele (y = 2x). [Results] The results of the comparisons for the BI and LMM methods are shown in Figure 3.8 and 3.9. It can be seen that as expected, using different alleles does not change the estimated parameters in general. There are minor differences for few cases, which were likely caused by the optimization process. These results show that using either reference allele gives identical estimates. What is not shown in Figure 3.8 and 3.9 is the CHE variance 2 H or 2 0 . From previous derivation, using different reference allele leads to different estimated values of the CHE variance for both the BI method and the LMM method. This seeming conflict is actually a consequence of the model and is expected. In fact, by introduc- ing the interaction between a SNP and the CHE, we assumes the total phenotypic variance is not constant given different genotype values of the SNP. The estimated CHE variance has a corresponding reference allele as the baseline. Therefore, two reference alleles will lead to two different CHE estimates, simply because the base- line is different. A similar situation in a traditional regression setting is testing the interaction between a SNP with another explanatory variable. Choosing different reference allele has an impact on the effect size estimate. Given the property of non-constant variance, it is no longer appropriate to de- fine the heritability as h 2 = 2 A = 2 P . However, we can always directly compare the 88 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL −5.0 −2.5 0.0 2.5 −2.5 0.0 2.5 5.0 Estimates using reference allele 'A' Estimates using reference allele 'a' Slope (a) 0 50 100 150 200 0 50 100 150 200 Estimates using reference allele 'A' Estimates using reference allele 'a' Additive Variance (b) 50 100 150 50 100 150 Estimates using reference allele 'A' Estimates using reference allele 'a' Residual Variance (c) −0.40 −0.35 −0.30 −0.25 0.4 0.8 1.2 1.6 2.0 Estimates using reference allele 'A' Estimates using reference allele 'a' Beta of Interaction (d) Figure 3.8: Comparison of estimated parameters between using different ref- erence alleles: the BI method. The red line/curve is the expected relationship. BI method, the new method based on beta of interaction. In (d), the expected parameter relationship is y =x=(2x + 1), which gives rise the nonlinear curve. 89 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL −8 −4 0 4 −4 0 4 8 Estimates using reference allele 'A' Estimates using reference allele 'a' Slope (a) 0 50 100 150 200 0 50 100 150 200 Estimates using reference allele 'A' Estimates using reference allele 'a' Additive Variance (b) 50 100 150 50 100 150 Estimates using reference allele 'A' Estimates using reference allele 'a' Residual Variance (c) 100 200 300 100 200 300 Estimates using reference allele 'A' Estimates using reference allele 'a' Random Slope Variance (d) Figure 3.9: Comparison of estimated parameters between using different ref- erence alleles: the LMM method. The red line/curve is the expected relationship. LMM, linear mixed-effects model implemented in the ‘pedigreemm’ package. (d) shows the variance term 2 1 as is in equation (3.24). 90 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL estimated additive variance A across models. We can also calculate h 2 for different genotype values. On the other hand, we may use the estimated total variance from a model without the interaction as an approximated value, if we would really like a single estimate of h 2 . 3.5. Discussion In section 3.2, we showed via simulations that the parameter estimation under the correct analysis model is unbiased using the approach described in section 3.1. The conclusion was further consolidated by the supportive results in simulations testing a series of parameter settings. The standard errors of the estimated parameters were reasonably small when there were only 600 individuals in the analysis. As the sample size increased, the standard errors further decreased. We also illustrated the validity of using trios as our analysis unit, as well as the idea of combining nuclear families of different numbers of children in a real application. All the above results showed that the proposed method in this dissertation performed well in terms of both accuracy and precision. However, this method might not be able to preform well if the interaction SNP is rare, e.g. with a MAF1%. Thecurrentimplementationusesexpectedkinshipcoefficientsacrossfamilymem- bers according to their familiar relationship. If genome-wide SNP data is available, the realized kinship coefficients can then be estimated and therefore may lead to more accurate parameter estimates. However, we need to remove individuals related to two or more independent families, because an important assumption that makes the computation vastly more efficient is that all the families are independent. More specifically, both genetic and environmental factors have to be independent for the 91 CHAPTER 3. UNBIASED ESTIMATION OF THE ADDITIVE VARIANCE UNDER THE CORRECT MODEL families. With that said, we need to correct for it if any environmental factor is shared by the families. To further evaluate this method, we might ask what happens if the specified model does not truly reflect the underlying biological process — in other words, if the specified model is not the ‘true model’? Examples are situations where there are more than one CHE components and the SNP of interest only interacts with one of them, or if the mode of interaction is not multiplicative. Investigation of this problem is also related to model selection, so we will examine it in Chapter 4. This can be regarded as testing the robustness of the method. More importantly, as we’ve shown in subsection 3.3, this method is equivalent to a standard linear mixed-effects model with a constraint of perfect correlation between the random components. If we loosen this constraint, this model might be more robust and be able to fit the data better. This will also be examined in Chapter 4. Another implementation that might be worth doing is replace the ML criterion with REML. ML does not take into account the loss of degree of freedom from estimating fixed effects, and therefore will underestimate the variance components. REMLusesasetorerrorcontrasttoestimatefixedeffectsandanothersetforrandom effects, and the two sets are perpendicular to each other (Patterson and Thompson, 1971). By doing this, the random effects can be estimated without bias. Although the bias from using ML decreases as the sample size becomes larger, there might still be situations when REML is of good use. The REML framework for ordinary LMM is well-developed, but how to incorporate the interaction int still requires further examination. 92 Chapter 4 Hypothesis Testing And Data Application In the previous chapter, a new method has been proposed to estimate parameters without bias under the correct model. However, for a study with real data, no model is ‘correct’. We can only choose between models and determine which one fits the data best. One way of doing this is by hypothesis testing. Statisticians usually use asymptotic tests (a.k.a. large-sample tests) because of the simplicity in the calculations when the sample size approaches infinity. With the advance of new technologies, nowadays the cost of getting genetic data is decreasing and researchers have been collecting data with ever increasing sample sizes. Asymp- totic statistical methods are therefore suitable here to meet our need of formulating the hypothesis testing. In the statistical literature, the likelihood ratio test (LRT, Neyman and Pearson, 1928), the Wald test (Wald, 1943) and the score test (Radhakrishna Rao, 1948) are referred to as the Holy Trinity of asymptotic tests (Rao, 2005). In fact, they are first-orderasymptoticallyequivalent. TheLRTispredominantinlinearmixed-effects model (LMM) compared to the other two methods, given its relative simplicity in testing for random effects. In this dissertation, the hypothesis testing approach is 93 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION therefore based on the LRT, which is described in section 4.1. 4.1. Hypothesis testing 4.1.1. Beta of interaction method For the method proposed in Chapter 3, which we denote as the BI method (beta of interaction), there are two parameters of interest: the common household envi- ronment (CHE) variance 2 H and the interaction effect size int between the CHE and a SNP. The focus of the model selection here is to estimate the additive genetic variance 2 A without bias by specifying the model that best fit the data. Therefore, the null and alternative hypotheses are h 0 : 2 H = 0 and int = 0; h 1 : 2 H > 0 or int 6= 0 (4.1) Note 2 H > 0 in h 1 is one-sided. The CHE variance, similar to other variance com- ponents, is constrained to be non-negative. The three asymptotic tests mentioned previously were originally designed for two-sided alternative hypotheses. These tests are conservative for testing variance components and will lower the test power. Chernoff (1954) discussed the asymp- totic distribution of the likelihood ratio (LR) statistic when the parameter was on the boundary of both the null hypothesis parameter space ( 0 ) and the alternative hypothesis parameter space ( 1 ) but was a point inside = 0 + 1 . Self and Liang (1987) generalized the results to the case where the parameter was on the boundary of , which was then introduced to testing of variance components by Stram and Lee (1994). A key idea is that both 0 and can be approximated by cones C 0 94 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION C , respectively, with vertex both at 0 , which is the parameter under the null dis- tribution. If is regular with boundaries being the axes, the LR statistic simply follows a mixture of chi-square distributions with different degrees of freedom under the null. In our case the null distribution is a 50:50 mixture of 2 1 and 2 2 . Another question of interest is the significance of the interaction. The interaction is based on the presence of the CHE. Therefore, a natural thought is a two-step test where the CHE 2 H is tested first followed by the interaction int providing the first test is significant. However, this testing framework might have low power in certain situations. In subsection 2.2.4, we observed situations where 2 H was underestimated for certain negative int values, i.e. the modification of the CHE on the genetic effect had an opposite direction. When this happens, a stepwise test may fail to identify the CHE in the first step and thereby may lose power. This will be examined in our simulations. To test the interaction, the null and alternative hypotheses are h 0 : int = 0; h 1 : int 6= 0 (4.2) and here the models that we compare include the additive, the CHE and the residual variance. Here the null distribution of the LR statistic should be simply 2 1 . 4.1.2. Linear mixed-effects method As has been shown in section 3.3, a standard LMM with a random intercept and a random slope can also be used to estimate the parameters. Now in addition to the CHE variance 2 H for the random intercept, there is another variance term for the random slope and a covariance between these random components, which we denote 95 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION as 2 int and 12 , respectively. These random components follow the relationship as is shown in equations (3.20) and (3.21). We would also like to use an LRT to test the random components in a similar manner as in the BI method, with the null and alternative hypotheses being h 0 : 2 H = 0; 2 int = 0; and 2 12 = 0; h 1 : 2 H > 0 or 2 int > 0; withj 12 j q 2 H 2 int (4.3) Shouldtherebenoconstrainton 12 , theasymptoticnulldistributionisamixture of 2 1 , 2 2 and 2 3 with mixing probabilities of 1 4 , 1 2 and 1 4 . However, because of the constraint 2 12 H int inh 1 , the parameter space for ( 2 H ; 2 int ; 12 ) underh 1 forms a polar cone with nonlinear boundaries in the direction of 12 . Consequently, the null distribution is not simply a mixture of 2 ’s. Similar problems have been addressed in Han and Chang (2010). On the other hand, a 2-step testing framework does not have this problem. The null distribution of its each step is a regular mixture of 2 ’s: (1) compare models with or without a random intercept, with the null and alter- native hypotheses being h 0 : 2 H = 0; h 1 : 2 H > 0 (4.4) The null distribution is a 50:50 mixture of 2 0 and 2 1 . (2) ifh 0 in (1) is rejected, further compare models with or without a random slope in the presence of the random intercept, with the null and alternative hypotheses 96 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION being h 0 : 2 int = 0; and 12 = 0; h 1 : 2 int > 0; withj 12 j q 2 H 2 int (4.5) The null distribution is a 50:50 mixture of 2 1 and 2 2 . Intheprocessofmodelselection,ifthetestin(4.4)hasasignificantresult,wemay then use the test in (4.5) to examine if the interaction is also significant. However, when the interaction has an opposite direction from the CHE, the standard LMM can’t avoid the problem of losing power in this case either, with the same rationales as previously described. The omnibus test of all random components in (4.3) is therefore still used first, with the comprise of a conservative type I error rate. If the interaction is also of interest, we may then use the test in (4.5). To summarize the procedures, given a dataset, we first use the test in (4.4). If the result is not significant, we perform the omnibus test in (4.3) to make sure we don’t miss the signal due to lack of power to detect a negative interaction effect. If any of the aforementioned test has a significant result, we then use the test in (4.5) to examine the interaction. Given the tests being proposed, I will examine the type I error and the power via simulations. 4.1.3. Type 1 error Depending on the trait, the additive genetic variance 2 A , the CHE variance 2 H and the residual variance 2 e can account for different amount of the total pheno- typic variability. In real studies of glycemic traits or traits related to cardiovascular diseases, these components were observed to account for around 0.05-0.7, 0.05-0.12, 97 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION 0.25-0.7, respectively, in the proportions of the total variation after adjusting for other covariates (Freeman et al., 2002; Mitchell et al., 1996). [Methods] Data were simulated in similar manner as in sections 2.3 or 3.2. For the type 1 error () of the omnibus test in equation (4.1), only 2 A and 2 e were simulated and we compared the maximized log likelihood between model (a) that specified these two variance terms, and model (b) that specified 2 A , 2 H , 2 e and int . The log likelihood came from the R function described in Chapter 3 based on the proposed estimation method, after convergence is reached for each replication. The LR statistic was com- pared to distributions of 2 1 and 2 2 , and the corresponding upper-tail probabilities were divided by 2 and added to give the p value. The type 1 error rate was calculated as the frequency that we observed a significant statistic given certain nominal level, e.g. 0.05. It was assumed that all known covariates have been adjusted, and we examined 2 A , 2 e in the combinations of (1) 10, 90; (2) 40, 60; (3) 70, 30. The total variance was therefore 100. For each parameter setting, 200 trios were simulated and replicated for 1000 times. The type 1 error was also examined for the LMM method implemented in the ‘pedigreemm’ package. To test the hypothesis in equation (4.3), we kept the simu- lation settings being the same, and compared the maximized log likelihood between model (a) that specified 2 A , 2 e , and model (b) that specified 2 A , 2 H , 2 int , 12 and 2 e . The LR statistic was compared to the distributions of 2 1 , 2 2 and 2 3 , and the p value was the sum of the upper-tail probabilities weighted by 1 4 , 1 2 and 1 4 . To test the hypothesis in equations (4.5) and (4.2) for the effect of the interac- tion, we also simulated data with 2 A , 2 H , 2 e in the combinations of (1) 25, 5, 70; 98 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (2) 70,5 25; (3) 10, 20, 70; (4) 55, 20, 25. We then compared the maximized log likelihood between model (a) that specified 2 A , 2 H , 2 int , 2 e , and model (b) that specified 2 A , 2 H , 2 e together with either int as in the BI method or 2 int and 12 as in the LMM method. The LR statistic was compared to distributions of either 2 1 or 0:5 2 1 + 0:5 2 2 ,respectively. [Results] Table 4.1: Type 1 error rate ( level) given different variance components Test 2 A 2 e 2 H BI (%) LMM (%) Omnibus test 10 90 - 3.9 2.5 40 60 - 5.9 2.6 70 30 - 4.9 2.6 25 70 5 6.3 3.9 Test of 70 25 5 4.1 2.7 interaction 10 70 20 5.5 3.5 55 25 20 6.6 4.3 The type 1 error rate was kept at the nominal level (5%) for our BI method for both the omnibus test and the test of interaction. As expected, the type 1 error rate of the LMM method was too conservative (around 2.5%) for the omnibus test. However, it was also too conservative when testing the hypothesis in (4.5) with a small 2 H = 5. This deviation was not by chance. Another simulation with 10,000 replications with the setting 2 A = 25; 2 H = 5; 2 e = 70 showed an alpha level of 3.38% (compare to 3.9% in Table 4.1). The standard error of this estimate was approximately r 0:0338(1 0:0338) 10000 = 0:00181 99 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION And we have a statistic Z = 0:05 0:0338 0:00181 = 8:96 Therefore, it was significantly lower than 5% (P = 3:12 10 19 ). The reason for this deviation is that the null distribution of a mixture of 2 ’s is asymptotic, which is kept true only when the sample size is infinite. In fact, another simulation with 10,000 trios instead of 200 showed an alpha level of 4.39%, which is much closer to 5%, although the difference is still significant (P = 2:91 10 3 ) according to similar calculations as above. 4.1.4. Power [Methods] According to the variance component values from real studies from the previous subsection, we examined 2 A , 2 H , 2 e in the combinations of (1) 25, 5, 70; (2) 70,5 25; (3) 10, 20, 70; (4) 55, 20, 25. One SNP was simulated to have interaction with the CHE and the minor allele frequency (MAF) of the SNP was examined at 5%, 20% or 40%, because as has been shown in subsection 3.2.1, the method may not work well for rare SNPs. The effect size of interaction int was set to be between -5 and 5 at a step of 0.5. The likelihood ratio statistic and p value were calculated as described in the previous subsection. Power was calculated as the frequency that we observed a significant statistic among all the simulations. The level was set to be 0.05. For each parameter setting, 200 trios were simulated and replicated for 1000 times. To illustrate the idea that the omnibus test of 2 H and int has higher power than a stepwise test (denoted as the ‘SW’ method) when int is negative, we also examined a model with 2 A , 2 H , 2 e and compared it to the null model without 2 H . 100 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION This was actually the same hypothesis testing as in equation (4.4). Here we were actually comparing the power of detecting a deviation from pure additive genetic effect. Next, we calculated the power of detecting the interaction as the frequency of passing both the omnibus test and the test of interaction. Meanwhile, data was simulated the same way but was analyzed using the LMM method. Power of the two tests was also examined for the LMM method. [Results] Figure 4.1 shows the result of the BI method. There are several pieces of infor- mation from it. Let’s first focus on solid lines that represent the result of our BI method. First, when int = 0, the full model deteriorates to an ordinary variance compo- nents model with three variance terms. The test reaches the desirable level of 80% power when 2 H accounts for a considerable proportion of the total variance (c,d) but has low power when 2 H is small compared to other variance terms (a,b). Second, comparing (a) to (b) or (c) to (d), the power is not dramatically impacted by the relative magnitudes of 2 A and 2 e but rather by 2 H and int . When 2 H is small, a desirable power can be obtained only if there is significant interaction (largej int j); when 2 H is large, the test has sufficient power to detect interaction of small effect, e.g. int = 0:5. Third, the levels of power are not symmetric about int = 0. For example, in (a) the power is below 20% when int =1 but is around 80% when int = 1 for the setting of MAF=0.4. This happens because the coefficients of the CHE variance are in the form of (1 + int x ::t )(1 + int x ::t ) T as is described in subsection 2.1.2. Given the constant ‘1’ in the expression, the behavior of int cannot be symmetric about 101 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power method BI SW MAF ● ● ● 5% 20% 40% σ A 2 =25, σ H 2 =5, σ e 2 =70 (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power method BI SW MAF ● ● ● 5% 20% 40% σ A 2 =70, σ H 2 =5, σ e 2 =25 (b) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power method BI SW MAF ● ● ● 5% 20% 40% σ A 2 =10, σ H 2 =20, σ e 2 =70 (c) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power method BI SW MAF ● ● ● 5% 20% 40% σ A 2 =55, σ H 2 =20, σ e 2 =25 (d) Figure 4.1: Comparison of power between the BI method for the omnibus test and the SW method under various parameter settings. The horizontal dashed line is 80% power level and the vertical dashed line is where int = 0. BI, the ‘beta of interaction’ method. SW, the ‘stepwise’ method. 102 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION 0. Moreover, when int is -0.5 and -1, genotype data of 2 and 1, respectively, may lead to coefficients of 0, which will result in a smaller contribution of 2 H to the total variance. As a consequence, a simpler model with only 2 A and 2 e may has similar likelihood with the full model, which then leads to a low power to detect 2 H and int . This power loss is more serious if 2 H is small in the first place. In summary, the interaction can be detected only with a large effect size when int < 0. Finally, comparing the three solid lines in each subfigure, the more common the variant is, the high power the test has, only except for situations when int is between -1 and 0. As is expected, it is more likely for a subject to present the interaction effect with common variants than with rare variants, which makes the interaction effect easier to detect. The abnormal behavior when int is between -1 and 0 is due to reasons described in the last paragraph. NowwecomparetheresultsfromusingtheBImethodandfromusingthestepwise (SW) method. When int = 0, the SW method has more power than the BI method, although the difference is not dramatic. This is because the null distribution of the SW method has less degree of freedom compared to that of the BI method. However, when int 6= 0, the SW method generally has less power than the BI method, especially when the MAF is low and int < 0. In summary, this comparison shows the advantage of the BI method over the SW method. Figure 4.2 shows the result of the LMM method. As expected, the power curves of testing the 1 st hypothesis are very close to those of SW method in Figure 4.1. Therefore, the power of testing the 1 st hypothesis is low when 2 H is small, and the power gets lower if the MAF is low and/or int < 0. The ‘power’ of testing the 2 nd hypothesis when int = 0 is actually the type 1 error rate, and therefore it is around 5% in all 4 settings. The power is relatively low 103 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power MAF ● ● ● 5% 20% 40% method LMM SW σ A 2 =25, σ H 2 =5, σ e 2 =70 (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power MAF ● ● ● 5% 20% 40% method LMM SW σ A 2 =70, σ H 2 =5, σ e 2 =25 (b) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power MAF ● ● ● 5% 20% 40% method LMM SW σ A 2 =10, σ H 2 =20, σ e 2 =70 (c) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Beta of interaction β int Power MAF ● ● ● 5% 20% 40% method LMM SW σ A 2 =55, σ H 2 =20, σ e 2 =25 (d) Figure 4.2: Comparison of power between the LMM method for the omnibus test and the SW method under various parameter settings. The horizontal dashed line is 80% power level and the vertical dashed line is where int = 0. LMM, linear mixed-effects model method. SW, the ‘stepwise’ method. 104 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION as the effect size of the interaction is small, but quickly gets higher as the magnitude ofj int j increases. 4.1.5. Relationship between the sample size and power [Methods] Given results in the previous subsection, we picked two parameter settings to examine the impact of sample size on power: (1) int =1 (2) int = 1, and both are with 2 A = 25, 2 H = 5, 2 e = 70. The MAF of the SNP was examined at 5%, 20% or 40%. For each setting, 200 to 1200 trios were simulated at a step of 200. Power was calculated as was described in the previous subsection, and the level was kept at 0.05. Each simulation was replicated for 1000 times. [Results] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 200 400 600 800 1000 1200 Number of trios Power MAF ● ● ● 5% 20% 40% σ A 2 =25, σ H 2 =5, σ e 2 =70, β int =−1 (a) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 200 400 600 800 1000 1200 Number of trios Power MAF ● ● ● 5% 20% 40% σ A 2 =25, σ H 2 =5, σ e 2 =70, β int =1 (b) Figure 4.3: The relationship between the sample size and power. The horizontal dashed line is 80% power level. 105 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION When int =1, power increases linearly as the sample size increases. However, even with 1200 trios, power is still at a low level of around 40%. Surprisingly, the three curves almost overlap completely, indicating that the MAF does not have much impact on power. When int = 1, on the contrary, power increases more rapidly as the sample size increases. Even for the case of 5% MAF, a 80% power is reached with 1200 trios. Moreover, MAF has a remarkable positive impact on power: the same 80% power can be reached with only 200 or 400 trios when MAF is 40% or 20%, respectively. 4.2. Application: The Framingham Heart Study Winston Churchill once said, ‘However beautiful the strategy, you should occa- sionally look at the results.’ Similarly, models are useless unless they can reflect the reality to a reasonable extent. We examined the method developed in this disserta- tion using data from real studies in the current chapter. The method requires both phenotype and genotype data from nuclear families. The Framingham Heart Study meets this requirement and has a relatively large sample size, which makes it a good dataset for our testing purposes. 4.2.1. General information about the study TheFraminghamHeartStudy(FHS)beganin1948withtherecruitmentofadults from the town of Framingham, Massachusetts. The objective of the study was to identify common factors contributing to cardiovascular disease (CVD) by following participants for 20 years with repeat examinations. Between 1948 and 1953, the study recruited 5,209 participants, known as the Original Cohort. Between 1971 106 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION and 1975 the study enrolled 5,124 second generation individuals from the original participants’ children and the spouses of these children, to participate in similar examinations. These individuals formed the Offspring Cohort. Between 2002 and 2005, the study enrolled 4,095 third generation individuals from the offspring of the second generation, which was referred to as the Generation 3 Cohort. Most Generation 3 participants had both parents enrolled in the Offspring Cohort. For those with only one parent enrolled, the other parent was invited to the study. As of July 2005 the study had recruited 103 New Offspring Spouses. Detailed description about the cohorts and recruitment procedures can be found from Cupples et al. (2009) or from the official website of the study. The Original Cohort underwent physical examination and laboratory tests every 2 years since 1948. The Offspring Cohort had similar examinations every 4 years. The Generation 3 Cohort and New Offspring Spouse Cohort had only completed two similar examinations in 2005 and 2011. In-depth genetic studies in the FHS cohorts began since 1990s. By that time, many Original Cohort individuals had died, and therefore researcher obtained DNA samples from less than 30% of this cohort. In 2007, the study established the FHS SHARe (SNP Health Association Resource) project, for which dense SNP genotyping was performed using Affymetrix chipsets (GeneChip ® Human Mapping 500k Array Set and the 50k Human Gene Focused Panel) in 10,775 samples from the three gener- ationsofparticipants. Afterqualitycontrol(QC)bytheAffymetrixcriteria, checking sex consistency and family structure, there were 9,274 participants in SHARe. Both the phenotype data and genotype data are kept and maintained by the National Center of Biotechnology Information (NCBI) database of Genotypes and Phenotypes (dbGaP, Mailman et al., 2007; Tryka et al., 2014). Access to individual- 107 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION level data has been granted for the purpose of this dissertation. 4.2.2. Methods [Data from the FHS] Nuclear families were selected from the three FHS generations to maximize the number of individuals included in the analyses with criteria of (1) had phenotype data, and (2) had genotype data that passed the original QC. Selected families were either parents from the Original Cohort with children from the Offspring Cohort or parents from the Offspring Cohort or New Offspring Cohort with children from the Generation 3 Cohort. There were several very large families with individuals from all three generation. When there was a tie in the number of individuals between an Original-Offspring combination and an Offspring-Generation3 combination, the latter was selected given the fact the majority of the nuclear families were in this combination. When there was a pair of twins, only one was included. The offspring of the twin being removed were also excluded. To make sure the selected nuclear families were independent, familiar relationship was inferred on the genetic data using KING v1.4 (Manichaikul et al., 2010). When there were individuals with inter-family kinship greater than 0.1, the corresponding family with smaller size was dropped. After the selection, there were 538 independent nuclear families with 2,427 individuals, in which 68 families were an Original-Offspring combination and 470 wereanOffspring-Generation3combination. Thewholesetof12,732individualswith available phenotype data were also used to compute theh 2 estimate as a comparison. Phenotype of interest were height, weight, body mass index (BMI) and fasting glucose. Necessary transformation were applied to reduce the skewness of the data: 108 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION natural logarithm for weight, reciprocal transformation for fasting glucose and BMI, notransformationforheight. Foreachphenotype, asummarystatisticwascomputed on the various numbers of records over the years of examinations: at maximum 30 records for the Original Cohort, 9 for the Offspring Cohort and 2 for the Generation 3 and New Offspring Spouse Cohorts. The median was computed for height, weight, fasting glucose, and the medians were highly correlated with the means (r>0.96) for all phenotypes. Then BMI was calculated using height and weight. In addition, a single observation was extracted from each participant to mimic a cross sectional study as a comparison to the analysis on the summary statistics. These were records from exam 26 (1999-2001) for the Original Cohort, exam 7 (1998-2001) for the Off- spring Cohort, the earliest available record of exam 1 (2002-2005) or 2 (2008-2011) for the Generation 3 and New Offspring Spouse Cohort. We refer to these phenotype data as “overlap" phenotype, given the time of records were almost overlapping. Phenotype data was from dbGaP and was supposed to had passed the original QC by the study. Additional QC was performed to remove outliers, e.g. unrealistic height/weight values. Individuals were labeled as diabetic if he/she had> 2 exams (Original/Offspring Cohorts) or> 1 exam (Generation 3/New Offspring Spouse Co- horts) with fasting glucose > 126 mg/dL for the median phenotypes, or if he/she had 1 exam with fasting glucose> 126 mg/dL for the overlap exams. Genotype data were from the Affymetrix 500k mapping array. SNPs of interest were those mapped to chromosome 1 to 22 and passed the original QC. These QC filters included call rate> 95% and HWE p> 10 6 . To prevent the association re- sult from being driven by extreme values for rare variants, we restricted the analysis on SNPs with an MAF > 5%. These filtering resulted in 353,988 SNPs. Individuals were included with criteria of call rate> 97%, no excess Mendelian errors (< 1000), 109 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION and heterozygosity between 5 standard deviations from the mean. [Analysis] Population substructure were modeled using ADMIXTURE v1.21 (Alexander et al., 2009). Data from HapMap Project (CEU, CHB, JPT, YRI and MEX, n=704) were combined as reference populations. Both the FHS data and HapMap data were confirmed to be aligned to hg18/NCBI build 36. SNPs were pruned to achieve approximate linkage equilibrium using PLINK v1.07 (Purcell et al., 2007), which led to 233,109 SNPs after the pruning. 4 subpopulations were identified based on the criterion of 5-fold cross-validation error. These inferred population proportions were then used as covariates in all tests for association if not specified otherwise. The linear mixed-effects model (LMM) implemented in the R package ‘pedi- greemm’ (v0.3-3) was used to test the proposed method in this dissertation, given its advantage in computation speed. Heritability (h 2 ) was calculated with and without the adjustment of covariates, including age, sex and population proportions. For fasting glucose, BMI was another covariate. For the median fasting glucose, indi- viduals labeled as diabetic were excluded, whereas for the “overlap" fasting glucose, individuals with the “overlap" fasting glucose > 126 mg/dL were excluded. If the phenotype being tested is the median, the median age and BMI were used as the covariates, whereas the “overlap" age and BMI were used for the “overlap" pheno- types. Then a model further including the common household environment (CHE) — a random intercept in the LMM framework — was estimated. These two models were compared using a likelihood ratio test with a null distribution of 1 2 2 0 : 1 2 2 1 . All these calculations were also repeated in the set of 12,732 FHS individuals as a comparison. 110 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION To evaluate the interaction between the CHE and SNPS, we first tested the CHE component by comparing between two models: (1) with additive variance; (2) plus the CHE variance based on (1). This is an approximated test to the test in 4.5, because SNPs were not included as a covariate. This was done to avoid the problem to multiple testing, with the assumption that the marginal effect of each SNP is small so including it or not would not have significant impact on the testing result. Then each SNP was then further added as a covariate, and we compared two models: (2) with additive variance and the CHE variance; (3) based on (2), plus the variance from the interaction between the SNP being evaluated and the CHE. We used the test framework described in 4.1.2 to compare the models. 4.2.3. Results The selected 538 nuclear families had sizes of between 3 and 9 individuals. The characteristics are summarized in Table 4.2, which have similar values to those of the whole FHS sample. The different proportions of individuals from each generation in the selected subset from the whole FHS sample is partly due to the fact that individuals from the Offspring Cohort could be selected both as offspring and as parents. Another reason is that only a less proportion of subjects from the Original Cohort were genotyped, compared to other cohorts. Depending on the cohort, each individualhaddifferentnumbersofexamrecordsforagivenphenotype. Forexample, individuals from the Original Cohort had an average of 16.9 records (range, 6 to 23) for height, whereas the number was 7.9 (range, 1 to 9), 1.8 (range, 1 to 2) and exact 1 for individuals from the Offspring, Generation 3, New Offspring Spouse cohorts, respectively. 111 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Table 4.2: Characteristics of the subjects, stratified by generation Variable Mean(sd), nuclear families Mean(sd), all Original Cohort N (men%) 136 (50.0) 4,139 (47.2) Age, y 58.6 (5.3) 56.5 (9.6) Height, m 1.65 (0.10) 1.64 (0.09) Weight, kg 71.9 (14.8) 70.6 (13.0) BMI, kg/m 2 26.4 (4.0) 26.3 (4.1) Mean fasting glucose, mg/dL 91.2 (14.5) 89.7 (21.3) Offspring Cohort N (men%) 1,037 (50.5) 4,451 (48.5) Age, y 51.8 (8.8) 48.1 (12.9) Height, m 1.69 (0.09) 1.68 (0.09) Weight, kg 77.1 (16.1) 75.6 (16.1) BMI, kg/m 2 27.0 (4.6) 26.7 (4.7) Mean fasting glucose, mg/dL 100.0 (17.4) 100.9 (20.3) New OffspringSpouse Cohort N (men%) 55 (50.9) 101 (46.5) Age, y 51.1 (16.6) 53.3 (15.8) Height, m 1.67 (0.09) 1.66 (0.08) Weight, kg 78.6 (17.8) 78.1 (15.4) BMI, kg/m 2 28.0 (5.6) 28.3 (5.4) Mean fasting glucose, mg/dL 104.9 (20.4) 104.2 (17.5) Generation 3 Cohort N (men%) 1,199 (47.0) 4,041 (46.7) Age, y 35.8 (10.6) 37.0 (11.2) Height, m 1.71 (0.09) 1.70 (0.09) Weight, kg 79.6 (18.7) 79.8 (18.9) BMI, kg/m 2 27.2 (5.4) 27.4 (5.6) Mean fasting glucose, mg/dL 95.4 (18.9) 95.8 (17.0) 112 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Table 4.3: Estimated h 2 using different models Trait ^ h 2 (n) a N ^ h 2 (o) b N ^ h 2 (all) c N Height - Crude 0.616 2,404 0.514 2,205 0.610 12,696 - Age, sex adjusted 0.792 2,404 0.757 2,205 0.776 12,696 - Fully adjusted d 0.781 2,404 0.742 2,205 0.839 8,579 Weight (log transformed) - Crude 0.506 2,404 0.409 2,258 0.455 12,696 - Age, sex adjusted 0.549 2,404 0.468 2,258 0.526 12,696 - Fully adjusted d 0.543 2,404 0.464 2,255 0.539 8,577 BMI (inverse) - Crude 0.441 2,404 0.335 2,205 0.453 12,694 - Age, sex adjusted 0.460 2,404 0.400 2,205 0.474 12,694 - Fully adjusted d 0.458 2,404 0.398 2,205 0.487 8,577 Fasting glucose (inverse) - Crude 0.240 2,425 0.073 2,217 0.246 12,685 - Age, sex adjusted 0.287 2,425 0.191 2,217 0.258 12,685 - Age, sex, BMI adjusted 0.253 2,402 0.170 2,189 0.219 12,646 - Fully adjusted e 0.255 2,402 0.169 2,189 0.314 8,574 Fasting glucose (inverse), non-diabetic - Crude 0.244 2,269 0.188 2,083 0.180 11,638 - Age, sex adjusted 0.308 2,269 0.280 2,083 0.199 11,638 - Age, sex, BMI adjusted 0.306 2,251 0.284 2,060 0.174 11,608 - Fully adjusted e 0.306 2,251 0.286 2,060 0.336 7,916 a Computed in selected nuclear families. b Computed in selected nuclear families, using the single exam data in the overlap- ping period. c All individuals with available data. d Adjusted for age, sex and the first 3 out of 4 population proportions. e Also adjusted for BMI. 113 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION A summary of heritability estimates of all phenotypes of interest using model (1) is shown in Table 4.3. The estimates range from the lowest of 0.306 for fasting glucose (non-diabetic individuals), to the highest of 0.781 for height. The estimates from selected nuclear families are similar to those estimated using all FHS subjects. These estimates are also consistent with previous findings (Brown et al., 2003; Liu et al., 2003). The estimates from using the “overlap” phenotypes are all smaller than their counterparts from using the summary statistics, which was also observed in Meigs et al. (2002). This could be due to the fact that the summary statistics are estimates of the true values and are less disturbed by random errors. For example, the estimated total phenotypic variance of BMI is 0.381 whereas the number for the “overlap” BMI is 0.422, and the corresponding additive variance estimates are 0.174 and 0.168, respectively. The results of including the CHE and its interaction with SNPs are summarized by each phenotype as following. [Height] The variance components andh 2 estimates using model (1) to (3) are summarized in Table 4.4. The test of model (2) against (1) was significant (LR statistic 46.3, P = 5:2 10 12 ), so we used the 2 nd test as in equation (4.5) of the stepwise test framework to detect the interaction. For the result of model (3), we selected SNPs with a test P value less than 1 10 5 as potential signals, which results in 10 SNPs. The results for median height from model (1) and (2) show that the inclusion of the CHE dramatically reduces theh 2 estimate. This indicates that a large proportion of variation in height can be explained by the environment (1.25/5.82=0.215) rather than by genetic factors. 114 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Table 4.4: h 2 estimates of height using different models Trait Median height “Overlap" height Models a (1) (2) (3) b (1) (2) (3) b ^ 2 A 4.27 3.00 2.65 – 2.85 4.29 3.04 2.55 – 2.78 ^ 2 H - 1.25 1.10 – 1.95 - 1.20 1.14 – 2.25 ^ 2 int - - 0.68 – 1.79 - - 0.63 – 1.73 total ^ 2 5.47 5.82 5.82 c 5.52 5.84 5.84 c # SNPs - - 10 - - 10 ^ h 2 0.781 0.515 0.455 – 0.490 0.778 0.521 0.437 – 0.476 a All models adjusted for age, sex and population proportions. Model (1) and (2) did not include SNP as a covariate. b The range of model parameters for the selected SNPs. c Not calculated for model (3) because the total variance depends on genotype. Use that in model (2) as an approximation. For the result of model (3), we first look at the test results across all SNPs. The quantile-quantile (QQ) plot in Figure 4.4 (a) shows that the type 1 error has been well controlled. The manhattan plot in (b) shows that there are several potential signals with P < 1 10 5 . None of the identified SNPs were previously reported to be associated with any phenotypes or diseases. These SNPs are summarized in Table 4.8. Analyses on “overlap" height had similar results as shown in Figure 4.5. The test of model (2) against (1) was significant (LR statistic 36.9, P = 6:1 10 10 ), and we selected 10 SNPs withP < 1 10 5 for model (3). Although SNP rs2253804 reaches genome-wide significance (P = 3:2 10 8 ), it was not previously reported, neither was the nearest gene TMEM92. 115 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.4: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for median height. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 116 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.5: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for “overlap" height. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 117 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION [Weight] The variance components and h 2 estimates are summarized in Table 4.5. The test of model (2) against (1) was significant (LR statistic 6.27, P = 0:0062), so we used the 2 nd test as in equation (4.5) of the stepwise test framework to detect the interaction. For the result of model (3), we selected SNPs with a test P value less than 1 10 5 as potential signals, which results in 19 SNPs. Table 4.5: h 2 estimates of weight using different models Trait Median weight “Overlap" weight Models a (1) (2) (3) b (1) (2) (3) b ^ 2 A 1.80 1.35 1.09 – 1.39 1.68 1.25 0.99 – 1.21 ^ 2 H - 0.32 0 – 1.13 - 0.32 0.04 – 0.96 ^ 2 int - - 0.22 – 1.06 - - 0.25 – 1.21 total ^ 2 3.30 3.35 3.35 c 3.63 3.69 3.69 c # SNPs - - 19 - - 19 ^ h 2 0.545 0.404 0.324 – 0.416 0.463 0.339 0.269 – 0.329 a All models adjusted for age, sex and population proportions. Model (1) and (2) did not include SNP as a covariate. b The range of model parameters for the selected SNPs. c Not calculated for model (3) because the total variance depends on genotype. Use that in model (2) as an approximation. Theresultsfrommodel(1)and(2)showthatthereissomeproportionofvariation in weight that can be attributable to the environment (0.32/3.35=0.096), which reduces the h 2 estimate. For the result of model (3), the QQ plot in Figure 4.6 (a) shows that there is too much deviation from expected. An examination on the residuals as in Figure 4.8 shows that the residuals are skewed, which violates the assumption underlying the analysis model. 118 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.6: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for median weight. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 119 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.7: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for “overlap" weight. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 120 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Figure 4.8: The distribution of the residuals of testing model (3) on a SNP. The blue curve is a normal density with zero mean and a standard deviation same as that of the residuals. The red line is x=0, which is the mean of the residuals. The manhattan plot in Figure 4.6 (b) shows that there are several potential signals with P < 1 10 5 , but due to the inflation seen from the QQ plot, it is likely that the majority, if not all, of these are false positive results. None of the SNPs were reported in the literature to be associated with any phenotypes or diseases. However, the nearby genes of several signals among the results have been reported to be associated with weight-related traits. For example, multiple SNPs in or near PACRG and DPYD were shown to be associated with BMI (Locke et al., 2015). Another SNP in ZNF385D was shown to be associated with waist-hip ratio under the interaction with recreational physical activity (Velez Edwards et al., 2013). The locus plots around these loci were generated using Locuszoom v1.1 (Pruim et al., 2010) and are shown in Figure 4.9. Analyses on “overlap" weight had similar results as shown in Figure 4.7. The 121 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION rs4709707 0 2 4 6 8 10 - log 10 (p−value) 0 20 40 60 80 100 Recombination rate (cM/Mb) ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● rs4709707 0.2 0.4 0.6 0.8 r 2 PACRG LOC285796 QKI 163.4 163.6 163.8 164 Position on chr6 (Mb) Plotted SNPs (a) rs10875089 0 2 4 6 8 10 - log 10 (p−value) 0 20 40 60 80 100 Recombination rate (cM/Mb) ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● rs10875089 0.2 0.4 0.6 0.8 r 2 DPYD 97.4 97.6 97.8 98 Position on chr1 (Mb) Plotted SNPs (b) Figure 4.9: Locus plot of testing the interaction between SNPs and the CHE for median weight. The LD is from 1000 Genomes CEU panel, build hg18. 122 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION rs7626779 0 2 4 6 8 10 - log 10 (p−value) 0 20 40 60 80 100 Recombination rate (cM/Mb) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● rs7626779 0.2 0.4 0.6 0.8 r 2 VENTXP7 ZNF385D 21.2 21.4 21.6 21.8 Position on chr3 (Mb) Plotted SNPs (a) Figure 4.9 cont.: Locus plot of testing the interaction between SNPs and the CHE for median weight. The LD is from 1000 Genomes CEU panel, build hg18. test of model (2) against (1) was significant (LR statistic 5.26, P = 0:011), and we selected 19 SNPs with P < 1 10 5 for model (3). It is of note that the same SNP rs7626779 located in ZNF385D remains as a signal (P = 7:8 10 7 ). [BMI] The variance components and h 2 estimates are summarized in Table 4.6. The test of model (2) against (1) was not significant (LR statistic 1.15, P = 0:14), so we switched to the omnibus test as in equation (4.3) to detect the deviation from a purely additive model followed by the test of interaction as in equation (4.5). 7 SNPs passes the omnibus test with P<0.05 and got P<1 10 5 in the test of interaction. The results from model (1) and (2) show that there is only a small proportion 123 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Table 4.6: h 2 estimates of BMI using different models Trait Median BMI “Overlap" BMI Models a (1) (2) (3) b (1) (2) (3) b ^ 2 A 0.174 0.140 0.113 – 0.144 0.167 0.148 0.115 – 0.135 ^ 2 H - 0.024 5:14 10 5 – 0.112 - 0.013 0 – 0.113 ^ 2 int - - 0.024 – 0.104 - - 0.059 – 0.066 total ^ 2 0.380 0.384 0.384 c 0.419 0.420 0.420 c # SNPs - - 7 - - 6 ^ h 2 0.459 0.365 0.295 – 0.375 0.398 0.351 0.274 – 0.321 a All models adjusted for age, sex and population proportions. Model (1) and (2) did not include SNP as a covariate. b The range of model parameters for the selected SNPs. c Not calculated for model (3) because the total variance depends on genotype. Use that in model (2) as an approximation. of variation in BMI that can be explained by the environment (0.024/0.380=0.063). Accordingly, model (2) is not significantly better than model (1). For the result of model (3), the QQ plot in Figure 4.10 (a) shows that the type 1 error rate is well kept. Among the SNPs withP < 1 10 5 , the same SNP rs10875089 in DPYD was also shown to be association with weight. Since the test of model (2) against (1) is not significant, these signals on the interaction needs to be examined further and to be replicated in other cohorts if possible. Analyses on “overlap" BMI had similar results. The test of model (2) against (1) was not significant (LR statistic 0.364, P = 0:27), either. We used the omnibus test and then selected 6 SNPs withP < 110 5 for testing the interaction. Again, these results needs to be treated with care. 124 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.10: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for median BMI. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 125 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.11: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for “overlap" BMI. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 126 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION [Fasting glucose, non-diabetic] The analysis on fasting glucose using model (3) resulted in skewed residuals and led to serious inflation in type 1 error rate. Consequently, the result is only shown for the analysis after excluding individuals that were classified as diabetic. 3 individuals that led to large residuals in testing “overlap" glucose were also removed. The variance components and h 2 estimates are summarized in Table 4.7. The test of model (2) against (1) is significant (LR statistic 13.59, P = 1:1 10 4 ), so we used the 2 nd test as in equation (4.5) of the stepwise test framework to detect the interaction. For the result of model (3), we selected SNPs with a test P value less than 1 10 5 as potential signals, which results in 4 SNPs. Theresultsfrommodel(1)and(2)showthatthereissomeproportionofvariation inmeanfastingglucosethatcanbeexplainedbytheenvironment(0.065/0.681=0.095). Analyses on “overlap" fasting glucose got results that were quite different. The test of model (2) against (1) was significant (LR statistic 3.62, P = 0:029), and we selected 11 SNPs with P < 1 10 5 for model (3). The results of all identified interaction signals (P<1 10 5 ) for the median phe- notypes are summarized in Table 4.8. One SNP with the most significant P value was selected to be shown for SNPs within 30kb. 127 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.12: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for median fasting glucose. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 128 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION (a) (b) Figure 4.13: QQ plot and manhattan plot of testing the interaction between SNPs and the CHE for “overlap" fasting glucose. Two lines in (b): top, 5 10 8 ; bottom, 1 10 5 . 129 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Table 4.7: h 2 estimates of median fasting glucose using different models Trait Median fasting glucose “Overlap" fasting glucose Models a (1) (2) (3) b (1) (2) (3) b ^ 2 A 0.207 0.109 0.069 – 0.085 0.221 0.170 0.113 – 0.143 ^ 2 H - 0.065 0.027 – 0.153 - 0.030 0 – 0.180 ^ 2 int - - 0.051 – 0.301 - - 0.055 – 0.131 total ^ 2 0.675 0.681 0.681 c 0.792 0.793 0.793 c # SNPs - - 4 - - 11 ^ h 2 0.306 0.160 0.101 – 0.125 0.279 0.215 0.142 – 0.180 a All models adjusted for age, sex, BMI and population proportions. Model (1) and (2) did not include SNP as a covariate. b The range of model parameters for the selected SNPs. c Not calculated for model (3) because the total variance depends on genotype. Use that in model (2) as an approximation. 130 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION Table 4.8: Results of testing the interaction for median phenotypes at P < 1 10 5 Trait SNP P ^ 2 A ^ 2 H ^ 2 int a int ^ h 2 Chr Pos Gene_Up Gene_Down In_Gene RA b RAF c Height rs17643638 2:9810 6 2.74 1.52 1.27 -0.488 -1.87 0.471 2 78609955 496.364 kb from REG3G 238.59 kb from LOC101927967 A 0.132 Height rs7303934 7:6910 6 2.75 1.33 1.30 -0.360 -2.75 0.472 12 59898548 489.747 kb from FAM19A2 None within 500 kb A 0.102 Height rs9634421 9:5810 6 2.65 1.95 0.839 -0.632 -1.04 0.456 13 25207248 ATP8A2 G 0.419 Height rs4901085 9:1410 7 2.78 1.47 0.801 -0.444 -1.66 0.478 14 50709354 67.281 kb from TMX1 77.182 kb from TRIM9 C 0.407 Height rs9913467 3:3910 7 2.76 1.57 1.13 -0.515 -1.64 0.474 17 54870249 LINC01476 G 0.158 Height rs9898731 4:1910 7 2.69 1.10 0.679 -0.203 -3.88 0.462 17 46047373 CACNA1G T 0.409 Height rs228293 1:3710 6 2.75 1.55 1.60 -0.520 -1.95 0.473 17 34185726 PIP4K2B C 0.105 Height rs2253804 4:6610 6 2.85 1.45 1.22 -0.466 -1.97 0.490 17 45710559 TMEM92 C 0.137 Height rs17623659 2:0010 6 2.79 1.58 1.79 -0.641 -1.66 0.480 19 34177368 7.933 kb from LINC01532 23.949 kb from LOC102724958 C 0.077 Weight rs10875089 1:1510 6 1.12 0.497 0.890 -0.737 -1.81 0.333 1 97679462 DPYD A 0.149 Weight rs7639339 5:0910 6 1.15 0.292 0.392 -0.433 -2.67 0.342 3 96468579 387.377 kb from MTHFD2P1 275.72 kb from LINC00879 T 0.446 Weight rs7626779 7:5710 6 1.35 0.00 0.300 NA NA 0.404 3 21588038 ZNF385D, ZNF385D-AS1 T 0.399 Weight rs4709707 1:2410 6 1.13 0.333 0.708 -0.218 -6.70 0.346 6 163681245 DKFZp451B082 A 0.177 Weight rs1542479 2:4510 6 1.27 0.227 0.937 1.000 2.03 0.378 9 92412869 DIRAS2 A 0.051 Weight rs12572355 1:0410 6 1.20 0.304 0.580 -0.155 -8.91 0.358 10 59248955 372.328 kb from IPMK None within 500 kb T 0.183 Weight rs7067834 5:0410 6 1.40 1.13 0.399 -0.986 -0.603 0.416 10 67148082 LINC01515 T 0.499 Weight rs11049713 7:9710 6 1.25 0.158 0.369 0.211 7.22 0.374 12 28602812 None within 500 kb. 8.446 kb from CCDC91 G 0.272 Weight rs10843192 8:1510 6 1.32 0.0405 0.276 0.900 2.91 0.393 12 28552956 CCDC91 A 0.335 Weight rs11160307 9:9510 6 1.23 1.01 0.222 -0.986 -0.476 0.368 14 95606343 C14orf132 T 0.498 Weight rs6124623 7:0910 9 1.10 0.0424 1.06 -0.532 -2.97 0.324 20 42179746 JPH2 G 0.139 BMI rs10875089 6:0610 6 0.113 0.0439 0.104 -0.830 -1.85 0.295 1 97679462 DPYD A 0.149 BMI rs7421418 7:4510 6 0.123 0.0105 0.0412 -0.190 -10.40 0.321 2 50130564 NRXN1 C 0.314 BMI rs6447614 8:0510 6 0.115 0.111 0.0245 -1.000 -0.469 0.301 4 47762714 0.452 kb from TXK 25.767 kb from NIPAL1 T 0.487 BMI rs2616217 8:6910 6 0.122 2:5210 3 0.0351 0.252 14.83 0.318 8 20657617 197.806 kb from LOC102467222 465.368 kb from LZTS1-AS1 G 0.372 BMI rs10813746 5:9410 6 0.144 5:1410 5 0.0506 1.000 31.39 0.375 9 32103319 271.281 kb from ACO1 None within 500 kb A 0.170 BMI rs17834073 4:1110 6 0.122 0.0150 0.0533 0.196 9.65 0.318 19 38658293 PEPD C 0.145 Glucose rs17024443 5:8110 6 0.085 0.095 0.301 -0.853 -2.09 0.125 4 149305529 NR3C2 C 0.0725 Glucose rs3758813 2:9210 6 0.069 0.153 0.103 -0.791 -1.04 0.101 11 44585055 CD82 T 0.4486 Glucose rs12594449 1:3610 6 0.079 0.103 0.296 -0.994 -1.7018 0.116 15 51253705 339.524 kb from WDR72 384.204 kb from ONECUT1 T 0.0708 Glucose rs1493899 7:4610 6 0.085 0.027 0.051 0.041 33.84 0.125 16 53288267 150.395 kb from LOC101927480 326.155 kb from LOC100996345 C 0.4326 a The estimated correlation between the random intercept and random slope. a Reference allele. a Reference allele frequency. 131 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION 4.3. Discussion In the first section of this chapter, a testing framework based on likelihood ratio statistic is introduced. It has generally desirable type 1 error rate and power in the simulations, except for the overly conservative type 1 error rate for the omnibus test in the LMM framework. However, if the test for the CHE is significant by the stepwise test, it is not necessary to use this conservative omnibus test at all. The omnibus is only preferable when the interaction has an opposite direction to the CHE, which may reduce the power of a stepwise test. On the other hand, there have been studies on modeling the distribution of the LR statistic for a linear mixed-effects model using tools other than its asymptotic properties. For example, the bootstrap method was used to approximate the P value of a score test in Sinha (2009). However, there are controversies on the consistency of using bootstrap method to approximate the asymptotic distribution (Drton and Williams, 2011). Given that primary of this dissertation is testing the hypothesis on the source of missing heritability, the bootstrap method was not applied. The test is based on likelihood ratio statistic, which requires optimizing both the null and alternative models. The likelihood ratio statistic can be approximated using a score test or a Wald test (Muggeo and Lovison, 2014), which then saves computation by doing optimization only once. Currently the framework has been based on nuclear families. In real studies, there are usually singletons in addition to families. Although the inclusion of the singletonsisunlikelytohelptheoptimizationprocesstobetterestimateeachvariance component, it is likely that they are going to help estimating the total variance. Simulations can be performed to examine this idea. 132 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION The hypothesis of explaining the missing heritability problem by the CHE and its interaction with genetic variation was tested using data from the Framingham Heart Study. By including the CHE component, the heritability estimates dropped by various degrees, from 9.4% for BMI to 27% for height, although the test for the CHE for BMI was not significant. Moreover, the heritability estimates decrease further by 0% to 6% when including the interaction between the CHE and a single SNP that shows association. These results support the hypothesis, although they need to be confirmed in other cohorts. These discovered SNPs may jointly reduce the heritability estimates further, yet the way of combining the interaction effect is beyond the scope of this dissertation and can be studied in the future. A problem that occurred during the analysis was the potential inflation on type 1 error for testing the interaction. This is caused by violation to the underlying as- sumptions of the model: (1) a nuclear family should be unrelated to other ones; (2) the residuals should be normally distributed. To meet the first assumption, relation- ship checking should be performed using genome-wide data to confirm the specified familial correlations. Another solution is to use the realized kinship matrix gener- ated using genome-wide data to replace the expected kinship matrix inferred using family structure. The biggest consideration under this assumption is by assuming the independence across the families, computations during model optimization can be save dramatically. Therefore, this assumption can be dropped as needed with proper correction of between-family correlations. For the second assumption, certain type of transformation should be done on the phenotype data to make the residuals more normal. One example was that BMI was initially log-transformed and the inflation in type 1 error rate was huge. An exami- nation showed that residuals were skewed, although not severely. After changing to 133 CHAPTER 4. HYPOTHESIS TESTING AND DATA APPLICATION the reciprocal transformation, the inflation was prevented. This implicates that the model may be quite sensitive to this assumption. Further simulations can be done to examine how much impact the violation will have on the result. Despite of the fact that we found several signals showing evidence of association for the interaction, we don’t find many. This may implicate that the effect sizes of the interaction in real data are too small to be identified with sufficient power. Therefore, larger sample sizes are preferred in other studies that aim to identify interactions between the CHE and genetic variation. 134 Chapter 5 Conclusion And Future Directions 5.1. Conclusion This dissertation has elaborated a possible explanation to the problem of missing heritability and has shown supporting evidence from testing the hypothesis on data from a real study. The first part shows, via theoretical derivation and simulations , that the esti- mated heritability will be biased upwards if the underlying model is misspecified by not including the common household environment (CHE) and its interaction with genetic variation. In the second part a method based on linear mixed-effects model (LMM) is introduced to estimate heritability without bias given a correctly specified model, and its connection to a standard LMM is built. In the third part a hypothe- sis testing framework is introduced for model comparison and selection. Finally the proposed estimation and hypothesis testing framework have been tested using data from the Framingham Heart Study (FHS). We conclude that we have evidence supporting that one potential source of miss- ing heritability is from the inflation in the estimation of heritability by not including 135 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS the CHE and its interaction with genetic variation, and this bias can be corrected using LMM methods with correctly specified models. 5.2. Future Directions The performance of the proposed model in situations of violations to the under- lying assumptions needs to be further studied. As has been shown in the application on the FHS data, violation to the assumptions of normal distribution of residuals leads to a higher type 1 error rate. Simulation studies can be done to examine the impact of deviation from normality has on the estimation. Factors to check include skewness and kurtosis. For the former, asymmetric distributions can be used in simu- lating the residuals, such as the log normal or chi square distributions. For the latter, distributions with heavier tails can be used such as the Student’s t distribution. An- other perspective on this issue is if the violation to the normality assumption can be worked around by doing transformations on the phenotype. More specifically, I’d like to inspect if semi-parametric methods like the rank-based inverse normal trans- formation can be used here. The bias on estimation, the type 1 and type 2 error rates on hypothesis testing are things to be checked. Another assumption of the model is the independence between nuclear families. Violation to this assumption can also lead to a higher type 1 error rate. Simulations can be done to study the relationship between the inflation of the type 1 error rate and different degrees of correlation across families. Then it can be tested that if the bias can be corrected by including the correlation structure, which is expected to be the case. The idea of using realized kinship matrix can be examined by simulating genome-widedataandusingtheestimatedkinshipmatrixtoaccountforthebetween- 136 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS family correlations. Different types of familiar relationship can be simulated, such as parents, grandparents, cousins, or more distant relationships. The correlation matrix will include all families. Then we can look at how much the estimation will be influenced by assuming the independence between distant families, and examine the impact of the threshold of classifying as ‘independent’. The above-mentioned ideas can be further testes using real data, i.e. the FHS data. The identified interaction using the FHS data in this dissertation needs to be con- firmed within other study cohorts. In addition to finding another cohort to replicate the discoveries, we may randomly divide the current FHS data in two sets, and mimic a two stage ‘discovery-replication’ study. By doing this, we can be more confident about the identified signals, but we may also lose some power because of the smaller sample size. We may also use families from the FHS that were not selected into the analysis in this dissertation as a replication cohort, given that we only selected 2,427 individuals out of around 9000 genotyped samples. However, the discovery and replication cohorts are correlated by doing this. The contribution of the CHE to the interaction can be further studied. There are data on specific environment exposures in the FHS data. We can examine which environment factors are clustered by families and test if we can find a genetic by environment (G by E) interaction on them. However, the idea of the CHE is to model the environment that is unable to measure, either because it’s hard to quantify (i.e. stress) or it happened in the past but we did not measure at that time. As a consequence, we may or may not find an G by E interaction that can explain the interaction between the CHE and genetic variation. 137 References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory, Akademiai Kiado, Budapest. Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9):1655–1664. Aulchenko, Y. S., Ripatti, S., Lindqvist, I., Boomsma, D., Heid, I. M., Pramstaller, P. P., Penninx, B. W. J. H., Janssens, A. C. J. W., Wilson, J. F., Spector, T., Martin, N. G., Pedersen, N. L., Kyvik, K. O., Kaprio, J., Hofman, A., Freimer, N. B., Jarvelin, M. R., Gyllensten, U., Campbell, H., Rudan, I., Johansson, A., Marroni, F., Hayward, C., Vitart, V., Jonasson, I., Pattaro, C., Wright, A., Hastie, N., Pichler, I., Hicks, A. A., Falchi, M., Willemsen, G., Hottenga, J.-J., de Geus, E. J. C., Montgomery, G. W., Whitfield, J., Magnusson, P., Saharinen, J., Perola, M., Silander, K., Isaacs, A., Sijbrands, E. J. G., Uitterlinden, A. G., Witteman, J. C. M., Oostra, B. A., Elliott, P., Ruokonen, A., Sabatti, C., Gieger, C., Meitinger, T., Kronenberg, F., Döring, A., Wichmann, H.-E., Smit, J. H., McCarthy, M. I., vanDuijn, C.M., Peltonen, L., andENGAGEConsortium(2009). Lociinfluencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nature genetics, 41(1):47–55. Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting Linear Mixed- Effects Models Using lme4. Journal of Statistical Software, 67(1):1–48. Bates, D., Mullen, K. M., Nash, J. C., and Varadhan, R. (2014). minqa: Derivative- free optimization algorithms by quadratic approximation. Battiti, R. (1992). First- and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method. Neural computation, 4(2):141–166. Bloom, J. S., Ehrenreich, I. M., Loo, W. T., Lite, T.-L. V., and Kruglyak, L. (2013). Finding the sources of missing heritability in a yeast cross. Nature, 494(7436):234– 237. 138 REFERENCES Bodmer, W. and Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nature genetics, 40(6):695–701. Boehnke, M., Moll, P. P., Lange, K., Weidman, W. H., and Kottke, B. A. (1986). Univariate and bivariate analyses of cholesterol and triglyceride levels in pedigrees. American journal of medical genetics, 23(3):775–792. Box, G. E. P. and Draper, N. R. (1987). Empirical model-building and response surfaces. John Wiley & Sons Inc. Brown, W. M., Beck, S. R., Lange, E. M., Davis, C. C., Kay, C. M., Langefeld, C. D., Rich, S. S., and Framingham Heart Study (2003). Age-stratified heritability esti- mation in the Framingham Heart Study families. BMC genetics, 4 Suppl 1(Suppl 1):S32. Broyden, C. G. (1970). The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA Journal of Applied Mathematics, 6(1):76–90. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C. R., Lim, E. P., Kalyanaraman, N., Nemesh, J., Ziaugra, L., Friedland, L., Rolfe, A., Warrington, J., Lipshutz, R., Daley, G. Q., and Lander, E. S. (1999). Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature genetics, 22(3):231–238. Casella, G. and Berger, R. L. (2002). Statistical inference, volume 2. Duxbury Pacific Grove, CA. Caspi, A. and Moffitt, T. E. (2006). Gene-environment interactions in psychiatry: joining forces with neuroscience. Nature reviews. Neuroscience, 7(7):583–590. Chamaillard, M., Philpott, D., Girardin, S. E., Zouali, H., Lesage, S., Chareyre, F., Bui, T. H., Giovannini, M., Zaehringer, U., Penard-Lacronique, V., Sansonetti, P. J., Hugot, J.-P., and Thomas, G. (2003). Gene-environment interaction modu- lated by allelic heterogeneity in inflammatory diseases. Proceedings of the National Academy of Sciences of the United States of America, 100(6):3455–3460. Chan, J.C.N., Malik, V., Jia, W., Kadowaki, T., Yajnik, C.S., Yoon, K.-H., andHu, F. B. (2009). Diabetes in Asia: epidemiology, risk factors, and pathophysiology. JAMA, 301(20):2129–2140. Chen, X., Kuja-Halkola, R., Rahman, I., Arpegård, J., Viktorin, A., Karlsson, R., Hägg, S., Svensson, P., Pedersen, N.L., andMagnusson, P.K.E.(2015). Dominant Genetic Variation and Missing Heritability for Human Complex Traits: Insights 139 REFERENCES fromTwinversusGenome-wideCommonSNPModels. American journal of human genetics, 97(5):708–714. Chernoff, H. (1954). On the distribution of the likelihood ratio. The Annals of Mathematical Statistics. Collins, F. S., Guyer, M. S., and Charkravarti, A. (1997). Variations on a theme: cataloging human DNA sequence variation. Science (New York, N.Y.), 278(5343):1580–1581. Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T. D., Barnes, C., Campbell, P., Fitzgerald, T., Hu, M., Ihm, C. H., Kristiansson, K., MacArthur, D. G., Macdonald, J. R., Onyiah, I., Pang, A. W. C., Robson, S., Stirrups, K., Valsesia, A., Walter, K., Wei, J., Wellcome Trust Case Control Consortium, Tyler-Smith, C., Carter, N. P., Lee, C., Scherer, S. W., and Hurles, M. E. (2010). Origins and functional impact of copy number variation in the human genome. Nature, 464(7289):704–712. Cupples, L. A., Heard-Costa, N., Lee, M., Atwood, L. D., and Framingham Heart Study Investigators (2009). Genetics Analysis Workshop 16 Problem 2: the Fram- ingham Heart Study data. BMC proceedings, 3 Suppl 7:S3. Drton, M. and Williams, B. (2011). Quantifying the failure of bootstrap likelihood ratio tests. Biometrika, 98(4):919–934. Easton, D. F., Bishop, D. T., Ford, D., and Crockford, G. P. (1993). Genetic linkage analysisinfamilialbreastandovariancancer: resultsfrom214families.TheBreast Cancer Linkage Consortium. American journal of human genetics, 52(4):678–701. Falconer, D. S. and Mackay, T. F. C. (1996). Introduction to Quantitative Genetics (4th edition). Fisher, R. A. (1918). The Correlation Between Relatives on the Supposition of Mendelian Inheritance. Fisher, R. A. (1958). The genetical theory of natural selection. Frayling, I. M., Beck, N. E., Ilyas, M., Dove-Edwin, I., Goodman, P., Pack, K., Bell, J. A., Williams, C. B., Hodgson, S. V., Thomas, H. J., Talbot, I. C., Bodmer, W. F., and Tomlinson, I. P. (1998). The APC variants I1307K and E1317Q are associatedwithcolorectaltumors,butnotalwayswithafamilyhistory. Proceedings of the National Academy of Sciences of the United States of America,95(18):10722– 10727. 140 REFERENCES Freeman, M. S., Mansfield, M. W., Barrett, J. H., and Grant, P. J. (2002). Heritabil- ity of features of the insulin resistance syndrome in a community-based study of healthy families. Diabetic medicine : a journal of the British Diabetic Association, 19(12):994–999. Galton, F. (1876). The History of Twins, as a Criterion of the Relative Powers of Nature and Nurture. The Journal of the Anthropological Institute of Great Britain and Ireland, 5:391. Gaugler, T., Klei, L., Sanders, S. J., Bodea, C. A., Goldberg, A. P., Lee, A. B., Mahajan, M., Manaa, D., Pawitan, Y., Reichert, J., Ripke, S., Sandin, S., Sklar, P., Svantesson, O., Reichenberg, A., Hultman, C. M., Devlin, B., Roeder, K., and Buxbaum, J. D. (2014). Most genetic risk for autism resides with common variation. Nature genetics, 46(8):881–885. Gibbs, R. A., Belmont, J. W., Hardenbol, P., Willis, T. D., Yu, F., Yang, H., Ch’ang, L.-Y., Huang, W., Liu, B., Shen, Y., Tam, P. K.-H., Tsui, L.-C., Waye, M. M. Y., Wong, J. T.-F., Zeng, C., Zhang, Q., Chee, M. S., Galver, L. M., Kruglyak, S., Murray, S. S., Oliphant, A. R., Montpetit, A., Hudson, T. J., Chagnon, F., Ferretti, V., Leboeuf, M., Phillips, M. S., Verner, A., Kwok, P.-Y., Duan, S., Lind, D. L., Miller, R. D., Rice, J. P., Saccone, N. L., Taillon-Miller, P., Xiao, M., Nakamura, Y., Sekine, A., Sorimachi, K., Tanaka, T., Tanaka, Y., Tsunoda, T., Yoshino, E., Bentley, D. R., Deloukas, P., Hunt, S., Powell, D., Altshuler, D., Gabriel, S. B., Zhang, H., Matsuda, I., Fukushima, Y., Macer, D. R., Suda, E., Rotimi, C. N., Adebamowo, C. A., Aniagwu, T., Marshall, P. A., Matthew, O., Nkwodimmah, C., Royal, C. D. M., Leppert, M. F., Dixon, M., Stein, L. D., Cunningham, F., Kanani, A., Thorisson, G. A., Chakravarti, A., Chen, P. E., Cutler, D. J., Kashuk, C. S., Donnelly, P., Marchini, J., McVean, G. A. T., Myers, S. R., Cardon, L. R., Abecasis, G. R., Morris, A., Weir, B. S., Mullikin, J. C., Sherry, S. T., Feolo, M., Daly, M. J., Schaffner, S. F., Qiu, R., Kent, A., Dunston, G. M., Kato, K., Niikawa, N., Knoppers, B. M., Foster, M. W., Clayton, E. W., Wang, V. O., Watkin, J., Sodergren, E., Weinstock, G. M., Wilson, R. K., Fulton, L. L., Rogers, J., Birren, B. W., Han, H., Wang, H., Godbout, M., Wallenburg, J. C., L’Archevêque, P., Bellemare, G., Todani, K., Fujita, T., Tanaka, S., Holden, A. L., Lai, E. H., Collins, F. S., Brooks, L. D., McEwen, J. E., Guyer, M. S., Jordan, E., Peterson, J. L., Spiegel, J., Sung, L. M., Zacharia, L. F., Kennedy, K., Dunn, M. G., Seabrook, R., Shillito, M., Skene, B., Stewart, J. G., chair, D. L. V., co chair, E. W. C., co chair, L. B. J., Cho, M. K., Duster, T., Jasperse, M., Licinio, J., Long, J. C., Ossorio, P. N., Spallone, P., Terry, S. F., chair, E. S. L., co chair, E. H. L., co chair, D. A. N., Boehnke, M., Douglas, J. A., Hudson, R. R., Kruglyak, L., and Nussbaum, R. L. (2003). The International HapMap Project. Nature, 426(6968):789–796. 141 REFERENCES Gibson, G. (2011). Rare and common variants: twenty arguments. Nature reviews. Genetics, 13(2):135–145. Gusella, J. F., Wexler, N. S., Conneally, P. M., Naylor, S. L., Anderson, M. A., Tanzi, R. E., Watkins, P. C., Ottina, K., Wallace, M. R., and Sakaguchi, A. Y. (1983). A polymorphic DNA marker genetically linked to Huntington’s disease. Nature, 306(5940):234–238. Hall, J. M., Lee, M. K., Newman, B., Morrow, J. E., Anderson, L. A., Huey, B., and King, M. C. (1990). Linkage of early-onset familial breast cancer to chromosome 17q21. Science (New York, N.Y.), 250(4988):1684–1689. Han, S. S. and Chang, J. T. (2010). Reconsidering the asymptotic null distribution of likelihood ratio tests for genetic linkage in multivariate variance components models under complete pleiotropy. Biostatistics, 11(2):226–241. Hartley, H. O. and Rao, J. (1967). Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika. Henderson, C. R. (1953). Estimation of Variance and Covariance Components. Bio- metrics, 9(2):226. Hill, W. G., Goddard, M. E., and Visscher, P. M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genetics, 4(2):e1000008. Hoffmann, A. A. and Parsons, P. A. (1993). Evolutionary Genetics and Environmen- tal Stress. Oxford University Press. Hopper, J. L. and Mathews, J. D. (1982). Extensions to multivariate normal models for pedigree analysis. Annals of Human Genetics, 46(Pt 4):373–383. Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstan- dard conditions. In Proceedings of the fifth Berkeley symposium on .... Hunt, K. A., Mistry, V., Bockett, N. A., Ahmad, T., Ban, M., Barker, J. N., Barrett, J. C., Blackburn, H., Brand, O., Burren, O., Capon, F., Compston, A., Gough, S. C. L., Jostins, L., Kong, Y., Lee, J. C., Lek, M., MacArthur, D. G., Mansfield, J. C., Mathew, C. G., Mein, C. A., Mirza, M., Nutland, S., Onengut-Gumuscu, S., Papouli, E., Parkes, M., Rich, S. S., Sawcer, S., Satsangi, J., Simmonds, M. J., Trembath, R. C., Walker, N. M., Wozniak, E., Todd, J. A., Simpson, M. A., Plagnol, V., and van Heel, D. A. (2013). Negligible impact of rare autoimmune- locus coding-region variants on missing heritability. Nature, 498(7453):232–235. Hunter, D. J. (2005). Gene-environment interactions in human diseases. Nature reviews. Genetics, 6(4):287–298. 142 REFERENCES International Schizophrenia Consortium (2008). Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature, 455(7210):237–241. Jacquard, A. (1974). The genetic structure of populations. Jacquard, A. (1983). Heritability: one word, three concepts. Biometrics, 39(2):465– 477. Karlin, S., Cameron, E. C., and Chakraborty, R. (1983). Path analysis in genetic epidemiology: a critique. American journal of human genetics, 35(4):695–732. Kathiresan, S., Willer, C. J., Peloso, G. M., Demissie, S., Musunuru, K., Schadt, E. E., Kaplan, L., Bennett, D., Li, Y., Tanaka, T., Voight, B. F., Bonnycas- tle, L. L., Jackson, A. U., Crawford, G., Surti, A., Guiducci, C., Burtt, N. P., Parish, S., Clarke, R., Zelenika, D., Kubalanza, K. A., Morken, M. A., Scott, L. J., Stringham, H. M., Galan, P., Swift, A. J., Kuusisto, J., Bergman, R. N., Sundvall, J., Laakso, M., Ferrucci, L., Scheet, P., Sanna, S., Uda, M., Yang, Q., Lunetta, K. L., Dupuis, J., de Bakker, P. I. W., O’Donnell, C. J., Chambers, J. C., Kooner, J. S., Hercberg, S., Meneton, P., Lakatta, E. G., Scuteri, A., Schlessinger, D., Tuomilehto, J., Collins, F. S., Groop, L., Altshuler, D., Collins, R., Lathrop, G. M., Melander, O., Salomaa, V., Peltonen, L., Orho-Melander, M., Ordovas, J. M., Boehnke, M., Abecasis, G. R., Mohlke, K. L., and Cupples, L. A. (2009). Common variants at 30 loci contribute to polygenic dyslipidemia. Nature genetics, 41(1):56–65. Kempthorne, O. and Osborne, R. H. (1961). The interpretation of twin data. Amer- ican journal of human genetics. Kempthorne, O. and Tandon, O. B. (1953). The Estimation of Heritability by Re- gression of Offspring on Parent. Biometrics, 9(1):90–100. Kilpeläinen, T. O., Qi, L., Brage, S., Sharp, S. J., Sonestedt, E., Demerath, E., Ahmad, T., Mora, S., Kaakinen, M., Sandholt, C. H., Holzapfel, C., Autenrieth, C. S., Hyppönen, E., Cauchi, S., He, M., Kutalik, Z., Kumari, M., Stančáková, A., Meidtner, K., Balkau, B., Tan, J. T., Mangino, M., Timpson, N. J., Song, Y., Zillikens, M. C., Jablonski, K. A., Garcia, M. E., Johansson, S., Bragg-Gresham, J. L., Wu, Y., van Vliet-Ostaptchouk, J. V., Onland-Moret, N. C., Zimmermann, E., Rivera, N. V., Tanaka, T., Stringham, H. M., Silbernagel, G., Kanoni, S., Feitosa, M. F., Snitker, S., Ruiz, J. R., Metter, J., Larrad, M. T. M., Atalay, M., Hakanen, M., Amin, N., Cavalcanti-Proença, C., Grøntved, A., Hallmans, G., Jansson, J.-O., Kuusisto, J., Kähönen, M., Lutsey, P. L., Nolan, J. J., Palla, L., Pedersen, O., Pérusse, L., Renström, F., Scott, R. A., Shungin, D., Sovio, U., Tammelin, T. H., Rönnemaa, T., Lakka, T. A., Uusitupa, M., Rios, M. S., 143 REFERENCES Ferrucci, L., Bouchard, C., Meirhaeghe, A., Fu, M., Walker, M., Borecki, I. B., Dedoussis, G.V., Fritsche, A., Ohlsson, C., Boehnke, M., Bandinelli, S., vanDuijn, C. M., Ebrahim, S., Lawlor, D. A., Gudnason, V., Harris, T. B., Sørensen, T. I. A., Mohlke, K. L., Hofman, A., Uitterlinden, A. G., Tuomilehto, J., Lehtimäki, T., Raitakari, O., Isomaa, B., Njølstad, P. R., Florez, J. C., Liu, S., Ness, A., Spector, T. D., Tai, E. S., Froguel, P., Boeing, H., Laakso, M., Marmot, M., Bergmann, S., Power, C., Khaw, K.-T., Chasman, D., Ridker, P., Hansen, T., Monda, K. L., Illig, T., Jarvelin, M. R., Wareham, N. J., Hu, F. B., Groop, L. C., Orho-Melander, M., Ekelund, U., Franks, P. W., and Loos, R. J. F. (2011). Physical activity attenuates the influence of FTO variants on obesity risk: a meta-analysis of 218,166 adults and 19,268 children. PLoS medicine, 8(11):e1001116. Laken, S. J., Petersen, G. M., Gruber, S. B., Oddoux, C., Ostrer, H., Giardiello, F. M., Hamilton, S. R., Hampel, H., Markowitz, A., Klimstra, D., Jhanwar, S., Winawer, S., Offit, K., Luce, M. C., Kinzler, K. W., and Vogelstein, B. (1997). Familial colorectal cancer in Ashkenazim due to a hypermutable tract in APC. Nature genetics, 17(1):79–83. Lander, E. S. (1996). The new genomics: global views of biology. Science (New York, N.Y.), 274(5287):536–539. Lange, K. (1978). Central limit theorems of pedigrees. Journal of Mathematical Biology. Lathrope, G. M., Lalouel, J. M., and Jacquard, A. (1984). Path Analysis of Family Resemblance and Gene-Environment Interaction. Biometrics, 40(3):611. Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated-Measures Data. Journal of the Ameri- can Statistical Association, 83(404):1014–1022. Liu, X.-Q., Hanley, A. J. G., and Paterson, A. D. (2003). Genetic analysis of common factors underlying cardiovascular disease-related traits. BMC genetics, 4 Suppl 1(Suppl 1):S56. Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C., Vedantam, S., Buchkovich, M. L., Yang, J., Croteau-Chonka, D. C., Esko, T., Fall, T., Ferreira, T., Gustafsson, S., Kutalik, Z., Luan, J., Mägi, R., Randall, J. C., Winkler, T. W., Wood, A. R., Workalemahu, T., Faul, J. D., Smith, J. A., Hua Zhao, J., Zhao, W., Chen, J., Fehrmann, R., Hedman, Å. K., Karjalainen, J., Schmidt, E. M., Absher, D., Amin, N., Anderson, D., Beekman, M., Bolton, J. L., Bragg-Gresham, J. L., Buyske, S., Demirkan, A., Deng, G., Ehret, G. B., Feenstra, B., Feitosa, M. F., Fischer, K., Goel, A., Gong, J., Jackson, A. U., 144 REFERENCES Kanoni, S., Kleber, M. E., Kristiansson, K., Lim, U., Lotay, V., Mangino, M., Mateo Leach, I., Medina-Gomez, C., Medland, S. E., Nalls, M. A., Palmer, C. D., Pasko, D., Pechlivanis, S., Peters, M. J., Prokopenko, I., Shungin, D., Stančáková, A., Strawbridge, R. J., Ju Sung, Y., Tanaka, T., Teumer, A., Trompet, S., van der Laan, S. W., van Setten, J., van Vliet-Ostaptchouk, J. V., Wang, Z., Yengo, L., Zhang, W., Isaacs, A., Albrecht, E., Arnlöv, J., Arscott, G. M., Attwood, A. P., Bandinelli, S., Barrett, A., Bas, I. N., Bellis, C., Bennett, A. J., Berne, C., Blagieva, R., Blüher, M., Böhringer, S., Bonnycastle, L. L., Böttcher, Y., Boyd, H. A., Bruinenberg, M., Caspersen, I. H., Ida Chen, Y.-D., Clarke, R., Daw, E. W., de Craen, A. J. M., Delgado, G., Dimitriou, M., Doney, A. S. F., Eklund, N., Estrada, K., Eury, E., Folkersen, L., Fraser, R. M., Garcia, M. E., Geller, F., Giedraitis, V., Gigante, B., Go, A. S., Golay, A., Goodall, A. H., Gordon, S. D., Gorski, M., Grabe, H.-J., Grallert, H., Grammer, T. B., Gräßler, J., Grönberg, H., Groves, C. J., Gusto, G., Haessler, J., Hall, P., Haller, T., Hallmans, G., Hartman, C. A., Hassinen, M., Hayward, C., Heard-Costa, N. L., Helmer, Q., Hengstenberg, C., Holmen, O., Hottenga, J.-J., James, A. L., Jeff, J. M., Johansson, A., Jolley, J., Juliusdottir, T., Kinnunen, L., Koenig, W., Koskenvuo, M., Kratzer, W., Laitinen, J., Lamina, C., Leander, K., Lee, N. R., Lichtner, P., Lind, L., Lindström, J., Sin Lo, K., Lobbens, S., Lorbeer, R., Lu, Y., Mach, F., Magnusson, P. K. E., Mahajan, A., McArdle, W. L., McLachlan, S., Menni, C., Merger, S., Mihailov, E., Milani, L., Moayyeri, A., Monda, K. L., Morken, M. A., Mulas, A., Müller, G., Müller-Nurasyid, M., Musk, A. W., Nagaraja, R., Nöthen, M. M., Nolte, I. M., Pilz, S., Rayner, N. W., Renström, F., Rettig, R., Ried, J. S., Ripke, S., Robertson, N. R., Rose, L. M., Sanna, S., Scharnagl, H., Scholtens, S., Schumacher, F. R., Scott, W. R., Seufferlein, T., Shi, J., Vernon Smith, A., Smolonska,J.,Stanton,A.V.,Steinthorsdottir,V.,Stirrups,K.,Stringham,H.M., Sundström, J., Swertz, M.A., Swift, A.J., Syvänen, A.-C., Tan, S.-T., Tayo, B.O., Thorand, B., Thorleifsson, G., Tyrer, J. P., Uh, H.-W., Vandenput, L., Verhulst, F. C., Vermeulen, S. H., Verweij, N., Vonk, J. M., Waite, L. L., Warren, H. R., Waterworth, D., Weedon, M. N., Wilkens, L. R., Willenborg, C., Wilsgaard, T., Wojczynski, M. K., Wong, A., Wright, A. F., Zhang, Q., LifeLines Cohort Study, Brennan, E. P., Choi, M., Dastani, Z., and Drong, A. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197–206. Lush, J. L. (1940). Intra-sire Correlations Or Regressions of Offspring on Dam as a Method of Estimating Heritability of Characteristics. Lynch, M. and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates Incorporated. Mailman, M. D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L., Popova, N., Pretel, S., Ziyabari, L., Lee, M., 145 REFERENCES Shao, Y., Wang, Z. Y., Sirotkin, K., Ward, M., Kholodov, M., Zbicz, K., Beck, J., Kimelman, M., Shevelev, S., Preuss, D., Yaschenko, E., Graeff, A., Ostell, J., and Sherry, S. T. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nature genetics, 39(10):1181–1186. Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., and Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics (Oxford, England), 26(22):2867–2873. Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., Cho, J. H., Guttmacher, A. E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C. N., Slatkin, M., Valle, D., Whittemore, A. S., Boehnke, M., Clark, A. G., Eichler, E. E., Gibson, G., Haines, J. L., Mackay, T. F. C., McCarroll, S. A., and Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265):747–753. McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioan- nidis, J. P. A., and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews. Genetics, 9(5):356–369. Meigs, J. B., Panhuysen, C. I. M., Myers, R. H., Wilson, P. W. F., and Cupples, L. A. (2002). A genome-wide scan for loci linked to plasma levels of glucose and HbA(1c) in a community-based sample of Caucasian pedigrees: The Framingham Offspring Study. Diabetes, 51(3):833–840. Mitchell, B. D., Kammerer, C. M., Blangero, J., Mahaney, M. C., Rainwater, D. L., Dyke, B., Hixson, J. E., Henkel, R. D., Sharp, R. M., Comuzzie, A. G., VandeBerg, J. L., Stern, M. P., and MacCluer, J. W. (1996). Genetic and environmental con- tributions to cardiovascular risk factors in Mexican Americans. The San Antonio Family Heart Study. Circulation, 94(9):2159–2170. Muggeo, V. M. R. and Lovison, G. (2014). The “Three PlusOne” Likelihood-Based Test Statistics: Unified Geometrical and Graphical Interpretations. The American Statistician, 68(4):302–306. Neale, M. and Cardon, L. (1992). Methodology for genetic studies of twins and families. Springer Science & Business Media. Neyman, J. and Pearson, E. S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Biometrika, 20A. Patterson, H. D. and Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3):545–554. 146 REFERENCES Petersen, K. B. and Pedersen, M. S. (2008). The matrix cookbook. Technical Uni- versity of Denmark. Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and R Core Team (2016). nlme: Linear and Nonlinear Mixed Effects Models. Powell, M. (2009). The BOBYQA algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06. Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? American journal of human genetics, 69(1):124–137. Pruim, R. J., Welch, R. P., Sanna, S., Teslovich, T. M., Chines, P. S., Gliedt, T. P., Boehnke, M., Abecasis, G. R., and Willer, C. J. (2010). LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics (Oxford, England), 26(18):2336–2337. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., Maller, J., Sklar, P., de Bakker, P. I. W., Daly, M. J., and Sham, P. C. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. American journal of human genetics, 81(3):559–575. RadhakrishnaRao, C.(1948). Largesampletestsofstatisticalhypothesesconcerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(01):50–57. Rao, C. R. (2005). Score Test: Historical Review and Recent Developments. In Advances in Ranking and Selection, Multiple Comparisons, and Reliability, pages 3–20. Birkhäuser Boston, Boston, MA. Rao, D. C. and Morton, N. E. (1974). Path analysis of family resemblance in the presence of gene-environment interaction. American journal of human genetics, 26(6):767–772. Risch, N. and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science (New York, N.Y.), 273(5281):1516–1517. Saunders, A. M., Strittmatter, W. J., Schmechel, D., George-Hyslop, P. H. S., Pericak-Vance, M. A., Joo, S. H., Rosi, B. L., Gusella, J. F., Crapper-MacLachlan, D. R., Alberts, M. J., Hulette, C., Crain, B., Goldgaber, D., and Roses, A. D. (1993). Association of apolipoprotein E allele 4 with late-onset familial and spo- radic Alzheimer’s disease. Neurology, 43(8):1467–1467. Searle, S. R., Casella, G., and MacCulloch, C. E. (1992). Variance components. New York [etc.]: Wiley. 147 REFERENCES Self, S. G. and Liang, K.-Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association, 82(398):605–610. Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimiza- tion. Mathematics of computation, 24(111):647–656. Shao, H., Burrage, L. C., Sinasac, D. S., Hill, A. E., Ernest, S. R., O’Brien, W., Courtland, H.-W., Jepsen, K. J., Kirby, A., Kulbokas, E. J., Daly, M. J., Broman, K. W., Lander, E. S., and Nadeau, J. H. (2008). Genetic architecture of complex traits: large phenotypic effects and pervasive epistasis. Proceedings of the National Academy of Sciences of the United States of America, 105(50):19910–19914. Sinha, S. K. (2009). Bootstrap tests for variance components in generalized linear mixed models. Canadian Journal of Statistics, 37(2):219–234. Snijders, Tom AB and Bosker, R. (1999). Multilevel analysis: an introduction to basic and advanced multilevel modeling. SAGE. Stefansson, H., Rujescu, D., Cichon, S., Pietiläinen, O. P. H., Ingason, A., Steinberg, S., Fossdal, R., Sigurdsson, E., Sigmundsson, T., Buizer-Voskamp, J. E., Hansen, T., Jakobsen, K. D., Muglia, P., Francks, C., Matthews, P. M., Gylfason, A., Hall- dorsson, B.V., Gudbjartsson, D., Thorgeirsson, T.E., Sigurdsson, A., Jonasdottir, A., Jonasdottir, A., Bjornsson, A., Mattiasdottir, S., Blondal, T., Haraldsson, M., Magnusdottir, B. B., Giegling, I., Möller, H.-J., Hartmann, A., Shianna, K. V., Ge, D., Need, A. C., Crombie, C., Fraser, G., Walker, N., Lonnqvist, J., Suvisaari, J., Tuulio-Henriksson, A., Paunio, T., Toulopoulou, T., Bramon, E., Di Forti, M., Murray, R., Ruggeri, M., Vassos, E., Tosato, S., Walshe, M., Li, T., Vasilescu, C., Mühleisen, T. W., Wang, A. G., Ullum, H., Djurovic, S., Melle, I., Olesen, J., Kiemeney, L. A., Franke, B., GROUP, Sabatti, C., Freimer, N. B., Gulcher, J. R., Thorsteinsdottir, U., Kong, A., Andreassen, O. A., Ophoff, R. A., Georgi, A., Rietschel, M., Werge, T., Petursson, H., Goldstein, D. B., Nöthen, M. M., Pel- tonen, L., Collier, D. A., St Clair, D., and Stefansson, K. (2008). Large recurrent microdeletions associated with schizophrenia. Nature, 455(7210):232–236. Stram, D. O. and Lee, J. W. (1994). Variance components testing in the longitudinal mixed effects model. Biometrics, pages 1171–1177. Tryka, K. A., Hao, L., Sturcke, A., Jin, Y., Wang, Z. Y., Ziyabari, L., Lee, M., Popova, N., Sharopova, N., Kimura, M., and Feolo, M. (2014). NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Research, 42(Database issue):D975–9. 148 REFERENCES Vazquez, A. I., Bates, D. M., Rosa, G. J. M., Gianola, D., and Weigel, K. A. (2010). Technical note: An R package for fitting generalized linear mixed models in animal breeding1. Journal of Animal Science, 88(2):497–504. Velez Edwards, D. R., Naj, A. C., Monda, K., North, K. E., Neuhouser, M., Magvan- jav, O., Kusimo, I., Vitolins, M. Z., Manson, J. E., O’Sullivan, M. J., Rampersaud, E., and Edwards, T. L. (2013). Gene-environment interactions and obesity traits among postmenopausal African-American and Hispanic women in the Women’s Health Initiative SHARe Study. Human Genetics, 132(3):323–336. Visscher, P. M.(2004). Power oftheClassical Twin DesignRevisited. Twin Research, 7(05):505–512. Visscher, P. M., Brown, M. A., McCarthy, M. I., and Yang, J. (2012). Five Years of GWAS Discovery. The American Journal of Human Genetics, 90(1):7–24. Visscher, P. M., Hill, W. G., and Wray, N. R. (2008). Heritability in the genomics era — concepts and misconceptions. Nature Publishing Group, 9(4):255–266. Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3):426–482. Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678. White, H. (1982). Maximum likelihood estimation of misspecified models. Econo- metrica: Journal of the Econometric Society, pages 1–25. Wray, N. R., Purcell, S. M., and Visscher, P. M. (2011). Synthetic associations created by rare variants do not explain most GWAS results. PLoS biology, 9(1):e1000579. WRIGHT, S. (1920). The Relative Importance of Heredity and Environment in Determining the Piebald Pattern of Guinea-Pigs. Proceedings of the National Academy of Sciences of the United States of America, 6(6):320–332. Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Goddard, M. E., and Visscher, P. M. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature genetics, 42(7):565–569. 149 REFERENCES Yang, J., Lee, S. H., Goddard, M. E., and Visscher, P. M. (2011a). GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics, 88(1):76–82. Yang, J., Manolio, T. A., Pasquale, L. R., Boerwinkle, E., Caporaso, N., Cunning- ham, J. M., de Andrade, M., Feenstra, B., Feingold, E., Hayes, M. G., Hill, W. G., Landi, M. T., Alonso, A., Lettre, G., Lin, P., Ling, H., Lowe, W., Mathias, R. A., Melbye, M., Pugh, E., Cornelis, M. C., Weir, B. S., Goddard, M. E., and Visscher, P. M. (2011b). Genome partitioning of genetic variation for complex traits using common SNPs. Nature genetics, 43(6):519–525. Zaitlen, N., Kraft, P., Patterson, N., Pasaniuc, B., Bhatia, G., Pollack, S., and Price, A. L. (2013). Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genetics, 9(5):e1003520. Zeggini, E., Rayner, W., Morris, A. P., Hattersley, A. T., Walker, M., Hitman, G. A., Deloukas, P., Cardon, L. R., and McCarthy, M. I. (2005). An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nature genetics, 37(12):1320–1322. Zhou, X. and Stephens, M. (2014). Efficient multivariate linear mixed model algo- rithms for genome-wide association studies. Nature Methods, 11(4):407–409. Zhu, Z., Bakshi, A., Vinkhuyzen, A. A. E., Hemani, G., Lee, S. H., Nolte, I. M., van Vliet-Ostaptchouk, J. V., Snieder, H., The LifeLines Cohort Study, Esko, T., Milani, L., Mägi, R., Metspalu, A., Hill, W. G., Weir, B. S., Goddard, M. E., Visscher, P. M., and Yang, J. (2015). Dominance Genetic Variation Contributes Little to the Missing Heritability for Human Complex Traits. American journal of human genetics, 96(3):377–385. Zuk, O., Hechter, E., Sunyaev, S. R., and Lander, E. S. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences of the United States of America, 109(4):1193– 1198. 150
Abstract (if available)
Abstract
The 'missing heritability' problem describes the fact that the explained heritability from identified genetic variation associated with phenotypes/diseases is much smaller than the estimated total heritability from traditional population studies. ❧ This dissertation proposed a hypothesis to explain the missing heritability problem. We claimed that the problem may be due to the inflated estimation of the total heritability from population studies with misspecified model. We used theoretical derivation and simulations to illustrate that the heritability would be overestimated when there was common household environment (CHE) and/or its interaction with genetic variation but was not specified in the estimation model. We then proposed a method based on linear mixed-effects model (LMM) to properly account for the CHE and the interaction, which could be used to estimate the heritability without bias. The connection between this method and standard LMM methods was also built. We further proposed hypothesis testing frameworks for model selection. Finally we tested both the estimation and hypothesis testing methods using the Framingham Heart Study data, and found evidence supporting the hypothesis of this dissertation on the source of missing heritability.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Examining the relationship between common genetic variation, type 2 diabetes and prostate cancer risk in the multiethnic cohort
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Adaptive set-based tests for pathway analysis
PDF
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Dietary carcinogens and genetic variation in their metabolism: epidemiological studies on the risk of selected cancers
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Genetic variation in the base excision repair pathway, environmental risk factors and colorectal adenoma risk
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
Ancestral/Ethnic variation in the epidemiology and genetic predisposition of early-onset hematologic cancers
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
Identifying genetic, environmental, and lifestyle determinants of ethnic variation in risk of pancreatic cancer
PDF
The environmental and genetic determinants of cleft lip and palate in the global setting
PDF
Observed and underlying associations in nicotine dependence
PDF
The role of heritability and genetic variation in cancer and cancer survival
PDF
Characterizing the genetic and environmental contributions to ocular and central nervous system health
PDF
Common immune-related factors and risk of non-Hodgkin lymphomy
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Genetic association studies of age-related macular degeneration from candidate gene to whole genome
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
Asset Metadata
Creator
Wang, Nan
(author)
Core Title
Missing heritability may be explained by the common household environment and its interaction with genetic variation
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
12/07/2018
Defense Date
05/18/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
common household environment,genetic variation,interaction,missing heritability,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Watanabe, Richard (
committee chair
), Grzywacz, Norberto (
committee member
), Lewinger, Juan Pablo (
committee member
), Stram, Daniel (
committee member
)
Creator Email
sailingwave@gmail.com,wangn@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-675733
Unique identifier
UC11336361
Identifier
etd-WangNan-4962.pdf (filename),usctheses-c16-675733 (legacy record id)
Legacy Identifier
etd-WangNan-4962.pdf
Dmrecord
675733
Document Type
Dissertation
Rights
Wang, Nan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
common household environment
genetic variation
interaction
missing heritability