Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Two-step study designs in genetic epidemiology
(USC Thesis Other)
Two-step study designs in genetic epidemiology
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Two-Step Study Designs in Genetic Epidemiology by Zhao Yang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Biostatistics) August 2016 Copyright 2016 Zhao Yang ii Dedicated to my parents, Jingyun Yang and Shangyun Han iii Acknowledgements I would like to thank my dissertation committee members: Dr. Duncan Thomas, Dr. David Conti, Dr. Paul Marjoram, Dr. Christopher Haiman and Dr. Kai Wang for their advice and help in the completion of this dissertation. Specially, I am deeply grateful to my mentors: Dr. Thomas and Dr. Conti. This dissertation would not be possible without their knowledge, insights and encouragement throughout my graduate study. Their enthusiasm and guidance have inspired me not only in the field of biostatistics, but also in many aspects of my life and career. I would also like to thank all my friends who offered me their help and support during my graduate study. iv Table of Contents Acknowledgements .................................................................................................................................... iii List of Tables ............................................................................................................................................ viii List of Figures ............................................................................................................................................. ix Abstract ....................................................................................................................................................... xi 1 Introduction ........................................................................................................................................ 1 1.1 A Brief History of Two-Step Studies .......................................................................................... 1 1.2 Two-Step Studies and Genetic Epidemiology ............................................................................. 6 2 Two-Phase Sampling Designs for Case-Control Studies with Latent Variable Models to Investigate Effects of Candidate Genes ..................................................................................................... 8 2.1 Background .................................................................................................................................. 8 2.1.1 The Use of Biomarkers in Genetic Epidemiology ................................................................................. 8 2.1.2 Interactions Between Candidate Genes and Environmental Factors .................................................... 11 2.1.3 Non-Traditional Designs for Gene-Environment Interactions ............................................................. 13 2.2 Formalization of the Problem .................................................................................................... 20 2.3 Methods ..................................................................................................................................... 22 2.4 Scenarios of Optimization ......................................................................................................... 24 2.4.1 Optimization for Estimating a Single Parameter ................................................................................. 24 2.4.2 Optimization for Estimating Multiple Parameters ............................................................................... 28 2.4.3 Robustness to Parameter Values .......................................................................................................... 29 2.5 Results ....................................................................................................................................... 30 2.5.1 Optimization ........................................................................................................................................ 30 2.5.2 Evaluation ............................................................................................................................................ 39 2.5.3 Estimating Multiple Parameters ........................................................................................................... 40 2.5.4 Sensitivity to Parameter Values ........................................................................................................... 45 v 2.6 Discussion .................................................................................................................................. 47 3 Integrated Analysis of Germline, Omic and Disease ..................................................................... 50 3.1 Background ................................................................................................................................ 50 3.2 Methods ..................................................................................................................................... 52 3.2.1 Latent Variable Model Formalization .................................................................................................. 52 3.2.2 Joint Estimating Method via EM Algorithm ........................................................................................ 54 3.2.3 Sparse Solution .................................................................................................................................... 56 3.2.4 Statistical Testing and Integrated Analysis .......................................................................................... 57 3.2.5 Implementation of the Integrated Analysis .......................................................................................... 58 3.3 Simulation Study ....................................................................................................................... 59 3.3.1 Factors With Impact on Clustering and Estimation Performance ........................................................ 59 3.3.2 Statistical Testing: Type I Error Rate and Power................................................................................. 63 3.3.3 Choose the Number of Clusters ........................................................................................................... 64 3.3.4 High Dimensional Genetic and Biomarker Data ................................................................................. 65 3.4 Application on WHI (Women’s Health Initiative) Data ............................................................ 66 3.4.1 Data Description and Analysis............................................................................................................. 66 3.4.2 Results ................................................................................................................................................. 69 3.5 Discussion .................................................................................................................................. 72 4 Two-Phase and Two-Stage Family-Based Designs for Genetic Epidemiological Studies using Sequencing Data ........................................................................................................................................ 74 4.1 Background ................................................................................................................................ 74 4.1.1 Genome-Wide Association Studies (GWAS) ...................................................................................... 74 4.1.2 The Post-GWAS Era ............................................................................................................................ 78 4.2 Formalization of the Problem .................................................................................................... 86 4.3 Methods ..................................................................................................................................... 87 4.3.1 Variant Prioritization ........................................................................................................................... 87 vi 4.3.2 Design Considerations ......................................................................................................................... 89 4.3.3 Simulation Studies ............................................................................................................................... 92 4.4 Results ....................................................................................................................................... 95 4.4.1 Comparing Approaches to Selecting Pedigree Members for Sequencing............................................ 95 4.4.2 Comparing Pedigree and Case-Control Samples in Stage I ............................................................... 101 4.4.3 Comparing Pedigree and Case-Control Samples in Stage II .............................................................. 103 4.4.4 Sample Size Allocation Between the Two Stages and Prioritization Cut-Off ................................... 104 4.4.5 Using More Seeds in the Iterative Approach for Sequencing Sample Selection ............................... 106 4.4.6 Application to Colorectal Cancer Family Registry (Colon CFR) Data .............................................. 107 4.5 Discussion ................................................................................................................................ 110 5 Conclusions and Future Research Directions .............................................................................. 112 5.1 Conclusions ............................................................................................................................. 112 5.2 Future Research Directions ..................................................................................................... 112 5.2.1 Use Semi-Parametric Model in the Study of Two-Phase Case-Control Design ................................ 112 5.2.2 Robustness of Optimal Sampling fractions ........................................................................................ 113 5.2.3 Incomplete Biomarker Measurements and Ascertainment Correction .............................................. 114 5.2.4 Consider More Outcome Types in Integrated Analysis ..................................................................... 115 5.2.5 Stochastic Approach to Select Pedigree Members for Sequencing ................................................... 116 Appendix .................................................................................................................................................. 117 Appendix 1. Derivation of the E-step ................................................................................................... 117 Appendix 2. Statistical Model for Binary Outcome ............................................................................. 118 Appendix 3. Factors Impacting Clustering and Estimation Performance: Binary Outcome ................ 119 Appendix 4. Type I Error Rate and Power for Binary Outcome .......................................................... 121 Appendix 5. EM Algorithm for Incomplete Biomarker Data ............................................................... 122 Appendix 6. Ascertainment Correction for Case-Control Sample ....................................................... 124 vii Appendix 7. Publication: Two-Phase and Family-Based Designs for Next-Generation Sequencing Studies .................................................................................................................................................. 126 Appendix 8. Publication: Two-Stage Family-Based Designs for Sequencing Studies......................... 146 Bibliography ............................................................................................................................................ 151 viii List of Tables Table 2.1 Pre-specified true parameter values in (2.8)-(2.11). .................................................................................... 31 Table 2.2 Parameter estimates and standard deviations of different designs when planning the study before ascertaining Phase I sample with a budget of 3000. ........................................................................................... 40 Table 2.3 Parameter estimates and standard deviations of different designs when planning the study after ascertaining Phase I sample with a fixed budget of 2000 for Phase II. .............................................................. 40 Table 3.1 Simulation scenarios to explore factors with impact on performance of clustering and estimation. ........... 60 Table 3.2 Type I error rate when making inference with or without using outcome data. .......................................... 64 Table 3.3 Estimating effects of latent clusters on the outcome.................................................................................... 64 Table 3.4 Summary statistics of biomarkers in WHI data. .......................................................................................... 68 Table 3.5 Mean and standard deviation of seven biomarkers after adjusting for covariates and inverse normal transformation. ................................................................................................................................................... 69 Table 3.6 Estimated cluster-specific mean and standard errors of seven biomarkers using SNPs, biomarkers and the outcome. ............................................................................................................................................................. 70 Table 3.7 Estimated cluster-specific mean and standard errors of seven biomarkers using only SNPs and biomarkers. ............................................................................................................................................................................ 70 Table 3.8 Estimated effects of clusters on the outcome. .............................................................................................. 72 Table 4.1 Prioritizing causal and null variants by the ad hoc approach and the simplified iterative approach in three scenarios. ............................................................................................................................................................ 97 Table 4.2 Characteristics of pedigree members selected for sequencing by the ad hoc approach and the simplified iterative approach in three scenarios. ................................................................................................................. 99 Table 4.3 Proportion of causal and null variants that are prioritized using pedigree samples or case-control samples in Stage I in three scenarios. ............................................................................................................................. 102 Table 4.4 Proportion of causal and null variants prioritized being significant after Bonferroni correction in Stage II using pedigree samples or case-control samples in three scenarios. ................................................................ 104 Table 4.5 Comparison of the iterative approach using top 1 vs. 5 seeds. .................................................................. 107 Table 4.6 Characteristics of Conlon CFR pedigrees analyzed. .................................................................................. 108 ix List of Figures Figure 2.1 From (Gilliland et al., 1999) The pathway model that used latent processes to relate ambient air pollutants and the response of children’s respiratory system. ............................................................................................... 9 Figure 2.2 Formalization of the latent variable model. ................................................................................................ 21 Figure 2.3 The latent variable model with ascertainment of the case-control sample. ................................................ 24 Figure 2.4 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the interaction of G and E when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. ... 33 Figure 2.5 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the effect of latent variable on disease when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. ........................................................................................................................................... 34 Figure 2.6 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the interaction of G and E when planning the study after ascertaining Phase I sample using the prospective likelihood approach. ........ 37 Figure 2.7 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the effect of latent variable on disease when planning the study after ascertaining Phase I sample using the prospective likelihood approach. ............................................................................................................................................................ 38 Figure 2.8 Two-phase sampling designs for estimating the main effects and interaction of G and E when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. .................................... 42 Figure 2.9 Two-phase sampling designs for estimating the main effects and interaction of G and E when planning the study after ascertaining Phase I sample using the prospective likelihood approach. ......................................... 44 Figure 2.10 Given a range of pre-specified values of the interaction between G and E, optimal Phase II sampling fractions computed by maximizing ARCE (bars), expected ARCE red dots), minimum ARCE (blue dots) for estimating this parameter when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. ........................................................................................................................................... 46 x Figure 2.11 Given a range of pre-specified values of the effect of the latent variable on Y, optimal Phase II sampling fractions computed by maximizing ARCE (bars), expected ARCE (red dots), minimum ARCE (blue dots) for estimating this parameter when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. ........................................................................................................................................... 47 Figure 3.1 Joint model integrating germline genomic data, biomarker measurements and outcome variable. ........... 53 Figure 3.2 Impact of biomarker mean effects on estimating genetic effects. .............................................................. 62 Figure 3.3 Impact of structured cluster-specific covariance matrices of biomarkers on estimating genetic effects. ... 63 Figure 3.4 BIC of fitted model using different pre-assumed number of underlying clusters. ..................................... 65 Figure 3.5 Power of detecting genetic effect when: there are 5 causal SNP/5 null SNP, 2 informative biomarkers / 2 non-informative biomarkers (left); there are 5 causal / 995 null SNPs, 2 informative biomarkers / 98 non- informative biomarkers (right). .......................................................................................................................... 66 Figure 3.6 Scatter plot of the first and second principal components colored by estimated clusters. .......................... 71 Figure 4.1 Three different simulation scenarios: causal variants have smaller MAFs but have larger effects (red); moderate MAFs and moderate effects (green); larger MAFs but have smaller effects (blue). .......................... 96 Figure 4.2 Ratio of the number of true causal variants prioritized to the number of null variants prioritized by the ad hoc approach and the simplified iterative approach. .......................................................................................... 98 Figure 4.3 Distribution of number of members sequenced in one pedigree by the simplified iterative approach in scenarios when causal variants have smaller MAFs but have larger effects (green); moderate MAFs and moderate effects (red); larger MAFs but have smaller effects (blue). .............................................................. 101 Figure 4.4 Ratio of the number of true causal variants prioritized to the number of null variants prioritized by the pedigree sample and case-control sample in Stage I. ....................................................................................... 103 Figure 4.5 Overall power of detecting causal variants for different combinations of sample size allocation and cut- off for prioritization in Stage I. ........................................................................................................................ 106 Figure 4.6 Proportion of top 100 prioritized variants being prioritized by randomly selecting 3 members sequencing per pedigree. ..................................................................................................................................................... 109 Figure A.1 Impact of biomarker mean effects on estimating genetic effects when the outcome is binary. ............... 119 Figure A.2 Impact of structured cluster-specific covariance matrices of biomarkers on estimating genetic effects when the outcome is binary. ............................................................................................................................. 120 xi Abstract Two-step study designs, including both two-stage and two-phase designs, have been widely used in genetic epidemiology. In this dissertation, after reviewing the development of two-step study designs and their application in various settings, I focused on applying this two-step idea to several problems that are frequently encountered in the research of this area. In the study of complex biological pathways, latent variables are often used to model the underlying mechanism and to analyze the relationship among the factors involved. Biomarkers can be used as a measurement on them. Due to the high cost, it is often unfeasible to have biomarkers measured on the whole sample. We developed models involving latent variables to investigate effects of candidate genes, both the main effect and the interaction with environmental factors, and proposed designs with the two-phase sampling idea for case-control studies to increase efficiency. We discussed two approaches for statistical modeling, retrospective likelihood approach and prospective likelihood approach. For situations of planning the study before or after ascertaining Phase I sample, we computed optimal designs in scenarios that estimate single or multiple parameters with or without a fixed budget constraint. In addition, using simulations, we compared the relevant performance of the optimal designs with other designs. Technology advances have provided access to various omic data, such as metabolites, expression, and somatic profiles, which potentially enable us gain new insights into the underlying etiologic mechanism of disease. These data are often collected on a subset of a sample, and thus can be viewed as a second step in one study. There are many analytic challenges involved in analyzing such data, including effect heterogeneity, high dimensionality, incomplete data, etc. We proposed a novel approach for the integrated analysis of germline, xii omic and disease data. We used a latent variable to relate information from germline genetic data to either a continuous or binary disease outcome, and viewed the omic data as a flawed measure of underlying latent clusters, categorized to simplify interpretation. We used an expectation-maximization algorithm to simultaneously estimate the unobserved latent clusters and model parameters, including genetic effects on the latent cluster and the impact of the cluster on omic patterns and on the disease outcome. Additionally, we incorporated penalized methods for variable selection in a high dimensional setting for both the genetic data and the omic data. Using simulations, we demonstrated the ability of our approach to accurately estimate underlying clusters and their corresponding genetic, omic and disease effects. Moreover, we demonstrated the feasibility of the variable selection to identify genetic and omic factors as both the means and correlational structures were varied. As an example, we applied our approach to a data set from the Women’s Health Initiative. It has been learned that findings of genome-wide association studies can only explain a small proportion of heritability. Research on rare variants has become one of the directions to find the ‘missing heritability’. In spite of the advance of DNA sequencing technology, it is still too expensive to sequence the entire sample in large-scale epidemiological studies. To improve efficiency, we proposed a two-phase and two-stage design using sequencing data to study the association of rare variants with diseases. This design sequences only a subset of Stage I sample and then use Stage I data to prioritize rare variants for subsequent association test in Stage II sample, adjusting for multiple comparison only for prioritized variants. We proposed a score test criterion to prioritize rare variants in Stage I using pedigree data, and an iterative algorithm to select pedigree members for sequencing using existing information from genome-wide association studies. Using simulations, we evaluated the performance of using pedigree sample xiii or case-control sample in Stage I and II, the performance of the iterative algorithm, and other design parameters. With real sequencing data from Colon Cancer Family Registry, we evaluated various Stage I designs by looking at how their results compared with the one that used all available sequencing data. 1 1 Introduction 1.1 A Brief History of Two-Step Studies In epidemiological studies, a major goal is to investigate the relationship between variables of interest, generally called ‘exposures’, and health outcomes. In a genetic study, ‘exposures’ might include genes or interactions. The ideal design for this purpose is a randomized trial, in which investigators randomly assign exposures to study participants, observe and compare their health outcomes. In this way, the distortions to the estimate of their relationship caused by variables other than the variables of interest, or confounders, can be eliminated because the distributions of the confounders in subjects at different exposure levels are assumed equivalent by randomization. Despite the advantage in estimation and inference, randomized trials cannot always be employed in practice. For example, it would not be ethical to assign potentially harmful exposures, such as air pollutants, to study participants, and of course not possible for genes. Instead of assigning exposures, it is more appropriate to use exposure status selected by the study participants themselves to do the analysis. Taking a prospective view, investigators can measure the exposure status of a group of people at the beginning of the study, and measure their outcomes after following them for a certain time period. Then, after considering potential confounders, a similar analysis can be done as if the data were from a randomized trial. This design is often referred to as the cohort design. An investigation of the health effects of cigarette smoking as well as other exposures was one of the early examples of this kind (Doll & Hill, 1956; Doll & Peto, 1976). The nature of the cohort design raises several shortcomings when applied in epidemiological studies. In cohort studies, healthy participants are enrolled and then followed 2 for their outcomes. A lot of health outcomes of interest, such as chronic diseases, have a very long latency period from the onset of risk factors, which means investigators have to follow the participants for a very long period of time. Moreover, for diseases that are rare in the general population, a very large number of individuals need to be followed to get an adequate sample size. Either of the above issues can make the cohort design impractical for those applications. Furthermore, when the exposure of interest is a complex of a diverse range of variables, it is hard to select study participants based on their exposure status in a cohort study. To overcome the disadvantages of the cohort design mentioned above, an alternative from a retrospective view is a case-control design. In the case-control design, participants are selected based on their outcomes. Investigators only need to identify cases of disease and select controls as a random sample from the disease-free population. Then historical information of both cases and controls is collected and analyzed. One of the early examples of the case-control design is also about the relationship between cigarette smoking and lung cancer (Doll & Hill, 1950). Compared to the cohort design, the case-control design requires much smaller number of controls to obtain statistical power, which leads to possibilities of improved quality of data and possible lower cost (Ury, 1975). Furthermore, it has been shown that methods for analyzing data from a cohort study can be employed to analyze data from a case-control study without causing bias to the major parameters of interest (Prentice & Pyke, 1979). Despite the benefits above, the case-control design potentially can be subject to various biases. For example, selection bias can happen when hospital based controls are used, which may underestimate the association between a risk factor (e.g. smoking) and the disease because the prevalence of the risk factor in hospital based controls may be higher than that in the unaffected individuals in the general population. Since the exposure history for cases is 3 collected after disease onset, information bias such as recall bias happens when study participants’ memory of their previous exposures is distorted by their disease status. Furthermore, reverse causation happens when the disease of interest or its treatment influences some exposures. Another possible bias is from uncontrolled confounding, which happens when there exist unknown or unobservable confounders that are associated with both exposures and the disease status and can distort the estimated association between them. A hybrid of the cohort and case-control design can be used to take advantage of both of them. Such designs include the ‘synthetic retrospective design’ (Mantel, 1973), the ‘hybrid retrospective design’ (Kupper, McMichael, & Spirtas, 1975), the ‘case-base’ design (Miettinen, 1982), the ‘case-cohort’ design (Prentice, 1986) and the ‘nested case-control’ design (Liddell, McDonald, & Thomas, 1977). In the nested case-control design, a subset of participants in a cohort is selected based on their disease outcomes and their exposure information is compared with that of the cases from the cohort study. In this way, the exposure information is only collected on a part of the entire cohort, potentially reducing cost and increasing data quality, and is not affected by the disease outcome. In the sampling process of a nested case-control study, cases are firstly selected from the cohort, and then a certain number of controls are randomly selected from each case’s risk set, the set of cohort members still at risk at that time. In the risk set, the case and controls are often matched on some pre-defined covariates, such as sex, race, etc., which is the main difference between the nested case-control design and the case-cohort design which is unmatched. It has been shown that although the relative efficiency of testing associations between single exposures and disease using a relatively small number of controls (usually 3 or 4) sampled in this way is high enough compared with using the entire cohort (Breslow & Patton, 1979), many more 4 controls may be required to achieve high enough efficiencies for complex analysis (Breslow, Lubin, Marek, & Langholz, 1983). To improve efficiency in this situation, sophisticated sampling designs have been proposed. One special design proposed by Langholz and Goldstein is called counter-matching (Langholz & Goldstein, 1996, 2001). As the name indicates, the sampling strategy is different from traditional matching in that the controls in a risk set are selected so as to increase the variability of the exposure variables of primary interest. An extreme example is to sample a control whose surrogate exposure status is opposite to the case for a one-to-one matching. If those variables contain information about the exposures of interest, it is intuitive that the counter-matching strategy improves efficiency by having a more informative sample because each case-control set is discordant for the exposures of interest. The nested case-control design described above can be viewed as a two-step process. In the first step, a cohort is formed and preliminary exposures are measured for all participants. In the second step, after the outcome of interest is observed, cases and a subset of controls are sampled from the cohort and additional information on exposures, confounders or modifiers is collected for those selected. This class of study design has drawn great interest because of their ability to reduce cost while retaining a relatively high efficiency. There are different types of the two-step design: the two-stage design and the two-phase design. In the two-stage design, the second stage is often viewed as the replication of the findings in the first stage, and the two stages use independent samples. In contrast, in the two- phase design, a carefully selected subset from the first-phase sample is used as the sample in the second phase, on which additional information, usually expensive, is obtained. The sampling strategy can be based on variables measured in the first phase (White, 1982), and many statistical methods and design strategies have been developed. 5 Robins, Rotnitzky and Zhao considered the conditional mean model, treating the exposures collected at both phases as regressors (Robins, Rotnitzky, & Zhao, 1994). They assumed the information on exposures collected at the second phase is missing at random, either inadvertently or by design, and the missingness probabilities are either known or can be parametrically modeled, and used the inverse of the probability of being selected for the second phase to weight each individual for the estimating equations. They proposed a class of semiparametric estimators that are consistent for the parameters of the conditional mean model. Reilly and Pepe considered the situation when some variables of interest are missing because of the sampling for the second phase and there are additional variables, called the auxiliary variables, available for all study participants (Reilly & Pepe, 1995). They used regression models and included only the variables of interest as the regressors. While the auxiliary variables were excluded from the regression models, they are thought to be informative about the true value of the missing variables of interest in the regression models. In their method, auxiliary variables and the outcome variable were used to form strata from which study participants were randomly sampled using stratum-specific sampling probabilities for the second phase. Then individuals in each stratum were weighted by the inverse of the sampling probability, and a mean score estimating equation was used to estimate the parameter of interest based on a weighted regression. They also showed that the asymptotic variance of the mean score estimator depends on the sampling probabilities. Based on this observation, Reilly derived optimal two-phase study designs considering different constraints that might be encountered in practice, such as maximizing precision for a fixed total budget or minimizing the total cost for a desired precision level (Reilly, 1996). 6 Breslow and Holubkov considered using logistic regression to fit two-phase data with a binary response variable (Breslow & Holubkov, 1997b). They described computational algorithms that permit efficient estimation of regression coefficients using three methods: weighted likelihood, pseudo-likelihood and maximum likelihood. After being stratified based on the binary response variable and covariates available for everyone, study participants were randomly sampled based on stratum-specific probabilities for additional data in the second phase. The weighted likelihood estimate is obtained by fitting the logistic model using the subjects in the second phase, and applying the inverse of their sampling probabilities as the weights. The pseudo-likelihood estimate is based on the conditional probability of having the binary response given the stratum or the variables in the second phase, following methods by Breslow and Cain, and Schill et al (Breslow & Cain, 1988; Schill, Jockel, Drescher, & Timm, 1993). The maximum likelihood estimator is based on the full likelihood of both phases, including the sampling probabilities. These methods were compared by application to data from the National Wilms Tumor Study Group (Breslow & Chatterjee, 1999). Subsequently, Breslow et al showed that when using the inverse probability weighted estimating equation to fit the two- phase data, precision can be improved by calibrating the weights such that the weighted total of certain auxiliary variables meets the corresponding total for the original data (Breslow, Lumley, Ballantyne, Chambless, & Kulich, 2009a). 1.2 Two-Step Studies and Genetic Epidemiology In genetic epidemiology, two-stage studies and two-phase studies have been widely used. In two genetic epidemiological studies by Whittemore and Halpern (Whittemore & Halpern, 1997), designs involving two or more phases were used. Schaid et al used two-phase designs in 7 DNA resequencing studies of Genome-Wide Association Study (GWAS) signals, treating GWAS as the first phase and the resequencing study as the second phase (Schaid, Jenkins, Ingle, & Weinshilboum, 2013). As for the two-stage designs, Thomas et al reviewed the methodological issues in multistage GWAS (D. C. Thomas et al., 2009), genotyping part of the sample using a commercial high-density panel as the first stage, genotyping only the most promising SNPs using a customized panel on the remainder of the sample as the second stage, and analyzing the two stages jointly to gain statistical power. Another example they mentioned is DNA pooling, in which the first stage forms small pools of cases and controls and selects SNPs based on the difference of their Minor Allele Frequencies (MAF) and the second stage genotypes individuals to retest the SNPs selected. Although there have been many applications of the two-step study designs in genetic epidemiology, opportunities and challenges still exist in specific problems. In this dissertation, we will to apply the two-step idea to several problems frequently encountered in the research of genetic epidemiology. This dissertation is organized in the following way. Chapter 2 will discuss two-phase sampling strategies for case-control studies using latent variable models to study genetic effects on diseases, specifically using biomarkers to study gene-environment interactions. Chapter 3 will discuss an integrated analysis using latent clusters, germline, omic and disease data. Chapter 4 will discuss two-phase and two-stage strategies for sequencing studies using familial data. Chapter 5 will conclude the dissertation and discuss future research directions. 8 2 Two-Phase Sampling Designs for Case-Control Studies with Latent Variable Models to Investigate Effects of Candidate Genes 2.1 Background 2.1.1 The Use of Biomarkers in Genetic Epidemiology In an editorial for Cancer Epidemiology, Biomarkers & Prevention, Thomas discussed the complexity of biological pathways and the need for the development of study designs and statistical analysis that can incorporate information from multiple disciplines to study the relationship among the factors involved (D. C. Thomas, 2005). One of the challenges he mentioned was that the pathway models may not be robust to modeling assumptions or even identifiable when information about the intermediate metabolites or kinetic variables is not available. To tackle this problem, he proposed the use of biomarkers as a source of additional data to inform the analysis. One example of the complex biological pathways is the mechanism of the response of children’s respiratory system to ambient air pollutants. To explore the chronic effects of air pollution on children’s respiratory health, the Southern California Children’s Health Study (CHS) was initiated in 1993 as a prospective study of a cohort of school-aged children recruited from 12 southern California communities. Analysis of the CHS data has found associations between several air pollutants and the factors representing children’s respiratory health condition. The deficits in forced expiratory volume in one second (FEV1) was found to be statistically significantly associated with exposure to nitrogen dioxide, acid vapor, particulate matter with an aerodynamic diameter of less than 2.5 𝜇 m (PM2.5), and elemental carbon by linear regression after adjusting for several potential confounders and effect modifiers (Gauderman et al., 2004). Childhood asthma was also found to be associated with traffic-related pollution such as nitrogen dioxide and residential distance to a freeway (Gauderman et al., 2005). 9 For the response of children’s respiratory system to ambient air pollutants mentioned above, investigators do not have a well-established understanding of the underlying biological systems involved and thus cannot specify a detailed model based on it. Gilliland et al proposed a pathway model that relates exposures and health outcomes via a latent process of oxidative stress and inflammation, as shown in Figure 2.1 (Gilliland, McConnell, Peters, & Gong, 1999). In this model, both environmental exposures (for example, ozone, nitrogen dioxide, PM2.5, etc.) and genetic susceptibility (for example, genotypes for antioxidant enzymes) have effects on the latent process of oxidative stress and inflammation, which then causes subsequent health outcomes in the respiratory system, both acute and chronic, such as asthma and changes in lung function growth. Figure 2.1 From (Gilliland et al., 1999) The pathway model that used latent processes to relate ambient air pollutants and the response of children’s respiratory system. Latent processes like those described by Gilliland et al cannot be measured directly. Therefore, to get information of the latent processes of interest, investigators use biomarkers in their studies. For example, the fractional exhaled nitric oxide (FeNO) concentration can reflect 10 airway inflammation, and its measurement is not invasive, so has been widely used as a biomarker for studies on children (Baraldi, de Jongste, & European Respiratory Society/American Thoracic Society Task, 2002; Mattes et al., 1999). It has been used to reflect the association between residential traffic-related pollution exposures and airway inflammation in children with asthma in the CHS (Eckel et al., 2011). Further studies on using FeNO as a biomarker showed that it depends on the rate of air flow, with high flow providing more information about distal/alveolar sources and low flow providing more information about proximal/airway sources. Thus, a single high flow FeNO is not a perfect proxy for distal nitric oxide (NO) concentration due to distortion by the maximum airway flux, making it necessary to measure FeNO of multiple flow rate and use statistical models to estimate the alveolar NO concentration (Eckel & Salam, 2013). The so-called ‘extended NO analysis’, i.e. measurement of FeNO at several standardized controlled rates, has been used to relate bronchial flux and peripheral NO concentration estimates with respiratory health status in the CHS (Linn et al., 2013). Indeed, as is discussed in an editorial of Journal of Breath Research by Eckel, Baumbach and Hauschild, breath research is rapidly evolving as more advanced technology produces more complex data from air exhaled by humans. The complexity of the data requires more sophisticated statistical methods for analysis in order to improve investigators’ ability to translate the huge amount of biomarker data into clinical use (Eckel, Baumbach, & Hauschild, 2014). In another review by Vineis and Chadeau-Hyam, the authors thoroughly discussed biomarker analysis at different stages in longitudinal frameworks. They pointed out that it provides great potential to analyze the roles of different biomarkers in the whole process and helps refine understanding of the causal pathway between exposures and disease risk (Vineis & Chadeau-Hyam, 2011). 11 Despite the rapid development of using biomarkers in epidemiological studies, there are still problems to be addressed. To begin with, if biomarker measurements are taken after the onset of the outcome, as in a case-control design, it is possible that the values measured are influenced by the outcome or its treatment, namely the ‘reverse causation’ problem. To avoid this problem, cohort, case-cohort or nested case-control designs are more appropriate because the biomarker measurements/samples can be obtained when all subjects are free of the outcome. Secondly, it is not feasible to get biomarker measurements on every subject enrolled in many situations. One constraint is the additional cost of the biomarker measurement, which can get beyond the total budget of the study. Another constraint is the time needed for the measurement, which can take too long to get measurements on everybody. In these situations, two-phase designs that perform biomarker measurements only on a subset of all the subjects are favored for the decrease of total amount of money or time needed (D. C. Thomas, 2007). For example, in a two-phase case-control design, a standard case-control sample is ascertained in the first phase. Then in the second phase, a stratified random sample is selected from the first phase for measurements of additional variables of interest based on some pre-specified probabilities that may depend on both exposure and disease. Finally, data from both phases are analyzed jointly, increasing efficiency by combining information from both phases. 2.1.2 Interactions Between Candidate Genes and Environmental Factors When discussing the term ‘interaction’, it should be noted that its meaning depends on the context and the model assumed. According to Yang and Khoury, there are two major definitions of interaction, statistical and biological. In the biological view, the interaction between two or more factors is their joint effect in the same causal mechanism to the disease 12 development. It happens when the effect of one factor on the disease is modified by other factors, which can be estimated statistically by the departure from some null main effects model, such as the additive or multiplicative model. For example, it can be estimated in regressions by the coefficient of the product term of the risk factors with the assumption of a multiplicative model (Q. Yang & Khoury, 1997). There are also statistical models for additive interactions, which uses risk differences to evaluate interactions. Besides the above two major definitions, there are other types of interactions, such as quantitative interaction, qualitative interaction, public health synergy, reviewed by Thomas. Based on his review paper, quantitative interaction happens when the effects of one factor are in the same direction but differ in magnitude at different levels of other factors; qualitative interaction refers to the situations in which the effects go in opposite directions, the effect only exists in one level of the other factor, or the effect only exists in the presence of both factors; public health synergy means the disease burden attributable to two or more risk factors is greater than the sum of the excess risks from them separately (D. Thomas, 2010a, 2010b). Gene-environment interactions have been playing an important role in the study of genetic epidemiology. The interest in gene-environment interactions is drawn by its great potential in public health. According to the review by Thomas in 2010, the possible applications of gene-environment interactions include getting insights of biological mechanisms, identifying novel genes acting through interactions, understanding heterogeneity of study results caused by the difference of exposure distributions, identifying gene-specific environmental risk factors, dissecting the effects of complex mixtures into components metabolized by different genes, establishing environmental regulation, predicting individual disease-related outcomes based on modifiable environmental factors and choosing the optimal personalized treatments based on 13 genetic information (D. Thomas, 2010a). In the context of air pollution and children’s respiratory health, considering the model in Figure 2.1, exploration of the possible interactions between the environmental exposures and the genetic exposures can provide investigators with insights of the mechanism of the biological pathway and basis for regulations of the emission of air pollutants. Traditional designs such as the case-control design can be used to investigate gene- environment interactions. However, it has been shown that the power for detecting interaction effects can be much lower, or the sample size needed can be much larger, compared to that needed for detecting the main effects, unless the population prevalence of the environmental and genetic factors is high enough or the magnitude of the interaction effect is large, which is often not the case in genetic epidemiology. Indeed, Smith and Day showed that the sample size for detecting an interaction effect should be at least 4 times the sample size needed for detecting a main effect of the same magnitude in a 1:1 unmatched case-control study (Smith & Day, 1984). 2.1.3 Non-Traditional Designs for Gene-Environment Interactions The limitation of applying traditional designs in studies of interaction effects motivates the development of special designs that are especially useful in this situation. The following non-traditional designs have their own advantages over the traditional case-control study. The Case-Only Design The case-only design is an early development (Begg & Zhang, 1994; Piegorsch, Weinberg, & Taylor, 1994). As the name indicates, only subjects with the disease are needed in case-only studies. For example, consider the interaction of a binary environmental factor 14 (𝐸 = 0,1) and a binary genotype (𝐺 = 0,1) and let 𝛾 denote the interaction when 𝐸 = 𝑗 and 𝐺 = 𝑖 . Assume a multiplicative interaction effect and use the logit model. If the further assumption that the gene and environmental factor are independent in the underlying population and the disease is rare can be made, Piegorsch, Weinberg and Taylor showed that the interaction effect could be well approximated by the log odds ratio of environmental factor across different genotypes, i.e. 𝛾 = ln ( | , ) ( | , ) ( | , ) ( | , ) (2.1) They further showed that case-only studies have better precision and therefore require smaller sample size for estimating gene-environment interactions than traditional case-control studies in that there is no need for estimating the log odds ratio of the gene and environmental factor in subjects without disease, under the assumptions of a rare disease and gene-environment independence. Despite the big saving of sample size it offered, the case-only design has some shortcomings. The first one is that only the interaction effect can be estimated in case-only studies; the main effects of gene and environmental factors are not estimated. Second, the assumption of gene-environment independence needed in case-only studies is often too strong and hard to test. A good strategy of avoiding making too strong assumptions is to let the data make the decision. Li and Conti used Bayes model averaging (BMA) to combine the case-control design and the case-only design (D. Li & Conti, 2009). In their paper, let 𝑀 denote the case-control model, 𝑀 denote the case-only model and 𝜎 denote the parameter for the interaction effect. Then the posterior distribution of 𝜎 can be written as Pr(𝜎 |𝐷𝑎𝑡𝑎 ) = ∑ Pr (𝜎 |𝐷𝑎𝑡𝑎 , 𝑀 )Pr (𝑀 |𝐷𝑎𝑡𝑎 ) (2.2) 15 And let 𝜃 denote all model parameters, Pr (𝑀 |𝐷𝑎𝑡𝑎 ) ∝ ∫ Pr(𝐷𝑎𝑡𝑎 |𝜃 , 𝑀 ) Pr(𝜃 , 𝑀 ) 𝑑 𝜃 ∙ Pr (𝑀 ) (2.3) In this way, investigators have the flexibility to use their prior beliefs about the gene- environment independence to assign prior probabilities of the two models and obtain the posterior distribution of the parameter for the interaction effect. By simulations, Li and Conti showed that when there is no prior information on the gene-environment independence, the BMA approach is more powerful than the traditional case-control study, and when the gene- environment independence does not hold, the BMA approach reduced the bias substantially. Another way to take advantage of the statistical efficiency of case-only designs and control the Type-I error rate at the same time is to combine the test statistics of case-only designs and case-control designs using appropriate weights. Mukherjee et al developed an empirical- Bayes method that combines the case-control and case-only estimators depending on the sample size and strength of the gene-environment association in the data (Mukherjee, Ahn, Gruber, & Chatterjee, 2012; Mukherjee et al., 2008; Mukherjee & Chatterjee, 2008). The empirical-Bayes estimator they proposed is in the form below. 𝛽 = 𝛽 + 𝛽 (2.4) Here 𝜃 is the estimated association between genetic and environmental factors, and (𝛽 , 𝜎 ) stands for the estimated interaction effects and the corresponding variance estimates with 𝐸𝐵 , 𝐶𝐶 , 𝐶𝑂 standing for empirical-Bayes, case-control and case-only respectively. By taking the weighted sum of the case-only and the case-control estimates, this estimator achieves the trade-off between bias and efficiency. While the case-only estimate help improve efficiency, the weight of the case-control estimator increases when there is an association between genetic and environmental factors, thus reducing the overall bias. 16 The Case-Parent Trio Design In a case-parent trio study, subjects with the disease (cases) and their parents are genotyped and the environmental exposures of the cases are evaluated. Gene-environment interactions can be estimated based on the idea that the existence of the interaction can be suggested by the observation that the difference in the frequency of the transmission of the gene from parents to their affected offspring differs between exposed and unexposed cases (Schaid, 1999). In his paper, Schaid presented the following two statistical methods to test this effect, the likelihood method and the transmission disequilibrium test (TDT) method. Assuming the cases are independent, the likelihood method uses the likelihood ratio below to test the null hypothesis that the genotype relative risk does not differ across the levels of environmental factor. 𝐿𝑅 = 2[ln𝐿 (𝑟 ̂ , 𝑟 ̂ ) + ln𝐿 (𝑟 ̂ , 𝑟 ̂ ) − ln𝐿 (𝑟 , 𝑟 )] (2.5) Asymptotically, this statistic follows a chi-square distribution with 2 degrees of freedom. Furthermore, the degrees of freedom could be reduced to 1 by assuming specific genetic models, such as multiplicative model (𝑟 = 𝑟 , 𝑟 = 𝑟 ), additive model (𝑟 = 𝑟 , 𝑟 = 2𝑟 − 1), dominant model (𝑟 = 𝑟 = 𝑟 ) or recessive model (𝑟 = 1, 𝑟 = 𝑟 ). The TDT method first evaluates the probability (denoted by 𝜋 ) that a heterozygous parent transmits a particular allele to an affected child, and then tests if this probability differs between exposed and unexposed cases by the following statistic that asymptotically follows a standard normal distribution. 𝑧 = ( )[ ] (2.6) Comparing this approach with the traditional case-control design, Schaid found that the case-parents design can be more efficient to detect the gene-environment interaction in some 17 situations, especially when the gene is rare and the environmental risk factor has a large effect in the absence of the genotype (Schaid, 1999). However, in spite of the advantages of the case-parents design, it still relies on the assumption of gene-environment independence, although in a weaker form: gene-environment independence conditional on parental genotypes (D. C. Thomas, 2000) and can only test the interaction effect, not including the main effect. Another limitation of it is that it is sometimes not feasible to get genotype information of the parents of the cases, especially when the onset of diseases is late in one’s lifetime. Counter-Matching The lack of controls in the above two types of designs requires the sometimes unrealistic assumption of gene-environment independence and does not allow for testing the main effects along with the interaction. A different approach to increase the efficiency while including controls in the analysis is to use counter matching. Introduced by Langholz and Clayton in 1994, the counter-matching design can improve statistical efficiency by selecting a more informative case-control sample from a larger cohort in such a way that the variation of the factors of interest is increased in the case-control set compared to a simple random sample as is in standard nested case-control studies (Langholz & Clayton, 1994). The inclusion of controls also enables estimation of the main effect. Andrieu et al discussed the use of counter-matching in studies of gene-environment interaction (Andrieu, Goldstein, Thomas, & Langholz, 2001). In their paper, the following three counter-matching strategies were compared with a standard nested case-control study with three controls per case: 1) a 2-2 case-control design with counter-matching on a surrogate of an 18 environmental factor (𝐸 ); 2) a 2-2 case-control design with counter-matching on a surrogate of a gene (𝐺 ); 3) a 1-1-1-1 case-control design with counter-matching on surrogates of both 𝐸 and 𝐺 . It turned out that counter-matching on both 𝐸 and 𝐺 is the most efficient design and the efficiency depends on the sensitivity and specificity of the surrogates, the frequency of the risk factors (𝐸 , 𝐺 ) and the magnitude of the interaction effect (Andrieu et al., 2001). The Two-Phase Case-Control Design While being similar to counter matching in that there is also a selection from a larger sample involved, the two-phase case-control design described earlier is more flexible. Firstly, it does not require a matched case-control sample. Secondly, the sampling probabilities for the second phase can be controlled, thus providing the opportunity to optimize these sampling probabilities to increase the cost-efficiency of the entire study. Breslow and Holubkov used logistic regression to fit the data from the two-phase case- control designs (Breslow & Holubkov, 1997a). They compared the following three approaches: weighted likelihood, pseudo-likelihood and maximum likelihood. In the weighted likelihood approach, the logistic model is fitted using subjects in Phase II, applying the inverse of their sampling probabilities as the weights. In the pseudo-likelihood approach, the conditional probability of being a case given the variables of interest is used for estimation. In the maximum likelihood approach, the full likelihood of both steps, including the sampling probabilities, is used for estimation. They also showed that precision could be improved by calibrating the weights using information that is available from everybody in Phase I when the inverse probability weighted estimating equation is used. By reanalyzing data from the Atherosclerosis 19 Risk in Communities (ARIC) study, they demonstrated such calibration can improve statistical efficiency (Breslow, Lumley, Ballantyne, Chambless, & Kulich, 2009b). Two-Step Genome-Wide Analysis In the genome-wide scale, it is even harder to detect gene-environment interactions due to the huge number of SNPs to be tested. To improve efficiency, Murcray, Lewinger and Gauderman introduced a two-step approach for detecting genetic loci involved in gene- environment interactions using a case-control sample. In the first step, a likelihood ratio test for the association between G and E using logistic regression is performed for all the SNPs and a subset of SNPs that exceeds a given significance level is selected; in the second step, the selected SNPs in the first step are tested for the interaction effect in the traditional way, adjusting multiple comparisons only for this subset of SNPs (Murcray, Lewinger, & Gauderman, 2009). Their simulation studies showed that this two-step approach has more power than the standard one-step approach while preserves the overall type I error rate. Later on, Murcray et al modified this two- step approach, using a hybrid screening method in the first step by allocating a proportion of the overall genome wide significance level to each of the following two tests: G-E association test and Disease-Gene (DG) association test. They showed that this hybrid two-step approach is powerful and robust to the choice of allocation proportion in most cases, and the allocation of the significance level in the first step can indeed be optimized to improve power for a given design and parameter setting (Murcray, Lewinger, Conti, Thomas, & Gauderman, 2011). Finally, realizing that the G-E and D-G association tests are independent, Gauderman et al proposed a two degrees of freedom test using the sum of the above two statistics in the first step for screening, followed by the similar test for gene-environment interactions in the second step. 20 They showed by simulation that this approach is more powerful than the previous ones in many circumstances and its application to a G-Sex scan found two promising SNPs for the gene- environment interaction (Gauderman, Zhang, Morrison, & Lewinger, 2013). 2.2 Formalization of the Problem As is reviewed above, it is interesting to investigate complex biological pathways using models involving latent processes. An example is the response mechanism of the respiratory system in Figure 2.1. Both genetic and environmental factors are often involved in such models and thus it is also interesting to investigate gene-environment interactions between candidate genes and potential environmental risk factors. The latent variable model of interest is formalized with a directed acyclic graph (DAG) with the following notations (Figure 2.2). Let Y denote a binary disease outcome, X denote a variable that stands for the latent process that is causal for Y, E denote the environmental factor of interest, G denote a candidate gene and Z denote a biomarker measurement of X, which is too expensive to perform on all subjects. Instead, a stratified random sample from all subjects in Phase I is formed to get measurement of X, with the sampling probabilities chosen specifically for each of the strata defined by combinations of E, G and Y. Let 𝛼 , 𝛽 , 𝛾 and 𝜎 denote the model parameters. In this model, there is no direct effect of either 𝐸 or 𝐺 on 𝑌 . E and G can be related to Y only through X. Z is only a measurement of X and does not have any causal effect on Y by itself. Furthermore, we assume that variables in the boxes (E, G, Z and Y in this case) can be measured directly while variables in the circles (X in this case) cannot. 21 Figure 2.2 Formalization of the latent variable model. With this model, the major goal is to find optimal strategies, in terms of the sampling probabilities 𝑆 , with respect to economic feasibility and statistical efficiency to estimate the effects of the genetic and environmental factors (both main effects and interaction) on the latent process and the effects of the latent process on the disease outcome. In particular, the focus will be on the two-phase case-control designs that have the potential to get more information from a limited sample size by choosing the sampling strategies in a sophisticated fashion. 22 2.3 Methods Retrospective Likelihood Consider a two-phase case-control design. In Phase I, a standard case-control sample is ascertained, and E and G are measured on all subjects. In Phase II, a subset of the Phase I sample is selected and Z is measured only on the selected subjects. Then, the full likelihood is formalized using parametric models and the Markov property of the DAG. Let N e,g,y denote the number of subjects with E=e, G=g and Y=y in the Phase I sample and let n e,g,y,z denote the number of subjects with E=e, G=g, Y=y and Z=z in the Phase II sample. Then the log-likelihood of the two-phase case-control sample is: 𝑙 (𝜃 ) = ∑ 𝑁 , , log 𝑝 (𝑒 , 𝑔 |𝑦 ) , , + ∑ 𝑛 , , , log 𝑝 (𝑧 |𝑒 , 𝑔 , 𝑦 ) , , , (2.7) The parametric models used for each part of the DAG are listed below in (2.8) – (2.10): 𝑝 (𝑒 , 𝑔 ; 𝛼 ) = ( ∙ ) ∑ ( ∙ ) , (2.8) 𝑝 (𝑥 |𝑒 , 𝑔 ; 𝛽 , 𝜎 ) = √ exp {− ( ∙ ) } (2.9) 𝑝 (𝑧 |𝑥 ; 𝜎 ) = √ exp (− ( ) ) (2.10) 𝑝 (𝑦 |𝑥 ; 𝛾 ) = Φ(𝛾 + 𝛾 𝑥 ) (2.11) Maximizing (2.7) results in the maximum likelihood estimates (MLE) of the parameters. The Fisher information can also be written based on this log-likelihood: 𝐼 = −{∑ 𝐸 𝑁 , , ( , | ; ) , , + ∑ 𝐸 (𝑛 , , , ) , , , ( | , , ; ) } (2.12) Inverting the Fisher information matrix in (2.12) results the variance-covariance matrix for the MLEs of the parameters. 23 Prospective Likelihood The retrospective likelihood approach above involves assuming and estimating a parametric model for the distribution of E and G, which is not particularly of interest. Estimating these nuisance parameters decreases the efficiency of this approach. Another shortcoming of it is the strong parametric model assumption for the distribution of E and G involved. When that assumption is not valid, serious bias could be introduced in the overall analysis. In addition, it is even more difficult to compute the retrospective likelihood when 𝐸 and 𝐺 are continuous. In this sense, using the prospective likelihood is appealing. In order to use the prospective likelihood in the case-control design, ascertainment needs to be included as a random variable in the DAG (Figure 2.3). Let 𝐴 denote the indicator variable for ascertainment. Assume 𝑝 (𝑎 |𝑦 = 1) = 𝜋 , 𝑝 (𝑎 |𝑦 = 0) = 𝜋 and leave the rest of the model unchanged. The log-likelihood of the two-phase case-control sample becomes: 𝑙 (𝜃 ) = ∑ 𝑁 , , log 𝑝 (𝑦 |𝑒 , 𝑔 , 𝑎 ) , , + ∑ 𝑛 , , , log 𝑝 (𝑧 |𝑒 , 𝑔 , 𝑦 ) , , , (2.13) Here 𝑝 (𝑦 |𝑒 , 𝑔 , 𝑎 ) = ( | ) ∫ ( | ) ( | , ) ∑ ( | ) ∫ ( | ) ( | , ) (2.14) 𝑝 (𝑧 |𝑒 , 𝑔 , 𝑦 ) ∫ ( | ) ( | ) ( | , ) ∫ ( | ) ( | ) ( | , ) (2.15) Note that the term 𝑝 (𝑒 , 𝑔 ) is not in the log-likelihood in (2.13). Therefore, assuming that the probability of ascertainment 𝑝 (𝑎 |𝑦 ) for both cases and controls is known by design, it is no longer necessary to model and estimate the marginal distribution of E and G, which can potentially increase robustness and efficiency. 24 Figure 2.3 The latent variable model with ascertainment of the case-control sample. 2.4 Scenarios of Optimization 2.4.1 Optimization for Estimating a Single Parameter Two major scenarios are considered here. In the first scenario, the study plan is made before ascertaining Phase I sample. In the second scenario, the study plan is made after ascertaining Phase I sample. The remainder of this section considers optimization problems for each of the two scenarios separately. In both scenarios, the number of cases and the number of controls are assumed to be the same. Planning the Study Before Ascertaining Phase I Sample Let 𝑆 , , denote the fraction of subjects with 𝐸 = 𝑒 , 𝐺 = 𝑔 , and 𝑌 = 𝑦 being sampled for the measurement in Phase II. For simplicity, we assume without loss of generality that the cost per subject in Phase I is one unit and the cost per subject in Phase II is a ratio of that in 25 Phase I, denoted by CR. Before ascertaining Phase I sample, we do not know the counts in each of the stratum defined by 𝐸 , 𝐺 and 𝑌 . Thus, we need to specify a parametric model for the marginal distribution of 𝐸 and 𝐺 to compute the Fisher information and the variance-covariance matrix of parameter estimates. Therefore, only the retrospective likelihood approach above can be used in this scenario, since the prospective likelihood approach does not specify the marginal distribution of 𝐸 and 𝐺 . Using the retrospective likelihood approach, the Fisher information should be written as follows. 𝐼 = −𝑁 {∑ 𝑝 (𝑒 , 𝑔 |𝑦 ) ( , | ; ) , , + ∑ 𝑆 , , 𝑝 (𝑒 , 𝑔 |𝑦 )𝑝 (𝑧 |𝑒 , 𝑔 , 𝑦 )( ( | , , ; ) , , , } (2.16) Here, 𝑁 is the number of cases in the study. Note that we are only assuming that the number of controls is equal to the number of cases here. The cases and controls are not necessarily matched individually. The expected total cost of both Phase I and Phase II is 𝐶 , = 𝑁 {2 + ∑ 𝑆 , , 𝑝 (𝑒 , 𝑔 |𝑦 )𝐶𝑅 , , } (2.17) Optimization Without a Fixed Budget Suppose the magnitude of the true parameter values can be specified when we are planning the study. With these values, the expected information can be calculated as a function of the sampling fractions 𝑆 and the Phase I sample size 2 ∙ 𝑁 , and so can the variance-covariance matrix of the MLEs. Let 𝑉 , denote the variance of the parameter of interest, which is obtained from the inverse of the Fisher information matrix. Then we can compute the asymptotic relative cost efficiency (ARCE) introduced by Thomas (D. C. Thomas, 2007). 𝐴𝑅𝐶𝐸 = ∙ , ∙ , (2.18) 26 Note that the Phase I sample size is cancelled out. Then optimal sampling fraction 𝑆 ∗ is obtained by maximizing ARCE, i.e. 𝑆 ∗ = argmax ∙ , ∙ , = argmin 𝐶 , ∙ 𝑉 , (2.19) The 𝑆 ∗ computed in this way is optimal in the sense that it minimizes both cost and variance jointly. When there is not a budget constraint, 𝑆 ∗ gets the minimum variance of estimation among designs using different sampling fractions and Phase I sample sizes that require the same cost. Likewise, it enables us to minimize the cost among designs using different sampling fractions and Phase I sample sizes that achieve the same level of precision. Minimizing Variance With a Fixed Budget To minimize variance given a fixed budget, the 𝑆 ∗ computed above is still optimal. To see this, let 𝐵 denote the fixed budget, and assume that we can ascertain as many subjects as possible. Then, the number of cases/controls needed, 𝑁 , when using 𝑆 ∗ can be computed by solving the equation below. 𝐵 = 𝑁 {2 + ∑ 𝑆 , , ∗ 𝑝 (𝑒 , 𝑔 |𝑦 )𝐶𝑅 , , } (2.20) We can claim that this design gives the smallest variance of the parameter of interest given this fixed budget, since 𝑆 ∗ minimize the product of cost and variance regardless of the Phase I sample size, and thus achieves the minimum variance among other sampling fraction choices that require the same cost. Maximizing Power With Fixed Budget With an estimated parameter and its variance, a Wald test can be used to test the null hypothesis that the true value of the parameter is zero. In this case, the 𝑆 ∗ computed above to 27 minimize variance also maximize the power of the Wald test, since the power of the Wald test for a single parameter increases as the variance of the parameter estimate decreases. Planning the Study After Ascertaining Phase I Sample Another scenario is that we have a Phase I sample given and plan to measure biomarkers in Phase II. In this scenario, although the retrospective likelihood approach can still be used, the prospective likelihood approach is preferred because we do not need to model the marginal distribution of 𝐸 and 𝐺 , if we are planning the study based on a cohort and have information on ascertainment probabilities. With the prospective likelihood approach, the Fisher information can be written as: 𝐼 = −{∑ 𝑁 , , ( | , , ; ) , , + ∑ 𝑆 , , 𝑁 , , 𝑝 (𝑧 |𝑒 , 𝑔 , 𝑦 )( ( | , , ; ) , , , } (2.21) In this scenario, the total cost for Phase I and Phase II is: 𝐶 = ∑ 𝑁 , , , , + ∑ 𝑆 , , ∙ 𝑁 , , ∙ 𝐶𝑅 , , (2.22) Optimization Without a Fixed Budget As before, we assume the magnitude of the true parameter values can be specified when planning the study. Then optimal sampling fractions 𝑆 ∗ in this case can be obtained by maximizing ARCE, i.e. 𝑆 ∗ = argmax ∙ ∙ = argmin 𝐶 ∙ 𝑉 (2.23) Here, 𝐶 is defined in (2.22). 28 Minimizing Variance With a Fixed Budget When planning the study after ascertaining Phase I sample with a fixed budget, simply using the * S computed above by maximizing ARCE can potentially lead to a situation that the cost exceeds the budget, or the cost is less than the budget and thus the best possible performance given the budget is not achieved. Therefore, in this case, a constrained optimization is more appropriate, i.e. the variance of the parameter of interest is minimized, while keeping the cost from exceeding the fixed budget. Formally, the following constrained optimization problem is solved: min 𝑉 ; 𝑠 . 𝑡 . 𝐶 ≤ 𝐵 (2.24) Or, equivalently, min 𝑉 ; 𝑠 . 𝑡 . 𝐺 = 𝐶 − 𝐵 ≤ 0 (2.25) There exist many algorithms to solve this type of optimization problem, such as the Constrained Optimization BY Linear Approximations (COBYLA) (Powell, 2007). Here, the COBYLA algorithm implemented in the R package ‘nloptr’ (Johnson) is used. Maximizing Power With a Fixed Budget Similar to planning the study before ascertaining Phase I sample, the 𝑆 ∗ computed above to minimize variance also maximizes the power of the Wald test for the single parameter. 2.4.2 Optimization for Estimating Multiple Parameters Similar to the single parameter case above, three optimizations are considered here when designing the study either before or after ascertaining the Phase I sample: optimization without a fixed budget, minimize variance with a fixed budget, and maximize power with a fixed budget. 29 When multiple parameters are estimated jointly, e.g. estimating the main effects and interaction together, the precision of all of the estimated parameters needs to be taken into account. Here, the variance-covariance matrix of the parameters of interest is used. When performing optimization without a fixed budget, the ARCE for the multiple parameters of interest is maximized. In this ARCE for multiple parameters, the variance in the ARCE for single parameter estimation is replaced by either the trace or the determinant of this variance- covariance matrix, depending on whether the primary goal is to estimate the variances as efficiently as possible or both the variances and covariances. Specifically, we use the trace if the goal is to minimize the variances of all parameters of interest; we use the determinant if the goal is to minimize both their variances and the absolute value of their covariances. Similarly, when minimizing variance with a fixed budget, the objective function becomes either the trace or the determinant of this variance-covariance matrix. When maximizing power with a fixed budget, this variance-covariance matrix is used to compute the Wald test statistics for multiple parameters, and the power for the Wald test. 2.4.3 Robustness to Parameter Values As is mentioned in previous sections, the evaluation of the log-likelihood depends on the value of the true parameter values when planning for the study. However, the true parameter values are generally unknown and only based on prior beliefs. Therefore, it is necessary to check the sensitivity of the optimal sampling fractions to the change of the model parameter values. If the optimal sampling fractions are indeed sensitive to certain model parameter values, it is necessary to use averaging techniques to choose a robust design. 30 One such technique is to optimize the worst-case scenario. For example, if we want to maximize ARCE such that it is robust to a set of different true parameter values, the optimization below can be used. 𝑆 ∗ = argmax min 𝐴𝑅𝐶𝐸 (𝑆 , 𝜃 ) (2.26) Here, Θ denotes a set of possible true parameter values. Another way is to optimize the expected ARCE on a prior distribution of unknown parameters, i.e. 𝑆 ∗ = argmax ∫ 𝐴𝑅𝐶𝐸 (𝑆 , 𝜃 )𝑑𝑝 (𝜃 ) (2.27) Since the expected ARCE in (2.27) does not have a closed form, this approach will be computational intensive if numerical integration is needed. 2.5 Results 2.5.1 Optimization In this section, a case-control study with equal number of cases and controls is considered. Assume that the true parameter values can be pre-specified as in Table 2.1 when designing the study. Optimal sampling fractions and/or optimal Phase I sample size are obtained for each of the scenarios above using their corresponding optimization approaches. Comparing the optimal sampling fractions of different 𝐸 , 𝐺 and 𝑌 strata can give clues about which strata are more informative compared with others. 31 Table 2.1 Pre-specified true parameter values in (2.8)-(2.11). Parameters 𝛼 𝛼 𝛼 𝛽 𝛽 𝛽 𝛽 𝛾 𝛾 𝜎 𝜎 Value -1.0 -2.0 0.0 -0.6 0.6- 0.6 1.0 -1.5 1.0 1.0 0.25 Planning the Study Before Ascertaining Phase I Sample When planning the study before ascertaining Phase I sample for estimating the interaction of 𝐺 and 𝐸 , the Phase II sampling fractions that maximize ARCE without a budget constraint (Figure 2.4 (a)), minimize variance with a budget constraint (Figure 2.4 (b)), or maximize power with a budget constraint (Figure 2.4 (c)) are similar to each other, which is consistent with the discussion earlier in Section 2.4.1. As the Phase II study becomes more expensive, the sampling fractions of all 8 strata of 𝐸 , 𝐺 and 𝑌 have a decreasing trend in general, while some of them decrease slower than others, indicating that the corresponding strata are more informative compared with others. In particular, the exposed unaffected carriers (𝐸 1𝐺 1𝑌 0) are most informative while the un-exposed non-carriers (𝐸 0𝐺 0𝑌 0;𝐸 0𝐺 0𝑌 1) are less informative (and to a lesser extent, the exposed non-carrier cases (𝐸 1𝐺 0𝑌 1). With an expensive Phase II assay and a budget constraint, only a small sample size can be afforded if all Phase I cases and controls are sampled for Phase II, while a much larger sample size is possible if optimal Phase II sampling is used. Therefore, the optimal Phase II sampling design can achieve much better variance and power (Figure 2.4 (b)-(c)). Similarly, when planning the study before ascertaining Phase I sample for estimating the effect of the latent variable on disease, the Phase II sampling fractions that maximize ARCE without a budget constraint (Figure 2.5 (a)), minimize variance with a budget constraint (Figure 2.5 (b)), or maximize power with a budget constraint (Figure 2.5 (c)) are similar to each other. 32 The bumps in the Phase II sampling fractions in Figure 2.5 (c) are caused by multiple optima of the objective function (power), especially when the budget constraint is more than sufficient to achieve the highest possible power in this situation. Among the eight strata of 𝐸 , 𝐺 and 𝑌 , the exposed unaffected carriers (𝐸 1𝐺 1𝑌 0) are most informative as before, and the unexposed, affected carriers (𝐸 0𝐺 1𝑌 1) the least informative. Again, the optimal Phase II sampling design can achieve much better variance and power than the design in which all Phase I cases and controls are sampled for Phase II (Figure 2.5 (c)-(d)). 33 Figure 2.4 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the interaction of 𝐺 and 𝐸 when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. 34 Figure 2.5 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the effect of latent variable on disease when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. 35 Planning the Study After Ascertaining Phase I Sample When planning the study after ascertaining Phase I sample for estimating the interaction of 𝐺 and 𝐸 , the Phase II sampling fractions that maximize ARCE without a budget constraint (Figure 2.6 (a)) and the Phase II sampling fractions that minimize variance with a budget constraint (Figure 2.6 (b)), or maximize power with a budget constraint (Figure 2.6 (c)) are different. This is caused by the additional “constraint” of the Phase I sample, i.e. unlike planning the study before ascertaining Phase I sample, when the Phase I sample has already been ascertained, the Phase I sample size cannot be changed. Therefore, given a certain budget, the optimal sampling fractions may not be the same with those without a budget constraint. But still, as the Phase II study becomes more expensive, the sampling fractions of all eight strata of 𝐸 , 𝐺 and 𝑌 have a decreasing trend in general, while some of them decrease slower than others. Specifically, the exposed unaffected carriers (𝐸 1𝐺 1𝑌 0) are most informative, and the unexposed unaffected non-carriers (𝐸 0𝐺 0𝑌 0) are least informative (Figure 2.6 (a)-(c)). Compared with the design that has the same sampling fraction for all strata of 𝐸 , 𝐺 and 𝑌 , the optimal design that has the same Phase II sample size achieves a lower variance and higher power as Phase II study becomes more expensive (Figure 2.6 (b)-(c)). Similarly, when planning the study after ascertaining Phase I sample for estimating the effect of latent variable on disease, the Phase II sampling fractions that maximize ARCE without a budget constraint (Figure 2.7 (a)) and the Phase II sampling fractions that minimize variance with a budget constraint (Figure 2.7 (b)), or maximize power with a budget constraint (Figure 2.7 (c)) are different. In all Phase I samples, the exposed unaffected carriers (𝐸 1𝐺 1𝑌 0) are most informative (Figure 2.7 (a)-(c)). Compared with the design that has the same sampling fraction 36 for all strata of 𝐸 , 𝐺 and 𝑌 , the optimal design that has the same Phase II sample size achieves a lower variance and higher power as Phase II study becomes more expensive (Figure 2.7 (b)-(c)). 37 Figure 2.6 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the interaction of 𝐺 and 𝐸 when planning the study after ascertaining Phase I sample using the prospective likelihood approach. 38 Figure 2.7 Two-phase sampling designs maximizing ARCE without budget constraint (a), minimizing variance with budget constraint (b), and maximizing power with budget constraint (c) for estimating the effect of latent variable on disease when planning the study after ascertaining Phase I sample using the prospective likelihood approach. 39 2.5.2 Evaluation Simulation studies were used to evaluate the performance of optimal Phase II sampling fractions. At each replicate, 𝐸 , 𝐺 , 𝑍 and 𝑌 for a population of 10,000 were generated based on the parametric model in (2.8)-(2.11), then with a budget constraint, Phase I and Phase II samples were drawn based on different designs, and parameter of interest was estimated using the samples drawn. Note that to simplify the estimation problem, each sample in Phase II has two measurements of biomarker 𝑍 , which is used to estimate the variance of 𝑍 conditioning on the latent variable 𝑋 . Then, the log-likelihood of Phase I samples and Phase II samples combined (2.7) or (2.13) was maximized numerically to get parameter estimates. When planning the study before ascertaining Phase I sample, the optimal designs for the interaction of 𝐺 and 𝐸 (𝛽 ) and the effect on disease (𝛾 ) were computed by minimizing the corresponding variance given the budget constraint respectively. The standard deviation of 𝛽 or 𝛾 of the optimal design for them respectively is lower than that from other designs (Table 2.2). In addition, even though its Phase I sample size is much smaller than the optimal design for 𝛽 , the optimal design for 𝛾 still has a smaller standard deviation of 𝛾 , indicating that the optimal sampling fractions can indeed improve efficiency (Table 2.2). Similarly, when planning the study after ascertaining Phase I sample, the optimal designs were computed by minimizing the variances given the budget constraint. The standard deviation of 𝛽 or 𝛾 of the corresponding optimal design is also lower than that from other designs (Table 2.3). Note that in Table 2.3, the standard deviation of 𝛾 estimated by the optimal design for 𝛽 is very large. This is because several sampling fractions are zero in this situation, which makes it hard to estimate the parameters by numerically optimizing the complicated log-likelihood function. 40 Table 2.2 Parameter estimates and standard deviations of different designs when planning the study before ascertaining Phase I sample with a budget of 3000. Designs 𝛽 (𝑆𝐷 ) 𝛾 (𝑆𝐷 ) Phase I sample size Optimal design for 𝛽 1.40 (0.48) 0.95 (0.24) 1778 Optimal design for 𝛾 1.13 (0.66) 0.90 (0.17) 582 Balanced (50% sampling) 1.11 (0.76) 0.99 (0.24) 176 100% Phase II sampling 0.98 (0.81) 1.00 (0.23) 90 Table 2.3 Parameter estimates and standard deviations of different designs when planning the study after ascertaining Phase I sample with a fixed budget of 2000 for Phase II. Designs 𝛽 (𝑆𝐷 ) 𝛾 (𝑆𝐷 ) Optimal design for 𝛽 0.85 (0.15) 1.97 (13.39) Optimal design for 𝛾 1.16 (0.19) 0.85 (0.13) Balanced design 1.24 (0.16) 0.97 (0.25) 2.5.3 Estimating Multiple Parameters When estimating multiple parameters simultaneously, there are two ways to measure variation given the variance-covariance matrix of estimated parameters. One is to use the trace. The other is to use the determinant. In this section, with the same settings in Section 2.5.1, optimal designs are computed using either the trace or the determinant. In addition, designs that maximize power for the multivariate Wald test are also computed 41 Planning the Study Before Ascertaining Phase I Sample When planning the study before ascertaining Phase I sample for estimating the main effects and interaction of 𝐺 and 𝐸 simultaneously, the Phase II sampling fractions for the design that maximizes the 𝐴𝑅𝐶𝐸 (𝑡𝑟𝑎𝑐𝑒 ), the design that minimizes the trace, and the design that minimizes the determinant are similar to each other (Figure 2.8 (a), Figure 2.8 (c)-(d)). In all Phase I samples, the exposed unaffected carriers are most informative, and the unexposed unaffected non-carriers are least informative. The Phase II sampling fractions for the design that minimizes 𝐴𝑅𝐶𝐸 (𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑛𝑡 ) are not informative (Figure 2.8 (b)), because the value of the objective function (𝐶𝑜𝑠𝑡 ∙ 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑛𝑡 ) is dominated by the value of the determinant, which is close to zero. As the Phase II study becomes more expensive, the optimal design achieves higher power than the design in which all Phase I cases and controls are sampled for Phase II (Figure 2.8 (e)). Again, the bumps in the Phase II sampling fractions in Figure 2.8 (e) are caused by multiple optima of the objective function (power), especially when the budget constraint is more than sufficient to achieve the highest possible power in this situation. 42 Figure 2.8 Two-phase sampling designs for estimating the main effects and interaction of 𝐺 and 𝐸 when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. 43 Planning the Study After Ascertaining Phase I Sample When planning the study after ascertaining Phase I sample for estimating the main effects and interaction of 𝐺 and 𝐸 simultaneously, the Phase II sampling fractions for the design that minimizes the trace with a budget constraint, the design that minimizes the determinant with a budget constraint are similar to each other (Figure 2.9 (c)-(d)). In the eight strata of 𝐸 , 𝐺 and 𝑌 , the exposed unaffected carriers (𝐸 1𝐺 1𝑌 0) are most informative. As Phase II study becomes more expensive, the optimal design achieves higher power than the design in which all Phase I cases and controls are sampled for Phase II (Figure 2.9 (e)). To achieve higher power, the exposed carriers are more informative than others. Again, the bumps in the Phase II sampling fractions in Figure 2.9 (e) are caused by multiple optima of the objective function (power), especially when the budget constraint is more than sufficient to achieve the highest possible power in this situation. 44 Figure 2.9 Two-phase sampling designs for estimating the main effects and interaction of 𝐺 and 𝐸 when planning the study after ascertaining Phase I sample using the prospective likelihood approach. 45 2.5.4 Sensitivity to Parameter Values As is mentioned in previous sections, parameter values need to be pre-specified in order to compute the optimal Phase II sampling fractions. Therefore, it is necessary to look at the sensitivity of the optimized Phase II sampling fractions to different pre-specified parameter values. In this subsection, we use the scenario of planning the study before ascertaining Phase I sample as an example to explore this question. Based on the parameter values in Table 2.1, when we vary the value of the parameter for the interaction between 𝐺 and 𝐸 , i.e. 𝛽 , while keeping other parameter values unchanged, the optimal Phase II sampling fractions that maximize ARCE also change. Specifically, when the interaction between 𝐺 and 𝐸 gets stronger, the unaffected carriers (either exposed or unexposed) become more informative (bars in Figure 2.10). The optimal Phase II sampling fractions that maximized the expected value of ARCE given the possible parameter values are close to the average of the optimal sampling fractions given these different parameter values (red dots in Figure 2.10). In contrast, the optimal Phase II sampling fractions that maximized the minimum ARCE given the possible parameter values are the same as those from the largest pre-specified parameter value (blue dots in Figure 2.10). Similarly, based on the parameters in Table 2.1, when we vary the value of the parameter for the effect of latent variable on 𝑌 , i.e. 𝛾 , while keeping other parameter values unchanged, the exposed affected carriers become less informative when the effect on 𝑌 gets stronger (bars in Figure 2.11). The optimal Phase II sampling fractions that maximized the expected ARCE is close to the average of those given different parameter values (red dots in Figure 2.11), while the optimal sampling fractions that maximized the minimum ARCE are the same as those from the smallest pre-specified parameter value (blue dots in Figure 2.11). 46 Figure 2.10 Given a range of pre-specified values of the interaction between 𝐺 and 𝐸 , optimal Phase II sampling fractions computed by maximizing ARCE (bars), expected ARCE red dots), minimum ARCE (blue dots) for estimating this parameter when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. 47 Figure 2.11 Given a range of pre-specified values of the effect of the latent variable on 𝑌 , optimal Phase II sampling fractions computed by maximizing ARCE (bars), expected ARCE (red dots), minimum ARCE (blue dots) for estimating this parameter when planning the study before ascertaining Phase I sample using the retrospective likelihood approach. 2.6 Discussion In this chapter, we used a latent variable to model the relationship between candidate genes, environmental exposures, biomarkers and disease outcomes, representing possible underlying biological pathways. Then considering a two-phase case-control study that measures biomarkers in Phase II only on a subset of the Phase I sample, we optimized the sampling fractions for Phase II in various scenarios to achieve the best cost-efficiency in estimating either a single parameter, or multiple parameters of interest. In addition, the optimal designs were 48 evaluated using simulations, and the sensitivity of the optimal sampling fractions to the pre- specified parameter values was explored. From the results, we have seen that some strata in the Phase I case-control sample are more informative than others in Section 2.5.1. In particular, the exposed unaffected carriers (𝐸 1𝐺 1𝑌 0) are more informative than others in various scenarios. Intuitively, this can be due to that the number subjects in this stratum of Phase I sample is smaller than that in others, and thus a larger fraction of them need to be sampled for Phase II in order to reduce the variance of the estimated parameter of interest. When the goal is to estimate multiple parameters simultaneously, the trace and the determinant of the variance-covariance matrix of the estimated parameters are two candidates for the measurement of variation for optimization. By definition, it is proper to use trace as the objective for optimization if the goal is to minimize the sum of the variance of all parameters. Unlike the trace, the determinant also accounts for the covariances. In an experiment (result not shown here), we found that determinant of a variance-covariance matrix is highly correlated with the sum of the variances alone, or with sum of both the variances and the absolute value of all covariances. Therefore, determinant is a reasonable objective for optimization when there are multiple parameters to estimate. In Section 2.5.4, we have shown that optimal sampling fractions are sensitive to pre- specified parameter values, and it is necessary to pre-specify a range of possible parameter values. If a prior distribution of model parameters can be specified, it should be used to compute the expected value of the objective function, which is then used instead for optimization. In this chapter, the two-phase case-control study considered has equal number of cases and controls, which is not necessarily true in many applications. When the number of cases and 49 controls are not set to equal, the method in this chapter can be readily generalized. By introducing an additional parameter, the ratio of the number of controls to the number of cases, the expected information and expected cost can be modified accordingly. In addition, if this ratio is also one of the design parameters that need to be determined, is can also be optimized along with other designs parameters such as the Phase II sampling fractions. 50 3 Integrated Analysis of Germline, Omic and Disease 3.1 Background Technology advancements have made it possible for researchers and practitioners to get access to various omic data on human subjects, such as metabolites, gene expression, somatic profiles, etc. The availability of such data has begun to facilitate new insights into the underlying etiologic mechanism of disease. In a study of 2000 primary breast tumors from breast cancer patients by Curtis et al, integrated analysis of copy number and gene expression data revealed novel subgroups with distinct clinical outcomes, and also provided a novel molecular stratification of the breast cancer population, suggesting possible biological mechanism (Curtis et al., 2012). Despite the great potential, omic data also presents many analytic challenges. The first challenge is effect heterogeneity. Different types of omic data often get involved in the underlying biological mechanism in different ways, thus can have different types of effects. The second challenge is high dimensionality. Omic data such as genomic data and gene expression data often consist of tens of thousands of variables, making integrated analysis both analytically and computationally difficult. Dimension reduction is often accomplished through clustering. Underlying clusters with different characteristics of interest, such as the study of breast tumor (Curtis et al., 2012), are estimated from information in various types of omic data. Shen et al developed a method called iCluster for clustering by integrating different types of omic data via a joint Gaussian latent variable model (Shen, Olshen, & Ladanyi, 2009). In iCluster, the Gaussian latent variable model provided a probabilistic framework, and sparse estimates of cluster-specific parameters were then obtained by maximizing the log-likelihood with a lasso (Tibshirani, 1996) type penalty. 51 However, different types of data were treated in the same way in the model used by iCluster, ignoring their underlying causal relationships. In this chapter, we present a novel approach to integrated analysis of germline, omic and disease data, leveraging their underlying causal relationships. In our approach, a DAG is used to present the underlying causal relationships of different types of data. Then a joint probabilistic model with a latent variable for clustering integrates data in the DAG. Unlike other methods, the latent variable in our approach relates different types of omic data explicitly to the outcome. An EM algorithm is used to estimate the latent variable and model parameters simultaneously. Penalization methods are used to accommodate high dimensional data. We also present a two- step framework for integrated analysis. The first step focuses on variable selection, and the second step focuses on estimation and inference. The rest of this chapter is organized as follows. We first introduce the DAG and formalize the joint probabilistic model with the latent variable for clustering. Next we develop an EM algorithm, add penalization methods to generate a sparse solution in case of high dimensional data, and present the two-step framework for integrated analysis. We evaluate the performance of our approach using extensive simulations. We then present the impact of different types of omic data on the performance of clustering and parameter estimation. We examine the Type I error rate and power of the inference of the effect on the outcome. We demonstrate the ability of our approach to choose the number of clusters, and to handle high dimensional data. As an example, we apply our approach to a data set from the Women’s Health Initiative (WHI). In the end, we conclude this chapter with a discussion. 52 3.2 Methods 3.2.1 Latent Variable Model Formalization Information from germline genomic data, omic data, such as biomarker measurements, and the disease is integrated by jointly modeling their relationships through a latent variable for clustering, which is presented by the DAG in Figure 3.1. In this DAG, 𝑮 is a mean-centered genomic data vector of dimension 𝑝 × 1, 𝑋 is a categorical scalar latent clustering variable, 𝒁 is a continuous biomarker measurement vector of dimension 𝑚 × 1, and 𝑌 is a scalar outcome variable, while other circled variables in the DAG are model parameters. As is shown by the arrows in the DAG, the distribution of 𝑋 depends on the values of 𝑮 and parameter 𝜷 , the distribution of 𝒁 depends on the value of 𝑋 and parameters 𝑾 and 𝚺 , and the distribution of 𝑌 depends on the value of 𝑋 and parameter 𝜸 . It should also be noted that conditioning on 𝑋 , 𝒁 and 𝑌 are independent of each other. 53 Figure 3.1 Joint model integrating germline genomic data, biomarker measurements and outcome variable. Integrating information in this way is convenient in several aspects. First, the joint likelihood of 𝑋 , 𝒁 and 𝑌 conditioning on 𝑮 can be easily formalized given individual models of 𝑋 given 𝑮 , 𝒁 given 𝑋 , and 𝒀 given 𝑋 . Let 𝑓 (𝑋 |𝑮 , 𝜷 ), 𝑓 (𝒁 |𝑋 , 𝑾 , 𝚺 ) and 𝑓 (𝑌 |𝑋 , 𝜸 ) denote the probability density functions (for continuous random variables) or probability mass functions (for discrete random variables). Then, the complete data log-likelihood based on the joint distribution of 𝑋 , 𝒁 and 𝑌 conditioning on 𝑮 is: 𝑙 (𝚯 ) = ∑ log 𝑓 (𝑋 |𝑮 , 𝜷 ) 𝒏 𝒊 𝟏 + ∑ log 𝑓 (𝒁 |𝑋 , 𝑾 , 𝚺 ) + ∑ log 𝑓 (𝑌 |𝑋 , 𝜸 ) (3.1) where the subscript 𝑖 = 1, … , 𝑛 indexes the study subjects. Here is a generic notation for all model parameters, including 𝜷 , 𝑾 , 𝚺 and 𝜸 . 54 Second, this model can accommodate high dimensional genomic data and biomarker measurements because penalization methods, such as LASSO (Tibshirani, 1996), can be applied to the corresponding models separately based on the observation that (3.1) is consist of three independent parts that do not share model parameters. Third, flexibility is increased since assuming different conditional distributions between the random variables lead to different models. For example, assume 𝑋 has 𝑘 categories and follows a multinomial distribution conditioning on 𝑮 , 𝑍 follows a multivariate normal distribution conditioning on 𝑋 , and 𝑌 follows a normal distribution conditioning on 𝑋 : 𝑓 (𝑋 = 𝑗 |𝑮 𝒊 , 𝜷 ) = 𝑆 (𝜷 , 𝑗 , 𝑮 ) = (𝑮 𝒊 ∙𝜷 𝒋 ) ∑ (𝑮 𝒊 ∙𝜷 𝒋 ) (3.2) 𝑓 (𝒁 |𝑋 = 𝑗 , 𝑾 , 𝚺 ) ∝ 𝚺 exp − 𝒁 − 𝑾 𝜮 𝒁 − 𝑾 (3.3) 𝑓 (𝑌 |𝑋 = 𝑗 , 𝝁 , 𝝈 ) = √ exp (− ) (3.4) Then, the log-likelihood of the joint distribution of 𝑋 , 𝒁 and 𝑌 conditioning on 𝑮 becomes: 𝑙 (Θ) = ∑ ∑ 𝑥 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) − ∑ ∑ 𝑥 𝒁 − 𝑾 𝚺 𝒁 − 𝑾 − ∑ ∑ 𝑥 log 𝚺 − ∑ ∑ − ∑ ∑ 𝑥 log 𝜎 (3.5) where 𝑥 = I ( ) . Although other distributions can also be used to parameterize this joint model, for convenience, the remainder of this chapter will focus on this model. The model with a binary 𝑌 is presented in the Appendix. 3.2.2 Joint Estimating Method via EM Algorithm The EM algorithm is used for estimation. The parameters of interest include the clustering variable 𝑋 , genetic effect 𝜷 , cluster-specific biomarker means 𝑾 , and mean effect on the 55 outcome, 𝝁 . These parameters are estimated jointly from the log-likelihood (3.5), which is the complete data log-likelihood in the EM framework. Consider the complete data log-likelihood (3.5). The E-step is to compute the following conditional expectation of (3.5) given the current parameter estimates and observed data: 𝑄 (𝛩 ) = ∑ ∑ 𝑟 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) − ∑ ∑ 𝑟 𝒁 − 𝑾 𝜮 𝒁 − 𝑾 − ∑ ∑ 𝑟 log 𝜮 − ∑ ∑ − ∑ ∑ 𝑟 log 𝜎 (3.6) Here 𝑟 = 𝑓 𝑋 = 𝑗 𝑮 , 𝒁 , 𝑌 ; Θ ( ) , which is computed by dividing the joint density/mass of X, Z and Y conditioning on G by the marginal density of Z and Y conditioning on G. The detailed derivation is shown in the Appendix. Then, the M-step updates the parameter estimates by maximizing (3.6) treating the 𝑟 as known. This results the following estimates of 𝜷 , 𝑾 , 𝚺 , 𝝁 and 𝝈 : 𝜷 ( ) = argmax 𝜷 ∑ ∑ 𝑟 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) (3.7) 𝑾 ( ) = ∑ 𝒁 ∑ (3.8) Σ ( ) = ∑ 𝒁 𝑾 ( ) 𝒁 𝑾 ( ) ∑ (3.9) 𝜇 ( ) = ∑ ∑ (3.10) 𝜎 ( ) = 𝑠𝑞𝑟𝑡 ( ∑ ( ) ∑ ) (3.11) Although the maximization in (3.7) does not have a closed form solution, it can be easily solved iteratively since it is a convex optimization. Indeed, it is equivalent to maximizing the log- likelihood of multinomial logistic regression with outcome probabilities denoted by 𝑟 and can be easily solved by the multinom function in the R package {nnet}(Venables & Ripley, 2002). 56 With the E-step and M-step formalized above, starting with a set of initial values of parameter estimates, the algorithm iterates between these two steps until convergence. 3.2.3 Sparse Solution In case of high dimensional 𝑮 and 𝒁 , sparse estimates of 𝜷 , 𝑾 and 𝚺 can be obtained by applying penalization methods to the joint estimating method above. To obtain sparse estimates of 𝜷 , a LASSO type penalty (Tibshirani, 1996) is added to the objective function in (3.7), resulting the following estimate of 𝜷 : 𝜷 ( ) = argmax 𝜷 { ∑ ∑ 𝑟 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) − 𝜌 ∑ ∑ 𝛽 } (3.12) To obtain sparse estimates of 𝑾 and 𝚺 , following the principle proposed by Witten and Tibshirani (Witten & Tibshirani, 2009), 𝚺 is first updated by the following maximization while penalizing the L-1 norm of its inverse: 𝚺 ( ) = argmax 𝚺 {log 𝚺 − 𝑡𝑟 𝑺 𝚺 − 𝜌 𝚺 | 𝚺 | } (3.13) Here 𝑺 = ∑ 𝒁 𝑾 ( ) 𝒁 𝑾 ( ) ∑ (3.14) Then, 𝑾 is updated by minimizing the following penalized distance: 𝑾 ( ) = argmin 𝑾 ∑ ∑ 𝑟 𝒁 − 𝑾 [𝚺 ( ) ] 𝒁 − 𝑾 + 𝜌 ||[𝚺 ( ) ] 𝑾 || (3.15) 57 3.2.4 Statistical Testing and Integrated Analysis In order to perform hypothesis testing, standard errors of parameter estimate are estimated asymptotically based on the idea of supplemented EM (SEM) by Meng and Rubin (Meng & Rubin, 1991). In the SEM framework, using the equations in their paper, we first run the EM algorithm and store parameter estimates at each iteration. After the EM algorithm has converged, we compute the expectation of the complete data observed information matrix using equation (2.3.4). Then, starting with the parameter estimate at the first iteration of the EM algorithm, we compute the rate of convergence numerically in an iterative fashion based on equation (3.3.3). Finally, the asymptotic variance-covariance matrix is computed using equation (2.3.5) and (2.3.6), and the standard errors are obtained by taking the square root of the asymptotic variance of the parameters of interest. Dividing a parameter estimate by its corresponding standard error, we compute a Wald test statistic to test the null hypothesis that the true value of the parameter is zero. To integrate information from high dimensional data while retaining the ability to perform statistical testing, we propose the following two-step framework for integrated analysis: Step 1: Fit the joint model (3.1) using all data (high dimension) available by incorporating penalization method into the EM algorithm, as described in the section on sparse solution. Step 2: Refit the same model by the EM algorithm without the penalization method, using a subset of data (low dimension) including only variables selected in Step 1, and estimate standard errors of estimates and perform the Wald test. 58 3.2.5 Implementation of the Integrated Analysis To clarify the implementation, we first show the pseudo code for the EM algorithm without regularization as follows. 1. Choose initial value Θ ( ) 2. At each iteration: a. E-step: compute 𝑟 = 𝑓 𝑋 = 𝑗 𝑮 , 𝒁 , 𝑌 ; Θ ( ) b. M-step: update Θ ( ) by i. If continuous 𝑌 , Equation (3.7), (3.8), (3.9), (3.10) and (3.11) ii. If binary 𝑌 , Equation (3.7), (3.8), (3.9) and (A.4) c. Check convergence by the change in parameter values i. If converged, stop. ii. If not converged, return to 2.a. Pseudo code for the EM algorithm with regularization is the same as the one without regularization above, except that Equation (3.12), (3.13), (3.14) and (3.15) are used to replace Equation (3.7), (3.8) and (3.9). In the EM algorithm, (3.12) is implemented by using the glmnet function in the R package {glmnet}(Friedman, Hastie, & Tibshirani, 2010). (3.13) is implemented by using the glasso function in the R package {glasso} (Friedman, Hastie, & Tibshirani, 2014). In order to facilitate implementation of (3.15), the following re-parameterization is used: 𝑾 ( ) = 𝚺 ( ) 𝜶 Here 59 𝜶 = argmin 𝜶 ∑ ∑ 𝑟 𝒁 − 𝚺 ( ) 𝜶 [𝚺 ( ) ] 𝒁 − 𝚺 ( ) 𝜶 + 𝜌 ||𝜶 || (3.16) Then, this minimization is implemented by using the Orthant-Wise Quasi-Newton Limited- Memory (OWL-QN) algorithm implemented by the lbfgs function in the R package {lbfgs} (Coppola, Stewart, & Okazaki, 2014). 3.3 Simulation Study We used extensive simulations to evaluate the performance of our approach, specifically its type I error rate and power, and present a method to choose the number of clusters. In each replicate of one simulation, a population of 200,000 individuals was randomly generated based on pre- defined population parameters. Each individual had 10 genes, 4 continuous biomarkers and a continuous outcome variable. Then, a sample of 2000 individuals was randomly selected from the population, and the two-step integrated analysis was performed on this sample. In Step 1, variables with estimated effects greater than the average estimated effect in their variable group or a fixed threshold were selected. In Step 2, only the variables selected in Step 1 were used for estimation and statistical testing. In all simulations in this section, the true number of latent clusters was 2. Results of the simulation study for a binary outcome are shown in the Appendix. 3.3.1 Factors With Impact on Clustering and Estimation Performance Simulation scenarios to explore factors that have impact on the performance of clustering and estimation are listed in Table 3.1. The first scenario is used as the reference. The performance of our integrated analysis is evaluated by several metrics, including estimated parameters and their standard errors or confidence intervals, power for detecting genetic effects, and Area Under the Curve (AUC) for classification based on estimated latent clusters 60 Table 3.1 Simulation scenarios to explore factors with impact on performance of clustering and estimation. Genes Biomarkers Outcome Odds Ratio #Informative Diff. in means Covariance Diff. in means Odds Ratio 2.0 2 0.8 Independent* 0.4 2.0 2.0 0 0.0 Independent 0.4 2.0 2.0 2 0.2 Independent 0.4 2.0 2.0 2 0.4 Independent 0.4 2.0 2.0 2 0.6 Independent 0.4 2.0 2.0 2 1.0 Independent 0.4 2.0 2.0 2 1.2 Independent 0.4 2.0 2.0 2 0.0 Structured** 0.4 2.0 2.0 2 0.8 Structured 0.4 2.0 2.0 2 0.8 Independent 0.0 1.0 *The covariance of any pair of biomarkers is zero. **The covariance of the pair of informative biomarkers is 0.5 and -0.5 respectively conditioning on the two clusters. 61 In Figure 3.2, we can see that information gained from the means of biomarkers has a substantial impact on the estimation of genetic effects and the latent clusters. The 95% confidence intervals of the estimates of both causal (Figure 3.2 (a)) and null (Figure 3.2 (b)) gene effect contain the true simulated parameter value in scenarios with different simulated biomarker mean effects (i.e. magnitude of the difference of cluster-specific biomarker means). Moreover, when the simulated biomarker mean effects increases, the confidence intervals of the estimated genetic effects become smaller (Figure 3.2 (a)-(b)), the average power of detecting a single causal gene becomes higher, while the Type I error rate is conserved (Figure 3.2 (c)), and the AUC of the estimated clusters becomes larger (Figure 3.2 (d)). As a reference, the green bar in Figure 3.2 (c) is the average power for detecting a single causal gene when regressing Y on the genes directly. We can see that the power of the integrated analysis is decreased when there is not much information gained from the biomarkers, but increased when more information is gained from the biomarkers. Besides the mean effects, information gained from the covariance matrices of biomarkers also has a substantial impact on the estimation of genetic effects and the latent clusters. In Figure 3.3, when the two biomarkers with cluster-specific mean of zero have different cluster specific covariance structures (Figure 3.3 (b)), the 95% confidence intervals of both estimated causal and null gene effects become smaller (Figure 3.3 (a)), the average power for detecting the causal gene effect is substantially increased (Figure 3.3 (c)), and the AUC of the estimated latent clusters becomes larger (Figure 3.3 (d)). 62 Figure 3.2 Impact of biomarker mean effects on estimating genetic effects. 63 Figure 3.3 Impact of structured cluster-specific covariance matrices of biomarkers on estimating genetic effects. 3.3.2 Statistical Testing: Type I Error Rate and Power Type I error rates for testing the effect of the latent clusters on the outcome are presented in Table 3.2. Including the outcome in the integrated analysis to estimating the latent clusters and model parameters can be seen as including the outcome data in generating the hypothesis that there are pre-specified number of latent clusters that have effects on the outcome. Therefore, the outcome data is used twice in estimating the latent clusters and testing their effects on the outcome, which inflates the Type I error rate (Table 3.2). On the other hand, if we only use the genetic and biomarker data to estimate latent clusters, and then regress the outcome on the estimated latent clusters, the Type I error rate of testing the effects of latent clusters on the outcome is preserved (Table 3.2). 64 Table 3.2 Type I error rate when making inference with or without using outcome data. Use Outcome Data Not Use Outcome Data 0.14 0.06 Therefore, to have a valid test of the effect of the latent clustering variable on the outcome, only genetic data and biomarkers are used in the second step of our integrated analysis to fit the model and estimate the latent clustering variable. Then, the effect of the latent clustering variable on the outcome is estimated by regressing the outcome on the estimated latent clusters. Based on Scenario 0 in Table 3.1, we evaluate the power of this approach when the true effect on Y is 0.4 for continuous Y, and 0.69 for binary Y respectively. In Table 3.3, we can see that the estimated effect of the latent clusters on the outcome is close to the true simulated value, and the test of this effect achieves a very high power. Table 3.3 Estimating effects of latent clusters on the outcome. Simulated value Effect Estimate (SE) Power 0.40 0.39 (0.08) 0.99 3.3.3 Choose the Number of Clusters In our integrated analysis, the number of latent clusters 𝑘 needs to be pre-specified. To select the optimal 𝑘 , the two-step integrated analysis approach is used on the same data set in each replicate to fit the model with a range of pre-specified values for 𝑘 . Then, the Bayesian Information Criterion (BIC) is calculated from the estimated parameters to select the optimal 𝑘 . In a simulation scenario for the continuous outcome where there are 4 simulated latent clusters in 65 the data, the model that pre-specifies 𝑘 to be 4 indeed has the lowest BIC among the models with a range of pre-specified values for 𝑘 from 2 to 6 (Figure 3.4). Figure 3.4 BIC of fitted model using different pre-assumed number of underlying clusters. 3.3.4 High Dimensional Genetic and Biomarker Data We keep all the other parameters in Table 3.1 Scenario 0 unchanged except for increasing the number of null SNPs from 5 to 995 and increasing the number of non-informative biomarkers from 2 to 98. By the regularization approach involved in Step 1 of the integrated analysis, although at a lower power, we are still able to detect the causal SNPs and reserve the Type I error rate for the null SNPs (Figure 3.5). 66 Figure 3.5 Power of detecting genetic effect when: there are 5 causal SNP/5 null SNP, 2 informative biomarkers / 2 non-informative biomarkers (left); there are 5 causal / 995 null SNPs, 2 informative biomarkers / 98 non-informative biomarkers (right). 3.4 Application on WHI (Women’s Health Initiative) Data 3.4.1 Data Description and Analysis The WHI (Women’s Health Initiative) is a nation-wide, long-term health study with the focus on heart disease, breast and colorectal cancer, and osteoporotic fractures in postmenopausal women (https://www.whi.org). Our data is a case-control sample nested in the WHI Observational Study cohort. In our data, there are 989 colorectal cancer cases and 987 controls. SNPs genotyped in our data are from genes selected in the folate metabolism. Biomarkers measured in our data are also from the folate metabolism. In total, there are 286 SNPs and 9 biomarkers measured on 1664 of these subjects. These 9 biomarkers have different scales and distributions. The 67 summary statistics of these 9 biomarkers are listed in Table 3.4. Because of missing values, 86 SNPs, creatinine and methylmalonic acid were excluded in our analysis. Subjects that have missing values on the remaining SNPs, biomarkers as well as covariates in the data were also excluded in our analysis, which left 562 cases and 572 controls remaining in our data. To adjust for covariates including age and supplements of alcohol, various vitamins and folate, and make the adjusted biomarkers normally distributed, we applied inverse normal transformation to the residuals of the multivariate linear regression of the biomarkers on the covariates. Then these residuals after transformation were used as the processed biomarkers. Their means and standard deviations are shown in Table 3.5. With the 200 SNPs, the processed biomarkers summarized in Table 3.5 and the case- control status, assuming there are 2 latent clusters that may have effects on the disease, we used the integrated analysis to fit the model, either including or excluding the outcome in Step 2. A range of penalties for the SNPs and biomarkers were tried in the fitting process, and the ones with the smallest BIC were used. 68 Table 3.4 Summary statistics of biomarkers in WHI data. Name Min. Q1 Median Mean Q3 Max. #NA’s Choline 3.91 7.91 9.24 9.41 10.68 24.89 181 Creatinine 0.1 0.6 0.7 0.72 0.8 3.4 678 Cysteine 167.7 259.4 282.4 285.5 308 621.8 42 Folate 0.9 9.26 16.12 19.66 26.06 121.9 61 RBC Folate 75.5 414.9 568.6 601.5 740.8 2090 23 Homocysteine 2.38 6.81 8.01 8.61 9.84 53.92 42 Methylmalonic acid 51.4 121.1 152.9 176.4 199.4 1751 757 Vit. B6 8 40.6 64.75 97.49 108.9 794.6 80 Vit. B12 56.36 342.4 481.4 535.3 663.3 2863 63 69 Table 3.5 Mean and standard deviation of seven biomarkers after adjusting for covariates and inverse normal transformation. Mean (SD) Choline Cysteine Folate RBC Fol. Homocys. Vit. B6 Vit. B12 Cases 0.04 (1.00) 0.01 (1.00) -0.07 (1.02) -0.05 (0.98) 0.10 (1.00) -0.04 (1.01) -0.04 (1.04) Controls -0.04 (0.99) -0.01 (1.00) 0.07 (0.96) 0.05 (1.00) -0.10 (0.98) 0.04 (0.98) 0.04 (0.95) 3.4.2 Results In the fitted model, either including the outcome or not in Step 2 of the integrated analysis, none of the 200 SNPs were selected to have an effect on the latent clusters in our analysis. Among the 7 biomarkers, folate, RBC folate (folate in red blood cell), homocysteine, vitamin B6 and vitamin B12 were informative. Their estimated cluster-specific means, when including the outcome or excluding the outcome in Step 2, are shown in Table 3.6 and Table 3.7 respectively. The estimates in Table 3.6 and 3.7 are very similar, indicating that the outcome is not very informative, and the estimation is primarily driven by the biomarkers alone. Among the 5 informative biomarkers, folate, RBC folate and vitamin B12 have significant non-zero cluster- specific means. If we use 0.5 as the cut-off for assigning cluster labels using the estimated probabilities of the clusters, then the plot of the first two principal components (PC) of these three biomarkers (Figure 3.6) shows that the estimated clusters have slightly different means for PC1. 70 Table 3.6 Estimated cluster-specific mean and standard errors of seven biomarkers using SNPs, biomarkers and the outcome. Mean (SD) Choline Cysteine Folate RBC Fol. Homocys. Vit. B6 Vit. B12 Cluster 1 N/A N/A -0.26 (0.08) -0.40 (0.09) 0.05 (0.10) -0.03 (0.08) -0.23 (0.09) Cluster 2 N/A N/A 0.08 (0.04) 0.13 (0.04) -0.02 (0.04) 0.01 (0.04) 0.08 (0.04) Table 3.7 Estimated cluster-specific mean and standard errors of seven biomarkers using only SNPs and biomarkers. Mean (SD) Choline Cysteine Folate RBC Fol. Homocys. Vit. B6 Vit. B12 Cluster 1 N/A N/A -0.26 (0.07) -0.41 (0.08) 0.06 (0.10) -0.04 (0.07) -0.24 (0.08) Cluster 2 N/A N/A 0.09 (0.04) 0.13 (0.04) -0.02 (0.04) 0.01 (0.04) 0.08 (0.04) 71 Figure 3.6 Scatter plot of the first and second principal components colored by estimated clusters. When excluding the outcome in Step 2, the estimated effect of the latent clusters on the outcome is very small (Table 3.8), which is expected since the estimation here is primarily driven by the biomarkers whose distributions do not differ greatly across cases and controls (Table 3.5). 72 Table 3.8 Estimated effects of clusters on the outcome. Not Use Disease Status in Step 2 Use Disease Status in Step 2 Effect estimate (SE) 0.10 (0.22) 0.11 (0.20) 3.5 Discussion In this chapter, we developed a method to integrate germline, omic data and outcome of interest using a clustering approach. We developed an EM algorithm for estimation, and proposed a 2- step integrated analysis to allow for high dimensional data and statistical testing. Using extensive simulations, we evaluated the impact of information from omic data on the estimation of model parameters, and examined the Type I error rate using our method to test the effect of the latent clusters on the outcome of interest. As an example, we presented an application of our method to the WHI data. In the first step of our integrated analysis, penalization methods are used to obtain sparse solutions. However, the principle of statistical testing does not apply appropriately in this situation, since estimates from penalization methods are biased. Thus, only the second step in our integrated analysis involves estimating standard errors and statistical testing for model parameters. In Table 3.2 and Table A.1, the inflated Type I error rate is because of using the outcome data twice by first estimating the latent clusters and then testing their effect on the outcome. Therefore, when using this integrated analysis, we have two options depending on the goal of the analysis. If the goal is to estimate the latent clusters and generate hypothesis, we include the outcome in both steps of the integrated analysis. If the goal is to test the effect of the estimated latent clusters on the outcome, we include the outcome in Step 1 for variable selection, but in Step 2 exclude the outcome and only use other variables selected in Step 1 to estimate 73 latent clusters, and then analyze the effect of the estimated latent clusters on the outcome of interest. In this chapter, omic data such as biomarkers are measured on the entire sample. This may not be possible in some situations. For example, where there is reverse causation between the outcome and the biomarkers, only biomarkers measured on affected subjects before the onset of their diseases, as well as biomarkers measured on unaffected subjects can be used to avoid bias. In this case, the EM algorithm in our method can be modified to allow for incomplete data for these biomarkers. Detailed formulas of the model and EM algorithm allowing for incomplete biomarker data are shown in the Appendix. The model used in this study is based on joint distribution conditioning on the genetic data. However, this is only valid when the ascertainment of the study samples is independent of the outcome, e.g. when the outcome is continuous and sampling from the population is independent of it. In studies where the ascertainment does depend on the outcome, e.g. case- control studies, the estimated parameters are biased using this model due to the different ascertainment probabilities for the cases and controls. To account for ascertainment, assuming the ascertainment probabilities for cases and controls are known or can be well estimated, we can modify the model in this study by adding an additional ascertainment indicator to the DAG and using the joint distribution conditioning on both genetic data and ascertainment. Detailed formulas for the model and EM algorithm that correct for ascertainment are shown in the Appendix. 74 4 Two-Phase and Two-Stage Family-Based Designs for Genetic Epidemiological Studies using Sequencing Data 4.1 Background 4.1.1 Genome-Wide Association Studies (GWAS) According to the National Human Genome Research Institute’s website (http://www.genome.gov/), a GWAS is an approach by which markers are rapidly scanned across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. This kind of study becomes possible based on the insights gained from the International HapMap Project, which is a multi-country effort to identify and catalog genetic similarities and differences in human beings (http://hapmap.ncbi.nlm.nih.gov/) (International HapMap, 2003). There are about 3 × 10 base pairs in the entire human genome. The International HapMap Project demonstrated that genetic variants located within a short distance on the human genome are in linkage disequilibrium (LD) and the genetic variance at one locus can well predict genetic variants at an adjacent locus (typically over distances of 30,000 base pairs of DNA in the human genome) (Hardy & Singleton, 2009). This finding suggests that genotyping approximately 500,000 carefully selected single-nucleotide polymorphisms (SNPs) as tags could possibly cover most genetic variants with minor allele frequency (MAF) greater than 0.1 (W. Y. Wang, Barratt, Clayton, & Todd, 2005). Therefore, after genotyping those tag SNPs or markers along the entire genome for subjects with and without a certain disease, the genetic variants or regions associated with the disease can be identified using statistical analysis. 75 Study Designs for GWAS One of the major problems in GWAS is unrecognized population heterogeneity, which could happen if individuals in the sample have different ethnic origin (Ott, Kamatani, & Lathrop, 2011). If not controlled for, population heterogeneity can distort the estimated association between genetic variants and diseases, since it can also contribute to the differences in allele frequencies of particular SNPs between cases and controls besides the true association of those SNPs and the disease. To avoid such distortion, some investigators have adopted family-based designs for GWAS. By using cases and controls from the same families, the analysis is protected from population heterogeneity since members of one family usually have the same ethnic origin. Several statistical methods have been proposed to analyze family-based GWAS, including the transmission disequilibrium test (TDT) and its general extension called the family-based association test (FBAT), which were reviewed by several researchers (N. M. Laird & Lange, 2009; Ott et al., 2011). In addition, Chen and Abecasis used a generalized linear model with variance-covariance matrix including pedigree information to analyze family-based GWAS. They also used pedigree information to compute expected genotype score for subjects not genotyped and combined this information with that from subjects genotyped for the final analysis (W. M. Chen & Abecasis, 2007). In GWAS, every SNP is tested for its association with the disease, so a huge number of tests are performed. Thus, due to the penalty for multiple comparisons, detecting genome-wide significant associations requires a much larger sample size than in candidate gene studies. To achieve convincing statistical support for a disease association, a sample size of more than 4,000 is required if the disease-susceptibility alleles have MAFs of 0.1 and effect sizes less than an odds ratio of 1.3 (W. Y. Wang et al., 2005). In early GWAS, the cost of genotyping one person 76 using high density SNP panel was very high, which often made it not feasible to genotype the entire sample to get the genotype information on all markers. A more cost-efficient strategy is to use two-stage designs. Thomas et al reviewed the basic principles of two-stage study designs for GWAS. In the first stage, a proportion of the sample is genotyped using a high density SNP panel. In the second stage, promising SNPs are genotyped using a customized SNP panel on the remainder of the sample (D. C. Thomas et al., 2009). Then, information of both stages is combined for the analysis. This joint analysis has been shown to be more efficient than treating the second stage as a simple replication study (Skol, Scott, Abecasis, & Boehnke, 2006). Design parameters such as the sampling fraction for the first stage and the fraction of SNPs entering the second stage can be optimized with respect to cost or power (Skol, Scott, Abecasis, & Boehnke, 2007). Another kind of two-stage GWAS is a hybrid of population-based and family-based analysis (Van Steen et al., 2005), which first uses population-based analysis to screen for promising SNPs and then tests for association by family-based analysis, as reviewed by several investigators (N. M. Laird & Lange, 2009; Murphy, S, & Lange, 2010; D. C. Thomas et al., 2009). Achievements and Limitations of GWAS Since the first time it was used in genetic epidemiology, GWAS have made great achievements in the search for associations between genetic variance and diseases. An online catalog of published GWAS has been created and regularly updated (http://www.genome.gov/gwastudies/) (Welter et al., 2014). As of July 26 th , 2014, the catalog included 1942 publications and 13750 SNPs for 1113 diseases/traits. Population variation has been explained by GWAS for many complex traits. Examples include Type I and Type II 77 diabetes, Obesity (BMI), Crohn’s disease, etc. More examples can be found in Table 1 of the review paper by Visscher et al (Visscher, Brown, McCarthy, & Yang, 2012). In spite of the rich output from GWAS, there are also limitations. SNPs found in GWAS are usually only associated with the traits of interest, instead of being causal themselves. Moreover, although the number of SNPs identified to be genome-wide significance by GWAS so far is quite large, they generally can only explain a small proportion of heritability of the traits, leaving the rest unexplained. One example is the study of adult height variation by GWAS. Although height is a highly heritable trait with about 80% heritability, the more than 50 SNPs identified explain only about 5% of the variation of height despite a very large sample used (Visscher, 2008). Later on, SNPs found in another study of height by Lango Allen et al using a much larger sample did not explain a large proportion of variation as well (Allen et al., 2010). Based on their estimate, the proportion of variance explained by SNPs found in nearly half a million subjects would only reach approximately15%. Besides the prediction analysis using SNPs that are significant in GWAS by Lango Allen et al above, a random genetic effect model was also used by other researchers to estimate the proportion of variance in height explained by all SNPs, both that are significant and that are not (Visscher, Yang, & Goddard, 2010; J. Yang et al., 2010). Although more variance of height is captured by this method, it is still far from its total heritability. In GWAS of other traits, such as Crohn’s disease, Type II diabetes, HDL cholesterol, etc., a similar phenomenon also exists (Manolio et al., 2009). This problem is often referred to as the ‘missing heritability’ and has been reviewed by many investigators (Galvan, Ioannidis, & Dragani, 2010; Manolio et al., 2009). 78 4.1.2 The Post-GWAS Era As is discussed above, for many traits, only a small proportion of heritability can be explained by findings of GWAS. Therefore, in the so-called ‘post-GWAS era’, finding the ‘missing heritability’ becomes a major goal in genetic epidemiology. There are several possible directions to be taken, such as refining the phenotype, moving from main effects to gene-gene or gene-environment interactions, looking at epigenetic effects, using expression information, using genome annotation, studying structural variation and extending the analysis to rare variants (McCarthy & Hirschhorn, 2008). The remainder of this chapter will focus on the study of rare variants. Rare Variants Compared with common variants, rare variants appear much less frequently in the population, with MAFs usually less than 5% or 1%. There have been two major hypotheses with respect to the genetic etiology of complex disease, the ‘Common Disease, Common Variant (CDCV)’ hypothesis and the ‘Common Disease, Rare Variant (CDRV)’ hypothesis (Schork, Murray, Frazer, & Topol, 2009). The CDCV hypothesis is the one that GWAS relies on. By design, GWAS can only scan common variants for their association with a certain disease. Therefore, if rare variants do play an important role in the etiology of the disease, the heritability explained by them will not be identified by GWAS. Indeed, it has been suggested that rare variants may account for a substantial fraction of disease susceptibility (Bodmer & Bonilla, 2008; Cirulli & Goldstein, 2010; Fearnhead et al., 2004; Pritchard, 2001). Later on, based on analysis from an evolutionary perspective, more than 50% variants are rare based using 5% MAF as the cut off, and rare variants are predicted to be 79 more likely to be functional and tend to have a larger effect size than common SNPs (Gorlov, Gorlova, Frazier, Spitz, & Amos, 2011). Similar results can also be found by population genetic simulations, indicating that a large proportion of modern-day rare alleles have a deleterious effect on function and therefore have the potential contribution to diseases susceptibility (Maher, Uricchio, Torgerson, & Hernandez, 2012). Despite the findings by Wray et al that rare variants seem not to explain most GWAS results (Wray, Purcell, & Visscher, 2011), Goldstein pointed out that rare causal variants still can create association signals for common variants (Goldstein, 2011). Such findings provide support for the ‘CDRV’ hypothesis and great interest has been drawn in the study of rare variants and their relationship with human diseases. Next Generation Sequencing (NGS) Commercial SNP panels used in GWAS can only genotype common variants. Sequencing technologies are required to get information on rare variants and enable further research on them. Traditional sequencing technologies, such as the Sanger method, are both expensive and time-consuming, which makes it almost impossible to sequence a large genomic region, let alone the entire genome. Fortunately, recent developments of the so-called ‘Next- generation sequencing (NGS)’ or massively parallel sequencing have greatly reduced both cost and time needed by sequencing up to billions of individual DNA templates at one time, followed by genome sequence alignment, sequence assembly and variant detection. This class of technology has become commercially available from several companies. Details of this class of technology can be found in the review papers by Shendure and Ji, and Casey et al (Casey, Conti, Haile, & Duggan, 2013; Shendure & Ji, 2008). Since then, studies using large amounts of sequencing data have become feasible. 80 Although NGS has reduced the cost of DNA sequencing by over two orders of magnitude (Shendure & Ji, 2008), its cost is still very high, which often prevents investigators from having a large enough sample size for sufficient statistical power. Therefore, studies using sequencing data need to be well designed to increase their efficiency. Indeed, considering the principles used by NGS, optimal choices of the number of subjects being sequenced and the coverage depth can be made with respect to different study goals (Sampson, Jacobs, Yeager, Chanock, & Chatterjee, 2011). Kang and Marjoram also developed algorithms that can optimally select a subsample of an existing sample for sequencing to either maximize the number of new polymorphic sites detected or improve the efficiency of imputation (Kang & Marjoram, 2012). For targeted resequencing studies, a probability-based approach was developed to select samples such that the yield of rare alleles can be substantially increased (Edwards, Song, & Li, 2011). For whole-genome sequencing studies, an optimized approach using global estimates of kinship was developed to select subjects that are best available to represent the founder chromosomes of the population, thus increasing the total yield of alleles from sequencing (Edwards & Li, 2012). Another special design is DNA pooling, in which DNA samples from a group of subjects are pooled and sequenced at one time, hereby reducing the total cost (Sham, Bader, Craig, O'Donovan, & Owen, 2002). Using pooled DNA samples and NGS, allele frequencies can be estimated, providing the possibility of studying genetic associations using pooled data. Lots of work has been done in this direction. For example, Prabhu and Pe’er used overlapping pools to identify carriers of rare variants (Prabhu & Pe'er, 2009), Wang et al proposed a statistical procedure to detect disease associations with rare variants by resequencing of pooled DNA (T. Wang, Lin, Rohan, & Ye, 2010), Lee et al explored the optimal number of individuals in a DNA pool to identify rare variants through resequencing (Lee, Choi, Yan, Lifton, & Zhao, 2011), and 81 Liang et al proposed a hierarchical Bayesian model to analyze pooled NGS data and explored the optimal number and size of the pools (Liang, Thomas, & Conti, 2012). Besides the designs specifically for NGS, more developments of study designs for rare variants will be discussed later in this section. Statistical Methods for Rare Variants Because of the extremely low MAFs of rare variants and the even larger number of comparisons, traditional single-variant tests that essentially compare allele frequencies in different outcome groups are poorly powered and require huge sample sizes, which usually cannot be achieved in practice (Bansal, Libiger, Torkamani, & Schork, 2010; B. Li & Leal, 2008). An extreme example mentioned by Hoffmann et al (Hoffmann, Marini, & Witte, 2010) is that a rare variant with MAF 0.46% in Type I diabetes cases and 0.67% in controls was detected using a huge sample size of 17,730 (Nejentsev, Walker, Riches, Egholm, & Todd, 2009). Alternatively, methods that treat rare variants in the same genome region as a group and combine their effects in some fashion for association tests have become popular because of the improved power. Methods of this class are referred to hereafter as regional tests. Early examples of the regional tests are the ‘Cohort Allelic Sums Test (CAST)’ (Morgenthaler & Thilly, 2007), the ‘Combined Multivariate and Collapsing (CMC) method’ (B. Li & Leal, 2008) and the weighted-sum method (Madsen & Browning, 2009). Because they essentially evaluate the overall genetic burden due to rare variants, the above methods are often referred to as the ‘burden tests’ (Neale et al., 2011). The CAST test compares the sum of the mutant exonic sequences for the cases with that expected for the controls. The CMC method first collapses rare variants using indicator variables in each group defined by some criteria such 82 as the MAF, and then uses a multivariate test for association of the groups and the disease, which can also include common variants as covariates. The weighted-sum method generalizes the CAST by assigning different weights to the rare variants according their locations in the genome. However, these methods implicitly assume that rare variants in the same region have effects in the same direction and similar sizes, not allowing for the existence of large number of null variants as well as both deleterious and protective variants (Hoffmann et al., 2010). To accommodate variants that have different directions of effect, Neale et al proposed a different method using the C-alpha test statistic. Unlike the burden tests above which essentially test the mean of the rare variant counts, C-alpha tests the difference between the observed variance of the allele counts and that expected under the null hypothesis that rare variants appear randomly following a binomial distribution in cases and controls. When there is a mixture of null, deleterious, and protective variants in the region being tested, the performance of the C- alpha test is better than the burden tests (Neale et al., 2011). Later on, Wu et al proposed the sequence kernel association test (SKAT) as a generalization of the C-alpha test to allow for continuous traits and covariate adjustment. SKAT uses a generalized linear mixed model, combines the variants in the same region using a positive semi-definite kernel function and tests the variance of the random effect of this kernel function (Wu et al., 2011). The generalized linear mixed model allows additional covariates to be adjusted for. In addition, different kernel functions can be used, offering much flexibility to combine the information of the variants (Schaid, 2010a, 2010b). The SKAT-type test has also been generalized by several groups to allow for family data (H. Chen, Meigs, & Dupuis, 2013; Ionita-Laza, Lee, Makarov, Buxbaum, & Lin, 2013; Schaid, McDonnell, Sinnwell, & Thibodeau, 2013; Schifano et al., 2012). 83 Another class of grouping methods addresses the problem of power loss when the proportion of causal variants to null variants in the region is low by constructing a risk index based on multiple rare variants within a region, instead of using all the rare variants in this region. The Bayesian Risk Index incorporates model uncertainty as well as the direction of effects in the selection of variants. Moreover, it allows for inference at both group level and variant-specific levels (Quintana, Berstein, Thomas, & Conti, 2011). Extension of this method has been made by adding another level of uncertainty to group regions and integrating external biological variant-specific covariates to inform the selection of associated variants and regions (Quintana et al., 2012). Study Designs for Rare Variants As is discussed earlier in this chapter, despite the availability of the next-generation sequencing technology, it is still not practical to sequence everybody in a large-scale epidemiological study. The low MAF and higher penalty from multiple comparisons make it hard to achieve high enough power to detect significant hits for rare variants. Therefore, well- designed studies are needed in the research on rare variants. Two-phase designs, two-stage designs and family-based designs have drawn great attention in sequencing studies because of the advantages they provided (D. C. Thomas, Yang, & Yang, 2013). Two-Phase Designs As is discussed in Chapter 1 and Chapter 2, in a two-phase design, a proper subset of the sample in Phase I is selected for measurement of additional variables that are either expensive or unrealistic to be measured for the entire Phase I sample. Then information from both phases is 84 combined for the final analysis and sampling probabilities can be optimized to increase the overall cost-efficiency. Breslow et al developed the fundamental statistical methods for two- phase designs in a series of papers (for example, see (Breslow & Holubkov, 1997b) and (Breslow et al., 2009b)). Detailed discussion of their methods can be found in previous chapters. A natural application of the two-phase designs for sequencing studies is to treat genotyping tag SNPs or performing GWAS for the entire sample as Phase I and the follow-up targeted sequencing for a well-chosen subset of Phase I sample as Phase II. Chen, Craiu and Bull used two-phase stratified sampling designs for regional sequencing, in which tag-SNPs for candidate genes or regions are genotyped on all subjects as Phase I, a proportion of subjects are selected into Phase II based on genotypes at one or more tag SNPs and then deep sequencing in the region is applied only on Phase II subjects. The two phases are jointly analyzed using a weighted mean score function (Z. Chen, Craiu, & Bull, 2012). Later on, Schaid et al also applied two-phase designs to targeted follow-up of GWAS hits by NGS. Viewing the GWAS tag SNPs as imperfect surrogates for the underlying causal variants and expecting that the tag SNPs and the causal variants are correlated, they have GWAS serve as the first phase and the resequencing study serve as the second phase and use stratified sampling based on both tag SNP genotypes and case-control status. By simulation studies, they show that this design improves power compared with sampling stratified only on case-control status, and the degree of improvement depends on the LD between the tag SNP and the causal variants, as well as the effect size of the causal variants (Schaid, Jenkins, et al., 2013). 85 Family-Based Designs Although not as statistically efficient as designs using unrelated samples, family-based designs have become attractive again with the growing interest in rare variants in genetic epidemiology and the availability of NGS. Zhu et al proposed affected sibpair designs to detect rare genetic variants in either a candidate gene-based or genome-wide association analysis (Zhu, Feng, Li, Lu, & Elston, 2010). By computer simulations, Shi and Rao showed that families showing evidence of linkage in a particular region can be used to enrich rare and functional variants in a sequencing study design (Shi & Rao, 2011). For sequencing studies, family-based designs have the following possible advantages (D. C. Thomas et al., 2013). Firstly, the co-segregation of diseases and causal variants can make it more efficient at prioritizing potential causal variants for subsequent association testing in larger samples. Secondly, their ability to exploit Mendelian inheritance may improve the imputation of rare variants in untested sample (Cheung, Thompson, & Wijsman, 2013). Thirdly, they can exploit between- and within-family comparison for better power and being robust to bias from population from stratification, as is discussed earlier in the study design section for GWAS. Furthermore, a subset of family members can be sequenced initially to prioritize promising variants for subsequent association test in an independent sample. This type of design is both a two-phase (subsampling for sequencing) and a two-stage design (independent sample is used after the first step), and has the potential to improve cost and statistical efficiency (D. C. Thomas et al., 2013). In addition, the power for testing the joint effect of rare variants in a region can be improved by estimating the weights using case-parents and unrelated controls by comparing population control allele frequencies and the allele frequencies in parents of the cases (Jiang et al., 2014). 86 4.2 Formalization of the Problem As is discussed above, the ‘missing heritability’ is expected to be partially explained by rare variants, and the advances in sequencing technology have made studies of rare variants in the human genome feasible. However, it is still too expensive to sequence the entire sample in large-scale epidemiological studies. The limited sample, the low MAF and high penalty for multiple comparisons make it even harder to achieve large enough power to detect associations between rare variants and the trait of interest. Therefore, carefully designed sequencing studies are needed to improve the efficiency of the study of rare variants. A hybrid two-phase and two-stage design using pedigrees has such potential (D. C. Thomas et al., 2013). Stage I of this design consists of two phases. In the first phase, pedigrees that have one or multiple affected people are ascertained. As many members as possible are then genotyped using the GWAS panel. In the second phase, a properly selected subset of the members in each pedigree is sequenced and information from both phases is combined to prioritize a set of rare variants that may be associated with the trait from all the variants discovered by sequencing. Then, in Stage II, a different set of people that is independent of those used in Stage I are genotyped using customized panel only for the variants prioritized in Stage I and association tests are performed only for those variants. This design has the following potential advantages. By using a pedigree sample in Stage I, it can exploit the co-segregation of the disease and causal variants and improve the efficiency of prioritizing variants for Stage II (D. C. Thomas et al., 2013). Since only a proportion of the entire sample is sequenced, it can improve cost-efficiency. Because of the much smaller number of comparisons made in Stage II, it can improve the power of the association tests. 87 In spite of the potential merits, there are still open questions to be answered with respect to this design. Firstly, proper statistical methods need to be developed to prioritize promising variants for Stage II, using information of both of the two phases in Stage I. Secondly, more work needs to be done to explore the optimal strategy to select pedigree members for sequencing. Recently some guidance was proposed for sequencing choices in pedigrees (Cheung, Marchani Blue, & Wijsman, 2014), but it is only based on optimizing the performance of genotype imputation, not involving either the performance of any association tests or the phenotype of the pedigree members. It is also interesting to explore how the existing GWAS information (if available) can assist making the sequencing choices. Thirdly, whether the sample used in Stage II for association testing should be pedigrees or unrelated people is not clear. Fourthly, the optimal choices of the cut-off for prioritization and the sample size allocation between Stage I and Stage II needs to be explored. The goal of this chapter is to find a comprehensive approach to address these open questions in the two-phase and two-stage family- based sequencing design framework. For brevity, the latter three questions will be collectively called the ‘design considerations’ in the remainder of this chapter. 4.3 Methods 4.3.1 Variant Prioritization Single Variant Score Test Criterion Score statistics are easy to compute and don’t require the specification of an alternative hypothesis, and thus are natural candidates to prioritize promising variants. Following the principle of the score contribution computed by Ionita-Laza et al for the variants shared by pairs of relatives in a family based on their population frequency and degree of relationship (Ionita- 88 Laza et al., 2011), an extension of this idea is made to incorporate all available phenotype information in a pedigree, including the phenotypes of subjects without sequence data. Suppose each single variant is indexed by the subscript 𝑣 and each pedigree is indexed by the subscript 𝑓 . Let 𝑌 and 𝜇 denote the phenotype vector and the phenotype mean for the members in pedigree 𝑓 , 𝐺 denote the genotype vector of variant 𝑣 for the members in pedigree 𝑓 , Φ denote the matrix of the kinship coefficients for pedigree 𝑓 and 𝑞 denote the MAF of variant 𝑣 in the sequenced members in pedigree 𝑓 . Then, the score statistic with the above extension is as follows. 𝑇 = ∑ 𝑡 = ∑ 𝑌 − 𝜇 ∙ 1 Φ (𝐺 − 𝑞 ∙ 1) (4.1) In the score statistic above, for members that are not sequenced, the difference between the genotypes and the minor allele frequency is set to zero. However, the inclusion of the kinship coefficients for the pairs of sequenced and unsequenced members allows their phenotypes to make a contribution. Under the null hypothesis, this score statistic has a zero mean and its asymptotic variance is ∑ 𝑡 . For each variant discovered, the score test statistic in (4.2) is computed 𝑢 = ∑ (4.2) Then variants are ranked in descending order of this score statistics and a certain number of top ranked variants are prioritized for the analysis in Stage II. 89 Multi-Variant Regional Test Criterion Recently, the kernel-based multi-variant regional test such as SKAT (Wu et al., 2011) has been generalized to accommodate pedigrees by several groups (H. Chen et al., 2013; Ionita-Laza et al., 2013; Schaid, McDonnell, et al., 2013; Schifano et al., 2012). Following the same spirit of regional test, the single variant score test criterion can be generalized to prioritize multi-variant regions for Stage II using both the phenotype information on all pedigree members and the sequencing data on the members selected for sequencing. This generalization only involves combining the score statistics of the variants in the same region, and the resulting score test statistic is: 𝑇 = ∑ ∑ ∈ ∑ ∑ ∈ (4.3) The score test statistic in (4.3) currently does not use the LD structure in the region. Incorporating the information from the LD structure could potentially improve its performance. 4.3.2 Design Considerations Sequencing Sample Selection In the second phase of Stage I, a subset of pedigree members is chosen for sequencing. The sequencing sample selection strategy needs to be carefully developed to get better performance in discovering and prioritizing causal variants. Ad hoc Approach The first approach is to sequence a certain number of members in each pedigree, and pre- specify the characteristic of these members sequenced, such as whether they are affected or not, the relationship among them, etc. Since this approach does not use any statistical technique, it is 90 called the ad hoc approach. For example, in each pedigree, select 2 cases that are at least 2- degree relatives, and 1 control for sequencing. Iterative Approach The second approach is an iterative approach incorporating existing GWAS hits in the selection process. Specifically, this approach is based on the idea to optimize the sequencing sample selection by maximizing the sum of the statistic used for prioritization for the existing GWAS hits. In other words, the existing GWAS data could be treated as less dense sequencing data and used to train the sequencing sample selection strategy. Since GWAS hits are assumed to be in LD with the causal variants, the optimal sequencing sample selection strategy in prioritizing the GWAS hits is expected to perform well for prioritizing causal variants. With the idea described above and the spirit of the joint-prioritized selection algorithm by Cheung et al (Cheung et al., 2014), the following iterative algorithm using the existing GWAS data as ‘sequence’ data is proposed for sequencing sample selection. Assume that the total number of subjects in all pedigrees is 𝑛 , from which 𝑚 subjects will be selected for sequencing. To select the first two subjects, all pairs of subjects are considered in turn and their GWAS data are retained while the remaining 𝑛 − 2 subjects’ SNPs are hidden. Then, a chosen statistic (e.g. the sum of the score statistics for the GWAS hits of interest) is calculated based on the SNP data for these two subjects and the phenotype data of whole pedigree. The pairs of subjects are ranked in descending order of that score and the top 𝑠 pairs are chosen as the ‘seeds’ for the second step. At the second step, for each seed chosen in the first step, the remaining 𝑛 − 2 subjects are explored in the same way as before. Then the 𝑠 (𝑛 − 2) trios are ranked in descending order of their scores and the top 𝑠 trios are chosen as the new ‘seeds’ for the third 91 step. At the third step, for each seed chosen in the second step, the remaining 𝑛 − 3 subjects are explored in the same way as before. Then the 𝑠 (𝑛 − 3) quartets are ranked in descending order of their scores and the top 𝑠 quartets are chosen as the new ‘seeds’ for the next step. In this fashion, the algorithm moves step by step until there are 𝑚 subjects in each of the ‘seeds’. Then, the ‘top seeded’ 𝑚 subjects are chosen for sequencing from this pedigree. Simplified Iterative Approach The iterative approach described above can be very computationally expensive for large sample sizes. When 𝑛 = 6600, 𝑚 = 900, 𝑠 = 10, the number of iterations at the first step is about 2.2 × 10 , the total number of iterations at other steps is about 5.3 × 10 , thus the total number of iterations is about 7.5 × 10 ! To reduce computational complexity, instead of keeping the top 𝑠 seeds, we only keep the top one seed for the next step. This will reduce the complexity of the above setting to 2.3 × 10 . However, this potentially has negative effects on the performance of the iterative approach. Choice Between Pedigree and Unrelated Samples in Stage I and Stage II Although the case-control design using unrelated individuals has become the standard approach in the GWAS era, because of its greater efficiency for the analysis of common variants than pedigree data, the co-segregation of affected people and causal rare variants in pedigrees can make pedigree samples more powerful for variant prioritization in Stage I and the association test in Stage II. Simulation studies are used to compare the performance of using pedigree sample vs. unrelated sample for Stage I and Stage II analysis separately. 92 Sample Size Allocation and Prioritization Cut-Off Given a fixed total sample size (combining stages I and II) and a particular sequencing sample selection strategy, different proportions of the total sample size allocated to Stage I and Stage II can result in different statistical efficiencies. Different cut-offs for prioritization can also influence the overall performance. We used a grid search to explore the optimal combination of sample size allocation and prioritization cut-off through simulation studies. 4.3.3 Simulation Studies Data Generation In the simulation studies, genetic data are from the same population of 10,000 haplotypes of length 250Kb generated by the COSI program using the best fitting model (Schaffner et al., 2005). This population contains 5227 unique variants, 4630 of which have MAF less than 0.05. Selecting Causal Variants At each replicate, K rare variants (MAF less than 0.05 but greater than 0.01) were randomly designated as causal variants. The probability of being causal for each rare variant was assumed to be a function of their MAF: Pr(𝐶𝑎𝑢𝑠𝑎𝑙 |𝑀𝐴𝐹 ) = 𝜁 ∙ exp (𝜁 ∙ 𝑙𝑜𝑔 . ) ∙ exp (− ( ) . ∙ ) (4.4) By using a negative 𝜁 , variants with lower MAF are more likely to be causal. 𝑀𝐼𝑁 (𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ) is the minimum distance of the variant to other causal variants, and 𝑉 is the simulated window for clustering of causal variants. Then, log relative risk of the causal variants were assigned as 𝛽 = exp (𝜎 ∙ 𝑈 + 𝜇 + 𝜇 ∙ . ) (4.5) where 𝑈 ~𝑈𝑛𝑖𝑓 (0, 1). 93 Generating Pedigree Samples We considered a fixed pedigree structure comprising 22 members in each 3-generation pedigree with 2 children in each nuclear family. At each replicate, we randomly sampled a pair of haplotypes from the haplotype population for each founder in each pedigree, and then randomly dropped the founders’ genotypes through the rest of the pedigree. For each pedigree member, computed his or her disease risk based on the genotype and the risk model in (4.5) and randomly assigned disease status accordingly. A pedigree was then retained if it had 4 or more cases. This process continued until desired number of pedigrees was achieved. We then randomly selected a subset of pedigrees as the Stage I sample. For each common SNP, we used the TDT to test the association between 𝐺 and 𝑌 in these Stage I pedigrees. If the number of significant SNPs after adjusting for multiple comparisons was less than 1, we discarded all the pedigree samples at this replicate and generated new pedigrees from the beginning. Otherwise, we stored the significant SNPs in Stage I pedigree samples. Generating Case-Control Samples At each replicate, we randomly chose a pair of haplotypes from the haplotype population and assigned disease status using the same disease model as for the pedigree samples, continuing in this manner until required number of cases and controls was achieved. We then selected a subset of case-control samples as the Stage I samples. For each common SNP, we used logistic regression to test the association between 𝐺 and 𝑌 . If the number of significant SNPs after adjusting for multiple comparisons was less than 1, we discarded all the case-control samples at this replicate and generated new case-control samples from the 94 beginning. Otherwise, we stored the significant SNPs and nominal significant SNPs in Stage I case-control samples. Analysis Stage I Pedigree Samples We first selected pedigree members for sequencing in Stage I pedigree samples using either the ad hoc approach or the simplified iterative approach. In the simplified iterative approach, only the SNPs that were significant in the Stage I pedigrees sample and not highly correlated (𝑅 < 0.25) with other significant SNPs are used. We then computed the single variant score statistics (4.2) for each rare variant. We ranked the rare variants discovered in descending order of the score statistic, and prioritized the top 𝛼 proportion of them for Stage II. Stage I Case-Control Samples Using the SNPs that were nominally significant in Stage I case-control samples and not highly correlated with other nominal significant SNPs, we used logistic regression to construct a risk index for all Stage I case-control samples. We then stratified these samples into six strata based on quartiles of the risk index high, medium (2 nd and 3 rd quartiles), and low risk crossed with case-control status and then randomly sampled equal number of individuals from each of the six strata for sequencing. We then fit a logistic regression model of each rare variant using these sequenced individuals. Finally, we ranked the rare variants discovered in ascending order of p-values, and prioritized the top 𝛼 % for Stage II. 95 Stage II Pedigree Samples For each rare variant prioritized in Stage I, we used the TDT to analyze the association between the rare variant and disease, adjusting only for the number of comparisons for those prioritized from Stage I. Stage II Case-Control Samples Likewise, for each rare variant prioritized in Stage I, we used logistic regression to test its association with disease, adjusting only for the number of comparisons prioritized in Stage I. Comparing Using Case-Control Samples and Pedigree Samples in Stage I and Stage II We used simulation studies to compare the performance of using case-control samples and pedigree samples in either Stage I or Stage II separately. Specifically, in Stage I, we compared the performance of prioritizing causal rare variants using either case-control samples or pedigree samples; in Stage II, we used separate simulation studies to compare the performance of testing the association between rare variants with disease using either case-control samples or pedigree samples. Then, the strategies with the better performance in each stage were combined to be the final overall strategy, which was used later to explore the sample size allocation between the two stages, and the prioritization cut-off in Stage I. 4.4 Results 4.4.1 Comparing Approaches to Selecting Pedigree Members for Sequencing Simulation studies in three scenarios were used to compare the performance of the ad hoc and the simplified iterative approach. In the first scenario, simulated causal variants were 96 assumed to have smaller MAFs but larger effects; in the second scenario, we simulated more variants with moderate MAFs and moderate effects compared to the first scenario; in the third scenario, we assumed more simulated variants have larger MAFs but smaller effects compared to the first scenario. The relationship between MAF and the probability of being causal and the relationship between MAF and the effect size are illustrated in Figure 4.1. Note that the effect size in Figure 4.1 only represents the non-random part, but it should be sufficient for the illustration purpose. Figure 4.1 Three different simulation scenarios: causal variants have smaller MAFs but have larger effects (red); moderate MAFs and moderate effects (green); larger MAFs but have smaller effects (blue). In each of the three scenarios in Figure 4.1, 300 pedigrees with 22 members in each were generated as Phase I samples at each replicate. Then the ad hoc approach and the simplified iterative approach were used to select 900 pedigree members for sequencing. In the ad hoc approach, 2 cases that are at least second-degree relatives, and 1 control were randomly selected 97 from each pedigree. With the sequencing data, rare variants were then prioritized based on the corresponding statistics described in Section 4.3.1. Table 4.1 Prioritizing causal and null variants by the ad hoc approach and the simplified iterative approach in three scenarios. Scenarios Approach Prop. Causal Variants Prioritized Prop. Null Variants Prioritized Large Effect, Low MAF Ad hoc 62.2% 5.4% Simplified iterative 46.0% 5.0% Moderate Effect and MAF Ad hoc 52.1% 5.5% Simplified iterative 31.4% 4.9% Small Effect, High MAF Ad hoc 36.9% 5.7% Simplified iterative 29.4% 4.8% In Table 4.1, as the true causal variants have larger effects but lower MAFs, their chance of being prioritized by either the ad hoc approach or the simplified iterative approach increases. When comparing the performance of these two approaches in prioritizing variants in Stage I, the ad hoc approach prioritizes a higher proportion of causal variants than the simplified iterative approach in all three scenarios, although it also prioritizes a slightly higher proportion of null variants. It should be noted that the vast majority of variants prioritized at this stage are not causal. Therefore, it is also interesting to look at the ratio of the number of causal variants prioritized versus the number of null variants prioritized. In Figure 4.2, this ratio of the ad hoc 98 approach is also larger than that of the simplified iterative approach. Thus, these simulations show that both approaches provide a much higher chance to prioritize causal variants than null variants in all the three scenarios in Stage I, and the ad hoc approach has a better performance compared to the simplified iterative approach. Figure 4.2 Ratio of the number of true causal variants prioritized to the number of null variants prioritized by the ad hoc approach and the simplified iterative approach. 99 Table 4.2 Characteristics of pedigree members selected for sequencing by the ad hoc approach and the simplified iterative approach in three scenarios. Scenarios Approach # Affected / # Unaffected Large Effect, Low MAF Ad hoc 600 / 300 Simplified iterative 145 / 755 Moderate Effect and MAF Ad hoc 600 / 300 Simplified iterative 95 / 805 Small Effect, High MAF Ad hoc 600 / 300 Simplified iterative 115 / 785 When comparing members selected by these two approaches for sequencing in Stage I, we can see that their selection strategies are different. In Table 4.2, although total number of members selected for sequencing is the same for both approaches, the number of affected members selected by the simplified iterative approach for sequencing is much smaller than that by the ad hoc approach in all three scenarios, which can potentially negatively affect the performance, since affected members are more likely to be carriers of the causal variants and thus more informative. The distributions of the number of members sequenced in one pedigree by the simplified iterative approach are shown in Figure 4.3. Unlike the ad hoc approach, the simplified iterative approach does not fix the number of members sequenced in one pedigree. In all the three scenarios, while a large portion of pedigrees has 1 to 4 members sequenced, there are also many 100 pedigrees that have more members sequenced. In particular, there are several pedigrees that have all of the 22 members sequenced. This phenomenon can be explained by the greedy nature of the simplified iterative algorithm, which only proceeds with the top one seed to the next iteration and thus tends to be settled in a selection that is only optimal locally. In Figure 4.3, we can also see that when the simulated causal variants have larger MAFs but smaller effects, there are more pedigrees with only one members sequenced, indicating that there are more unrelated subjects sequenced in this situation, while when the simulated causal variants have smaller MAFs but larger effects, there are more pedigrees with two or more members sequenced. This observation confirms the merit of family-based designs in the study of rare variants. 101 Figure 4.3 Distribution of number of members sequenced in one pedigree by the simplified iterative approach in scenarios when causal variants have smaller MAFs but have larger effects (green); moderate MAFs and moderate effects (red); larger MAFs but have smaller effects (blue). 4.4.2 Comparing Pedigree and Case-Control Samples in Stage I Simulation studies in scenarios in Figure 4.1 were used to compare pedigree and case- control samples in Stage I. In each of the three scenarios in Figure 4.1, 300 pedigrees with 22 members in each, and 3300 cases and 3300 controls were generated as Stage I samples at each replicate. Then for the pedigree sample, the same ad hoc approach as that in Section 4.4.1 was used to select 900 pedigree members for sequencing, and for the case-control sample, 150 subjects were randomly sampled from each of the three risk index strata in cases and controls for sequencing in the manner described in Section 4.3.3 (thus 900 subjects were sequenced in total for the case-control sample). With the sequencing data, rare variants were then prioritized using 102 the pedigree sample or the case-control sample in the corresponding ways described in the analysis subsection of Section 4.3.3. Table 4.3 Proportion of causal and null variants that are prioritized using pedigree samples or case-control samples in Stage I in three scenarios. Scenarios Approach Prop. Causal Variants Prioritized Prop. Null Variants Prioritized Large Effect, Low MAF Case-Control 77.7% 11.7% Pedigree 64.7% 5.4% Moderate Effect and MAF Case-Control 75.7% 11.8% Pedigree 51.2% 5.6% Small Effect, High MAF Case-Control 67.7% 11.7% Pedigree 32.6% 5.7% In Table 4.3, we can see that in all three scenarios although the proportion of causal variants prioritized using case-control sample is higher than using pedigree sample, the proportion of null variants prioritized using case-control sample doubles that using pedigree sample. Thus, by comparing the ratio of the number of true positive to the number of false positive in prioritized variants (Figure 4.4), we can see that the pedigree sample performs better in Stage I by achieving a better balance of true positive and false positive rate, while using the case-control sample in Stage I prioritizes too many null variants compared with causal variants, which will reduce power in Stage II by adjusting for more multiple comparisons. 103 In Table 4.4, we can also see that as the simulated causal variants have lower MAFs but larger effects, the performance of pedigree sample in Stage I becomes better, which again confirms the merit of using family-based designs in the study of rare variants. Figure 4.4 Ratio of the number of true causal variants prioritized to the number of null variants prioritized by the pedigree sample and case-control sample in Stage I. 4.4.3 Comparing Pedigree and Case-Control Samples in Stage II Similarly, simulation studies in scenarios in Figure 4.1 were used to compare pedigree and case-control samples in Stage II. In each of the three scenarios in Figure 4.1, 700 pedigrees with 22 members in each or 7700 cases and 7700 controls were generated as Stage II samples at each replicate. Assume that all causal variants and a fixed number of randomly selected null variants had been prioritized in Stage I, and genotype information of these prioritized variants was available in all subjects in Stage II. Then with these genotype data, variants were then tested 104 for association with the outcome using the pedigree sample or the case-control sample in the corresponding ways described in the analysis subsection of Section 4.3.3. Table 4.4 Proportion of causal and null variants prioritized being significant after Bonferroni correction in Stage II using pedigree samples or case-control samples in three scenarios. Scenarios Approach Prop. Causal Variants being Significant Prop. Null Variants being Significant Large Effect, Low MAF Case-Control 81.2% 1.7% Pedigree 61.8% 2.1% Moderate Effect and MAF Case-Control 55.3% 0.9% Pedigree 43.4% 1.0% Small Effect, High MAF Case-Control 27.6% 0.4% Pedigree 21.4% 0.4% In Table 4.4, we can see that in all three scenarios, the proportion of causal variants prioritized being significant using case-control sample is higher than that using pedigree sample. In addition, the proportion of null variants prioritized being significant using either case-control sample or pedigree sample is below 5% in all these three scenarios. Thus, the case-control design has a better performance in testing single variant association in Stage II. 4.4.4 Sample Size Allocation Between the Two Stages and Prioritization Cut-Off Simulations were used to explore the influence of sample size allocation between two stages and the prioritization cut-off in Stage I on the performance of testing association between 105 variants and the outcome. Based on the results in Section 4.4.2 and 4.4.3, a two-stage design that uses a pedigree sample and ad hoc approach for sequencing selection in Stage I and a case- control sample in Stage II is expected to have a better performance. Consider such designs in the scenario where causal variants are assumed to have larger effects but lower MAFs, a range of different proportions of total sample size (20%, 40%, 60% and 80%) allocated to Stage I, and a range of different proportions (10%, 20%, 30% and 40%) of discovered variants being prioritized in Stage I were used in simulations. In total, 16 such two-stage designs with different combinations of these two parameters were considered. The overall power of detecting a significant association between a single variant and the outcome in each of these 16 designs is shown by the heat map in Figure 4.5. In Figure 4.5, we can see that different combinations of sample size allocation and prioritization cut-off have different overall power. When the proportion of variants being prioritized in Stage I is too large, the overall power becomes lower due to the larger penalty for multiple comparison in Stage II. On the other hand, when this proportion is too small, the overall power is not very high either, since the power of prioritizing causal variants is lower in this situation. The overall power is also influenced by sample size allocation in Stage I. When the sample size allocated in Stage I is too small, the overall power is low because of the low power of prioritizing causal variants. When sample size allocated in Stage I becomes larger, the overall power increases because the power of prioritizing causal variants increases, and Stage II is still well powered in this situation. 106 Figure 4.5 Overall power of detecting causal variants for different combinations of sample size allocation and cut-off for prioritization in Stage I. 4.4.5 Using More Seeds in the Iterative Approach for Sequencing Sample Selection We used a simulation in the scenario where causal variants are assumed to have smaller effects but higher MAFs as an example to explore the effect of using different number of top seeds in the iterative approach for sequencing sample selection in Stage I. When the number of top seeds was increased from 1 to 5, more affected members were selected, and the proportion of causal variants being prioritized was increased while the proportion of null variants being 107 prioritized was also increased slightly, leaving the ratio of number of true positive vs. the number of false positive unchanged (Table 4.5). Table 4.5 Comparison of the iterative approach using top 1 vs. 5 seeds. Top N Seeds Used 1 5 Prop. Causal Variants Prioritized 29.4% 31.7% Prop. Null Variants Prioritized 4.8% 5.2% # True Positive / # False Positive 0.013 0.013 # Affected / # Unaffected Sequenced 115 / 785 183 / 717 4.4.6 Application to Colorectal Cancer Family Registry (Colon CFR) Data The Colon CFR Cohort is an international consortium formed as a resource to support studies on colorectal cancer (http://www.coloncfr.org/), in which 10,662 families (62,353 individuals in total) have been recruited (http://epi.grants.cancer.gov/CFR/about_colon.html, accessed 05/28/2016). In a subset of data from Colon CFR, we have colorectal cancer status measured on 22,920 individuals from 356 pedigrees. Among these 356 pedigrees, 339 have at least one pedigree-member sequenced on 11 replicated regions identified by previous GWASs as associated with colorectal cancer in GWASeq, an targeted re-sequencing follow-up to GWAS (Salomon et al., 2016). In total, 1006 pedigree members were sequenced. The regions sequenced are defined as the sequence on either side of a focal GWAS SNP, with their lengths determined by the local LD structure around the focal SNP. A detailed summary of the regions sequenced can be found in Table 2 of the paper by (Salomon et al., 2016). We selected and 108 analyzed 8 pedigrees in which there are at least 10 members sequenced as an example of using family-based design to prioritize variants for future association testing, and empirically evaluating approaches to select pedigree members for sequencing. Characteristics of these 8 pedigrees are summarized in Table 4.6. Table 4.6 Characteristics of Conlon CFR pedigrees analyzed. Pedigree Number of Affected (Sequenced) Number of Unaffected (Sequenced) 1 14 (2) 124 (9) 2 31 (13) 281 (14) 3 23 (6) 123 (6) 4 5 (3) 64 (7) 5 19 (5) 236 (11) 6 20 (8) 166 (11) 7 19 (6) 77 (5) 8 26 (6) 132 (14) Total: 157 (49) 1203 (77) First, we used sequencing data from all 126 sequenced members (49 affected and 77 unaffected) as well as phenotype data and pedigree information from all the 1360 (both sequenced and unsequenced) pedigree members in Table 4.5 to compute the score statistic (4.2) for each of the variants discovered. Then, the 13,608 variants discovered were sorted in descending order of their score statistics, and the top 25% (3402) of them were prioritized. Then, a simulation study was used to evaluate the performance of four different ad hoc 109 approaches in terms of prioritizing variants. At each replicate, a certain number of affected members of at least second-degree relationship and a certain number of unaffected members of at least second-degree relationship were randomly selected in each pedigree. Then, sequencing data of only the selected members as well as phenotype data and pedigree information of all pedigree members were used in a similar manner to compute score statistics for each of variants discovered and prioritize the top 25% of variants discovered with the largest score statistics. Figure 4.6 Proportion of top 100 prioritized variants being prioritized by randomly selecting 3 members sequencing per pedigree. From the result of 100 simulation replicates, we can see that a proportion of top 100 variants prioritized using all sequencing data can still be prioritized by sequencing only 3 110 members in each of the 8 pedigrees. In addition, this proportion increases with more affected members being sequenced. 4.5 Discussion In this chapter, we discussed the use of two-phase and two-stage designs in the study of association between genetic variants and disease of interest using sequencing data. In the first stage, by sequencing only a subset of the study sample in the second phase, the sequencing cost is reduced. We also proposed an iterative algorithm to select pedigree members for sequencing using pre-existing GWAS data. Using simulations in three scenarios with different characteristics of causal variants, we evaluated the performance this selection algorithm and compared the performance of using either case-control sample or pedigree sample in Stage I and Stage II. We also explored the influence of the cut-off for prioritization and sample size allocation between the two stages on the power of detecting causal variants. Besides the work presented in this chapter, we also published two papers on family-based designs and two-phase designs in sequencing studies. They are attached in the Appendix. The results in this chapter confirm the merit of pedigree sample in prioritizing causal variants from null variants by exploiting the co-segregation of causal variants and disease within pedigrees, especially when the causal variants tend to be rare but with large effect on the disease. In Stage II, the case-control approach is still recommended because of its higher power. In this chapter, different designs were compared with respect to their performance in single variant analysis. Similar comparisons could also be made on their performance in multi-variant region analysis. 111 The number of top seeds used plays an important role in the performance of the iterative algorithm. A small number may make the algorithm too “greedy” and fail to select a good subset of pedigree members for sequencing. Increasing the number of top seeds used can improve performance. However, using a large number of top seeds may not be practical because of the computational cost brought by the additional seeds. Ideally, optimal performance can be achieved by considering every seed at each iteration. But this is equivalent to the brute force solution, which is not computationally possible. An alternative algorithm is to iteratively select a fixed number of members for sequencing in each pedigree. However, with this algorithm, the score will remains to be 1 because of standardization if we consider each pedigree separately. If we consider all pedigrees at the same time, it is very computationally expensive to consider all possible combinations when select the first member. Recently, Wang et al used a simulated annealing approach for sequencing selection (M. Wang, Jakobsdottir, Smith, & McPeek, 2016). Such stochastic approach can be a possible way to reduce the computational cost of the iterative algorithm. 112 5 Conclusions and Future Research Directions 5.1 Conclusions In this dissertation, we have discussed different types of two-step study designs in genetic epidemiology, and have addressed several interesting questions when applying such designs in different settings. In Chapter 1, we reviewed previous development of two-phase designs and two-stage designs, and how the two-step idea has been applied in the analysis of several statistical problems and study designs encountered in genetic epidemiology in order to improve efficiency. In Chapter 2, with proper statistical methods, we discussed two-phase case-control designs with latent variable models, focusing on the optimization of such designs in different scenarios. In Chapter 3, we developed a novel integrated analysis of germline, omic and disease data, focusing on integrating information from different sources to estimate subgroups of subjects that have different risks of the outcome. In Chapter 4, we discussed two-phase and two- stage family-based designs for sequencing studies, focusing on comparing the performance of different designs and exploring the merit of family-based designs in the study of rare variants. Beyond the scope of this dissertation, there are still many interesting questions to be addressed. We discuss some of them in the following section. 5.2 Future Research Directions 5.2.1 Use Semi-Parametric Model in the Study of Two-Phase Case-Control Design In Chapter 2, we currently use only parametric models to develop the statistical method. By using them, we have to make corresponding assumptions about the distributions of variables in the model. It is not easy to check whether these assumptions are valid or not in practice. When they are violated, the estimated parameters may be biased. Therefore, to have a more 113 robust method, it is desirable to use models that make less such assumptions, such as semi- parametric models and non-parametric models. We could specify the conditional distribution of the latent variable 𝑋 given the environmental exposure 𝐸 and the candidate gene 𝐺 non-parametrically and keep other parts of the model unchanged. In this way, we no longer need to make the assumption that 𝑋 is normally distributed conditioning on 𝐸 and 𝐺 , which is not possible to be checked since 𝑋 is not directly observed in the data. We could get a non-parametric maximum likelihood estimation of this conditional distribution using an EM algorithm to maximize over the positive support and the corresponding probabilities for 𝑋 conditional on 𝐸 and 𝐺 (Dersimonian, 1986; N. Laird, 1978). Using this non-parametric estimate to replace the corresponding part in the prospective likelihood approach in Chapter 2 and leave everything else unchanged gives rise to a semi- parametric approach, since the conditional distribution of 𝑋 given 𝐸 and 𝐺 is specified non- parametrically while the conditional distribution of 𝑍 given 𝑋 , and 𝑌 given 𝑋 are specified parametrically. Furthermore, it may be possible to estimate the latent variable 𝑋 as well as the model parameters simultaneously by an EM algorithm using data from both Phase I and Phase II combined. 5.2.2 Robustness of Optimal Sampling fractions In Chapter 2, we have seen that optimal sampling fractions for Phase II are sensitive to pre-specified parameter values, and optimizing the expectation of the objective function may lead to a more robust optimal solution. However, in that analysis, we only varied the value of a single parameter and kept other parameters fixed. The optimal sampling fractions probably 114 depend on the values of other unknown parameters as well. Thus, a more systematic way is needed to explore a larger parameter space to find the optimal design. In addition, in some situation, we may not be completely ignorant about some parameters. For example, we may already know the marginal effect of 𝐸 and 𝐺 on 𝑌 from the Phase I data, and thus know the approximate magnitude of 𝛽 𝛾 and 𝛽 𝛾 . This information could provide guidance of the possible range of combinations of these parameters, and help to narrow down the parameter space to explore. 5.2.3 Incomplete Biomarker Measurements and Ascertainment Correction In Chapter 3, it is currently assumed that the entire study sample has biomarker measurements. However, this assumption will not hold in several situations. For example, there could be missingness in the dataset. In addition, when the disease or its treatment can influence the value of the measured biomarkers, i.e. ‘reverse causation’, only biomarker measured on controls and incidence cases can be used to avoid bias. In this situation, we can modify our EM algorithm in the following way: separate subjects without biomarker measurements from those who have biomarker measurements in the complete data log-likelihood; compute the expected latent clusters for those with or without biomarker measurements separately to get the expected complete-data log-likelihood in the E-step; update the model parameters by maximizing the expected complete-data log-likelihood in M-step (details can be found in the Appendix). This modified EM algorithm should enable us to fit the model with incomplete biomarker measurements. However, there may be computational difficulties when there are too many subjects with missing biomarkers. 115 The probability model currently used in Chapter 3 is based on the joint distribution of the variables conditioning on genetic data. It is not appropriate to be used in case-control sampling, since it does not account for the different ascertainment for cases and controls. If applying it directly to case-control data, the parameters estimates will be biased. Therefore, modification is needed before it can be used in this situation. To take ascertainment into consideration, we can add an additional variable representing ascertainment in the DAG. This variable only depends on the outcome 𝑌 . If we can assume that the ascertainment probabilities for cases and controls are known, then the complete-data log-likelihood can be written based on the joint probability of other variables conditioning on both genetic data and ascertainment. With this complete-data log-likelihood, the rest steps in the estimation process can be derived following the same principle. 5.2.4 Consider More Outcome Types in Integrated Analysis In our current model in Chapter 3, the outcome of interest can be either continuous or binary. In practice, besides these two, there are other outcome types worth looking at. For example, it is interesting to look at time-to-event type outcomes, since it may lead us to jointly estimate the underlying subgroups that have different survival for a disease, as well as the variables that play important roles in determining which subgroup a subject belongs in. Besides the outcome, we can also include different types of variables in other parts of the model. For example, we can use gene expression data, methylation, etc. as the “biomarkers” in the model, and/or use copy number variation as the genetic part in the model. 116 5.2.5 Stochastic Approach to Select Pedigree Members for Sequencing In Chapter 4, we were trying to select members from pedigrees for sequencing in a way that can improve the performance of prioritizing causal variants over null variants. Essentially it can be viewed as an optimization problem with an objective function (e.g. the score statistic for the GWAS hits) and parameters to optimize (e.g. indicator variable for sequencing for each pedigree member). For large pedigrees and even a moderate number of members to be selected for sequencing, the number of possible selections is enormous, making the brute force method impossible. In this situation, a stochastic approach like simulated annealing could be a good alternative to reduce computational burden needed to achieve the optimal solution. 117 Appendix Appendix 1. Derivation of the E-step In E-step, the expectation of Equation (3.5) is computed by replacing the unobserved 𝑥 using its expected value 𝑟 below. 𝑟 = 𝑓 𝑋 = 𝑗 𝑮 , 𝒁 , 𝑌 ; Θ ( ) = ( | ; ( ) ) ( | ; ( ) ) ( | ; ( ) ) ∑ ( | ; ( ) ) ( | ; ( ) ) ( | ; ( ) ) (A.1) 118 Appendix 2. Statistical Model for Binary Outcome When the outcome is binary, the last two terms of Equation (3.5) and Equation (3.6) is replaced by (A.2) and (A.3) below respectively. ∑ ∑ 𝑥 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑥 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) (A.2) ∑ ∑ 𝑟 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑟 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) (A.3) Here, 𝜎 𝛾 = . Parameter 𝜸 is updated in the M-step by: 𝛾 ( ) = log ; 𝑝 = ∑ ∑ (A.4) 119 Appendix 3. Factors Impacting Clustering and Estimation Performance: Binary Outcome Figure A.1 Impact of biomarker mean effects on estimating genetic effects when the outcome is binary. 120 Figure A.2 Impact of structured cluster-specific covariance matrices of biomarkers on estimating genetic effects when the outcome is binary. 121 Appendix 4. Type I Error Rate and Power for Binary Outcome Table A.1 Type I error rate when making inference with or without using Y (binary). Use Outcome Data Not Use Outcome Data 0.14 0.05 Table A.2 Estimating effects of latent clusters on the outcome (binary). Simulated value Effect Estimate (SE) Power 0.69 0.63 (0.16) 0.97 122 Appendix 5. EM Algorithm for Incomplete Biomarker Data When a subset of the sample does not have measured biomarkers, the log-likelihood can be constructed as follows. 𝑙 (Θ) = 𝑙 (Θ) + 𝑙 (Θ) (A.5) Here, 𝐶 stands for “complete”, while 𝐼𝐶 stands for “incomplete”. Hence, on the right hand side, the first term is log-likelihood from subjects with complete data. The second term is the log- likelihood from subjects without measured biomarkers. Assume the outcome is binary and apply the statistical model in Appendix 2. Then these two terms can be written out explicitly in (A.6) and (A.7): 𝑙 (Θ) = ∑ ∑ 𝑥 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) − ∑ ∑ 𝑥 𝒁 − 𝑾 𝚺 𝒁 − 𝑾 − ∑ ∑ 𝑥 log 𝚺 + ∑ ∑ 𝑥 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑥 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) (A.6) 𝑙 (Θ) = ∑ ∑ 𝑥 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) + ∑ ∑ 𝑥 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑥 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) (A.7) As the situation of complete biomarker data, in (A.6) and (A.7), 𝑥 is the only unobserved variable in the data set. Then, to estimate model parameters, in the E-step, based on subjects with / without biomarker data, we evaluate the following two expressions 𝑟 and 𝑞 as the expected value of 𝑥 in (A.6) and (A.7) respectively. 𝑟 = 𝑓 𝑋 = 𝑗 𝑮 , 𝒁 , 𝑌 ; Θ ( ) = ( | ; ( ) ) ( | ; ( ) ) ( | ; ( ) ) ∑ ( | ; ( ) ) ( | ; ( ) ) ( | ; ( ) ) (A.8) 𝑞 = ( | ; ( ) ) ( | ; ( ) ) ∑ ( | ; ( ) ) ( | ; ( ) ) (A.9) With (A.8) and (A.9), in M-step model parameters are updated as follows. 123 𝑾 ( ) = ∑ 𝒁 ∑ (A.10) Σ ( ) = ∑ 𝒁 𝑾 ( ) 𝒁 𝑾 ( ) ∑ (A.11) 𝛾 ( ) = log ; 𝑝 = ∑ ∑ ∑ ∑ (A.12) 𝜷 ( ) = argmax 𝜷 { ∑ ∑ 𝑟 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) + ∑ ∑ 𝑞 log 𝑆 (𝜷 , 𝑗 , 𝑮 )} (A.13) 124 Appendix 6. Ascertainment Correction for Case-Control Sample Assume ascertainment probabilities is known for cases and controls, e.g. in a case-cohort or nested case-control study. Let 𝐴 denote ascertainment and 𝜋 and 𝜋 denote ascertainment probability for cases and controls respectively. When all cases and controls have measured biomarkers, the ascertainment-corrected likelihood is constructed as follows. 𝐿 (Θ) = ∏ 𝑓 (𝑋 , 𝑍 , 𝑌 |𝐺 , 𝐴 ) = ∏ ( , , , , ) ∑ ∑ ∑ ( , , , , ) = ∏ ( ) ( | ) ( | ) ( | ) ( | ) ∑ ∑ ∑ ( ) ( | ) ( | ) ( | ) ( | ) = ∏ ( | ) ( | ) ( | ) ( | ) ∑ ∑ ( | ) ( | ) ( | ) = ∏ ( | ) ( | ) ( | ) ( | ) ∑ ( | ) ( | ) ∑ ( | ) ( | ) (A.14) After omitting the terms not involving unknown parameters, the log-likelihood is written as: 𝑙 (Θ) = 𝑙 (Θ) − ∑ log {𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 1|𝑋 ) + 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 0|𝑋 ) } (A.15) On the right hand side of (A.15), the first term is the log-likelihood used in Chapter 3 without considering the ascertainment. The second term is correcting for ascertainment. Note that the second term does not depend on the latent variable 𝑋 . Thus, the EM algorithm in this case is very similar to that in Chapter 3. The only difference is in the M-step. Here, the parameters 𝜸 and 𝜷 need to be updated in the following way: 𝜷 ( ) , 𝜸 ( ) = argmax 𝜷 ,𝜸 { ∑ ∑ 𝑟 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) + ∑ ∑ 𝑟 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑟 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) − ∑ log 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 1|𝑋 ) + 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 0|𝑋 ) } (A.16) For incomplete biomarker data for some cases, if we make additional assumption that the ascertainment probability for these cases is 𝜋 ∗ , then the ascertainment-corrected log-likelihood becomes: 125 𝑙 (Θ) = (A. 5) − ∑ log 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 1|𝑋 ) + 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 0|𝑋 ) − ∑ log {𝜋 ∗ ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 1|𝑋 ) + 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 0|𝑋 ) } (A.17) Then, in the M-step: 𝜷 ( ) , 𝜸 ( ) = argmax 𝜷 ,𝜸 {∑ ∑ 𝑟 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) + ∑ ∑ 𝑞 log 𝑆 (𝜷 , 𝑗 , 𝑮 ) + ∑ ∑ 𝑟 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑟 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) + ∑ ∑ 𝑟 𝑌 log 𝜎 𝛾 + ∑ ∑ 𝑟 (1 − 𝑌 )log (1 − 𝜎 𝛾 ) − ∑ log 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 1|𝑋 ) + 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 0|𝑋 ) − ∑ log 𝜋 ∗ ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 1|𝑋 ) + 𝜋 ∑ 𝑓 (𝑋 |𝐺 )𝑓 (𝑌 = 0|𝑋 ) } (A.18) Computational challenges should be expected in solving (A.16) and (A.18). REVIEWARTICLE published: 13 December 2013 doi: 10.3389/fgene.2013.00276 Two-phase and family-based designs for next-generation sequencing studies DuncanC.Thomas*,ZhaoYang andFanYang Department ofPreventive Medicine, UniversityofSouthern California,Los Angeles,CA,USA Editedby: Xuefeng Wang,HarvardUniversity, USA Reviewedby: JigangZhang, TulaneUniversity, USA WilliamC.L.Stewart, Columbia University,USA *Correspondence: DuncanC.Thomas, Department of Preventive Medicine, University of Southern California,1450 Biggy Street, NRT2502, Los Angeles, CA90089-9601, USA e-mail:pdthomas@usc.edu The cost of next-generation sequencing is now approaching that of early GWAS panels, but is still out of reach for large epidemiologic studies and the millions of rare variants expected poses challenges for distinguishing causal from non-causal variants. We review two types of designs for sequencing studies: two-phase designs for targeted follow-upofgenomewideassociationstudiesusingunrelatedindividuals;andfamily-based designs exploiting co-segregation for prioritizing variants and genes. Two-phase designs subsample subjects for sequencing from a larger case-control study jointly on the basis of their disease and carrier status; the discovered variants are then tested for association in the parent study. The analysis combines the full sequence data from the substudy with the more limited SNP data from the main study. We discuss various methods for selecting this subset of variants and describe the expected yield of true positive associations in the context of an on-going study of second breast cancers following radiotherapy.Whilethesharingofvariantswithinfamiliesmeansthatfamily-baseddesigns are less efficient for discovery than sequencing unrelated individuals, the ability to exploit co-segregation of variants with disease within families helps distinguish causal from non-causal ones. Furthermore, by enriching for family history, the yield of causal variants can be improved and use of identity-by-descent information improves imputation of genotypesforotherfamilymembers.Wecomparetherelativeefficiencyofthesedesigns with those using unrelated individuals for discovering and prioritizing variants or genes for testingassociationinlargerstudies.Whileassociationscanbetestedwithsinglevariants, power is low for rare ones. Recent generalizations of burden or kernel tests for gene-level associations to family-based data are appealing. These approaches are illustrated in the context of a family-based study of colorectal cancer. Keywords: sequencing, two-phase sampling design, family-based study, rare variant association, breast neoplasms, colorectal cancer INTRODUCTION In the early days of genomewide association studies (GWAS), the cost of commercial high-density genotyping panels was pro- hibitiveforlarge-scaleepidemiologicstudiesneededtodetectthe modestrelativerisks(RRs)nowknowntobeassociatedwithmost common variants for complex diseases (Hindorff et al., 2009). Hence, investigators turned to multi-stage designs, in which only asampleofsubjectsweregenotypedonsuchplatformsanda generous selection of the most significant associations were then tested on an independent sample using custom genotyping tech- niques. The final analysis typically combined the data from both stages,withafinalsignificancelevelchosentoensuregenomewide significanceafterallowingforthenumberofvariantstestedinthe second.Thebasicprincipleswerebasedonaseriesofpaperswrit- tenbeforetheGWASera(Satagopanetal.,2002,2004;Satagopan andElston,2003),andsubsequentworkshowedhowtooptimize the allocation of sample size and first-stage critical values in the GWAScontext(Wang et al., 2006; Skol et al., 2007).Inparticular, Skoletal.(2006)showedthatthisjointanalysiswasmorepower- fulthantreatingthedesignasdiscoveryfollowedbyindependent replication, despite various high-profile journals’ requirements for an independent replication study (Panagiotou et al., 2012). Although this became the conventional GWAS design through- out the first decade of the 21st century, rapidly declining costs of commercial GWAS chips have made it feasible for many stud- ies to obtain genome-wide coverage on all available subjects in a singlestage(Thomasetal.,2009a,b).Thecostofcustomgenotyp- ingforlargenumbersofhand-pickedSNPswasoftencomparable to standard high-density panels, and having more subjects with genome-widedataallowedformoreinformativeanalysisofinter- actions,subgroups,pleiotropiceffects,etc.Forageneralreviewof multi-stagedesignsingenetics,see(Elstonetal.,2007). As we entered the “post-GWAS” era, the focus began to shift toward rare variants and the use of next-generation sequenc- ing (NGS) technologies that could in principle (given a large enough sample size and deep enough sequencing) uncover all the genetic variation in a region, not just the common SNPs that have been used to tag the unknown causal variants. In part, this interest stemmed from the increasing recognition that common variants were accounting for only a relatively small proportion of the total heritability of most complex diseases (Manolio et al., 2009; Schork et al., 2009). Amongst other possible explanations www.frontiersin.org December 2013 | Volume 4 | Article 276 | 1 126 Appendix 7. Publication: Two-Phase and Family-Based Designs for Next-Generation Sequencing Studies Thomas et al. Designs for sequencing studies for the “missing heritability,” rare variants have been proposed, based on an evolutionary argument (Gorlov et al., 2011)or empirical evidence (Bodmer and Bonilla, 2008)thattheireffect sizes could be larger, although recent whole-exome sequencing studies have cast some doubt on this hypothesis (e.g., Heinzen et al., 2012). Furthermore, since rare variants tend not to be well tagged by common ones (Duan et al., 2013), use of conventional GWASpanelswouldtendtomissassociationswithrarevariants. Cost currently precludes application of NGS to whole-genome sequencing on a large scale, so clever study design has again becomeimportant(Thomasetal.,2009a,b).Oneofthefirstuses of NGS was for targeted follow-up of GWAS hits, for which an alternativetotwo-stagedesigns,knownastwo-phasedesigns,isa natural choice.Thesedifferfrom thetwo-stage designsdescribed above in that the set of subjects chosen for expensive data col- lection (e.g., NGS) are a proper subset of a larger epidemiologic study rather than an independent sample and that this subset is selected on the basis of information already available on the full study (Whittemore and Halpern, 1997; Thomas et al., 2004; Yang and Thomas, 2011). In the case of NGS, this could involve stratification jointly on disease status and carrier status of the associatedvariant(s).Whilethiswouldtendtoinduceaspurious associationbetweenanyvariantsinLDwiththeGWASSNPsand disease even under the null hypothesis that they are not causal, this bias can be avoided by adjusting for the sampling fractions, and additional information available in the full study can also be incorporated. The basic principles were developed in a series of seminal papers by Norman Breslow with various colleagues (seeBreslowandHolubkov,1997b;BreslowandChatterjee,1999; Scott et al., 2007; Breslow et al., 2009b,forsummariesofthis work). Recently, Schaid et al. (2013a) has provided an excellent discussion of the use of this approach for targeted follow-up of GWAShitsbyNGS.However,forwholegenomeorwholeexome sequencingstudies,therewouldbenopointinselectingindividu- alsbasedonwhethertheycarriedaspecificpolymorphism,except toeliminatethoseknowntobecarryingaknownmajormutation. MostGWASfordiscoveringcommonvariantsassociatedwith disease traits have been conducted using a case-control design with unrelated controls. Not only are unrelated individuals eas- ier to identify and enroll than are entire families (particularly multiple-case families), but the statistical efficiency for discovery or association testing per subject genotyped is typically higher using unrelated controls than using unaffected siblings or other relatives (Witte et al., 1999). However, with the growing inter- est in rare variants and the availability of NGS, there has been aresurgenceofinterestinusingfamily-baseddesigns(Zhuetal., 2010; Feng et al., 2011; Ionita-Laza and Ottman, 2011; Shi and Rao, 2011). Family-based designs may have other advantages that outweigh their loss of statistical efficiency. By exploiting information about co-segregation, they may be more efficient at prioritizing potentially causal variants from non-causal ones for subsequent testing for association with disease in larger samples. TheabilitytoexploitMendelianinheritancemayalsoimprovethe imputation of rare variants in untested samples (Li et al., 2009; Cheung et al., 2013). Finally, family-based designs can exploit bothbetween-andwithin-familycomparisonsinatwo-stepanal- ysis for better power while being robust to bias from population stratification(Langeetal.,2003;VanSteenetal.,2005;Fengetal., 2007; Murphy et al., 2008). In this paper, we focus on the first of theseadvantages,usingadesignthatsequencesasubsetoffamily members initially, ranks the discovered variants in terms of their likelihood of being associated with the trait using the phenotype information on the entire family, and then tests for association in an independent sample. In this sense, the design has elements of both two-phase and two-stage designs, in that the sequencing set is a proper subset of a larger family-based study and that an independentsampleisusedforreplicationorcombinedanalysis. Oneconsequenceofthenewfocusonrarevariantsistheneed for novel analysis strategies, because testing associations individ- ually with every variant would have very little power due to the large multiple comparisons penalty and their rarity. In a sample of, say, size 200, one might identify about 20 million variants. Most of these are likely to be unrelated to disease and genotyp- ingallofthemforalargecase-controlassociationstudywouldbe neitherfeasiblenorstatisticallyefficient,sosomemeansofidenti- fyingthosemostlikelytobecausalisneeded.Furthermore,under some models of disease causation, multiple variants in a causal gene (or pathway) could affect its function, so aggregating vari- ants within genes may also improve power. To address this need, ahostof“burden”testshavebeendevelopedbasedoncounts of rare variants, weighted in various fashions (see Asimit and Zeggini, 2010; Cirulli and Goldstein, 2010; Basu and Pan, 2011; Bacanu et al., 2012; Thomas, 2012,forreviews).However,these areill-suitedtothesituationwherearegioncontainsbothdelete- rious and protective variants (Hoffmann et al., 2010). A random effects model that focuses instead on the variance of risk across variants rather than their mean might therefore be more power- ful.ThefirstofthistypewastheC α test(NeymanandScott,1966; Neale et al., 2011), which tests for overdispersion of case-control ratios,conditionalonthetotalnumberofvariants.TheSequence KernelAssociationTest[SKAT(Wu et al., 2011; Lee et al., 2012)], based on a general linear mixed model, tests for association between similarity of phenotypes and similarity of multi-locus genotypes across all pairs of subjects. See Schaid (2010a,b) for ageneralreviewofthebasicstatisticalfoundationsofsuchtests and various choices of kernel functions for genetic applications. Recently, this class of methods has been extended to family stud- ies (Huang et al., 2010; Schifano et al., 2012; Chen et al., 2013; Ionita-Laza et al., 2013; Schaid et al., 2013b). Hierarchical mod- eling approaches offer another approach to the analysis of rare variants, allowing formal incorporation of external information forprioritization. Avarietyofmethodsforincorporatinggenomiccontext,func- tional,orpathwayannotationdatahavebeendiscussedinGWAS contexts(reviewedbyCantoretal.,2010;Thompsonetal.,2013). Examples of prior information might include loci previously reported, pathway or genomic annotation, expression QTL or other functional assays, etc. (Rebbeck et al., 2004; Bush et al., 2009; Karchin, 2009; Nicolae et al., 2010; Wang et al., 2010; Freedman et al., 2011; San Lucas et al., 2012; Minelli et al., 2013). Filtering on such variables has become a popular strat- egy, but risks eliminating many causal variants whose potential significance has not yet been recognized or loading up the list of prioritized variants with too many non-causal ones based Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 2 127 Thomas et al. Designs for sequencing studies on irrelevant information. The weighted False Discovery Rate (Roeder et al., 2006; Wakefield, 2007; Whittemore, 2007)and Gene Set Enrichment Analysis (Chasman, 2008; Holden et al., 2008) require specification of weights in advance and there is no obvious way to combine multiple filters. The hierarchical mod- eling approach described below is more flexible, allowing the weights given to various biofeatures to be determined empiri- cally,basedontheirobservedcorrelationwithdiseaseassociations acrosstheensembleofallvariants. An example of a two-phase design is the Women’s Environmental Cancer and Radiation Epidemiology (WECARE) Study of the risk of second breast cancers among survivors of a firstbreastcancer,focusingonradiationdosetothecontralateral breast (Stovall et al., 2008; Langholz et al., 2009), various genes involved in DNA damage response pathways (Begg et al., 2008; Concannon et al., 2008; Borg et al., 2010; Malone et al., 2010; Capanu et al., 2011; Quintana et al., 2011; Brooks et al., 2012; Quintana et al., 2012; Reiner et al., 2013)andtheirinteractions (Bernstein et al., 2010, 2013); a GWAS is also currently in progress. The design is a nested case-control study, with two controls matched to each case on age and year of diagnosis of the first cancer and study center, and “counter-matched” on radiotherapy for treatment of the first cancer (Bernstein et al., 2004).Asanillustrationofthetwo-phasedesign,wearecurrently performing whole genome sequencing on a subsample of 201 subjects and whole exome sequencing on several hundred more, drawn from the 701 cases and 1399 controls, stratified jointly by case-controlstatusandriskpredictors—ageatfirstcancer,family history(FH),radiationtreatment,andtimesinceexposure. As an example of a family-based design, we are currently performing deep targeted resequencing of 11 replicated regions identified by previous GWASs as associated with colorectal can- cer (CRC), using ∼4200 samples drawn from the Colon Cancer Family Registries (C-CFR). The C-CFR is an international col- laboration of registries of families ascertained through CRC in various ways, some population-based, some from high-risk genetic clinics, some including population controls or control families (Newcomb et al., 2007). To date, 10,662 CRC families havebeenenrolled,totaling62,353individuals,withgeneticsam- plesavailableon5113casesand9196unaffectedfamilymembers or population controls with epidemiologic risk factor informa- tion, and FH data on many more (http://epi.grants.cancer.gov/ (CFR/about_colon.html, accessed 3/8/13). For the purpose of comparing different designs, we have selected some samples from multiple-case families and some from unrelated cases or controls in various ways. Ultimately these data would be used to compare designs empirically in terms of the yield of sig- nificant findings by subsampling from these real data (e.g., to assess whether a lower depth of sequencing, narrower regions, fewer subjects or subjects targeted in different ways would have sufficed). The aim of this paper is to review recent developments in methods for the design and analysis of NGS studies, with a particular focus on two-phase and family-based designs, and to illustrate the various issues with simulated data and applica- tions to power calculations and preliminary data from these two studies. RESULTS TWO-PHASESAMPLINGFORTARGETEDRESEQUENCING Suppose one has already completed a large case-control GWAS of unrelated individuals, in which one or more tag SNPs have been found to be strongly associated with a particular disease. It isunlikelythattheassociatedSNPswouldthemselvesbecausal— morelikelytheyaresimplyinLDwiththetrulycausalvariant(s). The aim of a targeted resequencing study is therefore to exhaus- tivelyre-sequencetheregiontoidentifyallvariantsinthehopesof discoveringthesecausalones.Twokeydecisionsarerequired:how toselectthesubsampletobesequenced;andhowtoprioritizethe variantsfoundinthissubsample. Approaches to prioritization of variants One obvious method of prioritization would be on the basis of novelty,i.e.,tofocusattentiononvariantsthathavenotbeenseen previously (or only rarely) among population controls. With the growing catalog of sequence variants in public databases like the 1000 Genomes Project, many causal variants are likely already to have been discovered and those that are novel are likely to be so rare that there would be very little power testing their associa- tion individually with disease. Furthermore, most novel variants are likely to be neutral. Nevertheless, the discovery of a novel association with disease is important, irrespective of whether or not the existence of the variant has been previously reported, but a discovery of a novel variant and its association with dis- easeisparticularlynoteworthy.Thesameappliestofilteringbased on differences in allele frequencies between cases and controls within the sequencing subset (Yang and Thomas, 2011). Thus, someinvestigatorshavedecidednottosequencecontrols,butthis couldbeilladvisedifcasesaresequencedinpopulationsnotwell represented in public databases or on platforms with different discovery characteristics (e.g., depth of coverage, quality control filtering). Under the hypothesis that a gene may harbor multiple vari- ants any of which could affect function or that a critical pathway could be affected by polymorphism in any of the genes in it, a strategy that aggregates across multiple related variants may be helpful. Methods section Simulation of Gene- and Pathway-level Prioritization in the WECARE STUDY describes a simulation of thisstrategybasedontheWECAREstudy;resultsarediscussedin theapplicationbelow. Hierarchical modeling (Greenland, 2000) entails adding a second-levelmodelfortheeffectestimatesofeachvariant,allow- ing the magnitude of the effects, their probability of being non- null, or their covariances to depend upon external information (“prior covariates”) (Conti and Witte, 2003; Hung et al., 2004; Chen and Witte, 2007; Hung et al., 2007; Lewinger et al., 2007; Conti et al., 2009; Hoffmann et al., 2010; Capanu and Begg, 2011; Capanu et al., 2011). Hierarchical models involve a first (subject)-levelmodelforindividual’sphenotypesY i asafunction of a vector of genotypes G i =(G iv ) at loci v and corresponding regression coefficients β = (β v ), e.g., a general linear model of the form f[E(Y i )]= G ′ i β, and a second (variant)-level model for the distribution of these regression coefficients as a function of prior covariates Z v , e.g., a linear regression model of the form E(β v ) = Z ′ v π (or) perhaps a model for their variances or www.frontiersin.org December 2013 | Volume 4 | Article 276 | 3 128 Thomas et al. Designs for sequencing studies covariances (Thomas et al., 2009a,b). This approach also has the advantage of allowing for the uncertainty about which effects should be included in the model within a Bayes model averaging framework,e.g.,bymodelingtheprobabilitythatavarianthasno effectaslogit[Pr(β v = 0)]=Z ′ v α(Quintanaetal.,2012;Quintana and Conti, 2013). It also allows the data to determine the opti- malweightsforthevariouspriorcovariatesratherthanhavingto specify them a priori; these papers show how the gain in power from including covariates with high predictive value for classify- ing causality of variants is offset by very little loss of power from includingcovariateswithlowpredictivevalue(sincetheyareusu- allyassignedverylittleweight),incontrasttofilteringapproaches, which can lead to substantial loss of power if either sensitivity or specificityislow. JointanalysisofSNPandsequencedata The subset of subjects from the substudy with the full sequence datawouldprobablynotprovideadequatepowerfortestingasso- ciations with disease directly, either variant-by-variant or by any of the aggregation methods described above. How then might onetakeadvantageofthedatafromthemuchlargerGWASfrom which the substudy subjects were selected? Three basic strategies arepossible:(i)bygenotyping;(ii)byimputation;or(iii)byjoint analysis. Thefirstoftheseisthesimplest,butmostexpensive.Onesim- ply does custom genotyping in the main study of the prioritized variants. Under the hypothesis that any causal variants are likely to have been discovered by sequencing and that they survived prioritization, then the genotype data for the main study should be sufficient and associations can be tested directly, with appro- priate correction for the effective number of independent tests performed (Conneely and Boehnke, 2007). A final analysis can then include a test of whether the novel variants account for the originalSNPassociation(Yang and Thomas, 2011). ImputationhasbecomeastandardapproachforGWASanaly- sis, so that typically several million common variant associations are tested by combining the study data from the SNP panel with populationdistributionsofallcommonanduncommonvariants from such databases as the HapMap and 1000 Genomes projects (Asimit and Zeggini, 2012; Howie et al., 2012). For each variant notontheGWASpanel,onecomputestheexpectedallelicdosage and uses this as the covariate in a logistic regression model; this strategy is known to be superior to simply using the most likely genotype,inpartbecauseitcorrectlyallowsfortheuncertaintyin the imputation (Stram et al., 2003). Whether this strategy would be viable for rare variants is still unknown, but there are two reasons for concern. First, the strategy relies on linkage disequi- librium, and rare variants tend to have weaker LD than common ones (Duan et al., 2013). Second, it also relies on having suf- ficiently large reference panels, which would not include newly discoveredvariants. Joint analysis of the full sequence data on the subsample and the SNP data on the main study is the most powerful approach and, like imputation does not involve any further genotyping costs. In their series of seminal papers on two-phase studies, Breslow et al. describe three basic analysis approaches: pseudo- likelihood (PL), weighted likelihood (WL), and semi-parametric likelihood (Breslow and Cain, 1988; Breslow and Zhao, 1988; CainandBreslow,1988;BreslowandHolubkov,1997a,b;Breslow and Chatterjee, 1999; Breslow et al., 2003, 2009a,b; Breslow and Wellner, 2007). The simplest of these is the WL approach, so for simplicity, we confine our discussion here to this one (Methods section Likelihoods for Joint Analysis of Two-phase Studies). The basic idea is based on Horvitz-Thomson estimating equa- tions, which use the score function derived from the likelihood for a logistic regression of disease status in the substudy data alone, weighting each subject’s contribution inversely by their samplingprobabilities,! i [Y i −p i (β)] W i G i =0,where p i (β)= expit(G ′ i β)and W i = N si /n si , s i being the sampling stratum to which subject i belongs and N s and n s the main study and sub- study sample sizes respectively. While simple in concept, the disadvantage is that the only information used from the main study is the stratum sample sizes. A refinement of this approach is to replace the empiric weights based on the realized sample sizes by predicted weights based on a logistic regression of sam- pling probabilities on additional covariates available for all main study participants. Recent papers show how this basic approach can be stabilized by using “calibrated weights” without requir- ingassumptionsaboutthevalidityofanimputationmodelusing influence residuals (Breslow et al., 2009a,b). The utility of this approach for targeted follow-up of GWAS hits is discussed in Schaidetal.(2013a). Optimizationofsamplingfractions As with two-stage designs, it is theoretically possible to optimize the choice of sampling fractions, subject to a constraint on total cost, but in practice this requires knowledge of the true values of variousmodelparameters(causalallelefrequencies,RRs,LDwith the GWAS SNPs, etc.). Fortunately, the design is often relatively insensitivetotheseparameters,sothatabalanceddesigninwhich the various strata are represented by equal numbers n s in the subsample may be nearly optimal (Reilly and Pepe, 1995; Reilly, 1996).Intheirarticleontheapplicationofthisdesigntotargeted follow-upofGWAShits,forexample,Schaidetal.(2013a)donot addressoptimization,butrecommendthebalanceddesign. As with two-stage designs, the basic idea is either to maxi- mize power subject to a constraint on total cost or to minimize the cost required to attain a target power. If the only cost is sequencingthesubsample,thenitissufficienttooptimizethepro- portional allocation of substudy subjects across strata. If instead one is designing both the main study and substudy de novo or if custom follow-up genotyping of the main study is planned, then the relative sample sizes of the two phases also need to be considered.Ineithercase,therearelikelytobemultiplehypothe- ses being tested, so optimization of power for a specific type of variant may be less helpful than a global optimization. For this purpose,wehavepreviouslyconsideredAsymptoticRelativeCost Efficiency,aquantityinverselyproportionaltothetotalcosttimes the variance of the parameter of interest, combining main and substudy data (Thomas, 2007), but more recently in the con- text of designs for sequencing using DNA pooling, we aimed to optimize power subject to a constraint on total cost (Liang et al., 2012). We adopt a similar approach to optimize designs for test- ing the 1 degree of freedom Madsen and Browning (2009) rare Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 4 129 Thomas et al. Designs for sequencing studies variantburdentest(MethodssectionOptimizationofTwo-phase Studies), but this could be easily extended to maximize power for multi-dimensional hypotheses. Here, we summarize a small simulation study to illustrate the potential of two-phase designs (MethodssectionSimulationofTwo-phaseDesigns). ResultsusingthesimulatedsequencedataareshowninTable1 for the full cohort (the “ideal” results if the entire sample could have been sequenced) compared with two-phase analyses using (1) imputation, (2) the Horvitz-Thompson WL approach with sampleweights(Horvitz and Thompson, 1952);(3)theBreslow- CainPL(BreslowandCain,1988);and(4)theBreslow-Holubkov semi-parametric estimator (Breslow and Holubkov, 1997a,b). The top half of the table provides results for the Madsen- Browning index including all variants present in the parent case- control study, while the bottom half is limited to those seen at least once in the subsample; for the latter, the risk index would include different variants for the different designs, so point esti- matesarenotcomparable.Thelastlinegivestheaverageestimate, empirical standard deviation of estimates across replicates, and the estimated non-centrality parameter (NCP) for the Wald test ifnosubsamplingweredone.Generally,theimputationapproach was the least efficient for all designs, followed closelyby WL esti- mator, while the semi-parametric one the most efficient. Except for the WL, the optimal sampling design was also the most effi- cient; the inefficiency of the WL in this case seems to be due to some small strata receiving very large weights (for example, 500/2 in the low-risk case stratum compared with 500/214 in the high-risk case stratum, see Footnote b of Table1). In ear- lier simulations (not shown) we found relatively little inflation in the variance estimates or changes in the point estimates as the number of strata increases (although the number of repli- cates that failed to converge increased). Further research in the case of many sparse strata would be helpful, as well as on such issues as the size of the region to be sequenced and the depth of sequencing. ApplicationtotheWECAREstudyofcontralateralbeastcancers To illustrate the potential yield from a sequencing substudy, we consider whole genome sequencing of a subset of 200 genetically enrichedsubjects,withtheintent offollowingupasubsetofdis- covered variants by testing their associations in a larger study of 700 cases and 1400 controls. The sample sizes used for illus- tration derive from the WECARE study (Bernstein et al., 2004) describedabove.The201subjectsinthetoppartofTableS1were selected by prioritizing young age, positive FH, cases over con- trols,andamongcases,thosewhoreceivedradiotherapy(andfor these, longer latency). These samples are currently being whole- genome sequenced at an average depth of coverage of 30×.We usedsimulationtoaddressthefollowingquestions: 1. What is the anticipated yield of variants discovered one or more times in this sample, as a function of population MAF andRR? 2. Of those discovered, what proportion would be novel (not in the 1000 Genomes Project), what proportion would be truly causal,andbothnovelandcausal? 3. Among the discovered variants in each category (of MAF, RR, causality, and number of times seen in each series), what Table 1 | Parameter estimates (SEs) [WaldZ-tests]forthesimulatedtwo-phase sequencingdatausingtheimputation,weightedlikelihood, Breslow-Cainpseudo-likelihood,andBreslow-Holubkov semiparametricmaximumlikelihoodestimators. Analysismethod Subsampledesign Case-control Balanced a Optimal b ALL1422RAREVARIANTSINTHEFULLSTUDY(47CAUSAL) Imputation 1.69 (0.96)[1.76] 1.75 (0.86)[2.03] 1.63 (0.79)[2.06] Weighted likelihood 1.88 (0.96)[1.96] 1.89 (0.91)[2.08] 1.72 (1.13) [1.52] Pseudolikelihood 1.88 (0.96)[1.96] 2.03 (0.97)[2.09] 2.22 (1.00)[2.22] Semiparametric ML 2.12 (0.98)[2.16] 2.22 (0.99)[2.24] 2.24 (1.00)[2.24] Full Study 1.80 (0.69) [2.61] VARIANTSDISCOVEREDINTHESUBSTUDYONLY Average number discovered (causal) 653 (44) 719 (45) 697 (44) Imputation 1.66 (0.95)[1.75] 1.73 (0.86)[2.01] 1.64 (0.80)[2.05] Weighted likelihood 2.34 (1.01)[2.32] 1.87 (0.93)[2.01] 1.73 (1.12) [1.54] Pseudolikelihood 2.35 (1.02) [2.33] 2.01 (1.00)[2.01] 2.27 (0.97)[2.34] Semiparametric ML 2.56 (1.06)[2.42] 2.19 (1.02)[2.15] 2.29 (0.97)[2.36] Empirical mean estimates and standard deviations are computed from 1000 replicates with 2000 cases and 2000 controls showing association with at least one GWAS SNP, subsampling 600 subjects, 50 causal rare variants. These results are contrasted across three sampling designs. Coefficients are in units of log RR per Madsen-Browning rare variant summary index divided by 1000; for consistency across designs, all rare variants are included in the index in the top portion of the table; the bottom portion includes just those discovered in the substudy, so point estimates are not comparable across sampling methods. All estimates are adjusted for the risk index. a 100 subjects from each of the 6 strata. b Numbers of subjects in the subsample are fixed across replicates at (2, 20, 214) cases and (74, 116, 174) controls, stratified into 3 groups of risk index from low to high, based on overall optimization for all replicates combined. www.frontiersin.org December 2013 | Volume 4 | Article 276 | 5 130 Thomas et al. Designs for sequencing studies is the power for testing association in the main study, after Bonferroniadjustmentforthenumberofmarkerstested? 4. Putting all these together, what is the expected overall yield of novel,causaldiscoveries? These calculations are described in Methods section Calculation of the Expected Yield of Single-variant Tests in the WECARE Study, based on the distribution of simulated allele frequencies and RR shown in FigureS1.Inasubsampleofthissize,most variants with MAF >0.1% would be seen at least once, includ- ing the most causal variants (Table2). Restricting to those seen morethanonceconsiderablyreducesthenumberofvariantspri- oritized, as does eliminating those never or seldom reported in the 1000 Genomes Project sample, but also eliminates most of thetrulycausalvariants.If,however,thegoalistoidentify at least someofthenovelcausalvariantswithadequatepowertotestthem forassociationinthemainstudy,thenthisdesignmightstilldis- cover something in the range of 5–20 causal variants out of the total 1600 simulated, depending on the specific criteria used for prioritization,andofcoursedependinguponthetruesimulation modelparameters. We also simulated gene- and pathway-level burden tests (Methods section Simulation of Gene- and Pathway-level Prioritization in the WECARE study; Table3). These show a modestimprovementinpoweratthehigherlevelsofaggregation, butpowerisstilllowwiththesesamplesizes.Thesimulatedcausal variantsarepredominantlyrare,soonlyabout16%ofcausalones are even discovered in this small sample, setting an upper bound forpowerforsinglevarianttests.Ofthe43discoveredcausalvari- ants (on average across 100 replicates), 19 are prioritized and 3 Table 2 | Expected total number of discovered variants prioritized and expectednumberofthesethatarecausal,byminimumnumberof copiesinthesequencingsampleandmaximumnumberofcopiesin 1000 Genomes Project data. Maximumcopies Minimumcopiesin in1000GP sequencingsample c=1c=2c=3 NUMBEROFVARIANTSPRIORITIZED c ′ =01.5M 113K 10K c ′ =12.6M265K30K c ′ =23.4M418K57K NUMBEROFPRIORITIZEDVARIANTSTHATARECAUSAL c ′ =041 34 27 c ′ =1113 97 79 c ′ =2192 168 140 EXPECTEDYIELDOFSIGNIFICANTLYASSOCIATEDCAUSAL VARIANTS FROM SECOND STAGE* c ′ =00.7 1.0 1.6 c ′ =12.1 2.9 4.2 c ′ =23.8 5.1 7.0 *Bonferroni corrected α= 0.05 (i.e., in addition to these causal variants, 0.05 non-causal variants are expected to be declared significant). of these are found to be significantly associated in the full study sample,foranaveragepowerof1.1%.Ofcoursethecorrespond- ingproportionsweremuchsmallerfornullvariants,yieldingonly 3falsepositivesintotaloutof31million.(Theelevated“falsepos- itive” rate for single variant tests compared with the target 0.05 is due to null variants in strong LD with other causal variants.) Similar comparisons yielded 2.9, 5.7, and 4.6% power for gene- regions, genes, and pathways respectively, with type I error rates at or below the target level. The improvement at the region and gene levels probably reflects the increasing benefit from pooling similar variants, while the failure of the pathway burden test to yield even better power may be due to an increasing proportion of truly null genes or variants diluting the effect of the positive ones. These results are based on prioritization at each level at α 1 = ! 0.01/pwherepisthenumberoftests(variants,genes,etc. discovered in the subsample); while optimization of these values is possible, the results seem to be relatively insensitive across a broad range of choices. The specific results are somewhat more sensitive to the specific model parameters, only one being pre- sented here, but the general patterns remained consistent across allvaluesweconsidered. Sequencing is still underway but preliminary results from the first 93 samples from TableS1 suggest that the major- ity of subjects (mainly contralateral cases with early onset and/or family-history positive subjects) carry at least one functionally significant, clinically relevant or predicted disease- causing mutation, based on external annotation criteria includ- ing Human Genome Mutation Database (Stenson et al., 2009), ClinVar [http://www.ncbi.nlm.nih.gov/clinvar,] MutationTaster (Schwarz et al., 2010) in both known and unknown breast can- cer candidate genes and pathways, and >50% carry at least two and10%carrythreeormore.Thenextstepistoseewhetherthese variantsaredifferentiallydistributedbetweenWECAREcasesand (unilateral breast cancer) controls or population rates, whether they are associated with radiotherapy (suggesting an interaction effect), and to test these variants in the full WECARE study sample. FAMILY-BASEDDESIGNSFORPRIORITIZATION Several investigators have recently reported approaches to effi- cient selection of individuals for sequencing in family-based designs. Cheung et al. (2013) described an approach for tar- geted sequencing of regions exploiting already available linkage information to optimize imputation to other family members, but without using phenotype information, whereas Wang et al. (2013) described an approach for whole genome sequencing using phenotype and kinship information. We simulated various family-baseddesignsforwholegenomesequencingtoaddressthe followingquestions: 1. What criteria should be used to select families and members forsequencingsubstudies? 2. What criteria should be used to prioritize variants for subse- quentassociationtesting? 3. How do family-based designs for sequencing compare with those using unrelated individuals in terms of probability of discovering novel variants, classifying variants by their Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 6 131 Thomas et al. Designs for sequencing studies Table 3 | Simulated results of hypotheses tested in the main study for various levels of aggregation in the planned WECARE Study; means over100replicatesimulations. Test Truenegatives Truepositives Total Discovered Prioritized Significant Total Discovered Prioritized Significant Pathway 87.487.30.20.00 12.612.61.10.6(4.6%) Gene 1016 1006 8.40.00 33.733.45.61.9(5.7%) Gene-region 2925 2318 32.80.05 97.579.612.92.9(2.9%) Single variant 31,218 6558 156 3.10 273 43.019.43.1(1.1%) Basedon100pathwayswithanaverageof10geneseach,eachgenehavingonaverage10exonicvariants(r=1),20regulatoryvariants(r=2),and30othervariants surrounding the gene (r = 3) with σ P2 = 1.0, π P = 0.125, σ G2 = 0.25, π G = 0.25, σ V2 = exp[−η Rv −0.25ln(q v /.01)] where η 1 =−1.0, η 2 =−1.5, η 3 =−2.0, logit(π v ) = ζ Rv −0.25ln(q v /.01) where ζ 1 =−0.5, ζ 2 =−1.5, ζ 3 =−2.5. likelihood of being causal, and power for testing association withdisease? We consider a two-phase design that uses a subset of individuals from a family study chosen on the basis of their phenotypes and relationships to each other for discovery and screening, followed byassociationtestinginthefullpedigrees.Ifenoughfamily-based samples are available, replication using additional family-based samplesispreferabletousingadifferentsamplingschemebecause one would like the spectrum of variants (e.g., MAFs and RRs) being tested in the second stage to be comparable to those dis- covered in the first. We compare the relative efficiency of this family-based design with a conventional two-stage case-control designwithcomparablecosts. We evaluated these designs by simulating 4-generation pedi- grees with 22 members in each. We sampled haplotypes from the same simulated population described earlier and randomly dropped genes through the pedigrees, generating phenotypes with randomly selected rare variants as causal with the same RR distribution and retaining those pedigrees with some minimum number of cases (Methods section Simulation of Family-Based Designs). To address the first question, families with various numbers of affected individuals were ascertained, and from each oftheseweselectedindividualstosequenceinvariousways(e.g., two cases of at least second-degree relationship to each other and one unaffected individual). We then tabulated the following statisticsforcausalandnon-causalvariantsbytherelationshipof thecasestoeachother. • Rule-based criterion: the number of families for which all affectedmemberscarrythevariantandnounaffectedonesdid amongthesubsetsequenced; • Likelihood ratio (LR) criterion: the ratio of the retrospective likelihoods under the simulated penetrances and allele fre- quencies vs. the null penetrance (the average rate in the ascer- tained families); while similar to the lod score used in linkage analysis, here a single-locus likelihood is used to test associa- tion with a directly-observed variant, not markers in LD with anunobservedlocus; • Bayes factor (BF) criterion: similar to the likelihood ratio, but based on the marginal probabilities under the simulated prior distributionsofpenetrancesandallelefrequencies; • Score test criterion:thescoretestderivedfromtheretrospective likelihood,evaluatedunderthenullhypothesis. (Methods section Family-based Criteria for Prioritization of Variants). The score test was evaluated both at the single-variant andtheregionallevel,thelatterusingthefamily-basedSKATtests (Schifano et al., 2012; Chen et al., 2013; Ionita-Laza et al., 2013; Schaid et al., 2013b). The other tests were used only for ranking variants individually. The score test essentially relates the pheno- types of the entire pedigree to the genotypes of those who have been sequenced usingthe inverse of thekinshipmatrix to weight them.Iflinkageinformationisavailable,thenadirecttestofasso- ciationwithimputedgenotypesispossible(Cheungetal.,2013), allowingforresidualphenotypiccorrelations Figure1 shows the mean score statistics per family for those with a total of 4 affected individuals in which either an affected sib pair with an affected first cousin or a discordant sib pair with an affected cousin have been sequenced. These were derived for an11-membersub-pedigreeforwhichexhaustiveenumerationof all possible genotypes and phenotypes was feasible. As expected, variants with the largest scores for causal variants (top panel) andthehighestprobabilitiesofbeingcausal(bottompanel)were those where both cases were carriers and (if sequenced) the con- trolnot.Havingthecontrolaffectedsomewhatlowerstheaverage score,butnotasmuchashavinganadditionalcasebeingacarrier increasesit,essentiallybecauseweareconsideringarelativelyrare disease(populationprevalence5%). Thistrade-offisexploredfurtherin Figure2forvarioustypes of relatives sequenced. Here, we fix the total number of individ- uals being sequenced at 100 across the designs being compared (e.g.,25pedigreeswithfourmemberseachsequenced,33with3, 50 with 2 each, or 100 singletons). Although, as expected, hav- ing more families with fewer individuals sequenced increases the absolute discovery probabilities (not shown), the relative differ- ence comparing causal and null variants goes the other direc- tion, and the relative probability of prioritization also increases for more subjects per pedigree and fewer pedigrees sequenced, for a considerable increase in the overall relative probability of discovery andprioritization. Comparing designs with two individuals sequenced per pedi- gree shows little difference in the probability of discovery across the relationships among the pairs (Figure2), but shows that the www.frontiersin.org December 2013 | Volume 4 | Article 276 | 7 132 Thomas et al. Designs for sequencing studies FIGURE 1 | Mean scores for causal variants (top panel) and ratio of frequencies of causal to non-causal variants (bottom panel) in simulated 11-member pedigrees with at least 4 affected members. In each panel, results are shown for a design sequencing an affected sib pair and affected cousin by the number of carriers of the variant allele (left) or an affected first cousin pair and an unaffected sib by the number of carriers among cases and controls (right). conditional probability of prioritization given discovery and the jointprobabilityofdiscoveryandprioritizationisbetterformore distantrelatives. Obviously, the more stringent the cutoff for any of these criteria,thefewerthevariantsthatwouldbeprioritized,butnon- causal variants tend to be eliminated much faster than causal variants(FigureS2),sothechallengeistochooseathresholdthat minimizesthefalsepositiveproportion,subjecttothetotalnum- berofvariantsthatcanbetestedinsubsequentreplicationefforts. The relative performance of the various prioritization criteria is illustrated in Figure3 as Receiver Operating Curves, varying these thresholds. Although the BF criterion is the best overall in Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 8 133 Thomas et al. Designs for sequencing studies FIGURE 2 | Relative probabilities of discovery, prioritization, and both between causal vs. null variants for different criteria for selecting members for sequencing in simulated 11-member pedigrees with at least 4 affected members. Top panel, all designs; bottom panel,detailfor designs with only two members sequenced. (Codes for top panel:S,sib;C, cousin; 2, first cousin once removed; U, uncle; G, grandparent; P, parent; Upper case, affected, lower case, unaffected; hyphen, affected but not sequenced.) terms of the area under the curve, it is the most computationally intensive and the score test is nearly as good and much faster to compute. RELATIVEEFFICIENCYOFFAMILY-vs.POPULATION-BASEDDESIGNS. We compared the power of a two-stage family-based design with that of a conventional case-control design. The overall power for any design with independent tests in the two stages is simply the product of the powers for the two stages (Methods section Calculation of Power for Two-stage Designs). The probabili- ties of discovery, prioritization, and replication are illustrated in Table4 for a range of design parameters—total sample sizes, proportions allocated to stage 1, numbers of copies required to be judged a discovery, and the minimum threshold required www.frontiersin.org December 2013 | Volume 4 | Article 276 | 9 134 Thomas et al. Designs for sequencing studies FIGURE3|Receiveroperatingcurvescomparingdifferentprioritizationschemes. Table 4 | Some near-optimal multi-stage family-based and case-control designs (The first row of each block is the one with the highest ARCE amongthoseinvestigated; thesecondistheonewithbetterpoweramongthosewithsimilarcosts.) Total sample Proportion allocated Minimum copies Criterion for Proportion of causal Total number Power for all causal Total cost ARCE d ) size a tostage1(%) fordiscovery prioritization b discovered(%) prioritized c allnovel) variants (millions) (×1000) FAMILY-BASEDDESIGNS(COSTS:$1000/FAMILY, $5000/SEQUENCE, 5¢/GENOTYPE) 1800 30 12 3.0 38 3,591 17% (9%) $12.3 0.252 2400 30 14 3.0 13 3,894 21% (13%) $16.8 0.241 FAMILY-BASEDDESIGNS(COSTS:$5000/FAMILY, $1000/SEQUENCE, 5¢/GENOTYPE) 2100 50 12 4.0 56 346 19% (12%) $13.8 0.264 2400 50 12 4.0 59 378 21% (13%) $15.8 0.260 CASE-CONTROLDESIGNS(COSTS:$100/SUBJECT, $5000/SEQUENCE,5¢/GENOTYPE) 7000 20 12 0.001 62 5,502 16% (9%) $17.8 0.171 9000 20 14 0.001 65 5,862 20% (12%) $23.1 0.164 CASE-CONTROLDESIGNS(COSTS:$500/SUBJECT, $1000/SEQUENCE, 5¢/GENOTYPE) 6000 40 16 0.0001 70 823 17% (11%) $8.1 0.416 7000 40 16 0.0001 74 912 20% (13%) $9.4 0.407 a Number of 22-member pedigrees for family-based designs; number of cases, number of controls for case-control designs. b Minimum score test for family-based designs; minimum p-value for case-control designs. c Assuming 1000 causal variants out of a total of 20 million. d Total number of true positives, inversely weighted by the square root of MAF, divided by total cost. for prioritization—under two different cost structures. (The full range of choices considered is shown in FigureS3.) Obviously, as the number required for discovery is lowered or the threshold for prioritization is raised, fewer variants in total would be pri- oritized,leadingtoalessstringentmultiplecomparisonspenalty, but at some point the overall power decreases because too many of the truly causal variants are either not discovered or not pri- oritized.Althoughtheoverallpower(theproportionof all causal variants discovered, prioritized, and replicated) for any of these designs is only about 20% (for a sample size of 1000 families), the majority of those not found are either very rare or have very small effect sizes. Of particular interest are the numbers of novel variants (those not in the 1000 Genomes Project database) that are discovered, prioritized, and replicated. Since these too are predominately rare, power for them is even lower, but depend- ing upon the total number that actually exist, they could still representasubstantialyieldoftruepositivefindings. These comparisons are provided for the “optimal” designs of each type and one alternative design that yields better power at modestly larger cost. Assuming costs of $1000 per family for enrollment and obtaining pedigree phenotypes, $100 per sub- ject enrolled in a case-control design, $5000 per whole-genome sequence, and $0.05 per subject-genotype, the optimal family- based design turns out to require 540 pedigrees in stage I and Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 10 135 Thomas et al. Designs for sequencing studies 1260 in stage II at a critical value of the score test for prioritizing variantsof3.0;thisyields167true-positivereplicatedassociations outof1000simulatedatacostof$12M.Thecorrespondingopti- mal case-control design would require 1400 case-control pairs in stage I, 5600 in stage II at a critical value of α 1 = 0.001 in stage I, for a yield of 159 true positive replicated associations at a cost of $18M (about 2/3 the cost-efficiency of the family- baseddesign).Ofcourse,withdifferentcostratios,theseoptimal designswouldchange,asillustratedin Table4forthecasewhere enrollment costs are 5 times larger and sequencing only $1000 perwholegenome.Inthisinstance,case-controldesignsturnout to be the more cost-efficient. For this simulated MAF distribu- tion, only 175 of the 1000 causal variants would be novel, but in every situation considered, the power for discovering such rare variantsisstillmorethanhalfthatforallcausalvariants.Nostrik- ingdifferenceswereseenbetweenthespectrumofRRsandMAFs discovered,prioritized,andreplicatedbythetwotypesofdesigns withsamplesizeschosentoyieldsimilaroverallpower. APPLICATIONTOTHECOLORECTALCANCERFAMILYREGISTRIESDATA Included in the Colorectal Cancer Family Registries are a few large families for whom all available family members have either been genotyped for previously replicated GWAS SNPs or whole exome sequence variants. We analyzed one of these— a large Australian pedigree comprising 145 individuals with a total of 7 colorectal, 1 Lynch syndrome, and 9 other cancer cases. Genotypes for 32 GWAS-associated SNPs were available for 49 of the members, including 5 of the CRC and Lynch cases and 4 of the other cancers. These data were used to illustrate the effect of subsampling. We selected individuals under various criteria, calculated LR, BF, and score statistics using only the SNP data for these selected individuals [but all the phenotype information (Visscher and Duffy, 2006)], and compared these results to those from the complete genotype data to see which criteria best distinguish variants that are “truly” associated (based on the complete data) from “false” positives. Figure4showsthecorrelationofeachstatisticcomputedusing all available genotype data (as the “gold standard”) with those using only the subsample of genotypes (averaging over 10 repli- cate subsamples). For the CRC and Lynch syndrome phenotype, we compared subsets of 1–4 cases and 0–3 controls out of the 5 available cases and 44 available controls. These showed little improvement in correlation for any of the test statistics from adding more than about 2 cases. Adding the genotypes for one or two unaffected members somewhat improved the correlation for LRs and BFs when there were 2 or more cases, but sur- prisingly worsened the correlation when only a single case was included;thismaysimplyreflectinstabilityduetothesmallnum- berofcasesintotal.Resultsweresomewhatmorestablewhenall 9cancercaseswithavailablegenotypeswereconsidered,allow- ing comparisons of larger subsamples of cases. Adding more than about 3 cases did not materially improve the correlation and adding 1–3 controls improved the correlations only mod- estly, again reducing them when a single control was added to a single case. The bottom panel shows that the more distant rela- tivepairsweremoreinformativeaboutdistinguishingapparently associated from non-associated SNPs (based on the complete data). BecauseonlycommonvariantswereavailableintheAustralian data, we also performed similar simulations on 15 large pedi- grees that had previously been included in a linkage scan (Cicek etal.,2012)andhadwholeexomedataavailableon2–3CRCcases fromeach.Notsurprisingly,inthissmalldataset,nogenomewide significant associations were found by the score test with any of the 359,744 single nucleotide variants (SNVs) called at least once (a third of these were called only 3 or fewer times) or with 100-SNV bins with the regional score SKAT test, nor were the regional tests particularly correlated with the maximum single- SNV tests or with prior annotation. Additional simulations (not shown) based on the real sequence data and simulated pheno- types(conditionalonthetotalnumberofcasesineachpedigree) confirmed that 15 families would be far too few to find any sig- nificant causal effects, even if IBD information were used. The design of a larger NGS study is described in the concluding section. DISCUSSION One of the advantages of family-based designs is that Mendelian inconsistencies can be used to check for genotyping errors (Pompanon et al., 2005). This has, of course, long been rec- ommended as a routine quality control check in linkage and family-based GWAS studies. This advice becomes even more important when dealing with NGS data because of its inherently higher error rate as a function of depth of sequencing and qual- ity control filters applied (Faye et al., 2013). Further research on approaches to using pedigree information to improve variant callingwouldbehelpful. Inasimilarvein,Mendelianinheritancecouldbeexploitedfor improved imputation of variants in unsequenced family mem- bers. On obvious way to proceed would be to first use stan- dard imputation procedures with external reference populations (Howie et al., 2012), treating each subject with GWAS SNP data as independent to obtain preliminary genotype probabilities at the sequenced variants. These could then be combined with the observed genotype calls forthe sequenced family members using Mendelian transmission probabilities to obtain refined genotype probabilities (Burdick et al., 2006). While proceeding variant- by-variant in this manner is relatively straightforward, it fails to take LD patterns among the variants into account, but a similar strategy could be applied to haplotypes (Cheung et al., 2013). Such a two-step approach was used in simulations for theGeneticAnalysisWorkshop18(http://www.gaworkshop.org/ gaw18/index.html). Ideally, a unified approach that would inte- grate the two sources of information in a single step would be preferable. In addition to imputation, identity-by-descent infor- mation could be used to inform the selection of subjects for sequencing (Cheung and Wijsman, 2013) and directly as a local geneticsimilaritykernelinfamily-basedSKATtests. Homozygosity mapping in families has proven to be a valu- able technique for mapping recessive alleles (Kruglyak et al., 1995; Chahrour et al., 2012). Design issues for sequencing stud- ies for such traits are likely to be somewhat different from those consideredhereandwouldbeusefulavenueforfurtherresearch. www.frontiersin.org December 2013 | Volume 4 | Article 276 | 11 136 Thomas et al. Designs for sequencing studies FIGURE4|Correlation across 32 GWAS SNPs between the statistics computed from the complete genotype data and those computed using only the genotypes for various subsets of members; top left: 5 genotyped CRC and Lynch syndrome cases; top right: 9 cases of any cancer. Bottom: prioritization statistics by degree of relationship for apparently associated or unassociated variants based on the complete data. Data from a single 145-member Australian pedigree with a total of 8 CRC or Lynch syndrome cases and 15 cases of any cancer and a total of 49 subjects genotyped. Another possibility is to use a two-step analysis of the same data, exploiting between-family comparisons to prioritize vari- ants and within-family comparisons to test the most promising ones, in the spirit of Van Steen et al. (2005).Becausethesetwo tests are independent, one then need only correct the signifi- cance level for the number of variants passed to the second stage. For quantitative traits, regression of the offspring phenotypes on the mean of the parents’ genotypes provides a simple first-step test. For disease traits, one would have to include control trios, nuclear families with varying proportions of affected offspring, or external control individuals to have the variability in phe- notypes needed for the first-step test. Practically, however, this approachwouldrequirehavingaccesstotheDNAforcase-parent trios (which might not be available for late-onset diseases like cancer) and sequencing of the parents rather than the cases for the first stage; not only would this double the sequencing costs over a more conventional design that sequences only the cases, but it might seem counter-intuitive since the parents may not themselvesbeaffected(eventhoughatleastoneofeachpairmust carry any variant that case does). One could of course reverse the twosteps,butthiswouldrequiresequencingtheentiretriointhe first step and the final inference would not be robust to popu- lation stratification. In further simulation studies (not shown), we found that use of external controls tends to be more powerful than between-family comparisons for the first step, but is more susceptible to population stratification bias; this is not a threat to validity if used in the first step, but could reduce power if too many false-positives are passed to the second step, inflating the multiple testing penalty. The two-step analysis approach consis- tently yielded better power than a two-stage case-control design, however. Two-stage and two-phase designs are also amenable to con- siderable cost-efficiency gains by using DNA pooling techniques (Sham et al., 2002)inthefirststage,therebyallowingoneto sequence many more subjects than would be feasible if one were tosequenceindividuals.Ofcourse,onlyaggregateallelefrequency Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 12 137 Thomas et al. Designs for sequencing studies information (Huang et al., 2010), not individual genotypes, are then available [unless one uses molecular bar-coding techniques (Craig et al., 2008)], but these can still be used for discovery of novel variants (Lee et al., 2011)orcase-controlcomparisonsof pool allele frequencies (Johnson, 2007; Macgregor et al., 2008; Zhao and Wang, 2009). Further cost-efficiencies are possible by constructing pools of pools, with bar-coding of the sub-pools (Smith et al., 2010). Optimization of designs using DNA pool- ing has been described by Liang et al. (2012),butextensionto family-basedstudiesremainsachallenge(Lee,2005). We conclude by describing how these considerations influ- enced the design of a planned whole-exome sequencing study within the colorectal CFR. We are planning a three-stage family- based design, in which the first stage would use already avail- able sequence information for prioritizing about 1000 genes. This would be followed by two stages of replication, each with probands from about 1000 multiple case families and 1000 con- trols. The first stage would exploit existing control data from the 1000 Genomes Project, while the second and third stages would use individually matched population controls. Because our hypothesis is that causal genes may harbor multiple rare variants—notnecessarilythesameacrossfamilies—thetworepli- cationstageswouldperformfullresequencingoftheentirecoding and flanking regions of the prioritized genes, 1000 genes in stage 2, 100 genes in stage 3. For the same reason, we have decided to use a family-based design for all three stages, since variants dis- covered in multiple-case families may not be well represented in unselected series of population-based cases. After analysis of the sequencing data from each stage, additional genotyping of the prioritized variants would be done on all other available fam- ilymembersforanalysesusingaconditionalsegregationanalysis (Hopper et al., 1999). Three criteria would be used for pri- oritization at every stage: a family-based test of co-segregation with disease for each variant separately and for entire genes; a gene-based test ofassociationcomparingcasesandcontrols; and filtering based on bioinformatics predictors. The first of these uniquely exploits the information available from a family-based designandcanbeusedtorankgenesontheprobabilitytheycarry at least one causal variant, using an aggregate assessment of the impact of all rare variants in the gene. The three comparisons would be unified through hierarchical modeling, in which both the family-based and case-control comparisons would be incor- porated in the likelihood for the first (individual)-level model, and the bioinformatics predictors would be incorporated in the second (variant)-level model. Similar issues are currently being discussed in the design of a large-scale sequencing study for the WECARE project. Since this is not a family-based study, the key decisionthereishowbesttoselectthesubsetforsequencingina two-phasedesign. METHODS All simulations were based on the same population of 10,000 haplotypes of length 250Kb generated by the COSI program (Schaffner et al., 2005)withthepopulationhistoryparame- ters provided in their Table1.Thispopulationcontained5125 unique variants, of which 4557 had minor allele frequencies (MAF)<0.05,95%lessthan0.01,79%lessthan0.001. SIMULATIONOFTWO-PHASEDESIGNS We postulated a disease model involving multiple rare variants drawn from the simulated haplotype population with the prob- ability of having any effect and the expected size of the effect depending inversely on the MAF (FigureS1). We then sampled pairs of haplotypes at random from this population, computed theirriskunderthismodel,andassignedcase-controlstatus,con- tinuinginthismanneruntilthetargetnumberof1000casesand 1000 controls were obtained for the parent GWAS, and tested these data for association with all common SNPs (MAF >5%). If a significant association is found with one or more SNPs, the replicatewasretainedforthesequencingsubstudy. Forthesubsample,wefirstconstructedariskindexbasedona multiple logistic regression of disease state on all non-redundant GWASSNPsandstratifiedthephase1subjectsintothreestrataof high,medium,andlowrisk(withcutpointsatthe25thand75th percentiles). We compared case-control, balanced, and optimal samplingof600subjectstotaloutoftheavailable4000(Methods Section Optimization of Two-phase Studies). For these subjects, we retained all variants, including the causal ones but also many moreirrelevantones. Finally,weconductedajointanalysisofbothphases.Wetested association with the Madsen and Browning (2009) index—the number of rare variants weighted inversely by the square root of their allele frequencies—as the rare-variant covariate of interest, treated as continuous, using the WL, PL, and semi-parametric likelihoodmethodsdescribedinMethodsSectionLikelihoodsfor JointAnalysisofTwo-phaseStudies.Wefoundthattheriskindex usedforsamplingwasaconfounderoftheMadsen-Browningrare variantindex,duetoLDamongthevariantsincludedineach,due todifferencesbetweentheweightsintheMadsen-Browningindex andthesimulatedweights,andduetohavingusedthediseasesta- tustoconstructtheriskindex,soallresultswereadjustedforthe samplingriskindex.Thisentireprocesswasrepeated1000times. LIKELIHOODSFORJOINTANALYSISOFTWO-PHASESTUDIES Following the general notation used by Breslow and Holubkov (1997a,b), we let V represent a set of GWAS SNPs in a region found to be associated with disease Y, and X represent the causal variant(s) in LD with the GWAS SNPs, to be discovered by sequencing the region on a subsample of subjects. For this pur- pose,wedefinedthecausalvariableXtobetheMadsen-Browning index. The imputation strategy entails simply fitting a regression model for X|V to the substudy data, and then using ˆ X(V) as the covariate for Y|X in the full study. Proper inference would, how- ever,requirethattheuncertaintyintheimputationbetakeninto accountintheanalysisofthemainstudydata.Wenowdescribea formallikelihoodapproachtoaccomplishthis. If the first stage is a case-control sample, then the full likeli- hoodwouldbetheretrospectiveprobability L 1 (α, β, γ) = Pr(V|Y) = N ! i=1 " x Pr(V i ,X i = x|Y i ) = N ! i=1 p γ (V i ) # x p β (Y i |x)p α (x|V i ) # ν p γ (ν) # x p β (Y i |x)p α (x|V i ) www.frontiersin.org December 2013 | Volume 4 | Article 276 | 13 138 Thomas et al. Designs for sequencing studies andthelikelihoodforthesecondstagesampleSwouldbe L 2 (α, β) = Pr(X|Y,V, S) = ! j∈S p β (Y i |X i )p α (X|V i ) " x p β (Y i |x)p α (x|V i ) The full likelihood would then be L(θ) = L 1 (α, β, γ)L 2 (α, β) where θ = (α, β, γ).Inpractice,however,both V and X are highly multidimensional and we wish to avoid having to spec- ify their joint LD distribution parametrically. When we do not assume functional forms of p α (X|V)and p γ (V)thelikelihood abovebecomessuitableforsemiparametricmaximumlikelihood (SPML) estimation. Both Breslow and Holubkov (1997a,b) and Scott et al. (2007) have developed profile likelihoods by maxi- mizingoutthehigh-dimensionalparametersp α (X|V)andp γ (V), with a different parameterization. We followed a recent formu- lation of the problem from Scott et al., in which the estimating equationsforβandtheconstraintsfornuisanceparametersπare describedina“log-likelihood” l ∗ (β, π) = # y, v, x n yvz logp ∗ yv (x;β, π) + # y, v N yv logπ yv − # y, v n yv +log(N +v π yv −N yv ) whereπ 1v = 1−π 0v, N yv = N yv −n yv ,and p ∗ 1v (x;β,π) = expit $ x ′ β +log % N +v − N 1v π 1v & −log % N +v − N 0v π 0v &' By iterating between maximizing a logistic likelihood with fixed offsets containing π and updating π using its constraint equa- tions, we can obtain semiparametric (SPML) efficient estimates of β.TheSPMLapproachhastheadvantageofbeingflexible aboutthedistributionofcovariatesX andV whileretaininggood efficiency.However,thederivationofthesemiparametricestimat- ing equations was complex enough that it appeared only after two approximation methods—the WL and the PL—had been published. The WL approach weights individual score functions frommodelp β (Y|X)inverselyproportionaltothesamplingprob- abilities.Inourimplementation,theoriginalHorvitz-Thompson weights N yv /n yv were used, although an improvement might be to use predicted weights 1/Pr(S =1|Y,V,Z)thatcouldincorpo- rate auxiliary information Z from the full cohort (say, from a logistic model), or better yet, the calibrated weights described in Breslow et al. (2009b). The PL approach was first developed in Breslow and Cain (1988),representinganalternativethatuses first phase information. Following their seminal paper, we work withaPLbasedonp β, δ (Y|X,V,S),whichincorporatestheparam- eterofinterestβandthenuisancelog-oddsδforY = 1instratum V = v.Theyinsertestimates ˆ δ v = log(N 1v /N 0v ) into the PL and then solve for β. Schill et al. (1993) proposed to estimate δ and β simultaneously; this method, although not implemented here, hasbeenreportedtoyieldsimilarresultsastheBreslowandCain (1988)version. OPTIMIZATIONOFTWO-PHASESTUDIES Our general strategy for optimization of any of these two-phase designsaimstosolvethefollowingproblem.Supposewehavecol- lectedphaseIdataon N subjectsandseektoselectivelyassemble data on at most n subjects based on available information. The availableinformationfromphaseI,mainlyobservationsofY and V,issummarizedbythecellsizes N yv .Whatwewishtooptimize arethecell-specificsamplingfractions,denotedbys yv = n yv /N yv . Anaturalchoiceofobjectivefunctionmightbetoaimformore precise parameter estimates per unit cost, for example using the Asymptotic Relative Cost Efficiency (Thomas, 2007). However, while this goal could be readily achieved for a scalar parameter as in Reilly (1996), it is less clear when more than one param- eter is estimated. In this work, we chose our objective function to be the non-centrality parameter of the likelihood ratio test for H 0 : ˆ β=0vs. H 1 : ˜ β ̸= 0,with ˜ βbeingthesubsetofinterestinβ. This objective is equivalent to a linear combination of informa- tion matrix entries, and is thus a good summary of the standard error estimates. We denote the entire parameter vector including β and other nuisance parameters as θ.Ithasbeenshown(Self et al., 1992; Brown et al., 1999)thatthenon-centralityparam- eter can be computed as λ = 2E θ A [i(θ A )−l( ˆ θ A )],where l(.)is thelog-likelihoodfunction,θ A isthetrueparametervectorunder the alternative hypothesis, and ˆ θ 0 is the parameter vector that maximizes E θA [l(θ)] under the null hypothesis. A slightly dif- ferent form, λ ′ = λ−v with ν being the degrees of freedom of the test, is also reasonable. In this particular problem, we used l ∗ (.) shown in the previous section in place of l(.). It has been shown (Scott and Wild, 1989; Scott et al., 2007)thatthepro- file log likelihood l ∗ (.) is amenable for standard likelihood ratio tests. ThisproblemsettingrequiresN yv ,Pr(X|Y,V),β 0 (truevalueof β)tobeeitherknownorpre-specified.Toobtainthesequantities, wesimulated1000datasetsasdescribedinsectionSimulationof Two-phase Designs, and consider these data sets as the underly- ing“super-population.”ThenweusedtheestimatesofPr(X|Y,V), β 0 and the average values of N yv from this super-population as the input to the optimization procedure. Hence we used a fixed optimal design for all simulation replicates, using the SPML. Comparedtosolvingtheoptimizationproblemforeachreplicate, thisstrategyrepresentsasolutiontothe“expected”problem. CALCULATIONOFTHEEXPECTEDYIELDOFSINGLE-VARIANTTESTSIN THEWECARESTUDY Theestimatesin Table2 were derived using only the MAF distri- bution from the simulated haplotype population. The expected carrier probabilities in each age, FH, and disease stratum were computed from the assumed distribution of RRs over a grid of MAFandRRvaluesusingstandardMendelianinheritancemeth- ods. We computed the probability of observing c copies of a variant in the subsample from the Poisson distribution for each MAFandRRbin,summingtheexpectedcountsoverallsampling strata, and in a similar manner, we computed the probability of observing c ′ copies among 1000 population controls. The total yieldofdiscovered novel variants isthenthesumover thesebins of the number of variants in each bin times the probability of seeing>c and<c ′ copies. Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 14 139 Thomas et al. Designs for sequencing studies For association testing in the main study, we computed the NCP for the Mantel-Haenszel test (stratified by age and FH) for each bin and then computed power by reference to the cumula- tivenormaldistributionwithBonferronicorrectionforeitherthe total number of discovered variants or only the number of novel variants. These are again summed over all RR and MAF bins to estimate the expected yield for various values of c and c’givenin Table2. SIMULATIONOFGENE-ANDPATHWAY-LEVELPRIORITIZATIONINTHE WECARESTUDY To compare the power of single-variant and burden tests, we selected variants from the haplotypes in a multi-level fashion as follows. We defined 100 pathways p,eachcomprising1–20genes g,furthersubdividedintothreeregions r (e.g., “exons,” “introns and promoter regions,” and “more distant enhancer regions”). Starting and ending locations of each gene and is sub-regions were selected at random and all variants v within these regions were included. Pathways, genes, and variants were selected as causal with probabilities π P , π G ,and π v respectively, where π v forvariantvdependsuponthetypeofregionanditsMAF.Thus, a variant has a causal effect only if all three levels are designated ascausal.EachcausalvariantwasassignedalogRRβ v asasumof pathway, gene, and variant-level effects, each being the absolute value of a normal deviate with zero mean and variance σ 2 P , σ 2 G , and σ 2 v ,respectively, σ 2 v also depending upon the type of region and MAF. We drew two haplotypes at random for each poten- tial subject and computed the genetic log RR as the sum of the β v s for each variant they carried. Subjects were assigned at ran- domtoanagestratumandthentodisease(unaffected,unilateral, bilateral) and FH strata with probabilities depending upon their age and genetic RR. This process was continued until the tar- get number of subjects in each age, FH, and disease stratum was obtained and a random subset of these was designated as the sequencing sample. In the real WECARE study, we prioritized the youngest cases, those with a positive FH, and radiotherapy subjects with the longest latency for sequencing. In this way, we selected201subjectsoutofthetotalof2199availableforsequenc- ing;thedistributionoftheentirestudysampleandthesequencing subsamplebyage,FH,andlateralityisprovidedinTableS1. Fortheanalysis,wescannedthesubsampletoidentifyallvari- ants seen at least twice. All single variants that were seen more frequentlythanexpected (byanamountdependingonthenum- ber of comparisons) based on the general population MAFs and similarly all pathways, genes, or regions that were seen more frequently than expected were prioritized. These are tested for case-control association in the main study, using a Cochran- Mantel-Haenzsel (CMH) test, stratified by age and FH, with Bonferroni adjustment for the number of comparisons at each level. SIMULATIONOFFAMILY-BASEDDESIGNS Family-basedsimulationsusedafixedpedigreestructureof4gen- erations with two offspring in each generation for a total of 22 members in 7 nuclear families. We sampled two haplotypes at random for each of the founders from the simulated haplotype populationanddroppedthematrandomwithoutrecombination through the non-founders. As before, we chose causal variants and their RRs depending upon MAF, computed the genetic log RRasthesumoftheβ v sforeachcausalvariantasubjectcarries, andassigneddiseasestatusaccordingly,adjustingtheinterceptto yield a population prevalence of 5%. Families with the required number of cases (set to 4 for most of the results reported here) were retained and the process continued until 1000 such fami- lieswereascertained. Variouscriteriawereusedto selectasubset of family members whose genotypes were to be retained for analysis (the “sequencing subset,” e.g., two affected individuals of at least second-degree relationship and one unaffected mem- ber), while retaining the phenotype information for the entire pedigree. Foreachofthecausalandarandomsampleofthenon-causal variants,wecomputedthelikelihoodratio,Bayesfactor,andscore statistics (described below) and tabulated these values for differ- ent configurations of genotypes among the sequenced members and their relationships to each other. For the rule-based prioriti- zation,wealsotabulatedthenumberofvariantsfoundinatleast f min families and the number of these that were prioritized by having the target genotype configuration (e.g., both cases being carriers and the control not). The distributions of causal and non-causal variants for each criterion are shown in FigureS2 as afunctionofthethresholdforprioritization;plottingonecurve againsttheotheryieldstheROCsdisplayedinFigure3. FAMILY-BASEDCRITERIAFORPRIORITIZATIONOFVARIANTS Rule-based criterion Variants were classified on the basis of the number of families in which all sequenced cases carried the variant and any sequenced controlsdidnot. Likelihood ratio criterion Following the principles described in Petersen et al. (1998),we estimated the probability that any particular variant is causal under a given genetic model by accumulating likelihood ratio contributions (comparing the likelihoods of the data under the alternative hypothesis that a particular variant is causal to that under the null hypothesis that it is not causal) across families. Letting Y denote the phenotypes of all family members (includ- ing those not sequenced), G bs the observed sequence data,β v the genetic RR and q v the minor allele frequency for variant v,the likelihoodratiois LR v = Pr(G obs v |Y;β v ,q v ) Pr(G obs v |q v ) = Pr(G obs v |Y) Pr(G obs v )Pr(Y) wherePr(G obs v ,Y) = ! G unobs v Pr(Y|G v )Pr(G v ).Thesecalculations were done evaluating the likelihood under the simulated RR and minor allele frequency for each variant under the alternative hypothesis and under the induced marginal population risk for thenullhypothesis. Bayes factor criterion Thelikelihoodratiocriterionrequiresamaximizationofthelike- lihood under the alternative hypothesis, which can be unstable for rare variants. To avoid this, Petersen et al. (1998) compute a www.frontiersin.org December 2013 | Volume 4 | Article 276 | 15 140 Thomas et al. Designs for sequencing studies Bayes factor by averaging over a prior distribution of MAF and RR. Bayes Factors are computed as the ratio of these marginal probabilities of the joint genotypes of the sampled individuals under the true model to that under the null. Of course, we do not know the true values of either β or q, or even their true probability distributions, so we used the simulated probability distributionsforPr(q)andPr(β|q),averagingoverarandomsam- plesof100parametervaluesdrawnfromtheirnullandalternative distributions. Score test criterion We computed the score statistic T v for a single variant v derived fromamultivariatelogisticmodelforthephenotypesoftheentire pedigreeandthegenotypesofthesequencedsubsetas T v =! f t fv =! f (Y f −p f 1) ′ K −1 f (G fv −q fv 1) =! f [! i∈Nf ! j∈Sf (Y fi −p f )K ij f (G fjv −q fv )] whereY f isthevectorofphenotypesforfamilyf,p f isthefamily- specificdiseaseprevalence,K f isthekinshipmatrix,G fv thevector of genotypes for variant v,and q fv is the mean of G v among the sequenced members. The G fv −q fv deviations are set to zero for untyped individuals, but the inclusion of the kinship terms for typed-untyped pairs allows their phenotypes to contribute. This statistic has mean zero under the null hypothesis and asymptotic variance var(T v ) =! f t fv 2 .Forthepurposeofprioritizingvari- antsweusedthescoretestT v 2 /var(T v )foreachvariantandselect the top-ranked ones at somecutoff. This provides a pure within- family comparison, but those families for which all sequenced individuals are no carriers or all are carriers become uninfor- mative. A more powerful test that exploits both between- and within-familyinformationreplacesthep f andq fv byand,thecor- responding means over all families. The regional (SKAT) test for allvariantsinregionRissimply[! f (! v ∈ R t 2 fv )] 2 /! f (! v∈R t fv 2 ) 2 . CALCULATIONOFPOWERFORTWO-STAGEDESIGNS Using family-based simulation described above, we tabulated the average number of families in the simulated sample in which each variant was seen at least once and the NCP as the mean of the simulated score statistics by bins of MAF and RR. To compare designs with different stage 1 sample sizes and discov- ery thresholds, we rescaled the numbers of families carrying a givenvariantbytheratioofproposedandsimulatedsamplesizes and recomputed the probability of discovery by reference to the Poisson distribution in each bin. In a similar manner, we com- puted the probability of prioritization at stage one at threshold λ min in each bin by rescaling the NCPs by the ratio of sample sizes and referred them to the non-central chi square distribu- tion. To extend our simulation results to the whole genome, we multipliedthepredictednumberofsimulatednullvariantsmeet- ing our discovery and prioritization criteria by 20,000,000/1000. The total number of variants carried forward to stage 2 is then simply the sum over all MAF and RR bins of the product of the number of variants in the population times the probabilities of discovery and prioritization. Power for stage two was com- puted in a similar manner by rescaling the NCPs by the stage 2samplesizeandreferringittothenon-centralchisquaredis- tribution with Bonferroni correction for this number of tests carried forward. Thus, the yield of causal variants discovered, prioritized, and replicated is the sum of the number of variants in the population times the probability of discovery, prioritiza- tion, and replication over all MAF and RR bins. In each MAF bin, we also computed the Poisson probability that a variant would not have been seen at least twice in 1000 population con- trols and computed the power for novel variants in a similar manner. Calculations for case-control designs were similar, except that no simulation was required. The probability of discovery could be computed directly from the Poisson distribution in the com- bined case and control sample and the NCP s for the chi square test for allelic association computed in the usual way for a 2×2 contingencytableofallelecountsbycase-controlstatus. ACKNOWLEDGMENTS SupportedinpartbyNIHgrantsU01HG005927,U19CA148107, R01 ES019876, R01 CA129639, P30 CA014089, P30 ES 07048. The authors are grateful to Drs. Paul Marjoram and David Conti for many helpful discussions and to the investigators from the Colorectal Cancer Family Registries (Dr. Steve Thibodeau, P.I. of the proposed sequencing substudy) and WECARE Study (Dr. Jonine Bernstein, P.I.) for the studies used for illustration, par- ticularly Dr. David Duggan for the summary of preliminary data fromtheWECAREStudy. SUPPLEMENTARYMATERIAL The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/ fgene.2013.00276/abstract Figure S1 | Simulation parameters for models 1 and 2: top: probability of causality and mean RR for causal variants as a function of MAF; bottom: frequency distribution of non-causal and causal variants as a function of MAF. Figure S2 | Yield of prioritized variants as a function of the number of families required for prioritization, the minimum Bayes factor, and the minimum score. Figure S3 | Two-stage designs using Bayes factors for prioritization. Top panel,varyingnumberofvariantsrequiredfordiscovery(1–4)and minimum BF for prioritization; bottom panel, detail of left-most portion, varying the sample size for replicationN = 20(×2)10880. The colors indicate the proportions of simulated causal variants that are discovered, prioritized, and discovered. Table S1 | Sample sizes used to illustrate power calculations for a two-phase design for the WECARE study (note that the actual sequencing sample is further stratified by radiotherapy and latency for the purpose of studying gene-radiation interactions, factors not considered in these simulations). *UBC, unilateral breast cancer (controls); CBC, contralateral (second asynchronous) breast cancer (cases) REFERENCES Asimit, J., and Zeggini, E. (2010). Rare variant association analysis methods for complex traits. Annu. Rev. Genet. 44, 293–308. doi: 10.1146/annurev-genet- 102209-163421 Asimit, J., and Zeggini, E. (2012). Imputation of rare variants in next generation associationstudies.Hum.Hered.74,196–204.doi:10.1159/000345602 Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 16 141 Thomas et al. Designs for sequencing studies Bacanu,S.A.,Nelson,M.R.,andWhittaker,J.C.(2012).Comparisonofstatistical testsforassociationbetweenrarevariantsandbinarytraits.PLoSONE7:e42530. doi:10.1371/journal.pone.0042530 Basu, S., and Pan, W. (2011). Comparison of statistical tests for disease asso- ciation with rare variants. Genet. Epidemiol. 35, 606–619. doi: 10.1002/ gepi.20609 Begg, C. B., Haile, R. W., Borg, A., Malone, K. E., Concannon, P., Thomas, D. C., et al. (2008). Variation of breast cancer risk among BRCA1/2 carriers. JAMA 299,194–201.doi:10.1001/jama.2007.55-a Bernstein, J.L., Haile,R. W.,Stovall,M.,Boice, J.D. Jr.,Shore,R. E.,Langholz, B., etal.(2010).Radiationexposure,theATMGene,andcontralateralbreastcancer inthewomen’senvironmentalcancerandradiationepidemiologystudy.J.Natl. CancerInst.102,475–483.doi:10.1093/jnci/djq055 Bernstein, J. L., Langholz, B., Haile, R. W., Bernstein, L., Thomas, D. C., Stovall, M.,etal.(2004).Studydesign:evaluatinggene-environmentinteractionsinthe etiology of breast cancer–the WECARE study. Breast Cancer Res. 6, R199–214. doi:10.1186/bcr771 Bernstein, J. L., Thomas, D. C., Shore, R. E., Robson, M., Boice, J. D. Jr., Stovall, M., et al. (2013). Contralateral breast cancer after radiotherapy among BRCA1 and BRCA2 mutation carriers: a WECARE study report. Eur. J. Cancer 49, 2979–2985.doi:10.1016/j.ejca.2013.04.028 Bodmer, W., and Bonilla, C. (2008). Common and rare variants in multi- factorial susceptibility to common diseases. Nat. Genet. 40, 695–701. doi: 10.1038/ng.f.136 Borg, A., Haile, R. W., Malone, K. E., Capanu, M., Diep, A., Torngren, T., et al. (2010). Characterization of BRCA1 and BRCA2 deleterious mutations and variants of unknown clinical significance in unilateral and bilateral breast cancer: the WECARE study. Hum. Mutat. 31, E1200–E1240. doi: 10.1002/ humu.21202 Breslow, N., and Cain, K. (1988). Logistic regression for two-stage case-control data.Biometrika75,11–20.doi:10.1093/biomet/75.1.11 Breslow,N.,McNeney,B.,andWellner,J.A.(2003).Largesampletheoryforsemi- parametric regression models with two-phase, outcome-dependent sampling. Ann. Stat.31,1110–1139.doi:10.1214/aos/1059655907 Breslow, N. E., and Chatterjee, N. (1999). Design and analysis of two-phase stud- ies with binary outcome applied to Wilms tumor prognosis. Appl. Stat. 48, 457–468.doi:10.1111/1467-9876.00165 Breslow,N.E.,andHolubkov,R.(1997a).Maximumlikelihoodestimationoflogis- ticregressionparametersundertwo-phase,outcome-dependentsampling.J.R. Stat.Soc.B59,447–461.doi:10.1111/1467-9868.00078 Breslow,N.E.,andHolubkov,R.(1997b).Weightedlikelihood,pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data.Stat.Med.16,103–116. Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E., and Kulich, M. (2009a). Improved horvitz-thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat. Biosci. 1, 32. doi:10.1007/s12561-009-9001-6 Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E., and Kulich, M. (2009b). Using the whole cohort in the analysis of case-cohort data. Am. J. Epidemiol.169,1398–1405.doi:10.1093/aje/kwp055 Breslow, N. E., and Wellner, J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand.J.Stat.34,86–102.doi:10.1111/j.1467-9469.2006.00523.x Breslow,N.E.,andZhao,L.P.(1988).Logisticregressionforstratifiedcase-control studies.Biometrics44,891–899.doi:10.2307/2531601 Brooks, J. D., Teraoka, S. N., Reiner, A. S., Satagopan, J. M., Bernstein, L., Thomas, D. C., et al. (2012). Variants in activators and downstream targets of ATM, radiation exposure, and contralateral breast cancer risk in the WECARE study. Hum. Mutat. 33, 158–164. doi: 10.1002/humu. 21604 Brown, B. W., Lovato, J., and Russell, K. (1999). Asymptotic power calcula- tions: description, examples, computer code. Stat. Med. 18, 3137–3151. doi: 10.1002/(SICI)1097-0258(19991130)18:22<3137::AID-SIM239>3.0.CO;2-O Burdick, J. T., Chen, W. M., Abecasis, G. R., and Cheung, V. G. (2006). In silico method for inferring genotypes in pedigrees. Nat. Genet. 38, 1002–1004. doi: 10.1038/ng1863 Bush, W. S., Dudek, S. M., and Ritchie, M. D. (2009). Biofilter: a knowledge- integration system for the multi-locus analysis of genome-wide association studies. Pac. Symp. Biocomput.368–379.doi:10.1142/9789812836939_0035 Cain, K., and Breslow, N. (1988). Logistic regression analysis and efficient design fortwo-stagestudies.Am.J.Epidemiol.128,1198–1206. Cantor,R.M.,Lange,K.,andSinsheimer,J.S.(2010).PrioritizingGWASresults:a reviewofstatisticalmethodsandrecommendationsfortheirapplication.Am.J. Hum. Genet.86,6–22.doi:10.1016/j.ajhg.2009.11.017 Capanu, M., and Begg, C. B. (2011). Hierarchical modeling for estimat- ing relative risks of rare genetic variants: properties of the pseudo- likelihood method. Biometrics 67, 371–380. doi: 10.1111/j.1541-0420.2010. 01469.x Capanu, M., Concannon, P., Haile, R. W., Bernstein, L., Malone, K. E., Lynch, C. F., et al. (2011). Assessment of rare BRCA1 and BRCA2 variants of unknown significance using hierarchical modeling. Genet. Epidemiol. 35, 389–397. doi: 10.1002/gepi.20587 Chahrour, M. H., Yu, T. W., Lim, E. T., Ataman, B., Coulter, M. E., Hill, R. S., et al. (2012). Whole-exome sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes in autism. PLoS Genet. 8:e1002635. doi:10.1371/journal.pgen.1002635 Chasman, D. I. (2008). On the utility of gene set methods in genomewide association studies of quantitative traits. Genet. Epidemiol. 32, 658–668. doi: 10.1002/gepi.20334 Chen, G. K., and Witte, J. S. (2007). Enriching the analysis of genomewide associ- ation studies with hierarchical modeling. Am. J. Hum. Genet. 81, 397–404. doi: 10.1086/519794 Chen, H., Meigs, J. B., and Dupuis, J. (2013). Sequence kernel association test for quantitative traits in family samples. Genet. Epidemiol. 37, 196–204. doi: 10.1002/gepi.21703 Cheung,C.Y.K.,Thompson,E.A.,andWijsman,E.M.(2013).GIGI:anapproach to effective imputation of dense genotypes on large pedigrees. Am. J. Hum. Genet.92,504–516.doi:10.1016/j.ajhg.2013.02.011 Cheung, C. Y. K., and Wijsman, E. M. (2013). “Design matters! A statistical framework to guide sequencing choices in pedigrees (IGES abstract #9),” in International Genetic Epidemiology Society eds C. Greenwood, J. Loreno Bermejo, B. Fridley, J. Houwing-Duistermaat, A. Paterson, S. Shete, et al. (Chicago,IL:GeneticEpidemiology).3–4. Cicek, M. S., Cunningham, J. M., Fridley, B. L., Serie, D. J., Bamlet, W. R., Diergaarde, B., et al. (2012). Colorectal cancer linkage on chromosomes 4q21, 8q13, 12q24, and 15q22. PLoS ONE 7:e38175. doi: 10.1371/jour- nal.pone.0038175 Cirulli, E. T., and Goldstein, D. B. (2010). Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 11, 415–425.doi:10.1038/nrg2779 Concannon, P., Haile, R. W., Borresen-Dale, A. L., Rosenstein, B. S., Gatti, R. A., Teraoka, S. N., et al. (2008). Variants in the ATM gene associated with a reduced risk of contralateral breast cancer. Cancer Res. 68, 6486–6491. doi: 10.1158/0008-5472.CAN-08-0134 Conneely, K. N., and Boehnke, M. (2007). So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. Am. J. Hum. Genet. 81,1158–1168.doi:10.1086/522036 Conti, D. V., and Witte, J. S. (2003). Hierarchical modeling of linkage disequilib- rium: genetic structure and spatial relations. Am. J. Hum. Genet. 72, 351–363. doi:10.1086/346117 Conti, D. V., Lewinger, J. P., Swan, G. E., Tyndale, R. F., Benowitz, N. L., and Thomas, P. D. (2009). “Using ontologies in hierarchical modeling of genes and exposures in biologic pathways,” in Phenotypes and Endophenotypes: Foundations for Genetic Studies of Nicotine Use and Dependence, ed G. E. Swan (Bethesda,MD:NCITobaccoControlMonographs),539–584. Craig, D. W., Pearson, J. V., Szelinger, S., Sekar, A., Redman, M., Corneveaux, J. J., et al. (2008). Identification of genetic variants using bar-coded multiplexed sequencing.Nat.Methods5,887–893.doi:10.1038/nmeth.1251 Duan, Q., Liu, E. Y., Auer, P. L., Zhang, G., Lange, E. M., Jun, G., et al. (2013). Imputation of coding variants in African Americans: better performance using data from the exome sequencing project. Bioinformatics 29, 2744–2749. doi: 10.1093/bioinformatics/btt477 Elston, R. C., Lin, D., and Zheng, G. (2007). Multistage sampling for genetic studies. Annu. Rev. Genomics Hum. Genet. 8, 327–342. doi: 10.1146/annurev.genom.8.080706.092357 Faye, L. L., Machiela, M. J., Kraft, P., Bull, S. B., and Sun, L. (2013). Re-Ranking sequencing variants in the Post-GWAS era for accurate causal variant identifi- cation.PLoSGenet.9:e1003609.doi:10.1371/journal.pgen.1003609 www.frontiersin.org December 2013 | Volume 4 | Article 276 | 17 142 Thomas et al. Designs for sequencing studies Feng, T., Elston, R. C., and Zhu, X. (2011). Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statis- tics (SPWSS, ORWSS). Genet. Epidemiol. 35, 398–409. doi: 10.1002/gepi. 20588 Feng, T., Zhang, S., and Sha, Q. (2007). Two-stage association tests for genome- wide association studies based on family data with arbitrary family structure. Eur.J.Hum.Genet.15,1169–1175.doi:10.1038/sj.ejhg.5201902 Freedman, M. L., Monteiro, A. N., Gayther, S. A., Coetzee, G. A., Risch, A., Plass, C., et al. (2011). Principles for the post-GWAS functional characterization of cancerriskloci.Nat.Genet.43,513–518.doi:10.1038/ng.840 Gorlov, I. P., Gorlova, O. Y., Frazier, M. L., Spitz, M. R., and Amos, C. I. (2011). Evolutionary evidence of the effect of rare variants on disease etiology. Clin. Genet.79,199–206.doi:10.1111/j.1399-0004.2010.01535.x Greenland, S. (2000). Principles of multilevel modelling. Int. J. Epidemiol. 29, 158–167.doi:10.1093/ije/29.1.158 Heinzen, E. L., Depondt, C., Cavalleri, G. L., Ruzzo, E. K., Walley, N. M., Need, A. C., et al. (2012). Exome sequencing followed by large-scale genotyping fails to identifysinglerarevariantsoflargeeffectinidiopathicgeneralizedepilepsy.Am. J.Hum.Genet.91,293–302.doi:10.1016/j.ajhg.2012.06.016 Hindorff,L.A.,Sethupathy,P.,Junkins,H.A.,Ramos,E.M.,Mehta,J.P.,Collins,F. S.,etal.(2009).Potentialetiologicandfunctionalimplicationsofgenome-wide association lociforhumandiseasesandtraits. Proc. Natl. Acad. Sci. U.S.A. 106, 9362–9367.doi:10.1073/pnas.0903103106 Hoffmann, T. J., Marini, N. J., and Witte, J. S. (2010). Comprehensive approach to analyzing rare genetic variants. PLoS ONE 5:e13584. doi: 10.1371/jour- nal.pone.0013584 Holden, M., Deng, S., Wojnowski, L., and Kulle, B. (2008). GSEA-SNP: apply- ing gene set enrichment analysis to SNP data from genome-wide asso- ciation studies. Bioinformatics 24, 2784–2785. doi: 10.1093/bioinformatics/ btn516 Hopper, J. L., Southey, M. C., Dite, G. S., Jolley, D. J., Giles, G. G., McCredie, M. R.E.,etal.(1999).Population-basedestimateoftheaverageage-specificcumu- lative risk of breast cancer for a defined set of protein-truncating mutations in BRCA1andBRCA2.CancerEpidemiol.BiomarkersPrev.8,741–747. Horvitz, D., and Thompson, D. (1952). A generalization of sampling without replacement from a finite population. J. Am. Stat. Assoc. 47, 663–685. doi: 10.1080/01621459.1952.10483446 Howie,B.,Fuchsberger,C.,Stephens,M.,Marchini,J.,andAbecasis,G.R.(2012). Fast and accurate genotype imputation in genome-wide association studies throughpre-phasing.Nat. Genet.44,955–959.doi:10.1038/ng.2354 Huang,Y.,Hinds,D.A.,Qi,L.,andPrentice,R.L.(2010).Pooledversusindividual genotypinginabreastcancergenome-wideassociationstudy.Genet.Epidemiol. 34,603–612.doi:10.1002/gepi.20517 Hung, R. J., Baragatti, M., Thomas, D., McKay, J., Szeszenia-Dabrowska, N., Zaridze, D., et al. (2007). Inherited predisposition of lung cancer: a hierarchi- cal modeling approach to DNA repair and cell cycle control pathways. Cancer Epidemiol. Biomarkers Prev. 16, 2736–2744. doi: 10.1158/1055-9965.EPI-07- 0494 Hung, R. J., Brennan, P., Malaveille, C., Porru, S., Donato, F., Boffetta, P., et al. (2004). Using hierarchical modeling in genetic association studies with mul- tiple markers: application to a case-control study of bladder cancer. Cancer Epidemiol.BiomarkersPrev.13,1013–1021. Ionita-Laza, I., Lee, S., Makarov, V., Buxbaum, J. D., and Lin, X. (2013). Family-based association tests for sequence data, and comparisons with population-based association tests. Eur. J. Hum. Genet. 21, 1158–1162. doi: 10.1038/ejhg.2012.308 Ionita-Laza,I.,andOttman,R.(2011).Studydesignsforidentificationofraredis- ease variants in complex diseases: the utility of family-based designs. Genetics 189,1061–1068.doi:10.1534/genetics.111.131813 Johnson, T. (2007). Bayesian method for gene detection and mapping, using acaseandcontroldesignandDNApooling. Biostatistics 8, 546–565. doi: 10.1093/biostatistics/kxl028 Karchin,R.(2009).NextgenerationtoolsfortheannotationofhumanSNPs.Brief. Bioinform.10,35–52.doi:10.1093/bib/bbn047 Kruglyak, L., Daly, M., and Lander, E. (1995). Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am. J. Hum. Genet.56,519–527. Lange,C.,Demeo,D.,Silverman,E.K.,Weiss,S.T.,andLaird,N.M.(2003).Using the noninformative families in family-based association tests: a powerful new testingstrategy.Am.J.Hum.Genet.73,801–811.doi:10.1086/378591 Langholz, B., Thomas, D. C., Stovall, M., Smith, S. A., Boice, J. D. Jr., Shore, R. E., et al. (2009). Statistical methods for analysis of radiation effects with tumoranddoselocation-specificinformationwithapplicationtotheWECARE study of asynchronous contralateral breast cancer. Biometrics 65, 599–608. doi: 10.1111/j.1541-0420.2008.01096.x Lee, J. S., Choi, M., Yan, X., Lifton, R. P., and Zhao, H. (2011). On optimal pooling designs to identify rare variants through massive resequencing. Genet. Epidemiol.35,139–147.doi:10.1002/gepi.20561 Lee,S.,Emond,M.J.,Bamshad,M.J.,Barnes,K.C.,Rieder,M.J.,Nickerson,D.A., etal.(2012).Optimalunifiedapproachforrare-variantassociationtestingwith applicationtosmall-samplecase-controlwhole-exomesequencingstudies.Am. J.Hum.Genet.91,224–237.doi:10.1016/j.ajhg.2012.06.007 Lee, W. C. (2005). A DNA pooling strategy for family-based association studies. Cancer Epidemiol. Biomarkers Prev. 14, 958–962. doi: 10.1158/1055-9965.EPI- 04-0503 Lewinger,J.P.,Conti,D.V.,Baurley,J.W.,Triche,T.J.,andThomas,D.C.(2007). Hierarchical Bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genet. Epidemiol. 31, 871–882. doi: 10.1002/gepi.20248 Li, Y., Willer, C., Sanna, S., and Abecasis, G. (2009). Genotype imputation. Annu. Rev. Genomics Hum. Genet. 10, 387–406. doi: 10.1146/annurev.genom.9.081307.164242 Liang,W.E.,Thomas,D.C.,andConti, D.V.(2012).Analysis andoptimaldesign for association studies using next-generation sequencing with case-control pools.Genet.Epidemiol.36,870–881.doi:10.1002/gepi.21681 Macgregor, S., Zhao, Z. Z., Henders, A., Nicholas, M. G., Montgomery, G. W., andVisscher,P.M.(2008).Highlycost-efficientgenome-wideassociationstud- ies using DNA pools and dense SNP arrays. Nucleic Acids Res. 36, e35. doi: 10.1093/nar/gkm1060 Madsen, B. E., and Browning, S. R. (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5:e1000384. doi: 10.1371/journal.pgen.1000384 Malone, K. E., Begg, C. B., Haile, R. W., Borg, A., Concannon, P., Tellhed, L., et al. (2010). Population-based study of the risk of second primary contralat- eral breast cancer associated with carrying a mutation in BRCA1 or BRCA2. J.Clin.Oncol.28,2404–2410.doi:10.1200/JCO.2009.24.2495 Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., et al. (2009). Finding the missing heritability of complex diseases. Nature 461,747–753.doi:10.1038/nature08494 Minelli,C.,DeGrandi,A.,Weichenberger,C.X.,Gögele,M.,Modenese,M.,Attia, J., et al. (2013). Importance of different types of prior knowledge in select- ing genome-wide findings for follow-up. Genet. Epidemiol. 37, 205–213. doi: 10.1002/gepi.21705 Murphy, A., Weiss, S. T., and Lange, C. (2008). Screening and replication using the same data set: testing strategies for family-based studies in which all probands are affected. PLoS Genet. 4:e1000197. doi: 10.1371/journal.pgen. 1000197 Neale, B. M., Rivas, M. A., Voight, B. F., Altshuler, D., Devlin, B., Orho-Melander, M.,etal.(2011).Testingforanunusualdistributionofrarevariants.PLoSGenet. 7:e1001322.doi:10.1371/journal.pgen.1001322 Newcomb, P. A., Baron, J., Cotterchio, M., Gallinger, S., Grove, J., Haile, R., et al. (2007). Colon cancer family registry: an international resource for studies of the genetic epidemiology of colon cancer. Cancer Epidemiol. Biomarkers Prev. 16,2331–2343.doi:10.1158/1055-9965.EPI-07-0648 Neyman,J.,andScott,E.(1966).Ontheuseofc(alpha)optimaltestsofcomposite hypotheses.Bull.Int.Stat.Inst.41,477–497. Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E., and Cox, N. J. (2010). Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6:e1000888. doi: 10.1371/jour- nal.pgen.1000888 Panagiotou, O. A., Ioannidis, J. P. A., and Project, F. T. G.-W. S. (2012). What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations. Int. J. Epidemiol. 41, 273–286. doi: 10.1093/ije/ dyr178 Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 18 143 Thomas et al. Designs for sequencing studies Petersen, G. M., Parmigiani, G., and Thomas, D. (1998). Missense mutations in diseasegenes:aBayesianapproachtoevaluatecausality.Am.J.Hum.Genet.62, 1516–1524.doi:10.1086/301871 Pompanon, F., Bonin, A., Bellemain, E., and Taberlet, P. (2005). Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6, 847–859. doi: 10.1038/nrg1707 Quintana, M. A., Berstein, J. L., Thomas, D. C., and Conti, D. V. (2011). Incorporating model uncertainty in detecting rare variants: the Bayesian risk index.Genet.Epidemiol.35,638–649.doi:10.1002/gepi.20613 Quintana,M.A.,andConti,D.V.(2013).IntegrativevariableselectionviaBayesian modeluncertainty.Stat.Med.32,4928–4953.doi:10.1002/sim.5888 Quintana,M.A.,Schumacher,F.R.,Casey,G.,Bernstein,J.L.,Li,L.,andConti,D. V. (2012). Incorporating prior biologic information for high-dimensional rare variantassociationstudies.Hum.Hered.74,184–195.doi:10.1159/000346021 Rebbeck, T. R., Spitz, M., and Wu, X. (2004). Assessing the function of genetic variantsincandidategeneassociationstudies.Nat.Rev.Genet.5,589–597.doi: 10.1038/nrg1403 Reilly, M. (1996). Optimal sampling strategies for two-stage studies. Am. J. Epidemiol.143,92–100.doi:10.1093/oxfordjournals.aje.a008662 Reilly, M., and Pepe, M. S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82, 299–314. doi: 10.1093/biomet/82.2.299 Reiner, A. S., John, E. M., Brooks, J. D., Lynch, C. F., Bernstein, L., Mellemkjaer, L., et al. (2013). Risk of asynchronous contralateral breast cancer in noncar- riers of BRCA1 and BRCA2 mutations with a family history of breast cancer: a report from the Women’s environmental cancer and radiation epidemiology study.J.Clin.Oncol.31,433–439.doi:10.1200/JCO.2012.43.2013 Roeder, K., Bacanu, S. A., Wasserman, L., and Devlin, B. (2006). Using linkage genome scans to improve power of association in genome scans. Am. J. Hum. Genet.78,243–252.doi:10.1086/500026 San Lucas, F. A., Wang, G., Scheet, P., and Peng, B. (2012). Integrated annotation and analysis of genetic variants from next-generation sequencing studies with varianttools.Bioinformatics28,421–422.doi:10.1093/bioinformatics/btr667 Satagopan, J. M., and Elston, R. C. (2003). Optimal two-stage genotyping in population-based association studies. Genet. Epidemiol. 25, 149–157. doi: 10.1002/gepi.10260 Satagopan, J. M., Venkatraman, E. S., and Begg, C. B. (2004). Two-stage designs forgene-diseaseassociationstudieswithsamplesizeconstraints.Biometrics60, 589–597.doi:10.1111/j.0006-341X.2004.00207.x Satagopan, J. M., Verbel, D. A., Venkatraman, E. S., Offit, K. E., and Begg, C. B. (2002). Two-stage designs for gene-disease association studies. Biometrics 58, 163–170.doi:10.1111/j.0006-341X.2002.00163.x Schaffner,S.F.,Foo,C.,Gabriel,S.,Reich,D.,Daly,M.J.,andAltshuler,D.(2005). Calibrating a coalescent simulation of human genome sequence variation. GenomeRes.15,1576–1583.doi:10.1101/gr.3709305 Schaid, D. J. (2010a). Genomic similarity and kernel methods i: advancements by buildingonmathematicalandstatisticalfoundations.Hum.Hered.70,109–131. doi:10.1159/000312641 Schaid, D. J. (2010b). Genomic similarity and kernel methods ii: methods for genomicinformation.Hum.Hered.70,132–140.doi:10.1159/000312643 Schaid, D. J., Jenkins, G. D., Ingle, J. N., and Weinshilboum, R. M. (2013a). Two-phase designs to follow-up genome-wide association signals with DNA resequencingstudies.Genet.Epidemiol.37,229–238.doi:10.1002/gepi.21708 Schaid, D. J., McDonnell, S. K., Sinnwell, J. P., and Thibodeau, S. N. (2013b). Multiple genetic variant association testing by collapsing and kernel methods withpedigreeorpopulationstructureddata.Genet.Epidemiol.37,409–418.doi: 10.1002/gepi.21727 Schifano, E. D., Epstein, M. P., Bielak, L. F., Jhun, M. A., Kardia, S. L. R., Peyser, P. A.,etal.(2012).SNPsetassociationanalysisforfamilialdata.Genet.Epidemiol. 36,797–810.doi:10.1002/gepi.21676 Schill, W., Jockel, K. H., Drescher, K., and Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika 80, 339–352. doi: 10.1093/biomet/80.2.339 Schork,N.J.,Murray,S.S.,Frazer,K.A.,andTopol,E.J.(2009).Commonvs.rare allelehypothesesforcomplexdiseases.Curr.Opin.Genet.Dev.19,212–219.doi: 10.1016/j.gde.2009.04.010 Schwarz, J. M., Rodelsperger, C., Schuelke, M., and Seelow, D. (2010). MutationTasterevaluatesdisease-causingpotentialofsequencealterations.Nat. Methods7,575–576.doi:10.1038/nmeth0810-575 Scott, A. J., Lee, A. J., and Wild, C. J. (2007). On the Breslow-Holubkov estimator. Lifetime Data Anal. 13, 545–563. doi: 10.1007/s10985-007- 9054-0 Scott, A. J., and Wild, C. J. (1989). Likelihood ratio tests in case/control studies. Biometrika76,806–809.doi:10.1093/biomet/76.4.806 Self, S. G., Mauritsen, R. H., and Ohara, J. (1992). Power calculations for like- lihood ratio tests in generalized linear models. Biometrics 48, 31–39. doi: 10.2307/2532736 Sham, P., Bader, J. S., Craig, I., O’Donovan, M., and Owen, M. (2002). DNA Pooling: a tool for large-scale association studies. Nat. Rev. Genet. 3, 862–871. doi:10.1038/nrg930 Shi, G., and Rao, D. C. (2011). Optimum designs for next-generation sequencing to discover rare variants for common complex disease. Genet. Epidemiol. 35, 572–579.doi:10.1002/gepi.20597 Skol, A. D., Scott, L. J., Abecasis, G. R., and Boehnke, M. (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209–213. doi: 10.1038/ ng1706 Skol, A. D., Scott, L. J., Abecasis, G. R., and Boehnke, M. (2007). Optimal designs for two-stage genome-wide association studies. Genet. Epidemiol. 31, 776–788. doi:10.1002/gepi.20240 Smith, A. M., Heisler, L. E., St Onge, R. P., Farias-Hesson, E., Wallace, I. M., Bodeau, J., et al. (2010). Highly-multiplexed barcode sequencing: an efficient methodforparallelanalysisofpooledsamples.NucleicAcidsRes.38,e142.doi: 10.1093/nar/gkq368 Stenson,P.D.,Mort,M.,Ball,E.V.,Howells,K.,Phillips,A.D.,Thomas,N.S.,etal. (2009). The human gene mutation database: 2008 update. Genome Med. 1, 13. doi:10.1186/gm13 Stovall, M., Smith, S. A., Langholz, B. M., Boice, J. D. Jr., Shore, R. E., Andersson, M.,etal.(2008).Dosetothecontralateralbreastfromradiotherapyandriskof secondprimarybreastcancer intheWECAREstudy. Int. J. Radiat. Oncol. Biol. Phys.72,1021–1030.doi:10.1016/j.ijrobp.2008.02.040 Stram,D.O.,Pearce,C.L.,Bretsky,P.,Freedman,M.,Hirschhorn,J.N.,Altshuler, D., et al. (2003). Modeling and E-M estimation of haplotype-specific relative risksfromgenotypedataforacase-controlstudyofunrelatedindividuals.Hum. Hered.55,179–190.doi:10.1159/000073202 Thomas, D. C. (2007). Multistage sampling for latent variable models. Lifetime DataAnal.13,565–581.doi:10.1007/s10985-007-9061-1 Thomas, D. C. (2012). Some surprising twists on the road to discovering the con- tribution of rare variants to complex diseases. Hum. Hered. 74, 113–117. doi: 10.1159/000347020 Thomas,D.C.,Casey,G.,Conti,D.V.,Haile,R.W.,Lewinger,J.P.,andStram,D.O. (2009a). Methodological issues in multistage genome-wide association studies. Stat.Sci.24,414–429.doi:10.1214/09-STS288 Thomas, D. C., Conti, D. V., Baurley, J., Nijhout, F., Reed, M., and Ulrich, C. M. (2009b). Use of pathway information in molecular epidemiology. Hum. Genomics4,21–42.doi:10.1186/1479-7364-4-1-21 Thomas, D., Xie, R., and Gebregziabher, M. (2004). Two-stage sampling designs for gene association studies. Genet. Epidemiol. 27, 401–414. doi: 10.1002/gepi. 20047 Thompson,J.R.,Gögele,M.,Weichenberger,C.X.,Modenese,M.,Attia,J.,Barrett, J.H.,etal.(2013).SNPprioritizationusingabayesianprobabilityofassociation. Genet.Epidemiol.37,214–221.doi:10.1002/gepi.21704 Van Steen, K., McQueen, M. B., Herbert, A., Raby, B., Lyon, H., Demeo, D. L., et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat. Genet. 37, 683–691. doi: 10.1038/ ng1582 Visscher, P. M., and Duffy, D. L. (2006). The value of relatives with phenotypes but missing genotypes in association studies for quantitative traits. Genet. Epidemiol.30,30–36.doi:10.1002/gepi.20124 Wakefield, J. (2007). A Bayesian measure of the probability of false discov- ery in genetic epidemiology studies. Am. J. Hum. Genet. 81, 208–227. doi: 10.1086/519024 Wang, H., Thomas, D. C., Pe’er, I., and Stram, D. O. (2006). Optimal two-stage genotyping designs for genome-wide association scans. Genet. Epidemiol. 30, 356–368.doi:10.1002/gepi.20150 Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR: functional annotation ofgeneticvariantsfromhigh-throughputsequencingdata.NucleicAcidsRes.38, e164.doi:10.1093/nar/gkq603 www.frontiersin.org December 2013 | Volume 4 | Article 276 | 19 144 Thomas et al. Designs for sequencing studies Wang, M., Jakobsdottir, J., Smith, A. V., Gudnason, V., and McPeek, M. S. (2013). “Optimal selection of individualsforgenotypingingeneticassociation studies with related individuals (IGES abstract #5),” in International Genetic EpidemiologySociety(Chicago,IL:GeneticEpidemiology). Whittemore, A. S. (2007). A Bayesian false discovery rate for multiple testing. J.Appl.Stat.34,1–9.doi:10.1080/02664760600994745 Whittemore,A.S.,andHalpern,J.(1997).Multi-stagesamplingingeneticepidemi- ology.Stat.Med.16,153–167. Witte,J.S.,Gauderman,W.J.,andThomas,D.C.(1999).Asymptoticbiasandeffi- ciencyincase-controlstudiesofcandidategenesandgene-environmentinterac- tions:basicfamilydesigns.Am.J.Epidemiol.149,693–705.doi:10.1093/oxford- journals.aje.a009877 Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rare-variant associationtestingforsequencingdatawiththesequencekernelassociationtest. Am.J.Hum.Genet.89,82–93.doi:10.1016/j.ajhg.2011.05.029 Yang, F., and Thomas, D. C. (2011). Two-stage design of sequencing studies for testing association with rare variants. Hum. Hered. 71, 209–220. doi: 10.1159/000328193 Zhao, Y., and Wang, S. (2009). Optimal DNA pooling-based two-stage designs in case-control association studies. Hum. Hered. 67, 46–56. doi: 10.1159/000164398 Zhu,X.,Feng,T.,Li,Y.,Lu,Q.,andElston,R.C.(2010).Detectingrarevariantsfor complex traits using family and unrelated data. Genet. Epidemiol. 34, 171–187. doi:10.1002/gepi.20449 Conflict of Interest Statement: The authors declare that the research was con- ducted in the absence of any commercial or financial relationships that could be construedasapotentialconflictofinterest. Received:28September2013;paperpendingpublished:29October2013;accepted:19 November2013;publishedonline:13December2013. Citation: Thomas DC, Yang Z and Yang F (2013) Two-phase and family-based designs for next-generation sequencing studies. Front. Genet. 4:276. doi: 10.3389/ fgene.2013.00276 This article was submitted to Statistical Genetics and Methodology, a section of the journalFrontiersinGenetics. Copyright © 2013 Thomas,YangandYang.Thisisanopen-accessarticledistributed under the terms of the Creative Commons Attribution License (CCBY). The use, dis- tributionorreproductioninotherforumsispermitted,providedtheoriginalauthor(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permittedwhichdoesnotcomplywiththeseterms. Frontiers in Genetics|StatisticalGeneticsandMethodology December 2013 | Volume 4 | Article 276 | 20 145 PROCEEDINGS Open Access Two-stage family-based designs for sequencing studies Zhao Yang, Duncan C Thomas * From Genetic Analysis Workshop 18 Stevenson, WA, USA. 13-17 October 2012 Abstract The cost of next-generation sequencing is now approaching that of the first generation of genome-wide single- nucleotide genotyping panels, but this is still out of reach for large-scale epidemiologic studies with tens of thousands of subjects. Furthermore, the anticipated yield of millions of rare variants poses serious challenges for distinguishing causal from noncausal variants for disease. We explore the merits of using family-based designs for sequencing substudies to identify novel variants and prioritize them for their likelihood of causality. While the sharing of variants within families means that family-based designs may be less efficient for discovery than sequencing of a comparable number of unrelated individuals, the ability to exploit cosegregation of variants with disease within families helps distinguish causal from noncausal ones. We introduce a score test criterion for prioritizing discovered variants in terms of their likelihood of being functional. We compare the relative statistical efficiency of 2-stage versus1-stage family-based designs by application to the Genetic Analysis Workshop 18 simulated sequence data. Background Most genome-wide association studies for discovering common variants associated with disease traits have been conducted using a case-control design with unrelated controls. Not only are unrelated individuals easier to identify and enroll than are entire families (particularly multiple-case families), but the statistical efficiency per subject genotyped is typically higher using unrelated con- trols than using unaffected siblings or other relatives [1]. However, with the growing interest in rare variants and the availability of next-generation sequencing technology, there has been a resurgence of interest in using family- based designs [2-5]. Althoughfamily-based designs are less efficient for discovering novel variants than designs using unrelated individuals with the same total number of subjects, they may have other advantages that may outweigh this limitation. Specifically, by exploiting infor- mation about cosegregation with disease within families, they may be more efficient at prioritizing potentially cau- sal variants from noncausal ones for subsequent testing for association with disease in larger samples. The ability to exploit mendelian inheritance may also substantially improve the imputation of rare variants in untested sam- ples [6]. Finally, family-based designs can exploit both between-family and within-family comparisons in various 2-stage designs for better power while being robust to bias from population stratification [7-11]. The net result could be improved power for the ultimate goal of disco- vering novel associations with disease. We focus here on the first of these advantages. Methods Study designs We consider a 2-stage design using individuals from a family study for discovery of novel variants and screen- ing, followed by association testing in an independent data set using the linear regression of the phenotype on the genotype. A subset of members is selected for sequencing, preferentially sampling those with the most extreme phenotypes. We rank all the variants identified in these individuals using a novel score test and select those most likely to be causal for genotyping and associa- tion testing in the replication data set. These designs are * Correspondence: dthomas@usc.edu Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089-9234, USA Yang and Thomas BMC Proceedings 2014, 8(Suppl 1):S32 http://www.biomedcentral.com/1753-6561/8/S1/S32 © 2014 Yang and Thomas; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. 146 Appendix 8. Publication: Two-Stage Family-Based Designs for Sequencing Studies compared with a single-stage family-based design in which all available members are sequenced [12]. Methods of prioritization Bayes factor criterion. Following the principles described in Petersen et al [13], the probability that any particular variant is causal under a given genetic model can be estimated by accumulating likelihood ratio contributions (the ratio of the likelihoods of the data under the alter- native hypothesis that a particular variant is causal to the likelihood under the null hypothesis that it is not causal) across families. The likelihood ratio requires a specific alternative hypothesis to be tested against the null hypothesis, and estimation of these parameters is likely to be highly unstable for rare variants. To avoid this, Petersen et al [13] compute a Bayes factor (BF) by averaging over a prior distribution of allele frequencies and relative risks. BFs are computed as the ratio of the conditional probabilities of the joint genotypes of the sampled individuals under the true model to that under the null. Score test criterion. A simpler alternative to the BF cal- culations for distinguishing causal from noncausal var- iants was described by Ionita-Laza et al [14]. Because score tests are computed under the null hypothesis, they do not require specification of an alternative hypothesis distribution of minor allele frequencies (MAFs) and rela- tive risks (RRs) for causal alleles. Ionita-Laza et al com- pute a score contribution for variants shared by each pair of relatives, based on their population frequency and degree of relationship, add these scores over all families, and compare it to an approximate null mean and var- iance. Here we explore an extension of this basic idea to incorporate all available phenotype information in a pedi- gree, including the phenotypes of subjects without sequence data. We compute the score statistic: T v =! f t fv =! f (Y f −µ f 1)′" −1 f (G fv −q v 1) =! f (! i∈Nf ! j ∈Sf (Y fi −µ f )" −1 fij (G fjv −q v )) where Y f is the vector of phenotypes for family f; μ f is the family-specific mean phenotype; F f is the kinship matrix, that is, the matrix of kinship coefficients; G fv is the vector of genotypes for variant v;and q v is its minor allele frequency. The G fv -q v deviationsare set to zero for untyped individuals, but the inclusion of the kinship terms for typed-untyped pairs allows their phe- notypes to contribute. This statistic has mean zero under the null hypothesis and asymptotic variance var (T v )= ∑ f t fv 2 .Verysimilartestshaverecentlybeen described by Schifano et al [15] and Chen et al [16]. For the purpose of prioritizing variants, it is sufficient to cal- culate the score test T v 2 /var(T v )foreachvariantand select the top-ranked ones at some cutoff. In other simulations, we have found this statistic to be highly correlated with the BF, to show nearly as good discrimi- nation between causal and noncausal variants, and to be computationally much faster. Application to Genetic Analysis Workshop 18 data We compared the various design and analysis alternatives on a subset of the simulated Genetic Analysis Workshop 18 (GAW18) data. Based on the “answers” provided, we chose to focus on the MAP4 region of chromosome 3, which contains 15 functionalvariantshavingthestron- gest associations with both diastolic blood pressure (DBP) (6.5% of the total phenotypic variants) and systolic blood pressure (SBP) (7.8%). These 15 variants spanned a broad range of MAFs and effect sizes, individually accounting for from 2.8% to <0.3% of the variance. We selected all the variants in the region from 200 kilobases (kb) upstream to 100 kb downstream of the transcription start site (1151 variants in total). For comparison, we selected six 300-kb regions at random from those on chromosome 3 that harbored no functional variants for either trait and included all variants in these regions (6195 variants in total). For simplicity, we used the most likely genotypes for the imputed individuals, although the expected allele dosages would have been better. We also limited our initial exploration to 1000 variants (all 15 functional and 985 of the null variants). For each trait, we preprocessed the phenotype data using a general linear mixed model to extract the inter- cept and slope coefficients for age and their variances for each individual, after adjustment for gender and current hypertension treatment, with random effects for family. These estimates were then treated as the phenotypes in the genetic analysis of each variant individually, using lin- ear regression. For the single-stage design, we analyzed the associa- tions using all 959 individuals using the quantitative transmission disequilibrium test with mating-type means (QTDT M ) [17] and tabulated the proportion of functional and nonfunctional variants that were associated at 0.05 significance after Bonferroni correction. We also tried the FBAT-rare procedure [18,19], which is similar except that the test is performed at the nuclear family level rather than at the individual offspring level. Because of the small number of informative nuclear families, the var- iance estimator from this test was less stable. Because no residual within-family dependency was simulated, the QTDT M test is valid, so we present the results only from this test. For the 2-stage design, we first selected 2, 4, or 6 mem- bers of each pedigree for whom sequencing data were available, excluding those in the subset of maximally unrelated individuals and the closely related full sib and parent-offspring pairs. We used a logistic function of the Yang and Thomas BMC Proceedings 2014, 8(Suppl 1):S32 http://www.biomedcentral.com/1753-6561/8/S1/S32 Page 2 of 5 147 squared rank deviation from the family’smedianpheno- type to select these members at random. We used the sequence data on only these sampled individuals (along with the complete phenotype data) to compute the score test for each variant. In the second stage, we tested the association of the top-ranked associations in the data set of unrelated individuals using linear regression of the phenotype on the genotype, with Bonferroni adjustment for only the prioritized variants. We varied the thresholds for prioritization from 0.5% to 16% of the top-ranked variants. Because of the computational burden, we restricted these analyses to phenotype replicates 1 to 5 and ana- lyzed each replicate using 20 random subsets of mem- bers’ sequence data. Results Table 1 summarizes the results for 985 null and 15 functional variants in the MAP4 region for DBP and SBP measurements, using the baseline observation, intercept, and slope parameters as the phenotype. The results for baseline and intercepts were generally similar and somewhat stronger for SBP than for DBP, so subse- quent analyses are presentedonlyforSBPintercepts. The mean scores showed a clear gradient in mean score statistics between negative, null, and positive variants, albeit with substantial overlap between their distribu- tions (SDs about 1.0). Power was very low for both the 1-step and 2-step procedures, and generally the 1-step procedure yielded higher power (13.3% vs. 4.0% when restricted to the top 100 prioritized variants) despite the larger multiple testing penalty. Slope estimates showed opposite effects from intercepts and were generally very weak, as might be expected as no effects on slopes were simulated, only a shift in level. Extending this to all 6195 null variants lowered power for both 1-stage and 2-stage results, as expected, but the mean scores for the null variants were then very close to zero. Most of the 15 functional variants in MAP4 were either very rare or had weak effects. Figure 1 compares the 1-stage and 2-stage results, varying the number of individuals whose sequence data was used for prioritiza- tion. The 2 variants accounting for the largest variance (2.79% and 1.49%) were significant in the QTDT M using all the pedigree members in a single stage in 4 of the 5 replicates analyzed (80% power), while 1 variant with only weak effects (0.05% of variance) was significant in 2 of 5 replicates; 2 other variants with relatively large effects (1.43% and 1.11%) were not significant in any of the 5 replicates. The first and second strongest variants were prioritized 36%, 42%, and 55% of the time using 2, 4, and 6 subject’ssequencedata,respectively,andthe majority of these were replicated. Three other variants– 1withthethirdlargesteffectand2withveryweak effects–were prioritized with relatively large probabil- ities, but none were ever replicated. Discussion With only 15 causal variants (most of them with very small effects) in the subset of variants we analyzed, we cannot reliably compare the power of the various designs, although for these data, the 2-stage procedure did not seem to perform better than the 1-stage proce- dure. This may be partly a result of the small size of the replication sample of unrelated individuals (N=157). As a test, we expanded the data set by combining the unrelated individuals from 5 replicates for the second step and the power for the joint test rose substantially (results not shown). Despite the lower power of the 2- stage approach, the cost is much lower because only a small number of individuals need to be fully sequenced and the genotyping required for replication is much cheaper. Because the costs of sequencing and targeted genotyping are rapidly changing and could be quite dif- ferent for individual- and family-based designs, we have not addressed cost-efficiency. See [12] and [20] for Table 1 Mean score tests for the complete pedigrees. Phenotype Parameter Mean score test Proportion of variants (%) Prioritized Replicated 1-Stage QTDT M Simulated - Null + Null +/- Null +/- Null +/- DBP Baseline -1.42 -0.25 +0.42 10.2 14.7 1.8 5.3 1.7 9.3 Intercept -1.36 -0.32 +0.24 10.1 17.3 1.4 4.0 3.5 8.0 Slope +0.70 +0.14 -0.12 10.2 13.3 1.7 0.0 3.3 4.0 SBP Baseline -1.40 -0.35 +0.20 10.2 16.0 0.3 2.7 1.7 9.3 Intercept -1.34 -0.31 +0.37 10.1 17.3 0.0 4.0 2.0 13.3 Slope +0.85 +0.21 +0.11 10.2 12.0 0.5 0.0 2.3 0.0 Mean score tests for the complete pedigrees for protective, null, and deleterious variants, along with the proportion of the top 100 variants prioritized using only the related members and replicated at a= 0.05/100 using only the unrelated members, and the proportion of variants significant at a= 0.05/1000 in a single- stage QTDT M test. The “+” and “−” represent variants with positive and negative association with the phenotypes respectively. If higher blood pressure is assumed to have more risk, “+” would correspond to deleterious variants and “−” would correspond to protective variants. In total, there are 6 deleterious variants, 9 protective variants, and 961 noncausal variants being discovered and tested. Yang and Thomas BMC Proceedings 2014, 8(Suppl 1):S32 http://www.biomedcentral.com/1753-6561/8/S1/S32 Page 3 of 5 148 discussion of optimization of sampling fractions for stu- dies of independent individuals using individual and pooled sequencing under cost constraints. We arbitrarily fixed the total number of variants to be prioritized at 100, but the general principles for design of 2-stage designs [21-24] could be applied to optimize the allocation of sample sizes across the 2 stages and the threshold for prioritizing variants, subject to cost constraints. As this threshold becomes more restrictive, fewer variants will be selected, lowering power for the first stage, but because the penalty for multiple testing will be less, power for the second stage will be improved; a similar tradeoff applies to sample size allo- cation between the 2 stages. As a preliminary explora- tion, we varied this threshold between 0.5% and 16% of variants, and the overall power increased monotonically with the threshold. Of course, the number of false posi- tives also increased, but at a much lower rate, so that the false discovery rate dropped with increasing thresh- old and number of members sequenced. Still, the false discovery rate is very large, so that a further replication would be needed to weed out the false positives. It is also possible that more appropriate adjustment for time-dependent treatment data would improve power for all these analyses (although it’s unlikely to affect the rela- tive performance of the 1-stage and 2-stage designs). Because treatment is itself related to blood pressure, it is both a confounder and an intermediate variable on a cau- sal pathway, so neither ignoring it nor covariate adjust- ment is appropriate. This problem was extensively discussed at Genetic Analysis Workshop 13 [25]; see refer- ences [26-29] for discussion of several better approaches. As a rough test of this hypothesis, we reran the analysis using the simulated effect sizes for gender and treatment, and the results (not shown) were essentially unchanged. Competing interests The authors declare that they have no competing interests. Authors’ contributions ZY participated in statistical analysis. DCT developed the statistical method. Both authors participated in drafting and revising the manuscript. Both authors read and approved the final manuscript. Acknowledgements Supported in part by NIH grants U19 CA148107, U01 HG005927, R01 ES019876, R21 ES020794, and P30 CA014089. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. The authors are grateful to John Morrison for programming help with extraction of the selected data from the massive GAW18data sets. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH Figure 1 Comparison of 1-stage and 2-stage designs. The proportion of being prioritized and being significantfor all 15 functional variants in the MAP4 gene (sorted by %VAR explained in descending order):QTDT M for the 1-stage procedure using all members; score test for prioritizing the top 100 variants using 2, 4, or 6 randomly selected related members for sequencing and prioritizing). SBPintercepts model only. Yang and Thomas BMC Proceedings 2014, 8(Suppl 1):S32 http://www.biomedcentral.com/1753-6561/8/S1/S32 Page 4 of 5 149 grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/ supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute. Published: 17 June 2014 References 1. Witte JS, Gauderman WJ, Thomas DC: Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol 1999, 149:693-705. 2. Ionita-Laza I, Ottman R: Study designs for identification of rare disease variants in complex diseases:the utility of family-based designs. Genetics 2011, 189:1061-1068. 3. Shi G, Rao DC: Optimum designs for next-generation sequencing to discover rare variants for common complex disease. Genet Epidemiol 2011, 35:572-579. 4. Feng T, Elston RC, Zhu X: Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet Epidemiol 2011, 35:398-409. 5. Zhu X, Feng T, Li Y, Lu Q, Elston RC: Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol 2010, 34:171-187. 6. Li Y, Willer C, Sanna S, Abecasis G: Genotype imputation. Annu Rev Genomics Hum Genet 2009, 10:387-406. 7. Murphy A, Weiss ST, Lange C: Screening and replication using the same data set: testing strategies for family-based studies in which all probands are affected. PLoS Genet 2008, 4:e1000197. 8. Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, Demeo DL, Murphy A, Su J, Datta S, Rosenow C, et al: Genomic screening and replication using the same data set in family-based association testing. Nat Genet 2005, 37:683-691. 9. Feng T, Zhang S, Sha Q: Two-stage association tests for genome-wide association studies based on family data with arbitrary family structure. Eur J Hum Genet 2007, 15:1169-1175. 10. Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM: Using the noninformative families in family-based association tests: a powerful new testing strategy. Am J Hum Genet 2003, 73:801-811. 11. Wason JM, Dudbridge F: A general framework for two-stage analysis of genome-wide association studies and its application to case-control studies. Am J Hum Genet 2012, 90:760-773. 12. Yang F, Thomas DC: Two-stage design of sequencing studies for testing association with rare variants. Hum Hered 2011, 71:209-220. 13. Petersen GM, Parmigiani G, Thomas D: Missense mutations in disease genes: a Bayesian approach to evaluate causality. Am J Hum Genet 1998, 62:1516-1524. 14. Ionita-Laza I, Makarov V, Yoon S, Raby B, Buxbaum J, Nicolae DL, Lin X: Finding disease variants in mendelian disorders by using sequence data: methods and applications. Am J Hum Genet 2011, 89:701-712. 15. Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, Lin X: SNP set association analysis for familial data. Genet Epidemiol 2012, 36:797-810. 16. Chen H, Meigs JB, Dupuis J: sequence kernel association test for quantitativetraits in family samples. Genet Epidemiol 2013, 37:196-204. 17. Gauderman WJ: Candidate gene association analysis for a quantitative trait, using parent-offspring trios. Genet Epidemiol 2003, 25:327-338. 18. Yip W, De G, Raby BA, Laird N: Identifying causal rare variants of disease through family-based analysis of Genetics Analysis Workshop 17 data set. BMC Proc 2011, 5(Suppl 9):S21. 19. De G, Yip W, Ionita-Laza I, Laird N: Rare variant analysis for family-based design. PloS One 2013, 8:e48495. 20. Liang W, Thomas DC, Conti DV: Analysis and optimal design for association studies using next-generation sequencing with case-control pools. Genet Epidemiol:doi: 10.1002.gepi.21681 , epub ahead of print September 12, 2012. 21. Skol AD, Scott LJ, Abecasis GR, Boehnke M: Optimal designs for two-stage genome-wide association studies. Genet Epidemiol 2007, 31:776-788. 22. Wang H, Thomas DC, Pe’er I, Stram DO: Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol 2006, 30:356-368. 23. Thomas DC, Casey G, Conti DV, Haile RW, Lewinger JP, Stram DO: Methodological issues in multistage genome-wide association studies. Stat Sci 2009, 24:414-429. 24. Satagopan JM, Elston RC: Optimal two-stage genotyping in population- based association studies. Genet Epidemiol 2003, 25:149-157. 25. Almasy L, Cupples LA, Daw EW, Levy D, Thomas D, Rice JP, Santangelo S, MacCluer JW: Genetic Analysis Workshop 13: introduction to workshop summaries. Genet Epidemiol 2003, 25(Suppl 1):S1-S4. 26. Levy D, DeStefano AL, Larson MG, O’Donnell CJ, Lifton RP, Gavras H, Cupples LA, Myers RH: Evidence for a gene influencing blood pressure on chromosome 17. Genome scan linkage results for longitudinal blood pressure phenotypes in subjects from the Framingham heart study. Hypertension 2000, 36:477-483. 27. Gauderman WJ, Macgregor S, Briollais L, Scurrah K, Tobin M, Park T, Wang D, Rao S, John S, Bull S: Longitudinal data analysis in pedigree studies. Genet Epidemiol 2003, 25(Suppl 1):S18-S28. 28. Bickeböller H, Barrett JH, Jacobs KB, Rosenberger A: Modeling and dissection of longitudinal blood pressure and hypertension phenotypes in genetic epidemiological studies. Genet Epidemiol 2003, 25(Suppl 1): S72-S77. 29. Cui JS, Hopper JL, Harrap SB: Antihypertensive treatments obscure familial contributions to blood pressure variation. Hypertension 2003, 41:207-210. doi:10.1186/1753-6561-8-S1-S32 Cite this article as: Yang and Thomas: Two-stage family-based designs for sequencing studies. BMC Proceedings 2014 8(Suppl 1):S32. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Yang and Thomas BMC Proceedings 2014, 8(Suppl 1):S32 http://www.biomedcentral.com/1753-6561/8/S1/S32 Page 5 of 5 150 151 Bibliography Allen, H. L., Estrada, K., Lettre, G., et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467(7317), 832-838. doi:10.1038/nature09410 Andrieu, N., Goldstein, A. M., Thomas, D. C., et al. (2001). Counter-matching in studies of gene- environment interaction: efficiency and feasibility. Am J Epidemiol, 153(3), 265-274. Bansal, V., Libiger, O., Torkamani, A., et al. (2010). Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet, 11(11), 773-785. doi:10.1038/nrg2867 Baraldi, E., de Jongste, J. C., & European Respiratory Society/American Thoracic Society Task, F. (2002). Measurement of exhaled nitric oxide in children, 2001. Eur Respir J, 20(1), 223-237. Begg, C. B., & Zhang, Z. F. (1994). Statistical analysis of molecular epidemiology studies employing case-series. Cancer Epidemiol Biomarkers Prev, 3(2), 173-175. Bodmer, W., & Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet, 40(6), 695-701. doi:10.1038/ng.f.136 Breslow, N. E., & Cain, K. C. (1988). LOGISTIC-REGRESSION FOR 2-STAGE CASE-CONTROL DATA. Biometrika, 75(1), 11-20. Breslow, N. E., & Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Journal of the Royal Statistical Society Series C-Applied Statistics, 48, 457-468. doi:10.1111/1467-9876.00165 Breslow, N. E., & Holubkov, R. (1997a). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society Series B-Methodological, 59(2), 447-461. doi:10.1111/1467-9868.00078 Breslow, N. E., & Holubkov, R. (1997b). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat Med, 16(1-3), 103-116. Breslow, N. E., Lubin, J. H., Marek, P., et al. (1983). Multiplicative Models and Cohort Analysis. Journal of the American Statistical Association, 78(381), 1-12. Breslow, N. E., Lumley, T., Ballantyne, C. M., et al. (2009a). Improved Horvitz-Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology. Stat Biosci, 1(1), 32. doi:10.1007/s12561-009-9001-6 Breslow, N. E., Lumley, T., Ballantyne, C. M., et al. (2009b). Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol, 169(11), 1398-1405. doi:10.1093/aje/kwp055 Breslow, N. E., & Patton, J. (1979). Case-control analysis of cohort studies. Paper presented at the Energy and Health, Philadelphia. Casey, G., Conti, D., Haile, R., et al. (2013). Next generation sequencing and a new era of medicine. Gut, 62(6), 920-932. doi:10.1136/gutjnl-2011-301935 152 Chen, H., Meigs, J. B., & Dupuis, J. (2013). Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol, 37(2), 196-204. doi:10.1002/gepi.21703 Chen, W. M., & Abecasis, G. R. (2007). Family-based association tests for genomewide association scans. Am J Hum Genet, 81(5), 913-926. doi:10.1086/521580 Chen, Z., Craiu, R. V., & Bull, S. B. (2012). Two-phase stratified sampling designs for regional sequencing. Genet Epidemiol, 36(4), 320-332. doi:10.1002/gepi.21624 Cheung, C. Y., Marchani Blue, E., & Wijsman, E. M. (2014). A statistical framework to guide sequencing choices in pedigrees. Am J Hum Genet, 94(2), 257-267. doi:10.1016/j.ajhg.2014.01.005 Cheung, C. Y., Thompson, E. A., & Wijsman, E. M. (2013). GIGI: an approach to effective imputation of dense genotypes on large pedigrees. Am J Hum Genet, 92(4), 504-516. doi:10.1016/j.ajhg.2013.02.011 Cirulli, E. T., & Goldstein, D. B. (2010). Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet, 11(6), 415-425. doi:10.1038/nrg2779 Coppola, A., Stewart, B., & Okazaki, N. (2014). lbfgs: Limited-memory BFGS Optimization. Curtis, C., Shah, S. P., Chin, S. F., et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403), 346-352. doi:10.1038/nature10983 Dersimonian, R. (1986). MAXIMUM-LIKELIHOOD-ESTIMATION OF A MIXING DISTRIBUTION. Applied Statistics-Journal of the Royal Statistical Society Series C, 35(3), 302-309. doi:10.2307/2348030 Doll, R., & Hill, A. B. (1950). Smoking and carcinoma of the lung; preliminary report. Br Med J, 2(4682), 739-748. Doll, R., & Hill, A. B. (1956). Lung cancer and other causes of death in relation to smoking; a second report on the mortality of British doctors. Br Med J, 2(5001), 1071-1081. Doll, R., & Peto, R. (1976). Mortality in relation to smoking: 20 years' observations on male British doctors. Br Med J, 2(6051), 1525-1536. Eckel, S. P., Baumbach, J., & Hauschild, A. C. (2014). On the importance of statistics in breath analysis-- hope or curse? J Breath Res, 8(1), 012001. doi:10.1088/1752-7155/8/1/012001 Eckel, S. P., Berhane, K., Salam, M. T., et al. (2011). Residential traffic-related pollution exposures and exhaled nitric oxide in the children's health study. Environ Health Perspect, 119(10), 1472-1477. doi:10.1289/ehp.1103516 Eckel, S. P., & Salam, M. T. (2013). Single high flow exhaled nitric oxide is an imperfect proxy for distal nitric oxide. Occup Environ Med, 70(7), 519-520. doi:10.1136/oemed-2013-101458 Edwards, T. L., & Li, C. (2012). Optimized selection of unrelated subjects for whole-genome sequencing studies of rare high-penetrance alleles. Genet Epidemiol, 36(5), 472-479. doi:10.1002/gepi.21641 153 Edwards, T. L., Song, Z., & Li, C. (2011). Enriching targeted sequencing experiments for rare disease alleles. Bioinformatics, 27(15), 2112-2118. doi:10.1093/bioinformatics/btr324 Fearnhead, N. S., Wilding, J. L., Winney, B., et al. (2004). Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proc Natl Acad Sci U S A, 101(45), 15992-15997. doi:10.1073/pnas.0407187101 Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. Friedman, J., Hastie, T., & Tibshirani, R. (2014). glasso: Graphical lasso- estimation of Gaussian graphical models. Galvan, A., Ioannidis, J. P., & Dragani, T. A. (2010). Beyond genome-wide association studies: genetic heterogeneity and individual predisposition to cancer. Trends Genet, 26(3), 132-141. doi:10.1016/j.tig.2009.12.008 Gauderman, W. J., Avol, E., Gilliland, F., et al. (2004). The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med, 351(11), 1057-1067. doi:10.1056/NEJMoa040610 Gauderman, W. J., Avol, E., Lurmann, F., et al. (2005). Childhood asthma and exposure to traffic and nitrogen dioxide. Epidemiology, 16(6), 737-743. Gauderman, W. J., Zhang, P., Morrison, J. L., et al. (2013). Finding novel genes by testing G x E interactions in a genome-wide association study. Genet Epidemiol, 37(6), 603-613. doi:10.1002/gepi.21748 Gilliland, F. D., McConnell, R., Peters, J., et al. (1999). A theoretical basis for investigating ambient air pollution and children's respiratory health. Environ Health Perspect, 107 Suppl 3, 403-407. Goldstein, D. B. (2011). The importance of synthetic associations will only be resolved empirically. PLoS Biol, 9(1), e1001008. doi:10.1371/journal.pbio.1001008 Gorlov, I. P., Gorlova, O. Y., Frazier, M. L., et al. (2011). Evolutionary evidence of the effect of rare variants on disease etiology. Clin Genet, 79(3), 199-206. doi:10.1111/j.1399-0004.2010.01535.x Hardy, J., & Singleton, A. (2009). Genomewide association studies and human disease. N Engl J Med, 360(17), 1759-1768. doi:10.1056/NEJMra0808700 Hoffmann, T. J., Marini, N. J., & Witte, J. S. (2010). Comprehensive approach to analyzing rare genetic variants. PLoS One, 5(11), e13584. doi:10.1371/journal.pone.0013584 International HapMap, C. (2003). The International HapMap Project. Nature, 426(6968), 789-796. doi:10.1038/nature02168 Ionita-Laza, I., Lee, S., Makarov, V., et al. (2013). Family-based association tests for sequence data, and comparisons with population-based association tests. Eur J Hum Genet, 21(10), 1158-1162. doi:10.1038/ejhg.2012.308 154 Ionita-Laza, I., Makarov, V., Yoon, S., et al. (2011). Finding disease variants in Mendelian disorders by using sequence data: methods and applications. Am J Hum Genet, 89(6), 701-712. doi:10.1016/j.ajhg.2011.11.003 Jiang, Y., Satten, G. A., Han, Y., et al. (2014). Utilizing population controls in rare-variant case-parent association tests. Am J Hum Genet, 94(6), 845-853. doi:10.1016/j.ajhg.2014.04.014 Johnson, S. G.). The NLopt nonlinear-optimization package. Kang, C. J., & Marjoram, P. (2012). A sample selection strategy for next-generation sequencing. Genet Epidemiol, 36(7), 696-709. doi:10.1002/gepi.21664 Kupper, L. L., McMichael, A. J., & Spirtas, R. (1975). A Hybrid Epidemiologic Study Design Useful in Estimating Relative Risk. Journal of the American Statistical Association, 70(351), 524-528. doi:10.2307/2285927 Laird, N. (1978). NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION OF A MIXING DISTRIBUTION. Journal of the American Statistical Association, 73(364), 805-811. doi:10.2307/2286284 Laird, N. M., & Lange, C. (2009). The Role of Family-Based Designs in Genome-Wide Association Studies. Statistical Science, 24(4), 388-397. doi:10.1214/08-sts280 Langholz, B., & Clayton, D. (1994). Sampling strategies in nested case-control studies. Environ Health Perspect, 102 Suppl 8, 47-51. Langholz, B., & Goldstein, L. (1996). Risk set sampling in epidemiologic cohort studies. Statistical Science, 11(1), 35-53. Langholz, B., & Goldstein, L. (2001). Conditional logistic analysis of case-control studies with complex sampling. Biostatistics, 2(1), 63-84. doi:10.1093/biostatistics/2.1.63 Lee, J. S., Choi, M., Yan, X., et al. (2011). On optimal pooling designs to identify rare variants through massive resequencing. Genet Epidemiol, 35(3), 139-147. doi:10.1002/gepi.20561 Li, B., & Leal, S. M. (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet, 83(3), 311-321. doi:10.1016/j.ajhg.2008.06.024 Li, D., & Conti, D. V. (2009). Detecting gene-environment interactions using a combined case-only and case-control approach. Am J Epidemiol, 169(4), 497-504. doi:10.1093/aje/kwn339 Liang, W. E., Thomas, D. C., & Conti, D. V. (2012). Analysis and optimal design for association studies using next-generation sequencing with case-control pools. Genet Epidemiol, 36(8), 870-881. doi:10.1002/gepi.21681 Liddell, F. D. K., McDonald, J. C., & Thomas, D. C. (1977). Methods of Cohort Analysis: Appraisal by Application to Asbestos Mining. J. R. Statist. Soc. A, 140(4), 469-491. 155 Linn, W. S., Rappaport, E. B., Eckel, S. P., et al. (2013). Multiple-flow exhaled nitric oxide, allergy, and asthma in a population of older children. Pediatr Pulmonol, 48(9), 885-896. doi:10.1002/ppul.22708 Madsen, B. E., & Browning, S. R. (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet, 5(2), e1000384. doi:10.1371/journal.pgen.1000384 Maher, M. C., Uricchio, L. H., Torgerson, D. G., et al. (2012). Population genetics of rare variants and complex diseases. Hum Hered, 74(3-4), 118-128. doi:10.1159/000346826 Manolio, T. A., Collins, F. S., Cox, N. J., et al. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265), 747-753. doi:10.1038/nature08494 Mantel, N. (1973). Synthetic retrospective studies and related topics. Biometrics, 29(3), 479-486. Mattes, J., Storm van's Gravesande, K., Reining, U., et al. (1999). NO in exhaled air is correlated with markers of eosinophilic airway inflammation in corticosteroid-dependent childhood asthma. Eur Respir J, 13(6), 1391-1395. McCarthy, M. I., & Hirschhorn, J. N. (2008). Genome-wide association studies: past, present and future. Hum Mol Genet, 17(R2), R100-101. doi:10.1093/hmg/ddn298 Meng, X. L., & Rubin, D. B. (1991). USING EM TO OBTAIN ASYMPTOTIC VARIANCE - COVARIANCE MATRICES - THE SEM ALGORITHM. Journal of the American Statistical Association, 86(416), 899-909. doi:10.2307/2290503 Miettinen, O. (1982). Design options in epidemiologic research. An update. Scand J Work Environ Health, 8 Suppl 1, 7-14. Morgenthaler, S., & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono- allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res, 615(1-2), 28-56. doi:10.1016/j.mrfmmm.2006.09.003 Mukherjee, B., Ahn, J., Gruber, S. B., et al. (2012). Testing gene-environment interaction in large-scale case-control association studies: possible choices and comparisons. Am J Epidemiol, 175(3), 177- 190. doi:10.1093/aje/kwr367 Mukherjee, B., Ahn, J., Gruber, S. B., et al. (2008). Tests for gene-environment interaction from case- control data: a novel study of type I error, power and designs. Genet Epidemiol, 32(7), 615-626. doi:10.1002/gepi.20337 Mukherjee, B., & Chatterjee, N. (2008). Exploiting gene-environment independence for analysis of case- control studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics, 64(3), 685-694. doi:10.1111/j.1541-0420.2007.00953.x Murcray, C. E., Lewinger, J. P., Conti, D. V., et al. (2011). Sample size requirements to detect gene- environment interactions in genome-wide association studies. Genet Epidemiol, 35(3), 201-210. doi:10.1002/gepi.20569 Murcray, C. E., Lewinger, J. P., & Gauderman, W. J. (2009). Gene-environment interaction in genome- wide association studies. Am J Epidemiol, 169(2), 219-226. doi:10.1093/aje/kwn353 156 Murphy, A., S, T. W., & Lange, C. (2010). Two-stage testing strategies for genome-wide association studies in family-based designs. Methods Mol Biol, 620, 485-496. doi:10.1007/978-1-60761-580- 4_17 Neale, B. M., Rivas, M. A., Voight, B. F., et al. (2011). Testing for an unusual distribution of rare variants. PLoS Genet, 7(3), e1001322. doi:10.1371/journal.pgen.1001322 Nejentsev, S., Walker, N., Riches, D., et al. (2009). Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science, 324(5925), 387-389. doi:10.1126/science.1167728 Ott, J., Kamatani, Y., & Lathrop, M. (2011). Family-based designs for genome-wide association studies. Nat Rev Genet, 12(7), 465-474. doi:10.1038/nrg2989 Piegorsch, W. W., Weinberg, C. R., & Taylor, J. A. (1994). Non-hierarchical logistic models and case- only designs for assessing susceptibility in population-based case-control studies. Stat Med, 13(2), 153-162. Powell, M. J. D. (2007). A view of algorithms for optimization without derivatives. Cambridge University Technical Report. Prabhu, S., & Pe'er, I. (2009). Overlapping pools for high-throughput targeted resequencing. Genome Res, 19(7), 1254-1261. doi:10.1101/gr.088559.108 Prentice, R. L. (1986). A Case-Cohort Design for Epidemiologic Cohort Studies and Disease Prevention Trials. Biometrika, 73(1), 1-11. Prentice, R. L., & Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika, 66(3), 403-411. Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet, 69(1), 124-137. doi:10.1086/321272 Quintana, M. A., Berstein, J. L., Thomas, D. C., et al. (2011). Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genet Epidemiol, 35(7), 638-649. doi:10.1002/gepi.20613 Quintana, M. A., Schumacher, F. R., Casey, G., et al. (2012). Incorporating Prior Biologic Information for High-Dimensional Rare Variant Association Studies. Human Heredity, 74(3-4), 184-195. doi:10.1159/000346021 Reilly, M. (1996). Optimal sampling strategies for two-stage studies. Am J Epidemiol, 143(1), 92-100. Reilly, M., & Pepe, M. S. (1995). A MEAN SCORE METHOD FOR MISSING AND AUXILIARY COVARIATE DATA IN REGRESSION-MODELS. Biometrika, 82(2), 299-314. Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). ESTIMATION OF REGRESSION-COEFFICIENTS WHEN SOME REGRESSORS ARE NOT ALWAYS OBSERVED. Journal of the American Statistical Association, 89(427), 846-866. doi:10.2307/2290910 157 Salomon, M. P., Li, W. L., Edlund, C. K., et al. (2016). GWASeq: targeted re-sequencing follow up to GWAS. BMC Genomics, 17(1), 176. doi:10.1186/s12864-016-2459-y Sampson, J., Jacobs, K., Yeager, M., et al. (2011). Efficient study design for next generation sequencing. Genet Epidemiol, 35(4), 269-277. doi:10.1002/gepi.20575 Schaffner, S. F., Foo, C., Gabriel, S., et al. (2005). Calibrating a coalescent simulation of human genome sequence variation. Genome Res, 15(11), 1576-1583. doi:10.1101/gr.3709305 Schaid, D. J. (1999). Case-parents design for gene-environment interaction. Genet Epidemiol, 16(3), 261- 273. doi:10.1002/(SICI)1098-2272(1999)16:3<261::AID-GEPI3>3.0.CO;2-M Schaid, D. J. (2010a). Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered, 70(2), 109-131. doi:10.1159/000312641 Schaid, D. J. (2010b). Genomic similarity and kernel methods II: methods for genomic information. Hum Hered, 70(2), 132-140. doi:10.1159/000312643 Schaid, D. J., Jenkins, G. D., Ingle, J. N., et al. (2013). Two-phase designs to follow-up genome-wide association signals with DNA resequencing studies. Genet Epidemiol, 37(3), 229-238. doi:10.1002/gepi.21708 Schaid, D. J., McDonnell, S. K., Sinnwell, J. P., et al. (2013). Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genet Epidemiol, 37(5), 409-418. doi:10.1002/gepi.21727 Schifano, E. D., Epstein, M. P., Bielak, L. F., et al. (2012). SNP set association analysis for familial data. Genet Epidemiol, 36(8), 797-810. doi:10.1002/gepi.21676 Schill, W., Jockel, K. H., Drescher, K., et al. (1993). LOGISTIC ANALYSIS IN CASE-CONTROL STUDIES UNDER VALIDATION SAMPLING. Biometrika, 80(2), 339-352. doi:10.1093/biomet/80.2.339 Schork, N. J., Murray, S. S., Frazer, K. A., et al. (2009). Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev, 19(3), 212-219. doi:10.1016/j.gde.2009.04.010 Sham, P., Bader, J. S., Craig, I., et al. (2002). DNA Pooling: a tool for large-scale association studies. Nat Rev Genet, 3(11), 862-871. doi:10.1038/nrg930 Shen, R., Olshen, A. B., & Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25(22), 2906-2912. doi:10.1093/bioinformatics/btp543 Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nat Biotechnol, 26(10), 1135-1145. doi:10.1038/nbt1486 Shi, G., & Rao, D. C. (2011). Optimum designs for next-generation sequencing to discover rare variants for common complex disease. Genet Epidemiol, 35(6), 572-579. doi:10.1002/gepi.20597 158 Skol, A. D., Scott, L. J., Abecasis, G. R., et al. (2006). Joint analysis is more efficient than replication- based analysis for two-stage genome-wide association studies. Nat Genet, 38(2), 209-213. doi:10.1038/ng1706 Skol, A. D., Scott, L. J., Abecasis, G. R., et al. (2007). Optimal designs for two-stage genome-wide association studies. Genet Epidemiol, 31(7), 776-788. doi:10.1002/gepi.20240 Smith, P. G., & Day, N. E. (1984). The design of case-control studies: the influence of confounding and interaction effects. Int J Epidemiol, 13(3), 356-365. Thomas, D. (2010a). Gene--environment-wide association studies: emerging approaches. Nat Rev Genet, 11(4), 259-272. doi:10.1038/nrg2764 Thomas, D. (2010b). Methods for investigating gene-environment interactions in candidate pathway and genome-wide association studies. Annu Rev Public Health, 31, 21-36. doi:10.1146/annurev.publhealth.012809.103619 Thomas, D. C. (2000). Case-parents design for gene-environment interaction by Schaid. Genet Epidemiol, 19(4), 461-463. doi:10.1002/1098-2272(200012)19:4<461::AID-GEPI16>3.0.CO;2-Y Thomas, D. C. (2005). The need for a systematic approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomarkers Prev, 14(3), 557-559. doi:10.1158/1055-9965.EPI- 14-3-EDB Thomas, D. C. (2007). Multistage sampling for latent variable models. Lifetime Data Anal, 13(4), 565- 581. doi:10.1007/s10985-007-9061-1 Thomas, D. C., Casey, G., Conti, D. V., et al. (2009). Methodological Issues in Multistage Genome-wide Association Studies. Stat Sci, 24(4), 414-429. Thomas, D. C., Yang, Z., & Yang, F. (2013). Two-phase and family-based designs for next-generation sequencing studies. Front Genet, 4, 276. doi:10.3389/fgene.2013.00276 Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1), 267-288. Ury, H. K. (1975). Efficiency of case-control studies with multiple controls per case: continuous or dichotomous data. Biometrics, 31(3), 643-649. Van Steen, K., McQueen, M. B., Herbert, A., et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet, 37(7), 683-691. doi:10.1038/ng1582 Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (Fourth ed.). New York: Springer. Vineis, P., & Chadeau-Hyam, M. (2011). Integrating biomarkers into molecular epidemiological studies. Curr Opin Oncol, 23(1), 100-105. doi:10.1097/CCO.0b013e3283412de0 Visscher, P. M. (2008). Sizing up human height variation. Nat Genet, 40(5), 489-490. doi:10.1038/ng0508-489 159 Visscher, P. M., Brown, M. A., McCarthy, M. I., et al. (2012). Five years of GWAS discovery. Am J Hum Genet, 90(1), 7-24. doi:10.1016/j.ajhg.2011.11.029 Visscher, P. M., Yang, J., & Goddard, M. E. (2010). A commentary on 'common SNPs explain a large proportion of the heritability for human height' by Yang et al. (2010). Twin Res Hum Genet, 13(6), 517-524. doi:10.1375/twin.13.6.517 Wang, M., Jakobsdottir, J., Smith, A. V., et al. (2016). G-STRATEGY: Optimal Selection of Individuals for Sequencing in Genetic Association Studies. Genet Epidemiol. doi:10.1002/gepi.21982 Wang, T., Lin, C. Y., Rohan, T. E., et al. (2010). Resequencing of pooled DNA for detecting disease associations with rare variants. Genet Epidemiol, 34(5), 492-501. doi:10.1002/gepi.20502 Wang, W. Y., Barratt, B. J., Clayton, D. G., et al. (2005). Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet, 6(2), 109-118. doi:10.1038/nrg1522 Welter, D., MacArthur, J., Morales, J., et al. (2014). The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res, 42(Database issue), D1001-1006. doi:10.1093/nar/gkt1229 White, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol, 115(1), 119-128. Whittemore, A. S., & Halpern, J. (1997). Multi-stage sampling in genetic epidemiology. Stat Med, 16(1- 3), 153-167. Witten, D. M., & Tibshirani, R. (2009). Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society Series B-Statistical Methodology, 71, 615-636. doi:10.1111/j.1467-9868.2009.00699.x Wray, N. R., Purcell, S. M., & Visscher, P. M. (2011). Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol, 9(1), e1000579. doi:10.1371/journal.pbio.1000579 Wu, M. C., Lee, S., Cai, T., et al. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89(1), 82-93. doi:10.1016/j.ajhg.2011.05.029 Yang, J., Benyamin, B., McEvoy, B. P., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat Genet, 42(7), 565-569. doi:10.1038/ng.608 Yang, Q., & Khoury, M. J. (1997). Evolving methods in genetic epidemiology. III. Gene-environment interaction in epidemiologic research. Epidemiol Rev, 19(1), 33-43. Zhu, X., Feng, T., Li, Y., et al. (2010). Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol, 34(2), 171-187. doi:10.1002/gepi.20449
Abstract (if available)
Abstract
Two-step study designs, including both two-stage and two-phase designs, have been widely used in genetic epidemiology. In this dissertation, after reviewing the development of two-step study designs and their application in various settings, I focused on applying this two-step idea to several problems that are frequently encountered in the research of this area. ❧ In the study of complex biological pathways, latent variables are often used to model the underlying mechanism and to analyze the relationship among the factors involved. Biomarkers can be used as a measurement on them. Due to the high cost, it is often unfeasible to have biomarkers measured on the whole sample. We developed models involving latent variables to investigate effects of candidate genes, both the main effect and the interaction with environmental factors, and proposed designs with the two-phase sampling idea for case-control studies to increase efficiency. We discussed two approaches for statistical modeling, retrospective likelihood approach and prospective likelihood approach. For situations of planning the study before or after ascertaining Phase I sample, we computed optimal designs in scenarios that estimate single or multiple parameters with or without a fixed budget constraint. In addition, using simulations, we compared the relevant performance of the optimal designs with other designs. ❧ Technology advances have provided access to various omic data, such as metabolites, expression, and somatic profiles, which potentially enable us gain new insights into the underlying etiologic mechanism of disease. These data are often collected on a subset of a sample, and thus can be viewed as a second step in one study. There are many analytic challenges involved in analyzing such data, including effect heterogeneity, high dimensionality, incomplete data, etc. We proposed a novel approach for the integrated analysis of germline, omic and disease data. We used a latent variable to relate information from germline genetic data to either a continuous or binary disease outcome, and viewed the omic data as a flawed measure of underlying latent clusters, categorized to simplify interpretation. We used an expectation-maximization algorithm to simultaneously estimate the unobserved latent clusters and model parameters, including genetic effects on the latent cluster and the impact of the cluster on omic patterns and on the disease outcome. Additionally, we incorporated penalized methods for variable selection in a high dimensional setting for both the genetic data and the omic data. Using simulations, we demonstrated the ability of our approach to accurately estimate underlying clusters and their corresponding genetic, omic and disease effects. Moreover, we demonstrated the feasibility of the variable selection to identify genetic and omic factors as both the means and correlational structures were varied. As an example, we applied our approach to a data set from the Women’s Health Initiative. ❧ It has been learned that findings of genome-wide association studies can only explain a small proportion of heritability. Research on rare variants has become one of the directions to find the ‘missing heritability’. In spite of the advance of DNA sequencing technology, it is still too expensive to sequence the entire sample in large-scale epidemiological studies. To improve efficiency, we proposed a two-phase and two-stage design using sequencing data to study the association of rare variants with diseases. This design sequences only a subset of Stage I sample and then use Stage I data to prioritize rare variants for subsequent association test in Stage II sample, adjusting for multiple comparison only for prioritized variants. We proposed a score test criterion to prioritize rare variants in Stage I using pedigree data, and an iterative algorithm to select pedigree members for sequencing using existing information from genome-wide association studies. Using simulations, we evaluated the performance of using pedigree sample or case-control sample in Stage I and II, the performance of the iterative algorithm, and other design parameters. With real sequencing data from Colon Cancer Family Registry, we evaluated various Stage I designs by looking at how their results compared with the one that used all available sequencing data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Bayesian hierarchical models in genetic association studies
PDF
Integrative analysis of multi-view data with applications in epidemiology
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Novel statistical and computational methods for analyzing genome variation
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Post-GWAS methods in large scale studies of breast cancer in African Americans
PDF
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Modeling mutational signatures in cancer
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
Statistical analysis of high-throughput genomic data
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
High-dimensional regression for gene-environment interactions
PDF
Observed and underlying associations in nicotine dependence
Asset Metadata
Creator
Yang, Zhao
(author)
Core Title
Two-step study designs in genetic epidemiology
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/24/2018
Defense Date
06/16/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
integrated analysis,OAI-PMH Harvest,optimization,rare variants,study designs
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Thomas, Duncan (
committee chair
), Conti, David (
committee member
), Haiman, Christopher (
committee member
), Marjoram, Paul (
committee member
), Wang, Kai (
committee member
)
Creator Email
yang19@usc.edu,zhao.y029@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-281483
Unique identifier
UC11281212
Identifier
etd-YangZhao-4631.pdf (filename),usctheses-c40-281483 (legacy record id)
Legacy Identifier
etd-YangZhao-4631.pdf
Dmrecord
281483
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yang, Zhao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
integrated analysis
optimization
rare variants
study designs