Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Power and sample size calculations for nested case-control studies
(USC Thesis Other)
Power and sample size calculations for nested case-control studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POWER AND SAMPLE SIZE CALCULATIONS FOR NESTED CASE-CONTROL STUDIES by Wei Ye A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) August 2010 Copyright 2010 Wei Ye Dedication Dedicated to my family for their love, support and sacrice. ii Acknowledgements First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Bryan Langholz, for his profound knowledge and wonderful guidance. I am also very thankful for his emotional support and great patience. I have been fortunate of learning from him. I am also greatly indebted to Dr. Jim Gauderman, who has given me plenty of insightful advice and generously shared his valuable experience and computer programs with me. Dr. Jacek Pinski deserves my special thanks for stepping into the role of the outside committee member and for his time and help. I also owe a huge debt of gratitude to Dr. Susan Groshen for her mentoring and support all these years and for serving on my guidance committee. She has been my role model in my career as a statistician. I greatly appreciate the valuable advice from my other committee members, Dr. Stanley Azen and Dr. Larry Goldstein. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vi List of Figures viii Abstract ix Chapter 1: Introduction 1 Chapter 2: Rationale for Nested Case-Control Studies 2 Chapter 3: A Few Nested Case-Control Sampling Designs 5 3.1 Simple Random Sampling 5 3.2 Counter-matching 6 3.3 Quota Sampling 7 3.4 Counter-Matching with Additional Randomly Sampled Controls 8 Chapter 4: Proportional Hazards Model for the Analysis of Nested Case-Control De- signs 9 4.1 Notation and Model 9 4.2 A Few Nested Case-Control Designs 12 4.2.1 Full Cohort 13 4.2.2 Simple Random Sampling 13 4.2.3 Counter-Matching 14 4.2.4 Quota Sampling 15 4.2.5 Counter-Matching with Additional Randomly Sampled Controls 17 Chapter 5: Power and Sample Size Calculation in General 20 5.1 Likelihood Based Tests 21 5.2 Calculation of Power and Sample Sizes 23 5.3 About a Popular Sample Size Formula 28 iv Chapter 6: Power and Sample Size Calculations for WECARE Study 31 6.1 Introduction 31 6.2 WECARE Study 32 6.3 Method for Full Simulation 34 6.4 Method for Calculation 34 6.4.1 Expected Information 35 6.4.2 Expected LRT Statistic 38 6.5 Calculation of Power and Sample Sizes for WECARE Study 42 6.5.1 Main Eect of Radiation as a Binary Variable 42 6.5.2 Main Eect of Radiation Dose as a Continuous Variable 45 6.5.3 Interaction Between Mutation of a Single Gene and Radiation Dose 45 6.5.4 Results 45 Chapter 7: Power and Sample Size Calculations for Endometrial Cancer Study 57 7.1 Study on Endometrial Cancer Risk in Women Diagnosed with Endometrial Hyperplasia 57 7.2 Methods for Calculation 58 7.2.1 Expected Information and LRT Statistic in CM+SRS Design 58 7.2.2 Expected Information and LRT Statistic in QS+Stopping Design 61 7.2.3 Method for Calculating Expected Information and LRT Statistic 67 7.3 Calculation of Power in Endometrial Cancer Study 68 7.3.1 Interaction between EH Type and a Continuous Secondary Expo- sure in CM+SRS Design 68 7.3.2 Interaction between EH Type and a Continuous Secondary Expo- sure in QS+Stopping Design 69 7.3.3 Results 70 Chapter 8: Summary and Discussion 75 Bibliography 77 v List of Tables Table 3.1: 1:1 CM design 7 Table 3.2: General CM design 7 Table 6.1: Distribution of the subjects in a risk set 37 Table 6.2: Asymptotic limits of the proportion of the subjects in a risk set 37 Table 6.3: Statistical power for main eect of radiation therapy as a binary variable. Pr(RRT=1)=0.40. PP=PN=0.9. Number of risk sets=200 43 Table 6.4: Statistical power for main eect of radiation therapy as a binary variable. Pr(RRT=1)=0.10. PP=PN=0.98. Number of risk sets=200. 44 Table 6.5: Statistical power for main eect of radiation therapy as a continu- ous variable. Pr(RRT=1)=0.4. PP=PN=0.9. RT 2 (4) when RT>0. Number of risk sets=700. 46 Table 6.6: Statistical power for gene by radiation dose interaction for SRS 1:2. RT 2 (4) when RT>0. G =0. RT =0.158. Pr(RRT=1)=0.4. PP=PN=0.90. Number of risk sets=700. 47 Table 6.7: Statistical power for gene by radiation dose interaction for CM 1:2. RT 2 (4) when RT>0. G =0. RT =0.158. Pr(RRT=1)=0.4. PP=PN=0.90. Number of risk sets=700. 48 Table 6.8: Statistical power for gene by radiation dose interaction for CM 2:1. RT 2 (4) when RT>0. G =0. RT =0.158. Pr(RRT=1)=0.4. PP=PN=0.90. Number of risk sets=700. 49 vi Table 7.1: Statistical power for interaction between EH type and a normal co- variate (A) in CM+SRS 2:2:2 design. AN(0,1) for EH=SH. AN(0.5,1) for EH=CH. EH =1.8. A =0.455. Number of risk sets=500. 71 Table 7.2: Statistical power for interaction between EH type and a log-normal covariate (B) in CM+SRS 2:2:2 design. Blog-normal(0,0.25) for EH=SH. Blog-normal(0.5,0.25) for EH=CH. CH =1.8. B =0.4. Number of risk sets=1000. 72 Table 7.3: Statistical power for main eect of EH type in QS+stopping design. Number of risk sets=200. 73 Table 7.4: Statistical power for interaction between EH and a normal covariate (A) in QS+stopping design. AN(0,1) for EH=SH. AN(0,1) for EH=CH. CH =1.8. A =0.455. Number of risk sets=500. 73 Table 7.5: Statistical power for interaction between EH and a log-normal co- variate (B) in QS+stopping design. B log-normal(0,0.25) for EH=SH. B log-normal(0.5,0.25) for EH=CH. CH =1.8. B =0.4. Number of risk sets=1000. 74 vii List of Figures Figure 6.1: Comparison of simulated power for testing gene-radiation dose in- teraction between SRS 1:2 and CM 1:2 52 Figure 6.2: Comparison between null-variance formula and corrected formula for calculating sample sizes in SRS 1:2 with a binary covariate 56 viii Abstract Power and sample size calculations can greatly facilitate the designing of nested case- control studies. Few studies have addressed this issue. In this dissertation we develop unied approaches of determining power and sample sizes for nested case-control studies based on the established asymptotic theory for Cox proportional hazards model. We assess the accuracy of these methods for practical use by comparing them with full Monte Carlo simulation with realistic sample sizes. The information method and the SMO method we propose are shown to work well in a wide range of practical situations, which include various complex nested-control study designs (simple random sampling, counter-matching, counter-matching with additional randomly sampled controls, and quota sampling with stopping), various categorical and continuous covariate distributions, and univariate and multivariate models. These methods are exible for handling high dimensional data and complex covariate distributions, which makes it possible to extend them to other nested case-control study designs. ix Chapter 1 Introduction Power and sample size calculations are essential for determining study feasibility and cost. As study questions and study designs become more complex, such calculations require more sophistication than is generally available using standard statistical pack- ages. Few researchers have addressed the issues of power and sample size calculations in nested case-control studies although the theory for the analysis of these designs have been well established (Goldstein and Langholz, 1992; Borgan, Goldstein, and Langholz, 1995). There is a need for approaches to the calculations that accommodate situations seen in the practice of designing nested case-control studies. In this dissertation, we in- tend to develop methods for power and sample size calculations that can be applied to various nested case-control designs and a wide range of practical situations based on the established asymptotic theory. 1 Chapter 2 Rationale for Nested Case-Control Studies In cohort studies, disease rates across dierent levels of factors of interest can be reliably compared (Breslow, 1987; Rothman and Greenland, 1998). In cohort studies, exposure data is collected before the disease occurs, which is an appropriate temporal sequence for a cause-eect relationship. Selection or information biases, common in stan- dard case-control studies, can be avoided in cohort studies. Multiple outcomes related to a specic exposure can be studied. However, large numbers of subjects and/or long follow up may be required for rare diseases to accumulate enough cases in cohort studies. It is often expensive to collect or process information on all subjects in the cohort. On the other hand, standard case-control studies are relatively inexpensive as compared to cohort studies and can be conducted over relatively short time periods (Breslow, 1980). Selecting cases and controls separately, they are particularly useful for studying rare dis- eases. It is possible to study multiple potential causes of disease in standard case-control studies. However, standard case-controls studies are prone to biases and do not permit the calculation of incidence rates. 2 A nested case-control study is a case-control study nested in a cohort study and can retain some of the advantages of both cohort design and standard case-control design (Langholz, 2005a; Ernster, 1994). Disease outcome is available for all subjects in the cohort, which helps dene risk sets. At any time, the risk set is consisted of all the subjects who are still at risk. In nested case-control sampling, all (or a portion of) the failures (cases) are kept and only a sample of the nonfailures (controls) are selected from all the nonfailures in the risk set at each failure time. Usually, only enough information is collected on all the subjects in the cohort and detailed covariate information is collected only for the subjects selected in the nested case-control study. The variables collected on all subjects in the cohort are usually either \relatively cheap to obtain at enrollment but from which important information can be derived or those that are subject to retrospective biases" (Li, 2004). The information collected on the subjects selected in the nested case- control study can be more detailed exposure information, confounding factors or eect modiers, other potential exposures, results of biological tests which are expensive to obtain (Li, 2004; Langholz and Goldstein, 1996). The statistical analysis is only based on the subset of subjects selected in the nested case-control study. Thus, the cost is greatly reduced in data collection and/or computational burden in data analysis. When the disease is rare, the statistical eciency lost due to excluding some controls is relatively small because the contribution of a control is negligible compared to that of a case. Nested case-control studies are much more cost ecient for assessing associations between exposures and disease than cohort studies. Compared with standard case-control studies, nested case-control studies can reduce the temporal ambiguity because data on exposure is more likely to be collected before the occurrence of the disease (Ernster, 1994) . For example, in a nested case-control study investigating the association between a biological 3 marker and a disease, serum specimens are usually collected at the beginning of the cohort study, well before the diagnosis of the disease. In a nested case-control study where there is selection or information bias, it is possible to assess the potential magnitude of the bias by comparing the participants and non-participants (Langholz, 2005a). Besides, absolute risks can often be reliably estimated (Langholz, 2005a). Depending on the purpose of the study and information available in the cohort, researchers can choose from a wide range of design options for a nested case-control study (Borgan et al. 1995; Langholz, 2007; Langholz and Goldstein, 1996). 4 Chapter 3 A Few Nested Case-Control Sampling Designs Since the rst nested case-control design, the simple random sampling design, was proposed by Thomas et al. (1977), a wide range of novel nested case-contronl designs and an unied theory for the data analysis for these designs have been developed (Bogan et al., 1995; Goldstein and Langholz, 1992; Langholz and Borgan, 1995; Langholz and Clayton, 1994; Langholz, 2007; Langholz and Goldstein, 1996) and applied in many studies. Following are a few sampling schemes to select controls. In each of the designs, if con- founding factors need to be controlled, controls can be sampled within the same stratum as the case that is dened by the matching factors such as age, gender, race/ethnicity, calendar year, etc. 3.1 Simple Random Sampling The most commonly used method for selecting controls is the simple random sampling (SRS), in which m 1 controls are randomly sampled without replacement from the risk set at the failure time for each case (Thomas, 1977; Oaks, 1981). Relative to the full cohort with innite number of controls, the eciency of this design is (m 1)=m for testing the 5 association between disease and a single exposure variable which is not associated with disease (Ury, 1975). It means that if four controls are selected for each case, the simple random sampling design can reach as much as 80% of the full cohort eciency. 3.2 Counter-matching When an exposure variable or an exposure-related variable is known for all cohort members and additional information is to be collected on a sample, counter-matching (CM) design using the available covariate information in the sampling process may be more advantageous than SRS design (Langholz and Clayton, 1994; Langholz and Borgan, 1995). In the simplest scenario where the counter-matching variable, C, has two strata and one control is matched to one case (1:1 CM design), the control is sampled from the opposite C stratum of the case (Table 3.1), which motivated the name of \counter- matching" (Langholz, 2005b). A case and a control that are concordant in terms of exposure and disease status do not contribute any information in estimation of risk ratio for exposure disease association (Rothman and Greenland, 1998). By counter-matching the controls based on C, which is either the exposure or exposure-related covariate, the discordant case-control pairs are greatly increased. It has been shown analytically and empirically that CM design increases the statistical eciency as compared to SRS design when the counter-matching variable is suciently correlated to the exposure of interest (Langholz, 2005b; Steenland and Dedden, 1997). If not, the eciency can be less than that for SRS design. In a general CM design, if the counter-matching variable has L strata and the case is from stratum j, then m l controls are sampled from each stratum (l2 (1; 2;:::;L)), except 6 Table 3.1: 1:1 CM design Case Control C = 0 1 0 C = 1 0 1 Table 3.2: General CM design Sampling 1 2 ... j ... L Total stratum Case 0 0 ... 1 ... 0 1 Controls m 1 m 2 ... m j 1 ... m L P m l 1 for stratum j, from which m j 1 controls are sampled. P m l =m. Compared to other alternatives, the generalization in Table 3.2 permits unbiased estimation of exposure dis- ease association. CM designs are very useful in the following situations (Langholz and Goldstein, 1996): (i). A crude exposure surrogate is available on all subjects in the cohort. The eect of the more detailed exposure variable needs to be assessed (e.g. Bernstein, 2004). (ii). The exposure variable is available on everyone. The goal is to investigate the inter- action between the exposure and another covariate (e.g. Bernstein, 2004; Tokunaga et al., 1994). (iii). The exposure variable is available on everyone. The goal is to investigate the eect of the exposure variable after controlling for some confounding factors. 3.3 Quota Sampling Suppose that the risk set is partitioned into targeted and non-targeted subjects. In quota sampling, controls are sampled until a pre-specied number of subjects, say m 1 , in 7 the targeted group have been obtained (Langholz, 2006; Borgan et al., 1995). Thus, the sampled risk sets will have the same number of target subjects and vary in the number of nontarget subjects. If the case is in the targeted group, he or she is counted as one of m 1 . The targeted group can be the exposed (when the exposure is dichotomized) or the rarest or the most important level of the exposure (if the exposure has multiple levels). When the exposure is rare, quota sampling ensures enough exposed subjects collected in the sample and almost always produce a sampled risk set that are discordant in terms of exposure and disease status. 3.4 Counter-Matching with Additional Randomly Sampled Controls When the primary exposure is available on everyone in the cohort and the informa- tion on a secondary exposure is collected on the sample, the goal is to investigate both exposures, SRS is not ecient for estimating the main eect of the primary exposure. The main eect of the secondary exposure is subject to the loss of eciency if CM design based on the primary exposure is used compared to SRS design. A compromise is to rst counter-match on the primary exposure and then pick additional controls by simple random sampling, which should have at least the eciency in CM design for the primary exposure and at least the eciency in SRS design for the secondary exposure (Langholz and Goldstein, 1996). 8 Chapter 4 Proportional Hazards Model for the Analysis of Nested Case-Control Designs The following is a brief summary of the theory developed by Borgan et al. (1995) and Goldstein and Langholz (1992) based on the Cox proportional hazards model. For detailed proof, please see Borgan et al. (1995) and Goldstein and Langholz (1992). 4.1 Notation and Model In Cox proportional hazards model of cohort data, let N i (t), Y i (t) and Z i (t) be the counting, censoring and covariate processes for the i th subject, respectively, with time t2 [0;]. N i (t) counts the number of failures for subject i up to time t. The ltration H t is the cohort history including failure time, censoring and covariate information up to time t. Y i (t) and Z i (t) are left-continuous and adapted. The intensity process for N i (t) is assumed to be i (t) =Y i (t) 0 (t)exp( 1 T Z i (t)); 9 where 0 (t) is a nonnegative baseline intensity and 1 is a p-dimensional vector of regres- sion parameters at their true values. In counting process notation, the partial likelihood used for estimation of the regression parameters for cohort data is L () = Y s2[0;] Y i2R(s) f Y i (s)exp( T Z i (s)) P l2R(s) Y l (s)exp( T Z l (s)) g N i (s) (4.1) The contribution of each risk set to the partial likelihood is the conditional probability that the i th subject is the failure given that there is a failure in the risk set R(t). In an extension to nested case-control data, the sampling of the controls is superim- posed onto this model, which leads to the counting process N (i;r) (t) as the number of times in [0,t] subject i fails and r is the chosen sampled risk set. H t then is augmented toF t , which includes both the cohort history and the sampling information up to time t. The intensity processes (i;r) (t) for N (i;r) (t) is given by (i;r) (t) = i (t) t (rji) =Y i (t) 0 (t)exp( T 1 Z i (t)) t (rji); where t (rji) is the conditional probability that r is chosen as the sampled risk set given thati is the failure at timet. By Bayes theorem, the conditional probability that subject i fails given that r is the sampled risk set is t (ijr) = (i;r) (t) r (t) = Y i (t)exp( T 1 Z i (t)) t (rji) P l2r Y l (t)exp( T 1 Z l (t)) t (rjl) 10 The partial likelihood for inferencing on the regression parameters in a nested case-control study is L () = Y s2[0;] Y r2P Y i2r f Y i (s)exp( T Z i (s)) t (rji) P l2r Y l (s)exp( T Z l (s)) t (rjl) g N i;r (s) ; (4.2) whereP is the power set off1; 2;:::;ng withn being the number of subjects in the cohort. The estimator of , ^ , is obtained by maximizing (4.2). After canceling out the common term in t (rji), (4.2) becomes L () = Y s2[0;] Y r2P Y i2r f Y i (s)exp( T Z i (s))w i (s;r) P l2r Y l (s)exp( T Z l (s))w i (s;r) g N i;r (s) : (4.3) The vector of score functions and the observed information matrix are U () = @ @ logL () = Z 0 X r2P X i2r fZ i (s)E r (;s)gdN (i;r) (s) and I () = @ 2 @ 2 logL () = Z 0 X r2P V r (;s)dN r (s); respectively, where E r (;t) = S (1) r (;t) S (0) r (;t) (4.4) V r (;t) = S (2) r (;t) S (0) r (;t) E r (;t) 2 with S ( ) r (;t) = X i2r Y i (t)Z i (t) exp( T Z i (t)) t (rji); = 0; 1; 2: (4.5) By standard counting process theory, (4.2) has basic likelihood properties under the nec- essary regularity conditions, meaning that when evaluated at the true parameter, 1 , 11 its score vector has expectation zero and the covariance matrix of the score equals the expected information matrix. It was proven that under conditions 1-4 in Borgan et al. (1995), ^ is consistent for 1 and p n( ^ 1 )! D N(0; 1 ) as n!1 , where may be estimated consistently by n 1 I ( ^ ). For (; ) = (0; 2) and (; ) = (2; 0), let Q (; ) ( 1 ;t) = 1 n X r2P E r ( 1 ;t) S ( ) r ( 1 ;t)! p q (; ) ( 1 ;t) as n!1. Then the asymptotic expected information is = Z 0 [q (0;2) ( 1 ;t)q (2;0) ( 1 ;t)] 0 (t)dt (4.6) The asymptotic expected information formulas, which are essential for calculating power and sample sizes, will be given for some nested case-control designs under independently identically distributed (iid) condition in next section. 4.2 A Few Nested Case-Control Designs In this section, it is assumed that the at-risk indicators, covariate processes and classi- cation variables (Y i (t);Z i (t);C i (t)),i = 1; 2;:::n, are independent copies of (Y (t);Z(t);C(t)). C i (t) indicates the counter-matching stratum for subject i in CM design and whether subject i is targeted or non-targeted in quota sampling design. In what follows, the expectations are taken over the distribution of Z(t) and C(t) conditioned on Y (t) = 1. 12 4.2.1 Full Cohort The full cohort is a special case of nested case-control designs, in which the entire risk set is sampled with probability of 1. Then, t (rji) = 1 for all i2 R(t) and the partial likelihood for nested case-control designs in (4.2) becomes the usual partial likelihood for cohort data in (4.1). 4.2.2 Simple Random Sampling For simple random sampling, t (rji) = n(t) 1 m 1) 1 I(i2r;rR(t);jrj =m): It is the same for each subject in the sampled risk set and therefore will drop out of (4.2). Thus, the partial likelihood of the simple random sampling design takes the same form as that of full cohort. Q (; ) ( 1 ;t) = 1 n X rR(t);jrj=m E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) t (rji) = n(t) n n(t) m 1 X rR(t);jrj=m E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) 1 m Q (; ) ( 1 ;t) ! p q (; ) ( 1 ;t) =p(t)E[E u ( 1 ;t) X i2u Z(t) exp( T 1 Z(t)) 1 m ] 13 under iid condition as n!1, where u =f1;:::;mg. n(t) n ! p p(t) as n!1. The asymptotic expected information for the simple random sampling design is: =Ef Z 0 p(t) 1 m X i2u [Z i (t)E u ( 1 ;t)] 2 exp( T 1 Z(t)) 0 (t)dtg 4.2.3 Counter-Matching Let C i (t) indicate the counter-matching stratum for subject i with C i (t)2 G and G =f1; 2;:::;Lg. Let n l (t) be the number of subjects in l th counter-matching stratum, R l (t). In each risk set, m l subjects are sampled from R l (t) except for the case's stratum, R C i (t) (t), in which m c i (t) 1 subjects are sampled from n c i (t) 1 controls. Therefore, t (rji) = [ Y l2G n l (t) m l ] 1 n C i (t) m C i (t) I(i2r;rR(t);jr\R l (t)j =m l ;l2G) Thus, in the partial likelihood for the counter-matching design, the weight for a subject from a given stratum is the inverse of the proportion of the stratum sampled, i.e. w i (t;r) = n C i (t) m C i (t) : Let P m (t) =I(rR(t);jr\R l (t)j =m l ;l2G). Q (; ) ( 1 ;t) = 1 n X rP m (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) t (rji) = 1 n L Y l=1 n l (t) m l 1 X rP m (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) n C i (t) m C i (t) Q (; ) ( 1 ;t) ! p q (; ) ( 1 ;t) =p(t)E[ E u ( 1 ;t) X i2u Z i (t) exp( T 1 Z i (t)) p C i (t) m C i (t) ] 14 under iid condition as n ! 1, where u = f1;:::; P L l=1 m l g and C j = 1 for j 2 f1;:::;m 1 g, C j = 2 for j2fm 1 + 1;:::;m 1 +m 2 g, :::, and C j =L for j2f P L1 l=1 m l + 1;:::; P L l=1 m l g. E r ( 1 ;t) is the same as E r ( 1 ;t) in (4.4) except that t (rji) is replaced by w i (t;r) = p C i (t) m C i (t) . n(t) n ! p p(t) and n C i (t) n(t) ! p p C i (t) as n!1. The asymptotic expected information for the counter-matching design is: =Ef Z 0 p(t) X i2u [Z i (t) E u ( 1 ;t)] 2 exp( T 1 Z i (t)) p C i (t) n(t) 0 (t)dtg: 4.2.4 Quota Sampling The size of sampled risk set required to obtain m 1 targeted subjects, m(t), given the target status of the case follows a negative hypergeometric distribution: Pr[j ~ R(t)j = m(t) C i (t)] is ( m(t)2 m 1 2 )( n(t)m(t) n 1 (t)m 1 ) ( n(t)1 n 1 (t)1 ) I(jr\R 1 (t)j =m 1 ) ifC i (t) = 1 and ( m(t)2 m 1 1 )( n(t)m(t) n 1 (t)m 1 ) ( n(t)1 n 1 (t) ) I(jr\ R 1 (t)j = m 1 ) if C i (t) = 0, where C i (t) indicates the target status of the case. Selecting a particular set given the size of the set, m(t) = m 0 (t) +m 1 , is equivalent to counter- matching which selectsm 0 (t) subjects in nontarget group andm 1 subjects in target group. Pr ~ R(t) =r j ~ R(t)j =m(t);C i (t) = [ 1 Y l=0 n l m l ] 1 n C i (t) m C i (t) The probability of picking a particular setr given thati is the case is the product of these two probabilities. t (rji) = 1 n 0 (t) m 0 (t) n 1 (t) m 1 m(t)1 m 1 1 n(t)m(t) n 1 (t)m 1 n(t) n 1 (t) n(t) m(t) 1 m 1 C i (t) m 1 I(i2r;rR(t);jr\R 1 (t)j =m 1 ) 15 Thus, the weight in the partial likelihood is 1 if the subject is non-targeted and (m 1 1)=m 1 is the subject is targeted. Let P m (t) =I(rR(t);jr\R 1 (t)j =m 1 ;jrj =m). Q (; ) ( 1 ;t) = 1 n 1 X m(t)=m 1 X rP m (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) t (rji) = 1 n 1 X m(t)=m 1 [ m(t)1 m 1 1 n(t)m(t) n 1 (t)m 1 n(t) n 1 (t) n(t) m(t) 1 ] [ 1 n 0 (t) m 0 (t) n 1 (t) m 1 X rP m (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) m 1 C i (t) m 1 ] Let n 0 (t) n ! p p 0 (t) and n 1 (t) n ! p p 1 (t) as n!1. Since m(t)1 m 1 1 n(t)m(t) n 1 (t)m 1 n(t) n 1 (t) ! p m(t) 1 m 1 1 p 0 (t) m 0 (t) p 1 (t) m 1 and 1 n 0 (t) m 0 (t) n 1 (t) m 1 X rP m (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) m 1 C i (t) m 1 ! p E[E u ( 1 ;t) X i2u Z i (t) exp( T 1 Z i (t)) m 1 C i (t) m 1 ] as n!1 under iid condition, where u =f1;:::;m 0 (t) +m 1 g and C j = 0 for j 2 f1;:::;m 0 (t)g and C j = 1 for j2fm 0 (t) + 1;:::;m 0 (t) +m 1 g. Q (; ) ( 1 ;t)! p q (; ) ( 1 ;t) = p(t) 1 X m(t)=m 1 1 m(t) 1 m(t) 1 m 1 1 p 0 (t) m 0 p 1 (t) m 1 E[E u ( 1 ;t)) X i2u Z i (t) exp( T 1 Z i (t)) m 1 C i (t) m 1 ] 16 as n!1 under iid condition. The asymptotic expected information for quota sampling design is: = Ef Z 0 p(t) 1 X m(t)=m 1 1 m(t) 1 m(t) 1 m 1 1 p 0 (t) m 0 p 1 (t) m 1 X i2u [Z i (t)E u ( 1 ;t)] 2 exp( T 1 Z i (t)) m 1 C i (t) m 1 0 (t)dtg: 4.2.5 Counter-Matching with Additional Randomly Sampled Controls Suppose that m subjects are counter-matched, with m l from stratum l and then ~ m subjects are randomly sampled, with ~ m l subjects that happen to be picked from stratum l. t (rji) = [ Q l n l (t) m l ~ m l n(t) m ~ m ][ Y l n l (t) m l + ~ m l ] 1 n C i (t) m C i (t) + ~ m C i (t) I(i2r;rR(t);jr\R l (t)j = m l + ~ m l ;l2G) The weight in (4.2) is w i (t;r) = n C i (t) m C i (t) + ~ m C i (t) Let2f( 1 ;:::; L ) : m l l = m l + ~ m l ; P L l=1 l = m+ ~ mg represent the compositions of the sampled risk set.f 1 ;:::; L g is xed in each composition and vary from composition 17 to composition. LetP (t) =I(rR(t);jr\R l (t)j = l ;l2G), which includes all subsets of the risk set with l subjects at stratum l. Q (; ) ( 1 ;t) = 1 n X X rP (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) t (rji) = 1 n X [ Q n l (t) m l ~ m n(t) m ~ m ] [ Y l 1 n l (t) m l + ~ m l X rP (t) E r ( 1 ;t) X i2r Z i (t) exp( T 1 Z i (t)) n C i (t) m C i (t) + ~ m C i (t) ] Since [ Q l n l (t) m l ~ m l n(t) m ~ m ]! ~ m ~ m 1 ;:::; ~ m L Y l2G p ~ m l l (t) and 1 n [ Y l 1 n l (t) m l + ~ m l X rP (t) E r ( 1 ;t) X i2u Z i (t) exp( T 1 Z i (t)) n C i (t) m C i (t) + ~ m C i (t) ]! p p(t)E[ E u ( 1 ;t) X i2u Z i (t) exp( T 1 Z i (t)) p C i (t) m C i (t) + ~ m C i (t) ] asn!1 under iid condition, whereu =f1;:::; m+ ~ mg andC j = 1 forj2f1;:::; m 1 + ~ m 1 g, C j = 2 for j 2f m 1 + ~ m 1 + 1;:::; m 1 + ~ m 1 + m 2 + ~ m 2 g, :::, and C j = L for j2f P L1 l=1 ( m l + ~ m l ) + 1;:::; P L l=1 ( m l + ~ m l )g: E( 1 ;t) is the same as E( 1 ;t) in (4.4) except that t (rji) is replaced by p C i (t) m C i (t) + ~ m C i (t) . Q (; ) ( 1 ;t) ! p q (; ) ( 1 ;t) = p(t) X ~ m ~ m 1 ;:::; ~ m L Y l2G p ~ m l l (t)E[ E u ( 1 ;t) X i2u Z i (t) exp( T 1 Z i (t)) p C i (t) m C i (t) + ~ m C i (t) ] 18 as n!1. The asymptotic expected information for counter-matching with additional simple random sampling is: = Ef Z 0 p(t) X ~ m ~ m 1 ;:::; ~ m L Y l2G p ~ m l l (t) X i2u [Z i (t) E u ( 1 ;t)] 2 exp( T 1 Z i (t)) p C i (t) m C i (t) + ~ m C i (t) 0 (t)dtg: 19 Chapter 5 Power and Sample Size Calculation in General Two common approaches for obtaining power and sample sizes include full simula- tion and calculation based on the asymptotic distribution of the test statistic (Brown et al., 1999; Schoenfeld and Borenstein, 2005). In a full simulation, a large number of hypothetical data sets are generated based on the specied model and parameters under the alternative hypothesis. The statistical test is applied to each data set. The power is the percentage of times when the null hypothesis is rejected. Simulation can be used to accurately obtain power given a sample size if the number of the data sets generated is large. However, it is usually slow and takes more time than the calculation method even with modern computers. Besides, sample sizes can not be determined directly from power by simulation. It takes at least a few iterations if a sample size is computed from power by simulation. Furthermore, in complex situations, specialized knowledge is required to write a program for a full simulation. Calculation of power and sample sizes is usually based on the asymptotic distributions of the likelihood based test statistics. There are three commonly used likelihood based tests: the likelihood ratio test (LRT), the score test (ST), and the Wald test (WT), 20 which are asymptotically equivalent by the rst-order approximation under H 0 and local alternative hypothesis. The generality of the likelihood theory makes the power and sample size calculation based on these tests applicable in a wide range of models. Both power and sample sizes can be calculated quickly using the calculation method, which allows a rapid comparison of dierent designs. 5.1 Likelihood Based Tests This section follows Cox and Hinkley (1974). Let Z = (Z 1 ;:::;Z n ) be a vector of independent, identically distributed random variables with a density f(Z;), where is a vector of unknown parameters. The log-likelihood at is L(Z;) = n X i=1 log(f(Z i ;) can be partitioned into ( ;). and are p 1 and q 1 vectors, respectively. The hypothesis to be tested is H 0 : = 0 versus H 1 : 6= 0 with being the nuisance parameters. LetU andI andI be the observed score, observed information and expected information, respectively. The Wald test statistic is given by WT = ( ^ 0 )I ( ^ ) 1 ( ^ 0 ); whereI ( ^ ) is the upper leftpp sub-matrix of the inverse of the observed information evaluated at ^ . 21 The score test statistic is given by ST =U T ( ^ 0 )I ( ^ 0 )U( ^ 0 ); where ^ 0 denotes the maximum likelihood estimate of with being held xed at 0 . The likelihood ratio test statistic is given by LRT = 2[L( ^ )L( ^ 0 )]: Following the standard likelihood theory, under regularity conditions, WT ,ST andLRT are all asymptotically distributed according to a central 2 distribution withp degrees of freedom under H 0 , where p is the dimension of . For any xed level of signicance a, the power of the tests approaches 1 as n!1 for any xed alternative. For local alternatives in the form of = 0 + p n , where is a constant, the tests have non-trivial power properties, which makes the asymptotic power calculation meaningful. Under local alternative hypotheses, the test statistics have a limiting non-central chi-square distribution with p degrees of freedom. The expected information matrix evaluated at 1 underH 1 : = 1 ,I, is partitioned in correspondence to into 0 B @ I I I I 1 C A 22 The noncentrality parameter is = [ 1 0 ] T [I I I 1 I ][ 1 0 ]; (5.1) which is actually the expected WT statistic under H 1 . 5.2 Calculation of Power and Sample Sizes When power is calculated given a sample size, it usually involves the following steps (Brown 1999; Greenland, 1985): (i). Specify the model, the values of the parameters under H 1 and the probability dis- tributions of covariates. (ii). Find the critical value,, based on the specied signicance level,a, from the central 2 distribution underH 0 , so that = 2 1 (p), where 2 1a (p) is the 1a th quantile of the central chi-square distribution with p degrees of freedom. (iii). Calculate the noncentrality parameter under H 1 ,, using the information specied at step 1 and the sample size. (iv). Power under H 1 : = 1 is the probability that a random variable which follows a non-central 2 distribution with a noncentrality parameter of exceeds . Alternatively, the sample size can be calculated based on a specied power. After step 2 described above, nd the noncentrality, , so that = 2 1b (p;), where b denotes power and 2 1b (p;) is the 1b th quantile of the noncentral chi-square distribution withp degrees of freedom and the noncentrality parameter of. Then, calculate the contribution 23 of each observation to the noncentrality parameter, . =N under iid condition since log-likelihood values, expectations and derivatives all add across observations. The sample size is therefore calculated using N = . Earlier research on power and sample size calculations was mostly for specialized models, such as chi-square type tests for contingency tables (Guenther, 1977; Greenland, 1985) and test on proportions (Fleiss, 1981). A lot of research on sample size calculations for case-control studies was based on test on proportions, assuming that there is a single dichotomized exposure (Breslow and Day, 1980; Schlesselman, 1982; Parker and Bregman, 1986; Dupont, 1988; Noviko, 2005). In recent years, more work has been done on power and sample size calculations using the likelihood based tests, which can be applied in more general settings. The key step is the computation of the noncentrality parameter. It has been proven that (5.1) can be used to reliably estimate power and sample sizes in in many studies (Brown, 1999; Demidenko, 2007a; Demidenko, 2007b; Schoenfeld and Borenstein, 2005). In simple sit- uations such as logistic regression with a binary covariate (Demidenko, 2007a), a closed formulus is available for expected information. In complex situations numerical integra- tion may be needed to compute expected information. To avoid complex computation, some researchers used approximate expression for expected information under some re- strictive assumptions, which limited the use of their methods. For example, Whittemore (1981) assumed small response probabilities in calculating power and sample sizes for logistic regression models. When several continuous covariates are involved, Schoenfeld and Borenstein (2005) simplied the calculation of the expected information for logistic regression and Cox regression models by converting multivariate integration to univariate 24 integration. In the literature, it is called the information method (INFO) to use (5.1) to calculate power and sample sizes. Self et al. (1992) proposed computing the noncentrality parameter for the LRT statis- tic for generalized linear models (GLM) by equating the expected value of a noncentral chi-square random variable to an approximation of the expected value of the LRT statistic. 2E[L( ^ ; ^ )L( 0 ; ^ 0 )] =p + = 2E[L( ^ ; ^ )L( 1 ; 1 )] 2E[L( 0 ; ^ 0 )L( 0 ; 0 )] + 2E[L( 1 ; 1 )L( 0 ; 0 )]; (5.2) where 0 is the solution of limn 1 E[U( 0 ;)] = 0. They showed that the rst term in (5.2) 2E[L( ^ ; ^ )L( 1 ; 1 )]p +q and the second term in (5.2) is usually very close to q. p and q are the dimensions of and , respectively. Therefore, the noncentrality parameter 2E[L( 1 ; 1 )L( 0 ; 0 )] In the literature, this approach is called the SMO method according the initials of the authors of Self et al. (1992). Gauderman (2002a and 2002b) developed a computer program that calculates power and sample sizes for gene-environment and gene-gene inter- actions using conditional logistic regression model or log-linear model. He used numerical methods to compute 2E[L( 1 ; 1 )L( 0 ; 0 )]. O'Brien and Shieh (1998) proposed using the exemplary data to calculate 2E[L( 1 ; 1 )L( 0 ; 0 )] for GLMs. The idea is to rst 25 create an expected data set specied under H 1 and then analyze it exactly as we would analyze the actual data using regular statistical software. The resulting LRT statistic is used as an approximation of the noncentrality parameter. Suppose there are J possible unique congurations of the covariates, denoted by x 1 ;x 2 ;:::;x J andK possible values of the response, y 1 ;y 2 ;:::;y k . A matrix with KJ records is created: 0 B B B B B B B B B B B B B B B B B B B B @ y 1 x 1 w 11 count 11 :::: y K x 1 w K1 count K1 :::: y 1 x J w 1J count 1J :::: y K x J w KJ count KJ 1 C C C C C C C C C C C C C C C C C C C C A w kj =Pr(y k ;x j ) =Pr(Y =y k jX =x j )(X =x j ) andcount kj =Nw kj for prospective data, withN being the total number of observations in the expected data and(X =x j ) being the relative proportion of conguration x j in the design. Pr(Y = y k jX = x j ) is dened by the model used (O'Brien and Shieh, 1998; O'Brien, 2002). For retrospective studies such as case-control studies, w kj =Pr(X =x j jY =y k )(Y =y k ) = Pr(Y =y k jX =x j )Pr(X =x j ) P J i=1 Pr(Y =y k jX =x i )Pr(X =x i ) (Y =y k ); where (Y = y k ) is the sampling proportion for y k and Pr(X = x j ) is the relative fre- quency of conguration x j in the population (Longmate, 2001; Gauderman 2002a and 26 2002b). The accuracy of this method is increased as the number of observations is in- creased in the expected data set. This method can not be applied to continuous covariates. The exemplary data approach developed by Lyles et al. (2007) seems to be an improve- ment over O'Brien's. In their approach, w kj = Pr(Y = y k jX = x j ). The desired GLM is t using w kj as the weight in the \Weight" statement, which is available in many pro- cedures in standard statistical software, such as Proc Logistic and Proc Genmod in SAS. Continuous covariates are handled very well by representing their distributions using the expected quantiles of x j =F 1 ( j 0:375 N + 0:25 ); (5.3) where F is the cumulative density function for the continuous variable. Using w kj = Pr(Y =y k jX =x j ) as the weight makes it possible to calculate the noncentrality param- eter with fewer observations in the expected data set than required by O'Brien's method and still maintain the accuracy. The exemplary data approaches avoid complex mathe- matical computation and can be applied in an unied way. We will explore this idea in power and sample size calculations for nested case-control studies. Other researchers calculated power and sample sizes based on the score test (Self and Mauritsen, 1988; Lachin 2008; Sinha and Mukerjee, 2006; Schoenfeld, 1983); Although the Wald test, the score test and the likelihood ratio test are asymptotically equivalent under local alternatives, they may behave dierently when the alternatives are away from the null. For example, for certain parameterization in logistic regression or log linear model, the power of the Wald test goes down as the alternative value moves away from the null (Hauck and Donner, 1977; Vaeth, 1985). Some researchers (Shieh, 27 2005; Demidenko, 2007a) suggested that the same test as the one which will be used in the analysis should be used for power and sample size calculations. It is noted that power and sample size calculations described above is based on asymp- totic distributions and is only approximate for the actual power encountered in reality. The methods for computing the noncentrality parameters are often approximate as well. It is important to evaluate the accuracy of the calculation methods by comparing the calculated power with the actual power, which can be estimated by full simulation, when the sample sizes are moderate or small. 5.3 About a Popular Sample Size Formula For calculating the sample size in an univariate model using the Wald test, the fol- lowing formula is often provided: N = [Z 1a=2 I s ( 0 ) 1 2 +Z b I s ( 1 ) 1 2 ] 2 ( 1 0 ) 2 (5.4) for a two-sided test at the signicance level ofa and with powerb (Breslow and Day, 1987; Whittemore, 1981; Wilson and Gordon, 1986; Bull, 1993), where I s is the expected infor- mation for a single observation. Z 1a=2 andZ b are the (1a=2) th andb th quantiles of the standard normal distribution, respectively. Without loss of generality, it is assumed that the true parameter value underH 1 , 1 , is greater than the null value 0 in the discussion that follows. For a two-sided test, the power is Pr(jZj > Z 1a=2 ) under the alternative distribution. (5.4) is obtained by assuming thatPr(jZj<Z 1a=2 ) is negligible when the power is large enough. Demidenko (2007a) pointed out that (5.4) is actually a mistake 28 that has been overlooked and showed why. According to Demidenko (2007a), it would have been correct if the Z test statistic were Z = p n( ^ 0 ) p I s ( 0 ) 1 : (5.5) Under H 1 : = 1 , Z in (5.5) has a limiting distribution, N( p n( 1 0 ) p Is( 0 ) 1 ; Is( 0 ) Is( 1 ) ). Since power =Pr(Z >Z 1a=2 ), Z 1a=2 p n( 1 0 )= p I s ( 0 ) 1 p I s ( 0 )=I s ( 1 ) =Z b ; which leads to (5.4). However, in commonly used modern statistical software, such as SAS and S-plus, Z = p n( ^ 0 ) q I s ( ^ ) 1 (5.6) is used as the test statistic. Under H 1 : = 1 , (5.6) has a limiting distribution, N( p n( 1 0 ) p Is( 1 ) 1 ; 1). Then, Z 1a=2 p n( 1 0 ) p I s ( 1 ) 1 =Z b ; which leads to (5.7). N = (Z 1a=2 +Z b ) 2 ( 1 0 ) 2 I s ( 1 ) 1 : (5.7) 29 (5.7) is equivalent to what would have been obtained using the information method described in section 5.3. In a multivariate model, when = ( ;) with being tested and as the nuisance parameters, the sample size formula is N = (Z 1a=2 +Z b ) 2 ( 1 0 ) 2 I s ( 1 ; 1 ); (5.8) whereI ( 1 ; 1 ) is the upper leftpp submatrix of the inverse of the expected informa- tion matrix evaluated at 1 (Demidenko, 2007b). However, Shieh (2001 and 2005) thinks that N = [Z 1a=2 I s ( 0 ; 0 ) 1 2 +Z b I s ( 1 ; 1 ) 1 2 ] 2 ( 1 0 ) 2 (5.9) instead of (5.8) should be used. We failed to see the theoretical basis of Shieh's method even though his simulation showed that (5.9) improved the accuracy of power and sample size calculations over (5.8) in his examples. Based on Demidenko's arguments, (5.9) is probably not only unnecessary but also incorrect if (5.6) is used as the test statistic. When (5.6) instead of (5.5) is the test statistic, the power or the sample size should have nothing to do with the variance of the test statistic under H 0 . 30 Chapter 6 Power and Sample Size Calculations for WECARE Study 6.1 Introduction Lachin (2008) described a method for calculating power and sample sizes in the con- ditional logistic regression model based on the distribution of the score test, which can be applied to matched case-control studies and the simply random sampling design of the nested case-control studies. However, he used I( 0 ) as an approximation of I( 1 ), which is a poor approximation unless 1 is very close to 0 . His calculation of I( 0 ) for a continuous covariate is also oversimplied, without specifying the distribution of the covariate. The method can only applied to univariate models. As far as we know, no other researchers have discussed about the issues of power and sample size calculations for nested case-control studies. In this chapter and next chapter, we intend to use the unied asymptotic theory described in Chapter 4 to develop methods of power and sample size calculations which can be applied in various complex nested-control designs. 31 Our goals are to develop methods of calculating power and sample sizes based on the Wald test and the likelihood ratio test, compare the easiness of implementing these methods, and assess their performance by comparing them with full simulation under dierent scenarios and for various designs. The scenarios include: a. Univariate categorical covariate (balanced or skewd). b. Univariate continuous covariate (symmetrical or skewed). c. Interaction between two covariates. The designs include: a. Simple random sampling b. Counter-matching c. Quota sampling d. Counter matching with additional randomly sampled controls. 6.2 WECARE Study We would like to use the WECARE study (for Women's Environmental Cancer and Radiation Epidemiology, Bernstein et al., 2004) as an example to illustrate the results of the proposed methods for SRS and CM designs. Some questions are real questions in the study and others are hypothetical. The WECARE cohort consists of 34,000 women who were diagnosed with the rst primary breast cancer at age less than 55 from 1985 to 2000 and registered in ve pop- ulation based cancer registries. Breast cancer is often treated with radiation therapy in 32 addition to surgical treatment and chemotherapy. There are some breast cancer suscepti- bility genes such as ATM, BRCA1, and BRCA2, of which the mutation can increase the risk of developing breast cancer. The hypothesis of this study is that the gene mutation carriers are more likely to develop a second breast cancer after radiation therapy for the rst breast cancer than noncarriers. Cases are the women with asynchronous bilateral breast cancer. Controls had unilateral breast cancer and were individually matched with cases on date and age at diagnosis of the rst breast cancer, race and registry region. The time scale that denes the risk sets is years since rst breast cancer. In addition to the matching variables and the information that was needed to dene the risk sets, a covariate that indicates whether a patient received radiation treatment or not for the rst primary breast cancer is also available for all the subjects in the cohort. Though not always correct, the radiation treatment status recorded in the registry (RRT) is highly correlated with actual radiation treament status (RT). A nested case-control design is proposed as it is costly to determine the genotypes of the breast cancer susceptibility genes and quantify the radiation dose exposed by the contralateral breast. Budget constraints dictates that only two controls per case can be used for this study. The choices for the nested case-control study could be either SRS 1:2 or CM 1:2 using RRT as the counter-matching variable with one subject in the stratum with no RRT and two subjects in the stratum with RRT or CM 2:1 with two subjects in the stratum with no RRT and one subject in the stratum with RRT. The calculation of power or sample sizes based on the Wald test or the likelihood ratio test would allow quick comparisons between these designs and help make the right choice. 33 6.3 Method for Full Simulation In order to evaluate the accuracy of the proposed calculation methods, the results are compared to those from full simulation. Full simulation is performed using Langholz's method (Langholz, 2007), in which a xed number of independent risk sets are generated to form the cohort. Each risk set is consisted of 100 subjects. Risk set subjects are generated with covariate values needed for sampling and for analysis based on the specied distributions of these covariates. In each risk set, a single case is randomly generated with a probability of exp(Z T i 1 ) P j2R exp(Z T j 1 ) . The nested case-control study is consisted of the sampled case-control sets selected using the sampling design of interest. The simulation is repeated for 5000 trials. The nested case-control data is analyzed in each trial. The simulated power is taken as the percentage of trials where the null hypotheses is rejected. 6.4 Method for Calculation As in section (4.1), it is assumed that the at-risk indicators, covariate processes and classication variables (Y i (t);Z i (t);C i (t)), i = 1; 2;:::n, for all subjects in a risk set are independent copies of (Y (t);Z(t);C(t)). The sample size relevant in a nested case-control study is the number of sampled risk sets. In calculating power and sample sizes, it is essential to compute the contribution of a sampled risk set to the expected informa- tion (for the information method) or the expected LRT statistic (for the SMO method). For simplicity, the covariate distributions, the asymptotic expected information and the asymptotic expected LRT statistic are assumed to be identical for all risk sets. The non- centrality parameter is then =N, where N is the number of sampled risk sets and 34 is the expected contribution from each sampled risk set. Power and sample sizes can be calculated as discussed in Section 5.2. 6.4.1 Expected Information The asymptotic expected information for each design given in section (4.1) is the total information for the nested case-control study. The asymptotic expected information per risk set is needed to calculate power and sample sizes using the information method. LetI ~ R =E[I ~ R ] be the asymptotic expected information for a sampled risk set, where ~ R is the random sampled risk set andI ~ R is the observed information for ~ R. Then, I ~ R =E[I ~ R ] = X rR I r Pr( ~ R =r) = X rR I r X i2r Pr(D =i)(rji): For a full risk set, I R =lim n!1 I R =lim n!1 S (2) R S (0) R ( S (1) R S (0) R ) 2 =lim n!1 S (2) R n S (0) R n ( S (1) R n S (0) R n ) 2 = E[Z 2 exp(Z T 1 )] E[exp(Z T 1 )] ( E[Z exp(Z T 1 )] E[exp(Z T 1 )] ) 2 (6.1) by Strong Law of Large Numbers (SLLN). 35 For 1 :m 1 SRS, the asymptotic expected information per risk set is I ~ R = X rR I r Pr( ~ R =r) = X rR I r X i2r Pr(D =i)(rji) = X rR I r X i2r exp(Z T i 1 ) P j2R exp(Z T j 1 ) 1 n1 m1 I(jrj =m) = 1 1 n P j2R exp(Z T j 1 ) 1 m 1 n m X rR I r S r (0) I(jrj =m) ! p E[I u ( 1 )S u (0) ( 1 ) mE[exp(Z T 1 )] (6.2) as n!1 by SLLN and Lemma 2 in Borgan et al. (1995), where u =f1;:::;mg. Derivations of (6.1) and (6.2) are taken from Bryan Langholz's personal communica- tion. For CM design, the asymptotic expected information per risk set is I ~ R = X rR I r Pr( ~ R =r) = X rR I r X i2r Pr(D =i)(rji) = X rR I r X i2r exp(Z T i 1 ) P j2R exp(Z T j 1 ) [ Y l2G n l (t) m l ] 1 n C i (t) m C i (t) I(jr\R l (t)j =m l ) = 1 1 n P j2R exp(Z T j 1 ) [ Y l2G n l (t) m l ] 1 X rR I r X i2r exp(Z T i 1 ) n C i (t) n 1 m C i (t) I(jr\R l (t)j =m l ) ! p E[I u ( 1 ) S (0) u ( 1 )] E[exp(Z T 1 )] (6.3) 36 Table 6.1: Distribution of the subjects in a risk set Z = 0 Z = 1 Total C = 0 n 00 n 01 n 0: C = 1 n 10 n 11 n 1: Total n :0 n :1 n Table 6.2: Asymptotic limits of the proportion of the subjects in a risk set Z = 0 Z = 1 Total C = 0 00 01 0: C = 1 10 11 1: Total :0 :1 1 asn!1 by by SLLN and Lemma 2 in Borgan et al. (1995), whereu =f1;:::; P L l=1 m l g and C j = 1 for j2f1;:::;m 1 g, C j = 2 for j2fm 1 + 1;:::;m 1 +m 2 g, :::, and C j =L for j2f P L1 l=1 m l + 1;:::; P L l=1 m l g. In simple settings such as SRS design with only one binary covariate and CM design with the counter matching variable and the covariate of interest both being binary, closed formulus can be obtained for the asymptotic expected information. Then, the noncen- trality parameter can be calculated easily. Let the numbers of subjects in a risk set be distributed as in Table 6.1 and the asymptotic limits of the proportions of the subjects in a risk set be denoted as in Table 6.2. Let = 11 = :1 and = 00 = :0 be the sensitivity and specicity of C for predicting Z. For a full risk set with a binary Z, the asymptotic expected information is I R = lim n!1 I R = lim n!1 [ S (2) R S (0) R ( S (1) R S (0) R ) 2 ] = lim n!1 n :0 n :1 exp( 1 ) [n :0 +n :1 exp( 1 )] 2 = :0 :1 exp( 1 ) [ :0 + :1 exp( 1 )] 2 : 37 For 1 :m 1 SRS with a binary Z, the asymptotic expected information is I ~ R = 1 m 1 :0 + :1 exp( 1 ) m X k=0 m k k :1 (mk) :0 f k exp( 1 )(mk) (mk) +k exp( 1 ) g (6.4) (Bryan Langholz's personal communication). The asymptotic expected information for 1:1 CM with a binary C and a binary Z is: I ~ R = :0 :1 exp( 1 ) :0 + :1 exp( 1 ) [ 0: + 1: exp( 1 ) + (1 )(1) 0: exp( 1 ) + 1: ]: (6.5) (Bryan Langholz's personal communication). When some covariates follow continuous distributions, eg. normal distribution and 2 distribution, numeric integration can be used for exact computations of the expectations in (6.1), (6.2) and (6.3). Here, we would like to propose using Monte Carlo (MC) integration to compute these expectations approximately, which can be an unied approach for both categorical and continuous covariates. Compared to numerical integration, using MC integration to compute the expections can be implemented easily under complex settings such as multiple continuous covariates or mixed distributions of categorical and continuous covariates. It is also less time-consuming than full simulation. The results of the proposed approach will be reported in Tables 6.3 through 6.8. 6.4.2 Expected LRT Statistic Let the covariate vector Z be partitioned into X and Y , which are the covariates to be tested and other covariates, respectively. 38 The noncentrality parameter calculated by the SMO method for a full risk set is 2E[L( 1 ; 1 )L( 0 ; 0 )] = 2 X i2R n [X i 1 +Y i 1 log X j2R exp(X j 1 +Y j 1 )] [X i 0 +Y i 0 log X j2r exp(X j 0 +Y j 0 )] o exp(X i 1 +Y i 1 ) P j2R exp(X j 1 +Y j 1 ) = 2 P i2R [(X i 1 +Y i 1 )(X i 0 +Y i 0 )] exp(X i 1 +Y i 1 ) n P i2R exp(X i 1 +Y i 1 ) n log P j2R exp(X j 1 +Y j 1 ) n P j2R exp(X j 0 +Y j 0 ) n ! p 2 E n [(X 1 +Y 1 ) (X 0 +Y 0 )] exp(X 1 +Y 1 ) o E[exp(X 1 +Y 1 )] log E[exp(X 1 +Y 1 )] E[exp(X 0 +Y 0 )] (6.6) as n!1 by SLLN. For 1 :m 1 SRS, the noncentrality parameter for a sampled risk set is: 2E[L( 1 ; 1 )L( 0 ; 0 )] =E[E(LRT ~ R; 1 ] = X rR E(LRT r; 1 )Pr( ~ R =r) = X rR E(LRT r; 1 ) X i2r Pr(D =i)(rji) = X rR E(LRT r; 1 ) X i2r exp(X i 1 +Y i 1 ) P j2R exp(X j 1 +Y j 1 ) 1 n1 m1 I(jrj =m) = 1 1 n P j2R exp(X j 1 +Y j 1 ) 1 m 1 n m X rR E(LRT r; 1 )S r (0) ( 1 )I(jrj =m) ! p E[E(LRT u; 1 )S u (0) ( 1 )] mE[exp(X 1 +Y 1 )] (6.7) 39 by SLLN and lemma 2 in Borgan et al. (1995), where u =f1;:::;mg and E[LRTjr; 1 ] = 2Ef[L( 1 ; 1 )L( 0 ; 0 )]jrg = 2 X i2r f[X i 1 +Y i 1 log X j2r exp(X j 1 +Y j 1 )] [X i 0 +Y i 0 log X j2r exp(X j 0 +Y j 0 )]g(ijr; 1 ; 1 ) = 2 X i2r f[X i 1 +Y i 1 log X j2r exp(X j 1 +Y j 1 )] [X i 0 +Y i 0 log X j2r exp(X j 0 +Y j 0 )]g exp(X i 1 +Y i 1 ) P j2r exp(X j 1 +Y j 1 ) : (6.8) For CM design, the noncentrality parameter for a sampled risk set is: 2E[L( 1 ; 1 )L( 0 ; 0 )] =E[E(LRT ~ R; 1 )] = X rR E(LRT r; 1 )Pr( ~ R =r) = X rR E(LRT r; 1 )Pr(D =i)(rji) = X rR E(LRT r; 1 ) X i2r exp(X i 1 +Y i 1 ) P j2R exp(X j 1 +Y j 1 ) [ Y l2G n l (t) m l ] 1 n C i (t) m C i (t) I(jr\R l (t)j =m l ) = 1 1 n P j2R exp(X j 1 +Y j 1 ) [ Y l2G n l (t) m l ] 1 X rR E(LRT r; 1 ) X i2r exp(X i 1 +Y i 1 ) p C i (t) m C i (t) I(jr\R l (t)j =m l ) ! p E[E(LRT u; 1 ) S 0 u ( 1 )] E[exp(X 1 +Y 1 )] ; (6.9) 40 by SLLN and lemma 2 in Borgan et al. (1995) where u =f1;:::; P L l=1 m l g and C j = 1 for j2f1;:::;m 1 g, C j = 2 for j2fm 1 + 1;:::;m 1 +m 2 g, :::, and C j = L for j2 f P L1 l=1 m l + 1;:::; P L l=1 m l g and E[LRTjr; 1 ] = 2Ef[L( 1 ; 1 )L( 0 ; 0 )]jrg = 2 X i2r f[(Z i 1 +X i 1 ) +log( p C i (t) m C i (t) ) log X j2r exp(X j 1 +Y j 1 ) p C j (t) m C j (t) ] [(X i 0 +Y i 0 ) + log( p C i (t) m C i (t) ) log X j2r exp(X j 0 +Y j 0 ) p C j (t) m C j (t) ]g(ijr; 1 ; 1 ) = 2 X i2r f[(X i 1 +Y i 1 ) +log( p C i (t) m C i (t) ) log X j2r exp(X j 1 +Y j 1 ) p C j (t) m C j (t) ] [(X i 0 +Y i 0 ) + log( p C i (t) m C i (t) ) log X j2r exp(X j 0 +Y j 0 ) p C j (t) m C j (t) ]g exp(X i 1 +Y i 1 ) p C i (t) m C i (t) P j2r exp(X j 1 +Y j 1 ) p C j (t) m C j (t) : (6.10) For an univariate model, the expectations in (6.6), (6.7) and (6.9) can be calculated by MC integration in a similar way as described in section 6.4.1. For a multivariate model with nuisance parameters, it is essential to calculate 0 . We propose using a method that is similar to Lyles et al.'s (Lyles et al., 2007) to calculate 0 . An expected data set which consists of the sampled risk sets with all possible covariate proles is rst created. Then, the expected data set is analyzed using Proc Phreg in SAS, with the probability for each covariate prole being the weight in the Weight statement. The analysis is run twice, one with the full model and the other under the constraint of the null hypothesis. The estimates for the nuisance parameters in the constraint model are used as 0 . The accuracy of the calculated 0 can be validated by comparing the estimates 41 for the parameter of interest and the nuisance parameters in the full model with the true parameters under H 1 , 1 and 1 . For a continuous covariate, we use (5.3) to represent its distribution. We found that 20 quantiles are sucient for getting accurate estimates of the parameters. It is worth noting that both O'Brien and Shieh (1998) and Lyles et al. (2007) used the LRT statistic outputted by SAS when analyzing the expected data set as the noncentrality parameter. We found that it is not accurate enough, probably because SAS only gives three decimal places. We propose inserting the estimated 0 into (6.6), (6.7) and (6.9) and then calculate the expectations as in section 6.4.1. 6.5 Calculation of Power and Sample Sizes for WECARE Study 6.5.1 Main Eect of Radiation as a Binary Variable Suppose that we are only interested in the main eect of radiation therapy as a binary variable (RT vs. no RT) in the absence of gene mutation status. According to the registry data, about 40% of the subjects in the cohort were recorded as having had radiation therapy (RRT=1) and the predictive positive (PP) and predictive negative (PN) probabilities of RRT for RT are about 90%. Table 6.3 shows the results for testing the main eect of radiation therapy based on Pr(RT=1)=0.4 and PP=PN=0.90. In order to evaluate the accuracy of the proposed information method when the covariate distribution is relatively imbalanced, the situation where Pr(RT=1)=0.1 and PP=PN=0.98 is also studied (Table 6.4). The simulation and calculation are based on 200 risk sets. 42 Table 6.3: Statistical power for main eect of radiation therapy as a binary variable. Pr(RRT=1)=0.40. PP=PN=0.9. Number of risk sets=200 RR All SRS 1:2 SRS 1:3 CM 1:2 CM 2:1 Full simulation (LRT) 1.4 65.6 49.0 53.8 57.8 59.3 1.6 91.3 77.2 81.7 86.5 87.1 1.8 98.6 92.5 94.7 96.2 97.0 2.0 99.8 97.8 98.9 99.4 99.5 Full simulation (ST) 1.4 65.9 49.1 54.0 58.5 59.7 1.6 91.6 77.2 81.8 86.8 87.3 1.8 98.7 92.5 94.7 96.2 97.0 2.0 99.8 97.8 98.9 99.5 99.5 Full simulation (WT) 1.4 65.8 48.8 53.8 58.3 59.4 1.6 91.4 77.1 81.8 86.7 87.2 1.8 98.7 92.4 94.7 96.2 97.0 2.0 99.8 97.8 98.9 99.5 99.5 SMO method 1.4 66.1 49.1 53.8 58.7 60.1 1.6 91.3 77.1 81.7 86.2 87.2 1.8 98.6 92.0 94.7 96.8 97.2 2.0 99.8 97.7 98.8 99.4 99.5 INFO method 1.4 66.3 48.7 53.6 59.2 60.2 1.6 91.2 76.4 81.4 86.6 87.0 1.8 98.5 91.4 94.4 96.9 96.9 2.0 99.8 97.3 98.6 99.4 99.4 43 Table 6.4: Statistical power for main eect of radiation therapy as a binary variable. Pr(RRT=1)=0.10. PP=PN=0.98. Number of risk sets=200. RR All SRS 1:2 SRS 1:3 CM 1:2 CM 2:1 Full simulation (LRT) 1.7 77.1 60.5 65.6 70.8 74.4 1.8 84.5 68.0 72.5 77.7 80.8 1.9 89.0 75.9 79.2 85.6 88.5 2.0 94.6 81.6 85.2 89.5 92.1 Full simulation (ST) 1.7 80.0 61.7 67.4 75.0 77.2 1.8 86.4 69.0 74.3 80.9 83.0 1.9 90.7 76.9 80.2 87.3 90.1 2.0 95.2 82.2 86.5 90.7 93.1 Full simulation (WT) 1.7 79.1 61.3 66.9 74.6 77.0 1.8 86.2 68.5 73.8 80.5 82.8 1.9 90.1 76.3 79.9 87.3 90.1 2.0 95.0 82.0 86.1 90.6 93.1 SMO method 1.7 78.1 58.7 64.1 71.1 74.5 1.8 86.9 68.4 73.9 80.7 83.6 1.9 92.5 76.6 82.0 87.8 90.1 2.0 95.9 83.2 88.0 92.7 94.4 INFO method 1.7 82.5 59.2 65.9 76.3 78.9 1.8 90.5 68.9 75.7 85.6 87.6 1.9 95.2 77.0 83.5 91.9 93.3 2.0 97.8 83.2 89.1 95.8 96.6 44 6.5.2 Main Eect of Radiation Dose as a Continuous Variable Suppose that we are interested in testing the eect of radiation dose in the absence of gene susceptibility. Based on previous data, a 2 distribution with 4 degrees of freedom 2 (4)can be used to reasonably approximates the dose distribution for the subjects who were actually given RT (Bernstein, 2004). The rate ratios for the main eect of radiation dose in Table 6.5 are those when the 95% centile of the 2 distribution (approximately equivalent to 2 Gy dose) is compared to no radiation. The simulation and calculation in this section and next section are based on 700 risk sets, which is the number of risk sets available in the real study. 6.5.3 Interaction Between Mutation of a Single Gene and Radiation Dose Here, we assume that there is no main eect of gene mutation, i.e. mutation carriers have no additional risk compared to non-carriers when there is no radiation exposure. The 2 Gy rate ratio for the main eect of radiation dose is taken to be 1.5 in non-carriers. Gene mutation is assumed to be independent of radiation dose. The results for the proportion of mutation carriers ranging from 1% to 20% are presented in Tables 6.6-6.8. 6.5.4 Results For testing the main eect of radiation therapy as a binary variable, the likelihood ratio test, the score test and the Wald test agree with one another very well in terms of the power obtained by full simulation when Pr(RRT=1)=0.4 and PP=PN=0.9 (Table 6.3). In this setting, both the information method and the SMO method perform extremely well with the calculated power almost always being within 1% of the simulated power. In 45 Table 6.5: Statistical power for main eect of radiation therapy as a continuous variable. Pr(RRT=1)=0.4. PP=PN=0.9. RT 2 (4) when RT>0. Number of risk sets=700. RR at 2 Gy* All SRS 1:2 SRS 1:3 CM 1:2 CM 2:1 Full simulation(LRT) 1.4 75.1 57.9 62.8 65.2 61.8 1.5 89.5 75.5 80.1 82.6 79.4 1.6 96.6 87.6 90.7 91.8 90.0 1.7 99.5 96.7 98.0 98.3 97.5 Full simulation(ST) 1.4 76.2 58.6 63.5 66.4 62.3 1.5 90.2 76.0 80.9 83.3 79.8 1.6 96.9 87.9 91.1 92.1 90.2 1.7 99.5 96.8 98.2 98.3 97.6 Full simulation(WT) 1.4 76.2 58.4 63.4 66.3 62.1 1.5 90.2 75.8 80.8 83.3 79.6 1.6 96.9 87.7 91.0 92.1 90.1 1.7 99.5 96.7 98.1 98.3 97.5 SMO method 1.4 77.0 59.0 64.2 66.5 63.7 1.5 90.8 75.8 80.5 82.8 79.6 1.6 97.2 87.3 90.9 92.8 90.4 1.7 99.3 94.4 96.4 97.3 96.3 INFO method 1.4 79.7 59.0 65.2 67.7 63.8 1.5 92.9 76.0 81.0 84.1 80.2 1.6 98.2 87.7 91.6 93.6 90.3 1.7 99.7 94.6 96.8 97.8 96.3 *Int RR at 2 Gy = rate ratio when 2 Gy radiation dose is compared with no radiation. 46 Table 6.6: Statistical power for gene by radiation dose interaction for SRS 1:2. RT 2 (4) when RT>0. G =0. RT =0.158. Pr(RRT=1)=0.4. PP=PN=0.90. Number of risk sets=700. Int RR* Full simu. Full simu. SMO INFO at 2 Gy (LRT) (WT) Prevalence of gene carriers = 20% 1 4.88 4.62 5.00 5.00 2.25 65.4 65.0 66.6 65.2 2.5 77.7 76.9 78.0 75.0 2.75 86.9 86.1 86.2 84.5 3 92.6 92.0 91.6 89.3 Prevalence of gene carriers = 10% 1 5.16 4.48 5.00 5.00 2.5 56.0 53.8 56.3 54.3 3 73.8 72.1 74.5 70.1 3.5 85.5 84.0 86.3 82.1 4 92.8 91.6 93.0 89.5 Prevalence of gene carriers = 5% 1 6.10 4.68 5.00 5.00 3 48.6 44.7 49.6 44.6 4 70.8 66.7 73.6 65.3 5 88.3 85.2 87.4 78.7 6 95.0 92.8 95.2 87.4 Prevalence of gene carriers = 3% 1 5.82 3.40 5.00 5.00 4 52.1 44.8 52.6 46.8 6 77.6 73.0 81.6 70.2 8 90.5 87.1 94.2 83.4 10 96.5 94.5 97.9 90.1 Prevalence of gene carriers = 2% 1 6.21 2.56 5.00 5.00 6 63.4 52.4 65.3 54.0 8 79.5 69.8 83.9 67.9 10 89.1 82.9 92.4 76.8 12 93.5 89.2 96.6 82.9 Prevalence of gene carriers = 1% 1 5.88 1.04 5.00 5.00 15 76.1 54.7 88.8 61.6 20 85.9 70.6 96.6 71.0 25 91.2 77.5 98.9 76.8 30 93.3 83.5 99.6 80.6 *Int RR at 2 Gy = rate ratio when 2 Gy radiation dose is compared with no radiation. 47 Table 6.7: Statistical power for gene by radiation dose interaction for CM 1:2. RT 2 (4) when RT>0. G =0. RT =0.158. Pr(RRT=1)=0.4. PP=PN=0.90. Number of risk sets=700. Int RR* Full simu. Full simu. SMO INFO at 2 Gy (LRT) (WT) Prevalence of gene carriers = 20% 1 5.22 5.06 5.00 5.00 2.25 66.6 66.3 69.2 68.5 2.5 79.7 79.4 80.6 79.6 2.75 88.9 88.7 88.7 87.0 3 92.9 92.7 93.5 92.5 Prevalence of gene carriers = 10% 1 5.22 4.62 5.00 5.00 2.5 58.9 58.6 58.9 58.0 3 75.9 75.3 77.8 75.7 3.5 86.6 86.1 88.7 86.9 4 93.6 93.3 94.9 92.7 Prevalence of gene carriers = 5% 1 5.94 4.46 5.00 5.00 3 50.9 48.2 51.6 49.9 4 72.7 71.0 77.0 72.2 5 88.6 87.5 90.4 85.5 6 94.4 93.3 96.6 92.6 Prevalence of gene carriers = 3% 1 5.97 3.83 5.00 5.00 4 51.5 47.7 57.4 52.3 6 77.9 75.2 86.5 77.1 8 90.8 89.2 96.6 89.1 10 95.4 94.8 99.2 94.5 Prevalence of gene carriers = 2% 1 5.85 2.96 5.00 5.00 6 63.0 57.3 69.9 60.8 8 77.9 73.2 85.9 75.3 10 89.3 86.5 94.1 83.6 12 93.3 90.9 97.8 89.2 Prevalence of gene carriers = 1% 1 5.25 1.81 5.00 5.00 15 74.7 63.3 91.1 71.9 20 84.6 75.4 97.4 80.6 25 88.1 81.6 99.5 85.7 30 92.2 88.0 1 89.1 *Int RR at 2 Gy = rate ratio when 2 Gy radiation dose is compared with no radiation. 48 Table 6.8: Statistical power for gene by radiation dose interaction for CM 2:1. RT 2 (4) when RT>0. G =0. RT =0.158. Pr(RRT=1)=0.4. PP=PN=0.90. Number of risk sets=700. Int RR* Full simu. Full simu. SMO INFO at 2 Gy (LRT) (WT) Prevalence of gene carriers = 20% 1 4.84 4.62 5.00 5.00 2.25 59.2 57.7 61.8 60.5 2.5 75.7 74.0 73.5 71.1 2.75 82.2 80.7 82.3 80.0 3 87.5 86.9 88.6 86.0 Prevalence of gene carriers = 10% 1 5.24 4.76 5.00 5.00 2.5 51.9 48.8 52.4 49.9 3 69.2 66.7 70.0 64.6 3.5 81.6 79.2 82.7 76.9 4 89.2 87.5 90.2 85.6 Prevalence of gene carriers = 5% 1 5.62 4.16 5.00 5.00 3 44.3 38.1 45.4 41.1 4 66.6 60.2 68.3 60.6 5 83.5 78.4 84.0 74.4 6 90.8 88.2 92.5 83.3 Prevalence of gene carriers = 3% 1 5.92 3.21 5.00 5.00 4 47.0 37.7 48.8 42.1 6 73.4 65.9 78.7 64.2 8 88.0 83.9 90.9 77.3 10 95.0 92.1 96.6 85.3 Prevalence of gene carriers = 2% 1 5.89 2.32 5.00 5.00 6 56.3 42.4 60.9 48.4 8 72.6 60.7 76.8 61.2 10 84.9 78.3 91.0 70.6 12 91.0 84.0 94.1 77.0 Prevalence of gene carriers = 1% 1 6.29 0.78 5.00 5.00 15 71.0 43.5 82.9 56.6 20 82.5 57.2 94.0 65.6 25 85.2 65.9 98.4 71.4 30 91.2 72.0 99.0 75.3 *Int RR at 2 Gy = rate ratio when 2 Gy radiation dose is compared with no radiation. 49 Table 6.4 where RRT and RT become more imbalanced (Pr(RRT=1)=0.1, PP=PN=0.98), the agreement among the three tests by full simulation is still good except that the power of the score test or the Wald test is slightly higher than that of the likelihood ratio test in CM 1:2 and CM 2:1 designs. Table 6.4 also shows that the information method overestimates the power most of the time, but by no more than 5%, compared to the simulated power by Wald test when Pr(RRT=1)=0.1 and PP=PN=0.98. The performance of the SMO method is slightly better than the information method when Pr(RRT=1)=0.1 and PP=PN=0.98. When the main eect of the continuous radiation dose is tested, the three tests are again very similar to one another in full sumulation and the power calculated using the information method and the SMO method agree with the simulated power (Table 6.5). Comparing the simulated results for testing the gene by radiation dose interaction (Tables 6.6-6.8), we found that the likelihood ratio test and the Wald test have similar power when the prevalence of gene carriers is 20%. The likelihood ratio test rejects H 1 more often than the Wald test and the dierence between these two tests increases as the prevalence of mutation carriers decreases. When the prevalence of mutation carriers is only 1%, the power for the likelihood ratio test appears to be 20% higher than that for the Wald test. As shown in Figures 6.1(a)-6.1(f), by the Wald test, H 1 is almost always more likely to be rejected in CM 1:2 than in SRS 1:2. The dierence between the two designs increases as the gene mutation becomes rarer. By the likelihood ratio test, CM 1:2 is slightly more powerful than SRS 1:2 when the prevalence of gene mutation carriers is 5% or higher, whereas the power of CM 1:2 appears to be equal to or even slightly lower than that of SRS 1:2 when the prevalence of gene mutation carriers is below 5% (Figure 6.1). CM 2:1 is always less powerful than CM 1:2 and SRS 1:2 by both the likelihood ratio 50 test and the Wald test. Tables 6.6-6.8 show that under H 0 , the test size changes slightly for the likelihood ratio test and decreased greatly for the Wald test as the proportion of gene carriers decreases, which indicates the failure of the asymptotic theory when the sample size is small. When the gene carriers are rare, the chi-square critical values does not accurately re ect the signicant level of the tests, especially the Wald test. It seems that the departure from the asymptotic theory at small sample sizes has resulted in what is described above: the dierence in the probability of rejectingH 1 between the likelihood ratio test and the Wald test and the dierence in the behavior between these two tests when CM 1:2 is compared with SRS 1:2 for rare genes. Agresti (2002) claimed that the likelihood ratio test is more powerful than the Wald test for small sample sizes without providing supporting evidence. Whether this is true or not needs to be investigated by comparing the two tests when the type I error rates are kept the same. In our simulation, smaller changes in the test size for the likelihood ratio test than for the Wald test indicates that the likelihood ratio test is more stable than the Wald test. Agresti (2002) made a similar claim. It is unknown whether this claim is generally true. For testing the gene-radiation dose interaction, when the gene mutation is common (10% or 20% of the subjects in the risk set are carriers), the power calculated by the infor- mation method and the SMO method is close to the simulated power by the corresponding test, with the dierences all being within 3%. The accuracy of the information method and the SMO method is degraded when the proportion of carriers is 5% or smaller. It is noted that the calculated power is closer to the simulated power if the same test is used in simulation and calculation than if dierent tests are used, even when the calculated power and the simulated power do not agree with each other. 51 Figure 6.1: Comparison of simulated power for testing gene-radiation dose interaction between SRS 1:2 and CM 1:2 52 Figure 6.1: Comparison of simulated power for testing gene-radiation dose interaction between SRS 1:2 and CM 1:2 (continued) 53 Figure 6.1: Comparison of simulated power for testing gene-radiation dose interaction between SRS 1:2 and CM 1:2 (continued) 54 In Figures 6.2a through 6.2c, SRS 1:2 with a binary covarite is used as an example to illustrate the dierence between the null variance formula (5.4) and the corrected formula (5.7) described in Section 5.3 in calculating sample sizes for a two-sided test of H 0 : = 0 vs. H 1 : = 1 with a = 0:05 and b = 0:90. The null variance formula overestimates sample sizes whenI 1 ~ R ( 0 )>I 1 ~ R ( 1 ) and underestimates sample sizes when I 1 ~ R ( 0 )<I 1 ~ R ( 1 ). Based on (5.4), I 1 ~ R ( 0 )>I 1 ~ R ( 1 ) when 1 [( :0 + :1 exp( 1 )] [ :0 exp( 1 ) 2 + exp( 1 ) + :1 exp( 1 ) 1 + 2 exp( 1 ) ]< 1 for SRS 1:2 with a binary covariate. In conclusion, when the sample size is large enough, the Wald test and the LR test are equivalent in terms of actual power and the power calculation methods based on these two tests also provide results similar to each other and to the simulated power. However, when nuisance parameters are present, the information method is easier to implement since it does not require the calculation of 0 . When the sample size is small and the asymptotic theory does not hold true, neither the information method nor the LRT method gives satisfactory results. If the asymptotic tests are used anyway to calculate the power and sample sizes even if the sample size is small, it is better to use the same test as the one which will be used in the data analysis than to use a dierent test. 55 Figure 6.2: Comparison between null-variance formula and corrected formula for calcu- lating sample sizes in SRS 1:2 with a binary covariate 56 Chapter 7 Power and Sample Size Calculations for Endometrial Cancer Study 7.1 Study on Endometrial Cancer Risk in Women Diag- nosed with Endometrial Hyperplasia The example we use here is based on a real study on risk of endometrial cancer in women diagnosed with endometrial hyperplasia (EH) (Lacey, 2008). The goal of this study is to "investigate whether type of EH is a predictor, either alone or in conjunction with other factors, of endometrial cancer risk" (Lacey, 2008). The cohort consists of women diagnosed with EH from a single health care provider with the outcome being diagnosis of endometrial carcinoma. Cases are women who were diagnosed with endometrial carcinoma at least one year after diagnosis of EH. Controls are women who were diagnosed with EH and remained at risk at the time of diagnosis of endometrial carcinoma for their matched cases. The time scale that denes the risk sets is years since the diagnosis of EH. 57 The types of EH classied based on the severity of glandular crowding and nuclear atypia include simple hyperplasia (SH), complex hyperplasia(CH) and atypical hyperpla- sia (AH). Here, we only consider two types, SH and CH, assuming that CH is rare. When CH is rare, randomly selected controls would almost certainly have included few controls with CH. Counter-matching with additional randomly sampled controls (CM+SRS) or quota sampling with stopping design (QS+stopping) is proposed depending on the study goal and whether or not the EH type is available for all subjects in the cohort. 7.2 Methods for Calculation 7.2.1 Expected Information and LRT Statistic in CM+SRS Design Let 2f( 1 ;:::; L ) : m l l = m l + ~ m l ; P L l=1 l = m + ~ mg represent the com- positions of the sampled risk set, where m l and ~ m l are the numbers of subjects counter- matched and randomly sampled, respectively at stratum l. f 1 ;:::; L g is xed in each composition and vary from composition to composition. Let P (t) =I(rR(t);jr\R l (t)j = l ;l2G), which includes all subsets of the risk set with l subjects at stratum l. 58 The expected information for a sampled risk set is I ~ R = X X rP (t) I r Pr( ~ R =r) = X X rP (t) I r X i2r exp(Z T i 1 ) P j2R exp(Z T j 1 ) [ Q l n l (t) m l ~ m l n(t) m ~ m ][ Y l n l (t) m l + ~ m l ] 1 n C i (t) m C i (t) + ~ m C i (t) I(i2r;rR(t);jr\R l (t)j = m l + ~ m l ;l2G) = X [ Q l n l (t) m l ~ m l n(t) m ~ m ] 1 1 n P j2R exp(Z T j 1 ) [ Y l n l (t) m l + ~ m l ] 1 X rP (t) I r X i2r exp(Z T i 1 ) p C i (t) m C i (t) + ~ m C i (t) I(i2r;rR(t);jr\R l (t)j = m l + ~ m l ;l2G) ! X ~ m ~ m 1 ;:::; ~ m L Y l2G p ~ m l l (t) E[I u ( 1 ) P i2u exp(Z T i 1 ) p C i (t) m C i (t) + ~ m C i (t) ] E[exp(Z T 1 )] asn!1 by SLLN and lemma 2 of Borgan et al. (1995), where u =f1;:::; m + ~ mg and C j = 1 forj2f1;:::; m 1 + ~ m 1 g,C j = 2 forj2f m 1 + ~ m 1 + 1;:::; m 1 + ~ m 1 + m 2 + ~ m 2 g, :::, and C j =L for j2f P L1 l=1 ( m l + ~ m l ) + 1;:::; P L l=1 ( m l + ~ m l )g; since Q l ( n l (t) m l ~ m l ) ( n(t) m ~ m ) ]! ~ m ~ m 1 ;:::; ~ m L Q l2G p ~ m l l (t) as n!1. 59 The expected LRT test statistic is: 2E[L( 1 ; 1 )L( 0 ; 0 )] =E[E(LRT ~ R; 1 )] = X X rP (t) E(LRT r; 1 )Pr( ~ R =r) = X X rP (t) E(LRT r; 1 ) X i2r exp(X i 1 +Y i 1 ) P j2R exp(X j 1 +Y j 1 ) [ Q l n l (t) m l ~ m l n(t) m ~ m ][ Y l n l (t) m l + ~ m l ] 1 n C i (t) m C i (t) + ~ m C i (t) I(i2r;rR(t);jr\R l (t)j = m l + ~ m l ;l2G) ! X ~ m ~ m 1 ;:::; ~ m L Y l2G p ~ m l l (t) E[E(LRT u; 1 ) P i2u exp(X T i 1 +Y T i 1 ) p C i (t) m C i (t) + ~ m C i (t) ] E[exp(X 1 +Y 1 )] (7.1) 60 asn!1 by SLLN and lemma 2 of Borgan et al. (1995), where u =f1;:::; m + ~ mg and C j = 1 forj2f1;:::; m 1 + ~ m 1 g,C j = 2 forj2f m 1 + ~ m 1 + 1;:::; m 1 + ~ m 1 + m 2 + ~ m 2 g, :::, and C j =L for j2f P L1 l=1 ( m l + ~ m l ) + 1;:::; P L l=1 ( m l + ~ m l )g: E[LRTjr; 1 ] = 2Ef[L( 1 ; 1 )L( 0 ; 0 )]jrg = 2 X i2r f[(X i 1 +Y i 1 ) +log( p C i (t) m C i (t) + ~ m C i (t) ) log X j2r exp(X j 1 +Y j 1 ) p C j (t) m C i (t) + ~ m C i (t) ] [(X i 0 +Y i 0 ) + log( p C i (t) m C i (t) + ~ m C i (t) ) log X j2r exp(X j 0 +Y j 0 ) p C j (t) m C i (t) + ~ m C i (t) ]g(ijr; 1 ; 1 ) = 2 X i2r f[(X i 1 +Y i 1 ) +log( p C i (t) m C i (t) + ~ m C i (t) ) log X j2r exp(X j 1 +Y j 1 ) p C j (t) m C i (t) + ~ m C i (t) ] [(X i 0 +Y i 0 ) + log( p C i (t) m C i (t) + ~ m C i (t) ) log X j2r exp(X j 0 +Y j 0 ) p C j (t) m C i (t) + ~ m C i (t) ]g exp(X i 1 +Y i 1 ) p C i (t) m C i (t) + ~ m C i (t) P j2r exp(X j 1 +Y j 1 ) p C j (t) m C i (t) + ~ m C i (t) : (7.2) 7.2.2 Expected Information and LRT Statistic in QS+Stopping Design Quota sampling design was discussed in Sections 3.3 and 4.2.4. In reality, the sampling is often stopped after a specied number of subjects have been selected whether the quota has been met or not. Langholz (2005c) discussed about the control selection probability and the partial likelihood in batch quota sampling with stopping. The design presented here is a special case of batch quota sampling with stopping with the batch size being 1. 61 Let m max be the maximum size of the sampled risk set. If the quota has been met before the size of the sampled risk set exceedsm max , the probability of picking a particular set r given that i is the case is t (rji) = 1 n 0 (t) m 0 (t) n 1 (t) m 1 m(t)1 m 1 1 n(t)m(t) n 1 (t)m 1 n(t) n 1 (t) n(t) m(t) 1 m 1 C i (t) m 1 I(i2r;rR(t);jr\R 1 (t)j =m 1 ;jrjm max ) as discussed in Section 4.2.4. If the quota has not been met and the sampling stops when the size of the sampled risk set reachesm max , thenM 1 2f0; 1;:::;m 1 1g withM 1 being the number of targeted subjects actually selected and m 1 the number of targeted subjects required. The prob- ability that there are M 1 targeted subjects in r of size m max follows a hypergeometric distribution: Pr(j ~ Rj =m max ;j ~ R 1 j =M 1 C i (t)) = [ n 0 (t) M 0 n 1 (t) M 1 ] 1 n C i (t) M C i (t) n(t)1 mmax1 The probability of selecting a particular set given the size of the set being m max is Pr ~ R(t) =r j ~ R(t)j =m max ;C i (t) = [ n 0 (t) M 0 n 1 (t) M 1 ] 1 n C i (t) M C i (t) : 62 The probability of picking a particular setr given thati is the case is the product of these two probabilities: t (rji) = 1 n(t)1 mmax1 I(i2r;rR(t);jr\R 1 (t)j<m 1 ;jrj =m max ): Therefore, the subjects are weighted with one if the quota is not met. Let I 1 =f( 0 ; 1 ) : 0 = 0; 1;:::;m max m 1 ; 1 =m 1 g represents the compositions of the sampled risk set where the quota has been met and I 2 =f( 0 ; 1 ) : 0 =m max 1 ; 1 = 0; 1;:::;m 1 1g represents the compositions of the sample risk set where the quota stops before the quota is met. Let P m 1 (t) =I(rR(t);jr\R 1 (t)j =m 1 ;jrj =m =m 1 ;:::;m max ) and P M 1 2 (t) =I(rR(t);jr\R 1 (t)j =M 1 = 0; 1;:::;m 1 1;jrj =m max ) 63 I ~ R = mmax X m=m 1 X rP m 1 I r Pr( ~ R =r) + m 1 1 X M 1 =0 X rP M 1 2 I r Pr( ~ R =r) = mmax X m=m 1 X rP m 1 I r X i2r exp(Z T i 1 ) P j2R exp(Z T j 1 ) 1 n 0 (t) m 0 (t) n 1 (t) m 1 m1 m 1 1 n(t)m n 1 (t)m 1 n(t) n 1 (t) n(t) m 1 m 1 C i (t) m 1 I(i2r;rR(t);jr\R 1 (t)j =m 1 ;jrjm max ) + m 1 1 X M 1 =0 X rP M 1 2 I r X i2r exp(Z T i 1 ) P j2R exp(Z T j 1 ) 1 n 0 M 0 n 1 (t) M 1 n 0 (t) M 0 n 1 (t) M 1 n(t)1 mmax1 I(i2r;rR(t);jr\R 1 (t)j<m 1 ;jrj =m max ) ! mmax X m=m 1 m 1 m 1 1 p m 0 0 p m 1 1 E[I u ( 1 ) P i2u exp(Z T i 1 ) m 1 C i (t) m 1 ] (m 1)E[exp(Z T 1 )] + m 1 1 X M 1 =0 m max M 1 p M 0 0 p M 1 1 E[I v ( 1 ) P i2v exp(Z T i 1 )] m max E[exp(Z T 1 )] asn!1 by SLLN and lemma 2 in Borgan et al. (1995), whereu =f1;:::;m 0 +m 1 g and C j = 0 forj2f1;:::;m 0 g andC j = 1 forj2fm 0 +1;:::;m 0 +m 1 g andv =f1;:::;M 0 + M 1 g and C j = 0 for j2f1;:::;M 0 g and C j = 1 for j2fM 0 + 1;:::;M 0 +M 1 g; since ( m1 m 1 1 )( n(t)m n 1 (t)m 1 ) ( n(t) n 1 (t) ) ! m1 m 1 1 p m 0 0 p m 1 1 and ( n 0 (t) M 0 )( n 1 (t) M 1 ) ( n(t) mmax ) ! mmax M 1 p M 0 0 p M 1 1 as n!1. 64 2E[L( 1 ; 1 )L( 0 ; 0 )] =E[E(LRT ~ R; 1 )] = mmax X m=m 1 X rP m 1 E(LRT r; 1 )Pr( ~ R =r) + m 1 1 X M 1 =0 X rP M 1 2 E(LRT r; 1 )Pr( ~ R =r) = mmax X m=m 1 X rP m 1 E(LRT r; 1 ) X i2r exp(X i 1 +Y i 1 ) P j2R exp(X j 1 +Y j 1 ) 1 n 0 (t) m 0 (t) n 1 (t) m 1 m(t)1 m 1 1 n(t)m(t) n 1 (t)m 1 n(t) n 1 (t) n(t) m(t) 1 m 1 C i (t) m 1 I(i2r;rR(t);jr\R 1 (t)j =m 1 ;jrjm max ) + m 1 1 X M 1 =0 X rP M 1 2 E(LRT r; 1 ) X i2r exp(X i 1 +Y i 1 ) P j2R exp(X j 1 +Y j 1 ) 1 n 0 M 0 n 1 (t) M 1 n 0 (t) M 0 n 1 (t) M 1 n(t)1 mmax I(i2r;rR(t);jr\R 1 (t)j<m 1 ;jrj =m max ) ! mmax X m=m 1 m 1 m 1 1 p m 0 0 p m 1 1 E[E(LRT u; 1 )( 1 ) P i2u exp(X i 1 +Y i 1 ) m 1 C i (t) m 1 ] (m 1)E[exp(X 1 +Y 1 )] + m 1 1 X M 1 =0 m max M 1 p M 0 0 p M 1 1 E[E(LRT v; 1 ) P i2v exp(X i 1 +Y i 1 )] E[m max exp(X 1 +Y 1 )] (7.3) asn!1 by SLLN and lemma 2 in Borgan et al. (1995), whereu =f1;:::;m 0 +m 1 g and C j = 0 forj2f1;:::;m 0 g andC j = 1 forj2fm 0 +1;:::;m 0 +m 1 g andv =f1;:::;M 0 + M 1 g and C j = 0 for j2f1;:::;M 0 g and C j = 1 for j2fM 0 + 1;:::;M 0 +M 1 g; since ( m1 m 1 1 )( n(t)m n 1 (t)m 1 ) ( n(t) n 1 (t) ) ! m1 m 1 1 p m 0 0 p m 1 1 and ( n 0 (t) M 0 )( n 1 (t) M 1 ) ( n(t) mmax ) ! mmax M 1 p M 0 0 p M 1 1 as n!1. 65 E[LRTjr; 1 ] = 2Ef[L( 1 ; 1 )L( 0 ; 0 )]jrg = 2 X i2r f[(X i 1 +Y i 1 ) +log( m 1 C i (t) m 1 ) log X j2r exp(X j 1 +Y j 1 ) m 1 C i (t) m 1 ] [(X i 0 +Y i 0 ) + log( m 1 C i (t) m 1 ) log X j2r exp(X j 0 +Y j 0 ) m 1 C i (t) m 1 ]g (ijr; 1 ; 1 ) = 2 X i2r f[(X i 1 +Y i 1 ) +log( m 1 C i (t) m 1 ) log X j2r exp(X j 1 +Y j 1 ) m 1 C i (t) m 1 ] [(X i 0 +Y i 0 ) + log( m 1 C i (t) m 1 ) log X j2r exp(X j 0 +Y j 0 ) m 1 C i (t) m 1 ]g exp(X i 1 +Y i 1 ) m1Ci(t) m1 P j2r exp(X j 1 +Y j 1 ) m1Ci(t) m1 ; (7.4) if the quota has been met. E[LRTjr; 1 ] = 2Ef[L( 1 ; 1 )L( 0 ; 0 )]jrg = 2 X i2r f[(X i 1 +Y i 1 ) log X j2r exp(X j 1 +Y j 1 )] [(X i 0 +Y i 0 ) log X j2r exp(X j 0 +Y j 0 )]g(ijr; 1 ; 1 ) = 2 X i2r f[(X i 1 +Y i 1 ) log X j2r exp(X j 1 +Y j 1 )] [(X i 0 +Y i 0 ) log X j2r exp(X j 0 +Y j 0 )]g exp(X i 1 +Y i 1 ) P j2r exp(X j 1 +Y j 1 ) ; (7.5) if the quota has not been met. 66 7.2.3 Method for Calculating Expected Information and LRT Statistic The expected information for CM+SRS and QS+stopping designs is calculated using the MC integration, which is similar to what was used for SRS design and CM design. We used the exemplary data approach to calculate 0 in the expected LRT statistics for SRS design and CM design. It would be dicult to create an exemplary data set for CM+SRS or QS+stopping design, which may involve more subjects than SRS and CM designs. For the integration part in calculating 0 in the expected LRT statistics for CM+SRS and QS+stopping designs, we considered the MC integration, the classical numerical interation and the quasi-monte carlo (QMC) integration and decided to use the QMC integration. Classical numerical integration, such as Gaussian quadrature, is not suitable for high dimensional data because the number of quadrature points increases exponentially as the dimensionality increases, which makes the computation too intensive to be feasible. Both the MC and QMC methods approximate the integral of interest by calculating an average of the selected points in the dened space. The number of points to be evaluated in the MC or QMC methods does not change with the number of dimensions. The disadvantage of MC method is its slow convergence. Instead of using random points, QMC method uses deterministic points. These deterministic points are shown to be more evenly distributed than random points, which makes the QMC method converge faster than the MC method (Fang and Wang, 1994). There are many sequences to choose for generating the points in the QMC method, we use Halton sequence (Halton, 1960) because it is most frequently used. First, we generate Halton sequence points at interval (0,1) and then transform them to the points in the distribution of interest using the inverse cumulative distribution function. To determine the number of points 67 required, we increase the number of points until the result is stable. The expectations in calculating the expected LRT statistic are computed using the QMC method as well. The maximization algrithm used is Newton-Raphson method. The \unif.halton" function in the \fOptions" package and the \maxNR" subrountine in the \maxLik" package in R are used to generate Halton sequence and perform Newton-Raphson iteration, respectively. The results of the calculation methods are compared with the results of full simulation, which is done similarly as in SRS and CM designs. The number of trials in full simulation is 5000. 7.3 Calculation of Power in Endometrial Cancer Study 7.3.1 Interaction between EH Type and a Continuous Secondary Ex- posure in CM+SRS Design Suppose that the EH type is available in the computerized records for all subjects in the cohort. The goal is to investigate the interaction between the EH type and a secondary exposure. A CM+SRS design (2:2:2) is proposed. First, the controls are counter-matched by selecting two subjects from each of the two EH types, including the case. CM 2:2 en- sures that two subjects with CH are selected. Then, two additional controls are randomly chosen from the remaining controls, which may increase the power for detecting the inter- action between the primary and second exposures. In order to evaluate the performance of the proposed methods under various covariate distributions, we assume that the preva- lence of CH to be 20%, 5% or 2%. The secondary exposure is assumed to be normally distributed or log-normally distributed. We consider the situations where the primary and secondary exposures are correlated with each other. When the second exposure is 68 normally distributed, it follows N(0,1) and N(0.5,1) for SH and CH, respectively. When the secondary exposure is log-normally distributed, it follows log-normal(0, 0.25) and log-normal (0.5, 0.25) for SH and CH, respectively. The results are shown in Tables 7.1 and 7.2, respectively. To compare CM+SRS design with SRS and CM designs and also illustrate the power calculation in SRS and CM designs with normally distributed or log-normally distributed covariates, we also showed the results for SRS 1:5 and CM 4:2 designs in Tables 7.1-7.2. The simulation and calculation are based on 500 and 1000 risk sets when the secondary exposure is normal and log-normal, respectively. 7.3.2 Interaction between EH Type and a Continuous Secondary Ex- posure in QS+Stopping Design Suppose that the EH type is not available in the computerized records and it is prohibitively expensive to obtain the information for all cohort subjects by reviewing the medical records. Quota sampling design seems to be a good choice, in which only the medical records for sampled subjects are reviewed. However, only at most 10 subjects can be included in each risk set due to budget constraints. Therefore, we propose using QS+stopping design. We randomly sample the controls and check the EH type for each subject sampled. We stop the sampling either when two subjects with CH are selected and total number of subjects (including both the case and the controls) selected are 10 or fewer or when no or only one subject with CH is selected and total number of controls reaches 10. We calculate the power for the main eect of the EH type alone based on 200 risk sets (Table 7.3). Similar to CM+SRS design, we also calculate the power for the interaction between the EH type and a continuous secondary exposure (Tables 7.4 69 and 7.5) using the same parameters, covariate distributions and number of risk sets as for CM+SRS design. 7.3.3 Results When the secondary exposure is normally distributed, the information method and the SMO method agree with each other and with full simulation very well (Table 7.1) for calculating the power for the interaction between the EH type and the secondary exposure in CM+SRS design, regardless the prevalence of CH (Table 7.1). When the secondary covariate is log-normal and CH is common (20%), both methods perform well for calculating the power for the interaction between the two exposures in CM+SRS design (Table 7.2). When the secondary exposure is log-normal and CH is rare (5% or lower), the power calculated by the SMO method is within 3% of the simulated power while the information method overestimates the simulated power by up to almost 6% (Table 7.2). For the interaction between the EH type and normally or log-normally distributed secondary exposure in QS+stopping design, the agreement between the two calculation methods and between the calculated power and the simulated power is very good, with the dierences being within 2% most of the time (Tables 7.4 and 7.5). Table 7.3 shows that for QS+stopping design, the power for the main eect of the EH type alone calculated by the information method becomes higher than that by the SMO method and the test size of the Wald test in full simulation decreases as the prevalence of CH decreases, indicating a departure from the asymptotic theory when CH is rare. It is unclear why for QS+stopping design, the agreement between the information method and the SMO method is good for the interaction between the EH type and the 70 Table 7.1: Statistical power for interaction between EH type and a normal covariate (A) in CM+SRS 2:2:2 design. AN(0,1) for EH=SH. AN(0.5,1) for EH=CH. EH =1.8. A =0.455. Number of risk sets=500. Prev. Int RR* Full simu. Full simu. SMO INFO of CH (LRT) (WT) CM+SRS 2:2:2 20% 2 53.7 53.8 52.9 52.5 3 90.3 90.2 89.9 89.8 5% 3 61.6 61.4 61.9 61.9 5 93.6 93.5 93.8 93.3 2% 4 53.5 53.4 53.3 53.1 8 92.5 92.4 92.2 91.5 SRS 1:5 20% 2 50.6 49.9 49.7 49.4 3 88.3 88.1 87.4 86.8 5% 3 50.2 48.3 49.7 48.0 5 84.2 83.2 83.5 80.3 2% 4 38.0 33.9 37.9 36.0 8 75.2 71.5 74.3 67.2 CM 4:2 20% 2 52.9 53.0 52.6 52.3 3 90.2 90.2 89.7 89.3 5% 3 62.4 61.9 61.9 61.5 5 93.1 93.0 93.6 93.6 2% 4 51.2 50.8 52.9 53.3 8 90.1 89.9 91.8 92.5 *Int RR = rate ratio for EHA interaction when the 95 th percentile in N(0,1) is compared the 5 th percentile. 71 Table 7.2: Statistical power for interaction between EH type and a log-normal covariate (B) in CM+SRS 2:2:2 design. Blog-normal(0,0.25) for EH=SH. Blog-normal(0.5,0.25) for EH=CH. CH =1.8. B =0.4. Number of risk sets=1000. Prev. Int RR* Full simu. Full simu. SMO INFO of CH (LRT) (WT) CM+SRS 2:2:2 20% 2.5 57.9 57.5 57.4 58.9 4 89.9 89.9 89.7 90.2 5% 2.5 46.9 47.6 49.3 53.1 4 84.3 84.6 88.5 90.7 2% 3 47.0 48.6 47.3 54.5 5 86.2 87.3 88.5 91.9 SRS 1:5 20% 2.5 57.5 56.7 56.2 54.9 4 90.1 89.6 88.4 87.6 5% 2.5 37.5 36.2 37.7 37.0 4 74.7 73.4 73.2 70.7 2% 3 30.1 26.7 29.1 27.9 5 63.1 58.8 58.2 57.0 CM 4:2 20% 2.5 60.3 59.8 57.6 58.7 4 92.0 91.7 89.8 90.1 5% 2.5 49.2 49.8 49.4 52.6 4 88.9 88.9 88.6 90.7 2% 3 47.0 48.8 47.4 52.1 5 86.9 87.4 88.5 92.1 *IntRR=rateratioforEHBinteractionwhenthe99 th percentileinlog-normal(0,0.25)iscompared the 1 st percentile. 72 Table 7.3: Statistical power for main eect of EH type in QS+stopping design. Number of risk sets=200. Prev. Int RR Full simu. Full simu. SMO INFO of CH (LRT) (WT) 50% 1 4.86 4.72 5.00 5.00 1.5 61.5 62.4 61.2 62.1 1.8 89.8 90.4 89.9 90.4 20% 1 4.58 4.48 5.00 5.00 1.5 55.5 57.5 56.6 59.1 1.8 89.5 90.4 88.7 90.9 5% 1 5.42 4.56 5.00 5.00 1.8 49.2 53.0 50.5 55.3 2.3 82.0 84.2 84.8 89.6 2% 1 5.14 3.18 5.00 5.00 2.3 51.1 56.5 51.4 58.1 3.2 81.9 84.9 86.8 92.4 Table 7.4: Statistical power for interaction between EH and a normal covariate (A) in QS+stopping design. AN(0,1) for EH=SH. AN(0,1) for EH=CH. CH =1.8. A =0.455. Number of risk sets=500. Prev. Int RR* Full simu. Full simu. SMO INFO of CH (LRT) (WT) 20% 2.2 61.9 61.1 60.1 59.4 3.2 90.8 90.6 90.6 90.1 5% 3 54.1 52.7 54.2 53.6 5.5 90.9 90.3 91.6 90.1 2% 5 56.8 54.0 56.2 53.7 10 89.4 88.0 89.4 85.5 *Int RR = rate ratio for EHA interaction when the 95 th percentile in N(0,1) is compared with the 5 th percentile. 73 Table 7.5: Statistical power for interaction between EH and a log-normal covariate (B) in QS+stopping design. B log-normal(0,0.25) for EH=SH. B log-normal(0.5,0.25) for EH=CH. CH =1.8. B =0.4. Number of risk sets=1000. Prev. Int RR Full simu. Full simu. SMO INFO of CH (LRT) (WT) 20% 2.5 58.4 57.6 56.2 57.4 4 91.9 91.2 88.9 88.4 5% 2.8 54.4 52.8 52.2 54.2 4.8 92.2 91.6 90.5 90.8 2% 4 56.3 54.6 55.2 56.2 7 88.9 87.3 89.2 88.3 *Int RR = rate ratio for EHB when the 99 th percentile in log-normal(0,0.25) is compared with the 1 st percentile. secondary exposure, but not for the main eect of the EH type alone, and for the interac- tion between the EH type and the log-normal secondary exposure, the agreement between the two calculation methods is good for QS+stopping design, but not for CM+SRS design. For SRS 1:5, the results of the information method and the SMO method are close to each other and to the simulated power for the interaction between the EH type and the secondary exposure (normal or log-normal) most of the time except that the dierence between the two calculation methods reaches 7% when there are only 2% of the cohort subjects who are diagnosed with CH and the secondary exposure is normally distributed. Both calculation and full simulation show that SRS 1:5 has lower power than CM+SRS 2:2:2 to detect the interaction eect. The power of CM 4:2 is very similar to that of CM+SRS 2:2:2. The performance of the two calculation methods for CM 4:2 is also as good as that for CM+SRS 2:2:2. 74 Chapter 8 Summary and Discussion The information method and the SMO method developed in this study are very general and exible. They can be used successfully in wide range of practical situations including various complex nested case-control designs, univariate and multivariate models, and covariates with binary and various continuous distributions. The application of these methods will greatly facilitate the designing of nested case-control studies. When the covariates are symmetrical or balanced, both the information method and the SMO method can reliably estimate the power. In these settings, the information method is preferred since it does not require the calculation of 0 and is easier to imple- ment than the SMO method. When the covariates are skewed or imbalanced, the two methods may not agree with each other, indicating the failure of asymptotic theory. In practice, it is helpful to perform both methods and compare them when the validity of the asymptotic theory is in doubt. The inconsistency of the two methods may suggest the departure from asymptotic distributions of the test statistics and the full simulation approach should be used instead to obtain the power or sample size. 75 Due to the straightforwardness of handling high dimensional data and generating complex covariate distributions using monte carlo and quasi-monte carlo integration, the proposed methods can be extended to the nested case-control study designs that are even more complex than the ones discussed in this thesis, for example, the design actually used in the endometrial cancer study, counter-matching within batch quota sampled data (Lacey, 2008). We have investigated a few common covariate distributions that we may encounter in practice. Continuous covariate distributions include normal, log-normal and chi-square distributions. We have only looked at binary covariates as an example for categorical variables. Our methods developed for binary covariates should be easy to be extended to polytomous covariates. The proposed methods can be used to compare the eciencies and power of dierent nested case-control designs. In previous work (Langholz and Borgan 1995; Li, 2004), the eciencies of simple random sampling design and counter-matching design have been compared analytically for categorical covariates, but not for continuous covarites. No research has been done to compare QS+stopping design and CM+SRS design with other designs. Such comparisons will not only help to understand the designs better, but also can further advocate the use of the proposed methods and the developed software. Our simulation show that the likelihood ratio test has a more stable test size than the Wald test when the covariates are skewed or imbalanced (Tables 6.7-6.8 and 7.3). More work needs to be done to ascertain whether this behavior is common and if the likelihood ratio test is to be preferred when exposure is rare or the distribution is highly skewed. It will also be interesting to investigate more how skewed covariate distributions and small sample sizes aect the distributions of the test statistics. 76 Bibliography [1] Agresti, A. (2002). Categorical Data Analysis. John Wiley and Sons, Hoboken, NJ. [2] Bernstein, J.L., Langholz, B., Haile, R.W., Bernstein, L., Thomas, D.C., Stovall, M., Malone, K.E., Lynch, C.F., Olsen, J.H., Anton-Culver, H., Shore, R.E., Boice, J.D. Jr., Berkowitz, G.S., Gatti, R.A., Teitelbaum, S.L., Smith, S.A., Rosenstein, B.S., Borrensen-Dale, A, Concannon, P. and Thompson, W.D. (2004). Breast Cancer Research. 6:199-214. [3] Borgan, O., Goldstein, L. and Langholz, B. (1995). Methods for the analysis of sample cohort data in the Cox proportional hazards model. The Annals of Statistics. 23:1749-1778. [4] Breslow, N. E. and Day, N. E. (1980).StatisticalMethodsinCancerResearch.Volume I: The Analysis of Case-control Studies. Oxford University Press, New York. [5] Breslow, N. E. and Day, N. E. (1987).StatisticalMethodsinCancerResearch.Volume II: The Design and Analysis of Cohort Studies. Oxford University Press, New York. [6] Brown, B. W., Lovato, J. and Russell, K. (1999). Asymptotic power calculations: description, examples, computer code. Statistics in Medicine. 18:3137-3151. [7] Bull, S.B. (1993). Sample size and power determination for a binary outcome and an ordinal exposure when logistic regression analysis is planned. American Journal of Epidemiology. 137:676-684. [8] Cox, D.R. (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society, Seriers B, Methodological. 34:187-220. [9] Cox, D.R. and Hinkley, D.V. (1974).Theoretical Statistics. Chapman and Hall, Long- don, UK. [10] Demidenko, E. (2007a). Sample size determination for logistic regression revisited. Statistics in Medicine. 26:3385-3397. [11] Demidenko, E. (2007b). Sample size and optimal design for logsitic regression with binary interaction. Statistics in Medicine. 27:36-46. 77 [12] Dupont, W.D. (1988). Power calculations for matched case-control studies. Biomet- rics 44:1157-1168. [13] Ernster, V.L. (1994). Nested case-control studies. Preventive Medicine. 23:587-590. [14] Fang, K.T. and Wang, Y. (1994) Number-theoretic Methods in Statistics. Chapman and Hall, London, UK. [15] Fleiss, J.L. (1987). Statistical Mothods for Rates and Proportions, 2nd edition. Wiley, New York. [16] Foppa, I. and Spiegelman, D. (1997). Power and sample size calculations for case- control studies of gene-environment interactions with a polytomous exposure variable. American Journal of Epidemiology. 146:596-604. [17] Gauderman, W. J. (2002a). Sample size requirements for matched case-control stud- ies of gene-environment interaction. Statistics in Medicine. 21:35-50. [18] Gauderman, W. J. (2002b). Sample size requirements for association studies of gene- gene interaction. American Journal of Epidemiology. 155:478-483. [19] Goldstein, L. and Langholz, B. (1992). Asymptotic theory for nested case-control sampling in the Cox regression model. Annals of Statistics. 20:1903-1928. [20] Greenland, S. (1985). Power, sample size, and smallest detectable eect determina- tion for multivariate studies. Statistics in Medicine. 4:117-127. [21] Guenther, W.C. (1977). Power and sample size for approximate chi-square tests. American statistician. 31:83-85. [22] Halton, J. (1960). On the eciency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerical Mathematics. 2:84-90. [23] Hauck, W.W. and Donner, A. (1977). Wald test as applied to hypoethesis in logit analysis. Journal of the American Statistical Association. 72:851-853. [24] Lacey, J.V. (2008). Endometrial carcinoma risk among women diagnosed with en- dometrial hyperplasia: the 34-year experiene in a large health plan. British Journal of Cancer 98:45-53. [25] Lachin, J.M. (2008). Sample size evaluation for a multiply matched case-control study using the score test from a conditional logistic (discrete Cox PH) regression model. Statsitics in Medicine. 27:2509-2523. [26] Langholz, B. (2005a). Case-Control Study, Nested. In Encyclopedia of Biostatistics, Second Edition. John Wiley and Sons, Chichester, UK. pp. 646-655. 78 [27] Langholz, B. (2005b). Counter-matching. In Encyclopedia of Biostatistics, Second Edition. John Wiley and Sons, Chichester, UK. pp. 1248-1254. [28] Langholz, B. (2005c). Batch quota sampling. Biostatistics Division Technical Report. Department of Preventive Medicine, University of Southern California. [29] Langholz, B. (2006). Quota sampling of controls in matched case-control studies. Bio- statistics Division Technical Report. Department of Preventive medicine, University of Southern California. [30] Langholz, B. (2007). Use of cohort information in the design and analysis of case- control studies. Scandinavian Journal of Statistics. 34:120-136. [31] Langholz, B. and Borgan, O. (1995). Counter-matching: a stratied nested case- control sampling method. Biometrika. 82:69-79. [32] Langholz, B. and Clayton, D. (1994). Sampling strategies in nested case-control studies. Environmental Health Perspectives. 102 (Suppl. 8):47-51. [33] Langholz, B. and Goldstein, L. (1996). Risk set sampling in epidemiologic cohort studies. Statistical Science. 11:35-53. [34] Li,Y. (2004). Counter-matching in Nested Case-control Studies: Design and Analytic Issues. Ph.D. Dissertation. University of Southern California. [35] Longmate, J.A. (2001). Complexity and power in case-control association studies. American Journal of Human Genetics. 68:1229-1237. [36] Lyles, R.H., Lin, H.M. and Williamson, J.M. (2007). A practical approach to com- puting power for generalized linear models for nominal, count or ordinal responses. Statistics in Medicine. 26:1632-1648. [37] Novikov, I, Oberman, B and Freedman, L. (2005). Modication of the computational procedure in Parker and Bregman's method of calculating sample size from matched case-control studies with a dichotomous exposure. Biometrics. 61:1123-1127. [38] Oaks, D. (1981). Survival times: Aspects of partial likelihood (with discussion). International Statistical Review. 49:235-264. [39] O'Brien, R.G. (2002). Sample size analysis in study planning. Unpublished paper presented at Joint Statistical Meetings, New York City, NY. [40] O'Brien, R.G. and Shieh, G. (1998). A simpler method to compute power for likeli- hood ratio tests in generalized linear models. Unpublished paper presented at Joint Statistical Meetings, Dallas, TX. [41] Parker, R. A. and Bregman, D. J. (1986). Sample size for individually matched case- control studies. Biometrics. 42: 919-926. 79 [42] Rothman, K.J. and Greenland, S. (1998). Modern Epidemiology. Lippincott Williams and Wilkins Publishers, Philadelphia, PA. [43] Sahai, H. and Khurshid, A. (1996). Formulae and tables for the determination of sample sizes and power in clinical trials for testing dierences in proportions for the two-sample design: a review. Statistics in Medicine. 15:1-21. [44] Schlesselman, J. J. (1982). Case-control Studies. Oxford University Press, New York. [45] Schoenfeld, D.A. (1983). Sample-size formula for the proportional-hazards regression model. Biometrics. 39:499-503. [46] Schoenfeld, D.A. and Borenstein, M. (2005). Calculating the power or sample size for the logistic and proportional hazards models. Journal of Statistical Computation and Simulation. 75:771-785. [47] Self, S.G. and Mauritsen, R.H. (1988). Power/sample size calculations for generalized linear models. Biometrics. 44: 79-86. [48] Self, S.G., Mauritsen, R.H. and Ohara, J. (1992). Power calculations for likelihood ratio tests in generalized linear models. Biometrics. 48:31-39. [49] Shieh, G. (2000). On power amd sample size calculations for likelihood ratio tests in generalized linear models. Biometrics. 56:1192-1196. [50] Shieh, G. (2001). Sample size calculations for logistic and poisson regression models. Biometrika. 88:1193-1199. [51] Shieh, G. (2005). On power and sample size calculations for Wald tests in generalized linear models. Journal of Statistical Planning and Inference. 128:43-59. [52] Sinha, S. and Mukherjee, B.O. (2006). A score test for determining sample size in matched case-control studies with categorical exposure. Biometrical Journal. 48:35- 53. [53] Steenland, K. and Deddens, J.A. (1997). Increased precision using counter-matching in nested case-control studies. Epidemiology. 8:238-242. [54] Thomas, D. C. (1977). Addendum to a paper by Liddel, F.D.K., McDonald, J.C. and Thomas, D.C. Journal of Royal Statistical Society, Series A. 140:483-485. [55] Tokunaga, M., Land, C.E., Tokuoka, S., Nishimori, I., Soda, M. and Akiba, S. (1994). Incidence of female breaset cancer among atomic bomb survivors, 1950-1985. Radia- tion Research. 138:209-223. [56] Ury, H. (1975). Eciency of case-control studies with multiple controls per case: continuous or dichotomous data. Biometrics. 31:643-649. 80 [57] Vaeth, M. (1985). On the use of Wald's test in exponential families. International Statistical Review. 53:199-214. [58] Whittemore, A.S. (1981). Sample size for logistic regression with small response probability. Journal of the American Statistical Association. 76:27-32. [59] Wilson, S.R. and Gordon, I. (1986). Calculating sample size in the presence of con- founding variables. Applied Statistics. 35:307-313. 81
Abstract (if available)
Abstract
Power and sample size calculations can greatly facilitate the designing of nested case-control studies. Few studies have addressed this issue. In this dissertation we develop unified approaches of determining power and sample sizes for nested case-control studies based on the established asymptotic theory for Cox proportional hazards model. We assess the accuracy of these methods for practical use by comparing them with full Monte Carlo simulation with realistic sample sizes. The information method and the SMO method we propose are shown to work well in a wide range of practical situations, which include various complex nested-control study designs (simple random sampling, counter-matching, counter-matching with additional randomly sampled controls, and quota sampling with stopping), various categorical and continuous covariate distributions, and univariate and multivariate models. These methods are flexible for handling high dimensional data and complex covariate distributions, which makes it possible to extend them to other nested case-control study designs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Sampling strategies based on existing information in nested case control studies
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
Cluster sample type case-control study designs
PDF
Disease risk estimation from case-control studies with sampling
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
A study of methods for missing data problems in epidemiologic studies with historical exposures
PDF
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Risk factors associated with smoking initiation among Chinese adolescents: a matched case-control study
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Bayesian hierarchical models in genetic association studies
PDF
Nonlinear modeling and machine learning methods for environmental epidemiology
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
Two-step study designs in genetic epidemiology
PDF
Precision-based sample size reduction for Bayesian experimentation using Markov chain simulation
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Nonuniform sampling and digital signal processing for analog-to-digital conversion
Asset Metadata
Creator
Ye, Wei (author)
Core Title
Power and sample size calculations for nested case-control studies
Contributor
Electronically uploaded by the author
(provenance)
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
08/08/2010
Defense Date
06/21/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Cox regression model,likelihood ratio test,nested case-control study,OAI-PMH Harvest,Power,sample size,Wald test
Language
English
Advisor
Langholz, Bryan (
committee chair
), Gauderman, W. James (
committee member
), Pinski, Jacek (
committee member
)
Creator Email
weiye1@yahoo.com,wye@med.usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3347
Unique identifier
UC171038
Identifier
etd-Ye-3936 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-376880 (legacy record id),usctheses-m3347 (legacy record id)
Legacy Identifier
etd-Ye-3936.pdf
Dmrecord
376880
Document Type
Dissertation
Rights
Ye, Wei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
Cox regression model
likelihood ratio test
nested case-control study
sample size
Wald test