Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Counter -matching in nested case -control studies: Design and analytic issues
(USC Thesis Other)
Counter -matching in nested case -control studies: Design and analytic issues
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COUNTER-MATCHING IN NESTED CASE-CONTROL STUDIES: DESIGN AND ANALYTIC ISSUES by Yu-Fen Li A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (BIOMETRY) August 2004 Copyright 2004 Yu-Fen L i Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3145236 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3145236 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To my parents Dedication Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I would like to express my appreciation to my committee members for their precious tim e and comments. I am indebted to Drs. Bryan Langholz and Frank G illiland whose competent research and career advice was invaluable. Their generosity w ith their tim e and patience are deeply appreciated. Drs. Gauderman, Xiang, and Goldstein served on my committee and they deserve special thanks for their efforts. Appreciation is also expressed to Dr. Daniel Stram giving me many valuable suggestions on the dissertation. Last, but not least, this work could not have been accomplished w ith out the support of my best friend Susan and my husband Chien-Kuo. T heir understanding, patience, and excellent editorial comments are greatly appreciated. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents D edication ii Acknowledgem ents iii List of Tables x List of Figures xiv Abstract xvi 1 Introduction 1 2 N ested Case-Control Studies 4 2.1 Cohort Studies: Rationale for Nested Case-Control S tu d ie s ............ 4 2.2 Sampling on E x p o s u re ............................................................................... 6 3 Sam pling D esigns and N otation 9 3.1 Notation and Model ................................................................................... 11 3.2 Random Sampling ...................................................................................... 14 3.2.1 Control of Confounding in Random S a m p lin g ........................ 14 3.3 Counter-M atching ...................................................................................... 16 3.3.1 Control of Confounding in Counter-Matching ........................ 20 4 Comparison of Efficiency of Different Sampling Schemes for N ested Case-Control Studies 23 4.1 Review: Efficiency Comparisons for Sampling S c h e m e s ................... 24 4.1.1 Efficiency for M ain Effect D e te c tio n ............................................ 24 4.1.2 Efficiency for Interaction D e te c tio n ............................................ 25 4.2 Efficiency - 1:1 Counter-M atching versus 1:1 Simple Random Sampling 27 5 D esign Issues for Counter-M atched N ested Case-Control Studies 33 5.1 Factors Affecting the Efficiency of Counter-Matched Designs . . . . 33 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1.1 Exposure P re va le n ce ....................................................................... 34 5.1.2 Exposure-Disease A s s o c ia tio n ...................................................... 34 5.1.3 Sample S iz e ........................................................................................ 34 5.1.4 X — Z R elationship.......................................................................... 35 5.1.5 Sample Allocation .......................................................................... 35 5.2 Simulation Study I: Categorical Z ............................................................. 37 5.2.1 Methods for Simulation Study I ................................................... 39 5.2.2 Results of Simulation Study I ...................................................... 40 5.3 Simulation Study II: Factors th a t Influence E ffic ie n c y ........................ 49 5.3.1 Methods for Simulation Study II ................................................ 52 5.3.2 Results of Simulation Study I I ...................................................... 54 6 A nalytic Issues for Counter-M atched N ested Case-Control Studies 72 6.1 Model F ittin g under Complex Sampling ............................................... 73 6.2 Descriptive S ta tis tic s .................................................................................... 74 6.3 Missing Data and Non-Participation ...................................................... 75 6.3.1 The Nature of Missing D a t a ............................................ 75 6.3.2 Accounting for Missing Data in A n a ly s is ................................. 77 6.3.3 Accounting for Non-Participation in Analysis ............................ 83 6.4 Disease Subtype and Covariate-Defined Subset A n a ly s is ...................... 88 7 A nalytic Issues: D ata Exam ple 93 7.1 Data Example: the Early Asthm a Risk Factor Study (EARS) . . . 94 7.2 A nalytic Issues .............................................................................................. 96 7.2.1 Model F it t in g .................................................................................... 97 7.2.2 Descriptive S ta tis tic s ....................................................................... 100 7.2.3 N o n -P a rticip a tio n ............................................................................. 101 7.2.4 Outcome w ith Subtypes: Subtype A n a ly s is ................................ 107 8 Polytom ous O utcom es in Counter-M atched N ested Case-Control Studies 111 8.1 Methods for Analyzing Polytomous O u tc o m e s ..................................... 112 8.1.1 Nom inal O u tc o m e s .......................................................................... 112 8.1.2 O rdinal Outcomes .......................................................................... 116 8.2 Polytomous Logistic Regression Model under Complex Sampling . . 121 8.2.1 Likelihood: Full Study B a s e ......................................................... 122 8.2.2 Conditional Likelihood for Random S a m p lin g ......................... 126 8.2.3 “Less” Conditional Likelihood for Random Sampling .... 130 8.2.4 Two Conditional Likelihood Approaches for Complex Sam plingl31 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9 Polytom ous Outcomes: EARS Exam ple and Sim ulation Studies 136 9.1 EARS E x a m p le ............................................................................................. 137 9.2 Simulation Studies Comparing Polytomous and Pairwise Approaches for C ounter-M atching.................................................................................... 140 10 Summ ary and D iscussion 152 10.1 Summary ....................................................................................................... 152 10.2 Future Research............................................................................................. 155 References 159 A ppendix A Sim ulation D escription 166 A .l Categorical C o v a ria te s ................................................................................ 166 A .2 Polytomous O u tc o m e s ............................................................................... 168 B SAS Macro Code for Sim ulation Studies 171 B .l Logistic Regression for Counter-Matched Data ................................ 171 B.2 Polytomous Logistic Regression for Counter-Matched D a t a ............ 175 C Derivation of Inform ation for Conditional Polytom ous Likelihood M odel 183 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables 5.1 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample al location for a binary exposure Z, when the odds ratio (OR) is 1 (j3 = 0.00). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The over all probability of disease is 10%. Based on 1,000 tria ls......................... 42 5.2 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample al location for a binary exposure Z, when the odds ratio (OR) is 2 (/3 = 0.69). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The over all probability of disease is 10%. Based on 1,000 tria ls........................ 43 5.3 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample al location for a binary exposure Z, when the odds ratio (OR) is 4 (/? = 1.39). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The over all probability of disease is 10%. Based on 1,000 tria ls........................ 44 5.4 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample al location for a categorical exposure Z w ith a linear trend effect on the outcome, when the odds ratio (OR) is 1 per u nit of Z (f3 = 0.00). The counter-matched sampling stratum variable X has 90% sensitiv ity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 tria ls.................................................... 46 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample al location for a categorical exposure Z w ith a linear trend effect on the outcome, when the odds ratio (OR) is 2 per u n it of Z (f3 = 0.69). The counter-matched sampling stratum variable X has 90% sensitiv ity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 tria ls..................................................... 47 5.6 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample al location for a categorical exposure Z w ith a linear trend effect on the outcome, when the odds ratio (OR) is 4 per unit of Z (/3 = 1.39). The counter-matched sampling stratum variable X has 90% sensitiv ity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 tria ls..................................................... 48 5.7 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample alloca tion for a categorical exposure Z, when the odds ratio is 1 for Z = 1 (O R i = 1, Pi = 0.00) and 4 for Z = 2 (O R 2 = 4, /?2 = 1.39). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 tria ls...................................... 50 5.8 Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample alloca tion for a categorical exposure Z, when the odds ratio is 2 for Z = 1 (O R i = 2, Pi = 0.69) and 3 for Z = 2 (O R2 = 3, (32 = 1-10). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 tria ls...................................... 51 7.1 Examples of descriptive statistics under counter-matching, compared to the descriptives statistics in the cohort.................................................. 102 7.2 Estim ated odds ratios (OR) and 95% confidence intervals (C l) of in utero exposure to m aternal smoking on asthma w ith and w ith o u t the consideration of non-participation, the Early Asthm a Risk factors Study (EARS) .............................................................................................. 104 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.3 Estim ated odds ratios (ORs) and 95% confidence intervals (C l) of in utero exposure to maternal smoking for early and late onset asthma in the fu ll cohort and EARS case-control sample w ith and w ith out weight adjustments.......................................................................................... 109 9.1 Estim ated odds ratios (ORs) and 95% confidence intervals (C l) of in utero exposure to maternal smoking for early and late onset asthma in the fu ll cohort and EARS case-control sample using the pairwise and polytomous approaches........................................................................... 138 9.2 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 5% type I and 15% type II cases, when the odds ratio is 1 for type I case (O R i = 1, /A = 0) and the odds ratio is 4 for type I I case (OR 2 — 4, p2 — 1-39). Based on 1,000 trials. . . . 144 9.3 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 10% type I and 10% type I I cases, when the odds ratio is 1 for type I case (O R i = 1, Pi = 0) and the odds ratio is 4 for type I I case (OR 2 = 4, p2 = 1-39). Based on 1,000 trials. . 145 9.4 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 15% type I and 5% type II cases, when the odds ratio is 1 for type I case (O R i = 1, Pi = 0) and the odds ratio is 4 for type II case (OR 2 = 4, p2 = 1-39). Based on 1,000 trials. . . . 146 9.5 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 5% type I and 15% type II cases, when the odds ratio is 1 for type I case (O R i = 2, Pi = 0.69) and the odds ratio is 4 for type II case (OR2 = 2, p2 = 0.69). Based on 1,000 trials. . . 147 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9.6 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 10% type I and 10% type I I cases, when the odds ratio is 2 for type I case (O R i = 2, /?i = 0.69) and the odds ratio is 2 for type I I case (OR2 = 2, /32 = 0.69). Based on 1,000 trials. 148 9.7 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 5% type I and 15% type II cases, when the odds ratio is 2 for type I case (O R i — 2, fix = 0.69) and the odds ratio is 4 for type I I case (OR2 = 4, /?2 = 1.39). Based on 1,000 trials. . . 149 9.8 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n—600) compared to the fu ll study base (n= l,500) w ith 10% type I and 10% type I I cases, when the odds ratio is 2 for type I case (O R i = 2, fix = 0.69) and the odds ratio is 4 for type II case (OR2 = 4, fi2 = 1.39). Based on 1,000 trials. 150 9.9 Relative efficiency of the pairwise logistic regression models and two conditional likelihood methods for the polytomous logistic regression for the counter-matched sample (n=600) compared to the fu ll study base (n= l,500) w ith 15% type I and 5% type II cases, when the odds ratio is 2 for type I case (O R i — 2, fix — 0.69) and the odds ratio is 4 for type II case (OR2 = 4, fi2 = 1.39). Based on 1,000 trials. . . 151 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures •3.1 l:m frequency matched sampling scheme................................................. 15 3.2 An 1:1 matched nested case-control study using random sampling and counter-matched sampling for the control sampling by exposure X ....................................................................................................................... 18 3.3 Counter-matched sampling on a binary exposure X for |d| controls from a cohort w ith |d| cases....................................................................... 19 3.4 Counter-matched sampling scheme on exposure X w ith stratum - specific sampling margins m;|D|, where 1=1 to L ................................ 21 4.1 Relative efficiencies of 1:1 counter-matching compared to 1:1 ran dom sampling w ith different sensitivity and specificity of the counter matching variable and the exposure of interest, a) the 3D plot w ith the sensitivity and specificity as the X-axis and Y-axis and rela tive efficiency as the Z-axis; b) the contour plot w ith the relative efficiency as the contour lines.................................................................... 32 •5.1 Efficiencies of 1:1 random sampling and five counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is one. a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative effi ciency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2 Efficiencies of 1:1 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is two. a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative effi ciency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 59 5.3 Efficiencies of 1:1 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is four, a) The exposure prevalence, p r[Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 60 5.4 Efficiencies of 1:2 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is one. a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative effi ciency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 61 5.5 Efficiencies of 1:2 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is two. a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative effi ciency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 62 xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.6 Efficiencies of 1:2 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is four, a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 63 5.7 Efficiencies of 1:3 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is one. a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative effi ciency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 64 5.8 Efficiencies of 1:3 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is two. a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative effi ciency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 65 5.9 Efficiencies of 1:3 random sampling and 5 counter-matching de signs, relative to the fu ll study base, when the odds ratio (OR) is four, a) The exposure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: relative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 66 5.10 A n example of counter-matching when no controls are sampled in one of the counter-matched sampling stratum . The sample allo cation is 33% unexposed subjects in the counter-matched sample, and the sample size is twice the number of cases w ith the odds ratio(O R )= 2, p r(Z = 1) = 0.2 and 90% sensitivity and specificity. 67 xiii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.1 The EM algorithm ..................................................................................... 81 7.1 A n example of the analytic data set for one case-control set in the Early Asthm a Risk factors Study (EARS). X: the counter-matched variable. Z: exposure collected in the EARS. CC: the pseudo case- control indicator, 1 for the observed case set and 0 for the rest. New covariate Z: the sum of Z values in th a t case set. logw: the logarithm of risk weight............................................................................... 98 7.2 Weight adjustments and the missing indicator adjustments to ac count for non-participation, shown is one case-control set for the Early Asthm a Risk factor Study. Observations in the rectangle are non-participating subjects........................................................................... 106 8.1 An example of hierarchical polytomous responses................................ 119 xiv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Cohort studies are the foundation of epidemiologic investigation; however, they are generally very costly to conduct, especially if detailed data are collected for all subjects. The most common alternative for collection of detailed data is to conduct a nested case-control study. Counter-matching is a newer sampling option utilizing a nested case-control design and has the potential of im proving efficiency by m axim izing the number of case-control pairs discordant on exposure. However, the efficiency gained in the counter-matched design is dependent upon several design factors including the sample size, the exposure prevalence, the exposure- disease association, the relationship between the counter-matching variable and the exposure of interest, and the ratio of unexposed and exposed subjects in the counter-matched sample. Simulation studies were used to evaluate these design factors that influence the efficiency of counter-matching. The sim ulation studies show that counter-matching is more efficient than random sampling when the bi nary counter-matching variable has high sensitivity and specificity (at least 75% for both) for a binary exposure of interest, especially for a rare exposure associated w ith a high risk of disease. As the sample size increases, the ratio of unexposed and exposed subjects in the counter-matched sample has less im pact on the efficiency of counter-matching. There was a need for further study of practical issues in using counter-matching in nested case-control studies, including descriptive statistics, non-participation of subjects, subgroup analysis, and polytomous outcomes. A n approach to addressing xv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. each of the analytic issues is proposed along w ith a counter-matched study exam ple to illustrate and evaluate the appropriations of the approaches for this study. Because the covariate distribution in the counter-matched data was distorted in the sampling process, we could use sampling proportions to estimate the number of controls in the cohort from which the sample arose. For the analytic issues of non participation and subgroup analysis, an appropriate adjustment on risk weights in the likelihood was needed to obtain unbiased estimates. Two proposed conditional likelihood methods for polytomous logistic regression were shown to be valid and equally efficient. Counter-matching is a sampling strategy for nested case-control studies that increases efficiency, does not introduce bias, and may be a good choice for sampling under certain conditions. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction Cohort studies are an im portant epidemiologic design [11]. The im portance arises as cohort studies are less prone to bias than other epidemiologic designs. The reduction in potential bias is due to the fact th a t the study base is well-defined and the exposure determined before the outcome. Although cohort studies provide the basis for strong inferences about exposure outcome relationships, this design has a number of inherent challenges including the need for large numbers of subjects and long follow-up periods necessary for investigations of most disease outcomes. The considerable size required and the collection of detailed inform ation for the entire cohort makes cohort studies very expensive and lim its the number th a t can be conducted given finite available resources. Efficient study designs th a t build on the strength of cohort studies are needed to allow a large number of im portant public health and clinical issues to be investigated. One approach to m axim izing the efficiency of the cohort design is to collect basic inform ation from all members of a large cohort and then to sample members 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. from the cohort to collect more detailed inform ation. The nested case-control study has been the most frequently applied sampling design to improve the efficiency of cohort studies. In this approach, sampling is based on outcome status w ith more detailed exposure assessment conducted for cases and random ly sampled controls. Although this sampling method has been successfully employed to reduce the cost of cohort studies, little attention has been given to whether the efficiency could be further improved by using additional inform ation for the entire cohort study base. One potential source of inform ation that could be used for sampling is the exposure inform ation that is available for the entire cohort; however, the conventional wisdom has been that sampling based on exposure status produces biased estimates of the exposure-outcome relationship. Recent theoretical developments have shown that sampling on exposure status can be accomplished w ith o u t bias and has the potential to substantially improve sampling efficiency [40, 39, 71, 41]. Although the improved efficiency of sampling based on both exposure and outcome status provides a compelling m otivation to adopt this sampling approach, a number of issues arise in the design and analysis of such nested case-control studies. The overall objective of this thesis is to investigate solutions to these challenges and provide guidance for the design and analysis of nested case-control studies that sample based on exposure status. This dissertation presents considerations of the rationale for nested case-control studies and design options focusing on counter matched designs (Chapter 2). We provide the context for the research through presentation of the theoretical framework for counter-matching (Chapter 3) and 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. assessment of the efficiency of competing sampling strategies (Chapter 4). We then review the design issues for counter-matched nested case-control studies and provide guidelines when and how counter-matching can be applied to increase the efficiency and validity of nested studies (Chapter 5). We then review analytic issues that arise in summarizing counter-matching case-control data and provide an approach to addressing non-participation, missing data and subgroup analyses (Chapter 6). The Early Asthm a Risk-factor Study (EARS) provides a data example to illustrate these approaches (Chapter 7). Because asthma can be viewed as a polytomous outcome we extend methods for binary outcome data to the polytomous setting and again use the EARS and simulated data to illustrate and evaluate the methods (Chapter 8 and Chapter 9). Finally, we provide conclusions and an overall summary regarding the use of counter-matching in nested case-control studies (Chapter 10). 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 N ested Case-Control Studies The nested case-control study has been the most frequently applied sampling design to improve the efficiency and to reduce the cost of cohort studies. The rationale and options for nested case-control studies are discussed here. 2.1 Cohort Studies: Rationale for N ested Case- Control Studies Cohort studies have several advantages including providing 1) a clear tem poral sequence of exposure and disease, and 2) an opportunity to study m ultiple outcomes related to a specific exposure. In 2001, Sir Richard D oll reviewed the history of cohort studies [20, 21], and pointed out th a t the term of cohort study was first introduced by Frost in 1935. Cohort studies were in itia lly called prospective studies, but now are distinguished as prospective cohort studies and retrospective cohort studies by the tim ing of collection of exposure and actual inform ation is or was 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. obtained (at tim e the study is started or sometime in the past). Cohort studies have become essential tools for epidemiologic research. However, in cohort studies, it is often not to collected detailed inform ation on all variables of interest. Generally, there are three issues that motivate collecting additional detailed inform ation. Issue 1 Better Exposure Inform ation To reduce misclassification or to assess a dose-response effect on the outcome, a binary exposure variable X (e.g. smoking) may be insufficient. A dditional data collection (e.g. pack years) is needed to m inimize misclassification and to provide detailed exposure inform ation. Therefore, the first m otivation is the need to collect additional detailed exposure inform ation. Issue 2 Better Control of Confounding Observational studies are potentially subject to the effect of extraneous fac tors, confounders, that may distort the findings of the studies. For example, cigarette smoking is a potential confounder of the relationship between radon exposure and lung cancer. The binary inform ation for exposure X (i.e. yes/no to smoking) is inadequate to control for confounding because the binary in form ation does not allow assignment of homogenous categories (e.g. light, medium, and heavy smokers) when we are interested in the effect of radon ex posure on lung cancer. Therefore, a second m otivation is to collect improved covariate inform ation. Issue 3 Sampling for Biologic tests 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When we examine biological parameters as risk factors such as genetic vari ants, We want to maximize the power for the assessment of genetic variants and/or gene-environment interaction and minimize the cost on genotype and exposure ascertainments. The usual strategy is to collect less detailed inform ation on covariates followed by the collection of more detailed inform ation on exposure in a sample. However, in the cohort, it is ideal to collect factors: 1) subject to retrospective biases, and 2) relatively cheap to obtain at enrollment but from which im portant inform ation can be derived. 2.2 Sampling on Exposure The trad ition al sampling approach for nested case-control studies is to sample on outcome status, such as random sampling (and frequency m atching for control of confounding). In other words, we sim ply sample controls random ly from the risk sets (matching on confounders). However, in a matched case-control study, concordant case-control pairs do not contribute any inform ation [62]. Therefore, we underutilize the existing resources to collect data from concordant case-control pairs. Since the study is nested in the cohort, there is additional inform ation available for the entire cohort. Recent methodological developments have shown th a t using the data gathered as part of the cohort study in the sampling strategy can further improve efficiency [39, 40, 41, 71]. Since cases and controls th a t are discordant 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. on exposure w ill contribute the most to the analysis, m axim izing the number of discordant pairs and m inim izing the number of concordant pairs could potentially improve efficiency. This is the intent of “counter” -matching on exposure for the sampling process. We can utilize inform ation on exposure X which is available in the cohort (e.g. yes/no to smoking) to maximize the discordancy in exposure among cases and controls for the collection of more detailed inform ation on Z (e.g. pack years) or a closely related exposure. Adjusting for the effects of confounding factors is im portant in observational epidemiologic studies, and is usually applied in the design stage by m atching or stratifying the sample of study subjects, or in the analysis stage by stratified or m ultivariate analyses [37, 62, 68]. We have the chance to address confounding in both the design and analysis stages in nested case-control studies by: 1) matching on the confounders, if no additional inform ation is needed, such as age and gender; 2) matching on the surrogates of confounders, if additional inform ation is needed; or 3) adjusting the confounder in the analysis stage using newly collected data. In addition to maximize the efficiency of collecting more detailed inform ation about exposure, the potential for this novel sampling strategy to increased power of interaction is another sampling issue. Sampling on the surrogate variables of factors involved in the interaction of interest would have the potential to enhance the power of the nested case-control study [6]. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In summary, nested case-control studies are an im portant m ethod for analytical epidemiologic research. Improvements in sampling strategies, such as sampling on exposure as well as outcome, can enhance the efficiency of the sampling method. 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Sampling Designs and N otation Statistical sampling is the fundamental method of inferring inform ation about the cohort w ithout measuring the entire cohort. When sampling from a cohort is needed for a nested case-control study, there are several sampling schemes available for control and case sampling, including M ethod 1 Simple Random Sampling Simple random sampling is a sampling procedure which assures th a t each ele ment in the population has an equal chance of being selected. This constitutes the most basic form of random sampling. This sampling method does not di rectly include the control of confounding and cannot guarantee an adequate overlap in the distribution of covariates (e.g. age and gender) of subjects in the case and control groups. M atching of cases and controls is a technique th a t is used to control for con founding at the design stage. M atching may be done on an individual ba sis (pair matching) or on a group basis (frequency matching). In frequency Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. matching, control sampling involves dividing a population in to m utually ex clusive strata or subgroups and then taking a simple random sample in each stratum or subgroup. One should try to select stratification criteria which maximizes the between-strata variance among the variables of interest and which minimizes the w ithin-strata variance. However, this approach may be inefficient only if the matching is unnecessary. For example a rare exposure, most case-control pairs are unexposed and do not contribute inform ation to the analysis. M ethod 2 Counter-M atching Counter-matching is a newer sampling strategy based on sampling using expo sure status as well as case-control status, and was first proposed by Langholz and Clayton in 1994 [40]. Counter-matching extends the idea of frequency matching, but ‘counter’ matches cases and controls on their exposure sta tus. For example, in an 1:1 counter-matched design, one control is selected w ith the ‘opposite’ exposure status of the case. The purpose of counter m atching is to maximize the number of discordant case-control pairs which contribute inform ation to in the analysis. Counter-matching can be im ple mented on an individual bases as well as a group basis. As w ith frequency matching, counter-matching can be conducted w ith in each confounding stra tum when one considers confounders. Therefore, counter-matching w ith in m atching strata improves the efficiency of exposure effect estim ation as well as controls for confounding. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1 N otation and M odel The following is a summary of the approach of Langholz and Goldstein [41]. In this dissertation, we extend this approach to consider some analytic issues, including non-participation, subgroup analysis, and polytomous outcomes in the context of counter-matching. We now briefly outline the assumed disease model, and derive the likelihood for the fu ll study base and a sample selected from the fu ll study base. D isease M odel and Likelihood M odel Consider a study base of n individuals and let 7 2 . = {1,..., n } index the individuals. For individual i w ith disease status D , where D — 0 denotes the non-diseased and D = 1 the diseased state, and covariates Zi, suppose the probability of disease status D satisfies the following proportional odds model, / n h - n ^ r{ Z i,p ) A ri pr(£> = l|Z i) = t v a \ = TWA— > and 1 + A r{Zf, p) 1 + A Vi pr(D = 0|Zi) = 1 = — 1 - , (3.1) 1 + A r(Zi] p) 1 + A n where A is the baseline odds and = r(Zi,[3) is the odds ratio associated w ith covariates Zj. I f a logistic model is assumed then one would have A = exp (cr) and r, = exp(Zi/3). Since our model (3.1) always conditions on the covariates Z;, we suppress it in the following to sim ply the notation. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Let D be the set of indices for diseased individuals, i.e. the case set. Then, the probability of a given case set D = d is Pr(D = d) = I I t t t t ; n ied 1 jen\d 3 = n A r < IlrT T T - ied j€TZ = \'d]rd QK(\,i}), where |d| is the number of subjects in d, ra = I L e d a n ( ^ Q n { \ ft) = Yljen I+X- The likelihood for estimation of A and ft for the study base is given by LfuiiiX,p) = A'DI n>09) Q n(\,P). (3.2) The terms Q n {\P ) is independent of case-control status and does not contribute to the likelihood when considering sampling from the fu ll study base, e.g. equation (3.3) in the next section. Sampling For sim plicity, we consider a sample of all cases and selected controls. Let 1Z be the sampled case-control set and 7r(r|d) be the proba bility th a t r is the sampled case-control set given the case set d. Then, applying Bayes theorem to the model (3.1),.we have 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. pr(D = d |ft = r) = pr(K = r|D = d) pr(D = d) E sCr Pr ( ^ = r lD = S) Pr (D = S) _ A|d| r d Qn 7r(r|d) E s c r a|s| r ■ Qk 7r(r|s) _ Aldl r<j 7r(r|d) E s c r AN ^ (r|s ). Therefore, the likelihood for the case-control set is given by = (3.3) E K « A W r , W ^ |s). This likelihood accommodates quite general sampling through specification of the corresponding case-control set selection probabilities 7r(r|s). Because of a cancella tion of common factors in the selection probabilities 7r(Tt|s), the likelihood used to fit the data can reduce to the form ( 3 - 4 ) where w ^{ s) are called “risk weights.” The to ta l likelihood from case-control sets sampled from a cohort is sim ply the product of likelihood contributions of the form of equation (3.4). More details on the sampling methods are discussed in the following sections. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2 Random Sampling Simple random sampling is an approach th a t can be easily accomplished and ex plained to others. Suppose we sample m from the n — |d| controls. The selection (sampling) probability for for a case-control set r of size m + |d| containing d is same, it does not contribute to the likelihood as risk weights, wr (d) are identically equal to one. 3.2.1 Control of Confounding in Random Sampling has been proposed as a means to balance the distribution of potential confounders among cases and controls. In this commonly used design, controls are random ly selected in numbers proportional to the number of cases, usually w ith in strata defined by age group or some other demographic variables, as shown in Figure 3.1. Suppose we sample m\d\ from the n — |d| controls in a m atching stratum in an l:m frequency matching design (as shown in Figure 3.1), where we select m controls per case. Then, the sampling probability for this m atching stratum r of size (771 + 1 ) x |d| containing d is 7T(r|d) = (n J*1') . Since the sampling probability for each possible control set is the Simple random sampling lacks the control of confounding. Frequency matching 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. Cohort: Case-Control Status by Confounder C M atching Stratum C = 1 ••• C = L Cases Controls Total |D i| \V l \ n i - |D X | nL ~ |Dx,| ni nL Total lD l n — |D| n b. Frequency M atching on Confounder C M atching Stratum (7 = 1 C = L Cases Controls |D i| |D l | m |D i| m \D i\ Total lD l m |D | Total (m + l) |D i (m + l) | D L | ( m + l) | D | Figure 3.1: l:m frequency matched sampling scheme. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and the conditional likelihood is given by w m A |P | r P ( < ? )(m |D |i ) ______ fp (fl) (jo E ,c K ,|,H D |> l M « ( “ m l 5 lr I E,c*:|.|=|D| For example, in an 1:3 frequency matched study, if 5 of the cases are between 20 and 30 years of age, 15 of the controls are in this age stratum , where m = 3 for this example. The ratio (m) of sampled controls to cases in each stratum can be extended, of course, to be stratum-specific, i.e. mi, I = 1, • • • , L. Since the sampling probabilities are uniform w ith in a set, it does not contribute to the likelihood as the risk weights are identically equal to one. Equation (3.5) is the usual conditional likelihood for “simple” unmatched case-control data. 3.3 Counter-M atching For case and control sampling in nested case-control studies, investigators often use random sampling or frequency matching as an unbiased sampling strategy. When the exposure is rare, random samples would have to be large for any meaningful inference on the relation between the outcome and exposure. Therefore, simple random sampling and frequency matching may not be the most statistically effi cient method of sampling. In order to increase efficiency in observational studies, Langholz and Clayton [40] proposed a novel design, counter-matching. Using the in form ation collected in the cohort may increase the efficiency of the sampling method as illustrated below. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Consider a 1:1 matched nested case-control study. We know concordant pairs in matched case-control studies contribute no inform ation to the odds ratio esti m ation [62]. Therefore, there is an opportunity to improve the tra d itio n a l random sampling method (Figure 3.2.a) through a strategy th a t counter-matches each case and control on exposure and thereby minimizes the concordant pairs. For example, as shown in Figure 3.2.b, in an 1:1 counter-matched nested case-control study, we select one control from the risk set th a t has the “opposite” exposure status as the case based on the surrogate (crude) exposure data available for the entire cohort. Hence, the sampling design is named “counter” -matching. We can generalize 1:1 counter-matching to m ultiple cases and controls per case- control set. Le t’s assume tha t the to ta l number of cases in the cohort is |d | and counter-matching is based on a binary exposure X . One first decides the marginal totals in the exposed ( X = l ) and the unexposed ( X = 0 ) strata as TOi|d| and TOo|d|, respectively. This implies tha t the to ta l number of controls sampled from the cohort is (to — l ) | d | , where m = mi + m 0. When m — 2 and wi\ — too = 1, the cell counts for cases and controls in counter-matching (Figure 3.3.b) are matched w ith the ’opposite’ counter-matching strata (i.e. | d i | cases but |do| controls in the exposed stratum ; | d 0 | cases but | d i | controls in the unexposed stratum ). Therefore, we have | d i | unexposed controls counter matched to | d i | exposed cases. Counter-matching can be implemented in an even more general way as shown in Figure 3.4. Assume that the counter-matched variable X G { 1 , . . . , L } is known for all subjects in the study base (or cohort). The m arginal to ta l in the sampling 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. Random sampling Case 1 0 1 Case 0 1 Control ? ? 1 Control ? ? 2 2 b. Counter-matching X = 1 X = 0 X = 1 X = 0 Case 1 0 1 Case 0 1 Control 0 1 1 Control 1 0 2 2 Figure 3.2: An 1:1 matched nested case-control study using random sampling and counter-matched sampling for the control sampling by exposure X. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. Cohort: Case-Control Status by Exposure X X = 1 X = 0 Case |di| |d0| |d| Control ni - |dx| no - |do| n ~ |d| n i no n b. Counter-M atching on Exposure X X = 1 X = 0 Case ldi| |do| |d| Control Idol |di| d Idl d 2 d Figure 3.3: Counter-matched sampling on a binary exposure X for |d| controls from a cohort w ith Idl cases. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. stratum I is fixed to a value proportional to the number of cases, m ;|d| (see Figure 3.4). Then, m j|d| — |d/| controls are randomly sampled w ith o u t replacement from the ni — |d;| controls in stratum I. The control selection using counter-matching can be characterized by 3.3.1 Control of Confounding in Counter-M atching In a random sampling design, matching can be used to control confounding. The control of confounding in a counter-matched design can be undertaken in the same manner that controls in a counter-matched design are matched to cases on the confounders as well as counter-matched to cases on exposure. In other words, one can apply the counter-matched sampling scheme shown in Figure 3.4 to each of the strata defined by the confounders. Summ ary Counter-matching is a newer sampling option u tilizin g a nested case- control design and has the potential of im proving efficiency by m axim izing the - l for r w ith |r*| = ra*|d|, I = 1 ,..., L, w ith likelihood m = r . . i - l (3.6) 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. Cohort: Case-Control Status by Exposure X Counter-matched Stratum X = 1 ••• X = L Total Cases |D i| |D l | lD l Controls nx - |D i| nL - |D l | n — |D Total n x nL n b. Counter-M atching on Exposure X Counter-matched Stratum X = 1 II Total Cases |D i| |D l | |D| Controls m i|D | — |D i| mi\D\ - |D l | E ; M D | - |D ;|) Total m i jID I m L\D | E m dD l Figure 3.4: Counter-matched sampling scheme on exposure X w ith stratum-specific sampling margins m j|D |, where 1=1 to L. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of case-control pairs discordant on exposure. The control of confounding can be incorporated in a counter-matched design as well. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Comparison of Efficiency of Different Sampling Schemes for N ested Case-Control Studies Counter-matching was first introduced by Langholz and Clayton in 1994 [40]. The goal of counter-matching is to maximize the number of discordant case-control pairs. Counter-matching is an attractive option for nested case-control studies because it is theoretically more efficient [39, 40]. Some researchers have shown the superior efficiency of counter-matching compared to other sampling methods in detecting the main effect of exposure and the interaction w ith other factors by means of sim ulation studies [6, 41, 71]. In Section 4.2 we show tha t 1:1 counter-matching can be at most 2 times more efficient than 1:1 simple random sampling method. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.1 Review: Efficiency Comparisons for Sampling Schemes The superior efficiency of counter-matching compared to other sampling methods has been shown in terms of the detection of main effect of exposure [41, 71] and the detection of gene-environment interactions [6]. 4.1.1 Efficiency for Main Effect D etection Langholz and Goldstein [41] have compared the efficiencies of frequency matching, counter-matching, and other sampling methods. Using sim ulation studies w ith exposure odds ratios ranging from one to four, Langholz and Goldstein found that both frequency matching and counter-matching showed no evidence of bias using conditional logistic likelihood analyses. Furthermore, Langholz and Goldstein found that counter-matching offers superior efficiency than frequency matching, when the covariates are well correlated w ith the counter-matched exposure variable (w ith sensitivity and specificity both set to 0.9). For example, the sim ulation results showed that the em pirical efficiency of the 1:1 counter-matching design relative to the 1:1 frequency matching design was 1.5 under the null. Steenland and Deddens [71] also showed in their sim ulation studies th a t an 1:3 counter-matching design is approxim ately equivalent to random sampling using 10 controls per case. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.1.2 Efficiency for Interaction D etection Andrieu et al. [6] investigated the efficiency of counter-matching in studies of gene- environment interactions compared to random sampling. Andrieu et al. evaluated the efficiency and feasibility of counter-matching designs for the estim ation of dif ferent m ultiplicatively gene-environment interactive effects as well as both genetic and environmental main effects. They considered three different counter-matched designs in the sim ulation studies: 3 controls were selected for a cases by 1) counter matching w ith the cases on a surrogate of the environmental factor (yes/no) such that two exposed and two unexposed subjects in this case-control set; 2) or counter matching w ith the cases on a surrogate of the genetic factor (e.g. fam ily history) such that two w ith and two w ithout the fam ily history of the disease in this case- control set; and 3) counter-matching on surrogates of both the environmental and genetic factors such that there is one subject in each of the four combinations. The efficiencies of counter-matching designs were compared to the fu ll cohort and the classical 1:3 nested case-control design (random sampling and no matching). A n drieu et al. found that the most im portant factors th a t determine the efficiency of a counter-matching designs are the sensitivity and specificity of the surrogates and the frequencies of the risk factors of interest. Under different m ain effect and inter action scenarios, counter-matching is more appropriate than random sampling for the investigation of gene-environment interactions, especially involving rare factors. C o u n te r-m a tc h in g on b o th surrogates o f gene and e n viro n m e n t fa cto rs o f in te re st appears to be the most efficient design for detecting gene-environment interactions. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As the environmental and genetic factors are rarer, counter-matching on both fac tors becomes more efficient. Bernstein et al. [9] applied counter-matching to a m ulti-center population based case-control study and to investigate gene-radiation interactions of second prim ary breast cancer. Bernstein et al. performed sim ulation studies to make power calcula tions for detecting a gene by radiation dose interaction from various study designs. They considered three different study designs in the sim ulation studies: 1) a stan dard 1:2 nested case-control study w ith two controls per case; 2) a counter-matched design having the two controls counter-matched w ith the case on a binary expo sure status (treated and untreated) such th a t one subject from the ‘untreated’ and two from the ‘treated’ stratum ; and 3) a counter-matched design having the two controls counter-matched w ith the case on a binary exposure status (treated and untreated) such that one subject from the ‘treated’ and two from the ‘untreated’. In the sim ulation studies, the proportion of gene carriers varies from 0.5% to 10%. Bernstein et al found that the power advantage of counter-matching is greatest w ith rare genetic mutations, which agrees w ith the finding of Andrieu et al. [6]. Lack of power is a well known problem in case-control studies of gene-environment interactions. The work of Andrieu et al. [6] and Bernstein et al. [9] show that counter-matching improves the power of interaction detection in nested case-control studies. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this dissertation, we focus on the investigation of m ain effect of exposure of interest. The following discussion and investigations on the design and analytic issues for counter-matching are for the main effect of exposure only. 4.2 Efficiency — 1:1 Counter-M atching versus 1:1 Simple Random Sampling We illustrate how the relation (sensitivity and specificity) between the covariates and the counter-matched exposure variable affect the efficiency for the counter matched design in Section 4.2 w ith 1:1 counter-matching compared to random sam pling. In terms of the simplest setup, 1:1 matched case-control studies, we demon strate why counter-matching can be an option to increase efficiency and power. We assume the probability of disease for individual % given a covariate Zi follows the logistic model: g Q L -\-Z if3 M D = l\Zi) = 1 + eQ +- ^ , and pr(Z > = 0 | ^ ) = , + el 2l0 (4.1 ) Then, applying equation (3.4 w ith A = exp(cr) and r; = e x p (Z ^ ) to the 1:1 nested case-control study, the likelihood for a case-control set 7Z = { i , j } is given by 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Lsam ple(p*"> ft) ea.+ZiP w _ )_ ea+ Z jP w I ‘ H LLe ( Z j- Z i )P 1 J VJi 1 1 + k Zi^ , where Wi is the “risk weight” which is corresponding to the sampling scheme used for the risk weights are equal to 1; while the risk weights depend on the case’s exposure (X ) status for counter-matching. For design purposes, it is im portant to know when counter-matching is more efficient than random sampling in a nested case-control study. Comparing the efficiency of counter-matching w ith random sampling can be used to assess gain in power for counter-matching. The greater the expected inform ation, the smaller the variance of /3, the greater efficiency for a given sampling plan. The expected inform ation is calculated under the null hypothesis, H 0 : /3 = 0, as follows, 1) 1:1 counter-matching (CM) the control sampling and k = ^ is the ratio of two weights. For random sampling < c m {Zj - Z x) \ (4.2) 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. because the quantity ^ ky has the m axim um 1/4 at k = l. One interesting property on the quantity ^+^2 is th a t it is invariant to exposure status of the case in the case-control set (i.e. (1+k)2 = n + i/W )- • (1+ fc)2 ~ (1 + 1 /k) k 2) 1:1 random sampling (RS) d2 log L rs jE * 5 ( Z , - Z , f . (4.3) To be more efficient than random sampling, counter-matching needs to have greater expected inform ation. It appears th a t I c m , equation (4.2) cannot be greater however, the advantage of counter-matching is the gain in the expectation term by m inim izing the number of the concordant pairs through sampling. For counter-matching, if the correlation between counter-matching variable X and the newly collected covariate Z is ±1, then E c m {Zj — Zi)2 = 1. Let p r(Z = 1) = 7 T and pr(Z = 0) = 1 — 7r among controls. Then, k = 7 r/(l — 7r) if X = 0 or k = (1 — 7r)/7r if X —\. A fter some calculations, we obtain than I r s , equation (4.3) because the quantity has the m axim um 1/4 at k = l; 7r(l — 7r), and I r s = ^ E flS( Z i - Z i ) 2 = ^ [ 2 7 r ( l - 7 r ) ] = ^ 7 r ( l - 7 r ) . 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, 1:1 counter-matching is twice as efficient as 1:1 random sampling under the null hypothesis, H 0 : (3 = 0, and the most favorable condition, the correlation between X and Z is ±1. On the other hand, if the correlation between X and Z is 0, i.e. E c m (Zj — Z^)2 = E rs(Zj — Zi)2, then counter-matching cannot be more efficient than random sampling. Using sensitivity (8) and specificity (rj) to express the relationship of X and Z, Langholz and Goldstein [41] show th a t the efficiency of 1:1 counter-matching relative to the fu ll study base is [5rj + (1 — <5)(1 — 77)]. Because the efficiency of 1 :1 random sampling relative to the fu ll study base is 1 / 2 , the efficiency of 1 :1 counter-matching relative to 1 :1 random sampling is 2[8ri + (1 - 5)(1 - r])\. As shown in Figure 4.1, 1:1 counter-matching is more efficient than 1 :1 random sampling when either 1) both sensitivity and specificity are greater than 0.5 (i.e. the first quadrant in Figure 4.1.b) ; or 2) both sensitivity and specificity are less than 0.5 (i.e. the th ird quadrant in Figure 4.1.b). However, 1:1 counter-matching is less efficient than 1 :1 random sampling when either sensitivity or specificity is less than 0.5. A n intuitive explanation for the equal efficiency when both sensitivity and specificity are 0.5 is counter-matching becomes a random sampling process, such as frequency matching, when both sensitivity and specificity are 50%. The im pact of sensitivity and specificity on the efficiency of counter-matching is independent of the exposure prevalence. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Higher sensitivity and specificity corresponds a stronger positive correlation, while lower sensitivity and specificity implies a stronger negative correlation. There fore, the superior efficiency of counter-matching depends on the strength of the X — Z relationship expressed by either the correlation or sensitivity plus specificity. Summ ary Counter-matching has the potential to improve the efficiency of the investigations of both main effects and interactions. In a 1:1 case-control study, counter-matched sampling can be at most twice more efficient than simple random sampling method under the null hypothesis. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. 3D plot b. Contour plot Relative Efficiency 1.0 0. 0.8 0 7 0 6 0.5 0.4 0.3 0.2 0.1 0.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 S ensitivity Figure 4.1: Relative efficiencies of 1:1 counter-matching compared to 1:1 random sampling w ith different sensitivity and specificity of the counter-matching variable and the exposure of interest, a) the 3D plot w ith the sensitivity and specificity as the X-axis and Y-axis and relative efficiency as the Z-axis; b) the contour plot w ith the relative efficiency as the contour lines. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Design Issues for Counter-M atched N ested Case-Control Studies Counter-matching has great potential to increase efficiency in nested case-control studies. However, the efficiency gained in a counter-matched design is dependent upon several factors that must be addressed before employing a counter-matched design. 5.1 Factors Affecting the Efficiency of Counter- M atched Designs There are several factors that affect the efficiency of a counter-matched design to detect the main effect of exposure Z. These factors include the: 1) prevalence of exposure Z (exposure prevalence), 2) risk of exposure Z for disease (exposure-disease association), 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3) to ta l number of individuals sampled from the study base (sample size), 4) relationship between the counter-matched variable X and exposure Z (X — Z relationship), and 5) sample allocation of the counter-matched variable X (sample allocation). 5.1.1 Exposure Prevalence One influential factor on the efficiency of counter-matching is the prevalence of exposure. It has been shown th a t counter-matching appears to be more appropriate than most trad ition al sampling methods for rare exposures [6 ]. For a rare exposure, most case-control pairs are unexposed and do not contribute inform ation to the analysis. 5.1.2 Exposure-Disease Association The magnitude of risk from exposure Z or the dose-response relationship between exposure Z and disease also plays a role in the performance of a sampling scheme. 5.1.3 Sample Size It is well known that greater efficiency is gained if more individuals are sampled from the study base. Although sampling a large number of subjects is desirable, the sam ple size is lim ite d by th e a m o u n t of resources available. T h e nu m b e r of individuals that can be sampled in a nested case-control study is lim ited by the 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. available resources, and therefore, the problem is how to make the best use of the resources and obtain an adequate sample size. 5.1.4 X — Z Relationship In Section 4.2, we show that the stronger the X — Z relationship (either positive or negative), the greater the efficiency of a counter-matched design. The im pact of the X — Z relationship on the efficiency of group counter-matching has not been clarified, ft would be of interest to find the threshold of sensitivity and specificity of X — Z, so that we would know in practice when counter-matching is a good option. We would like to understand the im pact of the X — Z relationship on the efficiency of a group counter-matched design in the context of the other factors described previously. 5.1.5 Sample Allocation One im portant characteristic of counter-matching is the specification of a prede termined number of individuals in different categories (called ‘sample allocation’) of the counter-matched variable X . For instance, for a binary X , one needs to determine the numbers of exposed and unexposed controls to be sampled from the study base. Unlike many other design factors, the sample allocation can be different in different design situations. However, it is not clear how the sample allocation influence the efficiency of counter-matching and how it compares to other factors. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to understand how the factors (sample size, X — Z relationship, sample allocation, exposure prevalence, exposure-disease association) affect the efficiency of a counter-matched design, we conducted sim ulation studies and compared the results w ith fu ll study base and random sampling. For sim plicity, only a binary disease outcome and a binary counter-matched variable X is considered in the sim ulation studies. It should be noted, however, that counter-matching can be extended for exposures w ith m ultiple categories. We evaluate the efficiency of counter-matching for a categorical exposure (Sim ulation Study I), and then use a binary exposure Z to assess the im pact of each of the five factors on efficiency (Simulation Study II). To date there are no published reports of sim ulation work beyond binary exposures, and it is im portant to understand how the factors influence efficiency of counter-matched designs for practical study designs. The sim ulation results presented here are based on 1,000 trials and are presented in terms of relative efficiencies. Relative efficiency for a parameter, /3, in a sampling design (either random sampling or counter-matching) is defined by em pirical vari ance of the parameter, Var(j3), for the study base, divided by th a t of the sampling design. The sample size for a tria l (study base) is 3,000 (300 in each of 10 strata). Case-control and exposure Z status were generated based on the assigned scenario including the overall disease probability, prevalence of exposure Z, and exposure- disease association. Then, the counter-matching variable X was assigned on the 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. basis of the given X — Z relationship in terms of sensitivity and specificity. Con trols of different designs were selected from the same simulated study base of size 3,000 according to the setups of the designs including the sample size for both the random sampling and counter-matched designs and the different sample allocations for the counter-matched designs. 5.2 Simulation Study I: Categorical Z It has been demonstrated that counter-matching can be an efficient sampling scheme for nested case-control studies using binary exposures, X (sampling variable which is available for whole cohort) and Z (new inform ation collected in the nested case- control study) [39, 40, 71]. In practice, it is likely th a t more detailed inform ation is desired in the nested case-control study. Lately, Bernstein et al [9] performed a sim ulation study of power detecting the radiation dose-response (i.e. a continuous Z variable) in a counter-matched design in which cases and controls are counter matched on a yes/no exposure (i.e. binary X ). Bernstein et al. found th a t counter matching provides much higher power than using the random ly sampled controls under the scenarios. Here le t’s assume X was collected as yes/no for an exposure in the cohort study. Now, we want to assess the dose-response relationship in a nested case-control study and want to collect the dose inform ation Z: low, medium, or high. In this sim ulation study, we w ill evaluate how counter-matching performs when X is binary but Z is categorical. We consider a categorical Z w ith three levels ( Z = 0, 1, or 2). Because 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. our goal is to investigate the efficiency of counter-matching for a categorical Z, we sim ply assume the counter-matched sampling stratum variable X had a 90% sensitivity and a 90% specificity for the exposure variable Z. Each study base consists of 3,000 individuals equally distributed in 10 strata (i.e. 300 in each of 10 strata). The overall probability of the disease is 10%. The sample size is fixed as three times the number of cases (|3D |), and the efficiency of counter-matching is compared w ith 1 (case):2(controls) random sampling (for the same sample size). For counter-matching, five different scenarios of sample allocation are tested. Since the sample size is |3D|, we have the margins for stratum X = 0 (denoted by m o|D |) and for stratum X = 1 (denoted by m i|D |) varied from m 0/ r a i= 2 /l, 1.8/1.2, 1.5/1.5/, 1.2/1.8 , to 1/2. In our setup w ith 3,000 individuals in the study base and a 10% disease probability, the sample size is 900, three times the number of cases. For counter-matching, the number of unexposed including cases is 600, 540, 450, 360, and 300 for m 0/m i= 2 /l, 1.8/1.2, 1.5/1.5/, 1.2/1.8 , and 1/2, respectively. In the next section, we consider three different types of exposure effects of Z w ith values 0, 1, and 2 on the binary outcome D, 1) Z is treated as binary (e.g. unexposed and exposed); 2) Z has a linear trend effect on the outcome; 3) each level of Z has its own risk. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2.1 M ethods for Simulation Study I Binary Z First, the simplest scenario is a binary outcome and a binary exposure Z. In the generation of cohort, we have 1) sensitivity and specificity set to 0.9 as p r(X = 1|Z = 1) = 0.9 and p r(X = 0\Z = 0) = 0.9; 2 ) three risk association scenarios: OR = 1 , 2 , and 4; 3) two exposure prevalence scenarios: p r(Z = 1) = 0.20 and p r(Z = 1) = 0.50. Categorical Z: a linear trend assum ed. When a linear trend is assumed, there is only one parameter for Z in the model which is the logarithm of the odds ratio (OR) per unit of Z. Under this scenario, the cohort is generated by 1) sensitivity and specificity both set to 0.9, i.e. p r(X = lj Z > 0) = 0.9 and p r(A = 0\Z = 0) = 0.9; 2 ) three risk association scenarios: OR = 1, 2, or 4 per u n it of Z; 3) two exposure prevalence scenarios: p r(Z = 1) = p r(Z = 2) = 0.10 and pr (Z = 1) = pr (Z = 2) = 0.25. Categorical Z: each level of Z has its own risk. Another relationship be tween the categorical exposure Z and the outcome D we consider is to have different odds ratios for each level of Z. We need to set up two odds ratios, O R i for Z —1 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. compared to Z=0, and 0 R 2 for Z = 2 compared to Z = 0. In the generation of the cohort, the parameters are 1) sensitivity and specificity both set to 0.9, i.e. p r(X = \\Z > 0) = 0.9 and p r(X = 0| Z = 0) = 0.9; 2 ) two sets of risk association: (a) O R i = 1.0 and OR2 = 4.0 (b) O R x = 2.0 and OR 2 = 3.0; 3) two exposure prevalence scenarios: p r(Z = 1) = p r(Z = 2) = 0.10 and pr (Z = 1) = pr (Z = 2) = 0.25. 5.2.2 Results of Simulation Study I In all scenarios, the sample size is the same for counter-matching and random sampling, 3|D| where |D| is the number of cases. The counter-matched sampling stratum variable X has 90% sensitivity and 90% specificity for the exposure variable Z. The overall probability of disease is 10%. The results are based on 1,000 trials. The efficiencies of counter-matching and random sampling (two controls per case) are relative to the fu ll cohort (study base). We also display the point estimate (3 and the standard error (SE) of the logarithm of the odds ratio (OR) in the result tables. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Binary Z For binary X and Z under the setup of 0 R = 1 , 2, or 4, the sim ulation results comparing cohort, random sampling (two controls per case) and counter matching (Table 5.1, Table 5.2, and Table 5.3) show no evidence of biased estimates on odds ratio parameters. There is little difference in the efficiencies of counter matching w ith different sample allocation when OR = 1 for both rare (p r(Z = 1) = 0.2) or common (p r(Z = 1) = 0.5) exposures. When the odds ratio is away from the null, sampling more from the exposed stratum can gain efficiency, especially for the rare exposure. For example, when OR = 4 and the exposure probability is 0.2, the efficiency compared to the cohort is 0.81 if 6 6 % unexposed in the sample (i.e. m 0/ m i= 2 /l) , while the relative efficiency is 0.91 if 33% unexposed in the sample (i.e. m 0/m i= l/2 ). Under the assumption of 90% sensitivity and 90% specificity, counter-matching is superior to random sampling in efficiency in all scenarios of the exposure prevalence and the exposure-disease association. Categorical Z: a linear trend assum ed. When the underlying odds ratio is 1, the relative efficiencies of all counter-matching designs is higher than th a t of random sampling (Table 5.4). There is no dram atic difference in efficiencies among different sample allocations for counter-matching. The relative efficiency is between 0.87 and 0.89 for a rare exposure and is between 0.82 and 0.89 for a common exposure. However, when the odds ratio is away from the null value (i.e. when OR = 2 or OR = 4), the sample allocation in counter-matching affects the efficiency (Table 5.5 and Table 5.6). When the exposure is less common, p r(Z = 1) = p r(Z = 2) = 0.10 and the odds ratio is 2, all 5 counter-matching designs w ith different sample 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.1: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a binary exposure Z, when the odds ratio (OR) is 1 (/? = 0.00). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Sampling Method m o /m / /3 (S E ) Relative Efficiency p r(Z —1)=0.2 Full Cohort ----- -0 . 0 1 (0.16) 1 . 0 0 Random Sampling ----- -0.01 (0.19) .75 Counter-M atching 2 / 1 -0.01 (0.17) .91 1 .8 / 1 .2 -0.01 (0.17) .92 1.5/1.5 -0.01 (0.17) .92 1 .2 / 1 .8 -0.01 (0.17) .92 1 / 2 -0.01 (0.17) p r(Z = l)= 0 .5 .90 Full Cohort ----- 0 . 0 0 (0 .1 2 ) 1 . 0 0 Random Sampling ----- -0.00 (0.14) .71 Counter-M atching 2 / 1 0.00 (0.13) .87 1 .8 / 1 .2 0.00 (0.13) . 8 8 1.5/1.5 0.00 (0.13) .87 1 .2 / 1 .8 0.00 (0.13) .87 1 / 2 0.00 (0.13) .8 6 t Sampling stratum margins: unexposed(m0 |D |)/exposed (m i|D |) 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.2: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a binary exposure Z, when the odds ratio (OR) is 2 (/? = 0.69). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Sampling Method rrio/m^ f5 (S E ) Relative Efficiency p r(Z = l)= 0 .2 Full Cohort ----- 0.69 (0.14) 1 . 0 0 Random Sampling ----- 0.69 (0.17) .67 Counter-M atching 2 / 1 0.69 (0.15) .85 1 .8 / 1 .2 0.69 (0.14) .8 8 1.5/1.5 0.69 (0.14) .90 1 .2 / 1 .8 0.69 (0.14) .90 1 / 2 0.69 (0.14) p r(Z = l)= 0 .5 .91 Full Cohort ----- 0.70 (0.13) 1 . 0 0 Random Sampling ----- 0.69 (0.15) .74 Counter-M atching 2 / 1 0.70 (0.14) . 8 8 1 .8 / 1 .2 0.70 (0.13) .89 1.5/1.5 0.70 (0.13) .89 1 .2 / 1 . 8 0.69 (0.13) . 8 8 1 / 2 0.69 (0.13) . 8 8 t Sampling stratum margins: unexposed(m o|D|)/exposed(m i|D|) 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.3: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a binary exposure Z, when the odds ratio (OR) is 4 (0 = 1.39). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Sampling Method m o /m / f3 (S E ) Relative Efficiency p r(Z = l)= 0 .2 Full Cohort ----- 1.39 (0.13) 1 . 0 0 Random Sampling ----- 1.38 (0.16) .64 Counter-M atching 2 / 1 1.38 (0.14) .81 1 .8 / 1 .2 1.38 (0.14) .87 1.5/1.5 1.38 (0.14) .90 1 .2 / 1 .8 1.38 (0.13) .91 1 / 2 1.38 (0.13) p r(Z = l)= 0 .5 .91 Full Cohort ----- 1.40 (0.15) 1 . 0 0 Random Sampling ----- 1.39 (0.17) .79 Counter-M atching 2 / 1 1.39 (0.16) .87 1 .8 / 1 .2 1.39 (0.16) .90 1.5/1.5 1.39 (0.15) .92 1 .2 / 1 . 8 1.39 (0.15) .91 1 / 2 1.39 (0.15) .90 t Sampling stratum margins: unexposed(m0 |D |)/exposed (m i|D |) 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. allocations are superior to the random sampling design, and the more subjects sampled from the exposed stratum , the higher the relative efficiency of counter matching. Nevertheless, when the exposure is relatively common, p r(Z = 1) = p r(Z = 2) = 0.25, the random sampling design is not necessary less efficient than the counter-matching design. The superiority of counter-matching to random Sampling is observed when more exposed individuals are sampled (i.e. when m§fm\ is 2 / 1 or 1.8/1.2). The same patterns of findings are observed when OR = 4 (Table 5.6). Categorical Z\ each level of Z has its own risk. We consider two different risk patterns for the categorical exposure Z, 1) the odds ratio is 1 when Z = 1 (i.e. O R i = 1) and the odds ratio is 4 when Z = 2 (i.e. OR 2 = 4); 2) the odds ratio is 2 when Z = 1 (i.e. O R i = 2) and the odds ratio is 3 when Z — 2 (i.e. OR 2 = 3). For both odds ratios scenarios, all counter-matching designs are more efficient than random sampling when Z is less common (p r(Z = 1) = p r(Z = 2) = 0.10, see upper half of Table 5.7 and Table 5.8). Moreover, when the exposure is relatively common, p r(Z = 1) = p r(Z = 2) = 0.25, the random Sampling design is only more efficient than the counter-matching design w ith the margin m o /m i = 2 / 1 (see lower half of Table 5.7 and Table 5.8). Summ ary Based on these sim ulation results, counter-matching is consistently more efficient than random sampling for different scenarios (varied odds ratios and exposure prevalence scenarios) based on a binary sampling variable X , whether Z is binary or categorical w ith a linear trend or not. Using the same sample size, 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.4: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a categorical exposure Z w ith a linear trend effect on the outcome, when the odds ratio (OR) is 1 per unit of Z (f3 = 0.00). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Sampling Method mo/mi P (S E ) Relative Efficiency X 3 i-t Isi II I — * ii =0.1 and pr(Z==2 ) = 0 . 1 Full Cohort — -0 . 0 0 (0 .1 1 ) 1 . 0 0 Random Sampling — -0 . 0 0 (0 .1 0 ) .74 Counter-M atching 2 / 1 -0 . 0 0 (0 .1 0 ) .87 1 .8 / 1 .2 -0 . 0 0 (0 .1 0 ) .8 8 1.5/1.5 -0 . 0 0 (0 .1 0 ) .89 1 .2 / 1 .8 -0 . 0 0 (0 .1 0 ) .89 1 / 2 -0 . 0 1 (0 .1 0 ) . 8 8 Pr (Z =1)=C .25 and pr(Z==2)=0.25 Full Cohort ----- -0 . 0 0 (0.07) 1 . 0 0 Random Sampling ----- -0 . 0 0 (0.09) .77 Counter-M atching 2 / 1 - 0 . 0 0 (0.08) .82 1 .8 / 1 .2 -0 . 0 0 (0.08) .85 1.5/1.5 -0 . 0 0 (0.08) .89 1 .2 / 1 .8 -0 . 0 0 (0.08) .87 1 / 2 -0 . 0 0 (0.08) .87 t Sampling stratum margins: unexposed(m o|D|)/exposed(m i |D |) 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.5: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a categorical exposure Z w ith a linear trend effect on the outcome, when the odds ratio (OR) is 2 per unit of Z (/3 = 0.69). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Sampling Method mo/mi f3 (SE ) Relative Efficiency pr(Zi= l)= 0 .1 and pr(Z==2 ) = 0 . 1 Full Cohort ----- 0.69 (0.08) 1 . 0 0 Random Sampling ----- 0.69 (0.10) .60 Counter-M atching 2 / 1 0.69 (0.09) .77 1 .8 / 1 .2 0.69 (0.08) .83 1.5/1.5 0.69 (0.08) .87 1 .2 / 1 . 8 0.69 (0.08) .91 1 / 2 0.69 (0.08) .91 Pr (Z==1)=0.25 and pr(Z==2)=0.25 Full Cohort ----- 0.69 (0.07) 1 . 0 0 Random Sampling ----- 0.69 (0.09) .72 Counter-M atching 2 / 1 0.69 (0.09) . 6 6 1 .8 / 1 .2 0.69 (0.08) .73 1.5/1.5 0.69 (0.08) .80 1 .2 / 1 .8 0.69 (0.08) .82 1 / 2 0.69 (0.08) .84 t Sampling stratum margins: unexposed(m o|D|)/exposed(m i|D |) 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.6: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a categorical exposure Z w ith a linear trend effect on the outcome, when the odds ratio (OR) is 4 per unit of Z (/3 = 1.39). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Sampling Method m 0/m 1t /? (SE) Relative Efficiency p r(Z = l)= 0 .1 and pr(Z==2 ) = 0 . 1 Full Cohort ----- 1.39 (0.08) 1 . 0 0 Random Sampling ----- 1.39 (0.11) .52 Counter-M atching 2 / 1 1.39 (0.10) .62 1 .8 / 1 .2 1.39 (0.09) .71 1.5/1.5 1.39 (0.08) .82 1 .2 / 1 .8 1.39 (0.08) . 8 8 1 / 2 1.39 (0.08) . 8 8 pr(Z =1)=0.25 and pr(Z= o j!^ II Full Cohort ----- 1.39 (0.09) 1 . 0 0 Random Sampling ----- 1.38 (0.11) .70 Counter-M atching 2 / 1 1.38 (0.12) .54 1 .8 / 1 .2 1.38 (0.11) .63 1.5/1.5 1.39 (0.10) .73 1 .2 / 1 . 8 1.39 (0.10) .77 1 / 2 1.39 (0.10) .80 1 Sampling stratum margins: unexposed(m0 |D |)/exposed (m i|D |) 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we can obtain a higher efficiency by sampling more individuals from the exposed stratum , especially when the underlying odds ratio is away from the null and when the exposure is less common. 5.3 Simulation Study II: Factors that Influence Efficiency In Simulation Study II, we consider a binary counter-matched variable X and a binary exposure Z. To evaluate the im pact of each factor discussed in Chapter 5 on efficiency, we sequentially vary factors one at a time. The factors varied in the sim ulation are 1) prevalence of exposure Z (exposure prevalence), 2) risk of exposure Z for disease (exposure-disease association), 3) to ta l number of individuals sampled from the study base (sample size), 4) relationship between the counter-matched variable X and exposure Z (X — Z relationship), and 5) the distribution of the counter-matched variable X in the sample (sample allocation). The last two factors only apply to the counter-matching designs, because random sampling does not use any exposure inform ation. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.7: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a categorical exposure Z, when the odds ratio is 1 for Z — 1 (O R i = 1, /5i = 0.00) and 4 for Z — 2 (OR 2 = 4, {32 = 1-39). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Method O R i = 1 o r 2 = 4 p r(Z = l)= 0 .1 and pr(Z =2)= 0.1 A (SE) Eff A (SE) Eff Full Cohort -0 . 0 2 (0 .2 2 ) 1 . 0 0 1.38 (0.15) 1 . 0 0 Random Sampling - 0 . 0 2 (0.26) .75 1.39 (0.20) .56 Counter-M atchingd (2/1) -0.02 (0.24) .84 1.39 (0.18) .70 (1 .8 / 1 .2 ) -0.02 (0.24) .87 1.38 (0.17) .77 (1.5/1.5) -0.02 (0.23) .91 1.38 (0.17) .84 (1 .2 / 1 .8 ) -0.02 (0.23) .92 1.38 (0.16) .8 8 (1 / 2 ) -0.02 (0.23) .91 1.38 (0.16) .90 p r(Z = l)= 0 .2 5 and pr(Z=2)=0.25 A (SE) Eff A (SE) E ff Full Cohort -0.01 (0.19) 1 . 0 0 1.39 (0.14) 1 . 0 0 Random Sampling -0 . 0 1 (0 .2 1 ) .84 1.39 (0.17) .69 Counter-Matching: (2/1) -0 . 0 0 (0 .2 1 ) .79 1.39 (0.18) .59 (1 .8 / 1 .2 ) -0 . 0 1 (0 .2 1 ) .84 1.39 (0.17) .6 8 (1.5/1.5) -0 . 0 1 (0 .2 0 ) .87 1.39 (0.16) .78 (1 .2 / 1 .8 ) -0 . 0 1 (0 .2 0 ) .89 1.39 (0.15) .80 (1 / 2 ) - 0 . 0 1 (0 .2 0 ) .90 1.38 (0.15) .83 E ff = Relative efficiency 1 Sampling stratum margins: unexposed(m0 |D |)/e xp o se d (m i|D |), show (mo/mi) 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.8: Results of sim ulation studies comparing efficiency of the fu ll cohort to random sampling and counter-matching w ith different sample allocation for a categorical exposure Z , when the odds ratio is 2 for Z — 1 (O R i = 2, /5i = 0.69) and 3 for Z = 2 (OR2 = 3, /32 = 1.10). The counter-matched sampling stratum variable X has 90% sensitivity and specificity for the exposure variable Z. The overall probability of disease is 10%. Based on 1,000 trials. Method O R i = 2 o r 2 = 3 p r(Z = l)= 0 .1 and pr(Z =2)=0.1 A (SE) E ff A (SE) Eff Full Cohort 0.69 (0.17) 1 . 0 0 1.09 (0.16) 1 . 0 0 Random Sampling 0.69 (0.21) .67 1 . 1 0 (0 .2 0 ) .63 Counter-Matching: t (2/1) 0.69 (0.20) .77 1.10 (0.19) .75 (1 .8 / 1 .2 ) 0.68 (0.19) .82 1.09 (0.18) .80 (1.5/1.5) 0.69 (0.19) .87 1.09 (0.18) .8 6 (1 .2 / 1 .8 ) 0.69 (0.19) .87 1.09 (0.17) .8 8 (1 / 2 ) 0.69 (0.19) .8 6 1.09 (0.17) . 8 8 p r(Z = l)= 0 .2 5 and pr(Z=2)=0.25 A (SE) Eff A (SE) Eff Full Cohort 0.69 (0.16) 1 . 0 0 1.10 (0.14) 1 . 0 0 Random Sampling 0 . 6 8 (0.18) .78 1.10 (0.17) .71 Counter-Matching: (2/1) 0.69 (0.19) .73 1 . 1 0 (0.18) .63 (1 .8 / 1 .2 ) 0.69 (0.18) .80 1.10 (0.17) .73 (1.5/1.5) 0.69 (0.17) .85 1 . 1 0 (0.16) .81 (1 .2 / 1 .8 ) 0.69 (0.17) .87 1 . 1 0 (0.16) .84 (1 / 2 ) 0.68 (0.17) .87 1 . 1 0 (0.16) .85 E ff = Relative efficiency 1 Sampling stratum margins: unexposed(m0 |D |)/e xp o se d (m i|D |), show (mo/mi) 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.1 M ethods for Simulation Study II In this section, the detailed plans of Simulation Study II are presented. The overall probability of the disease is 10% and each study base consists of 3,000 individuals. Exposure prevalence. The prevalence of exposure Z , p r(Z = 1), is 20% and 50% for a less common and a common exposure, respectively. Exposure-disease association. Three different configurations for the risk of ex posure Z to disease, odds ratio (O R )= l, 2, and 4, are tested in the sim ulation studies. Sample size. We varied the to ta l sample size of a nested case-control study from two to four times the number of cases. Based on this setup, we have approxi m ately 300 cases in each study base as the overall probability of the disease is 10%. Thus, we can compare the efficiencies of three counter-matching designs w ith different sample sizes to 1:1, 1:2, and 1:3 random sampling, respectively, on the basis of using the same number of subjects in the counter-matching and random sampling designs. For example, in a 1:3 random sampling design, we would have 300 cases and 900 randomly sampled controls for a to ta l of 1,200 individuals. The corresponding counter-matched sample would have exactly the same 300 cases and 900 controls, but the exposure composition of the 900 controls depends on the predetermined sample allocation. X — Z relationship. The X — Z relationship does not affect the efficiency of ran dom sampling because random sampling is blind to the X status, while the 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. X — Z relationship does im pact the efficiency of counter-matching (see Section 3.2 and 3.3). The relationship between the counter-matched variable X and exposure Z can be expressed in terms of the sensitivity (SE) and specificity (SP). We consider five scenarios: 1) S E = p r(X = 1|Z = 1)=0.90 and S P = p r(X = 0|Z = 0)=0.90; 2) SE=0.75 and SP=0.75; 3) SE=0.60 and SP=0.60; 4) SE=0.60 and SP=0.90; 5) SE=0.90 and SP=0.60. The negative relationship between X and Z is sym metrical at SE=0.50 and SP=0.50 (as shown in Figure 4.1), so we only consider the positive X — Z relationship in the simulations. These five scenarios cover varied levels of the X — Z relationship such as high, medium, and low SE and SP as well as two different combinations of high and low SE and SP. Sample allocation. In random sampling designs, the sample allocation does not affect efficiency because the number of exposed and unexposed subjects is not fixed in random sampling (only the number of cases and controls). The issue of sample allocation is specific to counter-matching only. For each of three fixed sample sizes, we investigate five scenarios of sample allocation for counter m atching such that the proportions of unexposed individuals (i.e. X = 0 ) in the counter-matched sample are 33%, 40%, 50%, 60%, and 6 6 %, respectively. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For example, to be compatible w ith an 1:3 random sampled sample w ith 1,200 individuals (300 cases and 900 controls), we sample controls such th a t the to ta l numbers of unexposed individuals including cases are 400, 480, 600, 720, and 800, respectively. 5.3.2 Results of Simulation Study II Because of the large number of to ta l combinations by five factors, the sim ulation re sults are first grouped by the to ta l sample size and the risk of exposure Z to disease. For each scenario (presented in Figure 5.1 - Figure 5.9) of the three sample size and three exposure risk combinations, the relative efficiencies are divided in two sections for the less common and common Z , respectively. Each section comprises the rel ative efficiency of random sampling and 25 relative efficiencies of counter-matching designs made up of five sample allocation and five different X — Z relationship. If a bar (the relative efficiency of counter-matching) is higher than the horizontal refer ence line (the relative efficiency of random sampling), it implies counter-matching is more efficient than random sampling under th a t scenario on the basis of the same sample size. Figure 5.1 to 5.3 are for the samples which consist of 300 cases and 300 controls from a cohort of size 3,000. When there is a strong X — Z relationship w ith both the sensitivity and specificity equal to 0.75 or 0.90, counter-matching designs w ith different sample allocations are more likely to be efficient than random sampling. Counter-matching w ith 50% allocation of exposed and unexposed subjects in the 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sample is more efficient than random sampling for p r(Z = 1)0.2 and 0.5, except when the odds ratio is 4, 90% SE, and 60% SP (Figure 5.3). Figure 5.4 to 5.6 are for the samples which consist of 300 cases and 600 controls. When both the sensitivity and specificity equal to 0.75 or 0.90, counter-matching designs w ith different sample allocations are all more efficient than random sampling for all three risk scenarios. Figure 5.7 to 5.9 are for the samples which consist of 300 cases and 900 controls. When both the sensitivity and specificity equal to 0.90, counter-matching designs w ith different sample allocations are all more efficient than random sampling for all three risk scenarios. We can also evaluate the influence of each factor on efficiency. Exposure prevalence Andrieu et al. [6 ] have shown th a t counter-matching may be more appropriate than most traditional sampling methods for rare exposures. Comparing section a and b in Figure 5.1 - Figure 5.9, the efficiencies of counter matching designs do not vary considerably when the odds ratio is away from the null. B y contrast, random sampling has a greater efficiency for a common exposure than for a rare exposure if the odds ratio is away from the null, especially when the odds ratio is 4. Exposure-disease association The im pact of exposure-disease association on efficiency varies by exposure prevalence. Efficiency decreases as the exposure-disease association increases for the rare exposure for both random sampling and counter matching, but the magnitude of the change is greater for random sampling. By 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. contrast, the efficiency of random sampling increases as the exposure-disease asso ciation increases for the common exposure, while the efficiency of counter-matching is about the same when the exposure-disease association increases for the common exposure. For instance, for p r(Z = 1) = 0 .2 , when both the sensitivity and speci ficity are 0.90 and the sample size is 4 times the number of cases, the efficiency of counter-matching w ith 50% allocation is 0.94, 0.93, and 0.93 when the odds ratio is 1, 2, and 4, respectively; while the efficiency of random sampling decreases from 0.83, 0.77, and 0.71 when the odds ratio is 1, 2, and 4, respectively (Figure 5.7.a - Figure 5.9.a) For p r(Z — 1) = 0.5, the efficiency of counter-matching w ith 50% allocation is 0.93 when the odds ratio is 1, 2, and 4, but the efficiency of random sampling increases from 0.83, 0.83, and 0.87 when the odds ratio is 1, 2, and 4, respectively (Figure 5.7.b - Figure 5.9.b). This implies th a t counter-matching is more efficient for a less common exposure w ith a high exposure-disease association, but the efficiency of random sampling increases when the exposure is common and the exposure-disease association is higher. Sample size For both random sampling and counter-matching, it is well known that having more individuals sampled from the study base increases efficiency. We want to know how the efficiency changes w ith increasing sample size. M iettinen [53] pointed out that the statistical power of an in dividually matched study can be increased by selecting more controls per case, but the additional gain decreases sharply and the gain is very small when the case/control ratio exceeds 4. We observe 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the same pattern of reduced efficiency gain w ith increasing overall sample size for both random sampling and counter-matching. For instance, when the underlying OR is 1 and the exposure is less common such as 20%, the relative efficiency of random sampling changes from 0.57 to 0.75 (31.6% increase) as the case/control ratio increases from 1 to 2 (Figure 5.1.a and Figure 5.4.a), but the relative efficiency of random sampling increases 10.7% to 0.83 when the case/control ratio changes from 2 to 3 (Figure 5.7.a). In order to understand the im pact of sample size on the efficiency of counter matching, the exposure-disease association and the exposure prevalence, the sample allocation, and the X —Z relationship needs to be considered. Using equal allocation of exposed and unexposed individuals and a strong X — Z relationship (both the sensitivity and specificity are 90%) as an example, the relative efficiency of counter matching is 0.87, 0.92, and 0.94 for the sample size 2 times, 3 times, and 4 times the number of cases, respectively, when the odds ratios is 1 and the exposure prevalence is 20%. The relative efficiencies of counter-matching under the above scenarios are all higher than those of random sampling. X — Z relationship As shown in Figure 5.1 - Figure 5.9, the stronger the X — Z relationship, the greater the efficiency of a counter-matched design. This is because the discordancy in exposure Z between cases and controls is obtained from counter matching on X . However, when the counter-matched sampling stratum variable X has a 60% sensitivity and a 60% specificity for the exposure variable Z , the efficiency of counter-matching is less likely to be higher than th a t of random sampling. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR= 1, -sample size is twice the number of cases a. P r (Z = l) 0.20 1.0 & § o s t-l-l M u > ( D O i 0.8 0.6 0.4 ] 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 -4— RS b. P r (Z = l) = 0 .5 0 1.0 £ *o S w < u £ 0.4 I 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 RS ' i 33% unexposed in the counter-m atched sample i = i 40% unexposed in the counter-m atched sample 50% unexposed in the counter-m atched sample w w . 60% unexposed in the counter-m atched sample a z z n z 66% unexposed in the counter-m atched sample Figure 5.1: Efficiencies of 1:1 random sampling and five counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is one. a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=2,-sample size is twice the number of cases a. P r (Z = l) = 0 .2 0 51 § o £ w !> ( U P i 1.0 0.8 0.4 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 RS b. P r (Z = l) = 0.50 E ? c < L > 'o £ < 4 - H W <L) > U 0 £ 0.8 0.6 0.4 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 RS ' 33% unexposed in the counter-m atched sample = - — 40% unexposed in the counter-m atched sample ■■■■ 50% unexposed in the counter-m atched sample rasas 60% unexposed in the counter-m atched sample czzzzzi 66% unexposed in the counter-m atched sample Figure 5.2: Efficiencies of 1:1 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is two. a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=4, -sam ple size is tw ice th e n u m b e r o f cases a. P r (Z = l) = 0 .2 0 & § o S U - 4 w < u > 0.8 0.6 0.4 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 RS b. P r (Z = l) = 0.50 1.0 fl 0.8 ( D O s „ 0.6 W v •S ° - 4 J S 2 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 S pecificity 0.90 0.75 0.60 0.90 0.60 RS ' ■ ' 33% unexposed in the counter-m atched sample ' — 3 40% unexposed in the counter-m atched sample 50% unexposed in the counter-m atched sample esssza 60% unexposed in the counter-m atched sample izzzzza 66% unexposed in the counter-m atched sample Figure 5.3: Efficiencies of 1:1 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is four, a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR= 1, sample size is 3 times the number of cases a. P r (Z = l) = 0 .2 0 1.0 < u & § 0 iS w 1 0.8 0.6 0.4 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 -4 — RS b. P r (Z = l) = 0.50 1.0 £ 4— RS *o £ _ 0.6 < u £ 0.4 in | 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 1 ~ J 33% unexposed in the counter-m atched sample 1 ------ 1 40% unexposed in the counter-m atched sample 50% unexposed in the counter-m atched sample ™ 60% unexposed in the counter-m atched sample a z z z a 66% unexposed in the counter-m atched sample Figure 5.4: Efficiencies of 1 :2 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is one. a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=2, sample size is 3 times the number of cases a. P r (Z = l) = 0.2 0 & c < D 'B > C D a 1.0 0.8 0.6 0.4 0.2 0.0 RS S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 b. Pr (Z = l) = 0 .50 e - 'o £ w u " a ! & 0.4 0.2 0.0 Sensitivity 0.90 0.75 0.60 Specificity 0.90 0.75 0.60 0.60 0.90 0.90 0.60 RS 1 ■ 1 33% unexposed in the counter-m atched sample 1 : .....1 40% unexposed in the counter-m atched sample 50% unexposed in the counter-m atched sample v b z b s 60% unexposed in the counter-m atched sample ezzzb 66% unexposed in the counter-m atched sample Figure 5.5: Efficiencies of 1:2 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is two. a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=4, sample size is 3 times the number of cases a. Pr (Z=l) = 0.20 & < L > 'o £ w (L> > D 1.0 0.8 0.6 B 0.4 0.2 0.0 RS S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 b. Pr (Z=l) = 0.50 1.0 & § o iS * 4 H a D > ID O S RS 0.0 Sensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 ' -=i 33% unexposed in the counter-m atched sample r = i 40% unexposed in the counter-m atched sample ■“ ■ ■ ■ 50% unexposed in the counter-m atched sample sssssa 60% unexposed in the counter-m atched sample tzzzzza 66% unexposed in the counter-m atched sample Figure 5.6: Efficiencies of 1:2 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is four, a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=l, sample size is 4 times the number of cases a. Pr (Z = l) = 0 .2 0 1.0 S ' p < L > 'o £ < 4 - 1 w •I 1 3 o s 0.8 0.6 0.4 0.2 0.0 S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 RS b. P r (Z = l) = 0.50 1.0 06 1) •£ 0.4 P S 0.2 0.0 RS Sensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 ' 33% unexposed in the counter-m atched sample i — * 40% unexposed in the counter-m atched sample ■■■■ 50% unexposed in the counter-m atched sample E K ssss 60% unexposed in the counter-m atched sample v w /.v 6 6 % unexposed in the counter-m atched sample Figure 5.7: Efficiencies of 1:3 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is one. a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=2, sample size is 4 times the number of cases a. P r (Z = l) = 0 .2 0 5 * a 0) U £ < 4 — ( ! > ttl 1.0 0.8 0.6 0.4 0.2 0.0 RS Sensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 b. P r (Z = l) = 0.50 1.0 E ? O J 'o £ w < u iS Tj Pi 0.8 0.6 0.4 0.2 0.0 Sensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 RS | = > 33% unexposed in the counter-m atched sample < = i 40% unexposed in the counter-m atched sample 50% unexposed in the counter-m atched sample b uzz es 60% unexposed in the counter-m atched sample r/ynr** 66% unexposed in the counter-m atched sample Figure 5.8: Efficiencies of 1:3 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is two. a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OR=4, sample size is 4 times the number of cases a. P r (Z = l) = 0.2 0 1.0 & c < L ) O u > o (Z 0.8 0.6 0.4 0.2 0.0 RS Sensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 b. Pr (Z = l) = 0 .50 1.0 S ’ § o w ( D > u C Z 0.8 0.6 0.4 0.2 0.0 RS S ensitivity 0.90 0.75 0.60 0.60 0.90 Specificity 0.90 0.75 0.60 0.90 0.60 1 = 1 33% unexposed in the counter-m atched sample ■ .......-1 40% unexposed in the counter-m atched sample 50% unexposed in the counter-m atched sample 60% unexposed in the counter-m atched sample izmza 66% unexposed in the counter-m atched sample Figure 5.9: Efficiencies of 1:3 random sampling and 5 counter-matching designs, relative to the fu ll study base, when the odds ratio (OR) is four, a) The expo sure prevalence, p r(Z = 1), was 20%, and b) the exposure prevalence was 50%. Solid horizontal line: relative efficiency of random sampling (RS). Vertical bar: rel ative efficiency of counter-matching. X-axis: different scenarios of sensitivity (SE) and specificity (SP) of the counter-matching variable and exposure Z. The overall probability of disease was 10%. Based on 1,000 trials. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. One stratum in the simulated cohort X = 1 X = 0 Z = 1 Z = 0 Z=1 Z = 0 Case Control 13 3 1 32 40 20 1 0 181 76 224 49 251 300 b. Counter-matching on X X = 1 X== 0 Z=1 Z = 0 Z = 1 Z = 0 Case 13 3 1 32 Control 34 17 0 0 67 33 49 51 100 Figure 5.10: A n example of counter-matching when no controls are sampled in one of the counter-matched sampling stratum . The sample allocation is 33% unexposed subjects in the counter-matched sample, and the sample size is twice the number of cases w ith the odds ratio(O R )=2, p r(Z — 1) = 0.2 and 90% sensitivity and specificity. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. One interesting observation in the sim ulation studies (Figure 5.3, Figure 5.6, and Figure 5.9) is that higher sensitivity (SE) appears to be more im portant than a higher specificity (SP) when OR is 4 by comparing the last two groups (SE=0.60 and SP=0.90 versus SE=0.90 and SP=0.60). The efficiency of counter-matching is reduced by breaking the discordancy in exposure Z among cases and controls for a poor predictive positive value, especially when unexposed proportion in the sample is 60% or 6 6 %. Sample allocation As w ith the X — Z relationship, sample allocation is only applicable to counter-matching designs, because random sampling does not use the inform ation of X in the sampling. When the sample size is twice the number of cases (i.e. the case/control ratio is 1; Figure 5.1, Figure 5.2, and Figure 5.3), the sample allocation m arkedly affects the efficiency of counter-matching so th a t the efficiency fluctuates widely in the range of 0.2 to 0.8. The big jumps in efficiency observed at designs w ith 33% or 6 6 % unexposed subjects in the sample are due to the spare number of controls given the sample allocation. For example, in Figure 5.2.a, we observe a big deficit in the efficiency of counter-matching when the sample allocation is 33% unexposed subjects in the sample. As shown in Figure 5.10, there are no controls sampled in the unexposed stratum because the number of unexposed cases is equal to the tota l number needed for the unexposed stratum (i.e. 49 x 2%). Therefore, the efficiency loss here is because the sample allocation is not applicable when the sample size is just twice the number of cases. The im pact of sample allocation becomes less 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. substantial as the sample size increases to 3 or 4 times the number of cases (Figure 5.4 - Figure 5.9). The influence of sample allocation on efficiency varies by the X — Z relationship. In most scenarios w ith ‘concordant’ sensitivity and specificity (both equal to 0.90, 075 and 0.60), we observe tha t the 50% allocation shows the highest efficiency among different proportions of unexposed individuals. However, the efficiency monotoni- cally increases or decreases when the sensitivity and specificity are ‘not concordant’ (SE=0.60 and SP=0.90; SE=0.90 and SP=0.60). Sampling more unexposed in di viduals enhances the efficiency when the specificity is much higher (SE=0.60 and SP=0.90), whereas selecting more exposed individuals increase the efficiency when the sensitivity is much higher (SE=0.90 and SP=0.60). The 50% sample allocation in the unexposed and exposed strata appears to be a good choice. In fact, the 50% allocation assures that counter-matching is more efficient than random sampling in almost all scenarios. This ‘balanced’ (50/50) allocation show a better performance than the allocation th a t cases and sampled controls share the same exposure proportion. However, when the sample size is larger (e.g. three or four times the number of cases), the im pact of different sample allocations on efficiency becomes less substantial. Summ ary We provide the methods and results for a to ta l of 450 counter-matching scenarios by varying five factors. To date there is no such intensive counter-matching to random sampling comparison reported. The results of these sim ulation can 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. provide investigators w ith inform ation on the efficiency of each scenario and may be helpful in designing studies. In summary, the sim ulation studies demonstrate that 1 ) the efficiency gain reduces w ith increasing overall sample size for both random sampling and counter-matching, 2) the stronger the X — Z relationship, the greater the efficiency of counter matched designs, 3) the 50% allocation can almost guarantee th a t counter-matching is more effi cient than random sampling, 4) the efficiencies of counter-matching designs do not vary m arkedly by the ex posure prevalence; however, random sampling is more efficient for a common exposure than for a rare exposure, 5) counter-matching is more efficient for a less common exposure w ith a high risk to the disease, but the efficiency of random sampling increases when the exposure is more common and the risk is higher. In conclusion, based on the sim ulation studies presented in this dissertation, we recommend investigators choose counter-matching rather than random sampling when variable X that is available for the whole cohort has a high sensitivity and a s p e c ific ity (a t least 75% fo r b o th ) fo r th e exposure va ria b le Z o f m a in in te re st in the nested case-control study, especially for a rare exposure w ith a high risk to the 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. disease. When the sample size is small, such as two times the number of cases, 50% allocation of exposed and unexposed for counter-matching is the optim al choice. As the sample size increases, the allocation of exposed and unexposed individuals in the sample is less critical. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ______________________________________ Chapter 6 A nalytic Issues for Counter-M atched N ested Case-Control Studies Counter-matching is a promising sampling strategy that has been shown to be more efficient than traditional sampling methods. However, when one uses counter matching to gain efficiency, one also has to address a number of issues in the analysis stage. Because of the requirement of the risk weights in the likelihood, one must make substantial com putational efforts to obtain the weights. Besides the compu tational issue, several practical issues exist, including 1 ) descriptive statistics under a non-random sampling method, 2 ) non-participation of subjects and missing data, and 3) subtype (defined by the outcome categories) and stratification (defined by covariates) analysis must be considered. In Chapter 6 , we w ill address these ana ly tic issues and propose solutions, and a data example w ill be presented to illustrate the issues presented here. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.1 M odel Fitting under Complex Sampling As shown in Chapter 3, equation (3.4), the likelihood for counter-matching is not independent of the risk weights inherent in the non-random sampling. Because of the requirement of the risk weights in the likelihood, one must make substantial com putational efforts to obtain the risk weights. A com putation strategy for a log-linear risk model was pointed out by Langholz and Goldstein [41]. The conditional likelihood under complex sampling, i.e. equa tion (3.4), when the odds ratio term is log-linear, i.e. r D (/3) = Ilie D exP(fiZ i)) has the form r ( qN YlieT>eM P Z i) w D J-> s a m p le \P ) Es IIi6. exp(/3Z ,>s exp[log(wD) + ft E i£ D Z % \ Es exP[log(u>s) + p E je s Z j], (6.1) which is comparable to the standard conditional logistic model w ith covariates £ ies Z j and “offset” log(tcs). Therefore, the analysis may be performed using standard conditional logistic likelihood software which allows for an ’offset’ in the model, e.g. PROC PHREG in SAS/STAT® software [67], Epidemiological studies for assessing risk factors often use logistic regression, so this com putation strategy, equation (6.1), is very helpful in practice. However, we need to prepare the analytic data set which includes the new covariates (E je s Z j, the sum of covariates) and the risk weights (ws) of a case set s before applying this 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. com putation strategy. The data example in Chapter 7 w ill further illustrate the application of this com putation strategy. 6.2 Descriptive Statistics O btaining descriptive statistics in a case-control study is straightforw ard under simple sampling. However, obtaining descriptive statistics for a counter-matched design is not as straightforward as the covariate distributions based on the cases and selected controls do not represent the study base. Given complex sampling such as counter-matching, the descriptive statistics of the counter-matched samples are nonrepresentative of the study base distribution of covariates, especially for covariates correlated to the counter-matched variable. For purposes of describing the study base, we provide a method to estimate the covariate distributions using the counter-matched sample. To facilitate interpretation of the distribution, the number of controls in the fu ll study base must be estimated using corresponding sampling proportions. Each control is weighted by the inverse of the sampling proportion based on the control’s exposure status. To illustrate this approach, consider the setup shown in Figure 3.4. One control in exposure stratum I actually represents ^ p ^ p ^ controls in the study base. In the model fittin g , the risk weights in the likelihood accounts for the sampling so we can obtain unbiased parameter estimates. For descriptive statistics we need to estimate the number of controls by means of the sampling probability as well. 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The data example shown in Table 7.1 in Chapter 7 demonstrates the problem directly using the counter-matched sample to describe the study base, and provides the chance to illustrate the proposed method comparing to the real study base distribution of covariates. 6.3 M issing Data and Non-Participation In a nested case-control study, we sample a predetermined number of controls and/or cases from the study base. However, no m atter what kind of sampling is adopted, it is expected that not all the sampled controls and cases w ill be recruited for the nested case-control study. The most common reasons for non-participation include refusals, loss of contact, and death. Moreover, it is not uncommon that there is some missing covariate inform ation for participants. In the standard ana ly tic approach, non-participation and participants w ith missing covariate data are often sim ply ignored w ith an “underlying” assumption th a t missingness is random and independent. As discussed below, this assumption may not be valid for all circumstances. 6.3.1 The Nature of M issing D ata In case-control studies, missing data can depend on 1) case-control status only, 2) exposure status only, 3) both case-control and exposure status, 4) other complex missing mechanism; or missing data can be not dependent on these factors. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The issue of missing data have been studied by Rubin [63] who defined three mechanisms of missing data. Therefore, the definitions are as following Definition Missing completely at random (M C AR ) The probability of the observed missing data pattern, given the complete data, does not depend on the unobserved and observed values. In notation, pr(missing| complete data) = pr (missing). Definition Missing at random (M AR ) The probability of the observed missing data pattern, given the complete data, does not depend on the (unobserved) values of the missing items. In notation, pr(missing| complete data) = pr(missing| observed data). Definition Not missing at random (N M AR ) The probability of the observed missing data pattern, given the complete data, does depend on the (unobserved) values of the missing items. Fuchs [22] and L ittle [45] proposed likelihood-based tests of M C A R for contingency tables and m ultivariate data, respectively. Chen and L ittle [14] generalized L ittle ’s method [45] to the generalized estimating equations setting using a W ald-type test for the M C A R assumption. In the context of two-phase design studies, Zhou et al. [77] proposed a likelihood ratio test for the M A R assumption and a bootstrapped likelihood ratio test for finite samples. We next consider how these missing mechanisms affect approaches to handling missing inform ation in analysis of case-control data. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3.2 Accounting for M issing D ata in Analysis It is appropriate to sim ply ignore the process th a t causes missing data if they are ‘missing complete at random ’ (M C AR ) as parameter estimates w ill be unbiased. In practice, it is usually difficult to meet the M C A R assumption. Missing at random (M AR ) is an assumption tha t more frequently is made, but is not always tenable, either. Therefore, it is im portant to consider the missing mechanism when missing data exist. There are many varied methods for analyses w ith missing data and it is an active research area. We classify them into five approaches of handling missing data: 1 ) deletion 2 ) use of missing indicators 3) im putation 4) m axim um likelihood 5) Bayesian methods The D eletion Approach The deletion approach includes complete-case (CC, also known as listwise deletion) analysis and available-case (AC, also known as pairwise deletion) methods. One can use subjects w ith complete data (CC) or use all available subjects (AC). Deletion is the most common method and is the stan dard treatm ent offered in most statistical packages because it is easy to conduct the 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis w ithout any extra effort to im pute data. Deletion, in fact, im p lic itly as sumes that the discarded cases are like a random subsample (i.e. M C A R ). However, deletion wastes inform ation by deleting cases, and has the potential to introduce bias and give misleading statistical inference when there is no good evidence that the missing data are M C A R or M AR. Nevertheless, CC estimates are unbiased if missingness does not depend on the outcomes [46]. The M issing Indicator Approach To accommodate missing covariate values, we can add a “missing indicator” and assign the covariate the referent value. There fore, we can s till utilize their other available covariate data instead of wasting them by deleting them from analysis. For example, if an individual has a missing value on covariate Z, 0=no (reference) and l=yes, then this individual would be assigned a value of 1 to the missing indicator and a value of 0 to covariate Z . For sub jects w ithout missing inform ation on Z, they would be assigned a value of 0 to the missing indicator and keep their value of covariate Z. The missing indicator approach has been suggested by Anderson et al. [5], Cohen and Cohen [15], Huberman and Langholz [31], and M iettinen [54]. However, other researcher [27, 34] concerned th a t the missing indicator approach does not produce unbiased parameter estimates in the context of logistic regression or simple regression models. We th in k the missing indicator method should not be completely proscribed, so we w ill examine the use of the missing indicator approach in the counter-matched data example in Chapter 7. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The Im putation Approach Im putation is a useful approach th a t assigns values for missing data and thus allowing analyses to be conducted as if the data set were complete. However, im putation methods should be used cautiously as a naive or unprincipled im putation method may create more problems than it solves [35]. There are several approaches to im pute missing values, e.g., mean substitution, regression methods, and m ultiple im putation. M ultiple im putation proposed by Ru bin [63, 64, 65] has been used and studied intensively during the last two decades. M ultiple im putation uses a Markov Chain Monte Carlo (M C M C ) technique in which the missing values are replaced by a few (usually a small number) repeated im pu tations . A key requirement of m ultiple im putation is th a t the missing data be “missing at random” (M AR ). The M axim um Likelihood Approach Conventional methods for missing data, such as deletion and regression im putation, may result in biased estimates of pa rameters, while the maximum likelihood and m ultiple im putatio n approaches are approxim ately unbiased and efficient [4], The principle of the m axim um likelihood approach is fa irly simple, but com putationally complex. A number of the approaches to addressing missing data are m axim um likelihood based. Generalized estim ating equations (GEE) [44] are valid w ith data missing completely at random (M C AR ) but not necessarily w ith data missing at random (M A R ). Breslow and Cain [10] introduced a pseudo conditional likelihood (PCL) approach for two-stage case-control studies w ith a binary response and discrete covariates which is applicable when the missingness is M AR . Wang and Wang [75] 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. extended the PCL approach of Breslow and Cain [10] to continuous covariates. Weighted likelihood [24, 76] has been shown to be applicable when missingness is either M C A R or M A R [61]. The mean score method [60, 57], m otivated by the EM algorithm [19], can be used for M A R data. EM A lgorithm There are a number of ways to obtain m axim um likeli hood estimators, and one of the most common approach is called the Expectation- M axim ization (EM ) algorithm [19]. The EM algorithm can be used for missing data in either independent or dependent variables. This algorithm is based on iterations of expectation (E-step) and m axim ization (M-step) to im pute an unknown (missing) 0 by 9e m as shown in Figure 6.1. This method was preformed to im pute missing subtype outcomes using available inform ation [69, 70]. The EM algorithm also as sumes th a t the missing data are missing at random. Missing data are fractionally assigned to 9 ^ according to current model predictions to create pseudocomplete data (E-step) which is modeled in the following M-step in which the resulting pseu docomplete data are analyzed to provide new estimates. Then, new model estimates are used for new fractional assignments, 0(p+1\ for missing data (another E-step). These iterations continue u n til convergence is met. Bayesian Approach For small samples, Bayesian approaches are more desirable than the m axim um likelihood methods which rely on asym ptotic results. In L ittle ’s review [46], he noted that Bayesian approaches have been applied more to missing dependent variables than missing independent variables. Bayesian approaches are 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Any suitable initial value 0(W t & M g f i J Estimate the sufficient statistics t of 0 by finding t(p ’=E[t|y,0!p )3 , where e(p ) is the value of 0 at cycle p M-step! Determine 0lp ’ as the solution of the likelihood equations E[t|0]=t< p l converge -'EM Figure 6.1: The EM algorithm Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. more sophisticated, but, in summary, missing data are treated as parameters and predicted using the posterior predictive distribu tion in the Bayesian approaches [73, 58, 13, 47], Software The calculations for the missing data approaches are usually complex w ith the exception of the deletion and the missing indicator approaches. They are not commonly used in epidemiology, no doubt due to their com plexity and the lack of packaged software to apply these methods. However, software packages are available for the purpose of addressing missing data, and some of it is in the public domain: 1) M axim um likelihood approach: M x is a program available to the public at h ttp ://v ie w s , vcu.edu/mx. 2) M ultiple im putation: (a) A software of m ultiple im putation, known as NO RM , is available for public download at h ttp ://w w w .s ta t.p s u .e d u /~ jls /. (b) PROC M I and M IA N A L Y Z E found in the SAS package [67], (c) Other missing data software inform ation is available at h ttp : / / www. m ultiple-im putation. com /. In the future there w ill be even more software/programs available to accommodate the more sophisticated approaches. Thus, in this dissertation, we only consider the simpler methods, the deletion and the missing indicator approaches, and leave 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. applying the more sophisticated approaches to counter-mated data as the future work. 6.3.3 Accounting for Non-Participation in Analysis Non-participation (non-response) is one type of missing data. Non-participation is considered the most difficult to address because the data are completely missing. However, non-participation is a common event in practice and must be addressed as selection bias may stem from non-participation. The reasons for non-participation are often reported in published articles, but rarely is non-participation addressed in the analysis phase of the study. To account for non-participation in analysis of counter-matched data is im portant as we show later that it can introduce bias if one fails to have an appropriate adjustment for the risk weights to address non participation. In this dissertation, we focus on the deletion and the missing indicators meth ods to address non-participation in the analysis of nested case-control studies under counter-matching. It is reasonable to use these two approaches as they are the sim plest and most transparent methods and are more easily presented in publications. The other more sophisticated methods w ill be investigated in the future research. In population-based case-control studies, we need to delete nonparticipants (i.e. deletion method) because we do not even know who they are (e.g. using using a random d igit dial (RDD) method). Deletion is the most common practice in 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis of non-participation. However, the naive deletion method discards all non participating subjects from the analysis which wastes some other available infor m ation if it is a nested case-control study. Besides, under counter-matching, the simple deletion method w ithout any further adjustment does not provide unbiased estimates. We provide a data example in Chapter 7 to illustrate this method. A lte r native methods are needed for nested case-control studies to adjust the bias in the simple deletion method and to use other covariate inform ation which is available from cohort data. To assess alternative methods, we first formalize the process of non-participation by assuming Bernoulli recruitment for each subject and treating it as a part of the sampling process. This leads to an approach based on an adjustment on the risk weights which provides unbiased estimates for counter-matched data w ith non participation. Second, we adopt the missing indicator method to address non participation which makes the use of other available covariate inform ation we have in the cohort data. In the next sections, we present approaches to non-participation. Bernoulli R ecruitm ent Langholz and Goldstein [41] proposed the likelihood for case and m ultistage samplings. In this dissertation, we extend this idea and con sider the recruitm ent of cases and sampled controls in a counter-matched design as the second “stage” of sampling. In this approach, we assume B ernoulli recruitm ent for cases and controls (a ‘missing completely at random ’ assumption). We consider cases and controls are “sampled” in two successive stages. Let D denote recruited 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cases, D those not recruited. Then, an appropriate likelihood is based on the proba b ility extended from p r(D = d\7Z = r) to a two-stage sampling scheme w ith TZi = rx as the sampled risk set at the first stage and 7Z2 = r2 as the recruited risk set at the second stage plus the inform ation th a t we know who the non-participating (not ‘sampled’ at the second stage) cases are, i.e. D = d and the to ta l number of cases |D| is known. For r2 C n , and d = D — d p r(D = d\TZi = rl5 7Z2 = r2, D = d, |D |) _ _ _ _ p r(D = d) pr(7£x = r1 ; 7Z2 = r2, D = d |D = d)___________ E 8Cra:|s|=|D|-|d| Pr(D = s u d) pr(7^i = rx, 7Z2 = r2, D = d |D = s U d) _ ________^|d|+id|r _r5- p r(rx|D = d) p r(r2, d |D = d, n )________ £ s c r 2:|sMD|— |d| AIs I+ldl f s p r(rx|D = s U d) p r(r2, d |D = s U d, r x) _ _________A|D|r 5 p r(rx lP = d) p r(r 2 ,d |P = d ,rx )_________ E s c r 2:|s|=|d| A|D|rs Pr(ri ID = s U d) p r(r2, d |D = s U d, rx) r 5 p r(rx|D = d) p r(r 2 ,d |D = d ,rx ) E sCr2:|sMd| r s Pr ( r i|D = s U d ) p r(r 2 ,d |D = s U d ,n ) , (6.2) where p r(r 2 ,d |D = s U d, rx) is the probability of the final case-control set r 2 w ith s as the recruited case set, given s U d is the set of all cases. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Applying the assumption th a t the recruitm ent, the second stage of sampling, is based on independent Bernoulli trials w ith the subject-specific participating proba b ility pi, where i G p , a more general assumption than having an equal participating probability. One can rewrite the probability of the second stage of sampling as p r(r 2 ,d |D = d , r i) = J J p ; (1 - pj). *£ r 2 j e r i \ r 2 Then, equation (6.2) would reduce to p r(D = d \R i = r i, jz 2 = r 2, D = d, |D |) ___________ r d Pr (r l| D = d) n < 6 ra Pi IljgrA raC 1 ~ Pj)_________ E s Cr 2:|s| = |D |-|d | r s P r(ri|D = S U d) E U Pi r i j 6 n \r 2 (! - Pj) _________ r d p r (r ilD = d)_______ E s c r 2:|s| = |d| r s Pr (f l |D = S U d ) . (6.3) In the following, we consider two sampling methods, simple random sampling and counter-matching. Sim ple Random Sam pling As shown in Section 3.1, when m controls are sampled by simple random sampling, the sampling probability is p r(rx |D = d) = 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. vr(r|D = d) = (n J d|) ^ ( I n | = m + |d|). Under simple random sampling, equation (6.3) would become Pr(D = d\lZi = rl5 K 2 = r2, D = d, |D|) = = 2-j\ sCr2:|s| = |d| rs In fact, this simple ( “unweighted” ) likelihood works for other sampling schemes as long as the risk weights (or sampling probabilities) are the same in case-control sets (w ithin each strata, such as frequency matching). This implies th a t ignoring non-participation in the analysis is valid when the likelihood is not a function of the risk weights, such as simple random sampling and frequency matching. We examine this approach for counter-matched sampling in the following paragraphs. Counter-M atching Under counter-matching, equation (6.3) would be Equation (6.4) shows that the number of non-participating cases (i.e. |d|) in each sampling stratum I must be considered in analyzing a counter-matched data w ith non-participation. I f non-participating subjects are deleted from the analysis, i.e. Pr(D = &\1Zi = r1 ; U 2 = r2, D = d, |D|) ' » if-(|si|+ |dij) - m/ld|-(lsz l + |d(|) (6.4) (6.5) 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. using Equation (6.5), the weights are misspecified and the effect estimates are biased (as shown in Table 7.2.b in Chapter 7). This approach, equation (6.4), is labeled as ‘weight adjustments’. The second approach we consider for non-participation is to use the missing in di cator method. A ll cases and controls in the nested case-control study are used in the analysis since the case-control status and some other covariate inform ation is known for subjects from the cohort data. The procedure is to assign one missing indicator for non-participating individuals, sim ilar to the method for subjects w ith missing data illustrated in Section 6.3.2. Therefore, other covariate inform ation from the co hort data can be also used in the analysis. A n example of a counter-matched design, using the missing indicator adjustment to address non-participation, is presented in Chapter 7. 6.4 Disease Subtype and Covariate-Defined Sub set Analysis Exposure effect on disease may differ in magnitude in different subgroups. To distinguish the difference in how subgroups are formed, we label the analysis for the subtypes defined by the outcome as ‘subtype analysis’, while the analysis for strata defined by covariates as ‘subset analysis’. In the subtype analysis, one m ight like to investigate the association of the subtypes of a disease w ith an exposure by comparing each subtype w ith the disease-free group. The reason of the subset 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis is to investigate the effect m odification. For example, we estimate the effect of an exposure on the outcome for each gender group to address gender as a effect modifier. Although it is different in the purposes for the subtype and subset analysis, the approach is to have separate logistic models for subgroups of interest defined by either the outcome (subtype analysis) or the covariates (subset analysis). In other words, the analysis method for both the subtype and subset analysis is the same, so we use ‘subgroups’ to denote both the subtypes of the outcome or the strata of covariates (or effect modifiers). Following the idea of multistage samplings [41] as was used for non-participation in Section 6.3.3, let D denote the cases in the subgroups and D those cases not in the subgroups. We can treat subjects not in the subgroups as non- participating in the analysis. Subgrouping w ill be the second stage of ‘sam pling’. One can w rite the probability of the second stage of sampling as p r(r 2 ,d |D = d, r i) = /(r2 C r i and r2 C subgroup). Then, equation (6.2) would reduce to p r(D = d \T Z i = rx, n 2 = r2, D = d, |D |) _ __________ r 5 p r(ri|D = d) p r(r2,d lP = d , r x)___________ £ s c r 2:|s H D |— |d| r s p r(* i|D = s U d ) p r(r 2,d |D = s U d .q ) , 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. rg p r(ri|D = d)/(r2 C r i and r2 C subgroup) E sC r 2: | s | = | D | - | d | r s p r(ri|D = sU d)/(r 2 C n and r 2 C subgroup) _________r d Pr (r i|D = d)_________ E „ Cr a:|s|=|DHd| r ' Pr (r l | D = S U d ), (6.6) for all r2 C ri and r2 C subgroups. We apply this approach to address subgroup analysis under simple random sampling and counter-matching in the following. Subgroup A nalysis under Simple Random Sam pling For simple random sampling, each individual in the study base has an equal chance of being selected, i.e. p r (r i|D = d), and the sampling probability is independent in equation (6 .6 ). Then, equation (6 .6 ) reduces to Pr(D = & \K i = r1 ; U 2 = r2, D = d, |D |) = ra 5 ^ s C r2 .’ |s| = |d| *"s ' This simple ( “unweighted” ) likelihood is appropriate for other sampling schemes, such as frequency matching, as long as the sampling probability is the same for each individual in the case-control set. However, when the sampling probability is not uniform across case-control sets like counter-matching, the simple method w ill be not suitable. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Subgroup A nalysis under Counter-M atching In this section, we examine if the simple stratification method is appropriate for counter-matched data. Equation (6 .6 ) under counter-matching would become p r(D = d \ n x = ri, H 2 = r2, D = d, |D |) ' «i-(lsd+!d!]) ■ m/ldl-( lsil+ ldil) (6.7) (6.8) for all v2 C r i and r 2 C subgroups. Equation (6.7) shows that the number of cases not in the subgroups (i.e. |d|) in each sampling stratum I must be considered in analyzing a counter-matched data. I f cases not in the subgroups are ignored in the analysis, i.e. using Equation (6 .8 ), the weights are misspecified and the effect estimates are biased. In Chapter 7, we provide a data example to illustrate the issue from not adjusting the weights in the analysis. Summ ary Counter-matching has been shown to be more efficient than trad itional sampling methods but there are a number of analytic issues to consider. We have provided solutions to some analytic issues after applying counter-matching to a nested case-control study. The issues encompassed descriptive statistics for counter matched data, non-participation of subjects, and the subgroup analysis including 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the subtype analysis (a pairwise approach for a polytomous outcome) and subset analysis (for considering effect m odification). Because the covariate distribu tion in the counter-matched data was distorted in the sampling process, we used sampling probabilities to estimate the number of controls in the cohort from which the sample arose. For the issues of non-participation and the subgroup analysis, we needed to appropriately adjust risk weights in the likelihood to accommodate the number of cases non-participating the study or not in the subgroups we were interested in the model, because the likelihood for counter-matching is not independent of the risk weights. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 A nalytic Issues: D ata Exam ple Counter-matching has been shown to be more efficient than tra d itio n a l sampling methods but there are a number of analytic issues to consider. We proposed meth ods for addressing these analytic issues for counter-matched nested case-control studies in Chapter 6 . Here in this chapter, we provide a counter-matched data example to illustrate the approaches proposed for each of analytic issues. The Early Asthm a Risk factors Study (EARS) w ill serve as a data example to illustrate the analytic issues under counter-matching. The analytic issues include 1) model fittin g , 2) descriptive statistics, 3) non-participation of subjects, and 4) subgroup analysis. The results for applying the proposed approaches to the data example are presented in this chapter which starts w ith the introduction of the EARS. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.1 Data Example: the Early Asthm a Risk Fac tor Study (EARS) Asthm a has emerged as a m ajor worldwide public health problem [32, 33, 7, 59]. The number of people w ith asthma in the U.S. has more than doubled in the last 20 years, w ith most notable increases among preschool children [59]. Based on the 2001 National Health Interview Survey, it was estimated that 113.4 per 1,000 Americans had been diagnosed w ith asthma by a physician in their lifetim e and, among children ages 5-17, the lifetim e prevalence was 144.2 per 1,000 [28]. The asthma burden poses a serious public health challenge th a t is difficult to address because modifiable targets for intervention have yet to be firm ly established. Tobacco smoke exposure is an im portant determinant of childhood asthma occurrence. Secondhand smoke is (SHS) causally related to exacerbations of asthma and increased school absenteeism [16]. Accum ulating evidence indicates th a t m aternal smoking is associated w ith new onset asthma, which may be mediated in part, by in utero exposure to maternal smoking [18, 72, 25]. The C hildren’s Health Study (CHS), a cohort study of the effects of air pollution on children’s respiratory health, offers an opportunity to investigate the effect of in utero tobacco smoke on the occurrence of asthma and wheezing during childhood. A tota l of 6,259 children were recruited in CHS from public school classrooms from grades 4, 7, and 10 in 12 Southern California communities th a t were selected based on historical measurements of air quality, demographic sim ilarities, and cooperative 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. school districts. In early 1993, the parents or guardians of each participating stu dent provided w ritten informed consent and completed a w ritte n questionnaire. In 1996, a second group of grade 4 students was recruited from schools in the same 12 communities and completed the same protocol. A t study entry, parents or guardians completed questionnaires about demographic inform ation, characterized history of respiratory illness and its associated risk factors, exposure inform ation, and house hold characteristics. H istory of in utero exposure to m aternal smoking was assessed as exposed or unexposed, and secondhand smoke exposure was estimated by the number of household smokers. In previous publications from the CHS [25, 26, 48], we reported th a t in utero exposure to maternal smoking was independently associated w ith an increased risk for childhood asthma. Based on these findings and other existing evidence, we hypothesized th a t smoking cessation before or early in pregnancy reduces risk of asthma, and that heavier smoking during pregnancy is associated w ith higher risk. We obtained the inform ation of in utero tobacco smoke exposure by a yes/no question in the m ain cohort questionnaire. We were unable to investigate the ef fects of smoking before pregnancy, cessation, or intensity during pregnancy because we lacked inform ation on smoking habits. Therefore, we conducted a case-control study nested w ith in the CHS cohort to collect more detailed inform ation about pre-pregnancy tobacco smoke exposure as well as tim ing, intensity, and cessation of smoking during pregnancy. This nested case-control study (EARS) did not attem pt 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to contact mothers of children in the second group who were recruited in 1996 be cause the children were s till under active follow-up as part of the CHS cohort study reducing the eligible cohort to 4,244. Cases were children reported to have been diagnosed w ith asthma by a physician prior to age 5 (n=338) based on the answer in the CHS questionnaire. Asthma-free controls (n=570) were counter-matched to cases on in utero exposure to maternal smoking, based on the binary questionnaire responses at CHS cohort entry, w ith in each of the grade, sex, and com munity of residence subgroups of the cohort, the po tential confounding variables. W ith in each subgroup, the to ta l numbers of subjects (cases and controls) w ith in each sampling stratum , i.e. in utero exposure status, were 1 ) a m ultiple of the tota l number of cases w ith in the subgroup; 2 ) chosen to yield approxim ately equal numbers of exposed and unexposed subjects; and 3) in creased to account for anticipated non-participation. A fter excluding the subgroups w ith no cases, the size of the study population was reduced to 4,082. Telephone interviews were completed for 691 of the 908 cases and sampled con trols. The participant rate was 72.3% (n=412) and 82.5% (n=279) for controls and cases, respectively. 7.2 Analytic Issues Based on the study design of the EARS and practical issues, new com puting and methodological approaches for analyzing data were required. The EARS data serves 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. as an example to illustrate the analytic issues and proposed approaches on 1 ) model fittin g , 2) descriptive statistics, 3) non-participation, and 4) subtype analysis. 7.2.1 M odel Fitting For complex sampling such as counter-matching, the conditional likelihood is not independent to the risk weights inherent from the sampling process, the model fittin g involves the risk weights. I f a log-linear risk model (e.g. logistic model) is assumed, then we can use the com putation strategy, equation 6 .1 , pointed out by Langholz and Goldstein [41]. The analysis of a counter-matched data can be performed using standard con ditional logistic likelihood software which allows for an ’offset’ in the model. The offsets were the risk weights of counter-matching. However, the analytic data set which includes the new covariates (i.e. the sum of covariates among cases) and the risk weights needs to be prepared before applying this com putation strategy. Using a case-control set in the EARS as an example to show how the analytic data set is prepared. Consider a counter-matched case-control set w ith 3 exposed (m id = 3) cases and 3 unexposed (mod = 3) controls from a stratum (defined by the gender, grade, and com munity of residence) w ith a to ta l of 5 exposed (n\ = 5) and 17 unexposed (n 0 = 17) children (see Figure 7.1.a). Because 2 out of 6 children are cases, the number of the possible combinations of a case set is (®) = 15 for this case-control set (see Figure 7.1.b). B u t one and only one of them is the observed case set, which is the first one in the example, to 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. One couner-matched case-control set from a stratum w ith a total of 5 exposed (X=l) and 17 unexposed (X=0) children ID Asthma X Z 1 1 0 0 2 1 0 0 3 0 1 1 4 0 1 1 5 0 1 0 6 0 0 0 b. The analytic data set for this case-control set Combination CC New logw # Case ID covariate Z 1 1, 2 1 0 3.81404 2 1, 3 0 1 2.24543 3 1,4 0 1 2.24543 4 1,5 0 0 2.24543 5 1, 6 0 0 3.81404 6 2, 3 0 1 2.24543 7 2, 4 0 1 2.24543 8 2, 5 0 0 2.24543 9 2, 6 0 0 3.81404 10 3, 4 0 2 1.20397 11 3, 5 0 1 1.20397 12 3, 6 0 1 2.24543 13 4, 5 0 1 1.20397 14 4, 6 0 1 2.24543 15 5, 6 0 0 2.24543 Figure 7.1: A n example of the analytic data set for one case-control set in the Early Asthm a Risk factors Study (EARS). X: the counter-matched variable. Z: exposure collected in the EARS. CC: the pseudo case-control indicator, 1 for the observed case set and 0 for the rest. New covariate Z: the sum of Z values in th a t case set. logw: the logarithm of risk weight. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have the ‘pseudo’ case-control indicator (CC) equal to 1. The new covariate Z is the likelihood, equation (7.1), is the sum of Z values of cases in th a t combination. Recall the conditional likelihood for counter-matching is l J sa m ‘ p le \ P ) — . . i . . i n , e D [ n , u e * p ( ^ > e, nj e , [ n ? = o .jfc V S T I.)] e x p(^> exp[log(wD) + p £ ).gp Zj] Es exp[log(ws) + p E z j l (7.1) where log(u,s) = lo g fn L o r n ^ d ^ n S ^ ^ Therefore, the logw corresponding to the case set { 1 ,2 } is 17 1 log(wg={i> 2 }) = lo g [l x 3 2 ] = 3.81404, w ith Si = 0, and so = 2. Consider another case set {1,3}, and the logw for this case set is log(Ws={1,3}) = log[j| x y ] = 2.24543, w ith si = 1, and so = 1. More detailed inform ation about weight calculations when preparing the analytic data set ‘pseudo’ by SAS macro ‘makepseudo’ is attached in Appendix B .l. 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A ll analyses can be undertaken using PROC PHREG in SAS/STAT® software [67]. For example, the analysis for exposure Z is carried out by % fit(Z ), where the SAS macro % fit is ° / 0 macro fit(in d v a r ); proc phreg data=pseudo nosummary; model setno*CC(0)= &indvar / offset= logw r is k lim its ; str a ta se tn o ; run; ° / 0 mend; Data set pseudo is the analytic data set for all case-control sets. Variable CC is the pseudo case-control indicator. Variable setno is the set number (or stra tum number) in the data set because counter-matching in the Early Asthm a Risk Factor Study (EARS) was conducted w ith in 96 strata. Variable logw in syntax ‘offset=logw’ is the logarithm of risk weights. 7.2.2 Descriptive Statistics The descriptive statistics of the counter-matched samples are not representative of the study base distribution of covariates, especially for covariates correlated to the counter-matched exposure. The distribution of covariates in the counter-matched data set is distorted by the sampling process. In the counter-matched EARS data, there were 338 cases and 570 controls. The distributions of covariates for the counter-matched controls does not represent the distribution of covariates for the controls in the study base. For instance, in the most obvious example, the percentage of in utero exposure to m aternal smoking is enriched in the sampling process because it is the counter-matching variable 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and we set the proportion in the counter-matched sample as 50%. Therefore, the proportion of exposed for the counter-matched controls would be much higher than the proportion of exposed for the controls in the study base. A fte r the adjustment discussed in Section 6.2, the estimated percentage of in utero exposure to m aternal smoking in the entire control group is 19% (not 65%) and the real proportion for this cohort is 19% (Table 7.1). The stronger a covariate is related to the counter-matching variable, the greater distortion the distribution is of the covariate. Secondhand smoke exposure is re lated to in utero exposure to maternal smoking, so controls exposed to secondhand smoke were oversampled in the counter-matched sample (63%) compared w ith the estimated percentage (39%) and the real proportion (40%) in the study base. On the other hand, because gender is not related to in utero exposure to maternal smoking, the percentage of boys in counter-matched controls (57%) is close to the percentage estimated using sampling probability (48%) or in the study base (48%). These im ply that to directly compute descriptive statistics and the distribution of covariates is not appropriate. An adjustment using the sampling probability is needed. 7.2.3 Non-Participation There are many different approaches available for addressing missing data in the analysis of epidemiologic data. Non-participation is not uncommon in practice, but it receives less attention than missing data. A number of approaches have been 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 7.1: Examples of descriptive statistics under counter-matching, compared to the descriptives statistics in the cohort. CM 0 Estim ated 6 Controls c Covariate Cases (%) Controls (%) Controls (%) in Cohort (%) 00 C O I C O ! 1 ! ! (n=570) (n=3744) (n=3744) In utero exposure to m aternal smoking 8 6 (25%) 371 (65%) 727 (19%) 727 (19%) Secondhand smoke exposure 143 (42%) 358 (63%) 1443 (39%) 1509 (40%) Boys 214 (63%) 327 (57%) 1811 (48%) 1811 (48%) a Counter-matched controls in the nested case-control study b Using sampling probability to estimate the numbers in the fu ll study base c The numbers of controls in the fu ll study base 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. proposed for addressing missing data, and we classified them into five approaches in Section 6.3.2. Currently, the deletion method is the best practical alternative to address non-participation in the analysis. In the nested case-control studies, we have cohort data to use for the non-participants. We can choose to delete them or include them w ith missing indicators. Here we examine the two methods (the deletion and missing indicator ap proaches) for missing data for the EARS data because they are simpler, less com putationally intensive, and, im portantly, these methods are furnished and can be fit w ith standard statistical software. I f the deletion and missing indicator approaches perform adequately, they w ill provide simple and transparent methods for substan tia l data analysis. The goal of the EARS was to collect additional inform ation on the children’s early life exposures by interviewing children’s mother. A fter counter-matching con trols w ith 338 cases on in utero exposure to maternal smoking w ith cases, there were 570 controls identified from the cohort. When recruitm ent was finished, 59 (17%) cases and 158 (28%) controls were not interviewed, leaving 270 cases and 412 con trols. Simple D eletion We tested the performance of the deletion method for non participation in counter-matched data. The naive deletion method discards all non-participating subjects from the analysis w ith out any further adjustment. This simple deletion method resulted in a biased estimate, O R=1.8 (95% C l 1 .2 - 2 .6 ) in the EARS (Table 7.2.b) versus O R=1.4 (95% C l 1.1-1.9) estimated in the fu ll study 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 7.2: Estim ated odds ratios (OR) and 95% confidence intervals (C l) of in utero exposure to maternal smoking on asthma w ith and w ith o u t the consideration of non-participation, the Early Asthm a Risk factors Study (EARS) Data M ethod addressing non-participation OR (95% C l) a. Study Base 1.4 ( 1 .1- 1 .9) b. EARS Simple deletion 1 . 8 ( 1 .2 - 2 .6 ) c. EARS Deletion w ith weight adjustments 1.5 ( 1 .1- 2 .1 ) d. EARS Missing indicator adjustments 1.5 ( 1 .1- 2 .1 ) base (Table 7.2.a). Furthermore, the confidence interval is also wider when using the simple deletion method compared to using the fu ll study base. These results indicate that the simple deletion method is inappropriate under counter-matching. D eletion w ith W eight A djustm ents In Figure 7.2.a, one case-control set (set number 61) from the EARS is shown to demonstrate the weight adjustments pro cedure. Let CC be the case-control status, X be the counter-matching variable, and Z be the new covariate collected in the counter-matched sample. The non participating subjects w ith missing Z values (marked by a ? m ark) in the rectangle are deleted from the analysis, but the weights need to account for the number of non-participating cases, |do| for unexposed (A=0) cases and |di| for exposed (A = l) cases, as in equation (6.4). In this example, there are three non-participating cases. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Among these 3 cases, one was unexposed and two were exposed to m aternal smok ing during pregnancy (the variable A ), so we have |do| = 1 and |di| = 2. The appropriate risk weight for this case-control set is TT ( ni ~ (ls/| + |d*l) A _ / 2 8 - ( l + l)\ 1 / 5 - ( l + 2 )\ f i U i | d | - ( | s , | + |d ,|P V 5 - (1 + 1) / h - ( l + 2 )/ The bias resulting from the simple deletion method can can be corrected by suitable weight adjustments. M issing Indicator A djustm ents An alternative approach to deletion w ith weight adjustments is the use of missing indicators and replacing missing values by a ref erent value as discussed in Section 6.3.2. The difference from weight adjustments is th a t all subjects in the risk set are included in the analysis (Figure 7.2.b). MISS is a missing indicator and this dummy variable would be in the model as a covariate besides Z. Since all sampled subjects are in the data set for analysis, we do not need to adjust the weights. We w ill examine if the missing indicator method provides unbiased estimates using a data example along w ith the deletion method w ith weight adjustments in the following. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a. Deletion with Weight Adjustments ID SETNO CC X z m ,|d | | d 0 1 l d x 1 61 1 0 1 29 6 1 2 ' 2 61 1 1 1 7 7 1 2 3 61 0 0 0 29 6 1 2 In c lu d e d in th e d a ta 4 61 0 0 0 29 6 1 2 ’ set fo r a n a lysis 5 61 0 1 1 7 7 1 2 6 61 0 1 1 7 7 1 2 1 61 1 0 ? 29 6 \ 8 61 1 1 ? 7 7 8 61 _ 1 1 9 7 7 N o n -p a rtic ip a tin g cases 10 61 0 0 9 29 6 11 61 0 0 9 29 6 Idol = 1 a n d | d jj =2 12 61 0 1 9 7 7 13 61 0 1 9 7 7 b. M issing Indicator Adjustm ents ID SETNO CC X z MISS n i m, I 1 61 1 0 1 0 29 6 2 61 1 l 1 0 7 7 3 61 0 0 0 0 29 6 4 61 0 0 0 0 29 6 5 61 0 l 1 0 7 7 6 61 0 i 1 0 7 7 7 61 1 0 0 1 29 6 8 61 1 1 0 1 7 7 9 61 1 1 0 1 7 7 10 61 0 0 0 1 29 6 11 61 0 0 0 1 29 6 12 61 0 1 0 1 7 7 13 .£1. n 1 n . ... 2 7 In c lu d e d in th e d a ta set fo r a n a lysis Figure 7.2: W eight adjustments and the missing indicator adjustments to account for non-participation, shown is one case-control set for the Early Asthm a Risk factor Study. Observations in the rectangle are non-participating subjects. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Comparisons o f Approaches We estimated the effect of in utero exposure to maternal smoking on asthma using 1 ) simple deletion; 2 ) deletion w ith weight ad justments; and 3) missing indicator adjustments, and compared their results w ith using the fu ll study base. Ideally, we should be able to obtain the same (or at least very close) effect estimate for the counter-matched variable from both the fu ll study base and the counter-matched data. The comparison of the three approaches w ith the fu ll study base is summarized in Table 7.2. The estimated odds ratio (OR) using the simple deletion approach (O R=1.8 (95% C l 1.2-2.6 ), Table 7.2.b) is larger than the OR estimated using the fu ll study base (O R =1.4 (95% C l 1.1-1.9), Table 7.2.a). The OR is very close to the OR estimated in the study base if we do take into account of non participation either by weight adjustments (OR=1.5 (95% C l 1 .1-2.1), Table 7.2.c) or missing indicator adjustments (O R =1.5 (95% C l 1.1-2.1), Table 7.2.d). Moreover, the w idth of confidence intervals for the weight adjustments and the missing indicator adjustments is narrower than th a t for the simple deletion method. In this example, we confirm the requirement of some ad justm ent for non-participation when data is collected using complex sampling like counter-matching. 7.2.4 Outcome with Subtypes: Subtype Analysis Exposure effect on disease may differ in magnitude in different subgroups defined by the outcome (the subtype analysis) or by covariates (the subset analysis). The analysis method for both the subtype and subset analysis is the same, although 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the underlying reasons for the subtype and subset analysis are different. Because the proposed method can apply to both the subtype and subset analysis, we use the subtype analysis as an example. This is a restricted analysis using pairwise comparisons for subtypes of the outcome. Later in Chapter 8 and 9, we w ill use another approach called the polytomous logistic regression. The subtypes considered here are the age (early or late) at asthma onset. In EARS, 76.3% of asthma cases was diagnosed by 3 years old (i.e. early onset). We hypothesized that in utero smoke exposure would have a stronger effect on early onset asthma than on late onset asthma. As discussed in Section 6.4, to perform a subtype analysis in the counter-matched data, the weights in the analysis need to adjust to consider the number of cases not in the subtype; otherwise, the results w ill be biased. For example, when estim ating the exposure effect on early onset asthma, relative to no asthma, we adjust the risk weights by considering the number of late onset asthma children. Relative to no asthma, using the naive approach, the OR estimate of in utero smoke exposure on asthma is 1.8 (95% C l 1.0-3.3) and 0.3 (95% C l 0.1-3.2) for early onset and late onset asthma, respectively (Table 7.3.b). The OR estimate for early onset asthma using the naive approach shows no evidence of bias, while the OR estimate for late onset asthma is biased (compared to 0.8 (95% C l 0.5-1.6 ) using the cohort data, Table 7.3.a). Moreover, the confidence intervals of both ORs are wider than those using the cohort data. 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 7.3: Estim ated odds ratios (ORs) and 95% confidence intervals (C l) of in utero exposure to m aternal smoking for early and late onset asthma in the fu ll cohort and EARS case-control sample w ith and w ithout weight adjustments. Asthm a Data Method Early Onset Late Onset OR (95% C l) O R (95% C l) a. Cohort Pairwise H 1.7 (1.3-2.1) 0.8 (0.5-1.6 ) b. EARS Naive 5 1.8 (1.0-3.3) 0.3 (0.1-3.2) c. EARS Adjusted H 1.8 (1.2-2.5) 1.0 (0.5-1.9) 1 1 no asthma as the common reference group b simple w ith no weight adjustments I I subtype analysis w ith weight adjustments A fter applying the proposed weight adjustments, the OR estimates for both early and late onset asthma are not biased (Table 7.3.c compared to Table 7.3.a). The OR of in utero smoke exposure on early asthma is 1.8 which is 1.7 in cohort data, and the OR of in utero smoke exposure on late asthma is 1.0 compared to 0.8 in cohort data. Furthermore, the confidence levels are narrower than those estimated using the naive method, and are sim ilar to confidence intervals using the cohort data. In this example, the results of the subtype analysis using the naive and the adjusted methods are compared to the OR estimates using the fu ll cohort data, which is the gold standard. Using the counter-matched data is equivalent to using 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the fu ll cohort data to estimate the effect of the counter-matching variable, because the risk weights carry the inform ation of the m arginal distribu tion of the counter matching variable from the cohort to the counter-matched sample. In fact, this is why counter-matching can provide unbiased estimates. Therefore, the foundation of the analysis of a counter-matched data is to use appropriate weights. Summ ary Counter-matching is more efficient than trad ition al sampling methods, although there are a number of analytic issues inherent from the sampling process. Because the likelihood for counter-matching is not independent of the risk weights, we need to make sure to adjust the risk weights to address these analytic issues discussed in Chapter 6 . Our proposed approaches presented in this chapter provide the solutions for addressing these analytic issues and show to be appropriate. In the next chapters, we consider a polytomous approach to the analysis of subtypes of outcome for a counter-matched design. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 Polytom ous Outcom es in Counter-M atched N ested Case-Control Studies Logistic regression techniques are widely used for estim ating the effects of risk fac tors on a dichotomous outcome. However, the extension to the case of a polytomous response has received less attention. The lack of attention is not due to a lack of interest in polytomous responses, but it is the result of tra d itio n a lly using a pair wise approach by modelling several dichotomous outcomes. In the EARS, we were interested in the effect of smoking exposure on age at asthma onset so we defined a polytomous outcome in the EARS as no asthma, early onset asthma, and late onset asthma. Thus, an approach to model polytomous outcomes for counter-matched studies was needed. A polytomous outcome sometimes is purely categorical, although it may be on an ordinal scale. The conceptualization and treatm ent of the polytomous dependent 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variable are the main issues th a t determine the method of analysis. A polytomous outcome may be conceptualized as: 1) points along a linear scale (e.g. 1, 2, 3, ••■) or some arb itra ry weighting scheme (e.g. 1, 2.5, 5, 10, 2 ) a nom inal dependent variable compared w ith a common level, a baseline; or 3) a series of events, such that each subsequent outcome depends on the atta in ment of the prior one. Because regular regression models can be used for the first type of polytomous outcomes, this dissertation w ill focus on the last two types of outcomes. Given the analysis methods are different for nominal and ordinal polytomous outcomes, we w ill consider each separately in the following two sections (Section 8.1.1 and 8.1.2). 8.1 M ethods for Analyzing Polytom ous Outcomes The methods of analysis for polytomous outcomes depend on the type of the poly tomous dependent variable (e.g. nom inal or ordinal). Therefore, this section of the dissertation w ill describe some available analyses methods by the types of polyto mous outcome: nom inal or ordinal. 8.1.1 Nom inal Outcomes Case-control studies may have more than one control group to compare to a single case group or to have m ultiple case groups to compare to a single control group. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In studies where m ultiple case or control groups are nominal, an extended logistic regression model is needed for testing whether the relative odds vary w ith outcomes. Logit and Probit M odels Logit and probit models allow a m ixture of categor ical and continuous independent variables w ith respect to a polytomous dependent variable. Although two models (logit and probit) could be considered, the probit model is difficult to practically employ, making it a less attractive alternative. 1. M ultinom ial Logit M odels A widely used functional form for discrete probabilities is a m ultinom ial logit (M N L) model, termed as a conditional logit model by McFadden [52], For subject % and response choice j, let Xy denote the values of the explanatory variables, the M N L model is defined as c \ exp(P 'X ij) , . ^ |x" ’ Ci) = £ ie c , e x p t f 'X,t ), ( 8 - 1 } the response probability given the set of response choices Ci for subject i and the explanatory variables x*.. It is widely known th a t a potentially im portant drawback of the M N L model is the independence from irrelevant alternative (IIA ) property, named by Luce [49]. This IIA property states th a t the odds of choosing any two choices do N O T depend on the other alternatives in the choice set or on values of the explanatory variables for those alternatives. It it not a realistic assumption in practical circumstances. 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. M ultinom ial Probit M odels A n alternative specification to M N L models and w ith no need for IIA assumption is a m ultinom ial probit (M N P) model [29, 36, 55]. However, the use of the M NP model has been lim ited due to its com plexity on the requirement that m ultivariate normal integrals must be evaluated to estimate the unknown parameters. Therefore, we do not consider this class of models. Polytom ous and Pairwise Logistic Regression Polytomous logistic regres sion models have not been commonly used in the analysis of epidemiologic data. Researchers tend to employ a simplified analytic method in which each response category is individually compared w ith a baseline category using regular logistic models. We refer to this approach as a pairwise logistic regression analysis. The subtype analysis discussed in Chapter 6 and 7 is a pairwise approach. The pairwise approach, however, does not fu lly utilize the available inform ation in the data. The polytomous approach allows for the use of all the available inform ation simultane ously. Polytom ous versus Pairwise Logistic Regression Both polytomous and pairwise methods have their own advantages and lim itations. W hen a case-control study has m ultiple case or control groups, Liang and Stewart [43] stated that a polytomous approach is generally more efficient than a pairwise approach based on stim ulation studies for a single case group accompanied by two control groups fro m d iffe re n t p o p u la tio n s . They considered missing data in the simulations as well by examining different missing percentages. They recommend the polytomous 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. approach because it is more efficient than the pairwise approach, and has better efficiency of the polytomous method even when 40% of the case-control sets are incomplete. Liang and Stewart [43] also pointed out a problem using the pairwise approach when several case subgroups are studied. One can sim ply compare two case groups in the absence of a control group to test if there are differences in risk factors for these two case subgroups. For instance, if the odds ratio is different from one, it is not possible to determine if this factor is associated w ith an increased risk in one case group, a decreased risk in the other case group, or both. By contrast, Begg and Gray [8 ] studied the asym ptotic relative efficiencies of the pairwise approach relative to the polytomous method, and they preferred the pairwise approach, which they called “individualized regressions” , because it is com putationally less cumbersome. However, Begg and Gray [8 ] agreed th a t to jo in tly test the parameters from different models additional com putations on the jo in t covariance m a trix must be performed. However, neither study considered issues related to sampling. A n evaluation of the polytomous and pairwise approaches in the context of counter-matched sampling is reported in Section 9.2 of this dissertation. 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.1.2 Ordinal Outcomes Although our main interest in this dissertation is nom inal polytomous outcomes, studies also collect ordered outcome data. It has been shown th a t there are substan tia l advantages in using an ordinal procedure over the m ultinom ial (polytomous) logistic model when the response variable is ordinal [12]. Ordered polytomous re gression has been discussed by several researchers, including M cCullagh [50], Agresti [1, 2, 3], and a summary of several logit and probit models are outlined here. Let G denote the cumulative distribution function (cdf) of a random variable. Then, the cumulative lin k model is specified as G - l l F j ( x ) ] = » j - p ' x linking the cumulative probabilities, Fj to the lin k function G~l . This model fits a common slope (3, which is a parallel lines regression model based on the cu m ulative probabilities of the response categories because of the ordinal m ature of the data. In a logit model, the lin k function is the natural log of the odds ratio, G - 1 (u) = ln [u /( 1 — u)]. In a probit model, the lin k function is the inverse of the standard norm al cumulative distribution, function G~l {u) — $ _ 1 (u), where $ (.) is the standard norm al cdf. The term “p ro b it” was coined in the 1930’s by Chester Bliss and stands for probability u nit [6 6 ]. When the outcome is ordinal, ordered logit and cumulative probit models are most commonly used. 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ordered Logit M odels This class of logit models includes adjacent-categories logits [2, 3], continuation-ratio logits [23], and cumulative logits [50]. Let 7Ti(x), ■ • • , 7T/(x denote response probabilities (for a J-level response variable, Y ) at value x for a set of explanatory variables, i.e. 7r,-(x) = p r(Y = j |x ). 1. A djacent-C ategories Logits Based on modelling the logits for adjacent categories pairwise, the adjacent-categories logit models are = log = f3 presents the log odds ratio for Y = j versus Y — j + 1 per u n it change in x. These adjacent-categories logits are equivalent to the baseline-category logits ^ '(x ) w ith the relationship L* = L k- 2. C ontinuation-R atio Logits Based on modelling the logits for each cate gory compared to all lower categories above, the continuation-ratio logits are defined lo g ( ,..n ■ , _ ,..v ) = 0 j + /^ 'x , J = 1 , ■ • • , J ~ 1 . as ^ ( x ) ^ 7 T i+ i(x ) H --------H7T j (x ) When the logit lin k function is replaced by the complementary log-log lin k G~l (u) = log[—lo g (l — u)], the resulting model is the Cox proportional hazards model [17] 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and is widely used in survival analysis. One can also define the continue-ratio logits in terms of Y = j versus Y < j. However, Hosmer and Lemeshow [30] pointed out these two parameterizations are not equivalent. M cCullagh and Nelder [51] suggested th a t the continuation-ratio logit model is appropriate when the response is nested or hierarchical (Figure 8.1). The response categories are alive, dead from causes other than cancer, and death from cancers other than lung cancer. In this hierarchical structure, the size of ‘risk set’ is reduced by levels. The number starts at size n, and it becomes n — y l as y 1 is out in for ‘Stage 1’. Corresponding to the dichotomy at each level, using the continuation-ratio logit models, we can model tota l m o rta lity to exposure x at Stage 1, cancer m o rta lity as a proportion of tota l m ortality to exposure at Stage 2, and so on. 3. Cum ulative Logits The cumulative logits are > 0 g [ \7Tj+i(x) + --- + 7 r 7(x),' Cumulative logit models can be viewed as using G - 1(m) = lo g [u /(l — u)] in the cumulative lin k model. M cCullagh [50] considered a proportional odds model based on cumulative logits, assuming /3 does not depend on j, the “proportional odds” assumption, = (8.2) 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Study base, n Stage 1 r ................................................. i Alive, y l Dead, n -y l I Stage 2 r ............ ................... ....... i O th e r causes, y2 C ancer, n - y l- y 2 Stage 3 1 1 Other cancers, Lung cancer, y3 n-yl-y2-y3 1 Stage 4 1 1 Other subtypes, Sm all cell carcinoma, y4 n -y l-y 2 -y 3 -y 4 Figure 8.1: An example of hierarchical polytomous responses. 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The proportional odds model, in fact, is equivalent to the linear logistic model for a binary response w ith outcomes Y < j and Y > j, for fixed j. Moreover, Laara and Mattews [38] have explicitly shown th a t when the complementary log-log link is used, the proportional odds and the continuation-ratio models are identical. If this proportional odds assumption is not valid, one can relax the model as \ 7T i+i(x)H h 7 1" J (x) This latter model can be used to test the assumption of proportionality in /3. When this assumption does not hold, extended models, partial proportional odds mod els, allowing non-proportional odds for a subset of the explanatory variables were proposed by Peterson and Harrell [56]. Ordered Probit M odels Another class of models is called ordered probit models that Agresti [3] named as cumulative probit models. It applies a different link function than an ordered logit model does. When using the standard normal cdf for the lin k function G, it gives the ordered probit model, a natural generalization of the binary probit model to ordinal response categories. A typical use of probit is to analyze dose-response data in medical studies. For the ordered probit models, we would have 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7T ;l (x) = $(-/3'x) 7 r 2(x) = $(02 - /?'x) - $(-/3'x) 7 r 3(x) = $ (0 3 - /3'x) - $ (0 2 - /3'x) 7Tj(x) = 1 - $(0J_! - /Tx). Besides the models for analyzing polytomous outcomes summarized in this sec tion, there are other approaches not mentioned like using Bayesian techniques, but they w ill not be presented in this dissertation. We look for an approach to model polytomous outcomes for counter-matched EARS data. The polytomous logistic regression model is an intuitive extension of the regular logistic regression. In the next section, we consider polytomous logistic regression models for counter-matched data. 8.2 Polytom ous Logistic Regression M odel under Complex Sampling In this section of the dissertation, we focus on m ultinom ial outcomes using the poly tom o us lo g is tic regression. F irs t, we review th e lik e lih o o d of th e p o ly to m o u s lo g is tic regression in the fu ll study base (Section 8.2.1). Next, we derive two conditional 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. likelihood models under simple sampling (Section 8.2.2 and 8.2.3), and finally we examine m ultinom ial outcomes under complex sampling (Section 8.2.4). 8.2.1 Likelihood: Full Study Base The polytomous logistic regression model for a m ultinom ial outcome is an intuitive extension of the regular logistic regression for a dichotomous outcome. We review the likelihood of the polytomous logistic regression in the fu ll study base in this section, accompanied by score and inform ation derivations under the null hypothesis that an exposure has no effect on all outcome categories relative to the reference category. Suppose an outcome has r + 1 m utually exclusive categories where category 0 is considered the reference category. Assume the probability th a t an individual i is in disease category j (j = 0 , • • • , r ) given a covariate Zi = (Zn, • ■ ■ , Zip) is exp(o!j + Z j p j ) Y?kZ l exP K + ZA ) , p r(D = j\Z i) = _ kZT ' (8-3) where D = 0 denotes disease-free and a0 = 0 = /?o- For sim plicity, we consider two case groups (i.e. r = 2) accompanied by a single control group. Let D j denote the set of disease type j ( j = l and 2) and D = D iU D 2. The disease probability is given by 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Pr(D = d) = JJpr(£ > = j\Z i) p r(D = 0|Zt) i£d i£7?.\d e x p (a ij + Z ^ j ) 2 1 n n exp(ft?+Zi^) n i + E L i exp(«fc + Zifa) j = 1 tGDj 2 Q n { f t ) n n exp(a, + Zifa), (8.4) j —1 ieDj w ith Qn (a,f3) = f[. ien l+ E L le xP(Q t+ Zi& )' Following the disease probability, equation (8.4), the log-likelihood, score, and inform ation of the polytomous logistic regression model in fu ll study base can be obtained after some calculations. Let |D j| denote the number of subjects w ith disease type j. Then, log-likelihood of the model would be n 2 n 2 J 2 + Zi P i W G D i) - S los E exp(a * + (a i + ZiPi) + ^ 2 (a 2 + Zif32) ie D i «eD2 n (8.5) i= 1 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Consequently, score and inform ation elements are given by, daj dl dfa d i2 dajCHj d l2 dpjPj d l2 dajP j dl 2 doijfik d l2 d(X\Oi2 dl2 d/3i E ^Ctj ZjiPj for j = 1 and 2 ; \ _ |_ eoi+Zi/3i _ |_ ea2+Zi02 J ’ i = 1 ’ CC Zieai +Zifii / ;--------- T tH---------- , 7 . for j = 1 and 2 ; Z_^ ^ i gai+Zj/3i i gC^+Zjfo ^ *GD,- j= l ’ - _ |_ g fifc"!~ZiPk\ >--- 77---------------------------------------for j, k = 1 and 2 , j ^ k ; _|_ e a i + Z j j 8 i _|_ g c ^ + Z j / ^ ^ i j r ■ > ( Z j) 2e ^ '+ Z i^ ' ( l + e " 1+2,/31 + _ ^ .g O ,+ Z ;£ j)2 M _ (_ gai+Zj/3i _ |_ ga2+Z;/32 2=1 ' ' ■ n . (Z-')‘ 2Pai^r^*Pi (]. -I- p a k~lr Z i 0 k \ = J for j, k = I and 2 , j ± k; Z - u ( l + e a i + Z i P ! + e a 2 + Z i P 2 y ^ J ’ i j - r i ^ Zieai +Zi^ ( 1 + e“ 1+^ ^ + e“ 2+ZiA) - (Zje ^ ' + Z i/3 > ) 2 2 = 1 / r^n,-----------, ^ a xo for i-,k = 1 and 2 , 7 7^ h Z-^ ^ + eai+Zi^1 + ea2+- Zi'82)2, ■ 1 1 y Z'(c o tj~^^i@j}(gC ik ~ \~ Z il3 fc\ / T> — 5 x0 for j, k — 1 and 2 , j 7^ /c; Z — t (X - f e - U i+ Z if ii _|_ g a 2+ Z i ^ 2 )2 , •'> 1 J 1 C - — J -|- (lOLl+Ziftl _ |_ g G J 2 + ^*/?2^2. (^.)2gai+2i/3iea2+Zi^2 Z _ — / ^ _ | _ gO ti-\-Zif3 i _ | _ ^ O L 2 + Zi^2^ m Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Then, assume Zi are identically and independently distributed (iid) and the inform ation under the “null” (H 0 : f3i = 02 = 0) is given by -In = 1 n n E [d f/d e d d 1 eai (1 + e“ 2) _ pai pai — ea ie“ 2 EZeai (1 + e0 1 2 ) — EZeaiea 2 e“ 2( l + ea i) — EZeaie a 2 EZea2(l + eai) EZeai( 1 + e“ 2) — EZeaiea 2 EZ2 eax( l + e“ 2) — EZ2 eaiea 2 -EZeaiea 2 EZea2(l + eai) - EZ2 eaie a2 EZ2 ea2( 1 + e ai) j ( I I X •*22 -*12 h i h i where 9T = (aq, q>2,/3i , fo) and c0 = 1 + eai + e“ 2. Inverse inform ation for (3i and /32, denoted as I j * u, J - 1 full = ( h i - I n l ^ h i ) - 1 c0 V a rZ I l+ e a i i N e “ i i l + e “ 2 \ 1 / (8.6) Later, we w ill compare the efficiency under the null for the conditional likelihood model we propose w ith equation (8 .6 ) which is the variance m a trix under the null using the fu ll study base. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.2.2 Conditional Likelihood for Random Sampling In this section, we propose two different conditional likelihood models for a nested case-control study sampling for the study base. The likelihood models are derived for simple sampling method such as random sampling and complex sampling method such as counter-matching. Consider a nested case-control study randomly sampling from the study base. Again, assume there is one control group and two types of disease groups in the study base. Let D j denote a set of disease type j, j = 0 , 1 , 2 . A fter the sampling process, each D j would be divided as D., = D j U D j, where D j denotes the set of individuals w ith disease type j ‘sampled’ for the case-control study, and Dy denotes the rest of individuals w ith disease type j. Then, the sampled risk set 1Z would be D 0 U D x U D 2. Use |s| stand for the size of set s. A conditional likelihood, hence, can be based on this probability, E c o n d P r ( s 0>s i > s 2 lD 0 u So,Dx U Sx,D 2 U s2) X p r(D 0 U s 0, D i U s i,D 2 U s2) /|P iK - 1 /|D 2K - 1 /|P 0K - 1 r h eDl e“ i + ^ i n i6Da e ^ + Z j h \ | P i | / V |P 2 | / V |P 0 | / n i € K l + e a l + z i /3i + e Q2 + 2 i ^ 2 p r(D 0, D i, D 2 |D 0, D i, D 2, |D 0|, |D i|, |D 2|) p r(D 0 , D i, D 2 |D 0, D i, D 2) x p r(D 0, D x, D 2) /iP ii+ |S1h 1 / | p 2[-i-|s2i\ L iP o i+ is o b x n i c o n d V | s i | / V |s2 1 / V |so | / e x p (A Z j + f a E »£ p 2 Z i ) (8.7) E c o n d e M P l Eie s i Zi + 0 2 Ei € S 2 Zi), 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where cond is the condition th a t s*, : |s*| = |Dfc|,A; = 0,1,2. Consequently, the log-likelihood equals to A ( X > ) + f t ( X > ) - l ° g { E exp + ie D i j£ D 2 sfc:|sfc| = |D fc|,fc=0,l,2 ^ si *es2 Let V a rE = Z f — Z i)2}- A fter some calculations (see Appendix C), the inform ation under the null H 0 : Pi = /32 = 0 is given by 1 dl(r~dl)YavZ - A ^ V a r Z ^ r r h = _ dirfa y ar z ^ C - ^ lV a rZ / - h n -V a r Z n ( di r— di d± d & ^ V di_ d& i — r f 2 r r r r / — > • p V a rZ V -P1P2 P2 ( 1 P 2 ) / where p = sampling proportion, and pj = type j proportion in the sample, j = 1 , 2 . Therefore, the inverse inform ation for /?i and f32 equals E = -1 1 / 1- p Var Z 1 — pi — p2 p 2 pi 1 - pi P 2 / (8.8) 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Now, we check the efficiencies for three simple sampling setups based on the conditional likelihood model: 1 ) sampling fu ll study base; 2 ) simple random sampling; 3) sampling equal numbers from each outcome level. Exam ple 1: sampling full study base If we sample the whole study base, it means the sampling proportion p = 1 , and pj = pj = 1+ea1+ea2 = j = 1 , 2 . Hence, (8 .8 ) becomes E = C o ( l + e a l y ^ Var Z e“ i -i l + e “ 2 \ e “ 2 j which is the same as I^ull from the “unconditional likelihood” (8 .6 ). This implies this conditional likelihood approach is as efficient under the null H 0 : Pi = /32 — 0. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Exam ple 2: sim ple random sampling Suppose we sample (100 x p) % from each type (i.e. (100 x p) % from entire study base). Equation (8.8), the inverse inform ation for (3i and /32, is This means the efficiency for simple random sampling is p, the sampling proportion, under the null H 0 : /3\ = (5 2 = 0 . Exam ple 3: sam pling equal numbers from each outcom e level I f we sam ple equal numbers w ith the sampling proportion p in each of three outcome types, we have pj = |, j = 1, 2.. Then, the inverse inform ation for (5 1 and ^ is E p Var Z 1 - p \ - p 2 ( l+ e a i Cq e“ i pVarZ 1 1+ea2 \ e “ 2 / p Var Z V3 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.2.3 “Less” Conditional Likelihood for Random Sampling In Section 8 .2 .2 , we conditioned on the numbers of each type of cases in the likeli hood deviation for random sampling. One can find that this conditional likelihood model, equation (8.7), for a polytomous outcome w ith one control and two case groups is sim ply an extension of the conditional likelihood logistic model for a di chotomous outcome. Here, we consider a different approach by relaxing the condition to fix the to ta l numbers of cases no m atter which type it is, labeled as the “less” conditional likelihood method. We conjecture this less conditional likelihood method m ight be more efficient than the previous conditional likelihood model, so later in Chapter 9 we w ill use a data example and sim ulation studies to compare the efficiencies for two models. The “less” conditional likelihood model is based on the probability, pr(D|D0, D i,D 2, |D|) _ ______________pr(Do,Di,D 2 |D o,D i,D 2) x pr(D0,D i,D 2)______________ E C O m2Pr (so,Si,s2|Do U s0, Di U si,D 2 u s2) x pr(D0 U s0,D i U S i,D 2 U s2) /[D x K - 1 / | P 2K - 1 /IP o K - 1 n . g p , e - 1 + ^ 1 n , e p , e -2 + ^ 2 _ V |D i|/ V|D2|/ V|Do|/ n ie£ + ea 2+z i&2 v /|D i|+ |SlK - 1 /|D 2 |+|b2|\ - 1 /|D qI+|sq|\ n^gg, u„ e?1* 2 *01 f a+Zi<ia /-/c o n d , V |s 1 1 / V |s2| / V |sq| ) n ig K l+ e “ i+ zi^ i+ e “ 2 + zi^2 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [exp(o;1)]lD ll exp(o;2)lD2l exp(/3i ^ 5 , Zj + E » eD2 ^ E c 0nd[exP ( « i) ] |si1 exp(o;2)lS 2l e x p ( ft £ iesi + & E i£S2 Z ^ [ e x p ^ - a 2 ) ] |P l1 exp(/3i Z j + fc E j £p 2 ^ i) Econd[eXP (a l - f t 2 ) ] |Sl1 exp(/3! E iesi ^ + f o E ie s 2 [exp(7 ) ] |D l1 exp(A E ie6i Z< + Ejgp2 Z i ) , R q x E ccnd[exP (7 )]|si1 e x p (A E iesi Z i + /32 E i£S2 £ i) ’ where 7 = a 1—a 2, cond is the condition th a t s0, sl5 s2 : |s0| + |si| + |s2| = |D 0| + |Di| + |D2| and Yhcond means summing over all sk satisfying this condition. Comparing equation (8.7) in Section 8.2.2 and equation (8.9), there is one additional parameter 7 to estimate using this less conditional likelihood approach. 8.2.4 Two Conditional Likelihood Approaches for Complex Sampling Next, we consider using the polytomous logistic regression model in a case-control study w ith complex sampling like counter-matching. Using the same notation, let D j = D j U D j denote a set of disease type j , j = 0,1,2, where D j is the set of individuals w ith disease type j ‘sampled’ for the case-control study, and D j is the rest of individuals w ith disease type j . Then, the sampled risk set 77 would be D 0 U Dx U D 2. A conditional likelihood, hence, can be based on this probability, 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. pr(Do, Di, D2ID0, Di, D2, |Do|, jDi|, ID2j) ______________pr(D0, Dj, D 2 |Dq, D i;P 2) x pr(D0 ,D 1,D 2)______________ 'E c o n d Pr(so, si, s2 |D0 U s0, Di U sl3 D 2 U s2) x pr(D0 U s0, Di U Si, D 2 U s2) ~ ~ ~ n „ ea i + z i /3i F T - „ ea 2+ z i /32 p r ( D „ , D , , D 2ID11, D , , D a ) 1 + e, , , d f e ^ V — ' I |f=\ I I tS' I I T - C I I \ IhgD.Us, e ai+Z*^1 IligD oU so e “2+zi^2 E c o n dPr(so, S i, S2 | D 0 U So, D x U S i, D 2 U s2) ----- l+e ai+zi0 1 E a 2 + Z i!32 _____________p r ( D o , D i , D 2 I D 0 , D i . D j H e n , < = “ ‘ + z ‘ f t H e p , _____________ E ,„ „ iP r(so.Si,s2 |D„ U So, Dj U s,, D 2 U s2) r L o , ILeDiUs, e“1+Zi& p r ( D o , D l D 2|D „ , D , , D 2 ) F I L i I L e d * * Z A pr!s„, Si, s2 |D 0 U s0, D j U Si, D 2 U s2) F L = i F L s d . u . , , ' ZiPt (8.10) where cond is the condition that s* , : |s*| = |Dfc| — |Dfe| = |Dfc|, k — 0,1, 2. If equation (8.10) is “less” conditional on the to ta l number of case types (i.e. E fe =o lsfc l = Efc=o l-^fcl)) then, following the same idea in Section 8.2.3, the “less” conditional likelihood equals to p r(D |D 0 , D 1,D 2, |D |) ______________________ Pr(Dp; D j, D2ID0, Dj, D2)_______________________ Ehcond P^(s0 i ®i) ®2 |D 0 U s0, D i U Si, D 2 U s2) x p r(D 0 U So, D l U s±, D 2 U s2) 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ~ — F T . p O i + ^ j S i F T , e a 2 + Z i & 2 pr(D„, D „ D 2|D0, D „ D 2) U^ d‘, F T 7 7 eai+^i^l FT. e“2+zi/3 2 ScondPr(S0)Sl 5s2|D0 U S0, D i U Si ,D 2 U S2 ) l£ l+e“i ^ ' i ^ 2 + ^ ----- pr(D0, D l, D 2 | D o , D l, D 2)[exp(ai - a 2)]l6 ^e(ELl h zd Y s c o n d Pr(so, Si, s2|D0 U So, DiUSi,D2U s2)[exp(«i - a 2)]lsile(E^ = 1 Pk^ e . k 2* ) _______ p r(D 0, D i, D 21Dq, D i, P 2 ) [ e x p ( 7 ) ] l D l l e xp K fc=i( & z i)\________ E w P r (so>Si,s2 |D 0 U s0, D i U Sl,D 2 U s2)[exp(7 )]lsil e x p E L i( & £ ie s * Zi)], (8.11) where 7 = Qq—c t;2, cond is the condition th a t s0, s1; s2 : |s0| + |s1| + |s2| = |D 0| + |D i| + |D 2|, and Y^cond means summing over all s^ satisfying this condition. Comparing equation (8.7) and (8.11), there is one nuisance parameter, 7 1 , to estimate using this less conditional likelihood method, if one is not interested in the difference of baseline odds. When control sampling is based on the distribu tion of cases by ignoring the type of each case, the selection probabilities, e.g. 7r(T7 |s) in (3.3), or risk weights, e.g. Wji(s) in (3.4), would be the same as in the regular dichotomous situation (i.e. one case group and one control group). Using the EARS data as an example, the asthma-free controls were counter-matched w ith asthma cases on in utero exposure to m aternal smoking. The sampling was done before we divided the cases into early onset and late onset groups. Even if we had defined our case types before counter-matched sampling, the risk weights would not change if the inform ation of case types was not used in sampling. Given this observation, the extension of 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the regular logistic regression (3.4) to polytomous logistic regression is to replace r{Z i, {3) = exp(Zif3) in equation (3.1) by r ( ^ ; f t , fi2) = exp(Zj/3i + Zj/?2). In Section 3.3, the control selection using counter-matching on a sampling vari able w ith L strata can be characterized by 7 T(R|s) = n u =i n l - \S.l\ ,m/ls| - Is,i = * M S)> where |s.;| is the number of all type cases (i.e. Si and S 2) in stratum I. Then, the conditional likelihood (8 .1 0 ) is given by L c o n d ( / 3 ) — n £Dl e* P {Zipi) n ^ p 2 e x p ( z ^ ) n f =i c J d h d h ) 1 E c o n d riie s i exp(Zi^) n iG S 2 e x p ( z A ) U t = i Lhs h 's ,|) e x P [A E i e ^ Z i + & E i €p2 + lo g(^(D))] E c o n d exP [A E i £ Sl Z i + @ 2 E ie s 2 Zi + lo g (^ (s ))]> (8.12) where Econd means 1° sum over all possible Si and s2 such th a t s* , : |s*| = |Dfc|, k = 1,2. Moreover, the less conditional likelihood (8.11) becomes L le s s c o n d ifit-i P ) n ^ p J e x p lT ) ] 1011 exp(Z ifa ) n ^ p 2 e x p (Z ^ 2) ^ ( P ) less cond IliG s i [exp(7 )]lsd exp( Z ^ ) exp(Zi/32 ) ^ ( s ) , where 7 = aq — a 2 and Eiess cond m e a n s summing over all Si and s2 such that | s 11 + |s2| = |Di| + |D2| = |D|. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Summ ary We extended the regular logistic regression to handle a polytomous re sponse for counter-matched data, and proposed two conditional likelihood methods for the polytomous logistic regression model. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 9 Polytom ous Outcomes: EARS Exam ple and Simulation Studies The pairwise approach is commonly used in practice when an outcome has m ultiple levels. For counter-matching, we proposed a subtype analysis m ethod to adjust the bias using the naive pairwise method in Section 6.4. The EARS data is again utilized as an example to compare the pairwise methods to the polytomous methods using the two different conditional likelihood approaches proposed in Chapter 8 . The results of this examination are presented in Section 9.1. In addition, the efficiencies of the two different conditional likelihood approaches proposed for the polytomous logistic regression under counter-matching w ill be compared and presented in Section 9.2. 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9.1 EARS Example Table 9.1 shows the comparisons among the results of the cohort data using the pairwise and polytomous approaches and the results of EARS data (using the sub- type analysis which is pairwise and the two different conditional likelihood methods we proposed), respectively. The effects of in utero smoke exposure on early and late onset asthma using the cohort data served as the gold standard. For the cohort data, the odds ratio (OR) of in utero smoke exposure for early onset asthma was 1.7 (95% C l 1 .3-2.1) and 1.6 (95% C l 1 .2 - 2 .1 ) using the pairwise and polytomous approaches, respectively. The estimated odds ratio of in utero smoke exposure for late onset asthma was 0 . 8 for both the pairwise and polytomous approaches, while the confidence interval using the polytomous approach was narrower than th a t using the pairwise approach. Estimates are very close across different methods for the EARS data and show no evidence of bias compared to the cohort results. The subtype analysis method we proposed in Section 6.4, labeled as the ‘Adjusted pairwise’ method in Table 9.1.c shows no difference in estimating the effects of in utero smoke exposure (O R =1.8 and 1 .0 ) on the early and late onset asthma, compared to the polytomous methods using the two conditional likelihood approaches (O R =1.7 and 0.9 using the conditional likelihood method, Table 9.1.d; O R=1.6 and 1.0 using the less conditional likelihood method, Table 9.1.e). The less conditional likelihood method is relatively com putationally expensive. For example, for a set of tota l size 17 ( 8 type I & 2 type I I cases) in the EARS 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.1: Estim ated odds ratios (ORs) and 95% confidence intervals (C l) of in utero exposure to maternal smoking for early and late onset asthma in the fu ll cohort and EARS case-control sample using the pairwise and polytomous approaches. Data M ethod Early Onset Asthm a Late Onset Asthm a OR (95% C l) OR (95% C l) a. Cohort Pairwise * * 1.7 (1.3-2.1) 0.8 (0.5-1.6 ) b. Cohort Polytomous 1 . 6 ( 1 .2 - 2 .1 ) 0.8 (0.4-1.5) c. EARS Adjusted pairwise H 1 . 8 ( 1 .2 - 2 .5) 1.0 (0.5-1.9) d. EARS Polytomous t 1.7 (1.2-2.4) 0.9 (0.5-1.9) e. EARS Polytomous * 1 . 6 ( 1 .2 - 2 .4) 1 . 0 (0 .5-2.0) • * no asthma as the common control group H subtype analysis (pairwise w ith weight adjustments) t conditional on the size of each type of case * conditional on the to ta l number of cases 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data, the possible combinations as the number of terms in the denom inator of the likelihood for both methods are: M e th o d 1: conditional on the sizes of each type f f ) x © = 878> 16°; M e th o d 2 : conditional on to ta l number of cases (the less conditional approach) ( jj) x 210 = 19,914,752. However, the less conditional likelihood method did not provide a narrower confi dence interval (i.e. a smaller standard error) as expected. We wanted to test if in utero smoke exposure has different effects on early onset asthma (O R i) and late onset asthma (OR 2). I f using two pairwise logistic regressions, one needs additional computations on the jo in t covariance m a trix of parameters to test the hypothesis of Ho: O R i = OR 2 . For the polytomous logistic regression, the W ald’s test of this hypothesis can be performed by PROC PHREG in SAS/STAT® software [67], and the score test was done while generating the analytic data set (see Appendix B.2). The W ald’s and score tests show sim ilar test statistics (x 2=2.56 and 2.59) w ith the p-value equal to 0.11 for both. 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9.2 Simulation Studies Comparing Polytom ous and Pairwise Approaches for Counter-M atching We proposed two conditional likelihood methods for polytomous logistic regression. One is conditional on the number of each type of case and the other one is condi tional on the to ta l number of all cases, called the ‘less’ conditional approach. In Section 9.1, we showed that these two different conditional likelihood methods esti mate the parameters equally well in the data example. Furthermore, we showed that the pairwise approach w ith the appropriate weight adjustments offers an acceptable alternative to the two polytomous logistic regression models. In this section, we con tinue to compare the efficiencies of the two different conditional likelihood methods and the pairwise approach using sim ulation studies. The objective of the sim ulation studies in this section is to compare the efficiency of counter-matching design relative to fu ll study base when there is a polytomous outcome w ith one type of control and two different types of cases. For the counter matched sample, we use three different methods to estimate the risk parameters for each type of cases relative to the common control group: the pairwise approach plus two conditional polytomous logistic regression methods. The detailed methods of sim ulation studies are presented first. Consider a poly tomous outcome w ith one control (0) and two case levels (1 and 2). For sim plicity, only a binary counter-matched variable X and a binary exposure Z are considered 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in the sim ulation studies, although counter-matching can be implemented w ith cat egorical variables. The overall probability of disease (type I and type I I cases together) is 2 0 %, the exposure prevalence is 2 0 %, and one study base consists of 1,500 individuals equally distributed in 30 strata. Case-control and exposure Z status were generated based on the assigned scenario including the overall disease probability, prevalence of exposure Z, and exposure-disease association. The anal ysis of fu ll study base is using the polytomous logistic regression. The sim ulation is repeated 1000 times. In the generation of the study base, we have 1) the sensitivity and specificity for the X — Z relationship are both set to 0.9 as p r(X = 1|Z = 1) = 0.9 and p r(X = 0|Z = 0) = 0.9; 2 ) three risk scenarios: (a) O R i = 1 for the type I case and OR 2 = 4 for the type I I case, (b) O R i = 2 and OR 2 = 2, (c) O R i = 2 and OR 2 = 4; 3) two scenarios of the subtype probabilities: (a) 5% type I cases and 15% type II cases, (b) 10% type I cases and 10% type II cases, (c) 15% type I cases and 5% type II cases. The sample size is twice the number of cases from the study base which is 300 cases and 300 controls, and the sample allocation is set to be the 50% allocation 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of exposed and unexposed in the counter-matched samples. We want to evaluate if the subtype prevalence impacts the efficiency of the pairwise and polytomous approaches for counter-matched data. The sim ulation results are summarized in Table 9.2 - Table 9.9. We find that there is no evidence that one method is significantly better than the other in terms of efficiency. The subtype prevalence slightly influences the efficiency of the analysis methods. The polytomous approach is more efficient than the pairwise approach on a rare subtype w ith a prevalence of 5%. The anticipated higher efficiency of the less conditional method for the polytomous logistic regression is observed when one of the subtypes is rare like 5%. For example, if the exposure w ith a prevalence of 20% has the same effect on both subtypes w ith O R =2 (Table 9.5), the relative efficiency on the rare subtype (i.e. type I) is 0.91, 0.92, and 0.95 for the pairwise, conditional, and less conditional polytomous logistic regression, respective, and the efficiency is about the same on the more common subtype (i.e. type I I w ith a prevalence of 15%). It is also observed for different risk scenarios (O R i = 1 and OR 2 = 4 in Table 9.2 - Table 9.4; O R i = 2 and OR 2 = 4 in Table 9.7 - Table 9.9) th a t the polytomous approach is more efficient than the pairwise approach on a rare subtype, no m atter the odds ratio is 1, 2, or 4, and the exposure prevalence is 20% or 50%. Summ ary The sim ulation studies show th a t the pairwise and polytomous ap proach for a polytomous outcome in a counter-matched study has no difference in efficiency. Both approaches are very efficient in a counter-matched study (90% effi ciency w ith 40% sample size relative to the fu ll study base). However, as we pointed 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. out in Section 9.1, using the less conditional likelihood approach for the polytomous logistic regression requires more com puting resources for generating a very large an alytic data set. Therefore, the less conditional likelihood approach m ight not be a good choice in practice. The advantage of the conditional likelihood approach for the polytomous logistic regression against the pairwise logistic regression is the ability to conduct the hypothesis tests of parameter estimates for the subtypes of cases. In a polytomous logistic regression, parameters are estimated in the same model, so it allows direct inferences on the parameters. By contrast, parameters in the pairwise approach are estimated in different models. In order to test the hypothesis involving parameters from different models, one needs some additional computations on the jo in t covariance m atrix, and this may be not easy or even not possible. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.2: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 5% type I and 15% type I I cases, when the odds ratio is 1 for type I case (O R i = 1, fa = 0) and the odds ratio is 4 for type I I case (OR 2 = 4, fa = 1.39). Based on 1,000 trials. M ethod O R i = 1 o r 2 = 4 p r(Z = 1 ) = 0 . 2 fa (SE) Efficiency fa (SE) Efficiency Full study base -0.02 (0.34) 1 . 0 0 1.40 (0.16) 1 . 0 0 Pairwise -0.03 (0.35) .93 1.40 (0.17) .89 Conditional P LR 1 -0.03 (0.35) .95 1.40 (0.17) .89 Less Conditional P LR * -0.04 (0.35) .96 1.40 (0.17) .87 rH II N. Pi = 0.5 fa (SE) Efficiency fa (SE) Efficiency Full study base 0.01 (0.25) 1 . 0 0 1.40 (0.17) 1 . 0 0 Pairwise 0.01 (0.25) .94 1.40 (0.17) .92 Conditional PLR 0.01 (0.25) .95 1.39 (0.17) .91 Less Conditional PLR 0.00 (0.25) .94 1.40 (0.17) .91 P LR = polytomous logistic regression 1 conditional on the size of each type of case I conditional on the tota l number of cases 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.3: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 10% type I and 10% type II cases, when the odds ratio is 1 for type I case (O R i = 1, f3\ = 0) and the odds ratio is 4 for type I I case (OR 2 = 4, /3 2 = 1.39). Based on 1,000 trials. Method O R i = 1 o r 2 = 4 M z =: 1 ) = 0 . 2 A (SE) Efficiency A (SE) Efficiency Full study base 0.00 (0.24) 1 . 0 0 1.40 (0.18) 1 . 0 0 Pairwise 0.00 (0.25) .91 1.40 (0.19) .91 Conditional PLR 1 0.00 (0.25) .92 1.40 (0.19) .93 Less Conditional PLR * -0.01 (0.25) .92 1.41 (0.19) .89 p r (Z = 1) = 0.5 A (SE) Efficiency A (SE) Efficiency Full study base 0 . 0 1 (0.18) 1 . 0 0 1.41 (0.21) 1 . 0 0 Pairwise 0 . 0 1 (0.18) .91 1.41 (0.21) .95 Conditional PLR 0 . 0 1 (0.18) .92 1.40 (0.21) .96 Less Conditional PLR -0.00 (0.19) .90 1.41 (0.21) .94 PLR = polytomous logistic regression 1 conditional on the size of each type of case * conditional on the to ta l number of cases 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.4: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 15% type I and 5% type I I cases, when the odds ratio is 1 for type I case (O R i = 1, /3i = 0) and the odds ratio is 4 for type II case (OR 2 = 4, /32 = 1-39). Based on 1,000 trials. Method O R i = 1 o r 2 = 4 p r(Z = 1 ) = 0 . 2 A (SE) Efficiency & (S E ) Efficiency Full study base 0.00 (0.19) 1 . 0 0 1.39 (0.25) 1 . 0 0 Pairwise - 0 . 0 0 (0 .2 0 ) .93 1.39 (0.27) .89 Conditional PLR t - 0 . 0 0 (0 .2 0 ) .94 1.39 (0.27) .91 Less Conditional PLR * - 0 . 0 1 (0 .2 0 ) .92 1.40 (0.27) .92 pr (Z = 1) = 0.5 A (SE) Efficiency A (SE) Efficiency Full study base 0.01 (0.15) 1 . 0 0 1.42 (0.30) 1 . 0 0 Pairwise 0.00 (0.15) .91 1.41 (0.31) .95 Conditional PLR 0.00 (0.15) .91 1.41 (0.31) .96 Less Conditional PLR -0.00 (0.15) .90 1.42 (0.31) .96 PLR = polytomous logistic regression t conditional on the size of each type of case * conditional on the tota l number of cases 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.5: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 5% type I and 15% type II cases, when the odds ratio is 1 for type I case (O R i = 2, /% = 0.69) and the odds ratio is 4 for type II case (OR 2 = 2, (3 2 = 0.69). Based on 1,000 trials. Method O R i = 2 OR 2 = 2 p r (Z = 1 ) = 0 . 2 A (SE) Efficiency A (SE) Efficiency Full study base 0.70 (0.28) 1 . 0 0 0.70 (0.17) 1 . 0 0 Pairwise 0.69 (0.29) .91 0.70 (0.18) .90 Conditional P LR t 0.69 (0.29) .92 0.70 (0.18) .90 Less Conditional PLR 1 0.69 (0.28) .95 0.70 (0.18) .89 P r(Z = 1 II 0 01 A (SE) Efficiency A (SE) Efficiency Full study base 0.71 (0.26) 1 . 0 0 0.71 (0.15) 1 . 0 0 Pairwise 0.71 (0.27) .94 0.70 (0.16) .90 Conditional PLR 0.71 (0.26) .95 0.70 (0.16) .90 Less Conditional PLR 0.71 (0.26) .96 0.70 (0.16) .90 PLR = polytomous logistic regression t conditional on the size of each type of case * conditional on the to ta l number of cases 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.6: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 10% type I and 10% type I I cases, when the odds ratio is 2 for type I case (O R i = 2, = 0.69) and the odds ratio is 2 for type I I case (OR 2 = 2, /3 2 = 0.69). Based on 1,000 trials. Method O R i = 2 o r 2 = 2 p r ( Z = 1 ) = 0 . 2 A (SE) Efficiency A (SE) Efficiency Full study base 0.70 (0.20) 1 . 0 0 0.70 (0.20) 1 . 0 0 Pairwise 0.70 (0.21) .87 0.70 (0.20) .91 Conditional P LR 1 0.70 (0.21) .8 8 0.70 (0.20) .93 Less Conditional P LR * 0.70 (0.21) .90 0.70 (0.20) .93 pr (Z = 1) = 0.5 A (SE) Efficiency A (SE) Efficiency Full study base 0.71 (0.18) 1 . 0 0 0.71 (0.18) 1 . 0 0 Pairwise 0.71 (0.19) .91 0.70 (0.19) .93 Conditional PLR 0.71 (0.19) .92 0.70 (0.19) .93 Less Conditional PLR 0.70 (0.19) .91 0.70 (0.19) .94 PLR = polytomous logistic regression t conditional on the size of each type of case I conditional on the tota l number of cases 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.7: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 5% type I and 15% type I I cases, when the odds ratio is 2 for type I case (O R i = 2, — 0.69) and the odds ratio is 4 for type I I case (OR 2 = 4, /32 = 1.39). Based on 1,000 trials. Method O R i = 2 o r 2 = 4 p r(Z = 1 ) = 0 . 2 a (s e ) Efficiency a (s e ) Efficiency Full study base 0.69 (0.28) 1 . 0 0 1.40 (0.16) 1 . 0 0 Pairwise 0.69 (0.30) .92 1.39 (0.17) .87 Conditional P LR t 0.69 (0.29) .93 1.39 (0.17) . 8 8 Less Conditional PLR * 0.69 (0.29) .94 1.40 (0.17) .87 pr (Z = 1) = 0.5 a (s e ) Efficiency a (s e ) Efficiency Full study base 0.71 (0.25) 1 . 0 0 1.40 (0.17) 1 . 0 0 Pairwise 0.71 (0.26) .93 1.40 (0.18) .90 Conditional PLR 0.71 (0.26) .94 1.40 (0.18) .90 Less Conditional PLR 0.70 (0.26) .95 1.40 (0.18) .90 P LR = polytomous logistic regression t conditional on the size of each type of case 1 conditional on the to ta l number of cases 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.8: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 10% type I and 10% type I I cases, when the odds ratio is 2 for type I case (O R i = 2 , = 0.69) and the odds ratio is 4 for type I I case (OR 2 = 4 , /3 2 = 1-39). Based on 1,000 trials. Method O R i = 2 o r 2 = 4 p r ( Z = 1 ) = 0 . 2 A (SE) Efficiency A (SE) Efficiency Full study base 0.70 (0.20) 1 . 0 0 1.40 (0.18) 1 . 0 0 Pairwise 0.70 (0.22) .87 1.39 (0.19) .91 Conditional P LR 1 0.70 (0.21) . 8 8 1.39 (0.19) .93 Less Conditional PLR * 0.69 (0.21) .89 1.40 (0.19) .92 P r(Z = 1) = 0.5 A (SE) Efficiency A (SE) Efficiency Full study base 0.71 (0.18) 1 . 0 0 1.40 (0.21) 1 . 0 0 Pairwise 0.70 (0.19) .90 1.40 (0.21) .95 Conditional PLR 0.70 (0.19) .92 1.40 (0.21) .94 Less Conditional PLR 0.70 (0.19) .90 1.40 (0.21) .95 P LR = polytomous logistic regression 1 conditional on the size of each type of case I conditional on the tota l number of cases 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9.9: Relative efficiency of the pairwise logistic regression models and two con ditional likelihood methods for the polytomous logistic regression for the counter matched sample (n=600) compared to the fu ll study base (n = l,5 0 0 ) w ith 15% type I and 5% type I I cases, when the odds ratio is 2 for type I case (O R i = 2 , = 0.69) and the odds ratio is 4 for type I I case (OR 2 = 4, /3 2 = 1-39). Based on 1,000 trials. Method O R i = 2 o r 2 = 4 p r(Z = 1 ) = 0 . 2 P i (SE) Efficiency P 2 (SE) Efficiency Full study base 0.70 (0.17) 1 . 0 0 1.40 (0.26) 1 . 0 0 Pairwise 0.70 (0.18) .90 1.39 (0.27) .89 Conditional PLR t 0.70 (0.18) .91 1.39 (0.27) .92 Less Conditional PLR * 0.69 (0.18) .90 1.39 (0.26) .93 pr (Z = 1) = 0.5 P i (SE) Efficiency % (SE) Efficiency Full study base 0.70 (0.15) 1 . 0 0 1.41 (0.29) 1 . 0 0 Pairwise 0.70 (0.16) .90 1.41 (0.30) .94 Conditional PLR 0.70 (0.16) .90 1.41 (0.30) .95 Less Conditional PLR 0.70 (0.16) .89 1.41 (0.30) .97 PLR = polytomous logistic regression 1 conditional on the size of each type of case I conditional on the tota l number of cases 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 10 Summary and Discussion 10.1 Summary Cohort studies are the basis of epidemiologic studies and are useful for studying the effect of exposures on outcomes of interest. Cohort studies have several advantages including providing 1 ) a clear tem poral sequence of exposure and disease, and 2 ) an opportunity to study m ultiple outcomes related to a specific exposure. However, cohort studies are quite often very expensive to conduct due to the number of participants and the long period of observation required. Moreover, to investigate complex disease etiologies or low level exposure risk, cohort studies need to be large. Because of these considerations, alternative approaches are needed. One such approach is a nested case-control study. 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A nested case-control study is an alternative method to reduce the cost of the cohort study w ithout jeopardizing efficiency or introducing bias. The nested case- control approach samples from the study base based on disease outcomes and col lects new inform ation from the sample. In 1994, Langholz and Clayton [40] proposed a novel idea of ” counter-matching” on exposure rather than outcome in sampling for nested case-control studies. Counter-matching broke conventional thought about sampling. Counter-matching can increase the efficiency compared to the more trad itional sampling methods such as random sampling. Counter-matching can be at most 2 times more efficient than frequency matching in an 1 :1 matched case-control study. I f the relative efficiency for counter-matching is 1.5 compared to frequency matching, then we can save one th ird of resources by using counter-matching rather than frequency matching. Therefore, counter-matching can be considered a sampling option to increase efficiency in nested case-control studies. A lthough the counter matched sampling method is promising from a theoretical point of view, a number of practical issues arise in the design and analysis of counter-matched case-control studies. Counter-matching has great potential in efficiency improvement for nested case- control studies. However, the efficiency gained in a counter-matched design is de pendent upon several factors such as the sample size, the exposure prevalence, the exposure-disease association, the relationship between the counter-matching 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variable and the exposure of interest, and the sample allocation of the counter matching variable. We have evaluated these factors im pacting the efficiency of counter-matching by sim ulation studies. When the sample size was small, 50% al location of exposed and unexposed for counter-matching was the optim al choice. As the sample size increased, the im pact of the allocation of exposed and unexposed on efficiency was less critical. Counter-matching was more efficient than random sampling when counter-matching variable X had high sensitivity and specificity (at least 75% for both) for the exposure variable Z of m ain interest in the nested case- control study, especially when Z was rare and w ith a high risk for the disease. The X — Z relationship is the most im portant determinant of the efficiency of counter matching. Therefore, one shortcoming of counter-matching is if an exposure is not related to the counter-matching variable X then the efficiency of counter-matching could be worse than th a t of random sampling. In this dissertation, we have provided solutions to some analytic issues after applying counter-matching to a nested case-control study. The issues encompassed descriptive statistics for counter-matched data, non-participation of subjects, and the subgroup analysis including the subtype analysis (a pairwise approach for a polytomous outcome) and subset analysis (for considering effect m odification). Be cause the covariate distribution in the counter-matched data was distorted in the sampling process, we used sampling probabilities to estimate the number of controls in the cohort from which the sample arose. For the issues of non-participation and 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the subgroup analysis, we needed to appropriately adjust risk weights in the like lihood to accommodate the number of cases non-participating the study or not in the subgroups we were interested in the model, because the likelihood for counter matching is not independent of the risk weights. In case-control studies, we sometimes are interested in an outcome w ith m ultiple levels, like subtypes of a disease. It is common to use a pairwise approach by modelling several dichotomous outcomes. The polytomous logistic regression model is an in tuitive extension of the regular logistic regression. Two conditional likelihood polytomous logistic regression methods were proposed for counter-matched data, and were shown to be valid and have about the same efficient. The advantage of the polytomous approach against the pairwise approach is the a b ility to conduct the hypothesis tests of parameter estimates for different subtypes of disease. In summary, counter-matching can increase efficiency compared to trad ition al sampling methods, does not introduce bias, and should be considered as a sampling option in nested case-control studies. 10.2 Future Research Counter-matching is a newer sampling method and has not been widely used yet. In this dissertation, we used intensive sim ulation studies to show th a t counter matching is more efficient than frequency matching using the same amount of sample size in different scenarios of exposure-disease association, and exposure prevalence. 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We also proposed solutions for some analytic issues for counter-matched nested case-control study. However, opportunities for future research exist and include: 1. N on-Participation and M issing D ata N on-participation of subjects is a fre quent source of selection bias. The two methods we used in our data example for considering non-participation of subjects are the deletion approach w ith weight adjustments and the missing-indicator adjustments. These two meth ods are simple and easy to perform w ith desirable results. However, the deletion approach relies on the assumption of Bernoulli recruitm ent. On the other hand, the performance of the m issing-indicator m ethod remains un known. Huberman and Langholz [31] have shown th a t the m issing-indicator method has the advantage of making use of all the data while s till preserving the m atching in matched case-control studies. Nevertheless, some researchers do not recommend the missing-indicator method because of the concern of bias [27, 42, 34]. A better understanding of the missing-indicator method and searches for other approaches can help us to decide how to address non participation. In contrast to non-participation (i.e. complete missingness), missing data refer to incomplete data. It has become more popular to use m ultiple im putation to “fill in ” missing data. Even if M C A R does occur, m ultiple im putation is potentially more efficient than the deletion approach [74], Statistical theory about missing data is an active area of research. Future work w ill include how 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. m ultiple im putation and other methods reviewed in Section 6.3 work for the counter-matched data. 2. Polytom ous O utcom es The challenge for a counter-matching design to fit a polytomous logistic regression model is com putationally intensive, especially for the ‘less’ conditional likelihood method. C om putational intensity can keep people away from the polytomous approach, and lead people to use the pair wise approach. However, it would be difficult to use the pairwise approach and make inferences on parameters from different models. Using the polyto mous logistic regression avoids such a problem. Therefore, a com putational strategy is warranted for the polytomous logistic regression under complex sampling like counter-matching. B inary logistic regression is very common in statistical software, and is straight forward to apply. The efficiency is not lower using the pairwise approach compared to the polytomous approach. I f a test (or an approximated test) to make inferences on parameters from different models can be determined, the pairwise approach may be an appropriate method of analyses for a polytomous outcome. Therefore, future studies to make inferences on parameters from dif ferent models accommodating complex sampling like counter-matching may be beneficial. 3. Sam pling Schemes When considering a counter-matched design, researchers should understand th a t the main exposures in the nested case-control study 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. must be highly related to the counter-matched sampling variable in order to retain the superior efficiency from a counter-matched design. Besides counter matching, there are some other less well-known sampling schemes like quota sampling and case-cohort sampling. Further research on different sampling methods and their performance on different specialized situations may help investigators design the best study and choose the most efficient sampling method. 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [1] Agresti, A. Considerations in measuring partial association for ordinal cate gorical data. Journal of the American Statistical Association, 72:37-45, 1977. [2] Agresti, A. Analysis of Ordinal Categorical Data, New York: John W iley & Sons, 1984. [3] Agresti, A. Categorical Data Analysis, New York: John W iley & Sons, 1990. [4] Allison, P.D. Missing Data, Thousand Oaks, CA: Sage Publications, 2001. [5] Anderson, A.B., Basilevsky, A., and Hum, D.P.J. Missing Data: A Review of the Literature, in Handbook of Survey Research eds. P.H. Rossi, J.D. W right, and A. Anderson, New York: Academic Press: pp415-492, 1983. [6] Andrieu, N., Goldstein, A.M ., Thomas, D.C., and Langholz, B. Counter matching in studies of gene-environment interaction: efficiency and feasibility. Am erican Journal of Epidemiology, 153:265-274, 2001. [7] Asher, M .I., Barry, D., Clayton, T., Crane, J., D ’Souza, W ., Ellwood, P., Ford, R.P., Mackay, R., M itchell, E.A., Moyes, C., Pattemore, P., Pearce, N., and Stewart, A.W . The burden of symptoms of asthma, allergic rhinoconjunctivi- ties and atopic eczema in children and adolescents in six New Zealand centres: ISAAC Phase One. New Zealand Medical Journal, 114:114-20. 2001. [8] Begg, C.B. and Gray, R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biom etrika, 71:11-18, 1984. [9] Bernstein, J.L., Langholz B., Haile, R.W ., Bernstein, L., Thomas D.C., Sto vall, M., Malone, K.E., Lynch, C.F., Olsen, J.H., Anton-Culver, H., Shore, R.E., Boice, J.D., Berkowitz, G.S., G atti, R.A., Teitelbaum, S.L., Smith, S.A., Rosenstein, B.S., Borresen-Dale, A.L., Concannon, P., and Thompson, W .D. Study design: Evaluating gene-environment interactions in the etiology 159 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [10] Breslow, N.E. and Cain, K.C. Logistic regression for two stage case-control data. Biom etrika, 75:11-20, 1988. [11] Breslow, N.E. and Day, N.E. Statistical Methods in Cancer Research, Vol. II: The Design & Analysis of Cohort Studies, IA R C Scientific Publications, No. 82, New York: Oxford University Press, 1987. [12] Campbell, M .K . and Donner, A. Classification ffficiency of m ultinom ial logistic- regression relative to ordinal logistic-regression. Journal of the American Sta tistical Association 84:587-591, 1989. [13] Chen, C.F. A Bayesian approach to nested missing-data problems, in Bayesian inference and decision techniques Essays in honor of Bruno de F inetti: pp355- 361, 1986. [14] Chen, H.Y. and L ittle , R. A test of missing completely at random for gener alised estim ating equations w ith missing data. Biom etrika, 86:1-13, 1999. [15] Cohen, J., and Cohen, P. Applied M ultiple Regression Correlation Analysis fo r the Behavioral Sciences, New York: John W iley, 1975. [16] Cook, D.G. and Strachan, D.P. Health effects of passive smoking-10: Sum m ary of effects of parental smoking on the respiratory health of children and im plications for research. Thorax, 54:357-366, 1999. [17] Cox, D.R. Regression models and life-tables (w ith discussion). Journal of the Royal Statistical Society B, 34:187-220, 1972. [18] Cunningham J., O ’Connor G.T., Dockery D .W ., and Speizer F.E. Environ mental tobacco smoke, wheezing, and asthma in children in 24 communities. Am erican Journal of Respiratory & C ritical Care Medicine, 153:218-224, 1996. [19] Dempster, A.P., Laird, N.M ., and Rubin, D.B. M axim um likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society B, 39:1-38, 1977. [20] Doll, R. Cohort studies: history of the method. I. Prospective cohort studies. Soz Praventivmed, 46: 75-86, 2001. [21] Doll, R. Cohort studies: history of the method. II. Retrospective cohort studies. Soz Praventivmed, 46: 152-160, 2001. 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [22] Fuchs, C. M axim um likelihood estimation and model selection in contingency tables w ith missing data. Journal of the Am erican Medical Association, 77: 270-278, 1982. [23] Feinberg, S.E. The analysis of cross-classified categorical data, 2nd ed., Cam bridge, M A: M IT Press, 1980. [24] Flanders, W .D ., and Greenland, S. A nalytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine, 10:739-747, 1991. [25] G illiland,F.D , Li, Y.F., and Peters, J.M. Effects of m aternal smoking during pregnancy and environmental tobacco smoke on asthma and wheezing in chil dren. Am erican Journal of Respiratory & C ritical Care Medicine, 163:429-436, 2001. [26] G illiland, F.D., Li, Y.F., Dubeau, L., Berhane, K., Avol, E., McConnell, R., Gauderman, W . J., and Peters, J. M. Effects of glutathione S-transferase M l, m aternal smoking during pregnancy, and environmental tobacco smoke on asthma and wheezing in children. Am erican Journal of Respiratory & C ritical Care Medicine, 166:457-463, 2002. [27] Greenland, S. and Finkle, W .D. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am erican Journal of Epidem i ology, 142:1255-1264, 1995. [28] Trends in asthma m orbidity and m ortality. A p ril 25, 2003. Available at: http://w w w .lungusa.org/data/asthm a/asthm al.pdf Accessed July 1, 2003. [29] Hausman, J.A. and Wise, D.A. A conditional probit model for qualitative choice: discrete decsisions recognizing interdependence and heterogeneous pref erences. Econometrica, 46:403-426, 1978. [30] Hosmer, D. and Lemeshow, S. Applied Logistic Regression, 2nd ed., New York: John W iley & Sons, 2000. [31] Huberman, M. and Langholz, B. Application of the m issing-indicator method in matched case-control studies w ith incomplete data. Am erican Journal of Epidemiology, 150:1340-1345, 1999. [32] ISAAC Steering Committee. W orldwide variation in prevalence of symptoms of asthma, allergic rhinoconjunctivitis, and atopic eczema: ISAAC. The In ternational Study of Asthm a and Allergies in Childhood (ISAAC) Steering Committee. Lancet, 351:1225-1232, 1998. 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [33] ISAAC Steering Committee. W orldwide variations in the prevalence of asthma symptoms: the International Study of Asthm a and Allergies in Childhood (ISAAC). European Respiratory Journal, 12:315-335, 1998. [34] Jones, M.P. Indicator and stratification methods for missing explanatory vari ables in m ultiple linear regression. Journal of the Am erican Statistical Associ ation, 91:222-230, 1996. [35] Kalton, G. and Kasprzyk, D. Im puting for Missing Survey Responses Pro ceedings Section of Survey Research Method, American Statistical Association, 22-33, 1982. [36] Keane, M.P. A note on identification in the m ultinom ial p robit model. Journal of Business and Economic Statistics, 10:193-200, 1992. [37] Kleinbaum D.G., Kupper L.L., and Morgenstern H. Epidemiologic Research: Principles and Quantitative Methods, New York: John W iley & Sons, 1982. [38] Laara, E. and Mattews, J.N.S. The equivalence of 2 models for ordinal data. Biom etrika 72:206-207, 1985. [39] Langholz, B. and Borgan, 0 . Counter-matching: A stratified nested case- control sampling method. Biom etrika, 82:69-79, 1995. [40] Langholz, B. and Clayton, D. Sampling strategies in nested case-control studies Environm ental Health Perspectives, 102 (Suppl. 8):47-51, 1994. [41] Langholz, B. and Goldstein, L. Conditional logistic analysis of case-control studies w ith complex sampling. Biostatistics, 2:63-84, 2001. [42] Li, X., Song, X., and Gray, R.H. Comparison of the missing-indicator method and conditional logistic regression in l:m matched case-control studies w ith missing exposure values. Am erican Journal of Epidemiology, 159(6):603-10. [43] Liang, K .Y . and Stewart, W .F. Polychotomous logistic regression methods for matched case-control studies w ith m ultiple case or control groups. Am erican Journal of Epidemiology, 125:720-730, 1987. [44] Liang, K .Y . and Zeger, S.L. Longitudinal data analysis using generalized linear models. Biom etrika, 73: 13-22, 1986. [45] L ittle , R.J.A. A test of missing completely at random for m ultivariate data w ith missing values. Journal of the Am erican Statistical Association, 83, 1198-1202, 1988. 162 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [46] L ittle , K.F. Regression w ith missing X ’s: A review. Journal of the American Statistical Association, 87:1227-1237, 1992. [47] Liu, C. Bayesian robust m ultivariate linear regression w ith incomplete data. Journal of the American Statistical Association, 91: 1219-1227, 1996. [48] London S.J., Gauderman, W .J., Avol, E., Rappaport, E.B., and Peters, J.M. Fam ily history and the risk of early-onset persistent, early-onset transient, and late-onset asthma. Epidemiology, 12:577-583, 2001. [49] Luce, R.D. Individual Choice Behavior New York: John W iley, 1959. [50] M cCullagh, P. Regression models for ordinal data. Journal of the Royal Sta tistical Society Series B-Methodological, 42:109-142, 1980. [51] McCullagh, P. and Nelder, J.A. Generalized Linear Models. London: Chapman & Hall, 1989. [52] McFadden, D. Conditional Logit Analysis of Q ualitative Choice Behavior. Frontiers in Econometrics, Zarembka ed., New York: Academic Press, pp. 105-142, 1974. [53] M iettinen, O.S. Individual M atching w ith M ultiple Controls in the Case of All-or-None Responses. Biometrics, 25:339-355, 1969. [54] M iettinen, O.S. Theoretical epidemiology: principles of occurrence research in medicine, New York: John W iley & Sons, 1985. [55] Natarajan, R., McCulloch, C., and Kiefer, N. A Monte Carlo EM method for estim ating m ultinom ial probit models. Computational Statistics and Data Analysis 34:33-50, 2000. [56] Peterson, B. and Harrell, F.E. P artial proportional odds models for ordinal response variables. Applied Statistics, 39:205-217, 1990. [57] Pepe, M.S., Reilly, M., and Fleming, T.R. A u xilia ry outcome data and the mean score method. Journal of Statistical Planning and Inference, 42:137- 160,1994. [58] Press, S. J. and Scott, A. J. Missing variables in Bayesian regression, II. Journal of the Am erican Statistical Association, 71:366-369, 1976. [59] Redd, S.C. Asthm a in the United States: burden and current theories. E nvi ronmental Health Perspectives, 110(Suppl. 4):557-560, 2002. 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [60] Reilly, M. and Pepe, M.S. A mean score method for missing and auxiliary covariate data in regression models. Biom etrika, 82:299-314, 1995. [61] Robins, J.M., Rotnitzky, A., and Zhoa, L.P. Estim ation of regression coeffi cients when some regressors are not always observed. Journal of the American Statistical Association, 89:846-866, 1994. [62] Rothm an K.J. and Greenland S. Modern Epidemiology, Philadelphia, PA: Lippincott W illiam s & W ilkins Publishers, 1998. [63] Rubin, D.B. Inference and missing data. Biom etrika, 63:581-592, 1976. [64] Rubin, D.B. M ultiple Im putation for Nonresponse in Surveys. New York: John W iley & Sons, 1987. [65] Rubin, D.B. M ultiple im putation after 18+ years (w ith discussion). Journal of the Am erican Statistical Association, 91, 473-489, 1996. [66] Salsburg, D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, New York: W H Freeman, 2001. [67] SAS Institute Inc. SAS/STAT® User’s Guide, Version 8, Cary, NC: SAS In stitute Inc., 1999. [68] Schlesselman J.J. and Stolley P.D. Case Control Studies : Design, Conduct, Analysis, London: Oxford University Press, 1990. [69] Schroeder, J.C., Olshan, A.F., Dent, R.B., Weinberg, C.R., Yount, B., Cerhan, J.R., Lynch, C.F., Schuman, L.M ., Tolbert, P.E., Rothman, N., Cantor, K.P., and B lair, A. A case-control study of tobacco use and other non-occupational risk factors for t(14;18) subtypes of non-Hodgkin’s lym phom a (United States). Cancer Causes & Control, 13:159-168, 2002. [70] Schroeder, J.C. and Weinberg, C.R. Use of missing-data methods to correct bias and improve precision in case-control studies in which cases are subtyped but subtype inform ation is incomplete. Am erican Journal of Emidemioloqy, 154:954-962, 2001. [71] Steenland, K.and Deddens, J.A. Increased precision using countermatching in nested case-control studies. Epidemiology, 8:238-242, 1997. [72] Stein, R.T., Holberg C.J., Sherrill, D., W right, A.L., Morgan, W .J., Taussig, L., and M artinez, F.D. Influence of parental smoking on respiratory symp toms during the first decade of life: the Tucson C hildren’s Respiratory Study. Am erican Journal of Emidemiology, 149:1030-1137, 1999. 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [73] Swamy, P.A.V.B. and Mehta, J.S. On Bayesian estim ation of seemingly unre lated regressions when some observations are missing. Journal of Econometrics, 3:157-169, 1975. [74] Taylor, J.M., Cooper, K .L., Wei, J.T., Sarma, A.V., Raghunathan, T.E., and Heeringa, S.G. Use of m ultiple im putation to correct for nonresponse bias in a survey of urologic symptoms among African-Am erican men. Am erican Journal of Emidemiology, 156:774-782, 2002. [75] Wang, C.Y. and Wang, S. Semiparametric methods in logistic regression w ith measurement error. Statistica Sinica, 7: 1103-1120, 1997. [76] Zhao, L.P. and Lipsitz, S. Designs and analysis of two-stage studies. Statistics in Medicine, 11:769-782, 1992. [77] Zhou, X.H., Catelluccio, P., Hui, S.L., and Rodenberg, C.A. Comparing two prevalence rates in a two-phase design study. Statistics in M edicine, 18:1171- 1182, 1999. 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A Simulation Description In the sim ulation studies, we use some kind of ‘shortcut’ to speed up the process on preparing the analytic data sets. The methods for a categorical exposure Z and for a polytomous outcome are presented in the following sections. A .l Categorical Covariates When the exposure Z is categorical w ith fewer levels, we can speed up our sim ulations by gathering all terms w ith the same sum of covariates (i.e. the number of subjects w ith Z equal to a certain value), instead of exclusively perm uting all possible combinations. The details are discussed in the following sections. In the setup of having categorical Z , say 0, 1, and 2, we can have the risk term in our model, which is r s(f3) in equation 3.6, as P [e x p (P iZ 1:j) x ^Q e xp (/i2^ 2j) = exp(/31Z ls) exp((32Z 2a), (A .l) je s je s 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where Z\ is the indicator variable for Z = 1 (i.e. Z\ — 1 if Z — 1 and Z \ = 0 if Z — 0 or 2), Z 2 is the indicator variable for Z = 2, Z u = Yhjes an<^ ^ 2 s = E je s ^ 2 - In other words, Z \s and Z 2 s are the numbers of subjects w ith Z = 1 and Z = 2, respectively. Let u;s = on - l n ( = i ( m” fs | - | s , | ) • Therefore, our model would be based m rn {P ) t - t L / t i j — |D j| \ I L =1 V m /ID I-ID /i; E , sCK:|s| = |D| 'sw ) [n L ( e x p (^ iZ 1D) exp(p2Z 2L>)wD E s c £ : | s | = | D | exp(/5iZis) exp{fi2Z 2s)w s (A.2) For sim plicity, we consider dichotomous X as well. Let and Sij be the numbers of cohort subjects, counter-matched sampled subjects, and cases w ith sampling covariate X = i and the new covariate Z = j, where * = 0 or 1 and j — 0, 1 , 2. Furthermore, let s.i = s0i + S 11 and s.2 = S 02 + $1 2- Then, the denominator of the above likelihood (A.2) can be rew ritten as ^ 2 exp(/ d1Z ls) exp(/32Z2s) s C ^ : | s |= | D | Ws |D| IDI-s.! S.l S.2 = E E exp(/5iEis) exp(/32E2s) EE m w \ /m u \ ( m l2 ■s.1=0 S.2=0 |D |— S.l — S .2 X 5 0 0 — 0 E Sn = 0 Si2 — 0 moo \ / moi \ ( m q2n S lO / V Sll / V s1 2 ^00 / V s0 1 / V S 0 2 ^s:S0.=S00+S01+S02,Sl.=Sl0+Sll+Sl2 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. |D| |D|— s.i s.i s.2 |D|-s.1-s,2 fw uA ( m o.\ E E e x p (^ Z ls)exp(/32Z 2s) £ E E s.1 = 0 s.2 = 0 Sll= 0 Sl2 = 0 SOO= 0 V Sl.+So, / / m i0\ / m n \ / m i 2 '\ Anoo\ A n o A /m o2\ w V SlQ / V in / V S12 / V SQQ A SQl / V SQ2 / X ( m i ) ( m o.^ w s:s0.,si. |D| |D |-s .i s 1 s.2 |D |— s .l— s.2 (rn o .\ E E exp(/3iZla) exp(/32Z 2s) EE E ) s.i=o s.2=0 s ii —0 S12— 0 soo= 0 ' si.+^o. ' / m 1Q\ | ' m n + m i 2 ' \ / m i A / m i 2 \ | 'm 0 0 \ / ' m 0 i + m 0 2 '\ / m o A / m o 2 \ V S10 / V S11+S12 / V Sll / V S12 / V Soo / V S01+S02 / V Sol ' \ SQ2 / X- /mi.\ ( m u + m i 2 \ (m o.\ /moi+mo2\ V si. / V S11+S12 / V so. / V S01+S02 ' X'ws:s q ,,si. j where s1 0 = s.i - sn - s1 2 and s02 = |D| - s.i - s.2s0 o - «oi- A.2 Polytom ous Outcomes Another sim ulation study is to assess the validity and compare two conditional likelihood analysis methods for a polytomous outcome, say three levels 0, 1, and 2, in counter-matched design. I f we generate all denominator terms in the likelihood, it would take a long tim e and create a huge data set. We can speed up our simulations by gathering all terms w ith the same sum of covariates (i.e. the number of subjects w ith Z equal to a certain value) as how we can do for categorical covariates discussed in previous section. 168 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the setup of having a three-level outcome, coded as 0, 1, and 2, we can have the likelihood (8.12) as - l T _ r k o i exp M ) r w l ^ l ^ ) l J c o n d \ P ) - _ „ r , n H D ;1 i .m ,|D |-|D ,|> E c o n d H Gsi e x p{ZiPJ n ies2 e x p(ZA) n f = i ( J d i h d , | ) exp(/31Z D l)e x p (^2^ D 2)^D i,D a _ ^ ^ E Sl,S2c ^ | Slh | D 1|,|s2|= lD 2l e x p (ftZ Bl) exp(p2Z S 2 ) m Sl,S2 ) where Z S l = E je s i Z S 2 = E j GS2 an<^ w*i,a2 are the counter-matching weights. In other words, Z ai and Z S 2 are the numbers of subjects w ith Z = 1 in set Si and S 2 , respectively. For sim plicity, we consider dichotomous X and Z. Let n ^, and be the numbers of cohort subjects and counter-matched sampled subjects w ith sampling covariate X = i and the new covariate Z = j, where i = 0 or 1 and j = 0 or 1. Furthermore, let s ^ i j be the numbers of type k cases w ith sampling covariate X = i and the new covariate Z — j, where i = 0 or 1, j = 0 or 1, and k = 1 or 2. Let s,0. = so o o + sooi + sioo + sioi + 'S 2o o + s20i and s .l = soio + Son+Sno + s in -I-S 210 + 5 2X 1 , then r no-s.o. ) x ( n]TS l ' )1 . Then, the denominator of the above \mo|s|— 5.0./ Vm i |s|— 5.1./ 5 likelihood (A .3) can be rew ritten as ^ 2 exp(/3iZSl) exp(p2Z S 2 )w shS 2 si,S 2 C 7 2 -:|si|=|Di |,|s 2|=|D2| |D i | |D 2| l- * - 'J - I *1 .1 *2 .1 S 1 .0 *2 .0 / \ / \ = £ E e x p O W e x p ^ ) £ £ £ £ (7 ‘ °)(;u ) Si. 1 = 0 *2. 1 = 0 Sin = 0 S2n = 0 SlQO= 0 S2OO= 0 ' 110' ' 11 1 ' 169 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / 7 7 1 1 1 — S ill \ / 7 7 2 .1 0 — Sno\ / 7 7 7 o o \ / m 01 \ I m 01 ~ s1 0 1 \ I m Q0 ~ s1 0 0 \ S211 ) V s2 1 0 / V^lOO/ Vs101/ \ s2 0 1 / \ s2 0 0 , X w s t , S 2 : s . 0 . , s . i . lD i | |Da| s i . i S2.1 Si.o S 2.0 / m i. W m o A a Y Y e x p ( ^ Z s l ) e x p ( ^ 2Z S2) Y Y Y Y 5 1 .1=0 52.1=0 Slll= 0 S211=0 sioo= 0 S2O O=0 V |D i| / 7m i . - s n \ / m o . - s i o A / r o io \ / m i A (m o o } / m o i\ V 5 2 1 . ' v 5 2 0 . ' ''5110/ VSlll/ V 5 1 0 0 / V 5 1 0 1 / X 7m i.+ m 0. - |D i| \ ( m i . \ /mo.A V |D2| / '■ 5 1 1 . / vsio./ 7mn— 5m\ /mio-siioA 7moi— sioi\ /moo-sioo^ y v 5 2 1 1 / V 5 2 1 0 / v 5 2 0 1 > V 5 2 0 0 ' . f A A / m i . - 5 1 1 . \ /m0 .-5io.\ u ; s i , s 2 : s . o . , 5 . i . J v ' - A l V 5 2 1 . J V S 2 0 . / where sioi = si.i — sm , s2 oi = s2.i — s2n , sno = |D i| — Si.i — Sioo, and s2io = |D 2| — s2.i — s2o o - Above example is for the conditional polytomous logistic regression. The same idea can apply to the less conditional polytomous logistic regression. One ju st needs to have one more loop to permute all possible numbers of type one and type two cases. 170 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix B SAS Macro Code for Sim ulation Studies We provide the SAS macro code we used for the sim ulation studies as a reference for interested readers. The code for polytomous logistic regression for counter-matched data (Section B.2) is based on the speedup method presented in Section A.2. B .l Logistic Regression for Counter-M atched Data The following macro code %makepseudo programmed by Dr. Bryan Langholz is to generate a “pseudo” data set for the analysis. The first argument in the macro is the original data set, and the second argument is the list of variables in the analysis, while the th ird argument is to specify the counter-matching variable. '/.macro makepseudo(dsn,vlist,samstr) ; * get the number of variables in the variable list; '/.let nvars = 0; ' / . d o /,until (’ /,scan(&vlist, %eval (&nvars+l) ) It 0); '/.let nvars = '/.eval(&nvars+l); '/.end; 171 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. proc datasets; delete pseudot pseudo copy strat; run; data copy; set &dsn; samstr = &samstr; run; "/,let done = 0; '/.let psrecs =0; ’ /.do '/.until (&done eq 1) ; data strat (drop=_a) copy (drop=_a); retain _a 0; set copy end=eof; by setno; if _a=0 then do; output strat; if last.setno then do; call symput(’setno’,trim(left(put(setno,8.)))) ; call symput(’nstrat’,trim(left(put(_n_,8.)))); call symput(’cases’,trim(left(put(nfail,8.)))); if eof then call symput(’done’,’1’); _a=l; end; end; else output copy; run; data pseudot (keep=setno cc comb logw &varlist); retain iO 0 newrecs 0; length comb $ 20; * array of covariate variables; array vars{&nvars} fevarlist; * temporary arrays; array cct{1000} _temporary_; array samstrt{1000} _temporary_; array totalt{0:20} _temporary_ ; array msampt{0:20} _temporary_; * covariate array; array vt{1000,&nvars} _temporary_; * count up number in each sampling stratum from a given set of subjects; array sscnt{0:20} _temporary_; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. do iss = 0 to 20; sscnt{iss> = 0; end; * loop through current stratum, read info into temporary arrays; do i = 1 to fenstrat; set strat; cctli} = cc; samstrt{i} = samstr; totalt{samstr} = total; msampt{samstr} = msamp; do ivar = 1 to Anvars; vt{i,ivar} = vars{ivar}; end; end; /* Section to create pseudo subjects */ * generate appropriate do loops for this stratum; "/.do ind = 1 ‘ /.to fecases; do i&ind = i"/,eval(&ind-l)+l to ‘ /.eval(&nstrat-&cases+&ind) ; ‘ /.end; newrecs = newrecs + 1; * assign pseudo case-control status (all subjects must be cases); cc = 1; '/.do ind = 1 '/.to fecases; if cct{i&ind} eq 0 then cc = 0; ‘ /.end; * covariates; do ivar = 1 to fenvars; vars{ivar} = 0; ‘ /.do ind = 1 ‘ /.to fecases; vars{ivar> = vars{ivar> + vt{i&ind,ivar}; ‘ /.end; end; * weight; do iss = 0 to 20; sscnt-Ciss} = 0; end; logw = 0; ‘ /.do ind = 1 ‘ /.to fecases; iss = samstrt{i&ind}; 173 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sscnt-Ciss} = sscnt-Ciss} + 1; logw = logw + log( (totalt{iss}-sscnt{iss}+l) /(msampt{iss}-sscnt{iss}+l) ); '/.end; * id for combination; comb = ’ ’; '/.do ind = 1 '/.to fecases; comb = trim(left( comb )) |I put(ifeind,2.) ; '/.end; output; ’ /.do ind = 1 ’ /.to &cases; end; ’ /.end; call symput(’newrecs’,trim(left(put(newrecs,10.)))); run; ’ /.let psrecs=’ /,eval(&psrecs + &newrecs); proc datasets; append base=pseudo data=pseudot; run; quit; ‘ /,put Stratum number &setno; '/.put Number of subjects = fenstrat ; ‘ /.put Number of cases = &cases; '/.put Number of pseudo-records= &newrecs; ’ /.end; /* do until loop */ '/,put Total number of records in the pseudo data set = fepsrecs; '/.mend; A fter the data set ‘pseudo’ is generated by the above macro code, we can use the following SAS statement to conduct the analysis for variables(s) specified as the argument of the macro % fit, '/.macro f it ( in d v a r ) ; proc phreg data=pseudo nosummary; model s e tn o *c c (0 )= & indva r / o ffs e t= lo g w r is k lim it s ; s tr a ta setno; ru n ; ’ /.mend; 174 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Variable cc is the case-control indicator (1 for a case and 0 for a control). Variable setno is the set number (or stratum number) in the data set. Variable logw in syntax ‘offset=logw’ is the logarithm of risk weights. B.2 Polytom ous Logistic Regression for Counter- M atched Data Assume the outcome is three-level, one control and two type cases. The test of interest here is the score test to test if a covariate has the same effect on type 1 and 2 case relative the control, i.e. Ho: /A = f}2 = /3, as discussed in Section 9.2. W ith the same setup in macro %level3_score, one can generate the pseudo data set called ‘polypseudo’ for the polytomous logistic regression for counter-matched data. The data ‘pseudo’ in the macro %ST is generated by macro %makepseudo in Section B .l, which is for a dichotomous outcome. Therefore, the data ‘pseudo’ is to calculate the common ft for a covariate under the null, Ho: (3 i = @2 = P , f° r the score test. "/.macro S T (v list); proc phreg data=pseudo nosummary ou test= tl; model setno*cc(0)= fev list/ offset=logw r isk lim its; stra ta setno; run; data commonB; set t l ; i f _type_ eq ’FARM S’ then bhat =&indvar; c a ll symput( ’bhat’ ,tr im (left(p u t(b h a t,8 .5 ) ) ) ) ; run; 175 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ° / 0 level3_score (d a ta set, ca se,& varlist, sam str); %mend; The macro %level3_score is to generate the analytic data set for the polytomous logistic regression and to calculate the test statistic and the p-value for the score test. ‘ /.macro level3_score(dataset.cc.vlist.samstr) ; ‘ /.let 111= 0; ‘ /.let 112= 0;‘ /.let 122= 0; ‘ /.let U= 0; * get the number of variables in the variable list; ‘ /,let nvars = 0; ‘ /.do '/.until (‘ /,scan(&vlist,‘ /,eval(&nvars+l)) It 0); ‘ /.let nvars = ‘ /,eval(&nvars+l); ‘ /.end; ‘ /.let varlist=&vlist; ‘ /.let varlistl=; ‘ /.let varlist2=; ‘ /.do i = 1 ‘ /.to &nvars; ‘ /.let varlistl=‘ /,trim(‘ /,left(&varlistl)) ‘ /,trim(‘ /,left(‘ /.scan(&varlist,&i))_l); ‘ /.let varlist2=‘ /,trim(‘ /.lef t (&varlist2) ) ‘ /,trim(‘ /,left(‘ /,scan(&varlist,&i))_2) ; ‘ /.end; proc freq data=&dataset; table &cc*setno/SPARSE out=nstrat (drop=percent rename=(count=nfailt)) noprint; run; proc sort data=nstrat; by setno &cc; data nfail(keep=setno nfailO nfaill nfail2); array ccnum nfailO nfaill nfail2; do over ccnum; set nstrat; by setno; ccnum=nfailt; end; run; proc freq data=&dataset; table &cc*setno/SPARSE out=nstratl (drop=percent rename=(count=exposed)) noprint; where &indvar=l; run; proc sort data=nstratl; by setno &cc; data exposed(keep=setno zdOzdl zd2); array ccnum zdO zdl zd2; do over ccnum; set nstratl; by setno; ccnum=exposed; end; 176 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. proc datasets; delete pseudot copy copyO strat; run; proc sort data=&dataset; by setno; proc sort data=nfail; by setno; proc sort data=exposed; by setno; data copyO; merge &dataset nfail exposed; by setno; if zdl=. then zdl=0; if zd2=. then zd2=0; run; data copy; set copyO; samstr = &samstr; run; ’ /.let done = 0; ’ /.let psrecs =0; ’ /.do ’ /.until (&done eq 1); data strat (drop=_a) copy (drop=_a); retain _a 0; set copy end=eof; by setno; if _a=0 then do; output strat; if last.setno then do; call symput( ’ setno’,trim(left(put(setno,8.)))); call symput(’nstrat’,trim(left(put(_n_,8.)))); call symput(’cases1’.trim(left(put(nfaill,8.)))); call symput( ’ cases2’,trim(left(put(nfail2,8.)))); call symput(’cases’,trim(left(put(nfail, 8. ) ) ) ) ; call symput(’zdl’,trim(left(put(zdl,8.)))); call symput(’zd’.trim(left(put(zdl+zd2,8.)))); if eof then call symput(’d o n e 1’); _a=l; end; end; else output copy; run; data pseudot (keep=U 111 112 122 ST p); retain iO 0 newrecs 0 jO 0 delta 0 zlzdelta zldelta zdelta zlsqdelta zsqdelta 0; length combi comb2 $ 40; * array of covariate variables; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. array vars-f&nvars} &vlist; array varslf&nvars} fevarlistl; array vars2{&nvars} &varlist2; * temporary arrays; array cct{1000} _temporary_; array samstrt{1000} _temporary_; array totalt{0:20} _temporary_ ; array msampt{0:20} _temporary_; * covariate array; array vt{1000,&nvars} _temporary_; * count up number in each sampling stratum from a given set of subjects array sscnt{0:20} .temporary.; do iss = 0 to 20; sscnt{iss} = 0; end; * loop through current stratum, read info into temporary arrays; do i = 1 to &nstrat; set strat; cct{i} = &cc; samstrt{i} = samstr; totalt{samstr} = total; msampt{samstr} = msamp; do ivar = 1 to fenvars; vt{i,ivar} = vars{ivar}; end; end; /* Section to create pseudo subjects */ * generate appropriate do loops for this stratum; *=========== type 1 case ============; '/.do ind = 1 ’ /.to fecases 1; do i&ind = i’ /,eval(&ind-l)+l to ’ /,eval(fenstrat-&casesl+&ind) ; ’ /.end; * covariates; do ivar = 1 to fenvars; varsl-fivar} = 0; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. '/.do ind = i '/.to &casesl; varsl{ivar} = varsl{ivar} + vt{i&ind,ivar}; '/.end; end; *=========== type 2 case ============; ’ /.do ind2 = 1 ‘ /.to &cases2; do jfeind2 = j'/,eval(&ind2-l)+l to ‘ /,eval(&nstrat-&cases2+feind2) ; ’ /.do ind = 1 ’ /.to &casesl; if j&ind2=i&ind then continue; '/.end ; ‘ /.end; * covariates; do ivar = 1 to &nvars; vars2{ivar} = 0; */,do ind2 = 1 ‘ /.to &cases2; vars2{ivar} = vars2{ivar} + vt{j&ind2,ivar}; '/.end; end; * number of records; newrecs = newrecs + 1; * id for combination; combi = ’ ’; '/.do ind = 1 '/.to fecases 1; combi = trim(left( combi )) || put(i&ind,3.) ; ‘ /.end; comb2 = ’ ’; ‘ /.do ind2 = 1 '/.to &cases2; comb2 = trim(left( comb2 )) || put(jfeind2,3.) ; ‘ /.end; * assign pseudo case-control status; cc = 1; ‘ /.do ind = 1 ‘ /.to &casesl; if cct{i&ind} ~= 1 then cc = 0; '/.end; ‘ /.do ind2 = 1 '/.to &cases2; if cct{j&ind2} “= 2 then cc = 0; '/.end; * weight; do iss = 0 to 20; sscnt{iss} = 0; end; logw = 0; ‘ /.do ind = 1 '/.to fccasesl; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. iss = samstrt-Ci&ind}; sscnt{iss} = sscnt{iss} + 1; logw = logw + log( (totalt{iss}-sscnt{iss}+l) /(msampt{iss}-sscnt{iss}+l) ); ’ /.end; ’ /.do ind2 = 1 ’ /.to &cases2; iss = samstrt{jfcind2}; sscnt{iss} = sscntfiss} + 1; logw = logw + log( (totalt{iss}-sscnt-(iss}+l) /(msampt{iss}-sscnt{iss}+l) ); ’ /.end; ** information component under the null: alpha=betal-beta2=0 **; delta= delta + exp((&indvar._1+ &indvar._2)* &bhat + logw); zlzdelta= zlzdelta + feindvar._l*(&indvar._l+ &indvar._2) *exp((feindvar._1+ &indvar._2)* febhat + logw); zldelta= zldelta + feindvar._1 *exp((&indvar._1+ feindvar._2)* febhat + logw); zdelta= zdelta + (&indvar._l+ &indvar._2) *exp((feindvar._1+ &indvar._2)* febhat + logw); zlsqdelta=zlsqdelta + &indvar._l*fcindvar._l *exp((&indvar._l+ &indvar._2)* febhat + logw); zsqdelta= zsqdelta + (&indvar._l+ &indvar._2)*(&indvar._l+ &indvar. *exp((&indvar._1+ feindvar._2)* febhat + logw); ’ /.do ind = 1 ’ /.to &cases2; end; ’ /.end; ’ /.do ind = 1 ’ /.to &casesl; end; ’ /.end; call symput(’newrecs’,trim(left(put(newrecs,10.)))); ** information under the null: alpha=betal-beta2=0 **; Ill=&Ill+(zlsqdelta*delta-zldelta*zldelta)/(delta*delta); I22=&I22+(zsqdelta*delta-zdelta*zdelta)/(delta*delta); I12=&I12+(zlzdelta*delta-zldelta*zdelta)/(delta*delta); U = &U + &zdl-zldelta/delta; if (111*122-112*112)>0 then do; ST=U*U*I22/(111*122-112*112); p=l-probchi(ST,1); call symput(’ST’,trim(left(put(ST,10.5)))); call symput(’p’,trim(left(put(p,7.5)))); end; call symput(’111’,trim(left(put(111,10.5)))); Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. call symput(’122’,trim(left(put(122,10.5)))); call symput(’112’,trim(left(put(112,10.5)))); call symput(’ U’ ,trim(left(put(U ,10.5)))); run; ‘ /.let psrecs='/,eval(&psrecs + &newrecs) ; proc datasets; append base=polypseudo data=pseudot; run; quit; ‘ /.put Stratum number fcsetno; ’ /.put Number of subjects = fenstrat; '/.put Number of TYPE 1 cases = fecasesl; ‘ /.put Number of TYPE 2 cases = &cases2; ‘ /.put Number of pseudo-records= fenewrecs; '/.end; /* do until loop */ ‘ /.put; ‘ /.put Total number of records in the pseudo case-control data set = &psrecs; ‘ /.put Score Test statistic for (HO: betal=beta2) = &ST; ‘ /.put p-value of Score Test for (HO: betal=beta2) = &p; '/.mend; A fter subm itting the above code, the statistic for the score test of H q : /A = = fa and the p-value are displayed in the LOG window. The polytomous logistic regression for covariate(s) specified as the first argument of macro % polyfit can be performed in SAS by calling macro % polyfit. I f there is only one covariate (either binary or continuous), we can further ask for a W ald test of Ho : /A = (3 2. '/.macro polyf it(indvar); * get the number of variables in the variable list; '/.let nindvar = 0; ‘ /.do '/.until (‘ /.scan(&indvar,'/.eval(fenindvar+l)) It 0); ‘ /.let nindvar= '/,eval(fenindvar+l) ; ‘ /.end; ‘ /.let indvarlist=&indvar ; '/.let indvarl= ; ‘ /.let indvar2= ; '/.do i = 1 '/.to fcnindvar; ‘ /.let indvarl='/,trim('/.lef t (&indvarl) ) '/,trim(‘ /.left(‘ /.scan(&indvarlist,&i))_l) ; ‘ /.let indvar2='/.trim('/,left(&indvar2)) '/,trim('/,left(‘ /,scan(&indvarlist,fei))_2) ; ‘ /.end; 181 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. proc phreg data=polypseudo nosummary; model setno*cc(0)= &indvarl &indvar2/ offset=logw risklimits; strata setno; 7,if &nindvar=l "/.then "/.do; WALD: test &indvarl=&indvar2; **** a WALD test; "/.end ; run; ‘ /.mend ; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix C Derivation of Information for Conditional Polytom ous Likelihood M odel In Section 8.2.2, we derive the inform ation for the conditional polytomous likelihood model. The detailed derivation is documented here. The conditional likelihood can be based on this probability, pr(D o, D i, D 2|D 0, D i , D 2, |D 0|, |D i |, |D 2|) _____________ pr(D0, Di, D 2|Dq , D i, P 2) x pr(D0, Di, P 2)______________ 'EcondPr(so, Si, s2|D0 U s0, DiUSi,D2U s2) x pr(D0 U s0, Di U si, D 2 U s2) /ID tK - 1 / | P 2|\ - 1 /|D 0K - 1 n igDl e ° i+ z i^i n . 6Da ea2+z i ^ V |D i|/ V|D2p V|D0p n i6w l+ e “ 1+ z ^ 1+ e“ 2+z^ 2 v / | D i l + |S lK - 1 / | D 2| + |s2K ~ X /|D q |+ |s q |\ - 1 f h e p , u., ^ 1 + Z i P l U i €D ^ e“ 2+Zi02 2 — scond V |si| / V [s21 / V |so| / Y liz n l+ e “ i +z*/Ji +eQ 2+z;P2 exp(/3i Zj + 02 EieD2 Zj) J 2 c o n d e X P (A E*G s i Z i + & EiG s 2 Z i ) i 183 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where cond is the condition th a t s*, : Is*| = |Dfc|, A : = 0,1,2. Consequently, the log-likelihood equals to Z i) + 1 3 2 (^ 2 Z i) —\og{ ^ 2 exp(/^i Z j + & Z j)}. ie f ii ief>2 S fc :|s*|=|Sfc|,fe=0,l,2 *esi i£s2 The corresponding score and inform ation elements are, U (A ) = = £ Zi - - E C ie " Z i)A d h i£Di 't€si E A T,-( o \ 9 1 E ( E i G S 2 Z i ) A ieD2 •ii& S 2 E A -^11 (/5i, A ) [E(Eie s i ^)2 A](E a) - [E(Ei6 B 1 z m dft [ E A ]2 r ( R o , _ - 9 l 2 _ [E(EiG S 2 ^ -)2A ] ( E A ) - [E(Eies2 z i ) A ? *22 \P2 j P2) - ■ ■ ” “ “ ~ ~ ” "■ “ _ dft [ E A ]2 I 1 2 ( / E P 2 ) - d l 2 d/3id/32 1 [ E A ? x h i (A ) P2) £ E > ) ( £ z.)akE a ) iGsi i£ s 2 - E ( E ^ ) A £ E Z ‘)A ] iGsi ies2 where X) = E 8 fc :|B jk |=|Dfc |,*= 0,1,2 and A = exP ( A E ie91^ + A E iES2^ ) - 184 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Some equalities can apply to the derivation of inform ation. Let r — \TZ\ — |D 0| + |D i| + ID 2 I) and dk = |D fe | = \sk|, k = 0,1, 2 Ei = E 1 = r \ ( r — d\ \sk\= \& k\=dk ,k = 0, l ,2 d \) V c ? 2 e< e z ‘ ) = ( j , : 1 ! iesi v E < E Z> ! *esi r — di d2 r — di d2 r — di d2 r — di d2 r — di d2 r — di d2 E z - i e n r — di E ( E z - > 2 |s i|= |D i| [ E < E Z . ? + E z ‘ z i ) i |s iN |D i| *£si t:1 ^ + ( ; : ’ ) E ^ ie ii r ~ A r - di r ■ di Ez ? + 7r r E ^ i€ 7 ? . ) E z ? + T V f f E z ‘)2 - T V f E z - 2 ) ieft ieiii i€7e [ ^ E z? + 7 ^ ( E z.)2 1 iG T J BEz . ) ( E z . ) = E.E ^ = ( ; : 2 i ) ( r ^E1 ) E ^ iGsi j£s2 *es1,jeS 2 je T iM j 185 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. *Gsi i£s2 1 \ ( r — c i(z R - (r-yfiOa- -1 ,)(’**)£■ Under the null H 0 : fii = (3 2 = 0, i.e. A = exp(/3i ]T ie E ( E z‘)2 ak E a) - E ( E z- > A i2 si iGsi > \ /V - d A l2d i(r - d i) ^ d A V ^2 r r ( r -l) L A * itTZ E ( E z<)2a kE a ) - i£ ( £ z.>a i2 l€S2 *GS2 , ( r \ ( r - < k \n d2(r - d2) „ 2 = lU Jv * J1 - i& T Z 2 )£Z J ie7e !? + £ S l Zi + /?2 D i €s2 ^ i ) ; ( £ z <)2) ieH ; ( £ z > ) 2 ! ie T Z Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E £ > x 5 >)A](E-a) _ E ( E Z i)A i E £ z‘ )Ai iesi ieS2 i £si S 2 r 1 \ / r - d A ( r - 1 \ /V - d2 \ „ 2 di-VV d 2 J[d 2- l J ( d! 1 + ie'R . ie7J Let V a rZ = “ K D ig w ^ ) 2]- Then> the derived inform ation under the null H 0 : Pi = j32 = 0 is given by / rfi(r-rfi)ya r^ _ M y a r Z ^ r r h = ~ h n r . n — 4^2. Var Z d 2 ( r - d 2) y a r Z r r ^ di r — di _^1^2. ^ Var Z r r r r di C ^ 2 \ r r d'z r - d 2 r r J — > pV a r Z / P l ( l - P l ) - P l P 2 ^ P1P2 pb(i - P 2 ) y 187 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where p = sampling proportion, and Pj = type j proportion in the sample, j Then, the inverse inform ation for and equals {-Io n 1 -1 p Var Z 1 — p i — p2 ( 1- 1~ P 2 I Pi 1-£1 P 2 / Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Comparison of variance estimators in case -cohort studies
PDF
The effects in utero and environmental tobacco smoke exposure on childhood lung function
PDF
Cure rate estimation in the analysis of survival data with competing risks
PDF
An exploration of nonresponse with multiple imputation in the Television, School, and Family Project
PDF
Associations between lung function growth and air pollution in two cohorts in Southern California children
PDF
Effects of ambient air pollution on monthly wheeze and asthma exacerbation in southern California school children
PDF
Application of a two-stage case-control sampling design based on a surrogate measure of exposure
PDF
A comparative study of environmental factors associated with multiple sclerosis in disease-discordant twin pairs
PDF
Effects of myeloperoxidase (MPO) polymorphism and tobacco smoke on asthma and wheezing in Southern California children
PDF
Determinants of mammographic density in African-American, non-Hispanic white and Hispanic white women before and after the diagnosis with breast cancer
PDF
Experimental Modeling In Case Crossover Designs
PDF
beta3-adrenergic receptor gene Trp64Arg polymorphism and obesity-related characteristics among African American women with breast cancer: An analysis of USC HEAL Study
PDF
Analysis of gene-environment interaction in lung cancer
PDF
Multi-State Failure Models With Competing Risks And Censored Data For Medical Research
PDF
Associations of weight, weight change and body mass with breast cancer risk in Hispanic and non-Hispanic white women
PDF
Identifying susceptibility genes for complex diseases by accounting for epistasis in studies of candidate genes
PDF
Family history, hormone replacement therapy and breast cancer risk on Hispanic and non-Hispanic women, The New Mexico Women's Health Study
PDF
Imputation methods for missing data in growth curve models
PDF
Correction for ascertainment in family studies
PDF
Androgens and breast cancer
Asset Metadata
Creator
Li, Yu-Fen
(author)
Core Title
Counter -matching in nested case -control studies: Design and analytic issues
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Biometry
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, biostatistics,OAI-PMH Harvest,statistics
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Gilliland, Frank (
committee chair
), Langholz, Bryan (
committee chair
), Gauderman, William James (
committee member
), Goldstein, Larry (
committee member
), Xiang, (Dr.) (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-407271
Unique identifier
UC11340858
Identifier
3145236.pdf (filename),usctheses-c16-407271 (legacy record id)
Legacy Identifier
3145236.pdf
Dmrecord
407271
Document Type
Dissertation
Rights
Li, Yu-Fen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, biostatistics