Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sampling strategies based on existing information in nested case control studies
(USC Thesis Other)
Sampling strategies based on existing information in nested case control studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Sampling Strategies based on Existing Information in Nested Case Control Studies by Yi Luo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (Biostatistics) May 2018 Copyright 2018 Yi Luo Acknowledgements At the very beginning, I would like to take this opportunity to express my sincere gratitude to everyone who have helped me in carrying this thesis out. I am grateful to Professor Bryan Langholz who inspired and directed me. He is unfailingly kind and helpful at all times. I feel blessed to have him as my mentor. I am also extremely thankful to my committee, Professor Kiros T. Berhane, Professor Wendy Jean Mack, Professor Sandrah Proctor Eckel, and Professor Larry Goldstein for their valuable inputs and encouragement. I owe my thanks to lecturer Mary Ann Murphy for helping with my English writing skills. My appreciation also goes to all my colleague and people who have helped me and stayed patient with me taking the process slowly. ii Table of Contents Abstract iv 1 Introduction 1 2 Background 3 2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Partial Likelihood with Sampling Weights . . . . . . . . . . . . . . . . . . . . . 4 2.3 Colorado Plateau Uranium Miners Cohort . . . . . . . . . . . . . . . . . . . . . 6 2.4 Data Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Literature Review 9 3.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Counter Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Counter Matching on Surrogate Exposure . . . . . . . . . . . . . . . . . 13 3.2.2 Counter Matching with A Covariate . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Counter Matching on Continuous Exposure . . . . . . . . . . . . . . . . 17 4 Case-Based Probability Matching 19 4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Analysis Results Based on Simulated Datasets . . . . . . . . . . . . . . . . . . . 23 4.2.1 Relative Efficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Projection of Case Exposure Distribution . . . . . . . . . . . . . . . . . . . . . 33 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Outside Caliper Matching 38 5.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Outside Caliper Matching on Exposure . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Effect of the Caliper Size . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.2 Comparison with Simple Random Sampling . . . . . . . . . . . . . . . . 43 5.2.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Estimation with Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Outside Caliper Matching on Surrogate Exposure . . . . . . . . . . . . . . . . . 48 5.5 Generalization to Other Matching Rates . . . . . . . . . . . . . . . . . . . . . . 50 iii 6 Application on Real World Data 51 7 Summary 55 7.1 Relative Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.3 Design and application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.3.1 Simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.3.2 Counter matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.3.3 Case-based probability matching . . . . . . . . . . . . . . . . . . . . . . 61 7.3.4 Outside caliper matching . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8 Appendix 65 BIBLIOGRAPHY 68 iv List of Figures 4.1 Relative efficiencies of the case-based probability matching as compared to sim- ple random sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Relative efficiencies of case-based probability matching and simple random sampling as compared to full cohort analysis. . . . . . . . . . . . . . . . . . . 26 4.3 Amount bias observed from case-exposure distribution based probability match- ing and simple random sampling given large sample size. . . . . . . . . . . . . 28 4.4 P-values of bias tests from case-exposure distribution based probability match- ing and simple random sampling. . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5 Amount bias observed from case-exposure distribution based probability match- ing and simple random sampling given small sample size . . . . . . . . . . . . 30 4.6 P-values of bias from case-exposure distribution based probability matching and simple random sampling, shrank sample size . . . . . . . . . . . . . . . . . . . 31 4.7 Percentage bias in exposure estimation from probability matching and simple random sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.8 Relative efficiencies of case-exposure distribution based probability matching compared to simple random sampling. . . . . . . . . . . . . . . . . . . . . . . 34 4.9 P-value of the bias tests from probability matching compared to simple random sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1 Relative efficiencies of outside caliper matching compared to simple random sampling at with different values ofQ given = 0. . . . . . . . . . . . . . . . 42 5.2 Relative efficiencies of outside caliper matching with maximum caliper size compared to simple random sampling . . . . . . . . . . . . . . . . . . . . . . 44 5.3 P-values of bias test for outside caliper matching with maximum caliper size compared to simple random sampling . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Relative efficiencies of the exposure estimation with covariate controlled in the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.5 Relative efficiencies of the covariate estimation with exposure controlled in the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6 Relative efficiencies of exposure estimation given a surrogate exposure is used as the matching factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.1 Relative efficiencies of exposure estimation by the exposure coefficient . . . . . 56 7.2 Loess plot of bias in exposure estimation . . . . . . . . . . . . . . . . . . . . . 58 v 7.3 Loess plot of P-value for bias in exposure estimation . . . . . . . . . . . . . . 59 8.1 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 vi Abstract Cohort studies, by their nature, require a large number of subjects or a long follow-up period to observe outcomes of interest. These requirements make data collection a costly process. The nested case-control design is a way to reduce the cost as it includes only the cases and selected controls. However, this also makes it less efficient than a full cohort design. It is already known that the way by which we sample controls can affect the study efficiency [7] [8]. A good choice of control sampling methods can reduce not only the sample size but also the variance of parameter estimates. These are important especially when the study outcome is rare and/or the affordable sample size is small. In this dissertation, we discuss sampling methods that are specifically designed for situations where exposure or its related information is available in the study cohort, and due to issues such as a small budget or a short data collection period, some covariate information cannot be collected for every participant of the cohort. We demonstrate the merits of choosing an appropriate sampling method by comparing their relative efficiencies. These comparisons are made under different types of exposure and variety of conditions including rare outcome and rare exposure situations. Under all testing conditions, we find control sampling methods that are carefully chosen to match the studies are, as a whole, more efficient in estimating the exposure effects than generally-used sampling methods. vii Chapter 1 Introduction In this dissertation, we present four sampling methods that are applicable to nested case-control studies where the exposure or its related information is available from the cohort while the infor- mation for the covariates is not. These include two established sampling designs, simple random sampling and counter matching, and two novel ones we developed, case-based probability match- ing and outside caliper matching. The simple random sampling method, which we discuss in detail in Section 3.1, is the most commonly used method to reduce the expense and to simplify the data collection process. How- ever, this sampling design is known to have much less power in hypothesis testing compared to a full cohort analysis. To raise the power, we have to increase the sample size, but this increases the cost. In other words, simple random sampling sacrifices power to reduce cost. Unlike simple random sampling, counter matching (Section 3.2) promises both economy and efficiency. This method, first presented in the 1990’s by Langholz and Borgan [7], is a stratified sampling method that incorporates exposure-related information into the sampling process. It has proved its capability and stability in many studies, including the Colorado Uranium Miner Cohort Study (Section 2.3) and the Women’s Environmental Cancer and Radiation Epidemiology (WECARE) Study. 1 One disadvantage of this advanced matching method is that it is designed for categorical exposures, and therefore cannot be directly applied to continuous exposures. While categorizing an exposure is an established way to implement counter matching, a categorized variable contains much less information than the original continuous exposure. To better accommodate continuous exposures in the sampling process, we propose and eval- uate two new methods, case-based probability sampling and outside caliper matching. In Chapter 4, we describe in detail the case-based probability matching method. This method samples case-control sets based on sampling probabilities calculated from the exposure distri- bution in cases. Samples selected in this way maintain a constant portion of information from the full cohort, which makes its power in testing the exposure effect relatively insensitive to the correlation between the exposure and the outcome. Chapter 5 introduces the idea of outside caliper matching, a method that is similar to counter matching. The two new sampling designs both match cases to controls that are of totally different exposure characteristics. The unique feature of outside caliper matching is that, instead of using the actual exposure measurement for sampling, it uses rank order. This characteristic increases its compatibility with all kinds of exposure variables and, therefore, provides more flexibility. The real data application results are included in Chapter 6, using data from the Colorado Uranium Miner Cohort Study as a real world example. All four sampling methods are applied and compared. In the final summary chapter, the four sampling methods are compared for advantages and disadvantages in different aspects. Practical application recommendations for choosing or de- signing studies with different sampling methods are provided to optimize the efficient use of all available information in the cohort. 2 Chapter 2 Background 2.1 Data Structure A nested case-control study usually samples its analysis population from an existing cohort. In an open cohort, participants can join the study at any time (entry time) after the study starts, and they can leave the study when the study ends or when they decide to quit(censored), or when they develop the outcome of interest (failed). The study follows subjects from their entry time, until their failure time or their censor time, whichever comes first. During the follow-up period, all subjects who are still in the study cohort are consideredatrisk, and the group of subjects who are at risk at the same time or same time period constitute ariskset. If the risk sets are defined based on time points, they are called continuous time risk sets. To set up these sets, we assume all failures happen at different time points (no ties). For each failure time, a risk set is formed, including the subjects who developed the outcome (case) and all the other subjects who are still at risk (controls). If the risk sets are defined by time intervals, they are called grouped time risk sets. To constitute these, we have to divide the study period into many mutually exclusive time periods. During each, we count all failures as cases, but we count only participants who are at risk at the beginning of the time interval as controls. Because there can 3 be multiple cases and controls in the same risk set, more complicated sampling designs can be applied to not only the controls, but also to the cases. For simplicity, we use continuous time risk sets throughout this dissertation. 2.2 Partial Likelihood with Sampling Weights Based on the data structure of our study, we choose to use Cox’s proportional hazard model. This model has been widely used for survival data analysis. It is a semi-parametric model where the baseline hazard is an arbitrary non-parametric function, and the effect of the exposures on the hazard is assumed to have a parametric form. The hazard rate function is defined in the form of i (t;z) = 0 (t)exp( T Z i (t)) (2.1) Where 0 (t) is a non-negative baseline hazard at time t; T and Z(t) are row vectors of regression parameters and covariates, respectively [3]. Given this hazard model (2.1), its partial likelihood equation was derived by Cox [4] as L() = n Y i=1 f exp[ T Z i (t i )] P j2R(t i ) exp[ T Z j (t i )] g i (2.2) where R(t i ) is the set of subjects still at risk at time t i , R(t i ) = j :T j T i ; and i , an indication of an event occurring at timet i , is set to 1 if the event occurs and 0 otherwise. The asymptotic properties of this partial likelihood equation were discussed by Andersen and Gill [1], and later generalized by Borgan, Goldstein and Langholz [13]. The sampling weights 4 were included in the likelihood equation, and therefore this equation is able to handle different sampling processes more easily. This improvement is accomplished through counting processes. To incorporate sampling weights, we first set up a counting processN i;r (t), that counts until right before timet, the num- ber of occurrences wherei is the case ( i = 1) and ~ R i = r is the sampled risk set. This process should be a non-negative and right continuous function that jumps one unit per event. N i;r (t) = X I(t j t; ( ~ D t j ; ~ R t j ) = (i;r)) (2.3) Next, we assume that the survival times are iid with a hazard rate h(t). Provided that the filtration right before time t isF t , and the process recording number of individuals at risk right before time t isY (t) = P n i=1 Y i (t), whereY i (t) equals 1 when subject i is at risk right before time t and 0 otherwise. We can specify the probability, given the history prior to t, of an event occurring at time t with the sampled risk set beingr as Pr(tt i t +dt; i = 1; ~ R i =r;jF t ) =h(t) t (rji)I(T i t)dt (2.4) Where t (rji) is the conditional probability of having ~ R = r as the sampled risk set given that subjecti is the one who failed at time t. With equation (2.4), we derive E[dN i;r (t)jF t ] =Y (t)h(t) t (rji)dt = i;r (t)dt (2.5) 5 wheredN i;r (t) is the increment of the processN i;r (t) over [t,t+dt). Substituting in equation 2.1, we obtain the intensity process associated withN i;r (t) i;r (t) = i (t) t (rji) = 0 (t)exp( T Z i (t)) t (rji) (2.6) Summing over all subjects in each risk set, we obtain the intensity process associated with the counting processN r (t),N r (t) = P i2r N i;r (t), as follows: r (t) = X i2r i;r (t) = X i2r 0 (t)exp( T Z i (t)) t (rji) (2.7) Finally with these two intensity processes (2.6, & 2.7), the partial likelihood equation becomes L = Y t i f i; ~ R i (t i ) ~ R j (t i ) g i (2.8) = Y t i f exp( T Z i (t i )) t i ( ~ Rji) P j2 ~ R exp( T Z j (t i )) t i ( ~ Rjj) g i (2.9) 2.3 Colorado Plateau Uranium Miners Cohort In 1949, the Colorado Plateau uranium miner cohort study [6] [11] [12] was initiated by the State of Colorado and the U.S. Public Health Service to investigate the effect of radon exposure on the uranium miner’s risk of developing lung cancer. It was a large scale observational study, and its study period lasted for 30 years. From January 1st, 1952 to December 31st, 1982, the study recruited a total of 3347 Caucasian male miners who had worked in uranium mines in the four-state Colorado Plateau area. These participants joined the study either at the beginning of the study or one month after they started 6 mining. They were followed until their death or the end of the study or were lost-to-follow- up. During follow-up, they were interviewed by U.S. Public Health Service (PHS) physicians at least one time. During the interview, the physicians gathered participant’s information, including demographics, occupational status, smoking, and radon exposure histories. Radon exposure was recorded as a time-varying cumulative measurement that was calculated based on each miner’s occupational history and the exposure rate of the mine in which they had worked. Smoking status was also a time-varying cumulative measurement calculated from years of smoking and number of packs smoked per day. Both smoking and occupational status were updated yearly until 1969. By the end of the study, 258 lung cancer deaths had occurred. 2.4 Data Realization In this dissertation, we rely heavily on simulated data in investigating the behavior of each sam- pling method. Given the computing capacity and the need of investigating the impact of both large and small sample sizes, the simulated data sets are generated in two scales. One simulated 1000 trials, each containing 100 risk sets of sample size 100, and was used to resemble a general study scale or a relatively small sample size. The other simulation contained 100 trials, each consisting of 1000 risk sets of sample size 1000. This simulation was used to evaluate the results we get from asymptotic theory where sample sizes are assumed to be close to infinite. The data realization process starts with the simulation of the exposure variable, Z. This variable is generated to follow a predefined exposure distribution. We selected three exposure distributions, Uniform(0,1), Normal(0.5,1/36), and 2 [2] =6. We used these three distributions to represent circumstances where 1) people are equally likely to receive all levels of the exposure (uniform); 2) most people are exposed around a median 7 level while a small portion get extremely high or low exposures (normal); 3) high exposure rarely happens, and most people get very low or no exposure at all ( 2 ). To ensure a relatively fair comparison, the distribution parameters of the three are picked so that more than 99% of the simulated subjects will have an exposure value ranging from 0 to 1. For checking the possible impact from a confounder or covariate on the estimation of the exposure effect, a covariate C is generated together with the exposure Z to follow a bivariate normal distribution with a covariate matrix being, 1 36 2 6 4 1 1 3 7 5, where represents the correlation (pre-defined). The case-control status is determined based on rate ratios calculated based on Cox’s propor- tional hazard model:RateRatio(RR) =' =exp( z z + c c). Where z and c are pre-defined regression parameters that can be adjusted. We calculate the rate ratio for each subjecti, and then pick one case per risk set according to the conditional probability of subjecti being the case given the risk setR, which can be calculated as:Pr(D =ijR) = ' i P j2R ' j . At the end of the data simulation process, each of the observations generated should have an exposure variableZ, a case-control indicator, a trial number, a risk set number, an individual identifier, and, if needed, a covariate variableC. 8 Chapter 3 Literature Review In this chapter, we briefly introduce the simple random sampling and counter matching methods. Simple random sampling is a basic sampling method that is most widely used. We use it as a baseline comparison to show the level of efficiency we can gain with a certain sample size. On the other hand, we use counter matching, an advanced method developed to improve the precision of exposure associated estimates, to illustrate the highest level of efficiency we can obtain from existing sampling methods. 3.1 Simple Random Sampling The simple random sampling design is very commonly seen in traditional nested case-control studies. The analysis population of these studies are sampled from an existing cohort based on the subjects disease status. Ideally for each time point, all subjects in the cohort who developed the disease of interest are included in the sample as cases, while subjects without the disease are randomly sampled without replacement for controls. This kind of sampling process is called simplerandomsampling. 9 The advantage of this design is its simplicity. However, it is not a efficient practice when the exposure rate is rare. According to the sampling process, the controls are selected without regard to exposure-related information, and the expected proportion of the exposed and unexposed controls in the sample remains the same from that of the full cohort. Therefore, when the exposure rate is rare, simple random sampling is not likely to capture a large number of controls for the sample. The efficiency of the simple random sampling design can be derived from its partial likelihood equation. Consider a continuous time setting where there is only one case per risk set. Suppose m 1 controls are to be sampled without replacement from n 1 controls in the full risk set of sizen. Giveni being the case, andr being the case-control sample at timet i , the probability of obtaining a sample ~ R = r is equal to n1 m1 1 for all i. Hence, based on (2.9), the partial likelihood function for the simple random sampling design becomes L = Y t i 0 (t i )exp( T Z i (t i )) t i ( ~ R i ji) P j2 ~ R i 0 (t i )exp( T Z j (t i )) t i ( ~ R i jj) (3.1) = Y t i exp( T Z i (t i )) n1 m1 1 P j2 ~ R i exp( T Z j (t i )) n1 m1 1 (3.2) = Y t i exp( T Z i (t i )) P j2 ~ R i exp( T Z j (t i )) (3.3) According to this equation (3.3), when all the sampled subjects share the same exposure level, they also share exactly the same rate ratio'(t i ) = exp T Z(t i ) and baseline odds 0 (t i ). As a result, the part of partial likelihood provided by this sample becomes a constant, and is dropped in the score function. In other words, it captures no information from the cohort for the estimation of the exposure coefficient (). This situation occurs more with small samples, and thus the 10 efficiency of the simple random sampling design with a small sample size is worse than a larger sample. To measure how much better a sampling design is compared to another, we use relative ef- ficiency, the ratio of the variance of the estimated coefficient given by two different samples, to quantify the portion of information a sample obtains from the full cohort. For simple random sampling, the asymptotic relative efficiency of a simple with size m over the full risk set given the null hypothesis of exposure = 0 is equal to (m 1)=m [9]. When the null hypothesis does not hold, the relative efficiency becomes more complicated and is dependent on additional factors including the strength of the exposure effect, the exposure rate in the cohort, and possible confounding effects. To better understand simple random sampling, Langholz and Goldstein [10] applied this method with matching rate 1:1 (1 case vs. 1 control) and 1:3 (1 case vs. 3 controls) to the analysis of the Colorado Plateau uranium miners cohort data(2.3). An excess relative risk model [14] was used in modeling the data, and the analysis showed that radon exposure had a significantly positive relationship with lung cancer. The result also suggested that the relative efficiency of simple random sampling can be lower than the (m- 1)/m efficiency rule when there is a positive relationship between the exposure and the outcome. In later chapters, we present simulation results of simple random sampling for a variety of situations. The relative efficiency level of this sampling design is used as a reference of the minimum acceptable amount we should obtain with a given sample size. 3.2 Counter Matching The idea of counter matching was introduced by Langholz and Borgan [7] in 1995. Unlike most of the matching strategies used for the purpose of adjusting for possible confounders, counter 11 matching pursues a different objective. Instead of limiting the variance of the factor being matched on, counter matching tries to maximize the variance so that the factor can be better estimated. This method takes advantage of the exposure-related information available in the full cohort, deliberately samples over different exposure levels. As a result, the selected case-control sample retains more diversity in the exposure status from full cohort, and thus increases the efficiency in estimating the exposure coefficient. In contrast to simple random sampling, the advantage of counter matching is more obvious when the exposure is rare, because it purposely samples rare exposures and does not depend on the sample size to achieve more power for analysis. Application of counter matching is not at all more difficult than traditional matching. The first step of the application starts from the definition of sampling strata, which is determined based on the variable being matched on. The choice of the matching variable is usually the exposure itself or a variable that is related. Secondly, a sample size is selected for each sampling strata. The sample sizes are arbitrary, but should at least be equal to the total number of cases in the risk set. We first include all cases in the risk set and then randomly sample controls without replacement until we achieve the sample size for all sampling strata. Under the continuous time setting, we have only 1 case per each risk set. We letl = 1;:::;L be the index of each category of the matching variable and use them as the sampling strata. Denote total number of subjects in stratal asn l , and the number of cases in stratuml asD l .D l can only be set to 0 or 1, and the sum of allD l in a risk set should be equal to 1, i.e. P l D L l=1 = 1. Suppose we are going to includem l subjects, including both case and controls, in the sample from stratum l, then the probability of sampling a set ~ R =r given the case being subjecti can be written as, 12 (rji) = Y l n l D l m l D l 1 = Y l (m l D l )!(n l m l )! (n l D l )! Based on equation(2.9), we can write out its partial likelihood as L = Y t i 0 (t i )exp( T Z i (t i )) t i ( ~ R i ji) P j2 ~ R i 0 (t i )exp( T Z j (t i )) t i ( ~ R i jj) (3.4) = Y t i exp( T Z i (t i )) Q l (m l D l )! (n l D l )! P j2 ~ R i exp( T Z j (t i )) Q l (m l S l )! (n l S l )! (3.5) WhereS l is the number of cases in stratuml assuming subjectj being the case,j2 ~ R i 3.2.1 Counter Matching on Surrogate Exposure There are situations where a precise measure of the exposure is not available at the sampling stage. For example, in the Pancreatic Cancer and Occupational Exposure study [5], a cohort of chemical manufacturing workers was studied for possible effect of chemical exposure on the risk of developing pancreatic cancer. At the time, the exact amount of chemical exposure was not readily available and extra information was required for refined exposure assessment. In this kind of study, it is suggested that we can, before we finish collecting the exposure, alternatively counter match on asurrogateexposure [8]. Surrogate exposure refers to measurements that are related to the exposure of interest. It can be a crude measure of the exposure levels or some other factors that are closely related. We require that all information we need from the surrogate exposure be covered in the exposure variable, so that the surrogate exposure variable will be independent of the outcome given the 13 exposure variable is modeled. For example, in the Pancreatic Cancer and Occupational Exposure study, we can use job title or working site histories, which are available from interview records, as surrogates of the chemical exposure. Because the two factors are directly related to the amount of chemical exposure at work, they are good representatives of a crude chemical exposure level. Moreover, we don’t have to include them in the model once we obtained detailed measurements of the exposure, because the two should not be related to the pancreatic cancer after we control for the chemical exposure. Langholz, Clayton[8] showed that counter matching on surrogates can still improve the ef- ficiency in estimating the exposure association of interest. They derived the asymptotic relative efficiency (ARE) under a situation where a dichotomous variable ~ Z is used as a surrogate ex- posure for a dichotomous exposureZ. When the null hypothesis of no relationship between the exposure and the outcome is true, the ARE of a 1:1 counter matched sample over a simple ran- dom sample is only related to the sensitivity 1 = Pr( ~ Z = 1jZ = 1) and the specificity 1 =Pr( ~ Z = 0jZ = 0) [8]. The ARE can be calculated as, 2[(1)(1) +] (3.6) By definition of a surrogate exposure, its sensitivity and specificity in predicting the exposure measure should both be greater than 50%. Therefore, based on the ARE equation (3.6), even with a surrogate exposure, a counter matched sample can always reach a higher efficiency compared to a simple random sample under the null hypothesis. We can also see from the equation (3.6) that when counter matching is performed on the exposure directly, i.e. 1 = 100% and 1 = 100%, the efficiency of a 1:1 counter matched sample approaches the efficiency of the full cohort, meaning we can get the same power level of a full cohort analysis with a much smaller 14 subsample. On the other hand, it is also worth noticing that this also means if counter matching is performed on covariates that are not related to the exposure, then it would harm the power for the exposure estimation. The ARE equation becomes complicated when the exposure coefficient is away from the null. Langholz and Clayton [8] and Langholz and Borgan [7] tabled simulation results comparing the relative efficiency of counter matching vs. simple random sampling and counter matching vs. full cohort analysis with different combination of sensitivity, specificity, exposure rate, exposure coefficient, and matching rate. Each observation in the simulated cohort contained one outcome variable, one exposure variable, and one surrogate exposure variable. All variables were dichoto- mous, and were fitted in a Cox’s proportional hazard model. Simulation results showed that a high matching rate provides more information for both simple random sampling and counter matching. Counter matching works better than simple random sampling under rare exposures. Presumably because counter matching helps balance the case-control sample to have fixed number of exposed and unexposed subjects. The superiority of counter matching over simple random sampling is largely improved when the exposure coefficient differs from zero. It is also found that surrogate exposure with a greater specificity contributes more to the relative efficiency than a surrogate exposure with greater sensitivity in counter matching. 3.2.2 Counter Matching with A Covariate Langholz and Clayton [8] also presented simulation results examining the power of counter matching in estimating the exposure effect while controlling for a covariate. Each observation in the simulated cohort contained an outcome, an exposure Z and a covariate variable C. All three of them were dichotomous variables; odds ratios were used to measure the pair-wise re- lationships between the outcome, the exposure, and the covariate. Different combination of the 15 exposure rate Pr(Z = 1), the covariate expose ratePr(C = 1), and the three pair-wise odds ratios were used. The simulation results confirmed that counter matching is more efficient than simple random sampling when the exposure is rare or the relationship between exposure and out- come is strong. The efficiency in estimating the exposure effect when controlling covariates was observed to dependent on the strength of the relationships between the exposure and the covari- ate, or the outcome and the covariate. Although counter matching always kept its superiority over simple random sampling, its efficiency generally decreased when the relationship between the exposure and the covariate become stronger. However, counter matching causes a loss in the efficiency of the covariate estimation [8]. This becomes more severe when the exposure rate is rare or when the exposure effect is trivial (ie. 0). One way to remedy this is to increase the matching rate. With a 1:3 matching rate, the efficiency loss on the covariate estimate can be brought back to a level that is only slightly lower than what a simple random sample can provide. Also, this loss of efficiency in the covariate may not be of great importance because they are still properly controlled in the statistical model, and the coefficient of a confounder or how precise its coefficient can be estimated is usually not of research interest. However, this feature of counter matching again warns us that we should not match on co- variates that are not related to the exposure. Counter matching on these variables will in return provide very poor efficiency in the exposure estimation. If there is a secondary exposure, the sug- gestion is to include it when defining the sampling strata, to also optimize efficiency in estimating the second exposure variable. 16 Counter Matching and Interaction No matter which variable is used in defining the sampling strata in counter matching, the effi- ciency in estimating the interaction that involves the matching factor is found to be always more efficient than that of simple random sampling. Cologne and Langholz [2] conducted simulations to compare the efficiency in estimating interaction from counter matched samples and simple random samples in a multiplicative excess relative risk model. The simulation results show that counter matching is consistently more efficient over simple random sampling. In addition, counter matching is still found to be more efficient under rare exposure and positive interaction. 3.2.3 Counter Matching on Continuous Exposure Given the design of counter matching, the controls should be sampled according to different sam- pling strata defined by categories of the exposure or the surrogate exposure. However, when the exposure is continuous, there are no categories to define the sampling strata. How the sampling strata should be defined, and how much efficiency counter matching can retain with these sam- pling strata becomes a question. Langholz and Goldstein [10] suggested two possible ways of categorizing the continuous ex- posure variable. The first suggestion is to categorize the exposure by the overall exposure distribu- tion. The distribution is drawn from everyone in the cohort, and the sampling strata are defined by the median or quartiles according to the number of controls needed. The other suggestion draws the exposure distribution from cases only, and defines the sampling strata based on the median or quartiles from this case-based exposure distribution. With the first suggestion, the sample will have equal numbers of subjects in each sampling stratum. The second suggestion balances the number of cases in each stratum. Both suggestions 17 use a empirical distribution that is obtained by pooling all risk sets or all cases in the cohort to approximate the true underlying exposure distribution. This assumes that the true distribution does not change over time, or at least remains unchanged during the study period. The two categorization methods were tested out on the Colorado Plateau Uranium Miners Cohort (chapter 2.3). The cumulative radon exposure was used as the exposure of interest and the sampling strata were formed based on the empirical case and the full cohort distributions in each age group. Analysis results show that the variation of the exposure estimates from the case exposure distribution based counter matching is much smaller than the one based on the full cohort distribution. The results also show that in a univariate model with only the radon exposure, the efficiency of a 1:1 (1 case matched with 1 control) case-based counter matching sample reached almost the same level as the full cohort. In the comparison simple random sample, a 1:1 sampling rate provided only about half of the efficiency as the full cohort analysis. Based on these comparisons, Langholz and Goldstein [10] recommended that it is better to set up counter matching sampling strata in a way that there is approximately an equal number of cases in each stratum. 18 Chapter 4 Case-Based Probability Matching The counter matching method requires sampling strata, which are usually defined based on the categories of the exposure variable. Therefore, when the exposure is continuous, we have to cat- egorize the variable. In other words, we should find a categorical surrogate exposure for counter matching. Chapter 3.2.1 also suggested that efficiency would be gains best if a surrogate variable is as close as possible to the exposure itself. Although this conclusion is drawn based on studies with dichotomous exposures, the same idea should apply to a continuous one. In the following chapters, we introduce two novel sampling methods that are developed with the inspiration of counter matching. The two methods are designed for both categorical and continuous exposures, and we mainly discuss the application on continuous exposures. 4.1 Design During categorization, the more levels kept the less information is lost. If we push the number of levels to infinity, then all information from the original variable remains. When the length of each category is equal, then having an infinite number of categories and sample subjects from the categories is the same as assigning each exposure level a probability and sampling subjects 19 according to their exposure levels. Starting from this idea, we present a new matching method which substitutes the need of sampling strata with sampling probability, which makes it possible to sample not only by categorical exposures but also continuous ones. To better present the link between this new method and the counter matching method, we start with the sampling probability for counter matching studies with categorical exposures. LetS l = S 1 ;:::;S L denote the set of subjects whose exposure measurements, Z, fall in thel th sampling stratum, and make subject i as the case in the risk set R, while D is the set of all cases in the cohort. With the counter matching design, the probability that a subject with exposure valuez j falls in stratumS l being sampled can be written as, Prob(Z control 2S l ) = 8 > < > : 0 z2fS l :z i 2S l g C P Z2S l f(z =ZjR) z2fS l :z i = 2S l g WhereC = 1=( P Z f(z =ZjR)) . As described in Section 3.2.3, there are two ways to define the sampling strata. The first is to define them by the exposure distribution drawn from the full cohort. For example when L = 2, then the subjects are grouped into two sets, S = fS 1 = [min(Z); (ZjR));S 2 = [(ZjR);max(Z)]g. When the number of sampling strata L is pushed to1 to accommodate a continuous exposure, then the above probability becomes, Prob(Z control 2S l ) =Prob(Z control 2 (Z;Z +Z)) = 8 > < > : 0 z =Z i C R Z+Z Z f(zjR) z = Z i This is exactly the same as for a simple random sampling design, where the probability of being picked as a control is the same for all subjects in the risk set. 20 The other suggestion which defines the sampling strata by the exposure distribution in cases only was proved to be more efficient[10] with counter matching. With this suggestion, a two level sampling strata (L = 2) groups the subjects into sets defined asS =fS 1 = [min(Z); (ZjD));S 2 = [(ZjD);max(Z)]g. When theL is set to1, the probability of a subject with exposure valuez j falls in stratumS l being sampled becomes, Prob(Z control 2S l ) =Prob(Z control 2 (Z;Z +Z)) = 8 > < > : 0 z =Z i C R Z+Z Z f(zjD) z = Z i where the distribution of the exposure in cases,f(zjD), can be derived as, f(zjD) = '(z)f z (z) R z '(z)f z (z) : (4.1) This indicates that this probability is dependent on both the exposure level of the subject and the empirical exposure-outcome relationship acquired from the cases. Given a 1:1 matching rate where we have the matched subjectsi andj for the sampler, we can calculate the sampling weight for the partial likelihood function as, w i = (rji) (r) = ' j P k2R ' k ' i ' i P k2R ' k ' j P k2R ' k ' i + ' j P k2R ' k ' i P k2R ' k ' j = ( P k2R ' k ' j )( P k2R ' k ) ' i (2 P k2R ' k ' i ' j ) 21 This equation can be simplified with the assumption that the size of the risk set R is big enough, so that for any subjecti2R it establishes that ' i X k2R ' k (4.2) Under this assumption, the weight expression can be simplified, and it becomes the inverse of the probability of subjecti being a case. w i P k2R ' k 2' i / 1 ' i We then write out the expected information as, E(I) = X ~ RR [ X i2 ~ R Z 2 i ' i w i P k2 ~ R ' k w k ( X i2 ~ R Z i ' i w i P k2 ~ R ' k w k ) 2 ] ( ' i ' j ( P k2R ' k )( P k2R ' k ' i ) + ' i ' j ( P k2R ' k )( P k2R ' k ' j ) ) X ~ RR [ X i2 ~ R 1 2 Z 2 i ( X i2 ~ R 1 2 Z i ) 2 ]( 2' i ' j ( P k2R ' k ) 2 ) = 1 2( P k2R ' k ) 2 X fi;jgR [(Z i Z j ) 2 ' i ' j ] This is half the information we have from the full cohort, since given the full cohort partial likelihood equation (equation 2.9) in Section 2.2, the expected information for a full risk set analysis can be derived as, 22 E(I) = X i2R Z 2 i ' i P k ' k [ X i2R Z i ' i P k ' k ] 2 = 1 ( P k ' k ) 2 X fi;jgR [(Z i Z j ) 2 ' i ' j ] Therefore, we know that as long as assumption 4.2 holds, the efficiency of this case-exposure distribution based probability matching design is always stable and equal to half of a full cohort analysis. 4.2 Analysis Results Based on Simulated Datasets The behavior of this probability matching method is investigated on simulated data sets described in Section 2.4. The analysis are done using Cox’s proportional hazard model, and for simplicity, we included only the exposure variable in the model. 4.2.1 Relative Efficiencies First of all, we compared the efficiency of the probability matching to that of simple random sampling to see if our new sampling method is at least as good as the basic sampling design. Comparisons were made with three different exposure distributions: uniform, normal, and chi-square distribution. Under each distribution, we further set the log of the hazard ratio,, to range from 0 to 3 by 0.1 representing positive exposure-outcome relationships, and 0 to -3 for negative relationships. Figure 4.1 shows the results from the comparisons. The figure includes six plots for the three different exposure distributions, each separated by the two directions of the exposure effect, positive or negative. The Y axis represents the relative efficiency of the 23 probability matching as compared to the simple random sampling. The X axis represents the log of hazard ratio. Figure 4.1: Relative efficiencies of the case-based probability matching as compared to simple random sampling. From plots A and B in Figure 4.1, where the exposure follows uniform or normal distribu- tion, we see that efficiency of the probability matching is consistently better than simple random sampling. This superiority increases with increasing exposure effect, no matter protective or risk effect. This advantage is multiplied in plot C2 where high level exposure is rare. These conclu- sions are similar to Langholz and Clayton’s findings with their comparison between the counter 24 matching method vs. simple random sampling, given dichotomous exposures [8]. However, this pattern does not hold in the situations where a protective exposure is rare or similarly a risky exposure is common (plot C1). We found under these situations that the probability matching provides even less information than the simple random sampling method. To better understand where the advantage and disadvantage comes in, the relative efficiencies of the two sampling methods as compared to the full cohort analysis were calculated. Figure 4.2 below plots the relative efficiencies from both sampling methods against different exposure effects. The analysis results from both sampling methods, case-exposure distribution based proba- bility matching (solid dots) and simple random sampling (unfilled square), are overlaid on the plots. The Y axis of each plot represents the relative efficiencies of the two sampling methods as compared to the full cohort analysis. We can see that the relative efficiency of the probability matching is almost constant and does not change among different exposure distributions, nor the strength of the exposure. On the plots, the relative efficiencies against the full cohort are stable at 0.5, which means the efficiency is always a half of the efficiency of a full cohort analysis. The results of simple random sampling differ by exposure distributions. In plots A1-2 and B1- 2, given the exposure variable following a uniform or normal distribution, the relative efficiency of simple random sampling is always below the 0.5 level except for when there is no relationship at all between the exposure and the outcome (i.e. = 0). When there exists a relationship, no matter positive or negative, the relative efficiency decreases with the increasing strength of the relationship. However, this pattern changed when the exposure follows the 2 [2] distribution (plots C1 and C2). In plot C1, where the exposure is preventive, the simple random sampling method exceeds the efficiency of probability matching. In plot C2, where the exposure has a risk effect, 25 Figure 4.2: Relative efficiencies of case-based probability matching and simple random sampling as compared to full cohort analysis. the relative efficiency for the simple random sample quickly drops down to less than 20% of the full cohort efficiency when the exposure coefficient is about 3. These interesting results indicate that, by weighting each subject based on the case-exposure distribution in the sampling process, it is possible to remove the impact of the exposure effect from the efficiency. In other words, weighting makes the probability matching a simple random sample under the alternative hypothesis. 26 4.2.2 Bias The weighted partial likelihood from section 2.2 is asymptotically unbiased [13]. However, it is usually not possible to start with such a huge cohort. To better understand how the sampling method works with smaller sample sizes, we utilized two sets of simulated data sets that resemble one relatively larger and one relatively smaller scaled study. The first set includes simulated trials with 1000 risk sets of sample size 1000, and the second is more realistic and has trials with 100 risk sets of sample size 100. Figure 4.3 shows the bias observed in simulations with the larger sample size. The bias is calculated as the difference between the true simulation parameter and the estimated exposure coefficient (Bias = true estimated ). Each dot on the plots represent the average amount of bias we observed at the corresponding level. As we expected, no trace of bias is observed with this large sample size. The bias on all of the six plots are very close to zero, and are scattered evenly up and below the line. Further tests of bias are produced to check if they are of statistical significance. P-values are calculated from 2 test with variance calculated from all simulated trials. Figure 4.4 plots the p-values of the bias tests. Again, no sign of substantial bias is observed. On all six plots, the p-values from both probability matching and simple random sampling are scattered randomly in the range of 0 to 1 for all hazard ratios simulated. However, bias starts to show up with smaller sample size. Figure 4.5 presents the average bias observed in the smaller cohort that has 100 risk sets of sample size 100. Compared to Figure 4.3, the bias from the simple random sampling is much more obvious. A pattern in the bias is observed indicating the results from simple random sampling tend to underestimate the exposure effect. This is more obvious when the exposure is rare (Figure 4.5, plot C1 and C2). 27 Figure 4.3: Amount bias observed from case-exposure distribution based probability matching and simple random sampling given large sample size. At the same time, the bias plots for probability matching do not show differences under differ- ent sample sizes. Slight deviations are observed in plot C2 indicating underestimation of strong exposure effects where high exposure levels are rare. These biases were also tested using 2 test and the p-values are plotted against the exposure effects in Figure 4.6. The p-value plots are consistent with what we have seen from the bias plots. It confirms that the simple random sampling design tends to cause statistically significant bias when the exposure effect is large. This is less obvious when the exposure is normally distributed 28 Figure 4.4: P-values of bias tests from case-exposure distribution based probability matching and simple random sampling. (plot B), but more intensive when the exposure follows a uniform or chi-square distribution (plot A and C). Like the bias plots, the p-value plots also show that estimates from probability matching are generally unbiased, except for the slight deviations observed in plot C2. In addition to what is found in the bias plots, the p-value plots also show signs in both plot A1 and A2 that the bias is more likely to be significant when the exposure effect gets stronger. Although this is only 29 Figure 4.5: Amount bias observed from case-exposure distribution based probability matching and simple random sampling given small sample size observed under a uniform distribution, it is reasonable to believe that probability matching has the tendency to give biased estimators when the exposure effect increases. Combining the findings from both Figure 4.5 and Figure 4.6, it can be seen that general bias problems exist when the sample size is small or the exposure effect is relatively strong. This is more severe when the exposure distribution is skewed, as simulated using the 2 [2] distribu- tion. Both simple random sampling and probability matching are impacted by these conditions. 30 Figure 4.6: P-values of bias from case-exposure distribution based probability matching and sim- ple random sampling, shrank sample size However, the probability matching appears to be less sensitive to them, and can provide unbiased estimators with relatively higher efficiencies for a wider range of conditions. This advantage of probability matching can be attributed to its design that utilizes the case- exposure distribution based weighting system, since this is its only difference from simple random sampling. As discussed in section 4.1, the weight calculation of the probability matching is based on the assumption (4.2) of ' i P k2R ' k . With ' being an exponential function and the increases to an extent, the hazards of people with high exposure levels become so large that the 31 assumption no longer holds. This is not shown in the plots under uniform and normal distribution because the shape of these two distributions determined that we will need an unreasonably large hazard ratio to see the violation. However, with the skew shaped 2 [2] distribution and a small sample size, it is easier to reach the point that the assumption is violated. This explains why bias is observed at a log hazard of around 2.2 or more in Figure 4.5 plot C2. Figure 4.7: Percentage bias in exposure estimation from probability matching and simple random sampling. 32 Despite all the possible bias observed from the above two figures, it does not mean that the two sampling methods are not providing good estimators. Figure 4.7 below show the percentage bias which compares the amount bias to the true parameter. From the plots in this figure, we can see that the bias we observe from the previous 2 figures are all within a range of 10% or less of the true parameter. Compared to the size of the parameter, the deviation in the estimation is small and would not impact the conclusion of the analysis result. 4.3 Projection of Case Exposure Distribution Although this probability matching method appears to have a better and more stable performance compared to simple random sampling, its application is not as easy. The hardest part is that the method requires a sampling probability calculated for all subjects based on the case exposure distribution. However, this distribution is determined by the exposure coefficient, , which in practice, is not available at the sampling stage and is the main interest of study. Therefore, in order to apply this method, we have to project the of the exposure to calculate the sampling probability. The projection can be from previous literatures or studies, from the un- derstanding of all available exposures or surrogate exposures, or simply from the null hypothesis. It is intuitive that better projections should provide better results, but it is also important to know how much harm a bad projection will cause us. Denote the projected exposure coefficient for sampling as s , and the true exposure coefficient as. The difference between s and is measured by ratio s =. When this ratio is negative, it means our projected exposure-outcome relationship is in the wrong direction and if the ratio is 0, it means the sampling process is exactly the same as a simple random sampling. 33 Figure 4.8: Relative efficiencies of case-exposure distribution based probability matching com- pared to simple random sampling. Figure 4.8 shows the relative efficiencies by different values of the s vs. ratio, and the is set to values 1 and 2. Among the six plots, the three on the left hand side are assigned a true of value 1, and the three plots on the right with a true of value 2. The Y-axis stands for the level of relative efficiency. The X-axis represents the ratio of the projected vs. the true coefficient ( s =). The ratio is set to range from -1 to 3, which makes the range of the projected s is from -1 to 3 when the true is 1, and the range becomes -2 to 6 when the true is 2. 34 No matter what value we have for the, when s is set to 0 ( s = = 0), the relative efficien- cies from the probability matching and the simple random sampling method should always be the same. Also, when the ratio is equal to 1, then the relative efficiency should always be 1=2. From Figure 4.8, we can see that the relative efficiency from the probability matching is always below simple random sampling when the projected s is of the wrong direction (ratio< 0). When the ratio is positive, the relative efficiency starts to go beyond simple random sampling, increasing to the level of 0.5 or more, and then declines. The point when the relative efficiency reaches its maximum seems to be dependent on both the exposure distribution and the level of the exposure effect. A s = ratio with value between 0 and 1 seems to be a safe choice that provides better relative efficiencies under all exposure distributions, while larger ratios may work better when the exposure distribution is skewed like the 2 [2] distribution. However, it is hard to determine how large the projected s can be without harming efficiency. Under the uniform distribution, the efficiency curve of the probability matching drops beneath the line of simple random sampling at the point where the ratio is less than two. While with 2 [2] distribution, the curve is still above the line of simple random sampling when the projected s reaches three times the value of the true. Although the details differ by exposure distributions, probability matching generally works better than simple random sampling as long as the projected s is between 0 and the true exposure coefficient or is not too far from it. However the simulation results also suggest that making a projection without guidance is unwise. When we are clueless with the direction of the exposure coefficient or its scale, it is better to stay on the safe side and apply simple random sampling instead. Figure 4.9 displays the p-value plots of the bias tests from cohorts generated with small sample sizes. From the figure, we can see that significant bias mostly occurs when the ratio is negative or 35 Figure 4.9: P-value of the bias tests from probability matching compared to simple random sam- pling. far away from the true. In addition, the p-value plots suggest that the bias issue is more severe when the true is set to 2. Since it is already observed that large exposure effects can lead to more significant bias (Section 4.2.2), it is reasonable to predict that as the true increases, the range of s that won’t cause the bias to increase will get smaller and smaller. 36 4.4 Discussion Our case-exposure distribution based probability matching method provides samples that are more informative and powerful compared to simple random sampling in estimation of the exposure effect. The method keeps its efficiency stable at half of a full cohort analysis with a sample size of only 2 observations per risk set. This makes it especially valuable when the exposure of interest is rare, or the exposure effect is large. However, this sampling method has some limitations. First of all, although it works well in extreme conditions, the method doesn’t show much advantage when exposure effects are minor or when higher levels of exposures are not rare. Also, the great efficiency provided by probability matching when exposure effect is strong is not really needed. These limit the advantages of this sampling method to situations where high exposure levels are rare. We also need to consider the possible efficiency loss caused by the non-perfect projection of the parameter of interest. Bias is more likely to occur with projected parameters when the exposure distribution is skewed. Considering the extra steps needed to perform this probability sampling method, and the small benefit we get in return, we recommend simple random sampling for its simplicity. 37 Chapter 5 Outside Caliper Matching In the previous chapter, we proposed the case-exposure distribution based probability matching design that focused on balancing the exposure distributions from the cases and the sampled con- trols. We found it helpful in keeping the efficiency level constant regardless of the exposure effects. However, it does not help much in improving the efficiency level when the exposure effect is small. In this chapter, we introduce a new sampling strategy we namedoutsidecalipermatching that utilizes the distance between the sampled controls and cases. The design of outside caliper matching combines both the feature of counter matching and caliper matching. It inherits from the caliper matching the idea of using the caliper as a dynamic, case-specific cutoff to determine if a control is similar or different compared to the case being matched. The method adopts the idea of counter matching and samples control from outside of the caliper. Since possible confounding effects can be diminished with caliper matching by carefully defining the caliper size and making sure the controls within the caliper can be considered similar to the case, it is reasonable to expect that outside caliper matching can make use of the caliper 38 size to select controls that are different enough from the cases to retain more information on the exposure effect. 5.1 Design The most important aspect of outside caliper matching is determination of its caliper size. There are different ways in measuring the size. In this dissertation, we measure caliper size in the unit of rank orders. We define the caliper size as the minimum difference allowed between the cases and their matched controls, and denote it asK. By definition,K must be a nonnegative integer. To apply the caliper, we rank all subjects in the risk setR according to their exposure levels. Given the risk set has a total ofN subjects and the case’s exposure rank order beingi, we know that subjects with their rank order falling in range (max(0;iK);min(N + 1;i +K)) are covered by the caliper around the case. We refer to this min(N; 2K + 1) wide range as the coveredarea, and the calculatedQ (Q =min(N; 2K +1)=N) as thepercentageofareacovered. For simplicity, we consider only situations where the same percentage of area covered is used throughout the sampling process, i.e. we keepQ constant. The control is randomly sampled from outside of the covered area, which falls in the rank range of (0;max(0;iK)); (min(N + 1;i +K);N + 1). Note that this can be an empty set if the caliper is too wide. The largest caliper size we can have without leaving any case unmatched is (N 3)=2 for an odd number ofN, and (N 2)=2 for an even number ofN. On the other hand, we get a special case when the caliper size is at its smallest. When K = 0, the outside caliper matching becomes exactly the same as the simple random sampling design. Since this design forces some controls to be dropped before sampling, the probabilities of being sampled are different when the caliper size changes. For example, when K = 0, every 39 control is equally likely to be sampled, but when K > 0, controls in the covered area had no chance to be sampled. This inequality caused by the sampling process must be controlled for when we analyze the data. Based on equation (2.9), we can fox this inequality by adding a weight to the likelihood function. This weight is also dependent on the caliper size and the rank order of the individual. The weight for a subject with ranki;i = 1;:::;N can be calculated as, w i = 8 > > > > > < > > > > > : (N 2K 1) 1 i2 [K + 1;NK] (iK 1) 1 i2 [NK + 1;N] (NiK) 1 i2 [1;K] Given this weight function and the partial likelihood function (2.9), we can derive the expected information of a 1:1 outside caliper matching with sampled case and control having rank orderi andj as, E[v] = X fi;jgR [ Z 2 i ' i w i +Z 2 j ' j w j ' i w i +' j w j (Z i ' i w i +Z j ' j w j ) 2 (' i w i +' j w j ) 2 ]( w i +w j N )I(jijj>K) (5.1) = X fi;jgR (Z i Z j ) 2 (w i +w j )' i ' j w i w j N(' i w i +' j w j ) 2 I(jijj>K) (5.2) From the derived information equation (5.2), we find that given the smallest caliper sizeK = 0, the weight becomes the same for all subjects and is equal to 1 (N1) . Its corresponding expected information is then equal to, E[v] = X fi;jgR 2(Z i Z j ) 2 ' i ' j 2N(N 1)(' i +' j ) 2 (5.3) This is the same as a simple random sampling design. 40 On the other hand, when we utilize the largest possible value forK without dropping any risk set, then under the null hypothesis = 0, the expected information for outside caliper matching will always be higher than that of the simple random sampling (See appendix for detailed proof). By intuition, we believe a wider caliper will provide more diversified exposure values, and a better efficiency. However, it is complicated to prove when the exposure effect is not null. In the next section, we check the relative efficiency of outside caliper matching against full cohort with simulated data at different exposure effect levels, and have simple random sampling as a comparison. 5.2 Outside Caliper Matching on Exposure In this section, we continue to use the simulated data sets described in Section 2.4, and continue to use Cox’s proportional hazard model for data analyses. 5.2.1 Effect of the Caliper Size To evaluate the effect of the caliper size, we keep the data sets generated with the exposure effect coefficient () constant, and set the percentage of area covered (Q) to range from 0% to 99% by steps of 2%. Then for each one step change inQ, we prepare 100 simulation trials each with 1000 risk sets of size 1000. Both outside caliper matching and simple random matching are applied to the same simulated data sets with a 1:1 matching rate. Situations whereQ equal to or greater than 100% are not checked as it brings in probabilities that some cases may be dropped due to a lack of a eligible match. 41 Figure 5.1 shows the relative efficiencies at each value ofQ when there is no exposure effect. The line with black dots represents results for outside caliper matching, and the unfilled squares for simple random sampling. Figure 5.1: Relative efficiencies of outside caliper matching compared to simple random sampling at with different values ofQ given = 0. All six plots in the figure show a consistent pattern that the relative efficiency increases with increasing caliper size. The efficiency of outside caliper matching is the same as that of the simple random sampling design when we setQ at 0%, which agrees with the sampling design and the information equation (equation 5.3). After that the efficiency gradually goes up untilQ is around 42 100%. The highest efficiency achieved atQ = 100% varies by different exposure distributions and different exposure effect levels. The efficiency of outside caliper matching amazingly ap- proaches a full cohort analysis in plot A1. The efficiency of outside caliper matching is always approximately 2 times that of the simple random sampling atQ = 100%. 5.2.2 Comparison with Simple Random Sampling Based on the analysis results from the simulations, we fix the caliper size at its maximum value (Q=99%) to ensure its best performance, and to further compare the two methods at different exposure effect levels. Figure 5.2 presents the relative efficiency of outside caliper matching and simple random sampling over a full cohort analysis. The Y axis represents the relative efficiency estimated by Cox’s proportional hazard model that includes only the exposure. while the X axis represents the exposure coefficient. The efficiency of the outside caliper matching increases and decreases in a similar pattern as the simple random sampling design. However, the efficiency of the outside caliper matching is always approximately one fold higher than the simple random sampling re- gardless of the exposure effect. Given the same sample size, choosing outside caliper matching can double the efficiency. 5.2.3 Bias In addition to the high efficiency, outside caliper matching is not more likely than simple random sampling to cause bias problems. We observe no sign of possible bias with the simulated data sets of relatively large sample sizes. When the smaller set of simulated data which includes trials with 100 risk set of size 100 is used, similar issues as we have seen in Chapter 4, Figure 4.6 are observed for both outside caliper matching and simple random sampling. The p-values from the 43 Figure 5.2: Relative efficiencies of outside caliper matching with maximum caliper size compared to simple random sampling bias tests are plotted in Figure 5.3, where the Y-axis represent the p-values from the bias tests at different value of the exposure coefficients (X-axis) that range from -3 to 3. The p-values from both outside caliper matching and simple random sampling design also share the same pattern. Both sampling designs tend to provide biased estimates when the exposure effect is large, and the bias gets worse when fewer subjects in the cohort have extreme exposures. As we discussed in Chapter [4], this is most likely because by the nature of random sampling methods, it is very hard to get a sample with rare extreme values when the sample size is small. 44 Figure 5.3: P-values of bias test for outside caliper matching with maximum caliper size compared to simple random sampling 5.3 Estimation with Covariates We would like to see if outside caliper matching also works well in the present of covariates. A covariate is known to have a negative impact on the efficiency of the exposure estimation [8]. To visualize the impact, we checked situations where a covariate is also included in the model, and the covariate is weakly (correlation=0.1), or moderately (correlation=0.5), or strongly 45 (correlation=0.9) correlated with the exposure. The log hazard ratio of the covariate is fixed at 1 or 3, and the log hazard ratio of the exposure is set to range from -3 to 3 by steps of 0.1. Figure 5.4: Relative efficiencies of the exposure estimation with covariate controlled in the model Figure 5.4 shows plots of the relative efficiency in estimating the exposure effect. The three plots on the left hand side are given a covariate with coefficient equal to 1, and the plots on the right hand side have a covariate with coefficient equal to 3. The top two plots are for situations where the covariate is weekly correlated with the exposure, the middle two for a moderate relationship, and the bottom two for strong correlation. 46 The efficiency in estimating the exposure is barely affected by the strength of the covariate effect, no matter how strong they are correlated. However the correlation between the exposure and the covariate itself has a large impact on the efficiency. The stronger the correlation, the less efficient is the outside caliper matching. In the extreme case where the covariate is strongly correlated with the exposure, the relative efficiency of the exposure estimation dropped so low that it is even below the level of the simple random sampling. We do not expect this to happen since this is more of a co-linearity issue. Figure 5.5: Relative efficiencies of the covariate estimation with exposure controlled in the model 47 The efficiency in estimating the covariate effect is also checked. Figure 5.5 shows the relative efficiency of the covariate estimation when the exposure effect is controlled in the model. We see that the efficiency in estimating the covariate from the outside caliper matching is lower than the simple random sampling in all situations. While the efficiency in estimating the exposure effect, the relative efficiency of the covariate estimation seems to be consistent over different correlation levels between the exposure and the covariate, and is not impacted by different covariate effects. A similar situation was observed with counter matching [8]. This decrease in the efficiency of covariate estimation is due to ignoring the covariate measure during the sampling process. This is not a problem when the covariate is used as a confounder to be controlled for in the model. However, if the covariate is of study interest, or it is used as a secondary exposure, then it is definitely suggested to include the covariate information in the matching process. 5.4 Outside Caliper Matching on Surrogate Exposure In this section, we further investigate situations where the exposure variable is not available at the matching stage, and we have to use another variable to serve as a surrogate exposure. As introduced in section 3.2.1, an ideal surrogate exposure variable should serve as the ex- posure variable during matching process, and contributes no additional information to the model when the exposure is also included. However, such ideal surrogates are often times not avail- able. Therefore, we alternatively use variables that are correlated to the exposure as a surrogate exposure. Simulations were performed to get a general idea on how strong the correlation needs to be for a variable to be a good candidate as a surrogate exposure. Figure 5.6 shows the relative efficiencies of the exposure estimation from samples drawn by outside caliper matching with 48 Figure 5.6: Relative efficiencies of exposure estimation given a surrogate exposure is used as the matching factor a surrogate exposure used as the matching factor, and a simple random sampling design as a comparison. The relative efficiencies are plotted against the exposure log hazards ratio that ranges from -3 to 3, by steps of 0.1. From top to bottom, the plots show situations where there is a low (correlation=0.1), median (correlation=0.5), and high (correlation=0.9) correlation between the surrogate matching variable and the actual exposure included in the final model. The relative efficiency depends largely on the strength of the correlation between the surrogate matching variable and the exposure. The stronger the correlation, the better the efficiency. When 49 the matching variable is barely correlated with the exposure (plot A1 and A2), the efficiency of the outside caliper matching is even about 10% worse than the simple random sampling design. With a strong correlation of 0.9, the efficiency from outside caliper matching increased, and is 30% better than the simple random sampling. Based on these findings, we suggest that a variable should have a correlation of more than 0.5 with the exposure to be considered a candidate surrogate exposure. This cutoff point should be raised to a higher level if there exists other covariates or confounders in the model that may reduce the efficiency (see Section 5.3). 5.5 Generalization to Other Matching Rates The above illustrated outside caliper matching methods are all described in a 1:1 matching rate, but it can be easily generalized to 1:m matching by defining m equally spaced calipers and sam- pling one control from between each of the two neighbor calipers. However, it is not recommended to generalize the outside caliper matching to group time setting where ties are allowed in event times. Although theoretically we can still put calipers on each case in the risk set, and sample controls from outside of all the calipers, it is very likely that these calipers would cover up all or most of the exposure rank orders and cause no or not enough controls to be available for sampling. 50 Chapter 6 Application on Real World Data In this chapter, we present analysis results from the application of each of the above introduced sampling methods on the Colorado Plateau uranium miner study cohort. The uranium miner cohort was collected mainly for the purpose of studying the possible effects of uranium mining exposure on lung cancer disease rates. Detailed descriptions about this study can be found in Section 2.3. Cox’s proportional hazards model is used, and for the illustration purpose of the sampling methods, the analysis models only consider the radon exposure, and the smoking histories. The cumulative radon exposure is scaled to have a unit of 1 thousand working level months (WLMs), and is used as the exposure of interest. The cumulative smoking amount is scaled to have a unit of 10 thousands packs, and is treated as a possible confounder or effect modifier in the analysis. All sampling methods introduced in this paper, including simple random sampling, counter matching, probability matching, and outside caliper matching are applied to the same cohort for comparisons. Counter matching is applied with the radon exposure variable dichotomized. Two sets of cutoff points are determined. One set is defined using the exposure distribution in cases, and the other using the distribution from the full cohort. Based on suggestions from Langholz, and 51 Goldstein [10], the radon exposure distributions can differ across different age groups. Therefore, the cutoff points are defined for age group of < 55 years old and age group of 55 years, separately, the age groups are determined based on the median age from the full cohort (Table 6.1). Table 6.1: Age and radon distributions from the Colorado Plateau uranium miner study cohort Mean(std.) Minimum Median Maximum Age 54.23(7.76) 34.5 54.5 86.42 Radon (1000 WLMs) Age< 55 750.35(1040.46) 0 387.96 10000 Age 55 862(1217.07) 0 448.51 10000 Radon from cases only Age< 55 1873.35(1403.12) 27 1824 6832.36 Age 55 1748.36(1993.64) 8 1026 10000 Based on Table 6.1, with the full cohort exposure distribution, the cutoffs for radon exposure are determined to be 400 thousands WLMs and 450 thousands WLMs for age group< 55 and 55 years, respectively. For the case-based exposure distribution, the cutoffs are 1800k WLMs and 1000k WLMs, respectively. Probability matching is applied with sampling probabilities calculated from two projections of the radon exposure coefficient. One projection being 0.3, which is determined by the actual estimation resulted from a full cohort analysis. The other projection is 0.1, which is the estimate resulting from analysis based on the simple random sampling design. The two probabilities are chosen to represent a perfect projection of the true radon effect and a not very ideal projection from preliminary studies, respectively. 52 The outside caliper matching is applied with the largest possible caliper size of each risk set. The following models are fitted: 1) with the cumulative radon exposure only, i.e. Hazard ratio(HR) = exp( R R(t)), where R is the coefficient for the radon exposure, and R(t) be- ing the cumulative radon exposure measure at time t; 2) with smoking history only, HR = exp( S S(t)), where S is the coefficient for smoking, and S(t) being the cumulative smok- ing amount at time t; and 3) with both the cumulative radon and smoking exposure, HR = exp( R R(t) + S S(t)); The results of these models are listed in Table 6.2. Table 6.2: Estimated coefficients and standard errors Log hazard (S.E) Full cohort SRS * CM (case) * CM * PM (0.3) * PM (0.1) * OCM * Univariate model Radon 1 0.29 (0.02) 0.12 (0.03) 0.21 (0.02) 0.15 (0.03) 0.30 (0.03) 0.21 (0.03) 0.30 (0.02) Smoking 2 0.26 (0.06) 0.06 (0.07) 0.003 (0.07) 0.05 (0.07) -0.06 (0.07) 0.09 (0.06) -0.44 (0.06) Bivariate model 3 Radon 0.29 (0.02) 0.12 (0.03) 0.21 (0.02) 0.15 (0.03) 0.30 (0.03) 0.21 (0.03) 0.27 (0.02) Smoking 0.29 (0.06) 0.05 (0.07) 0.02 (0.07) 0.04 (0.07) -0.01 (0.07) 0.07 (0.06) -0.37 (0.06) * SRS: Simple random sampling CM (case): Counter matching based on exposure distribution from cases CM: Counter matching based on exposure distribution from full cohort PM (0.3): Probability matching with pre-estimation equal to 0.3 PM (0.1): Probability matching with pre-estimation equal to 0.1 OCM: Outside caliper matching 1. HR =exp( R R(t)) 2. HR =exp( S S(t)) 3. HR =exp( R R(t)+ S S(t)) From the results table, we see that all sampling methods that considered the exposure status in the sampling process provide more precise exposure (cumulative radon exposure) estimates 53 than the simple random sampling. Outside caliper matching and the probability matching with an exact projection provide the closest estimations to the full cohort analysis results. At the same time, outside caliper matching shows a higher efficiency as it has the smallest standard error. The estimation from probability matching with exact projection has the largest standard error among all sampling methods. The estimates of the covariate (cumulative smoking exposure) effect from outside caliper matching is the most, and probability matching the second most estimation that is far away from the full cohort analysis results. Compared to the other sampling methods, the case-distribution based counter matching and the probability matching with a loose projection are more balanced in providing precise estima- tions on both the exposure and the covariate effects. Although they are not the most precise methods in estimating the exposure, they still provide exposure estimations that are more closer to the full cohort result than the simple random sampling. At the same time, case-distribution based counter matching and the probability matching with a loose projection are able to keep the efficiency in estimating the covariate. The covariate estimation from the probability matching is even slightly better than simple random sampling in our case. We also notice that the results from the full cohort exposure distribution based counter match- ing are similar to what we get from the simple random sampling. The results from this counter matching design are slightly better in the exposure effect estimation, and are almost identical to results from the simple random sampling in the estimation of the covariate. 54 Chapter 7 Summary In this dissertation, we introduced two new sampling methods, case exposure distribution based probability matching and outside caliper matching. The two matching designs are developed to generalize the idea of counter matching to better accommodate continuous exposures. We made the argument that when the exposure is a continuous variable, it helps retain more information from the exposure variable compared to categorizing. In this chapter, we put these new sampling methods and the existing counter matching and simple random sampling design in perspective, and discuss the advantages and disadvantages in their design and application. 7.1 Relative Efficiency From the aspect of the efficiency in estimation of exposure association, outside caliper matching is superior among all the methods under a one to one matching rate (1 control per case) at the null hypothesis of no exposure effect. Its efficiency reaches almost the level of a full cohort analysis. The efficiencies in estimating exposure effect from counter matching design with two categorization methods are the just below that from outside caliper matching. The two counter 55 matching methods both can retain about 80% efficiency from the full cohort analysis. Probability matching showed the worst performance, with an efficiency as low as simple random sampling, and is only 50% of the full cohort analysis. The efficiency from each sampling design also depends on the level of the exposure effect. Figure 7.1 shows the relative efficiencies from all five matching methods with the exposure coef- ficients ranging from -3 to 3. Figure 7.1: Relative efficiencies of exposure estimation by the exposure coefficient 56 The efficiencies of outside caliper matching, the cohort based counter matching, and the sim- ple random sampling are much lower when the exposure effect is strong. The two case distribu- tion based sampling designs are less affected by the size of the exposure effect. The case-based counter matching is only slightly impacted when the exposure effect is extreme, and the proba- bility matching is completely independent of the exposure effect. Because of this characteristic, the relative efficiency of the case-based counter matching design is always higher than or equal to that of the cohort-based counter matching, and probability matching is always higher than or equal to that of simple random sampling. The efficiency curve of outside caliper matching is more similar to the cohort-based counter matching design and the simple random sampling, but has a much higher efficiency level. Compared to case-based counter matching, the outside caliper matching has greater efficiency when the exposure effect is small or moderate. This superiority diminishes when the exposure effect is extreme (odds ratio more than 5). Since a higher efficiency means a better accuracy, which is more important in testing and estimating small effects, we recommend the outside caliper matching as the best choice for high efficiencies, while the cased based counter matching is next. 7.2 Bias The estimations from all sampling methods are asymptotically unbiased. However, we do observe substantial bias with small sample size, large exposure effects, or rare exposure rates. Figure 7.2 shows the bias patterns of each sampling method when sample size is small (100 observations per risk set). The amount of bias increases with increasing exposure effect for all 57 sampling methods. The simple random sampling has the largest amount of bias, and the case- based counter matching design has the smallest. Figure 7.2: Loess plot of bias in exposure estimation Figure 7.3 shows the loess lines of the corresponding p-values. From the p-value plots, we see that the bias from the simple random sampling starts to become statistically significant when the exposure coefficient is around 1.5 to 2.5, while the other sampling methods in general are not statistically significantly biased unless the exposure effect gets extreme (with a coefficient greater than 2.5), or the exposure distribution is largely skewed (plot C1 and C2). 58 Figure 7.3: Loess plot of P-value for bias in exposure estimation 7.3 Design and application 7.3.1 Simple random sampling Among the five sampling methods, the simple random sampling design has the most straight forward and easy-to-apply design. Its only requirement is to make sure that everyone in the target population are equally likely to be sampled. When sample size is adequate, this method provides a sample that represents the target population. However, under this design, no exposure related 59 information is used in the sampling process. The advantage is that no bias correction is needed for this method. The disadvantage is that it disregards exposure related information that can be used to improve the exposure effect estimation. Also, a large sample size may be required to reach a certain power level when the exposure effect is minor. 7.3.2 Counter matching The idea of counter matching is to sample controls that are opposite to the cases in terms of their exposure status. Counter matching was originally designed for exposures that are categorical. It requires that the exposure, or some exposure-related measurement exists at the sampling stage. This exposure-related measure should be categorical to be served as a matching factor. When the exposure or its related measure is continuous, it must be categorized to create a matching factor. Therefore, determination of the cutoff points of each category becomes critical. When the cutoff point is defined by the exposure distribution from the full cohort, samples include balanced numbers of subjects in each category. This balance gains counter matching about 20% more relative efficiency than simple random sampling in estimating the exposure. When the cutoff point is defined by the exposure distribution from cases only, not only the number of subjects are balanced by category, the overall number of cases and controls are bal- anced as well. On top of the 20% gain in relative efficiency, this extra balance prevents the efficiency from decreasing with increasing exposure effect. From the application aspect, counter matching requires extra steps than the simple random sampling in order to generate sampling categories and calculate sampling weights to correct the bias caused by the matching process. However, considering that we can gain the same power level with a much smaller sample size, the amount of extra work needed to conduct the counter matching may be much less than the effort needed in collecting information from more subjects. 60 From the aspect of the sampling design, although the two methods both perform better than simple random sampling, categorization is not optimal for counter matching on continuous expo- sures. By design, we can only restrict the controls to be sampled from categories that are different from the cases. This does not help in preventing situations where the matched case and controls are from different exposure categories, but the actual exposure measurements of the two are very close to each other. 7.3.3 Case-based probability matching The case-based probability matching emphasizes sampling according to the exposure distribution in cases. This sampling method is similar to the case-based counter matching, but it replaces the categorization step by assigning sampling probabilities. The advantage is that it has no restriction on the format of the exposure. However, the disadvantage is that it requires calculation of the sampling probabilities, which further requires a pre-estimation of the exposure effect. Because the efficiency of this sampling method depends heavily on the sampling probabilities, a bad pre- estimation could ruin the study. The application of the sampling probability is also more complicated compared to other sam- pling methods. More importantly, the gain in efficiency is not large(50% of a full cohort analysis) compared to the simple random sampling. Although this matching method is not a practical one, it highlighted an interesting finding that, when the case control samples are selected based on the exposure distribution in cases, they provide stable relative efficiencies that are not affected by strength of the exposure effects. This characteristic is observed in both the case-based counter matching design, and the case-based probability matching design. 61 7.3.4 Outside caliper matching Outside caliper matching focuses on the distances between the matched cases and controls. This design does not have restrictions on the format of the exposure. Because the sampling process depends only on the rank orders of the exposure, it also does not require the exact measurements of the exposure at the sampling stage. This characteristic of outside caliper matching benefits studies with exposures that are difficult to be measured while a crude measure or ranking can be easily achieved. This design also has the highest relative efficiency amongst all sampling methods. Its power at a sample size of two reaches almost the power of a full cohort analysis when the exposure effect is small. The application of this sampling method is relatively simple. We only need to put a caliper around the case, where the size of the caliper is half the size of the risk set. A simple random sample from controls outside the caliper is then obtained. The downside is that due to this sampling design, a weight calculation is still needed in data analysis to correct the bias caused by the sampling process. Also, this design sacrifices the effi- ciency in estimating the covariates like the other exposure related sampling methods. 7.4 Discussion With all the comparisons, the outside caliper matching and the case-based counter matching are considered the best options that can provide relatively higher efficiencies, and are less likely to have bias problems. Comparing the two, the efficiency from the case-based counter matching is not the highest, but is more stable and is barely affected by the exposure effect level. While outside caliper matching 62 has the highest efficiency when the exposure has no or moderate effect, the efficiency decreases with increasing exposure effect. The amount of bias is generally smaller from counter matching. In applications, both designs require extra steps other than random sampling, and both need to correct sampling bias in the analysis stage. However, for the case-based counter matching, it is possible that some risk sets may not have any eligible control to be matched. For example, when there is no control from the high exposure level and the case’s exposure is also in the lower level, then this risk set will be dropped from the analysis. This is more likely to happen when the exposure effect is large, or the size of the risk set is small. 7.5 Future Directions This dissertation did not cover the performance of each sampling method in estimating exposure- related interaction term. It is not clear how the relative efficiency of the interaction estimation would differ with different sampling methods. Based on the simulation results for the exposure and covariate estimation, factors that could affect interaction estimation may include exposure distribution, exposure-covariate correlation, exposure effect, covariate effect, and interaction ef- fect. Also of interest is the impact of adding an interaction term on the efficiency of exposure estimation. The real data analysis could also be improved by finding data that suits Cox’s proportional hazards model better. The Colorado Plateau Uranium Miners cohort should actually be analyzed with an Excess Relative Risk model, but it is analyzed with Cox’s proportional hazards model to be consistent with all the simulation works in this dissertation. Therefore the real data analysis result may not be valid and cannot be compared to results from previous literatures. 63 Further research could look for other sampling strategies that combines the design of out- side caliper matching with probability matching. In this dissertation, outside caliper matching successfully improved the exposure estimation efficiency by keeping the controls afar from the cases, while probability matching managed to keep the exposure estimation efficiency stable re- gardless of the exposure distribution and strength of the exposure effect. An ideal combination of these two sampling methods should be able to provide an exposure estimation efficiency of approximately the full cohort efficiency. 64 Chapter 8 Appendix Proof of the expected information of Outside Caliper Matching with largest caliper size under Null hypothesis = 0,E[v] = P fi;jgR (Z i Z j ) 2 w i w j N(w i +w j ) I(jijj>K), is greater than the expected information ofSimpleRandomSampling under Null hypothesis,E[v] = P fi;jgR (Z i Z j ) 2 2N(N1) . Given N is even, we can get the largest caliper size K as K = N 2 1. According to the weight equation5.1, under the largest caliper size, if we leti<j, then we know that w 1 i =NiK, and w 1 j =NK (N (j 1)) =jK 1. Thus, after plug inK = N 2 1, we get w i w j w i +w j I(jijj>K) = N 2K +jjij 1 = jijj + 1 65 (a) P i ( i +::: + (a+i) ) 2 (b) P i ( i +::: + (N2a+i) ) 2 Figure 8.1: Matrix Therefore the expected information ofOutsideCaliperMatching can be rewritten as, E[v] = X fi;jgR (Z i Z j ) 2 w i w j N(w i +w j ) I(jijj>K) = X fi;jgR (Z i Z j ) 2 N( N 2 + 1) I(jijj =K + 1) + X fi;jgR (Z i Z j ) 2 N( N 2 + 2) I(jijj =K + 2) +::: + X fi;jgR (Z i Z j ) 2 N 2 I(jijj =N 1) > X fi;jgR (Z i Z j ) 2 N(N 1) I(jijj =K + 1) + X fi;jgR (Z i Z j ) 2 N(N 1) I(jijj =K + 2) +::: + X fi;jgR (Z i Z j ) 2 N(N 1) I(jijj =N 1)o(N 1 ) = X fi;jgR (Z i Z j ) 2 N(N 1) I(jijj>K)o(N 1 ) whereo(N 1 ) goes to 0 when N goes to infinite. 66 Let i = Z i+1 Z i , i = 1; 2;:::;N 1, from figure 8.1 we can see that for integer a, a = 0; 1; 2;:::;K. X i ( i +::: + (N2a+i) ) 2 >= X i ( i +::: + (a+i) ) 2 And the two sides are equal whena =K. If we replace i withZ i+1 Z i , then we get, X fi;jg (Z i Z j ) 2 I(jijj =N 1a)>= X fi;jg (Z i Z j ) 2 I(jijj = 1 +a) ) K X a=0 X fi;jg (Z i Z j ) 2 I(jijj =N 1a)> K1 X a=0 X fi;jg (Z i Z j ) 2 I(jijj = 1 +a) ) X fi;jgR (Z i Z j ) 2 I(jijj>=N 1K =K + 1)> X fi;jgR (Z i Z j ) 2 I(jijj<=K) ) X fi;jgR (Z i Z j ) 2 2N(N 1) I(jijj>K)> X fi;jgR (Z i Z j ) 2 2N(N 1) I(jijj<=K) ) X fi;jgR (Z i Z j ) 2 N(N 1) I(jijj>K)> X fi;jgR (Z i Z j ) 2 2N(N 1) Therefore we proved that as N goes infinity, the expected information of Outside Caliper Matching with largest caliper size is greater than the expected information of Simple Random Sampling. 67 BIBLIOGRAPHY [1] P. K. Andersen and R. D. Gill. Cox’s regression model for counting processes: A large sample study. TheAnnualsofStatistics, 10(4):1100–1120, 1982. [2] John Cologne and Bryan Langholz. Selecting controls for assessing interaction in nested cases-control studies. JournalofEpidemiology, 13(4), 2003. [3] D. R. Cox. Regression models and life tables. Journal of the Royal Statistical Society, 20:187–220, 1972. [4] D. R. Cox. Partial likelihood. Biometrika, 62:269–276, 1975. [5] Bryan Langholz John M. Peters David H. Garabrant, Janetta Held and Thomas M. Mack. Ddt and related compounds and risk of pancreatic cancer. JNCI, 84:764–771, 1992. [6] R. W. Hornung and T. J. Meinhardt. Quantitative risk assessment of lung cancer in u.s. uranium miners. HealthPhysics, 52:417C430, 1987. [7] Bryan Langholz and Ornulf Borgan. Counter-matching: A stratified nested case-control sampling method. Biometrika, 82:69–79, 1995. [8] Bryan Langholz and David Clayton. Sampling strategies in nested case-control studies. EnvironmentalHealthPerspectives, 102:47–51, 1994. [9] Bryan Langholz and Larry Goldstein. Asymmptotic theory for nested case-control sampling in the cox regression model. AnnStat, pages 1903–1928, 1992. [10] Bryan Langholz and Larry Goldstein. Risk set sampling in epidemiologic cohort studies. StatisticalScience, 11:35–53, 1996. [11] Jr. J. D.; Edling C.; Hornung R. W.; Howe G.; Kunz E.; Kusiak R. A.; Morrison H. I.; Radford E. P.; Samet J. M.; Tirmarche M.; Woodward A.; Yao S. X.; Pierce D. A. Lubin, J. H.; Boice. Lungcancerandradon: ajointanalysisof11undergroundminersstudies. U.S. National Institutes of Health, 1994. [12] Wagoner Lundin and Archer. Radondaughterexposureandrespiratorycancerquantitative and temporal aspects: Report from the epidemiological study of United States uranium miners. U.S. Public Health Service, 1971. 68 [13] Larry Goldstein Ornulf Borgan and Bryan Langholz. Methods for the analysis of sampled cohort data in the cox proportional hazards model. The annals of Statistics, 23:1748–1778, 1995. [14] Pogoda J. Langholz B. Thomas, D. and W. Mack. Temporal modifiers of the radon-smoking interaction. Healthphysics, 66:257–262, 1994. 69
Abstract (if available)
Abstract
Cohort studies, by their nature, require a large number of subjects or a long follow‐up period to observe outcomes of interest. These requirements make data collection a costly process. The nested case‐control design is a way to reduce the cost as it includes only the cases and selected controls. However, this also makes it less efficient than a full cohort design. It is already known that the way by which we sample controls can affect the study efficiency. A good choice of control sampling methods can reduce not only the sample size but also the variance of parameter estimates. These are important especially when the study outcome is rare and/or the affordable sample size is small. ❧ In this dissertation, we discuss sampling methods that are specifically designed for situations where exposure or its related information is available in the study cohort, and due to issues such as a small budget or a short data collection period, some covariate information cannot be collected for every participant of the cohort. We demonstrate the merits of choosing an appropriate sampling method by comparing their relative efficiencies. These comparisons are made under different types of exposure and variety of conditions including rare outcome and rare exposure situations. Under all testing conditions, we find control sampling methods that are carefully chosen to match the studies are, as a whole, more efficient in estimating the exposure effects than generally‐used sampling methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Power and sample size calculations for nested case-control studies
PDF
Disease risk estimation from case-control studies with sampling
PDF
Cluster sample type case-control study designs
PDF
A study of methods for missing data problems in epidemiologic studies with historical exposures
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
Comparison of nonlinear mixed effect modeling methods for exhaled nitric oxide
PDF
Relationship of the Best Practice Advisory (BPA) to antibiotics prescription: a pragmatic clinical trial
PDF
Nonlinear modeling and machine learning methods for environmental epidemiology
PDF
Correcting for shared measurement error in complex dosimetry systems
PDF
Disparities in exposure to traffic-related pollution sources by self-identified and ancestral Hispanic descent in participants of the USC Children’s Health Study
PDF
Two-step study designs in genetic epidemiology
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
Interim analysis methods based on elapsed information time: strategies for information time estimation
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Street connectivity and childhood obesity: a longitudinal, multilevel analysis
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Quantile mediation models: methods for assessing mediation across the outcome distribution
PDF
Surgical aortic arch intervention at the time of extended ascending aortic replacement is associated with increased mortality
PDF
The design, implementation, and evaluation of accelerated longitudinal designs
PDF
Elevated fasting free fatty acids in overweight Latino children and adolescents
Asset Metadata
Creator
Luo, Yi (author)
Core Title
Sampling strategies based on existing information in nested case control studies
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
04/09/2018
Defense Date
03/07/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
nested case control study,OAI-PMH Harvest,sampling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Langholz, Bryan (
committee chair
), Berhane, Kiros (
committee member
), Eckel, Sandrah (
committee member
), Goldstein, Larry (
committee member
), Mack, Wendy (
committee member
)
Creator Email
luoyi@usc.edu,Xellossly@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-1842
Unique identifier
UC11671410
Identifier
etd-LuoYi-6169.pdf (filename),usctheses-c89-1842 (legacy record id)
Legacy Identifier
etd-LuoYi-6169.pdf
Dmrecord
1842
Document Type
Dissertation
Rights
Luo, Yi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
nested case control study
sampling