Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A study of methods for missing data problems in epidemiologic studies with historical exposures
(USC Thesis Other) 

A study of methods for missing data problems in epidemiologic studies with historical exposures

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Transcript (if available)
Content A STUDY OF METHODS FOR MISSING DATA PROBLEMS IN EPIDEMIOLOGIC STUDIES WITH HISTORICAL EXPOSURES by Xinbo Zhang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2009 Copyright 2009 Xinbo Zhang Dedication Dedicated to my wife, Jing Wu and rest of my family, for their sel°ess support. ii Acknowledgements I would like to thank Dr. Bryan Langholz for serving as the dissertation committee chair and the boundless patience and faith he has placed on me; Dr. Myles Cockburn for theopportunitytoparticipatetheprostatecancer-pesticideprojectwhichultimatelyled to the topic of the thesis; Dr. Larry Goldstein, Dr. Dan Stram and Dr. Kiros Berhane for being members of the committee and providing invaluable advice in creation of the thesis. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vi Abstract 1 Chapter1: A General review on missing data problems in epidemiologic studies 1 1.1 Introduction of missing data problems . . . . . . . . . . . . . . . . . . . . 1 1.2 The missing mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 A review of general methods for missing data . . . . . . . . . . . . . . . . 5 1.4 Missing value in time-dependent covariate and canse-control study . . . . 12 Chapter2: Background studies with missing data in exposure history 17 2.1 Assessment of environmental pesticide exposure . . . . . . . . . . . . . . . 17 2.2 The prostate cancer - pesticide pilot study . . . . . . . . . . . . . . . . . 22 2.3 Other examples of missing data in time-dependent covariate . . . . . . . . 29 Chapter3: Literature Review 35 3.1 Weinberg's study on exposure history with missing gaps . . . . . . . . . . 36 3.2 Separate time interval method . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Cox regression and nested case-control sampling . . . . . . . . . . . . . . 41 3.3.1 The partial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Counting process approach . . . . . . . . . . . . . . . . . . . . . . 45 3.3.3 Nested case-control sampling . . . . . . . . . . . . . . . . . . . . . 46 3.4 The induced intensity - the early developments . . . . . . . . . . . . . . . 49 3.5 Some advanced methods based on partial likelihood manipulation . . . . . 55 3.5.1 The approximate partial likelihood estimator . . . . . . . . . . . . 55 3.5.2 Method using auxiliary covariate data . . . . . . . . . . . . . . . . 58 3.5.3 Regression Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter4: A simulation study comparing single imputation methods and missing indicator method 64 4.1 The missing data methods examined in the simulation . . . . . . . . . . . 64 iv 4.2 The data preparation for the simulation . . . . . . . . . . . . . . . . . . . 66 4.3 The simulation setting and results . . . . . . . . . . . . . . . . . . . . . . 69 Chapter5: The Method based on missing indicator induced intensity 77 5.1 The induced intensity under counting processes theory . . . . . . . . . . . 77 5.2 Examples of parametrization . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Asymptotic relative e±ciency with cohort study design . . . . . . . . . . 89 5.4 Generalized results of asymptotic e±ciency . . . . . . . . . . . . . . . . . 94 5.5 Asymptotic e±ciency with nested case-control sampling . . . . . . . . . . 97 5.6 Simulation study to verify the theoretical calculation . . . . . . . . . . . 102 Chapter6: The application on the prostate cancer study 111 6.1 The data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter7: Discussion 120 7.1 The importance of the basic assumptions . . . . . . . . . . . . . . . . . . 120 7.2 The bridge to logistic regression under case-control sampling . . . . . . . 123 7.2.1 The link with unconditional logistic regression . . . . . . . . . . . 124 7.2.2 The logistic model under case-control sampling . . . . . . . . . . . 125 7.3 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 REFERENCES 131 v List of Tables Table2.1: Breakdown of missing proportion during 1974-99 at individual level 26 Table4.1: Simulation results for 1500 cases and 1500 controls . . . . . . . . 75 Table4.2: Simulation results for 500 cases and 1500 controls . . . . . . . . . 76 Table5.1: E±ciency calculation for NCC design . . . . . . . . . . . . . . . . 100 Table5.2: E±ciency simulation for cohort design . . . . . . . . . . . . . . . 107 Table5.3: E±ciency simulation for nested case-control design . . . . . . . . 108 Table5.4: Example of non-ignorable missingness from simulation . . . . . . 109 Table6.1: Unadjusted results for average annual exposure during 1974-1999 116 Table6.2: Adjusted results for average annual exposure during 1974-1999. . 117 Table6.3: Unadjusted results for average years exposed during 1974-1999 . . 118 Table6.4: Adjusted results for average years exposed during 1974-1999 . . . 119 Table7.1: Simulation example of informative missingness . . . . . . . . . . . 122 vi Abstract In this thesis we consider methods for a speci¯c missing pattern where missing val- ues occur across the exposure history of an individual, thus creating gaps and holes in the exposure history. We propose a missing indicator induced intensity (MIND) method undertheraredisease assumption. The ideaoriginatedfrom Prentice(1982)and thethe- oretical development can ¯nd its root within the Cox regression framework under cohort design, in which the essential part is the parametrization of the induced intensity. The parametrizationofthe\induced"intensityactuallyre°ectsthemissingmechanism,there- fore the missing mechanism assumption such as missing completely at random (MCAR), missing at random (MAR) found in other literature is no longer required. The MIND method is compared against simple imputation methods in a Monte Carlo simulation study under logistic model with case-control sampling and demonstrates to be better in term of bias and e±ciency compared to the single imputation methods consid- ered, and far superior to the complete case analysis method in term of relative e±ciency. Themethodis showntoreachanasymptotic e±ciencyequal tothe expectednon-missing proportion under cohort design, assuming the exposure and the missingness as a pair is independently and identically distributed across di®erent years. Under nested case- control sampling design, the asymptotic e±ciency varies slightly but stays close to cohort design nonetheless. Under rare disease assumption, the method can be bridged back to case-control design based on logistic model. The method is then applied to the Univer- sityofSouthernCaliforniaprostatecancer-pesticidepilotstudytoassessitsperformance. The MIND method is overall e±cient and °exible in solving the missing data problem vii where the missingness occurs in the exposure history. The method can be further im- proved with better parametrization for the induced intensity under complex situations, especially when the exposure is correlation over years. Keywords: Missing data, case-control study, exposure history, Cox regression, logis- tic regression, asymptotic e±ciency. viii Chapter 1 A General review on missing data problems in epidemiologic studies In the ¯rst chapter of the paper, we provide an introduction to missing problems faced instatisticalstudies; adiscussiononthemissingmechanism; abriefreviewonacollection of widely accepted methods, along with their merits and pitfalls; and ¯nally a description of a missing pattern that will be the focus of the paper. 1.1 Introduction of missing data problems Missing data problem is common in epidemiologic studies. In fact, missing data exist in literally every study that involves large amount of manual data collection, only to di®erent extent. In surveys the individuals may not be available to answer phone call, or they refuse to respond to certain questions. In longitudinal studies the participants may drop out due to death or reallocation. More frequently, a low quality data collection process may result in out-of-range values or simply blank values. Missing data is problematic because most statistical theories are built on the assump- tion of complete data, and most statistical software packages simply discard records with 1 missing values by default, as such giving the investigator a false sense of security because theprogramstillrunsandsometimesgeneratesfavorableresults. Howeverindoingsotwo potential pitfalls are lurking: power reduction and bias. Missing data can make the sam- ple size smaller than originally planned, leading to failure to prove the study hypothesis when it is in fact true. To counter the power loss, additional reserve of participants may be required, which in turn raises the study cost. The bias in survey data is often referred to as non-response bias (Barnard and Meng, 1999), which occurs when the participants who answer the survey question di®er signi¯cantly from those who refuse to answer. In epidemiologic studies, the chance of being missing may be related to both exposure and outcome, which can induce bias if the missing records are simply discarded. The best way to solve a problem is always to prevent it from happening. Instead of putting faith in the missing data methods in the analysis stage, investigators should take precautions as early as the initial study design, a clever design can limit the damage from the missing data problem or even circumvent it completely. During the following data collection phase, investigators can make best e®orts to improve the data quality. Howeveritisgenerallyinevitabletoendupwitha¯naldatasetwithpartiallymissing data, researchers have to make decisions on how to deal with the missing data in analysis phase of the study. The reasons for missing data can be numerous, there are also a great variety of methods developed over years at the investigators' disposal. Generally there is no single best method to ¯t all situations due to the complexity and the unknown nature of missing data, hence researchers are often required to develop an ad hoc method for a speci¯cstudytoreachthemostsatisfactoryresults. Assummarizedin(LittleandRubin, 1987), we would ask following questions when using an existing method or developing a new method for a study: 2 Validity: Does the method provide an estimator without systematic bias that may be induced by the missing mechanism? Variance: Is the variance of the estimator under-estimated? Is there a correction to the variance estimation? E±ciency: How big is the variance relative to the case where no data is missing? Di±culty: How di±cult to implement the method? Is the trouble worth the gain? 1.2 The missing mechanisms The ¯rst thing a researcher should do to tackle a missing data problem may be to give a reasonable cause why the missing data are generated, and how the cause interacts with study outcome and other covariates. The reasons, or the so-called missing mechanism, can be numerous: measuring instrument malfunctioning, patients failing to answer a questionnaire question, input data being out of range, data not available for a certain study area, and so on. When a missing mechanism is not integrated in the method, the researcherisusuallyrequiredtomakeassumptiononthemissingmechanismintheclassic missing data literature since the validity of the method may rely on it. The strongest assumption about missing mechanism is missing completely at random, or MCAR, where themissingdoesnotdependonoutcomeandcovariates,observedornot. Inmathematical language, if we let M = 1 when the value of a certain covariate is missing, M = 0 when observed; Y be the outcome variable; Z obs be the part of the covariates that are observed and Z mis the missing part, then MCAR implies E[M]=E[MjY;Z obs ;Z mis ]: (1.1) 3 While it is easy to partially verify the assumption by comparing the missing pro- portion among fully observed variables, it is generally impossible to strictly test if the probability of missing is a result of the missing value itself. Researchers frequently make this assumption if the reason of missing is clearly unrelated to any study variables un- der common sense, for example, a patient missed the data collection session because of bad weather can be treated as MCAR. Classic non-MCAR examples include that people with high-income tend to refuse to report their income in a survey; or that over-weighted females tend to omit the question regarding weight. Aspointedoutin(LittleandRubin,1987),theMCARassumptionisoftentoostrong. A less restrictive assumption is called missing at random, or MAR. A missing mechanism is called MAR when the probability of missing may depend on observed value but not depend on unobserved value, that is, E[MjY;Z obs ]=E[MjY;Z obs ;Z mis ]: (1.2) At¯rstglance,theMARassumptionisquitehardtoperceiveanditisdi±culttogive a perfect example. MAR is less restrictive than MCAR in that MCAR requires complete independence between missing and the study variables, observed or not, while MAR only requires conditional independence of missing on the missing value itself, conditioned on the observed part of the data. The income example given in MCAR discussion is clearly not MAR because the probability of missing depends on the missing income itself. How- ever, assume we know everyone's socioeconomic status (SES), then the chance of missing income may be considered MAR, because among the same SES level, it is reasonable to assume people have the same chance to refuse to report their income. In this example, 4 the missingness is explained by the observed SES level, not the income itself, we can also describe it as an example of strati¯ed MCAR. Same as MCAR, it is impossible to strictly test MAR assumption because the missing data is unknown. Researchers are rec- ommended to use their own discretion to determine the missing mechanism based on the speci¯c study. When the missing mechanism is neither MCAR nor MAR, we call it \not missing at random", or NMAR; in term of likelihood setting, it is also called \non-ignorable". This type of missing implies that after conditioning on all the observed information, the missing mechanism still depends on the non-observed value, so one has to incorporate the missing mechanism through a joint model to obtain valid parameter estimates, hence \non-ignorable". TheproblemwiththemissingmechanismisthatneitherofMCAR,MAR,NMARcan be determined directly from the study data; and in case of NMAR, in rare situations can the missing mechanism be modeled explicitly. One may have to resort to past experience or population data. 1.3 A review of general methods for missing data Complete case analysis Themostcommonapproachtoanalyzemissingdataisprobablycompletecaseanalysis (CCA) method which simply discards all records that contain any missing value used in theanalysis,andanalyzesthereduceddatasetwithunmodi¯edmethodasifthedatawere complete. This is the default method adopted by most commercially available statistical software packages, such as SAS, STATA, SPSS and S-PLUS. The obvious advantage of 5 thismethodisitssimplicitytocarryout. CCAcanalsoprovidevalidparameterestimates if the missing is MCAR; the loss in e±ciency is acceptable if the missing is rare. However the disadvantage is obvious: 1. ThemethodmaynotprovidevalidestimatesifthemissingmechanismisnotMCAR. 2. The method can cause serious loss of information if the missing data proportion is high, ortheomissionismainlycausedbymissingvaluesinnuisancecovariates. The latter is particularly common in multivariate analysis where only a small subset of the records may be complete. 3. The method may break the study design. For example one missing value in the case-control pair will break the 1:1 matching, or one missing value in the block will break the balanced block design. Despitethepotentialproblems, thesimplicityoftenoutweighsthedisadvantage, mak- ing the CCA one of the most widely used methods. Single imputation methods Single imputation is probably the most straightforward method barring the complete case analysis. The simple idea behind the technique is to replace the missing value with a mean or draw from explicitly or implicitly de¯ned distribution, then analyze the data as if the data were complete. There are many ways to realize the draw depending on how the model and the data are set up. Mean imputation: substitute the missing value with the mean from the observed values, or a subset of the observed, hence the missing value for the same variable will be replaced with the same imputation. 6 Regression based imputation: sometimes the variable where missing occurs can be predicted reasonably well from other variables in the model. For instance a person with a higher height tends to weigh more. Hence the idea is to run a regression on theobservedrelevantvaluestopredictthemissingones, thensubstitutethemissing valuewiththepredictedone. Clearlythismethodcanproducemorevariationinthe imputed value than mean imputation method. In fact, mean imputation is no more than a special case of linear regression imputation where all slopes are 0 meaning the regression contains intercept only. Hot deck imputation: the method is often used by the Census Bureau in population surveys(ReillyandPepe,1997). Essentially,thedrawismadefromapoolwherethe fully observed individuals share similar properties with the individual with missing value on other characteristics. The scheme used for the draw can be quite delicate. Hotdeckimputationisamethodsomehowbetweenmeanimputationandregression imputation, or can be viewed as an empirical realization of regression imputation. LOCF imputation: last observation carried forward method is usually used to im- pute missing value in time-dependent covariates where the measurement is taken repeatedly over time for the same individual subject. The missing value is replaced with the closest earlier observed measurement, hence \carried forward". Note the imputation can also be done in an opposite direction, replacing the missing value with the closest observed value that happens after the missing value. Singleimputationmethodscanbeproblematic,however,ifnottreatedcorrectly. Since the missing value is imputed with a ¯xed value based on the researcher's knowledge and belief about the model, the uncertainty within the data is arti¯cially reduced. Even if 7 theanalysisyieldsanunbiasedestimate, thevariancecalculationgenerallyrequiresacor- rection to re°ect the loss of the data uncertainty, and the correction is not always easy. Single imputation also requires at least MAR assumption, otherwise potential bias in the parameter estimation may rise, the chance of bias is particularly high when large amount of data is missing or in the case of LOCF method where a single observed value may have a huge in°uence on the estimates. Multiple imputation One problem with the single imputation methods is that the missing value is imputed with a ¯xed value as if the value was true, thus potentially underestimate the variance for the parameter estimate without a post-analysis adjustment. Motivated by Bayesian perspective, the multiple imputation (MI) method was ¯rst introduced by Rubin (1978) to re°ect the uncertainty from imputation which is unaccounted for in single imputation methods. The idea is to ¯rst impute each missing value in the data multiple times from a prede¯ned distribution, say D times. D \complete" data sets will then result from the imputation. The analysis is performed separately to each of the imputed data set as in the single imputation method to obtain estimate for the unknown parameter. The ¯nal step is to combine the estimates from the D sets of analysis to form one estimate for the parameter, usually by averaging the D estimates. The variance for the combined param- eter estimate consists of two parts, one accounts for the variation from the model within the imputation, the other accounts for the variation from between imputations. 8 Multiple imputation relies on MAR to generate unbiased estimate. To answer the questionhowmanysetsofimputationareenough,RubinandSchenker(1986)showedthat even with only two sets of imputations, the method can provide a satisfactory coverage for the con¯dence interval estimates relative to its nominal level with up to 30% missing data. Most statisticians however suggest 3 - 5 sets of imputations (Allison, 2000) after considering a trade-o® between complexity and performance. The advantage of multiple imputation method is that it incorporates the merits of singleimputationmethodandontopofthatprovidesmorereasonablevarianceestimates since it accounts for random variation in the imputation process. The disadvantage is thattherandomimputationrendersitimpossibletopreciselyreproduceananalysis. The method has also been criticized for introducing too much noise into the model by essen- tially conjuring up the data. Missing indicator method Missing indicator method can be traced back as early in (Cohen and Cohen, 1975). The method can vary from a very simple form to a complex one. In the most naive form, letussaytheexposureofinterestistheonlycovariateinthemodelandcontainismissing for some subjects, then instead of discarding the subjects with missing value, one may replace the missing value with 0 and add a missing indicator in the model as a controlling covariate which is assigned 1 when the exposure is missing and 0 otherwise. 9 To illustrate the method, consider the unconditional logistic regression with a single scalar covariate Z. Suppressing the index for subject, the model can be expressed in logistic form when no missing occurs, logitPr(D =1)=exp(®+¯Z); (1.3) where ®;¯ are the parameters and D the dichotomous outcome. When there is missing data in Z, let M be the missing indicator for Z. The missing indicator induced model then becomes, logitPr(D =1)=expf®+¯Z¤(1¡M)+°Mg: Note that the induced model retains the same model structure by only adding an addi- tional covariate, thus the method can be applied directly in most statistical analysis tools without much modi¯cation. With a single covariate in the model, it can be shown that the missing indicator method will not provide any gain in e±ciency under unconditional logistic regression. However, in vast majority of epidemiologic studies, the model usually contain multiple covariates, the missing indicator method can be adjusted accordingly to improve the e±ciency in parameter inference. In individually matched case-control studies, Huberman and Langholz (1999) recom- mendedmissingindicatormethodforanalysisunderconditionallogisticregression. When theexposurecannotbeobservedforallsubjects, thereareusuallytwosimpleapproaches for the situation: 1) retain the case-control matching and perform conditional logistic re- gression on the pairs where both case and control are completely observed and; 2) breake 10 thematchedpairsandanalyzethedatawithunconditionallogisticregression. Inthesame paper, the authors demonstrated that the missing indicator method can be regarded as a compromise between the odds ratios estimated by the two aforementioned approaches, therefore the method can preserve the matching structure and at the same time improve e±ciency over the complete pairs approach. Themissingindicatormethodcanappearinadi®erentformasaclassofstrati¯cation methods. SupposethatZ inmodel(1.3)iscategoricalcantakeonlyk di®erentvalues,the analysis can be done by adding a (k+1)'th category to Z to indicate missing status, then following the usual estimating procedure to calculate the odds ratios (OR) with respect to each category. Here the missing category acts as a controlling factor to other ORs with actual meaning. This strati¯cation strategy is more often used for the missing value in confounders. For example Z 1 is the exposure of interest with no missing and Z 2 is a partially observed confounder with k distinct values, then the analysis for e®ect of Z 1 can be done by stratifying into k+1 sub-models, one for each of the category of Z 2 and the additional missing category. The EM algorithm First developed in (Dempster et. al., 1977), EM algorithm uses iterative algorithm to ¯nd maximum likelihood estimates in parametric models with missing data. The algorithm begins by assigning an initial value µ (0) to the unknown parameter µ, then the E-Step and M-Step are carried out iteratively until desired convergence is achieved. The E-Step The E-step attempts to calculate the expectation of the likelihood condi- tioned on the observed values and the current parameter estimate µ (n) , in doing so the missing values in the full likelihood function will be eliminated. To achieve this, 11 the distribution of the missing values conditional on the observed values must be speci¯ed. The calculation is as follows, Q(µjµ (n) ) = E[l(µ;Z)jZ obs ;µ (n) ] = Z l(µ;Z)f(Z miss jZ obs ;µ (n) )dZ miss : The M-Step From the E-Step, Q(µjµ (n) ) is computable for every µ. The M-Step is to maximize Q(µjµ (n) ), i.e. ¯nd µ (n+1) so that Q(µ (n+1) jµ (n) )¸Q(µjµ (n) ); for all µ: The likelihood keeps increasing at the end of every M-Step and since it is bounded, the convergence of µ (n) is guaranteed, which is one of the biggest advantages of EM algorithm. However the computation burden of EM algorithm is heavy and the asymptotic properties of the estimate is di±cult to come by. 1.4 Missing value in time-dependent covariate and canse- control study The pattern of the missing data can be quite di®erent from study to study depending how the relevant data are collected and organized. Generally, the missing pattern is one of the most important considerations when it comes to choose a method to deal with missing data, since there is no universally best method when every method has its own 12 range of applicability. Readers may refer to (Little and Rubin, 1987) where the authors gave a comprehensive overview to the missing patterns. In this dissertation we will focus on the missing data problems in time-dependent covariate histories. It is a common scenario in epidemiologic studies where the exposure of interest is evaluated over a certain period of time. For example, in the Colorado plateau uranium miner study (Hornung and Meinhardt, 1987) for association between radonexposureandlungcancerdeaths. Theminer'sradonexposurehistorywasestimated using job history and mine radon level on the basis of annual cumulative exposure for a period as long as 40 years. More often than not one may encounter the situation that the exposure history can only be partially observed or estimated due to the lack of information. The longer is the exposure history, the higher is the chance that missing occurs. Severalexamplesareintroducedinnextsectiontoillustratethismissingproblem; one thing in common for the examples is that the exposure evaluations all depend on the subject's residential history, thus an incomplete residential history results an incomplete exposure history. With missing holes and gaps in the exposure history, the complete case analysis is generally no longer adequate. In the case of univariate missing problem, if the missing rate is low and the missing mechanism is MCAR, the CCA method is an acceptable method at the cost of e±ciency. However, if the missing value can occur anywhere in the exposurehistory, onlyonemissing\hole"intheexposurehistorycanresultinthesubject being discarded from the analysis, likely leaving the analysis much fewer subjects than the case of univariate missing with comparable missing rate. One may also argue that discarding a subject with partially missing exposure history is ine±cient especially if the cost of data collection is prohibitive. 13 Perhaps the most natural way to study the e®ect of time-dependent exposure is to set up a cohort, follow the cohort up, monitor the occurrence of the outcome and the assess the exposure along as the study progresses. The analysis can be done using widely accepted Cox regression (Cox, 1974) which is implemented in most statistical software package; orusingsimilarderivativesoftheCoxproportionalmodel. Therehasbeensome literature over the years that focuses on the missing data methods in Cox regression, we will review these methods in Chapter 3. The advantage of the prospective cohort study is that if well designed, the natural process of the follow-up to the cohort resembles what is going on among the general population, thereby reducing the chance of bias. However, large scale cohort study is not always feasible or e±cient if the occurrence rate of the event is low (rare disease) or, the costtocollectcovariatesinformationfromthestudysubjectsisprohibitive. Acase-control design, or nested case-control design can be used in these situations to improve e±ciency andreducecost. Case-controlstudydesignisknowntobemorepronetobiasthancohort design, andthemissingdatacanonlymakethesituationworse. Therehasbeenalimited literature on the methods for missing data in case-control study with time-dependent exposure, which is the main focus of this paper. Underacase-controlstudydesign,thetime-dependentcovariatesummaryforexposure history is evaluated at the point when the study subject is ascertained. If a single (or limited number of) parameter is used to describe the association between the exposure and outcome. It is necessary to specify a way to summarize the exposure history, which re°ects the investigator's belief on how the exposure and the outcome are related. 14 Let Z ¤ (t) =fZ ¤ 1 (t);:::;Z ¤ p (t)g be a p-vector function of fZ(u);u· tg and fu· tg, theexposurehistory. OnemustnotethatalthoughZ ¤ (t)isevaluatedattimet,itactually dependsonthecovariatehistorypriorto t. Someexamplesofexposuresummaryinclude: ² the \exposure at baseline", i.e., Z ¤ (t) : =Z(0); note here the time point may not be necessarily 0, it could indicate any time point in the history; ² the \cumulative exposure", i.e., Z ¤ (t) : = Z t 0 Z(u)du; ² the \cumulative exposure with latency", i.e., Z ¤ (t) : = Z t¡v 0 Z(u)du; where v is a ¯xed latency e®ect; ² or the \time-weighted cumulative exposure", i.e., Z ¤ (t) : = Z t 0 uZ(u)du; or Z ¤ (t) : =t Z t 0 Z(u)du: Among these types of summaries, perhaps the most frequently used type is the \cu- mulative exposure". For example, in the prostate cancer - pesticide study that will be 15 discussed in Chapter 2, the exposure to certain chemicals is calculated for the entire res- idential history for each participant; and the analysis based on the cumulative exposure is to be compared to the exposure collected at the diagnosis year. 16 Chapter 2 Background studies with missing data in exposure history General discussion on missing data is the focus of Chapter 1, Chapter 2 describes an ongoing study that motivates the the topic of this thesis. The study focuses on the association between prostate cancer and ambient pesticide exposure. A novel method based on geographic information system (GIS) was developed to estimate an exposure historyovertheparticipant'slife,indoingsothemissingdatainexposurehistoryemerged. We start from the ¯rst section to provide an overview of various methods for pesticide exposure assessment; followed by the second section where we carry out an extensive coverage on the study methods and the results; ¯nally we introduce some other studies from literature that share similar missing data problem. 2.1 Assessment of environmental pesticide exposure Conventional methods to assess pesticide exposure Mostepidemiologicstudiessofarinvestigatingtheriskofpesticideexposurehavebeen carried out in occupational groups. For example in Zahm, Ward et al. (1997), the risk 17 for cancer among pesticide occupational group was investigated. In Engel, Checkoway et al. (2001)Parkinson'sdiseasewasfoundtobeassociatedwithlongtermoccupationalex- posure to pesticides. In Young, Mills et al. (2004), a more complicated estimate method was used, various of criterion such as job, crop type and residence were used to create a matrix to evaluate the exposure to triazine herbicides. However occupational group is usually small, thus not suitable to represent the population to investigate the pesticide exposure acquired from daily life. Also, much of the data are obtained based on self re- port or interview, people usually don't know what they were exposed to, thus subject to potential 'recall bias' and accuracy problems. Another conventional method of pesticide exposure assessment is biological monitoring: test the existence of a speci¯c pesticide within the subject's blood, urine or even gene. For example, in Loewenherz, Fenske et al. (1997), organophosphorous pesticides were tested using the urine from the children to give a more detailed of understanding of pesticide pathways for children of agricultural workers. Obviously, biological testing can be very accurate if the testing procedures are correct. However it can be prohibitively expensive so that large scale studies are not economically viable. Poorly designed biological testing scheme can also lead to potential bias as suggested in (Acquavella, Alexander et al., 2006). Pesticide Use Report: a GIS approach In1972,Californiaestablishedalawthatapesticide-usereportmustbe¯ledwhenever a restricted-use pesticide is applied; in 1990 the law was extended to cover all pesticides. ThePURisreportedaccordingtoPublicLandSurveySystem(PLSS),agridsystemthat partitions the United States into units with area of approximately 1 square mile, each PLSS unit is identi¯ed by a unique TRS (Township, Range and Section). The pesticide 18 type, applied amount (in pounds), applied area (in acres), applied cropland type etc. are recorded in the corresponding PLSS section every year. Using widely available GIS software such as ArcView (ESRI, Redlands, California), the study subjects' addresses can be geocoded into spatial data. Then the PLSS section to which each subject belongs to can be easily identi¯ed, thus the link between the study subjects and the pesticide usage is established. The problem remains as to how to ac- curately de¯ne the link. There are quite a few studies utilizing PUR data to determine the exposure, from the most primitive but simple criterion to more complicated methods (Mills, 1998), (Reynolds, Hurley et al., 2005). Bell, Hertz-Picciotto et al. (2001) in their study to analyze the association between pesticide exposure and fetal death due to congenital anomalies utilized PUR data to determine the pesticide exposure. The TRS of residence and the surrounding eight TRSs were used as geographic markers for residential proximity to pesticide applications. If the TRSofapesticideapplicationfellwithinthesameTRSasthestudysubject'sresidence,or withinanyofthesurroundingeightTRSs,thestudysubjectthenwasclassi¯edasexposed (broad de¯nition). A narrow classi¯cation of exposure limited the de¯nition of exposure to those pesticide applications that fell within the same TRS as the subject's residence. Obviously, this type of exposure assessment is the most primitive because exposure is binary. In case we need to evaluate the impact of pesticide on health issues based on the amount of pesticide the subject actually exposes to, a more detailed assessment method is required. Reynolds,VonBehrenetal. (2005)alsousedPURdatatoevaluatepesticideexposure to detect relation between pesticide exposure and the risk of childhood cancer. However unlike (Bell, Hertz-Picciotto et al., 2001) where the exposure was simpli¯ed to a binary 19 variable, the exposure in this paper was estimated in a more detailed, quanti¯ed level. Because of the nature of PUR report, the basic section was still based on PLSS. First, around each study subject's residence, a half-mile bu®er was created. Considering the PLSS section is roughly an 1 mile by 1 mile square, the bu®er can intersect with at most fouradjacentPLSSsections. Next, theauthorsassignedpesticideuse(inpounds)toeach residence proportional to the percentage area of each section within the bu®. For each residence,theauthorsthensummedtheproportionallyweighedpoundsforallmile-square sections within the bu®er. Finally, the authors divided by the bu®er area (0.79 square miles) to obtain pounds per square mile for each pesticide group and individual pesticide of interest. The PUR data can also be used to assess a group of people, say the people in a census block, to eliminate individual variation in the assessment. For example, in (Gu- nier, Harnly et al., 2001), the author ¯rst calculated the pesticide density distribution in California based on PUR data. Then the spatial relationship between census blocks and PLSS sections was analyzed using GIS software, and then the state wide distribution of pesticide density among block groups was obtained though the spatial relations. Enhance the PUR data with Land-use survey data Utilizing California Pesticide-Use Report to assess the pesticide exposure has recently become popular and successful because it's an objective measurement comparing to self- report and inexpensive once the PUR data and the subjects' residential locations are identi¯ed in the GIS system. However, the limitations are almost obvious. PUR data is based on PLSS, although the acreage applied with pesticide is reported, the exact location is unknown; hence the PUR data can only provide a rough pesticide density 20 estimate across the entire PLSS unit. The PLSS unit, which is roughly 1 square mile, is relatively huge comparing to the 500 - 1000 meters bu®er size around the subject's residence used to determine the exposure. Therefore the exposure estimate based only on PUR might be inaccurate and misclassi¯ed. Rull and Ritz (2003) developed a new estimating method utilizing both PUR data and land use data to narrow the exposed area. The CDWR (2002) performs county-wide, large-scale surveys of land use and crop cover every 7-10 years, which vary by county. Essentially there are 1-3 surveys being done in each year, then move around counties, resulting in roughly 7-10 years between surveys in any give county. In the land-use data, ¯elds, vineyards, orchards and other land-use types are expressed as polygons with individual attribute information (e.g., land use type, acreage). After merging the PLSS boundary into land-use data, The land-use data are then linked to PUR data by crop types within each PLSS section. A three-tier link approach are designed to deal with the mismatch caused by data error and time discrepancy, with tier-1 for perfect match and tier-3 for the least match. After the matching, each land-use polygon is associated with densities quantifying the amount of the chemicals being applied on that polygon during theyear. Whenabu®eriscreatedaroundaresidence, theamountofexposureduringthe year can be accurately calculated using the areas of parts of the polygons that fall within thebu®erandtheassociateddensities,muchlike(Reynolds,VonBehrenetal.,2005),but with a '¯ner' partition. The authors then designed a simulation study using PUR+LU approach as the golden standard to compare with PUR only method described in (Bell, Hertz-Picciotto et al., 2001). The results revealed a substantial attenuation in OR using PUR only method when residential exposure status was based on a larger geographic area yielding higher sensitivity but low speci¯city for exposure, or when the pesticide was 21 rarely applied. This means the PUR + LU analysis method can potentially provide more accuracy comparing to the methods where only PUR is used. 2.2 The prostate cancer - pesticide pilot study Study design and method Cockburn, M. C. (2005) proposed a pilot case-control study investigating the associ- ation between prostate cancer risk and environmental pesticide exposure in agricultural area. Oneofthemajoraimsofthestudyotherthantheassociationitselfwastoshowthat historicalresidentialandPUR+LUdataprovidessubstantialreductioninexposuremis- classi¯cation compared to estimates based only on current residential addresses and PUR +LUdatainformationalone. Thecaseswerearandomlyselectedsampleofhistologically con¯rmed prostate cancer cases occurring in Cancer Registry records after August 2005, among non-Latino whites, Latinos and Blacks, and with the last known address (usually address at diagnosis) inside Fresno, Kern or Tulare counties. All the selected cases were between55and74earsofageatdiagnosis. Age-andrace-frequencymatchedpopulation- based controls was obtained from Medicare patients and from random residential parcel selection. Telephone interviews were then conducted for enrolled cases and controls to obtain information on prostate cancer risk factors and residence/occupation histories. PesticideexposureestimateswereobtainedusingsimilarPUR+LUapproachasin(Rull and Ritz, 2003), combined with the spatio-temporal data from geocoded historical and currentresidentialaddressesintermoflatitude/longitudecoordinates. Theexposureesti- mateprocesscanbecomequitecomplicated. AVisualBasicscriptrunningunderArcGIS was developed to ease the computing burden. The script, named GIS-based Residential 22 Ambient Pesticide Exposure Software (GRAPES), is in essential an automated machine to generate yearly total exposure to a speci¯c chemical when being feeded with life time residential history. The planned sample size for the pilot study was 120 cases and 120 controls. After the initial data collection phase, 136 cases and 137 controls entered the analysis phase. The missing data problems in the pilot study The PUR + LU exposure evaluation approach implemented in the form of GRAPES system, while shows greater potential over the traditional self-report method and the method only utilizing PUR data (Rull, Ruitz et al., 2003), is vulnerable to a variety of measurement error and missing value problems because of its complexity. The major causes to these di±culties include: ² ThePURdataaregatheredthroughself-reportedapplications,henceinevitablysub- jecttohumanerrorandmissingness. Forexample,thePLSSunitwithinwhichacer- tain chemical is applied is frequently reported with an incomplete coordinate, thus the data entry regarding this application record will be missing from the GRAPES system. Approximately 40% of the entire PUR data entries are discarded due to in- complete PLSS coordinates, hence the exposure estimate generated from GRAPES is on average less than the true amount. ² The PUR had not started until 1974; and while the data for recent years since 1999 are available online, they are yet to be processed for use in GRAPES system. For these reasons, we can only estimate the exposure history from 1974 to 1999 using the GRAPES system. 23 ² The PUR data and LU data do not perfectly match each other, for land-use survey in a county is conducted every 6-7 years. We can only link to PUR data with the closest available year of LU data. It is possible that the crop type reported in the PUR does match the land-use type in the same PLSS section in the LU data due to time discrepancy. A \three-tier" approach was introduced to give an estimate to the exposure when the mismatch occurs (Rull, Ruitz et al., 2003). ² The current PUR data is available only for the state of California; and the land- usesurveydataavailableforthestudyonlycoverFresno, KernandTularecounties. Whenasubjecthasaperiodofhisresidentialhistoryoutsidethestudyarea,theex- posureassociatedwiththeaddressescannotbeestimatedusingPUR+LUmethod, eveniftheaddressescanbevalidlygeo-coded. Wecallthistypeofmissing\missing due to out of counties (OOC)". ² The residential history was collected through personal interview and self-reported questionnaire. Some individuals might fail to recall the residential history precisely, especially when the individual moved frequently in his past, or even worse some might even refuse to answer. In occasion that some address(es) is unable to be obtained, or not valid for geo-coding, we consider the address as unobservable and mark it as \missing due to invalid address (IA)". We would like to shift away from measurement error problems for now in the pilot study to stay focused on the missing data; more speci¯cally, we would focus on the latter two types of missing data, i.e., missing due to \out of counties" and \invalid addresses". One striking characteristic of the missing data is that the missingness is completely gov- erned by the residential addresses. If an address is missing, all the exposures associated 24 with that address will also be missing. Thus the missing pattern looks more like stripes ratherthanholes. Themissingpatternsuggeststhatthechancesofbeingmissingbetween two consecutive years have a high positive correlation, rather than completely indepen- dent. Missing mechanism is an important aspect to consider. The three counties (Kern, Fresno and Tulare) are heavy in agriculture, one may expect the ambient concentration of pesticides inside the study areas is on average higher than the rest of the world. Hence the\OOC"typeofmissingimpliestheunobservedpesticideexposurebelowerthanwhat would be otherwise observed inside the three counties. This association practically denies the MCAR and MAR assumptions. The ethnicity tends to be another covariate that is associated with the chance of missing addresses, because Hispanic tends to be more likely tobeamigrantwhospenthisearlyyearsoutofthethreecounties. Theintra-dependency and NMAR in the missing mechanism could well mean a nightmare in the analysis of the missing data. While keeping these delicate properties in mind, we always starts from the most primitive assumptions as we will do in the later sections. Both types of missing data are prevalent pilot study data where the entire life history of pesticide exposure is to be estimated. Table (2.1) shows the missing data distribution for overall, case-control status and race for the period from 1974 to 1999 where the exposure can be estimated through GRAPES. The individuals containing at least one year of missing exposure (due to either \out of counties" or \unknown address") account for 40:30% of the entire study sample, the UA type of missing (4.40%) is quite low comparing to OOC type of missing (35.90%). The overall missing proportion is higher in cases (30.88%) than in controls (49.70%), but not much di®erences between white and non-whiteparticipants. Forthosecontainingmissingyearsofexposure,theseverityranges 25 from a mere one missing year to almost missing the entire history. We also calculate the missing proportion at the year level, for which each participant had 26 years of exposure during the 74-99 period. The overall missing proportion at the year level is 21.31% with OOC missing 19.85% and UA missing 1.46%, which is around half of that at individual level, hence the individual with at least one missing year has on average around half of the years being missing. Table 2.1: Breakdown of missing proportion during 1974-99 at individual level OOC missing UA missing Total Overall 98 (35.90%) 12 (4.40%) 273 Case 41 (30.15%) 1 (0.73%) 136 Control 57 (41.61%) 11 (8.09%) 137 White 77 (38.31%) 5 (2.49%) 201 Non-White 18 (26.87%) 7 (10.49%) 67 Withtherelativelyhighproportionofmissingdataatsubjectlevel, thecompletecase analysis, which is traditionally adopted in most software packages, is essentially out of consideration because there will not be enough subjects left for analysis after the exclu- sion of the subjects containing missing data; furthermore, it is not wise to discard the entire observation when it is only partially missing since doing so may cause severe loss of information; last but not least, one of the primary aims of the study is to demonstrate the superiority of the exposure history to the diagnosis-year-only exposure. The statistical analysis and results in the pilot study The analysis for the prostate-pesticide pilot case-control study was carried out using unconditionallogisticregressioninSAS9.1.3. Theoutcomeisabinaryvariableindicating whether the individual was diagnosed with prostate cancer at the reference year which is de¯ned as the year of diagnosis for cases; and as the year of interview for controls. 26 Processing through GRAPES, we attempted to establish an exposure history for each individual from calendar year 1900 to year 2006, as long as the residential address was available for the year. The yearly exposure was calculated as the yearly total pounds of the chemical applied within the bu®er around the residence (in our case 500 meters in radius), as recorded in PUR data. Wedidnotseparatedi®erentmissingtypesdiscussedearlier,butrathersimplytreated as \missing". The way the missing value was taken care of was simple and straight forward. A valid year is the year where the estimated exposure to the chemical is not missing. Given a certain time interval, the average valid year exposure (AVY) is de¯ned as the total valid year exposure (in pounds) divided by the number of valid years within the time interval. To test the and compare di®erent exposure summaries, we proposed ¯ve types of exposure summaries for various time intervals, these ¯ve AVY variables were modeled separately. 1. Reference (diagnosis) year only. 2. Life time. 3. 1974 to 1999: PUR data is available for this period, hence the exposure estimate during this period is more accurate. 4. Life time up to 10 years prior to reference year: to account for latency e®ect. 5. Life time up to 15 years prior to reference year: to account for longer latency e®ect. Although AVY variables were given in continuous form, it may not be suitable to include them into the model unchanged, because usually over half of the study samples had a zero AVY estimate; and for those who had a positive AVY estimate, the value was 27 often heavily skewed. In our analysis, the AVY variables were categorized into dummy variables in two ways: 1) zero-or-positive binary indicator; and 2) zero-low-high tertiary indicator. Thecut-pointforlowexposureandhighexposureamongthepositiveexposures varied according to di®erent chemicals. We performed the analysis on the sample of 136 cases and 137 controls for Methyl Bromide, Captan, Simazine, and Organochlorine. The reported ORs, if not speci¯ed, are adjusted for age, race and home pesticide exposure. We summarize the analysis results as follows, detailed tables can be obtained from the author. Methyl Bromide: When summarized as binary type exposure, the OR for diagnosis year exposure was 2.49 (1.05, 5.94, p = 0.04); the ORs for exposure summary based on residential history ranging from 1.42 to 1.82 with the OR for 1974-1999 period exposurebeing1.82(1.08,3.06,p=0.03). Whilenotsigni¯cantatthenominal0.05 level, the ORs for exposure 10 years and 15 years before diagnosis were 1.55 (0.92, 2.62, p=0.10)and1.59(0.93, 2.74, p=0.09), respectively. Whenweusedlow-high exposure summary (none, low, high), however, none of the results were signi¯cant. Furthermore, we didn't see an OR increase from low exposure to high exposure. Captan: For binary type exposure, the OR for diagnosis year exposure was 2.24 (0.88, 5.72, p = 0.09). The ORs for history type exposure were fairly close, ranging from 1.42 to 1.64 with the OR for 1974-1999 period exposure being 1.64 (0.96, 2.82, p=0.07). Unlike Mythyl Bromide, the low-high exposure scheme showed a clear trend with increasing Captan exposure: the ORs for low exposure were close to 1 across the board, while the ORs for high exposure were higher than 2 showing a 28 signi¯cant risk. For example, the OR for 1974-1999 period with high exposure was 2.29 (1.23, 4.28, overall p = 0.02). Simazine: Forbinarytypeexposure,theORfordiagnosisyearexposurewas1.90(0.98, 3.68, p = 0.06). However, the ORs for history type exposure, ranging from 1.08 to 1.19, were very close to 1, showing almost no departure from the null (with p values greater than 0.5). When low-high exposure summaryis used, we could see a linear trend similar to Captan. For example the OR for low life time exposure is 0.82 (0.47, 1.44) and the OR for high life time exposure was 2.03 (0.95, 4.35). Organochlorine: For binary exposure, the OR for diagnosis year exposure was 1.37 (0.56, 3.33, p = 0.49) which was not signi¯cant and was agreed by life time expo- sure (OR = 1.29 (0.77, 2.16, p = 0.32)). However the 1974-1999, 10 years before diagnosis, and 15 years before diagnosis all showed signi¯cantly higher risk(OR = 1.85 (1.09, 3.11), OR = 1.70 (0.99, 2.93), OR = 1.77 (1.01, 3.08), respectively). This¯nding suggests howrecentyearsexposuregreatly in°uence theresults. When using low-high exposure summary, the history related exposures showed a trend or increasedriskasexposureincreases, exceptfor1974-1999periodexposureforwhich the low OR and high OR are fairly close (1.71 vs 1.97). 2.3 Other examples of missing data in time-dependent co- variate Historical ultraviolet exposure and melanoma study 29 Exposure to ultraviolet (UV) radiation has long been considered as a signi¯cant risk factor for the development of melanoma. However much like the pesticide exposure as- sessment, the reliability of the measurement of historical UV exposure has always been problematic. Traditionalmethodsusuallyinvolveself-reportedtimespentoutdoor, which is subject to recall bias. Others attempted to address the historical UV exposure by eval- uating the average annual hours of sunshine received at each location where the study participants had lived. The problem of this metric is that the relationship between in- coming solar radiation and sunshine hours are inconsistent. Tatalovich, et al. (2006) proposed a GIS-based UV exposure evaluation method. The method combined the typical GIS software package and the ANUSPLIN routine which runs outside the GIS. ANUSPLIN generates many diagnostics in addition to the sur- face coe±cients that summarize the relationship between mean global solar radiation, latitude/longitude and elevation, then estimate solar radiation values at the Digital Ele- vation Model node. This method was demonstrated in (Tatalovich, et al., 2006:2) to have better variance error and more accuracy than the interpolation procedures based on the Thiessen-polygon and Kriging methods. 820 melanoma cases among white non-Latino residents under 65 years old were re- cruited from the cancer registry and 877 controls were selected through neighborhood walking to be individually matched to cases on race, age and gender. The estimates of UV exposure for continental US used county level ANUSPLN model to predict 30-year average daily total global solar radiation as the proxy. The estimates of UV exposure for foreign addresses were obtained from a radiation data set maintained by NASA. One missing data problem in this study was caused by the missing gap in the 30-year residential history. When the missing occurred, the AVY method described in section 2:2 30 wasusedtoimputethemissingvalues. Thistypeofmissingprobleminthisstudywasnot severe, only 6% of the participants had more than two missing residential records. How- ever further investigation on the missing residence records might be worthwhile. Another \missing" type in the residential records was not in fact missing but occurred when the participantslivedoutsideofUS.WhentheUSbaseaddresswasmissing,theauthorsused the NASA method to estimate the UV exposure which is considered to be less accurate than the ANUSPLIN method for residences within the US. The authors then compared theORsbetweentheUVestimatesbasedonUSaddressonlyandthosebasedonUSplus overseas address. When the UV exposure level were divided into approximate quartiles, the ORs for US address only were 1, 2.16, 2.59, 4.34 from low to high; while the ORs for US + overseas address were 1, 1.62, 2.64, 6.01. While both signi¯cant, the trend slope forUS+overseasaddressexposurewasclearlysteeperthanforUSaddressonlyexposure. Residential radon exposure and lung cancer study Alavanja,etal. (1999)conductedapopulationbasedcase-controlstudyforassociation between residential radon exposure and lung cancer risk among women living in the state of Missouri, US. Among the 783 female lung cancer cases reported to Missouri Cancer Registry in a 13 months interval starting from January, 1, 1993, 679 were interviewed and had radon measurement made at their homes. Controls were ascertained through a randomized recruiting procedure separated by race and frequency matched to cases by 5-year age strata from those who lived in Missouri. Among 4592 potential controls, 730 were targeted for the interview and radon exposure measurement. The radon exposure was measured for 25 years before diagnosis or interview and the exposure between 5-25 years were considered biologically meaningful for the study. Two 31 radondosimetrymethodswereused. Onewasyearlongindoorairradonmeasurementsin the residence measured by two devices placed in the kitchen and the bedroom. The other wasanewlyappliedtechniquewhichattachedsurfacemonitors(CR-39particledetectors) onto glass objects to directly measure the previous 25 years cumulative radon exposure. Both measured results were converted to annual time-weighted-average exposure by the number of years the residence was occupied by the study subjects. The pattern of missing values in this study is similar to that of the pesticide-prostate cancerstudyinthatthemissingwasmainlycausedbyunknownorinaccessibleresidential history. The authors used single control mean imputation method to address the missing data problem, that is, the missing value was imputed by the mean of the values mea- suredincontrolresidencesforbothmeasuringtechniques. Toavoidexcessiveimputation, subjects who had over 30% of missing values in the exposure history were excluded from the study. In the end, 512 cases and 553 controls were included in the ¯nal analysis. Using multivariate logistic regression, the authors found a signi¯cant trend of association between the surface measurements and lung cancer risk, with the ORs 1.11, 1.32, 3.33 for the exposure strata with increasing radon exposure compared to the lowest exposure baseline stratum. Residential magnetic ¯eld exposure and breast cancer study London, et al. (2003) studied the association between residential magnetic ¯led expo- sureandtherisk ofbreastcancer. Thestudywasdesignedasanestedcase-controlstudy. The multi-ethnic cohort comprised of 52,112 female residents of Los Angeles County en- rolled by the use of selective sampling from licensed drivers. A questionnaire was sent 32 to each of the subject in the cohort to collect residential history and other personal in- formation. A total of 751 breast cancer cases were then ascertained from the cohort by examining the associated Surveillance, Epidemiology and End Results (SEER) cancer registry in Los Angeles County and the state-wide cancer registry. The cases must be previouslybreastcancerfreeandbediagnosedbetweenJune1993andJanuarytobecon- sidered eligible. 702 controls were then randomly sampled from the cohort and frequency matched to cases on self-reported ethnicity. The subjects from the sampled case-control set were interviewed in-person and the magnetic ¯eld exposure to be determined. The magnetic ¯eld exposure assessment was done indirectly by means of a surrogate measure known as wiring con¯guration coding, which was shown in other studies to be a usefulsurrogatemeasureofmagnetic¯eldsinNorthAmerica. Fieldtechniciansweresent to perform wire coding on all dwellings where the subjects lived in Los Angeles County during the 10 years before the reference data (diagnosis date for cases and interview data for controls). The coding results were classi¯ed into ¯ve categories from very low to very high. In the end, 872 case addresses and 803 control addresses for 743 cases and 699 controls were wire coded. The analysis used unconditional logistic regression. In the analysis a variable was created as the number of years living under each of the ¯ve levels of wire coding during the 10 years. The odds radios were associated with those ¯ve variables with the lowest level as the baseline. Similar to previous study examples, the missing value problem was caused by missing addresses or the house on the address was not applicable for code wiring. Among the 743 cases and 699 controls that were wire coded, only 572 cases and 468 controls were successfully wire coded for the entire 10 years. To address the missing problem without losing too much information, the authors considered the missing value 33 as a stand-alone category in addition to the established ¯ve wire coding levels. Hence all the subjects could be used in the analysis but now the ORs were adjusted by the number of years missing. The study did not ¯nd a signi¯cant association between the magnetic ¯eld exposure and breast cancer risk, with an OR of 0.76 (95% CI = 0.49 - 1.18) for the very high category relative to the very low. The missing data method used in this study was missing indicator method, which can be theoretically developed as described in Chapter 5. 34 Chapter 3 Literature Review In the previous chapter, we discussed the missing holes and gaps in the exposure history. In most studies that su®er from this problem, researchers chose to use naive methods like single imputation to avoid serious loss of power. The existing literatures dealing with this type of missing data are few and far between, quite disproportionate comparing to the profound literatures for single point missing data. One reason for the scarcity is of course the complexity brought upon by the multiple missing data points for an observation; another reason is that single point missing is the building block for the missing gaps, hence the researchers tend to focus on the fundamental problems rather than focus on a speci¯c type of missing data. However, since we usually consider the summary of the exposure history, which in most situations can be regarded as a single data point, a method designed speci¯cally for the problem may prove bene¯cial. In this chapter we will ¯rst review Weinberg's simulation study which forms the foundation for the simulation in Chapter 4, along with Moulton's work. We then proceed to discuss the Cox regression and some missing data methods designed for the model. The reason to introduce these seemingly irrelevant theories is to pave the road leading to the induced intensity method which serves as the core theory for the development in Chapter 5. 35 3.1 Weinberg'sstudyonexposurehistorywithmissinggaps The general missing data methods discussed in Chapter 1 are not designed speci¯cally for the missing history introduced in Chapter 2, although some methods such as single imputation may be applied to single missing point separately, the performance and con- sequence on the data analysis remains largely unknown. Weinberg, et al. (1995) studied the missing data issue rising from case-control studies on association between residential radon exposure and lung cancer, in which the data structure is similar to that of the pesticide-prostate cancer pilot study introduced in Chapter 2. In those studies the cumu- lativeresidentialexposurerequiresinformationofresidentialhistory. Howeversomeofthe former residences no longer exist or cannot be located; or some study participants fail to complywiththestudyguidance. Theseproblemscausemissinggapsinthereconstruction of the residential exposure history, which exist in over half of the study participants. The authors carried out a Monte-Carlo simulation to compare several single imputa- tion methods to deal with the missing gaps in the exposure history. The radon exposures over the years were generated from a log-normal distribution. The underlying model in the simulation was an excess relative risk (ERR) model, i.e., Pr(DjZ)=¸ 0 (1+¯ 0 Z); for some ¯ 0 , where D denotes the onset of lung cancer and Z the cumulative residential radonexposure. ThesingleimputationmethodsusedinWeinberg'sstudywillalsobeused in the simulation study in Chapter 4 of this paper, we will summarize the methods here andprovidemoredetailsinChapter4. Timeweightedaverage(TWA)exposurecomputed on the residences that were measured for the individual, the average is weighted by the 36 timeeachindividualoccupiedtheresidence,theAVYmethodusedinprostatecancerpilot study is the same except the average is not weighted. The zero imputation (ZI) simply replaces the missing exposure with 0s. Then there are three types of mean imputation methods. Observed mean imputation (OMI) method imputes the missing value with the overall average from both cases and controls that are not missing for that year; observed control mean imputation (OCMI) imputes the missing value with the average calculated from controls only; while status speci¯c mean imputation imputes the missing value in controls with from average non-missing controls and imputes cases from average non- missing cases. From 1000 iterations of simulation of a case-control study comprising of 1500 cases and 1500 controls over 30 years with 50% of missing, the authors observed that OCMI and OMI methods had exceptional performance comparing to other methods when the true ¯ 0 was set positive, while TWA, ZI methods signi¯cantly attenuated the estimated ¯ 0 and SSMI method biased the estimate away from null. The authors attributed the better performance of OCMI method to the fact that under rare disease assumption, the imputation based on control mean approximately yields Berkson error model. The argument to prove that OCMI yields Berkson type error goes as follows. Sup- pressing the index for individual subject, let Z 1 ;:::;Z K denote the exposure history for time 1;:::;K, the true cumulative exposure can be written as Z = K X j=1 Z j 37 . If for some time points the exposure is not observable with the status of observabil- ity is governed by missing indicator M 1 ;:::;M K , the time speci¯c population means W 1 ;:::;W K is then used instead. The imputed cumulative exposure can be expressed as W = K X j=1 Z j (1¡M j )+ K X j=1 W j M j = K X j=1 Z j + K X j=1 (W j ¡Z j )M j : Thus, Z =W + K X j=1 (Z j ¡W j )M j : Then we can calculate the conditional expectation as follows, E[ZjW] = W +E 2 4 K X j=1 (Z j ¡W j )M j ¯ ¯ ¯ ¯ W 3 5 = W + K X j=1 E · E ½ (Z j ¡W j )M j ¯ ¯ ¯ ¯ W;M j =1 ¾¯ ¯ ¯ ¯ W ¸ : It is evident that when Z j is missing, Z j ¡W j is independent of W, therefore the inside expectation simply reduces to E(Z j ¡W j ) = 0, since E(Z j ) = W j . It then follows that E[ZjW]=W. Whenthediseaseisrare,thecontrolmeanisapproximatelythesameasthepopulation mean; andunderERRmodel,wewillshowinsection3:4thatityieldsconsistentestimate when the unobservable exposure is replaced by the population mean, namely the model. It is worth noting that the consistency on the true ¯ 0 under the imputed model does 38 not depend on the distribution of Z j 's. The consistency generally does not hold under exponential relative risk model, except when the exposure Z j has a normal distribution, as discussed in section 3:4. Inthesamepaper, Weinbergwentfurthertocomparethemethodsbasedonstrati¯ed imputation, where the strati¯cation such as age, occupation was believed to account for a signi¯cant part of variation in the unobserved exposures. The simulation results were similar to the non-strati¯ed methods, that the strati¯ed OCMI method performed quite well in term of unbiasedness and e±ciency. 3.2 Separate time interval method The di±culties caused by missing data in case-control analysis with time-dependent exposure are two fold. One is the potential excessive missing data in exposure history, longer exposure history generally means higher chance of missing data; the other is the exposure summary over the time line, especially when the interest is on the variation in e®ects during di®erent time intervals. Lawrence and Monique (1991) proposed a semi- parametric approach as a work-around by estimating a coe±cient for each time interval then combing all the estimates into a single summary coe±cient and calculating the variance nonparametrically. The method was illustrated with a French case-control study for association between reproductive history and cervical cancer. The entire reproductive history from the be- ginning of record up to time of reference (diagnosis time for case and interview time for controls) is divided into consecutive time intervals, say t = 1;:::;T. The time intervals 39 may not necessarily be of equal length. For each time interval t, a logistic coe±cient p¡ vector ¯ t is assigned. That is, logitfPr(D i =1jZ it )g´logit(p it )=Z T it ¯ t ; where Z T it is N £p covariate matrix for time interval t and subject i. The models were then ¯tted separately within each time interval, resulting T coe±cient vector estimates ^ ¯ t ;t=1;:::;T. The interest may be summarized as, for example, X t ¯ t =T; max t (¯ t ) or other polynomial forms to represent the time e®ect in the summary. Note these T estimates will in general be correlated due to within-subject correlation. A consistent covariance matrices may be given empirically as follows, Cov( ^ ¯ r ; ^ ¯ s )=(Z T r V t Z r ) ¡1 Z T r V 1=2 r diag(r ir r is )V 1=2 s Z s (Z T s V s Z s ) ¡1 ; where V r =diagf^ p ir (1¡ ^ p ir )g and r ir is the Pearson residual (Y i ¡ ^ p ir )=f^ p ir (1¡ ^ p ir )g 1=2 . 40 This method addressed the time-variation e®ect and because the model ¯tting is done within a relatively small time interval, the chance of being missing is much lower. Hence as long as the missing mechanism is MAR, the complete-case method can be used in each sub-regression, while the complete-case method may not be valid when all the data are used at once because potential excessive missing. 3.3 Cox regression and nested case-control sampling In the remaining sections of the chapter, we will discuss missing data methods for Coxregression model, itdoesnothurtto goover thefundamentalson Coxregression and the partial likelihood ¯rst. The de¯nitions and notations occurring in this section will be frequently referred to in later sections. 3.3.1 The partial likelihood LetT be the time between the start of the observation and the time of failure. Assume then that T is an arbitrary continuous nonnegative random variables with distribution function F T (t) and density function f T (t) = dF T (t)=dt. We call S(t) = P(T > t) = 1¡F T (t) the survival function of T. In time-to-event type of data analysis, the risk of failure or more generally the occurrence of a certain status is often represented by hazard function, or intensity function as in context of counting process theory, which is de¯ned as ¸(t) = lim ¢t!0+ 1 ¢t Prft·T <t+¢tjT ¸tg = ¡ · d dt fS(t)g ¸ =S(t) 41 = f T (t)=S(t); where f T (t) is the PDF function of T. Evidently we have the relation, S(t)=expf¡¤(t)g=exp ½ ¡ Z t 0 ¸(u)du ¾ ; (3.1) where ¤(t)= R t 0 ¸(u)du is called the cumulative hazard function. The key assumption of Cox model is the following proportional property: ¸ft;Z(t)g=¸ 0 (t)e Z T (t)¯ ; (3.2) where ¸ 0 (t) is an unspeci¯ed baseline hazard, and Z(t) and ¯ are p-vector covariates and regression coe±cients respectively. e Z T (t)¯ is usually referred to as the relative risk. Note that e Z T (t)¯ can be replaced by a general function rft;Z(t)g, for example, the model is called excess relative risk if rft;Z(t)g = Z T (t)¯. We assume Z(t) is time independent for the time being. Standard likelihood methods cannot be used to estimate ¯ since ¸ 0 is unspeci¯ed. However, by maximizing a \partial likelihood" function with respect to ¯, which is independent of ¸ 0 under the proportional property, the components of ¯ can be estimated. We summarize the arguments as follows (See Fleming and Harrington 1991, Chapter 4). The partial likelihood function in the proportional hazards model is based on a con- ditional probabilityargument. Suppose(A 1 ;B 1 );(A 2 ;B 2 );:::;(A K ;B K ) is a collection of pairs of events. Then the likelihood of all 2K events is PrfA K B K A K¡1 B K¡1 :::A 1 B 1 g 42 = " K Y k=2 PrfA k B k jA k¡1 B k¡1 :::A 1 B 1 g # PrfA 1 B 1 g = " K Y k=2 PrfA k jB k A k¡1 B k¡1 :::A 1 B 1 g # PrfA 1 jB 1 g £ " K Y k=2 PrfB k jA k¡1 B k¡1 :::A 1 B 1 g # PrfB 1 g: When all four terms in the above product depend on a statistical parameter but the last two are neglected, the ¯rst two terms form a partial likelihood for that parameter. Firstletusconsiderconsidercensoreddatawithouttiedvaluesinthesetofobservation times. Set T 0 0 = 0 and T 0 L+1 =1. Let (k) provide the case label for the patient failing at T 0 k (thus T (k) =T 0 k ), so the covariates associated with the L failures are Z (1) ;:::;Z (L) . Let B k be the events denoting the observed times of censoring in the interval [T 0 k¡1 ;T 0 k ), the case labels associated with the censored times, and the fact that a failure has been observed at T 0 k . If A k is the event specifying the label (k) of the case failing at T 0 k , then the observed data are equivalent to the event B 1 A 1 :::B L A L B L+1 and the likelihood of the data will be PrfB 1 A 1 :::B L A L B L+1 g The key assumption in the partial likelihood method for Cox model is that the events B k containlittleinformationabouttheregressionparameter¯,thusthelattertwopartsof the likelihood of the 2K events could be omitted. Based on this assumption, a reasonable partial likelihood for ¯ will therefore be " K Y k=2 PrfA k B k jA k¡1 B k¡1 :::A 1 B 1 g # PrfA 1 B 1 g: 43 For a ¯xed k, since there are no ties in the observation times, it can be established that PrfA k B k jA k¡1 B k¡1 :::A 1 B 1 g = ¸(T 0 k jZ (k) ) P j2R k ¸(T 0 k jZ j ) = e Z T (k) ¯ P j2R k e Z T j ¯ ; where R k is the set of cases at risk at time T 0 k , that is, R k = fj : X j ¸ T 0 k g. Thus the partial likelihood for ¯ will be L 1 (¯)= L Y k=1 e Z T (k) ¯ P j2R k e Z T j ¯ : (3.3) Consequently the score function, U 1 (¯)= @ @¯ logL 1 (¯) for ¯ can be written as, U 1 (¯)= L X k=1 ( Z (k) ¡ P j2R k Z j e Z T j ¯ P j2R k e Z T j ¯ ) : (3.4) Maximum partial likelihood estimate ¯ can be obtained by solving the equation U 1 (¯)= 0. When the data contain tied failure times, the argument deriving the partial likelihood can still be applied, although more complicated. The partial likelihood (3.3) will change to L Y k=1 e H T (k) ¯ ©P j2R k e Z T j ¯ ª D k ; where D k is the number of observed failures at T 0 k , and H (k) is the sum of the covariates of the subjects failing at T 0 k . 44 For time dependent covariates, the partial likelihood can be written as L 2 (¯)= n Y i=1 ( e Z T i (X i )¯ P j2R i e Z T j (X i )¯ ) ¢ i ; (3.5) here R i = fj : X j ¸ X i g is the risk set and ¢ i is the censoring indicator. Letting ^ ¯ be the value maximizing (3.5), for the underlying cumulative hazard ¤ 0 (t) = R t 0 ¸ 0 (s)ds, Breslow (1972) proposed an estimate, ^ ¤(t)= X X i ·t ¢ i P j2R i e Z T j (X i ) ^ ¯ : (3.6) 3.3.2 Counting process approach TheapplicationofcountingprocessesforCoxregression¯rstappearedinAalen(1978). ForthecensoredfailuretimevariableT,letthecountingprocessN(t)=I (X·t;¢=1) count the failure of the observed object and Y(t) = I (X¸t) be the function indicating whether the object is under observation. The fundamental fact is that M(t) = N(t)¡ Z t 0 Y(s)¸(s)ds = N(t)¡A(t) isamartingaleadaptedtothe¯ltrationfF t g t¸0 ,whereF t =¾fN(s);I (X·s;±=0) :0·s· tg. We call the process A(t) the compensator of N(t). The fact that M(t) is a martingale is achieved by an assumption: ¸(t)=¡ @ @u log[PfT ¸u;C ¸tg] ¯ ¯ u=t whenever PfX >tg>0; 45 which is agreed in most cases (Fleming and Harrington, 1991, p26). It is natural to state the Cox proportional hazard model in counting process context. The basic assumption is that M(t)=N(t)¡ Z t 0 Y(s)e Z T (s)¯ 0 ¸ 0 (s)ds isamartingaleadaptedtoacertain¯ltration. Combiningthepartiallikelihood(3.5)with this notation, the score function for Cox's regression model can be written as U(¯;t)= n X i=1 Z t 0 Z i (s)dN i (s)¡ Z t 0 P n i=1 Y i (s)Z i (s)e Z i (s) T ¯ P n i=1 Y i (s)e Z i (s) T ¯ d ¹ N(s); (3.7) where N i is the counting process for the i-th object and ¹ N(t)= P n i=1 N i (t) can still be a counting process if no tied failure is allowed. The estimator ^ ¯ can be obtained by solving the function U(¯;1)=0. The corresponding Breslow estimator (3.6) then becomes ^ ¤(t)¡ Z t 0 d ¹ N(s) P n i=1 Y i (s)e Z T i (s) ^ ¯ : (3.8) In Andersen and Gill's famous paper in 1982, they proved in a rigorous way that underasetofmildassumptions, ^ ¯ isasymptoticallynormaland ^ ¤(t)tendstoaGaussian process in D[0;1] space. 3.3.3 Nested case-control sampling The Cox regression in the previous section is designed with cohort design in mind, where the subjects in the cohort are monitored over time until failures occur. However if the failure rate (or disease rate) is low in the population, the study may require a large 46 cohort; and it may be cost ine±cient if by the end of the study, exposure from all the surviving subjects must be evaluated, although each of which contributes little for the power of the analysis. First appearing in the literature in (Thomas, 1977), nested case- control (NCC) sampling designed has since gained signi¯cant popularity due to the ever increasing study cost. NCC study can be seen as a case-control study sampled based on the disease status from within the cohort study, hence the word \nested" is used. There are numerous sampling strategies available for NCC study, but the essential idea is the same, to keep all the cases in the cohort and sample a suitable number of controls from non-cases. Langholz (2005) gave a comprehensive review on NCC sampling theory under pro- portional hazard models. NCC sampling can be very °exible, and can be easily adopted with both continuous time and grouped time data. With continuous time, we can assume there are no tied failure times. Let T 1 ;:::;T L be the observed failure times, and r j (¯) be the relative risk evaluated at the respective failure times, the partial likelihood for NCC sampling data takes the form: L Y i=1 r i (¯)¼ i P j2 ~ R i r j (¯)¼ j ; (3.9) where ~ R i is the sampled case-control risk set at failure T i ;i = 1;:::;L; and ¼ j is the probability of sampling the speci¯c risk set ~ R i conditioned on that j is the case at time T i , i.e., ¼ j =Pr( ~ R i is sampledjT j =T i ): 47 ¼ j is often replaced with a weight w j proportional to ¼ j . Comparing (3.9) to the partial likelihood for cohort sampling (3.3), it can be seen that they are similar except that the sum of relative risks is a weighted sum under NCC sampling and ~ R i is a subset of the cohort risk set R i . For example, under individually matched case-control sampling where m¡1 controls are to be randomly sampled from n¡ 1 controls in the cohort risk set, the sampling probability ¼ j is constant across all individuals which is simply 1= ¡ m¡1 n¡1 ¢ . Thus all ¼ j 's can be simply regarded as 1 resulting in an unweighted partial likelihood, and the partial likelihood has the exact same form as the likelihood for conditional logistic regression. Therefore under individually matched sampling and exponential relative risk, the model can be ¯tted using standard statistical software such as SAS. The consistency and asymptotic normality of the estimate ^ ¯ obtained from the NCC partial likelihood is more di±cult to prove than the cohort design. In Andersen and Gill (1982), the crucial part is that the mean, or the covariate-weighted mean of the relative riskswithintherisksetmayconvergeastherisksetgrows. HoweverwithNCCsampling model, the risk set often has a ¯xed number of individuals, for example only 2 in 1:1 matching, thus the similar limiting theory cannot be directly applied. Goldstein and Langholz (1992) noticed that although there is no convergence within a single risk set, the convergence may be obtained on the risk set level as the number of risk sets grows. The asymptotic results achieved in that paper is the basis for the asymptotic e±ciency calculation in Chapter 5. 48 3.4 The induced intensity - the early developments The concept of induced intensity was ¯rst introduced by Prentice (1982) to deal with measurement error problems in Cox regression. The idea later sparked many literatures in context of both measurement error and missing data problems. We will go over what Prentice (1982) had discussed in his original article and some follow-up studies in this section, in preparation for a more comprehensive development to the induced intensity theory in Chapter 5 and 6. Suppressing subject index i, suppose the hazard function regarding the covariate of interest Z(t) follows Cox's proportional form for some ¯ 0 2R, ¸ft;Z(t)g=¸ 0 (t)rft;Z(t)g=¸ 0 (t)expfZ(t)¯ 0 g; (3.10) whereforpurposeofsimplicityZ(t)isassumedtobeascalar. LetusassumethatZ(t)can not be accurately measured, hence the partial likelihood (3.3) for Cox regression cannot be directly used. Suppose instead only an observed value (e.g. the measurement) X(t) is available, if we can ¯nd the intensity associated with X(t), we may be able to make an inference on ¯ 0 based this the available measurement X(t). An important assumption is that the intensity function is independent of X(t) when Z(t) is available, i.e., ¸ft;Z(t);X(t)g=¸ft;Z(t)g: (3.11) 49 Under this assumption, Prentice (1982) proposed that the induced intensity of X(t) is simply the intensity of Z(t) conditional on the available information up to time t, that is, ¸ft;X(t)g=E[¸ft;Z(t)gjT ¸t;fX(u);u·tg] =¸ 0 (t)E[expfZ(t)¯gjT ¸t;fX(u);u·tg]: (3.12) Notetheinducedintensity(3.12)stillretainsthemultiplicativeformwiththerelativerisk rft;X(t)g =E[expfZ(t)¯ 0 gjT ¸ t;fX(u);u· tg]. Although the induced intensity looks similar to (3.11), they are quite di®erent. In fact the expectation part of the induced intensity depends on both beta 0 and ¸ 0 (¢) because the fact the the individual survived over time t is in the condition part. To see this, let Z(t) and X(t) not vary with time, f(zjx) be the conditional density of Z given X. Then the density of z conditioned on both X and T ¸t is f(zjT ¸t;x)/Pr(T ¸tjx;z)f(x;z)/Pr(T ¸tjz)f(zjx) =expf¡ Z t 0 e Z¯ 0 ¸ 0 (u)dugf(zjx): (3.13) So the induced relative risk r(t;X) = R e Z¯ 0 f(zjT ¸ t;x)dz, which is a complicated function of ¯ 0 and ¸ 0 (¢). Theinductionsaboveshowsthatthespeci¯cformoftheinducedriskisdeterminedby the distributions of (Z(t);X(t)) and the error structure that associate the observed value X(t) to the true covariate Z(t). Pepe, et al. (1989) provided an example illustrating how a closed form of the induced relative risk can be obtained under the exponential relative 50 risk(3.10). Assumethatthedistributionofthetruecovariateconditionalontheobserved value follows the regression model, Z =E[ZjX]+"; (3.14) where " is independent of X and has mean 0. Clearly, when E[ZjX] = X, (3.14) be- comes the so-called Berkson error model. The error term " is structured such that e " ¯ belongstothetree-parameterfamilyofdistributionsdescribedinHougaard(1986),which includes the gamma, inverse Gaussian and stable distributions. Under such assumptions the induced relative risk will be K¢ expfE[ZjX]¯ 0 g [µ 1 +¸ 0 (t)expfE[ZjX]¯ 0 g] µ 2 ; (3.15) whereK isamultiplicativeconstantandµ 1 ;µ 2 aretheparametersintheerrordistribution. If the relative risk rft;Z(t)g assumes additive form, also known as the excess relative risk, and the error term " to be normal N(0;¾ 2 0 ), the induced relative risk can be given in a more concise form, that is, rft;X(t)g=1+E[ZjX]¯ 0 ¡¾ 2 0 ¸ 0 (t)¯ 2 0 : (3.16) When it comes to estimate ¯ 0 in (3.10) using the induced intensity (3.12), the ¯rst thought coming to mind might be to substitute rft;Z(t)g with rft;X(t)g in the partial likelihood (3.3) if the speci¯c form can be obtained. The fundamental idea of the partial likelihood is to estimate ¯ 0 without the interference from the baseline hazard ¸ 0 (¢) by canceling ¸ 0 (¢) out in proportional form; However the dependence of the induced relative 51 risk on ¸ 0 (¢) in general voids this advantage, rendering the direct substitution in the partial likelihood unlikely. The methods to work around this problem can be summarized into three categories: 1. Baseline hazard manipulation Parameterize the baseline hazard ¸ 0 (¢) then esti- mate the parameters using the full likelihood or the partial likelihood with the induced intensity; or estimate ¸ 0 (¢) in a non-parametric fashion, e.g. Kaplan-Meier estimator, and iterate with the partial likelihood to estimate ¯ 0 . 2. Partial likelihood manipulation Usenon-parametricapproachtoestimatethepar- tial likelihood contribution in term of the induced intensity, then estimate ¯ 0 by maximizing the estimated partial likelihood; or fundamentally modify the partial likelihood structure to incorporate the induced intensity. 3. Induced intensity manipulation Ignore the dependence on the baseline hazard ¸ 0 (¢) when calculating the induced intensity, or directly specify the induced in- tensity to obtain an induced relative risk rft;X(t)g that is independent of ¸ 0 (¢). Then follow the usual procedure to estimate ¯ 0 with partial likelihood. Baseline hazard manipulation methods are not preferred since they are basically what partial likelihood method is trying to avoid. We will review some advanced methods focusingonthepartiallikelihoodmanipulation,whilethosearegoodforcertainscenarios, none of them is designed speci¯cally for the missing data problem presented in Chapter 2 wherethemissingnessoccursintheexposurehistory. Theinducedintensitymanipulation, on the other hand, can be generalized to adopt many missing data scheme including the missing data within exposure history under case-control study design, which can be seen in the remaining chapters. 52 The major di±culty as we have seen is how to eliminate the dependence of the induced relative risk on the baseline hazard ¸ 0 (¢), Prentice (1982) argued that this dependence can be relaxed in most time-to-event cohort studies where the disease is rare. Still using the preceding settings, one may notice that the induced relative risk rft;Xg = R e Z¯ 0 f(zjT ¸ t;x)dz is but the unique moment generating function for the fully speci¯ed distribution of Z conditional on fT ¸ t;Xg, which is also related to the error distribution structure. Therefore, one may specify f(zjT ¸ t;x)dz to an extent such that the density function does not depend on the baseline ¸ 0 (¢); is easier to cal- culate the moment generating function; and re°ects our belief in the error distribution structure. If we only want to specify up to f(zjx), from the arguments (3.13), it's clear that f(zjT ¸ t;x) is free of ¸ 0 (¢) whenever the condition PrfT ¸ tjZg = PrfT ¸ tjXg holds. Prentice (1982) pointed out that this condition can be met when ¯ 0 = 0; and be approximately met when f(zjx) is a probability mass function or if PrfT ¸ tjZg ' 1 overthefollow-upperiod,whichistherarediseaseassumptionthatholdsformanycohort studies. Let us assume from now on that the dependence of the induced relative risk on ¸ 0 (¢) is negligible. When the error distribution follows Berkson model with normal error term, that is, f(zjT ¸t;x)»N(x;¾ 2 ); The induced relative risk is simply the moment generating function for normal distribu- tion, rft;Xg=expfX¯ 0 + 1 2 ¾ 2 0 ¯ 2 0 g: (3.17) 53 Under the induced relative risk rft;Xg, the corresponding partial likelihood (3.3) is identical to the partial likelihood under rft;Zg = expfZ¯ 0 g except that all the Z i 's are replaced by X i 's, this is because the term expf 1 2 ¾ 2 0 ¯ 2 0 g ends up canceled out in the partial likelihood. The parameter estimating then follows the same procedures of the original partial likelihood and asymptotic properties of the estimator can be obtained under a set of mild conditions as described in Andersen and Gill (1982). Under the excess relative risk model rft;Zg = 1+Z¯ 0 , the induced relative risk can becomeevenmoresimple. Ifthedependencyon¸ 0 (¢)canbeignored, thenunderBerkson error model, we may obtain the induced relative risk without specifying the distribution of the error term ", rft;Xg=1+E[ZjX]¯ 0 ; which is identical to the original form with Z replaced by X. Also note that it di®ers by a quadratic term with (3.16) which incorporates the dependence on ¸ 0 (¢). We would like to point out that by ignoring the dependence on the baseline hazard to specify a favorable form of the induced intensity, it does raise the concern of a violation to the consistency of the underlying model characterized by the original intensity, which may or may not be minor. Let us say the true parameter is ¯ 0 in (3.10), then the induced relative risk 3.17 altered the \true" induced relative risk with the dependence on ¸ 0 (¢) considered. Hence the estimator ^ ¯ based on 3.17 may be consistent to some ¯ ¤ 0 instead of the true parameter ¯ 0 . 54 3.5 Someadvancedmethodsbasedonpartiallikelihoodma- nipulation 3.5.1 The approximate partial likelihood estimator Lin and Ying (1993) proposed a general method to estimate ¯ 0 in the Cox model with missing covariate data. The idea is to estimate the conditional expectation ¹ Z(¯;t) from the subjects who have complete observations on all covariate components at time t. The estimating function for ¯ 0 is then taken as the summation over the uncensored failure times of the observed value of Z i (T i ) minus the estimated conditional expectation. If the jth component of Z i (T i ) is missing, the the jth component of the di®erence is excluded form the summation. The method di®ers from the CCA method in that the complete cases are only used for part of the estimation while the non-missing component of a covariate is to be kept, thus may increase e±ciency. Because the estimating func- tion is obtained approximately, the parameter estimate is called the approximate partial likelihood estimator (APLE). Supposethedataareorganizedini.i.d. vectorsfT i ;¢ i ;Z i (¢);H 0i (¢);H i (¢)g(i=1;:::;n). whereZ i (¢)=fZ 1i (¢;:::;Z pi (¢)g 0 maybemissinginsomecomponents; H 0i (¢)isanindica- torfunction,andH i (¢)isap£pdiagonalmatrixwithindicatorfunctionsfH 1i (¢);:::;H pi (¢)g as the diagonal elements which denote the missing status of Z i (¢) = fZ 1i (¢;:::;Z pi (¢)g 0 . H 0i (t) = IfH ji (t) = 1 for all j = 1;:::;pg, H 0i (t) will determine whether or not the ith subjects is included in the estimation of ¹ Z(¯;t), and H ji (t) will indicate whether the ith subject contributes directly to the jth component of the estimating function. 55 To present the estimating functions, the following notations are needed , S (r) LY (¯;t) = n ¡1 n X i=1 H 0i (t)Y i (t)expf¯ 0 Z i (t)gZ i (t) ­r ;r =0;1;2 s (r) LY (¯;t) = EfS (r) LY g;E LY (¯;t)=S (1) LY (¯;t)=s (0) LY (¯;t); e LY (¯;t) = s (1) LY (¯;t)=s (0) LY (¯;t): Then the approximate partial-likelihood score function can be written as U LY (¯)= n X i=1 ¢ i H i (T i )fZ i (T i )¡E LY (¯;T i )g: (3.18) The approximate partial likelihood estimator (APLE) ^ ¯ LY can be obtained by using Newton-Ralphson algorithm to solve the equation U LY (¯)=0. Lin and Ying's method requires MCAR assumption, that is, conditional on fT i ¸ tg, the missing indicators fH ji (t);j = 0;1;:::;pg are independent of all other random variables. De¯ne h(t) = EfH 1 (t)g and h j (t) = EfH j 1(t)g(j = 0;1;:::;p). To ensure that in large samples, the conditional expectation ^ Z(¯;t) can be reliably estimated from the subjects without missing covariates at time t, an assumption must be made that h 0 (t)¸® 0 for some ® 0 >0 for all t. To develop asymptotic theory for ^ ¯ LY , the authors set up following regularity condi- tions: A. All covariates have bounded total variations, that is, R 1 0 jdZ ji (t)j+jZ ji (0)j· K for some K >0 and all i;j. 56 B. There exists k 0 >0 and ´ 0 >0 such that for j =0;1:::;p and r =0;1, sup jdj·n ¡k 0 · 1 n ¯ ¯ ¯ ¯ n X i=1 fH ji (t)¡H ji (t+d)g ¯ ¯ ¯ ¯ +jh j (t)¡h j (t+d)j ¸ =o p (n ¡(1=2)¡´ 0 ) and sup jdj·n ¡k 0 · 1 n ° ° ° ° n X i=1 fZ i (t)¡Z i (t+d)g ° ° ° ° + ° ° s (r) LY (¯ 0 ;t)¡s (r) LY (¯ 0 ;t+d) ° ° ¸ = o p (n ¡(1=2)¡´ 0 ): C. E(T µ 0 1 )<1 for some µ 0 >0. Undertheseassumptions,itcanbeshownthattherandomvectorn ¡1=2 U LY (¯ 0 )isasymp- totically normal with mean 0 and covariance matrix B(¯ 0 )=fW 1 (¯ 0 ) ­2 , where W i (¯) = ¢ i H i (T i )fZ i (T i )¡e LY (¯;T i )g ¡ Z T i 0 fh(t)=h 0 (t)gH 0i (t)expf¯ 0 Z i (t)gfZ i (t)¡e LY (¯;t)g¸ 0 (t)dt ThecovariancematrixB(¯ 0 )canbeconsistentlyestimatedbyB n (¯ 0 )=n ¡1 P n i=1 ^ W i (¯ 0 ) ­2 , where ^ W i (¯) = ¢ i H i (T i )fZ i (T i )¡E LY (¯;T i )g ¡n ¡1 n X l=1 ¢ l Y i (T l )H 0i (T l )H l (T l )expf¯ 0 Z i (T l )g ­fZ i (T l )¡E LY (¯;T l )g=S (0) LY (¯;T l ): 57 Next let A n (¯)=¡n ¡1 @U LY (¯)=@¯, A(¯)=lim n!1 A n (¯), and u LY (¯) = lim n!1 n ¡1 U LY (¯). Suppose that A(¯ 0 ) is nonsingular, then there exists a solution ^ ¯ LY to U LY (¯) = 0 such that the random vector n 1=2 ( ^ ¯ LY ¡¯ 0 ) is asymptot- ically normal with mean 0 and covariance matrix A ¡1 (¯ 0 )B(¯ 0 )A ¡1 (¯ 0 ) 0 . A consistent estimator of the limiting covariance matrix can be given as A ¡1 n ( ^ ¯ LY )B n ( ^ ¯ LY )A ¡1 n ( ^ ¯ LY ) 0 . In a Monte-Carlo simulation study, the authors compared APLE against full data estimator, CCA estimator. An improvement of e±ciency over CCA was observed from the simulation. The drawback of APLE is its dependency on MCAR assumption which is often regarded as overly strong. 3.5.2 Method using auxiliary covariate data Missing data problem is in essential a measurement error problem to the extreme with the least possible information. When a spot of the data is missing, it is \measured"with the status \unknown" instead of an actual value. For this very reason, we also include in our review the methods to deal with measurement error problem. Zhou and Pepe (1995) proposed a nonparametric method to deal with mis-measured covariates utilizing auxiliary data based on the induced intensity introduced by Prentice (1982). The validation sample V is a simple random sub-sample of the cohort in which both thetruecovariateX andtheauxiliarycovariateZ areobserved, whichcanbethemissing indicator in missing data context. In its complement ¹ V, only the auxiliary is available. Suppose the relative risk in context of X i (¢) = fX i1 (¢);:::;X ip (¢)g has the proportional form: ¸ i (t)=Y i (t)¸ 0 (t)rf¯;X i (t)g=Y i (t)¸ 0 (t)r i (t): 58 When only the auxiliary variable Z i is available, the induced intensity conditional on Z i instead of X i is ¹ ¸ i (t)=Y i (t)¸ 0 (t)Efrf¯;X i (t)gjZ i (t)g: Theideaisto use¸ i (t)inthevalidationset where both X andZ areknownand use ¹ ¸ i (t) in the complement set where only Z is known. So the corresponding relative risk can be written as R i (t)=r i (t)I(i2V)+e i (t)I(i2 ¹ V); where e i (t)=Efrf¯;X i (t)gjZ i (t)g is the induced relative risk. The induced partial likelihood on the whole sample will then use the induced relative risk instead the true relative risk as in the usual partial likelihood, namely, L ZP (¯)= n Y i=1 ½ R i (T i ) P n j=1 Y j (Ti)R j (T i ) ¾ ¢ i : (3.19) Now the problem is how to realize R i (t) in (3.19). As discussed in section 3:4, the induced intensity generally depend on the baseline hazard ¸ 0 (t) and does not yield a simple closed form. However under rare disease assumption, the authors argued that the induced part e i (t), a expectation conditional on Z i , can be empirically approximated reasonablywellusingthedatainthevalidationsetwhentheauxiliaryvariableZ i assumes a categorical form. The authors proposed the estimator for e i (t) as follows, ^ e i (t)= P j2V Y j (t)IfZ j (t)=Z i (t)grf¯;X j (t)g P j2V Y j (t)IfZ j (t)=Z i (t)g ; (3.20) 59 whichissimplytheaverageoftherelativeriskinthevalidationsetwiththesameauxiliary variable Z i . ReplacingR i (t)withitsestimator ^ R i (t)=r i (t)I(i2V)+^ e i (t)I(i2 ¹ V), theestimated likelihood is ^ L ZP (¯)= n Y i=1 ½ ^ R i (T i ) P n j=1 Y j (Ti) ^ R j (T i ) ¾ ¢ i : (3.21) The estimated score function follows as ^ U ZP (¯)= n X i=1 ¢ i ½ ^ R (1) i (T i ) ^ R i (T i ) ¡ P n j=1 Y j (Ti) ^ R (1) j (T i ) P n j=1 Y j (Ti) ^ R j (T i ) ¾ ; (3.22) where (1) in the superscript means the ¯rst order derivative. The estimator ^ ¯ ZP can be obtained by solving ^ U ZP (¯) using Newton-Ralphson iteration. Theasymptoticdistributiontheoryofthemethodcanbederivedinthecasewherethe auxiliaryorthecovariatesarecategorical. Inasimulationstudytheauthorsdemonstrated signi¯cant gains comparing with CCA method. The limitation of the methods is that the auxiliary validation data must be discrete, and high dimensional covariates data may make the method very unstable. 3.5.3 Regression Calibration Wang et al. (1997) proposed a simple and intuitive procedure called \Regression Calibration" for Cox model with missing covariates. The idea is to impute the missing covariate with the value estimated from the validation set, then the routine Cox's regres- sion analysis may be used on the data with \complete" covariates. The motivation of 60 the method arose from Zhou and Pepe (1995) introduced in previous section to deal with continuous and high dimensional data where the previous method fails. Let (T i ;± i );i = 1;:::;n be the observed times and censoring indicators. The model to be considered is as usual the Cox model, ¸ft;X i ;Z i g=¸ 0 (t)expf¯ T 1 X i +¯ T 2 Z i g where ¯ =(¯ T 1 ;¯ T 2 ) are the regression coe±cients to be estimated. X i is a k£1 vector of covariates which might be missing for some subjects. Z i is a vector of the covariates that is always observed. W i is the surrogate measure for X i , always observable. The method is built on the relation between Z;W and X as suggested in Prentice (1982), ¸ft;Z;Wg=¸ 0 (t)expf¯ T 2 ZgE[expf¯ T 1 XgjY(t)=1;Z;W]: (3.23) A general model for the relation between X and (Z;W) assumes that E(XjZ;W) = g(Z;W;A), where matrixA=(® 1 ;:::;® k ) and where vector ® j =(® 1j ;:::;® mj ) T is the parameter of the jth component of the vector E(XjZ;W). The regression calibration method for estimating ¯ can be carried out by the following two steps. 1. Estimate the missing X by g(Z;W; ^ A), where ^ A is estimated from the validation data. 2. Apply a Cox regress procedure by using covariate data (X;Z) in the validation set andfg(Z;W; ^ A);Zg in the non-validation set. 61 With the missing covariates being substituted, the estimating function can be devel- oped. Let X ¤ i = 8 > < > : X i ; if X i is observed, g(Z i ;W i ;A); if X i is missing; and assume that there exists ¯ ¤ =(¯ ¤T 1 ;¯ ¤T 2 ) T such that ¸ft;X ¤ i ;Z i g=¸ 0 (t)expf¯ ¤T 1 X ¤ i +¯ ¤T 2 Z i g=r(X ¤ i ;Z i ;¯ ¤ ): The estimating equations can be written as U n (¯; ^ A)=n ¡1=2 n X i=1 ± i 8 > < > : 0 B @ ^ X ¤ i Z i 1 C A¡ ^ S ¤(1) (¯;T i ) ^ S ¤(0) (¯;T i ) 9 > = > ; : (3.24) where ^ X ¤ i = X ´ i i fg(Z i ;W i ; ^ A)g 1¡´ i ; ^ S ¤(0) = n ¡1 n X j=1 Y j (T i )r( ^ X ¤ j ;Z j ;¯) ^ S ¤(1) = n ¡1 n X j=1 Y j (T i ) 0 B @ ^ X ¤ i Z i 1 C Ar( ^ X ¤ j ;Z j ;¯) ThekeypartinRegressionCalibrationMethodistoobtaintheestimateforA,various estimatingprocedurescanbeused. Assumethevalidationsamplesize n V =n!½2(0;1). Let vex(A) = (® 11 ;:::;® m1 ;® 12 ;:::;® m2 ;:::) denote the vector of the matrix A. The 62 authors assume that there exist an mk£mk positive de¯nite matrix Q and some m£k matrixX i such that the estimated ^ A satis¯es n 1=2 V fvec( ^ A¡A)g=Q ¡1 n ¡1=2 v n X i=1 ´ i vecfX i diag(e i )g+o p (1): where e i =Z i ¡g(Z i ;W i ;A) and diag(e i ) is a k£k diagonal matrix with the components of e i and the diagonal elements. The advantage of the regression calibration method is that it does not require nor- mally distributed covariates or rare disease assumption, and can be used with continuous covariates with high dimension. In (Zhou and Pepe, 1995), the relative risk with respect to each subject in the non-validation set is estimated by a empirical weighted average of the relative risks in the validation set. Unlike in (Zhou and Pepe, 1995), one must realize that the regression calibration method to impute the missing covariates may vary case by case, certain asymptotic properties of the estimated covariates must hold to ensure the asymptotic normality of the estimated coe±cient ^ ¯. Another issue is that ^ ¯ derived with regression calibration method in this paper may not always be consistent. 63 Chapter 4 A simulation study comparing single imputation methods and missing indicator method In Chapter 2, we introduced in great details the prostate cancer - pesticide pilot study that serves as a motivational background study for this paper. The preliminary results were obtained using a single imputation method called average valid year method to han- dle the missing gaps in pesticide exposure history. It is natural to raise the question how AVY method performs in case-control setting comparing to other missing value methods. In this Chapter we will carry out a Monte-Carlo simulation to evaluate performances of variousmissingdatamethods. Thesimulationislargelybasedonthestudyby(Weinberg, et al. 1996), the di®erence is instead of the additive ERR model we consider exponential relative risk and add missing indicator method into the comparison. 4.1 The missing data methods examined in the simulation The missing data methods we compare in the simulation should have the properties to be straightforward, plausible, and easy to implement so they could be widely used in practical studies. Five simple imputation methods are chosen for the simulation, along 64 with a missing indicator method. These methods are all compatible with the gaps-in- exposure-history scenario which is the missing pattern this study focuses on. ² AVY method: This method was what we implemented to analyze the prostate cancer - pesticide pilot study data, which is covered in detail in Chapter 2. Here we pointoutthatalthoughthemethodappearstobeamethodthat\ignorethemissing part", it is in fact a simple imputation. The AVY method is equivalent to impute the missing value with the mean from non-missing value from the individual's own exposure history, then take the average over the imputed history. This imputation strategy is what we call \row imputation". ² OMImethod: OMIstandsforObservedMeanImputation. Foraspeci¯cyear,we impute the missing value with the mean value from the individuals who have an ob- servedexposureforthatyear. Wecallthistypeofimputation\columnimputation", sincetheimputedvaluesarederivedfromotherindividuals, thatis, calculatedfrom the columns of the data. ² OCMI method: OCMI stands for Observed Control Mean Imputation. The method is the same as OMI except that the mean is taken from all non-missing controls, instead of all study subjects. ² SSMI method: SSMI stands for Status Speci¯c Mean Imputation. For a given year,thismethodusesthemeantakenfromnon-missingcontrolstoimputecontrols with missing exposures; while the mean taken from non-missing cases is used to impute cases with missing exposures. 65 ² ZI method: ZI stands for zero imputation. As the the name suggests, this impu- tation is done by simply replacing all the missing values with zeros. Obviously this method underestimate the average yearly exposures regardless of disease status, we expect this method to underestimate the true coe±cient. ² MIND method: MIND stands for missing indicator. The missing indicator method can have di®erent implementations. For the simulation we use the most simple form which is basically an enhanced ZI method: impute the missing val- ues with zeros then adjust for the number of missing years by adding a nuisance parameter in the model. 4.2 The data preparation for the simulation Thegoalofdatagenerationistocreateanexposureandmissingpatternthatissimilar to the motivational study in some key characteristics. Before the exposure is created, we ¯rst generate a residential history for 30 years with a probability of 0.1 each year to move to a new residence for each individual subject in the simulation. We hypothesized the 0.1 moving probability from practical applications, on average each individual moves 3 times during the 30 years period. Note that in this stage, we do not generate actual addresses, the only thing that matters is whether the subject lives in the same place for any given adjacent years, since the residential history is only used to create missing exposures. As evident in the pilot study data consisting of 136 cases and 137 controls, the cumu- lative exposure which is expressed as the total pounds of certain chemicals applied within the disk-shaped bu®er is highly skewed. This observation is in accordance with the com- mon experience that cumulative exposures to airborne contaminants often approximates 66 the log-normal distribution (Armstrong, 1990). Another pattern worth noting is the high proportion of zero exposure estimates, which takes up approximately 70% of the entire exposure history in the pilot study data. This can be attributed to how the estimate is generated using the PUR + LU method. GRAPES will generate a zero exposure if no PUR report was ¯led for the speci¯c chemical product during the year. The exposure pattern from the real data is rather complicated, for example, it could be more reasonable to assume that the exposures between two consecutive years or the exposures from the same residence are similar to some extent. However an overly com- plicated exposure-missing scheme may digress from the main focus as to cover up the most meaningful truth we want to uncover, hence we have to balance out the design to ensure simplicity of the simulation while retaining important features from the real data. We propose a 30-year exposure history where the exposures from individual years for the same individual are mutually independent with identical distribution. For each individ- ual year, the exposure has a log-normal distribution with an independent chance of being zero. The exposure historyfZ 1 ;:::;Z 3 0g is generated as, Z k =I fu<0:3g e 2z ; (k =1;:::;30); whereudenotestheuniformdistributionon[0;1]andz thestandardnormaldistribution. Nextweconsiderthemissingpattern. Themissingnessinthesimulationisdetermined randomly on the address level. For each address, we assign a missing probability of 0.25. Since the missingness occurs on addresses, not the individual year exposure, once a address is missing, all the years of exposure associated with it will also be missing, thereby creating a streak-like missing pattern. Clearly the missingness over the 30 years 67 of exposure history is not mutually independent. Given a missing year, the immediate following year has 92.5% chance to be missing because the subject has a 10% chance of moving in that year. In fact, it can be shown the missing years have an auto-regression type of correlation matrix. The dependency among missing years can make the analysis of missing data much more di±cult, it will be relaxed ¯rst when we attempt theoretical development in Chapter 5. It is an intriguing topic to ¯nd the best way to summarize time-varying exposure history for case-control studies as discussed in Chapter 2, for the simulation we take the simpleapproachbyonlyanalyzinganone-dimensionalparameterforaverageyearlyexpo- sure. Noticethattheaverageyearlyexposureisequivalenttototalcumulativeexposureif thetimeperiodremainsthesameforallindividuals. Weassumetheoccurrenceofdisease and the exposure follow logistic model, namely logitPr(D =1)=e ®+¯ ¹ Z ; where the dichotomous D is the disease status, fZ k g;k = 1;:::;30 the exposure history and ¹ Z = 1 30 P 30 k=1 Z k the average yearly exposure. We then prospectively sample the cases by comparing a randomly number r from U[0;1] for each individual to the case probability determined by the logistic model. Namely if the random number satis¯es r <p case = e ® 0 +¯ 0 ¹ Z 1+e ® 0 +¯ 0 ¹ Z ; 68 the simulated individual will be sampled as a case, and otherwise be discarded. Continue the process until the desired number of cases is reached. Controls are sampled separately in the same manner except that the control probability is p control =1¡p case =1=(1+e ® 0 +¯ 0 ¹ Z ): One disadvantage of prospective sampling is that the sampling time greatly depends on the rarity of the disease with e ® 0 being the odds. However ® 0 is not of our interest in a case-control setting so we ¯x ® 0 to be¡4 through out the simulations. 4.3 The simulation setting and results Weusetwotypesofcase-controlallocations: 1500casesvs. 1500controlsand500cases vs. 1500 controls for the simulation. The true values of ¯ 0 are set to be 0:4;0:2;0:1;0:05 and 0. The simulation is carried out 1000 iterations for each simulation setting. The generated data sets are analyzed separately using logistic regression in SAS 9.1.3 (SAS Institute, NC) to compare FULL, AVY, OMI, OCMI, SSMI, ZI, and MIND methods proposed in preceding context. For each method and setting combination, we take arithmetic mean on the estimates from the N =1000 iterations as the empirical expectation, ^ E( ^ ¯)= 1 N N X i=1 ^ ¯ i : 69 The standard deviation is obtained empirically in the similar manner, ^ SD( ^ ¯)= v u u t 1 N¡1 N X i=1 ( ^ ¯ i ¡ ^ E[ ^ ¯]) 2 : The FULL method utilizes all the data before missingness is generated, it is supposed to bethe\goldstandard"thatservesasthebasisforothermissing-datamethodstocompare with. The e±ciency of a speci¯c method relative to the FULL method is de¯ned as the ratio of the two variances, E®(MethodjFULL)= Var FULL Var Method : When the \Method" is the FULL method itself, the e±ciency is of course 1. For any other missing data method, we expect the e±ciency to be less than 1 if the the method yields an unbiased estimate. The \no-bias" requirement is the premise to use the relative e±ciency as a criterion because, the variance of the estimate is smaller when the mean estimate is closer to 0. If a method bias the estimate toward null, the resulted e±ciency could be greater than 1; on the other hand if the bias is an over estimate, the e±ciency couldbefarlessthan1. Hencetherelativee±ciencyonlymakessenseifthetwomethods yieldsimilarmeanestimate. Theresultsshowninthenextsectionreportsomee±ciencies greater than 1, which is a demonstration of the bias problem. We also calculate the standard error of the estimate from another angle. For each iteration, the Wald standard error is obtained from SAS output, then the average value of the 1000 standard errors is computed as the empirical expectation of the standard error, whichiscomparedtothestandarddeviationof the estimatorto checktheaccuracy 70 of the standard error calculation under the missing data methods. The coverage of the nominal 95% con¯dence interval is de¯ned as the percentage of the CIs containing the true parameter ¯. If the missing-data method is valid, the coverage should be close to 95%. When¯ >0, wealso calculate the power as the percentage of that 0 is less than the lower bound of the nominal 95% CI. Note that when ¯ = 0, this percentage is actually the ¯rst type error. The results are presented in Table (4.1) and Table (4.2). As expected, by averaging over 1000 iterations, the FULL method produces a perfectly unbiased estimate for all simulationsettings. Wepresenttheresultsandcommentsforothermissingdatamethods used in the simulation respectively as follows. ² ZI:SinceZImethodimputeszerosonallthemissingspot,itunderestimatesthetrue cumulative exposures across the subjects; and since the missing does not depend on the case-control stauts, on average ZI method underestimates the true exposure by the same proportion of missing, namely 25%. The measurement error is non- di®erential, so we expected an attenuation on the true parameter. The simulation results con¯rmed our expectation. When the true coe±cient ¯ 0 is positive, the reduction rate on the true parameter is as high as 17:3% for ¯ 0 = 0:4 and the reductionrategoesdownasthetrueparametergoesdown. Furthermore,ZImethod does not generate false on average when ¯ 0 = 0. These are the typical behaviors when non-di®erential measurement error exists. ² AVY: For this subject-based imputation method, we noticed a substantial attenu- ation in estimating the true parameter ¯ 0 when ¯ 0 > 0. Unlike ZI method where the reduction in ¯ 0 varies with the true value, the reduction rate in ¯ 0 based on 71 AVYmethodremainsfairlyconstantaround30%overallthechoicesoftrue¯ 0 . We conjecture that the reduction rate may be determined by the proportion of missing in the simulation setting. The reason behind the attenuation may be due to the fact that the average exposure based on the imputed history tends to stretch away from the true average. Because the imputed values are solely determined by the ob- served value for the same subject, so the average value among the observed is large, the resulting imputed value will tend to be larger than the true value on average; similarly if the observed average is very small, the resulting imputed value will be smaller than the true value. So this kind of measurement error in exposure tends to make the true positive slope less steeper than when the true values are used. ² SSMI: At ¯rst glance, this method seems reasonable by using observed cases mean to impute missing in cases and observed controls mean to impute missing controls. However the result clearly shows that the estimate based on SSMI method greatly bias the positive true parameter ¯ 0 away from null, i.e., resulting in a larger pa- rameter estimate. The reason is because the measurement error in this method is di®erential on case-control status. The non-di®erential measurement error means, given the true value, the conditional distribution of the measured value should be the same regardless of case-control status. However, based on SSMI method and positive true parameter ¯ 0 , given the same true value, the imputed value on case tends to be larger than the imputed value for control (because cases should have higher exposure on average under the underlying model). This di®erential error causes the away-from-null bias. 72 ² OCMI: The method based on control mean imputation yields fairly satisfactory results. There is slight attenuation to the true parameter ¯ 0 , but not by much. The attenuation is within 5% across all speci¯ed ¯ 0 in the simulation. The coverage is close to the 95% nominal level and the e±ciency is greater than 80% when ¯ 0 is big enough. In fact, as being proved in (Weinberg, et al. 1996), the imputed value is subject to the so called \Berkson error". Let Z and X be the true and measured value respectively, the error is called \Berkson" type if the expectation of the true value conditional on the observed value happens to be the observed value itself, i.e., E[ZjX] = X. It was shown in (Prentice, 1982) that under Berkson error and excessrelativeriskr(Z;¯ 0 )=1+¯ 0 Z,themodeldoesnotproducebiasbyreplacing true value Z with its measured value X. If the relative risk obeys exponential form as used in the simulation and the conditional error ZjX is normally distributed, the model still produces unbiased estimate under Berkson error. However, the supporting part of the exposure distribution in the simulation follows log-normal distributionwhichisfairlyskewed, thisexplainswhythereexistsslightattenuation. ² OMI: The overall mean imputation method behaves similarly as OCMI, owning to the fact that both methods generate similar imputations. As shown in the 1:1 vs 1:3 chart, the OMI method is even closer to OCMI in 1:3 setting. ² MIND: The missing indicator method in this context seems to yield the best es- timates in term of unbiasedness and e±ciency in both situations with 1:1 and 1:3 case-control allocation, in comparison with rival mean imputation methods. For 73 example with ¯ 0 = 0:2, MIND methods yields an average estimate 0.198 and rela- tive e±ciency of 87%. Although the method performs well, there does exist a slight attenuation on the estimated ¯ 0 when the true ¯ 0 is not 0. At ¯rst glance it may be surprising to see such good results can arise from a simple implementation of MIND method, which in the analysis is no more than imputing zeros while adjusting to the number of missing years. In fact the MIND method is the most naive form of the so called missing indicator induced intensity. The general theory in next Chapter will explain why the method works and in fact and how the method can be extend to even more complex situations. 74 Table 4.1: Simulation results for 1500 cases and 1500 controls ¯ 0 Method Estimate StdDev StdErr Coverage Power Bias E±ciency 0.4 Full 0.4001 0.0206 0.0211 96.10% 100.0% 0.02% 100.0% MIND 0.3873 0.0235 0.0238 91.40% 100.0% ( 3.16%) 87.84% ZI 0.3307 0.0196 0.0208 9.70% 100.0% (17.33%) 105.2% AVY 0.2677 0.0209 0.0170 0.00% 100.0% (33.08%) 98.61% OMI 0.3602 0.0209 0.0214 52.90% 100.0% ( 9.95%) 98.55% OCMI 0.3794 0.0227 0.0236 85.20% 100.0% ( 5.16%) 91.01% SSMI 0.7201 0.0407 0.0302 0.00% 100.0% 80.02% 50.74% 0.2 Full 0.2002 0.0142 0.0145 94.90% 100.0% 0.11% 100.0% MIND 0.1975 0.0163 0.0166 94.50% 100.0% ( 1.27%) 87.25% ZI 0.1790 0.0144 0.0153 70.10% 100.0% (10.48%) 98.81% AVY 0.1391 0.0146 0.0121 2.00% 100.0% (30.46%) 97.35% OMI 0.1889 0.0149 0.0155 88.70% 100.0% ( 5.54%) 95.25% OCMI 0.1958 0.0160 0.0166 93.80% 100.0% ( 2.09%) 88.69% SSMI 0.4512 0.0386 0.0233 0.00% 100.0% 125.6% 36.83% 0.1 Full 0.1010 0.0119 0.0111 94.10% 100.0% 0.96% 100.0% MIND 0.1007 0.0132 0.0128 93.60% 100.0% 0.67% 90.21% ZI 0.0944 0.0120 0.0121 90.00% 100.0% ( 5.60%) 98.63% AVY 0.0714 0.0111 0.0094 20.40% 100.0% (28.60%) 106.8% OMI 0.0990 0.0128 0.0125 93.30% 100.0% ( 0.98%) 93.06% OCMI 0.1003 0.0131 0.0128 93.90% 100.0% 0.33% 90.84% SSMI 0.2694 0.0446 0.0201 0.00% 100.0% 169.4% 26.61% 0.05 Full 0.0510 0.0093 0.0092 93.80% 100.0% 1.98% 100.0% MIND 0.0510 0.0109 0.0106 94.50% 100.0% 1.98% 85.30% ZI 0.0488 0.0103 0.0102 93.30% 100.0% ( 2.33%) 90.70% AVY 0.0358 0.0087 0.0077 53.00% 100.0% (28.35%) 107.1% OMI 0.0507 0.0108 0.0105 94.00% 100.0% 1.46% 86.27% OCMI 0.0509 0.0109 0.0106 94.50% 100.0% 1.86% 85.52% SSMI 0.1312 0.0395 0.0164 6.90% 100.0% 162.5% 23.55% 0 Full 0.0004 0.0086 0.0085 97.00% 1.40% . 100.0% MIND 0.0005 0.0105 0.0102 96.50% 1.80% . 81.89% ZI 0.0005 0.0103 0.0100 96.50% 1.90% . 100.0% AVY 0.0004 0.0076 0.0073 96.50% 1.90% . 112.9% OMI 0.0005 0.0105 0.0102 96.50% 1.80% . 81.92% OCMI 0.0005 0.0105 0.0102 96.40% 1.80% . 81.92% SSMI 0.0008 0.0151 0.0106 88.90% 6.20% . 57.01% 75 Table 4.2: Simulation results for 500 cases and 1500 controls ¯ 0 Method Estimate StdDev StdErr Coverage Power Bias E±ciency 0.4 Full 0.4010 0.0261 0.0259 95.60% 100.0% 0.24% 100.0% MIND 0.3892 0.0292 0.0289 93.00% 100.0% ( 2.70%) 89.38% ZI 0.3433 0.0247 0.0259 41.40% 100.0% (14.18%) 105.6% AVY 0.2635 0.0271 0.0205 0.00% 100.0% (34.12%) 96.26% OMI 0.3882 0.0291 0.0288 92.40% 100.0% ( 2.95%) 89.50% OCMI 0.3798 0.0276 0.0284 88.30% 100.0% ( 5.05%) 94.38% SSMI 0.6645 0.0479 0.0359 0.10% 100.0% 66.13% 54.40% 0.2 Full 0.2016 0.0176 0.0174 94.80% 100.0% 0.78% 100.0% MIND 0.1988 0.0201 0.0199 94.20% 100.0% ( 0.58%) 87.69% ZI 0.1835 0.0182 0.0185 83.00% 100.0% ( 8.27%) 96.60% AVY 0.1378 0.0182 0.0144 5.10% 100.0% (31.09%) 96.52% OMI 0.1985 0.0200 0.0198 94.00% 100.0% ( 0.75%) 88.00% OCMI 0.1965 0.0197 0.0197 94.20% 100.0% ( 1.73%) 89.40% SSMI 0.3885 0.0465 0.0266 0.00% 100.0% 94.23% 37.89% 0.1 Full 0.1008 0.0130 0.0132 96.10% 100.0% 0.77% 100.0% MIND 0.1002 0.0150 0.0151 95.10% 100.0% 0.24% 86.68% ZI 0.0954 0.0141 0.0145 93.20% 100.0% ( 4.59%) 92.09% AVY 0.0697 0.0133 0.0110 27.20% 100.0% (30.31%) 97.59% OMI 0.1001 0.0150 0.0151 95.10% 100.0% 0.11% 86.90% OCMI 0.0998 0.0149 0.0151 95.00% 100.0% ( 0.20%) 87.34% SSMI 0.2058 0.0452 0.0210 3.50% 100.0% 105.8% 28.84% 0.05 Full 0.0508 0.0115 0.0110 94.80% 100.0% 1.67% 100.0% MIND 0.0509 0.0129 0.0127 95.50% 99.40% 1.77% 89.12% ZI 0.0492 0.0123 0.0124 94.30% 99.40% ( 1.52%) 93.19% AVY 0.0350 0.0106 0.0092 57.80% 98.30% (29.92%) 108.4% OMI 0.0509 0.0129 0.0127 95.50% 99.40% 1.70% 89.16% OCMI 0.0508 0.0128 0.0127 95.50% 99.50% 1.59% 89.43% SSMI 0.1006 0.0364 0.0168 31.30% 99.90% 101.1% 31.47% 0 Full -0.0026 0.0143 0.0129 96.70% 1.80% . 100.0% MIND -0.0035 0.0170 0.0153 95.70% 2.00% . 83.78% ZI -0.0034 0.0167 0.0150 95.90% 1.80% . 100.0% AVY -0.0027 0.0125 0.0112 95.80% 2.00% . 114.2% OMI -0.0035 0.0170 0.0153 95.60% 2.10% . 83.94% OCMI -0.0034 0.0170 0.0153 95.60% 2.10% . 84.00% SSMI -0.0062 0.0259 0.0163 84.60% 7.40% . 55.10% 76 Chapter 5 The Method based on missing indicator induced intensity The missing indicator induced intensity method was introduced in Chapter 3 through reviewing existing publications; and in Chapter 4 we conducted a Monte-Carlo simula- tion to compare various methods including the missing indicator method which can be perceived as a simple form of induced intensity method. In this Chapter a theoretical development is provided for the method based on missing indicator induced intensity in order to understand the root of the method and how the method may behave under the missing pattern speci¯ed in Chapter 1 and Chapter 2. 5.1 The induced intensity under counting processes theory Suppressing the subject index i unless further speci¯ed, suppose we have a time dependent R p -vector covariate Z(t) =fZ (1) (t);:::;Z (p) (t);t2 [0;¿];¿ < +1g. To make the argument ¯t in counting process theory, we assume Z(t) is left continuous with right- hand limit. Our interest is to ¯nd the association between Z(t) and the rate of failure. An exposure summary Z ¤ (t) and a parameter ¯ 0 2 R p are often used to model the 77 association. Z ¤ (t) is a function of the exposure historyfZ(u):u·tg, we have discussed some possible ways the summary can be constructed and provided some examples in Chapter 1. A Cox regression model accommodating Z ¤ (t), the associated parameter ¯ 0 andabaselinerelativeriskfunction®(¢)assumesthattherelativeriskhastheexponential form: ¸ft;Z ¤ (t);®(¢);¯g=®(t)e Z ¤ (t)¯ : (5.1) Now let a counting process N(t) count the number of events up to time t, andY(t) be the censoring indicator indicating whether the subject is under observation. We assume that the counting process N(t) has the intensity that is an element of the entire model space, i.e., ¸(t;F)=Y(t)¸ft;Z ¤ (t);® 0 (¢);¯ 0 g=Y(t)¸ 0 (t)e Z ¤ (t)¯ 0 ; for some ¯ =¯ 0 2R p and baseline risk function ® 0 (¢)=¸ 0 (¢) under the natural ¯ltration F t =¾fN(u);Y(u);Z(u):u·tg. We call ¯ 0 the true value for the data. In many situations, the exposure history fZ(u) : u · tg may not be completely observed, especiallywhentheexposurehistoryspansoveralongtimeperiod, forexample the history may contain gaps or holes. The missing in exposure history generally renders the exposure summary Z ¤ (t) unattainable. Let M(t) = fM (1) (t);:::;M (p) (t)g be the missing indicator associated with the exposure Z(t) at time t¸0, i.e., for any t¸0 and i2f1;:::;pg, M (i) (t)=0 if Z i (t) is observed and 1 if Z (i) (t) is missing. We also assume thatM(t)isleftcontinuouswithright-handlimit. Asimpleyetrealisticexampleisastep function with the components jumping between 0 and 1. Now we have a larger ¯ltration G t = ¾fN(u);Y(u);Z(u);M(u) : u · tg by incorporating the missing information into F t , albeitG t is not always observable by the investigator. 78 Animportantassumptionregardingthemissingindicatoristhatthemissingindicators donotprovideanypredicativeinformationforfailureratewhenthecovariatecanbefully observed. More precisely, the intensity of N(t) remains the same underG t andF t , i.e., ¸(t;G)=¸(t;F)=Y(t)¸ 0 (t)e Z ¤ (t)¯ 0 : (5.2) With missing data, all the information the investigator observes up to time t can be written as a new ¯ltration H t = ¾fN(u);Y(u);Z(u)± (1¡M(u));M(u) : u · tg, where \±" denotes element-wise matrix product. Evidently H t is a sub ¯ltration nested in G t . We want to ¯nd the intensity function of N(t) with respect to the ¯ltration H t , the reduced information caused by the missing data. The answer lies in the so called innovation theorem (Aalen, 1978) which ensures that ¸(t;H)=E[¸(t;G)jH t ]: A heuristic proof is provided as follows, ¸(t;H)dt = E[dN(t)jH t ¡] = E[E[dN(t)jG t ¡]jH t ¡] = E[¸(t;G)dtjH t ¡]: ApplyingthisargumenttocumulativeexposureregardingtheexposurehistoryfZ(u);u· tg:Z ¤ (t)= R t 0 Z(u)du, we have ¸(t;H) = E[Y(t)¸ 0 (t)e f R t 0 Z(u)dug¯ 0 jH t ] = E[Y(t)¸ 0 (t)e f R t 0 Z(u)±(1¡M(u))du+ R t 0 Z(u)±M(u)dug¯ 0 jH t ] = Y(t)¸ 0 (t)e f R t 0 Z(u)±(1¡M(u))dug¯ 0 E[e f R t 0 Z(u)±M(u)dug¯ 0 jH t ]: (5.3) 79 Thismissing indicator-inducedintensityclearly showsthatwhentheexposurehistory Z(¢) is completely observed up to time t, the intensity is exactly the same as the origi- nal intensity under the non-missing ¯ltration F t ; while when there is missing at time t, the induced intensity is the expectation of the original intensity conditional on all the information we know up to time t. We have seen in Section 3:4 that the composition of the expectation partE[¸ 0 (t)e f R t 0 Z(u)±M(u)dug¯ 0 jH t ] can be rather complicated. It depends on time t, the true parameter ¯ 0 , the baseline hazard ¸ 0 (¢), the covariate history Z(¢), missing history (¢) and the conditional distribution F Z (¢) of Z(¢) itself conditioning on the observed ¯ltrationH. We can further write Á(t;¯ 0 ;Z(¢);M(¢);¸(¢);F Z (¢))=logE[e f R t 0 Z(u)±M(u)dug¯ 0 jH t ]: (5.4) Writing the left hand side of (5.4) as Á(¢), the induced intensity then follows, ¸(t;H)=Y(t)¸ 0 (t)e f R t 0 Z(u)±(1¡M(u))dug¯ 0 +Á(t;¯;Z(¢);¸ 0 (¢);F Z (¢)) (5.5) Á(¢)isinfacttheconditionalcumulantgeneratingundertheoriginalCox'smodel. Itis temptingtousethemissing-indicatorinducedmodeltoestimate¯ 0 ,thedi±cultyisthatin general Á(¢) cannot be expressed explicitly. The di±culties are two-folds. Firstly, explicit form of the induced intensity requires comprehensive knowledge on the distributional structure between the missingness and the exposure, this includes the missing mechanism that is usually cannot be estimated from the data. Secondly, even if the missingness is completely at random and Á(¢) can somehow be explicitly calculated as a function of ¯ 0 , it generally depends on the baseline risk function ¸ 0 (¢), which in turn violates the 80 proportional property at the model level, making the parameter estimation much harder and non-standard. One way to work around the di±culties is to resort to the theory of model building. The idea is much like that in linear regression, one may consider a multiplicative term to account for interaction e®ect; or to add quadratic or higher order terms to model beyond linear e®ect; or like that in conventional missing data methods, one has to assume the missing mechanism to be MCAR or MAR. Nothing prevents us from arbitrarily modeling the explicit functional form of Á(¢) under reasonably speci¯ed covariate structure and assumption on the missing mechanism, as long as the resulting model re°ects the associ- ation of main interest. This way one may be able to greatly reduce the complexity of the induced intensity to the extent that it can be speci¯ed from the observed data. Most ide- ally, the speci¯ed intensity may retain the exponential form (or other well-studied form) so that standard estimating techniques such as partial likelihood can be used directly without heavy modi¯cation or complicated programming. In other words, our approach is to try to \guess" a new model that incorporates our knowledge and belief of missing data, and the analysis will be based on this new model. Themodelbasedontheintensity(5.4)onlyservesasahintfromwhichthenewmodelcan be \guessed". The guess cannot be too wild though, the new model must be nonetheless close to the original model and reduces to exactly the same when no missing data exist. 5.2 Examples of parametrization We have discussed in the previous section the idea around which the missing-indicator induced intensity can be derived; we have also seen the role of the baseline risk ¸ 0 (¢) in 81 determining the functional form of the induced intensity in Section 3:4. In this section, we will give some examples under strong assumptions on the exposure distribution and the missing structure. Unless otherwise speci¯ed, we make the following two assumptions through out the section: 1. Rare disease assumption: Namely ¸ 0 (¢) ¼ 0. This assumption, as suggested in Prentice (1982), essentially releases the induced intensity of the dependency on ¸ 0 (¢). 2. Non-informative censoring: The censoring indicator Y(t) is independent of the ex- posure and its missing status. Example1: Independentbinaryexposureandcompletelyrandommissing- ness Suppressing the index for subject, let Z =fZ 1 ;:::;Z K g denote the exposure history and the components Z k 's are i.i.d Bernoulli trials with PrfZ k = 1g = p 0 ;k = 1;:::;K. LetM =fM 1 ;:::;M K gbethecorrespondingmissingindicatorswhicharealsoi.i.d. with PrfM k = 1g = ¼ 0 ;k = 1;:::;K. Furthermore, we assume M and Z are independent. Normally the length of the history K is determined by time t since the history is what happened up to time t. However, for the purpose of simplicity, we assume for now K is a ¯xed value rather than varying with t. This simpli¯cation is not totally out of reality since one can imagine a scenario where the the exposure history is evaluated at baseline time0,andtheinterestisthetotalexposurebeforethebaseline. Wethenapply5:4under these strong assumptions, it follows that Á(¢) = logE h e ¯ 0 P K k=1 Z k M k ¯ ¯ M i = K X k=1 logE h e ¯ 0 Z k i M k 82 = log(1¡p 0 +p 0 e ¯ 0 )¢ K X k=1 M k : (5.6) The corresponding induced intensity becomes ¸(t;H)=Y(t)¸ 0 (t)exp ( ¯ 0 K X k=1 Z k (1¡M k )+log(1¡p 0 +p 0 e ¯ 0 ) K X k=1 M k ) : (5.7) At this point, one must be clear about what is to be estimated. ¯ 0 is the focus of interest hence it is always assumed unknown and to be estimated. Depending on the additional information one has in possession, the nuisance parameter p 0 may be either an unknown parameter or a ¯xed known parameter. If p 0 is unknown, we can estimate (¯ 0 ;p 0 ) in a cohort study design by maximizing the Cox partial likelihood (see more details in Section 3:3:1) : L ¤ (¯;p)= n Y i=1 Z ¿ 0 ( r ¤ i (¯;p;t) P n j Y j (t)r ¤ j (¯;p;t) ) dN i (t) ; where r ¤ i (¯;p;t)=exp ( ¯ K X k=1 Z ik (1¡M ik )+log(1¡p+pe ¯ ) K X k=1 M ik ) : One can introduce another parameter ´ when p 0 is unknown to retain the log-linear property of the original Cox model, let ´ =log(1¡p+pe ¯ ): 83 The partial likelihood 3:5 in term of (¯;´) becomes L(¯;´)= n Y i=1 Z ¿ 0 ( r i (¯;´;t) P n j Y j (t)r j (¯;´;t) ) dN i (t) ; (5.8) where r i (¯;´;t)=exp ( ¯ K X k=1 Z ik (1¡M ik )+´ K X k=1 M ik ) : It can be shown that the maximum partial likelihood estimators ^ ¯ obtained from maximizing L(¯;´) and L ¤ (¯;p) are the same and have the same variance. If on the other hand additional information outside the model is available so that p 0 is known, the induced intensity will only contain a single unknown parameter ¯ 0 ; and the estimate of ¯ 0 can be obtained by maximizing L ¤ (¯;p) with the constraint p = p 0 . One can expect higher e±ciency than when p 0 is unknown due to the gain from the additional information. Example 2: A generalization The assumptions in Example 1 are too strong in real world studies. Firstly the ex- posure distribution may not be as simple as a Bernoulli trial, it is generally preferred to make fewer assumptions on the exposure distribution when building the model, for example both proportional hazard model and logistic model are not required to specify the exposure distribution; secondly the MCAR assumption may not hold in reality, the exposure and its missing status can have some sort of correlation. In fact, these two assumptionscanbeeasilyremovedandtheparametrizationinExample1stillworks. Let 84 us assume that (Z 1 ;M 1 );:::;(Z K ;M K ) as paired random variables are i.i.d.; hence the exposure and the missing status are still independent between di®erent years but they can be dependent within the same year. Similar to (5.6), Á(¢) = logE h e ¯ 0 P K k=1 Z k M k ¯ ¯ M i = K X k=1 logE h e ¯ 0 Z k jM k =1 i M k = K X k=1 Á k M k : (5.9) The i.i.d. assumption assures that, as long as the moment generating function for Z k ex- ists,i.e.,jÁ k j<1,wehaveÁ 1 =¢¢¢=Á K ; henceallÁ k 'scanbeparameterizedasasingle unknown´. Thisisexactlythesameparametrizationwhichleadstothepartiallikelihood (5.8). This example demonstrates the °exibility of the induced intensity method, that the missing mechanism does not play a signi¯cant role constructing the model. Example 3: Further generalization: Additional confounders In epidemiologic studies, there generally exist potential confounders besides the pri- mary exposure such as gender, ethnicity and age. They may confound the estimate of relative risk if they both contribute to the risk function and correlate with the exposure, hencetheywillpresentthemselvesinthemodelinmostsituations. Theinducedintensity method can be extended to accommodate the existence confounders. Assume we have a subject level confounder C which is always observed. If (Z k ;M k ) are i.i.d over the years as the assumption in Example 2, it boils down to parameterize Á k (C)=logE[e ¯ 0 Z k jC;M k =1]: 85 When C is categorical with m possible values, the parameterizations is naturally Á k (C)= m¡1 X j=0 ´ j I (C=c j ) : Howeverifmisbig,onehastopaycloseattentionwithover-parametrization,especiallyin asmall-scaledstudywheretoomanyparametersmayseverelyreducethemodele±ciency. AstrategytoavoidthisistocollapseC intofewercategories. IfC isacontinuousvariable within¯nitesetofvaluessuchasweight,height,Á k (C)asafunctionofC isnon-parametric unlesstheexactclosedformisknown. WecancategorizeC sothattheproblemreducesto thesituationpreviouslydescribed. Wecanalsoresorttoothermodel-buildingtechniques, for example use polynomial approximation, Á k (C)= s X j=0 ´ j C j : One can start from s = 0; and increase the power by adding more parameters until a satisfactory ¯t is achieved. Example 4: An intuitive approach when individual years of exposure are correlated Recall that we are considering the cumulative exposure over a period of time, it is likelythattheexposuresmeasuredovertimepoints k =1;:::;K arecorrelatedwitheach other, hence the within-subject independence assumption for Z may be too restrictive. In this example we will propose a functional form for the induced intensity without the with-in subject independence. 86 Suppose the components of Z = fZ 1 ;:::;Z K g share the same marginal distribution and the corresponding correlation matrix Corr(Z) = V = (½ ij ). One may intuitively propose that the induced part of the induced intensity has the form, Á(¢) : = logE h e ¯ P K k=1 Z k M k ¯ ¯ M 1 ;:::;M K ;Z 1 (1¡M 1 );:::;Z K (1¡M K ) i : = ´ K X k=1 M k +°I K diag(M)Vdiag(I K ¡M)Z T ; whereI K istheK¡dimensionalunitrowvectorand° isunknownparameter. Theideais tointuitivelyconstructtheinducedpartoftheintensityinsuchawaythatthedependency among Z =fZ 1 ;:::;Z K g can be re°ected by the additional interaction term. Note that if Z k 's are mutually independent, then V =diag(I K ), diag(M)Vdiag(I K ¡M)=0, thus Á(¢) simply reduces to ´ P K k=1 M k . As an example, suppose K =3, Z 2 is missing and Z 1 ;Z 3 are observed, we have diag(M)Vdiag(I K ¡M) = 0 B B B B B @ 0 0 0 0 1 0 0 0 0 1 C C C C C A 0 B B B B B @ ½ 11 ½ 12 ½ 13 ½ 21 ½ 22 ½ 23 ½ 31 ½ 32 ½ 33 1 C C C C C A 0 B B B B B @ 1 0 0 0 0 0 0 0 1 1 C C C C C A = 0 B B B B B @ 0 0 0 ½ 21 0 ½ 23 0 0 0 1 C C C C C A ; hence Á(¢) = logE h e ¯Z 2 ¯ ¯ T ¸t;M 1 =0;M 2 =1;M 3 =0;Z 1 ;Z3 i : = ´+°(½ 21 Z 1 +½ 23 Z 3 ): 87 Example 5: An example where the exposure is measured along the obser- vation In epidemiologic studies regarding time-dependent exposure, the exposure is rarely measured in a continuous manner due to practical limitations. More frequently, the exposure is rather measured intermittently along the time line and assume the measured exposure remain constant until next measuring point. Similarly, the missing indicator for the exposure follows the same interval pattern. Now we consider a univariate covariate of interest Z(t). Suppose the entire study period (0;¿] is divided into K partitions with the cut points 0 = t 0 < ¢¢¢ < t K = ¿, i.e., (0;¿] = [ K k=1 (t k¡1 ;t k ]. On the kth interval (t k¡1 ;t k ];Z(t) is equal to some random variable Z k . Z k may or may not be observable depending on the missing indicator M k associated with the kth interval which alone is a Bernoulli trial random variable with PrfM k = 1g = p k . We also have a p-vector of subject level covariate C(t) which may include potential confounding factors and is assumed to be always observable. Then the cumulative exposure Z t 0 Z(u)du= K(t)¡1 X k=1 (t k ¡t k¡1 )Z k +(t¡t K(t)¡1 )Z K(t) where K(t) is the index of the interval that contains t. Example 6: Multi-valued missing indicator In the previous examples, the method based on missing indicator-induced intensity is based on a binary missing indicator, i.e., 1 means missing and 0 means non-missing. However it may be an over simpli¯cation in practical studies where the missingness may 88 be caused by di®erent mechanisms. If the missing values happen to distribute di®erently with respect to the missing types, failure to address this issue may become a source of bias. An good example can be found in the pesticide -prostate cancer pilot study, the missing years of exposure (in fact the missing addresses) are caused by two reasons: address completely unknown or address known but out of counties. It is reasonable to assume the pesticide exposure at unknown address to have a higher exposure on average than the exposure at OOC address, because the unknown addresses may have a higher chance to lie inside a rural area than the OOC addresses, which we know for sure out of the highly exposed study area. SupposeonlytheexposureZ(t)containsmissingvalueandthereare mmissingtypes. We can de¯ne a multi-valued missing indicator M + (t)=i;i=1;:::;m if Z(t) is missing andthemissingtypeisi,andM + (t)=0ifZ(t)isobserved. Let±(x)bethesignfunction, thentheobservable¯ltrationisH + t =¾fN(u);Y(u);Z(u)±(1¡±(M + (u)));M + (u):u· tg. The induced intensity in (5.5) can be extended to ¸(t;H + )=Y(t)¸ 0 (t)e f R t 0 Z(u)±(1¡±(M + (u)))dug¯ E[¸ 0 (t)e f R t 0 Z(u)±±(M + (u))dug¯ jH + t ] 5.3 Asymptoticrelativee±ciencywithcohortstudydesign In this section we will calculate the asymptotic information for ¯ in a cohort study designusingtheinducedintensityapproach,aswellastherelativee±ciencycomparingto when no data is missing. We will use the same notations and settings from Example 1 of 89 Section 5:2. Assuming rare disease and non-informative censoring, when p 0 is unknown, ¯ 0 can be estimated by maximizing the partial likelihood: L(¯;´)= n Y i=1 Z ¿ 0 ( r i (¯;´;t) P n j Y j (t)r j (¯;´;t) ) dN i (t) ; where, r i (¯;´;t)=exp ( ¯ K X k=1 Z ik (1¡M ik )+´ K X k=1 M ik ) : The corresponding score function follows as U(¯;´)= @ @(¯;´) logL(¯;´)= n X i=1 Z ¿ 0 W i dN i (t)¡ Z ¿ 0 ( W j P n j=1 Y j (t)r j (¯;´;t) P n j=1 Y j (t)r j (¯;´;t) ) d ¹ N(t); where W i = 0 B @ P K k=1 Z ik (1¡M ik ) P K k=1 M ik 1 C A. Beforewecarryoutthecalculation, itisworthnotingthat ´ =log(1¡p+pe ¯ )always holds when p is considered an unknown parameter; particularly at true value we have ´ 0 =log(1¡p 0 +p 0 e ¯ 0 ): (5.10) Thecalculationofasymptoticinformationmatrixfor(¯;´)isbasedontheasymptotic theory in Andersen and Gill (1982), from which we borrow the notations, S (r) (¯;´;t)= 1 n n X i=1 W ­r i Y i (t)r i (¯;´;t); r =0;1;2: 90 We can calculate the limit functions as follows using the predetermined distribution of (Z;M). s (0) (¯;´;t) = lim n!1 S (0) (¯;´;t)=E[Y 1 (t)r 1 (¯;´;t)] = E K [expf¯Z 11 (1¡M 11 )+´M 11 g]¢E[Y 1 (t)] = f(1¡p 0 )(1¡¼ 0 )+p 0 (1¡¼ 0 )e ¯ +¼ 0 e ´ g K ¢y(t); where y(t)=lim n!1 1=n P n i=1 Y i (t). Now let '(¯;´)=(1¡p 0 )(1¡¼ 0 )+p 0 (1¡¼ 0 )e ¯ + ¼ 0 e ´ . Clearly '(¯ 0 ;´ 0 ) = 1¡ p 0 + p 0 e ¯ 0 = e ´ 0 holds due to (5.10). For simplicity let ' = '(¯;´) and ' 0 = '(¯ 0 ;´ 0 ). Rest of the limit functions can be obtained by di®erentiating s (0) (¯;´;t) with respect to (¯;´): s (1) (¯;´;t)= @ @(¯;´) s (0) (¯;´;t)=K' K¡1 (¯;´) 0 B @ p 0 (1¡¼ 0 )e ¯ ¼ 0 e ´ 1 C A¢y(t); s (2) (¯;´;t)= @ 2 @(¯;´) 2 s (0) (¯;´;t) =K' K¡2 0 B @ 'p 0 (1¡¼ 0 )e ¯ +(K¡1)p 2 0 (1¡¼ 0 ) 2 e 2¯ ; (K¡1)p 0 ¼ 0 (1¡¼ 0 )e ¯+´ (K¡1)p 0 ¼ 0 (1¡¼ 0 )e ¯+´ ; K'¼ 0 e ´ +(k¡1)¼ 2 0 e 2´ 1 C A ¢y(t); s (1) (¯;´;t) ­2 =K 2 ' 2K¡2 0 B @ p 2 0 (1¡¼ 0 ) 2 e 2¯ ; p 0 ¼ 0 (1¡¼ 0 )e ¯+´ p 0 ¼ 0 (1¡¼ 0 )e ¯+´ ; ¼ 2 0 e 2´ 1 C A¢y 2 (t); v(¯;´;t)= s (2) (¯;´;t) s (0) (¯;´;t) ¡ ( s (1) (¯;´;t) s (0) (¯;´;t) ) ­2 91 =K' ¡2 0 B @ 'p 0 (1¡¼ 0 )e ¯ ¡p 2 0 (1¡¼ 0 ) 2 e 2¯ ; ¡p 0 ¼ 0 (1¡¼ 0 )e ¯+´ ¡p 0 ¼ 0 (1¡¼ 0 )e ¯+´ ; '¼ 0 e ´ ¡¼ 2 0 e 2´ 1 C A: According to Andersen and Gill (1982), under certain mild regulatory conditions, the asymptotic information matrix for parameter (¯;´) at the true value (¯ 0 ;´ 0 ) is § = Z ¿ 0 v(¯ 0 ;´ 0 ;t)s (0) (¯ 0 ;´ 0 ;t)¸ 0 (t)dt = K' K¡2 0 0 B @ ' 0 p 0 (1¡¼ 0 )e ¯ 0 ¡p 2 0 (1¡¼ 0 ) 2 e 2¯ 0 ; ¡p 0 ¼ 0 (1¡¼ 0 )e ¯ 0 +´ 0 ¡p 0 ¼ 0 (1¡¼ 0 )e ¯ 0 +´ 0 ; ' 0 ¼ 0 e ´ 0 ¡¼ 2 0 e 2´ 0 1 C A ¢ Z ¿ 0 y(t)¸ 0 (t)dt= 0 B @ ¾ ¯¯ ¾ ¯´ ¾ ´¯ ¾ ´´ 1 C A: (5.11) When all the Z ik 's can be observed, i.e., ¼ 0 = 0, the model will reduces to having ¯ as the only parameter. The asymptotic information for ¯ is simply ¾ ¯¯ = K' K¡2 0 f' 0 p 0 (1¡¼ 0 )e ¯ 0 ¡p 2 0 (1¡¼ 0 ) 2 e 2¯ 0 g¢ Z ¿ 0 y(t)¸ 0 (t)dt = Kf1¡p 0 +p 0 e ¯ 0 g K¡2 p 0 (1¡p 0 )e ¯ 0 ¢ Z ¿ 0 y(t)¸ 0 (t)dt; (5.12) whichis whatwecall thefull cohortinformationwhen no missingdatais involved. When ¯ 0 >0,' 0 =1¡p 0 +p 0 e ¯ 0 >1, hencetheasymptoticinformation¾ ¯¯ willincreaseasthe length of exposure history K increases. On the other hand, if ¯ 0 < 0, ¾ ¯¯ decreases as K increases when K is large enough, this ¯nding is interesting considering it contradicts the common belief that more data generally yield greater information for the parameter. 92 WhenmissingdatainexposureZ existwith¼ 0 >0,parameter´isnolongernegligible. The information component corresponding to ¯ can be obtained from the asymptotic information matrix (5.11) as (§ ¡1 ) ¡1 ¯¯ = ¾ ¯¯ ¡¾ ¯´ ¾ ¡1 ´´ ¾ ´¯ = K' K¡1 0 (1¡¼ 0 ) p 0 (1¡p 0 )e ¯ 0 1¡p 0 +p 0 e ¯ 0 ¢ Z ¿ 0 y(t)¸ 0 (t)dt: (5.13) The relative e±ciency comparing to full cohort with complete data then follows by dividing (5.13) with (5.12), which is simply 1¡¼ 0 . Note that (1¡¼ 0 ) K is the relative e±ciency of the complete case analysis (CCA) that only includes the subjects with no missing spot in the exposure history. It is evident that the induced intensity method always provide a better e±ciency than the CCA method, since 1¡¼ 0 > (1¡¼ 0 ) K as long as K > 1. When K = 1, both methods yield the same relative e±ciency, suggesting no e±ciency gain for the induced intensity method. Nowletusconsiderthesituationwhere´ 0 issomehowknowntobelog(1¡p 0 +p 0 e ¯ 0 ), the asymptotic information for ¯ is then given by ¾ ¯¯ = K' K¡2 0 f' 0 p 0 (1¡¼ 0 )e ¯ 0 ¡p 2 0 (1¡¼ 0 ) 2 e 2¯ 0 g¢ Z ¿ 0 y(t)¸ 0 (t)dt = K' K¡1 0 p 0 (1¡p 0 )e ¯ 0 1¡p 0 +p 0 e ¯ 0 ½ 1+ ¼ 0 p 0 e ¯ 0 1¡p 0 ¾Z ¿ 0 y(t)¸ 0 (t)dt: (5.14) Note the information with known ´ 0 di®ers with the information with unknown ´ 0 (5.13) by a factor 1+ ¼ 0 p 0 e ¯ 0 1¡p 0 , so the information with known ´ 0 can be regarded as a upper bound for the asymptotic information. 93 5.4 Generalized results of asymptotic e±ciency As discussed in Example 2, the assumptions based on which the previous asymptotic e±ciency calculation is carried out are often too strong for real world situations. In fact the calculation can be generalized while relaxing the model assumptions, as shown in the following theorem. Theorem 1 Consider the setting in Example 1, but do not specify the distribution for Z, instead assume the moment generating function for Z ik exists. Furthermore, relax the in- dependence assumption between Z and M, only assume thatf(Z ik ;M ik );i=1;:::;n;k = 1;:::;Kg are i.i.d. Then the asymptotic information for ¯ under the partial likelihood (5.8) is given by (1¡¼ 0 )K(m(¯ 0 )) K¡1 ( m 00 0 (¯ 0 )¡ (m 0 0 (¯ 0 )) 2 m 0 (¯ 0 ) ) ¢ Z ¿ 0 y(t)¸ 0 (t)dt; (5.15) where m(¯)=E[e ¯Z 11 ], m 0 (¯)=E[e ¯Z 11 jM 11 =0] and m 1 (¯)=E[e ¯Z 11 jM 11 =1]. Proof: The proof is similar to the asymptotic information calculation in Example 1. Sincef(Z ik ;M ik );i=1;:::;n;k =1;:::;Kg are i.i.d, we have s (0) (¯;´;t)= lim n!1 S (0) (¯;´;t)=' K (¯;´)¢y(t); where '(¯;´) = E[e ¯Z 11 (1¡M 11 )+´M 11 ] = E(E[e ¯Z 11 (1¡M 11 )+´M 11 jM 11 ]) 94 = (1¡¼ 0 )E[e ¯Z 11 jM 11 =0]+¼ 0 e ´ = (1¡¼ 0 )m 0 (¯)+¼ 0 e ´ : Particularly, since e ´ 0 =m 1 (¯ 0 ), it follows that ' 0 = '(¯ 0 ;´ 0 )=(1¡¼ 0 )m 0 (¯ 0 )+¼ 0 m 1 (¯ 0 ) = (1¡¼ 0 )E[e ¯ 0 Z 11 jM 11 =0]+¼ 0 E[e ¯ 0 Z 11 jM 11 =1] = E(E[e ¯ 0 Z 11 jM 11 ])=E[e ¯ 0 Z 11 ]=m(¯ 0 ): Similarly, s (1) (¯;´;t)= @ @(¯;´) s (0) (¯;´;t)=K' K¡1 (¯;´) 0 B @ (1¡¼ 0 )m 0 0 (¯) ¼ 0 e ´ 1 C A¢y(t); s (2) (¯;´;t)= @ 2 @(¯;´) 2 s (0) (¯;´;t) =K(K¡1)' K¡2 0 B @ (1¡¼ 0 ) 2 (m 0 0 (¯)) 2 ; ¼ 0 (1¡¼ 0 )m 0 0 (¯)e ´ ¼ 0 (1¡¼ 0 )m 0 0 (¯)e ´ ; ¼ 2 0 e 2´ 1 C A¢y(t) +K' K¡1 0 B @ (1¡¼ 0 )m 00 0 (¯); 0 0; ¼ 0 e ´ 1 C A¢y(t); s (1) (¯;´;t) ­2 =K 2 ' 2K¡2 0 B @ (1¡¼ 0 ) 2 (m 0 0 (¯)) 2 ; ¼ 0 (1¡¼ 0 )m 0 0 (¯)e ´ ¼ 0 (1¡¼ 0 )m 0 0 (¯)e ´ ; ¼ 2 0 e 2´ 1 C A¢y(t); v(¯;´;t)= s (2) (¯;´;t) s (0) (¯;´;t) ¡ ( s (1) (¯;´;t) s (0) (¯;´;t) ) ­2 95 =K' ¡1 0 B @ (1¡¼ 0 )m 00 0 (¯); 0 0; ¼ 0 e ´ 1 C A¡K' ¡2 £ 0 B @ (1¡¼ 0 ) 2 (m 0 0 (¯)) 2 ; ¼ 0 (1¡¼ 0 )m 0 0 (¯)e ´ ¼ 0 (1¡¼ 0 )m 0 0 (¯)e ´ ; ¼ 2 0 e 2´ 1 C A: Now applying the equality ' 0 = (1¡¼ 0 )m 0 (¯ 0 )+¼ 0 m 1 (¯ 0 ) = m(¯ 0 ) and e ´ 0 = m 1 (¯ 0 , the asymptotic information for (¯;´) can be obtained, §= Z ¿ 0 v(¯ 0 ;´ 0 ;t)s (0) (¯ 0 ;´ 0 ;t)¸ 0 (t)dt = (1¡¼ 0 )K' K¡2 0 0 B @ m(¯ 0 )m 00 0 (¯ 0 )¡(1¡¼ 0 )(m 0 0 (¯ 0 )) 2 ; ¡¼ 0 m 0 0 (¯)m 1 (¯ 0 ) ¡¼ 0 m 0 0 (¯)m 1 (¯ 0 ); ¼ 0 m 0 (¯ 0 )m 1 (¯ 0 ) 1 C A £ Z ¿ 0 y(t)¸ 0 (t)dt= 0 B @ ¾ ¯¯ ¾ ¯´ ¾ ´¯ ¾ ´´ 1 C A: (5.16) Finally, the information component for ¯ is (§ ¡1 ) ¡1 ¯¯ =¾ ¯¯ ¡¾ ¯´ ¾ ¡1 ´´ ¾ ´¯ = (1¡¼ 0 )K' K¡2 0 ½ m(¯ 0 )m 00 0 (¯ 0 )¡(1¡¼ 0 )(m 0 0 (¯ 0 )) 2 ¡¼ 0 m 1 (¯ 0 ) m 0 (¯ 0 ) (m 0 0 (¯ 0 )) 2 ¾ ¢ Z ¿ 0 y(t)¸ 0 (t)dt = (1¡¼ 0 )K(m(¯ 0 )) K¡1 ( m 00 0 (¯ 0 )¡ (m 0 0 (¯ 0 )) 2 m 0 (¯ 0 ) ) ¢ Z ¿ 0 y(t)¸ 0 (t)dt: (5.17) As a trivial corollary, the relative e±ciency achieved by the induced intensity method against the full data model is still 1¡¼ 0 , the non-missing proportion of the data. On the contrary the e±ciency of CCA method would su®er even larger loss when the chance of 96 being missing depends on the exposure status; the simulation results in later sections of the chapter show the e±ciency could be less than (1¡¼ 0 ) K , which is already crippling with large number of years. This theorem and Example 2 together demonstrate the advantage of the induced in- tensitymethod,forthemissingmechanismsuchasMCARorMARisnolongeraconcern because the missing mechanism has already been factored in when the parametrization of the induced intensity takes place; and at the same time the method achieves an e±ciency as high as one can expect. 5.5 Asymptotice±ciencywithnestedcase-controlsampling Oneofthefundamentalassumptionsofthemissingindicatorinducedintensitymethod istherarediseaseassumptionwhichmakesacohortstudydesignine±cientandcostly. We mayinsteadconsider1:m¡1nested case-controlstudy designto improvecoste±ciency. The parametrization of Á(¢) is exactly the same as the cohort design since there is no di®erence in the intensity; the sampling process and the resulting partial likelihood are on the other hand di®erent. The calculation for asymptotic information is considerably tougher than in cohort design, even with very strong assumption as in Example 1. LetR j ;j =1;:::;n be the risk sets, and de¯ne the summations in a similar manner. S (0) R j (¯ 0 ;´ 0 ) = X i2R j expf¯ 0 U i +´ 0 V i g S (1) R j (¯ 0 ;´ 0 ) = X i2R j 0 B @ U i V i 1 C Aexpf¯ 0 U i +´ 0 V i g 97 S (2) R j (¯ 0 ;´ 0 ) = X i2R j 0 B @ U 2 i ;U i V i U i V i ;V 2 i 1 C Aexpf¯ 0 U i +´ 0 V i g Using the asymptotic theory in Goldstein and Langholz (1992), we can derive the asymp- totic information matrix as ¡= Z ¿ 0 1 m E 8 < : S (2) R (¯ 0 ;´ 0 )¡ 0 @ S (1) R ­2 (¯ 0 ;´ 0 ) S (0) R (¯ 0 ;´ 0 ) 1 A 9 = ; y(t)¸ 0 (t)dt; where we suppress the index j by de¯ningR =f1;:::;mg. Note the di®erence with the full cohort information: §= Z ¿ 0 ( s (2) ¡ à s (1) ­2 s (0) !) y(t)¸ 0 (t)dt: Under cohort setting the corresponding summation functions converge to s (0) ;s (1) and s (2) , which have simple expressions with the help of the law of large numbers, however the technique does not work for nested case-control design because the risk set contains a ¯xed number of subjects. The di±culty lies in that the expectation must be taken as a whole,andthereisnosimplemathematicalexpressionevenunderthestrongassumptions as in Example 1. Nonetheless, it is possible to provide numerical calculations for the information matrix given di®erent true values of ¯ 0 . Under the settings of Example 1, the information matrix can be written as ¡=¡ ¤ Z ¿ 0 y(t)¸ 0 (t)dt; 98 where ¡ ¤ = 1 m E 8 < : S (2) R (¯ 0 ;´ 0 )¡ 0 @ S (1) R ­2 (¯ 0 ;´ 0 ) S (0) R (¯ 0 ;´ 0 ) 1 A 9 = ; ; which is the only relevant part to calculate the relative e±ciency since the integration part will be canceled out. Under the same settings we know V » B(K;¼) and UjV » B(K¡V;p), hence the joint probability mass function can be obtained as Pr(U =u;V =v) = Pr(U =ujV =v)Pr(V =v) = µ K v ¶µ K¡v u ¶ ¼ v 0 (1¡¼ 0 ) K¡v p u 0 (1¡p 0 ) K¡v¡u : We denote this probability as p u;v . The calculation can be carried out by enumerating all possible combinations of (u;v) and adding them up, that is, ¡ ¤ = X 0·u 1 ·k¡v 1 ¢¢¢ X 0·u m ·k¡v m 1 m 8 < : S (2) R ¡ 0 @ S (1) R ­2 S (0) R 1 A 9 = ; p u 1 ;v 1 ¢¢¢p um;vm : (5.18) The theoretical asymptotic e±ciency calculation based on (5.18) is programmed in R code. Table (5.1) shows the results for various settings which consist of all combinations from following parameters: the true value ¯ 0 = 0;0:1 and 0:2; low missing proportion ¼ 0 =0:2 and high missing proportion ¼ 0 =0:4; the number of exposures for each subject K = 1;2;5 and 10; and ¯nally the number of subjects in each sampled risk set m = 2;3 and 4 (1:m matching). As expected from the theoretical calculations, the asymptotic relative e±ciency under nested case-control sampling is very close to 1¡¼ 0 which is exactly the e±ciency under cohort design, particularly the calculated e±ciency is slightly higher than 1¡ ¼ 0 . In general, one should expect a loss in e±ciency no less than ¼ 0 with missing data. One 99 Table 5.1: E±ciency calculation for NCC design ¯ 0 =¼ 0 K 1:1 E±ciency 1:2 E±ciency 1:3 E±ciency 0.2/0.2 1 0.800698 0.8003235 0.8001854 2 0.802079 0.8012887 0.8009249 5 0.806113 0.8041648 0.8031545 10 0.812522 0.8088917 0 .8068973 0.1/0.2 1 0.8001713 0.8000777 0.8001854 2 0.800513 0.8003106 0.8009249 5 0.801531 0.8010082 0.8031545 10 0.8032056 0.8021681 0.8016376 0/0.2 any 0.8 0.8 0.8 0.2/0.4 1 0.601047 0.6004848 0.6002779 2 0.603124 0.60193333 0.6013869 5 0.609231 0.6062645 0.6047359 10 0.619051 0.6134324 0.6103786 0.1/0.4 1 0.600257 0.6001165 0.6000662 2 0.6007698 0.6004658 0.6003308 5 0.6023004 0.6015133 0.6011264 10 0.6048256 0.6032574 0.6024577 0/0.4 any 0.6 0.6 0.6 possible explanation to the higher-than-expect gain is that the true value in the induced intensitymaybeslightlybiasedtowardthenullcomparedtothetruevalueintheoriginal intensity without considering missing data, resulting a slightly lower variance for the estimated parameter. Another conjecture is that since the outcome is always observed, the overall missing proportion of data is lower than ¼ 0 , hence the e±ciency gain. Other trends that are worth mentioning from the calculation results are 1) the e±- ciency improves as ¯ 0 approaches the null; 2) the e±ciency improves as the number of exposureyearsincreases;and3)thee±ciencygetsslightlylowerasthenumberofsubjects per risk set increases. 100 The readers may have noticed that the e±ciency is precisely 1¡¼ 0 when the true value ¯ 0 =0, this is no coincidence. If ¯ 0 =0 then ´ 0 =log(1¡p 0 +p 0 e ¯ 0 )=0. It follows that S (0) R =m;S (1) R (¯ 0 ;´ 0 )= X i2R 0 B @ U i V i 1 C A;S (2) R (¯ 0 ;´ 0 )= X i2R 0 B @ U 2 i ;U i V i U i V i ;V 2 i 1 C A: ¡ ¤ = E 2 6 4 1 m X i2R 0 B @ U 2 i ;U i V i U i V i ;V 2 i 1 C A¡ 1 m 2 0 B @ ( P U i ) 2 ; P U i P V i P U i P V i ;( P V i ) 2 1 C A 3 7 5 = E 0 B @ U 2 ¡(U) 2 ;UV ¡UV UV ¡UV;V 2 ¡(V) 2 1 C A; where U 2 = 1 m P U 2 i , UV = 1 m P U i V i , U = 1 m P U i ; and V 2 ;V are similarly de¯ned. If follows from the independence between subjects that E £ UV ¡(U)(V) ¤ =E[UV]¡E £ (U) ¤ E £ (V) ¤ = E[UV]¡ 1 m 2 fmE[UV]+m(m¡1)E[U]E[V]g = m¡1 m Cov(U;V): Let U = V and then V = U, it follows immediately that E h U 2 ¡(U) 2 i = m¡1 m Var(U) and E h V 2 ¡(V) 2 i = m¡1 m Var(V): Based on the assumption that V »B(K;¼ 0 ) and UjV »B(K¡V;p 0 ), calculation can be carried out to show that Var(V) = K¼ 0 (1¡¼ 0 ); 101 Var(U) = Kp(1¡¼ 0 )(1¡p 0 +p¼ 0 ); and Cov(U;V) = ¡pK¼ 0 (1¡¼ 0 ): The asymptotic relative e±ciency comparing to full data analysis is then Var(U) Var(U)j ¼ 0 =0 ¡ Cov 2 (U;V) Var(V)Var(U)j ¼ 0 =0 = (1¡¼ 0 )(1¡p 0 +p 0 ¼ 0 ) 1¡p 0 ¡ p 0 ¼ 0 (1¡¼ 0 ) 1¡p 0 =1¡¼ 0 : 5.6 Simulation study to verify the theoretical calculation In this section we will design a simulation and present the results in order to illustrate the theoretical calculations of asymptotic e±ciency in previous sections. It is noteworthy to point out the di®erence than the simulation done in Chapter 4. Theprevioussimulationisdesignedaroundunconditionallogisticregressionunderacase- control study design to compare various methods to deal with missing data, particularly simple imputations. The exposure and missing structure are fairly complicated in an attempt to stay closer to real world situations. The simulation in this chapter is targeted to proportional hazard model; and the data structure is considerably simpler to re°ect the conditions under which the e±ciency theory is developed. Since the proportional model deals with time to event data, ideally one may consider tosimulatealargepopulationanddividethetimelineintomanyintervals. Asubjecthas achanceateachintervaltofailbasedonitsabsoluteriskatthetime. Thedatagenerating process keeps going until a certain number of cases are ascertained. However with rare disease assumption and a large population, this type of data generating procedure can 102 be computationally intensive hence time consuming. We instead use a di®erent approach to approximately generate the cohort data that is easy to carry out and allow the nested case-control design to be accommodated under the same framework. Instead of treating the whole population as a risk set, we create N risk sets with each consisting of M individuals; and use i=1;:::;N to index the risk sets and j =1;:::;M to index the individual. For the j'th individual in the i'th risk set, we generate a string of K exposures Z ijk ;k = 1;:::;K; which are i.i.d Bernoulli trials with probability p 0 . A string of i.i.d missing indicator M ijk ;k = 1;:::;K are generated the same way with probability of missing to be ¼ 0 . Now wecan de¯ne the relative risk for the j'th individual in the i'th risk set as r ij =e ¯ 0 P K k=1 Z ijk ; and relative probability p ij = r ij P M j=1 r ij : Clearly P j p ij =1. The crucial part of the simulation is, for each risk set, we sample one andonlyonecasewiththesamplingprobabilityequaltotherespectiverelativeprobability p ij , i.e.,PrfIndividual(i;j)is a caseg=p ij :ThereforeitendsupwithN casesfromNM individuals. There are many advantages to use the risk set based sampling. First we can avoid having to sample cases over time with the guaranteed N cases; then there is no need to specify a baseline risk function, hence the dependency of the induced relative risk on the baseline risk is not a worry anymore; ¯nally, with the choice of relatively large N and M, we can satisfy both the rare disease assumption and the convergence conditions required by the asymptotic theory. 103 One may notice that the partial likelihood contribution from each risk set under this sampling scheme is exactly the same as the likelihood of a conditional logistic regression under 1:M¡1 case-control design; it is the large M (usually in hundreds) that ensures the design to be a good approximation to a real cohort design. It is straightforward to extendthesimulationdesigntoaccommodatetheanalysisbasedon1: m¡1nestedcase- control sampling. We only need to add a second sampling step to sample m¡1 controls from the M ¡1 total controls in each risk set. In this simulation we will use complete random sampling which means every control has the same probability (m¡1)=(M¡1) to be sampled. The data analysis on the simulated data can be done with SAS procedure PHREG strati¯ed on risk sets. One analysis uses the full data while the other uses the data with missing values and induced relative risk method. The simulation repeats 1000 times to create 1000 pairs of estimates, based on which the empirical variances of the two types of estimators can be obtained, then follows the relative e±ciency of induced relative risk method against the full data method. To demonstrate Theorem, we adjust the way exposures and missing indicators are generated. The exposure Z ijk 's are no longer Bernoulli trials, instead, they are generated from i.i.d. standard normal distribution. Given Z ijk , we generate the missing indicator M ijk as follows, if Z ijk < 0 then M ijk = 0; otherwise we toss an uneven coin with head probability of 2¼ 0 to determine the value of M ijk : 1 for the head and 0 for the tail. It is trivial to see the over all probability of M ijk being 1 is still ¼ 0 , however M ijk is no longer independent with Z ijk , although cross-individual independency still holds. Simulation Results 104 The entire simulation study consists of three rounds. The ¯rst round is to demon- strate the e±ciency of the induced-intensity method under cohort design with the strong assumption of Example 1; the second round is to demonstrate the e±ciency under nested case-control design; the third round uses relaxed assumption to demonstrate the theo- rem which shows the e±ciency could be maintained even when the missing indicator and exposurearecorrelated(althoughtheindependenceamongdi®erentyearsisstillinplace). Foreachsimulationwiththecohortdesign,wegenerate500risksetswitheachriskset containing 1 case and 499 controls, so the overall disease rate is 0.2% to satisfy the rare disease assumption. Under individually matched nested case-control design, we simply samplem¡1controlsfromthe499controlsineachriskset. Thesimulationisrepeated500 timestoobtaintheempiricalmeanandstandarddeviationoftheestimates, theempirical mean of the standard errors, and the e±ciency relative to FULL method, similar to what wecalculatedinthesimulationofChapter4. Thefocusofthesimulationistodemonstrate the relative e±ciency of the MIND method, therefore we only include FULL, CCA and MIND methods in the simulation results. Table (5.2) shows the results for the simulation under cohort design, the simulation assumptionsandsettingsarepreciselyasdescribedinExample1. Thesingleyearexposure issetthroughoutthesimulationtobeanindependentBernoullitrialwith0.3probability of being 1; the overall missing probability of each year ¼ 0 is 0.2 or 0.4; the number of years of exposure K is either 1 or 5; and ¯nally the true value ¯ 0 is 0.2, 0.1 or at the null 0. Overall, all the methods provide satisfactory estimates in term of bias, this is to be expected due to the way data are generated. The MIND method achieves the e±ciency around 1¡ ¼ 0 regardless the true ¯ 0 or the number of years K, that is 80% or 60% depending on the missing proportion. As a comparison, the CCA method can achieve 105 1¡¼ 0 e±ciency with only 1 year of exposure involved, but the e±ciency deteriorates rapidly as the number of exposure years increases to 5. As is shown in the theoretical calculation, CCA is expected to have 0:8 5 =32:8% e±ciency with 20% chance of missing; or0:6 5 =7:8%e±ciencywith40%ofmissing. Obviously,withevenmoreyearsofexposure considered, the e±ciency of CCA method becomes too low to be practically viable, on the other hand, the e±ciency of MIND method remains unchanged. Table (5.3) includes the results of the second round simulation for NCC sampling. The simulation settings are similar to the cohort design since each generated data set is used for both cohort and NCC analysis. The di®erence is that for NCC simulation we keep the missing proportion ¼ 0 to be straight 0.2; and the number of individually matched controls is set to be 1 or 3. Again, the estimates from CCA and MIND method are unbiased. In term of e±ciency, the simulation on the MIND method con¯rms the theoreticalcalculationinTable(5.1),thee±ciencyiscloseto80%forallsettings,however because of the variation from individual simulation, it is impossible to tell the smaller trends observed in Table (5.1), for example 5 years of exposure should result in slightly higher e±ciency than a single year exposure. The CCA method, performs exceedingly bad in term of e±ciency, which is to be expected, because a single missing value in the risk set will exclude the entire risk set from the analysis. The standard deviation of the estimates is signi¯cantly higher than that from cohort analysis with the same simulation setting, this is of no surprise because only 2 or 4 individuals in each risk set are used for the NCC analysis instead of 500 under cohort analysis. Thus, the simulation validated the calculations provided by the theory. To demonstrate Theorem (1), we designed an NMAR scheme where the missingness and the exposure are correlated although the independence across di®erent years still 106 Table 5.2: E±ciency simulation for cohort design Settings Method Mean Est Std Dev Mean Std Err E±ciency ¯ 0 =0:2 FULL 0.1996 0.0911 0.0944 100.0% K =1 CCA 0.2017 0.1031 0.1055 78.15% ¼ 0 =0:2 MIND 0.2017 0.1030 0.1055 78.30% ¯ 0 =0:2 FULL 0.1979 0.0407 0.0422 100.0% K =5 CCA 0.1986 0.0728 0.0741 31.21% ¼ 0 =0:2 MIND 0.1978 0.0458 0.0472 78.92% ¯ 0 =0:1 FULL 0.0961 0.0980 0.0960 100.0% K =1 CCA 0.1007 0.1097 0.1073 79.94% ¼ 0 =0:2 MIND 0.1006 0.1095 0.1073 80.17% ¯ 0 =0:1 FULL 0.1021 0.0431 0.0429 100.0% K =5 CCA 0.0991 0.0760 0.0752 32.23% ¼ 0 =0:2 MIND 0.1015 0.0479 0.0479 81.26% ¯ 0 =0 FULL -0.0007 0.0977 0.0978 100.0% K =1 CCA -0.0038 0.1092 0.1095 80.01% ¼ 0 =0:2 MIND -0.0040 0.1092 0.1095 80.11% ¯ 0 =0 FULL -0.0021 0.0433 0.0437 100.0% K =5 CCA -0.0028 0.0774 0.0768 31.36% ¼ 0 =0:2 MIND -0.0029 0.0484 0.0489 80.15% ¯ 0 =0:2 FULL 0.2024 0.0935 0.0943 100.0% K =1 CCA 0.1986 0.1215 0.1223 59.26% ¼ 0 =0:4 MIND 0.1986 0.1211 0.1222 59.60% ¯ 0 =0:2 FULL 0.2013 0.0414 0.0422 100.0% K =5 CCA 0.2018 0.1611 0.1557 6.59% ¼ 0 =0:4 MIND 0.2018 0.0530 0.0545 60.96% ¯ 0 =0:1 FULL 0.0996 0.0935 0.0960 100.0% K =1 CCA 0.0953 0.1210 0.1243 59.77% ¼ 0 =0:4 MIND 0.0951 0.1210 0.1242 59.76% ¯ 0 =0:1 FULL 0.1006 0.0432 0.0429 100.0% K =5 CCA 0.1011 0.1585 0.1567 7.42% ¼ 0 =0:4 MIND 0.0985 0.0553 0.0554 60.87% ¯ 0 =0 FULL 0.0005 0.0986 0.0978 100.0% K =1 CCA -0.0030 0.1275 0.1266 59.76% ¼ 0 =0:4 MIND -0.0028 0.1273 0.1265 59.91% ¯ 0 =0 FULL -0.0014 0.0451 0.0437 100.0% K =5 CCA -0.0164 0.1683 0.1613 7.19% ¼ 0 =0:4 MIND -0.0016 0.0580 0.0564 60.65% 107 Table 5.3: E±ciency simulation for nested case-control design Settings Method Mean Est Std Dev Mean Std Err E±ciency ¯ 0 =0:2 FULL 0.1957 0.1293 0.1361 100.0% K =1;¼ 0 =0:2 CCA 0.1977 0.1646 0.1703 61.76% 1:1 matching MIND 0.1941 0.1455 0.1522 78.97% ¯ 0 =0:2 FULL 0.2071 0.1123 0.1099 100.0% K =1;¼ 0 =0:2 CCA 0.2122 0.1334 0.1286 70.82% 1:3 matching MIND 0.2110 0.1246 0.1229 81.24% ¯ 0 =0:2 FULL 0.2001 0.0625 0.0621 100.0% K =5;¼ 0 =0:2 CCA 0.2026 0.2126 0.2002 8.65% 1:1 matching MIND 0.2003 0.0698 0.0693 80.17% ¯ 0 =0:2 FULL 0.2003 0.0518 0.0497 100.0% K =5;¼ 0 =0:2 CCA 0.2053 0.1246 0.1237 17.26% 1:3 matching MIND 0.2001 0.0577 0.0555 80.45% ¯ 0 =0:1 FULL 0.1041 0.1374 0.1373 100.0% K =1;¼ 0 =0:2 CCA 0.1047 0.1693 0.1722 65.86% 1:1 matching MIND 0.1020 0.1539 0.1540 79.74% ¯ 0 =0:1 FULL 0.1036 0.1166 0.1114 100.0% K =1;¼ 0 =0:2 CCA 0.1037 0.1347 0.1305 74.86% 1:3 matching MIND 0.1025 0.1305 0.1247 79.77% ¯ 0 =0:1 FULL 0.1004 0.0655 0.0617 100.0% K =5;¼ 0 =0:2 CCA 0.1142 0.2053 0.1956 10.18% 1:1 matching MIND 0.1016 0.0733 0.0690 79.80% ¯ 0 =0:1 FULL 0.1007 0.0496 0.0499 100.0% K =5;¼ 0 =0:2 CCA 0.1083 0.1209 0.1233 16.85% 1:3 matching MIND 0.1018 0.0559 0.0558 78.77% ¯ 0 =0 FULL 0.0008 0.1451 0.1388 100.0% K =1;¼ 0 =0:2 CCA -0.0003 0.1830 0.1739 62.85% 1:1 matching MIND 0.0035 0.1619 0.1554 80.31% ¯ 0 =0 FULL 0.0046 0.1152 0.1128 100.0% K =1;¼ 0 =0:2 CCA 0.0009 0.1342 0.1320 73.67% 1:3 matching MIND -0.0015 0.1274 0.1264 81.79% ¯ 0 =0 FULL 0.0010 0.0600 0.0619 100.0% K =5;¼ 0 =0:2 CCA 0.0128 0.2125 0.1975 7.97% 1:1 matching MIND 0.0009 0.0676 0.0694 78.66% ¯ 0 =0 FULL 0.0016 0.0532 0.0505 100.0% K =5;¼ 0 =0:2 CCA -0.0025 0.1259 0.1238 17.85% 1:3 matching MIND 0.0015 0.0594 0.0565 80.19% 108 holds; another di®erence is that the exposure is no longer the binary exposure as in the previous settings, but instead follows normal distribution since the theorem does not impose restriction on the distribution. For each year k, we ¯rst generate the exposure from a standard normal distribution: Z k »N(0;1); next the missing status for the year is de¯ned as M k =I (u k <0:4) ¢I (Z k >0) ; where u k is uniformly distributed on [0;1]. With this de¯nition the change to be missing clearly depends on the exposure. The unconditioned missing probability turns out to be E(M k )=0:4£0:5=0:2, the same as the missing proportion used in previous simulation settings. Table (5.4) includes the result for one simulation with 5 years of cumulative exposure with ¯ 0 =0:2. Table 5.4: Example of non-ignorable missingness from simulation Settings Method Mean Est Std Dev Mean Std Err E±ciency ¯ 0 =0:2 FULL 0.199 0.020 0.020 100.0% K =5 CCA 0.201 0.038 0.039 26.8% ¼ =0:2 MI 0.199 0.022 0.022 80.6% The result reveals that both MIND method and CCA method work well in term of bias. When it comes to the e±ciency, the result con¯rms what the theorem proves; with the overall missing chance being 0.2, the e±ciency of MIND method is 80.6%, very close 109 to the theoretical 80% e±ciency. It is interesting that the e±ciency of CCA method only reaches an e±ciency of 26.8%, far below the 32.8% expected e±ciency when missingness is completely random. This extra loss of e±ciency for CCA can be explained by the non-random missingness, since all missing value must have a exposure greater than 0 according to the de¯nition, the CCA method is equivalent to selectively remove subjects from the study based on their exposure status, leading to extra loss of study power. For instance, for a binary type of exposure with equal chance, if we remove all subjects with exposure 1 from the analysis, the e±ciency of the analysis will decrease to 0, even though the remaining subjects still count for 50% of the original size. 110 Chapter 6 The application on the prostate cancer study In Chapter 2, we introduced extensively the USC prostate cancer - pesticide pilot studywherethemissingdatapatternwasobserved. Inthefollowingchapters,themissing indicator induced intensity (MIND) method was proposed and studied; and simulations were performed to compare the MIND method against some single imputation methods and to calculate the relative e±ciency. Naturally a question to ask is how the MIND method performs in the real-world applications? Unlike the simulation study in which we have the complete data as the gold standard to compare various missing data methods with, it is di±cult to evaluate the methods through real-world applications. However, we are still able to examine the results from the real-world application to verify whether the similar trend appears as does in simulation studies. For example, in the simulation from Chapter 4, it shows that the AVY method may attenuate the odds ratios compared to the MIND method. One concern for applying the method is that the theoretical basis for MINDmethodisestablishedundercohortdesignandCoxmodel, itisnecessarytojustify the application of MIND method under a case-control study design and unconditional logistic model, we leave the argument to Chapter 7. 111 6.1 The data and methods The background of the study is the same as what is introduced in Chapter 2, hence we willfocusonthedi®erenceofthedataandthemethod. Sincenewdatahasbecomeavail- able, we increase the sample size for the cases to 170 and controls to 162. In the original analysis, we calculated the average cumulative exposure and converted the exposure into binary exposure (i.e., non-exposed vs. exposed) or into triple-category exposure (non- exposed vs. low, high). There are di±culties when applying the missing data methods to this type of exposure summary, for example, a mean imputation will almost guarantee a positive exposure for the imputed year, as the result the exposure summary will always becategorizedas\positive"forthosewhoatleasthaveamissingyearofexposure. Inthe original analysis, we calculated odds ratios for di®erent time intervals including diagnosis year only, life time, 1974-1999, 15 years prior to the diagnosis and 10 years prior to the diagnosis, howeverasstatedinChapter2, theGISbasedexposureevaluationmethodcan only reliably estimate exposures from 1974 to 1999 with the tools at hand. Therefore we focus on the period from 1974 to 1999 only, since our primary aim is not the exposure summary type, but rather the missing data method. Weperformanalysisontherealdataduring1974-1999basedontwotypesofexposure summaries: average annual exposure and average years of being exposed. The average annual exposure is the average amount (in pounds) of chemical applied inside the 500- meter bu®er around the individual's residence, which we consider to be a proxy of the actual ambient exposure. The average years of being exposed on the other hand do not calculatetheactualexposedamount,butratherfocusonlong-terme®ect. Forexamplein the latter case, we consider an individual with 10 years exposed to be more exposed than 112 an individual with only 5 years of exposure, although it is quite possible the latter was exposed to a higher cumulative amount of the chemical. The average annual exposure is generated using GRAPES and is a non-negative continuous variable, while the average year of exposure is just the average of the dichotomized yearly exposure, hence it falls in the interval [0;1]. The MIND method is all about parametrization of the induced intensity. For the analysis we keep using the naive approach used in the simulation of Chapter 4, namely, Á(¢) : = 1 K K X k=1 ¯Z k (1¡M k )+´ K X k=1 M k : Since we consider the exposure from 1974 to 1999 only, K is ¯xed at 26 in the analysis. We also include the single imputation methods considered in Chapter 4 for comparison: average valid year (AVY) method, zero imputation (ZI) method, overall mean imputa- tion (OMI) method, overall control mean (OCMI) imputation and status speci¯c mean imputation (SSMI) method. The analysis is performed using the logistic regression the same way as in Chapter 4 for Methyl Bromide, Captan, Simazine and Organochlorine. Additional adjusting factors include age, race and home pesticide use. 6.2 Results and discussion The results under both unadjusted and adjusted analyses are included in Table (6.1) through (6.4). Overall, the results follow the same trend found in the simulation study in Chapter 4. AVY and ZI methods tend to yield an estimate toward the null hypothesis (in thiscase0);SSMImethodtendstoprovideanestimateawayfromnull;theestimatesfrom MIND, OMI and OCMI methods are close, in most cases sit between AVY method and 113 SSMImethod; MINDmethodproducesslightlylessstandarderrorthanOMIandOCMI, which suggests a higher e±ciency. Since we do not have a gold standard to compare with in the real data application, we would avoid judging the methods based on whether they are biased or not. However the results do validate our previous ¯ndings that MIND is an overall better method to deal with the missing data situation found in the prostate cancer - pesticide study. MIND method has many advantages when applying to the real-data. Most notably, the naive parametrization used on the real data is extremely easy to execute, it is no di®erent than zero imputation method except that MIND method includes the number of missingyearsasanadditionalcovariate. Themethodgenerallyyieldssmallervariancefor theestimate, meaningitismoree±cientthanothersingleimputationmethodscompared in the analysis. MIND method is °exible since there are numerous ways to design a parametrization to best ¯t the situation, although this advantage can't not be shown from this analysis. The weakness is that while °exible, it is di±cult to determine the most appropriate parametrization for the induced intensity. It was mentioned in the method section that the most naive parametrization was used, which, according to the examples in Chapter 5, is the natural parametrization when the (Z k ;M k ) pairs are i.i.d. However the i.i.d assumption does not hold for the real data. First of all the missing chances for di®erent years (M k ) are not independent, since the address solely determines whether the expo- sure is missing and an individual may live at the same address for several years or even decades. Then the exposures over years (Z k ) may not be independent either, the cor- relation between the exposures from consecutive years is certainly stronger than from a long period apart. Also the exposures may not be identically distributed throughout the 114 years. For example a certain pesticide product may be replaced by newer product which has di®erent chemical composition, or may be even banned from commercial use due to negative impact on human health and environment. Lastly other confounding or e®ect modifying factors in the model may require something more than the naive parametriza- tion. The added complexity usually means a more complex parametrization (e.g. more parameters required), which may in turn make the model less e±cient, o®setting the gain from more \correct" parametrization. Therefore, the violation of the assumption does not necessarily render the naive approach \incorrect", it could still be acceptable, just not e±cient. 115 Table 6.1: Unadjusted results for average annual exposure during 1974-1999 Chemical Method Estimate StdErr L95CI U95CI p-value Methyl Bromide* AVY 0.117 0.133 -0.143 0.377 0.378 MIND 0.118 0.140 -0.157 0.393 0.401 ZI 0.084 0.138 -0.187 0.354 0.544 OMI 0.119 0.142 -0.160 0.397 0.403 OCMI 0.112 0.141 -0.165 0.389 0.429 SSMI 0.125 0.143 -0.154 0.405 0.379 Captan AVY 0.410 0.250 -0.080 0.901 0.101 MIND 0.546 0.300 -0.042 1.134 0.069 ZI 0.461 0.290 -0.108 1.030 0.113 OMI 0.550 0.307 -0.053 1.152 0.074 OCMI 0.526 0.303 -0.069 1.121 0.083 SSMI 0.890 0.369 0.167 1.614 0.016 Simazine AVY 0.052 0.154 -0.248 0.353 0.733 MIND 0.282 0.216 -0.141 0.705 0.192 ZI 0.226 0.211 -0.188 0.639 0.285 OMI 0.293 0.220 -0.139 0.724 0.184 OCMI 0.278 0.218 -0.150 0.705 0.203 SSMI 0.380 0.231 -0.073 0.833 0.100 Organochlorine AVY 0.887 0.262 0.373 1.400 0.001 MIND 0.972 0.285 0.414 1.530 0.001 ZI 0.822 0.266 0.301 1.343 0.002 OMI 0.996 0.297 0.415 1.578 0.001 OCMI 0.919 0.287 0.357 1.481 0.001 SSMI 1.890 0.434 1.040 2.740 0.000 * The unit of exposure is 100 pounds for Methyl Bromide, 10 for others. 116 Table 6.2: Adjusted results for average annual exposure during 1974-1999 Chemical Method Estimate StdErr L95CI U95CI p-value Methyl Bromide * AVY 0.067 0.135 -0.198 0.332 0.622 MIND 0.047 0.143 -0.234 0.328 0.744 ZI 0.025 0.141 -0.252 0.302 0.859 OMI 0.048 0.144 -0.235 0.330 0.740 OCMI 0.042 0.144 -0.239 0.324 0.768 SSMI 0.053 0.145 -0.230 0.336 0.714 Captan AVY 0.424 0.254 -0.073 0.921 0.094 MIND 0.526 0.304 -0.071 1.122 0.084 ZI 0.470 0.298 -0.113 1.054 0.114 OMI 0.525 0.308 -0.078 1.129 0.088 OCMI 0.514 0.306 -0.086 1.115 0.093 SSMI 0.828 0.355 0.131 1.524 0.020 Simazine AVY 0.026 0.157 -0.282 0.334 0.868 MIND 0.218 0.218 -0.208 0.645 0.316 ZI 0.185 0.215 -0.236 0.606 0.388 OMI 0.229 0.220 -0.202 0.661 0.298 OCMI 0.220 0.219 -0.210 0.649 0.316 SSMI 0.307 0.228 -0.139 0.754 0.178 Organochlorine AVY 0.821 0.270 0.291 1.351 0.002 MIND 0.861 0.292 0.288 1.434 0.003 ZI 0.745 0.276 0.205 1.286 0.007 OMI 0.866 0.299 0.281 1.452 0.004 OCMI 0.817 0.293 0.244 1.390 0.005 SSMI 1.705 0.425 0.871 2.539 0.000 * The unit of exposure is 100 pounds for Methyl Bromide, 10 for others. 117 Table 6.3: Unadjusted results for average years exposed during 1974-1999 Chemical Method Estimate StdErr L95CI U95CI p-value Methyl Bromide AVY 0.678 0.663 -0.621 1.977 0.306 MIND 1.039 0.797 -0.524 2.602 0.193 ZI 0.646 0.761 -0.845 2.137 0.396 OMI 1.039 0.804 -0.537 2.614 0.196 OCMI 0.991 0.801 -0.579 2.562 0.216 SSMI 1.303 0.816 -0.297 2.902 0.110 Captan AVY 0.787 0.631 -0.450 2.024 0.212 MIND 1.018 0.740 -0.432 2.468 0.169 ZI 0.698 0.713 -0.699 2.095 0.328 OMI 0.978 0.746 -0.485 2.440 0.190 OCMI 0.942 0.744 -0.516 2.400 0.205 SSMI 1. 0.768 -0.120 2.889 0.071 Simazine AVY 0.466 0.446 -0.409 1.340 0.297 MIND 0.958 0.554 -0.127 2.044 0.083 ZI 0.700 0.534 -0.346 1.747 0.190 OMI 0.976 0.560 -0.122 2.074 0.081 OCMI 0.932 0.558 -0.162 2.025 0.095 SSMI 1.272 0.574 0.148 2.396 0.027 Organochlorine AVY 1.532 0.517 0.519 2.545 0.003 MIND 2.018 0.609 0.824 3.213 0.001 ZI 1.505 0.563 0.401 2.608 0.008 OMI 2.001 0.620 0.785 3.217 0.001 OCMI 1.891 0.613 0.689 3.092 0.002 SSMI 2.795 0.671 1.480 4.110 0.000 * The unit of exposure is 100 pounds for Methyl Bromide, 10 for others. 118 Table 6.4: Adjusted results for average years exposed during 1974-1999 Chemical Method Estimate StdErr L95CI U95CI p-value Methyl Bromide AVY 0.197 0.688 -1.151 1.544 0.775 MIND 0.333 0.829 -1.292 1.959 0.688 ZI 0.101 0.791 -1.450 1.651 0.899 OMI 0.327 0.832 -1.304 1.958 0.694 OCMI 0.294 0.830 -1.332 1.921 0.723 SSMI 0.580 0.841 -1.069 2.229 0.490 Captan AVY 0.483 0.656 -0.803 1.769 0.461 MIND 0.522 0.769 -0.986 2.029 0.498 ZI 0.319 0.742 -1.134 1.773 0.667 OMI 0.471 0.773 -1.044 1.986 0.542 OCMI 0.453 0.771 -1.058 1.964 0.557 SSMI 0.871 0.791 -0.680 2.422 0.271 Simazine AVY 0.279 0.463 -0.629 1.187 0.547 MIND 0.623 0.578 -0.509 1.755 0.281 ZI 0.459 0.559 -0.636 1.554 0.411 OMI 0.639 0.583 -0.503 1.781 0.273 OCMI 0.610 0.581 -0.528 1.749 0.293 SSMI 0.931 0.594 -0.234 2.096 0.117 Organochlorine AVY 1.286 0.542 0.225 2.347 0.018 MIND 1.628 0.639 0.375 2.881 0.011 ZI 1.242 0.593 0.079 2.405 0.036 OMI 1.598 0.646 0.333 2.864 0.013 OCMI 1.522 0.640 0.268 2.775 0.017 SSMI 2.397 0.690 1.044 3.749 0.001 * The unit of exposure is 100 pounds for Methyl Bromide, 10 for others. 119 Chapter 7 Discussion In this chapter we will wrap up what has been discussed in the previous chapters and provide discussion, additional results and possible future development on several important aspects of the theory. The ¯rst section includes some discussion on the basic assumptions, namely rare disease and non-informative missingness. The next section attempts to establish a bridge between the unconditional logistic regression under case- control study design and Cox's regression and cohort design. The remaining section is to discuss the advantage and the weakness of the method and provide a forward-look on future developments. 7.1 The importance of the basic assumptions The two basic assumptions are the rare disease assumption and the non-informative missingnessassumption. Thequestionishowimportanttheassumptionsareindeveloping the missing-indicator induced intensity method. The non-informative missingness assumption 120 Thenon-informativemissingnessstatesthatgiventhemissingvalue,themissingstatus contains no meaningful information for the analysis. In the context of induced intensity, it means the intensity function is independent of the missing status if the missing value is known, that is, ¸(t;G)=¸(t;F)=Y(t)¸ 0 (t)e Z ¤ (t)¯ 0 : While non-informative missingness can be perceived as a much weaker assumption than MCAR and MAR, they are not describing the same mechanism. Suppose we have non- missing outcome Y, and exposure Z which is divided into the missing part Z mis and observed part Z obs , let f be the density function and M the missing indicator. The non-informative missingness assumption can be expressed as f(YjZ mis ;Z obs ;M)=f(YjZ mis ;Z obs ): Recall that MCAR assumes f(M)=f(MjY;Z mis ;Z obs ); meaning the missingness does not depend on the data, missing or not. It is clear that MCAR implies non-informative missingness assumption. On the other hand, MAR as- sumes f(MjY;Z obs )=f(MjY;Z mis ;Z obs ); 121 whichmeansthemissingnessdoesnotdependonthemissingpartofthedata,conditional on the observed part. Unlike MCAR assumption, MAR does not imply non-informative missingness, it is possible the missingness is MAR but still plays a role in the exposure- outcome relationship even conditioned on the missing data. In the last set of simulations of the previous chapter, the missing status and the expo- sure are designed to be correlated but still maintaining the non-informative missingness assumption. Now instead of generating exposure ¯rst, we generate the missing status ¯rst, M k »B(1;0:2): So the overall missing proportion is still 20%. Then the exposure is created based on the missing status, Z k =M k +"; where " is a standard normal random variable. Under the new design, the missing status is correlated with exposure and is no longer non-informative. To show this, we retain the remaining setting from the simulation for Table (5.4). The simulation result is shown in Table (7.1), we also integrate Table (5.4) as the ¯rst three rows to make a side-by-side comparison. Table 7.1: Simulation example of informative missingness Missing Type Settings Method Estimate StdDev StdErr E±ciency Non-Informative ¯ 0 =0:2 FULL 0.199 0.020 0.020 100.0% K =5 CCA 0.201 0.038 0.039 26.8% ¼ =0:2 MIND 0.199 0.022 0.022 80.6% Informative ¯ 0 =0:2 FULL 0.201 0.015 0.015 100.0% K =5 CCA 0.202 0.042 0.044 12.5% ¼ 0 =0:2 MIND 0.201 0.023 0.023 41.0% 122 As shown in table (7.1) The MIND method provides an unbiased estimate even if the non-informative missingness assumption is violated. However the e±ciency of MIND method relative to full data method decreases dramatically to 41.0% under informative missingness, only half of the e±ciency with non-informative missingness. The larger loss of the e±ciency can be attributed to the loss of variation in exposure due to missingness. In the informative missingness example (bottom half in Table (6.1)), the variance of the exposurewithnomissingdatais0:2£(1¡0:2)+1=1:16,thevariancereducesto1among the non-missing exposure when missing data is involved. This simulation suggests that theMINDmethodmaystillbevalidwiththemissingdatabeinginformative,howeverthe method will take a hit in e±ciency, and the theoretical calculation provided in Chapter 5 is no longer applicable. 7.2 Thebridgetologisticregressionundercase-controlsam- pling The motivational prostate cancer - pesticide study is a case-control study under unconditional logistic regression model, whereas the missing indicator induced intensity method is developed under the framework of proportional hazard model. A naturally raisedquestionishowthetwotypesofmodelsaretheoreticallylinked,sothattheinduced intensity method can be used to the prostate cancer study without much modi¯cation. 123 7.2.1 The link with unconditional logistic regression Let D be the binary response variable with 1 being occurrence of the event, 0 being non-occurrence; andp Z =PrfY =1jZg,recallthatthelogisticregressionisageneralized linear model with logit link function for some parameters ® 0 and ¯ 0 , logit(p Z )=® 0 +¯ ) Z; (7.1) where logit(x)=log x 1¡x . (7.1) can be rewritten as the odds model p Z 1¡p Z =e ® 0 +¯ 0 Z : Once again the rare disease assumption comes to the rescue. Since p Z 1¡p Z =p Z + p 2 Z 1¡p Z =p Z +O(p 2 Z ); we have p Z =e ®+¯Z +O(p 2 Z ); (7.2) which means p Z essentially equals to e ®+¯Z when p Z inde¯nitely approaches 0. Now let us describe p Z as the probability of failure over the time period (0;1]. As- suming a constant hazard function ¸(t) = ¸ 0 ;t2 (0;1], the survival curve is simply the standard exponential survival function. Particularly at time 1, we have 1¡p Z =Prffailure in (0;1]g=e ¡ R 1 0 ¸(t)dt =e ¡¸ 0 : 124 Applying Taylor's expansion, it ends up that ¸(t)=¸ 0 =¡log(1¡p Z )=p Z +O(p 2 Z )=e ®+¯Z +O(p 2 Z ): (7.3) Comparing (7.2) and (7.3), it is clear that under rare disease assumption, unconditional logistic regression and Cox's regression approximately model the same relationship re- garding the exposure Z. 7.2.2 The logistic model under case-control sampling The unconditional logistic model assumes prospective sampling which ascertains the cohort ¯rst then identi¯es the cases and controls by the end. Suppose the cohort size is N =N 0 +N 1 and D the binary disease (failure) indicator. (7.1) can be written as Pr(D =1jZ)= e ® 0 +Z¯ 0 1+e ® 0 +Z¯ 0 (7.4) The likelihood function to estimate (® 0 ;¯ 0 ) is L(®;¯)= N Y i=1 e (®+Z i ¯)D i 1+e ®+Z i ¯ : (7.5) j Suppose now that n = n 0 +n 1 individuals are sampled with n 0 controls and n 1 cases randomly from their subpopulations N 0 and N 1 , the likelihood (7.5) is no longer cor- rect because after case-control sampling we need to model the retrospective probability Pr(ZjD) instead of Pr(DjZ). The retrospective likelihood is n Y i=1 Pr(Z i jD i ): 125 For estimation withe retrospective sampling, reparameterization is required. The BayesianformulaPr(DjZ)=Pr(ZjD)Pr(D)=Pr(Z)leadstotheequalityoftheprospec- tive and retrospective odds ratios, which is the most important aspect of case-control sampling, Pr(D =1jZ)=Pr(D =0jZ) Pr(D =1jZ =0)=Pr(D =0jZ =0) = Pr(ZjD =1)=Pr(Z =0jD =1) Pr(ZjD =0)=Pr(Z =0jD =0) (7.6) With the help of (7.6) Prentice and Pyke (1979) proposed to parameterize the retro- spective probability Pr(ZjD) as Pr(ZjD =d)=c d e °(Z)+d¢Z¯ 0 ; d=0;1; (7.7) where °(Z) = logfPr(ZjD = 0)=Pr(Z = 0jD = 0)g for all Z and c d = c d (°;d¢¯ 0 ) is a normalization constant. The model (7.7) also has logistic form and is equivalent to the prospective model if ® 0 and °(Z) are unrestricted; the di®erence from the prospective model is that °(Z) can be non-parametric if the sample space for Z is in¯nite. Further de¯ne q(Z)=e °(Z) n n 0 n c 0 + n 1 n c 1 e Z¯ 0 o : q(Z) is actually the marginal probability density function for Z under the case-control sampling. Then (7.7) can be rewritten as Pr(ZjD =d)= e (± 0 +Z¯ 0 )¢d 1+e ® 0 +Z¯ 0 ¢q(Z)nn ¡1 d ; (7.8) 126 where ± 0 =log(c 1 n 1 )¡log(c 0 n 0 ). The likelihood function is then proportional to ( n Y i=1 e (±+Z¯)¢D i 1+e ®+Z¯ )( n Y i=1 q(Z i ) ) =L 1 L 2 : (7.9) Notice thatL 1 is a parametric likelihood with the exact same format of the prospective likelihood (7.5); while L 2 is nonparametric. The detail to maximize (7.9) is provided in Prentice and Pyke (1979), which essential maximize L 1 ¯rst to obtain the maximum like- lihood estimator ( ^ ±; ^ ¯), then maximize L 2 empirically for ^ q(Z), the empirical probability function. The arguments here are to provide a theoretical background to justify the application oftheprospectivemodel(7.4)onthecase-controldata,asifthesamplingwereprospective. The likelihood equations would be exactly the same with ± in place of ®, so that the odds ratio parameter ¯ 0 , which as suggested in (7.6) is the same for both prospective model and retrospective model, can be reliably estimated. However the parameter ® 0 in the prospective model can not be estimated in the case-control study. This is expected because e ® 0 is the disease prevalence among non-exposed population, which cannot be estimated if the sampling depends on the case-control status. 7.3 Summary and discussion In previous chapters, we have explored the theoretical background and the application of missing indicator (MIND) induced intensity method to solve the missing data problem in exposure histories, especially in case-control studies. We summarize the advantages of the MIND method here. 127 1. The theoretical basis is well established. We started from time to event cohort data to develop the theory under the Cox regression framework. From there we extended the method to nested case-control sampling then established a bridge to the unconditional case-control design. Finally we veri¯ed the method using the real world data from the motivational background study. 2. Relatively high e±ciency. Under a strong condition, it can be shown the relative e±ciency is exactly 1¡¼ 0 for both full cohort design under Cox regression, where ¼ 0 denotes the overall missing proportion. Under matched nested case-control sam- pling, the e±ciency is still close to 1¡¼ 0 , which is as good as one can expect from a missing data method. The e±ciency is much higher than CCA method, especially whentheexposurehistoryislong. Theseresultswerebackedwithsimulationstudy. 3. Relativelysimpletoimplement. Inmostsimplecase,theMINDmethodisequivalent to imputing the missing value with 0's then adjusting the model by adding a new variable which is the number of missing years. The analysis can be done using readily available statistical software. 4. Themethodis°exible. AsdiscussedinChapter5,itallendsupwithhowtoparame- terizetheinducedintensity. Forexamplewhenconfoundingfactorandbetween-year correlation are involved, we can extend method by adding more parameters in the functional form as shown in the examples of Chapter 5. 5. Does not require assumption on the missing mechanism. Unlike in classic missing data literature where the assumption must be explicitly made. MIND method has themissingmechanismintegratedwhentheparametricformoftheinducedintensity 128 is decided. Thus the investigators can focus on the parametrization of the induced intensity rather than the missing mechanism which is often di±cult to verify. However, just like any other methods, MIND method has its own limitations leaving room for future improvements. 1. The method relies on the rare disease assumption. The assumption is made in the theoretical development to eliminate the dependency of the induced intensity on the the baseline intensity. In doing so we actually slightly alter the model and the original parameters. If the rare disease is violated, it remains to be seen whether the MIND method is still valid, especially under case-control sampling design. 2. The parametrization is di±cult in complex situations, especially with correlated data and huge variation of exposure over the years. We have made some initial attempt in the examples in Chapter 5, however we are still in need of a more universal way to parameterize the induced intensity. 3. The performance of the method under complex situation is unknown. The theoret- ical e±ciency calculation and simulations all assume the exposures are i.i.d. over the years. More evidence may be needed to answer this question. Overall, this dissertation provides several novel developments, we list them as follows: 1. The MIND method demonstrates to be superior to various single imputation meth- ods through a Monte-Carlo simulation under logistic model with case-control sam- pling design. The simulation can be seen as an extension on (Weinberg et al., 1996), where the author compared only among single imputation methods under excess relative risk model. 129 2. The MIND method had its theoretical root in (Prentice, 1982), where the author proposedtheoriginalidealofthemethodtodealwithmeasurementerrorproblems. In the dissertation, we extended the method in a counting process framework to handle missing values in exposure history. We discussed the assumption required for the theoretical development and suggested parameterizations of the induced intensity for complex scenarios suck as confounding, non-ignorable missingness and correlated exposures. 3. Under cohort study design and between-year independence assumption (Theorem 1), we have proved the asymptotic relative e±ciency of MIND method to be pre- cisely equal to the non-missing proportion. For nested case-control design, we used numerical approach for the calculation. A simulation was designed to demonstrate the theoretical calculation results. 4. We discussed in detail the USC prostate cancer - pesticide pilot study and applied the MIND method to the real data. 130 REFERENCES [1] Aalen, O. (1978). \Nonparametric Inference for a Family of Counting Processes." The Annals of Statistics, 1978;6;4: 701-26. [2] Acquavella,J.F.,B.H.Alexander,etal.(2006).\Exposuremisclassi¯cationinstud- iesofagriculturalpesticides: insightsfrombiomonitoring."Epidemiology,2006;17;1: 69-74. [3] Alavanja, M. C. R., J. H. Lubin, et al. (1999). \Residential Radon Exposure and Risk of Lung Cancer in Missouri." American Journal of Public Health, 1999;89;7: 1042-1048. [4] Andersen, P. K., R. D. Gill. (1982). \Cox's regression model for counting processes: A large sample study.\ The Journal of Statistics, 1982;10;4: 1100-1120. [5] Armstrong, B. G. (1990). \The E®ects of Measurement Errors on Relative Risk Regressions." American Journal of Epidemilogy, 1990;132;6: 1176-84. [6] Barnard,J.,Meng,X.(1999).\Applicationofmultipleimputationinmedicalstudies: FromAIEStoNHANES."Statistical Methods in Medical Research,1999;8;1: 17-36. [7] Cox, D. R (1972). \Regression models and life-tables (with discussions)." Journal of Royal Statistical Society, 1972;B;34, 187-220. [8] Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). \Maximum Likelihood from In- complete Data via the EM Algorithm (with discussion)." Journal of the Royal Sta- tistical Society, 1977;B;39; 1-38. [9] Engel, L. S., H. Checkoway, et al. (2001). \Parkinsonism and occupational exposure to pesticides." Occupational & Environmental Medicine, 2001;58;9: 582-9. [10] Fleming and D. P. Harrington (1991). Counting processes and survival analysis. [11] Greenland, S., W. Finkle, (1995). \A Critical Look at Methods for Handling Missing Covariates in Epidemiologic Regression Analyses." American Journal of Epidemiol- ogy, 1995;142: 1255-64. 131 [12] Goldstein, L., Langholz, B. (1992). \Asymptotic Theory for Nested Case-Control Sampling in the Cox Regression Model." Annals of Statistics, 1992;20;4: 1903-28. [13] Gunier, R. B., M. E. Harnly, et al. (2001). \Agricultural pesticide use in California: pesticide prioritization, use densities, and population distributions for a childhood cancer study." Environmental Health Perspectives, 2001;109;10: 1071-8. [14] Hornung, R.W. and T.J.,Meinhardt (1987). \Quantitative Risk Assessment of Lung Cancer in U. S. Uranium Miners." Health Physics, 1987;52: 417-30. [15] Hougaard, P. (1986). \A Class of Multivariate Failure Time Distributions." Biometrika, 1986;73: 671-8. [16] Langholz, B. (2005). \Case-Control Study, Nested." Encyclopedia of Biostatistics, 7th Edition, 1: 646-55. John Wiley & Sons, Ltd, Chichester, 2005. [17] Lin, D.Y., Z. Ying (1993). \Cox Regression with Incomplete Covariate Measure- ments." Journal of the American Statistical Association, 1993;88;424: 1341-9. [18] Little, R. D. Rubin, Statistical Analysis with Missing Data. New York: John Wiley and Sons, 1987. [19] Loewenherz,C.,R.A.Fenske,etal.(1997).\Biologicalmonitoringoforganophospho- rus pesticide exposure among children of agricultural workers in central Washington State." Environmental Health Perspectives 1997;105;12: 1344-53. [20] London, S. J., J. M. Pogoda, et al. (2003). \Residential Magnetic Field Exposure and Breast Cancer Risk in a Multiethnic Cohort Study." American Journal of Epi- demiology, 3003;158:969-80. [21] Mills, P. K. (1998). \Correlation analysis of pesticide use data and cancer incidence rates in California counties." Archives of Environmental Health, 1998;53;6: 410-3. [22] Moulton, L.H., L^ e, M.G. (1991). \Latency and Time-dependent exposure in a case- control study." J Clin Epidemiol, 1991;44;9: 915 - 23. [23] Pepe,M.S.,S.G.Self,etal.(1989).\FurtherResultsonCovariateMeasurementEr- rorsinCohortStudieswithTimetoResponseData."Statistics in Medicine,1989;8: 1167-78. [24] Prentice, R. L. (1982). \Covariate Measurement Errors and Parameter Estimatino in a Failure Time Regression Model." Biometrika, 1982;69: 331-42. [25] Prentice, R. L., R. Pyke (1979). \Logistic disease incidence models and case-control studies." Biometrika, 1979;66;3: 403-11. 132 [26] Reynolds, P., S. E. Hurley, et al. (2005). \Residential proximity to agricultural pes- ticide use and incidence of breast cancer in California, 1988-1997." Environmental Health Perspectives, 2005;113;8: 993-1000. [27] Reynolds, P., J. Von Behren, et al. (2005). \Agricultural pesticide use and childhood cancer in California." Epidemiology, 2005;16;1: 93-100. [28] Rull, R. and B Ritz (2003). \Historical Pesticide Exposure in California Using Pes- ticide Reports and Land-Use Survey: An Assessment of Misclassi¯cation Error and Bias." Environmental Health Perspectives, 2003; 111: 1582 - 9. [29] Tatalovich, Z., J. P. Wilson, et al. (2006). \The Objective Assessment of lifetime cumulative ultraviolet exposure for determining melonoma risk." Journal of Photo- chemistry and Photobiology B: Biology, 2006;85: 198-204. [30] Thomas, D. C. \Addendum to a paper by F. D. K. Liddel, J. C. McDonald and D. C. Thomas." J. Roy. Statist. Soc. Ser. A, 1977;140: 483-85. [31] Weinberg, C. R., E. S. Moledor, et al. (1996). \Imputation for Exposure Histories withGaps, underanExcessRelativeRiskModel." Epidemiology,1996;7;5: 490-97. [32] Zahm, S. H., M. H. Ward, et al. (1997). \Pesticides and cancer." Occupational Medicine, 1997;12;2: 269-89. [33] Zhou,H.,M.S.Pepe(1995).\AuxiliaryCovariateDatainFailureTimeRegression." Biometrika, 1995;82;1: 139 - 49. 133 
Asset Metadata
Creator Zhang, Xinbo (author) 
Core Title A study of methods for missing data problems in epidemiologic studies with historical exposures 
Contributor Electronically uploaded by the author (provenance) 
School Keck School of Medicine 
Degree Doctor of Philosophy 
Degree Program Biostatistics 
Publication Date 05/08/2009 
Defense Date 12/19/2008 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag asymptotic efficiency,case-control study,Cox regression,exposure history,logistic regression,missing data,OAI-PMH Harvest 
Language English
Advisor Langholz, Bryan (committee chair), Berhane, Kiros (committee member), Cockburn, Myles G. (committee member), Goldstein, Larry M. (committee member), Stram, Daniel O. (committee member) 
Creator Email xinbozha@usc.edu,xinbozhang76@gmail.com 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-m2211 
Unique identifier UC1492425 
Identifier etd-Zhang-2631 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-243046 (legacy record id),usctheses-m2211 (legacy record id) 
Legacy Identifier etd-Zhang-2631.pdf 
Dmrecord 243046 
Document Type Dissertation 
Rights Zhang, Xinbo 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Repository Name Libraries, University of Southern California
Repository Location Los Angeles, California
Repository Email uscdl@usc.edu
Abstract (if available)
Abstract In this thesis we consider methods for a specific missing pattern where missing values occur across the exposure history of an individual, thus creating gaps and holes in the exposure history. We propose a missing indicator induced intensity (MIND) method under the rare disease assumption. The idea originated from Prentice (1982) and the theoretical development can find its root within the Cox regression framework under cohort design, in which the essential part is the parametrization of the induced intensity. The parametrization of the "induced'' intensity actually reflects the missing mechanism, therefore the missing mechanism assumption such as missing completely at random (MCAR), missing at random (MAR) found in other literature is no longer required. 
Tags
asymptotic efficiency
case-control study
Cox regression
exposure history
logistic regression
missing data
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button