Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Imputation methods for missing data in growth curve models
(USC Thesis Other)
Imputation methods for missing data in growth curve models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IMPUTATION METHODS FOR MISSING DATA IN GROWTH CURVE MODELS by Betty Lu-Ti Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOMETRY) May 2000 Copyright 2000 Betty Lu-Ti Wang Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3018141 Copyright 2000 by Wang, Betty Lu-Ti All rights reserved. _ _ ® UMI UMI Microform 3018141 Copyright 2001 by Bell & Howell Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA 1HB ORADUATB S C H OO L u ra vu sm rrA U L OS A N O B U S, CALI FORNI A 9 Q M 7 This dissertation, written by B S T T Y L U z J l under the direction of hi nt : . Dissertation Committee, and approved by tU its m em b e rs , has b e e n presented to and a c c e p te d by The Graduate School, in partial fulfillment of re quirements for tiw degree of DOCTOR OF PHILOSOPHY u m of Cndmk S tudies D ate M a x„ .2 . . 2 . . « . . . ? . 9 . 9 . 9 .......... D 1 OMMITTEE I f U V U R 1 4 M 4 ____ Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I would like to thank my dissertation co-chairs, Professors Sather and Forsythe, for their stimulation, encouragement, constant support, and guidance throughout this research. I would like to thank Professor Azen for his careful review of my dissertation and valuable suggestions on the presentation of my dissertation. I would also like to thank Professor Mack for providing me confidence throughout the years in USC by awarding me a grade “ A” in the first course I took in USC. I would like to thank Professors Mack and TavarU for their time and perceptive comments. Finally, I want to thank my husband, Chao, and my son, Albert for their support at home. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents ACKNOWLEDGEMENTS................................................................................................. ii LIST OF FIGURES..............................................................................................................v LIST OF TABLES...............................................................................................................vi ABBREVIATIONS AND DEFINITIONS..................................................................... viii ABSTRACT........................................................................................................................... x CHAPTER 1. INTRODUCTION...................................................................................... 1 CHAPTER 2. REVIEW OF STATISTICAL METHODS FOR MISSING DATA ANALYSIS.................................................................................................. 8 2.1 Regression Slope Approach for Inform ative M issin g D a t a ...............................8 2.2 Joint L ikelihood Appr o a c h.............................................................................................. 20 2.3 Propensity Score Approach for Im pu ting Missing a t Random D a t a 28 2.4 O ther Approaches for Missin g D ata A n a l y sis...................................................... 36 CHAPTER 3. METHODS OF IMPUTATION.............................................................39 3.1 In tr o d u ctio n.......................................................................................................................... 39 3.2 M ethods o f Im putation - First Pr o po sa l..................................................................40 3.2.1 Methods as Presented in the Dissertation Proposal..............................................41 3.2.2 Improvement on the Methods Presented in the Dissertation Proposal.............48 3.3 Methods o f Im putation - Second Pr o po sa l............................................................. 49 CHAPTER 4. SIMULATION STUDY........................................................................... 57 4.1 In tr o d u ctio n...........................................................................................................................57 4.2 S im ulation o f C linical T r ia l D ata with M issing O b ser v a tio n s....................57 4.2.1 Generation o f Complete D ata...................................................................................... 57 4.2.2 Creation o f Missing Data from Complete Data.......................................................58 4.3 M ethods for M issing D ata An a l y s e s......................................................................... 59 4.4 C om parisons for the Methods o f M issing D ata A n a ly ses................................61 4.5 Results o f Sim ulation St u d y ......................................................................................... 62 4.5.1 Sample Data Simulated..................................................................................................62 4.5.2 Sample Data Imputed......................................................................................................64 4.5.3 Analyses o f Missing Data...............................................................................................75 CHAPTER 5. DATA THAT DO NOT FOLLOW A LINEAR GROWTH CURVE MODEL...................................................................................................... 95 5.1 Application o f LGC-PPDS M ethod to Repeated M easu res D ata of G eneral S h a p e s.................................................................................................................. 95 5.2 M odification of LGC-PPDS M ethod to Allo w D a ta o f G eneral S hapes .. 99 iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3 Application o f RMG-PPDS Meth o d to D ata from G row th C urve M o d e l....................................................................................................................................107 CHAPTER 6. DISCUSSION AND CONCLUSIONS................................................ 110 CHAPTER 7. DIRECTION OF FUTURE RESEARCH.......................................... 116 REFERENCES................................................................................................................. 118 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure I. Proportion of Observations Over Time (Data from Linear Growth Curve Model).........................................................................................................................63 Figure 2. Mean Observed Response Over Time (Data from Linear Growth Curve Model).........................................................................................................................64 Figure 3. Response Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Unobserved).................................................................... 65 Figure 4. Response Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LGC-PPDS)...................................................67 Figure 5. Response Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via Lavori)..........................................................68 Figure 6. Response Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LOCF)...........................................................69 Figure 7. Response Over Time for MAR (Data from Linear Growth Curve Model) (Completers)............................................................................................................... 70 Figure 8. Response Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Unobserved).........................................................71 Figure 9. Response Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LGC-PPDS).................................... 72 Figure 10. Response Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via Lavori)..............................................73 Figure 11. Response Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LOCF)..............................................74 Figure 12. Response Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Completers).................................................................................................. 75 Figure 13. Response Over Time for IM - Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Unobserved)..................... 96 Figure 14. Response Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Imputed via RMG-PPDS)... 101 Figure 15. Response Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Imputed via Lavori) 102 Figure 16. Response Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Imputed via LOCF) 103 Figure 17. Response Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Completers).................................................................. 104 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 1.1a Estimates of Slopes When Data are MCAR (Data from Linear Growth Curve Model)............................................................................................................. 79 Table 1.1b Analyses of Slopes When Data are MCAR (Data from Linear Growth Curve Model)............................................................................................................. 80 Table 1.2a Estimates of Endpoints When Data are MCAR (Data from Linear Growth Curve Model)............................................................................................................. 81 Table 1.2b Analyses of Endpoints When Data are MCAR (Data from Linear Growth Curve Model)............................................................................................................. 82 Table 2. la Estimates of Slopes When Data are MAR (Data from Linear Growth Curve Model)............................................................................................................. 83 Table 2. lb Analyses of Slopes When Data are MAR (Data from Linear Growth Curve Model)............................................................................................................. 84 Table 2.2a Estimates of Endpoints When Data are MAR (Data from Linear Growth Curve Model)............................................................................................................. 85 Table 2.2b Analyses of Endpoints When Data are MAR (Data from Linear Growth Curve Model)............................................................................................................. 86 Table 3.1a Estimates of Slopes When Data are IM - Slopes (Data from Linear Growth Curve Model)................................................................................................ 87 Table 3.1b Analyses of Slopes When Data are IM - Slopes (Data from Linear Growth Curve Model)................................................................................................ 88 Table 3.2a Estimates of Endpoints When Data are IM - Slopes (Data from Linear Growth Curve Model)................................................................................................ 89 Table 3.2b Analyses of Endpoints When Data are IM - Slopes (Data from Linear Growth Curve Model)................................................................................................90 Table 4.1a Estimates of Slopes When Data are IM - Unobs (Data from Linear Growth Curve Model)................................................................................................91 Table 4.1b Analyses of Slopes When Data are IM - Unobs (Data from Linear Growth Curve Model)................................................................................................92 Table 4.2a Estimates of Endpoints When Data are IM - Unobs (Data from Linear Growth Curve Model)................................................................................................93 Table 4.2b Analyses of Endpoints When Data are IM - Unobs (Data from Linear Growth Curve Model)................................................................................................94 Table 5a Estimates of Endpoints With LGC-PPDS Imputation When Data are IM - Unobs (Data from Repeated Measures Model of General Shapes)........................98 Table 5b Analyses of Endpoints With LGC-PPDS Imputation When Data are IM - Unobs (Data from Repeated Measures Model of General Shapes)........................98 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6a Estimates of Endpoints When Data are IM - Unobs (Data from General Repeated Measures Model).......................................................................................105 Table 6b Analyses of Endpoints When Data are IM - Unobs (Data from General Repeated Measures Model).......................................................................................106 Table 7a Estimates of Slopes With RMG-PPDS Imputation (Data from Linear Growth Curve Model)...............................................................................................108 Table 7b Analyses of Slopes With RMG-PPDS Imputation (Data from Linear Growth Curve Model)...............................................................................................109 Table 8. Summary of Simulation Results (Data from Linear Growth Curve Model).... 115 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abbreviations and Definitions Censoring: patient prematurely ending a study causing missing data. Censoring Time: time from when a patient is enrolled in a study to (and including) his last observed time point. Dropout: same definition as for censoring. Endpoint: population mean at the final time point, which refers to time t- 3 in the simulation study. The endpoint is analyzed by one-way ANOVA after imputation of missing data. IM : informative missing. The probability of missing is related to unobserved responses or patient-specific effects. IM - Slopes: missing informative with respect to patient-specific slopes. The probability of missing is related to patient slopes. IM - Unobs: missing informative with respect to individual unobserved responses. The probability of missing is related to patient unobserved responses. LGC-PPDS: the imputation method proposed and developed in this dissertation for linear growth curve model using samples from the posterior predictive distribution. This Bayesian method is developed after the proposal is written and is independent of PROP-1 and PROP-2. LMMSE: linear minimum mean squared error. The LMMSE estimator is derived by Wu and Bailey (Chapter 2) for population slope when data are missing information with respect to patient-specific slopes. viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LM VUB: linear minimum variance unbiased. The LMVUB estimator is derived by Wu and Bailey (Chapter 2) for population slope when data are missing informative with respect to patient-specific slopes. M AR: missing at random. The probability of missing is related to observed responses. It is not related to unobserved responses or patient-specific effects. MCAR: missing completely at random. The probability of missing is independent of both observed and unobserved responses. PROP-1: the imputation method presented previously in the dissertation proposal. PROP-2: the imputation method presented previously (PROP-1) and subsequently modified after the dissertation proposal is written. RMG-PPDS: the imputation method proposed and developed in this dissertation for repeated measures model of general shapes using samples from the posterior predictive distribution. This Bayesian method is developed after the proposal is written and is independent of PROP-1 and PROP-2. Survival Time: time from when a patient is enrolled in a study to the time he/she dies. WLS: weighted least squares. UWLS: unweighted least squares. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Traditional statistical analysis of clinical trial data can be biased or invalidated due to missing observations. In this study, we propose an imputation method for missing values based on samples from a Bayesian predictive distribution. We tailor this method to the data following two types of repeated measures models. One model assumes for each subject a linear growth curve that follows a population growth curve model yi(k) = Ti 3/(jfc) + - where T, is a design matrix for measurement time, (3,^ ~ N(fl*, Z b) is a vector of individual regression coefficients, and fl* is a vector of population means. Another model assumes that each subject’s response follows a general population response with an additive random subject effect =(Al + y(1 + Tj . +£,*, where YN(0, a2) is an individual subject effect and t* is a vector of population effects over time. We characterize and compare the Bayesian imputation method with the methods of [a] Wu and Bailey’s linear minimum variance unbiased (LMVUB) estimator, [b] Lavori’s sequential imputation, [c] maximum likelihood (ML) analysis of observed data, [d] last observation carried forward (LOCF), and [e] analysis of completers by performing a simulation study. We use bias, type-I error rate and power as the criteria for comparisons. The results of the simulation study indicate that the Bayesian imputation is comparable or superior to other methods when data are missing at random (MAR) or informative with respect to individual subject slopes or unobserved responses. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1. INTRODUCTION The occurrence of missing data is a pervasive problem in clinical data analysis. In longitudinal studies, it often arises due to patients dropping out of studies, i.e., ceasing to participate prior to completion of the intended investigational period. This type of missingness is often characterized as a monotonic pattern, so that when a patient’s responses are missing at assessment time t, then they are also missing at all later assessment times, r+I, t+2, etc. Missing data can also occur due to skipping some intermediate visits or refusing to have some assessment done during the intermediate or final visits. This type of missing data is often characterized as an intermittent pattern, because missing can be seen on and off in the intermediate visits, and if it is the final visit, usually not all the data are missing. The cause and statistical treatment of these two types of missing data problems can be very different. In this dissertation, we only focus on the first type of missing data. The common causes of patients prematurely ceasing to participate in a study can be that (i) patients have recovered from disease; (ii) patients do not experience improvement during the study; (iii) patients experience adverse reactions that may be related to the investigational treatment; (iv) study procedures are unpleasant; (v) patients experience other concurrent health problems; (vi) external reasons that seem to be unrelated to the trial procedures or to the progress of the patient; and (vii) deaths of patients. It is possible that these dropouts w ill invalidate the statistical results if appropriate methods are not taken. As a simplified illustration (Heyting et al., 1992), let y- correspond to a measure of the severity of illness. Suppose that treatment A is 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. modestly effective in a proportion of the patients, regardless of the initial severity of their illness, while treatment B causes the less severely ill patients to recover and leaves the severely ill patients essentially unaffected. If the recovered patients tend to drop out prior to the final assessment time, a simple-minded treatment comparison of the outcome variable y, for the completers, w ill unduly favor treatment A. The bias would be accentuated if treatment A caused some of the treatment resistant patients to drop out due to an unfavorable balance between side effects and improvement in the severity of their illness. Little and Rubin (1987) classify missing data mechanisms into three types for the survey sampling problem. The first type is called missing completely at random (MCAR), in which the probability of a missing response is independent of response Y and covariate X. The second one is called missing at random (MAR), in which the probability of missing response depends on X but not on Y. These two types are also called ignorable missing because the likelihood contributed by missing data can be factored out from the total likelihood and ignored in the likelihood-based inference. The third type of missing is called non-ignorable missing, in which the probability of missing response depends on Y and possibly X as well. The missing data mechanism of this type can not be ignored in the likelihood-based inference. These definitions can also be applied to longitudinal data if one thinks of previous observed outcome y,.|, > 7.2... and covariate x as X, and current unobserved outcome y, as Y. The third definition is often referred to as informative missing (Diggle and Kenward, 1994). Wu and Carrol (1988) extended this definition to refer to the situation where the probability of missing depends on the parameter of response, e.g., random slopes of individual patient regressions. - > Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The statistical treatment varies from one to another, depending on the missing data mechanism. One method of dealing with missing data is to drop all cases that have any missing data. For longitudinal data, the cases without missing data are those who complete the study. Generally, when the MCAR assumption is true, no special adjustment is needed for the analysis of completers. The obvious merit for the analysis of completers is its simplicity. However, it relies heavily on the MCAR assumption which may not be satisfied most of the time. Besides, the analysis of completers does not utilize all patient information and, therefore, is not efficient. For MAR and informative missing mechanisms, imputations and model-based methods have been developed to tackle the missing data problem. Among the general category of imputation procedures, there are single imputation and multiple imputation approaches. The single imputation approach assigns a single value to each missing value through some algorithm, and thus without special adjustments, the single imputation cannot reflect sampling variability. To rectify the drawback of the single imputation procedure, a multiple imputation procedure (Rubin and Schenker, 1991; Brand et al., 1994) is developed in which each missing value is replaced by a vector of M>2 imputed values. The M values are ordered in the sense that “complete” data sets can be created from the vectors of imputations; replacing each missing value by the first component in the vectors of imputations creates the first complete data set, replacing each missing value by the second component in the vector creates the second complete data set, and so on. Usual complete-data methods are used to analyze each data set. The results are summarized such that between imputation variations are incorporated (Rubin & Schenker, 1991). The disadvantage of multiple imputations is that it takes more work to impute missing values and analyze the results. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Lavori et al. (1995) discuss multiple imputations of dropouts based on patients with the same propensity scores (Rosenbaum and Rubin, 1984). Their algorithm involves very tedious repeated imputation. In fact, when the MAR assumption is true, no special treatment is needed other than the regular maximum likelihood (ML) approach for the model with appropriate covariates. As shown by Little and Rubin (1987), the ML analysis is valid under the assumption of the MAR mechanism. The model-based method is another approach to tackle the informative missing data problem. Using this approach, Wu and Carroll (1988) and Wu and Bailey (1989) developed estimation and comparison of rates of change under the linear random-effects model with informative censoring. The probit censoring model of Wu and Carroll combines the linear random effects model with a probit model for the censoring process which leads to a tractable estimator using pseudo-ML estimation. DeGruttola and Tu (1995) propose an approach to extending the general random effects model to the analysis of longitudinal data with informative censoring. Their method permits estimation of parameters describing the relationship between progression and survival times. An estimation - maximization (E-M) algorithm is used to compute the ML estimate from the joint distribution of repeated measures and survival times. This algorithm iteratively finds the conditional expectation of the sufficient statistics (E-step) and then estimates model parameters by substituting these expected values for the sufficient statistics (M-step). Other methods for missing data analysis include (1) analysis of completers, (2) analysis with last observation carried forward (LOCF), (3) analysis with missing values replaced by some special values according to the dropout reasons, and (4) weighted 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis. The first analysis uses patients who complete the study and ignores all patients with missing data. It is the simplest method analytically, but can suffer severe bias when the reason for discontinuation is related to treatment effects. For example, suppose patients receiving one treatment are doing well and thus have less non-adherence problems; another treatment is doing poorly and thus has more patients that are non- compliant. Dropping out non-completers will dilute the treatment difference. It gives valid (unbiased) estimates when the missing mechanism is MCAR. However, since it does not use all patient data, the method is not efficient. LOCF analysis is an analysis that uses all patient data and when a missing observation from non-completer is encountered, it carries forward the last observed values from that patient. It is unbiased with respect to the unrealistic assumption that patient’s disease evaluations do not change once they drop out of the study. However, this assumption can be severely violated if disease progresses rapidly. Generally speaking, the completer analysis can be especially favorable to a placebo group, in the sense of imputing the same clinical improvement (i.e., the mean of completers) to patients who drop out as that reported for patients who complete the study. On the other hand, the LOCF analysis can unrealistically impute the endpoint missing values by the untypically poor early results. In clinical trials studying drug efficacy or a patient’s quality of life, a worse score is often assigned to patients who drop out from the study due to death. This method, while having its appeal of being conservative, can be highly biased due to its empirical nature. 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The weighting procedure estimates the parameters (population mean responses) via weighted averages. It is related to the mean imputation procedure in the sense that it modifies the weights in an attempt to adjust for missingness. The concept of this procedure is that it divides all patients into strata within which patients are relatively homogeneous with respect to their response profile. Patients who drop out are then selected by stratified random sampling. If the weights are constant in the stratum, then both imputing the stratum mean for non-completer (dropout) patients in each stratum and weighting completer patients by the proportion of completers in each stratum, will lead to the same estimates of population means, although not the same estimates of variances unless adjustments are made to the data with means imputed. Detailed information regarding this procedure is described in Section 2.4 of Chapter 2 in this dissertation proposal. In this dissertation, we w ill develop a procedure to impute missing data. The procedure w ill allow the missing mechanism to be informative with respect to random patient effect as well as MAR. We w ill examine the rejection rates of hypothesis tests when both the null hypothesis and alternative hypothesis are true under various modeling situations via a simulation study. We will also examine mean squared errors (MSEs) as well as parameter estimates from this method. In the next chapter, we will review the literature of the analysis and imputation methods. Some of these methods are based on modeling, and some are based on imputation procedures. In Chapter 3, we will propose our imputation methods. Chapter 4 w ill describe methods for the simulation study. We w ill present the simulation results from our imputation methods and other methods published in the literature. Chapter 5 w ill discuss the imputation methods with respect to 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the simulation results. Based on the results of simulation, we w ill present some new ideas or approaches for future research in Chapter 6. Note that missing data also occur in survey sampling where cross-sectional observations are obtained. There are many similarities in describing and handling missing data between survey sampling and clinical trials. Little and Rubin (1987) and Little (1988) discussed various methods for handling missing data in survey sampling. Many concepts of their methods and discussions can be generalized for clinical trials. In this dissertation, we w ill only focus on the methods of handling missing data that are specific to clinical trials. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 2. REVIEW OF STATISTICAL METHODS FOR MISSING DATA ANALYSIS 2.1 Regression Slope Approach for Informative Missing Data In many clinical trials, the primary outcome variable is the rate of change in a continuous variable measuring physiologic function or disease status. One such study was the Intermittent Positive Pressure Breathing (IPPB) Trial for pulmonary disease (Wu and Carroll, 1988), where the primary outcome was the rate of change in the forced expiratory volume in one second (FEV1). The probability of patient early termination due to death might be related to the patient’s rate of decline in FEV1. This would cause informative censoring for the FEV1 measures. Inspired by the missing data problem from the above study, Wu and Carroll (1988) derived several estimators for the rate of change using the approach of the joint- likelihood of survival time and repeated measures. The derivations of their estimators tire as follows. For each patient, a repeatedly measured response variable is fitted over time using a simple linear regression model = X,P, + e ,, for patient (= 1,2,.. (2. 1.) where y,- = (ya , yi2 f 1 1 . . . I ) is the response vector, X, = is the design matrix of measurement schedule, P, = - N f B * , ^ ) is the vector of 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. individual regression intercepts and slopes, e, ~ N(0, o^I) is the vector of model errors, are the vector of population regression coefficients f O o G a Q = K I and = Pt PiP: P v 0 p ,p , ° p , , and covariance matrix, and subscript v, denotes the last observed time point. The parameter of interest is the population rate of change (slope) . Note that the regression coefficients are fixed conditional on each patient, but are random with respect to the population mean. Let the probability of informative right censoring during a specified time interval C C o V - J is a vector of “ regression (0, for the <th patient be M (aTp,-|/y), a = parameters” relating this probability to the independent variable P,. The choice of the probability function M might be probit regression, logistic regression, or Cox proportional hazards regression. For example, under probit regression T X M (a P, | t j) = $ (a P, + olqj), where a 0;, for j= 2..... 7, are censoring time parameters that are common to all the patients. Subscript j starts from 2 because we require that patients stay in the study at least through tj. Let (3 , | P, = be the ordinary least square (OLS) estimate of p ,. The marginal likelihood for and a for the ith individual can be expressed as j=2 (2.2.) 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where M t j = M (aTP, \ tj) for iek (k=l, 2), j=l, ..., J, and Mf,i=0. Here, C(i, j) is an indicator for non-informative censoring in the y'th interval due to staggered entry, Z(i, j) is an indicator for informative censoring in the jth interval due to death or withdrawal. Parameter D is constant with respect to P,, a and B*, 02 is the bivariate normal density function, and indicates whether patient / completes the study or not. The First factor under the integration sign in the likelihood function L, represents the conditional probability distribution of P, given the ith patient’s true regression parameters P, and his censoring indicators C(i, j), and Z(i, j). The second factor is the probability distribution of P,. The multiplication of the first and second factors corresponds to the joint distribution of the observed slope and the underlying true slope for each individual. The third factor is the conditional probability that the /th patient survived the (/-l)th time point and then was non-informatively censored by staggered entry. The fourth factor is the conditional probability {(1-M,j.i)-(1-My)} [hat patient survived the ( j- l)th time point and then was informatively censored by death or withdrawal. The last factor corresponds to the conditional probability that the ith patient survived the entire study, given P,. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2.3.) is the variance estimate conditional on each patient, and 0 for completers 1 for censored subjects (2.4.) 10 ;= 2 (2.5.) Assuming that the primary right censoring (death or withdrawal) process is a probit model, the log-Iikelihood can be simplified to ln(L,) = ln(D) + ln(A,-) - \(f t - B* ) T C£- ^ - B*)+ 7), where A =(2n|C :ir ) " . < 2-6-> C, = C „+ Z t , (2.7.) T , = £ {c(i, j - 1) ln[l - <»((/, ,„)] + Z(i.; - 1 ) ) - tit /,.,.,)]} 1 - £ Z ( U > - X C ( i, j> ln [l- $ ( ( /,,) ], (2 S ) v i . j = (a07+ ‘< i*Tc3/a )(i+c‘Tc3,“ r / ’ , < 2 ' 9 ) di l =C,-,Pl + I p -', < 2 l 0 > a n d C ^ C - f Z ; 1 ) " . <2 | l) The marginal likelihood for all patients is the product of the individual likelihood. The M L estimates of O o y, a and B* can be obtained numerically by simultaneously solving the equations of partial derivatives with respect to these model parameters and and a ^ . However, instead of going through the derivations of the full log-likelihood function, Wu and Carroll use the method of pseudo-ML estimation (Gong and Samaniego, 1981). They replace ZpandOg by the following unbiased crude estimators: 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where S p = £ X l i P i “ & U W L S ,k K P i “ ^ U W L S .k ) . S c = X ( f i “ X/Pi) U “ XA ). k = lie k i=l N=ri\+ri2 is the total number of patients in the two treatment groups, and 1 &UWLS.k = — XP/ is an unweighted least squared (UWLS) estimate of B* for k= I, 2. nk iek The hypotheses of interest are: • Ho: C(i=a2=0 , i.e., the primary right censoring caused by death or withdrawal is not informative with respect to both patient intercept Pa and slope P ,-i for iek, and hence is independent of population B* for k=l, 2. • Hi: a i*0 and ct2 = 0 (or ai=0 and ct2* 0 ), i.e., the primary right censoring is informative with respect to patient intercept Pa (or slope pt2) for iek, and hence is dependent of population (or Bio) for jfc=l, 2. • H 2: cti*0 and (*2* 0 , i.e., the primary right censoring caused by death or withdrawal is informative with respect to both intercept Pa and slope P,2 for iek, and hence is dependent of population B* for k=l, 2. When Ho is true, T, is a constant with respect to B*. The M L estimate of B* obtained by maximizing the first 2 factors in the likelihood function is the weighted least squared (WLS) estimate 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where C2i = var(P,) = var(P, | P/) + var(P,) = 0 £(X,TX ,) 1 + Ip (2.16.) is the unconditional variances of the parameter estimates for patient When Ho is true, and all individuals have complete observations measured at identical time points, Ci, will be the same among individuals, in which case reduces to the unweighted least squared (UWLS) estimate When Ho is not true, Wu and Bailey (1989) proposed a linear minimum variance unbiased (LMVUB) estimator and a linear minimum mean squared error (LMMSE) estimator for the population rate of change Zfo- The LMVUB estimator is obtained from a conditional linear model of individual regression slopes on their censoring (survival) times where Y o* and Yu are unknown coefficients, is a normal random error and tlv is the censoring time of the ith patient. The coefficients Y o* and Yu are estimated via WLS method with weight being the inverse of unconditional variance of the estimated individual slopes, 'U W L S .k UWLS. k (2.17.) (2.18.) (2.19.) 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where subscript 2,2 indicates the element on the 2nd row and 2nd column of the 2x2 matrix. The first term in the right hand side corresponds to the within patient variation, and the second term the between patient variation. The variance of the individual patient repeated measures model, oe 2 , is estimated by pooling the model residuals from all patients together, : / : n o 0 ) s , : = S S ^ :(v, - 2 ) / 2 ; E ( v,- 2 ) - t = I is * t = l iek The between patient variance is estimated by Vonesh and Carter (1987), (2.21.) % =SM £ £ | i - z, . t (z t z ) '1 J/ (x / x , ! 1 . " ~ ~ k = l i e k \ J i jv - z ( z t z ) 'i z t '|b / ( n -2 ) is a covariance matrix of conditional where = B linear model. It is calculated from the matrix of estimated coefficients from the individual regressions, BT = (jpi ••• P( - ••• $ ^), and matrices of censoring times. Z = (zt 2-v) with z, =| v*>J 1 . To ensure that the variance matrix is positive definite, let V = £ X ZTz)~ z( -l(x tTX x -)~ { N - 2 ) , £ = min|oE : ,X, j , and X\ be k=liek\ I T 1 the smallest eigenvalue of matrix V 2 S ^V 2 - I , defining Z p = S ^ - ^ V will tT l warrant £ p>0. Here V 2 -V 2 = V is the Cholesky decomposition. Note the formula in Wu and Bailey (1989, p 945) for Z p does not work properly. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Taking the expectation of the conditional linear model (2.18) and replacing the parameters Y o* and Yu by their WLS estimates give the LMVUB estimator of slope Bkz average of patient’s regression slopes and censoring times, respectively, and tk = — X {IV 's unweighted average of censoring times overall all patients in group k. nk ie k The associated variance estimate is where the approximation is because we only use the first order terms to represent the entire Taylor series expansion. Note that in Wu and Bailey (1989), it is assumed that patient censoring times are fixed and therefore the last term (the expression after the second “ +” ) above is ignored by their formula. However, for the general population, the patient censoring time is a random variable. Adding the last term can serve as a variance correction for their formula. & LM V U B .kl ~ Y o ifc + Y lk * k (2.22.) where PW it2 = and 't * = are the weighted Var(B, L M V U B .k l (2.23.) = Var f L 2 + Var[y u (tk - r w jk)] = Va{ f t . * 21 + Var(Y ,* ) ( i k - L , k f +(Yu )2 Va«-(^ - L . k ) 1=1 1=1 1=1 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The LMMSE estimator is a weighted average of regression slopes of individual repeated measures models BLMMSE.kl ~ S W k ] I ■ / je k (2.24.) where is the simple average of regression slopes over all group k patients who were censored at time ts. The use of simple average here is because the variance of individual slope estimates are assumed equal for patients who are censored at the same time point. The weight wtj is obtained from minimizing the mean squared error of the estimator. It is computed from =(»«*! w k 2 - w k J )'= (s + 7TT )"l lf r(x + 7T T )"l |) , (2.25.) / * * » / 0 0 ' f t - i \ *: k where Z = 0 0 -and r = v,* t, - 1, J K 0 \ 0 v ; ~ tk) . The associated variance of the LMMSE estimator is var(b lm . w s e .*:)= — . Hk i 1 When the primary right censoring (death or withdrawal) is non-informative, the value of Ci, (2.16) does not depend on |3„ so that all four estimates BW L S k l. BtMvuB.ii - anc* Blumse. k2 216 consistent and unbiased estimators of the population rate of change with UWLS estimate being least efficient. When the primary right censoring process is informative, the WLS estimate is biased, because X, (and hence Cn) depends on the censoring time. However, the UWLS estimate, LMVUB estimate, and LMMSE 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. estimate are still unbiased. Wu et al. (1994) compared B U W L S k l, B W is k l, and Blwvb kl by a simulation study. Their results indicate that under both random and informative censoring mechanisms, the hypothesis test based on B U W iS k2 suffers severe loss of power. However, the test based on Buivub k2 performed well when it is used in conjunction with the bootstrap variance estimator. Wang-Clow et al. (1995) compare, under MCAR, MAR, and informative missing with respect to patient individual slopes, the performance of the unweighted estimate BL W L S t ; , weighted estimate LMVUB estimate Blwub k2, LMMSE estimate BLm iSE k2, M L estimate Buu.-k l, and unweighted complete cases estimate BC im pk„.ikz of the population slope. Their results also demonstrate the inefficiency of B U W L S k2. Among their findings, the M L method is shown to provide a good estimate of variance in all the missing mechanisms examined. However, it is highly biased when missing is informative. Both Bluvub k2 and BLmSE k2 have relatively low biases for the non-informative missing mechanism, but their standard errors are relatively high. Schluchter (1992) models the patient’s random vector of intercept and slope, P,, and log-survival time, r,°, by a trivariate normal distribution, 'P i‘ - N / B Z G(jt ■ rO J i > T T Ob, O f. / where B and I are the mean vector and 2x2 covariance matrix of patient’s regression parameter, and |i, and o,2 are the mean and variance of patient’s log-survival time, and a*,, is the 2x1 covariance vector between patient’s regression parameter and log-survival 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. time. With the observed response vector y „ survival or censoring time T,=min(F,0, C,), and censoring indicator 8, for patient /= 1, N, Schluchter proposes the E-M algorithm to estimate the population parameters and the likelihood ratio statistic for the hypothesis test. Note that the trivariate normal model (2.26) implies that the conditional expectation Thus, Wu and Bailey’s (1989) conditional linear model (2.18) fits into Schluchter’s trivariate normal model with the survival time on the log-scale. However, Wu and Bailey’s model require that all patients are potentially followed the same length of time, because otherwise patients who censor due to deaths and due to short follow-up time will be indistinguishable Schluchter’s model which has the censoring indicator built-in does not have this restriction. Mori et al. (1992) applied the empirical Bayes method to the estimation of the population rate of change (slope) when censoring is informative with respect to an individual patient’s regression slope. Assuming known variances, they consider the following two conditional linear models for subject i (/= ! n) where bi,0ts is an OLS estimate of slope, riv is the survival time, and V, is the known variance of the b ^ and A is the known variance of (3 ,. The marginal distribution of b,M is is of the patient’s intercept and slope given T,° is linear: (2.27.) (2.28.) (2.29.) 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. frijY o .Y ,.'*, ~ n (y0+ Y i^v ,^ , + a ). (2-30.) The posterior distribution of P, given is W *^ .Y o .Y ,.V -N (p ;,K (l-S ,)), (2.31.) where P‘ = (1 - fl. )bl llls + Bt (y0 + Y ) is the posterior mean and Bt = V j(V t + A) is the ratio of the conditional variance (2.28) and the unconditional variance (2.30) of the OLS estimate of slopes, Let / \ b\,ols bh.ots Y = , Z = nm „ ,b = \ " j rVi + A ■ 0 ' B{ • ■ 0 ' D "1 = 0 V • Vn + A / , and B = 0 • V. • Bn/ be the parameter vectors and matrices. The empirical Bayes estimate for the slope of the z'th subject is its posterior mean f L = 0 - 3 + tf,(Y0 + Y , ^ ). (131) or in matrix form P ^ = (I - B)6 + BZ(ZTDZ)_1ZTD 6, (131) where y 0 anc* Yi are the weighted least squares estimates, y = (zTDz)~ ZTDb , from the marginal distribution of b ^ . The corresponding covariance matrix is var(Pe ^) = I - B + BZ(ZTDZ)-1 ZTD ^ T l [i - B + BZ(ZTD Z)"1 ZT Df . (2.34.) Mori et al. suggest using the sample mean of empirical Bayes estimates o f the individual slopes as an estimate of the population rate of change 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with variance var($^) = -^ -l1 var(p^)l. (" 36) n~ It can be shown that Wu and Bailey’s (1989) LMVUB estimator is a special case of (2.35) with B,=l for all i or equivalently A=0. Note also that if Z ? ,= 0 or equivalently V,=0 for all i, reduces to Wu and Carroll’s (1988) unweighted estimator ByvVLS- When variance matrices V, and A are unknown, Mori et al. (1992) give an iterative procedure to estimate them. Their simulation study shows that the empirical Bayes estimator of the population slope performs comparably with LMVUB and LMMSE estimators. Follmann and Wu (1995) generalize Wu and Carroll’s (1988) random effects linear growth curve model (2.1) to allow link function skUy I P < ) J = a w ij + P /Z y - (2- 37-} where wy and z,j are vectors of covariates, a is a vector of Fixed parameters, and P, is a vector of random regression coefficients for the ith patient (/=1, ..., N). Instead of deriving estimates from the joint likelihood function, they provide an algorithm to estimate parameters approximately. 2.2 Joint Likelihood Approach In many AIDS studies, patient CD4 count is periodically measured until the patient dies or the study is terminated. After the patient died, his CD4 count is no longer available. Because the patient’s CD4 count is closely related to mortality, death causes 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. informative (non-ignorable) missing. Regular M L or restricted M L (REML) approach for the repeated measures analysis on CD4 count is biased and not efficient, because the missing mechanism is not MCAR nor MAR. Unlike the Lavori et al. (1995) multiple imputation method, DeGruttola and Tu (1995) try to establish a relationship between CD4 count and survival time, so that the repeated measures analysis of CD4 count and survival analysis are carried out simultaneously by maximizing the joint likelihood function. Let a general repeated measures model for CD4 count be yt = T,a + Z tbt + e, , (2.38.) where T, and Z, are known design matrices for treatment and repeated measures, a is a vector of unknown treatment effects, A, - N(0, D) is an unknown patient effect, and e, ~ N(0,a2I) is the vector of model residuals. As an illustration, the above repeated measures model can be simplified as (2.39.) ' y>i' ' l 1 o' (\ t \ 1 i y i 2 1 l 0 ' a 0 s 1 hz = . . . a i + • • . . . • • V ' V , [ l i o j v 1 t> v . , \b iij + e, , for i e treatment group 1 , where O o is the grand mean, ct| and a: (=-<X \) are the effects of treatments I and 2, r,i...., riv are the treatment measurement schedule, bn and bi2 are the regression intercept and slope for the zth patient, and v ', is the number of repeated measures patient / has before death or censoring. When the probability of observing the CD4 count at a given time depends only on the previous observed values of CD4, (i.e., MAR) the M L estimates or the restricted ML estimates (Laird and Ware, 1982) are unbiased (Little and Rubin, 1987). However, when 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the probability of observing the CD4 count at a given time depends on the true and unobserved CD4 count, the missing mechanism is no longer MAR. The regular likelihood approach is not appropriate. It tends to overestimate the treatment effect because the very low unobserved CD4 counts are removed by deaths. In order to taking care of informative missing of CD4 count due to death, DeGruttola and Tu associate patient effects with their survival times, .v„ by the following accelerated life regression model (ALR) where is a known design matrix for treatment and/or covariates, X is a unknown parameter vector, and r, - N(0, s2 I) is model residual. To write down the joint likelihood function of repeated measures and survival times, they made two assumptions: (1) patient censoring is random and non-informative, i.e., patient censoring time does not depend on the survival time; (2) survival time depends only on the parameters that describes the true, but unobserved, trajectory of the CD4 count. The rationale to associate survival time with patient effect in the repeated measures is that treatment effects are fixed and identical for all the patients. They do not change no matter what patient dies or drops out or the reasons for the censoring. It is the patient effect that determines death and consequently the missing values, and it is this effect that impacts the estimation of treatment effects. Note that patient effect 6, has its meaning in the repeated measures model, e.g., slope and intercept of the linear trend. However, it is difficult to interpret its role in the survival time model. Consequently, it is unclear whether the functional form of survival time on patient effect is reasonable or adequate. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The likelihood contribution from repeated measures of CD4 counts of patient i can be expressed as (omitting subscript /') Ll = P(yly 1...yv\b) (2.41.) = /\y ,|* )- /*(3'2| V|i^) - yv.,;*) = P(yl I A) • P(yz I tz; b) P(tz \ ;t2 < x ,b ) P ( y31 r3; b) P(r3 1 y ,, y2; r3 < .v, b)... M = 4 K y t l * ) f [ { [ ^ y y I b)8Uj I >1 • ■ ■ • y H ; < j r ) f u ' <x' J where x is the survival time, (j) is the standard normal density function, g i t jly i- y j- i’ t, < x ) = P(r,|y,...y,_,;r; < x,b) is the conditional probability that a surviving patient has an observation at time and M is the maximum possible number of measurements. The likelihood contribution from his/her survival time measurement is L, ';,]* [ -<D(c|^)],_6, (2-42-) where c is the censoring time, 5=0 or 1 is a censoring indicator, and < t> is the standard normal cumulative distribution function. Conditional on the patient effect, the joint likelihood contribution is the product of the likelihoods of repeated measures and survival time L = L , L (2.43.) = M y i l A ) f l { [ w ? ; I I V , •• • y,-,; t, < -V)]/('J < J r > J[<t>(.t|6)]S [l - 4>(c|*)]I_0. Assuming g(r; |y,...y;_,;r; c.v) does not contain model parameters, b, it can be ignored in the likelihood expression (after a partial derivative with respect to model parameters, this term w ill disappear). The joint conditional likelihood (2.43) can be simplified to 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 42 r = 0 (y I|A)fl{<>(y7|*)/(,^ ,}[<l>(.v|6>]5[l-< t> (c|^)]1 "5 J = 2 = M b )[$ (.x\b )]s[l-<t>(c\b)}l- \ The joint unconditional likelihood is thus L(*( . , a , o ^ , r , D | 8) = P(y,,.t,,6,) (2.45.) = L 'P (b ,), where P(6,)= < {> (£/ |D) is the density function of the random effect 6,. For a patient who died, his/her likelihood contribution is Lut) = L(b;,a,a 2,q,sz,D [5 = 1 ) (2.46.) = <b(y,\bi ,a ,a 1)ty(xl\bl ,<;,s2)<b(bl\D). For a patient who is censored, his/her likelihood contribution is L(c) = L(A,.,a,a: ,(;,5: ,D | 8 = 0) (2.47.) = M y,\b, ,a ,o : )[l - <D(c,\b, ,g, r )]<j> (6, | D). The total observed log-likelihood is + S lo g /t" (y 1..v,.*1 )d»,. < 2 4 8 > died fft censured £ The parameters are estimated via an E-M algorithm. Assuming bj and .t, are known for every patient, the complete-data log-likelihood is written as L = ,ot,<T2)<X.rJA( j-)<>(*(|D) . (2.49.) The sufficient statistics from the first term are X e ,T £, and ^ Z,6,, from the second term are ^ r 2, , and ^ wjibi , and from the third term is . The E-step is to find conditional expectation of the sufficient statistics 24 (2.44.) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E^sufficient statistics 1y 0), (2.50.) where 0 is the parameter estimate from the previous E-M step. The M-step is to perform M L estimation of model parameters X e/ Z i\y 0b s * ^ i=i S v , (2.51.) V i= l D 2 = - E etc. ll> iTl>i\y0bsX’Q i=i The M L estimates of model parameters are obtained by iterating between (2.50) and (2.51) until convergence. For hypothesis testing, the observed-data likelihood Ls = £ lQg{‘M y . I © * | ^ ,< y; 6->} (152') died X ,°g{< t )(j'I|0*)[i- ♦ (c .in *,*-’a :K ;o-)]} censored is used. Greenlees et al. (1982) consider the problem of missing responses in the regression analysis y /= x ,- P + e,-, (2.53.) where y, is the response in the ith subject, and j c , - is a vector of auxiliary variables. Assume that the probability of response is a logistic function of response and other covariates z, 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P(Ri = l \ y h Zi ) = (2.54.) 1 + exp(-a - yy,- - z,TS) where /?,=1 if y, is observed, and 0 if y, is missing. The factor of the likelihood for each of the respondents is the product of the probability of response and the density function of the response, Li = I 1 -x —( j) v( - V p (2.55.) l + e x p (-a -y yi~ Z i 8) \ / For each nonrespondent, the factor of the likelihood is the marginal probability of nonresponse u - n 1-- l + expf-a-yy,- -z , 8) 1 x —< j) a ( (2.56.) dy. The likelihood function of the entire sample is thus riA x n LJ . M L estimates te rc s p o n d e n ts y e n o n rc s p o n d c n ts of the parameters of this model is found by numerically maximizing the log of this function with respect to a, y, 8 , p , and o, given y for all respondents and z and jc for all respondents and nonrespondents. Diggle and Kenward (1994) model the dropout process by a logistic regression model logit{p ,(H , , y - ;P)} = p, + p,y“ + 1 . (2.57.) where the first term represents MCAR process when the later two are absent, or a regression intercept otherwise, the second term represents informative dropout with respect to the current unobserved response value y (denoted as y"'“ ), and the third term 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. represents MAR process. They then maximize the following joint log-likelihood function for the parameter estimates and test hypothesis, L(0,<j>,P) = £,(©,<)>) + L>(P) + ^(0,(0,P), (2.58.) where A(e,») = £ io 8 (/o.,|e,<i»] 1 = 1 is the log-likelihood due to observed responses, (2.59.) M 0 ) = - X 5 > g i = i t = : 1 + exp P o + XP;.V,.i+l-- V ;= > (2.60.) is the log-likelihood due to MAR dropout process, and I | exp(piym “ ) , , Vn \ / vm u1 0 ,< D ) 1 + exp(P1 y'"°) 1 ' (2.61.) d v n is the log-likelihood due to informative dropout process, in which the dropout probability has to be evaluated by numerical integration. Note/(•) is the normal distribution density function /(y|0,<t>) = - 1 ■exp - ~ ( y - 0)T V - l (<)»(y - 0 ) (2ti:)v/2 | V((j))|1 /2 Both Greenlees et al. (1982) and Diggle and Kenward (1994) assume a logistic model for the missing/dropout process. The difference between the two is that the former method deals with a single measure response variable, while the latter deals with a response variable with repeated measures. Diggle and Kenward’s method is thus a generalization of the method of Greenlees et al. Hogan and Laird (1996) consider a piecewise linear random effects model for analyzing longitudinal data where the multivariate outcome can depend upon time spent 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. on treatment. They maximize the following joint likelihood function with respect to the model parameters (|3, a, d>): .rf.S) = c n p(df )s - )' * ■ .rf. .s, l<e)d*,, (2'6X) 1 = 1 where d is the survival time, p( ) and S(-) are its probability density and survival function, /i( ) and/2O) are the multivariate conditional density functions, 8 is an indicator function for subject who completes the trial (8=1) or dies during the trial (8=0), and b is a vector of random effects in the piecewise linear model. Smith and Helms (1995) apply the E-M algorithm to estimate fixed effects and variance components in a crossover trial when censoring is informative. 2.3 Propensity Score Approach for Imputing Missing at Random Data In statistical analysis of a treatment group difference, pair matching, stratification, and covariate adjustment are often used to reduce bias caused by the imbalance of a set of confounding covariates. Such stratification or adjustment is straightforward when there is only one covariate, as Cochran (1968) shows that five strata are sufficient to remove over 90% of the bias for many continuous distributions. However, as the number of covariates increases, the number of strata grows exponentially. Then direct stratification becomes virtually impossible, because many strata w ill contain either treated or control observations, but not both. Rosenbaum and Rubin (1983, 1984) show that the multivariate stratification problem can be reduced to a univariate stratification problem through the propensity score. A propensity score is a conditional probability of assigning 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a subject to a particular treatment given a set of observed covariates. By fitting a logistic regression model of the group assignment on a set of stratification variables, the propensity score can be estimated for each subject. Thus, instead of the direct stratification on the multivariate data, the new stratification is done based on the propensity scores. Yet, compared to the univariate subclassification, the multivariate stratification based on propensity scores carries the same bias-removal property. Motivated by the data from the Cross-National Collaborative Panic Study (CNCPS), in which the placebo group has a very high drop out rate compared to two other treatment groups, Lavori et al. (1995) develops a multiple imputation method under the assumption of MAR mechanism. The strategy of the imputation is to produce a complete data set based on the observed histories of response variable and covariates up to the point of non-adherence (dropout). Ideally, after stratifying by the baseline covariates and observed histories of outcome variables, one can impute the missing values by those sampled randomly from the observed outcomes in the same strata. This procedure allows multivariate cross-sectional imputation. The difficulties in following the above strategy are (1) patients remained on study for varying times, so the observed histories of outcome vary in length; (2) the highly multivariate outcomes make it impossible to stratify on history data directly, because it is multivariate (or multidimensional); (3) the variability of the estimates can not be calculated directly from the imputed complete data sets. The solutions for the first problem can be solved by sequential imputation. For example, to arrive at imputed week 8 values, we first impute early weeks based on the fully observed baseline data and partially observed early response data. Then we use those ‘pseudovalues’ to construct 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ‘histories’ that form the basis for imputations at subsequent weeks. The second problem can be solved by using the propensity score 7r(jr) of a patient who completes or has imputed values up to the time of evaluation given his/her baseline and response histories jc. It can be estimated from a logistic regression model for the probability of the patient having non-missing responses at the time of evaluation. Having computed the propensity score 7t(jc), we can form strata based on it, and then select randomly from stayer subjects (complete up to the particular time of evaluation) to impute to dropout patients with the same tt(jc) stratum. The imputation based on the stratification of the propensity score is unbiased because the property that missingness is independent of the unobserved response given the history jc, carries over to the independence given the propensity score as is shown by Rosenbaum and Rubin (1983). The third problem is solved via the approximate Bayesian bootstrap (ABB) method (Rubin and Schenker, 1986, 1991; Glynn et al., 1993). This method first randomly samples (with replacement) /i0 b S observed values of response from the n0 b s stayers, then randomly samples (with replacement) «m is s from these nQ b s . The purpose of the first sampling is to create the posterior predictive distribution of the missing data given the observed data, and the second sampling is to impute missing values from this distribution. By repeating this ABB procedure multiple times, the variability of the final estimate can be accounted for by the between-sample variation and the within-sample variation. The following flow chart illustrates the ABB method. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Approximate Bavesian Bootstrap (ABB): *1 4 impute th < s missing values by the observed values sample with replacement ^ U .S- 1 4 X sample with replacement *11 *4 The detailed imputation procedures of Lavori et al. (1995) are described as follows. For each treatment group, at each scheduled time point, a propensity score is computed for each patient. Starting from week 3 (assuming patients had at least two follow-up values after baseline), they fit a logistic regression of the propensity (probability) to remain on study through week 3 given the patient’s observed trajectory through week 2: rt3(*0,yi.obs,y2.obs) = Pr(.t3=l |*0,yi.obs,y2.obs), (2.63.) where *o is the fully observed baseline covariate vector, y is the vector of multivariate responses with subscript “ obs” stands for observed and “ mis” for missing, z is the 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. indicator for ‘stayer’ (z=l if y is observed) or ‘dropout’ (c=0 if y is missing). The subscript iny and z indicates the particular week of the trial. Having computed the propensity scores, the next step is to order patients by the their propensity score, 7t3 (x0, yi.obs, yzobs)- Patients are divided into strata based on quintiles of 7t3’s. Suppose there are n0 b s observed responses and nm jS missing responses, n=/iobs+«mis- The ABB creates M ignorable repeated imputations as follows. For m =l..... M (=10), the ABB first selects /i0 b S observed values (vectors) of y by drawing nobi values at random with replacement from the n0b s observed values of y, and then draws the nm is missing values of y at random with replacement from the n0 b s possible values. The drawing of nm \s missing values from a possible sample of /i0 b s values rather than from the observed sample of «0 b s values generates appropriate between-imputation variability. Note that, because the entire vector of week 3 responses from the selected week 3 stayer is used to fill in the missing responses of the dropout patient, the covariance structure of the multivariate outcome is preserved. With M “complete” data sets at week 3, they compute M imputations for week 4 in parallel to the above steps. That is, letting y “m is denote the mth imputation of the missing week 3 data, they form the “complete” data set y " ^ = (y 3 . o b s ,y^m i5 ) for m = l..... M. Then using the entire observed baseline covariates and y ”^ , they estimate the mth propensity to remain on study through week 4, e4 (JC 0’3'l,obs>blobs’ = ^r(24 = 1 I • r 0’ 3'l,obj’ 3 ,2 .o b 5 ’ 3 ,3 .o b s ) ’ (2.64.) and impute the missing week 4 responses by ABB draws within the quintiles of ^7 (*o ’ ^i.ob,» y^obs - y ”o b s ) ’ once f°r each m (imputed data set from week 3). This yields M 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sets of data complete through week 4. Proceeding again to impute week 6 missing data based on the complete data of baseline, and weeks 1 and 2, and imputed (with partially complete) data of week 4. (Note the measurement schedule is baseline, weeks I, 2, 3, 4, 6, and 8.) Finally, they impute week 8 missing data based on the complete data of baseline, and weeks I and 2, and imputed (with partially complete) data of weeks 4 and 6. Thus, they generated Af-imputed data sets. The following flow chart illustrates the sequential imputation procedure. Baseline Week 1 Week 2 Week 3 Week 4 H ob) 6 - 6 __ Hots— 5 4 nm is= l Propensity Propensity Score Score For each imputed data set, they compute the univariate parameter estimate 0 m, and its associated variance Um , m - 1,..., M, by the usual methods for the complete data. The multiple imputation estimate of 0 is the average of the M complete-data estimates (Rubin and Schenker, 1986,1991) 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The adjusted variance estimate is (Rubin and Schenker, 1986, 1991) 1 — 1 w where 1 + — is an adjustment for finite M, U = — V f / m is the average of within- M M ~ x t M imputation variance, and B = 0 m - 0 ) reflects the between-imputation component. The significance test can be constructed as ( © - © j/V r ~ tv (Rubin and Schenker, 1986, 1991), where degrees of freedom v = (Af - 1)(1 + r~‘ ): is based on a Satterthwaite approximation with r = ( l+ M ~ ')B /U being the ratio of between-to- within imputation variance. Rubin and Schenker (1991) give the formulae for the multivariate responses. In the same article, Lavori et al. apply the multiple imputation method and the methods of completers and LOCF analyses to the CNCPS data. They find that the completer analysis and LOCF analysis give extremely larger and smaller estimates, respectively, and yield grossly smaller standard errors than the multiple imputation analysis does. This multiple imputation procedure is now fully implemented in SOLAS statistical software (Solas™ Imputation 2.0). As an improvement to Lavori et al. (1995) method, the software now gives the option to sample from the closest matching cases in 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. propensity scores for each missing value instead of sampling within each fixed quintile groups for the original method. Instead of direct imputation, Heyting et al. (1992) consider the weighting method for the estimation of treatment group mean when the probability of dropout is dependent on the observed responses and covariates, but independent of the unobserved response. They propose that the group mean at time t be estimated by the weighted average of the observed responses at time t . S M r 'y . ) <--6 7 > ° v , * - ' • with r,=0 if patient i is a dropout and r,=l if the patient complete the study (i.e., if the value of response y« is observed). The weight ft, is the estimated propensity score of patient i completing the study based on the observed responses and covariates. The rationale of this approach is as follows. Within each subclass of the population defined by the propensity score, all subjects have similar covariates and response profiles. Of this subclass, a proportion n is destined to complete the trial and the remainder is destined to prematurely withdraw. The entire population can be reconstructed by inflating the sample size of completers within each subclass r f 1 times larger. Despite the simplicity of this method, its properties are not fully understood. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.4 Other Approaches for Missing Data Analysis A model-based imputation is recently developed and implemented in SOLAS software (Solas™ Imputation 2.0). The imputed values are sampled from the predictive distribution of responses for the model y = Po + Pl-r l + ^2*2 + - + $pXp + E ’ where .t’s are covariates associated with the responses and e~N(0, cr) is model error. This model is appropriate in the survey sampling or cross-sectional analysis when data are MCAR or MAR. For longitudinal observations, to better predict the missing responses at time r, one may treat >vi ••• v,.q as covariates in the calculation of the predictive distribution for y,. The properties of this method are unknown because no simulation study has been performed. Schafer (1997) describes the EM algorithm and data augmentation method for the imputations of missing values in normal and categorical data. The imputation software is available from the author. Gould (1980) argues that, when comparing the effectiveness of the treatments, the overall measure of the value of each treatment should be considered, involving not only effects on severity of illness, but tolerability as well. He suggests a ranking procedure over the entire trial population as follows: The highest tied ranks are assigned to patients who withdraw cured; For completers, the ranks are assigned according to the observed responses. The lowest tied ranks are assigned to those patients who withdraw for intolerance or lack of benefit. Patients are excluded from the analysis if they drop out for apparently outcome unrelated reasons. This method may shed light on how sensitive the 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis results depend on the imputation procedure. However, the bias is unavoidable because of the subjective reason for withdrawal. Little and Yau (1996) present the parametric version of the imputation for the analysis of intent-to-treat populations. They first fit a linear regression model of the response conditional on the drug dose and the previous responses, etc. They then perform the multiple imputations sequentially, first filling in missing values of v,2 as draws from the predictive distribution of y,i given yn, then filling in missing values of y,3 as draws from the predictive distribution of y,3 given observed or imputed values of y,i, y,2, and so on. To account for uncertainty in parameter estimates, model parameters are first drawn from their posterior distribution N and then recovering the missing values by drawing from their predictive distribution conditional on the drawn parameters (for patient i=l,..., n). Here, X, is the design matrix for patient /, P is the vector of regression parameter estimates from model fitting, and o : is the residual variance drawn as model residual sum of squares sq divided by a x2 random variable with the same degrees of freedom. Rubin and Schenker (1986) indicate that this multiple imputation procedure is asymptotically equivalent to ABB procedure for large sample size and number of multiple imputations. Crawford et al. (1995) compare complete-case analysis, mean imputation, single model-based imputation, and multiple model-based imputation methods. Their results show that the complete case analysis and mean imputation yielded biased estimates of population means when outcome values are non-random missing. Mean imputation and 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. single model-based imputation underestimate standard errors by treating imputed values as if they were observed. Among the comparisons, the multiple model-based imputation performs the best in terms of bias and standard error adjustment. Diggle (1989) discussed statistical tests for MCAR dropout in the repeated measures context. Ridout (1991) summarizes Diggle’s (1989) tests by logistic regression model of completion status and by tests for the association between dropout and values of regression covariates. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3. METHODS OF IMPUTATION 3.1 Introduction In the last chapter, we reviewed many different methods for analyses with missing data. We classified different methods into three categories: (1) regression slope approach, (2) joint likelihood approach, and (3) propensity score with ABB approach. The first approach is appropriate when data are informative missing with respect to individual patient regression slopes. However, this method does not allow covariate adjustment. The second approach is appropriate for data with informative missing and in theory can be applied to the analysis of virtually any type of outcome variables. However, it depends heavily on the assumption of the parametric structure of the model, and may not be easily adapted from one model to another. The third approach does not rely on the relationship between outcome variable and covariates beyond the extent of the form of logistic regression for propensity score. Also, this method is appropriate for the MAR data. It may be good when the probability of missing is related to the measurable covariates, e.g., y,.\, and .r’s. It may not be good in the situation of informative missing. One example that the method w ill not work well is when the missing probability is related to the slope of subject-specific regression. Moreover, in the case of Lavori et al. (1995) imputation, the method may break the correlation among different observations measured repeatedly within the same dropout patients, by imputing missing observations of one patient with non-missing observations of another patients. For example, the imputed value for missing may come from one patient at time ti and from another patient at time u. Thus, the correlation between ti and U is lost for the imputed case. While this 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. may not affect the analysis of the data at a single fixed point in time, it may severely impact the analysis of the repeated measures which uses all the observed and imputed data from each patient. In this dissertation, two imputation methods are proposed. The first one is based on propensity scores. It was originally presented in the Dissertation Proposal (denoted PROP-1 in this dissertation) and subsequently improved (denoted PROP-2 in this dissertation). The second method (denoted LGC-PPDS in this dissertation for linear growth curve - posterior predictive distribution sampling) is developed to improve the first one. It is based on Bayesian sampling from a posterior predictive distribution for the parameters in the linear growth curve model. The modified version of this method (denoted RMG-PPDS for repeated measures model of general shape - posterior predictive distribution sampling) allows repeated measures model of general shapes. The objective of these methods is to maintain the within-patient covariance structure for the imputed data. 3.2 Methods of Imputation - First Proposal This imputation is similar to Lavori et al. (1995). Instead of fitting on a set of measurable covariates only, we first estimate some non-measurable (non-observable) auxiliary variables by regression or repeated measures models. We then add these auxiliary variables into the covariates list for the logistic regression model. We apply a stepwise variable selection procedure to choose variables that are related to the probability of missing. Once the propensity score is estimated for each patient from the 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. logistic regression model, we estimate within each quintile the patient effect using a general repeated measures model. We then perform bootstrap re-sampling in conjunction with fitting a repeated measures model to estimate effects of treatment and time points. Finally, we impute the missing responses by combining together the estimated effects of patient, treatment and time points using the general repeated measures model. We use Rubin and Schenker’s (1991) method to adjust the variance for multiple imputations. The procedure is as follows. For the sake of analysis, each patient is required to have at least two measurements. 3.2.1 Methods as Presented in the Dissertation Proposal To distinguish from other methods proposed, this method is denoted as “ PROP-1” in the rest of this dissertation. For each treatment group, do Steps 1 through 9: I. Fit a linear growth curve model by M L method using all the available data from every patient, y , = Z iB + Zlbi + e i , or (3.68.) ' y* '1 'a" ( \ t ^ 1 il ( £ c» i y * 1 ^ i2 ' b ; 1 hi U j + L , + k ^,v ., f'v .y ^ l‘v J where v, is the last observed time point in patient Z, is a design matrix with r,t. ..., f,V r being the schedule of repeated measurements, B is a mean vector of regression coefficients representing the unknown treatment group effect, 6, - N(0, ) is a 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. vector of unknown patient-specific regression coefficients, and e, ~ N(0, SE) is a vector of model residuals representing the within-patient variation. It is assumed that bi and e, are independent of each other. The estimated patient-specific regression coefficients bn and ba are to be used as covariates in computing the propensity score in Step 5. 2. Fit a general type of repeated measures model by M L method using all the available data from every patient, y, = T,a + ly, + e ,, or (3.69.) M r i i 0 • • 0 0 • • < * T , f l ] ( e \ S i Vi- i 0 1 • • 0 0 • • 0 I e r [ T , t + Y, + V V - « VJ 0 0 • • I ; 0 • • o j t v 1+l l i j where M is the total number of possible time points, T, is a design matrix for repeated measurements, a is a vector of grand mean (representing another type of treatment group effect) and time effects, y ,- ~ N(0, ) is a patient effect, and e, -* N(0, I , ) is a vector of model residuals representing the within-patient variation. It is assumed that y, and e, are independent of each other. The estimated patient effect y, is to be used as a covariate in computing the propensity score in Step 5. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. Fit a logistic regression model on a set of measurable covariates, e.g., x, v,v . , vlV _,, ..., and patient-specific covariates, 6, and #, estimated from Steps 1 and 2 for all patients, logit(C, = l|xjt + K ,ylV + K: yiv.,+ ...+tifc, + <py,, (3.70.) where C,=l when patient / completes the trial and 0 when he drops out of the trial. Use stepwise variable selection to screen variables for association with missingness. 4. Compute the propensity score of a patient completing the trial using the above fitted logistic regression model exp(lr, + K ,ylV i + k , y , + f | A , +<PY,) (3'7L) P{Ci - 1 | jt, , y ,, bt, y j ) - - 7. ; ; 7T 7 7 T • l + exp(Xor, + K ,y lV + K: y,v. 1+...+TiA1 +<py,) 5. Divide patients into five quintile groups according to the ranks of their propensity scores. Within each quintile group, do Steps 6 through 9: 6. Fit a general repeated measures model y, = T,a + ly, + e t and estimate the patient effect Y i for each patient. Repeat Steps 7 through 9 Q times: 7. Sample with replacement the same size of data from within each quintile group defined in Step 5. 8. Fit a general repeated measures model y, = T,a + ly, + e : using the re-sampling data from the last step and estimate the vector of treatment and time effects a. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9. Impute the missing response y , = ( y ,• • • ,y,w )T by the model y ;= i;* d * + iY ;, (3.72.) where a* is the estimated treatment and time effects from Step 8, y* > s the estimated individual patient effect from Step 6. The design matrix f \ 0 0 ••• 0 | 1 ••• (T | X = 1 0 0 0 10 1 includes only the schedule of measurements that are missing. 10. Repeating the above Steps 1 through 9 for each treatment group generates Q imputed “complete” data sets. This concludes the imputation for missing data. 11. Perform repeated measures ANOVA or simple ANOVA at one fixed time point and estimate the parameter of interest 0 for each “ complete” data set. 12. Compute the adjusted parameter estimate and variance estimate respectively as the following (Rubin and Schenker, 1991), (3.73.) (3.74.) 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where Q q and Uq (to be described in the following) are the parameter and variance — 1 Q estimates, respectively, from Step 10, and U = — ^ 6 ^ and Q <i=i 1 V’ / * —\2 B = ~ q— “®) rePresent the within-imputation variance and the between- imputation variance, respectively. 13. The significance test is constructed as ( 0 - 0 ) / V f ~ r v, < 3-75-) where v = ( Q - 1)(1 + r" ‘ ): (3.76.) is the degrees of freedom (Rubin and Schenker, 1991) with r = (1 + Q~[ )B / U being the between-to-within variance ratio. A flow chart of the above imputation procedures is shown on the next page. Note Step 1 computes random subject-specific regression intercepts and slopes in a linear growth curve model. Step 2 computes random subject effect in a general repeated measures model. Although these two models may themselves be biased, it is our hope that if there are any latent variables relating to the dropout mechanism, they will be represented by these random effects. Steps 3 through 5 compute the propensity score based on the selected covariates. The purpose of these three steps is to find out the values of those observed and unobserved variables conditioning on which the missing data are ignorable. Akaike’s information criteria (Akaike, 1973) or other cross-validation criteria 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. may be used for variable selection. Step 6 fits the repeated measures model within each quintile. Because the missing data are ignorable now, the parameter estimates are unbiased. We estimate the random subject effects from this model. Steps 7 and 8 perform bootstrap re-sampling and estimate treatment and time effects for each re sampled data. The missing values are imputed in Step 9 by bringing together the effect of patient from Step 6 and the effects of treatment and time points from Steps 7 and 8. The rationale for not re-sampling patient effect is that we want to maintain the within- patient covariance structure for repeated measures (at least to the extent of compound symmetry). Steps 10 through 13 estimate the parameter of interest and adjust the variance of the estimate for multiple imputations. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10. Repeat steps 1-9 for each treatment group Complete dataQ Complete data 2 Complete data I 3. Logistic regr model 7. Bootstrap data Q times 8. Rep’d meas model: a t * ;2 4. Compute propensity score 5. Create quintile group 8. Rep’d meas model: &, * ;Q 8. Rep’d meas model: O C , * ; 1 0. Data with missing values 2. Repeated measure model: Y, 11. Estimate €L, U , 6. Repeated measure model: f , * 11. Estimate 11. Estimate I. Growth curve model: bt 12, 13. Variance Adjustment and Hypothesis Test 9. Impute missing values y* = T,*d* + ly ‘ 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.2 Improvement on the Methods Presented in the Dissertation Proposal The simulation study indicates that the methods of Section 3.2.1 gives unsatisfactory results with respect to bias and type-I error rate (See PROP-1 in Tables 1.1, 2.1, 3.1 and 4.1). It is discovered that for informative missing with respect to slopes, early dropout patients have very steep slopes. However, the estimates of their slopes are highly biased from the Step 1 model because they contribute too few data points. An improvement is made by replacing missing y values with their predicted values from simple linear regression fitted for each individual patient v,, = a, + (3 ^ + e ,y . This step is added to the first step. It improves the bias in the estimate. To improve the type-I error rate, we use the Bayesian sampling from posterior predictive distribution. We replace Steps 6, 8, and 9 by the following. 6. For each patient, we calculate T( - = (z,TZ,-) and estimate regression coefficient P, = (z,TZt) Z ,y( based on the model y, = Z,(3, + e ,, where Z, = f i r \ 1 'll 1 l i2 1 '/v, v 1 y A f A ^ 8. Sample regression coefficient P, from its posterior distribution N (P ,, o “ T ,), where o~ = £ c r“ is the pooled mean squared error from individual regression models all ( using the re-sampling data from Step 7. 9. Impute the missing response = (yiwi+i ,• • • ,y,^ )T by the model 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (3.77.) (1 * 'J * 1 where e, is a sample from N(0, o " ). The design matrix Z, = . <v!+ 2 includes only the schedule of measurements that are missing. This imputation is equivalent to sampling from the posterior predictive distribution of missing response. Chapter 4 presents the results of simulation study for this method. Note that the functional form of the above imputation model is changed to a linear growth curve (random coefficient regression) (3.77) from the previous structure (3.72). This change indicates that the improved method can only be applied to a linear growth curve model. To distinguish from other methods proposed, this improved method is denoted as “ PROP-2” in the rest of this dissertation. 3.3 Methods of Imputation - Second Proposal Section 3.2 describes the method of imputation in the earlier proposal. That method is based on sampling from the Bayesian posterior predictive distribution of the outcome variable. However, that method has low power and is biased when data are MAR or informative missing with respect to unobserved outcome values. (See Chapter 4 for details.) Examining the patient list in each quintile group during simulation revealed that patients are frequently assigned into incorrect quintiles by Step 5 of the procedure. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This is likely caused by the overestimation of the patient-specific regression slopes for MAR and underestimation of the slopes for informative missing with respect to unobserved values. To avoid this problem and restrict focus on the repeated measures data that follow a linear growth curve model, a new method is developed This method is based on sampling from the Bayesian posterior predictive distribution that is estimated from all data combined. A marginal posterior distribution for the variance components is obtained from the growth curve model first. The model parameters and errors are then sampled from the posterior distribution conditional on the variance components. The missing outcome is imputed by the predicted outcome based on the samples of mode! parameters and random individual observed variation from the posterior distribution. To distinguish from the methods proposed in Section 3.2, this new imputation method is denoted as LGC-PPDS (linear growth curve model - posterior predictive distribution sampling) in the rest of this dissertation. The steps of imputations follow. 1. Assume outcome y follows a linear growth curve (mixed effects) model y = X0 + ZY + E, (3.78.) y « i ' '1 h \ '1 ' 1 ' ( c ) E/1 y«2 1 l2 f P ll Po V - ) 1 tl ha) Jn) ; ; '■ + + * < l 'v,- J 1 \ EIV V 1 ) where P is a “ fixed” effect containing population intercept and slope, y~N(0, G) is a random effect containing individual subject variation around the population effect, and e~N(0, R) is random model error, v, < M is the number of time points observed 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and M-Vi is the number of time points missing for the longitudinal measures for subject /. We further assume R=crl. 2. Let 0 denote all unknown variance components in G and R. Assume (3 has a uniform prior p(P |0 ) = 1 independent of y. The joint posterior density of (P, y, 0) is pO,y,o| y ) = p(P,y| 0^)p(0| y ) . 3. The density for y and y for the linear growth curve model is p(^|P,Y,e) = (27t)-v /2 |R(0)|"l/2 e x p [- l / 2)(y - XP - Zy)T R (0)_1 (y - x p - Zy)} P(y|0)= (2ti)-1|g (0)|-1 /2 exp^f— 1 /2)yTG(0)~1 y}. 4. The conditional posterior density for P and y is p( M j> .g)= pU M 0)M .H 0) « p ( y » M e) = p ( y \ P ’Y > 0)p(P’Y|0) = p { y \ ^y.0)p(P|qWy|0) R (0 )|'l/2 e x p {(-l/2 )(y - Xp - Zy)T ( In ) -v/2 R ( 0 r l ( jr - X P - Z y ) } x (271) - l G(0) _ l/2 exp^-l/2)YT G (0)-1y}. 5. The conditional posterior means are P = (xTV (0)x)~X TV (0 )'ly , Y = G(0)ZTV(0)~l (y - Xp), (3.79.) (3.80.) (3.81.) (3.82.) (3.83.) (3.84.) 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where superscript indicates the generalized inverse of a matrix (i.e., AA'A=A) and, V(0) = ZG(0)ZT + R (0). (385) 6. Let 7t(0) be the marginal prior for 0. The marginal posterior density for 0 is p (0l.y ) = J J p (0> P ’Y l.y)4kty (3 - 86) = II P(y, 0. P, y M P ^ / J J J P (y, 0, P, y)d&(t{dQ NP(y,$,Y\QMQ)d$cfy = L(0)7t(0), where L(0) = ffp(y,p,Y|0)dprfy (3-87-) = (2tt)~(M ~r ) l 2 \ V(0) |_ I/2 [xTV(0)_I x|~1/2 exp j - 1 / 2(y - XP(0))T V(0)"1 (y - Xp(0))l and 3(0) = [xTV (0)_ 1 x |“XT V(0)~l y . (3-88 ) Here r=2 is the dimension of R(0). See Searle et al. (1992) p321-325 for details. 7. The choice of prior distribution for 0 is non-informative. They can be |/<^ • Jeffreys’ prior: 7t(0)« | /R(0) | ", the square root of the determinant of the Fisher information matrix, [/R (0 )]„ = (1 / 2)frf [k V -l KT ]” ' Vr [kV ^ K 7 ]"‘ V, where K = I- X ( X TX) X 7 is the projection matrix. See Wolfinger and Kass (1999) for the derivation, or 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Uniform prior: n(0) « = 1. The Jeffreys’ prior is a conjugate prior for the normal distribution and is transformation invariant. It has little influence on the posterior distribution. The uniform prior is transformation variant. It has no effect on the posterior distribution. The imputation results from these two priors are not expected to be different for the model parameterization and distribution assumptions discussed in Step 1 . 8. With the selected prior distribution Tt(0), draw a random sample 0* from the marginal posterior distribution p(Q |y) (see Step 6). 9. Conditioning on 0* from Step 8, draw random samples P* and y* from the multivariate normal distribution p{P, y |y, 0*) (see Step 4). 10. For subject i who drops out at time v,+l, draw random samples ei v+Ue/ v > from N(0, of"), where af* =o2 Here (A/-2)/(v,-2) ' ‘ v - 2 is an adjustment for the difference between the number of expected (M) and the number of observed (v,) outcomes for each individual patient. 11. For each draw of 0*, P*, and y*, impute missingy via the equation ' y « ,v f + i ' ' 1 . y V,- +1 r i ^Vj + l ' " e (.v ,+ I ^ , ■ + 2 = 1 r v ,+ 2 * j + i f v ,+ 2 ' Y / 1 * ' + E «.vi + 2 * yi.M V 1 V i V E/.iW * V / This imputation procedure is equivalent to sampling y from its posterior predictive distribution. 12. Repeat Steps 8 - 11 Q times and generate Q imputed data sets. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This concludes the imputation for the missing data. 13. Analyze imputed data sets by the linear growth curve model if the population slope is of interest or by one-way ANOVA model if the outcome at only the final time point is of interest. 14. Compute the adjusted parameter estimate and variance estimate respectively as the following (Rubin and Schenker, 1991), where Qq and Uq are the parameter and variance estimates, respectively, from Step 13, and (J = — ^ U q and B = —-— “ ®) represent the within-imputation Q q= 1 Q ~ \ q=\ variance and the between-imputation variance, respectively. 15. The significance test is constructed as where v = ( Q - 1)(1 + r -1)2 is the degrees of freedom (Rubin and Schenker, 1991) with r = (1 + Q~l )B / U being the between-to-within variance ratio. For missing at random (MAR) in which the missing probability depends on the last observed outcome value, no covariate is necessary in Step 1. However, for informative missing with respect to individual slopes and unobserved outcome values, (3.92.) 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. covariate, tv. - the censoring time, is needed in Step 1. The rationale for including this covariate is that it provides additional information in predicting missing outcome because it is highly correlated to individual slopes (Wu and Bailey, 1989) and the outcome variable. The above Bayesian sampling is carried out in SAS using Procedure Proc Mixed (SAS Institute). A flow chart of the above imputation procedure is shown on the next page. Step i constructs the likelihood function for the parameter based on the observed data and underlying growth curve model. Steps ii and iii provide the prior distributions for population parameters (3 (treatment group-specific intercept and slope) and 0 (variance components). Step iv calculates the posterior distribution p(0 | y) for 0. Step v draws a sample 0* from its posterior distribution. Step vi calculates the predictive distribution for e based on the variance sampled from Step v. Step vii draws a sample e* from its predictive distribution. Step viii calculates the posterior distribution p(P, y j y, 0*) conditional on the variance components sampled from Step v. Step ix draws samples P* and y* for their posterior conditional distribution. Step x calculates the imputed value for missing data based on the Bayesian prediction y * = Xp * +Zy * +e *. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. iv. Posterior distribution of variance components £ 3 and <r conditional on { K/,*} i. Growth curve model .y = XP + Zy + E 1. Data Y u. Prior distribution d(BI0)=1 111. Prior distribution Tt(0 ) viii. Posterior distribution of population and individual slopes p and y conditional on { Y^} and £ 3 and o" v. Draw random sample vi. Predictive distribution of e vii. Draw random sample ix. Draw random sample x. Impute missing Yyt via y «= XP * +2y * +£ 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 4. SIMULATION STUDY 4.1 Introduction In this chapter, we compare via simulation study the performance of our imputation methods with the methods of Lavori et al. (1995) and Wu and Bailey (1989). In addition, the M L method for repeated measures analysis is also compared. This later method applies to the observed data only without imputation. The data are simulated according to the parameters estimated from IPPB trial (see Section 2.1 and also Wu and Baiiey 1989). Monotonic missing data are created following MCAR, MAR, and informative missing mechanisms. 4.2 Simulation of Clinical Trial Data with Missing Observations 4.2.1 Generation o f Complete Data We simulate complete data from a linear growth curve model y m i y m 2 y m v j 1 h 1 tv f C ’ x £f*l ei*2 ei(fcv, V . 1 J (4.93.) where is outcome for subject /= 1, ..., n, group &=1, 2, and time point;=1, ..., 7; tj is time point; ~ N lB ^.S p) is a subject-specific regression coefficient; and zikj ~ N(o,Og) is model error. The parameter specifications are as follows: n=!00: 7=14; t2, ..., ri4=0, 0.167, 0.333, 0.5, 0.75, 1.0, 1.25, 1.50, 1.75, 2.0, 2.25, 2.5, 2.75, 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.0; model parameters B i = '-9 6 0 ' f-9 6 0 ^ f -4 5 V ) . *2 = -9 0 V J V (152100 -12420 -12420 8281 , and ot.2 =24000. These parameters are estimated by Wu and Bailey (1989) from IPPB trial and used in the simulation study to assess their linear minimum variance unbiased estimator. They are also used by Wang-Clow et al. (1995) in the simulation study for the comparison of several non-imputation based methods. The use of the same parameters enables us to compare our results against theirs. 4.2.2 Creation o f Missing Data from Complete Data Missing data are generated by dropping out subjects from complete data. The probability of dropout depends on the time point tj, last observed outcome y,uy./, current unobserved outcome y^y, and the subject-specific true regression coefficient (}«*), Probability (dropout at tj) = <t>(ao; + a\yuk\j-i + [a2, a 3 ]pl(i) + ouyi{k )J ). To avoid computational problems, all subjects have at least 3 observations. Approximately 50% subjects are dropped out of the trial at various time points. Six hundred (600) trials are simulated for each missing mechanism. The choice of 600 trials is consistent with those used in Wu and Bailey (1989) and Wang-Clow et al. (1995). It provides adequate precision for the estimate of the type-I error rate (the 95% confidence interval = [0.033, 0.067] when a =0.05) and, at the same time, does not impose a tremendous computational burden. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.2.1 Missing Completely at Random (MCAR) The probability of a subject dropping out of the trial at various time points is independent of the outcome and the subject-specific regression coefficient: c t|= a 2 = a 3 = a 4 = 0 . 4.2.2.2 Missing at Random (MAR) The probability of subject i dropping out of the trial at time tj depends on the last observed outcome yaty-i- cti=-3.7xlO'3 , and (*2=0x3=0 4 =0 . 4.2.2.3 Missing Informative With Respect to Random Subject Regression Coefficients (IM-Slope) The probability of subject i dropping out of the trial at time tj depends on the subject’s intercept and slope ($«*): O2=-4 .6 x l 0 ' 3 and (*3= -l4 xlCT3 , and C C |= (X 4=0 . 4.2.2.4 Missing Informative With Respect to Subject’s Current Unobserved Outcome (IM-Unobs) The probability of subject i dropping out of the trial at time tj depends on the current unobserved outcome y,• < * } , ■ : O4=-3.7xl0‘3 , and a 1 =(*2=0(3=0. 4.3 Methods for Missing Data Analyses The missing data imputations and analyses are carried out by the following methods. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (1) The multiple imputation method based on Bayesian sampling (Chapter 3.3) - denoted LGC-PPDS. The final parameter estimators are obtained from maximizing the likelihood for the linear growth curve model using both observed and imputed data. Ten imputations are performed to allow adjustment for between imputation variability; (2) Wu and Bailey’s (1989) linear minimum variance unbiased estimator for population slope (Chapter 2.1 with variance modification [equation 2.23]) - denoted Wu & Bailey; (3) Lavori et al. (1995) multiple imputations method (Chapter 2.3) - denoted Lavori. The final parameter estimators are obtained from maximizing the likelihood for the linear growth curve model using both observed and imputed data. Five imputations are performed to allow adjustment for between imputation variability; (4) Maximum likelihood estimators from the linear growth curve model using only observed data (Laird and Ware, 1982) - denoted ML; (5) Completer method - denoted Completer. This method is equivalent to Method (4) with the exception that only data from subjects who complete the trial are included in the model fitting; (6) Method of last observation carried forward - denoted LOCF. This method is equivalent to Method (4) with the exception that all missing data are imputed by LOCF prior to model fitting; (7) The multiple imputation method based on the early dissertation proposal - denoted PROP-1 (Section 3.2.1) and subsequent improvement - denoted PROP-2 (Section 3.2.2). 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A ll methods impute the missing data to create “complete” data sets with the exception of LMVUB and ML. The “ complete” data sets are then analyzed by the repeated measures analysis model. M L method applies the repeated measures analysis model directly to the data with missing values. For all methods, the population slope (# 2k) of each treatment group and the difference in slopes are estimated. In addition, the population endpoint (Pk) or response at the final time point r=3 (/= 14) and the difference in endpoints are estimated for all methods except LMVUB, PROP-1 and PROP-2. Variances are adjusted for multiple imputations for LGC-PPDS, Lavori, and PROP-1 and 2 methods. 4.4 Comparisons for the Methods of Missing Data Analyses For each method, we estimate the mean response of each treatment group and the mean difference in response between two treatment groups, A and B. We test the null hypothesis that there is no group difference by the r-test r < 4-94-> V v^r(A,\ ~ A a) when the null or alternative hypothesis is true, where |iA and As ^ the univariate parameter estimates and v is the degrees of freedom of error variance. Depending on the method of analysis, the degrees of freedom may vary from one method to another. We compare the rejection rates among the seven methods at a significance level of a=0.05. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Thus, when the null hypothesis is true, what we compare is the type-I error rates. When the alternative hypothesis is true, we actually compare the power of the methods. Another criterion for the comparison is the root mean squared errors of estimates. It is calculated by where k= A and B is a group indicator, Ntriai is the total number of trials simulated. Because MSE can be decomposed into bias and variance, it is thus a combined measure for the overall accuracy of the estimates. 4.5 Results of Simulation Study 4.5.1 Sample Data Simulated Complete data were simulated for each trial and then, according to the missing mechanisms, patients were selectively dropped out from trials at sometime after the required three observations (i.e., t>0.33). The minimum 3 observations requirement is for consistency with the literature to avoid computational problems. Figure 1 shows a sample plot of the proportion of observations over time after dropping patients according to MCAR, MAR, IM - Slope, and IM - Unobs. In this plot, the complete data were generated according to Section 4.2.1 for the treatment group with population regression coefficient £=(960, -90). For MCAR, the drop out rate was constant. Approximately 55% patients dropped out of the trials. For MAR, EM - Slope, and IM - Unobs, the dropout rate was relatively constant after time r=0.5. A high dropout rate between r=0.33 RMSE (4.95.) 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and 0.5 was because many patients who would have been dropped out earlier per the probit censoring model were not allowed by the minimum of 3 observations requirement. Figure 1. Proportion of Observations Over Time (Data from Linear Growth Curve Model) 1.0 0.9 — MCAR — MAR — IM - Slope — IM-Unobs 0.8 0.7- 0.6 a. 0.5 0.4 0.3 0.2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Tima Figure 2 plots the mean observed response over time for the same above model. As is expected, the mean responses are not biased when data are MCAR. The means are biased when data are MAR, IM - Slope, and IM - Unobs. The increase in means between r=0.33 and 0.5 was because many patients had very low values of responses or steep individual slopes and would have been dropped out earlier but were not allowed. After r=0.5, the mean responses remain relatively constant for MAR, IM - Slope, and IM 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. - Unobs. This is because patients were selectively dropped out if their values of responses were low or they had a steep decline in responses. Figure 2. Mean Observed Response Over Time (Data from Linear Growth Curve Model) 1200 1100 1000 g , 900 MCAR MAR IM - Slope IM - Unobs 800 700 600 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Tim* 4.5.2 Sample Data Imputed Figure 3 gives a sample plot o f individual patient response profiles over time for a simulated trial (n=l00) with population mean . Only patients in group 1 [fli=(960, -90)] are plotted. As described by the linear growth curve model (4.94), each patient follows a linear regression line. The individual regression lines follow a normal distribution around the population regression line. In this plot, patients are selectively dropped out 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. using MAR criteria. The blue lines represent the observed responses and the red lines represent missing responses that would have been observed if patients were not dropped out from the trial. The black straight line denotes the population regression line. Figure 3. Response Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Unobserved) Time Figures 4 - 6 show imputed responses over time after applying LGC-PPDS, Lavori, and LOCF imputation methods to the missing data in Figure 3. Figure 7 shows the observed responses only for patients who complete the trial. In each figure, the thick straight line is a reference line indicating population response. For LGC-PPDS method, 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the individual profiles, observed and/or imputed, are evenly spreaded arround the population regression line (Figure 4). This suggests that the parameter estimates from the LGC-PPDS imputation w ill be relatively unbiased. For the Lavori’s method, after imputation, the individual profiles are spread more to the upper side than to the lower side of the population regression line (Figure 5). This is because Lavori’s method replaces the missing value of one patient by the observed value of another patient at the same time point. The range of imputed values are limited to the range of observed values. Bias occurs for this method when most patients with very low values of outcomes drop out. The LOCF method imputes missing values by carrying the last observed values forward (Figure 6). This will inevitably introduce bias in the parameter estimates. For the “ Completer” analysis, as most of the individual profiles lie above the population regression line (Figure 7), this method will surely be biased. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4. R e s p o n s e Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LGC-PPDS) 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5. R espo nse Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via Lavori) 0 1 2 3 Time 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6. R esponse Over Time for MAR (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LOCF) 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7. Response Over Time for MAR (Data from Linear Growth Curve Model) (Completers) 0 1 2 3 Time Figures 8-12 show the individual response profiles over time for patients who are dropped out using IM - Unobs criteria in a simulated trial These plots confirm the speculations or findings from Figures 3 -7 . It is noted that the imputed values from LGC-PPDS method are highly variable despite the fact that this method is least biased among all four missing data analyses. The high variability is probably caused by the poor model fitting due to the low number of observations-to-number o f missings ratio. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 8. R esponse Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Unobserved) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 9. R espo nse Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LGC-PPDS) Time 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 10. R esponse Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via Lavori) Time 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 11. R espo nse Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Blue = Observed, Red = Imputed via LOCF) 0 1 2 3 Time 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 12. Response Over Time for IM-Unobs (Data from Linear Growth Curve Model) (Completers) 0 1 2 3 Time 4.5.3 Analyses o f Missing Data The results o f simulation studies for all methods are presented in Tables 1 - 4. A ll statistics are based on 600 simulations except for the Lavori’ s method, which is based on a smaller number o f simulations due to the extensive computational burden. The number of simulations for Lavori’s method is indicated in the footnote o f each table. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For data missing completely at random (MCAR), all methods, with the exception of LOCF, give approximately unbiased estimates for the population slopes, population endpoints (average responses at the final time point), and differences in population slopes and endpoints (Tables 1.1a, 1.1b, 1.2a and 1.2b). Both PROP-1 and PROP-2 have relatively larger RMSE, indicating that the slope estimates from these two methods are more variable than other methods. The type-I error rate is maintained by all methods with the exceptions of Lavori and PROP-1 methods. The PROP-l’s type-I error rate is significantly higher and the Lavori’s type-I error rate is significantly lower than the nominal a-level. Among all methods that maintain the nominal type-I error rate, the LGC-PPDS and M L methods have the smallest RMSE and highest power. For data missing at random (MAR), only the M L method gives approximately unbiased estimates for the population slopes (Table 2.1a). All other methods give biased estimates for slopes and endpoints (Tables 2.1a and 2.2a). However, the LGC-PPDS method is the least biased among all biased methods and is reasonably close to the correct value. The LGC-PPDS, Wu & Bailey, ML, PROP-1, and PROP-2 methods provide approximately unbiased estimates for the difference in slopes (Table 2.1b). Moreover, the LGC-PPDS method offers approximately unbiased estimates for the difference in endpoints (Table 2.2b). The type-I error rate is maintained by all methods with the exceptions of Wu & Bailey, Lavori, and PROP-I methods. The type-I error rate is slightly lower than the nominal a-level for Wu & Bailey method. The type-I error rates are significantly higher for both Lavori and PROP-1 methods. Among all methods that maintain the nominal type-I error rate, the M L and LGC-PPDS methods have the smallest RMSE and highest power. The small RMSE for the LGC-PPDS method indicates that 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this method is overall less biased and variable than all others with the exception of the M L method. For data missing informative with respect to patient-specific slopes (IM - Slopes), the LGC-PPDS, Wu & Bailey, PROP-1, and PROP-2 methods give approximately unbiased estimates for the slopes and difference in slopes (Tables 3.1a and 3.1b). Moreover, the LGC-PPDS method is approximately unbiased for the estimates of endpoints and difference in endpoints (Tables 3.2a and 3.2b). Note that the LGC-PPDS estimates for Group l ’s slope and endpoint seem to be slightly biased. However, this is probably due to random variation in the simulated data sets because the estimates for Group 2, which ought to be more biased due to higher dropout rate, are not biased. All other methods are highly biased for the estimation of population slopes, endpoints, and differences in slopes and endpoints. The type-I error rates are maintained by all methods with the exceptions of Wu & Bailey and PROP-1 for the analyses of slopes and LGC- PPDS and LOCF for the analyses of endpoints. The type-I error rate is slightly lower than the nominal a-level for Wu & Bailey method for the analysis of slopes and LGC- PPDS method for the analysis of endpoints. The type-I error rate is significantly higher for PROP-1 method for the analysis of slopes and slightly higher for LOCF for the analysis of endpoints. The slightly lower type-I error rate in the analysis of endpoints for LGC-PPDS method may be an artifact because this error rate is well maintained in the analysis of slopes for the same method. Among methods that offer approximately unbiased estimates, the LGC-PPDS method has the lowest RMSE and highest power. For data that are informative missing with respect to individual unobserved responses (IM - Unobs), all methods are biased for the estimation of individual 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. population slopes and endpoints (Tables 4.1a, 4.1b, 4.2a, and 4.2b). However, the LGC- PPDS and PROP-1 methods are the least biased for these estimates. In addition, the methods of LGC-PPDS, Propl, and PROP-2 offer approximately unbiased estimates for the difference in slopes (Table 4.1b). Moreover, the LGC-PPDS method provides approximately unbiased estimate for the difference in endpoints (Table 4.2b). The type-I error rate is maintained by all methods with the exception of PROP-1, which has significantly higher error rate than the nominal a-level. Among all the approximately unbiased methods, the LGC-PPDS method has the lowest RMSE and highest power. In summary, among all methods compared for missing data analyses, the LGC- PPDS method is approximately unbiased or least biased for estimates of individual population parameters and their differences. It maintains the type-I error rate and has relatively high power. The M L method is unbiased for MCAR and MAR, and highly biased when data are informative missing with respect to individual slopes and unobserved responses. However, it maintains the type-I error rate and provides the highest power compared to all other methods. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 1.1a Estimates of Slopes When Data are MCAR (Data from Linear Growth Curve Model) Method Estimate Bn = -45 95% Cl RMSE Estimate Bi2 = -90 95% Cl RMSE LGC-PPDS -45 (-44, -46) 1 1 -90 (-89,-91) 11 Wu & Bailey -44 (-43, -45) 14 -89 (-88, -90) 14 Lavori* -44 (-43, -45) 15 -90 (-89, -91) 14 ML -45 (-44, -46) 1 1 -91 (-90, -92) 1 1 Completer -45 (-44, -46) 13 -90 (-89, -91) 13 LOCF -35 (-34, 36) 14 -70 (-69, -71) 22 PROP-1 -47 (-44, -49) 29 -92 (-89, -94) 28 PROP-2 -46 (-44, -48) 21 -90 (-88, -92) 23 * Parameter estimates are based on 500 simulations. - j VO Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 1.1b Analyses of Slopes When Data are MCAR (Data from Linear Growth Curve Model) Method Aj)= Bj2 — B22 — 45 Estimate 95% Cl RMSE Rejection Rate (%) at o=0.05 Under Ho (95%Cl) Under Ha LGC-PPDS 46 (44, 47) 15 3.7 (2.2, 5.2) 83.0 Wu & Bailey 45 (44,47) 20 6.5 (4.5, 8.5) 62.2 Lavori* 46 (45, 47) 20 0.0 (n/a) 23.0 M L 46 (45, 47) 15 5.5 (3.6, 7.4) 83.7 Completer 45 (44, 47) 18 5.3 (3.5, 7.1) 70.7 LOCF 35 (34, 36) 16 3.7 (2.2, 5.2) 78.8 PROP-1 45 (42, 48) 41 13.2 (10.5, 15.9) 34.5 PROP-2 44 (43, 45) 30 5.3 (3.5, 7.1) 34.0 * Parameter estimates and rejection rate under Ha are based on 400 simulations. Rejection rate under H o is based on 250 simulations. oo O Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Table 1.2a Estimates of Endpoints When Data are MCAR (Data from Linear Growth Curve Model) Pi =825 (B,2 = -45) P2=690 (B2 2 = -90) Method Estimate 95% Cl RMSE Estimate 95% Cl RMSE LGC-PPDS 827 (823,831) 51 689 (685, 693) 52 Lavori* 838 (826, 850) 64 689 (677, 701) 61 Completer 826 (821,832) 66 688 (682,693) 68 LOCF 857 (854, 858) 57 751 (747, 755) 79 * Parameter estimates are based on 100 simulations. o o Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 1.2b Analyses of Endpoints When Data are MCAR (Data from Linear Growth Curve Model) Method A * Estimate = Pi - P2 = 135 95% Cl RMSE Rejection Rate (%) at «=0.05 Under Ho (95%CI) Under Ha LGC-PPDS 138 (132, 144) 73 4.2 (2.6, 5.8) 44.7 Lavori* 149 (133, 165) 83 1.0 (0.2, 3.6) 19.0 Completer 139 (132, 146) 93 6.3 (4.3, 8.3) 33.7 LOCF 106 (101, 112) 74 5.2 (3.4, 7.0) 38.3 * Parameter estimates and rejection rates are based on 100 simulations. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 2.1a Estimates of Slopes When Data are MAR (Data from Linear Growth Curve Model) Method Estimate B 12 = -45 95% C l RMSE Estimate B2 2 = -90 95% C l RMSE LGC-PPDS -48 (-47, -49) 13 -95 (-93, -96) 15 Wu & Bailey -117 (-115,-119) 76 -163 (-161,-165) 76 Lavori* -12 (-10,-14) 38 -47 (-45, -49) 49 M L -45 (-44, -46) 13 -90 (-89, -91) 14 Completer -23 (-22,-24) 26 -62 (-61,-63) 32 LOCF -52 (-51,-53) 1 1 -80 (-79, -81) 13 PROP-1 2 (-2.7,6.2) 73 -39 (-34, -44) 81 PROP-2 -115 (-112,-118) 79 -165 (-162,-168) 83 * Parameter estimates are based on 350 simulations. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 2.1b Analyses of Slopes When Data are MAR (Data from Linear Growth Curve Model) Method Ab= Estimate Bn - B22 = 45 95% Cl RMSE Rejection Rate (%) at ot=0.05 Under Ho (95%Cl) Under Ha LGC-PPDS 46 (45.47) 19 4.8 (3.1,6.5) 68.5 Wu & Bailey 45 (43.48) 30 2.3 (1.1,3.5) 28.5 Lavori* 33 (30, 36) 32 12.0 (9.0, 15.0) 34.2 ML 45 (44, 46) 18 5.8 (3.9,7.7) 73.8 Completer 39 (38,41) 20 5.0 (3.3,6.7) 53.0 LOCF 28 (27, 29) 21 5.0 (3.2,6.8) 62.7 PROP-1 41 (34, 47) 85 17.7 (14.6,20.7) 22.2 PROP-2 50 (45, 54) 51 5.2 (3.4,6.9) 20.3 * Parameter estimates and rejection rate under Ha are based on 3S0 simulations. Rejection rate under Ho is based on 450 simulations. oo 4^ Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Table 2.2a Estimates of Endpoints When Data are MAR (Data from Linear Growth Curve Model) Method Pi Estimate =825 (B12 = -45) 95% Cl RMSE p2 =690 (B2 2 = -90) Estimate 95% Cl RMSE LGC-PPDS 817 (812, 821) 56 677 (673, 681) 61 Lavori* 928 (922, 934) 121 835 (827,843) 166 Completer 1136 (1132, 1141) 316 1052 (1047, 1056) 367 LOCF 811 (807,815) 48 724 (721,728) 55 * Parameter estimates are based on 100 simulations. o o Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 2.2b Analyses of Endpoints When Data are MAR (Data from Linear Growth Curve Model) Method ** Estimate = Pi - H 2 = 135 95% Cl RMSE Rejection Rate (%) at a=0.05 Under Ho (95%CI) Under Ha LGC-PPDS 140 (133, 146) 82 3.5 (2.0,5.0) 33.7 Lavori* 93 (73,112) 108 12.0 (5.4, 18.5) 23.0 Completer 84 (78,90) 94 5.0 (3.2,6.8) 19.5 LOCF 86 (81,91) 80 6.8 (4.8,8.8) 31.7 * Parameter estimates and rejection rates are based on 100 simulations. o o o\ Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 3.1a Estimates of Slopes When Data are IM - Slopes (Data from Linear Growth Curve Model) Method Estimate B n = -45 95% Cl RMSE Estimate B2 2 = -90 95% Cl RMSE LGC-PPDS -49 (-48, -50) 14 -90 (-89, -91) 15 Wu & Bailey -46 (-44, -48) 22 -91 (-89, -93) 26 Lavori* 5 (0, 10) 52 -21 (-16, -26) 71 M L -11 (-10,-12) 36 -47 (-46, -48) 45 Completer -19 (-18, -20) 28 -58 (-57, -59) 35 LOCF -18 (-17,-19) 28 -39 (-35, -43) 52 PROP-1 -40 (-35, -44) 57 -84 (-78, -90) 72 PROP-2 -43 (-40, -45) 37 -90 (-87, -93) 42 * Parameter estimates are based on 400 simulations. o o - J Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 3.1b Analyses of Slopes When Data are 1M - Slopes (Data from Linear Growth Curve Model) Method Ab= Estimate B,2 - B2 2 = 45 95% C l RMSE Rejection Rate (% ) at cx=0.05 Under Ho (95% CI) Under Ha LGC-PPDS 41 (39,42) 20 4.5 (2.8, 6.2) 49.7 Wu & Bailey 45 (42,47) 33 2.7 (1.4, 4.0) 22.0 Lavori* 26 (23, 29) 28 6.1 (3.8, 8.4) 22.6 M L 36 (34, 37) 19 5.7 (3.8, 7.6) 61.7 Completer 39 (37, 40) 18 6.7 (4.7, 8.7) 59.3 LOCF 22 (21,23) 26 6.0 (4.1, 7.9) 48.5 PROP-1 44 (37, 52) 94 20.7 (17.4, 23.9) 22.7 PROP-2 47 (43,51) 55 6.0 (4.1, 7.9) 16.2 * Parameter estimates and rejection rates are based on 400 simulations. o o 00 Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Table 3.2a Estimates of Endpoints When Data are IM - Slopes (Data from Linear Growth Curve Model) Method Pi Estimate =825 (Bi2 = -45) 95% C l RMSE p2 =690(BM = -90) Estimate 95% C l RMSE LGC-PPDS 811 (806,815) 56 688 (684, 692) 59 Lavori* 977 (973,981) 159 887 (881,891) 204 Completer 1096 (1092, 1100) 275 1029 (1025, 1033) 343 LOCF 904 (900, 907) 89 834 (831,838) 150 * Parameter estimates are based on 100 simulations. o o vd Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Table 3.2b Analyses of Endpoints When Data are IM - Slopes (Data from Linear Growth Curve Model) Method A* Estimate = Pi - Hz = 135 95% Cl RMSE Rejection Rate (%) at cx=0.05 Under Ho (95%Cl) Under Ha LGC-PPDS 123 (116, 131) 82 2.5 (1.2,3.8) 23.8 Lavori* 90 (76, 104) 85 4.0 (0.1,7.9) 24.0 Completer 67 (61,72) 97 5.0 (3.2,6.8) 16.5 LOCF 69 (64, 74) 89 7.5 (5.3,9.7) 25.0 * Parameter estimates and rejection rates are based on 100 simulations. v £ > O Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 4.1a Estimates of Slopes When Data are IM - Unobs (Data from Linear Growth Curve Model) Method Estimate B,2 = -45 95% C l RMSE Estimate B2 2 = -90 95% C l RMSE LGC-PPDS -64 (-63, -65) 24 -110 (-109,-111) 26 Wu & Bailey -19 (-18, -21), 33 -60 (-58, -62) 37 Lavori* 9 (8,11) 56 -20 (-18, -22) 72 M L - 6 (-5, -7) 41 -45 (.4 4 , .46) 47 Completer -19 (-18, -20) 29 -57 (-56, -58) 36 LOCF -12 (-11.-13) 34 -33 (-32, -34) 57 PROP-1 -27 (-23, -32) 61 -71 (-66, -76) 64 PROP-2 -10 (-7,-13) 50 -53 (-50, -56) 53 * Parameter estimates are based on 400 simulations. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 4.1b Analyses of Slopes When Data are IM - Unobs (Data from Linear Growth Curve Model) Method Ab= B 12 — B22 — 45 Estimate 95% Cl RMSE Rejection Rate (%) at a=0.05 Under Ho (95%Cl) Under Ha LGC-PPDS 46 (43,48) 20 5.7 (3.8, 7.6) 62.8 Wu & Bailey 41 (38,43) 31 4.5 (2.8,6.2) 23.2 Lavori* 26 (24, 28) 25 6.3 (3.6, 9.0) 27.3 M L 39 (38, 40) 1 1 6.0 (4.1, 7.9) 63.3 Completer 38 (37, 40) 21 4.7 (3.0, 6.4) 51.8 LOCF 21 (20, 22) 26 5.8 (3.9, 7.7) 47.7 PROP-1 43 (36, 50) 85 16.7 (13.7, 19.6) 23.0 PROP-2 43 (39, 47) 50 5.7 (3.8, 7.5) 15.8 * Parameter estimates and rejection rates are based on 400 simulations. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 4.2a Estimates of Endpoints When Data are IM - Unobs (Data from Linear Growth Curve Model) Method Pi Estimate =825 (8,2 = -45) 95% Cl RMSE H2=690(B2 2 = -90) Estimate 95% Cl RMSE LGC-PPDS 775 (771,780) 76 635 (631,640) 82 Lavori* 990 (980, 1000) 173 895 (882, 908) 216 Completer 1153 (1149, 1157) 332 1077 (1072, 1082) 391 LOCF 923 (920, 926) 106 855 (852, 858) 170 * Parameter estimates are based on 100 simulations. S O C O Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 4.2b Analyses of Endpoints When Data are IM - Unobs (Data from Linear Growth Curve Model) Method A, Estimate = Pi - P2 = 135 95% C l RMSE Rejection Rate (% ) at a=0.05 Under Ho (95% CI) Under Ha LGC-PPDS 140 (133, 147) 82 3.8 (3.2,5.4) 31.8 Lavori* 94 (79, 109) 87 8.0 (2.6, 13.4) 26.0 Completer 76 (70, 82) 97 4.7 (3.0,6.4) 16.0 LOCF 68 (63, 72) 88 6.7 (4.7,8.7) 26.5 * Parameter estimates and rejection rates are based on 100 simulations. CHAPTER 5. DATA THAT DO NOT FOLLOW A LINEAR GROWTH CURVE MODEL 5.1 Application of LGC-PPDS Method to Repeated Measures Data of General Shapes The results of Chapter 4 indicate that sampling from Bayesian posterior predictive distribution can be used to impute missing data. However, the entire procedure is based on the assumption that data follow a linear growth curve model (3.78). In practice, the clinical trial data may not follow a linear growth curve. They may follow a non-linear curve, a logistic curve, or a curve of general shapes. In this section, we apply LGC-PPDS method to repeated measures data of general shapes. We simulate complete data from a repeated measures model where yi(k)j is outcome for subject /=1 n in group fc=l, 2 at time point;=1,..., J, yi(k) - N(0, 1002 ) is subject effect, t*i= ...= t*io - N(460, 9002), xkn= t*i3= - N( 180+8*, 9002 ), and t*h= (t*io+t*u )/2 are time effects, and 8* =20 for k= 1 and 0 for k=2 are the treatment group effects, ejkj ~ N(0,6252 ) is model error. Missing data are generated by dropping out subjects from the complete data. The probability of dropout depends on the time point r, and the current unobserved outcome yw- yi(k)j - Y i(k) +xkj + ^ikj - (5.96.) 0.02 Probability(dropout a tf; ) = ^ when 3 < j < 10 and y/(*)_/ < 480 when j >10 and y/(*);• <480 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Approximately 50% subjects are dropped from the trial at various time points. Figure 13 gives a sample plot of individual patient response profiles over time for a simulated trial (n=!00). Similar to Figure 3, observed responses are in blue and unobserved responses are in red. The population means are also plotted in black for reference. Figure 13. Response Over Time for IM - Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Unobserved) 0. 2 0. 6 1.0 1 . 4 1. 8 2 . 2 2 .6 3.0 Time Tables 5a and 5b give the estimates and hypotheses test for the endpoints after missing data are first imputed by the LGC-PPDS method. The results from 600 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. simulated trials indicate that the paramter estimates for the endpoints and for the difference in endpoints are biased. In addition, the type-I error rate is significantly smaller than the nominal a-level and the power is almost as small as the type-I error rate. These findings suggest that the LGC-PPDS method is not applicable to the repeated measures data of general shapes. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 5a Estimates of Endpoints W ith LGC-PPDS Imputation When Data are IM - Unobs (Data from Repeated Measures Model of General Shapes) Method Estimate p i = 2 0 0 95% Cl RMSE Estimate p 2 = 180 95% Cl RMSE LGC-PPDS 172 (170, 174) 30 164 (163, 165) 20 Table 5b Analyses o f Endpoints With LGC-PPDS Imputation When Data are IM - Unobs (Data from Repeated Measures M odel of General Shapes) A ,» = Pi - Hz = 20 Rejection Rate (%) at a=0.05 Method Estimate 95% Cl RMSE Under Ho (95%Cl) Under Ha LGC-PPDS 8 (7,9) 21 2.5 (1.3,3.8) 4.3 vO 00 5.2 Modification of LGC-PPDS Method to Allow Data of General Shapes This section modifies the LGC-PPDS method to allow data to follow a repeated measures model of general shapes. For the imputation procedure detailed in Section 3.3, Steps 1, 10, and 11 are modified as follows. 1 . Assume outcome y follows a repeated measures model of a general shape y = Xp + Zy + e (5.97.) i.e., ' * 1 ' V/2 v ‘ y f 1 1 0 • • 0 0 • / • 0 ^ 1 0 I • • 0 0 • • 0 1 0 0 • • 1 0 • • 0 V / X-y T ( c \ Efl I ei'2 Y « + 1 E/v, \ J v ' / Tv, + 1 XM where (3 is “ fixed” reflecting treatment group and time effects, y~N(0, G) is random reflecting individual subject effect, and e~N(0, R) is random model error, v* < M is the number of time points observed and Af-v, is the number of time points missing for the repeated measures for subject i. 10. For subject / who drops out at time v,+l, draw random samples ^ v, +1»£ /, v, +2 »•••’ £/, A/ * from N(0, o,“ ). 11. For each draw of 0*, (3*, and y*, impute missing.y via equation 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This new imputation method is denoted as RMG-PPDS (repeated measures model of general shape - posterior predictive distribution). It is applied to the same simulated data as described in Section 5.1. For the simulated trial displayed in Figure 13, Figures 14 - 16 show the imputed responses over time after applying RMG-PPDS, Lavori and LOCF methods. Figure 17 shows the response profiles only for patients who complete the trial. It is seen that more red lines are below the black line in Figure 13. This indicates that more dropouts occur for patients with low outcome values than for patients with high outcome values. The imputation by RMG-PPDS and Lavori methods seems to alleviatethe problem by replacing missing values with values that are comparable to the unobserved true values (Figures 14 and 15). The methods of LOCF and Completers are apparently biased as they are based on the observed data only (Figures 16 and 17). The results from 600 simulated trials are presented in Tables 6a and 6b. For the purpose of comparison, the methods of Lavori, LOCF, and Completers are also applied to the same simulated data. Due to the extensive computing time requirement, only 100 trials are simulated for Lavori’s method. Both RMG-PPDS and Lavori’s methods perform well. They produce approximately unbiased estimates for the endpoints and the difference in endpoints. They maintain the type-I error rates at the nominal a-level, with RMG-PPDS method having higher power and smaller RMSE than Lavori’s method. Both LOCF and Completers methods give biased estimates for the endpoints as expected. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 14. R e s p o n s e Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Imputed via RMG-PPDS) 0.2 0 . 6 1. 0 1. 4 1. 8 2 . 2 2 . 6 3.0 Time 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 15. R e s p o n s e Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Imputed via Lavori) 0 . 2 0 . 6 1. 0 1.4 1.8 2 . 2 2. 6 3.0 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 16. R e s p o n s e Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Blue = Observed, Red = Imputed via LOCF) 0 . 2 0 .6 1.0 1.4 1.8 2. 2 2.6 3. 0 Time 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 17. Response Over Time for IM-Unobs (Data from Repeated Measures Model of General Shapes) (Completers) 0.2 0. 6 1.0 1.4 1.8 2 . 2 2 . 6 3.0 Time 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 6a Estimates of Endpoints When Data are IM - Unobs (Data from General Repeated Measures Model) Method Estimate pi = 200 95% C l RMSE Estimate p2= 180 95% C l RMSE RMG-PPDS 199 (198, 200) 12 179 (178, 180) 12 Lavori* 203 (200, 206) 14 183 (180, 186) 14 Completer 209 (208,210) 17 187 (186, 188) 17 LOCF 267 (266, 268) 69 253 (252, 254) 74 * Parameter estimates are based on 100 simulations. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 6b Analyses of Endpoints When Data are IM - Unobs (Data from General Repeated Measures Model) Method A,, = pi - H2 = 20 Estimate 95% C l RMSE Rejection Rate (% ) at a=0.05 Under Ho (95% CI) Under Ha RMG-PPDS 21 (20,22) 17 6.8 (4.8,8.8) 28.7 Lavori* 20 (16,24) 19 2.0 (0.0,4.7) 15.0 Completer 22 (20,23) 22 5.2 (3.4,7.0) 16.5 LOCF 14 (13, 16) 21 3.7 (2.2,5.2) 11.2 * Parameter estimates and rejection rates are based on 100 simulations. o O N 5.3 Application of RMG-PPDS Method to Data from Growth Curve Model In this section, we First apply RMG-PPDS imputation method (Section 5.2) to the missing data that are simulated from the same growth curve model as described in Sections 4.2, 4.5.1, and 4.5.2. For missing informative with respect to patient-specific slopes and unobserved outcome values, the patient censoring time is added to the repeated measures model (5.96) as a covariate. The rationale for including this covariate is that it provides additional information in predicting missing outcome because it is highly correlated to individual slopes and the outcome variable. After the missing data are imputed, we estimate and compare the population slopes using the linear growth curve model (3.78). The results are presented in Tables 7a and 7b. Except for MCAR and missing informative with respect to slopes (IM-Slopes), the RMG-PPDS imputation method is biased for the estimates of population slopes. For all four missing mechanisms, this method inflates the type-I error rate. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 7a Estimates of Slopes W ith RMG-PPDS Imputation (Data from Linear Growth Curve Model) Missing Mechanism Estimate B,2 = -45 95% C l RMSE Estimate 1*22 = -90 95% C l RMSE MCAR -45 (-45, -45) 12 -90 (-90, -90) 12 MAR -31 (-30, -32), 19 -73 (-72, -74) 22 IM-Slopes -48 (-47, -49) 13 -90 (-89,-91) 15 IM-Unobs -65 (-64, -67) 26 -115 (-114,-117) 31 o 00 Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 7b Analyses of Slopes W ith RMG-PPDS Imputation (Data from Linear Growth Curve Model) Missing Mechanism Ab= Estimate B 12 - B22 = 45 95% C l RMSE Rejection Rate (% ) at a=0.05 Under Ho (95% CI) Under Ha MCAR 45 (45, 45) 16 15.0 (12.1,17.9) 92.5 MAR 42 (41,43) 18 24.8 (21.3,28.3) 89.2 IM-Slopes 42 (41,44) 20 20.8 (17.6,24.0) 80.8 IM-Unobs 50 (48, 52) 23 24.5 (21.1,27.9) 84.5 CHAPTER 6. DISCUSSION AND CONCLUSIONS It is desirable to find a statistical method that can apply to as many different types of missing mechanisms as possible. Table 8 summarizes the simulation results in Chapter 4 with respect to bias, type-I error rate, and power for each method of missing data analysis (LGC-PPDS, Wu and Bailey, Lavori, ML, Completers, LOCF, PROP-1 and PROP-2). Among all methods compared, the LGC-PPDS method performs well under most circumstances. It is approximately unbiased most of the time for the estimates of population slopes, endpoints, and differences in slopes and endpoints when data are missing completely at random, missing at random, or informative missing with respect to subject’s slopes. It is least biased among all methods for the estimates of individual population slopes and endpoints, and is approximately unbiased for the estimates of the differences in slopes and endpoints when data are informative missing with respect to unobserved outcomes. It maintains the type-I error rate at the nominal a-level most of the time and has comparable or high power for hypothesis tests. It is applicable to the estimation and hypothesis testing for both the population slopes and endpoints when data follow a linear growth curve model. Compared to the nominal a-level for the hypothesis test, a significantly smaller type-I error rate is observed when analyzing endpoints for informative missing data with respect to subjects’ slopes (Table 3.2b). This is probably due to the multiplicity of hypothesis tests since the overall type-I error rate is not controlled and this small type-I error rate is not seen in the analysis of slopes for the same missing data (Table 3.1b). Although not presented, there does not seem to be a 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. noticeable difference in the simulation results between the choices of Jeffrey’s prior and uniform prior distributions for the variance/covariance parameters 0. This is expected because both priors are not informative in the sense that they have no or little effect on the posterior distribution and let the data “ speak for themselves.” (Wolfinger and Rosner, 1996) A potential disadvantage of this model-based method is that a few imputed values may not be biologically plausible, although they are consistent with the mathematical model. This may seem awkward when making inference for each individual patient based on his or her imputed values. Nonetheless, this disadvantage does not seem to affect the estimates of population parameters and the comparisons as they are based on the overall “ profiles” of each treatment group. Wu & Bailey’s method performs adequately when data are missing completely at random or informative missing with respect to subject’s slopes. It is biased when data are missing at random or informative with respect to unobserved outcomes, because this method is not designed for these missing mechanisms. Lavori’s method performs poorly under most missing data mechanisms simulated. This method is biased for missing at random and informative missing with respect to subject’s slopes and unobserved outcome. A likely reason for bias is that this method can only impute missing values within the range of observed values as this method uses one patient’s observed value to replace another patient’s missing value. When most of the extreme values are missing, this method is not able to duplicate or create the missing extreme values. Instead, this method uses the observed less-extreme values to replace the missing extreme values as illustrated in Figure 5. The bias may also be attributed to the grouping in quintiles of propensity scores. The grouping in quintiles is suggested by 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cochran (1968) to reduce bias (up to 90%) in the parameter estimate adjusted for covariates. However, it can not eliminate 100% of the bias in a longitudinal data situation. Compared to the nominal a-level, the significantly smaller and greater type-I error rates found in the analyses of MCAR and MAR data (Tables I and 2, respectively) might be related to the insufficient number of imputations for the same missing data (five used for the Lavori method versus ten for the other methods). Due to the extensive computing resource requirement by this method, only 5 imputations were used to calculate the between imputation variability and consequently the standard errors for the parameter estimates. This may adversely affect the type-I error rates for the hypothesis tests. In addition, the smaller number of simulations for Lavori’s method make the hypothesis testing estimates (i.e., type-I error rate and power) less precise than for the other methods. Note that there are both advantages and disadvantage of this propensity score approach. One advantage is that it is unlikely to impute biologically implausible values to the missing values for patients who drop out the trials. Therefore, every imputed value is interpretable. Another advantage is that it can easily adapt to response variables of different types (e.g., frequency counts, categorical variables) because it is model-independent. The disadvantages are that the individual imputed values and overall estimates of population parameters can be biased if a substantial amount of extreme responses is missing as discussed earlier. Both LGC-PPDS and Lavori’s methods use samples from the posterior predictive distribution for the outcome. However, the LGC-PPDS method imputes all missing values within a subject at once by sampling from their (longitudinally) multivariate parametric distribution. The Lavori’s method imputes missing values one after another 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (longitudinally) by sampling from its univariate (cross-sectional) nonparametric distribution. Both methods are based on the approximate Bayesian bootstrap (ABB) (Rubin and Schenker, 1986) with parametric and non-parametric realizations respectively. The M L method performs well for the parameter estimates when data are missing completely at random or missing at random. This is expected by the likelihood theory (Little and Rubin, 1987). This method is highly biased when data are missing informative with respect to patient-specific slopes and unobserved responses. However, this method maintains the type-I error rate and has the highest power for all types of missing mechanisms. The methods of Completer and LOCF give biased results as expected. PROP-1 is the method presented in the early Dissertation Proposal. This method is biased and it does not maintain the type-I error rates under most circumstances. PROP- 2 is an improved version of PROP-1. It is approximately unbiased when data are MCAR or missing informative with respect to individual slopes. It maintains the type-I error rate. However, this method has relatively low statistical power in hypothesis testing for the treatment group difference. In practice, the missing mechanism is usually unknown. Although some methods are proposed to differentiate MCAR and MAR (Ridout, 1991), MCAR and missing informative with respect to slopes (Wu and Bailey, 1989), they are all based on some assumptions which limit their applications. It is important to find a statistical method that can apply to as many different types of missing mechanism as possible. This research demonstrates that among the methods investigated, the LGC-PPDS method can serve this 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. purpose. For the types of missing data investigated, the LGC-PPDS method is either approximately unbiased or least biased for estimates of individual population parameters and their differences. It maintains the type-I error rate and has power second to the highest among all statistical methods compared. The M L method, on the other hand, can also serve this purpose only if the statistical comparison is of primary interest. Although it is highly biased when data are informative missing, this method maintains the type-I error rates and provides the highest power compared to all other methods. RMG-PPDS method is a modified version of LGC-PPDS method. Instead of applying to the data from linear growth curve model, this method is applicable to the data from repeated measures model of general shapes. Simulation study indicates that both RMG-PPDS and Lavori’s methods give approximately unbiased estimates for the endpoints and the difference in endpoints. Both methods maintain the type-I error rates at approximately the nominal a-Ievel, with RMG-PPDS having a higher statistical power for the hypothesis test. However, when LGC-PPDS imputation method is cross-applied to the data from repeated measures model of general shapes or RMG-PPDS imputation method is cross-applied to the data from linear growth curve model, these two methods perform very poorly. This suggests that LGC-PPDS and RMG-PPDS methods are good only for the underlying data models for which they are designed. The biased estimates obtained by the LGC-PPDS method for the repeated measures data of general shapes may be explained by model misspecification. The inflated type-I error rates by the RMG- PPDS method for the data of linear growth curve model may be explained by the misspecification of the within-patient covariance structure. This results in the underestimated standard errors for the parameter estimates. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Table 8. Summary o f Simulation Results (Data from Linear Growth Curve Model) Criteria LGC-PPDS Wu & Bailey Lavori ML Completer LOCF PROP-1 PROP-2 Estimate @ MCAR Unbiased Unbiased Unbiased Unbiased Unbiased Biased Unbiased Unbiased Estimate @ MAR Slightly Biased Biased Biased Unbiased Biased Biased Biased Biased Estimate @ IM - Slope Unbiased Unbiased Biased Biased Biased Biased Biased Unbiased Estimate @ IM - Unobs Slightly Biased / Unbiased Biased Biased Biased Biased Biased Biased Biased Maintain Type-I Error Rate When Unbiased Yes Yes Inflated/ Deflated Yes Yes Yes Inflated Yes Power When Error Maintained High Medium/ Low Low High Medium Medium N/A Low Applicable to Analysis @ Endpoint Yes No Yes Becomes* Completer Yes Yes Yes Yes * Because only patients who have endpoint measurements, i.e., complete the study, are included in the analysis. N /A = Not applicable. CHAPTER 7. D IRECTIO N OF FUTURE RESEARCH In Chapters 4-6, six hundred (600) simulations are conducted for all methods with the exception of the Lavori’s method. Due to the limitation of the current computation environment and the extensive computational requirements needed for Lavori’s method, 400 or less simulations were performed to characterize this method. Moreover, only 5 imputations were conducted for each of these simulated trials. Although 3 imputations work well in general situations (Rubin and Schenker, 1991; Schafer, 1997), the extent of dropout and the specific missing mechanisms in this research may warrant more than 5 imputations for the Lavori’s method. This may explain the overly small type-I error rate for MCAR and missing informative with respect to unobserved outcomes, and large type-I error rate for MAR for this method. To further characterize and compare the properties of different imputation methods, a larger number of simulations are necessary for all methods. In addition, simulations with varying missing proportions, sample sizes, and covariance structures among repeated measures w ill also help this characterization. These simulations would help confirm findings from the current study and provide knowledge of whether variations in any of these parameters would modify the performance of the methods. A simulation study was conducted to examine the performance of the Bayesian method for various dropout proportions. The preliminary results from data with 10-30% dropout demonstrate that the Bayesian imputation method remains approximately unbiased for the estimates of population slopes, endpoints, and difference in slopes and 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. endpoints for data MCAR, MAR, and missing informative with respect to slopes. Although this method is biased for the estimates of slopes and endpoints when data are informative missing with respect to unobserved response, it is approximately unbiased for the estimates of the difference in slopes and endpoints between treatment groups. The type-I error rate is maintained at the nominal a-level and the power increases as the proportion of dropout decreases. These results are consistent with that presented in Chapter 4, where the dropout rate is approximately 50%. Results of simulation studies in Chapters 4 and 5 indicate that LGC-PPDS method is applicable to the data that follow a linear growth curve model and RMG-PPDS method is applicable to the data that follow a repeated measures model of general shapes. However, poor results are observed when these two methods are cross-applied to the data. Further research is warranted to develop an imputation method that is relatively robust to the underlying models for the data of interest. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References Akaike, H. (1973). “ Information theory and extension o f the maximum likelihood principle,” in Proceedings o f the Second International Symposium on Information Theory, Petrov, B. N. and Csak, F (Eds ), Akademia, Kiado, 1973, 267-281. Brand, J., Van Buuren, S., Van Mulligen, E. M „ Timmers, T „ and Gelsema, E. (1994). “ Multiple Imputation as a Missing Data Machine,” in The I8th Annual Symposium on Computer Applications in Medical Care. Ozbolt, J. G. (Ed ). Cochran, W. G. (1968). “ The effectiveness of adjustment by subclassification in removing bias in observational studies,” Biometrics 24, 295-313. Crawford, S. L., Tennstedt, S. L., and McKinlay, J. B. (1995). “ A comparison o f analytic methods for non-random missingness o f outcome data,” Journal o f Clinical Epidemiology 48, 209-219. DeGruttola, V and Tu, X. M. (1995). "Modeling progression o f CD4-lymphocyte count and its relationship to survival time,” Biometrics 50, 1003-1014. Diggle, P. J. (1989). “ Testing for random dropouts in repeated measurement data,” Biometrics 45. 1255-1258. Diggle, P and Kenward, M. G. (1994). informative drop-out in longitudinal data analysis,” Applied Statistics 43, 49-93 Follmann, D. and Wu, M. (1995). "An approximate generalized linear model with random effects for informative missing data,” Biometrics, 51, 151-168. Gong, G. and Samaniego, F. J. (1981) “ Pseudo maximum likelihood estimation: Theory and application,” Annals o f Statistics, 9, 861-869. Gould, A. L. (1980). “ A new approach to the analysis o f clinical drug trials with withdrawals,” Biometrics 36, 721-7271. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Greenlees, J., Reece, W „ and Zieschang, K. D. (1982). “ Imputation o f missing values when the probability o f response depends on the variable being imputed," Journal o f the American Statistical Association 77, 251-264. Glynn, R. J., Laird, N. M., and Rubin, D. B. (1993). “ Multiple imputation in mixture models for nonignorable nonresponse with follow-ups,” Journal o f the American Statistical Association 88, 984-993 Heyting, A., Tolboom, J T. B. M „ and Essers, J. G. A. (1992). “ Statistical handling of drop-outs in longitudinal clinical trials." Statistics in Medicine 11, 2043-2061. Hogan, J. W. and Laird, N. M. (1996). “ Intention-to treat analyses for incomplete repeated measures data,” Biometrics 52. 1002-1017. Laird, N. M. and Ware, J. H. (1982) "Random-effects models for longitudinal data,” Biometrics 38, 963-974. Lavori, P., Dawson, R. and Shera, D (1995). “ A multiple imputation strategy for clinical trials with truncation o f patient data." Statistics in Medicine, 14, 1913-1925. Little, R. J. A. (1988). "Missing data in large surveys,” Journal o f Business and Economic Statistics 6, 287-301. Little, R. J. A. (1995). “ Modeling the drop-out mechanism in repeated-measures studies,” Journal o f the American Statistical Association 90,1112-1121. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. John Wiley & Sons, New York. Little, R. and Yau, L. (1996). “ Intent-to-treat analysis o f Longitudinal Studies with Drop-Outs,” Biometrics 52, 1324-1333. 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mori, M., Woodworth, G. G., and Woolson, R. F. (1992). “ Application o f empirical Bayes inference to estimation o f rate of change in the presence o f informative right censoring,” Statistics in Medicine 11 , 621-631. Ridout, M. S. (1991). “ Testing for random dropouts in repeated measurement data,” Biometrics 47, 1617-1621. Rosenbaum, P. R. and Rubin, D. B. (19S3). "The central role o f the propensity score in observational studies for causal effects,” Biometrika 70, 41-55 Rosenbaum, P. R. and Rubin, D. B (1984). “ Reducing bias in observational studies using subclassification in the propensity score,” Journal o f the American Statistical Association, 79, 516-524. Rubin, D B. and Schenker, N. (1986) "Multiple imputation for interval estimation from simple random samples with ignorable nonresponse,” Journal o f the American Statistical Association 81, 366-374 Rubin, D. B. and Schenker, N. (1991) "Multiple imputation in health-care data bases: An overview and some applications." Statistics in Medicine, 10, 585-598. SAS Institute Inc. (1997). SAS/STAT® Software: Changes and Enhancements through Release 6.12, Cary, NC Schluchter, M. (1992). “ Methods for the analysis o f informatively censored longitudinal data,” Statistics in Medicine, 11. 1861 -1870. Searle, S. R., Casella, G., and McCulloch, C. E. (1992). Variance Components, John Wiley & Sons, New York. Schafer, J. L. (1997) Analysis o f Incomplete M ultivariate Data, Chapman & Hall, London. Smith, F. B. and Helms, R. W. (1995). “ EM mixed model analysis o f data from informatively censored normal distributions,” Biometrics 51, 425-436. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Solas™ Imputation 2.0 (1999). Statistical Solutions, Ltd. Cork, Ireland. Tsiatis, A., DeGruttola, V., and Wulfsohn, M. (1995). “ Modeling the relationship o f survival to longitudinal data measured with error: Applications to survival and CD4- lymphocyte counts in patients with AIDS.” Journal o f the American Statistical Association 90, 27-37. Vonesh, E. F. And Carter, R. L. (1987). “ Efficient inference for random coefficient growth curve models with unbalanced data,” Biometrics 43, 617-628. Wang-Clow, F., Lange, M., Laird, N M . and Ware, J. H. (1995). “ A simulation study o f estimators for rates o f change in longitudinal studies with attrition,” Statistics in Medicine, 14, 283-297. Wolfinger, R. D. and R. E. Kass (1999), “ Bayesian analysis o f variance component models via rejection sampling,” To be published. Wolfinger, R. D and Rosner, G. L. (1996). "Bayesian and Frequentist Analyses o f an In Vivo Experiment in Tumor Hemodynamics.” in Bayesian Biostatistics, Berry, D. A. and Stangl, D K. (Eds.) Marcel Dekker. New York. Wu, M. C and Bailey, K. R. (1989) "Estimation and comparison o f changes in the presence of informative right censoring: Conditional linear model,” Biometrics, 45, 939-955. Wu, M. C. and Carroll, R. (1988) “ Estimation and comparison o f changes in the presence o f informative right censoring by modeling the censoring process,” Biometrics 44, 175-188. Wu, M. C. and Follmann, D. (1995). "An approximate generalized linear model with random effects for informative missing data,” Biometrics 51. 151-168. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Wu, M. C., Hunsberger, S. and Zucker, D. (1994). “ Testing for differences in changes in the presence o f censoring: parametric and non-parametric methods,” Statistics in Medicine, 13, 635-646. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Cure rate estimation in the analysis of survival data with competing risks
PDF
Immune recovery vitritis in AIDS: Incidence, clinical predictors, sequellae, and treatment outcomes
PDF
Multi-State Failure Models With Competing Risks And Censored Data For Medical Research
PDF
Cost -efficient design of main cohort and calibration studies where one or more exposure variables are measured with error
PDF
Analysis of binary crossover designs with two treatments
PDF
A joint model for Poisson and normal data for analyzing tumor response in cancer studies
PDF
Familiality and environmental risk factors of peptic ulcer: A twin study
PDF
Evaluation of the efficacy of two lipid-lowering treatments using serial quantitative coronary angiography: Two- and four-year treatment results
PDF
Rates of cognitive decline using logitudinal neuropsychological measures in Alzheimer's disease
PDF
Associations of physical fitness and other risk factors with cardiovascular and non-violent mortality
PDF
Influence of body weight and weight change on carotid IMT and IMT progression: A pooled analysis using subject-level data from four clinical trials
PDF
Enabling clinically based knowledge discovery in pharmacy claims data: An application in bioinformatics
PDF
Imputation methods for missing items in the Vitality scale of the MOS SF-36 Quality of Life (QOL) Questionnaire
PDF
Bootstrapping Variable Selection Procedures In Linear Models
PDF
Analysis Of Transplant In Non-Randomized Settings
PDF
Comparison of variance estimators in case -cohort studies
PDF
Interaction of dietary fiber and serum cholesterol on early atherosclerosis
PDF
Descriptive epidemiology of thyroid cancer in Los Angeles County, 1972-1995
PDF
An exploration of nonresponse with multiple imputation in the Television, School, and Family Project
PDF
A comparative study of environmental factors associated with multiple sclerosis in disease-discordant twin pairs
Asset Metadata
Creator
Wang, Betty Lu-Ti (author)
Core Title
Imputation methods for missing data in growth curve models
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Biometry
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, biostatistics,health sciences, public health,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Forsythe, A. (
committee chair
), Sather, Harland (
committee chair
), Azen, Stanley (
committee member
), Mack, Wendy (
committee member
), Tavare, Simon (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-89007
Unique identifier
UC11339217
Identifier
3018141.pdf (filename),usctheses-c16-89007 (legacy record id)
Legacy Identifier
3018141.pdf
Dmrecord
89007
Document Type
Dissertation
Rights
Wang, Betty Lu-Ti
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, biostatistics
health sciences, public health