Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Using classification and regression trees (CART) and random forests to address missing data
(USC Thesis Other)
Using classification and regression trees (CART) and random forests to address missing data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Using Classification and Regression Trees (CART) and Random Forests to Address
Missing Data
Timothy Hayes
A dissertation presented to The Department of Psychology in partial fulfillment of the requirements for the degree of
Doctor of Philosophy University of Southern California Los Angeles, California May 2017
i
Table of Contents
Acknowledgements iii
Abstract vi
Chapter 1: Introduction 1
Chapter 1 References 15
Chapter 1 Figures 17
Chapter 2: Using Classification and Regression Trees (CART) and Random Forests to
Analyze Attrition: Results From Two Simulations 19
Chapter 2 References 37
Chapter 3: Should We Impute or Should We Weight? Examining the Performance of Two
CART-Based Techniques for Addressing Missing Data in Small Sample Research with
Nonnormal Variables 39
Chapter 3 References 71
Chapter 3 Tables 76
Chapter 3 Figures 83
Chapter 4: Investigating the Performance of CART- and Random Forest-Based
Procedures for Dealing with Longitudinal Dropout in Small Sample Designs Under MNAR
Missing Data: An Initial Study 87
Chapter 4 References 108
Chapter 4 Tables 112
Chapter 4 Figures 118
Chapter 5: General Discussion 120
ii
Chapter 5 References 129
Technical Appendix A: A Tale of Three Missing Data Mechanisms 130
Technical Appendix B: Step It On Up: An Overview of Piecewise Functions and Step
Functions 138
Technical Appendix C: Two Notes About Decision Tree Analysis, and a Summary 144
Technical Appendix D: Everything That You Never Wanted to Know About Inverse
Probability Weighting 149
Technical Appendix E: Latent Growth Curve Models for Fun and Profit 162
iii
Acknowledgements
I owe a great deal to the many supportive mentors and collaborators who have
contributed to my education, informed this work, and helped me become the researcher I am
today. First and foremost, I owe an enormous debt of sincere gratitude to my two co-advisors,
Jack McArdle and Wendy Wood. To Jack, for fostering my interests and abilities in quantitative
psychology, for being a patient sounding board for simulation ideas, for keeping me focused on
the most important (and, often, the most applied) questions, and for issuing me a passport into a
community of quantitative researchers, collaborators, and friends who have enriched both my
professional and personal life. To Wendy, for your wisdom, guidance, writing advice, and
patient, steadfast support through the ups and downs of graduate school, for encouraging me to
build a research profile that was (and is) uniquely my own, for better or for worse (let’s not think
too hard about which of those two options is more likely to be true … ), and for affording me so
many opportunities to work with exciting collaborators and present my ongoing research to
colleagues and friends around the globe.
I also owe an enormous and heartfelt thanks to my friend and collaborator Satoshi Usami,
for helping me learn the ropes of large scale statistical simulation research back in 2013 when I
first joined the quantitative lab, for patiently explaining material and answering even my most
basic (read: dumbest) questions, and for including me in so many exciting (and ongoing)
theoretical quantitative research projects. I’ve often joked that I attended the “Satoshi School of
Statistical Simulation Research,” and, along those lines, the work presented in this dissertation is
a tribute and testament to the skills I learned from our collaboration.
I have also been fortunate to know many other brilliant colleagues in (and around)
quantitative psychology at USC, and my statistical knowledge has leapt forward light years faster
iv
than it otherwise would have thanks to in-depth discussions in classes and labs, at coffee shops,
and, in many (but not all) cases, over beers and bourbons with the following individuals: Richard
John, Rand Wilcox, Ross Jacobucci, Sarfaraz Serang, Nick Jackson, Devika Dhamija, Eli
Tsukayama, Kelly Peters, Randy Bautista, Addie Timmons, Kevin Petway, Erin Shelton, Xiao
He, John Prindle, Prapti Gautam, and Skye Parral.
Additionally, I wish to thank Craig Enders for his ongoing collaboration, advice, and
guidance on all things missing data-, simulation-, and job market-related. Although we did not
collaborate on the material presented in this dissertation, all of my missing data research has
been informed in countless ways, both directly and indirectly, by the many lessons that I’ve
learned from conducting our recent simulation work evaluating joint-model and chained-
equation multilevel multiple imputation methods. Also, shout out to Zinque coffee shop on
Abbott Kinney for housing literally all of our simulation research meetings.
I also owe a big thanks to Kevin Grimm for his helpful comments on an early draft of
Chapter 4 of this dissertation, and to the editors and peer reviewers at Psychology & Aging and
Computational Statistics & Data Analysis for substantially improving the material and
presentation of the papers featured in Chapters 2 and 3, respectively, with their helpful and
insightful comments and suggestions (I cannot emphasize enough how sincerely I actually mean
this).
These remarks wouldn’t be complete without acknowledging my friend and colleague in
social psychology, Jen Labrecque, for helping me navigate Los Angeles, graduate school, and
USC in general when I first arrived; for providing so much support and encouragement
throughout the years; for sitting and working with me in solidarity while so many of these papers
v
were written at various hipster cafes (or “hipster cages,” as my iPhone’s autocorrect likes to say)
in Palms and Culver City; and for generally being an amazing friend.
Huge thanks and acknowledgements are also owed to Burcin, David, Saba, Arsalan,
Evan, Joao, Leandro, Jun, Milind, Toni, Forest, and everyone else from the SEP grant who
provided such fun and stimulating interdisciplinary discussions and collaborations over the years.
The SEP research assistantship helped support me during a critical period in my PhD studies, and
I’m fairly confident that our cross-cutting, interdisciplinary research was a key selling point on
the job market that helped me land a position in a highly collaborative department where I think
I’ll be extremely happy (and valued).
Finally, last but not least, my sincere and heartfelt thanks to my committee: Richard,
Daphna, Burcin, Morteza, and Wendy, for improving this work with your suggestions at the
original proposal defense and, of course, for agreeing to subject yourselves to the unspeakable
act of actually reading this document. You are good and selfless people.
Anyway, I’m about 99.99999% certain that nobody actually reads the acknowledgments
sections of dissertations and about 100% certain that nobody will read all the way to the end of
this one. Which means that I could say whatever I want to at this point. I could go on a political
rant, or lay out the details of some crazy conspiracy theory. I could probably even swear.
With that in mind: thank you to the current political climate for providing daily reminders
that there are worse things in the world than this dissertation. And thank you to the secret, cloak-
and-dagger society of masons who control all of the world’s universities from a vault in the
earth’s core. Your monthly offering of grant money, in the form of solid gold bars, is being
delivered to you as we speak, by a virginal undergraduate research assistant in a mine cart.
Alright then. Enough of this ridiculous, self-indulgent crap. Onto the damn dissertation. :)
vi
Abstract
Data mining algorithms such as Classification and Regression Trees (CART) and
Random Forests provide a promising means of exploring potentially complex nonlinear and
interactive relationships between auxiliary covariates and missing data. Recently, two CART-
based missing data methods have been proposed. The first uses CART to create predicted
probabilities of response and form data weights. The second uses CART to multiply impute the
data. The three major papers comprising the body of this dissertation present four simulations
designed to evaluate and compare these methods. In an initial set of two simulations (Chapter 2),
I compare CART-based weighting methods to logistic regression weights and standard multiple
imputation methods when a set of auxiliary variables are related to missingness via a variety of
step-functional (tree-based) forms under MAR missing data. In a follow-up simulation (Chapter
3), I compare CART-based weighting and CART-based multiple imputation methods in both
small and large sample sizes when the function form of the MAR mechanism is smooth (i.e.,
linear, quadratic, cubic, interactive) and the data are nonnormal. The final simulation study
(Chapter 4) compares the performance of these weighting and imputation methods in small
sample longitudinal trial designs under MNAR missing data. Results suggest that CART-based
weights help reduce parameter bias and increase coverage in small samples, but become
inefficient in large samples. CART-based multiple imputation methods exhibited the reverse
pattern, however, performing poorly in small samples and well in large ones in terms of both bias
and efficiency. These results suggest that CART-based methods may have utility in addressing
missing data, although future research is needed to ascertain the best way to extend CART-based
weights to more complex instances of multivariate missing data.
Chapter 1: Introduction
The three major papers contained in the body of this dissertation represent the current
state of a research program that dates back to late 2014. A year prior, McArdle (2013) proposed
a novel method for dealing with missing data due to longitudinal attrition: using Classification
and Regression Trees (CART, Breiman, Friedman, Olshen, & Stone, 1984), an exploratory data
mining algorithm, to model participant dropout and obtain a set of predicted probabilities that
could be inverted and used as data weights in subsequent analyses, serving to count more heavily
the observed (non-missing, complete data) responses of participants from under-represented
groups in the dataset. Around the same time, two additional papers were published that proposed
using CART-based methods in a different way. Rather than predicting the probability of dropout
and forming data weights, these new methods harnessed the predictive powers of CART to
estimate missing values on each outcome variable and multiply impute the data (Doove, van
Buuren, & Dusseldorp, 2014; Shah, Bartlett, Carpenter, Nicholas, & Hemingway, 2014).
Although these methods employ somewhat different strategies for recovering parameter
estimate bias, efficiency, and coverage under missing data, their underlying logic is the same. At
the root of this logic is the observation that whenever there are missing data, there are two
analytical models implicitly in play, a substantive model and a missing data model (a distinction
that dates back to the classic work by Rubin, 1976). In practice, this latter model may be actively
acknowledged or, perhaps more often, passively assumed. For example, using complete case
methods such as listwise deletion covertly assumes a completely random missing data
mechanism. Similarly, setting Full Information Maximum Likelihood (FIML, Arbuckle, 1996) as
the default estimator in a structural equation modeling (SEM) analysis tacitly implies that the
variables in one’s substantive model are the only ones that matter in correcting parameter
1
estimates for the biasing effects of missing data
1
. In each of these cases there is a missing data
model lurking in the shadows, and this is true whenever missing data are found.
In contrast to these passive approaches to addressing missing data, experts agree that it is
preferable to actively search for a set of auxiliary variables – additional variables in one’s
dataset (e.g., demographic measures, baseline predictor variables, additional items and scales,
etcetera) that are either predictive of missing data or are correlated with the variables in one’s
dataset that contain missing data – that might be incorporated into maximum likelihood or
multiple imputation analyses in order to maximally inform estimation (Collins, Schafer, & Kam,
2001). This search for auxiliary variables is not merely a good rule of thumb or “best practice,”
but can be essential. This is because the Missing At Random (MAR, Rubin, 1976) mechanism
underlying most modern missing data techniques assumes not only that missing data are linked
to some subset of the observed variables in one’s dataset, but also that these variables are
actually included in the analysis in some manner (e.g., as either model variables or auxiliary
variables) in order to inform the missing data model (see Technical Appendix A for an overview
of Rubin’, 1976, missing data mechanisms). When an analysis fails to include the variables upon
which the MAR mechanism depends, the MAR assumption is violated and the data are
technically Missing Not At Random (MNAR). That is, in the context of the analysis in question,
the missing data are linked to a variable that is not observed – not included in the analysis and
thereby not observed in estimation – similar to situations when the missing outcome values
themselves are the primary predictors of incompleteness (cf. Enders, 2010 for a very accessible
overview of these ideas).
1
Technically this is to assume that the data are MAR dependent only on one’s model variables.
Additionally, FIML estimation also assumes that missing data can be estimated using a
continuous multivariate normal distribution.
2
The edict to incorporate auxiliary variables into one’s analyses is well-taken. Yet the
problems remain of which auxiliary variables to choose and how these variables might be related
to the probability of missing data (and/or to participants’ scores on missing variables). This
dilemma is exacerbated by the fact that, while researchers typically hold a variety of a priori
hypotheses about their substantive models, the same cannot be said about the missing data model
in most cases. Thus, whereas substantive model analysis represents a theoretically-driven
confirmatory enterprise, missing data model specification is more commonly an inherently
exploratory endeavor.
Though confirmatory techniques like logistic regression analysis are often appropriated in
the service of predicting which variables in a dataset significantly relate to missing data, the
simple premise of the CART-based approaches evaluated in this dissertation is that, based on its
exploratory nature, the task of modeling missing data may be best handled by explicitly
exploratory methods. These approaches have the potential to unearth important information
about which variables in a dataset may be related to missing data and, perhaps even more
importantly, how they relate to missing data, in a low effort, automated manner that obviates
many of the computational difficulties faced by analysts using more traditional methods. This
becomes especially important when missing data are related to a set of auxiliary variables via
nonlinear or interactive functional forms (or some combination of both).
Examples of potential nonlinear and interactive functional forms are easy to imagine. For
instance, age may be related to dropout in longitudinal studies in a nonlinear fashion, such that
both the youngest and the oldest participants in the study are the most likely to drop out. In this
case, the probability of dropout would be related to age in a U-shaped manner, such that the
youngest and oldest participants had the greatest probabilities of failing to return at future
3
measurement occasions. If one had access to measures of personal motivation and health, it
would be possible to see that this U-shaped age relationship might be moderated: perhaps among
older adults, dropout probabilities depend largely on health-related factors, whereas among
young adults, dropout probabilities depend upon participants’ degree of personal motivation to
stick with the study. In such cases, the variables in a data set might be related to missing data
through a variety of complex linear-by-linear, nonlinear-by-linear, or nonlinear-by-nonlinear
interactions among the auxiliary variables, and omitting these interactive and nonlinear
relationships in the missing data model would violate the MAR assumption, injecting bias into
the parameter estimates in one’s substantive analysis.
Within the context of a logistic regression analysis predicting a binary missing data
indicator variable, these nonlinear and interactive terms could be represented by a series of
product variables and polynomial transformations. This approach is prone to at least three
difficulties, however. First, even with a small number of potential auxiliary variables, computing
all possible nonlinearities and interactions among the variables becomes an overwhelming and
computationally intensive endeavor, even when only considering lower-order polynomials (e.g.,
only quadratic and cubic terms) and lower order interactions among them (e.g., only two- and
three-way product terms among the various linear, quadratic, and cubic variables).
This difficulty is not necessarily prohibitive, however, as one can easily imagine
automating this tedious computational task by writing a function or macro designed to
automatically compute the various nonlinear and interactive variables required for analysis. But
even if this large set of polynomial and product variables was immediately available as
predictors to be entered into a logistic regression or multiple imputation analysis, the estimation
4
of such a model may run into additional difficulties, such as severe collinearity issues and
convergence problems (cf. Howard, Rhemtulla, & Little, 2015, who make a similar argument).
Finally, it is possible that the types of nonlinearities and interactions among auxiliary
predictor variables that give rise to missing data may not be best represented by a set of
polynomial and product variables. For instance, the hypothetical example of age interacting with
motivation and health to predict missing data could be represented as a tree diagram, as depicted
in Figure 1.1. Here, it is clear that the predicted probabilities of going missing are greatest for
younger individuals (age < 22) whose scores on a measure of motivation are less than some
cutoff (cuttoff
1
) and for older individuals (age ≥ 65) who score below a cutoff point on a General
Health Rating Index (cf. Read, Qulnn, & Hoefer, 1987). In mathematical terms, the difference
between the polynomial and product terms that might be modeled by logistic regression and tree-
based diagrams like Figure 1.1 amounts to the difference between a smooth functional form and
a step functional form of the relationship between the predictor variables and the probability of
missing data (see Technical Appendix B for a brief crash course in step functions).
The promise of CART-based missing data techniques is that these algorithms
automatically search through the data and test out all possible nonlinearities and interactions
among a set of predictor variables in determining the decision tree that best predicts a given
outcome variable, such as a missing data indicator. When the true functional form of the missing
data model is a nonlinear, interactive step function (i.e., a tree), these CART-based methods
should be ideally positioned to recover the true relationships between the predictor variables and
the probability of missing data. Furthermore, by using CART to generate a set of data weights or
a set of multiple imputations, analysts obviate the burden of computing large sets of polynomial
and product variables that may serve only to impair the performance of logistic regression
5
analyses and regression-based imputation techniques by introducing collinearity issues and
convergence problems, as described above.
The three major papers that comprise the body of this dissertation describe four
simulation studies that test these ideas. Each paper was written as a stand-alone piece for a
particular scholarly outlet. Paper 1 (Chapter 2) was published in Psychology & Aging (Hayes,
Usami, Jacobucci, & McArdle, 2015), paper 2 (Chapter 3) is presently under review, following a
“revise and resubmit” editorial decision, at Computational Statistics & Data Analysis, and paper
3 (Chapter 4) is currently under review for inclusion as a book chapter in a forthcoming volume
honoring the career contributions of John J. McArdle. For this reason, the introductory sections
of each and every paper herein describe the procedures and logic associated these CART-based
missing data methods in at least a moderate level of detail.
For this reason, rather than using this general introduction chapter to present the typical,
encyclopedic, dissertationy
2
recounting of the various computational details of the quantitative
techniques under study, I have chosen to relegate the more fine-grained details of these methods
to a set of technical appendices and approach this section of the document with two goals in
mind: (1) to present the foregoing outline of the logic of CART-based missing data methods in
order to explicate the underlying assumptions tacitly implied, but not always explicitly stated, in
the introductions of each of the individual papers that follow, and (2) to provide a short,
anecdotal overview of the logical progression and evolution of this research program, as a whole.
2
Because some critical readers will undoubtedly protest that "dissertationy" is not a real word, it
bears emphasizing that the construction of odd and unpleasant bastardizations of everyday words
through the application of unorthodox, misplaced suffixes is a venerable, time-honored tradition
in the psychological sciences that, with sufficient practice and persistence, anyone can learn to
engage in with "automaticity."
6
Having accomplished the first goal in this section, the next section provides a broad overview of
the logical of the three main papers, as a program of research.
Overview of the Logic and Evolution of the Research Program Reported in the Three Main
Papers
Initial goals for the research program. At the outset, it is worth briefly recounting the
stated goals of this research program. At the time of this dissertation work was initially proposed,
CART-based multiple imputation methods were relatively unknown and the articles describing
these methods were only just beginning to be published (Doove et al., 2014; Shah et al., 2014).
As a result, the initial proposal for this work did not address CART-based multiple imputation
methods, but instead consisted of plans to test McArdle’s (2013) CART-based inverse
probability weighting methods through a large-scale Monte Carlo simulation study relating sets
of simulated auxiliary variables to the probability of missing data through one of several step
functional forms (e.g., a tree with 1, 2, or 3 nonlinear and interactive splits, as described in the
method section of Chapter 2). The goal was to compare the performance of CART-based
weighting methods to the performance of standard approaches like listwise deletion, standard
(joint model regression; predictive mean matching) multiple imputation, and logistic regression
weights under these unusual missing data mechanisms.
The discussion of this initial proposal at the 2014 proposal defense raised a set of
additional concerns, all of which, broadly speaking, involved questions of how to test these
methods under the kinds of realistic conditions faced by substantive and applied researchers in
real world scenarios. A first concern involved the template models under study. Wherever
possible, it was important to simulate datasets that reflected prototypical data analysis situations
faced by real researchers dealing with missing data. With this in mind, the simulations that
7
follow examine the performance of these methods in the context of longitudinal panel models
(such as those used by Hawkley, Thisted, Masi, & Cacioppo, 2010. Chapter 2, Simulation A uses
this model as a template.), tests of moderation in small-sample experimental designs (Chapter 3),
and linear growth curve models typically used to analyze the results of randomized longitudinal
clinical trials (Chapter 4. Cf. Yang & Maxwell, 2014).
A second set of concerns revolved around whether various model assumptions were met
or violated in each simulated dataset. For example, many statistical simulations generate data
from normal (or multivariate normal) distributions, but yet normality is rarely observed in real
data (Miceeri, 1989). Therefore, the simulation described in Chapter 3 evaluates these methods
under both normal and severely nonnormal data. Perhaps even more importantly, CART and its
extensions assume that auxiliary predictors are related to the probability of dropout via a step
functional form (i.e., a tree), but this model may not always be realistic or true in a given dataset.
To address this issue, the simulation described in Chapter 3 assesses the performance of CART-
based methods when the auxiliary variables in a dataset are related to nonresponse on an
outcome variable via a variety of smooth (i.e., linear, quadratic, cubic, and multiplicative-
interaction) functional forms.
Finally, though these methods were developed under the assumption of MAR missing
data, what if the data are MNAR? That is, what if the probability of dropping out of a study is
directly related to the values of the missing outcome variable under study? Following Collins et
al. (2001), it seemed likely that CART-based methods should be able to provide relief under
MNAR to the extent that the auxiliary variables in the dataset were highly correlated with the
missing outcome variable at the time of dropout. The simulation in Chapter 4 provides a direct
test of this hypothesis.
8
To summarize, each of the three papers in the Chapters that follow tackles a different
broad issue. Chapter 2 examines these methods under MAR missing data and step-functional
relationships between the auxiliary variables and the probability of missing data. Chapter 3
examines these methods under nonnormality and a variety of smooth (rather than step-
functional) MAR mechanisms. Finally, Chapter 4 examines performance under MNAR missing
data.
Brief account of the anecdotal progression of the research program. In this section, I
will briefly highlight some details, predominantly concerning paper 1 (Chapter 2) that may
clarify the progression of the research described in the chapters that follow. As described in the
previous section, Chapter 2 details a simulation study that closely follows the procedures
outlined in the 2014 dissertation proposal. This study produced three important results that have
served to inform every subsequent study in this research program. The latter two of these three
results were only tentatively highlighted in the main paper, due to their preliminary nature at that
time.
First, this study provided strong initial evidence that CART, pruned CART, and random
forest methods perform well in identifying the true missing data model under study (flagging the
correct variables as splitting variables in a tree; displaying the true auxiliary variables as
relatively more important using variable importance measures; see Chapter 2, Simulation A).
This finding was promising, suggesting that these methods may indeed purchase some degree of
accuracy beyond a simple logistic regression, main-effects approach.
Second, surprisingly (at the time), very little bias was found in the regression parameters
of the simulated cross-lagged regression model (Simulation A). However, greater bias was
observed in estimates of the variable means and variances (Simulation B), and CART-based
9
methods performed well in removing this bias from these estimates. Although initially
perplexing in light of the high percentages (up to 50%) of MAR missing data injected into the
simulation datasets, these results – that greater bias observed in mean and variance estimates
than in in regression coefficients – were similar to those found by Collins et al. (2001) in their
classic simulations.
Upon reflection, these results make sense, however. In this simulation, the data were
generated so that the variables in the cross-lagged model under study were related to each other
by a set of moderate-to-strong linear relationships. Knowing this, recall that a straight line is
fully determined by two points, assuming that those two points fall squarely on the line. Thus,
even though substantial missing data were injected into the model, and even though the
simulated data points contained error variation and did not fall precisely on a regression line,
enough data remained for the regression coefficients to accurately model the linear relationships
in the data, making these coefficients relatively robust to the effects of missing data in the
simulation
3
. By contrast, the point estimate of a mean is a more fragile quantity. It is well-known
that the finite sample breakdown point of a mean is 1/n, indicating that removing even one
observation from (or adding even one extreme outlying observation to) a sample can potentially
alter the estimate of the mean in dramatic ways (cf. Wilcox, 2012).
3
Note that this is not to suggest that regression coefficients are always robust to the effects of
missing data. In real-world data, the data-generating mechanism is rarely perfectly linear and
thus the relationships captured by regression coefficients are far more likely to represent the
imposition of a linear model on a set of relationships that are not truly linear. In these more
realistic scenarios, the removal of several data points could have any number of effects on the
resulting parameter estimates.
10
One need not accept this argument on purely logical grounds, as these principals are quite
easily observed empirically. For example, this can be verified in R (R Core Team, 2013) by
repeatedly running the following code:
#x is random normal variate with N = 100 observations.
x <- rnorm(100)
#y = .5*x plus normal residual
y <- .5*x + rnorm(100)
#create duplicate of y which will have missing values
yMiss <- y
#randomly inject missing values below the median of y
for(i in 1:length(y)) if(yMiss[i] < median(y)){yMiss[i] <-
sample(c(yMiss[i], NA), 1, T)} else{yMiss[i] <- yMiss[i]}
#true regression coefficients of y complete on x
coef(lm(y ~ x))
#regression coefficients of yMiss (with missing data) on x
coef(lm(yMiss ~ x))
#mean of complete-data y
mean(y)
#mean of y with missing data
mean(yMiss, na.rm = T)
Running this code and examining the results, it is apparent that estimates of the sample
mean are substantially more affected by the missing data on simulated y than the estimates of the
regression coefficient of y on x. Noting the comparative fragility of estimates of sample means,
variances, and covariances, the simulation in Chapter 3 examines the effects of missing data on
interactions between an effect-coded experimental grouping variable and a continuous moderator
variable. Similarly, the simulation in Chapter 4 examines structural equation model results from
a latent growth curve model – a model that explicitly models mean trajectories over time –
predicted by an experimental grouping variable.
Finally, a third result emerged from these initial simulations that seemed quite tentative
and preliminary at the time, but that ultimately ended up providing a cornerstone for the research
11
that followed. To illustrate, Figure 1.2 displays a bar graph plotting the means of percent bias
observed in estimates of the covariance between variables X
2
and Y
2
, which contained missing
data, returned by several different missing data methods at each of three different sample sizes
(N = 100, 250, and 500; see Simulation B from Chapter 2). Examining this plot, a set of trends
are clearly visible. First, the listwise deletion estimates are biased, hovering around 20% bias in
all sample size conditions. Second, all of the various CART-based weighting methods essentially
eliminated this bias. Third, and most importantly, the multiple imputation results are clearly
dependent on sample size, producing the greatest bias at the lowest sample size (N = 100) and
systematically less bias as the sample size condition increased to N = 500. This plot did not
appear in the published Psychology & Aging paper, because it seemed too preliminary to
forcefully highlight these points, but these results clearly seemed to suggest that CART-based
weights performed well even in very low sample sizes, whereas the results produced by multiple
imputation analysis might be more sample size dependent.
In addition to these three results from Chapter 2, one additional event influenced the
simulations that followed. After the publication of this initial work, I became aware of the papers
by Doove et al (2014) and Shah et al (2014) describing and evaluating a set of CART-based
multiple imputation methods. These papers demonstrated the promise of CART-based multiple
imputation methods in recovering regression parameters when the data included interaction
terms. Yet, these authors only simulated very large datasets – N = 1,000 in the case of the first
paper and N = 2,000 in the case of the second. Thus, the performance of these CART-based
imputation methods in the kind of small samples commonly obtained in experimental research
was wholly unknown.
12
As a result of all of these factors, the second set of simulations (described in Chapter 3
and 4) compare CART-based weighting methods to CART-based imputation methods under both
small and larger sample sizes
4
in analysis models involving sample means and mean structures.
At the outset of the work described in these latter two chapters, it was unclear whether weighting
and imputation methods would perform comparably across all conditions, whether multiple
imputation methods would dramatically outperform weighting methods (a distinct possibility)
across conditions, or whether each method would perform better under certain distinct sets of
conditions. After conducting this series of studies, I believe that the initial answers to these
questions are quite clear in the data.
Summary
CART-based missing data methods represent a relatively easy, automated strategy for
ascertaining potentially complex nonlinear and interactive relationships between a set of
auxiliary predictor variables and the probability of missing data. The program of research
detailed here is informed by a combination of (a) the issues outlined in the original dissertation
proposal, (b) the concerns and additions raised in the original discussion of the proposal, (c) the
interesting and somewhat surprising results of the initial simulations described in Chapter 2, and
(d) the emergence of a new set of CART-based multiple imputation methods to be compared
with McArdle’s (2013) weighting methods. The simulations that follow represent a strong set of
initial steps toward understanding the performance of these methods across a variety of practical
conditions. Nonetheless, as discussed at length in the General Discussion section, future research
4
I note, in passing, that prior to adjusting for a peer reviewer’s comments, the sample sizes
employed in the simulation in Chapter 4 were N = 60 (30 per cell in a two-cell experiment), N =
100 (50 per cell), N = 200 (100 per cell), and N = 1000 (500 per cell, intended to replicate the
large-sample conditions of prior CART imputation simulations). So these original Ns were, in
many ways, even truer small sample Ns.
13
will be needed to extend these initial methods in several key ways if they are to be useful in more
complex missing data scenarios commonly faced by researchers in practice.
14
References
Arbuckle, J. N. (1996). Full information estimation in the presence of incomplete data. In
Advanced structural equation modeling. (pp. 243–277). Mahwah, NJ: Lawrence Erlbaum
Associates. Inc.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression
Trees. Pacific Grove, CA: Wadsworth.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.
http://doi.org/10.1037/1082-989X.6.4.330
Doove, L. L., van Buuren, S., & Dusseldorp, E. (2014). Recursive partitioning for missing data
imputation in the presence of interaction effects. Computational Statistics & Data Analysis,
72, 92–104. http://doi.org/10.1016/j.csda.2013.10.025
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Hawkley, L. C., Thisted, R. A., Masi, C. M., & Cacioppo, J. T. (2010). Loneliness predicts
increased blood pressure: 5-year cross-lagged analyses in middle-aged and older adults.
Psychology and Aging, 25(1), 132–141. http://doi.org/10.1037/a0017805
Hayes, T., Usami, S., Jacobucci, R., & McArdle, J. J. (2015). Using Classification and
Regression Trees (CART) and random forests to analyze attrition: Results from two
Simulations. Psychology and Aging, 30(4), 911–929. http://doi.org/10.1037/pag0000046
Howard, W. J., Rhemtulla, M., & Little, T. D. (2015). Using Principal Components as Auxiliary
Variables in Missing Data Estimation. Multivariate Behavioral Research, 50(3), 285–299.
http://doi.org/10.1080/00273171.2014.999267
McArdle, J. J. (2013). Dealing with longitudinal attrition using logistic regression and decision
15
tree analyses. In Contemporary issues in exploratory data mining in the behavioral sciences
(pp. 282–311). New York: Routledge.
Miceeri, T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures.
Psychological Bulletin, 105(1), 156–166. http://doi.org/10.1037/0033-2909.105.1.156
R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria:
R Foundation for Statistical Computing. Retrieved from http://r-project.org/
Read, J. L., Qulnn, R. J., & Hoefer, M. A. (1987). Measuring overall health: A Evaluation of
three important approaches. Journal of Chronic Diseases, 40, 7S–21S.
http://doi.org/10.1016/S0021-9681(87)80027-9
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
http://doi.org/10.2307/2335739
Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Comparison
of random forest and parametric imputation models for imputing missing data using MICE:
A CALIBER study. American Journal of Epidemiology, 179(6), 764–774.
http://doi.org/10.1093/aje/kwt312
Wilcox, R. (2012). Modern Statistics for the Social and Behavioral Sciences: A Practical
Introduction. Boca Raton, FL: CRC Press.
Yang, M., & Maxwell, S. E. (2014). Treatment effects in randomized longitudinal trials with
different types of nonignorable dropout. Psychological Methods, 19(2), 188–210. Retrieved
from http://dx.doi.org/10.1037/a0033804
16
Figure 1.1: Hypothetical Dropout Mechanism Represented as Tree Diagram
17
Figure 1.2: Mean Percent Bias of Cov(X2, Y2) by Missing Data Method and N. CART,
PrunedCART, and RandomForest denote CART-based inverse probability weighting methods.
MI indicates Multiple Imputation using predictive mean matching.
18
Chapter 2: Using Classification and Regression Trees (CART) and Random Forests to
Analyze Attrition: Results From Two Simulations
1
1
Article published in Psychology & Aging.
19
Using Classification and Regression Trees (CART) and Random Forests to
Analyze Attrition: Results From Two Simulations
Timothy Hayes
University of Southern California
Satoshi Usami
University of Tsukuba
Ross Jacobucci and John J. McArdle
University of Southern California
In this article, we describe a recent development in the analysis of attrition: using classification and
regressiontrees(CART)andrandomforestmethodstogenerateinversesamplingweights.Theseflexible
machine learning techniques have the potential to capture complex nonlinear, interactive selection
models, yet to our knowledge, their performance in the missing data analysis context has never been
evaluated. To assess the potential benefits of these methods, we compare their performance with
commonly employed multiple imputation and complete case techniques in 2 simulations. These initial
results suggest that weights computed from pruned CART analyses performed well in terms of both bias
and efficiency when compared with other methods. We discuss the implications of these findings for
applied researchers.
Keywords:missingdataanalysis,attrition,machinelearning,classificationandregressiontrees(CART),
longitudinal data analysis
Supplemental materials: http://dx.doi.org/10.1037/pag0000046.supp
A counterfactual is something that is contrary to fact. In an experi-
ment, we observe what did happen when people received a treatment.
The counterfactual is knowledge of what would have happened to
those same people if they simultaneously had not received treatment.
An effect is the difference between what did happen and what would
have happened (Shadish, Cook, & Campbell, 2002, p. 5).
As Shadish et al. (2002) observed in their classic text, counter-
factual reasoning is fundamental to causal inference. The focus of
this article is on counterfactual inferences in a different context:
thatofmissingdatacausedbyattrition.Althoughtheparallelisnot
typically made transparent, inferences about missing data take a
near-identical form to the more familiar causal inferences de-
scribed above. Paraphrasing Shadish et al., in the case of missing
data, what we observe is the sample data, which may contain
incompleteness. The counterfactual is what the data—and partic-
ularly our model(s) of interest—would have looked like if there
was no incompleteness; that is, if we had access to all of the data.
The effect of incompleteness is the difference between the results
we obtain from our actual sample and the results we would have
obtained with access to the complete data.
Viewed in this way, it seems evident that thinking about the
effects of missing data requires the same set of inferential skills
that researchers confidently deploy in a variety of other contexts
on a regular basis. The major difference is that, unlike an exper-
imental treatment condition, researchers do not have access to an
alternativesetofcompletedatathatcouldfostersuchacomparison
with the incomplete sample in order to assess the effects of
incompleteness. As a result, it is not possible to observe what our
model(s) would have looked like if there was no incompleteness.
Instead, this needs to be estimated.
In this article, we assess a new method of estimation under
missing data: the use of inverse probability weights derived from
anexploratoryclassificationtreeanalysis(cf.McArdle,2013).The
potential utility of this method comes from the promise of explor-
atory data mining techniques to uncover and account for complex
relationships in the data that other linear methods might overlook.
To evaluate whether this method lives up to its promise, we
compare it with (a) weights derived from logistic regression anal-
ysis, and (b) multiple imputation (MI) methods (Rubin, 1976,
1987). Further, we extend McArdle’s (2013) logic by comparing
these methods with probability weights computed using random
forest analysis (Breiman, 2001).
We begin by reviewing two well-known methods of handling
missingdata:completecasemethodsandMI.Wethendescribethe
Timothy Hayes, Department of Psychology, University of Southern
California; Satoshi Usami, Department of Psychology, University of Tsu-
kuba; Ross Jacobucci and John J. McArdle, Department of Psychology,
University of Southern California.
The authors thank Craig K. Enders for his invaluable help clarifying a
crucial detail of the coding of the multiple imputation analyses in Simu-
lation A. The authors thank John T. Cacioppo and Louise C. Hawkley for
generously providing us with their data on the CHASRS study, used in the
applied example in Online Supplement A. Additionally, the authors thank
LouiseHawkleyforherhelpfulandinformativefeedbackonadraftofthis
article.
Correspondence concerning this article should be addressed to John J.
McArdle, 3620 South McClintock Avenue, Seeley G Mudd (SGM) 501,
Department of Psychology, University of Southern California, Los Ange-
les, CA 90089-1061. E-mail: jmcardle@usc.edu
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychology and Aging © 2015 American Psychological Association
2015, Vol. 30, No. 3, 000 0882-7974/15/$12.00 http://dx.doi.org/10.1037/pag0000046
1
20
logic of using inverse sampling weights to address incomplete
data. Although inverse probability weighting (IPW) has a long
history in survey research (Kish, 1995; Potthoff, Woodbury, &
Manton, 1992) and in the analysis of attrition (Asparouhov, 2005;
McArdle, 2013; Stapleton, 2002), coupling this technique with an
exploratory data mining analysis of the probability of incomplete-
ness is a recent and novel idea (McArdle, 2013). We present three
alternative methods for computing these weights: conventional
logisticregression,classificationandregressiontrees(CART),and
random forest analysis. We then attempt to answer our questions
about the relative benefits of these methods using data from two
simulation studies.
Methods for Handling Incomplete Data
Complete Case Analyses
The simplest thing to do about missing data is, of course,
nothing at all,
1
and this is the basis for complete case methods. In
listwise deletion, any rows in the data set that contain incomplete-
ness are deleted prior to analysis and only complete cases are
analyzed. In pairwise deletion, the data set is subsetted to include
only those variables relevant to a particular analysis, and then
listwise deletion is performed on each pair of variables in the
subsetted data set (that is, cases are not deleted if they contain
incompleteness on variables not relevant to the analysis at hand,
with the standard example being correlation tables computed from
the complete cases on each pair of variables). Complete case
methods implicitly assume that the data are missing completely at
random (Rubin, 1976)
2
—that is, unrelated to both the missing and
observed portions of the data set—and unless this assumption is
met,thesemethodswillresultinbiasedparameterestimates.Even
when incompleteness is caused by a completely random process,
however,deletingcasesreducesstatisticalpower,andtheextentof
this problem increases as the amount of incompleteness becomes
more severe. In a world in which methods for addressing incom-
pleteness are widely available and easily implemented in common
software packages, complete case analysis should never be the
only analysis performed. However, these methods can serve as a
usefulbaseline(orcontrol)againstwhichtocomparetheeffectsof
statistical adjustments for incompleteness.
Multiple Imputation (MI)
Insteadofsimplyignoringmissingdata,researchersmightapply
an analysis method that effectively adjusts for the effects of
incompleteness. One such method is MI (Rubin, 1987). MI func-
tions exactly as its name implies: this method imputes new values
in place of missing cases, and it does this multiple times.
Concretely, MI is a simulation-based method that consists of
three steps: (a) imputing m data sets, in which m is typically
between 3 and 10,
3
(b) performing the data analysis of interest on
each of the m imputed data sets, and, finally, (c) using simple
arithmetic formulas to pool the parameter estimates and standard
errorsresultingfromthemanalyses.Byanalyzingandaggregating
theresultsfromeachofthemdatasets,MIproducesestimatesthat
are less biased and standard errors that are smaller than those
produced by a single imputation alone (for more information on
MI see, e.g., Graham & Schafer, 1999; Rubin, 1987).
Handling Missing Data Using Inverse
Probability Weights
An alternative strategy to address incompleteness frames miss-
ing data as a sample selection problem (Asparouhov, 2005; Kish,
1995; McArdle, 2013; Potthoff et al., 1992; Stapleton, 2002).
Understood in this way, missing data results from undersampling
the members of certain subpopulations. For example, perhaps
individualsinacertainagegroup,sayindividualswithagegreater
than58years,arelesslikelytoreturntothestudyatTime2.Inthis
scenario, individuals in the !58 age group are undersampled
relative to individuals with age"58. In practice, these probabili-
ties might be estimated from an exploratory analysis, such as a
logisticregression,decisiontreeanalysis,orensemblemethod(see
the section on “Random forest analysis”) predicting a variable
coded 0 for dropout (missing) at Time 2, and 1 for returning
(nonmissing) at Time 2. In order to correct the analysis for these
uneven selection probabilities, researchers can utilize IPW (cf.
Kish,1995;McArdle,2013;Potthoffetal.,1992;Stapleton,2002).
Ifp
i
representstheprobabilityofselection—thatis,theprobability
of returning to the study (not dropping out)—for person i at time
t# 1, then the inverse probability weight, w
i
is equal to 1/p
i
.
Because estimates of the weighted sample variance are not
invariant to scaling, it is important to choose an appropriate scale
for the sample weights. One technique that fits well with the goals
of the current situation is weighting to the relative sample size
(Stapleton, 2002).
4
This scaling is accomplished by multiplying
the raw weights, w
i
, by a scaling factor,$, where
!"
n
!
i"1
n
w
i
. (1)
This transformation scales the raw weights so that the scaled
weights sum to the actual sample size, n. Weights can readily be
incorporated into structural equation models by maximizing
ln(L)"
!
i"1
n
w
i
ln(L
i
), (2)
the weighted log likelihood (Asparouhov, 2005). Here, w
i
indicate
1
By “nothing at all” we mean “nothing at all to address incomplete
data.” Although some packages, like SPSS, default to complete case
methods in most analyses, which do not address missing data, many
structuralequationmodelingpackages,suchasMplus(Muthén&Muthén,
2011), default to full information maximum likelihood for many standard
analyses, which does address missing data. However, even this program
defaults to listwise deletion in some cases, as when data are missing only
on the dependent variables (as mentioned later in this article).
2
The remainder of the methods discussed in this article were designed
for situations in which incompleteness is related to the values of the
observed covariates; that is, when the data are missing at random (Rubin,
1976). Implications for the missing not at random case are discussed at the
end of this article.
3
However, some researchers now recommend 20 or more imputation
data sets, as did Craig Enders in a recent personal communication.
4
Another worthwhile option would have been to use effective weights,
which sum to the effective sample size (Potthoff et al., 1992; Stapleton,
2002). However, in the types of analyses described in this paper (that is,
when applying weights to single-level, rather than multilevel, data), Mplus
automatically rescales the weights so that they sum to the relative sample
size (see Muthén & Muthén, 2011, p. 501).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2
HAYES, USAMI, JACOBUCCI, AND MCARDLE
21
the weights, which may be rescaled using Equation 1 (that is, w
i
here may refer to $w
i
if relative weights are used). Weighted
maximum likelihood (WML) estimation is the computational
equivalent of fitting a model to the weighted sample means and
covariancesusingstandardmaximumlikelihoodestimation.When
w
i
is a vector of unit weights (that is, when w
i
% 1 for all i), this
equation reduces to regular maximum likelihood and is equivalent
to listwise deletion with every case receiving the same weight.
Thus, it is evident that researchers cannot ignore or avoid missing
data issues by adopting program defaults; even the most basic of
these defaults carries tacit assumptions about the equal probability
of selection into the sample.
However, WML has been shown to produce overly short stan-
dard errors and confidence intervals. Instead of WML, pseudo-
maximum likelihood (PML) is preferred, using the Huber-White
sandwich estimator to generate the asymptotic covariance matrix
(Asparouhov, 2005). PML is calculated by several robust maxi-
mum likelihood estimators offered in Mplus (Muthén & Muthén,
2011), including maximum likelihood with robust standard errors
(MLR), used in the demonstration below.
Modeling Selection Probabilities
Logistic regression. The standard way to assess the relation-
ship between the variables in a data set and the probability of
incompleteness is by using logistic regression. Logistic regression
relates a set of categorical, ordinal, or continuous predictors to a
binaryresponsevariable—inthiscase,anincompletenessindicator
variable, in which 0% dropout and 1% return at time t# 1. If
one’s goal is to identify correlates of incompleteness to include as
auxiliaryvariablesinanimputationmodel,onemightopttoselect
thosevariablesthatdisplaysignificantrelationshipstoincomplete-
ness. If one’s goal is to compute sample weights, however, the
predicted log odds can be converted to predicted probabilities
using a standard formula, and the predicted probabilities can then
be inverted to form sampling weights.
The logistic regression approach assumes that the predictors
exhibit a linear relationship to the logged odds of the binary
responsevariable,andthereforeprovidesausefulwayofassessing
the significance of such linear relationships. Alternatively, it is
possible that predictors in the data set may exhibit complex,
unconventional interactions and/or may be related to incomplete-
ness in a nonlinear fashion. Although it is possible to specify
multiplicative linear interactions and polynomial functions in the
regression framework, specifying all such interactions and nonlin-
earities among many predictors could result in multicollinearity
issues and may miss important predictive relationships that do not
conform to these specified functional forms. Even if the true
relationship could be well captured by linear interactions and
polynomial terms, finding the correct model specification may be
difficulttoapproximatemanually.Iftheanalystfailstospecifythe
correct relationship among the covariates and the missing data
indicator, the logistic regression approach may fail to capture
important relationships among predictors and important nonlinear
predictors of incompleteness. What is needed, then, is a technique
to identify such interactions and nonlinearities in a systematic,
automated manner. CART provides such a technique.
CARTanalysis. To identify a model of incompleteness is, by
definition, to attempt to discover a set of auxiliary covariates that
may be unimportant with respect to one’s a priori substantive
model of interest but that are related in important ways to the
probability of attrition (Enders, 2010; Rubin, 1976). One analytic
techniquethatisparticularlywellsuitedtotheseexploratorygoals
isCART(Berk,2009;Breiman,Friedman,Olshen,&Stone,1984;
Morgan & Sonquist, 1963; see also Strobl, Malley, & Tutz, 2009,
for an excellent, readable introduction aimed at psychologists). In
the context of attrition, a CART analysis seeks to find the values
of the predictor variables that separate the data into groups of
people who either (a) dropout, or (b) return to the data set at time
t# 1.
5
As an example, imagine that the dependent variable is an indi-
cator of incompleteness, coded 0 for missing and 1 for not.
Imagine,further,thatoneparticularpredictoristhehighestlevelof
education that a participant has achieved, coded with four ordered
categories:(a)highschooldiplomaorGED,(b)bachelor’sdegree,
(c) master’s degree, and (d) doctoral degree. The first thing a
CARTanalysisdoesistosearchforthe“split”onthisvariablethat
will partition the data into two homogenous groups—a group of
mostly1s(peoplewhoreturnedtothestudy)andagroupofmostly
0s (people who dropped out). With four ordinal categories, there
are 4 – 1 % 3 possible splits: high school education versus
bachelor’s, master’s, and doctoral; high school and bachelor’s
versus master’s and doctoral; and high school, bachelor’s, and
master’sversusdoctoraleducation.CARTiterativelytriesouteach
of these potential cut points, subdividing the data at each possible
split and choosing as the best split the split that produces the most
homogenous subgroups.
6
Once the best split has been identified for every variable, the
CART algorithm partitions the data using the best overall split
among these best splits and assigns a predicted class to each
subgroup by majority vote (i.e., a predicted class of 1 for a
subgroup containing mostly 1s). CART repeats this same process
on each predictor in the model, identifying the best split by
iterativelytryingoutallpossiblesplitsandsettlingonthesplitthat
produces the greatest reduction in impurity (or, equivalently, the
most homogenous partitions).
CART proceeds recursively in this fashion until some stopping
criterionisreached.Examplesofstoppingcriteriaincludecreating
a prespecified number of nodes, or reaching a point at which no
further reduction in node impurity is possible. If the algorithm is
allowed to proceed indefinitely, the model will eventually find
splits that are completely or nearly completely homogenous but
that may have trivial sample sizes. For example, the final split
might create two subgroups of only three people each. Because
CART assigns predicted classes by majority vote, the results of
such splits are highly unstable and unlikely to generalize to new
samples—it would only require changing a single case to overturn
a majority of two and change the predicted class for that node.
5
In this context, we only discuss using CART to predict binary out-
comes. However, we note that CART can also be used with multicategori-
cal and continuous outcomes (Breiman et al., 1984).
6
In the case of categorical dependent variables, this is mathematically
accomplished by minimizing some measure of node (i.e., subgroup) “im-
purity,” or heterogeneity, such as Bayes error, cross-entropy, or the Gini
index. Each of these functions reaches its minimum value when the class
proportion p is close to either 0 or 1 and its maximum value when p% .5
(cf. Berk, 2009).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
3
MISSING DATA WITH CART
22
Therefore, it is advisable to curb this algorithmic tendency to
overfit these fine-grained idiosyncrasies in the observed data. One
might consider accomplishing this using one of two broad strate-
gies. First, one could consider stopping the tree from growing too
large (and thereby preventing, in theory, the tendency to overfit in
response to trivial, unstable partitions in the data) by setting a
minimum sample size, a priori (e.g., all final splits must have at
least 20 people in each node).
Alternatively, one may instead grow a very large tree and
subsequently prune it back using cost-complexity pruning, which
tempers the number of partitions by adding a parameter that
penalizes larger, more unstable trees. In the binary classification
context, cost-complexity pruning seeks to identify the nested sub-
treethatminimizesthesumof(a)theriskassociatedwithatreeof
sizeT,and(b)thepenaltyforcomplexityassignedtoatreeofsize
T. Here, risk is defined in terms of the proportion of misclassified
observationsofclass0or1(e.g.,inanodeorentiretree)weighted
by the cost parameters assigned to each type of misclassification,
and a nested subtree is defined as a tree with fewer of the initial
partitions than the original large tree. Hence, the optimal subtree
chosen by cost-complexity pruning is a function of the costs of
misclassification errors (the risk) qualified by the penalty associ-
atedwithtreecomplexity.Cost-complexitymethodsemploycross-
validation to set the optimal penalty parameter for pruning, trying
out various values for the parameter and computing the associated
risk on the validation portion(s) of the data set.
Which of these two strategies—stopping or pruning—should
researchers prefer? Some methodologists are equivocal, stating
that“inpractice,whetheronedeterminestreecomplexitybyusing
[penalty parameter]& . . . or an explicit argument to the CART
procedure determining the minimum terminal node sample size,
seem to make little difference” (Berk, 2009, p. 130). Others (e.g.,
Louppe, 2014), however, caution against employing stopping cri-
teria, such as minimum node size, arguing that there may be cases
in which further splits below the enforced minimum (e.g., mini-
mum node sizes of less than a stopping rule of N % 20) could
potentially provide benefits in decreasing generalization error.
Because this argument is both intuitive and persuasive, and be-
cause investigating the effects of different choices of minimum
node size is tangential to the aims of the present research, in the
simulations described here we employ cost-complexity pruning
rather than minimum node size.
The results of a CART analysis are displayed as a tree diagram,
as shown in Figure 1. At the top of the diagram is the “root node,”
which contains the entire data set. In this example, after trying out
every possible split on all variables, CART chose to partition the
data at cut point c
1
on predictor x. If x# c
1
, we proceed visually
to the right of the diagram, reaching a new node. Because CART
chose to further partition this subgroup of the data, this interim
node is referred to as an “internal node.” This new split occurred
on variable z at cut point c
3
. If x# c
1
and z" c
3
, we reach Node
3, which is a “terminal node”—that is, a final node that was not
split further. Because the majority of cases in this node were 0s
(e.g.,dropouts),thisnodereceivesapredictedclassof0.Similarly,
if x # c
1
and z # c
3
, we reach terminal Node 4, in which the
majority of individuals were 1s (e.g., returners) and the predicted
class is 1.
Splitting the data on both x and z represents an interaction
effect—the effect of x on the predicted probability of returning to
the study depends on z. Yet this interaction may be quite different
from those modeled by the usual regression techniques (Aiken &
West,1991).Inthiscase,zinteractswithxonlyabovecutpointc
1
and it does so not by modifying the simple slope of a line, but
insteadbysplittingthevaluesofx#c
1
intotwodistinctsubgroups
(nodes) with different predicted outcomes.
If,instead,CARThadpartitionedthesubgroupusingadifferent
cutpointonx,theresultwouldbeanonlinearstepfunction(Berk,
2009). This pattern can be seen on the left side of the diagram: If
x" c
1
but# c
2
, the predicted class is 1; if x" c
1
and" c
2
, the
predicted class is 0. In this way, by testing each possible split on
every variable, the CART algorithm tests all possible nonlineari-
ties and interactions among all cut points on the predictors.
In addition to assigning a class to each node, CART also
computesapredictedprobabilityof“success”(i.e.,beingclassified
as a 1, or, in the case of attrition, returning to the study) using the
proportion of 1s in each terminal node. For example, if only 25%
percent of individuals in a certain terminal node returned to the
study, the predicted probability for this node would be .25. These
predicted probabilities can be inverted to create sample weights
that might be used to give greater weight to individuals from
high-dropout groups who actually returned to the study.
Random forest analysis. Although CART has many virtues, it
has some limitations. One such limitation is high variance across
samples. This means that the tree structure and resulting estimates
(e.g., predicted classes and probabilities) are not necessarily stable in
new samples. As we have seen, pruning is one method that may
addressthisissue.Analternativeistoemploybootstrapmethodsthat
repeatedly create new data sets by sampling from the observed data
with replacement, fitting the CART model to each bootstrap sample
andaggregatingtheresultstodeterminethemoststablefeaturesofthe
tree. Because of their low variance and high predictive accuracy, in
many domains the use of CART has largely been supplanted by
resampling (“ensemble”) methods that address CART’s potential
instability by averaging the results of many trees.
Bootstrapmethodstakemanyrepeatedsamplesfromthedatawith
replacement,eachtimerecordingthepredictedclassificationforeach
case. The resulting predicted probabilities are computed as the pro-
Figure 1. Example tree diagram from classification and regression tree
(CART) analysis (cf. Berk, 2009).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
4
HAYES, USAMI, JACOBUCCI, AND MCARDLE
23
portion of times a given case is classified asa0ora1 across the
bootstrap samples. For example, if Case 1 in the data set is classified
asa1in900outof1,000bootstrapsamples,thepredictedprobability
of this case being assigned a value of 1 is .9. Similarly, the predicted
class for the case is assigned by majority vote as 1. This is the basis
for bagging (short for “bootstrap aggregation”; Breiman, 1996), an
early resampling-based method.
7
Random forest analysis (Breiman,
2001)providesanadditionalbenefit:Foreachsplitoneachbootstrap
tree, the algorithm randomly samples a subset of the predictors to be
usedascandidatesforthesplit.Whenthepredictorsinthedatasetare
highly correlated, this procedure addresses potential collinearity is-
suesbygivingeachofthecorrelatedpredictorsachancetobeusedin
different bootstrap trees. As in CART and logistic regression, the
predicted probabilities from a random forest analysis can be inverted
and scaled to create weights for use in further analyses (Asparouhov,
2005; Kish, 1995; McArdle, 2013; Potthoff et al., 1992; Stapleton,
2002).
The biggest disadvantage of random forests is that the analysis,
whichaggregatesovertheresultsofmanybootstraptrees,doesnot
produce a single, easily interpretable tree diagram. However, this
method does provide variable importance measures derived from
the contribution of each variable to prediction or fit across the
bootstrap trees. As a result, variables with high variable impor-
tance scores in a missing data analysis may be considered as
important missing data correlates.
The Present Research
The promise of using CART and random forest methods to
model missing data is the potential of these methods to better
capture complex selection models than traditional linear methods
such as logistic regression. Yet the relative performance of these
methods has not been assessed in the missing data context. As a
result, whether and when these methods will provide gains over
traditionaltechniquesisunknown.Therefore,inordertoassessthe
performance of these methods, we conducted two statistical sim-
ulations.Thefirst,large-scalesimulationstudyassessedtheeffects
of selection model (linear vs. tree with one, two, or three splits)
and percent attrition (30% or 50%) on parameter estimates re-
turned by a cross-lagged path model.
Simulation A
Simulation design.
Template model. As a template model, we simulated a cross-
lagged factor model with two time points as our template model
for all analyses. This model is displayed in Figure 2a, which also
displays the true population parameters for the structural part of
the model. For the sake of simplicity, we set the correlation
between X
1
and Y
1
to zero in the population. We used this factor
model to simulate indicators with varying degrees of reliability. In
all cases, once the data were generated at a given level of reliabil-
ity, we averaged the indicators to form composite variables, yield-
ing the analysis model shown in Figure 2b. This analysis model
was then fitted using each of the techniques (e.g., MI, CART
weights) assessed in the simulation.
8
Simulated missing data covariates. In addition to this tem-
plate model, we simulated three missing data covariates, v, z, and
w. These three covariates were uncorrelated with each other and
weresettobecorrelatedwithbothtimeonevariablesatr
COV,X1
%
r
COV,Y1
% .4. Given the structural expectations of the template
model, this resulted in expected correlations of .32 between each
covariate and X
2
(because the cross between Y
1
and X
1
was zero)
and expected correlations of .52 between each covariate and Y
2
.
Approach to modeling attrition. In this simulation, we were
interested in modeling participant attrition rather than other types
of missing data (e.g., selective nonresponse to certain surveys or
items). Specifically, we modeled a situation in which participants
showed up at Time 1 and then either did or did not return at Time
2.Thus,ifaparticipantdroppedoutatTime2,bothvariables—X
2
and Y
2
—were missing.
Factors varied in the simulation.
Primary factors in the simulation. The two key factors in the
present simulation were the selection model and the percent attrition
generated.Foreachsimulationcell,wegeneratedj%200simulation
data sets from a multivariate normal distribution.
Selection model. The most crucial factor varied in the simu-
lationwasthestructureoftheselectionmodel.Oncewegenerated
completedatasetsfromthetemplatemodel,wegeneratedattrition
using either a linear selection model or a tree with one, two, or
three splits. Figure 3, Panels a, b, and c, display the structures of
these tree-based selection models, respectively. Note that, partic-
ularly in the case of the three-split model (Panel c), these figures
represent conceptual, missing data generation models for the sim-
ulation,andtheorderofthesplitsmaynotnecessarilybethesame
when analyzed using decision tree methods (e.g., the first split on
variablevmaynotresultinthebiggestpartition,andmaytherefore
not be the first split returned by the analysis, despite this data
generation model).
9
Percent attrition. The percent of attrition modeled in this
simulation was varied to be either 30% or 50% at Time 2 in the
manner described in the next section.
Choiceofcutpointsandmethodsofsimulatingselectionmodels
under different percentages of attrition. In the linear selection
conditions, attrition was predicted by variable v using a smooth
7
A modification of this procedure that is utilized by the random forest
algorithm increases the accuracy of the resulting estimates even further by
takingadvantageofthefactthat,foreachbootstrapsample,whensamplingN
rowsoftheoriginaldatawithreplacement,aboutonethirdoftheoriginaldata,
on average, will not be included in the sample (Breiman, 2001). These
unsampled cases are referred to as out-of-bag observations.Onestrategyto
increasethepredictiveaccuracyofestimatesfrombagginganalysesistotreat
the out-of-bag portion of the data as a validation data set, using the model
generated on the bootstrapped data to predict the classes of the out-of-bag
observations. After repeating this procedure many times, the final predicted
class of a given case is assigned by majority vote as the class most frequently
predicted for that case among the out-of-bag samples. Like cross-validation,
this improves prediction accuracy by predicting the classes of new observa-
tions that were not used to generate the model.
8
Lengthy simulation code (which includes not only R scripts but also
multipleMplusAutomationtemplatefiles,makingitunwieldytoincludein
an appendix for this article) is available from the first author upon request.
9
Decision trees are ordered in terms of successive “best” (most homo-
geneous) splits. Although we generated data based on these conceptual
diagrams, whether or not CART returned the splits in the exact order
depicted depended on the partitions created by each split. For example, if
the split w" c
3
produced the most homogenous subgroups, it might be
considered the first split in a CART diagram, rather than the split on
variable v.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
5
MISSING DATA WITH CART
24
function.Tosimulatelinearselection,wesimulatedthelogoddsof
attrition using the linear model:
LogOdds(Miss)"$
0
%$
1
v.
We then converted these log odds to probabilities using con-
ventional formulas and used these resulting probabilities to gen-
erate missing data. For example, if a certain case had a predicted
probability of .6, this case would have a 60% chance of receiving a
missing value in the data set and a 40% chance of not receiving a
missing value in the data set.
In the 30% attrition condition, the coefficients used to generate
the log odds were$
0
"& 0.85 and$
1
" 0.3. In the 50% attrition
condition, the coefficients used to generate the log odds were
$
0
" 0.03 and$
1
"& 0.86. These values were chosen based on
simulation pretests because of their ability to reliably lead to
30% and 50% missing cases, respectively. We simulated linear
attrition in this way, rather than a more conventional manner
(e.g., monotonically increasing the probability of attrition in
each quartile of the missing data indicator, as in Collins, Scha-
fer, & Kam, 2001), because we wanted the linear selection
conditions to truly represent, rather than merely approximate,
the kind of smooth, linear function that would be easily cap-
tured by logistic regression analysis (but not necessarily by
decision tree methods).
The cut points and probabilities displayed in Table 1 corre-
spond to the tree-based selection models from Figure 3. Impor-
tantly, the values displayed for the cut points are percentiles of
the splitting variable. For example, cut point c
1
occurred at the
75th percentile of variable v in the one-split, 30% attrition
condition, and at the 50th percentile (median) in the one split,
50% attrition condition, and so on. We generated these cut
points and probabilities of attrition by first hypothesizing tree
Y
11
Y
12
Y
13
X
11
X
12
X
13
Y
21
Y
22
Y
23
X
21
X
22
X
23
Y
2
U
X
Y
1
X
1
X
2
U
Y
U
Y1
U
Y1
U
Y1
U
X1
U
X1 U
X1
U
X2
U
X2 U
X2
U
Y2
U
Y2
U
Y2
β
XX
= 0.8
β
YY
= 0.8
β
YX
= 0.5
β
XY
= 0.0
λ
λ
λ λ
λ
λ
λ
λ
λ λ
λ
λ
Ψ
X1
Ψ
X1
Ψ
X1 Ψ
X2
Ψ
X2
Ψ
X2
Ψ
Y1
Ψ
Y1
Ψ
Y1 Ψ
Y2
Ψ
Y2
Ψ
Y2
Ψ
X
Ψ
Y
r
X1Y1
a
b
Figure 2. Template models used in Simulation A. (a) Population factor model. (b) Composite model used for
actual analyses.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
6
HAYES, USAMI, JACOBUCCI, AND MCARDLE
25
structures that might generate 30% or 50% attrition among the
uncorrelated covariates and then empirically adjusting these
values based on simulated pretests.
Secondary factors in the simulation. In addition to these
primary factors, we varied two secondary factors in this simula-
tion: the sample size, N, and the reliability of the covariates.
Samplesize,N. Ineachsimulationcell,wegenerateddatasets
of three different sizes, N% {100, 250, and 500}.
Reliability of the indicator variables. Rogosa (1995) suggests
that the more reliable variable is often chosen as a cause in
cross-lagged models. In order to investigate this phenomenon in
the present missing data context, we varied how reliable theX andY
measures were in our simulated data. In order to simulate different
reliabilities among the X and Y variables, we set&
X
% {.7, .9} and
&
Y
%{.7,.9}.Fullycrossingthesefactorsresultedinfourconditions:
(a)&
X
%&
Y
% .7; (b)&
X
% .7 and&
Y
% .9; (c)&
X
% .9 and&
Y
%
.7; and (d)&
X
%&
Y
% .9. We chose these values because&% .9 is
generally acknowledged as a high degree of reliability, whereas&%
.7 is generally acknowledged to be a minimum acceptable level of
reliability.Thus,theseconditionsweredesignedtorepresenthighand
minimumacceptablereliabilityconditions,reflectingreliabilitylevels
typically reported in practice.
Figure 3. Tree structures used to generate attrition in the simulations: (a) one-split condition; (b) two-split
condition; (c) three-split condition.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
7
MISSING DATA WITH CART
26
We generated these reliabilities by varying the size of the
uniquenesses in the template factor model (Figure 2a). All factor
loadingsforallindicatorsinthetemplatemodelweresetto$%.8
and the values of the uniquenesses were calculated to return either
.7 or .9 reliability among the indicators. Specifically,' was set
equal to .82 when&% .7, and .21 when&% .7. Note that
reliabilities refer to indicators of all factors related to a given
measure, such that when&
X
% .9, the indicators of both X
1
and X
2
are set at a reliability of .9, and the same is true for indicators of
both Y
1
and Y
2
when the reliability of Y is varied.
Models tested in the simulation.
Models applied to the simulated data. Because of the way we
simulated attrition, in which participants either returned or not at
Time 2, we were not able to include full information maximum
likelihood (FIML; Anderson, 1957; Arbuckle, 1996) among the
missing data estimators tested in this simulation. This is because
when data are missing only on the dependent (endogenous) vari-
ables, Mplus (Muthén & Muthén, 2011) automatically applies
listwise deletion to these cases.
10
Therefore, we used only six
missing data methods to analyze each simulated data set: listwise
deletion, MI, and weights generated from (a) logistic regression,
(b) CART, (c) CART with cost-complexity pruning, and (d) ran-
domforestanalysis.WeranCARTanalysesusingRpackagerpart
(Therneau, Atkinson, & Ripley, 2014). In pruned CART condi-
tions, we implemented cost-complexity pruning using the one-
standard-error rule, which essentially recognizes that values falling
within one standard error of the minimum risk are statistically equiv-
alent and chooses the complexity parameter that produces smallest,
most parsimonious subtree falling within this range (see package
documentation).Randomforestanalyseswereconductedusingpack-
age randomForest (Liaw & Wiener, 2002) with default settings. In
passing, we note that these were the same packages and setting
employed by Lee, Lessler, and Stuart (2010) in their simulations of
machine learning techniques in the propensity-score matching con-
text, although it is unclear whether these authors used the one-
standard-deviation rule for pruning or simply the minimum cross-
validated risk.
In all models besides CART and pruned CART (in which we
chosethefirstsingletreeandbestnestedsubtree,respectively),we
tookaninclusiveapproachtochoosingmissingdatacovariatesfor
analysis, as previous research has shown that including more
covariates often tends to improve the results of missing data
methods (Collins et al., 2001). For this reason, we included all
missingdatacovariatesineachofthesemodelsinordertoenhance
their performance as much as possible. That is, we modeled
logistic regression weights and imputed data using all three cova-
riates, v, z, and w. Similarly, we used the results of random forest
analyses modeling all covariates to create probability weights.
That is, although we recorded which variables were flagged as
statistically significant and predictively important in the simula-
tion, we utilized all covariates in all final missing data models,
regardless of which were flagged in the selection analyses.
With each of these methods, we estimated the full cross-lagged
regression model displayed in Figure 2b and assessed each meth-
od’s performance in recovering the true parameter values. All
structuralmodelswereruninMplus(Muthén&Muthén,2011)via
the MplusAutomation package in R (Hallquist & Wiley, 2014).
Overall design. Given these factors, the overall design of the
simulation consisted of a fully crossed 4 (selection: linear, one split,
two splits, three splits)(2(percentattrition:30%,50%)(2(&
Y
%
.7,.9)(2(&
X
%.7,.9)(3(N%100,250,300)design,resultingin
96 unique simulation cells. Because each cell was resampled 200
times, this resulted in 96( 200% 19,200 simulated data sets.
Dependent measures assessed in the simulation.
Methodsusedtoassesstheselectionmodel. Inthefirstpartof
the simulation, we tested the performance of several methods for
assessing the selection model: (a) t tests of missing versus non-
missing cases performed on each covariate, (b) logistic regression
analysis predicting the missing-data indicator from all covariates,
(c) CART analysis, (d) pruned CART analysis, and (e) random
forest analysis. The performance of these methods in determining
the true selection model was assessed using two methods: (a) by
recording which variables each selection analysis flagged as sta-
tisticallysignificantorpredictivelyimportant,and(b)byrecording
the classification accuracy returned by each method.
Selection variables flagged. To assess the accuracy of these
techniques in recovering the true selection model, we captured
whether or not each analysis flagged each covariate as (a) statis-
tically significant (for t tests and logistic regression), (b) a split
variableinatree(forCARTandprunedCARTanalyses),or(c)an
important predictor in the random forest analysis, using the stan-
dardized classification accuracy measure available in the impor-
tance() function in the randomForest package (Liaw & Wiener,
2002). Additionally, we assessed effect size measures (e.g., Co-
hen’s d) for the t tests and variable importance for the CART
models, respectively.
Classification accuracy. For the logistic regression, CART/
pruning, and random forest analyses, we recorded the classifica-
tion accuracy returned by each method. Because CART-based
methodsarepronetooverfitting,wehypothesizedthatthesemeth-
odswouldlikelyreturnhigherclassificationaccuracyratesthanall
othermethods,regardlessofselectionmodel.Conversely,because
random forest analysis is designed to undercut CART’s tendency
to overfit, we hypothesized that this method might be likely to
10
See this Mplus discussion board thread for more information: http://
www.statmodel2.com/discussion/messages/22/24.html?1380292912
Table 1
Simulation Parameters for Tree-Based Selection Models
Percent missing:
One split Two splits Three splits
30% 50% 30% 50% 30% 50%
Cutpoints
c
1
.75 .50 .80 .60 .40 .50
c
2
— — .15 .30 .65 .40
c
3
— — — — .70 .60
P(Return)
p
1
.87 .70 .40 .30 .88 .75
p
2
.20 .30 .89 .85 .40 .30
p
3
— — .60 .25 .90 .70
p
4
— — — — .30 .25
Note. Cut point values represent percentiles (quantiles) of the observed
covariates.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
8
HAYES, USAMI, JACOBUCCI, AND MCARDLE
27
return lower classification accuracy, regardless of selection model
(and regardless of random forest’s actual performance).
Dependent measures used to assess the performance of miss-
ing data estimators. To assess the performance of missing data
techniques in recovering model parameters, we used four primary
measures taken from the prior simulation literature on incomplete
data(Enders,2001;Enders&Bandalos,2001):percentbias,mean
squared error (MSE), efficiency, and statistical rejection rates.
Percent bias. Consistent with Enders’s work (Enders, 2001;
Enders & Bandalos, 2001), percent bias was measured using the
formula
%Bias"
"
ˆ
'
ij
&'
i
'
i
#
*100 (3)
where
ˆ
'
ij
indicates the value of the estimated statistic on the jth
iterationand'
i
indicatesthetruepopulationparameter.Theoverall
biasforagivenparameterinagivensimulationcellistheaverage
percent bias across the j iterations. Following Muthén and col-
leagues (Muthén, Kaplan, & Hollis, 1987), values greater than
15% are considered problematic.
Efficiency. Efficiency was simply computed as the empirical
standard deviation of the estimates of each model parameter for
each analysis method across the simulated iterations in each sim-
ulation cell.
MSE. In contrast to percent bias, MSE is simply computed as
the average squared difference of each estimate from the corre-
sponding parameter. As noted by others (Collins et al., 2001;
Enders, 2001; Enders & Bandalos, 2001), MSE incorporates both
bias and efficiency, making it a rough proxy for the “overall
accuracy” of a given method.
Statistical rejection rates. Finally, for all parameters, statisti-
cal rejection rates were recorded in the simulation.
Simulation Results
Selection variables flagged. Based on the selection models
included in this simulation, a given technique is considered accu-
rate to the extent that, on average, it (a) flags variable v, but not
variables z and w, as an important missing data covariate in the
linear and one-split conditions; (b) flags variables v and z, but not
w,asimportantmissingdatacovariatesinthetwo-splitconditions;
and(c)flagsallthreecovariatesasimportantpredictorsofmissing
data in the three-split conditions.
Table 2 displays results for missing data t tests, logistic regression
analysis, and CART-based methods. For the t tests and logistic re-
gression, the table displays the average rate at which each variable
was flagged as a significant predictor of incompleteness. For the
CART methods, the table indicates the average rate at which each
variable was included as a split variable in the chosen tree.
Overall, the results are encouraging for all selection model
assessment methods. In general, all methods flagged variable v as
more important than variables z and w in the linear and one-split
models, although the rates for t tests and logistic regression im-
proved with increased sample size in the linear condition. Simi-
larly, all methods seemed to flag variables v and z, but not w, as
importantmissingdatacorrelatesinthetwo-splitconditions.Once
again, however, the rates at which t tests and logistic regressions
flagged w as significant improved with increased sample size.
Finally,allmethodsperformedwellinidentifyingtheroleofzand
w in the three-split conditions, but the t tests and logistic analyses
performed poorly in recognizing the role of variable v in this
selection model.
The CART and pruning methods performed consistently better
than t tests and logistic regressions in all but the linear selection
model conditions. Whereas CART tended to overfit the data and
flag all three variables as predictors of attrition in the linear
selection model conditions, implementing pruning seemed to curb
this tendency, cutting rates of falsely identifying z and w as
important split variables down by roughly a half for all sample
sizes. Pruning also substantially reduced the rates of flagging
incorrect variables as split variables (e.g., flagging w in the two-
split condition or z in the one-split condition) in virtually all other
conditions. One exception to this trend is that pruning did not
Table 2
Selection Model Identification by Sample Size and Selection Model, Simulation A
Covariate flagged as significant (rejection rate) Covariate used in tree
t tests Logistic CART CART# Prune
vz w vz w v z w vz w
N% 100
Linear .612 .041 .052 .611 .054 .042 .914 .710 .698 .876 .396 .387
One split .944 .052 .058 .943 .049 .047 .997 .428 .408 .989 .065 .063
Two splits .892 .539 .056 .926 .049 .620 .991 .903 .219 .964 .699 .055
Three splits .044 .371 .593 .046 .616 .393 .848 .796 .943 .576 .597 .828
N% 250
Linear .786 .050 .054 .783 .064 .053 .996 .924 .912 .97 .482 .483
One split .999 .051 .048 .999 .047 .041 1.000 .545 .539 1.000 .033 .034
Two splits 1.000 .922 .044 1.000 .052 .958 1.000 .997 .518 .993 .838 .081
Three splits .048 .734 .884 .054 .896 .777 .999 .996 1.000 .902 .804 .964
N% 500
Linear .929 .053 .052 .928 .048 .054 .978 .811 .816 .966 .477 .482
One split 1.000 .053 .060 1.000 .054 .042 1.000 .306 .316 1.000 .018 .018
Two splits 1.000 .996 .044 1.000 .044 .999 1.000 1.000 .484 .999 .911 .070
Three splits .076 .959 .989 .078 .991 .972 1.000 1.000 1.000 .993 .901 .998
Note. CART%classificationandregressiontrees;Prune%prunedCARTanalysis.Tableentriesindicatethepercentageofsimulatediterationsthateach
variable was either flagged as statistically significant (t-tests, logistic regression) or included as a split variable (CART, CART#Prune).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
9
MISSING DATA WITH CART
28
perform quite as well as single-tree CART in flagging all three
covariates in the three-split, N% 100 conditions.
Statistical significance of t tests and logistic regression coeffi-
cients and inclusion as split variables in tree models are not the
only criteria for flagging important predictors. Alternative ap-
proaches include examining effect size estimates and extracting
variable importance measures. One potential benefit of these ap-
proaches is their potential to obviate some of the sample-size
dependencefoundforttestsandlogisticregressionwhendecisions
were based purely on statistical significance. To examine the
merits of these alternative approaches, Table 3 displays mean
values of Cohen’s d, McFadden’s pseudo R
2
(as an effect size
measure of the overall logistic regression model), and variable
importanceforCART,prunedCART,andrandomforestanalyses.
Examining Table 3, we see that the true missing data predic-
tor(s) reliably showed much larger effect sizes than the other
covariates. Based on these effect size measures, which covariates
would an analyst likely include in the missing data model? If
anything, these results suggest that use of Cohen’s d rather than
statisticalsignificancemayresultininclusionofallthecovariates.
Based on standard cutoffs (e.g., flag any covariate with a |d| !
.10), all covariates would be included, on average, in every con-
dition except for the N % 500 cells. However, past research
indicates that adopting an inclusive covariate selection strategy is
generally not harmful and, in fact, often carries many benefits
(Collins et al., 2001).
Interestingly, McFadden’s pseudo R
2
was small across all logistic
regression models, despite the inclusion of the true predictor(s) in
eachcase.Hereagain,onewouldbewell-advisedtouseaninclusive
approach to predictor selection here, rather than dismissing the logis-
tic regression results because of the small overall R
2
.
Additionally, variable importance measures proved to be a sound
alternative method for covariate selection when using tree-based
methods.Ineachcase,theaverageimportanceofthetruepredictor(s)
was much larger than the average importance of the other covariates.
Thiswasespeciallytruefortherandomforestmethod,inwhichonly
the true predictor(s) received high importance scores and the other
covariates’ scores were uniformly near zero.
Classification accuracy. As expected, CART and Pruning
methods consistently returned higher classification accuracy val-
ues than logistic regression or random forest analysis across all
selection models and missing data percentages. These results in-
dicate that classification accuracy measures lack diagnostic value
in identifying the true selection model. Based on these results, one
cannot claim that “the CART model had a higher classification
accuracy than the logistic regression analysis; therefore, the true
selection model is most likely a tree.” As we suspected, this
measure says more about the classifier used (in particular, its
tendency to overfit the data or not) than the data being classified.
To conserve space, further details concerning this result are omit-
ted here. However, a full table of classification accuracy rates is
presented in Online Supplement B.
Percentbias. Surprisingly, given the complexity of the selec-
tion models employed in the simulation and the high percentages
of missing data induced, percent bias was low, overall, among
the majority of the parameter estimates. Tables of percent bias
relative to the population structural parameters are included in
Online Supplement B. To summarize, the most notable results
from these tables are as follows: (a) In general, the regression
coefficients show negligible bias; and (b) the most bias, however,
is observed in the estimate of the Y
2
residual.
To provide a broad illustration of these results, Figure 4
displays the marginal means of percent bias for each method
under each selection model, aggregated across parameters. Al-
though this figure loses information by collapsing over the
different parameters, it is evident that all of the missing data
methods display low amounts of bias in all conditions (includ-
ing, unfortunately, listwise deletion).
Table 3
Effect Size Measures and Variable Importance by Sample Size and Selection Model, Simulation A
Mean effect size measures Mean variable importance
t test |Cohen’s d|
Logistic
regression CART CART# Prune RF
vz w
McFadden’s
pseudo-R
2
vz w vz w vz w
N% 100
Linear .563 .167 .170 .086 8.199 3.656 3.703 6.871 2.706 2.722 7.391 ).106 ).112
One split .940 .175 .173 .177 15.224 3.431 3.257 13.903 1.729 1.687 23.271 .024 ).054
Two splits .732 .470 .169 .155 11.125 8.117 2.065 10.667 7.485 1.423 20.075 11.71 ).147
Three splits .166 .366 .508 .093 5.502 5.662 7.591 4.335 4.601 6.613 6.067 6.533 11.192
N% 250
Linear .548 .105 .108 .070 18.810 8.242 7.85 14.412 5.141 5.007 12.175 ).055 ).037
One split .927 .107 .105 .155 35.932 6.077 5.835 32.604 1.965 1.969 40.284 .157 ).141
Two splits .731 .473 .105 .141 26.257 19.352 3.301 24.228 17.860 1.609 36.120 22.149 ).073
Three splits .106 .354 .496 .077 15.179 13.58 17.005 11.488 10.093 14.539 13.759 12.736 20.681
N% 500
Linear .546 .075 .074 .065 28.434 8.512 8.618 23.614 5.960 5.998 17.546 ).010 .052
One split .920 .075 .078 .147 65.234 4.773 4.850 63.561 2.485 2.499 58.864 .011 ).010
Two splits .737 .467 .074 .137 48.704 35.515 3.732 46.384 33.936 1.609 53.898 33.001 ).045
Three splits .084 .355 .503 .074 26.227 22.838 29.152 21.728 17.958 26.093 22.843 20.421 31.863
Note. Cohen’sdindicatesthemeanoftheabsolutedvaluesacrosssimulatediterations.Randomforestvariableimportancecalculatedusingclassification
accuracy. CART% classification and regression trees; Prune% pruned CART analysis; RF% random forests.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
10
HAYES, USAMI, JACOBUCCI, AND MCARDLE
29
Relative efficiency. Table 4 presents the relative efficiency
of each missing data estimator compared with pruned CART
analysis. This ratio is formed by taking the empirical standard
deviation of the estimates returned by measure X across the
simulated iterations and dividing it by the empirical standard
deviation of the pruned CART estimates, that is, SD
MethodX
/
SD
Prune
(cf. Enders & Bandalos, 2001). Using this metric,
values !1indicateinstancesinwhichagivenmethodisless
efficient than pruned CART (i.e., pruned CART is more effi-
cient), whereas values"1indicatetheopposite.Becausethese
results were similar across conditions, we present results for the
&
X
%&
Y
% .9, N% 500 cells only. Further, we average results
across parameters, because efficiency, unlike bias, did not seem
to be parameter-specific. In general, pruned CART was more
efficient than either single tree CART or random forest analy-
sis. Listwise deletion and MI, however, consistently outper-
formed pruning in terms of efficiency. When comparing the
efficiency of pruned CART with logistic regression-weighted
analyses, the results were more mixed. In the N% 100 conditions,
logisticweightingwasmoreefficientacrosstheboard,whereasprun-
ing was more efficient in several cells of the N% 250 and N% 500
conditions, especially when the selection models were trees and the
percentattritionwashigh.Ingeneral,however,thesemethods(logis-
ticandprunedweights)displayedsimilardegreesofefficiency.Tothe
extent that these ratios rose above or sunk below 1, it was rarely far,
indicating (perhaps surprisingly) that these methods often do not
represent a significant tradeoff in terms of efficiency.
MSE ratios. Table 5 displays the ratio of each estimator’s
MSE over that of pruned CART (cf. Enders, 2001) for each
parameter estimated in the model. In this case, values !1
indicate that pruned CART was more accurate than the com-
parison method, whereas values"1indicatethatprunedCART
was less accurate than the comparison method. Once again,
because of similarity in results across conditions, we display
resultsforthe&
X
%&
Y
%.9, N%500cellshere.Oneresultthat
is worth highlighting is the superior performance of pruned
CART to random forest analyses on virtually all measures. The
performance of pruned CART compared with listwise deletion
and logistic regression is more mixed. Pruned CART does
appear to have an advantage over logistic regression weights in
some conditions, particularly when the percentage of attrition is
50%. Finally, MI performed particularly well here. Although
many of the ratios were close to 1, indicating near-identical
performance to pruned CART, a few the ratios were substan-
tially smaller (between .6 and .8), indicating an overall advan-
tage of MI in these conditions.
Statistical rejection rates for !
X2Y1
. Finally, Table 6 pres-
ents statistical rejection rates of*
X2Y1
by sample size, selection,
andpercentattrition.Wedisplay*
X2Y1
ratherthanthe*
Y2X1
cross
because*
Y2X1
was (correctly) flagged as significant nearly 100%
of the time in all conditions. Interestingly, the reliability of X and
Y exerted negligible effects on rejection rates, and this factor is
therefore not discussed further. MI exhibited slightly higher rejec-
tion rates (!.1) under 50% attrition in the N% 100 and N% 250
conditions.
Figure4. MarginalmeansofpercentbiasforparametersinSimulationA.
Table 4
Average Relative Efficiency Across Parameters, Simulation A
Listwise deletion Log weights CART weights RF weights MI
30% 50% 30% 50% 30% 50% 30% 50% 30% 50%
N% 100
Linear .955 .886 .954 .958 1.006 1.053 .994 1.075 .942 .885
One split .788 .896 .893 .959 1.010 1.095 1.001 1.197 .790 .887
Two splits .880 .821 .931 .982 1.016 1.049 1.032 1.168 .887 .822
Three splits .893 .884 .935 .943 1.018 1.073 1.006 1.081 .900 .881
N% 250
Linear 1.006 .970 1.009 1.074 1.085 1.179 1.057 1.233 1.009 .971
One split .895 .971 1.069 1.072 1.152 1.167 1.331 1.348 .896 .976
Two splits .902 .902 .978 1.103 1.121 1.131 1.227 1.395 .910 .911
Three splits .945 .957 1.009 .998 1.135 1.171 1.102 1.191 .948 .946
N% 500
Linear .940 .917 .948 1.036 1.010 1.051 1.012 1.281 .937 .933
One split .816 .932 1.005 1.037 1.007 1.019 1.360 1.327 .828 .937
Two splits .883 .858 .955 1.042 1.052 1.019 1.208 1.507 .89 .872
Three splits .867 .873 .909 .899 1.040 1.009 1.076 1.276 .868 .883
Note. Relative efficiency is computed as the efficiency of measure X over the efficiency of pruned CART. Log % logistic regression; CART %
classification and regression trees; RF% random forests; MI% multiple imputation.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
11
MISSING DATA WITH CART
30
Discussion
Several important conclusions can be drawn from the present
study.Inbrief,(a)allmethodsperformedadmirablyincorrectly
identifying the population selection model, but CART, pruned
CART, and random forest analyses were especially strong; (b)
classification accuracy was not especially useful in discrimi-
nating between selection models; and (c) of all methods con-
sidered here, pruned CART and MI performed extremely well.
Perhaps surprisingly to many readers, pruned CART outper-
formedtraditionalCARTandrandomforestanalysisintermsof
both MSE and efficiency.
This study was not without limitations. One troubling fact was
the low amounts of bias observed for nearly all parameters. This
may be indicative of the resilience of regression coefficients,
specifically to missing data in this type of cross-lagged model
under these conditions. Nonetheless, the lack of bias observed in
many of the parameters assessed in the present simulation under-
cutsanyclaimswecouldmakeaboutthebenefitsofthesemethods
for alleviating bias in the present scenario.
In light of these results, we wondered whether the relatively
strong performance of pruned CART over random forest anal-
ysis would replicate when estimating different parameters from
the regression coefficients modeled in this study. Perhaps point
estimatesofmeansandvarianceswouldbemoreaffectedbythe
attritioninducedbytheseselectionmodels,and,ifso,thiscould
alter the observed pattern of results. We reasoned that even if
the direction and strength of a straight regression line proved
resilient to missing data, the specific values of item means and
variances may not be. This intuition is in line with the results
found by Collins et al. (2001), who noted that the effects of
their simulated missing data on estimated regression coeffi-
cients “appear[ed] surprisingly robust in many circumstances,”
whereas “the situation [was] different for the estimate of [the
mean under missing data], which was affected...in every
condition” (p. 341). These authors further noted that variances
in their simulations were most affected by their nonlinear
selection model, leading us to believe that the same might be
true for variable variances under our nonlinear tree-based se-
Table 5
MSE Ratios, Simulation A
Listwise Log CART RF MI
30% 50% 30% 50% 30% 50% 30% 50% 30% 50%
B
X2Y1
Linear .869 .907 .898 1.101 1.007 1.104 1.027 1.500 .880 .922
One split .695 .837 1.198 1.011 1.006 .968 1.945 1.738 .682 .863
Two splits .803 .781 .958 1.132 1.087 1.071 1.660 2.203 .762 .832
Three splits .752 .777 .831 .816 1.081 1.021 1.124 1.574 .754 .810
B
Y2X1
Linear .999 .944 .975 .979 1.015 .998 .95 1.228 1.012 .860
One split 1.054 1.043 1.021 .953 .994 1.034 1.162 1.250 .897 .939
Two splits 1.201 1.232 1.249 1.234 1.063 1.023 1.120 1.236 .936 .932
Three splits 1.117 .992 1.009 .986 1.039 1.002 1.096 1.372 .985 .892
B
X2X1
Linear 1.004 .971 1.006 .996 1.013 1.019 1.004 1.276 1.020 .985
One split 1.004 .977 1.036 .967 1.012 1.008 1.287 1.100 .933 1.003
Two splits 1.108 1.065 1.129 1.077 1.062 1.028 .989 1.266 1.007 .973
Three splits 1.039 .972 .986 .968 .993 .981 1.012 1.070 .956 .994
B
Y2Y1
Linear 1.009 1.022 .994 1.013 1.002 1.008 1.001 1.096 .984 .957
One split 1.076 .987 1.035 .972 .995 1.020 1.070 1.116 .938 .947
Two splits 1.151 1.202 1.156 1.198 .996 1.020 .969 1.243 .969 .966
Three splits 1.090 1.014 1.018 .967 1.018 .999 1.017 1.225 .989 .971
r
Y2X2
Linear .983 .963 .997 1.042 .993 1.059 1.017 1.340 .961 .987
One split .861 .984 .951 1.056 1.039 1.007 1.209 1.295 .899 .995
Two splits .904 .869 .926 1.002 1.042 .986 1.138 1.266 .937 .924
Three splits .926 .898 .977 .943 1.013 .991 1.084 1.185 .936 .902
Resid (X2)
Linear .902 .901 .917 1.047 1.025 1.043 1.016 1.538 .944 .843
One split .733 .921 .930 1.092 1.060 1.019 1.847 1.470 .697 .855
Two splits .850 .697 .946 .961 1.062 1.034 1.332 2.128 .807 .684
Three splits .840 .754 .890 .751 1.051 1.068 1.131 1.367 .788 .746
Resid (Y2)
Linear 1.004 .998 1.006 1.009 1.001 1.010 1.014 1.005 1.018 1.059
One split .962 .979 .965 1.018 .999 1.001 1.008 .963 .994 1.035
Two splits 1.018 .959 .996 .963 1.005 .996 .975 1.044 1.054 1.030
Three splits .985 .964 .990 .970 .99 .999 1.000 1.030 1.004 1.022
Note. &
X
%&
Y
% .9, N% 500. MSE ratio is computed as MSE
MeasureX
/MSE
Prune
. MSE% mean squared error; Log% logistic regression; CART%
classification and regression trees; RF% random forests; MI% multiple imputation.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
12
HAYES, USAMI, JACOBUCCI, AND MCARDLE
31
lection models. Therefore, we decided to conduct a smaller
scale simulation to follow up on these lingering questions.
Simulation B
Simulation B extended the logic of Simulation A to a differ-
ent scenario: Rather than estimating regression coefficients in a
path model, we sought to estimate point estimates of the sample
statistics at Time 2. In so doing, we extend our results from a
model-based framework, in which the missing data estimators
are employed to estimate a particular structural model, to a
model-freeframework.Thisisreminiscentoflargescalesurvey
research, for which researchers might apply imputation or
weighting methods to adjust the estimates of item means, vari-
ances, and covariances.
Simulation Design
The design of Simulation B was identical to the previous simula-
tion, with three important changes. First, in this smaller scale simu-
lation, we did not vary the reliability of the indicators, as this factor
didnotseemtointeractwithselectionorpercentattritionintheprior
study and was ultimately tangential to our present focus. Instead of
simulating the structural model of Figure 2a, then, we directly simu-
lated the covariance structure corresponding to the path model in
Figure 2b. Despite the fact that we did not intend to fit a structural
modeltothisdataset,weusedtheexpectedcovariancestructurefrom
this model to generate the same correlation structure as the prior
simulation (i.e., r
COV,X2
% .32, r
COV,Y2
% .52, as before).
Second,insteadofsettingthemeansequaltozero,weemployed
an expected mean vector that set the means of X equal to 0.5 at
bothtimepoints,andthemeansofYequalto1atbothtimepoints.
After generating structural model expectations, this resulted in
expected means of X
!
2
" 0.90 and Y
!
2
" 2.05. More important than
the specific parameter estimates was the fact that these nonzero
valueswerenowmoreeasilyamenabletothestandardizedpercent
bias measures to be employed in the study (inasmuch as division
by 0 was no longer an issue). Additionally, we again set the
var(X
2
)% var(Y
2
)% 1, and, once again, the observed correlation
between X
2
and Y
2
was set to 0.4.
Finally, rather than sending models to Mplus, we estimated all
sample statistics in R. We estimated the weighted statistics using
the weighted.mean and cov.wt functions. Additionally, we con-
ducted MI using the mice package in R (van Buuren & Groothuis-
Oudshoorn,2011)withdefaultsettings.Imputedmeans,variances,
and covariances were computed as the arithmetic mean of the
estimates from five imputed data sets.
Results
The effect of sample size on MI estimates. In general, the
methods employed here were robust to differences in sample size,
and therefore only tables from the N % 500 conditions are dis-
played. One exception is worth mentioning, however. MI proved
to be the one estimator that improved steadily from the N% 100
condition to the N% 250 and N% 500 conditions. Although MI
displayedminimalbiasinestimatingthemeansofX
2
andY
2
across
sample sizes, the estimates of var(Y
2
), and especially cov(X
2
, Y
2
),
dramatically improve in the N% 250 condition compared with the
N% 100 condition, with the smallest amount of bias observed in
the N% 500 cells. In the interest of space, tables of percent bias
can be found in Online Supplement C.
Percent bias. Like Figure 4, Figure 5 displays the marginal
means of bias across parameters in Simulation B. This figure
illustrates several key points about the overall trends in the data.
Mostimportantly,inthissimulation,thetree-basedselectionmod-
els succeeded in introducing a greater amount of bias in listwise
estimates than Simulation A. This circumvents the problematic
low-bias “floor” effects observed in the prior study. Here, the
beneficial effects of the missing data estimators are in clear evi-
dence: Listwise methods display greater bias across all conditions
and all missing data estimators substantially reduce this bias.
Table 6
Statistical Rejection Rates for B
X2Y1
, Simulation A
Full data
Listwise
deletion Log weights
CART
weights Prune weights RF weights MI
30% 50% 30% 50% 30% 50% 30% 50% 30% 50% 30% 50% 30% 50%
N% 100
Linear .045 .055 .060 .070 .055 .095 .040 .070 .050 .080 .050 .075 .040 .115
One split .035 .050 .035 .050 .055 .085 .035 .060 .070 .075 .075 .095 .060 .120
Two splits .095 .045 .045 .040 .075 .100 .060 .055 .090 .140 .085 .140 .110 .135
Three splits .095 .045 .095 .055 .095 .095 .095 .050 .075 .105 .075 .115 .095 .170
N% 250
Linear .045 .035 .025 .050 .040 .050 .045 .060 .050 .050 .050 .075 .040 .075
One split .055 .075 .055 .065 .100 .085 .080 .085 .095 .100 .090 .105 .110 .105
Two splits .060 .060 .040 .035 .055 .065 .050 .060 .080 .065 .100 .075 .095 .125
Three splits .075 .035 .085 .075 .075 .085 .060 .065 .110 .085 .095 .105 .085 .105
N% 500
Linear .025 .055 .045 .040 .050 .050 .055 .055 .055 .045 .055 .055 .055 .055
One split .050 .075 .030 .040 .065 .065 .060 .045 .050 .065 .050 .050 .070 .065
Two splits .070 .055 .090 .050 .085 .045 .085 .065 .085 .050 .095 .055 .095 .055
Three splits .035 .060 .060 .060 .070 .055 .060 .090 .055 .070 .065 .065 .055 .085
Note. Log% logistic regression; CART% classification and regression trees; Prune% pruned CART analysis; RF% random forests; MI% multiple
imputation.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
13
MISSING DATA WITH CART
32
To examine these results in greater detail, Table 7 displays the
percent bias for the N% 500 conditions. Once again, random forest
analyses perform well here, but often not quite as well as pruned
CART, which tends to often (though not always) display lower bias.
MI shows similarly strong results in the N% 500 cells (but see the
note concerning smaller sample sizes in the previous section, “The
effectofsamplesizeonMIestimates”).PrunedCARTperformsabit
betterinmanyofthevar(Y
2
)andcov(X
2
,Y
2
)cells.Intheend,though,
all of these missing data methods undercut the bias observed among
thelistwiseestimates,and,indeed,thesedifferenceslargelyrepresent
the difference between “strong” and “stronger,” rather than the dif-
ference between “strong” and “weak” performances.
Relative efficiency. Table 8 displays the results for relative
efficiency, once again comparing each method to pruned CART.
Hereagain,prunedCARToutperformsrandomforestanalysisand
single-tree CART in virtually all cells. The comparisons with
logistic regression weights are more thoroughly mixed, with lo-
gistic estimates displaying greater efficiency in many cases. Con-
sistent with Simulation A, these results are rarely severe in their
discrepancies: In a few cells, pruned CART is substantially more
efficient, whereas in a few other cells, the reverse is true. On the
balance, however, these methods are often in the same range, with
values between .9 and 1.10 abounding.
ListwiseestimatesdisplaygreaterefficiencythanprunedCART
in all cells but one. The benefit of this efficiency is diminished, of
course, by the fact that these parameter estimates were, in general,
biased when compared with those returned by other methods.
Finally, MI displays the greatest efficiency of all methods. This is
similar to the results of Simulation A.
MSE ratios. Finally, Table 9 displays the MSE ratios of each
estimator compared with pruned CART. Several results are worth
noting.First,thesuperiorperformanceofprunedCARTtorandom
forest analysis is not only evident here, but even stronger than it
appeared in Simulation A. Second, pruned CART once again
outperforms single tree CART in the majority of cells, although
thereareexceptionstothisrule.Third,onceagainthecomparisons
with logistic regression weights are mixed. Logistic weights seem
to do a particularly good job of recovering the item means, whereas
Table 7
Percent Bias, N% 500, Simulation B
Listwise Log CART Prune RF MI
30% 50% 30% 50% 30% 50% 30% 50% 30% 50% 30% 50%
X
!
2
Linear )3.539 13.265 ).561 ).300 )1.206 1.504 )1.277 3.017 ).268 )2.303 ).602 .549
One split )11.009 )11.891 ).066 1.011 ).228 ).056 ).196 ).148 4.392 3.425 ).466 ).723
Two splits )2.615 )4.019 ).524 ).675 .466 ).450 .902 ).377 1.752 1.151 ).075 ).416
Three splits )9.912 )12.207 ).781 ).836 ).067 ).204 ).019 ).686 1.328 1.498 ).429 )1.130
Y
!
2
Linear )2.029 9.398 .117 .073 ).248 1.113 ).276 2.176 .431 )1.748 .050 .351
One split )7.623 )8.479 .501 .878 .324 ).137 .365 ).212 3.427 2.248 ).291 ).371
Two splits )1.598 )2.848 ).166 ).102 .411 ).291 .848 ).232 1.555 1.384 .188 ).226
Three splits )6.520 )8.206 .027 .100 .393 .226 .421 ).108 1.322 1.821 .081 ).325
(
X
2
2
Linear )1.399 )1.511 )1.201 .163 )1.306 .025 )1.391 ).415 ).449 ).760 )1.694 )1.706
One split )3.089 )1.447 ).135 .660 ).307 .227 ).488 ).097 .496 ).257 )1.070 )1.998
Two splits )3.605 )4.119 )3.287 )2.269 .217 1.774 .100 1.877 .763 2.117 ).376 )1.016
Three splits )2.389 ).929 ).341 .081 .710 ).298 .765 ).665 1.086 ).832 ).598 )1.057
(
Y
2
2
Linear )1.528 )4.661 )1.059 ).999 ).758 )2.315 ).572 )3.221 .904 )3.285 )2.132 )3.859
One split )8.056 )4.375 2.089 1.924 .766 )1.429 .848 )1.383 .799 )2.195 )2.469 )3.490
Two splits )10.246 )12.876 )9.938 )9.393 .827 .327 ).005 1.113 2.925 )1.016 )1.796 )3.497
Three splits )6.431 )2.683 )1.089 .974 1.053 .362 1.470 ).183 2.164 .956 )1.693 )2.439
(
X
2
,Y
2
Linear )1.257 )7.225 ).426 .079 .039 )1.873 .172 )3.549 2.414 )2.223 )2.283 )5.073
One split )12.295 )6.004 2.036 3.234 .923 )1.500 .976 )1.685 1.942 )3.757 )2.415 )4.309
Two splits )16.170 )19.752 )15.135 )13.396 1.429 1.317 ).051 2.142 3.808 .652 )1.479 )5.027
Three splits )10.144 )5.970 )2.461 ).719 1.393 )1.654 1.905 )1.877 2.766 )2.675 )1.766 )4.876
Note. Log% logistic regression; CART% classification and regression trees; Prune% pruned CART analysis; RF% random forests; MI% multiple
imputation.
Figure5. MarginalmeansofpercentbiasforparametersinSimulationB.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
14
HAYES, USAMI, JACOBUCCI, AND MCARDLE
33
pruned CART seems to excel in recovering the variances and cova-
riance of the two variables, particularly under 50% attrition. Fourth,
pruned CART outperforms listwise deletion in the majority of cells.
When listwise appears superior, we can surmise that this is likely
becauseofthegreaterefficiencyoftheestimates.However,inlightof
the bias displayed in the listwise estimates, it would be ill-advised to
dub listwise methods “more accurate” in these cells. Finally, MI, as
instantiatedbythemicepackage,excelsinallsimulationcells.Aswe
have seen, this is likely aided by the method’s high efficiency.
Discussion
Simulation B replicated and extended the key results of Simu-
lation A in a different analysis context—that of accurately recov-
ering observed sample statistics rather than fitting a structural
model. In this context, listwise estimates displayed evident, albeit
modest, bias that was successfully reduced by the missing data
estimators. In this study, like Simulation A, pruned CART once
again outperformed random forest analysis and, in most cases,
single-tree CART methods. The benefits over logistic regression
weights were once again varied, with each method outperforming
the other in different cells. By applying these estimators in a
different analysis context (e.g., retrieving sample statistics rather
than model estimates), we can feel more confident that these
results are not idiosyncratic to the conditions simulated in
Simulation A. By applying these estimates in a different soft-
ware package (using weighted mean and variance functions in
RratherthanweightedstructuralequationmodelingusingMLR
estimation in Mplus), we can feel assured that these results are
properties of the weights themselves, not simply of the program
used to implement them.
General Discussion
Two simulation studies demonstrated the strong performance of
using machine learning techniques to compute missing data
weights. In both studies, these methods performed comparably
with and/or exceeded the performance of more traditional meth-
ods, such as logistic regression weights and MI. Across both
simulations, pruned CART outperformed single-tree and random
forest methods in terms of efficiency and MSE. Though more
simulation research is needed on this topic, several preliminary
conclusions can be drawn from these results.
All Methods, But Especially Pruned CART and
Random Forests, Excel in Identifying the True
Selection Model
One exciting finding from Simulation A is the strong perfor-
mance of nearly all selection model identification methods (t
tests, logistic regression, CART, pruning, and random forest
analysis) in identifying the true selection variables. The perfor-
mance of t tests and logistic regression was not quite as high as
the other methods when using significance testing as the main
criterion, and the performance of these analyses also depended
more highly on sample size. Using effect size measures (e.g.,
Table 8
Relative Efficiency, N% 500, Simulation B
Listwise Log CART RF MI
30% 50% 30% 50% 30% 50% 30% 50% 30% 50%
X
!
2
Linear 1.020 .924 .985 .984 1.022 1.038 1.057 1.340 .954 .816
One split .855 .930 1.158 .961 1.010 1.021 2.135 1.395 .813 .811
Two splits .924 .823 .925 .99 .990 .997 1.177 1.650 .805 .764
Three splits .912 .858 .892 .824 1.017 .975 1.056 1.102 .825 .727
Y
!
2
Linear .972 .907 .945 .893 1.024 .985 1.020 1.226 .945 .728
One split .833 .979 1.297 .869 .982 .942 1.813 1.187 .729 .691
Two splits .920 .887 .809 .815 .960 .982 1.205 1.597 .749 .669
Three splits .921 .955 .854 .793 1.019 .937 1.019 1.272 .801 .714
(
X
2
2
Linear .925 .819 .952 1.042 1.034 1.055 1.037 1.301 .889 .785
One split .752 .931 1.166 1.048 1.022 1.019 1.843 1.268 .735 .85
Two splits .858 .769 .916 .993 1.089 1.037 1.294 1.408 .867 .725
Three splits .792 .855 .900 .956 1.096 1.049 1.138 1.096 .792 .830
(
Y
2
2
Linear .971 .971 .985 1.176 1.030 1.083 .976 1.206 .809 .814
One split .694 .897 1.834 1.179 .983 .953 1.167 1.070 .631 .678
Two splits .736 .681 .794 .921 1.023 .962 1.173 1.015 .729 .610
Three splits .757 .773 .909 .948 1.008 .998 1.011 1.098 .707 .649
(
X
2
,Y
2
Linear .936 .887 .965 1.128 1.022 1.116 1.023 1.331 .848 .774
One split .695 .900 1.575 1.098 1.016 .962 1.576 1.077 .689 .723
Two splits .789 .727 .835 .939 1.037 .975 1.151 1.218 .794 .635
Three splits .733 .838 .853 .967 1.026 1.036 1.001 1.108 .741 .685
Note. Relative efficiency ratios computed over SD
Prune
. Log% logistic regression; CART% classification and regression trees; RF% random forests;
MI% multiple imputation.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
15
MISSING DATA WITH CART
34
Cohen’s d)alleviatedthistendency,butcouldleadtooverop-
timism concerning how many covariates to include in the se-
lection model. Pruning seemed to alleviate some of CART’s
tendencytooverfit,andmainlyseemedtocutspuriousselection
variables from the tree models. Finally, random forest’s vari-
able importance measures were remarkably consistent in prior-
itizing the true selection variables in the model.
The Performance of Tree-Based Weights Under a
Smooth, Linear Selection Model
It is important to draw attention to one thing that did not
happen in the simulations: The performance of CART, pruned
CART, and random forests did not, in general, deteriorate when
the selection model was a smooth linear function. This was not
aforegoneconclusion;toquoteBerk(2009,p.150),“Even
under the best of circumstance, unless the f(X)isastepfunc-
tion, there will be biases in any CART estimates. The question
is how serious the biases are likely to be.” Although the results
of CART may be biased in the sense of approximating rather
thancapturingthetruefunction,thepresentsimulationssuggest
that pruned CART weights may be fairly robust under smooth,
linear selection models, making this a surprisingly viable can-
didate as a useful all-purpose method (but see Lee et al., 2010,
whose quadratic functions proved problematic for tree-based
methods in a different context).
The Performance of Logistic Regression Weights
Throughout many of the simulation cells, pruned CART and
logisticregressionwere“neckandneck,”witheachmethodtaking
turns outperforming the other. One notable exception was in esti-
mating the variances and, especially, the covariance of X
2
and Y
2
in Simulation B (see Table 7). Under these conditions, logistic
regression weights performed considerably worse than pruned
CART weights, suggesting that, although logistic weights per-
formsverywellundermanycircumstances,theremaybeinstances
in which this does not hold true. Further simulation research is
neededtoclarifywhichmoderatorsaffecttherelativeperformance
of logistic weights over pruned CART weights.
The Surprisingly Small Impact of Sample Size
The present simulations seem to suggest that these methods are
usefulinmodelingattritioneveninverysmallsamples.Thishelps
to clarify a misconception some practitioners may carry concern-
ingtheusesofdecisiontreemethods.AlthoughCARTandrandom
forest methods are often invoked in the context of “big data,” this
reflects the methods’ usefulness for the types of prediction prob-
lems commonly found in big data scenarios and does not imply
that “big” data sets are required to profitably employ tree-based
methods (see also Strobl et al., 2009, who make a similar point).
As mentioned, MI was the one exception to this rule, performing
most strongly in the N % 500 conditions in Simula-
tion B.
Table 9
MSE Ratios, N% 500, Simulation B
Listwise Log CART RF MI
30% 50% 30% 50% 30% 50% 30% 50% 30% 50%
X
!
2
Linear 1.359 3.478 .933 .832 1.038 .960 1.066 1.624 .877 .576
One split 3.044 3.713 1.341 .943 1.021 1.042 4.923 2.183 .664 .668
Two splits .979 .931 .847 .986 .968 .996 1.424 2.738 .638 .585
Three splits 3.344 2.837 .811 .685 1.035 .945 1.160 1.237 .686 .543
Y
!
2
Linear 1.565 5.940 .884 .569 1.047 .766 1.057 1.257 .883 .385
One split 6.089 8.160 1.685 .829 .963 .885 4.340 1.909 .533 .489
Two splits 1.044 1.500 .610 .662 .873 .967 1.590 2.708 .524 .450
Three splits 6.295 8.569 .713 .629 1.035 .882 1.238 1.994 .629 .522
(
X
2
2
Linear .860 .694 .901 1.084 1.063 1.112 1.046 1.696 .810 .645
One split .666 .887 1.357 1.101 1.044 1.039 3.392 1.607 .551 .760
Two splits .933 .711 1.002 .999 1.186 1.070 1.682 1.963 .754 .519
Three splits .704 .736 .805 .910 1.199 1.096 1.301 1.204 .628 .697
(
Y
2
2
Linear .981 1.069 .985 1.241 1.066 1.101 .961 1.410 .736 .748
One split 1.155 .984 3.386 1.400 .966 .911 1.357 1.171 .458 .575
Two splits 1.831 1.544 1.843 1.419 1.056 .918 1.482 1.029 .571 .449
Three splits .978 .655 .820 .905 1.005 .997 1.048 1.212 .518 .469
(
X
2
,Y
2
Linear .884 .921 .933 1.221 1.045 1.207 1.078 1.718 .747 .656
One split .889 .892 2.484 1.223 1.031 .925 2.488 1.187 .489 .564
Two splits 1.534 1.372 1.496 1.263 1.082 .946 1.375 1.470 .638 .454
Three splits .860 .782 .738 .929 1.046 1.070 1.015 1.236 .553 .523
Note. MSE ratios computed over MSE
Prune
. MSE% mean squared error; Log% logistic regression; CART% classification and regression trees; RF%
random forests; MI% multiple imputation.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
16
HAYES, USAMI, JACOBUCCI, AND MCARDLE
35
Pruned CART Versus Random Forest Weights
It may seem surprising to find that pruned CART’s overall
performance exceeded that of random forest analysis. In light of a
largebodyofresearchsuggestingthatrandomforestshouldnearly
always be a preferred method (Hastie, Tibshirani, & Friedman,
2009;Leeetal.,2010),howmighttheseresultsbeunderstood?We
believethatthereare(atleast)threepotentialexplanationsofthese
results.
First, an important difference between these methods lies in the
random forest algorithm’s superior ability to handle collinearity
among the model predictors. In the present simulations,weincluded
only three covariates that were kept uncorrelated for computational rea-
sons(thishelpedusmoreeasilygeneratetreestructuresthatreliably
returned 30% and 50% missing cases). In real-world contexts,
however, researchers may have many, highly intercorrelated
covariates in their data sets. In such contexts, random forest
analysis could provide an advantage because of its ability to
addresscollinearitythroughresamplingpredictors.Therefore,it
is important to extend the present research by simulating data
sets with more numerous, correlated missing data covariates in
order to examine whether the present results hold or change
under these conditions.
Second, in these simulations, we predominantly used tree mod-
elstogeneratemissingdata.CARTandprunedCARTareidealfor
detecting these piecewise, linear step functions. By contrast, by
averagingacrosstheresultsofmanybootstraptrees,randomforest
methods actually create a smoothed functional form and may not
perform as well when the true function is a step function.
11
It is
possible, then, that random forest methods could outperform
CARTandprunedCARTwhennonlinearandinteractiveselection
modelsexhibitsmoothratherthanpiecewisestepfunctionalforms.
Future work should compare the performance of these methods
when the missing data correlates exhibit smooth, multiplicative
linear interactions (e.g., v ! z) and smooth nonlinear functions
(such as, e.g., quadratic, cubic).
Third,itispossiblethattheseresultsspeak,atleastinpart,tothe
specific goals and aims of the missing data analysis, which differ
from many common data analysis situations in crucial ways. In
mostdataanalysiscontexts,researchershopethattheirsubstantive
model of interest will generalize to future samples. This ability to
generalizeisonestrengthofrandomforestanalysis.Byclassifying
cases using majority vote, resampling methods like random forest
analysis are designed to capture information that is true across the
majorityofrepeatedsamples.Insodoing,thesemethods“average
out” the parts of each model that are idiosyncratic to each partic-
ular sample. This is how resampling methods reduce the variance
of CART results: by retaining only those elements of the model
that vary the least across repeated samples from the data.
Butinmissingdataanalysis,researchersaretypicallyconcerned
with addressing incompleteness in their own sample, not maxi-
mizingtheirabilitytoforecastwhatsortsofpeoplearemostlikely
to have missing data in future samples. In this way, missing data
analysis may represent an atypical inferential case. In this context,
we do not care whether the observed tree that caused incomplete-
ness in our data will be equally predictive of incompleteness in a
future data set, nor do we especially care how closely the chosen
split variables resemble those in the population, so long as they
help us weight our model estimates in an effective manner. Thus,
if the goal is to try to make inferences about what this sample
would have looked like without incompleteness, rather than which
casesarelikelytobeincompleteinthenextsample,thenaveraging
outidiosyncratic,sample-specificinformationmayimpedethetrue
goal. In this case, the priority should be to accurately model
correlates of incompleteness in the sample at hand, however idio-
syncratic those correlates happen to be.
Pruned CART may be particularly suited to these goals. Al-
though cost-complexity pruning employs cross-validation, a tech-
nique commonly used to assess the generalizability of a model to
new data, it does this to determine the optimal nested subtree that
is a subset of the original, larger, overfit tree produced by CART.
Thus,thistechniquemayservetocurbCART’stendencytooverfit
inresponsetotrivialblemishesinone’sdata(e.g.,bypruningback
nodesbasedonsmallnumbersofobservations)whilestillutilizing
a good amount of sample-specific information. In this way, prun-
ing may represent an optimal middle ground between CART and
random forest that serves our purposes well in the missing data
context.
Future research is needed to disentangle these three potential
explanations. Specifically, simulating covariates that are (a) more
highlycorrelatedwithoneanother,aswellas(b)relatedtoattrition
in a smooth interactive/nonlinear manner, would help determine
whether random forest methods excels under these conditions. It
would be ideal to conduct these simulations both with a highly
controlled design with smaller number of predictors, as we have
done here, as well as a larger, more real-world design in which
many predictors compete for inclusion in the missing data model.
Therefore, strength of covariate intercorrelations (low, moderate,
high), smoothness of selection functions (smooth vs. step func-
tions), and number of covariates (many vs. few) are three factors
worth exploring in future studies.
Important Future Directions
Although these initial results are promising, it is important for
future research to build on this work in several key ways. First, it
would be beneficial to simulate covariates that are more strongly
correlated with model response variables. In the present study, the
greatest degree of correlation was between the covariates and Y
2
,
set at r % .52. In line with previous research, we believe that
simulating greater covariate-outcome correlations would result in
higher amounts of bias and a greater need for missing data tech-
niques (see, e.g., Collins et al., 2001, who found correlations of .9
between the covariates and Y to be particularly deleterious).
Additionally, we note that although in our simulation we (like
Lee et al., 2010) only included main effects in our logistic regres-
sion analyses in order to most accurately model what researchers
typically do in practice, we agree with those (including our re-
viewers) who argue for the importance of assessing the perfor-
mance of this method when interactions and nonlinearities are
included in the model. This is a logical and important next step in
comparing these two methods. In practice, however, even if this
method performs well, there is one obvious and debilitating draw-
back: Including all interactions and nonlinear terms will undoubt-
edly be cumbersome for researchers who, in many software pack-
ages, would have to compute these numerous multiplicative and
11
We thank our anonymous reviewers for this suggestion.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
17
MISSING DATA WITH CART
36
exponential terms manually. Combined with the potential col-
linearities that could result in such analyses, this may relegate this
approach to the category of “possible but infeasible,” giving the
automated tree algorithms an edge in terms of practical utility.
Another extension of this work would be to simulate longitudi-
naldatawitht!2timepoints.Webeganwitht%2inthepresent
studies in order to form missing data weights in the most straight-
forward way: modeling incompleteness at only one time point
(Time 2). With a greater number of time points, more complex
patterns of incomplete data are possible, necessitating decisions
about which time point should be predicted and how the weights
might be formed (e.g., predicting Time 2 vs. the final time point;
averaging weights across time points; or using a multivariate
extension of CART, as in Brodley & Utgoff, 1995; De’ath, 2002).
In addition to providing an avenue to explore CART and random
forest weights in a multivariate context, such simulations would
also afford the possibility of simulating more complex patterns of
incompleteness and incorporating FIML (see, e.g., Arbuckle,
1996; Enders & Bandalos, 2001) among the missing data estima-
tors assessed.
Finally, two additional extensions are required to assess the
performance of these techniques in a comprehensive manner: (a)
examiningthesemethodswhenthedataaremissingnotatrandom
(Rubin, 1987; that is, when incompleteness is determined by
individuals’ scores on the endogenous variables themselves), and
(b) examining these methods when the data are non-normal. This
latterconditionmaybeespeciallyinteresting,giventhatweighting
methods, unlike FIML and MI, do not require an assumption of
normality. Thus, it would be interesting to compare these methods
with other missing data techniques previously studied under non-
normality (see e.g., Enders, 2001).
Conclusion and Recommendations for Researchers
We close by attempting to answer what may be the most
important question of all: What can applied researchers take away
from these results? Which methods should they prefer and how
will they know what to use to address their missing data issues?
The present research can offer several suggestions for research-
ers,particularlywhendealingwithmissingdataattwotimepoints,
as investigated here: First, although many techniques (i.e., t tests,
logistic regressions) can be successfully used to assess the true
selection model, pruned CART and random forest analysis appear
to perform particularly well. Second, of the machine learning
techniques studied here, pruned CART seems like a strong choice
under the various selection models, sample sizes, and amounts of
incomplete data considered here. Although random forest per-
formed well, the current simulations suggest that this computa-
tionally intensive technique may be overkill in the missing data
analysis context, at least when employing a smaller number of
uncorrelated (or lowly correlated) missing data covariates. This
beingsaid,itisrareinpsychologytohavesamplesizessolargeas
tomakerandomforestsubstantiallyslowerthanCART,despiteits
larger computational demands. Because the cost of trying random
forestanalysistendstobeminimalinthesepracticalsituations,and
because it is possible that random forest may perform better under
selection models other than the ones simulated here (e.g., smooth
linear interactions; smooth, polynomial functions), it is still worth
trying this method. As one reviewer pointed out, an added benefit
ofthistechniqueisthatitunburdenstheuserfromhavingtomake
decisions about whether (and how much) to prune. Third, MI’s
overall strong performance depended somewhat on sample size.
Thismethodseemstobeaparticularlystrongchoicewhendealing
with larger samples, especially with N# 500. The usual caveats
apply, however, and MI may be more cumbersome than other
methods when specifying analysis models with explicit or implicit
interactions, multiple group structural models, or hierarchical lin-
ear models (see Enders, 2010, for a very readable discussion).
Finally, an additional major theme of these simulations is that
sometimes selection models exert greater influence on the perfor-
mance of missing data techniques than others. Therefore, in prac-
tice, we recommend that researchers remember the counterfactual
inference discussed in the beginning of this article. Thus, rather
than asking which of several complicated methods for handling
missing data is the one that should be used, researchers can ask
themselves“Howstablearemymodelestimatesandresultsacross
analyses that address incompleteness in different ways, under
different but related assumptions?” We believe this is a vastly
better question. Although it would be impractical to try out every
possible missing data technique on every data set, comparing
estimates from one or two recommended methods with estimates
from listwise deletion can be illuminating. For example, when
working with N % 100 and two time points, comparing listwise
withprunedCARTestimatesmaybeaworthwhileassessment.For
N% 500, comparing listwise with pruned CART and MI might be
helpful. In each case, such comparisons can shed important light
on whether the data are relatively affected (as in the means,
variances, and covariances assessed in Simulation B) or relatively
unaffectedbyattrition(asinthecaseoftheregressioncoefficients
assessed in Simulation A).
This suggestion should not be misconstrued as an endorsement
of running many tests and selectively reporting desirable-looking
results.Rather,webelievesuchcomparisonsshouldbeshared,not
hidden, from your readers, even if only parenthetically or in
technical footnotes (e.g., “The results of missing data Method 1
werenear-identicaltotheresultsofmissingdataMethod2.There-
fore, Method 2 is relegated to the appendix”). Used responsibly in
concertwithrecommendationsfromempiricalsimulationresearch,
we believe this strategy provides a straightforward and incisive
way to assess the effects of incompleteness on one’s data.
References
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and
interpreting interactions. Thousand Oaks, CA: Sage.
Anderson, T. W. (1957). Maximum likelihood estimates for a multivariate
normal distribution when some observations are missing. Journal of the
American Statistical Association, 52, 200–203. http://dx.doi.org/10
.1080/01621459.1957.10501379
Arbuckle, J. N. (1996). Full information estimation in the presence of
incomplete data. In G. A. Marcoulides & R. E. Schumacker (Eds.),
Advanced structural equation modeling (pp. 243–277). Mahwah, NJ:
Erlbaum.
Asparouhov, T. (2005). Sampling weights in latent variable modeling.
Structural Equation Modeling, 12, 411–434. http://dx.doi.org/10.1207/
s15328007sem1203_4
Berk,R.A.(2009).Statisticallearningfromaregressionperspective.New
York, NY: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
http://dx.doi.org/10.1007/BF00058655
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
18
HAYES, USAMI, JACOBUCCI, AND MCARDLE
37
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. http://
dx.doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984).
Classification and regression trees: The Wadsworth statistics probabil-
ity series (Vol. 19). Pacific Grove, CA: Wadsworth.
Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Ma-
chine Learning, 19, 45–77. http://dx.doi.org/10.1007/BF00994660
Cohen, S. (1988). Perceived stress in a probability sample of the United
States. In S. Spacapan & S. Oskamp (Eds.), The social psychology of
health: Claremont Symposium on Applied Social Psychology (pp. 31–
67). Newbury Park, CA: Sage.
Cohen, S. (2008). Basic psychometrics for the ISEL 12-item scale. Re-
trieved from http://www.psy.cmu.edu/~scohen/
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of
inclusive and restrictive strategies in modern missing data procedures.
Psychological Methods, 6, 330–351. http://dx.doi.org/10.1037/1082-
989X.6.4.330
De’ath, G. (2002). Multivariate regression trees: A new technique for
modeling species–environment relationships. Ecology, 83, 1105–1117.
Enders, C. K. (2001). The impact of nonnormality on full information
maximum-likelihood estimation for structural equation models with
missing data. Psychological Methods, 6, 352–370. http://dx.doi.org/10
.1037/1082-989X.6.4.352
Enders, C. K. (2010). Applied missing data analysis. New York, NY:
Guilford Press.
Enders, C. K., & Bandalos, D. L. (2001). The relative performance of full
information maximum likelihood estimation for missing data in struc-
tural equation models. Structural Equation Modeling, 8, 430–457.
http://dx.doi.org/10.1207/S15328007SEM0803_5
Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-
based structural equation models. Structural Equation Modeling: A
Multidisciplinary Journal, 10, 80–100. http://dx.doi.org/10.1207/
S15328007SEM1001_4
Graham, J. W., & Schafer, J. L. (1999). On the performance of multiple
imputation for multivariate data with small sample size. In R. H. Hoyle
(Ed.), Statistical strategies for small sample research (pp. 1–27). Thou-
sand Oaks, CA: Sage.
Hallquist, M., & Wiley, J. (2014). MplusAutomation: Automating Mplus
model estimation and interpretation.Retrievedfromhttp://cran.r-
project.org/package%MplusAutomation
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of
statistical learning. New York, NY: Springer-Verlag. http://dx.doi.org/
10.1007/978-0-387-84858-7
Hawkley, L. C., Hughes, M. E., Waite, L. J., Masi, C. M., Thisted, R. A.,
&Cacioppo,J.T.(2008).Fromsocialstructuralfactorstoperceptionsof
relationship quality and loneliness: The Chicago Health, Aging, and
Social Relations Study. The Journals of Gerontology Series B: Psycho-
logical Sciences and Social Sciences, 63, S375–S384. http://dx.doi.org/
10.1093/geronb/63.6.S375
Hawkley, L. C., Lavelle, L. A., Berntson, G. G., & Cacioppo, J. T. (2011).
Mediators of the relationship between socioeconomic status and allo-
static load in the Chicago Health, Aging, and Social Relations Study
(CHASRS). Psychophysiology, 48, 1134–1145. http://dx.doi.org/10
.1111/j.1469-8986.2011.01185.x
Hawkley, L. C., Thisted, R. A., Masi, C. M., & Cacioppo, J. T. (2010).
Loneliness predicts increased blood pressure: 5-year cross-lagged anal-
yses in middle-aged and older adults. Psychology and Aging, 25, 132–
141. http://dx.doi.org/10.1037/a0017805
Kish, L. (1995). Methods for design effects. Journal of Official Statistics,
11, 55–77. Retrieved from http://www.jos.nu/Articles/abstract.
asp?article%11155
Lee, B. K., Lessler, J., & Stuart, E. A. (2010). Improving propensity score
weighting using machine learning. Statistics in Medicine, 29, 337–346.
Liaw, A., & Wiener, M. (2002). Classification and regression by random-
Forest. R News, 2, 12–22.
Louppe, G. (2014). Understanding random forests: From theory to prac-
tice (PhD thesis). University of Liege. Retrieved from http://arxiv.org/
pdf/1407.7502v3.pdf
McArdle, J. J. (2013). Dealing with longitudinal attrition using logistic
regression and decision tree analyses. In J. J. McArdle & J. Ritschard
(Eds.), Contemporary issues in exploratory data mining in the behav-
ioral sciences (pp. 282–311). New York, NY: Routledge.
McArdle,J.J.,&Hamagami,F.(1992).Modelingincompletelongitudinal
and cross-sectional data using latent growth structural models. Experi-
mental Aging Research, 18, 145–166. http://dx.doi.org/10.1080/
03610739208253917
Morgan,J.N.,&Sonquist,J.A.(1963).Problemsintheanalysisofsurvey
data,andaproposal.JournaloftheAmericanStatisticalAssociation,58,
415–434. http://dx.doi.org/10.1080/01621459.1963.10500855
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation
modeling with data that are not missing completely at random. Psy-
chometrika, 52, 431–462. http://dx.doi.org/10.1007/BF02294365
Muthén, L. K., & Muthén, B. (2011). Mplus user’s guide (6th ed.). Los
Angeles, CA: Author.
Potthoff, R. F., Woodbury, M. A., & Manton, K. G. (1992). “Equivalent
sample size” and “equivalent degrees of freedom” refinements for in-
ference using survey weights under superpopulation models. Journal of
the American Statistical Association, 87, 383–396.
Ripley, B. (2014). Package “tree.” Retrieved from http://cran.r-project
.org/web/packages/tree/index.html
Rogosa, D. (1995). Myths and methods: “Myths about longitudinal re-
search” plus supplemental questions. In J. M. Gottman (Ed.), The anal-
ysis of change (pp. 3–66). Mahwah, NJ: Erlbaum.
Rubin,D.B.(1976).Inferenceandmissingdata. Biometrika, 63,581–592.
http://dx.doi.org/10.1093/biomet/63.3.581
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New
York, NY: Wiley. http://dx.doi.org/10.1002/9780470316696
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and
quasi-experimental designs for generalized causal inference. Boston,
MA: Houghton Mifflin.
Stapleton, L. M. (2002). The incorporation of sample weights into multi-
level structural equation models. Structural Equation Modeling, 9, 475–
502. http://dx.doi.org/10.1207/S15328007SEM0904_2
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive
partitioning: Rationale, application, and characteristics of classification
and regression trees, bagging, and random forests. Psychological Meth-
ods, 14, 323–348. http://dx.doi.org/10.1037/a0016973
Therneau, T. M., Atkinson, E. J., & Ripley, B. (2014). rpart: Recursive
partitioning and regression trees. Retrieved from http://CRAN.R-project
.org/package%rpart
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Multivariate imputa-
tionbychainedequationsinR.JournalofStatisticalSoftware,45,1–67.
Received December 2, 2014
Revision received June 1, 2015
Accepted July 5, 2015 !
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
19
MISSING DATA WITH CART
38
Chapter 3: Should We Impute or Should We Weight? Examining the Performance of Two
CART-Based Techniques for Addressing Missing Data in Small Sample Research with
Nonnormal Variables
Timothy Hayes & John J. McArdle
1
Missing data are a prevalent problem in applied research. One challenge of addressing
missing data is that, unlike variables of substantive interest, researchers often lack strong a priori
hypotheses about the factors that may lead to nonresponse in their datasets. Another potential
challenge is that the relations between covariates in the datasets and participants’ probabilities of
nonresponse may be nonlinear and interactive.
One promising new approach to help address these challenges involves employing
exploratory data mining (or machine learning) algorithms to help model potentially complex
relationships between observed covariates and missing data (cf. Hastie, Tibshirani, & Friedman,
2009 for a comprehensive overview of statistical learning). In this paper, we focus on two broad
approaches to utilizing data mining methods to address missing data. The first method uses data
mining techniques to predict participants’ probabilities of nonresponse and form inverse
probability weights. The second method uses data mining to generate predicted values to be used
as imputations to fill in missing cases.
Although each of these methods performed well in initial studies (Doove, van Buuren, &
Dusseldorp, 2014; T. Hayes, Usami, Jacobucci, & McArdle, 2015; Shah, Bartlett, Carpenter,
Nicholas, & Hemingway, 2014), so far we are unaware of any studies that compare their
performance. Furthermore, prior research has not systematically assessed the performance of
1
Manuscript currently in the second round of review at Computational Statistics & Data
Analysis (revise and resubmit).
39
these methods in small samples with nonnormal data. In this paper, we first describe the
exploratory data mining techniques that form the basis for the missing data methods under study.
Then, we describe how these data mining techniques can be applied to missing data problems
using weighting and imputation methods. Next, we highlight important findings and questions
from past research on these methods. Finally, we report and discuss a simulation study designed
to assess the performance of these data mining methods in addressing missing data in small
sample experimental studies, under varying degrees of nonnormality and rates of nonresponse,
when missing data are related to a set of observed covariates through a variety of nonlinear and
interactive Missing At Random (MAR; Rubin, 1976) missing data mechanisms.
Overview of CART, Bagging, and Random Forests
Classification and regression trees (CART; Breiman, Friedman, Olshen, & Stone, 1984)
is a machine learning algorithm that uses the values of a set of observed predictor variables to
split a dataset into homogenous subgroups with respect to a categorical or continuous dependent
variable. A homogenous group in the categorical case is one in which the group members share
the same class membership. In the continuous case, a homogenous group is one in which the
group members share similar values on the continuous outcome, such that the values are closely
centered around a common group mean.
Because each split in a CART tree is contingent upon the splits that came before it,
CART trees are inherently conditional. Thus, when CART splits the dataset on the basis of more
than one predictor variable, these variables interact in determining the final subgroups in the
dataset. Likewise, when CART creates multiple splits at different cutpoints on a single variable,
this represents a type of nonlinearity, in which the prediction for that variable is not constant
40
across levels or values of the variable (as it would in linear regression, where any single-unit
change in x at any point in the variable would produce a corresponding change in y).
As an illustration of these concepts, Figure 3.1 depicts a tree diagram resulting from a
hypothetical CART analysis. On the right-hand side of the diagram, two variables, CollegeYear
and GPA, interact in predicting y, indicating that college seniors’ scores on the outcome depend
upon whether or not their GPAs are less than 2.0. On the left-hand side of the diagram, there is a
non-linear prediction on the variable CollegeYear, indicating that participants’ ultimate grouping
in the analysis upon whether they are juniors or underclassmen (freshmen + sophomores).
One of the main results produced by CART analyses is a set of predicted values of the
dependent variable. In the case of a continuous dependent variable, the predicted value for any
given case is set equal to the mean of the terminal node (final subgroup) in which it falls. In the
case of a categorical dependent variable, CART produces two kinds of predicted values: the
predicted probability of membership in a given class (e.g., of being classified as a 1 rather than a
0 on a binary outcome) is equal to the proportion of cases in the node who are members of that
class, whereas the predicted class for every case in a final (‘terminal’) node is simply assigned
by majority vote to the class with the most members in the node
When the CART algorithm is allowed to proceed indefinitely, it will tend to result in very
large trees with few observations in the terminal nodes that tend to overfit the training data and
provide relatively poor prediction to new test datasets. One way to improve the predictive
accuracy of single CART trees is to employ cost-complexity pruning, which adds a penalty term
to the CART equations that imposes a cost proportional to the size of the tree. A second way is
to use ensemble methods such as bagging (short for “bootstrap aggregation”, Breiman, 1996),
which involves taking repeated bootstrap samples of the data (cf. Efron & Tibshirani, 1993) and
41
growing a large CART tree on each bootstrap sample. When a large number of bootstrap trees
have been grown, the predicted value for each case is, in essence, an average of its predicted
values across the bootstrap trees. An improvement upon this technique is the random forests
(Breiman, 2001) algorithm, which starts out in the same way – by growing a ‘forest’ of bootstrap
trees – but adds an additional feature: the algorithm randomly samples a subset of predictor
variables at each split, thereby allowing highly correlated predictors to each contribute
differentially across the bootstrap trees, thereby counteracting the effects of multicollinearity.
Using CART and Random Forests to Address Missing Data
Using CART and random forests for multiple imputation. So far, two main
approaches have been proposed for using CART and its ensemble extensions to address missing
data. First, CART could be used to generate imputations to replace missing values on categorical
and continuous variables. Because CART and random forests are univariate procedures, one
imputation algorithm that is especially well-suited to implementing these approaches is Fully
Conditional Specification (FSC) or chained equations imputation (van Buuren, 2007; van
Buuren, Brand, Groothuis-Oudshoorn, & Rubin, 2006) which approximates multivariate
distributions of missing values through a series of univariate analyses (e.g., Bayesian
regressions) predicting missing cases on a variable-by-variable basis.
As implemented in the R package mice (van Buuren & Groothuis-Oudshoorn, 2011),
CART imputation essentially grows a CART tree on the data (with any missing values on the
predictor variables initially filled in by randomly sampling from their observed values),
determines the terminal node in which a given missing observation falls, and generates an
imputation by randomly sampling from the observed values of y falling in the same node (Doove
et al., 2014). By randomly sampling from observed values of y in the node, rather than simply
42
taking the predicted value of y, the algorithm preserves the variability found among observations
in the node, analogous to including a stochastic error term in conventional multiple imputation.
Random forest imputation takes this concept one step further by randomly drawing an
imputation for missing case i from the set of complete data observations that fell into the same
terminal node as case i in at least one of the bootstrapped trees. Like its parent algorithm, random
forest imputation forms trees by randomly sampling subsets of the predictors as candidates for
each split, but bagging imputation can be performed if the subset of candidate predictors is set
equal to the total number of predictors in the imputation model (see Doove et al., 2014, for
details about the CART and random forest multiple imputation algorithms).
Using CART and random forests to create inverse probability weights. A second
way to use CART and its extensions to address missing data takes a different tact. Rather than
using CART and random forests to generate imputations, this second method (T. Hayes et al.,
2015; McArdle, 2013) uses categorical implementations of CART and random forests to
estimate predicted probabilities associated with a binary missing data indicator, and these
predicted probabilities are then inverted to form inverse probability weights (Kish, 1995;
Potthoff, Woodbury, & Manton, 1992).
More specifically, CART, pruned CART, or random forest methods predict a binary
response indicator coded 0 if data on the outcome variable are missing and 1 if data on the
outcome are non-missing, and the predicted probabilities of being non-missing are then inverted
to form weights, such that the weight for case i is set equal to !
"
=
$
%&'$
(
. A terminal node with
a low probability of being classified as a 1 represents a node in which only a few individuals
provided data when most others were missing. Inverting the predicted probability of being
classified as 1 (non-missing), then, has the effect of up-weighting the scores of these individuals
43
who provided data on y in spite of sharing the same characteristics as individuals who did not
provide data.
Because the weighted sample variance is not invariant to scaling, these weights are
typically rescaled to form relative sample weights by dividing each weight by the mean of the
weights, e.g., !
" )*+,-".*
=
/
(
/
. Relative sample weights have a mean of unity and sum to the
sample size, n (Potthoff et al., 1992; Stapleton, 2002). Once relative weights are constructed,
they might be used in subsequent analyses employing a pseudo maximum likelihood (PML)
estimator, such as Maximum Likelihood with Robust standard errors (MLR), which has
performed well in past research when compared to standard weighted maximum likelihood
estimation, which tends to underestimate parameter standard errors, resulting in problematic
coverage rates (thus, researchers are advised to avoid using these weights in the context of
simple weighted ML or weighted least squares analyses, cf. Asparouhov, 2005).
Previous Simulation Research Evaluating the Performance of CART and Random Forest
Approaches to Addressing Missing Data
So far, only limited research has examined the performance of CART- and Random
Forest-based methods in addressing missing data. The next sections first detail prior research
based on the performance of CART and random forest weighting methods, and then turn to
research based on CART and random forest multiple imputation methods.
Previous research evaluating CART and random forest weights. The use of CART
weights was originally proposed by McArdle (2013), who applied the method to a longitudinal
dataset with promising results; unearthing more interesting relationships among missing data
covariates and explaining more variance than simple logistic regression weighting methods. Two
follow-up simulations by Hayes et al. (2015) compared the performance of CART, pruned
44
CART, and random forest weights to (a) listwise deletion, (b) weights computed from the
predicted probabilities generated by a logistic regression analysis, and (c) standard
implementations of multiple imputation (L. K. Muthén & Muthén, 2011; van Buuren &
Groothuis-Oudshoorn, 2011) in recovering parameter estimates from a cross-lagged regression
model. Missing data in these simulations were generated via MAR mechanisms in which the
probability of dropout (or nonresponse) was related to the observed covariates via either a linear
missing data model (in the log odds) or a decision tree with one, two, or three splits.
Although quite preliminary, these simulations generated several tentative conclusions.
First, all methods, but particularly decision tree techniques (CART, pruned CART, and random
forests) performed well at identifying the true covariates used in the population-level missing
data models included in the simulation, regardless of the functional form of the missing data
model. Second, CART, pruned CART, and random forest weighting methods performed as well
as standard MI and as well or better than logistic regression weights even under relatively small
sample sizes (N = 100). Third, weights based on pruned CART analysis seemed to perform
slightly better than the other decision tree methods in these initial simulations. Finally, similar to
other missing data simulations (see e.g., Collins, Schafer, & Kam, 2001), larger bias was
observed in the estimates of the sample means and variances than in the cross-lagged regression
coefficients estimated in the simulation.
Although these simulations provided an interesting initial look at these decision tree
methods, finding that they performed just as well as standard multiple imputation under a variety
of conditions, they did not clearly demonstrate situations in which CART, pruned CART, and
random forest weights might outperform standard missing data techniques like multiple
imputation. Furthermore, these simulations had at least three limitations. First, for simplicity, the
45
covariates used to generate the missing data in the simulations were orthogonal to one another –
a situation that may have disadvantaged the random forest algorithm, since a major strength of
this method is its ability to address collinearity issues by randomly sampling subsets of
predictors. Second, the majority of the missing data models used in these simulations were based
on tree structures, which may have artificially enhanced the apparent performance of the CART
methods by creating conditions under which these methods would easily succeed (although it
should be noted that all of these methods still performed well under a linear missing data model).
Finally, these simulations generated data from multivariate normal distributions, despite the fact
that perfect normality is rarely – if ever – observed in real data (Miceeri, 1989).
Previous research evaluating CART and random forest imputation. Two intriguing
recent sets of simulation studies examined the performance of CART and random forest multiple
imputation (MI) methods in recovering parameter estimates from linear and logistic regression
models (Doove et al., 2014) and survival analysis models (Cox regression; Shah et al., 2014).
Both simulations found that CART and random forest multiple imputation performed nearly as
well as standard multiple imputation in estimating main effects in the respective regression
models and exceeded the performance of standard MI in terms of bias, confidence interval
widths, and confidence interval coverage when estimating interaction effects.
One caveat to these results, however, is that both sets of authors simulated datasets with
relatively large sample sizes – N = 1000 in the Doove et al. (2014) simulations and N = 2000 in
the Shah et al. (2014) simulations. Therefore, whereas CART and random forest weighting
methods have shown initial promise in smaller sample settings (e.g., N = 100 and N = 250, cf. T.
Hayes et al., 2015), the performance of CART and random forest multiple imputation in the
small sample context remains unknown.
46
The Present Research
In the present research, we conducted a simulation study to assess the relative
performance of CART and random forest methods for inverse weighting and multiple imputation
under a set of realistic situations that may face applied researchers. First, many researchers
conducting experiments face practical constraints on the availability of study participants.
Therefore, a major aim of the present simulation was to assess the performance of these methods
in small sample, experimental studies (e.g., N = 125, N = 250)
2
.
Second, in contrast to Hayes et al’s (2015) tree-based missing data models – which are, at
their essence, step functions – we wished to assess the performance of CART and random forest
methods when missing data were related to the observed covariates through a variety of smooth
functional forms (e.g., linear, quadratic, and cubic functions, as well as multiplicative
interactions of the covariates). Since, under practical circumstances, analysts have no way of
knowing whether the population-level missing data function is smooth and continuous or a
discontinuous step function, it is important to assess the performance of these methods under a
wide array of conditions. Third, unlike the perfectly orthogonal covariates simulated by Hayes et
al, we wanted to examine performance when the set of missing data covariates are allowed to
correlate freely.
Fourth, we wished to base our template model on experimental research comparing the
means of two groups. Since the estimation of mean differences is often a primary goal of
experimental research, and since estimates of sample means seem especially sensitive to the
biasing effects of missing data (cf. Collins et al., 2001; T. Hayes et al., 2015), we felt that this
2
Additionally, we note that an initial simulation produced the same pattern of results with
sample sizes as low as N = 60 (i.e., 30 participants per cell in a two-group experiment).
47
situation would provide fertile ground for the assessment of these missing data methods. Fifth,
we wanted to incorporate an interaction effect into our data generation model, in which
experimental treatment was moderated by a continuous covariate. This seemed desirable both
because CART and random forest imputation have been shown to perform well in capturing
interaction effects, as described previously (Doove et al., 2014; Shah et al., 2014), and because
this type of analysis model is extremely common in applied research (cf. Aiken & West, 1991).
Finally, we also wished to assess the performance of these techniques under varying degrees of
nonnormality.
Our hope was that, under small sample sizes, nonlinear and interactive (smooth) missing
data models with correlated predictors, an interactive analysis model, and varying degrees of
nonnormality and missing data, these CART and random forest based methods might not only
rival, but perhaps even exceed, the performance of more traditional missing data estimation
techniques such as logistic regression weights and multiple imputation using Bayesian
regression. Additionally, we were interested in comparing CART and random forest weights
with CART and random forest multiple imputation methods to assess the conditions under which
one or the other of these variants provided superior performance.
Method
Template model.
For each simulation cell, we generated 200 datasets following a simple model in which
two experimental groups interacted with a continuous moderator, z, to predict scores on an
outcome variable, y. Additionally, we simulated three missing data covariates, labeled x, v, and
w, respectively.
48
Specifically, we began by simulating the moderator z by using Fleishman’s (1978) power
method to draw N observations from a distribution with mean 0 and variance 1 having a desired
level of skewness and kurtosis. If Fleish() indicates the Fleishman power function, S = skew, and
K = kurtosis, this may be written:
0= 12345 ℎ 7 = 0,:
;
= 1,=,> , (1)
where S and K took on sets of values defined below. Note that when S = K = 0, the
Fleishman function produced a standard normal variate.
At first, we simulated all N data points to model “control group” participants, with the
function:
?
@AB-)A+
= 0.20∗0+12345 ℎ 7 = 0,:
;
= 1,=,> , (2)
where the intercept was implicitly equal to zero and where the Fleish function served as a
regression residual with the same values of skewness and kurtosis as z. Next, we generated the
missing data correlates, x, v, and w, to be correlated at G = .7 with ?
@AB-)A+
and G = .5 with z.
Following the generation of these covariates, we changed half of the observations (that is, N/2
observations) to emulate an experimental treatment group, as:
?
J)*,-K*B-
= 0.50+−0.20∗0+12345 ℎ(7 = 0,:
;
= 1,=,>). (3)
Thus, in the “treatment” group simulated participants had a mean of 0.50, holding z
constant, and the relationship between z and y changed direction, yielding a disordinal
interaction. Because half of the simulated “participants” received the experimental “treatment,”
the correlation between the covariates and y was altered. Whereas the covariates x, v, and w were
correlated with ?
@AB-)A+
at G = .7, they were only correlated with ?
-)*,-K*B-
on average at G =
49
−0.1. Because these covariates were highly positively correlated with half of the simulated
observations (the control group observations) and lowly negatively correlated with the other half
of the observations (the treatment group observations), the overall correlation of the covariates
with y shrank to a modest G = .3. We felt that this was a realistic scenario, since random
assignment to experimental conditions often undercuts what may be high correlations between
baseline (e.g., pretest) measures and scores on the dependent variable after the experimental
treatment has taken effect. Following this procedure, the missing data covariates, x, v, and w,
were correlated with each other at approximately G ≈.6.
Factors Varied in the Simulation
We varied four primary factors in the simulation: the sample size N, the functional form
of the missing data mechanism, the rate of missing data, and the degree of nonnormality in the
data. We discuss each of these simulated factors in turn.
Sample size N. We included for sample sizes, N = {125, 250, 500, 1,000}. We generated
the first three sample size conditions to correspond to low, moderate, and large sample sizes,
respectively. We included the final, extremely large sample size condition (N = 1000, cell N =
N/2 = 500) in order to replicate the sample size used by Doove et al. (2014).
Functional form of missing data mechanism and percentage of missing data. We
generated missing data on y in the simulation by first simulating the log odds of response (i.e.,
the log odds of having complete data) using one of four functional forms:
RSTUVV5
W"B*,)
= X
YW
+X
$
Z, (4)
RSTUVV5
[\,]),-"@
= X
Y[
+X
;
Z+X
^
Z
;
,
(5)
50
RSTUVV5
_\`"@
= X
Y_
+X
a
Z+X
b
Z
;
+X
c
Z
^
, (6)
and
RSTUVV5
dB-*),@-"AB
= X
Yd
+X
e
Z+X
f
g+X
h
Zg, (7)
where the values of the parameters were chosen based on their ability to (a) inject bias
into the analysis models of interest, and (b) generate either (i) approximately 10-13% or (ii)
approximately 30-40% missing data
3
, with roughly equal proportions of missing data in the
control and experimental groups (see online supplemental materials for parameter values used in
the simulation and supplemental R code and simulation datasets for exact missing data
percentages observed in each condition). We then transformed these simulated log odds into
probabilities using the standard formula i
"
= 3
WAjk]]l
(
/(1+3
WAjk]]l
(
). Following this
transformation, we simulated a binary response (vs. nonresponse) indicator by sampling 0s and
1s with probabilities (1−i
"
) and i
"
respectively. We used this indicator to assign missing
values on the y variable to any cases with scores of 0.
Following the generation of missing data, we saved both (a) these true probabilities, i
"
,
used to generate the response indicator and (b) inverse probability weights (scaled to the relative
sample size) created using these true predicted probabilities.
3
Note: With normal data and ~30% incompleteness, the marginal proportions of missing data
were 0.35, 0.29, 0.31, and 0.35 in the linear, quadratic, cubic, and interaction missing data
mechanisms, respectively. With nonnormal data, these proportions were a bit larger: 0.39, 0.30,
0.35, and 0.41. These proportions were very stable across sample sizes.
51
Degrees of nonnormality. Following Enders (2001), we used the Fleishman (1978)
function to include two normality conditions: normal data (S = K = 0), and severely nonnormal
data (S = 3.25, K = 20)
4
.
Overall simulation design. The overall simulation design, based on the factors
described, was a 4 (N = 125, 250, 500, 1000) x 4 (missing data model = linear, quadratic, cubic,
interaction) x 2 (normality = normal, severe nonnormal) x 2 (percent missing data: 10%, 30%) =
64 cell design. Because we simulated 200 datasets in each condition, this resulted in 12,800
simulated datasets. However, because the results for percentage of missing data were largely
consistent and predictable (the missing data methods performed similarly in both conditions,
except where noted, but higher percentages of missing data produced greater bias), we simplify
our reporting by focusing on results generated in the 30% missing data conditions, noting where
results in the 10% missing data conditions differed. Interested readers can find full results for
the 10% missing data conditions, paralleling those described here, in the online supplemental
materials.
Analyses Conducted on the Simulated Datasets
Analysis model. Since we simulated our data to mimic a simple experiment with two
groups in which experimental condition interacted with a continuous moderator, we employed
the following analysis model
5
:
4
In pilot simulations, we additionally included mildly nonnormal (S = 1.25, K = 3.50) and
moderately nonnormal (S = 2.25, K = 7.00) data conditions. However, because the results for
middling degrees of nonnormality were fairly predictable (producing values in between those
found in normal and severe nonnormal datasets), we simplified our simulation by focusing on
results on the normal and severe nonnormal conditions. However, results for these conditions can
be obtained by running the supplemental R code accompanying this article.
5
Additionally, because experimenters are often interested in assessing mean differences between
groups, we fit an alternative analysis model, ? = X
Y
+X
]
V, where V indicated a dummy-coded
variable and X
]
represented the mean difference between the experimental and control groups.
52
? = X
Y
+X
*
3nn+X
o
0+X
*o
3nn∗0, (8)
where eff is an effect-coded variable in which –1 = control and +1 = treatment (Aiken & West,
1991; although dummy-coding would also have been a valid representation of the experimental
treatment, cf. A. F. Hayes, 2013). In the results reported below, we focus on the parameter X
*o
,
which quantifies the interaction of eff and continuous moderator z. The true population value of
X
*o
= −0.20.
6
Analyses conducted on each simulated dataset. The analyses conducted on the
simulated datasets fell into three main categories: baseline comparison analyses, weighted
analyses, and multiple imputation analyses.
Baseline comparison analyses. First, we analyzed each simulated dataset in three
preliminary ways: (1) analyzing the full datasets with complete cases, to provide a benchmark
for how far the parameter estimates fell from their true population values simply due to the small
sample sizes employed in the critical cells of the simulation, (2) analyzing the missing datasets
with listwise deletion, to provide a benchmark for the degree of bias produced by simply
ignoring the missing data and deleting observations with missing values on y, and (3) analyzing
However, because the results for bias and coverage associated with this parameter mirrored those
of the interaction coefficient, X
*o
, we refer interested readers to the supplemental R code for
these additional results.
6
Over 1,000 test trials, the model in equation (8) produced an average model R-squared of 0.10
and the inclusion of X
*o
produced a change in R-squared of roughly 0.04 compared with a model
only including main effects (average R-squared without the interaction = 0.06 – see online
supplemental R code) on complete data samples. Thus, of the total variance explained by the
model (10 percent), the interaction term X
*o
3nn∗0 explained 40% of that variance. Although
the overall percentage of variance explained was low (partially due to keeping the residual
variance equal to 1, rather than reducing it), we note that it is well-known that many small
experimental studies tend to capture phenomena with small-to-medium effect sizes (cf.
Sedlmeier & Gigerenzer, 1989). Despite the small effect size of this model, complete data
parameter estimates and coverage rates were very accurate across sample sizes in the complete
data sets, before inducing missing data, as displayed in the results below.
53
the missing datasets with listwise deletion using a regression model that included all auxiliary
missing data covariates (x, v, and w) as predictors, to assess the performance of listwise deletion
when important missing data predictors are included in the model.
Although the inclusion of missing data covariates in listwise analyses can sometimes aid
estimation (Groenwold, Donders, Roes, Harrell, & Moons, 2012), the addition of these linear
covariates did not ultimately aid estimation under the nonlinear and interactive missing data
models simulated here. Because the results produced by these two listwise deletion methods
were nearly identical regardless of whether the auxiliary covariates were included, in our
presentation of results we simply report the listwise deletion analyses that included the missing
data covariates (which parallel the other missing data methods, all of which include the
covariates in one form or another) and omit the listwise analyses that did not contain variables x,
v, and w in the regression model. For consistency with the weighted analyses, all three of these
baseline analyses were conducted by specifying the model regressions from eqs (8)-(9) using the
lavaan package (Rosseel, 2012) in R (R Core Team, 2013).
Weighted analyses. The second set of analyses utilized inverse probability weights
generated by several different methods. In all cases, we conducted weighted analyses using the
lavaan.survey (Oberski, 2014) package in R by applying the lavaan.survey()
function to an sem() analysis in lavaan using the arguments missing = “listwise”,
meanstructure = TRUE, and estimator = “MLR” to estimate the regression of
interest.
For the first type of weighted analysis, we computed inverse weights using the true
population response probabilities that generated missing data in the simulation. The purpose of
this initial analysis was to provide a benchmark for how well weighted analyses might be
54
expected to perform if the true population probabilities of response were estimated perfectly.
Since weighted analyses do not impute data but, rather, simply reweight cases with complete
data, we felt it was important to include these true population weights to ascertain if this
characteristic would be limiting and ultimately lead to ceiling effects on these methods’
performance.
Additionally, we estimated probabilities and formed inverse weights using four other
methods. First, we used predicted probabilities from a logistic regression analysis as a means of
assessing the performance of weights computed using a standard method. The logistic regression
model incorporated main effects of all predictors, as well as the substantive interaction of interest
(that is, predicting the log odds of response on y from the auxiliary covariates x, v, and w, as well
as the model variables eff, z, and the eff*z interaction). We only included main effects of the
covariates in this logistic model to mirror what we thought analysts would typically do in
practice. Manually computing all possible interactions and nonlinearities up to, for example,
cubic trends would be cumbersome, unwieldy, and ultimately infeasible even with only three
covariates. Furthermore, even if all of these terms were diligently computed, attempting to fit
such a complex model could easily lead to collinearity problems and other estimation issues
arising from the ratio of participants to parameters being estimated.
Following this logistic regression model, we estimated predicted probabilities of response
using CART, pruned CART, and random forest analyses. We used the package rpart
(Therneau, Atkinson, & Ripley, 2014) to fit the CART and pruned CART models. For the
pruned CART models, we estimated the complexity parameter using the 1 standard error rule
described in the rpart package documentation (see also T. Hayes et al., 2015 for further
details). We conducted random forest analyses using the randomForest() function from
55
package randomForest (Liaw & Wiener, 2002) with default settings. Since CART and
random forests automatically assess interactions and nonlinearities among all model variables,
we simply included the main effects of x, v, w, eff, and z predicting the binary response indicator
(where 0 = missing response on y, 1 = complete response on y). We then inverted all
probabilities to form inverse weights and rescaled them to sum to the relative sample size
(Potthoff et al., 1992; Stapleton, 2002).
Multiple imputation analyses. For our final set of analyses, we performed three different
types of multiple imputation using the mice package in R (van Buuren & Groothuis-Oudshoorn,
2011). For all imputation analyses, we generated m = 20 imputed datasets – a number four times
higher than the mice() function’s default of m = 5 (and the same number of imputations
employed by Doove et al., 2014). Additionally, for all imputation models, we included all
auxiliary covariates (cf. Collins et al., 2001) as well as all model variables that related to the
analysis of interest. This means that for the regression depicted in equation (8), the imputation
model included variables x, v, w, eff, z, and the critical eff*z interaction from the substantive
model.
We tested three imputation methods in the simulation. First, we conducted standard
multiple imputation using Bayesian regression with the method = “norm” argument in the
mice() function.
7
Following this standard imputation analysis, we ran the same imputation
models using single-tree CART multiple imputation (method = “cart”) and random forest
7
Another obvious point of comparison would be mice’s default method, predictive mean
matching. However, this method performed poorly in initial pilot simulations, so we opted to
assess Bayesian regression imputation in the hopes that it would produce stronger results.
56
multiple imputation (method = “rf”)
8
. All analyses were conducted using the lm()
function in base R and pooled using the pool() function in the mice package.
Summary of “control” analyses used in the simulation. As the foregoing discussion
implies, we took pains to include a variety of “control” analyses in the simulation which might
provide useful points of comparison for assessing the performance of our CART- and random
forest-based methods of interest. In brief, these are: (a) analysis of the full dataset, with no
missing data, (b) analysis using only complete cases via listwise deletion (with covariates x, v,
and w, included as predictors in the regression), (c) weighted analysis using the true population
weights, (d) weighted analysis using weights computed via standard logistic regression methods,
and (e) multiple imputation analysis using standard Bayesian regression imputation.
Outcome Measures Assessed in the Simulation
The primary outcomes assessed in the simulation concerned the bias and confidence
interval (CI) coverage of the key interaction parameter X
*o
. Coverage was simply defined as the
proportion of simulation cells in which the confidence interval around a given estimate contained
the true population parameter. Percent bias was computed as:
%q4r5=
θ−θ
θ
∗100
(9)
Where, for each estimation method, θ indicates the estimated parameter of interest in a
given simulation cell (i.e., X
*o
), and θ indicates the population value of the parameter (–0.20).
8
We note that the default settings of these CART-based mice methods are set to conform to
those used in the prior simulations described by Doove et al (2014) and Shah et al (Shah et al.,
2014) – namely, minimum terminal node sizes of 5 observations in CART trees (from Doove et
al.) and random forests of size B = 10 bootstrap trees (based on Shah et al. See
?mice.impute.cart and ?mice.impute.rf in R, with the mice package loaded, for
details).
57
Following prior research, we note that values of percent bias greater than 15 are considered
problematic (B. Muthén, Kaplan, & Hollis, 1987).
Results
Percent Bias
We will discuss the results for percent bias in two ways. First, we present the average
percent bias by simulation condition and missing data estimator descriptively, using tables and
plots. Second, we discuss the results of a series of significance tests that we performed in order
to assess which missing data methods differed significantly under which conditions.
Percent bias of the interaction parameter. Tables 3.1 and 3.2 display the means and
standard deviations of percent bias of the X
*o
interaction coefficient for each combination of
missing data mechanism (linear, quadratic, cubic, and interaction), sample size, and missing data
analysis method for normal data (Table 3.1) and severe nonnormal data (Table 3.2), respectively,
for the 30% missing data conditions. Examining these tables, several trends are apparent. First,
across sample sizes, the various linear, nonlinear, and interactive missing data mechanisms used
in the simulation injected a substantial amount of bias compared with the complete data
estimates.
Second, although absolute levels of bias remained high in a variety of cases, it is apparent
that, of the inverse weighting methods, random forest weights produced the largest decreases in
bias in the majority of conditions, whereas of the multiple imputation methods, CART Multiple
imputation produced the largest reductions in bias. The superior performance of random forest
weights to logistic regression weights (including main effects only) and the superior performance
of CART imputation over Bayesian regression (“norm”) imputation is consistent with previous
results by Doove et al. (2014), demonstrating that CART-based multiple imputation improved
58
upon imputation methods that failed to account for nonlinearity and interactivity in the
imputation model. The fact that CART MI performed somewhat better than random forest MI in
terms of bias also replicates Doove et al.’s simulation results. These general trends – the
superior performance of random forest weights and CART multiple imputation in recovering
bias compared to the listwise estimates – also held true under severe nonnormality. We note that
these general trends were also observed in the 10% missing data conditions.
Two additional trends are worth highlighting. First, examining the results for random
forest weights and CART multiple imputation in Tables 3.1 and 3.2, take note of the standard
deviations observed for these methods. Across sample sizes, the observed standard deviations of
percent bias are higher for random forest weights than they are for CART multiple imputation,
but this difference increases dramatically as the sample sizes increase from N = 125 to N = 1,000.
This is driven by the fact that the multiple imputation methods generally become more and more
efficient as sample size increases, evidenced by comparatively smaller standard deviations,
whereas the weighting methods tend to have large standard deviations regardless of sample size.
Thus, although random forest weights appear to have an advantage in terms of average bias in
many cells, this advantage may be tempered under larger sample sizes by the larger variation in
observed estimates.
Second, examine the relationship between sample size and average percent bias returned
by the random forest weights and CART multiple imputation. As a visual aid, Figure 3.2 presents
the average results returned by these two methods with the listwise deletion and complete data
estimates included as points of comparison. Under normal data (left panel of Figure 3.2),
random forest weights tend to show less overall bias than CART multiple imputation at lower
sample sizes (e.g., N = 125 and N = 250), but the distance between these estimates decreases as
59
sample size increases. By N = 500, the mean estimates of bias are quite close for the two
methods. Under nonnormal data, however, the differences between the two methods remains
relatively constant across sample size conditions. We note that, although the rest of the key
points emphasized in this section hold true across missing data percentages, this particular trend
– the relationship between missing data method and sample size – was dramatically less
pronounced in the 10% missing data conditions, in which the overall bias observed was
substantially less (see online supplemental materials for 10% missing data results).
Significance tests and effect sizes for the difference between missing data methods
and simulation factors. To explore these results further, we conducted a series of significance
tests examining the differences between key missing data estimators and simulation design
factors. First, within each combination of sample size, normality, and missing data mechanism,
we performed several pairwise comparisons. Specifically, we compared random forest weights
with (1) listwise deletion (including covariates), (2) inverse weights created from the true
response probabilities used in the simulation, (3) logistic regression weights, (4) CART multiple
imputation, and (5) Bayesian regression (norm) imputation. Additionally, we compared CART
multiple imputation with (6) listwise deletion and (7) Bayesian regression imputation. Thus, we
compared the two top-performing CART-based missing data methods (random forest weights
and CART multiple imputation) to each other, as well as listwise deletion results and key
comparison analyses (i.e., comparing random forest weights to weights from standard logistic
regression analysis with main effects; comparing CART multiple imputations to multiple
imputations generated by standard Bayesian regression).
9
9
Note that we did not attempt to compare missing data mechanisms (e.g., linear, quadratic, etc.)
to each other, since the magnitude of the bias injected by each of these mechanisms was a
somewhat arbitrary function of the coefficients chosen for equations (4)-(7) and would be
60
Because both missing data methods were fit to the same simulated dataset in each
repetition of each simulation cell, we conducted these comparisons using paired t-tests that, in
essence, treated each pair of missing data methods fit to the same dataset as repeated measures.
Because we performed 7 pairwise comparisons within each simulation cell, we Bonferroni
corrected all p-values in each cell by multiplying the observed p-value by the number of
comparisons (i.e., 7) and setting all corrected p-values >1 equal to 1.00, to preserve correct
probability values. Additionally, we computed Cohen’s d for each comparison.
Table 3.3 displays the results of these pairwise comparisons for normal data conditions
and Table 3.4 displays the results of these comparisons for nonnormal data conditions. In
examining these tables, the estimates of effect size are particularly helpful in understanding the
pattern of results. Under normality and nonnormality, both random forest weights and CART
multiple imputation perform significantly better than listwise deletion methods under the
majority of conditions, and the effect sizes of these differences increase as the sample size
conditions increase. Furthermore, both of these methods generally performed better than
Bayesian regression imputation, since the linear correlations between the missing data covariates
and the dependent variable were not high enough to provide adequate imputations, given the
interactive structure of the data. Random forest weights also differed significantly from logistic
regression weights in the majority of conditions. However, these differences were smallest in the
linear missing data mechanism conditions and largest under the interactive missing data
mechanism conditions.
different under different values of the coefficients than the ones simulated here. Instead, our
primary aim was to simply show the performance of CART-based missing data methods under a
variety of smooth, nonlinear and interactive missing data mechanisms under MAR missing data.
61
Thus, both of these CART-based missing data methods appear to perform well when
compared to listwise methods that ignore missing data and standard methods that do not take the
nonlinear and interactive structure of the data adequately into account. But how do these two
CART-based techniques compare to each other? Examining the results for normal data,
displayed in Table 3.3, there is a clear relationship between sample size and effect size: in the
majority of conditions, the effect size of the difference between CART MI and random forest
weighting methods tends to decrease as sample size increases, in contrast to the majority of other
missing data mechanisms. Under nonnormality, however, this does not appear to be the case: in
Table 3.4, the effect size of the difference between random forest weights and CART MI is
relatively constant across sample sizes in the majority of the conditions (and this is true under
10% missing data and nonnormality as well – see the online supplement for details).
To probe this relationship between sample size and missing data method further, we
subsetted our simulation results dataset to include only the random forest weighting and CART
multiple imputation results for 30% missing data. Then, we performed two multilevel
regressions, one on the normal data cells and one on the nonnormal data cells, predicting percent
bias from a dummy coded variable representing missing data method (cartMIDummy, where 0 =
random forest weights and 1= cart multiple imputation), a set of dummy-coded variables
representing the four sample size conditions, with the smallest sample size, N = 125, set as the
reference group, and the interaction of cartMIDummy with each dummy-coded sample size
condition variable. Table 3.5 presents the results of these multilevel regressions. We collapsed
the data across missing data mechanism conditions and allowed random intercepts within
simulation iterations, nested within missing data mechanism conditions.
62
For both normal and nonnormal data, the coefficient for the cartMIDummy variable is
significant and negative, suggesting that CART multiple imputation produced more negatively
biased estimates, on average, than random forest weights. However, for the normal data
conditions missing data method interacted with sample size at each level (N = 250, 500, and
1,000). To decompose these interactions, Figure 3.3 shows the expected values of percent bias
for each missing data method (CART multiple imputation and random forest weights) across
sample sizes for the normal data conditions. The results displayed in this figure mirror those
found in the pairwise comparisons from Table 3.3: across sample sizes, CART multiple
imputation produced more negatively biased estimates than random forest weights. However, the
difference between these missing data methods was greatest at lower sample sizes (N = 100 and
250) and the difference in is greatly reduced by N = 500.
Before proceeding, we note that these results are specific to the 30% missing data
conditions. In the 10% missing data conditions, shown in the online supplemental materials, both
random forest weights and CART multiple imputation methods perform equally well in reducing
bias, across sample sizes, when the data are normal. This represents something akin to a “floor
effect” – with only a small amount of missing data, the overall degree of bias in the listwise
estimates is substantially less and both cart MI and random forest weights remove the majority of
this bias relatively easily, reaching absolute levels of bias near 5% in many conditions, even
under small sample sizes. As a result, there is substantially less room for improvement in the
estimates, even under low N’s, and the sample size effects evident in the 30% missing data
conditions fail to surface.
Coverage
63
Tables 3.6 and 3.7 display the results for confidence interval coverage and confidence
interval width in the normal data and severe nonnormal data conditions, respectively. First,
examining the listwise deletion results, we see a clear and disturbing trend: as the sample size
increases, coverage rates decrease across missing data mechanisms, with the cubic and
interaction mechanisms showing particularly degraded results. Although this seems
counterintuitive at first (why would increasing the sample size lead to decreased coverage?),
there is a simple reason for this trend. Recalling the percent bias results displayed in Tables 3.1
and 3.2 and Figure 3.2, it is evident that the percent bias returned by listwise methods was
relatively stable across sample sizes. However, the standard errors associated with these
estimates, and the resulting confidence intervals around the estimates, become predictably
smaller at larger sample sizes. Thus, as sample size increases, the same biased estimates are
contained within smaller confidence intervals that are less likely to extend far enough to capture
the true parameter value. Less severe versions of this trend can be seen for the logistic regression
weighting and Bayesian regression (norm) imputation methods as well. Although these methods
fare better than listwise deletion across the majority of conditions, they still show decreased
coverage, with the most dramatic examples occurring under cubic and interactive missing data
generating mechanisms.
Figure 3.4 presents the main results graphically, for random forest weights and CART
multiple imputation, with listwise deletion and complete data estimates shown as reference
points. Under normal data (left-hand panels), both random forest weights and CART multiple
imputation perform substantially better than listwise methods. Between the two methods, random
forest weights tend to be consistently higher in terms of coverage, although in many cells this
difference is slight. Under normal data the coverage levels are relatively stable across sample
64
sizes, whereas under severe nonnormal data the CART multiple imputation method seems to
somewhat degrade in terms of coverage as sample size increases. For example, under an
interactive missing data generating mechanism, CART MI produces 74% coverage when N =
125 but only 46% coverage when N = 1,000. Contrast this with random forest weighting
methods, which produce 84% coverage at N = 125 and 85% coverage under these same
conditions.
Discussion
Across a variety of sample sizes, missing data generating mechanisms, and degrees of
nonnormality, missing data methods based on CART and random forests reduced parameter bias,
and increased confidence interval coverage relative to ignoring missing data and using listwise
deletion methods. The results of this simulation extend prior work in several incremental but
important ways. First, when auxiliary missing data covariates (x, v, and w in the simulation) were
allowed to correlate freely, random forest weights provided stronger results than pruned CART
weights, thus providing an important clarification to Hayes et al.’s (2015) previous results with
orthogonal predictors. Although pruning may be sufficient when predictors are uncorrelated,
random forests’ ability to address collinearity appeared quite useful in the present simulation.
Second, the performance of CART-based methods did not appear to be degraded under
nonnormality. To the contrary, if anything, these methods appeared to perform slightly better in
many instances when the simulated data were severely nonnormal. This strong performance
under nonnormality may be aided by both CART’s and, in the case of weighting methods, the
inverse weights’ lack of reliance on strong parametric assumptions (Berk, 2009; Potthoff et al.,
1992).
65
Third, the current simulation found that CART and random forest multiple imputation
methods did not perform as strongly as random forest weighting methods at reducing bias under
lower sample size conditions (N = 125 and 250) when the missing data rate was high (~30%),
providing what may be an important qualification to prior, large-sample simulations (Doove et
al., 2014; Shah et al., 2014). However, in-line with this previous work, multiple imputation
methods performed extremely well – comparably to random forest weighting methods – in the
larger sample size conditions of N = 500 and N = 1,000. Further, with lower rates of missing data
(~10%), both random forest weights and CART multiple imputation succeeded at recovering the
comparatively-smaller degree of bias caused by missing data even in small sample sizes.
The comparatively poorer performance of CART multiple imputation under small sample
sizes and 30% missing data was somewhat surprising to us, considering that we had expected
these CART multiple imputation methods might perform quite well, even in smaller samples.
One potential explanation for these results has to do with the available complete data
observations in the terminal nodes of CART trees in small samples. Recall that CART and
random forest multiple imputation methods generate imputations by sampling complete cases
from available data in the terminal nodes in which they fall in a given CART tree (or ensemble
of trees). It is possible that, at low sample sizes, there may simply not be enough complete data
observations in any given terminal node to provide adequate sampling variability for generating
accurate imputations.
A second potential explanation concerns the quantities that each CART-based method is
attempting to estimate. In the case of CART-based weights, the CART and random forest
algorithms are simply trying to predict a vector of 0s and 1s indicating who was missing versus
nonmissing in the dataset, whereas the CART-based imputations are, in essence, attempting to
66
predict the means of various subgroups within the data on a continuous outcome. Simply stated,
it is possible that with smaller datasets, it is simply more difficult to fit a tree that accurately
groups individuals around common means on a continuous outcome than it is to fit a tree
grouping people in terms of who gets a 0 and who gets a 1. It is also possible that both of these
explanations are true: under smaller sample sizes, it may be initially harder to fit the “right” tree
and properly group participants into subgroups of individuals whose scores lie close to a
common group mean on a continuous outcome, and then, subsequently, once this tree is fit, there
may be relatively few complete data observations in each node from which to sample in order to
generate accurate imputations with adequate variability. These explanations are only tentative,
however, and further research is needed to better understand the performance of CART and
random forest imputation under smaller sample size conditions.
One other surprising result from the simulations is worth mentioning: the comparatively
poor performance of standard Bayesian regression imputation using method = “norm” in the
R package mice (van Buuren & Groothuis-Oudshoorn, 2011). Although one may wonder if this
has something to do with the small sample size conditions (e.g., N = 125 and 250) employed in
many of our simulation cells, this small sample performance explanation is unlikely on two
levels: both because prior research has shown standard imputation methods to perform quite well
in a variety of circumstances, even under small sample conditions (Graham & Schafer, 1999),
and because norm imputation performed poorly even under the large sample size conditions (N
= 500 and 1,000) in the current study.
We think that a more likely explanation traces the poor performance of this standard
multiple imputation method to two features of our simulation. First, because our simulated data
were designed to mimic data from experimental studies, in which scores on the dependent
67
variable are altered for a random half of the research participants by the administration of an
experimental treatment, the simulated correlations between our auxiliary covariates and y were
quite low (r = .3). Furthermore, due to the nature of the simulated interaction effect, the
correlation between the simulated moderator, z, and the outcome, y, was 0.20 in the “control”
group, –0.20 in the “treatment” group, but zero overall. Thus, both because of the simulated
random assignment to treatment and because of the nature of the disordinal interaction effect
simulated, none of the model and auxiliary variables simulated were especially informative in the
resulting imputation models, even with 1,000 simulated cases.
This explanation makes sense in light of previous work on multiple imputation under
small sample sizes, in which MI performed well when model variables were more highly
correlated (Graham & Schafer, 1999). This is also supported by a recent simulation conducted by
Hayes and McArdle (2017), which showed that Bayesian regression imputation performed quite
well under a variety of MNAR (Rubin, 1976) missing data mechanisms when auxiliary and
model variables were correlated at moderate and high levels (rs of approximately 0.50 and 0.80,
respectively), even in extremely small samples. Interestingly, in this same recent simulation,
CART and random forest imputation methods still struggled under low sample sizes, despite
these larger variable intercorrelations.
Although we believe that this idiosyncratic correlational structure is the main culprit
behind the anomalous poor performance of Bayesian regression imputation, there is a second
feature of this simulation that may have contributed to these results. Because we simulated
missing data on the dependent variable only, this type of missing data pattern falls under the
category of regression with missing ys. Under these conditions, imputed values have been shown
to be particularly uninformative, leading to a proposal for multiply imputing incomplete
68
predictor variables and ultimately deleting cases with missing y values (multiple imputation then
deletion, von Hippel, 2007). Thus, this particular form of missing data, in which data were
deleted on y only, may have been especially resistant to imputation.
We note, however, that this was also true of Doove et al’s (2014) simulations, in which
both standard imputation methods and CART-based imputation methods helped recover main
effect and interaction estimates under univariate incompleteness on y. We also believe it is
noteworthy that random forest weights (and other CART-based weighting methods) provided
substantial improvements on complete case (listwise) analyses, suggesting that under the types of
conditions simulated here, reweighting is better than deletion.
Above all else, the current simulation identifies a set of clear initial guidelines about
when each of these methods might prove useful. Under low sample sizes (N = 125 and 250) and
larger amounts of missing data (~30%), random forest weighting appears to provide substantial
performance gains compared to other methods under most conditions. With larger sample sizes,
however, such as N = 500 and N = 1,000, single tree CART imputation was a clear victor,
producing parameter estimates that were, on average, comparable to random forest weights but
dramatically more efficient (but note CART multiple imputation’s slightly inferior confidence
interval coverage). With lower amounts of missing data (~10%), however, both CART multiple
imputation and random forest weighting methods are strong choices to address missing data
when the missing data generating mechanism is thought to be nonlinear or interactive.
Future research is needed to continue evaluating the conditions under which CART-based
weights and CART-based imputations might outperform standard methods in addressing missing
data. In particular, it will be crucial to assess whether altering key tuning parameters in the
CART and random forest imputation algorithms might help these methods to perform as well at
69
low sample sizes as they appear to perform in large samples. Because the classic
implementations of CART and random forests are univariate procedures, these techniques are
readily incorporated into chained equations imputation and easily extended to multivariate
missing data problems (Doove et al., 2014; Shah et al., 2014; van Buuren & Groothuis-
Oudshoorn, 2011). Thus, this method has great potential to address interactive and nonlinear
structures in missing data patterns more complex than those simulated here. We hope that future
studies refine and build upon the present results, identifying additional situations in which CART
and random forest imputations can be profitably employed to address complex relationships
between observed predictor variables and missing data in small sample research.
70
References
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions.
Thousand Oaks, CA: Sage.
Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation
Modeling: A Multidisciplinary Journal, 12(3), 411–434.
http://doi.org/10.1207/s15328007sem1203_4
Berk, R. A. (2009). Statistical learning from a regression perspective. New York: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
http://doi.org/10.1007/BF00058655
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
http://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression
Trees. Pacific Grove, CA: Wadsworth.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.
http://doi.org/10.1037/1082-989X.6.4.330
Doove, L. L., van Buuren, S., & Dusseldorp, E. (2014). Recursive partitioning for missing data
imputation in the presence of interaction effects. Computational Statistics & Data Analysis,
72, 92–104. http://doi.org/10.1016/j.csda.2013.10.025
Efron, B., & Tibshirani, R. (1993). An Introduction to the bootstrap. New York: Chapman &
Hall.
Enders, C. K. (2001). The impact of nonnormality on full information maximum-likelihood
estimation for structural equation models with missing data. Psychological Methods, 6(4),
71
352–370. http://doi.org/10.1037/1082-989X.6.4.352
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika,
43(4), 521–532. http://doi.org/10.1007/BF02293811
Graham, J. W., & Schafer, J. L. (1999). On the performance of multiple imputation for
multivariate data with small sample size. In R. H. Hoyle (Ed.), Statistical strategies for
small sample research (pp. 1–27). Thousand Oaks, CA: Sage.
Groenwold, R. H. H., Donders, A. R. T., Roes, K. C. B., Harrell, F. E., & Moons, K. G. M.
(2012). Dealing with missing outcome data in randomized trials and observational studies.
American Journal of Epidemiology, 175(3), 210–217. http://doi.org/10.1093/aje/kwr302
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning. New
York: Springer-Verlag.
Hayes, A. F. (2013). An introduction to mediation, moderation, and conditional process
analysis: A regression-based approach. New York: Guilford Press.
Hayes, T., & McArdle, J. J. (2017). Investigating the performance of CART- and random forest-
based procedures for dealing with longitudinal dropout in small sample designs under
MNAR missing data. In Advances in Longitudinal Models For Multivariate Psychology: A
Festschrift for Jack McArdle.
Hayes, T., Usami, S., Jacobucci, R., & McArdle, J. J. (2015). Using Classification and
Regression Trees (CART) and random forests to analyze attrition: Results from two
Simulations. Psychology and Aging, 30(4), 911–929. http://doi.org/10.1037/pag0000046
Kish, L. (1995). Methods for design effects. Journal of Official Statistics, 11(1), 55–77.
Retrieved from http://www.jos.nu/Articles/abstract.asp?article=11155
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3),
72
12–22.
McArdle, J. J. (2013). Dealing with longitudinal attrition using logistic regression and decision
tree analyses. In Contemporary issues in exploratory data mining in the behavioral sciences
(pp. 282–311). New York: Routledge.
Miceeri, T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures.
Psychological Bulletin, 105(1), 156–166. http://doi.org/10.1037/0033-2909.105.1.156
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are
not missing completely at random. Psychometrika, 52(3), 431–462.
http://doi.org/10.1007/BF02294365
Muthén, L. K., & Muthén, B. (2011). Mplus user’s guide. Sixth edition. Los Angeles, CA.
Oberski, D. (2014). lavaan.survey : An R package for complex survey analysis of structural
equation models. Journal of Statistical Software, 57(1), 1–27.
http://doi.org/10.18637/jss.v057.i01
Potthoff, R. F., Woodbury, M. A., & Manton, K. G. (1992). “Equivalent sample size” and
“equivalent degrees of freedom” refinements for inference using survey weights under
superpopulation models. Journal of the American Statistical Association, 87(418), 383–396.
http://doi.org/10.1080/01621459.1992.10475218
R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria:
R Foundation for Statistical Computing. Retrieved from http://r-project.org/
Rosseel, Y. (2012). lavaan : An R package for structural equation modeling. Journal of
Statistical Software, 48(2), 1–36. http://doi.org/10.18637/jss.v048.i02
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
http://doi.org/10.2307/2335739
73
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the
power of studies? Psychological Bulletin, 105(2), 309–316. http://doi.org/10.1037/0033-
2909.105.2.309
Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Comparison
of random forest and parametric imputation models for imputing missing data using MICE:
A CALIBER study. American Journal of Epidemiology, 179(6), 764–774.
http://doi.org/10.1093/aje/kwt312
Stapleton, L. M. (2002). The incorporation of sample weights into multilevel structural equation
models. Structural Equation Modeling: A Multidisciplinary Journal, 9(4), 475–502.
http://doi.org/10.1207/S15328007SEM0904_2
Therneau, T., Atkinson, B., & Ripley, B. (2014). rpart: Recursive Partitioning and Regression
Trees.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional
specification. Statistical Methods in Medical Research, 16, 219–242.
http://doi.org/10.1177/0962280206074463
van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully
conditional specification in multivariate imputation. Journal of Statistical Computation and
Simulation, 76(12), 1049–1064. http://doi.org/10.1080/10629360600810434
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45(3), 1–67.
http://doi.org/10.18637/jss.v045.i03
von Hippel, P. T. (2007). Regression with missing Ys: An improved strategy for analyzing
multiply imputed data. Sociological Methodology, 37(1), 83–117.
74
http://doi.org/10.1111/j.1467-9531.2007.00180.x
75
Table 3.1: Means (SDs) of Percent Bias – Normal Data, 30% Missing Data Rate
Inverse weighting Multiple imputation
Complete Listwise True Log CART Prune RF Norm CART RF
N = 125
Linear -0.24
(32.56)
-29.48
(36.58)
-12.5
(47.89)
-11.4.0
(48.58)
-11.00
(49.79)
-15.24
(47.65)
-2.18
(57.86)
-28.24
(39.82)
-26.98
(40.45)
-36.16
(32.8)
Quadratic -3.21
(30.98)
-37.89
(33.76)
-16.98
(55.62)
-29.77
(40.82)
-24.32
(56.06)
-25.48
(50.97)
-10.88
(58.58)
-38.15
(36.21)
-32.36
(43.44)
-39.93
(32.11)
Cubic -0.78
(36.38)
-37.21
(38.32)
-30.36
(49.40)
-30.88
(44.60)
-27.96
(61.21)
-27.03
(60.01)
-23.88
(54.71)
-37.36
(42.13)
-32.72
(45.09)
-42.54
(35.87)
Interaction -3.99
(30.82)
-70.28
(39.95)
-36.92
(67.23)
-70.40
(43.66)
-48.36
(65.46)
-50.04
(62.68)
-42.37
(57.49)
-73.00
(41.28)
-63.50
(52.59)
-69.83
(36.40)
N = 250
Linear 3.35
(22.5)
-24.23
(25.88)
-0.46
(44.47)
-1.16
(46.84)
-3.71
(37.71)
-8.28
(34.41)
5.03
(49.52)
-23.11
(27.53)
-6.04
(32.9)
-21.89
(26.85)
Quadratic -0.33
(22.72)
-32.15
(25.11)
-4.64
(51.05)
-23.61
(31.44)
-14.06
(43.84)
-16.14
(36.95)
-2.43
(53.89)
-32.31
(27.68)
-10.51
(32.47)
-25.49
(25.90)
Cubic 0.25
(22.21)
-39.87
(25.33)
-26.64
(45.09)
-30.02
(33.02)
-21.50
(50.86)
-21.70
(45.09)
-16.01
(56.89)
-39.41
(27.14)
-20.98
(35.34)
-34.72
(27.08)
Interaction 0.19
(21.73)
-67.11
(24.69)
-31.23
(54.02)
-65.80
(28.01)
-41.01
(48.17)
-42.56
(48.12)
-32.09
(50.13)
-66.63
(27.97)
-43.66
(39.64)
-54.36
(24.88)
N = 500
Linear 2.79
(15.47)
-23.4
(18.54)
-0.35
(30.93)
-0.54
(30.98)
-3.46
(25.76)
-6.36
(23.98)
11.68
(47.15)
-22.43
(19.5)
0.28
(23.19)
-12.31
(18.95)
Quadratic -0.86
(16.12)
-33.63
(18.00)
-9.70
(33.69)
-25.62
(22.13)
-16.37
(28.49)
-17.73
(26.81)
-6.92
(42.20)
-33.88
(19.41)
-8.08
(24.63)
-19.30
(19.37)
Cubic -1.03
(16.33)
-38.92
(18.19)
-24.44
(39.18)
-30.89
(24.23)
-24.48
(33.67)
-24.82
(32.65)
-17.92
(52.35)
-38.94
(19.16)
-14.50
(30.79)
-27.72
(21.88)
Interaction -0.81
(15.50)
-67.05
(19.18)
-29.30
(43.16)
-65.41
(21.34)
-39.50
(31.91)
-40.24
(31.75)
-26.50
(47.57)
-66.64
(21.53)
-27.95
(30.40)
-44.48
(21.20)
N = 1000
Linear 1.00
(10.88)
-24.22
(12.8)
-2.57
(22.39)
-2.36
(21.92)
-9.63
(16.29)
-10.42
(16.37)
7.58
(40.65)
-24.62
(13.77)
-0.05
(16.77)
-8.21
(14.08)
Quadratic 0.50
(10.99)
-34.01
(12.36)
-6.20
(32.39)
-25.36
(15.18)
-18.88
(16.90)
-19.54
(16.98)
2.58
(41.10)
-33.67
(13.00)
-7.67
(19.03)
-13.99
(14.26)
Cubic 1.17
(11.20)
-37.53
(12.28)
-22.25
(38.84)
-29.94
(16.56)
-24.06
(20.74)
-24.20
(20.84)
-11.27
(60.41)
-37.40
(13.73)
-10.63
(20.17)
-20.47
(15.03)
Interaction 0.24
(10.66)
-66.82
(13.88)
-25.12
(40.58)
-65.79
(15.21)
-41.63
(24.01)
-41.65
(23.90)
-20.80
(36.27)
-66.57
(14.32)
-23.10
(22.62)
-35.68
(15.35)
Note: Complete = complete datasets (no missing data), Listwise = listwise + covariates, True = true weights, Log = logistic weights, CART = Classification and
Regression Trees, Prune = pruned CART, RF = Random Forest, Norm = “norm” Bayesian Regression.
76
Table 3.2: Means (SDs) of Percent Bias – Severe Nonnormal Data, 30% Missing Data Rate
Inverse weighting Multiple imputation
Complete Listwise True Log CART Prune RF Norm CART RF
N = 125
Linear 2.14
(35.56)
-18.93
(39.62)
2.01
(44.32)
2.13
(44.86)
1.34
(46.38)
-0.41
(44.37)
4.25
(44.92)
-18.21
(41.32)
-0.83
(42.23)
-9.97
(38.13)
Quadratic -0.20
(33.85)
-23.89
(34.01)
-7.37
(43.83)
-19.58
(37.81)
-10.08
(42.91)
-11.74
(39.61)
-1.94
(48.38)
-25.40
(38.15)
-18.78
(37.73)
-24.47
(32.50)
Cubic -2.39
(34.79)
-29.79
(35.78)
-6.80
(46.68)
-12.84
(38.73)
-10.32
(39.83)
-9.81
(38.25)
-5.87
(41.61)
-28.59
(38.45)
-10.28
(36.95)
-19.53
(35.78)
Interaction 2.98
(33.40)
-45.35
(41.86)
-25.02
(83.13)
-45.87
(44.14)
-21.95
(54.47)
-25.23
(53.11)
-16.06
(61.11)
-47.86
(44.82)
-37.47
(43.69)
-47.16
(34.10)
N = 250
Linear
-1.72
(24.00)
-20.93
(24.67)
-4.45
(27.78)
-3.87
(27.36)
-4.71
(28.1)
-6.69
(27.43)
-0.98
(29.61)
-21.28
(28.21)
-4.55
(26.69)
-10.39
(26.08)
Quadratic
0.67
(22.77)
-24.03
(23.67)
-4.53
(35.59)
-18.81
(26.36)
-6.81
(36.42)
-9.29
(32.82)
-2.01
(39.85)
-24.96
(25.63)
-13.55
(28.59)
-18.12
(24.9)
Cubic -0.55
(22.23)
-26.50
(23.27)
-5.15
(33.62)
-9.76
(26.34)
-5.16
(29.29)
-7.37
(26.23)
0.78
(32.08)
-26.12
(25.04)
-4.60
(25.86)
-11.95
(24.07)
Interaction 3.43
(20.94)
-50.38
(26.63)
-15.60
(39.68)
-46.98
(27.77)
-26.16
(38.24)
-26.75
(38.05)
-14.49
(41.99)
-49.45
(27.09)
-30.97
(30.76)
-39.87
(23.07)
N = 500
Linear -1.12
(17.15)
-18.38
(18.07)
-1.66
(20.06)
-1.5.0
(19.88)
-2.25
(19.93)
-3.51
(19.41)
3.13
(24.26)
-18.71
(19.61)
-1.44
(19.44)
-5.42
(18.69)
Quadratic -0.68
(16.25)
-25.59
(15.84)
-6.18
(23.34)
-20.49
(17.03)
-7.12
(22.90)
-9.09
(21.60)
-1.93
(25.72)
-26.02
(17.73)
-10.05
(21.12)
-15.31
(17.56)
Cubic -1.29
(15.42)
-25.87
(15.30)
-5.10
(23.67)
-10.65
(18.22)
-8.33
(18.96)
-8.93
(18.98)
1.16
(26.31)
-26.55
(17.64)
-3.28
(19.59)
-9.23
(17.35)
Interaction 0.03
(15.68)
-50.04
(21.84)
-16.38
(65.77)
-47.70
(22.42)
-25.56
(32.21)
-25.96
(31.92)
-7.93
(57.22)
-50.35
(22.81)
-29.77
(26.84)
-35.33
(19.24)
N = 1000
Linear 0.18
(12.37)
-18.83
(12.25)
-1.15
(13.52)
-1.11
(13.48)
-2.62
(13.09)
-3.21
(13.24)
2.01
(18.19)
-18.46
(13.81)
1.01
(13.31)
-2.91
(12.79)
Quadratic 0.26
(11.55)
-25.20
(11.13)
-5.95
(21.17)
-20.05
(12.47)
-7.79
(16.84)
-8.56
(16.61)
0.11
(25.87)
-25.04
(12.65)
-7.80
(17.30)
-11.36
(13.88)
Cubic 0.44
(10.21)
-24.22
(10.81)
-2.46
(22.15)
-9.01
(12.61)
-7.40
(12.78)
-7.60
(12.83)
3.29
(29.94)
-24.43
(11.61)
0.15
(14.98)
-4.87
(12.71)
Interaction 1.06
(10.55)
-49.40
(14.42)
-17.62
(50.12)
-46.88
(14.97)
-26.00
(27.34)
-26.21
(27.22)
-7.86
(46.91)
-48.85
(15.25)
-23.76
(21.71)
-29.95
(15.59)
Note: Complete = complete datasets (no missing data), Listwise = listwise + covariates, True = true weights, Log = logistic weights, CART = Classification and
Regression Trees, Prune = pruned CART, RF = Random Forest, Norm = “norm” Bayesian Regression.
77
Table 3.3. Key Pairwise Comparisons of Random Forest Weights and CART MI – 30% Missing Data, Normal Data
Linear Quadratic Cubic Interaction
t p d t p d t p d t p d
N = 125
RF WT vs
Listwise 8.02 < .001 0.53 7.85 < .001 0.52 4.17 < .001 0.27 9.31 < .001 0.53
True TW 3.94 .001 0.19 2.06 .286 0.11 2.90 .029 0.12 -1.72 .604 -0.09
Log WT 3.47 .005 0.17 7.08 < .001 0.34 3.10 .015 0.13 11.21 < .001 0.51
Norm 7.94 < .001 0.50 8.18 < .001 0.52 4.29 < .001 0.27 10.32 < .001 0.58
CART MI 6.82 < .001 0.48 5.92 < .001 0.41 2.53 .085 0.17 6.31 < .001 0.38
CART MI
vs
Listwise 1.05 1.00 0.06 2.56 .079 0.14 1.95 .368 0.11 2.69 .054 0.14
Norm 0.54 1.00 0.03 2.94 .026 0.14 2.14 .237 0.11 4.02 .001 0.19
N = 250
RF WT vs
Listwise 9.71 < .001 0.68 8.82 < .001 0.64 6.67 < .001 0.49 11.40 < .001 0.81
True TW 2.28 .167 0.12 0.75 1.00 0.04 4.26 < .001 0.20 -0.34 1.00 -0.02
Log WT 2.47 .099 0.13 7.62 < .001 0.42 5.03 < .001 0.26 11.97 < .001 0.75
Norm 9.77 < .001 0.64 9.19 < .001 0.64 6.73 < .001 0.47 11.22 < .001 0.79
CART MI 3.86 .001 0.25 2.62 .066 0.17 1.35 1.00 0.10 3.47 .004 0.25
CART MI
vs
Listwise 10.48 < .001 0.60 13.53 < .001 0.72 10.40 < .001 0.58 10.95 < .001 0.65
Norm 10.42 < .001 0.55 15.37 < .001 0.71 11.02 < .001 0.56 10.68 < .001 0.64
N = 500
RF WT vs
Listwise 11.03 < .001 0.92 9.54 < .001 0.77 6.09 < .001 0.49 12.84 < .001 1.04
True TW 4.49 < .001 0.28 1.28 1.00 0.07 2.41 .118 0.14 1.17 1.00 0.06
Log WT 4.62 < .001 0.29 7.84 < .001 0.49 4.41 < .001 0.27 12.93 < .001 0.96
Norm 10.81 < .001 0.88 9.62 < .001 0.77 6.23 < .001 0.48 13.01 < .001 1.01
CART MI 3.72 .002 0.29 0.44 1.00 0.03 -1.05 1.00 -0.08 0.46 1.00 0.04
CART MI
vs
Listwise 18.95 < .001 1.10 19.82 < .001 1.13 14.52 < .001 0.88 22.08 < .001 1.45
Norm 20.02 < .001 1.04 20.84 < .001 1.13 14.99 < .001 0.87 23.32 < .001 1.40
N = 1000
RF WT vs
Listwise 11.81 < .001 0.95 13.11 < .001 1.11 6.26 < .001 0.56 18.7 < .001 1.57
True TW 4.03 .001 0.29 3.53 .004 0.23 2.88 .031 0.21 1.61 .768 0.11
Log WT 3.98 .001 0.28 10.93 < .001 0.78 4.68 < .001 0.37 19.40 < .001 1.47
Norm 12.15 < .001 0.95 13.08 < .001 1.09 6.23 < .001 0.56 18.85 < .001 1.54
CART MI 2.94 .026 0.22 3.87 .001 0.30 -0.15 1.00 -0.01 0.97 1.00 0.07
CART MI
vs
Listwise 23.85 < .001 1.58 24.64 < .001 1.54 22.90 < .001 1.5 32.96 < .001 2.18
Norm 25.81 < .001 1.58 26.32 < .001 1.49 22.99 < .001 1.48 34.27 < .001 2.14
Note: Listwise = listwise + covariates, WT = weights, Log = logistic, CART = Classification and Regression Trees, RF = Random
Forests, Norm = Bayesian Regression, MI = multiple imputation, d = Cohen’s d. df = 199. All p-values are Bonferroni corrected.
78
Table 3.4. Key Pairwise Comparisons of Random Forest Weights and CART MI – 30% Missing Data, Severe Nonnormal Data
Linear Quadratic Cubic Interaction
t p d t p d t p d t p d
N = 125
RF WT vs
Listwise 12.13 < .001 0.54 9.13 < .001 0.49 11.74 < .001 0.61 9.72 < .001 0.52
True TW 1.81 .503 0.05 2.60 .069 0.12 0.57 1.00 0.02 1.99 .332 0.12
Log WT 1.80 .514 0.05 8.46 < .001 0.38 5.81 < .001 0.17 10.69 < .001 0.52
Norm 14.69 < .001 0.51 9.94 < .001 0.52 13.28 < .001 0.56 10.19 < .001 0.56
CART MI 3.26 .009 0.12 6.37 < .001 0.38 2.34 .141 0.11 6.17 < .001 0.39
CART MI
vs
Listwise 10.23 < .001 0.44 2.98 .023 0.14 12.38 < .001 0.54 3.24 .010 0.18
Norm 13.79 < .001 0.42 3.75 .002 0.17 14.27 < .001 0.48 4.60 < .001 0.23
N = 250
RF WT vs
Listwise 14.5 < .001 0.71 10.64 < .001 0.60 16.59 < .001 0.92 13.49 < .001 0.98
True TW 3.89 .001 0.12 1.50 .942 0.07 4.25 < .001 0.18 0.52 1.00 0.03
Log WT 3.31 .008 0.10 9.03 < .001 0.45 9.10 < .001 0.34 12.74 < .001 0.87
Norm 17.29 < .001 0.70 11.59 < .001 0.62 18.26 < .001 0.89 13.21 < .001 0.95
CART MI 3.15 .013 0.12 5.42 < .001 0.32 3.69 .002 0.18 5.76 < .001 0.44
CART MI
vs
Listwise 14.01 < .001 0.63 7.38 < .001 0.39 19.20 < .001 0.88 10.67 < .001 0.67
Norm 19.53 < .001 0.61 9.31 < .001 0.42 22.95 < .001 0.84 9.79 < .001 0.63
N = 500
RF WT vs
Listwise 16.77 < .001 0.97 16.47 < .001 1.02 16.47 < .001 1.19 10.96 < .001 0.90
True TW 4.74 < .001 0.21 3.13 .014 0.17 4.43 < .001 0.25 2.65 .061 0.14
Log WT 4.74 < .001 0.20 13.82 < .001 0.79 8.60 < .001 0.49 10.47 < .001 0.85
Norm 18.9 < .001 0.96 16.64 < .001 1.04 17.99 < .001 1.18 10.99 < .001 0.91
CART MI 3.94 .001 0.20 5.50 < .001 0.34 2.87 .032 0.19 5.59 < .001 0.47
CART MI
vs
Listwise 22.01 < .001 0.90 15.64 < .001 0.79 24.83 < .001 1.23 14.47 < .001 0.81
Norm 29.10 < .001 0.88 16.64 < .001 0.80 30.42 < .001 1.23 14.43 < .001 0.82
N = 1000
RF WT vs
Listwise 20.88 < .001 1.27 15.35 < .001 1.16 14.67 < .001 1.06 12.68 < .001 1.14
True TW 3.57 .003 0.19 3.72 .002 0.25 3.51 .004 0.21 2.60 .069 0.20
Log WT 3.51 .004 0.18 12.69 < .001 0.90 7.05 < .001 0.45 12.3 < .001 1.03
Norm 21.20 < .001 1.22 15.91 < .001 1.12 14.91 < .001 1.07 12.54 < .001 1.12
CART MI 1.08 1.00 0.06 4.60 < .001 0.35 1.77 .548 0.12 4.70 < .001 0.42
CART MI
vs
Listwise 37.12 < .001 1.54 18.70 < .001 1.11 32.95 < .001 1.76 21.04 < .001 1.31
Norm 46.26 < .001 1.43 20.69 < .001 1.07 39.25 < .001 1.72 20.81 < .001 1.28
Note: Listwise = listwise + covariates, WT = weights, Log = logistic, CART = Classification and Regression Trees, RF = Random
Forests, Norm = Bayesian Regression, MI = multiple imputation, d = Cohen’s d. df = 199. All p-values are Bonferroni corrected.
79
Table 3.5
Multilevel Regression of Marginal Percent Bias on Missing Data Method (CART MI vs. RF
Weights) and Sample Size – 30% Missing Data Conditions
Normal Data Severe Nonnormal Data
Coefficient b SE t p b SE t p
Intercept -19.83 7.34 -2.7 .007 -4.91 4.88 -1.01 .315
cartMIDummy -19.06 2.09 -9.11 < .001 -11.94 1.64 -7.26 < .001
N250 8.45 2.09 4.04 < .001 0.73 1.64 0.44 .657
N500 9.91 2.09 4.74 < .001 3.51 1.64 2.14 .033
N1000 14.35 2.09 6.86 < .001 4.29 1.64 2.61 .009
cartMIDummyxN250 10.14 2.96 3.43 .001 2.7 2.33 1.16 .246
cartMIDummyxN500 16.42 2.96 5.55 < .001 2.19 2.33 0.94 .346
cartMIDummyxN1000 14.18 2.96 4.79 < .001 4.95 2.33 2.13 .033
Note: cartMIDummy is a dummy-coded variable where CART MI = 1 and RF Weights = 0.
N250, N500, and N1000 are dummy-coded indicator variables for the N = 250, N = 500, and N
= 1000 sample sizes, respectively. Because each missing data method was fit to the same
datasets, the multilevel regressions were specified allowing random intercepts for simulation
iteration, nested within missing data generation condition (linear, quadratic, cubic,
interaction).
80
Table 3.6
Confidence Interval Coverage Rates and [CI Widths] by Sample Size (N), Missing Data Estimator, and Function Form of Missing Data Generation Model –
Normal Data, 30% Missing Data Rate
Inverse weighting Multiple imputation
Complete Listwise True Log CART Prune RF Norm CART RF
N = 125
Linear 0.96 [0.25] 0.88 [0.28] 0.90 [0.36] 0.92 [0.37] 0.92 [0.37] 0.92 [0.36] 0.91 [0.38] 0.87 [0.31] 0.82 [0.27] 0.86 [0.30]
Quadratic 0.95 [0.25] 0.82 [0.27] 0.86 [0.35] 0.88 [0.33] 0.87 [0.36] 0.89 [0.36] 0.87 [0.36] 0.84 [0.30] 0.77 [0.27] 0.84 [0.29]
Cubic 0.92 [0.25] 0.81 [0.28] 0.86 [0.34] 0.85 [0.33] 0.88 [0.38] 0.88 [0.38] 0.89 [0.36] 0.83 [0.31] 0.77 [0.26] 0.79 [0.29]
Interaction 0.94 [0.25] 0.54 [0.29] 0.78 [0.38] 0.56 [0.32] 0.76 [0.41] 0.74 [0.40] 0.80 [0.38] 0.58 [0.32] 0.49 [0.26] 0.52 [0.30]
N = 250
Linear 0.94 [0.18] 0.85 [0.20] 0.90 [0.29] 0.88 [0.29] 0.94 [0.29] 0.93 [0.26] 0.90 [0.31] 0.86 [0.22] 0.86 [0.19] 0.91 [0.21]
Quadratic 0.94 [0.18] 0.74 [0.19] 0.85 [0.27] 0.85 [0.23] 0.89 [0.28] 0.92 [0.26] 0.88 [0.29] 0.74 [0.21] 0.80 [0.18] 0.84 [0.21]
Cubic 0.94 [0.18] 0.64 [0.20] 0.78 [0.27] 0.78 [0.24] 0.84 [0.30] 0.86 [0.30] 0.78 [0.28] 0.71 [0.21] 0.74 [0.18] 0.72 [0.21]
Interaction 0.96 [0.18] 0.25 [0.20] 0.74 [0.30] 0.36 [0.22] 0.79 [0.32] 0.79 [0.32] 0.74 [0.28] 0.31 [0.22] 0.54 [0.18] 0.46 [0.21]
N = 500
Linear 0.94 [0.12] 0.75 [0.14] 0.92 [0.21] 0.94 [0.22] 0.94 [0.19] 0.94 [0.18] 0.90 [0.26] 0.80 [0.15] 0.82 [0.13] 0.90 [0.15]
Quadratic 0.94 [0.12] 0.48 [0.14] 0.86 [0.20] 0.74 [0.16] 0.88 [0.20] 0.88 [0.19] 0.84 [0.23] 0.57 [0.15] 0.80 [0.13] 0.81 [0.14]
Cubic 0.94 [0.12] 0.38 [0.14] 0.74 [0.21] 0.68 [0.17] 0.83 [0.23] 0.84 [0.23] 0.76 [0.24] 0.44 [0.15] 0.66 [0.13] 0.68 [0.15]
Interaction 0.97 [0.12] 0.06 [0.14] 0.72 [0.25] 0.09 [0.16] 0.72 [0.25] 0.72 [0.25] 0.74 [0.24] 0.10 [0.16] 0.53 [0.13] 0.40 [0.15]
N = 1000
Linear 0.96 [0.09] 0.55 [0.10] 0.96 [0.16] 0.94 [0.16] 0.91 [0.13] 0.90 [0.13] 0.86 [0.20] 0.60 [0.11] 0.83 [0.09] 0.90 [0.10]
Quadratic 0.96 [0.09] 0.18 [0.10] 0.84 [0.18] 0.61 [0.12] 0.77 [0.13] 0.73 [0.13] 0.86 [0.21] 0.30 [0.10] 0.75 [0.09] 0.79 [0.10]
Cubic 0.96 [0.09] 0.14 [0.10] 0.65 [0.18] 0.54 [0.12] 0.76 [0.16] 0.75 [0.16] 0.70 [0.22] 0.20 [0.11] 0.71 [0.09] 0.68 [0.11]
Interaction 0.96 [0.09] 0.00 [0.10] 0.68 [0.21] 0.00 [0.11] 0.52 [0.18] 0.52 [0.18] 0.69 [0.20] 0.00 [0.11] 0.48 [0.09] 0.31 [0.11]
Note: Complete = complete datasets (no missing data), Listwise = listwise + covariates, True = true weights, Log = logistic weights, CART = Classification
and Regression Trees, Prune = pruned CART, RF = Random Forest, Norm = “norm” Bayesian Regression.
81
Table 3.7
Confidence Interval Coverage Rates and [CI Widths] by Sample Size (N), Missing Data Estimator, and Function Form of Missing Data Generation Model –
Severe Nonnormal Data, 30% Missing Data Rate
Inverse weighting Multiple imputation
Complete Listwise True Log CART Prune RF Norm CART RF
N = 125
Linear 0.94 [0.27] 0.89 [0.28] 0.90 [0.29] 0.92 [0.30] 0.90 [0.30] 0.91 [0.29] 0.92 [0.30] 0.92 [0.31] 0.90 [0.28] 0.94 [0.29]
Quadratic 0.94 [0.26] 0.91 [0.28] 0.94 [0.30] 0.91 [0.27] 0.92 [0.30] 0.92 [0.29] 0.91 [0.30] 0.88 [0.31] 0.92 [0.28] 0.93 [0.30]
Cubic 0.96 [0.26] 0.88 [0.26] 0.93 [0.29] 0.92 [0.28] 0.94 [0.29] 0.94 [0.28] 0.94 [0.29] 0.91 [0.30] 0.95 [0.27] 0.95 [0.29]
Interaction 0.96 [0.26] 0.78 [0.32] 0.82 [0.36] 0.76 [0.31] 0.88 [0.36] 0.86 [0.36] 0.84 [0.35] 0.80 [0.36] 0.74 [0.29] 0.82 [0.32]
N = 250
Linear 0.96 [0.18] 0.84 [0.19] 0.92 [0.20] 0.92 [0.20] 0.92 [0.20] 0.91 [0.20] 0.92 [0.21] 0.87 [0.21] 0.94 [0.19] 0.92 [0.20]
Quadratic 0.95 [0.18] 0.85 [0.19] 0.96 [0.22] 0.90 [0.19] 0.98 [0.22] 0.96 [0.21] 0.96 [0.22] 0.86 [0.21] 0.89 [0.19] 0.93 [0.22]
Cubic 0.96 [0.18] 0.80 [0.18] 0.92 [0.21] 0.92 [0.19] 0.94 [0.21] 0.94 [0.20] 0.94 [0.22] 0.85 [0.20] 0.90 [0.18] 0.94 [0.19]
Interaction 0.95 [0.18] 0.60 [0.23] 0.87 [0.27] 0.62 [0.23] 0.84 [0.27] 0.83 [0.26] 0.88 [0.28] 0.69 [0.25] 0.65 [0.19] 0.77 [0.22]
N = 500
Linear 0.93 [0.13] 0.79 [0.13] 0.97 [0.14] 0.98 [0.14] 0.94 [0.14] 0.94 [0.14] 0.94 [0.16] 0.83 [0.15] 0.92 [0.13] 0.92 [0.13]
Quadratic 0.96 [0.13] 0.70 [0.13] 0.92 [0.16] 0.80 [0.14] 0.92 [0.16] 0.90 [0.15] 0.94 [0.17] 0.72 [0.15] 0.86 [0.14] 0.94 [0.15]
Cubic 0.96 [0.13] 0.69 [0.13] 0.90 [0.15] 0.88 [0.14] 0.92 [0.14] 0.90 [0.14] 0.91 [0.17] 0.68 [0.14] 0.90 [0.13] 0.90 [0.14]
Interaction 0.96 [0.13] 0.30 [0.16] 0.83 [0.24] 0.36 [0.17] 0.80 [0.21] 0.80 [0.21] 0.84 [0.24] 0.38 [0.17] 0.52 [0.14] 0.60 [0.16]
N = 1000
Linear 0.92 [0.09] 0.65 [0.09] 0.94 [0.10] 0.95 [0.10] 0.94 [0.10] 0.93 [0.10] 0.94 [0.11] 0.74 [0.10] 0.91 [0.09] 0.92 [0.09]
Quadratic 0.95 [0.09] 0.44 [0.09] 0.90 [0.12] 0.62 [0.09] 0.83 [0.11] 0.84 [0.11] 0.91 [0.13] 0.52 [0.10] 0.77 [0.10] 0.88 [0.11]
Cubic 0.96 [0.09] 0.42 [0.09] 0.94 [0.12] 0.89 [0.10] 0.90 [0.10] 0.90 [0.10] 0.88 [0.14] 0.56 [0.10] 0.86 [0.09] 0.90 [0.10]
Interaction 0.95 [0.09] 0.08 [0.11] 0.82 [0.19] 0.08 [0.11] 0.75 [0.17] 0.75 [0.17] 0.85 [0.20] 0.12 [0.12] 0.46 [0.09] 0.46 [0.11]
Note: Complete = complete datasets (no missing data), Listwise = listwise + covariates, True = true weights, Log = logistic weights, CART = Classification
and Regression Trees, Prune = pruned CART, RF = Random Forest, Norm = “norm” Bayesian Regression.
82
Figure 3.1. Example tree diagram.
83
Figure 3.2. Percent bias of key estimators (complete data, random forest weights, CART
multiple imputation) by normality and missing data generation conditions – 30% missing data.
Normal Data Severe Nonnormal Data
N = 125 N = 250 N = 500 N = 1000
-80 -70 -60 -50 -40 -30 -20 -10 0 10 20 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20
Interaction
Cubic
Quadratic
Linear
Interaction
Cubic
Quadratic
Linear
Interaction
Cubic
Quadratic
Linear
Interaction
Cubic
Quadratic
Linear
Percent Bias
Parameter
Missing Data Method
Complete Listwise + Covariates
RF Weights CART MI
84
Figure 3.3. Percent Bias (Expected) As a Function of Sample Size (N) and Missing Data
Estimator - 30% Missingness, Normal Data.
85
Figure 3.4. Confidence interval coverage of key estimators (complete data, random forest
weights, CART multiple imputation) by normality and missing data generation conditions – 30%
missing data.
Normal Data Severe Nonnormal Data
N = 125 N = 250 N = 500 N = 1000
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Interaction
Cubic
Quadratic
Linear
Interaction
Cubic
Quadratic
Linear
Interaction
Cubic
Quadratic
Linear
Interaction
Cubic
Quadratic
Linear
Coverage
Parameter
Missing Data Method
Complete Listwise + Covariates
RF Weights CART MI
86
Chapter 4: Investigating the Performance of CART- and Random Forest-Based
Procedures for Dealing with Longitudinal Dropout in Small Sample Designs Under MNAR
Missing Data: An Initial Study
1
Timothy Hayes & John J. McArdle
Quantitative researchers have begun to consider ways of utilizing exploratory data
mining techniques to address missing data. Although many algorithms could conceivably prove
useful for this goal, Classification and Regression Trees (CART; Breiman, Friedman, Olshen, &
Stone, 1984) and Random Forests (Breiman, 2001) seem particularly promising, based on their
ability to model complex, nonlinear, interactive relationships among missing data correlates.
Recently, two strategies have been proposed for using CART and random forests to address
missing data. The first strategy uses CART to predict whom in a dataset is most likely to go
missing, with the goal of creating inverse data weights (Hayes, Usami, Jacobucci, & McArdle,
2015; McArdle, 2013). The second strategy uses CART to predict people’s scores on missing
outcome variables, with the goal of generating multiple imputations (Doove, van Buuren, &
Dusseldorp, 2014; Shah, Bartlett, Carpenter, Nicholas, & Hemingway, 2014; van Buuren, 2012).
This chapter provides a brief overview of these two missing data methods and describes a
simulation designed to assess whether CART-based weights and CART-based multiple
imputations can provide relief in small sample studies with outcome-dependent (Missing Not at
Random, or MNAR) missing data (Rubin, 1976). The next section provides brief conceptual
overview of the aspects of these techniques most relevant to addressing missing data (for
accessible introductions, see Berk, 2009; James, Witten, Hastie, & Tibshirani, 2013; Strobl,
1
Chapter under revision for inclusion in: Emilio Ferrer, Steven M. Boker, & Kevin J. Grimm
(Eds.) Advances in Longitudinal Models for Multivariate Psychology: A Feschrift for Jack
McArdle (working title). New York: Taylor & Francis.
87
Malley, & Tutz, 2009). We then describe how CART and random forests can be utilized to
address missing data.
Introduction to CART and Random Forests
The goal of a CART analysis is to use the values of a set of observed predictors to split
the dataset into homogeneous (or more pure/less impure) subgroups with respect to a single
outcome variable, y (Breiman et al., 1984). A homogeneous group could either be a group in
which the majority of group members share the same category membership on a categorical
outcome, or a group in which the majority of members share similar scores on a continuous
outcome, such that their scores are tightly clustered around the group’s mean. The results of a
CART analysis are visualized as a tree diagram, as displayed in Figure 4.1. Here, the subgroups
created by CART are depicted as the terminal nodes at the bottom of the tree.
Two aspects of this picture are worth noting. First, because each successive split in the
tree depends on the split that came before it, CART trees are fundamentally interactive and
nonlinear. Thus, on the left-hand side of the diagram we see a moderated effect in which the
prediction for age among people younger than 65 depends upon whether they fall above or below
a cutoff on a depression measure. On the right-hand side of the diagram, we see a nonlinear
effect, in which the prediction for age is not constant across levels of the variable, but differs by
subgroup.
The second aspect to note about Figure 4.1 is that each terminal node receives a predicted
value which can be applied to each person in the group. If y is continuous, the predicted value is
simply the average value in the node – the group mean, !
"#$%
&
. If y is categorical, however, the
predicted value could either be a predicted probability of being classified as a given class, simply
defined as the proportion of people in the node who are members of that class, or it could be a
88
predicted class membership that is defined by majority vote as the class with the most members
in the node. Thus, if three-quarters of the members of Node 1 are classified as '()**
+
, the
predicted probability of being '()**
+
is equal to .75 and the predicted class in the node is '()**
+
.
Left unchecked, CART analyses tend to result in large, unstable trees with very small
subgroups that tend to overfit the data (Hastie, Tibshirani, & Friedman, 2009). Two extensions of
CART address this shortcoming and improve the method’s predictive accuracy. Cost-complexity
pruning uses cross-validation methods to identify a complexity parameter that imposes a penalty
on larger trees. Thus, pruned CART results in smaller, more stable subtrees with better predictive
accuracy in new samples.
A second extension of CART improves its predictive accuracy by using bootstrapping
(Efron & Tibshirani, 1993). The earliest method, bagging (short for “bootstrap aggregation,”
Breiman, 1996) boostraps the dataset many times, fits a large CART tree to each bootstrapped
dataset, and averages across the results. For continuous outcomes, a case’s predicted value is
simply the mean of the predicted values given to that case across the bootstrapped samples. For
categorical outcomes, a case’s predicted probability of being classified as '()**
+
is simply the
proportion of times the case was classified as '()**
+
across the bootstrap samples, whereas the
case’s predicted class is simply the class assigned in the majority of bootstrap trees.
2
The random forests algorithm (Breiman, 2001) is similar to bagging in that it using
bootstrapping to estimate a “forest” of many trees. However, random forest analysis adds a twist:
at each split, only a random subset of the predictors is used for consideration as splitting
variables. In this way, over the course of many bootstrap samples, highly correlated predictors
2
A clever modification of this approach recognizes that, on average, about one-third of cases
will be left out of each bootstrapped sample. These unsampled cases are termed out-of-bag
(OOB) estimates. To increase the predictive accuracy of bagging analyses, the predicted values
for each case can be calculated using only these OOB observations.
89
each get a chance to contribute to the prediction at each split. Thus, random forests not only
bootstraps but also addresses potential collinearity among the predictors.
Using CART and Random Forests to Address Missing Data
Recent proposals suggest two ways that CART and random forests might to address
missing data. The first method (Hayes et al., 2015; McArdle, 2013) uses CART and random
forests to predict a binary response (or return) indicator in which missing observations (e.g.,
dropouts from the study) are coded 0 and non-missing observations (e.g., people who returned to
the study and provided responses) are coded 1. This analysis serves to split the data into groups
of participants whose responses are either mostly missing (mostly 0s) or mostly non-missing
(mostly 1s). The resulting predicted probabilities of returning to the study and providing data
(i.e., of receiving a 1) can then be inverted to form data weights that serve to up-weight
individuals in terminal nodes with a low probability of returning to the study and down-weight
individuals in terminal nodes where they have a high probability of returning (see e.g., Kish,
1995; Potthoff, Woodbury, & Manton, 1992). Weights are typically rescaled to sum to the
relative sample size (Potthoff et al., 1992; Stapleton, 2002). Weighted structural equation
modeling analyses are then conducted using pseudo-maximum likelihood estimation, typically
maximum likelihood with robust standard errors (MLR; for details, see Asparouhov, 2005;
Hayes et al., 2015; McArdle, 2013).
The second method uses CART and random forests to predict the missing outcome
variable, itself (rather than a response indicator), in order to group together participants who
share similar scores and generate imputations for the missing cases (Doove et al., 2014; Shah et
al., 2014). Although it would be easy to simply use the predicted value (e.g., majority class or
node mean) from a CART analysis to serve as an imputed value, this would lead to inadequate
90
variability, analogous to omitting the stochastic error term in standard regression imputation. To
address variability more appropriately, instead of simply using the predicted value in a given
node, the algorithm implemented in the mice package in R (van Buuren & Groothuis-
Oudshoorn, 2011) chooses imputed values by randomly sampling from the non-missing
observations falling in the same terminal node as a given missing case. Random forest
imputation extends this procedure by randomly sampling from the set of all non-missing
observations that fall in the same terminal node as a given missing data point in any of the
bootstrap trees (for more concrete details of the algorithms use for CART and random forest
imputation, see Doove et al., 2014; van Buuren, 2012). After the imputation phase is complete,
the analysis and pooling phases follow standard procedures (Enders, 2010; Little & Rubin, 1987;
van Buuren, 2012).
Thus far, little research has investigated the performance of these CART-based missing
data methods. Initial simulations suggest that CART-based weighting methods exhibit strong
performance in identifying true selection model variables and in reducing parameter bias, even in
small sample datasets (e.g., N = 100; Hayes et al., 2015). Additionally, two sets of simulations
demonstrated that, in terms of bias, efficiency, confidence interval width, and coverage, CART-
based multiple imputation methods outperformed traditional multiple imputation when
estimating interaction terms and performed nearly as well in estimating main effects in
regression and survival models (Doove et al., 2014; Shah et al., 2014). However, these studies
only simulated large sample datasets (N = 1,000 and N = 2,000). Thus, the performance of
CART-based multiple imputation in small sample settings remains unknown.
The Present Research
91
In the present research, we used statistical simulation methods to assess the performance
of CART-based weighting methods and CART-based multiple imputation methods in small
sample, randomized clinical trial settings. Randomized clinical trials provide an interesting
context for studying the missing data methods described in this chapter for at least two reasons.
First, researchers conducting clinical trials often face practical constraints on the number of
individuals available to participate in their studies. This is because clinical trials often target
participants from specialized populations, such as those with specific mental health diagnoses,
who may be difficult to recruit in large numbers.
Second, clinical trials face a very real threat of attrition based on individuals’ scores on
the outcome under study. For example, it is possible that the most depressed individuals – those
whose values on the dependent variable fail to improve or worsen over time – might plausibly
become discouraged by their lack of improvement in the trial and ultimately drop out of the
study. Alternatively, it is possible that those individuals who improve the most as a result of
treatment might feel so much better that they fail to return to the trial because they feel that they
no longer need help. In either case, the distribution of scores on the dependent variable will be
fundamentally altered (truncated at either the top or the bottom, in a pattern that will be
differential by treatment group, with the most depressed patients likely being control patients and
the most improved patients presumably being treatment group patients).
Such Missing Not at Random (MNAR; Rubin, 1976) mechanisms are among the most
deleterious and difficult-to-address forms of missing data (Enders, 2010, 2011; Yang &
Maxwell, 2014). To our knowledge, no prior research has assessed the performance of CART-
based weighting and CART-based under MNAR missing data. It is well-established that
auxiliary variables can help provide at least some relief under MNAR, so long as these variables
92
are highly correlated with y (Collins, Schafer, & Kam, 2001). Building on this logic, we
suspected that these tree-based, greedy algorithms might be particularly well-suited for
identifying patterns in the data that might relate to the MNAR mechanism and ultimately provide
relief from parameter bias.
For these reasons, the present research investigates the performance of CART and
random forest methods for dealing with missing data under small sample-sizes and a variety of
MNAR missing data mechanisms. In the simulation presented below, we employ a growth model
designed to mirror a randomized longitudinal clinical trial and included auxiliary variables that
were either moderately correlated or highly correlated with y at the time of dropout. We
conducted our simulation using R statistical software (R Core Team, 2013).
Method
Data Generation Model
For each cell of the simulation, we generated 200 datasets based on the template model
displayed in Figure 4.2. Here, Group indicates a dummy-coded experimental grouping variable
where 0 = “control” and 1 = “treatment” in our fictional clinical trial. Three features of this
template model deserve brief mention. First, the influence of Group on individuals’ starting
levels (random intercepts, I) is set to 0.005 in the model. This is to indicate an influence that is
near zero, in-line with our assumption that treatment group should have zero influence at
baseline before the treatment is administered. Second, the mean intercept, ,
--
, is set equal to
1.00.
3
We chose a starting level of 1 rather than a standardized starting level of 0 in order to
avoid division by zero and facilitate calculations of standardized parameter bias, as described
below.
3
Technically, the intercept for the control group (dummy-coded 0) is equal to 1.00, whereas the
intercept for the treatment group (dummy-coded 1) is equal to 1.00 + 0.005 = 1.005.
93
Finally, the mean slope value is set to ,
.-
=−0.20, indicating a negative trajectory over
time among control group participants (coded 0 on the dummy coded Group). For experimental
group participants coded 1 on Group, however, their expected simple slopes were ,
.-
+
,
..
56789 =−0.20+0.50×1=0.30. Thus, whereas average control group members in this
model follows a decreasing linear expected trajectory in which their scores on y decrease from a
starting level of 1 by 0.20 between each consecutive time point, average treatment group
members in the model follow a positive linear expected trajectory in which their scores on y
increase from the starting level of 1 by 0.30 between each consecutive time point. In addition to
the model variables pictured in Figure 4.2, we also simulated three auxiliary variables, arbitrarily
labeled x, v, and w that were set to either highly correlated with y at time 1 or highly correlate
with y at time 2, as described below.
Factors Manipulated in the Simulation
We varied four main factors in the simulation: sample size, type of MNAR mechanism,
functional form of MNAR mechanism, and correlation between the auxiliary variables and y. We
discuss each of these factors in turn.
Sample size. We simulated two sample sizes: N = {60, 200}. This was designed to
approximate either relatively small clinical trials with 30 participants per cell or relatively large
clinical trials with 100 participants per cell.
Characteristics of the MNAR Dropout Mechanism. The main factors varied in the
simulation concerned the nature of the MNAR dropout mechanism. Across simulation cells, we
modeled a relatively simple dropout mechanism, in which approximately 30 percent of
participants dropped out at the second time point. Although critical and attentive readers might
correctly note that this dropout mechanism is quite simplistic and unlikely to occur in real
94
longitudinal data, we made this choice for a deliberate reason. The weighting methods proposed
by McArdle (2013) and assessed by Hayes et al. (2015) are aimed at using univariate CART and
random forests methods to create a single set of inverse data weights. Although extensions of
these methods to multivariate missing data situations can be readily imagined (e.g., by
employing multivariate CART or survival CART and extensions of these techniques, such as
those proposed by Brodley & Utgoff, 1995; De’ath, 2002; Larsen & Speckman, 2004), as an
initial step in assessing the performance of these weighting techniques, we felt it was important
to assess these methods under straightforward, well-controlled conditions in order to obtain a
useful benchmark for their performance under ideal conditions.
4
Although the time point of
univariate dropout was held constant at time 2 throughout all simulation cells, we experimentally
varied both the type of MNAR mechanism and the functional form of the MNAR mechanism
used to induce attrition. All dropout mechanisms described below led to approximately a 30
percent dropout rate.
Type of MNAR mechanism. Following Yang and Maxwell (2014), we induced missing
data using either outcome-dependent MNAR, in which dropout depended upon individuals’
scores on y at the time of dropout (time 2 in the present simulation) or slope-dependent MNAR,
in which dropout dependent upon individuals’ scores on the latent slope factor.
Functional form of MNAR mechanism. For both types of MNAR attrition, we induced
dropout using one of two tree diagrams. Specifically, we simulated dropout according to either a
tree with a single split at the 50
th
percentile of the simulated variable, depicted in Figure 4.3
4
In the absence of such benchmark results, if these CART-based weighting methods were
applied directly to multivariate missing data problems and encountered performance problems, it
would be difficult to ascertain whether the issues resulted from flaws in the basic methods or
flaws in the generalization of these univariate procedures to the multivariate case. For these
reasons, then, we chose to prioritize experimental control over ecological realism in the present,
initial simulation.
95
panel (a), or a tree with two splits, at the 25
th
and 75
th
percentiles, as depicted in Figure 4.3 panel
(b). Conceptually, the one-split tree represents a situation in which the majority of participants
with the lowest scores or the lowest random slopes on the outcome variable dropped out of the
study (e.g., the most depressed individuals don’t come back). The two-split tree represents a
situation in which both the highest and lowest participants on y (or the random slope of y)
dropped out of the study.
Within any simulation cell, we first used these tree-based models in concert with the
rbinom() function in R to simulate a return indicator in which participants coded 0 dropped
out of the study and participants coded 1 returned to the study. We then induced missing values
for scores on all y variables occurring at time ≥ 2 for simulated participants who received a 0 on
the return indicator.
Correlation between each auxiliary variable and y. Beyond small sample sizes and
MNAR missing data, randomized longitudinal clinical trials present an additional challenge. The
types of auxiliary variables typically used to help address missing data (cf. Collins et al., 2001)
are often baseline measures assessed at the first time point (e.g., demographic measures, scores
on psychological inventories taken at time 1, etcetera). If all time 1 measures are baseline
assessments – that is, assessments taken before the treatment is administered or before the
treatment is thought to take effect – then any correlation between the time 1 outcome variables
and other baseline auxiliary variables will necessarily become increasingly different over time as
the experimental treatment begins to take effect.
Thus, the final factor that we varied in the simulations was the correlation between each
auxiliary variable and y. We simulated two scenarios. In the first scenario, each auxiliary
variable had a .80 correlation with y at baseline. Because of the treatment effect modeled in our
96
study, however, the correlation between each auxiliary variable and y by time 2 was only
approximately r = 0.51 and this correlation decreased over time, ultimately reaching
approximately r = 0.41 by the end of the study. This scenario models the realistic condition in
which baseline measurements are correlated with auxiliary variables in a manner that may not
persist across the experimental trial.
In the second scenario, each auxiliary variable had a .80 correlation with y at time 2.
Under these conditions, this artificially induced-correlation remained high (between 0.72–0.74)
throughout the remaining time points. However, due to the nature of the model expectations, in
this scenario, each auxiliary variable was only correlated with y at baseline at approximately r =
0.57. This scenario models a situation closer to the ideal situation of highly correlated auxiliary
and outcome variables assessed in prior research (see Collins et al., 2001, study 3). The average
observed correlations among the auxiliary variables and y variables in the simulation are
displayed in Table 4.1.
Overall simulation design. Varying these four factors, the overall simulation design was
a 2 (N = 60 vs. 200) x 2 (MNAR mechanism = y-dependent vs. slope-dependent) x 2 (functional
form of MNAR mechanism = 1 split tree vs. 2 split tree) x 2 (correlation of auxiliary variables
with y at time 2 = .5 vs .8) design. This led to 16 unique simulation cells and 2 × 2 × 2 × 2 ×
200=1,600 simulated datasets.
Analyses Conducted on Each Simulated Dataset
We analyzed each simulated dataset using 11 different analyses of interest. All analyses
specified the correct linear growth model structure, as depicted in Figures 1 and 3 (that is, the
simulation only assessed accuracy and error that resulted from the missing data techniques being
evaluated and did not introduce additional error that would result from model misspecification).
97
We conducted all SEM analyses of our template model using the lavaan package in R
(Rosseel, 2012). The 11 analyses conducted in the simulation can be grouped into three
categories: baseline comparison analyses, inverse weighting analyses, and multiple imputation
analyses.
Baseline comparison analyses. Three analyses provided initial baseline measures of
performance. First, we analyzed each simulated dataset prior to injecting missing data, thus
providing a measure of how far the simulated data fell from the true population parameters
simply due to the small sample sizes used in the simulation. Next, we analyzed each dataset with
missing data using listwise deletion in order to provide a baseline measure of how harmful each
MNAR mechanism was to the parameter estimates when participants who dropped out of the
study were simply ignored and thrown out of the analysis. Finally, we also analyzed each dataset
using standard Full Information Maximum Likelihood (FIML, with the missing = “fiml”
argument in lavaan), providing a baseline measure of performance for this default, commonly
used missing data method. It should be noted that this basic FIML condition did not include the
auxiliary covariates (x, v, and w), since the purpose of the condition was to mirror what we
suspected many applied researchers would actually do in practice.
Inverse weighting analyses. After conducting the baseline analyses mentioned above,
we conducted a number of weighted analyses. We first transformed all weights into relative
weights before conducting weighted analyses using the package lavaan.survey in R
(Oberski, 2014) with the estimator in the sem() function in lavaan set to MLR (maximum
likelihood with robust standard errors, as recommended by Asparouhov, 2005; and as specified
in Hayes et al., 2015).
98
Before testing the performance of logistic regression and tree-based approaches to
forming inverse weights, we sought a general measure of the performance of inverse probability
weights under ideal conditions. Toward this aim, in each simulation cell we formed an initial set
of inverse weights from the true population probabilities used to generate attrition in our
simulated data (as depicted in panels a and b of Figure 4.3) and analyzed each dataset using these
true population weights. Although these true population probabilities of attrition would never
actually be known to analysts, the initial analyses of the datasets using these population weights
provide a useful benchmark for the performance of weighting methods, in general. Thus, in all
tables of results presented below, the true weighted analyses conceptually represent the best
possible performance of weighting methods, if all population probabilities of dropout were
perfectly estimated.
Following this initial weighted analysis, we formed weights using the predicted
probabilities generated by (a) logistic regression, (b) CART analysis, (c) pruned CART, and (d)
random forests. All analyses predicted a missing data indicator coded 0 = dropout, 1 = return and
included all covariates (x, v, and w) as well as y at time 1 (in this way, all analyses predicted
dropout at time 2 using all complete variables that occurred prior to time 2). The logistic
regression analysis modeled main effects only, which we believed would be the most common
approach to logistic regression taken by applied researchers with no a priori hypotheses
concerning interactions among the four variables. Random forest analyses used the R package
randomForest (Liaw & Wiener, 2002). CART and pruned CART analyses used the package
rpart (Therneau, Atkinson, & Ripley, 2014) with the same settings described in Hayes et al
(2015).
99
Multiple imputation analyses. We tested three multiple imputation analyses in the
simulation using the mice package in R (van Buuren & Groothuis-Oudshoorn, 2011) using all
variables, including the model variables as well as the covariates x, w, and v in the imputation
models. First, we used standard Bayesian regression imputation by using the method =
“norm” argument in the mice() function
5
. Additionally, we assessed the performance of
single-tree CART and random forest multiple imputation methods using the method =
“cart” and method = “rf” arguments. In all cases, we set the number of imputations equal
to 20, rather than the default 5, in order to give the imputation methods the best chance to
perform well in the simulation. We also set the argument maxiter equal to 20. Otherwise, we
used default settings for all mice methods. Since, by default, mice does not included analysis
and pooling functions for SEM analysis, the simulation included custom code to fit the SEM
growth model on each imputed dataset using lavaan and pool the estimates and standard errors
using Rubin’s rules (cf. Little & Rubin, 1987; Rubin, 1987).
Outcomes Measured in the Simulation
We assessed each missing data method in terms of its bias and efficiency in returning
accurate growth model parameter estimates. Following previous work (Enders & Bandalos,
2001; Hayes et al., 2015), we assessed percent bias for each estimate by computing
%BC)* =
θ−θ
θ
∗100
(1)
5
Knowledgeable readers may wonder why we used Bayesian regression imputation (method =
“norm” in mice) instead of the default predicted mean matching (method = “pmm” in
mice). We used the norm imputation both because this represents a standard approach to
multiple imputation, generally, and because this method outperformed pmm in initial
simulations. Therefore, we chose to present the norm imputation results instead, to demonstrate
the stronger performance of this standard imputation technique.
100
where θ is the value of the parameter estimate returned by a given analysis of a given simulated
dataset and θ is the true population parameter. We note that values greater than 15% are
considered problematic (Muthén, Kaplan, & Hollis, 1987). We calculated efficiency as the
empirical standard deviation of each growth model parameter estimate across the 200 simulated
trials in each cell.
Results
In the next sections, we present the results of each measure (bias, efficiency) in turn. we
start with percent bias. Note that, across tables of results, we omit parameter ,
-.
from Figure 4.2,
since this parameter was functionally set to be zero in the simulation.
Percent Bias
The primary results of the simulation concern the average percent bias observed in the
parameter estimates of the growth model from Figures 1 and 3. These results are displayed in
Tables 4.2-4.5 for the F =60,'76
GH
I
≈ .5, F =60,'76
GH
I
≈ .8, F =200,'76
GH
I
≈ .5,
and F =200,'76
GH
I
≈ .8 conditions, respectively
6
. Several trends in these tables are worth
mentioning. First, examining the columns corresponding to listwise deletion, it is clear that the
MNAR mechanisms used in the simulation cells were quite harmful, as indicated by levels of
bias that were often far above 15 percent. The harmful effects of the missing data mechanisms
are confirmed by examination of the FIML columns, where it is easy to see that, in the absence
of further information, standard FIML estimation was unable to remedy the sometimes-dramatic
levels of bias induced by the MNAR dropout (see, for example, the cells corresponding to the
fixed slope ,
.-
under a 1 split S MNAR mechanism, which exhibits bias well over 100 percent in
all simulation cells).
6
Because all covariates, x, w, and v, were simulated in exactly the same way, the notation '76
GH
I
indicates the correlation between any of the interchangeable covariates and y
2
.
101
A second interesting, and somewhat surprising, initial observation was the strong
performance of the true population probability weights in all cells of the simulation. Whereas the
estimated probability weights from logistic regression, CART, pruned CART, and random
forests still struggled in absolute terms in a variety of conditions, the true probability weights
provided more consistent, accurate corrections to the parameter estimates than any other missing
data method. This assuages concerns that might be raised about ceiling effects that could exist
for the performance of weighting methods. At least under the conditions studied here, inverse
weights seem to have the potential to perform extremely well when the selection probabilities are
accurately modeled. Continuing to examine standard and baseline measures, it is instructive to
look at the results for Bayesian regression imputation across cells of the simulation (labeled
‘norm’ in the tables, corresponding to its label in the mice package). Consistent with
expectations, standard Bayesian regression imputation performed admirably well in a variety of
simulation cells, although MNAR still created substantial bias in a variety of conditions.
One of the most surprising results is that, across cells, the performance of both CART
and random forest imputation methods was quite poor when compared with the other methods.
Although random forest imputation appeared to fare a bit better than single tree CART
imputation, in general both of these methods performed substantially worse than standard
Bayesian regression imputation and also performed worse than the various weighting methods
assessed across the majority of simulation cells.
Comparing results of weighted analyses based on CART, pruned CART, and random
forest methods, with few exceptions random forest analysis seemed to most consistently provide
the largest reductions in bias across simulation cells, and this was especially evident in the cells
with the most bias. For one strong example, see the coefficient ,
.
, “1 split, S” cells of Table 4.3
102
(N = 60, 6
GH
I
≈ .8). FIML estimates displayed an average bias of –109.79. Bayesian regression
imputation improves upon this considerably, returning estimates with an average bias of –67.45.
Finally, random forest weights produce the greatest improvement, clocking in with –47.45. This
number is, admittedly, still extremely troubling in absolute terms, but represents over a 50%
improvement upon FIML and a 10% improvement upon Bayesian regression imputation. Similar
patterns are found in other problematic cells throughout the simulation.
Finally, how did random forest weights compare to weights generated from logistic
regression? In general, random forest weights tended to perform either similarly to, or in some
cases much better than, logistic regression weights in terms of bias, with the greatest
improvements occurring when the correlations between the auxiliary variables and y
2
– y
5
were
higher. For some examples, examine Table 4.5. Estimating the same fixed ,
.-
coefficient in the
1 split y condition, logistic regression weights were 65.50 percent biased on average compared to
only 18.75 percent bias observed for random forest weights. Similarly, when estimating the fixed
,
..
coefficient in the 1 split y and 2 split y conditions, logistic regression analysis produced
average bias of 18.16 percent and –15.15 percent, respectively, compared to only 2.66 and 1.91
for random forest weights.
Efficiency
Table 8 displays the results for efficiency of each estimate in the N = 60 and N = 200
conditions, respectively, collapsed over covariate correlations and MNAR mechanisms.
Although the empirical standard deviations of each estimate were quite similar across most
analyses, it is worth pointing out that, by their nature, the weighted analyses were a bit less
efficient, in general, than full-information approaches such as FIML and MI. The most
pronounced differences in efficiency were found in the N = 60 cells of the simulation. Although
103
weighted estimates were still slightly less efficient in the N = 200, the differences overall were
smaller in these larger samples size conditions.
Discussion
In this simulation inverse data weights formed from decision tree analyses (Hayes et al.,
2015; McArdle, 2013) – particularly random forest analysis – were able to substantially alleviate
bias in parameter estimates of the simulated growth curve model, although absolute levels of bias
under the MNAR mechanisms remained problematic in many conditions. Of particular interest,
analyses that used inverse weights formed using the true selection probabilities from the
simulation performed strikingly well, even in the lowest sample size conditions used in the study
(N = 60). Although throwing out data and only using available cases via listwise deletion several
harmed the growth curve parameter estimates, reweighting those same available cases using the
true population probabilities of return to the study provided a surprising amount of relief. This
suggests that if the population probabilities are modeled accurately, then inverse weighting could
plausibly be a very effective method for dealing with missing data even under extremely
deleterious selection mechanisms.
The best-performing methods in the simulation were random forest weights, closely
followed by Bayesian regression imputation and logistic regression weights. In many conditions
the levels of bias produced by these methods tracked quite closely, but when the auxiliary
variables were more highly correlated with the outcome variables (y
2
–y
5
), random forest weights
tended to provide greater relief than the other methods. These improvements in bias did come at
the cost of slightly lower efficiency than other methods, particularly in the case of random forest
weights, however, although the level of inefficiency did not seem severe in the present study.
104
In addition to providing a broad evaluation of these missing data techniques under small
sample sizes and MNAR attrition, this simulation provided an important insight into the nature
of covariate-outcome correlations at baseline in randomized longitudinal clinical trials. As shown
in Table 4.1, when the simulated covariates were highly correlated with y at baseline, these
correlations quickly dissipated over subsequent measurement occasions as the experimental
treatment took effect, leading the treatment and control groups to diverge. Under these
conditions, none of the missing data methods performed optimally. It was only when the
covariates were artificially simulated to be highly correlated with y at the time of dropout that the
missing data techniques provided the greatest relief.
One surprising result of the simulation was the poor performance of CART and random
forest imputation (Doove et al., 2014; Shah et al., 2014; van Buuren, 2012) under the conditions
studied. Although traditional Bayesian regression imputation performed relatively well, even
under the low sample size conditions modeled here, the CART and random forest imputation
methods produced the most bias out of all of the methods assessed. This seems to suggest that it
is not imputation, per se, that suffers under low sample sizes, but these particular extensions of
multiple imputation.
It is unclear why these methods failed to perform well in the simulation, but one potential
explanation is that at very low sample sizes there may not be adequate variability among
complete cases falling in terminal nodes of CART trees fit to the data. Because CART and
random forest trees generate imputed values by randomly sampling the observed values from the
complete cases falling in each terminal node, if the number of such complete cases is trivially
small (e.g., only a few observations), this might not provide enough variability to generate
realistic imputed values. The results of the weighted analyses provide a potential
105
counterargument to this theory, however. These methods simply reweighted the complete cases
in the data and returned substantially improved results, suggesting that the variability in the
complete data was not wholly inadequate for providing more accurate estimates.
It is important for future studies to assess the performance of CART and random forest
imputation under multivariate attrition with similar sample sizes and missing data mechanisms.
These approaches still have much to recommend them, and one benefit of the chained equations
imputation framework is the ease with which successive univariate CART and random forest
analyses can be employed to generate imputations for any number of variables with missing data.
Multivariate extensions of the CART and random forest weighting methods are not as obvious,
however, since these methods must ultimately produce a single set of probabilities that can be
inverted to form weights. The most promising approach to this problem involves assessing
discrete and continuous time versions of survival CART and survival forests which, as their
names imply, bridge decision tree approaches with survival analysis for censored data (see Zhou
& McArdle, 2015 for an accessible and thorough introduction to survival trees and ensembles).
What recommendations can this simulation provide for applied researchers conducting
clinical trials similar to the ones we’ve described here? First, it seems that random forest weights
are a sensible choice for scenarios like the one depicted here, even when sample sizes are quite
low, while Bayesian regression imputation appears to be another very solid choice. However, all
of the methods studied were most helpful when the auxiliary covariates were highly correlated
with the missing outcome variables. Therefore, researchers conducting small sample,
experimental research should be cautious in assuming that variables highly correlated with y at
baseline will necessarily remain highly correlated with successive measurements of y as the
experimental treatment begins to create group differences.
106
References
Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation
Modeling: A Multidisciplinary Journal, 12(3), 411–434.
http://doi.org/10.1207/s15328007sem1203_4
Berk, R. A. (2009). Statistical learning from a regression perspective. New York: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
http://doi.org/10.1007/BF00058655
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
http://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression
Trees. Pacific Grove, CA: Wadsworth.
Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19(1),
45–77. http://doi.org/10.1007/BF00994660
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.
http://doi.org/10.1037/1082-989X.6.4.330
De’ath, G. (2002). Multivariate regression trees: A new technique for modeling species–
environment relationships. Ecology, 83(4), 1105–1117. http://doi.org/10.1890/0012-
9658(2002)083[1105:MRTANT]2.0.CO;2
Doove, L. L., van Buuren, S., & Dusseldorp, E. (2014). Recursive partitioning for missing data
imputation in the presence of interaction effects. Computational Statistics & Data Analysis,
72, 92–104. http://doi.org/10.1016/j.csda.2013.10.025
Efron, B., & Tibshirani, R. (1993). An Introduction to the bootstrap. New York: Chapman &
107
Hall.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Enders, C. K. (2011). Missing Not at Random models for latent growth curve analyses.
Psychological Methods, 16(1), 1–16. Retrieved from http://dx.doi.org/10.1037/a0022640
Enders, C. K., & Bandalos, D. L. (2001). The relative performance of full information maximum
likelihood estimation for missing data in structural equation models. Structural Equation
Modeling: A Multidisciplinary Journal, 8(3), 430–457.
http://doi.org/10.1207/S15328007SEM0803_5
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning. New
York: Springer-Verlag.
Hayes, T., Usami, S., Jacobucci, R., & McArdle, J. J. (2015). Using Classification and
Regression Trees (CART) and random forests to analyze attrition: Results from two
Simulations. Psychology and Aging, 30(4), 911–929. http://doi.org/10.1037/pag0000046
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
with applications in R. New York: Springer.
Kish, L. (1995). Methods for design effects. Journal of Official Statistics, 11(1), 55–77.
Retrieved from http://www.jos.nu/Articles/abstract.asp?article=11155
Larsen, D. R., & Speckman, P. L. (2004). Multivariate regression trees for analysis of abundance
data. Biometrics, 60(2), 543–9. http://doi.org/10.1111/j.0006-341X.2004.00202.x
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3),
12–22.
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.
McArdle, J. J. (2013). Dealing with longitudinal attrition using logistic regression and decision
108
tree analyses. In Contemporary issues in exploratory data mining in the behavioral sciences
(pp. 282–311). New York: Routledge.
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are
not missing completely at random. Psychometrika, 52(3), 431–462.
http://doi.org/10.1007/BF02294365
Oberski, D. (2014). lavaan.survey : An R package for complex survey analysis of structural
equation models. Journal of Statistical Software, 57(1), 1–27.
http://doi.org/10.18637/jss.v057.i01
Potthoff, R. F., Woodbury, M. A., & Manton, K. G. (1992). “Equivalent sample size” and
“equivalent degrees of freedom” refinements for inference using survey weights under
superpopulation models. Journal of the American Statistical Association, 87(418), 383–396.
http://doi.org/10.1080/01621459.1992.10475218
R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria:
R Foundation for Statistical Computing. Retrieved from http://r-project.org/
Rosseel, Y. (2012). lavaan : An R package for structural equation modeling. Journal of
Statistical Software, 48(2), 1–36. http://doi.org/10.18637/jss.v048.i02
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
http://doi.org/10.2307/2335739
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Comparison
of random forest and parametric imputation models for imputing missing data using MICE:
A CALIBER study. American Journal of Epidemiology, 179(6), 764–774.
http://doi.org/10.1093/aje/kwt312
109
Stapleton, L. M. (2002). The incorporation of sample weights into multilevel structural equation
models. Structural Equation Modeling: A Multidisciplinary Journal, 9(4), 475–502.
http://doi.org/10.1207/S15328007SEM0904_2
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale,
application, and characteristics of classification and regression trees, bagging, and random
forests. Psychological Methods, 14(4), 323–348.
http://doi.org/http://dx.doi.org/10.1037/a0016973
Therneau, T., Atkinson, B., & Ripley, B. (2014). rpart: Recursive Partitioning and Regression
Trees.
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapman &
Hall/CRC.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45(3), 1–67.
http://doi.org/10.18637/jss.v045.i03
Yang, M., & Maxwell, S. E. (2014). Treatment effects in randomized longitudinal trials with
different types of nonignorable dropout. Psychological Methods, 19(2), 188–210. Retrieved
from http://dx.doi.org/10.1037/a0033804
Zhou, Y., & McArdle, J. J. (2015). Rationale and applications of survival tree and survival
ensemble methods. Psychometrika, 80(3), 811–833. http://doi.org/10.1007/s11336-014-
9413-1
110
Table 4.1
Average Correlations Between Covariates and y Variables in the Simulation
x w v !
.
!
L
!
M
!
N
!
O
6
GH
I
≈ .5
x 1.00 0.65 0.65 0.81 0.51 0.45 0.41 0.38
v 0.65 1.00 0.65 0.81 0.51 0.45 0.41 0.38
w 0.65 0.65 1.00 0.81 0.51 0.45 0.41 0.38
6
GH
I
≈ .8
x 1.00 0.78 0.78 0.56 0.88 0.74 0.73 0.72
v 0.78 1.00 0.78 0.56 0.88 0.74 0.73 0.72
w 0.78 0.78 1.00 0.56 0.88 0.74 0.73 0.72
Note: results are averaged across sample sizes.
111
Table 4.2
Percent Bias for Missing Data Methods in N = 60, "
#$
%
≈ . 5 conditions.
Inverse weighting methods Multiple imputation methods
Complete Listwise FIML True Log CART Prune RF Norm CART RF
)
**
1 split y 0.17 28.07 4.95 0.68 4.76 10.64 11.52 0.68 8.43 11.67 9.52
2 splits y 1.39 7.51 2.52 4.72 7.13 8.07 7.97 8.16 3.52 2.79 2.87
1 split S –1.05 10.00 –0.07 –1.08 0.66 1.82 1.81 –3.91 0.76 2.07 1.52
2 splits S –0.81 0.71 –0.65 0.85 0.72 1.49 1.42 0.19 –0.28 –0.55 –0.28
)
+*
1 split y –2.64 –88.90 –75.30 –8.88 –72.83 –75.54 –77.32 –65.17 –94.07 –158.36 –115.82
2 splits y 3.56 –9.38 –9.06 2.28 –9.73 –7.76 –8.78 –6.42 –15.62 –43.38 –37.10
1 split S 3.40 –114.42 –108.56 1.62 –109.20 –106.65 –106.23 –104.14 –121.07 –161.13 –142.26
2 splits S 1.97 4.93 5.97 5.99 5.28 5.87 6.92 8.67 1.42 –31.70 -23.71
)
++
1 split y 1.60 –3.57 –3.62 –1.64 –1.79 –4.16 –4.17 –3.68 –10.12 –33.04 –26.35
2 splits y –3.68 –13.90 –13.56 -5.34 –14.22 –12.28 –13.00 –12.45 –18.03 –39.54 –35.08
1 split S –0.01 2.61 1.93 4.37 3.29 4.79 5.09 4.56 –4.16 –28.05 –22.00
2 splits S 0.84 1.36 1.69 2.82 1.94 4.45 4.58 4.72 –0.79 –28.85 –21.60
,
-
.
1 split y –3.79 –13.92 –4.52 –8.02 –4.06 –8.28 –8.34 –7.23 –11.83 –31.49 –37.29
2 splits y –0.92 –31.88 –10.30 –7.10 –31.82 –24.71 –25.85 –13.78 –19.79 –36.92 –43.32
1 split S –3.82 –8.32 –4.43 –9.41 –5.46 –8.32 –7.99 0.20 –9.26 –27.80 –32.63
2 splits S –3.22 –13.62 –5.76 –11.67 –12.60 –10.55 –10.36 –1.58 –10.22 –27.45 –30.57
,
/
.
1 split y –3.22 –9.52 –9.12 –6.73 –10.65 –10.34 –10.11 –10.65 4.41 –0.38 –18.83
2 splits y –4.78 –22.39 –21.84 –10.38 –23.17 –23.13 –23.41 –23.00 –8.83 –15.84 –30.62
1 split S –3.46 –16.84 –16.72 –6.92 –17.00 –17.04 –17.10 –16.03 –4.22 –8.04 –24.67
2 splits S –3.73 –41.56 –41.38 –11.64 –42.71 –42.87 –42.40 –43.93 –31.36 –32.71 –47.86
,
-/
1 split y –1.75 –18.22 –14.69 –1.75 –10.17 –11.14 –11.69 –7.02 –18.67 –42.94 –10.67
2 splits y –2.17 –58.37 –57.67 –9.35 –58.57 –57.35 –57.54 –51.34 –58.18 –81.69 –58.05
1 split S –2.45 –12.74 –11.54 –3.10 –12.75 –13.83 –13.24 –7.40 –14.19 –33.43 –8.96
2 splits S 2.11 –39.07 –36.94 –8.54 –40.11 –38.03 –37.68 –35.38 –36.04 –57.33 –36.76
,
0
.
1 split y –0.53 –0.54 –0.42 –0.84 1.26 0.75 0.75 2.27 20.68 57.85 105.49
2 splits y –0.32 –1.82 -0.90 –0.32 –2.75 –2.12 –1.98 –0.79 19.58 47.16 88.66
1 split S 0.22 1.11 1.13 0.11 0.11 0.95 0.95 1.14 21.03 57.59 99.22
2 splits S –0.21 –0.12 0.12 –1.48 –0.56 –0.17 –0.13 1.56 21.08 48.93 79.34
Note: norm = Bayesian regression imputation in mice. y indicates the dependent variable. S indicates latent slope.
112
Table 4.3
Percent Bias for Missing Data Methods in N = 60, "
#$
%
≈ . 8 conditions
Inverse weighting methods Multiple imputation methods
Complete Listwise FIML True Log CART Prune RF Norm CART RF
)
**
1 split y –0.18 30.55 4.43 3.53 –13.05 6.22 6.57 –8.95 0.65 –0.23 3.78
2 splits y 1.55 4.92 2.05 0.16 5.82 1.86 2.17 –3.19 1.61 0.12 1.26
1 split S 2.00 14.88 3.45 3.15 –3.19 –2.13 –1.57 –9.78 –1.90 –3.29 –0.16
2 splits S 0.67 –0.52 0.66 –0.07 –0.39 0.40 0.41 –2.11 0.74 –0.31 –0.02
)
+*
1 split y 0.20 –82.75 –65.23 –1.16 37.34 –17.36 –18.00 24.10 –20.33 –145.22 –57.58
2 splits y –4.42 –12.91 –12.28 5.39 –15.49 –3.70 –5.15 15.09 –4.42 –43.41 –27.21
1 split S 1.22 –117.01 –109.79 –3.38 –53.91 –73.29 –73.61 –47.45 –67.45 –161.53 –101.18
2 splits S 5.19 10.51 9.73 16.89 10.26 9.30 9.26 9.85 5.21 –25.36 –18.01
)
++
1 split y 1.39 –0.10 1.11 4.81 18.14 4.72 4.14 1.99 –1.95 –30.16 –19.36
2 splits y –1.25 –11.03 –10.62 2.21 –12.78 –2.61 –4.42 9.05 –3.44 –37.47 –23.31
1 split S 1.02 4.41 4.91 4.47 5.83 4.45 5.23 5.82 3.58 –26.18 –16.62
2 splits S 1.47 3.80 3.34 7.67 4.50 5.09 4.60 5.41 1.63 –26.36 –17.83
,
-
.
1 split y 1.38 –13.28 0.17 –5.34 8.54 –3.06 –3.48 –8.05 –3.33 –17.42 –32.28
2 splits y –5.55 –36.12 –15.08 –12.28 –36.46 –18.91 –22.82 –5.02 –12.76 –26.67 –43.45
1 split S –3.32 –5.77 –3.07 –6.28 –0.89 –3.35 –3.07 –1.75 –4.70 –17.76 –30.50
2 splits S –4.20 –10.50 –5.72 –8.76 –9.68 –2.05 –3.32 8.88 –3.83 –16.61 –25.80
,
/
.
1 split y –4.31 –10.86 –10.16 –7.51 –4.13 –7.07 –7.33 –11.50 1.65 2.54 –17.30
2 splits y –1.98 –18.50 –17.91 –5.88 –18.61 –9.05 –10.93 –4.85 1.96 –9.32 –24.66
1 split S –4.16 –17.37 –17.34 –6.88 –7.55 –10.82 –11.27 –10.75 –6.91 –5.20 –22.41
2 splits S –3.58 –40.09 –39.99 –10.08 –40.70 –38.36 –38.49 –34.61 –28.9 –29.04 –43.84
,
-/
1 split y 0.73 –18.76 –13.21 –2.02 26.75 3.00 2.03 3.60 –6.11 –65.54 –1.29
2 splits y –0.47 –56.58 –54.50 –8.26 –55.47 –21.25 –28.96 6.20 –18.70 –94.82 –29.84
1 split S –4.36 –16.75 –16.35 –6.53 1.53 –6.64 –5.89 –3.23 –12.03 –60.73 –5.55
2 splits S –2.74 –39.31 –38.20 –9.93 –38.59 –28.93 –29.47 –14.36 –24.41 –71.39 –25.88
,
0
.
1 split y –1.50 –1.95 –1.74 –1.91 –1.56 –1.59 –1.86 –3.12 15.77 71.41 102.03
2 splits y 0.81 –1.26 –0.28 –0.52 –1.72 –1.05 –1.32 0.53 19.19 56.99 89.92
1 split S 0.65 0.42 0.41 –1.24 1.53 0.95 1.21 2.01 19.57 68.07 93.59
2 splits S –1.71 –1.26 –1.12 –2.12 –1.94 –1.27 –1.10 –0.73 18.22 55.82 71.39
Note: norm = Bayesian regression imputation in mice. y indicates the dependent variable. S indicates latent slope.
113
Table 4.4
Percent Bias for Missing Data Methods in N = 200, "
#$
%
≈ . 5 conditions
Inverse weighting methods Multiple imputation methods
Complete Listwise FIML True Log CART Prune RF Norm CART RF
)
**
1 split y –0.40 28.83 4.49 0.17 1.75 8.80 9.80 0.43 7.76 9.21 8.46
2 splits y –0.70 4.76 0.45 0.18 4.59 3.71 4.57 1.43 1.22 0.81 0.82
1 split S 0.03 12.79 1.25 0.91 1.96 3.53 3.67 –0.91 2.15 2.87 2.42
2 splits S 0.62 1.15 0.79 1.41 1.28 1.35 1.36 1.24 1.01 0.79 0.72
)
+*
1 split y –4.69 –90.76 –74.67 –4.80 –68.90 –74.38 –74.65 –67.9 –92.02 –139.20 –108.97
2 splits y 3.10 –10.82 –9.94 3.22 –10.75 –8.59 –9.36 –8.51 –15.26 –40.61 –32.65
1 split S –2.66 –125.61 –117.98 –7.12 –118.75 –118.9 –118.8 –116.41 –128.92 –162.61 –146.30
2 splits S 3.80 2.72 2.83 3.52 2.41 4.21 3.18 4.03 –0.62 –29.98 –20.56
)
++
1 split y –1.73 –4.99 –4.06 –0.44 –2.28 –2.94 –2.65 –3.03 –9.96 –31.21 –25.20
2 splits y 0.31 –11.66 –11.04 –0.44 –11.62 –10.18 –10.37 –10.25 –15.2 –35.95 –29.30
1 split S –1.81 –1.79 –1.52 –1.99 –1.82 –1.38 –1.48 –1.51 –6.95 –28.55 –22.28
2 splits S 2.54 1.91 1.98 2.79 1.66 2.52 2.25 2.20 –1.20 –25.24 –17.23
,
-
.
1 split y –1.46 –11.22 –2.34 –1.47 7.77 –3.37 –2.30 2.08 –8.41 –16.26 –23.42
2 splits y 0.11 –30.47 –10.12 –2.65 –30.20 –15.69 –19.66 –5.82 –16.63 –26.40 –33.10
1 split S –0.68 –2.23 –0.49 –0.86 1.21 –1.46 –0.11 3.68 –2.97 –10.67 –17.19
2 splits S –2.30 –8.07 –3.18 –2.77 –7.89 –5.29 –5.03 0.84 –5.12 –12.64 –18.62
,
/
.
1 split y –0.59 –6.98 –6.68 –2.14 –4.12 –5.12 –5.25 –5.03 –4.37 3.36 –12.74
2 splits y –0.95 –18.90 –18.75 –3.89 –19.11 –18.72 –18.54 –17.86 –15.95 –10.70 –25.25
1 split S –0.38 –13.55 –13.48 –2.58 –12.71 –13.06 –12.66 –12.34 –10.88 –3.98 –19.65
2 splits S –0.69 –36.58 –36.50 –0.88 –36.58 –36.81 –36.62 –36.18 –34.34 –28.27 –41.39
,
-/
1 split y 0.13 –17.78 –13.92 –0.25 –2.21 –9.84 –9.94 –6.07 –22.82 –35.24 –10.00
2 splits y 1.26 –56.28 –53.57 –2.49 –56.20 –51.57 –51.98 –47.33 –58.68 –72.99 –54.32
1 split S –0.33 –13.32 –12.54 –1.96 –11.75 –12.82 –11.55 –10.74 –19.04 –29.37 –11.94
2 splits S 0.61 –35.65 –34.17 –0.84 –35.61 –33.60 –33.97 –32.40 –37.14 –45.00 –33.94
,
0
.
1 split y –0.27 –0.27 –0.19 –0.99 1.16 0.52 0.44 1.18 5.83 20.24 63.52
2 splits y 0.09 -1.62 –0.72 –0.42 –1.87 –0.73 –1.02 0.72 4.52 17.36 55.36
1 split S 0.41 0.24 0.23 –0.15 0.20 –0.21 –0.08 0.42 5.17 16.40 56.23
2 splits S –0.42 –0.62 –0.44 –1.07 –0.85 –0.15 –0.19 –0.30 4.17 15.65 47.94
Note: norm = Bayesian regression imputation in mice. y indicates the dependent variable. S indicates latent slope.
114
Table 4.5
Percent Bias for Missing Data Methods in N = 200, "
#$
%
≈ . 8 conditions
Inverse weighting methods Multiple imputation methods
Complete Listwise FIML True Log CART Prune RF Norm CART RF
)
**
1 split y –0.98 28.24 3.91 –0.66 –27.38 2.41 3.31 –9.71 –0.23 –2.79 0.93
2 splits y –0.05 5.13 0.77 0.28 5.16 0.86 1.00 –3.23 –0.04 –1.52 –0.52
1 split S –0.90 10.94 0.10 –0.75 –7.26 –3.99 –3.43 –12.30 –4.70 –6.60 –4.21
2 splits S –0.80 –1.38 –1.05 –1.72 –0.80 –3.27 –3.15 –4.42 –1.07 –2.46 –2.03
)
+*
1 split y 2.34 –83.34 –67.81 0.48 65.50 –10.45 –12.05 18.75 –13.91 –94.44 –37.67
2 splits y 2.46 –16.65 –16.09 –4.02 –15.67 –6.64 –6.20 4.80 –4.70 –35.33 –24.52
1 split S 0.79 –120.09 –112.48 –0.28 –52.82 –67.11 –71.72 –47.03 –69.92 –128.66 –91.24
2 splits S 2.97 3.18 2.86 5.16 2.66 6.72 5.41 8.95 2.14 –24.92 –18.11
)
++
1 split y 0.84 –4.34 –3.47 –1.07 18.16 1.42 1.01 2.66 –0.78 –27.03 –15.81
2 splits y –1.47 –15.15 –14.55 –3.77 –15.15 –5.16 –6.49 1.91 –6.61 –32.09 –21.81
1 split S –0.15 1.29 1.24 1.42 1.49 3.93 3.26 3.82 –1.25 –23.82 –15.73
2 splits S 2.96 2.52 2.57 3.55 2.73 6.73 6.10 8.85 2.48 –19.79 –13.96
,
-
.
1 split y –1.81 –12.02 –2.68 –2.46 35.89 1.87 1.11 –0.07 –3.35 –7.56 –20.28
2 splits y –1.33 –31.84 –11.85 –4.63 –31.84 –7.02 –12.86 7.65 –5.21 –10.43 –27.60
1 split S –0.89 –3.30 –1.18 –1.97 5.23 0.50 1.44 4.12 –0.73 –5.86 –17.23
2 splits S –1.92 –9.22 –2.93 –4.35 –8.91 3.32 –0.19 13.27 1.90 –4.89 –14.81
,
/
.
1 split y –1.83 –7.27 –7.01 –3.07 11.54 –0.55 –1.17 –1.21 –2.78 –0.19 –10.36
2 splits y –1.26 –16.65 –16.53 –1.12 –16.73 –5.60 –8.27 –0.18 –4.94 –9.62 –16.11
1 split S 0.15 –12.20 –12.08 –0.85 0.74 –5.24 –4.79 –3.01 –8.30 –3.29 –13.58
2 splits S –1.69 –37.90 –37.83 –3.58 –37.72 –32.50 –33.67 –28.78 –32.24 –29.28 –38.27
,
-/
1 split y –1.41 –18.49 –14.81 –1.21 62.49 5.07 4.85 5.86 –6.39 –43.70 1.51
2 splits y –2.11 –58.63 –56.57 –4.83 –58.44 –14.33 –25.09 9.45 –18.26 –72.96 –20.55
1 split S 1.12 –11.62 –10.44 0.36 15.03 0.50 1.33 4.97 –4.50 –37.83 1.09
2 splits S 0.21 –37.00 –35.38 –2.50 –36.70 –19.55 –23.72 –7.83 –19.52 –48.30 –20.54
,
0
.
1 split y –0.59 –0.43 –0.33 –0.54 1.31 –0.49 –0.61 –0.8 3.56 26.77 56.59
2 splits y –0.05 –0.94 0.00 0.72 –1.06 0.02 –0.44 0.12 4.27 27.03 61.25
1 split S –0.31 –0.44 –0.42 –0.98 0.79 0.28 0.13 0.73 4.68 23.13 52.33
2 splits S 0.00 –0.15 0.09 0.08 –0.25 0.48 0.22 0.96 5.46 23.13 46.56
Note: norm = Bayesian regression imputation in mice. y indicates the dependent variable. S indicates latent slope.
115
Table 6
Efficiency for Missing Data Methods
Inverse weighting methods
Multiple imputation
methods
Complete Listwise FIML True Log CART Prune RF Norm CART RF
N = 60
)
**
0.19 0.25 0.19 0.26 0.27 0.26 0.25 0.32 0.20 0.20 0.19
)
+*
0.14 0.19 0.18 0.19 0.20 0.20 0.19 0.22 0.18 0.20 0.17
)
++
0.19 0.21 0.21 0.27 0.26 0.25 0.25 0.29 0.21 0.16 0.17
,
-
.
0.19 0.23 0.20 0.26 0.28 0.26 0.26 0.29 0.21 0.20 0.20
,
/
.
0.10 0.12 0.12 0.13 0.14 0.14 0.14 0.15 0.13 0.13 0.11
,
-/
0.10 0.11 0.12 0.13 0.14 0.13 0.13 0.15 0.11 0.11 0.11
,
0
.
0.04 0.05 0.05 0.05 0.06 0.06 0.06 0.07 0.06 0.13 0.14
N = 200
)
**
0.11 0.17 0.11 0.15 0.20 0.14 0.14 0.17 0.11 0.12 0.11
)
+*
0.08 0.14 0.13 0.11 0.15 0.13 0.13 0.14 0.13 0.14 0.12
)
++
0.11 0.12 0.12 0.15 0.16 0.14 0.14 0.15 0.11 0.10 0.10
,
-
.
0.10 0.14 0.11 0.14 0.24 0.15 0.15 0.17 0.12 0.12 0.12
,
/
.
0.05 0.08 0.08 0.07 0.11 0.09 0.09 0.10 0.08 0.09 0.08
,
-/
0.05 0.07 0.08 0.08 0.13 0.09 0.08 0.10 0.07 0.08 0.07
,
0
.
0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.05 0.06
Note: norm = Bayesian regression imputation in mice.
116
Figure 4.1: Hypothetical tree diagram from a CART analysis
117
Figure 4.2: Template Model and Parameter Values Used in the Simulation (all unlabeled paths
are equal to 1)
118
Figure 4.3: Tree structures used to induce dropout in the simulation study. Panel (a) depicts the
one-split condition, while panel (b) depicts the two-split condition. Predicted probabilities for
terminal nodes represent participants’ probability of returning to the study.
(a)
(b)
119
Chapter 5: General Discussion
The four simulation studies presented in this dissertation provide initial support for the
utility of CART-based missing data methods under a broad array of conditions. These
simulations demonstrate that CART and random forest methods can provide relief in terms of
bias under both MAR (Chapters 2 and 3) and MNAR missing data (Chapter 4), when the
probabilities of missing data are related to the variables in a dataset via a variety of different
functional forms (i.e., smooth functions and tree-based step functions that were linear, nonlinear,
interactive, or both nonlinear and interactive). Additionally, these methods outperformed
standard logistic regression and multiple imputation methods in many cases under both under
normal and severely nonnormal data (Chapter 3). Furthermore, CART and random forests were
more effective than t-tests and logistic regression analyses when it came to identifying which
auxiliary predictor variables were truly related to the probability of missing data (Simulation A,
in Chapter 2). Finally, these simulations provide initial clarification concerning the conditions
under which CART-based weighting methods and CART-based multiple imputation methods
might each be preferred: weights performed better in small samples (Chapters 3 and 4), but
multiple imputation methods performed comparably in terms of biased and were dramatically
more efficient when the sample sizes were large (Chapter 3).
In designing these simulations, a key priority was balancing the overarching goals of each
simulation with both experimental control and ecological validity and realism. Where possible,
the models under consideration closely matched data analysis models commonly used by
substantive researchers in real situations: cross-lagged regression models (Chapter 2),
moderation analysis assessing the interaction between a continuous moderator and dummy-coded
experimental treatment variable (chapter 3), and a simple latent growth curve model in which
120
experimental treatment group moderated (simulated) participants’ linear trajectories over time
(chapter 4) – a common analysis model for assessing the results of randomized longitudinal trials
(cf. Yang & Maxwell, 2014). Additionally, because data are rarely, if ever, perfectly normally
distributed (cf. Miceeri, 1989), the inclusion of nonnormal data conditions in the simulation
described in Chapter 3 better approximated the kinds of distributions that might be found in real
datasets. Finally, because in many real world scenarios, dropout could be closely related to the
outcome variable under study, the MNAR missing data conditions assessed in Chapter 4
provides an important evaluation of these methods under this prevalent and pernicious missing
data mechanism.
However, other aspects of these simulations sacrificed perfect realism in order to
prioritize another set of goals. For example, all of the missing data mechanisms simulated were
essentially arbitrary; chosen (particularly in the simulations featured in Chapters 3 and 4)
because they were especially harmful to model parameter estimates. In the simulation featured in
Chapter 4, for example, the MNAR mechanisms employed were so harmful that many of the
parameter estimates approached, and in some cases even exceeded, 100% bias! This occurred
because MNAR missing data are particularly harmful (Enders, 2010; Little & Rubin, 1987),
because the percentage of missing data was high (30%), and because dropout was generated at
the most harmful time point – time 2, the first time point after the simulated experimental
treatment was taking effect. Thus, this mechanism represented an extreme scenario and a
stringent test of the missing data methods’ abilities to reduce severe bias. A similar example
occurred in Chapter 3, in which 30% of simulated participants did not respond on an outcome
variable in a simple experimental design. In real-life situations, this exact scenario would be
extremely rare, but these simulation characteristics afforded examination of these missing data
121
methods under extreme conditions, when model estimates were particularly degraded. For this
reason, it was the relative performance of each missing data method in recovering some degree
of bias under these extreme circumstances, more than the absolute levels of bias, that formed the
primary basis for the discussion of simulation results.
Perhaps the most noticeably artificial aspects of the simulations were the simple
univariate missing data patterns employed. As briefly mentioned in Chapter 4, because the
classic implementations of CART and random forest analysis are univariate procedures in which
a set of predictor variables model individuals’ scores on a single outcome variable, it was
important to conduct an initial set of simulations assessing the performance of these CART-
based missing data methods under simple, univariate missing data. Thus, all simulations
described in this dissertation look at simple univariate dropout and nonresponse patterns – a two
time-point model with dropout at time 2 (Chapter 2), a simple experimental model with
nonresponse on y (Chapter 3), and a longitudinal study with dropout only occurring at time 2
(Chapter 4). Although such simple, univariate missing data scenarios are extremely artificial and
unlikely to occur in practice, these conditions served a crucial purpose in the simulations:
assessing CART and random forest methods under the basic univariate conditions for which they
were designed and thereby eliminating any additional error or bias that might have resulted from
the performance of any particular imperfect multivariate extension of CART.
Recall that, at the outset of this research program, the performance of these CART-based
missing data methods was completely unknown, even under simple scenarios. Indeed, the initial
simulation studies evaluating CART multiple imputation procedures (Doove, van Buuren, &
Dusseldorp, 2014; Shah, Bartlett, Carpenter, Nicholas, & Hemingway, 2014), cited in chapters 3
and 4, had not been published at the time of the original dissertation proposal that provided the
122
launching point for this series of studies. Thus, before studying elaborate multivariate extensions
of CART, it seemed prudent to first conduct a series of initial tests of these CART-based
methods under basic conditions in order to assess whether they merited further scrutiny. Now
that we have these initial results in-hand – now that we know that, under basic conditions, these
classic univariate CART methods show some promise in helping ameliorate the harmful effects
of missing data – discerning how best to extend the basic method to more complex and realistic
multivariate missing data scenarios will be an important direction for future study. It is to these,
and other, key future directions that I turn in the next section.
Future Directions
As the foregoing discussion implies, the work on these CART-based missing data
methods has only just begun. Whereas the chained-equations imputation procedure used to
implement CART-based multiple imputation methods (e.g., as implemented in Doove et al.,
2014) is ideally-suited to extending univariate CART procedures to multivariate missing data
scenarios, multivariate extensions of CART-based weighting methods are not as straightforward,
since missing data would need to be predicted on many variables and the results somehow
consolidated into a single set of probability weights.
To motivate a discussion of multivariate missing data, imagine a 5 time-point clinical
trial like the one depicted in Chapter 4. Furthermore, imagine that roughly ten percent of
participants drop out of the study at each time point, beginning with time 2. Thus, by time 5,
40% of the participants in the study would have dropped out. How can CART-based weighting
methods be extended to address this more complex situation? One simple approach would be to
simply predict the final time point in the study, at which the most attrition has occurred (as did
McArdle, 2013). Alternatively, each time point could be predicted via separate CART or
123
random forest analyses, and predicted probabilities could be aggregated in some way (e.g., by
multiplying the predicted probabilities of dropout at each time point). However, like the predict-
the-last-time-point method, this approach would not take the censored structure of the data into
account. Moreover, if only a small number of participants drop out at any time point (e.g., 10%),
the missing data indicator would likely be so severely skewed that CART would likely fail to
form a tree with even one split (cf. Berk, 2009; Breiman, Friedman, Olshen, & Stone, 1984;
Hastie, Tibshirani, & Friedman, 2009).
For these reasons, it seems more desirable to employ survival-analysis-based extensions
of CART and random forests to address multivariate attrition using hazard functions and survival
probabilities that better capture the true nature of the MNAR mechanism (cf. Zhou & McArdle,
2015). Thus, one important line of future research will be to assess the performance of survival
CART and survival forest methods in modeling multivariate longitudinal attrition.
1
Once these
methods are assessed, another important future direction will concern the relative performance of
survival CART methods compared with more standard MNAR missing data analyses such as
selection models (Diggle & Kenward, 1994; Wu & Carroll, 1988) and pattern mixture models
(Hedeker & Gibbons, 1997; Yang & Maxwell, 2014). This comparison should be the most
interesting when dropout occurs as a nonlinear or interactive function of both the outcome of
interest and a set of observed predictors. Such scenarios are likely to be extremely common in
real data, as, for example, when participants drop out of a depression trial not only because they
are depressed, but also because they are low in socioeconomic status, lack access to a vehicle,
and do not live close to the public bus. Thus, although it is probably unrealistic to assume that
1
Another important facet of this research will concern whether discrete time or continuous time
implementations of these survival methods produce stronger results.
124
attrition is unrelated to one’s dependent variable, it is probably just as unrealistic to assume that
it is only related to the dependent variable and nothing else in the dataset. It is these latter, more
complex missing data mechanisms that CART-based methods stand particularly poised to
unearth.
Beyond distinguishing between univariate and multivariate missing data scenarios,
further distinction can be made between dropout mechanisms, in which participants go missing
from the dataset at a given time point and never return, and intermittent missing data patterns, in
which participants’ scores are only intermittently missing, for example on a certain variable in
the dataset, or at a certain measurement occasion (cf. van Buuren, 2012 for an accessible
overview of these issues). In the case of inverse weighting methods, which up-weight the
complete cases in the dataset, intermittent values cause a potential problem: even one missing
value on one variable (or at one measurement occasion) would cause the entire variable to be
deleted in the analysis.
One potential solution to this problem comes from the MNAR literature, in which
intermittent missingness is often assumed to result from MAR processes whereas participant
dropout is assumed to be MNAR (Enders, 2011). If this assumption approximates the truth, then
perhaps a combination of imputation and weighting methods may be appropriate. First,
intermittent missing data could be imputed using some form of standard Bayesian regression or
CART-based multiple imputation. Then, survival CART or survival forest methods could be
used to generate survival probabilities and produce an appropriate set of inverse weights. Once
this two-step process has been carried out on each of several imputed datasets, results can be
pooled in the usual way.
125
Finally, an additional area of future research concerns missing data on the predictor
variables. The simulations detailed in this dissertation have all assumed that every participant has
complete data on a set of baseline auxiliary variables. However, this is often not the case, as
when individuals miss their first measurement session or intermittently fail to answer certain
items. Thus, in a variety of realistic situations, researchers have to contend with missing data on
both the auxiliary predictor variables and their model variables of interest. Standard CART and
random forest methods do have built-in methods for dealing with missing data on the predictors
(Berk, 2009 provides an accessible overview), but it would be interesting and worthwhile to
compare these built-in methods with more standard methods, such as filling in holes in the
predictor variables using some variant of multiple imputation.
As can be seen, the goal of extending CART-based weighting methods to more realistic
circumstances will necessarily involve a host of step-by-step incremental extensions and tests to
discern how best to configure the method in order to deal the most easily and appropriately with
multivariate dropout, intermittent missing data, and missing values on the predictor variables.
These extensions and future directions seem interesting and potentially worthwhile. Yet, other
ways of using the same information from CART and random forest analyses can be envisaged.
Although the studies reported in Chapters 3 and 4 have primarily distinguished between
weighting methods and imputation methods, at the heart of these algorithms the more pertinent
distinction lies between using CART-based methods to predict the probability of missing data (or
dropout) versus using these same methods to predict the values of the outcome variable under
study. Conceived in this manner, the probabilities of dropout generated by a CART analysis
could be used in a variety of ways besides forming the basis for inverse probability weights. For
example, an initial CART analysis could generate predicted probabilities of missing data
126
resulting from some nonlinear and interactive function of a set of auxiliary variables, and then
these predicted probabilities could, themselves, be used as an auxiliary variable (or set of
auxiliary variables) that could then help to generate more accurate imputations or inform the
likelihood estimation in a FIML analysis. Such approaches would be easily implemented and
would obviate many of the difficulties inherent in dealing with sample weights. These methods
would also come with the added benefit of retaining the full sample size instead of sacrificing
some cases to listwise deletion while strategically reweighting others.
Although these seem to be the most pressing and important future directions, a variety of
other issues need to be investigated. Among these are modeling scenarios involving more varied
combinations of ordinal and categorical predictor and outcome variables, assessing the
performance of CART-based methods in nested, multilevel data, and more systematically
assessing alternative methods of rescaling the sample weights in weighted analyses (cf.
Stapleton, 2002).
Summary and Conclusion
In sum, these CART-based missing data methods seem quite promising in addressing
missing data under a variety of circumstances, particularly when features of the substantive
model and missing data model contain nonlinearities and interactions. Although CART-based
chained equations imputation can be easily applied using existing software (van Buuren &
Groothuis-Oudshoorn, 2011), future work will be needed to extend CART-based weighting
methods to more realistic circumstances facing researchers in practice. It is my sincere hope that
the papers contained in the body of this dissertation prove thought-provoking to others and spark
future research into the intricacies of missing data analysis in small sample designs under
nonlinear and interactive missing data mechanisms.
127
References
Berk, R. A. (2009). Statistical learning from a regression perspective. New York: Springer.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression
Trees. Pacific Grove, CA: Wadsworth.
Diggle, P., & Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis.
Applied Statistics, 43(1), 49–93. http://doi.org/10.2307/2986113
Doove, L. L., van Buuren, S., & Dusseldorp, E. (2014). Recursive partitioning for missing data
imputation in the presence of interaction effects. Computational Statistics & Data Analysis,
72, 92–104. http://doi.org/10.1016/j.csda.2013.10.025
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Enders, C. K. (2011). Missing Not at Random models for latent growth curve analyses.
Psychological Methods, 16(1), 1–16. Retrieved from http://dx.doi.org/10.1037/a0022640
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning. New
York: Springer-Verlag.
Hedeker, D., & Gibbons, R. D. (1997). Application of random-effects pattern-mixture models for
missing data in longitudinal studies. Psychological Methods, 2(1), 64–78.
http://doi.org/10.1037/1082-989X.2.1.64
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.
McArdle, J. J. (2013). Dealing with longitudinal attrition using logistic regression and decision
tree analyses. In Contemporary issues in exploratory data mining in the behavioral sciences
(pp. 282–311). New York: Routledge.
Miceeri, T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures.
Psychological Bulletin, 105(1), 156–166. http://doi.org/10.1037/0033-2909.105.1.156
128
Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Comparison
of random forest and parametric imputation models for imputing missing data using MICE:
A CALIBER study. American Journal of Epidemiology, 179(6), 764–774.
http://doi.org/10.1093/aje/kwt312
Stapleton, L. M. (2002). The incorporation of sample weights into multilevel structural equation
models. Structural Equation Modeling: A Multidisciplinary Journal, 9(4), 475–502.
http://doi.org/10.1207/S15328007SEM0904_2
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapman &
Hall/CRC.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45(3), 1–67.
http://doi.org/10.18637/jss.v045.i03
Wu, M. C., & Carroll, R. J. (1988). Estimation and comparison of changes in the presence of
informative right censoring by modeling the censoring process. Source: Biometrics, 44(1),
175–188. http://doi.org/10.2307/2531905
Yang, M., & Maxwell, S. E. (2014). Treatment effects in randomized longitudinal trials with
different types of nonignorable dropout. Psychological Methods, 19(2), 188–210. Retrieved
from http://dx.doi.org/10.1037/a0033804
Zhou, Y., & McArdle, J. J. (2015). Rationale and applications of survival tree and survival
ensemble methods. Psychometrika, 80(3), 811–833. http://doi.org/10.1007/s11336-014-
9413-1
129
Technical Appendix A
A Tale of Three Missing Data Mechanisms
All methods for the treatment of missing data – including those that ignore missing data
completely – carry assumptions about the nature of incompleteness in one’s dataset. Therefore,
in order to appropriately adjust their models to account for the possible effects of missing data,
researchers must first determine, either implicitly or explicitly, a model for incompleteness. To
this end, the gold standard for understanding mechanisms of incompleteness is still Rubin’s
(1976) classic typology.
In order to make sense of Rubin’s mechanisms, imagine a simple analysis scenario with
two variables: a covariate, z, and an outcome, y, with complete data. Further, imagine an
alternative version of y, which we may designate y
miss
that has missing data. That is, where there
are complete data, y
miss
= y, but where there are incomplete data, y
miss
is missing (denoted, e.g., by
‘*’ or ‘NA’).
In this scenario, y
miss
is the variable one typically has in a dataset whereas y is the
complete data version one wishes one had. In our theoretical analysis, the complete-data variable
y can be conceptualized as a latent or true score version of the outcome that contains all of the
true values of y, even where y
miss
has missing data. Although analysts do not have access to
latent variable y (which may just as easily be denoted y
complete
) in practice, this is a useful quantity
to work with for theoretical purposes. Finally, imagine a missing data indicator variable R, that
is coded R = 1 for missing values of y
miss
and R = 0 for complete data values of y
miss
1
. In this
simplified example, Rubin’s (1976) missing data mechanisms classify missing data into different
categories based on the nature of the relationships among these four variables: z, y
miss
, R, and y.
1
Note that this missing data indicator is coded in the reverse way of the return/response
indicators used in the CART methods described in the main body of this dissertation.
130
Presaging the discussion to come, there is one other impactful factor worth mentioning: the
degree of correlation between z and y, which may be written r
zy
.
At its core, Rubin’s analysis classifies mechanisms of incompleteness according to where
they fall on two primary dimensions. The first key dimension asks: is incompleteness on
outcome variable y
miss
explained by the observed part of the data? This is tantamount to asking
the question “can the probability that R = 1 be predicted from the observed data?” Since
incompleteness on Y cannot be accurately predicted from the observed part of y
miss
(because R = 0
for all non-missing values of y
miss
, leaving no variation in R to predict), this is typically
operationalized in terms of whether incompleteness is predictable from observed covariate z. If
the answer to this question is no, the missing data are said to be Observed at Random (OAR;
Rubin, 1976). That is, incompleteness is random with respect to the observed part of the data
(e.g., the observed covariates). If the answer to this question is yes, incompleteness is not
considered observed at random.
Note that many common tests commonly described as distinguishing MCAR from MAR
data, such as predicting R from z in a logistic regression analysis or using R to compare the
means of z among individuals missing on y
miss
(i.e., R = 1) and those who are nonmissing on y
miss
(R = 0) are actually tests of the OAR dimension of Rubin’s typology (Rhoads, 2012). As we will
see, below, these tests can only ascertain if the observed portion of the data are related to R, but
they are silent on the question of whether the missing portion of the data are related to R.
The second major dimension asks: is incompleteness on y caused by the missing portion
of the data? In the present example, this asks whether incompleteness is related to the
unobserved portion of y
miss
– that is, the values of y that the analysis does not have access to due
to missing data. If the answer to this question is no, then the data are said to be Missing at
131
Random (MAR). That is, in this case, incompleteness is random with respect to the missing
portion of y
miss
. If the answer is yes, however, the data are said to be Missing Not at Random
(MNAR).
Rubin’s (1976) classifications result from crossing these two dimensions, as shown in the
table below:
Mechanisms of Incompleteness, Simplified (from Rubin, 1976)
Incompleteness on Y related to covariate z?
(R significantly predicted by z)
Incompleteness on y
miss
related
to y itself?
(R significantly predicted by
latent/true score y)
No
(Observed at Random – OAR)
Yes
(Not Observed at Random)
No (Missing at Random –
MAR)
MAR + OAR =
Missing Completely at
Random (MCAR)
MAR Only = Missing at
Random (MAR)
Yes (Not Missing at Random) Missing Not at Random (MNAR)
Note: MAR applies only if the covariate z is included in the analysis – that is if you condition on
z in the substantive analysis of interest.
This two-by-two table is essentially a contingency table that allows for classification of
combinations of conditions (i.e., data are not MAR or OAR; data are MAR but not OAR,
etcetera). To take an initial example, if incompleteness is both unrelated to observed covariate z
(observed at random) and unrelated to y itself (missing at random), the situation is what Rubin
termed Missing Completely at Random (MCAR), indicating that incompleteness is completely
random with respect to both the observed and missing portions of the variables in the dataset. In
the unlikely scenario that incompleteness is the result of a completely random process (e.g., the
computer randomly deletes a percentage of participants from analysis on the basis of draws from
a random number generator), a complete case analysis would produce unbiased results. In
essence, the MCAR mechanism indicates that z has no relationship to R and y has no relationship
132
to R – both variables are essentially orthogonal to the probability of missing data. For this reason,
the degree of correlation between z and y makes no difference to the effects of missing data on
the results of an analysis. The missing data process is unrelated to both of these variables.
Now, take a somewhat trickier example. When incompleteness on y is unrelated to y itself
(missing at random) but is related to observed covariate z in the data (not observed at random),
and if, further, this covariate is included in the analysis in some manner, then the designation is
simply Missing at Random, or MAR. It is worth pausing to contemplate what this means. To
enhance our discussion, let’s add one more variable into the mix. Imagine some complete data
independent variable, x, that we wish to use to predict y
miss
. Thus the substantive analysis model
of interest may be the regression y
miss
= b
0
+ b
1
x + [error]. Now, z is an auxiliary covariate that
may or may not be included in the analysis to address missing data.
Here’s where the correlation between z and y becomes important. If z and y are
completely unrelated (orthogonal), then we have a situation where z and R are related, so z is Not
Observed at Random (Not OAR). But because z is unrelated to y, the relationship between z and
R has no bearing on the bias returned in analyses of x predicting y
miss
. Thus, in this idealistic
situation, when r
zy
= 0, z does not need to be included as an auxiliary covariate, as including this
covariate would do nothing to ameliorate bias in the x à y regression.
However, in most situations, z and y are correlated to some degree, that is r
zy
≠ 0. In this
scenario, because y and z share variation, and because z is a cause of missingness (z predicts R),
failing to condition on z will leave substantial bias in the regression of y on x. But if the analysis
conditions on z, y will be unrelated to missing data.
Okay, you might be thinking, but what does that mean? If z and y are moderately
correlated – say, r
zy
= .6 or r
zy
= .8 – then they are very similar variables. As z increases, the
133
probability of missing data – p(R =1) – increases, and the same should be true of y. This much is
certainly correct. But although z and y may be correlated in these examples, they are not exactly
the same. Thus, when we hold z constant at some value, say z = 3 (where z may be a Likert-type
variable with discrete values of 1, 2, 3, 4, or 5), then the latent/true values of y should have no
relationship to the probability of missing data.
This can be clarified with an example. Imagine an analysis in which the outcome (y) is a
measure of life satisfaction. Now, imagine that life satisfaction is at least somewhat correlated
with socioeconomic status (SES, variable z in the running example). Finally, imagine that
individuals with lower SES are more likely to have missing data (e.g., to fail to respond to a life
satisfaction question, or to drop out of a longitudinal study). Because missingness is related to
SES, the SES variable is Not Observed at Random. Because SES and life satisfaction are
correlated, the correlation of life satisfaction (if we had all values, including the missing ones)
and missing data – that is, the correlation of y and R, in the running example – would be
significant and negative (the lower the life satisfaction – correlated with SES – the higher the
probability of dropout, or R being 1).
Now, here’s where it gets interesting. Imagine for a second that we had access to all
values on the life satisfaction variable, including the missing ones. Because the data are OAR
with respect to SES but MAR with respect to life satisfaction (that is, it wasn’t the unobserved
values of life satisfaction, themselves, that caused missing data), if we hold SES constant at some
value (say, a value corresponding to relatively low-SES people), the probability of going missing
should be unrelated to life satisfaction. In words: lower SES people may be more likely to have
missing data, but among low SES people, the value of life satisfaction is not predictive of
134
missing data. If this condition is true, the data are missing at random – random with respect to y
(life satisfaction) when z (SES) is held constant.
In terms of our original four variables, y, z, R, and y
miss
, here’s what this means. When z is
not OAR and y is MAR, the following would be true. The b
1
coefficient would be statistically
significant in the logistic regression logOdds[R = 1] = b
0
+ b
1
z. This indicates that z is not OAR.
Similarly, the b
1
coefficient would be statistically significant in the logistic regression logOdds[R
= 1] = b
0
+ b
1
y, in which y predicts missing data without controlling for z. However, in the
logistic regression, logOdds[R = 1] = b
0
+ b
1
y + b
2
z, the b
2
coefficient should be significant,
indicating that z significantly predicts missing data, but the b
1
coefficient should be
nonsignificant, indicating that y is unrelated to the log odds of missing data when z is taken into
account
2
.
These three hypothetical logistic regression analyses illustrate three important points.
First, the OAR test involves predicting the probability (or, in the example, log odds) of missing
data from z. Second, the relationship between R and y is NOT random or nonsignificant when z
is not taken into account. The importance of this observation cannot be overstated – the
imperative to condition on auxiliary variables provides a major basis for this entire dissertation.
Finally, when z is included in the logistic regression of R on y, the relationship between R and y
goes to zero.
Two important consequences of this final test bear mentioning. Or yelling. First one: This
final regression, predicting R using both z and true score/latent y is a true test of the MAR
assumption. If true score y is unrelated to R after conditioning on z, the MAR assumption is met
2
Note that this is a demonstrative example. Significance testing is, of course, subject to the usual
caveats of sample size, Type I errors, etcetera. But conceptually, this should most often be true,
given adequate sample size to conduct these tests.
135
and you’re good to go, right as rain. However, if y remains significantly related to R, then the
data are MNAR (at least in the context of this analysis), and one of two things is true: either (a)
the values of y itself caused the missing data, or (b) some other observed variable might account
for the relationship between y and R, if only we had it in our dataset.
Here’s the second important consequence, and it is a doozy: the logistic regression test
for MAR just described is contingent upon having access to the latent/true score variable y,
which we never, ever have access to in practice. Instead we only have access to stupid, ugly y
miss
,
we do not pass go, we do not collect two hundred dollars, and there is literally no way to assess
whether the MAR assumption is violated or met. Moral of the story: you can only test for OAR
and there is no way to test for MAR. Do not let anybody tell you differently.
Finally, if incompleteness on y is related to the true values of y itself, regardless of
whether other variables in the dataset (i.e., z in this example) are observed at random or not, the
situation is said to be Missing Not at Random (MNAR). Importantly, if incompleteness on y is
caused by y itself, the resulting bias will not necessarily be completely removed by analyses that
correct for the relationship of incompleteness with other covariates. In terms of the logistic
regressions describe above – the theoretical ones, that assume that we have access to complete
data/latent/true score y instead of just y
miss
– this means that in the final logistic regression, i.e.,
logOdds[R = 1] = b
0
+ b
1
y + b
2
z, the b
1
coefficient will be significant, regardless of the fact that z
is included. However, including z can still help one’s substantive analyses. The more highly
correlated y and z are, and the more complete data z has where y
miss
contains missing values (in
the running example, z is complete), the more that including z in analyses of y
miss
can help
alleviate bias, even if the data are MNAR.
136
Summary: What in God’s Holy Name Are You Blathering About, Dude?
This technical appendix presented a large amount of detail about Rubin’s (1976)
mechanisms of missing data. I included this relatively voluminous appendix for two reasons: (1)
Because knowing more information in a more in-depth manner is almost always helpful in truly
understand a subject, and (2) Even if you don’t believe (1), I wanted to write all of this
information down somewhere that I would be able to easily find it later (i.e., a dissertation
appendix). However, for those who want the Cliff’s notes, here they are:
1. Missing Completely At Random (MCAR) indicates missing data are caused by a
completely random process that is orthogonal to all model variables. In this case, listwise
deletion estimates will be unbiased, but the analysis will be lower in statistical power as
a result of throwing out missing cases.
2. Missing At Random (MAR). In practice, this typically means that missing data are related
to some covariate(s), z
i
, in the dataset (not observed at random), but not caused by the
values of y itself (hence, missing at random – MAR). If a substantive analysis conditions
on z
i
, the results will be unbiased.
3. Missing Not At Random (MNAR): This means that missing data were caused by the
missing values (e.g., of y) themselves.
137
Technical Appendix B
Step It On Up: An Overview of Piecewise Functions and Step Functions
This appendix provides a brief overview of piecewise functions and step functions to help
facilitate the discussion of functional form in the proposal. It is easiest to introduce step functions
by first introducing piecewise functions.
Piecewise functions
Psychologists are familiar with several standard functional forms. Virtually every
researcher knows, for example, that the equation for a straight line can be written as ! = #
$
+
#
&
', or a parabola can be written with the equation ! = #
$
+#
&
'+#
(
'
(
. Piecewise functions
occur when different equations apply to different parts of the domain of the function. The
equations could have the same functional form, for example linear equations with different
values of #
$
and #
&
. The following graph displays a piecewise function:
138
Note, the Y-axis is labeled p(Y = 1|x) to keep the discussion consistent with the example
of predicting the probability of being classified as a 1.
Here, each piece of the piecewise function is linear, but the values of the intercept and
slope are different at different values of x. This piecewise function can be specified as follows:
! = #
$
+#
&
',where
If 1 ≤ 4 #
$
= −0.1, #
&
= 0.2
If 1 > 4 ≤ 6 #
$
= 0.7, #
&
= 0
If 1 > 6 ≤ 8 #
$
= 0.1, #
&
= 0.1
If 1 > 8 #
$
= 1.7, #
&
= −0.1
139
Step functions
With this background in mind, it is easy to define step functions. A step function is a
piecewise function with two changes: (1) its pieces are constant, in contrast to the linear function
used in the example above, and (2) it is discontinuous; there are breaks in the function. One
example of a step function is seen in the plot below:
Here we see that for all x ≤ 3, p(Y = 1|x) = .4, for all x > 3 ≤ 6, p(Y = 1|x) = .7, and for all
x > 7, p(Y = 1|x) = .2. It is easy to understand how this graph represents a nonlinear functional
relationship between x and y. Rather than increasing (or decreasing) in a smooth, monotonic way
140
as x increases, this function produces very different predicted values across different sections of
x.
It is also easy to recognize that this functional form is the same kind produced by a tree
diagram. This diagram would correspond to a tree structure with 2 splits, one at x = 3 and another
x = 6, as shown below:
What about a tree with an interaction? Take for example, another tree with two splits. But
this time, the second split is on another variable, z:
141
In this case, we see that when x < 4, the predicted probability is .6, regardless of the value of z,
but when x ≥ 5, the predicted probability depends on whether z is greater than or less than
cutpoint c
1
. This complex interactional relationship can be represented graphically with the
following step function:
142
Here, the interaction is depicted by the color-coding. If z < c
1
, the line is blue. If z > c
1
,
the line is red. Here we see that when x ≥ 6, these red and blue lines are separate: the predicted
value for blue is .2 and for red is .9, as in the tree diagram. When x < 6, however, the level of z
no longer matters and the two lines are superimposed on top of one another. This is represented
by the purple line on the left-hand side of the graph, since red + blue = purple.
143
Technical Appendix C
Two Notes About Decision Tree Analysis, and a Summary
Note 1: Some additional details of node impurity. How is the “best split” defined in a
CART model? As stated, the goal of CART is to identify subgroups of participants with
homogeneous scores on the dependent variable. Following in the footsteps of many other
venerable statistical traditions, the measures used to assess goodness of splits are phrased as a
double-negative: just as the best-fitting SEM model among a set of competing models is the
model with the least misfit, the best, or most homogeneous, split among a set of competing splits
in a CART model is the split that creates two subgroups with the least heterogeneity. In the
context of CART, subgroup heterogeneity is assessed via measures of node impurity.
When the dependent variable in a CART analysis is categorical, impurity is defined in
relation to the proportion of cases in each node belonging to each category. In essence, nodes
that predominantly contain observations sharing the same category membership are less impure
(less heterogenous/more homogenous) whereas nodes that contain a random mixture of category
memberships are more impure (more heterogenous/less homogenous). With a binary dependent
variable coded 0/1, this means that less impure nodes contain either mostly 0s or mostly 1s,
while more impure nodes contain an even mixture of 0s and 1s.
1
Whereas impurity of a categorical dependent variable is defined as heterogeneity of
categories within a node, the impurity of a continuous dependent variable is defined in terms of
the squared distances of each of the scores in a node from the mean of the observations within
that node. Note that this quantitative definition is also, conceptually, a measure of heterogeneity.
1
Three common functions used in categorical node impurity calculations are Bayes Error, Cross
Entropy, and the Gini Index (for details, see Berk, 2009; Hastie et al., 2009; James et al., 2013).
144
If scores on y within a node were completely homogenous – that is, if all observations in the
node shared the same score, such that all !
"
= !, then impurity would be at a minimum of zero,
since all deviations would be zero.
Left unattended, the CART algorithm will proceed to create very large trees in which the
terminal nodes contain very few observations. For example, in the dichotomous case, imagine a
node with only two observations, both of class y = 1. This node is highly homogeneous inasmuch
as both cases in the node share the same observed class, but the predictions from this node are
based on only two observations and are likely to be highly unstable when the same model is fit to
new data. Such situations, in which the estimates from statistical models are highly specific to a
given dataset, are instances of what is termed the model overfitting the data. Two approaches
have been proposed to address the problem of overfitting. Early approaches addressed the size of
the tree and the number of observations in each terminal node using stopping criteria and cross-
validation, whereas more recent ensemble methods incorporate bootstrapping to help increase the
predictive accuracy of CART estimates. Because bagging and random forest methods are
described in detail in the main body of this dissertation, here I restrict attention to some
additional details about cost-complexity pruning.
Note 2: Some additional details about cost-complexity pruning. The first attempts to
improve the predictive validity of CART estimates centered around altering the size of the
resulting CART tree, favoring more stable trees with fewer “branches,” each containing a larger
number of observations. One easy way to achieve this is to manually set a minimum size for each
terminal node – for example, no terminal node may have fewer than 10 observations. This
approach essentially imposes an external criterion designed to stop the tree from growing too
large. As intuitive as this approach is, however, it suffers from the flaw that it is impossible to
145
know, a priori, whether any given stopping rule (e.g., n no lower than 20) will stop the growth of
the tree prematurely. That is, if one stops at a terminal node with 20 observations, for example,
there may be a further split that could improve the predictive accuracy of the model (cf. Berk,
2009; Louppe, 2014).
A second approach to regulating the size of CART trees obviates this difficulty by
growing large trees and then “pruning” them back using a technique termed cost-complexity
pruning. This technique uses cross validation to estimate a penalty parameter for tree complexity
designed to result in a more stable, predictive tree. Using standard notation, if |T| represents the
number of terminal nodes in a given tree being fit to a dataset and $ represents the penalty
parameter for tree complexity, then the conceptual formula would take the form:
Overall Error of Tree = Sum Error in Each Terminal Node +$|;|
The goal of cost-complexity pruning is to estimate a value of alpha that minimizes the
overall error of the tree. It is easy to see how the complexity parameter works in this conceptual
formula. Adding a nonzero value of $|;| to the equation increases the misfit of the overall tree
proportional to the value of alpha and/or number of terminal nodes, |T|. All else being equal, a
tree with many terminal nodes adds more to the overall error than a tree with fewer terminal
nodes. Cost-complexity pruning uses k-fold cross-validation to estimate alpha, where k is
typically set to 5 or 10. The main purpose of cross-validation in pruning is to obtain an empirical
estimate of which value of alpha minimizes the overall error and thereby increases the predictive
validity of the tree (for details, interested readers can consult Hastie, Tibshirani, & Friedman,
2009; James et al., 2013).
Here, again, the specific computational details depend upon whether the outcome
variable is categorical or continuous. With categorical outcomes, prediction error is defined in
146
Table C1: Categorical vs. Continuous CART
Scaling of outcome variable
Categorical Continuous
Single tree CART
Impurity: Function of class proportions
in the node: minimum if all
observations the same class,
maximum if 50/50 split (in
the binary case).
Sum of squared deviations
about node mean.
Predicted values: Predicted probability of
classification = proportion of
cases in terminal node with
that class membership.
Predicted class membership =
class membership of the
majority of cases in the
terminal node.
Predicted value = mean of
within-node observations
Cost-complexity pruning: Minimize:
Overall risk = sum[risk in
each terminal node] + $|;|
Minimize:
Overall error = sum[squared
error about the mean in each
terminal node] + $|;|
Ensemble methods
Predicted values: Predicted class is assigned by
majority vote across bootstrap
samples/OOB estimates.
Predicted probability of class
membership is the proportion
of bootstrap samples/OOB
estimates in which the case
receives that class
membership as a predicted
value.
Predicted value is the average
predicted value across the
bootstrap estimates/OOB
observations.
Note: OOB = “out-of-bag.” $|;| indicates a complexity parameter.
terms of the risk in each terminal node, which is comprised of the proportion of misclassified
cases in each terminal node (e.g., cases given a predicted class of 0 in spite of having scores of 1
on the outcome variable, or vice versa) and a weighting parameter that imposes a cost on each
type of misclassification (e.g., classifying a case as a 0 when it is really a 1, and vice versa). With
147
continuous outcomes, prediction error is the same as node impurity: the within-node sum of
squared deviations from the node’s mean.
I hope that the preceding sections have provided a useful introduction to CART methods
and their ensemble extensions for categorical and continuous dependent variables. As a helpful
aid, Table C1 summarizes some of the main aspects of these techniques for categorical and
continuous outcomes, respectively.
148
Technical Appendix D
Everything That You Never Wanted to Know About Inverse Probability Weighting
There’s a funny thing that happens when one studies the literature on inverse probability
weighting. One quickly learns that there is no shortage of technical articles and volumes on the
subject, and that these technical treatises tend toward being arcane, abstruse, and not all that
helpful to the uninitiated, leaving readers (like myself) yearning for clearer, more straightforward
exposition of these concepts. And yet, whenever one tries to include clear, straightforward
descriptions of these details in a journal paper or chapter, the editor and reviewers inevitably
come back poo-pooing the material as obvious and trivial and unbefitting of a serious outlet.
Nonetheless, trivial though they may be, these simple derivations and insights can be helpful and
illuminating. So where do they belong? Where can they be archived? In a technical appendix in a
dissertation, that’s where.
The goal of this appendix is to clearly describe a number of basic but important aspects of
the probability weighting methods described in this dissertation. My hope is that others find this
material as useful and interesting as I do. I begin with a brief note on scaling weights to the
relative sample size. Then, I move onto a presentation of population weighting methods (both
unscaled and scaled) and then extend these concepts to missing data weights.
A Note on Scaling Weights to the Relative Sample Size
In the main body of this dissertation, I mention a scaling procedure described by
Stapleton (2002, in a particularly clear and helpful article), in which a set of inverse probability
weights is rescaled to sum to the sample size, n. This is called summing to the relative sample
size. The procedure for doing this is simple, and involves dividing by the mean of the weights.
Thus, if the inverse weight for individual i is written w
i
, then the relative weight is computed as:
149
And it turns out that the sum of the relative weights equals the sample size, n. That is:
Why is this so, one might ask? A simple derivation makes this clear. First, let’s rewrite
in terms of the original weights and the mean of weights, that is:
Next, recall that the mean of the weights, , is a constant. Thus, but the rules of
constants, this can be taken outside of the summation:
Now, recall that the mean of the weights is simply the sum of weights divided by the
number of weights, that is . Thus:
This is why the relative sample size transformation works.
Note also that, once rescaled to the relative sample size, the weights have a mean of
unity. This is because:
150
Some Notes on Population Weighting
In the next sections, I will derive some basic ideas about population weights. Because
weighting methods grow out of large-scale survey methods, in which weights are applied to
attempt to adjust for non-representative sampling, these concepts are important to understand in a
fundamental way. Once established, this understanding is easily extended to the missing data
case.
Initial notation. At the outset, it is important to establish some basic terminology.
Assume there exists some finite population of size N, from which we draw a random sample of
size n. Thus the ratio of sample size to population size (a quantity that will be important later) is
n/N. Further, assume that the population is divided into some set of subgroups. These could be
demographic characteristics, such as identifying as male or female, or being African American,
Caucasian, Latino, Asian/Pacific islander, etcetera. Let the subscript i indicate persons and the
subscript j indicate groups, such that in the population there are i = 1, 2, … , N people and j = 1,
2, … , J groups (often called “strata” in the survey literature).
Further, let:
• indicate the number of people in subgroup (strata) j in the population. Recalling that
equals the total number of people in the population as a whole, we note that .
• indicate the number of people sampled from subgroup (strata) j and equal the total
number of people in the sample as a whole. Thus .
• Note, as mentioned, that if a simple random sample of n people is taken from a
population of size N, the probability of any one person being randomly sampled is .
151
• Likewise, if people are randomly sampled from the individuals in population group
j, then the probability of any of these people being selected is
.
• As will be shown, normalized population weights sum to the sample size, n, and have
mean of unity. Conversely, unnormalized population weights sum to the population size
N and have a mean weight of
.
Unnormalized population weights. Unnormalized sample weights involve ratios of
frequencies (actual numbers of people) in the sample and population, whereas normalized
weights involve proportions (probabilities). This section discusses unnormalized weights.
To aid our exposition, imagine a small company, in some alternative universe, that values
diversity and engages in scrupulously fair hiring practices. At this company, there are
employees with male employees and female employees
1
. Now, let’s say that
the company administers a workplace satisfaction survey to a random sample 10 people from the
company (n = 10) and end up with a sample of males and females. In this case,
the company has undersampled women relative to the population proportions and oversampled
men. One thing one can do in this situation, since the population is known, is use unnormalized
population weights, which can be thought of as a ratio of units you measured to units you hoped
1
I’m keeping things binary for the sake of this simple weighting example. However, in 2017, it
is undoubtedly more enlightened to say, imagine that 50 employees identify as male and 50
employees identify as female. This binary variable is still oversimplified, of course, but I hope
readers will bear with it for the sake of illustrating a set of statistical points.
152
that you had measured (cf. McArdle, 2013 for the measured vs. hoped terminology). In this case,
for each group, the unnormalized population weight would be:
Thus, in the example, the unnormalized probability for all men would be
and the
unnormalized probability for all women (yes, all women) would be
. Note, this is technically
more of a set of proportions selected from the population groups than a true set of probabilities
(these p’s don’t sum to 1). In any event, the inverse of this probability is the weight for group j,
that is:
In each group, this weight is assigned to all people. Thus, the sum of weights in
each group involves summing the same weight for each person,
, at total of
times. Thus, the within-group sum of weights equals . That is,
Recalling the definition of the weights, above, however, we see that:
Thus, with unnormalized groups, the sum of the weights assigned to a given group ends
up being the number of people in that group in the population, N
j
. In the example, for men, we
153
have
and for women we have
(there are 50 men
and 50 women in the company).
The sum over the total sample, then, is:
Thus, the overall sum of the unnormalized population weights is equal to the population sample
size N.
The mean of these weights can be thought of as:
So, the mean of the unnormalized inverse weights is the inverse of the ratio of elements
drawn from the population to the total size of the population. In the example, this is
(the population is 10 times bigger than the sample).
As mentioned above, dividing by the mean of the weights (or multiplying by n over the
sum of weights, equivalently), is one way to calculate the normalized weights:
154
As before. So this shows that the sum of the normalized weights is n and the mean of the
normalized weights is 1 when the unnormalized weights were constructed using sample and
population frequencies.
An alternative route to normalized population weights. Another way to understand
normalized population weights begins with a different set of formulas. Instead of defining the
probability for group j as the ratio of the number of elements we measured to the number of
elements that we hoped to measure, that is n
j
/N
j
, we can form a ratio comparing the proportion of
elements we measured to the proportion of elements we hoped to measure. That is, by this new
definition:
Noting that the proportion of people in group j that we measured in our sample is
and
the proportion of people in group j who exist in the population is
, this works out to:
Note that even if we didn’t know the true population N or N
j
, we could form this
probability ratio if we knew the population proportion (e.g., if we knew there were 50%
identified males and 50% identified females in the population but didn’t know the true Ns).
Returning to the running example we have
for males and
for
females.
Then the inverse weights are:
155
Note that it is easy to see, however, that when the proportion of people sampled in the
dataset equals the population proportion,
and the cases of individuals in this group
would each receive a unit weight in any subsequent analysis. By contrast, anytime the population
proportion of a given demographic group is greater than the proportion sampled, the ratio
will be greater than 1, whereas when the proportion sampled is greater than the
population proportion, the ratio
will be less than 1.
Note also that by the commutative property of multiplication, this can be arranged to
show the result in a different way:
Where
is the scaling factor we used when normalizing earlier. (The mean of the
population weights was
, and dividing by this resulting in multiplying by the reciprocal,
). So
we can already see that these weights are normalized.
In any case, the sum of weights for individuals in group j then becomes:
This quantity,
has a special meaning: it is the number of people in the sample
that you would have measured if you had the same proportion as the population. In the example,
156
we had a sample n of 10 people. For men,
.
We should have sampled 5 men if men were to be 50% of our sample of 10 people. For women,
. We also should have sampled 5 women.
(Note that if the population proportions of men and women differed, these two sums would also
differ).
There is another, slightly back-door way to get to this same quantity. Let’s say that we
already know the number of people we should have measured as a function of the population
proportion and our total sample size. For example, here we know that the population proportion
is ½ men and ½ women. So
and for both men and women
. We
should’ve measured 5. But we didn’t. We got 7 men and 3 women. The question is: how do we
apply weights that sum to 5 for men and sum to 5 for women, so that the 7 men in the sample are
only counted as 5 and the 3 women are upweighted to also count as 5.
We want to set up an equation that says:
Recalling that the sum of weights in group j equals , this equation says:
Solving for , we get:
157
This was the normalized population weight shown earlier.
So now we have two ways of thinking of this weight:
1. As the normalized population weight.
2. As the number of people we should have measured in the sample (sample n times
popoluation proportion) spread across (divided by) the number of people we actually
measured ( ).
Returning to business, the sum of the weights overall is:
Thus, these already-normalized weights sum to the sample size.
The mean of the weights is then:
So the mean of the weights is unity, as before.
Table D.1 summarizes these various relationships and transformations:
158
Table D.1. Overview of Weighting Concepts with 3 Group Example.
j = 1 2 3 Sum
Frequencies
(Sample)
Proportions
(Sample)
Frequencies
(Population)
Proportions
(Population)
Number
Should Have
Measured in
Sample
Probability Weighting in the Missing Data Case
In principle, it is straightforward to extend the basic logic of IPW to the missing data
case. If n denotes the full sample size, indicates the number of individuals in some subgroup j
(e.g., all of the depressed individuals in the sample, or all older adults), some of whom drop out
of the study leaving missing data, and if n
j(com)
is the number of individuals who return to the
study and provide complete data, then the number of these people the researcher hoped to have
measured is everyone in subgroup j in the entire sample, that is , the number of
people actually measured are those with who did not drop out, that is
, and
the ratio of measured to hoped becomes
159
And
Thus it is easy to see that in the missing data case, the probability that we hope to invert
for subgroup j is the proportion of complete cases to total cases among individuals in subgroup j.
Now, here’s the kicker: if we only sum over the complete cases:
We end up with the original sample size for group j. What if we sum all complete observations?
So, the complete data observations in each group sum to the total number of observations we
should have had in the group, and the sum of all complete data observations equals the full
sample size that we should have had.
This suggests that normalizing these weights might not actually be the right idea. Thank
God this observation is buried in an appendix. (Actually, the mean parameter estimates in a
weighted analysis should be invariant to this, but standard errors and coverage may be affected).
However, the mean of the complete data weights is not equal to unity. Instead, it is
160
What if we normalized just the complete data observations by dividing out the complete
data mean of weights?
Then the mean would be:
Although the utility of this approach is potentially dubious.
Conclusion
Although presented a bit informally, this appendix has detailed a number of important
results and nuances concerning inverse probability weights in both the population (survey)
weighting and missing data weighting cases. It is my hope that these derivations, though simple,
are interesting in their implications for the meaning of these weights under various
circumstances.
161
Technical Appendix E
Latent Growth Curve Models for Fun and Profit
How might the results of longitudinal randomized clinical trials, as described in Chapter
4, be evaluated? One class of models that has been embraced by researchers using randomized
longitudinal clinical trials are latent growth curve models (also called mixed-effects or multilevel
growth models, Meredith & Tisak, 1990; Raudenbush & Bryk, 2002). To illustrate, imagine a
clinical trial with 5 time points. Perhaps this represents measurements at baseline, 3-months, 6-
months, 9-months, and 12-months. In this case, y might represent any outcome of interest, such
as scores on a depression inventory.
Assuming linear change over time, the equation for individual we at time t might be
represented by the following equation:
!
"#
= %
&"
constant +%
."
/012+2
"
(1)
In words, this equation says that individual i's score on y at time t is equal to a person-
specific intercept (times a constant) plus a person-specific slope times time, plus a person-
specific residual term. Assuming that time is centered to be zero at the first time point, e.g., time
= {0, 1, 2, 3, 4}, this equation states that individual i's score at time 1 is simply equal to their
intercept plus a residual, his or her score at time 2 is simply equal to their intercept plus one
linear slope plus a residual, and so on.
In the multilevel modeling literature, equation (1) is referred to as a level-1 equation
(Raudenbush & Bryk, 2002). Building on this foundation, the level-2 equations provide
definitions for the meaning of the individual-specific intercepts and slopes. For the linear growth
model of our hypothetical clinical trial, the level-2 equations:
162
%
&"
= %
&
+%
3
4+5
&"
(2)
And:
%
."
= %
.
+%
6
4+5
."
(3)
In words, equation 2 states that each individuals’ intercept is equal to an overall or
average intercept, %
&
, plus the influence of dummy-coded experimental group (e.g., D is a
dummy variable where 0 = placebo, 1 = treatment), plus a level-2 residual, 5
&"
. However, it is
important to note that if time 1 measurements are baseline measures, taken before the treatment
has been administered, it is reasonable to expect that the %
3
4 term would be equal to zero,
indicating that experimental group should have no effect on initial levels that occur prior to the
administration of the treatment. In words, equation 3 states that each individuals’ slope is equal
to an overall or average slope, %
.
, plus the influence of dummy-coded experimental group, plus a
level-2 residual, 5
."
.
Substituting (2) and (3) into equation (1) yields the composite notation:
!
"#
= %
&
+%
3
4+5
&"
constant + %
.
+%
6
4+5
."
/012+2
"
= %
&
+%
.
/012+%
3
4+%
6
4×/012 +5
&"
+5
."
/012+2
"
(4)
Among other things, this composite notation makes clear that a main prediction of this
model is a treatment x time interaction quantified by the %
6
4×/012 term. This term actually
quantifies the question of greatest interest in randomized clinical trials: are individuals’
trajectories over time on the dependent variables altered by the administration of the
experimental treatment? Multilevel growth models allow researchers conducting clinical trials to
readily test this key hypotheses, while also providing important information about the average
163
slope and intercept across individuals in the trial (so-called fixed-effects) as well as individual
variability around these average slopes and intercepts (so-called random-effects).
Figure E.1: Linear latent growth curve model with 5 time points and dummy-coded predictor
indicating placebo vs. treatment group membership (all unlabeled parameters are equal to 1)
The linear growth model just described is depicted in Figure E.1. In the figure, there are
two latent factors (the two large circles) loading onto y at each time point via single-headed
arrows. The I factor indicates a latent intercept, which loads onto the scores at each time point at
unity (this is the constant in the equations above), and the S factor indicates a latent slope, which
loads on time points 1-5 with fixed values of 0, 1, 2, 3, and 4 – the values of time described
above, centered at the first time point. Because both the slope and intercept factors are predicted
(via a single-headed arrow) by dummy coded D, these factors have disturbances (d
I
and d
S
,
respectively). The disturbance variances indicate interindividual variability in individuals’ latent
164
intercepts and latent slopes that remains after the slopes and intercepts have been predicted by
dummy-coded experimental group.
165
Abstract (if available)
Abstract
Data mining algorithms such as Classification and Regression Trees (CART) and Random Forests provide a promising means of exploring potentially complex nonlinear and interactive relationships between auxiliary covariates and missing data. Recently, two CART-based missing data methods have been proposed. The first uses CART to create predicted probabilities of response and form data weights. The second uses CART to multiply impute the data. The three major papers comprising the body of this dissertation present four simulations designed to evaluate and compare these methods. In an initial set of two simulations (Chapter 2), I compare CART-based weighting methods to logistic regression weights and standard multiple imputation methods when a set of auxiliary variables are related to missingness via a variety of step-functional (tree-based) forms under MAR missing data. In a follow-up simulation (Chapter 3), I compare CART-based weighting and CART-based multiple imputation methods in both small and large sample sizes when the function form of the MAR mechanism is smooth (i.e., linear, quadratic, cubic, interactive) and the data are nonnormal. The final simulation study (Chapter 4) compares the performance of these weighting and imputation methods in small sample longitudinal trial designs under MNAR missing data. Results suggest that CART-based weights help reduce parameter bias and increase coverage in small samples, but become inefficient in large samples. CART-based multiple imputation methods exhibited the reverse pattern, however, performing poorly in small samples and well in large ones in terms of both bias and efficiency. These results suggest that CART-based methods may have utility in addressing missing data, although future research is needed to ascertain the best way to extend CART-based weights to more complex instances of multivariate missing data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Applying adaptive methods and classical scale reduction techniques to data from the big five inventory
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Statistical analysis of high-throughput genomic data
PDF
The power of flexibility: autonomous agents that conserve energy in commercial buildings
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Panel data forecasting and application to epidemic disease
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Reconstruction and estimation of shear flows using physics-based and data-driven models
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
A learning‐based approach to image quality assessment
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Non-parametric models for large capture-recapture experiments with applications to DNA sequencing
Asset Metadata
Creator
Hayes, Timothy
(author)
Core Title
Using classification and regression trees (CART) and random forests to address missing data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Publication Date
03/07/2017
Defense Date
03/03/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Cart,classification and regression trees,exploratory data mining,machine learning,missing data,OAI-PMH Harvest,random forests
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wood, Wendy (
committee chair
), Becerik-Gerber, Burcin (
committee member
), Dehghani, Morteza (
committee member
), John, Richard (
committee member
), Oyserman, Daphna (
committee member
)
Creator Email
hayest@usc.edu,timothybhayes@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-346342
Unique identifier
UC11259060
Identifier
etd-HayesTimot-5124.pdf (filename),usctheses-c40-346342 (legacy record id)
Legacy Identifier
etd-HayesTimot-5124.pdf
Dmrecord
346342
Document Type
Dissertation
Rights
Hayes, Timothy
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
classification and regression trees
exploratory data mining
machine learning
missing data
random forests