Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Discriminating changes in health using patient-reported outcomes
(USC Thesis Other)
Discriminating changes in health using patient-reported outcomes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DISCRIMINATING CHANGES IN HEALTH USING
PATIENT-REPORTED OUTCOMES
by
Jae Kyung Suh
__________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(PHARMACEUTICAL ECONOMICS AND POLICY)
ii
To my family with gratitude and love
iii
ACKNOWLEDGEMENTS
I would like to convey my sincere thanks to my advisor, Jason N. Doctor. As an academic
advisor and a mentor, he supported, advised, and encouraged me throughout my graduate study.
I’ll bring his passion and enthusiasm for research with me on my next step of life.
I would also like to thank my thesis committee, Dr. Mike Nichol, Dr. Neeraj Sood, Dr.
Geoffrey Joyce, and Dr. Hyunsik Roger Moon. I’m grateful to Dr. Nichol and Dr. Sood for their
support and invaluable comments on my study. Special thanks to Dr. Joyce and Dr. Moon for
their fruitful feedbacks on my research ideas.
I gratefully appreciate Drs. Dennis Fryback, Robert Kaplan, Theodore Ganiats and their
team for providing me the cataract and heart failure data set.
I had pleasure to share my time at USC with classmates and friends. I would like to
express gratitude and love to Sooin Bang and Jiat Ling Poon for not only discussing about any
research ideas, but also sharing our daily lives even with very tiny worries. Special thanks to my
beloved friends, Younoh Kim, Susie Chung, Mingon Kim, and Jiyeon Lee, for standing on my
side and for supporting me.
My very special gratitude goes to my family for their endless support and trust. I love you
so much.
Last but not least, thank you my Lord for leading every step of my life.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ........................................................................................................... iii
LIST OF TABLES ......................................................................................................................... vi
LIST OF FIGURES ...................................................................................................................... vii
ABSTRACT ................................................................................................................................. viii
CHAPTER 1. INTRODUCTION ................................................................................................... 1
CHAPTER 2. THE BEHAVIORAL ECONOMICS OF THE MINIMALLY IMPORTANT
DIFFERENCE ................................................................................................................................. 5
Abstract ....................................................................................................................................... 5
2.1. Introduction .......................................................................................................................... 7
2.2. Background .......................................................................................................................... 9
2.2.1. Theoretical Basis of a Minimally Important Difference ............................................... 9
2.3. Study 1 ............................................................................................................................... 12
2.3.1. Overview ..................................................................................................................... 12
2.3.2. Methods....................................................................................................................... 12
2.3.3. Results ......................................................................................................................... 17
2.4. Study 2 ............................................................................................................................... 20
2.4.1. Overview ..................................................................................................................... 20
2.4.2. Methods....................................................................................................................... 21
2.4.3. Results ......................................................................................................................... 22
2.5. Study 3 ............................................................................................................................... 23
2.5.1. Overview ..................................................................................................................... 23
2.5.2. Methods....................................................................................................................... 24
2.5.3. Results ......................................................................................................................... 27
2.6. Study 4 ............................................................................................................................... 28
2.6.1. Overview ..................................................................................................................... 28
2.6.2. Methods....................................................................................................................... 29
2.5.3. Results ......................................................................................................................... 30
2.7. Discussion .......................................................................................................................... 31
2.8. References .......................................................................................................................... 35
CHAPTER 3. COMPARATIVE STUDY OF THE PERFORMANCES OF GENERIC AND
DISEASE-SPECIFIC MEASURES IN CATARACT AND HEART FAILURE PATIENTS ........ 41
Abstract ..................................................................................................................................... 41
3.1. Introduction ........................................................................................................................ 43
3.2. Methods.............................................................................................................................. 45
3.2.1. Subjects ....................................................................................................................... 45
v
3.2.2. Health Outcomes Measures ........................................................................................ 46
3.2.3. Responder Definitions ................................................................................................ 49
3.2.4. Statistical Analysis ...................................................................................................... 52
3.3. Results ................................................................................................................................ 54
3.3.1. Cataract Patients.......................................................................................................... 54
3.3.2. Heart Failure Patients .................................................................................................. 59
3.4. Discussion .......................................................................................................................... 63
3.5. References .......................................................................................................................... 68
CHAPTER 4. CHOICE OF SELF-RATED HEALTH MEASURES AND MORTALITY ........ 74
Abstract ..................................................................................................................................... 74
4.1. Introduction ........................................................................................................................ 76
4.2. Methods.............................................................................................................................. 78
4.2.1. Data ............................................................................................................................. 78
4.2.2. Measures ..................................................................................................................... 79
4.2.3. Data Analysis .............................................................................................................. 80
4.3. Results ................................................................................................................................ 81
4.4. Discussion .......................................................................................................................... 91
4.5. References .......................................................................................................................... 94
CHAPTER 5. SUMMARY ........................................................................................................... 97
APPENDIX ................................................................................................................................... 98
vi
LIST OF TABLES
Table 2.1. Demographic characteristics ……………………………………………..……. 17
Table 2.2. Statistical results from Study 1 ………………………………………………... 18
Table 2.3. Statistical results from Study 2 ……………………………………………… 22
Table 2.4. Health states in Study 3 …………………………………………………....… 24
Table 2.5. Statistical results from Study 3 ………………………………………………... 27
Table 2.6. Statistical results from Study 4 ….……………………………………………. 30
Table 3.1. Missing responses (%) at each time point …...………………………………. 52
Table 3.2. ROC curve analysis in the cataract patients …………………………………… 56
Table 3.3. ROC curve analysis in the heart failure patients ……………………………… 61
Table 4.1. Demographic and social characteristics of the sample ………………………… 82
Table 4.2. Distribution of global and age-comparative SRH measures by age groups …… 84
Table 4.3. Hazard ratios (HRs) of mortality associated with global and age-comparative
SRH measures ………………………………………………………………………….… 86
Table 4.4. Hazard ratios (HRs) of mortality associated with global and age-comparative
SRH measures by age……………………………………………………………………… 89
vii
LIST OF FIGURES
Figure 2.1. Logarithmic utility function …………………………………………….……. 33
Figure 3.1. Comparisons of areas under the ROC curves in the cataract patients ………... 59
Figure 3.2. Comparisons of areas under the ROC curves in the heart failure patients …… 63
viii
ABSTRACT
Patient-reported outcome (PRO) measures are incorporated in health care decision
making not only when not enough objective health measures are available but also to consider
what is important to patients about their health and interventions. For instance, to measure the
impact of an intervention in a clinical trial, PRO instruments usually ask patients to compare
their health after the intervention against health before the intervention which is considered as an
anchor. If patients’ health has changed, patients endorse “better” or “worse” health to their
current health. PRO measures should be able to detect these changes. This three-paper
dissertation draws on three different models that discriminate changes in health based on PRO
measures.
Paper 1 introduces a model that discriminate hypothetical health stimuli. Investigating the
pattern of discriminating one health condition from another can shed light on how minimally
important difference works. Paper 2 also uses a person-based anchor but in empirical data. A
model tests the overall performances of generic and disease-specific instruments in terms of how
well the instruments distinguish patients who experienced change in health from those who did
not experience any changes. By suggesting a method to find an instrument that can capture
patients who improved or not compared to their previous health, this approach can be useful in
selecting an appropriate instrument. Paper 3 presents a model that tests whether self-rated health
can be differentiated from an external anchor, risk of death. Using a single PRO measure of self-
rated health, this model can help to achieve a more complete understanding on the predictive
quality of self-rated health.
1
CHAPTER 1. INTRODUCTION
Conventionally, in the health care decision-making process, “physician-centered” clinical
endpoints are used as outcome measures including A1C levels of diabetes patients or visual
acuity of patients with visual impairment have been used. However, these measures were
sometimes limited to tell whether a patient was satisfied with an intervention or whether a patient
was able to enjoy his/her life after an intervention. For example, a patient with visual impairment
may experience a range of visual deficits, which may also cause difficulties in a daily life such as
reading a book, driving or even socializing with other people. If the patient takes a treatment
resulting in visual rehabilitation, then the impact of the treatment will not be limited to visual
acuity. Rather, the impact will be translated into a cascade of improvements in his/her daily life,
which is difficult to measure only with physician-centered outcomes. The effectiveness of the
treatment can be evaluated more comprehensively if the patient’s quality of life after vision
improvement is compared against his/her quality of life with visual impairment. In contrast to
clinical or physical measurements, health or health-related quality of life is subjective so that it is
not easy to quantify. Thus, the need to develop “patient-centered” outcome measurements arose
and studies on these measurements have increased over the last several decades.
Patient-reported outcomes (PROs) are one of the patient-centered outcome measures.
PRO is “a measurement based on a report that comes direct from a patient about the status of a
patient’s health condition” (FDA, 2009). Thus, PROs can provide patients’ perspective on
treatment benefit beyond survival or clinical indicators. To measure PROs, health instruments
have been developed and applied. There are generally two types of health instruments: generic
and disease-specific measures. Generic measures cover overall health by capturing many aspects
2
of health. Due to this nature of the measures, the scores obtained from these measures can be
compared across different disease populations to support health professionals in decision making.
In contrast, disease-specific instruments focus on collecting disease-related health problems from
a specific population with a given medical condition or disease. Using these health measures,
PROs have been evaluated in clinical or policy studies (Bergner, 1985; Ware et al., 1981) and
recently, the importance of PROs was further highlighted by the Patient-Reported Outcomes
Measurement Information System (PROMIS) which is a program funded by the National
Institutes of Health (NIH) in 2004 (Cella et al., 2007), the guidance issued by the US Food and
Drug Administration (FDA) in 2009 (FDA, 2009), “Guidance for industry: Patient-reported
outcomes (PROs) measures to support claims in product labeling”, and the federal government’s
establishment of the Patient-Centered Outcomes Research Institute (PCORI) in 2010 (Clancy
and Collins, 2010).
To be useful in health care decision making, PRO measures should have acceptable
reliability and validity. Reliability and validity are separate psychometric properties. Reliability
refers to the ability of a measure to yield consistent and stable scores over time when the
construct being measured has not changed. Validity is the degree to which instruments
adequately measure intended concepts and do not measure unintended concepts. Responsiveness
is the ability of a measure to detect change and usually determined by differences in scores over
time in individuals or groups who have changed with respect to the measurement concepts.
Although FDA guidelines about PRO measures separated responsiveness from validity, some
argued that responsiveness is also an aspect of construct validity, because detecting change
means that instruments are able to measure the underlying concepts in a longitudinal manner
(Hays and Hadorn, 1992; Revicki et al., 2008).
3
In order to measure the impact of an intervention based on a PRO measure, it is essential
to detect change in health. Thus, patients are usually asked to compare their current health with
their health before the intervention. Researchers are interested in whether or not patients think
their health has improved and if so, how much it has improved. When patients think their current
health has improved, this implies that the patients discriminate current health from previous
health and endorse “better” health at their current health via pairwise comparison. This pairwise
comparison is the basis of Thurstone’s model for the law of comparative judgment. The law of
comparative judgment is an approach to measuring perceived intensity of physical stimuli,
attitude, choices and preference (Thurstone, 1927a) and Luce proposed an advanced model, the
Bradley-Terry-Luce model, which could apply pairwise comparison data to scale preference
(Bradley and Terry, 1955; Luce, 1959). The basic idea of the model is that a collection of
stimuli (e.g. health states) can be scaled on an underlying continuum using a series of paired
comparisons (Thurstone, 1927a, b). For instance, given a set of health states, the model assumed
that each state possesses some attributes, and that a preference exists for each of the health states
and for all individuals. Then, individuals can make a choice when health states are shown in a
pair, and from a series of pairwise comparisons health states can be scaled.
Purpose of the Study
PROs are frequently incorporated in health care decision making to consider what is
important to patients about their health and interventions. To measure the impact of an
intervention on patients’ health, patients are often asked to compare two health states, for
instance, before and after the intervention. When comparing two health states, patients consider
one health state as an anchor and distinguish the other state from the anchor. If patients’ health
4
has improved, a PRO measure has to be able to detect this change. This three-paper dissertation
introduces three models that discriminate changes in health using PRO measures. Models are
developed based on different types of health stimuli. Paper 1 introduces a model that
discriminates hypothetical health conditions. By testing for a theoretical model that explains the
smallest important difference which can distinguish two health states, we can have a more
complete understanding of minimally important difference. Paper 2 presents a model in
empirical data, cataract and heart failure cohorts. Evaluating the overall performances of generic
and disease-specific instruments in terms of how well the instruments distinguish patient
experiencing change in health against those without change can help researchers select an
instrument. Both Paper 1 and Paper 2 use person-based anchors to make a decision when two
health stimuli are given. In contrast to Papers 1 and 2, Paper 3 used an external anchor. Paper 3
introduces a model that examines whether current health can be discriminated from a future
health outcome, mortality. Even when subjects evaluate their current health without any anchor
state, self-rated health was able to discriminate risk of death. Testing the predictive power of
self-rated health with different reference points can help researcher understand the relationship
between a PRO measure that captures a subjective health outcome and an objective health
outcome, mortality.
5
CHAPTER 2. THE BEHA VIORAL ECONOMICS OF THE
MINIMALLY IMPORTANT DIFFERENCE
Abstract
Backgrounds: A minimally important difference (MID) is a method that identifies changes in a
health outcome measure necessary for a patient to discriminate an improvement. When
behavioral economic theory affects decisions using MIDs, this leads to specific functional forms
of utility (logarithmic or linear). We tested the type of utility function (logarithmic vs. linear)
underlying MIDs using behavioral economic theory.
Methods: Four studies were performed using different types of health outcome as stimuli such
as body weights, recovery periods from influenza, and pairs of health state and survival duration.
Logarithmic utility predicts a constant MID over proportional changes to health outcomes.
Linear utility predicts a constant MID when a fixed amount of health is added to the outcomes in
the choice. One-way or two-way repeated measure analysis of variance (ANOVA) was
performed to test these different models.
Results: A total of 98, 128, and 119 subjects completed Study 1a, 1b, and 1c of Study 1,
respectively. Through the three sub-studies we could not reject the proportional model with
confidence. The linear model, however, was rejected. In Study 2, 117 subjects were included in
the analysis. Although the proportional model was rejected, there appeared to be a trend of
following the proportional model. The alternative linear model in Study 2 was rejected. In Study
3, a total of 97 subjects completed the survey. There were no significant differences among mean
6
MIDs of three survival durations in each health state. Study 4 where 117 subjects were included
rejected the logarithmic model but showed equivocal results with respect to the linear model.
Conclusion: We found that MIDs were likely to be constant over proportional changes in most
cases, indicating that the utility function of the MID is logarithmic. However, MIDs might
behave differently with some health outcomes. Care should be taken when the MID is considered
as a normative policy decision tool.
7
2.1. Introduction
In recent years, the minimally important difference (MID) has gained prominence in
health outcome and health economic research. Numerous studies have sought to measure score
differences on outcomes measures that constitute an MID (Beaton et al., 2001; Deyo and Inui,
1984; Guyatt et al., 2002; Hageman and Arrindell, 1991; Juniper et al., 1994; Osoba et al., 1998).
They have reported MIDs in randomized clinical trials by comparing treatment and control
groups (Cella et al., 2005; Eton et al., 2006) and also in observational studies (Badia et al., 2004;
Khanna et al., 2007). The MIDs were also evaluated to allocate resources or to make medical
decisions using different types of instruments such as generic (Kaplan, 2005; Walters and
Brazier, 2005) or disease-specific instruments (Kupferberg et al., 2005; Kwok and Pope, 2010).
The MID is defined as “the smallest difference in score in the domain of interest which patients
perceive as beneficial and which would mandate, in the absence of troublesome side effects and
excessive cost, a change in the patient’s management(Jaeschke et al., 1989)”. There are two
general approaches to estimate the MID: distribution and anchor-based methods (Lydick and
Epstein, 1993; Revicki et al., 2006). A distribution-based approach explains MIDs based on the
statistical characteristics of the sample so that this method does not provide direct information on
the MIDs (Crosby et al., 2003). While the distribution-based approach is recommended as a
supportive method to estimating MIDs, an anchor-based approach is used as a primary method
(Hays et al., 2005; Revicki et al., 2008). The anchor-based approaches explore the associations
between scores from PRO instruments and an anchor to estimate MIDs. And for the anchors,
patient-perceived change or empirical evidence are often used (Lydick and Epstein, 1993). The
MIDs by the anchor-based approach can be determined longitudinally or cross-sectionally
(Crosby et al., 2003). The conventional approach to evaluating MIDs is a longitudinal way,
8
where patients are asked to discriminate health between two time points such as before and after
a treatment. Since, in this method, patients have to recall their previous health to make their
judgments on their current health, this may be susceptible to response shift and recall bias (Ross,
1989). In other words, patients may give different answers over time, not only because their
health has changed due to a treatment, but also because their cognition or perception on health
has been changed. Accordingly, this may mask the treatment impact. A cross-sectional approach,
however, estimates MIDs based on between-subjects differences. The MIDs are evaluated by
having patients evaluate their health in relation to others with the same condition (e.g., Do you
think your health is better than those who have same health problems as you?). Redelmeier and
his colleagues proposed this method and argued that the method provided comparable MID
estimates with the conventional one (Redelmeier et al., 1996). Contrary to the conventional
approach, this can avoid the problems of retrospective shift and recall bias (King, 2011).
There are different ways to interpret the above definition of MIDs. Normative economics
predicts that sensitivity to a commodity like health is determined by its intrinsic value. In
contrast, behavioral economics predicts that sensitivity to a commodity is influenced by the
psychology of perception. If the former determines an MID, then MIDs indicate purely intrinsic
differences in value of health changes. However, if the latter affects an MID, then MIDs reflect,
in part, human cognitive and perceptual limitations in making judgments about value differences.
For health improvements, both theories predict a concave utility. But it is possible to distinguish
them. The normative theory explains concavity through decreasing marginal utility whereas the
behavioral theory explains concavity through “diminishing sensitivity” – discrimination of
changes in a variable have less impact the farther the variable is from zero (e.g., 1 lbs. and 2 lbs.
weights are discriminable, but 51 lbs. and 52 lbs. are less so). The behavioral theory predicts that
9
discrimination of a quantity is governed by Weber’s Law: If a quantity is increased by some
factor, the threshold for a minimally important difference also increases by this factor. This leads
to a precise logarithmic function for quantifying outcomes (a proof on deriving logarithmic
function is detailed in the Appendix). While a logarithmic utility function has been appropriate
for some commodities (Keeney and Raiffa, 1993), it has been rejected for health utility
(Miyamoto, 1999; Pilskin et al., 1980). Thus, by testing a logarithmic functional form for MIDs
we can develop a better understanding of what accounts for MIDs, i.e. whether MIDs follow a
behavioral or classical economic theory. To accomplish this we need to establish the behavioral
foundations for such a parametric representation.
Decisions using MIDs imply something about the value of health in the sense that
policies that maximize MIDs reflect some underlying preference rule (e.g., logarithmic utility).
This preference rule will affect patients’ lives in ways that may or may not reveal with their
actual preferences. This paper is an effort to achieve a more complete understanding of the MID
measure and its implications with respect to health policy through both analysis and
experimentation. The analysis portion in 2.2. Backgrounds presents the derivation of the
conditions for an implied utility function from discrimination data. The experimental portion in
“Methods and Results” of each study involves testing those conditions.
2.2. Background
2.2.1. Theoretical Basis of a Minimally Important Difference
Let H be a health continuum; let P be an index of discrimination, for example, for health
states a, b in H, we will say that P(a, b) is the probability that a respondent answers ‘yes’ to the
10
question “Has [your/a person’s] health, b, improved from when it was a?”. An MID is often
taken as the difference between a and b on some health measure when P(a, b) equals 0.5. If in
discriminating a health improvement the respondent is comparing the value of prior health to
subsequent health, then the problem of representing health values through MIDs amounts to
finding if a scale U exists such that
P(a, b) = P(a′, b′) if and only if U(a) – U(b) = U(a′) – U(b′). (1)
Using discrimination indices to identify scales has a long history (Fechner, 1860; Thurstone,
1927a, b). The problem has been studied in many subsequent papers (Falmagne, 1971; Luce et
al., 1963). If there are behavioral foundations for the function U that can be identified, then the
implications of U in policy making become clear as U will imply a certain risk posture toward
medical treatments and may suggest rules for combining multi-attribute health preferences.
One well-studied behavioral foundation of discrimination is Weber’s Law which states
that P satisfies the following equation:
P(λa, λb) = P(a, b) (for a, b, λ > 0) (2)
Equation 2 requires that discrimination probabilities are equivalent when the comparison
between a and b changes by a constant proportion, λ. The condition is very much like that of
constant proportional time tradeoff (TTO) in the quality-adjusted life year (QALY) literature
because it posits invariance over proportional changes (Doctor and Miyamoto, 2003). To
understand the implications of this condition imagine the health continuum such as body weight
and the clinical problem of identifying minimally important weight loss to improve health.
Imagine that your initial weight is 150 lbs. and on a diet you subsequently lose 5 lbs. resulting in
a weight of 145 lbs. Suppose also that given your initial weight of 150 lbs. a change of 5 lbs. is
the smallest weight change that is important to you. Now suppose instead of weighing 150 lbs.
11
you weigh 300 lbs. and lose 5lbs. The difference between 300 lbs. and 295 lbs. may not be
critical. Equation 2 would predict this result, because it states that if 5 lbs. of body weight loss,
going from 150 lbs. to 145 lbs., is a minimally important difference then doubling one’s initial
weight would also require doubling the reduction in weight necessary for it to be discriminated
as minimally important. In other words, if losing 5 lbs. at 150 lbs. is minimally important, then at
300 lbs. losing 10 lbs. is minimally important. Equation 2 also underlies many widely used
discrimination scales that have become commonplace in everyday life, examples include: 1) The
Richter scale for seismic activity, 2) the magnitude for brightness of stars scale and 3) the decibel
scale for loudness (Uttal, 1973). When Equation 2 holds, U in Equation 1 takes a specific
parametric form, allowing us to understand the policy implications of MIDs. More detailed
derivations and theoretical results are reported in Appendix A.
There is, however, a simple alternative to the proportional one: the linear utility model.
Imagine the same case as described above. Under the linear model, when 145 lbs. from your
initial weight at 150 lbs. is the smallest meaningful improvement, going from 300 lbs. to 295 lbs.
is a minimally important improvement. Thus, the minimally important reduction in weight is
constant as 5 lbs. whether the initial weight is 150 or 300 lbs. We tested both the linear and
proportional models in Study 1. Detailed derivations of the models are given in Appendix A.
This paper addresses the question “How do MIDs value outcomes?” We consider two cases:
1) When the health measure takes its values from sets such as the real numbers e.g., body weight
(Study 1 and 2) and 2) when the health measure is constituted by health states and time frame
(e.g., “10 years with back pain”) that are commonly used to estimate QALY in cost-effectiveness
analysis (Study 3 and 4). Case 1 and 2 facilitate a derivation of U in Equation (1), because
improvements are real valued, strictly increasing and continuous.
12
2.3. Study 1
2.3.1. Overview
In Study 1, we examined the MID utility function using the proportional and linear models
when health outcomes were made up by body weights. Three sub-studies were conducted with
different sets of body weights. Subjects were asked to assume they were under a diet program
and to discriminate the clinically important improvement with respect to weight loss after the
program.
2.3.2. Methods
Research Design
In each sub-study, three weight stimuli were examined (hereafter, “starting weights”): 28,
34, and 40 lbs. in Study 1a, 29, 36, and 43 lbs. in Study 1b, and 30, 36, and 42 lbs. in Study 1c.
For the design of the question, we defined another term of a weight stimulus: an ending weight.
Subjects compared a starting weight to an ending weight and decided whether body weight loss
which was the difference of the starting and ending weights was clinically significant. The
ending weight in each question was determined by a bisection procedure which begins at half of
the starting weight and changes its direction and size of a step according to the subject’s previous
response. The procedure obeys the following rules:
1. On every step, halve the step size.
2. The first step size is half of the starting weight. And the ending weight of the first
question is the difference of the starting weight and first step size.
13
3. From a second step, an ending weight is drawn from the combination of the preceding
ending weight and step size.
4. The direction of the step is determined by a subject’s previous response. If a subject
answers “yes”, then the next change makes in increasing direction, while “no” makes
reversal of the direction.
For example, when the starting weight was 150 lbs. then the first ending weight would be 75 lbs.
(= starting weight – first step size = 150 – ½*150). If the subject responded that body weight loss
of 75 lbs. was significant, then the ending weight for its subsequent question would be 112.5 lbs.
(= previous ending weight + 2
nd
step size with increasing direction = 75 + ½*75). If the subject’s
answer was no, the next ending weight for this stimulus would be 37.5 lbs. (= 75 – ½*75). The
procedure continued until the starting weight and ending weight had a 50% chance of being
considered clinically important. By implementing this random staircase procedure whereby
adjacent questions in the sequence varied between stimuli, subjects would not choose the
smallest meaningful improvement mentally and use this to guide their choices in the studies
(Bleichrodt et al., 2005; Bostic et al.).
Web Service Used to Obtain Survey Responses
We utilized Amazon Mechanical Turk to recruit survey respondents. Amazon
Mechanical Turk is an Amazon Web Service that enables “Requesters” (researchers,
programmers and corporations) located in the United States (U.S.) to use human intelligence to
perform tasks. The Requesters are able to pose these tasks, known as Human Intelligence Tasks
14
(or HITs), such as answering a survey or evaluating a website for a selling price. Workers can
then browse among existing tasks and complete them for a small monetary payment.
Workers may be located anywhere in the world. However, we restricted workers to those
with internet addresses within the United States. Payments for completing tasks were transferred
to a worker's U.S. bank or Paypal account. Requesters pay 10 percent over the price of
successfully completed HITs to Amazon. For example, if a requester sets a price of US$1 to
complete the survey, the total cost per subject was $1.10 because of Amazon’s 10% fee. By
using Amazon Mechanical Turk, it was able to obtain about 100 responses within a few hours of
posting our requests.
Although the biggest advantages of Mechanical Turk are that data can be obtained
relatively quickly and the cost of data collection is inexpensive, these advantages lead to
concerns about the quality of data. The possible reasons for low quality work would be the
validity of workers’ behavior and the low wages paid to workers. However, there were evidences
that outputs by Mechanical Turk workers were similar to those of experts or off-line subjects
(Alonso and Mizzaro, 2009; Birnbaum, 2000). In addition to the validity of workers’ behaviors,
Mason and Watts found that there was no effect of wages on quality of work (Mason and Watts,
2010). Also, economists have used Amazon Mechanical Turk to replicate classic findings in
experimental economics, suggesting that Mechanical Turk provided valid data for economic
studies (Horton et al., 2010). However, we added an attention question so that we could filter
subjects who did not pay attention to the survey.
15
Subjects
From Amazon Mechanical Turk, we recruited 100 subjects to participate in Study 1a.
Similarly, 140 and 120 subjects were recruited to Study 1b and 1c, respectively. Subjects were
excluded from analyses if they participated in the same survey twice or more, if they did not
complete the survey, or if they failed to answer the attention question. We set our price at
US$0.50 to complete the survey of 20 questions including five demographic questions. No
identifiable protected health information was collected.
Questionnaire
After accepting the Mechanical Turk HIT, subjects were directed to an online survey.
The format of the question in the survey was as below:
“Imagine you are X lbs. overweight compared to the standard body weight and undergoing a
weight loss treatment. After the treatment, if you lose Y lbs., do you think this is clinically
significant?”
X indicates the starting weight and Y, the ending weight. That is, X took its values from 28, 34, or
40 lbs. in Study 1a, 29, 36, or 43 lbs. in Study 1b, and 30, 36, or 42 lbs. in Study 1c. Based on the
subject’s response, the value of Y for the next question was determined by the bisection
procedure. The final weights, which were the ending weights when the starting and ending
weights had a 50% chance of being considered clinically important, were collected. For analysis,
the ratios of the final weights to their corresponding starting weights were compared in the
16
logarithmic model. In the linear model, the final weights were subtracted by their corresponding
starting weights and the differences across three stimuli were compared.
Statistical Analysis
To assess the significance of the difference in MIDs, one-way repeated analysis of
variance (ANOVA) with post hoc testing using Scheffe’s multiple comparison method was
carried out. If the sphericity assumption was violated, adjusted univariate analysis (Greenhouse-
Geisser Epsilon (G-G) adjusted F-test) was applied. Also, grand means of the MIDs regardless of
starting weights were obtained. The statistical analysis was conducted using the SAS 9.1 (SAS
Institute, Cary, North Carolina) and the statistical software R (http://www.r-project.org/).
A power analysis is useful when the model tested holds under the null hypothesis.
Maxwell and Delaney (Maxwell and Delaney, 1990) provided guidelines to determine a sample
size to obtain statistical powers in repeated measurements. We let d denote effect size in the
study, defined to be
where μ
max
and μ
min
are the biggest and the smallest mean MIDs among three starting weights
and σ is standard deviation of the study. Also, the smallest correlation among the responses of
the stimuli is identified. In each sub-study, we have three stimuli, that is to say three starting
weights, and thus there are three correlations to consider. Among those correlations, the smallest
one is specified. To achieve an 80% chance to detect a small effect (d = 0.25) with a minimum
correlation of 0.7 at a significance level of 0.05, the table by Maxwell and Delaney indicated that
65 subjects were needed.
17
2.3.3. Results
A total of 98 subjects completed Study 1a. In Study 1b, 128 subjects completed the
survey while nine subjects who did not complete the survey and three who applied to the survey
twice were excluded. In Study 1c, a total of 119 subjects were included in the analysis, after
excluding one who applied to the survey twice. The demographics of the participants for Study 1
are shown in Table 2.1. In Study 1, the participants tended to be female and white/Caucasian.
And the majority of the participants were aged 20-39 years old with educational attainments
were college degree or high school or less.
Table 2.1. Demographic characteristics
Study 1a Study 1b Study 1c
Study 2 &
Study 4
Study 3
N
98 128 119 117 98
Sex (n, (%))
Female 65 (66.3%) 89 (69.5%) 73 (61.3%) 63 (53.9%) 62 (63.3%)
missing 3 (3.1%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Age (n, (%))
19 and under 3 (3.1%) 6 (4.7%) 2 (1.7%) 7 (6.0%) 4 (4.1%)
20-29 37 (37.8%) 39 (30.5%) 44 (37.0%) 48 (41.0%) 48 (49.0%)
30-39 26 (26.5%) 32 (25.0%) 30 (25.2%) 28 (23.9%) 17 (17.3%)
40-49 18 (18.4%) 27 (21.2%) 22 (18.5%) 12 (10.3%) 13 (13.3%)
50-59 11 (11.2%) 20 (15.6%) 17 (14.3%) 13 (11.1%) 8 (8.2%)
18
60 and over 3 (3.1%) 4 (3.1%) 4 (3.4%) 9 (7.7%) 8 (8.2%)
Race/Ethnicity (n, (%))
White/Caucasian 81 (82.7%) 106 (82.8%) 99 (83.2%) 85 (72.6%) 74 (75.5%)
African
American
7 (7.1%) 12 (9.4%) 4 (3.4%) 14 (12.0%) 7 (7.1%)
Asian 6 (6.1%) 4 (3.1%) 10 (8.4%) 7 (6.0%) 12 (12.2%)
Others 4 (4.1%) 6 (4.6%) 6 (5.0%) 11 (9.4%) 5 (5.1%)
Education (n, (%))
≥ High school 29 (29.6%) 48 (37.5%) 38 (31.9%) 43 (36.8%) 39 (39.8%)
College Degree 50 (51.0%) 57 (44.5%) 59 (49.6%) 61 (52.1%) 44 (44.9%)
≤ Master's
Degree
16 (16.3%) 22 (17.2%) 22 (18.5%) 12 (10.3% 15 (15.3%)
missing 3 (3.1%) 1 (0.8%) 0 (0%) 1 (0.9%) 0 (0%)
Income (n, (%))
< $10,000 16 (16.3%) 17 (13.3%) 26 (21.8%) 28 (23.9%) 21 (21.4%)
< $40,000 41 (41.8%) 59 (46.1%) 44 (37.0%) 48 (41.0%) 39 (39.8%)
< $70,000 24 (24.5%) 33 (25.8%) 27 (22.7%) 21 (18.8%) 27 (27.6%)
< $100,000 9 (9.2%) 13 (10.2%) 14 (11.8%) 12 (10.3%) 8 (8.2%)
≥ $100,000 8 (8.2%) 6 (4.7%) 8 (6.7%) 7 (6.0%) 3 (3.1%)
The results of Study 1 are summarized in Table 2.2. The mean MIDs of three starting
weights (28, 34, and 40 lbs.) in Study 1a were 0.35, 0.34, and 033, respectively and there was no
statistical significance (p = 0.066). Scheffe’s post hoc assessment showed that there were no
significant differences between any two mean MIDs. The grand mean for the three starting
19
weights was 0.34 (95% Confidence Interval, CI: 0.32, 0.35) and the means of the three starting
weights laid within the CI range of the grand mean. Study 1b yielded similar results to Study 1a
Table 2.2. Statistical results from Study 1
Proportional Model Linear Model
Starting
weight
Mean (SD) p-value
*
Grand mean
(95% CI)
Mean (SD) p-value
Grand mean
(95% CI)
Study 1a N = 98 N = 98
28 0.34 (0.16)
†‡
0.066
0.34
[0.32, 0.35]
9.52 (4.48) < 0.0001
5.32
[4.76, 6.16]
34 0.34 (0.15)
‡§
5.60 (5.04)
40 0.33 (0.15)
†§
1.12 (6.16)
Study 1b N = 128 N = 128
29 0.33 (0.14)
†‡
0.075
0.31
[0.30, 0.33]
9.57 (4.06) < 0.0001
4.35
[3.48, 4.93]
36 0.31 (0.15)
‡§
4.35 (5.22)
43 0.31 (0.15)
†§
-0.87 (6.67)
Study 1c N = 119 N = 119
30 0.34 (0.16) < 0.0001
0.32
[0.30, 0.34]
10.2 (4.80) < 0.0001
5.40
[4.80, 6.00]
36 0.31 (0.15)
†
5.10 (5.70)
42 0.31 (0.16)
†
0.90 (6.60)
*
In Greenhouse-Geisser Epsilon adjusted F-test
†‡§In each sub-study, means with the same symbols across starting weights are not significantly different
using Scheffe’s multiple comparison method (p < 0.05).
20
as well. The means were 0.33, 0.31, and 0.31, for the starting weights of 29, 36, and 43 lbs.,
respectively and no differences among the means were found, indicating that the probability of
discrimination was constant over proportional changes in body weights (p = 0.071). The means
of three starting weights all were within 95% CI of the grand mean (grand mean: 0.31; 95% CI:
0.30, 0.33). The mean MIDs in Study 1c were 0.34, 0.31, and 0.31 for 30, 36, and 42 lbs.,
respectively. Although the mean MIDs were statistically different (p < 0.0001), all three means
were within the 95% of the grand mean. Except Study 1c, we could not reject the proportional
model in the other two sub-studies, whereas the linear model was rejected in all three sub-studies.
For the linear model, in Study 1a, three mean MIDs (0.34, 0.20, and 0.22 for the starting weights
of 28, 34, and 40, respectively) were statistically different and also did not lie within 95% CI of
the grand mean (p < 0.0001). Study 1b and 1c also showed similar results. Differences in the
mean MIDs in the sub-studies were significant and the means were out of 95% CI of the grand
mean.
2.4. Study 2
2.4.1. Overview
In Study 2, we modified the case of body weight in Study 1 in a more practical way. The
distinction of Study 2 from Study 1 was that total body weights were used, instead of excess
weights. Subjects were asked to assume they were under a diet program and to discriminate an
important improvement with respect to body weights after the program. Three body weights
were applied as stimuli where all the weights were overweight or obese. We examined whether
MIDs followed the logarithmic or linear models.
21
2.4.2. Methods
Research Design
In the study, three weight stimuli were tested (hereafter, “starting weights”): 177, 193,
and 225 lbs. As described in 2.3.2. Methods, we also defined the term “ending weights” and the
bisection procedure was performed to determine the ending weight in each question.
Web Service Used to Obtain Survey Responses
Amazon Mechanical Turk was used as described in 2.3.2. Methods.
Subjects
A total of 120 subjects were recruited. Exclusion criteria described in 2.3.2. Methods
were applied as well. We set our price at US$0.30 for completing the survey.
The Questionnaire
The survey consisted of 17 questions including 5 demographic questions. The design of
each question was identical throughout the survey. The wording of the question in the survey
was as below:
“Robert is a 33-year old man who weighs X lbs. He is 5 feet and 10 inches tall. After undergoing
a weight loss treatment, he now weighs Y lbs. Do you think he has made a significant
improvement with respect to his weight?”
22
X indicates the starting weight and Y, the ending weight. That is, X took its values from 177, 193,
or 225 lbs. Based on the subject’s response, the value of Y for the next question was determined
by the bisection procedure. The final weights, which were the ending weights when the starting
and ending weights had a 50% chance of being considered clinically important, were collected.
To neutralize the effects of weight changes, the final weights were divided by their
corresponding starting weights in the proportional model, and in the linear model the differences
between the starting and final weights were taken for analyses.
Statistical Analysis
To examine differences in MIDs among three starting weights, statistical analysis was
carried out as described in 2.3.2. Methods.
2.4.3. Results
A total of 117 subjects completed Study 2, with three subjects excluded because they did
not complete the survey. The demographic characteristics of the subjects are shown in Table 2.1.
The respondents were more likely White/Caucasian and aged 20-29 years old. Table 2.3
demonstrates the results of Study 2. The mean MIDs of three starting weights (177, 193, and 225
lbs.) were 0.94, 0.94, and 0.93, respectively. Although they were statistically different (p =
0.0022), Scheffe’s post hoc assessment showed that two pairs among three possible pairs to
compare were not different to each other and all the three mean MIDs laid within the 95% CI
range of the grand mean MID (grand mean: 0.94; 95% CI: 0.93-0.94). However, the result from
the linear model was quite different from the proportional model. Three mean MIDs were
23
significantly different in both adjusted univariate test and Scheffe’s post hoc test. Also, three
mean MIDs did not lie within 95% CI range of the grand mean.
Table 2.3. Statistical results from Study 2 (N = 117)
Proportional Model Linear Model
Starting
weight
Mean (SD) p-value
*
Grand mean
(95% CI)
Mean (SD) p-value
Grand mean
(95% CI)
177 0.94 (0.02)
†
0.002
0.94
[0.93, 0.94]
9.85 (4.13) < 0.0001
12.26
[11.45, 13.07]
193 0.94 (0.03)
†‡
11.77 (6.42)
225 0.93 (0.05)
‡
15.16 (10.28)
*
In Greenhouse-Geisser Epsilon adjusted F-test
†‡§
Means with the same symbols across starting weights are not significantly different using
Scheffe’s multiple comparison method (p < 0.05).
2.5. Study 3
2.5.1. Overview
Study 3 investigated a more general case of health outcomes: health state-survival duration
pairs. We modified time-tradeoff technique in the survey to find the smallest improvement in
survival years under given health states. We asked subjects to suppose that they have limited
years left with some health conditions and are undergoing treatment for the conditions. Subjects
were asked to discriminate clinical improvement between a longer duration of life with health
problems and a shorter duration of life without health problems. Three health states and three
24
durations were used to generate nine cases. In each health state, we evaluated whether the MIDs
were constant over proportional change in duration of life.
2.5.2. Methods
Research Design
For Study 3, we first had to define health state-duration pairs. Three health states were
extracted from EuroQol (EQ-5D): 21222, 21122, and 22222 (Table 3.4). The EQ-5D is a generic
preference-based utility measure, comprising five dimensions: mobility, self-care, usual activities,
pain/discomfort and anxiety/depression (1990). The five dimensions are each assessed by a
single question on a three point ordinal scale (no problems, some problems, and extreme
problems). An EQ-5D health state is a combination of one level from each dimension, where in
total 243 health states are possible. The EQ-5D used TTO technique to obtain preference weights
for health states.
Similar to Study 1, in each health states, three durations of life were used as stimuli: 10,
20, and 30 years. Throughout this study, we call these duration stimuli “starting years”. We also
defined another term, “ending year”, which was determined by participants’ previous answer
using the bisection procedure. The rules of the bisection procedure described in Study 1 were
applied in the same manner.
Web Service Used to Obtain Survey Responses
Amazon Mechanical Turk was used as described in 2.3.2. Methods.
25
Table 2.4. Health States in Study 3
Health
State
Description
State 1
(21222)
Some problems in walking about
No problems with self-care
Some problems with performing usual activities
Moderate pains or discomfort
Moderate anxiety or depression
State 2
(21122)
Some problems in walking about
No problems with self-care
No problems with performing usual activities
Moderate pains or discomfort
Moderate anxiety or depression
State 3
(22222)
Some problems in walking about
Some problems with self-care
Some problems with performing usual activities
Moderate pains or discomfort
Moderate anxiety or depression
Subjects
A total of 100 subjects were recruited. Exclusion criteria described in 2.3.2. Methods
were applied as well. We set our price at US$0.30 for completing the survey.
26
The Questionnaire
The survey consisted of 41 questions including 5 demographic questions. The design of
each question was identical throughout the nine cases. To illustrate with a pair of State 1 and 10
years, subjects were asked to consider the following hypothetical scenario:
“Imagine you are told that you have 10 (X) years left to live with following health conditions:
Some problems in walking about
No problems with self-care
Some problems with performing usual activities
Moderate pains or discomfort
Moderate anxiety or depression
In connection with this, you are also told that you can choose to live these 10 (X) years in the
given health state followed by immediate death or that you can choose a treatment where you will
live for Y years in full health followed by immediate death. Compared to 10 (X) years in the
health state described above, do you think Y years in full health is a clinical improvement?”
X, 10 years in this example, indicated the starting year and Y, the ending year. For each health
state, X took its value among 10, 20, or 30 years and Y was determined by the respondents’
answer from the previous question. When subjects recognized that living as much as the starting
year in the given health state and living as much as the ending year in full health had 50% chance
considered as improvement, then that ending year was collected. The collected ending years
were divided by the corresponding starting years for analysis.
27
Statistical Analysis
To examine differences in MIDs among three different starting years in each health state,
two-way repeated measures ANOVA with post hoc testing using Scheffe’s multiple comparison
method was conducted. A grand mean of the MIDs in each health state were calculated as well.
2.5.3. Results
A total of 98 subjects completed Study 3 and two subjects did not complete the survey.
The demographic characteristics of the subjects are shown in Table 2.1. The respondents were
more likely female and White/Caucasian and aged 20-29 years old.
Table 2.5. Statistical results from Study 3
Health State A Health State B Health State C
Starting
year
Mean (SD)
Grand mean
(95% CI)
Mean (SD)
Grand mean
(95% CI)
Mean (SD)
Grand mean
(95% CI)
10 0.63 (0.24)
†‡
0.61
[0.58, 0.64]
0.66 (0.26)
†‡
0.67
[0.64, 0.69]
0.57 (0.25)
†‡
0.56
[0.53, 0.59]
20 0.61 (0.28)
‡§
0.68 (0.25)
‡§
0.55 (0.27)
‡§
30 0.59 (0.27)
†§
0.66 (0.27)
†§
0.57 (0.28)
†§
†‡§
In each sub-study, means with the same symbols across starting weights are not significantly different
using Scheffe’s multiple comparison method (p < 0.05).
Two-way repeated measures ANOVA with two within-subject factors resulted in no
significant duration effect (p = 0.59) but significant health state effect (p < 0.0001). In State 1,
the normalized mean MIDs of 10, 20, and 30 years were 0.63, 0.61, and 0.59, respectively (Table
28
2.5). Scheffe’s post hoc assessment showed that there were no significant differences between
any two mean MIDs and all the mean MIDs for the starting years were within 95% CI of the
grand mean (mean: 0.61; 95% CI: 0.58, 0.64). For State 2, the mean MIDs were 0.66, 0.68, and
0.66 for 10, 20, and 30 years, respectively, and resulted in no statistical differences in Scheffe’s
assessment. All three mean MIDs were in 95% CI of grand mean for State 2 (mean: 0.67; 95%
CI: 0.64, 0.70). Lastly, in State 3, the mean MIDs for the starting years (10, 20, and 30 years)
were 0.57, 0.55, and 0.57, respectively. There were no significant differences between any two
mean MIDs and all the means were within 95% CI of the grand mean (grand mean of 0.57 with
95% CI of 0.54 to 0.60). Therefore, in all health states, the MIDs were constant over proportional
change in duration of life.
2.6. Study 4
2.6.1. Overview
Study 4 also tested health outcomes that are constituted by a health state and time frame.
However, in Study 4, we fixed a health state as influenza and tested the functional form of the
MID across three different stimuli, recovery periods. Subjects were asked to assume that they
had the flu and its recovery period could be shortened by a treatment, Tamiflu. Subjects had to
determine the minimally important number of days that the treatment could shorten. Study 4
examined both the logarithmic and linear models.
29
2.6.2. Methods
Research Design
Three recovery days were applied as stimuli (hereafter, “starting days”): 7, 13, and 17
days. We also define another term, “ending days”, which was determined by subjects’ previous
answer using the bisection procedure described in 2.3.2. Methods.
Web Service Used to Obtain Survey Responses
Amazon Mechanical Turk was used as described in 2.3.2. Methods.
Subjects
Subjects in Study 4 were identical to the subjects in Study 2 as described in 2.4.2
Methods.
The Questionnaire
The survey contained 17 questions including 5 demographic questions. Subjects were
asked as below:
“A 67-year-old male has fever, chills, and muscle aches accompanied by extreme fatigue. His
doctor says that he has the flu and prescribes him Tamiflu to shorten the duration of recovery days.
Normal recovery time for this flu is X days. If with the treatment his recovery time is Y days, then do you
think this is a significant improvement?”
30
X indicates the starting day which took its values from 7, 13, or 17 days and Y indicates the
ending day. The value of Y was determined by the subject’s prior response based on the bisection
procedure. The procedure continued until the starting and ending days had a 50% chance of
being considered as a significant improvement due to the flu treatment and the ending days at
that moment were collected for analysis. Those ending days were divided by their corresponding
starting days for the logarithmic model analysis, while they were subtracted by their
corresponding starting days for the linear model analysis.
Statistical Analysis
The differences in MIDs among three starting days were tested based on the statistical
methods described in 2.3.2. Methods.
2.5.3. Results
The study sample of Study 4 was identical to that of Study 2 and the study results are
shown in Table 2.6. The ending days when the starting and ending days had a 50% chance of
being considered as a significant improvement were 4.49, 9.02, and 12.88 days for the starting
days of 7, 13, and 17 days, respectively. Also, the logarithmic model was rejected (p < 0.0001).
Although the linear model was rejected as well, the means for 13 days and 17 days were not
statistically different, indicating that the minimally important number of recovery days which
was shortened by the treatment were constant regardless of adding a fixed amount of health
outcomes, in this case 4 days, under the cases of 13 and 17 days of the flu. Thus, this linear
model may hold beyond a threshold number of sick days.
31
Table 2.6. Statistical results from Study 4 (N = 117)
Proportional Model Linear Model
Starting
day
Recovery
days with
Tamiflu
Mean (SD) p-value
*
Grand mean
(95% CI)
Mean (SD) p-value
Grand mean
(95% CI)
7
4.49
(1.02)
0.64 (0.15) <0.0001
0.70
[0.68, 0.71]
2.5 (1.02) <0.0001
3.53
[3.32, 3.75]
13
9.02
(1.98)
0.69 (0.15) 3.98 (1.98)
†
17
12.88
(2.50)
0.76 (0.15) 4.12 (2.50)
†
*
In Greenhouse-Geisser Epsilon adjusted F-test
†
Means with the same symbols across starting weights are not significantly different using Scheffe’s
multiple comparison method (p < 0.05).
2.7. Discussion
Many approaches to measuring health outcome appeal to our intuition, but fail to
represent our preferences. One such example is the five-year survival with lung cancer, a
measure popular in the 1970s. Medical outcomes researchers knew that a five-year survival
essentially described a cure, because the probability of death from lung cancer was very low after
five years. So five-year survival was coded as a “success”, equal to 1, and less than five year
survival a “failure”, equal to 0. Evaluation of the 5-year survival rate became a standard test to
identify an effective treatment. However, decision analysts realized that while people wanted to
live 5 years, they probably also valued living less than 5 years. The five year survival treated
these shorter life spans as equivalent in value to death. When these analysts measured survival
32
preference they found that treatment choice as prescribed by preferences for survival was
opposite that which the five year survival outcome prescribed (McNeil et al., 1978). The lesson,
of the fallacy of the five-year survival, is that when a measurement procedure is not fully
understood it may lead us down a path that disagrees with our preferences. In that sense, this
study investigated the functional form of the MIDs as a health outcome measure and our work
suggests that MIDs should also be scrutinized to insure that the welfare implications of this
approach are also desirable.
In our study, we found that the probability of discrimination tended to be constant over
proportional change in body weights but the discriminating probability did not satisfy the
alternative linear model. However, when we extended this derivation of U to health state-
duration pairs, mixed results were found. In the study using the health state-survival duration
pairs with the EQ-5D, MIDs were proportionally constant over duration changes in all tested
health states. But in the flu study, the MIDs were not constant over proportional changes in
recovery days. Rather, the linear model showed equivocal results. Although the linear model was
rejected, the MIDs of two bigger starting days (13 and 17 days) were statistically not different.
Together, MIDs might behave differently with some health outcomes but MIDs were constant
over proportional changes in most cases in this study. This implies that the function of the MID
in Equation (1) satisfies Theorem 1 and Theorem 2, indicating that the utility function for health
measures is a logarithmic function. Therefore, this supports the idea that the utility function
underlying the MID is based on behavioral economic theory.
Logarithmic utility has several undesirable properties for normative policy decision
making, especially in health decisions. First, small changes in health can yield large changes in
health value. That is, people are less sensitive to changes in health remote from their status quo,
33
while people are more sensitive to changes in health near their status quo. Second, the
logarithmic utility model implies that a worse-than-death state is not permitted. In the model
(Figure 2.1), as duration goes to zero which indicates death, utility of death approaches to
negative infinity. It is generally recommended for health measures to include health states worse-
than-death to describe a comprehensive range of preference (Patrick et al., 1994). Lastly, a zero
assumption is violated, which means an individual’s preference exists at death. i.e. People may
prefer one health state over another health state at death. The violations of zero-condition are
impossible in measuring health outcomes (Miyamoto et al., 1998).
Figure 2.1. Logarithmic utility function
34
Some limitations of this study should be noted. First, the four studies presented here
studied the functional form of the MID on a limited range of health measures. A broader range of
the stimuli was not tested because it was unlikely that subjects could easily imagine themselves
as being 100 lbs. overweight compared to standard body weight or imagine their lives in 100
years, even though the surveys asked subjects to assume hypothetical scenarios. Also, future
research of the flu study should consider a wider range of the recovery days from the flu to
examine the trend of the linear model. Therefore, care should be taken when extrapolating our
findings. Second, the functional form of U was only tested in the case of gain. Though MIDs are
usually used as an index of discriminating improvement, there is evidence that MIDs depend on
the direction of change, improvement or deterioration (Crosby et al., 2003; Nichol and Epstein,
2008). Studies of how MIDs work in case of loss will help to understand MIDs more completely.
We studied how MIDs value health outcomes in terms of sets derived from real numbers
and health state-duration pairs. We found that MIDs tended to fit the logarithmic utility model
within a limited range, indicating that MIDs may be problematic as a normative policy tool but
may be sufficient as a descriptive tool.
35
2.8. References
Alonso, O., and Mizzaro, S. (2009). Can we get rid of TREC assessors? using Mechanical Turk
for relevance assessment. Proceedings of the SIGIR 2009 workshop on the future of IR
evaluation, 15-16.
Badia, X., Díez-Pérez, A., Lahoz, R., Lizán, L., Nogués, X., and Iborra, J. (2004). The ECOS-16
questionnaire for the evaluation of health related quality of life in post-menopausal
women with osteoporosis. Health Qual Life Outcomes 2, 41.
Beaton, D.E., Bombardier, C., Katz, J.N., Wright, J.G., Wells, G., Boers, M., Strand, V., and
Shea, B. (2001). Looking for important change/differences in studies of responsiveness.
OMERACT MCID Working Group. Outcome Measures in Rheumatology. Minimal
Clinically Important Difference. J Rheumatol 28, 400-405.
Bergner, M. (1985). Measurement of health status. Med Care 23, 696-704.
Birnbaum, M.H. (2000). Psychological Experiment on the Internet, 1 edn (Elsevier).
Bleichrodt, H., Doctor, J., and Stolk, E. (2005). A nonparametric elicitation of the equity
efficiency trade-off in cost-utility analysis. J Health Econ 24, 655-678.
Bostic, R., Herrnstein, R., and Luce, R.D. The effect on the preference-reversal phenomenon of
using choice indifferences. Journal of Economic Behavior and Organization 13, 193-212.
Bradley, R.A., and Terry, M.E. (1955). Rank analysis of incomplete block designs:I. The method
of paired comparisons. Biomerika 39, 324-345.
Cella, D., Yount, S., Rothrock, N., Gershon, R., Cook, K., Reeve, B., Ader, D., Fries, J.F., Bruce,
B., Rose, M., et al. (2007). The Patient-Reported Outcomes Measurement Information
System (PROMIS): progress of an NIH Roadmap cooperative group during its first two
years. Med Care 45, S3-S11.
36
Cella, D., Yount, S., Sorensen, M., Chartash, E., Sengupta, N., and Grober, J. (2005). Validation
of the Functional Assessment of Chronic Illness Therapy Fatigue Scale relative to other
instrumentation in patients with rheumatoid arthritis. J Rheumatol 32, 811-819.
Clancy, C., and Collins, F.S. (2010). Patient-Centered Outcomes Research Institute: the
intersection of science and health care. Sci Transl Med 2, 37cm18.
Crosby, R., Kolotkin, R., and Williams, G. (2003). Defining clinically meaningful change in
health-related quality of life. J Clin Epidemiol 56, 395-407.
Deyo, R.A., and Inui, T.S. (1984). Toward clinical applications of health status measures:
sensitivity of scales to clinically important changes. Health Serv Res 19, 275-289.
Doctor, J.N., and Miyamoto, J.M. (2003). Deriving quality-adjusted life years (QALYs) from
constant proportional time tradeoff and risk posture condition. Journal of Pmathematical
psychology 47, 557-567.
Eton, D.T., Cella, D., Bacik, J., and Motzer, R.J. (2006). A brief symptom index for advanced
renal cell carcinoma. Health Qual Life Outcomes 4, 68.
Falmagne, J.-C. (1971). Bounded versions of holder's theorem with applications to extensive
measurement. Journal of Mathematical Psychology 8, 495-507.
Fechner, G.T. (1860). Elemente der Psychophysik (Leipzig: Breitkopf & Härtel).
Guyatt, G.H., Osoba, D., Wu, A.W., Wyrwich, K.W., Norman, G.R., and Group, (2002).
Methods to explain the clinical significance of health status measures. Mayo Clin Proc 77,
371-383.
Hageman, W., and Arrindell, W. (1991). Establishing clinically significant change: increment of
precision and distinction between individual and group level of analysis. Behav Res Ther
37, 1169-1193.
37
Hays, R., Farivar, S., and Liu, H. (2005). Approaches and recommendations for estimating
minimally important differences for health-related quality of life measures. COPD 2, 63-
67.
Hays, R.D., and Hadorn, D. (1992). Responsiveness to change: an aspect of validity, not a
separate dimension. Qual Life Res 1, 73-75.
Horton, J., Rand, D.G., and Zeckhauser, R.J. (2010). The Online Laboratory: Conducting
Experiments in a Real Labor Market. SSRN eLibrary.
Jaeschke, R., Singer, J., and Guyatt, G. (1989). Measurement of health status. Ascertaining the
minimal clinically important difference. Control Clin Trials 10, 407-415.
Juniper, E.F., Guyatt, G.H., Willan, A., and Griffith, L.E. (1994). Determining a minimal
important change in a disease-specific Quality of Life Questionnaire. J Clin Epidemiol 47,
81-87.
Kaplan, R.M. (2005). The minimally clinically important difference in generic utility-based
measures. COPD 2, 91-97.
Keeney, R.L., and Raiffa, H. (1993). Decisions with multiple objectives : preferences and value
tradeoffs, 2 edn (Cambridge University Press,).
Khanna, D., Furst, D.E., Wong, W.K., Tsevat, J., Clements, P.J., Park, G.S., Postlethwaite, A.E.,
Ahmed, M., Ginsburg, S., Hays, R.D., et al. (2007). Reliability, validity, and minimally
important differences of the SF-6D in systemic sclerosis. Qual Life Res 16, 1083-1092.
King, M.T. (2011). A point of minimal important difference (MID): a critique of terminology
and methods. Expert Rev Pharmacoecon Outcomes Res 11, 171-184.
Kupferberg, D.H., Kaplan, R.M., Slymen, D.J., and Ries, A.L. (2005). Minimal clinically
38
important difference for the UCSD Shortness of Breath Questionnaire. J Cardiopulm
Rehabil 25, 370-377.
Kwok, T., and Pope, J.E. (2010). Minimally important difference for patient-reported outcomes
in psoriatic arthritis: Health Assessment Questionnaire and pain, fatigue, and global
visual analog scales. J Rheumatol 37, 1024-1028.
Luce, R.D. (1959). Individual choice behavior; a theoretical analysis (New York,: Wiley).
Luce, R.D., Bush, R.R., and Galanter, E. (1963). Psycholophysical scaling, Vol I (New
York: Wiley).
Lydick, E., and Epstein, R.S. (1993). Interpretation of quality of life changes. Qual Life Res 2,
221-226.
Mason, W., and Watts, D.J. (2010). Financial Incentives and the “Performance of Crowds”.
ACM SIGKDD 11, 100-108.
Maxwell, S.E., and Delaney, H.D. (1990). Designing experiments and analyzing data : a model
comparison perspective (Pacific Grove, Calif.: Brooks/Cole Pub. Co.).
McNeil, B.J., Weichselbaum, R., and Pauker, S.G. (1978). Fallacy of the five-year survival in
lung cancer. N Engl J Med 299, 1397-1401.
Miyamoto, J.M. (1999). Quality-Adjusted Life Years (QALY) Utility Models under Expected
Utility and Rank Dependent Utility Assumptions. J Math Psychol 43, 201-237.
Miyamoto, J.M., Wakker, P.P., Bleichrodt, H., and Peters, H.J. (1998). The Zero-Condition: A
Simplifying Assumption in QALY Measurement and Multiattribute Utility. Management
Science 44, 839-849.
Nichol, M.B., and Epstein, J.D. (2008). Separating gains and losses in health when calculating
39
the minimum important difference for mapped utility measures. Qual Life Res 17, 955-
961.
Osoba, D., Rodrigues, G., Myles, J., Zee, B., and Pater, J. (1998). Interpreting the significance of
changes in health-related quality-of-life scores. J Clin Oncol 16, 139-144.
Patrick, D.L., Starks, H.E., Cain, K.C., Uhlmann, R.F., and Pearlman, R.A. (1994). Measuring
preferences for health states worse than death. Med Decis Making 14, 9-18.
Pilskin, J.S., Shepard, D.S., and Weinstein, M.C. (1980). Utility functions for life years and
health status. Oprations Research 28, 206-224.
Redelmeier, D.A., Guyatt, G.H., and Goldstein, R.S. (1996). Assessing the minimal important
difference in symptoms: a comparison of two techniques. J Clin Epidemiol 49, 1215-
1219.
Revicki, D., Cella, D., Hays, R., Sloan, J., Lenderking, W., and Aaronson, N. (2006).
Responsiveness and minimal important differences for patient reported outcomes. Health Qua
Life Outcomes 4, 70.
Revicki, D., Hays, R.D., Cella, D., and Sloan, J. (2008). Recommended methods for determining
responsiveness and minimally important differences for patient-reported outcomes. J Clin
Epidemiol 61, 102-109.
Ross, M. (1989). Relation of implicit theories to the construction of personal histories.
Psychological Review 96, 341-357.
The EuroQol Group (1990). EuroQol--a new facility for the measurement of health-related
quality of life. The EuroQol Group. Health Policy 16, 199-208.
Thurstone, L.L. (1927a). A law of comparative judgement. Psychological Review 34, 273.
Thurstone, L.L. (1927b). Psychophysical analysis. American Journal of Psychology 34, 368-389.
40
US Food and Drug Administration (2009). Guidance for Industry, Patient-Reported Outcome
Measures: Use in Medical Product Development to Support Labeling Claims.
Uttal, W.R. (1973). The psychobiology of sensory coding (Harper & Row).
Walters, S.J., and Brazier, J.E. (2005). Comparison of the minimally important difference for
two health state utility measures: EQ-5D and SF-6D. Qual Life Res 14, 1523-1532.
Ware, J.E., Brook, R.H., Davies, A.R., and Lohr, K.N. (1981). Choosing measures of health
status for individuals in general populations. Am J Public Health 71, 620-625.
41
CHAPTER 3. COMPARATIVE STUDY OF THE
PERFORMANCES OF GENERIC AND DISEASE-SPECIFIC
MEASURES IN CATARACT AND HEART FAILURE PATIENTS
Abstract
Objective: Health measures usually fall into two types: generic and disease-specific instruments.
When the purpose of a research is to apply a sensitive measure, both sensitivity of a measure to
changes in health and its ability to discriminate between those who improve and those who do
not should be considered. Thus, it becomes an empirical question whether a disease-specific
measure is more responsive than a generic one. The aim of this study is to evaluate the overall
performances of generic and disease-specific measures in cataract and heart failure patients.
Method: Two disease-specific and five generic measures were administered in cataract and heart
failure patients at baseline, 1 month and 6 months. The disease-specific measures were the
National Eye Institute Visual Functioning Questionnaire-25 (VFQ-25) for cataract and the
Minnesota Living with Heart Failure Questionnaire (MLHF) for heart failure patients. The
generic measures included the Short Form-6D (SF-6D), EuroQol-5D (EQ-5D), Self-
Administered Quality of Well-being Scale (QWB-SA), and two versions of the Health Utilities
Index (HUI2 and HUI3). A self-comparative SRH measure and clinical indices such as visual
acuity or hormone level were applied as anchors to define responders who experienced changes
in health. Five different definitions for responders were used: patients who perceived
improvement in their current health compared to one year ago (Model 1); patients who perceived
any change, improvement or deterioration, in their current health (Model 2); cataract patients
42
who achieved “good” vision (monocular or binocular) after cataract surgery (Model 3 and 4);
and heart failure patients whose N-terminal prohormone Brain Natriuretic Peptides (NT-
proBNP) levels were less than 300pg/ml.
We first examined the performances of the generic measures against the disease-specific
measures in each model. Second, to test the impact of using different classifications on the
responder definitions, the comparison of responsiveness between Model 1 and Model 2 for each
measure was performed. Lastly, by comparing the overall performances of Model 3, 4, and 5 to
Model 1, we assessed the impact of different the anchors, either subjective or objective, on the
overall performances of the health measures. The overall performances of the measures were
compared based on the areas under the receiver operating characteristics (ROC) curves. Missing
data was imputed using a multiple imputation method.
Result: A total of 362 cataract patients and 150 heart failure patients were included in the
analysis. In the cataract cohort, the disease-specific VFQ-25 showed good responsiveness in
Model 1 as well as the HUI2 and HUI3. However, when visual acuity was applied as an anchor,
the VFQ-25 was less responsive than the HUI2, EQ-5D, and SF-6D (p < 0.001). In heart failure
patients, the MLHF and EQ-5D were responsive across the testing models but we found that the
performances of the measures were lower under the clinical anchor compared to the subjective
anchor.
Conclusion: In these cataract and heart failure populations, we found that the generic measures
were as sensitive as the disease-specific measure in most cases. The responsiveness of the health
instrument is influenced by an external anchor to define responders, the contents of the measures,
and the study population.
43
3.1. Introduction
Patient-reported outcomes (PROs) are of increasing importance in health studies. The
Food and Drug Administration (FDA) recently released a guidance to support claims in approved
medical product labeling based on PRO measures(FDA, 2009). The guidance stipulated that it is
critical to choose an adequate PRO instrument that meets the concept of interest. For example,
researchers should consider whether an instrument is able to detect change when there is a
known change with respect to the concept of interest, and to distinguish people with the change
from those without the change in the study population. Thus, understanding the characteristics of
the PRO measure is important in the choice of the measure.
Instruments of assessing health status or health-related quality of life usually fall into two
categories: generic and disease-specific instruments. Generic measures are intended to describe
overall health, capturing many aspects of health. Their scores from patients with different
diseases can be compared against each other or against the general population. Therefore, these
instruments are used to aid decision makers in health decisions such as health resource allocation
or economic analysis. Disease-specific instruments, on the other hand, focus on collecting
information on symptoms or disease-related health problems from specific populations with
givien medical conditions. While disease-specific measures may provide more detailed
information about symptoms or problems, they may or may not be more sensitive than generic
measures. These disease-specific measures have often demonstrated better responsiveness to
changes in the particular conditions compared to the generic instruments (S et al., 2009; Wiebe et
al., 2003) but it is not always the case (Walsh et al., 2003). However, when the purpose of the
study is to apply sensitive measurement to detect treatment effects, researchers often presume
that sensitivity is greater in disease-specific measures and use the measures. Sensitivity depends
44
not only on the level of granularity of the items, but also on the reliability of the responses to
those items. Thus, it is an empirical question as to whether or not a disease-specific instrument is
more responsive than a generic one.
Cataracts are an eye condition that potentially leads to blindness if untreated. Patients
with a cataract may experience a range of visual deficits such as deterioration in visual acuity
and loss of contrast sensitivity. Consequently, these visual deficits affect a patient’s daily life and
lead to a range of real world difficulties from physical activities such as reading book, or driving
to mental concerns such as feeling frustration due to poor vision. Cataract extraction surgery
with lens replacement is a common treatment, which is effective and results in almost immediate
visual rehabilitation. As vision affects overall health, the impact of the surgery can be translated
beyond visual acuity. In other words, a unit change in a clinical factor such as visual acuity may
not correspond to a unit change in quality of life or well-being. Thus, to evaluate the sudden but
significant benefit of the treatment these days, clinical trials have assessed not only clinical
outcomes but also PROs, that is, a patient’s discrimination of overall health improvement (Javitt
and Steinert, 2000; Sandoval et al., 2008).
Although the pattern of the disease is different, the impact of heart failure on quality of
life may be similar to cataracts. Heart failure is a common but serious condition that affects the
cardiovascular system and has a significant impact on quality of life, comparable to or greater
than other chronic conditions such as arthritis or chronic lung disease (Alonso et al., 2004;
Bennet et al., 2002). The goal of treating heart failure mainly focuses on improving symptoms,
which may result in a cascade of improvements in day-to-day living. Unlike cataracts, however,
improvements after heart failure treatment are often small and may be transitory (Kaplan et al.,
2011).
45
For both diseases, generic measures have been validated for use and disease-specific
measures have also been developed and validated (Clemons et al., 2003; Lee et al., 2000;
Pressler et al., 2011; Rector and Cohn, 1992; Rosen et al., 2005). There is, however, no gold
standard regarding the choice of the instruments. The choice of PRO measurement depends on
the purpose of the research, and if the purpose is related to sensitive measurements, there is a
need to examine the sensitivity of measures with respect to discrimination of overall health
improvement.
The aim of this study is to evaluate the performances of the five most widely used
preference-based health-related quality of life measures and two disease-specific measures in
discriminating patient-reported improvement. The five generic measures include the Short Form-
6D (SF-6D), the EuroQol-5D (EQ-5D), the Self-Administered Quality of Well-being Scale
(QWB-SA), and two versions of the Health Utilities Index (HUI2 and HUI3). Two disease-
specific measures were the National Eye Institute Visual Functioning Questionnaire-25 (NEI
VFQ-25) for cataract patient and the Minnesota Living with Heart Failure Questionnaire
(MLHF) for heart failure patients.
3.2. Methods
3.2.1. Subjects
The current study utilized data from earlier works (Feeny et al., 2011; Kaplan et al.,
2011) where two patient populations, cataract and heart failure, were included. The cataract
patients were soon to undergo cataract extraction surgery with lens replacement. Patients with
simultaneous glaucoma, corneal or vitroretinal procedures or patients who were not able to read
46
large versions of questionnaires were not included. Heart failure patients were those who were
newly referred to congestive heart failure clinics. Patients with evidence of the presence of heart
failure for at least 3 months, defined as a left ventricular ejection fraction less than 40% were
included but patients classified as class IV in the New York Heart Association system and
patients with a recent (≤6 months) myocardial infarction, unstable angina, recent (≤3 months)
coronary artery bypass graft surgery, patients on the heart transplant list, or those with recent (≤3
months) ventricular tachycardia were not included. Responses to the measures were collected at
enrollment, 1- and 6- month follow-up. Briefly, a total of 536 subjects, 376 cataract patients and
160 heart failure patients participated in the study. Most of the patients were white (87% for
cataract and 79% for heart failure). The cataract group tended to be females (59%), with most
being 65 years or older. The heart failure group tended to be males (67%) and younger (78%
aged less than 65 years). For statistical analysis of the study, only the subjects whose total scores
at baseline and 6 months for all the measures and who answered the second question of the SF-
36, the global rating of change, at 6 months were included (hereafter, complete data).
3.2.2. Health Outcomes Measures
3.2.2.1. Generic Measures
Short Form-6D (SF-6D)
The SF-6D is a preference-based measure based on a subset of items from the SF-36,
which is a multi-item generic health survey intended to measure “general health concepts not
specific to any age, disease, or treatment group” (Ware and Sherbourne, 1992). The SF-6D
version 2 includes 6 attributes: physical functioning, role limitation, social functioning, pain,
mental health, and vitality. Each attribute has 4 to 6 levels (Brazier et al., 2002). The scoring
47
function is based on standard gamble (SG) preferences elicited from a random sample of
community-dwelling subjects in the UK (Brazier et al., 2002).
Health Utilities Index (HUI)
The HUI is a generic, preference-based health profile whose purpose is to measure health
status and provide utility scores. HUI consists of two systems: HUI2 and HUI3. The HUI2
includes seven attributes with three to five levels: sensation, mobility, emotion, cognition, self-
care, pain and fertility. However, the HUI3 consists of eight attributes (vision, hearing, speech,
ambulation, dexterity, emotion, cognition, and pain) with five to six levels, leading to 972,000
possible health states. The multiplicative scoring functions of both HUI2 and HUI3 systems were
derived using standard gamble and visual analogue scale in random samples of the Canadian
population (Feeny et al., 2002; Torrance et al., 1996).
EuroQol-5D (EQ-5D)
The EQ-5D was developed by the EuroQol group. The EQ-5D descriptive system uses
five domains (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression) with
three levels per domain (1990; Rabin and de Charro, 2001). The EQ-5D was originally validated
in Europe and its scoring algorithm for the US general population is also available (Persell et al.,
2010; Shaw et al., 2005).
Self-Administered Quality of Well-being Scale (QWB-SA)
The QWB-SA evaluates self-reported functioning based on a series of questions that
records limitations in the previous three days. The QWB-SA includes three domains including
48
mobility, physical activity, and social activity and additionally a series of questions that ask
about the presence or absence of symptom/problem complexes. The QWB-SA was initially
developed for interviewer administration, but the self-administered QWB-SA has also been
demonstrated to be highly correlated with the interviewer-administered QWB-SA and to retain
the psychometric properties (Kaplan et al., 1997).
3.2.2.2. Disease-Specific Measures
National Eye Institute Visual Functioning Questionnaire-25 (NEI-VFQ25)
The NEI-VFQ-25 captures the influence of vision on dimensions of health-related quality
of life including emotional well-being and social functioning. The NEI-VFQ25 includes 25 items
covering general health, general vision, near vision, distance vision, driving, peripheral vision,
color vision, ocular pain, role limitations, dependency, social function, mental health, and
expectations (Mangione et al., 2001; Mangione et al., 1998). The total score ranges from 0 to 100,
with higher scores indicating better vision.
Minnesota Living with Heart Failure Questionnaire (MLHF)
The MLHF consists of 21 items including symptoms, mental health, social life, fatigue,
appetite, mobility, sleep, sexual activity, work and recreational activities, and side effects of
treatment (Rector, 2005). Overall scores range from 0 to 105, with higher scores indicating
greater impairment.
49
3.2.3. Responder Definitions
3.2.3.1. Anchor-Based Approach
Calculating a minimum score which can be considered as a small but meaningful change
has been used to interpret PRO results. This score has been referred to as a minimally important
difference (MID), the smallest difference in score in the outcome of interest that informed
patients perceive as important (Schünemann and Guyatt, 2005), and recently the MID has been
replaced by the term responder definition. As the FDA guidance defined the responder definition
as “a score change in a measure, experience by an individual patient over a predetermined time
period that has been demonstrated in the target population to have a significant treatment
benefit” (FDA, 2009), the responder definition explicitly contains the time dimension. There are
two general approaches to defining a responder: distribution- and anchor- based methods.
Distribution-based approaches take the statistical characteristics of the sample into account when
establishing the responder. For distribution-based estimates, the effect size (mean change divided
by standard deviation at baseline) and standardized response mean (SRM: mean change divided
by standard deviation of change) are commonly used. However, the distribution-based estimates
do not provide direct information about MIDs, but instead they demonstrate the observed change
in a standardized metric. Thus, the distribution-based approach is recommended as a supportive
method to define a responder (FDA, 2009; Revicki et al., 2008). However, anchor-based
approached are recommended as a primary method. The anchor-based approaches generally use
patient ratings of change or scientific evidence to define a responder. In this study, we applied
both cases of the anchors to test the impact of using different anchors on the overall
performances of the health instruments. For an anchor of patient ratings of change, a self-
comparative self-rated health (SRH) measure was used in both disease groups. The self-
50
comparative SRH which is in the SF-36, is a balanced Likert-type scale scored by patients
(Guyatt et al., 1987; Wyrwich and Wolinsky, 2000). The question is worded as below:
Compared to one year ago, how would you rate your health in general now?
1. Much better now than one year ago
2. Somewhat better now than one year ago
3. About the same as one year ago
4. Somewhat worse now than one year ago
5. Much worse now than one year ago
From this 5-point scale (1-5: from much better to much worse), two definitions for responders
who experienced MID were applied: patients whose global rating of change was “somewhat
better” or above (Model 1) and those rated as “somewhat better” or above, or “somewhat worse”
or below (Model 2). Model 1, which described responders as patients who perceived
improvement, is closer to the responder defined by the FDA. However, we also examined if the
meaningful change indicates not only improving but also worsening. Regardless of the direction
of the changes, we pool the changes together and evaluate to what extent they contributed to the
overall performances of the measures. To account for the difference in the direction of the
change, the sign of the score changes were reversed for the patients who reported “somewhat
worse” or below in Model 2.
In addition to the subjective self-comparative SRH anchor, we conducted responder
definitions with an objective anchor as well. For the cataract group, we used monocular and
binocular visual acuity as anchors. First, the World Health Organization (WHO) published target
51
guidelines on the visual outcome of cataract surgery in 1998 and defined good outcome for the
surgery operated eye as 20/20-20/60 based on the Snellen chart (Organization, 1998). Therefore,
we defined a responder of Model 3 as a patient whose postoperative visual acuity had 20/60 or
better at 6 months. Second, patients with binocular vision of 20/40 or better at 6 months were
defined as responders of Model 4. The Beaver Dam Eye study and the Salisbury Eye Evaluation
project suggested that a 20/40 visual acuity is a useful standard in cataract surgery (West et al.,
1997). Also, the US Food and Drug Administration has used visual acuity of 20/40 as a threshold
to evaluate intraocular lenses (IOL) (Administration, 1999). In the heart failure group, NT-
proBNP was used as an anchor, which is a biomarker of hemodynamic stress of the heart. Higher
NT-proBNP levels are associated with a greater likelihood of heart failure. Since a cutoff point
of 300 pg/ml was proposed to rule out the diagnosis of heart failure, we defined responders of
Model 5 as patients whose NT-proBNP level was less than 300 pg/ml at 6 months. Based on the
responder definitions of the models, the score change of each measure between baseline and 6
months was calculated accordingly.
We first compared the performances of the generic measures against the disease-specific
measures within the models. The impact of using different classifications for the responders was
examined by comparing the performances of Model 1 and Model 2 for each health measure.
Lastly, we examined whether different types of anchors, objective or subjective, affected the
overall performances of health measures by comparing Model 3, 4, and 5 against Model 1. The
overall performances of the measures were compared based on the areas under the receiver
operating characteristics (ROC) curves.
52
3.2.4. Statistical Analysis
In the present study, subjects were not included if their basic demographic information at
baseline including age, sex, marriage status, and education was missing. The number of subjects
with data at baseline and 6 months for each health outcome is provided in Table 3.1.
Table 3.1. Missing responses (%) at each time point
Cataract Heart Failure
Baseline 6 months Baseline 6 months
SF-6D 5.2% 21.8% 7.3% 32.7%
HUI2 6.6% 21.5% 4.7% 31.3%
HUI3 5.8% 21.8% 5.3% 31.3%
QWB-SA 0.0% 18.5% 0.0% 30.7%
EQ-5D 1.9% 22.1% 3.3% 30.7%
NEI-VFQ25 3.9% 21.0% 0.0% 30.7%
Self-comparative SRH 1.7% 19.6% 1.3% 31.3%
Monocular Visual Acuity 4.7% 51.4% n/a n/a
Binocular Visual Acuity 5.2% 51.9% n/a n/a
NT-proBNP n/a n/a 54.0% 69.3%
In the cataract group, missing responses for all the variables at baseline were less than 7%.
However, the missing responses for the health instruments at 6 months increased up to 22% and
the missing responses for visual acuities were about 50%. In the heart failure group, missing
responses for health instruments at baseline were less than 7% but those at 6 months were about
53
30%. For the clinical variable, NT-proBNP, its missing responses was 54% at baseline and 69%
at 6 months. Multiple imputation was performed, since analyzing data without considering
incomplete data may bias results (Schafer and Graham, 2002). Although the missing responses
were substantial, Robin et al. (1987) argued that the high efficiency of an estimate could be
achieved by increasing the number of imputations (Rubin, 1987).
To compare the overall performances of instruments in each disease cohort, we computed
the area under the ROC curve of each measure. It has been shown that the area under the ROC
curve can be taken as a measure of the degree of information the test contains over its entire
score range(Hanley, 1989; Hanley and McNeil, 1983). The ROC curve plots true positive ratio
(sensitivity) against false positive ratio (1-specificity), which illustrates the discriminative ability
of the measure. The area under the curve (AUC) represents the probability that the measure will
correctly classify responders and nonresponders, with an area of 0.5 indicating no discriminating
ability and an area of 1.0 indicating perfect discriminability. We used the different cutoffs for the
score change in each measure to predict patient-reported classification as Model 1 or Model 2 to
generate the ROC curves. Comparison of the AUC of two instruments was conducted by using
the bootstrap test as shown by Robin et al(Robin et al., 2011). We also evaluated MIDs from the
ROC curve analysis (de Boer et al., 2001; Farrar et al., 2010; Turner et al., 2009; Ward et al.,
2000). The MIDs were chosen at the point where the shoulder of the ROC curve is closest to the
left upper side of the curve.
All statistical analyses were conducted with the statistical software package R version
2.12.2 and SAS version 9.2.
54
3.3. Results
In the complete data, there were 362 cataract patients and 150 heart failure patients.
Majority of the patients were white (91% for cataract and 83% for heart failure). The cataract
group tended to be more females (60%) and 68% of the group were 65 years or older. In the
heart failure group, there were more male (69%) and younger (78% aged less than 65 years).
3.3.1. Cataract Patients
The results of areas under the ROC curves are shown in Table 3.1. First, we compared
the overall responsive performances of the five generic instruments against the NEI-VFQ25
within each model. In Model 1, the NEI-VFQ25 was the most responsive instrument and the
HUI3 resulted in a comparable performance. The SF-6D, EQ-5D, and QWB showed
significantly lower overall performances compared to the NEI-VFQ25 (Fig 3.1A). In Model 2
where responders were defined as patients who experienced any change, improvement or
deterioration, the EQ-5D showed better performances than the NEI-VFQ25 and the
responsiveness of the SF-6D, HUI2, and HUI3 were not significantly different from that of the
disease-specific measure. With an objective anchor such as visual acuity on Model 3 (Fig 3.1B)
and Model 4, the SF-6D demonstrated significantly better responsiveness than the vision-specific
NEI-VFQ25.
Second, we tested the impact of using different responder definitions or different anchors
by comparing Models 2, 3, and 4 against to Model 1. When different definitions of responders
were applied (Model 1 vs. Model 2), the overall performances of the NEI-VFQ25, HUI2, HUI3,
and QWB were significantly different. The performances with responders defined as patients
who experienced improvement were statistically better than those with responders defined as
55
patients who experienced any change. Comparisons of Models 3 and 4 to Model 1 were to
examine the impact of using different types of anchors on the overall performances of health
measures. In other words, Model 1 used a subjective anchor, a self-comparative SRH whereas
Models 3 and 4 tested objective anchors, monocular and binocular visual acuity, respectively.
The results showed that the performances of health measures with responders defined by Model
1 were better than those with responders defined by Model 3 (Fig 3.1C) or Model 4. However,
only the overall performances of the SF-6D were not statistically different across all three
models at a significance level of 0.01 (Model 1 vs. Model 3: p = 0.60; Model 1 vs. Model 4: p =
0.02).
56
Table 3.2. ROC curve analysis in cataract cohort
Anchor VFQ-25 SF-6D HUI2 HUI3 EQ-5D QWB
M1: SRH with
improvement
AUC (95% CI)
0.654
(0.634, 0.674)
0.596
(0.574, 0.618)
0.625
(0.605, 0.646)
0.635
(0.614, 0.656)
0.597
(0.577, 0.618)
0.593
(0.572, 0.614)
Generic vs. NEI-VFQ25
n/a z = 4.41
***
z = 2.26
**
z = 1.46 z = 4.33
***
z = 4.72
***
M2: SRH with
any change
AUC (95% CI)
0.572
(0.592, 0.552)
0.588
(0.569, 0.607)
0.571
(0.552, 0.590)
0.569
(0.550, 0.589)
0.605
(0.587, 0.623)
0.532
(0.512, 0.551)
Generic vs. NEI-VFQ25
n/a z = -1.08 z = 0.08 z = 0.15 z = -2.29
*
z = 2.58
*
M1 vs. M2
z = 5.71
***
z = 0.53 z = 3.82
**
z = 4.55
***
z = -0.56 z = 4.27
***
M3:
Monocular
vision
AUC (95% CI)
0.513
(0.488, 0.539)
0.587
(0.563, 0.612)
0.531
(0.506, 0.556)
0.5
(0.476, 0.525)
0.554
(0.529, 0.579)
0.539
(0.514, 0.563)
Generic vs. NEI-VFQ25
n/a z = -4.93
***
z = -1.11 z = 0.64 z = -2.47
*
z = -1.24
M1 vs. M3
z = 8.49
***
z = 0.52 z = 5.73
***
z = 7.95
***
z = 2.63
**
z = 3.26
**
M4: Binocular
vision
AUC (95% CI)
0.527
(0.508, 0.537)
0.562
(0.543, 0.580)
0.503
(0.484, 0.522)
0.531
(0.531, 0.550)
0.508
(0.489, 0.527)
0.507
(0.488, 0.526)
57
Generic vs. NEI-VFQ25
n/a z = -2.94
**
z = 2.02
*
z = -0.28 z = 1.52 z = 1.58
M1 vs. M4
z = 8.93
***
z = 2.33
*
z = 8.59
***
z = 7.34
***
z = 6.33
***
z = 5.98
***
*
p < 0.05;
**
p < 0.01;
***
p < 0.0001
58
59
Figure 3.1. Comparisons of areas under the ROC curves. 1A: Comparisons of areas under the
ROC curves between the NEI-VFQ25 and the generic health instruments in Model 1; 1B:
Comparisons of areas under the ROC curves between the NEI-VFQ25 and the generic health
instruments in Model 3; 1C: Comparisons of areas under the ROC curves in each measure
between Model 1 and Model 3.
3.3.2. Heart Failure Patients
First, the comparisons of the generic measures against the health failure-specific
measures with respect to the overall performances in Model 1, Model 2 and Model 5 were
carried out. The results are shown in Table 3.3. Although the EQ-5D showed better
responsiveness than the MLHF in Model 1, there was no significant difference (p = 0.43). We
60
also found that all the other generic measures including the SF-6D, QWB, HUI2, and HUI3
resulted in lower discriminative abilities than the MLHF in terms of distinguishing heart failure
patients who perceived that their current health were better than one year ago from those whose
current perceived health were stable or worse (Fig 3.2A). In Model 2, the EQ-5D, SF-6D and
HUI2 were more responsive than the MLHF. However, the area under the ROC curve of the
MLHF was 0.501, indicating that changes in the MLHF were no better at identifying patients
who experienced a meaningful improvement than a random guess. When the responder was
defined by patients who achieved the NT-proBNP levels less than 300 pg/ml (Model 4), the EQ-
5D was the most responsive measure, followed by the MLHF, and the HUI2. The overall
performances of the two generic measures were not different from the performance of the MLHF
(Fig 3.2B).
Secondly, we also compared the responsiveness of measures based on Model 2 and 5 to
Model 1 in each instrument (Table 3.3). Comparison of Model 2 with Model 1 showed that the
overall performances of the all tested instruments were better in Model 1 although the
responsiveness of the HUI2 was statistically not different between both models. That is, the
health measures were more responsive when responders were defined as patients who
experienced improvement than when responders were those who experienced any meaningful
change regardless of the direction of the changes. We found similar results in the comparison of
Model 1 and Model 5. The overall performances of the measures in Model 1 were better than in
Model 5 (Fig 3.2C).
61
Table 3.3. ROC curve analysis in heart failure cohort
Standard
MLHF SF-6D HUI2 HUI3 EQ-5D QWB
M1: SRH with
improvement
AUC (95% CI) 0.636
(0.616, 0.655)
0.596
(0.574, 0.617)
0.56
(0.539, 0.580)
0.585
(0.565, 0.605)
0.645
(0.625, 0.664)
0.591
(0.571, 0.612)
Generic vs. MLHF n/a D = 2.65** D = 5.97*** D = 3.77* D = -0.79 D = 4.14***
M2: SRH with
any change
AUC (95% CI) 0.501
(0.475, 0.526)
0.548
(0.526, 0.571)
0.54
(0.564, 0.515)
0.501
(0.477, 0.525)
0.606
(0.583, 0.630)
0.493
(0.470, 0.517)
Generic vs. MLHF n/a Z = -2.25* D = -2.65** Z = -0.02 Z = -5.30*** Z =0.351
M1 vs. M2 D = 8.24*** D = 6.12*** D = 1.25 D = 5.20*** D = 2.45* D = 6.20***
M5: NT-
proBNP
AUC (95% CI) 0.566
(0.545, 0.586)
0.506
(0.476, 0.517)
0.537
(0.558, 0.517)
0.515
(0.494, 0.536)
0.577
(0.557, 0.597)
0.512
(0.492, 0.533)
Generic vs. MLHF n/a Z = 3.86** D = 2.33* Z = 3.16** Z = -0.63 D = 4.94***
M1 vs. M4 D = 4.78*** D = 9.90*** D = 1.54 D = 4.72*** D = 4.69*** D = 5.37***
*
p < 0.05;
**
p < 0.01;
***
p < 0.0001
62
63
Figure 3.2. Comparisons of areas under the ROC curves. 2A: Comparisons of areas under the
ROC curves between the MHLF and the generic health instruments in Model 1; 2B:
Comparisons of areas under the ROC curves between the MLHF and the generic health
instruments in Model 5; 2C: Comparisons of areas under the ROC curves in each measure
between Model 1 and Model 5
3.4. Discussion
For an instrument measuring quality of life, its ability to discriminate meaningful change
is of primary importance. The overall performances of the generic and disease-specific
64
instruments were investigated using the ROC curve analysis. We made the head-to-head
comparison of five generic measures, the SF-6D, HUI2, HUI3, QWB-SA, and EQ-5D and two
disease-specific measures, the VFQ-25 for cataract patients and the MLHF for heart failure
patients. We set four different responder definitions: using the self-comparative SRH, responders
were defined as patients experiencing important improvement and those experiencing important
change, either improvement or deterioration, in Model 1 and 2, respectively. In contrast to the
subjective anchor, Models 3, 4, and 5 used objective anchors. In Models 3 and 4, which were for
the cataract group, the responders were defined based on clinical anchors such as monocular and
binocular visual acuity. For heart failure, the biomarker, NT-proBNP was applied as a clinical
anchor. With these five different models, we tested whether there were differences in
responsiveness between generic health measures and disease-specific measures, and whether
there were impacts of using different responder definitions such as different classification
definitions but with a same anchor or different definitions by subjective or objective anchors. In
the cataract cohort, the NEI-VFQ25 and HUI3 were more responsive than the other generic
measures when the responder was defined based on the subjective self-comparative SRH
measure. However, when the responder was defined based on an objective visual acuity, the
overall discriminative ability of the SF-6D was better than that of the vision-specific measure. In
the heart failure cohort, the responsiveness of the MLHF and EQ-5D were better than those of
the other generic measures under the responders defined as patients who experienced meaningful
improvement. However, with the responder definition of patients whose NT-proBNP level were
less than 300pg/ml, the EQ-5D was significantly more responsive than the heart failure-specific
measure and the other generic measures. In both cataract and heart failure groups, the disease-
specific measures were not always more responsive to changes in health outcomes than generic
65
measures. Also, we found that in the heart failure cohort, the overall performances of the HUI2
were less affected by different responder definitions, because the overall performances in Models
1, 2, and 5 were not all statistically different.
In the cataract sample, the overall performance of the NEI-VFQ25 was the highest but its
performance was comparable to those of the HUI2 and HUI3, followed by HUI2 in Model 1. In
the earlier study (Kaplan et al., 2011), the NEI-VFQ25 was more responsive than the generic
measures and the HUI2 and HUI3 were more responsive than the other generic measures.
Although we did not find statistical differences in the overall performances, the trend of the
responsiveness in the current study was consistent to the previous. This implies that vision is not
only limited to visual issues but also affects daily living (Espallargues et al., 2005). Moreover,
the HUI2 has the sensory domain and the HUI3 includes the vision dimension, so that both can
take visual impairment and improvement after the surgery sufficiently into account. Therefore,
these generic instruments could cover some aspects of quality of life affected by cataracts, which
resulted in comparable responsiveness to the vision-specific measure. With the responder
definition based on visual acuity, the disease-specific NEI-VFQ25 was less responsive than the
HUI2, EQ-5D, or SF-6D. This might be attributable to the time point that we used for the
analysis. Because usually the cataract surgery benefit is immediate while we used the scores
before the surgery and 6 months after the surgery (Margolis et al., 2002). Therefore, the impact
of the surgery may be diluted and subsequently led to the low performances of the NEI-VFQ25.
In the heart failure cohort, the greater responsiveness of the MLHF and EQ-5D were
consistent across the three models tested. However, the responsiveness of the measures was
poorer under the responder definition with the clinical anchor than with the subjective anchor,
which might be owing to the characteristics of the patients. That is, the heart failure patients in
66
this study were recruited in outpatient setting and patients’ physical conditions were likely to be
stable. Thus, the clinical anchor might be less discriminative in defining the responders and
nonresponders. Although heart failure affects daily living, improvement of heart failure
treatment in this study was probably small to detect (Kaplan et al., 2011).
The results of the present study must be interpreted in light of limitations with the data.
First, there are possible sources of error. There might be random errors that affected the
classification of the responders. In other words, if random noise is introduced, some false values
(false positives and false negatives) might be classified as true values. Also, because both the
SRH anchor and the PRO instruments are subjective, they could be interdependent to each other,
which might affect the ROC curve. Secondly, the anchor question, global rating of change, was
quite general. The question asked about general health in comparison with health one year ago
rather than in comparison with health after the treatment. It is not clear whether the lack of
specific reference would affect the responsiveness but it is something to consider in further
studies. In addition, there was a 6-month gap between the time point that the anchor asked and
the actual time point used in the analysis. The anchor question (SF36-Q2) at 6 months, asked an
individual to compare one’s current health with “one year ago”. Thus, people would compare
their health now with one year ago, while we compared PRO scores at 6 months with baseline,
which yielded a 6-month gap between one-year-ago and baseline. However, we assumed the
impact of this time gap might be small in both samples. Since previous work has showed that the
PRO scores of cataract patients changed significantly between baseline and 1 month because of
the surgery, and were stable after 1 month (Kaplan et al., 2011), we could assume that there
would not be substantial changes in patients’ health conditions before the surgery. For the heart
failure cohort, although patients’ health outcomes increased gradually, the patients were newly
67
referred to congestive heart failure clinics and those with severe cardiac conditions were
excluded. Therefore, we assumed that the included patients did not experience any severe cases
before the enrollment.
In conclusion, we utilized ROC analyses to assess the overall performance of the SF-6D,
HUI2, HUI3, EQ-5D, QWB, and VFQ-25 for cataract and MHFQ for heart failure patients. In
cataract patients, there was no best instrument, whereas the greater responsiveness of the MLHF
and EQ-5D were consistent across the tested models in the heart failure patients. Some generics
measures could be applied solely without the risk of losing disease-specific information and still
maintain the merit of the generic measures. The responsiveness of the health instrument is
influenced by an external anchor to define responders, the contents of the measures, and the
study population.
68
3.5. References
Alonso, J., Ferrer, M., Gandek, B., Ware, J.E., Aaronson, N.K., Mosconi, P., Rasmussen,
N.K.,Bullinger, M., Fukuhara, S., Kaasa, S., et al. (2004). Health-related quality of life
associated with chronic conditions in eight countries: results from the International
Quality of Life Assessment (IQOLA) Project. Qual Life Res 13, 283-298.
Baberjee S, Samsi K., Petrie CD., Alvir J., Treglia M., Schwam, EM., and del Valle, M.,
(2009).What do we know about quality of life in dementia? A review of the emerging
evidence on the predictive and explanatory value of disease specific measures of health
related quality of life in people with dementia. Int J Geriatr Psychiatry 24, 15-24.
Bennet, S.J., Oldridge, N.B., Eckert, G.J., Embree, J.L., Browning, S., Hou, N., Deer, M.,
and Murray, M.D. (2002). Discriminant properties of commonly used quality of life
measures in heart failure. Qual Life Res 11, 349-359.
Brazier, J., Roberts, J., and Deverill, M. (2002). The estimation of a preference-based
measure of health from the SF-36. J Health Econ 21, 271-292.
Clemons, T.E., Chew, E.Y., Bressler, S.B., McBee, W., and AREDS Group,
(2003).National Eye Institute Visual Function Questionnaire in the Age-Related Eye
Disease Study (AREDS): AREDS Report No. 10. Arch Ophthalmol 121, 211-217.
de Boer, Y.A., Hazes, J.M., Winia, P.C., Brand, R., and Rozing, P.M. (2001).
Comparative responsiveness of four elbow scoring instruments in patients with
rheumatoid arthritis. J Rheumatol 28, 2616-2623.
Espallargues, M., Czoski-Murray, C.J., Bansback, N.J., Carlton, J., Lewis, G.M., Hughes,
69
L.A., Brand, C.S., and Brazier, J.E. (2005). The impact of age-related macular
degeneration on health status utility values. Invest Ophthalmol Vis Sci 46, 4016-4023.
Farrar, J.T., Pritchett, Y.L., Robinson, M., Prakash, A., and Chappell, A. (2010). The
clinical importance of changes in the 0 to 10 numeric rating scale for worst, least, and
average pain intensity: analyses of data from clinical trials of duloxetine in pain disorders.
J Pain 11, 109-118.
Feeny, D., Furlong, W., Torrance, G.W., Goldsmith, C.H., Zhu, Z., DePauw, S., Denton,
M., and Boyle, M. (2002). Multiattribute and single-attribute utility functions for the
health utilities index mark 3 system. Med Care 40, 113-128.
Feeny, D., Spritzer, K., Hays, R.D., Liu, H., Ganiats, T.G., Kaplan, R.M., Palta, M., and
Fryback, D.G. (2011). Agreement about Identifying Patients Who Change over Time:
Cautionary Results in Cataract and Heart Failure Patients. Med Decis Making.
Guyatt, G., Walter, S., and Norman, G. (1987). Measuring change over time: assessing
the usefulness of evaluative instruments. J Chronic Dis 40, 171-178.
Hanley, J.A. (1989). Receiver operating characteristic (ROC) methodology: the state of
the art. Crit Rev Diagn Imaging 29, 307-335.
Hanley, J.A., and McNeil, B.J. (1983). A Method of Comparing the Areas under
Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology 148,
839-843.
Javitt, J.C., and Steinert, R.F. (2000). Cataract extraction with multifocal intraocular lens
implantation: a multinational clinical trial evaluating clinical, functional, and quality-of-
life outcomes. Ophthalmology 107, 2040-2048.
Kaplan, R.M., Seiber, W.J., and Ganiats, T.G. (1997). The Quality of Well-being Scale:
70
comparison of the interviewer-administered version with a self-administered
questionnaire. Psychology & Health 12, 783-791.
Kaplan, R.M., Tally, S., Hays, R.D., Feeny, D., Ganiats, T.G., Palta, M., and Fryback,
D.G. (2011). Five preference-based indexes in cataract and heart failure patients were not
equally responsive to change. J Clin Epidemiol 64, 497-506.
Lee, J.E., Fos, P.J., Zuniga, M.A., Kastl, P.R., and Sung, J.H. (2000). Assessing health
related quality of life in cataract patients: the relationship between utility and health-
related quality of life measurement. Qual Life Res 9, 1127-1135.
Mangione, C.M., Lee, P.P., Gutierrez, P.R., Spritzer, K., Berry, S., Hays, R.D., and NEI
VFQ Field test Investigators, (2001). Development of the 25-item National Eye Institute
Visual Function Questionnaire. Arch Ophthalmol 119, 1050-1058.
Mangione, C.M., Lee, P.P., Pitts, J., Gutierrez, P., Berry, S., and Hays, R.D. (1998).
Psychometric properties of the National Eye Institute Visual Function Questionnaire
(NEI-VFQ). NEI-VFQ Field Test Investigators. Arch Ophthalmol 116, 1496-1504.
Margolis, M.K., Coyne, K., Kennedy-Martin, T., Baker, T., Schein, O., and Revicki, D.A. (2002).
Vision-specific instruments for the assessment of health-related quality of life and visual
functioning: a literature review. Pharmacoeconomics 20, 791-812.
Persell, S., Dolan, N., Friesema, E., Thompson, J., Kaiser, D., and Baker, D. (2010).
Frequency of inappropriate medical exceptions to quality measures. Ann Intern Med 152,
225-231.
Pressler, S.J., Eckert, G.J., Morrison, G.C., Murray, M.D., and Oldridge, N.B. (2011).
Evaluation of the Health Utilities Index Mark-3 in heart failure. J Card Fail 17, 143-150.
Rabin, R., and de Charro, F. (2001). EQ-5D: a measure of health status from the EuroQol
71
Group. Ann Med 33, 337-343.
Rector, T.S. (2005). A conceptual model of quality of life in relation to heart failure. J
Card Fail 11, 173-176.
Rector, T.S., and Cohn, J.N. (1992). Assessment of patient outcome with the Minnesota
Living with Heart Failure questionnaire: reliability and validity during a randomized,
double-blind, placebo-controlled trial of pimobendan. Pimobendan Multicenter Research
Group. Am Heart J 124, 1017-1025.
Revicki, D., Hays, R.D., Cella, D., and Sloan, J. (2008). Recommended methods for
determining responsiveness and minimally important differences for patient-reported
outcomes. J Clin Epidemiol 61, 102-109.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.C., and Müller, M.
(2011). pROC: an open-source package for R and S+ to analyze and compare ROC
curves. BMC Bioinformatics 12, 77.
Rosen, P.N., Kaplan, R.M., and David, K. (2005). Measuring outcomes of cataract
surgery using the Quality of Well-Being Scale and VF-14 Visual Function Index. J
Cataract Refract Surg 31, 369-378.
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys (New York ;: Wiley).
Sandoval, H.P., Fernández de Castro, L.E., Vroman, D.T., and Solomon, K.D. (2008).
Comparison of visual outcomes, photopic contrast sensitivity, wavefront analysis, and
patient satisfaction following cataract extraction and IOL implantation: aspheric vs
spherical acrylic lenses. Eye (Lond) 22, 1469-1475.
Schafer, J.L., and Graham, J.W. (2002). Missing data: our view of the state of the art.
Psychol Methods 7, 147-177.
72
Schünemann, H.J., and Guyatt, G.H. (2005). Commentary--goodbye M(C)ID! Hello MID,
where do you come from? Health Serv Res 40, 593-597.
Shaw, J.W., Johnson, J.A., and Coons, S.J. (2005). US valuation of the EQ-5D health
states: development and testing of the D1 valuation model. Med Care 43, 203-220.
The EuroQol Group (1990). EuroQol--a new facility for the measurement of health
related quality of life. The EuroQol Group. Health Policy 16, 199-208.
Torrance, G.W., Feeny, D.H., Furlong, W.J., Barr, R.D., Zhang, Y., and Wang, Q. (1996).
Multiattribute utility function for a comprehensive health status classification system.
Health Utilities Index Mark 2. Med Care 34, 702-722.
Turner, D., Schünemann, H.J., Griffith, L.E., Beaton, D.E., Griffiths, A.M., Critch, J.N.,
and Guyatt, G.H. (2009). Using the entire cohort in the receiver operating characteristic
analysis maximizes precision of the minimal important difference. J Clin Epidemiol 62,
374-379.
US Food and Drug Administration, (1999). FDA Intraocular Lens Guidelines.
US Food and Drug Administration (2009). Guidance for Industry: Patient-Reported
Outcome Measures: Use in Medical Product Development to Support Labeling Claims.
Walsh, T.L., Hanscom, B., Lurie, J.D., and Weinstein, J.N. (2003). Is a condition-specific
instrument for patients with low back pain/leg symptoms really necessary? The
responsiveness of the Oswestry Disability Index, MODEMS, and the SF-36. Spine (Phila
Pa 1976) 28, 607-615.
Ward, M.M., Marx, A.S., and Barry, N.N. (2000). Identification of clinically important
hanges in health status using receiver operating characteristic curves. J Clin Epidemiol 53,
279-284.
73
Ware, J.E., and Sherbourne, C.D. (1992). The MOS 36-item short-form health survey
(SF-36). Conceptual framework and item selection. Med Care 30, 473-483.
West, S.K., Munoz, B., Rubin, G.S., Schein, O.D., Bandeen-Roche, K., Zeger, S.,
German, S., and Fried, L.P. (1997). Function and visual impairment in a population-
based study of older adults. The SEE project. Salisbury Eye Evaluation. Invest
Ophthalmol Vis Sci 38, 72-82.
Wiebe, S., Guyatt, G., Weaver, B., Matijevic, S., and Sidwell, C. (2003). Comparative
responsiveness of generic and specific quality-of-life instruments. J Clin Epidemiol 56,
52-60.
World Health Organization, (1998). Informal consultation on analysis of blindness
prevention outcomes (Geneva: WHO).
Wyrwich, K.W., and Wolinsky, F.D. (2000). Identifying meaningful intra-individual
change standards for health-related quality of life measures. J Eval Clin Pract 6, 39-49.
74
CHAPTER 4. CHOICE OF SELF-RATED HEALTH MEASURES
AND MORTALITY
Abstract
Objective: Self-rated health (SRH) has been shown to be a good predictor of mortality. However,
there are mixed findings of the association between mortality and SRH measures with different
reference points (i.e., with respect to either global or peer age group). This study evaluated
whether SRH measures with different reference frames influenced the association of the SRH
and mortality whether there are differential patterns in the association between SRH measures
and mortality by age.
Method: We analyzed data from 2000-2003 Medical Expenditure Panel Survey (MEPS)
respondents in Panel 5-7, aged 45 or over, linked to the National Death Index (NDI) through
2006. Two SRH measures (global and age-comparative SRHs) were applied separately and
concurrently in the analyses. Cox proportional hazards model to test mortality prediction by SRH
measures was conducted, adjusted for demographic and social characteristics. The areas under
the receiver operating characteristics (ROC) curves were also compared to examine mortality
discrimination contribution by SRH measures.
Result: A total of 12,432 respondents representing yearly weighted population counts of
78,585,942 US residents were included in the analysis. More respondents rated their health as
excellent or very good on the age-comparative SRH measure than on the global SRH measure (p
< 0.0001). In the adjusted models, ‘fair’ and ‘poor’ ratings of both measures were strong
predictors of mortality, and the mortality discrimination of both models was comparable. The
75
predictive power of ‘poor’ and ‘fair’ ratings got weaker in the concurrent model, but they were
still significant. Although the domain analysis by age found that the mortality discrimination of
the age-comparative SRH measure was better in the younger age group (aged 45-54) and that of
the global SRH was better in older age group (aged 75-84), they were not statistically different.
Conclusion: In this study, we found graded associations between the SRH measures and
mortality regardless of the reference points. Both global and age-comparative SRH measures
demonstrated comparable predictive and discriminative abilities on mortality, even though
respondents rated their health differently depending on the SRH measures used.
76
4.1. Introduction
Self-reported health (SRH) is a widely used measure for health status in the public health
and social science fields. It is a simple and inexpensive measure and more importantly, provides
summative information on health, where health is a multidimensional concept including the
aspects of physical, mental, and social well-being. Thus, many researches have studied whether
or not this subjective health measure could be connected to health outcomes such as health care
utilization, morbidity, and mortality and found strong relationships between these and the SRH
measure (Fylkesnes, 1993; Idler and Benyamini, 1997; Kaplan et al., 1996).
Although it is still not clearly understood why SRH predicts mortality, previous studies
found that there was a graded association between SRH and mortality, i.e. worse SRH was
associated with increased risk of mortality. This association was usually attenuated when other
factors such as age, sex, comorbidities, or health behaviors were controlled for (Greiner et al.,
1999; Hays et al., 1996; Idler et al., 2000; LaRue et al., 1979; Mossey and Shapiro, 1982;
Thomas et al., 1992). In a meta-analysis by DeSalvo et al., it was also shown that question
wording affected the relationship between SRH and mortality (DeSalvo et al., 2006). The
relationship was stronger when a SRH question asked respondents to rate their health compared
to their age peers than when the SRH question were phrased without any frame of reference. The
latter which is a global SRH measure, is most commonly used (i.e. how would you rate your
health in general / at the present time?). The former is an age-comparative (or age-referential)
SRH that uses a comparative reference point (i.e. compared to other people in your age, how
would you rate your health?).
Most of the previous studies used only one SRH measure to evaluate predictive ability of
SRH on mortality, and there are few studies that tested the influence of different SRH measures
77
on the association between SRH and mortality. In 5-year follow-up study among people aged 77
and over, Manderbacka et al. (Manderbacka et al., 2003) compared global and age-comparative
SRH measures with 3-point rating scales. Age-comparative SRH was a better predictor of
mortality for males than females both in non-adjusted models and in models where age and both
SRH measures were included. In contrast, Vourisalmi et al. found in their 5, 10, and 20-year
follow-up study in people aged 60 to 89 that only global SRH was associated with mortality in
non-adjusted models but both measures predicted mortality after adjusting for age and other
social and health indicators (Vuorisalmi et al., 2005). In a study that compared global, age-
comparative, and self-comparative (which compares current health to previous health) SRH
measures with 10-year follow-up data in older adults aged 65 and over, global SRH was a better
predictor of mortality than other comparative SRH measures in adjusted models. In the last two
studies, global SRH measures used 5-point scales while age-comparative SRH measures used 3-
point scales.
These mixed findings in the association between mortality and different SRH measures
warrants further investigation. It is also not clear whether there may be using different rating
scales have an effect on the association. Since the number of category points is a critical factor in
rating judgments (Dalal, 1987; Parducci, 1965), the predictive quality of SRH measures on
mortality might be overestimated or underestimated if the number of response categories in the
SRH measures tested was not big enough to observe the graded association of SRH and mortality
or if the SRH measures compared did not have the same number of choices on the rating scales.
Therefore, the aim of this study is to fill the gaps within the literature regarding the predictive
power of SRH measures on mortality. We will assess whether SRH measures with different
reference frames influence the association of SRH and mortality in old age. In addition to SRH
78
measures at a single time point, we will examine whether there is a better predictor of mortality,
that is, change in SRH measures.
4.2. Methods
4.2.1. Data
The Medical Expenditure Panel Survey (MEPS) is an annual survey of health care
utilization, expenditure, source of payment, and insurance coverage in the United States (US)
civilian, noninstitutionalized population, using overlapping panel design (AHRQ, 2013).
Individual data were collected through 5 interviews in person over 30 months for two calendar
years. Each round of interviews were spaced about 5 to 6 months apart. In the MEPS Household
Component, information on social/demographic characteristics, health care features, health
insurance, and health care expenditures is included. A self-administered questionnaire in both
years includes items on chronic health conditions and health status. In this study, three panels,
Panel 5, 6, and 7 (2000-2003) were used.
The National Health Interview Survey (NHIS) is also an annual survey to obtain national
estimates of health care utilization, health conditions, insurance coverage and access conducted
by the National Center for Health Statistics, Centers for Disease Control and Prevention
(CDC/National Center for Health statistics, 2013). The MEPS Household Component sample is a
subsample of households included in the previous year’s NHIS. The NHIS is linked to death
certificate data in the National Death Index (NDI), a central computerized index of US death
record information on file in the State vital statistics offices, and this, in turn, permits linkage to
79
the MEPS. This NHIS-linked mortality files have been completed for survey years 1986 through
2004. The updated NHIS-linked mortality files provide mortality follow-up data from the date of
NHIS interview through December 31, 2006. Mortality is ascertained by a probabilistic match
between the NHIS and NDI death certificate records. In this study, public-use version of the
NHIS-linked mortality file from 2000 to 2006 which includes a limited set of mortality variables
for adult NHIS participants was used.
In this study, we included respondents aged 45 years or older in their first year of
participation in the MEPS who provided data on two versions of the SRH in both years and for
whom mortality information was available. Although our respondents of interest are those aged
45 years old or older, it is not advisable to create an analytic subfile containing only respondents
in the subdomain of interest. This may cause incorrect standard errors because all of the
observation corresponding to a stage of the MEPS sample design may be deleted. Thus, in this
study, we preserved the entire MEPS design structure which includes respondents aged 18 years
or older and analyzed a subgroup that includes respondents aged 45 years or older within the
entire data.
4.2.2. Measures
4.2.2.1. Self-Rated Health (SRH)
Two SRH measures were examined in this study: the global SRH was measured with the
question “In general, how would you rate your health?” and the age-comparative SRH asked “In
general, compared to other people in your age, how would you rate your health?”. Both SRH
80
measures used a 5-point rating scale where 1 indicated ‘excellent’ and 5 indicated ‘poor’. Two
SRHs were measured in two consecutive years.
4.2.2.2. Demographic and social variables
The demographic and social characteristics tested were age, sex, race/ethnicity (White,
Black, or others), US Census region (Northeast, West, Midwest, or South), education (high
school or less, some college, and college degree or more), household income level based on the
federal poverty level (poor, near poor, low income, middle income, and high income indicating
<100%, 100-124%, 125-199%, 200-399%, or ≥400% of the poverty level, respectively), and
health insurance status (uninsured, any private insurance, or public insurance only).
4.2.2.4. Outcomes
Mortality was assessed through the NDI from 2000 to the end of December 2006.
Survival was measured by quarters from the time of the health measure self-assessment until
time of death, or the respondent was considered censored if alive on the end of December, 2006.
We used mortality information by individual year.
4.2.3. Data Analysis
Cox proportional hazard regression models were developed to evaluate the effect of
covariates including the global and age-comparative SRH measures on mortality. The Cox
proportional hazard model is the most commonly applied model in medical time-to-event studies.
The model does not make any assumption about the shape of the underlying hazard, which is
81
baseline mortality risk in this study. However, it assumes that the hazard for any individual is a
fixed proportional of the hazard for any other individual. Number of years from the interview
until death or censorship was the measure for time used in the models.
To analyze the associations between SRH measures and mortality, we first tested five
models; an unadjusted model with a global SRH, an unadjusted model with an age-comparative
SRH, an adjusted model for demographic and social variables with a global SRH, an adjusted
model with an age-comparative SRH, and an adjusted model with global and age-comparative
SRHs together. In addition, we performed domain analysis by age groups for each SRH measure.
We also evaluated the mortality discrimination by testing an area under the curve (AUC), where
it is calculated by c-statistics.
Analyses were conducted using the SAS 9.1 (SAS Institute, Cary, North Carolina),
adjusting for the complex survey design of the MEPS. To obtain accurate estimates from
complicated MEPS data, the estimation weight, sampling strata, and primary sampling unit,
which jointly reflect the MEPS survey design, were applied.
4.3. Results
There were 31,177 eligible adults, aged 18 years and over, who participated in the MEPS
for two consecutive years between 2000 and 2003. Complete data including the global and age-
comparative SRH measures and mortality information was available for 24,651 persons. The
analytical sample that included those aged 45 years or older, contained 12,432 respondents,
representing yearly weighted population counts of 78,585,942 US residents. Of these, 984
82
respondents died during follow-up. The demographic and social characteristics of the analytic
sample are shown in Table 4.1.
Table 4.1. Demographic and social characteristics of the sample
Panel 5
(N= 2,713)
Panel 6
(N = 5,673)
Panel 7
(N = 4.046)
Significance
Age (SE) 45-54 39.77 (1.30) 39.69 (0.89) 38.68 (1.02) p = 0.87
55-64 24.75 (1.12) 25.88 (0.74) 26.23 (0.86)
65-74 19.22 (0.99) 18.72 (0.66) 18.28 (0.79)
75-84 12.73 (0.86) 12.77 (0.56) 13.53 (0.74)
≥ 85 3.53 (0.37) 2.94 (0.29) 3.29 (0.37)
Race (SE) White 88.18 (1.01) 87.23 (0.69) 86.66 (0.84) p = 0.04
Black 9.65 (0.92) 9.18 (0.52) 9.23 (0.74)
Others 2.17 (0.32) 3.59 (0.44) 4.12 (0.40)
Sex (SE) Male 45.37 (0.75) 46.06 (0.57) 46.02 (0.61) p = 0.70
Education
(SE)
< High school 21.25 (0.91) 21.36 (0.70) 21.51 (0.78) p = 0.83
High school
graduate
35.07 (1.14) 34.86 (0.88) 33.54 (0.91)
≥ College 43.67 (1.33) 43.78 (0.92) 44.95 (1.12)
Region Northeast 19.63 (1.42) 19.42 (1.04) 19.48 (1.93) p = 0.99
Midwest 24.75 (0.76) 24.03 (1.11) 23.50 (2.03)
South 36.57 (1.97) 36.25 (1.33) 36.37 (2.34)
West 19.05 (2.61) 20.30 (1.34) 20.65 (1.96)
83
Income Poor 9.55 (0.60) 9.62 (0.56) 9.08 (0.57) p = 0.78
Near Poor 3.88 (0.44) 3.95 (0.35) 3.47 (0.34)
Low income 11.46 (0.80) 13.24 (0.62) 12.31 (0.66)
Middle income 28.40 (1.24) 27.33 (0.78) 28.65 (1.14)
High income 46.70 (1.33) 45.85 (1.10) 46.49 (1.22)
Insurance
Status
Any private 73.40 (1.19) 74.23 (0.90) 74.79 (0.92) p = 0.78
Public only 19.51 (0.95) 18.66 (0.74) 17.97 (0.85)
Uninsured 7.09 (0.61) 7.11 (0.43) 7.24 0.46)
* Values were population-weighted percentages and their standard errors (SEs).
There were no statistical differences in age, sex, education level, region, income level, and
insurance status across all three panels but race/ethnicity was different (p = 0.04). We found
distributional difference between the global and age-comparative SRH measures (p < 0.0001).
More respondents evaluated their health as excellent with the age-comparative SRH than with
the global SRH (Table 4.2). When the distributions of two SRH measures were considered by
age groups, the differences were even more substantial, particularly in the older groups. In the
age groups of 65-74, 75-84, and over 85, 16.1%, 13.6%, and 16.8%, respectively, of respondents
evaluated their health as excellent compared to their age peers, while 8.5%, 6.3%, and 5.3%,
respectively, rated their health as excellent.
Table 4.2. Distribution of global and age-comparative SRH measures by age groups
84
Global SRH Age
Total 45-54 55-64 65-74 75-84 85+
Excellent 12.18 (0.36) 15.31 (0.61) 13.93 (0.80) 8.49 (0.68) 6.27 (0.81) 5.33 (1.29)
Very good 33.11 (0.63) 38.73 (0.82) 34.31 (1.18) 28.09 (1.13) 22.85 (1.21) 19.98 (2.09)
Good 34.23 (0.54) 30.87 (0.78) 32.44 (1.05) 39.60 (1.09) 38.91 (1.38) 39.31 (2.57)
Fair 16.18 (0.39) 11.92 (0.51) 14.70 (0.80) 18.42 (0.81) 26.20 (1.26) 26.45 (2.35)
Poor 4.30 (0.22) 3.18 (0.29) 4.63 (0.42) 4.40 (0.51) 5.78 (0.70) 8.93 (1.54)
Age-comparative SRH Age
Total 45-54 55-64 65-74 75-84 85+
Excellent 19.70 (0.53) 23.54 (0.84) 19.93 (0.98) 16.05 (0.92) 13.59 (1.03) 16.78 (2.71)
Very good 33.14 (0.59) 34.89 (0.83) 34.33 (1.10) 33.42 (1.14) 27.04 (1.44) 25.33 (2.69)
Good 29.80 (0.58) 28.53 (0.87) 28.07 (0.96) 30.29 (1.09) 35.45 (1.47) 33.54 (2.56)
Fair 12.86 (0.36) 9.56 (0.44) 12.49 (0.73) 15.42 (0.90) 18.17 (1.12) 19.90 (2.33)
Poor 4.50 (0.18) 3.48 (0.27) 5.17 (0.41) 4.83 (0.53) 5.75 (0.55) 4.46 (0.89)
* Values were population-weighted percentages and their standard errors (SEs).
Table 4.3 reports the hazard ratios for mortality in relation to global and age-comparative SRH
measures. When two SRH measures were separately tested, the measures were equally strong
predictor of mortality in both the unadjusted and adjusted models. In the adjusted models, good,
fair, and poor SRH ratings were associated with increased mortality risks. However, the
association between SRH measures and mortality diminished when the models were adjusted for
sociodemographic variables. Fair and poor SRH ratings were associated with increased risks of
mortality in both measures. Poor SRH ratings were the strongest predictors where poor global
rating increased mortality risk by 6.40 times and poor age-comparative rating increased mortality
85
by 4.09 times compared to their respective excellent ratings. When two measures were
concurrently analyzed in the relation to mortality risk, poor and fair ratings of both SRH
measures predicted mortality but their associations decreased. The discriminative abilities of the
SRH measures were comparable in the adjusted models (Table 4.3).
In all the five tested models, we found that sex and age groups were also significantly associated
with increased risk of mortality. Thus, we performed domain analyses by age groups and found
that the global and age-comparative SRH measures behaved differently in connection with age
groups (Table 4.4 only reports the HRs of the SRH measures from the adjusted models.
Sociodemographic covariates were omitted). In all age groups except the oldest group, poor
ratings were associated with an increased risk of mortality and the associations tended to be
stronger as respondents got older. In terms of discriminating mortality, though their AUCs were
statistically not different, we could have a grasp on the discrimination pattern by age. The age-
comparative SRH measure discriminated mortality better in a younger group (aged 45-54) and
moving towards the older groups, the global SRH measure discriminated mortality better.
86
Table 4.3. Hazard ratios (HRs) of mortality associated with global and age-comparative SRH measures
Unadjusted Model Adjusted Model
Global Age-comparative Global Age-comparative
Global +
Age-comparative
Global SRH
Excellent 1 1 1
Very good
1.29
(0.84, 1.97)
1.09
(0.71, 1.68)
1.00
(0.66, 1.50)
Good
2.19
(1.50, 3.19)
***
1.35
(0.91, 1.99)
1.16
(0.77, 1.75)
Fair
5.02
(3.32, 7.61)
***
2.50
(1.63, 3.84)
***
1.82
(1.13, 2.93)
*
Poor
9.12
(5.94, 13.99)
***
4.60
(2.86, 7.40)
***
2.92
(1.66, 5.14)
**
Age-comparative SRH
Excellent 1 1 1
Very good
1.42
(0.98, 2.04)
1.26
(0.88, 1.78)
1.19
(0.82, 1.72)
Good
2.04
(1.47, 2.83)
***
1.44
(1.04, 1.97)
*
1.20
(0.82, 1.75)
Fair
4.31
(2.97, 6.26)
***
2.56
(1.80, 3.65)
***
1.60
(1.04, 2.48)
*
87
Poor
6.40
(4.42, 9.27)
***
4.09
(2.81, 5.96)
***
1.87
(1.15, 3.06)
*
Age
45-54 1 1 1
55-64
2.41
(1.85, 3.15)
***
2.46
(1.91, 3.18)
***
2.44
(1.89, 3.14)
***
65-74
3.93
(3.01, 5.13) )
***
3.96
(3.04, 5.15)
***
3.96
(3.04, 5.17)
***
75-84
8.83
(6.69, 11.67)
***
9.12
(6,94, 11.98)
***
9.00
(6.81, 11.89)
***
85+
20.02
(14.92, 26.86)
***
22.07
(16.73, 29.10)
***
20.86
(15.66, 27.80)
***
Sex
Male 1 1 1
Female
1.63
(1.39, 1.93)
***
1.64
(1.40, 1.93)
***
1.645
(1.40, 1.94)
***
Race
White 1 1 1
Black 1.16 (0.89, 1.51) 1.12 (0.88, 1.42) 1.14 (0.88, 1.46)
Others 0.73 (0.41, 1.30) 0.76 (0.43, 1.36) 0.71 (0.39, 1.28)
Education
> High School 1 1 1
88
High school graduate 0.95 (0.79, 1.15) 0.90 (0.74, 1.08) 0.95 (0.79, 1.15)
≥ College 0.82 (0.66, 1.01) 0.75 (0.61, 0.92) 0.82 (0.67, 1.01)
Region
Northeast 1 1 1
Midwest 1.03 (0.81, 1.32) 1.02 (0.80, 1.30) 1.04 (0.81, 1.33)
South 1.12 (0.88, 1.42) 1.09 (0.87, 1.37) 1.12 (0.88, 1.42)
West 1.09 (0.87, 1.38) 1.13 (0.91, 1.41) 1.12 (0.89, 1.40)
Income
Very Poor 1 1 1
Near Poor 1.20 (0.87, 1.67) 1.12 (0.81, 1.56) 1.16 (0.83, 1.61)
Low income 1.14 (0.88, 1.49) 1.11 (0.86, 1.44) 1.13 (0.88, 1.47)
Middle income 1.08 (0.85, 1.39) 1.06 (0.84, 1.33) 1.09 (0.86, 1.39)
High income 0.95 (0.72, 1.24) 0.91 (0.69, 1.21) 0.96 (0.73, 1.25)
Insurance status
Uninsured 1 1 1
Any private 0.90 (0.57, 1.41) 0.84 (0.44, 1.29) 0.85 (0.55, 1.31)
Public only 1.03 (0.66, 1.61) 0.98 (0.64, 1.50) 0.96 (0.63, 1.48)
-2log L 217068944 217850292 211464737 215714044 215379500
AUC 0.69 (0.67, 0.71) 0.66 (0.64, 0.68) 0.81 (0.80, 0.83) 0.81 (0.80, 0.83) 0.80 (0.78, 0.81)
** p < 0.05; ** p < 0.001; *** p < 0.0001.
Values in the parentheses are 95% CI for HRs or 95% CI for AUCs.
89
Table 4.4. Hazard ratios (HRs) of mortality associated with global and age-comparative SRH measures by age
Age
group
SRH measure
†
Very good Good Fair Poor AUC
45-54 Global 0.97
(0.39, 2.40)
0.98
(0.42, 2.30)
1.62
(0.69, 3.81)
3.24
(1.32, 7.97)
*
0.75
(0.70, 0.80)
Age-comparative 1.20
(0.48, 3.04)
1.53
(0.72, 3.25)
3.59
(1.53, 8.42)
*
4.38
(1.89, 10.12)
**
0.76
(0.71, 0.81)
55-64 Global 1.30
(0.54, 3.18)
1.13
(0.47, 2.74)
2.18
(0.83, 5.72)
3.35
(1.23, 9.09)
*
0.73
(0.68, 0.77)
Age-comparative 1.19
(0.61, 2.32)
1.22
(0.67, 2.22)
2.09
(1.00, 4.34)
3.69
(1.93, 7.05)
***
0.73
(0.68, 0.77)
65-74 Global 0.83
(0.32, 2.13)
1.29
(0.56, 2.98)
2.62
(1.07, 6.45)
*
5.46
(2.03, 14.67)
**
0.70
(0.66, 0.74)
Age-comparative 1.39
(0.71, 2.70)
1.62
(0.91, 2.87)
3.53
(1.89, 6.62)
***
7.33
(3.71, 14.46)
***
0.69
(0.65, 0.73)
75-84 Global 1.08
(0.52, 2.24)
1.65
(0.78, 3.48)
3.12
(1.44, 6.76)
**
6.95
(3.10, 15.57)
***
0.69
(0.65, 0.72)
Age-comparative 2.01
(1.12, 3.95)
2.74
(1.53, 4.90)
***
4.06
(2.23, 7.38)
***
6.06
(2.89, 12.69)
***
0.66
(0.63, 0.70)
85+ Global 1.24
(0.53, 2.90)
1.07
(0.49, 2.35)
1.63
(0.80, 3.31)
2.00
(0.89, 4.48)
0.66
(0.61, 0.72)
Age-comparative 0.76 0.58 1.01 0.51 0.62
90
(0.44, 1.32) (0.31, 1.09) (0.63, 1.62) (0.23, 1.15) (0.56, 0.68)
* p < 0.05; ** p < 0.001; p < 0.0001.
† Reference category is Excellent in each SRH measure.
91
4.4. Discussion
In the present study, we evaluated the effect of a reference point on the predictive and
discriminative abilities of SRH measures for mortality. Our results showed that both global and
age-comparative SRH measures were strong predictors of mortality regardless of the reference
point. Both measures resulted in similar patterns in the relationship with mortality: the worst
perceived health had the greater risk of mortality. In connection with age, the global and age-
comparative SRH measures were likely to behave differently. The age-comparative SRH
measure tended to discriminate mortality better than the global measure in the younger group,
while the global measure was tended to discriminate better in the older group.
Our finding that people rated their health more positively with the age-comparative SRH
measure than with the global measure is consistent with previous studies (Idler, 1993;
Vuorisalmi et al., 2006). The distributional differences between the global and age-comparative
SRH measures became clearer as people got older. More people in older groups perceived their
health as “excellent” or “very good” when they compared their health with their age peer than
when they rated their health in general without any reference group. This can be explained by the
characteristics of the age-comparative SRH measure. As people get older, the remaining
population is likely to have more comorbidities and accordingly people will rate their health
more favorably compared to less healthy people when using the age-comparative SRH measure.
People perceived their health positively with the age-comparative SRH measure even though
they had some problems with their health (Johnson et al., 1990). Due to the age-sensitivity of the
age-comparative SRH measure, the global SRH measure is recommended as a more appropriate
measure when a study focuses on a wide range of age (Vuorisalmi et al., 2006). Thus, the
92
reference points of the SRH measures, whether explicit or implicit, affected people’s evaluation
of their health.
There were, however, no differences in predicting mortality between global and age-
comparative SRH measures despite their distributional differences. This did not support the
findings from previous studies (Sargent-Cox et al., 2010; Vuorisalmi et al., 2005). One reason
could be that age differently affected the behaviors of two SRH measures and subsequently
affected the association of mortality. In the present study, the age-comparative SRH measure was
tended to discriminate mortality better in the younger group, as did the global SRH in the older
group. Due to these mixed results across the different age range, two SRH measures may show
comparable discriminative abilities. Second, using the SRH measures with the same number of
choices might be another reason. Some previous studies argued that the number of response
choices did not affect the reliability and validity of scores (Aiken, 1983; Schutz and Rucker,
1975). However, there are also studies that provided support for a differential impact of the
number of choices on reliability, validity, and discriminating power of scores (Loken et al., 1987;
Preston and Colman, 2000). Scales with small numbers of response choices yielded scores that
are generally less valid and less discriminating. Thus it might not be appropriate to compare the
predictive powers of the SRH measures on mortality if the SRH measures had different number
of response categories. Our study has an advantage of using the SRH measures with the same
scale, which might result in the two SRH measures both being strong predictors of mortality.
There are some limitations with this study to be noted. First, a limited range of follow-up
period was studied. In the present study, six years was the longest follow-up period that the data
permitted, which is not as long an observation window as one might wish for studying mortality.
Second, we used the data of the SRH measures only in the first year of the MEPS. Future
93
research should consider assessing dynamic changes in the SRH measures from the first year to
the second year in relation to mortality.
In conclusion, we found that SRH measures showed a graded association with mortality,
meaning that the worse perceived health had the greater risk of mortality regardless of their
reference points. Although people’s evaluation of their health depended on the reference points
of the SRH measures, both were comparably strong predictors of mortality.
94
4.5. References
Aiken, L.R. (1983). Number of response categories and statistics on a teacher rating scale.
Educational and psychological measurement 43, 397-401.
Agency for Healthcare Research and Quality. (2013). Medical Expenditure Survey Panels.
Dalal, A. (1987). Contextual effects on category rating scales. Journal of Psychology 12,
481-489.
DeSalvo, K.B., Bloser, N., Reynolds, K., He, J., and Muntner, P. (2006). Mortality prediction
with a single general self-rated health question. A meta-analysis. J Gen Intern Med 21,
267-275.
Fylkesnes, K. (1993). Determinants of health care utilization--visits and referrals. Scand J Soc
Med 21, 40-50.
Greiner, P.A., Snowdon, D.A., and Greiner, L.H. (1999). Self-rated function, self-rated health,
and postmortem evidence of brain infarcts: findings from the Nun Study. J Gerontol B
Psychol Sci Soc Sci 54, S219-222.
Hays, J.C., Schoenfeld, D., Blazer, D.G., and Gold, D.T. (1996). Global self-ratings of health
and mortality: hazard in the North Carolina Piedmont. J Clin Epidemiol 49, 969-979.
Idler, E.L. (1993). Age differences in self-assessments of health: age changes, cohort differences,
or survivorship. Journal of Gerontology 48, S289-S300.
Idler, E.L., and Benyamini, Y. (1997). Self-rated health and mortality: a review of twenty-seven
community studies. J Health Soc Behav 38, 21-37.
Idler, E.L., Russell, L.B., and Davis, D. (2000). Survival, functional limitations, and self-rated
95
health in the NHANES I Epidemiologic Follow-up Study, 1992. First National Health
and Nutrition Examination Survey. Am J Epidemiol 152, 874-883.
Johnson, R.E., Mullooly, J.P., and Greenlick, M.R. (1990). Morbidity and medical care
utilization of old and very old persons. Health Serv Res 25, 639-665.
Kaplan, G.A., Goldberg, D.E., Everson, S.A., Cohen, R.D., Salonen, R., Tuomilehto, J., and
Salonen, J. (1996). Perceived health status and morbidity and mortality: evidence from
the Kuopio ischaemic heart disease risk factor study. Int J Epidemiol 25, 259-265.
LaRue, A., Bank, L., Jarvik, L., and Hetland, M. (1979). Health in old age: how do physicians'
ratings and self-ratings compare? J Gerontol 34, 687-691.
Loken, B., Pirie, P., Virnig, K.A., Hinkle, R.L., and Salmon, C.T. (1987). The use of 0-10 scales
in telephone surveys. . Journal of the Market Research Society 29, 353-362.
Manderbacka, K., Kåreholt, I., Martikainen, P., and Lundberg, O. (2003). The effect of point of
reference on the association between self-rated health and mortality. Soc Sci Med 56,
1447-1452.
Mossey, J.M., and Shapiro, E. (1982). Self-rated health: a predictor of mortality among the
elderly. Am J Public Health 72, 800-808.
Parducci, A. (1965). Category judgment: a range-frequency model. Psychol Rev 72, 407-418.
Preston, C.C., and Colman, A.M. (2000). Optimal number of response categories in rating scales:
Reliability,validity, discriminating power, and respondent preferences. Acta Psychologica
104, 1-15.
Sargent-Cox, K.A., Anstey, K.J., and Luszcz, M.A. (2010). The choice of self-rated health
measures matter when predicting mortality: evidence from 10 years follow-up of the
Australian longitudinal study of ageing. BMC Geriatr 10, 18.
96
Schutz, H.G., and Rucker, M.H. (1975). A comparison of variable configurations across scale
lengths: An empirical study Educational and Psychological Measurement 35, 319-324.
Thomas, C., Kelman, H.R., Kennedy, G.J., Ahn, C., and Yang, C.Y. (1992). Depressive
symptoms and mortality in elderly persons. J Gerontol 47, S80-87.
US Department of Health and Human Services, Center for Disease Control and Prevention,
National Center for Health Statistics. (2013). NHIS linked mortality public use files.
Vuorisalmi, M., Lintonen, T., and Jylhä, M. (2005). Global self-rated health data from a
longitudinal study predicted mortality better than comparative self-rated health in old age.
J Clin Epidemiol 58, 680-687.
Vuorisalmi, M., Lintonen, T., and Jylhä, M. (2006). Comparative vs global self-rated health:
associations with age and functional ability. Aging Clin Exp Res 18, 211-217.
97
CHAPTER 5. SUMMARY
PROs are essential tools for incorporating the patients’ perspectives in health care
decision making. Using PRO measures can facilitate the detection of health problems that might
otherwise be overlooked and can also be applied to provide information about the impact of a
treatment. When assessing the impact of treatment, individual change in health or change in
health between groups is usually measured. Thus, it is important not only to use a valid PRO
instrument but also to use a PRO instrument with the ability to measure change in health.
This three-paper dissertation introduced models discriminating change in health using
PRO measures. Paper 1 evaluated a theoretical model that discriminates minimally important
change in health. This approach can provide more comprehensive understanding on minimally
important difference. Paper 2 presented a model that captures meaningful changes and
distinguishes among patients who remain the same or improve, using the cataract and heart
failure cohorts as examples. This model can support an empirical approach for choosing an
optimal instrument in clinical trials. Paper 3 extends the concept of “discriminating change in
health” by assigning a future health outcome, risk of death. This approach suggests a useful way
to estimate probabilities of external health outcomes from current health based on PROs.
Taken together, this dissertation provides more insight into PRO measures, particularly if there is
change in health. Studies on the pattern of minimally important change in health and on the PRO
measures of overall responsiveness would deliver added value for assessing PROs. Testing the
predictive ability of PRO measures suggests further potential application of PRO measures.
98
APPENDIX
We assume throughout that MIDs represent a valid index of discrimination. That is, a
person making at least some minimal increment of change on a measure will discriminate such
change. We denote a continuous outcome of health (e.g., weight, survival durations etc.) as y
1
, y
2
… in Y, a set of such outcomes. We denote health quality as q
1
, q
2
… in Q, a set of health states.
When Y is a set of survival durations, the Cartesian product Q*Y, is the set of all health quality
and survival duration pairs. For any health state q
i
and survival duration y
j
we denote the pair as
(q
i
, y
j
). Let denote “at least as preferred as”. We assume is transitive and complete. We
assume that the function U in Equation 1 is real-values and represents in the sense that for any
outcome y
1
and y
2
in Y, y
1
y
2
if and only if U(y
1
) > U(y
2
) and for any health state and survival
duration pair (q
1
, y
1
) and (q
2
, y
2
), (q
1
, y
1
) (q
2
, y
2
) if and only if U(q
1
, y
1
) > U(q
2
, y
2
). In
practice, the domain Y is bounded above and below, to yield a realistic range of values. So as to
not distract from our main argument we do not impose this assumption, but extensions of our
results for such bounded intervals is straightforward.
DEFINITION 1 (Weber’s law). For all outcomes y
1
, y
2
, λy
1
, and λy
2
and pairs (q
1
,y
1
), (q
2
,y
2
),
(q
1
, λy
1
) and (q
2
, λy
2
) and probabilities P є [0,1], P(y
1
, y
2
) = P(λy
1
,λy
2
) and P((q
1
,y
1
), (q
2
,y
2
)) =
P((q
1
, λy
1
),(q
2
, λy
2
)).
Theoretical Results
THEOREM 1. For all outcomes y
1
, y
2
, λy
1
, and λy
2
in Y and probabilities P є [0,1], P(y
1
, y
2
) and
P(λy
1
,λy
2
) the following are equivalent:
99
(i) Equation 1 and Weber’s Law (Definition 1) hold
(ii) The function U in Equation 1 is given by U(y) = αlog(y) + β.
Proof: The proof is given in Falmagne et al.(1984) (Falmagne, 1984).
THEOREM 2. For all outcomes ordered pairs (q
1
,y
1
), (q
2
,y
2
), (q
1
, λy
1
) and (q
2
, λy
2
) and
probabilities P є [0,1], P((q
1
,y
1
), (q
2
,y
2
)) and P((q
1
, λy
1
),(q
2
, λy
2
)) the following are equivalent:
(i) Equation 1 and Weber’s Law (Definition 1) hold
(ii) The function U in Equation 1 is given by U(q
1
, y
1
) = log(y
1
) + M(q
1
).
Proof: Choose any ordered pairs (q
1
,y
1
), (q
2
,y
2
), (q
1
, λy
1
) and (q
2
, λy
2
) in Q X Y, by hypothesis of
the theorem, P((q
1
,y
1
), (q
2
,y
2
)) = P((q
1
, λy
1
),(q
2
, λy
2
)) and
F(U(q
2
,y
2
) - U(q
1
,y
1
)) = F(U(q
2
, λy
2
) - U(q
1
, λy
1
)) (i).
Because F is monotonically increasing it has an inverse which can be taken from both sides of
the equation in (i) yielding,
U(q
2
,y
2
) - U(q
1
,y
1
) = U(q
2
, λy
2
) - U(q
1
, λy
1
) (ii).
For each i let U(q
i
,y
i
) = U(y
i
| q
i
), U(x| q
i
) is a function conditioning on q
i
. We may rewrite (ii)
then as U(y
2
| q
2
) - U(y
1
| q
1
) = U(λy
2
| q
2
) – U(λy
1
| q
1
). Now let λ = 1/y
1
, y
*
= y
2
/ y
1
, and L(y
*
) =
U(y
*
| q
2
) - U(1| q
1
). This gives us U(y
2
| q
2
) - U(y
1
| q
1
) = U(y
*
| q
2
) - U(1| q
1
), or
L(y
*
) = U(y
2
| q
2
) - U(y
1
| q
1
) (iii).
Noting that y
2
= y
1
× y
*
we may substitute in (iii) to get
U(y
1
× y
*
| q
2
) = U(y
1
| q
1
) + L(y
*
) (iv)
Equation (iv) is a Pexider equation whose only solution is a logarithmic function where U(q
i
, y
i
)
= K(q
i
)log(y
i
) + M(q
i
). We must now show that for all i and j K(q
i
) = K(q
j
). Choose any q
i
and q
j
,
100
by the fact that U(q
i
, y
i
)) - U(q
j
, y
j
) = U(q
i
, λy
i
)) - U(q
j
, λy
j
) and we have K(q
i
)[log(λy
i
) – log(y
i
)]
= K(q
j
)[log(y
j
) – log(λy
j
)], or K(q
i
)/K(q
j
) = log(y
j
) – log(λy
j
)]/ [log(λy
i
) – log(y
i
)] = λ/λ = 1, or
K(q
i
) = K(q
j
). Since K(q
i
) are all equal we may set them to 1, which gives us U(q
1
, y
1
) = log(y
1
)
+ M(q
1
) Q.E.D.
Alternative linear model in Study 1
To compare with the proportional model in Theorem 1, we suggested another possible
MID function, a linear model. Suppose that for all outcomes y
1
, y
2
, λ + y
1
, and λ + y
2
, and
probabilities P є [0, 1], P(y
1
, y
2
) = P (λ + y
1
, λ + y
2
) (Assumption 1).
THEOREM 3. For all outcomes y
1
, y
2
, λ + y
1
, and λ + y
2
in Y and probabilities P є [0, 1], P(y1,
y2), and P (λ + y
1
, λ + y
2
), the following are equivalent:
(i) Equation 1 and Assumption 1 hold
(ii) The function U’ in Equation 1 is given by U’(y
1
) = cy
1
+ k where k is an arbitrary
constant.
Proof: Choose any pairs (y
1
, y
2
), and (λ + y
1
, λ + y
2
) in Y, by hypothesis of Theorem 3, P(y
1
, y
2
)
= P (λ + y
1
, λ + y
2
) and F′((U′(y
2
) – U′(y
1
)) = F′(U′(λ + y
2
) – U′(λ + y
1
)) (v).
Since F′ is monotonically increasing, it has an inverse which can be taken from both sides of the
equation (v) yielding,
U′(y
2
) – U′(y
1
) = U′(λ + y
2
) – U′(λ + y
1
) (vi).
101
By subtracting both sides by U(λ), we may rewrite (vi) as U′(λ + y
1
) – U′(y
1
) - U(λ) = U′(λ + y
2
)
- U′(y
2
) - U(λ). Since each side always has to have an identical value, t, now we have to solve
that
U′(λ + y
1
) – U′(y
1
) - U(λ) = t (vii).
Let G(x) = U(x) + t where G is a monotonic function and (vii) is transformed into U’(λ + y
1
) =
U’(y
1
) + G(λ), which yields Pexider equation. By Corollary 10 (p. 142, Aczel (1966)) (Aczél,
1966). The solution to U′(λ + y
1
) = U’(y
1
) + G(λ) is a linear equation, U′(y
1
) = cy
1
+ k Q.E.D.
References
Aczél, J. (1966). Lectures on functional equations and their applications (New York,: Academic
Press).
Falmagne, J.-C. (1984). Elements of psychophysical theory (Oxford New York: Clarendon
Press;Oxford University Press).
Abstract (if available)
Abstract
Patient-reported outcome (PRO) measures are incorporated in health care decision making not only when not enough objective health measures are available but also to consider what is important to patients about their health and interventions. For instance, in a clinical trial to measure an impact of an intervention, PRO instruments usually asked to compare patients' health after the intervention against health before the intervention which is considered as an anchor. If patients' health has changed, patients endorse "better" or "worse" health to their current health and PRO measures have to detect these changes. This three-paper dissertation draws on three different models that discriminate changes in health based on PRO measures. ❧ Paper 1 introduces a model that discriminate hypothetical health stimuli. Investigating the pattern of discriminating one health condition from another can shed light on how minimally important difference works. Paper 2 also uses a person-based anchor but in empirical data. A model tests the overall performances of generic and disease-specific instruments in terms of how well the instruments distinguish patients who experienced change in health from those who did not experiences any change in health. By suggesting a method to find an instrument that can capture patients who discriminate their current health from their previous health, this approach can be useful in selecting an appropriate instrument. Paper 3 presents a model that tests whether self-rated health can be differentiated from an external anchor, risk of death. Using a single PRO measure of self-rated health, this model can help to achieve a more complete understanding on predictive quality of self-rated health.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Application of concepts from prospect theory and response shift when estimating the minimally important difference for health utility measures
PDF
Burden of illness in hemophilia A: taking the patient’s perspective
PDF
The impact of Patient-Centered Medical Home on a managed Medicaid plan
PDF
New approaches using probabilistic graphical models in health economics and outcomes research
PDF
Essays in pharmaceutical and health economics
PDF
Economic aspects of obesity
PDF
Testing the role of time in affecting unidimensionality in health instruments
PDF
The causal-effect of childhood obesity on asthma in young and adolescent children
PDF
Advances and applications for economic evaluation methods in health technology assessment (HTA)
PDF
Healthcare provider recommendations: a panacea to improving influenza vaccination rates?
PDF
Essays on the economics of infectious diseases
PDF
Characterization of health outcomes in patients with hemophilia A and B: Findings from psychometric and health economic analyses
PDF
Effects of a formulary expansion on the use of atypical antipsychotics and health care services by patients with schizophrenia in the California Medicaid Program
PDF
Understanding primary nonadherence to medications and its associated healthcare outcomes: a retrospective analysis of electronic medical records in an integrated healthcare setting
PDF
The impact of treatment decisions and adherence on outcomes in small hereditary disease populations
PDF
Delivering better care for children with special health care needs: analyses of patient-centered medical home and types of insurance
PDF
Outcomes of antibiotic use among children with acute respiratory tract infections
PDF
The determinants and measurement of human capital
PDF
Three essays on emerging issues in hemophilia care
PDF
Integration of behavioral health outcomes into electric health records to improve patient care
Asset Metadata
Creator
Suh, Jae Kyung
(author)
Core Title
Discriminating changes in health using patient-reported outcomes
School
School of Pharmacy
Degree
Doctor of Philosophy
Degree Program
Pharmaceutical Economics and Policy
Publication Date
08/12/2013
Defense Date
08/12/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavioral economics,health instrument,minimally important difference,OAI-PMH Harvest,patient-reported outcomes,self-rated health,utility
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Doctor, Jason N. (
committee chair
), Nichol, Michael B. (
committee member
), Sood, Neeraj (
committee member
)
Creator Email
jaekyuns@usc.edu,jksuh81@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-325758
Unique identifier
UC11288029
Identifier
etd-SuhJaeKyun-1999.pdf (filename),usctheses-c3-325758 (legacy record id)
Legacy Identifier
etd-SuhJaeKyun-1999.pdf
Dmrecord
325758
Document Type
Dissertation
Rights
Suh, Jae Kyung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavioral economics
health instrument
minimally important difference
patient-reported outcomes
self-rated health
utility