Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Testing the role of time in affecting unidimensionality in health instruments
(USC Thesis Other)
Testing the role of time in affecting unidimensionality in health instruments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TESTING THE ROLE OF TIME IN AFFECTING UNIDIMENSIONALITY IN
HEALTH INSTRUMENTS
by
Ning Yan Gu
__________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(PHARMACEUTICAL ECONOMICS AND POLICY)
May 2010
Copyright 2010 Ning Yan Gu
ii
DEDICATION
This dissertation is dedicated to my father, who courageously fought with
terminal illness from January 2008 to April 2009 while continuously provided me
with greatest love and support during my Ph.D. endeavor.
iii
ACKNOWLEDGEMENTS
The completion of this dissertation is accompanied with considerable
guidance and assistance. I have deepest gratitude to my mentors, friends and
families.
I would first like to express my sincere appreciation to my advisor, Dr. Jason
N. Doctor, who provided me valuable guidance during my academic career at USC
and prompted me with research ideas enhanced my research ability and interest for
my future research career.
My appreciation is extended to my committee members Dr. Michael Nichol
and Dr. Kimberly Siegmund. Special thanks to Dr. Michael B. Nichol for his
invaluable support, advice and consistent encouragement during my research at
USC. And, special thanks also to Dr. Kimberly Siegmund for her time and effort on
my dissertation.
Many special thanks dedicated to Dr. Benjamin M. Craig for his generous
effort on facilitating the studies on the EQ-5D to be presented at EuroQol Group 25
th
and 26
th
Annual Plenary meetings.
My deepest gratitude goes to Dr. Dennis Fryback and his team at the
University of Wisconsin-Madison for providing me the Beaver Dam data set on the
four measures of the SF-36
®
. This dissertation could not be completed without his
generous help.
iv
Further, I sincerely hope this acknowledge serves as a small measure of my
appreciation to many experts in the Rasch measurement field. Many thanks to Dr.
Mike Linacre, Dr. Trevor Bond, Dr. Everett Smith, Dr. Richard Smith and Dr.
Edward Wolfe for providing me with continuous and unconditional help on my
technical and conceptual difficulties on the Rasch measurement models.
I would also like to thank the Department of Clinical Pharmacy &
Pharmaceutical Economics & Policy at the University of Southern California (USC)
for providing me with the research opportunity and support. Thanks to the Merck
Fellowship that made my research possible at USC.
Last but not least, I would also like to express my appreciation to my friends
Joanne Wu, Tony Zuo and Liang Yu for their support and friendship during my
study at USC.
v
TABLE OF CONTENTS
DEDICATION ii
ACKNOWLEDGEMENTS iii
LIST OF TABLES vii
LIST OF FIGURES ix
ABSTRACT xi
CHAPTER 1: INTRODUCTION 1
1.1 Introduction 1
1.2 Organization of this dissertation 9
CHAPTER 2: BACKGROUND, CONCEPTUALIZATION AND 11
MOTIVATION
2.1 Health as a latent variable 11
2.2 Separating the person measures from the health items 12
2.3 Rasch model 13
2.3.1 Rasch model specifications 18
2.3.2 Rasch in contrast with other measurement models 21
2.3.3 Rasch model for health measurement 28
2.3.4 Rasch model requirement: unidimensionality 32
2.3.5 Rasch model prior applications 33
2.4 The role of time in unidimensionality 35
2.5 Aims of this dissertation 39
CHAPTER 3: METHODS 41
3.1 Data 41
3.1.1 MEPS data 41
3.1.2 BDHOS data 42
3.1.3 Inclusion/exclusion criteria 43
3.1.4 Variables 44
3.2 Health instruments 44
3.2.1 The EQ-5D 45
3.2.2 The SF-12v2™ and the SF-36® 48
vi
3.3 Analytical models 53
3.3.1 The Rasch rating scale model (RSM) 53
3.3.2 The Rasch partial credit model (PCM) 55
3.3.3 The many-facet Rasch model (MFRM) 56
3.4 Analytical steps 58
3.4.1 Data matrices 60
3.5 Goodness-of-fit 63
3.5.1 Assessing unidimensionality 64
3.5.2 Assessing measurement invariance 67
3.6 Assessing productivity of the facets 69
3.7 Statistical analysis 70
CHAPTER 4: RESULTS 71
4.1 Cross-sectional analysis 71
4.1.1 Sample descriptive 71
4.1.2 Goodness-of-fit 75
4.1.3 Differential item functioning (DIF) 88
4.1.4 Principal Component Analysis of Residuals (PCAR) 102
4.2 Longitudinal analysis 106
4.2.1 Goodness-of-fit 106
4.2.2 Item x time interaction/bias assessments 120
4.2.3 Validation test 130
CHAPTER 5: DISCUSSION AND CONCLUSIONS 133
5.1 Overview 133
5.2 Discussion 134
5.2.1 On the cross sectional findings 135
5.2.2 On the longitudinal findings 137
5.3 Answering research questions 140
5.4 Significance, limitations and directions for future study 142
5.5 Conclusions 145
BIBLIOGRAPHY 146
APPENDIX: SAMPLE CONTROL FILES 170
vii
LIST OF TABLES
Table 1: The SF-12v2™ descriptive system 52
Table 2: The SF-36® descriptive system 52
Table 3: Facets and estimates in the longitudinal analysis 59
Table 4: Analytical steps 60
Table 5: Descriptive of the MEPS sample 72
Table 6: Descriptive of the BDHOS sample 74
Table 7: Cross sectional fit of the EQ-5D items to the RSM 78
Table 8: Cross sectional fit of the EQ-5D items to the PCM 79
Table 9: Cross sectional fit of the SF-12v2™ items to the RSM 80
Table 10: Cross sectional fit of the SF-12v2™ items to the PCM 83
Table 11: Cross sectional fit of the SF-36® items (n = 1,430) 86
Table 12: Test gender-related differential item functioning (DIF) in EQ-5D 89
Table 13: Test gender-related differential item functioning (DIF) in 93
SF-12v2™
Table 14: Test gender-related differential item functioning (DIF) in SF-36® 97
Table 15: Principal Component Analysis of Residuals (PCAR) of the EQ-5D 104
Table 16: Principal Component Analysis of Residuals (PCAR) of the 104
SF-12v2™
Table 17: Principal Component Analysis of Residuals (PCAR) of the 105
SF-36®
Table 18: Item fit of the EQ-5D to the FACETS model 108
viii
Table 19: Item fit of the SF-12v2™ to the FACETS model 109
Table 20: Time 1 item fit of the SF-36® to the FACETS model 110
Table 21: Time 2 item fit of the SF-36® to the FACETS model 112
Table 22: Time 3 item fit of the SF-36® to the FACETS model 114
Table 23: Time 4 item fit of the SF-36® to the FACETS model 116
Table 24: Longitudinal item fit of the SF-36® to the FACETS model 118
Table 25: Fixed all-same !2 test on the EQ-5D 131
Table 26: Fixed all-same !2 test on the SF-12v2™ 131
Table 27: Fixed all-same !2 test on the SF-36® 132
Table A.1: An example of the rating scale model control file using the 170
SF-12v2™
Table A.2: An example of 2-facet model specification file using the EQ-5D 171
Table A.3: An example of 3-facet model specification file using the EQ-5D 172
ix
LIST OF FIGURES
Figure 1: Indifference curve illustration of the Rasch model 17
Figure 2: Logistic ogive illustration of dichotomous item 19
Figure 3: Construct map illustration of the EQ-5D system 47
Figure 4: Construct map illustration of the SF-12v2™ and the SF-36® 51
system
Figure 5: Cross-sectional Winsteps data matrix 61
Figure 6: Point-in-time FACETS data matrix 62
Figure 7: Longitudinal FACETS data matrix 63
Figure 8: Test gender-related differential item functioning (DIF) in EQ-5D 90
Figure 9: Test disease-related differential item functioning (DIF) in EQ-5D 91
Figure 10: Test gender-related differential item functioning (DIF) in 94
SF-12v2™
Figure 11: Test disease-related differential item functioning (DIF) in 95
SF-12v2™
Figure 12: Test gender-related differential item functioning (DIF) in SF-36® 98
Figure 13: Test arthritis-related differential item functioning (DIF) in SF-36® 99
Figure 14: Test chronic pain-related differential item functioning (DIF) 100
in SF-36®
Figure 15: Test hypertension-related differential item functioning (DIF) 101
in SF-36®
Figure 16: Item x time interaction/bias of the EQ-5D in ten disease groups 123
Figure 17: Item x time interaction/bias of the EQ-5D in whole sample 124
(n=2,677)
x
Figure 18: Item x time interaction/bias of the SF-12v2™ in ten disease 125
groups
Figure 19: Item x time interaction/bias of the SF-12v2™ in whole sample 126
(n=5,151)
Figure 20: Item x time interaction/bias of the SF-36® in twenty-one 127
disease groups
Figure 21: Item x time interaction/bias of the SF-36® in whole sample 129
(n=1,430)
xi
ABSTRACT
Effective allocation of limited resources in healthcare delivery system
requires valid measurement of health. Valid measurement of health is contingent
upon the validity of the health instrument used. An important property of
measurement in general is “unidimensionality”: Do the numbers assigned to the
qualitative attribute being measured increase with increases in the attribute? Such
property of a health instrument is important because it is impossible to discuss
improvements in health with treatment or improvements in health at the population
level without unidimensionality. In this dissertation, we evaluate unidimensionality
with respect to a model of probability of item response.
In health measurement, unidimensionality has been shown to fail because
mental health and physical health tends to behave independently from one another.
One possibility is that mental health and physical health may reflect two distinctive
health constructs at point-in-time. However, given the role of time in health, this
dissertation tests the hypothesis that, unidimensionality is more plausible once time
is considered in the model.
We approached our study via investigating the measurement properties of
three widely used health instruments (the EQ-5D, the SF-12v2™ and the SF-36
®
)
using unidimensional mathematical measurement models, i.e. the Rasch models. We
tested our hypothesis using both the U.S. national survey data (the Medical
Expenditure Panel Survey, MEPS) and the U.S. community survey data (the Beaver
xii
Dam Health Outcomes Study, BDHOS). We examined the goodness-of-fit of the
items to see whether or not mental health and physical health items fit the
unidimensional model as a single scale. We also examined the gender- and disease-
related differential item functioning (DIF) to see if respondents with the same level
of latent health endorsed item with systematic differences due to gender and/or
health condition differences.
Based on our findings, mental health items consistently misfit the
unidimensional model at point-in-time measures in all three health instruments and
across all disease groups. Including time as a parameter in the model improved the
overall model fit in the two instruments from the U.S. national survey (the EQ-5D
and the SF-12v2™), but not in the SF-36
®
from the U.S. community survey. We
attribute the inconsistent findings to 1) cohort differences between samples extracted
from a U.S. national survey and samples from Beaver Dam eye study in Beaver
Dam, Wisconsin; 2) data metric differences as there was a substantially longer
follow-up (10 years) in the Beaver Dam data compared with the shorter follow-up (2
years) in the MEPS and, 3) the missing data problem for a longitudinal measurement
model. Hence, further investigation is warranted given the limited empirical
evidence on robustness of the Rasch model dealing with missing data in longitudinal
settings.
Our findings suggest that unidimensionality requirement in a health
instrument is supported when time is considered in the model. Mental health and
physical health tend to form different scales in point-in-time measures but, they may
xiii
be linked through time. Therefore, parameterizing time improve the overall model
fit. It no longer makes sense to interpret health measure cross-sectionally. Inter-
temporal health context is important in health measurement.
1
CHAPTER 1: INTRODUCTION
1.1 Introduction
Effective allocation of limited resources in healthcare delivery system
requires valid assessment of health. Methods for assessing the performance of drugs
and other treatments require the valid assessment of health. Health care professionals
and policy makers in the U.S. often measure health outcomes to evaluate the
achievement of goals for improving the health and the well-being of the general
public. Understanding how well these health goals are met requires measures that
reflect the quantities of health. Such quantities are not as easy to define as in physical
measurement and substantive models of person response to items (or health
questions) are needed to facilitate such quantitative measurement. During the past 40
years, health instruments such as generic, disease-specific, or preference-based
health instruments have proliferated and gained increasing acceptance by health
outcomes researches (Spilker, 1996; Brown & Gordon, 1999).
Different types of health instruments entail different levels of sophistication
and render different interpretations. For instance, generic instruments cover a broad
spectrum of health profiles; disease-specific instruments aim to describe specific
symptom-related health conditions and; preference-based instruments are designed to
yield preference-weighted single indices that can be used for economic studies that
involve trade-offs (Hays, 2005; Fayers & Machin, 2007). The guidance from the US
2
Food and Drug Administration (FDA ) on the development and use of patient-
reported outcomes (PRO) to support medical product labeling claims that
incorporating patient perspective about the effectiveness of a treatment has shed a
new light on the role of health measurement in healthcare decision-making (FDA
Guidance for Industry, 2006).
Virtually all health instruments consist of multiple health items,
predominantly, mental health and physical health items. These items are proxy
indicators of an individual’s underlying health. Each individual has a given level of
health at a specific time point. Nearly all health instruments contend that physical
and mental health items endorsement is primarily governed by individuals’ given
health level, where, an item “endorsement” referring to capturing a patient’s rating
on respondent’s perceived health via that particular health item in terms of the
intensity in health deficits. Whether or not individual health items behave the way
that they are expected underpins the overall quality of a health measurement. Hence,
evaluating the overall measurement quality of a health instrument requires an
evaluation of how well individual items perform in an instrument. Such evaluation
entails definitional and operational difficulties.
Two issues arise with respect to health measurement. First, the concept of
health is latent and that health can not be measured directly. Hence, the measurement
of health requires an explicit measurement theory to quantify such a latent trait.
Different measurement theories may render different approaches to measurement
that vary in their outcomes. These differences may play a significant role in medical
3
decision making and healthcare delivery systems. Second, the measurement of health
is different given that it can be defined differently by experts in different fields. The
underlying concept of health fosters varying perspectives (Klapow, Kaplan &
Doctor, 2002). Therefore, the measurement of health must entail an array of
comprehensive health attributes (items). The prerequisite is that all these items share
in virtue of being hypothesized as measurable in the sense that they can describe or
quantify a distinctive portion of the latent trait health (Embretson, 1999). Given its
latency, less attention has been given to testing a substantive theory of health survey
endorsements, thus, is warranted for investigation.
To evaluate whether or not people’s underlying health statuses are truly
represented by the selected set of health items in an existing health instrument
requires a measurement procedure that is capable of detecting whether or not
parameter estimates are free from any aberrant influences (Wright, 1996a; Bond &
Fox, 2007; Linacre 2008a/2008b). Such measurement requirement underpins the
construct validity of a measurement scale in terms of its discriminative validity
(Tennant et al., 2004; Streiner & Norman, 2004). For example, in a discriminative
measurement device, a patient with arthritis should endorse more problems on
mobility and pain scale than other patients of a similar age in the general population.
Further, a patient with psychological distress might be expected to report more
problems with mental health related problems.
Not all measurement procedures render such health estimates. For instance,
substantial amount of health measurement are built on summated raw scores by
4
adding the counts of health events from the items. A “health event” refers to the
patient’s endorsement of a health problem. The summated score assumes equal
importance or intensity between health events, hence, implies a linear relationship
between the items with the total score. In most empirical settings, such a one-to-one
relationship between the total score and the individual items may not hold (non-
linear) as the total may not equal the sum of the parts (Michell, 2005). Therefore,
parameter additivity is limited and estimate concatenation is not permitted. The non-
linearity of the summated scores often due to the fact that subjective health
perceptions are susceptible to various confounders that pertain to individual
characteristics (Thoits, 1981; Lange et al., 2002; Pallant et al., 2006; Tennant &
Pallant, 2007; Misajon et al., 2008) or, simply the effect of time (Monsaas &
Engelhard, 1993; Chang & Chan, 1995; Mannarini & Lalli, 2008). When there are
differences between groups that affect the total summated raw score, erroneous
inferences on health may render. According to Thurstone (1927, 1928a, 1928b,
1931), the very idea of quantitative measurement implies a linear continuum of some
sort and, a measuring instrument must not be seriously affected in its measuring
function by the object of measurement. Hence, an ideal measure of health is that the
estimate of health is built on health items estimated separately from these items, i.e.
parameter separation (discussed in the following). This requires a mathematical
measurement model that is capable of providing such estimation.
The advantages of using an mathematical measurement model are that such a
model 1) makes it explicit of exact details as to why people choose to endorse items
5
on a health measure; 2) produces concrete summary of the empirical situation and
substantive theory of why people endorse items and 3) justifies numerical
calculations on health measures and makes clear which calculations lead to
meaningful interpretations that makes the substantive theory (Luce, et al., 1990). At
the same time, mathematical estimation on peoples’ health via health items requires
statistically separating persons’ health from the health items. If persons and items are
not separated then measurement results cannot be easily compared across persons, as
each person’s score then means something different. Thus, such parameter separation
is fundamental in terms of obtaining accurate estimates on health as well as valid
evaluations on the measurement tools given the complexity of the latent trait.
Formally, parameter separation can be understood that there is a function, f, from
persons to the real numbers, and a function, g, from items to real numbers, such that f
– g preserves the order of probability of endorsing an item (Luce, 2000). As will be
elaborated further in chapter 2, without such separation, health is not measurable
because it depends on which items are included in the questionnaire and which
persons complete the measure.
How functions (f or g) are defined with regard to the measurement scale
entails the meaningfulness of the health measure. Commonly used rating scales in
health instrument, such as Likert scales, are ordinal scales that deliver ordinal
observations. Ordinal scales consist of monotonic increasing functions that are
capable of indicating more or less as more is better, but not the magnitude of how
much more or how much less. The meaningfulness of any measurement outcome is
6
contingent upon the quantitative nature of the measure (Roberts, 1985; Wright &
Linacre, 1989a/1989b). Hence, measurement is more meaningful when the numerical
functions are built on interval scales and are capable of linear transformations. For
example, functions like 0 , ) ( > + = ! " !x x f or 0 , ) ( > ! ! + ! = " # " x x g give a
quantitative base for generating linear measures (Roberts, 1985). Such quantitative
base permits the parametric arithmetic operations such as comparisons between
means (Stevens, 1946). Arithmetic mean comparisons using ordinal scales are not
capable of or sensitive enough to quantifying any changes along the monotonically
increasing scale as such changes are subject to percentile distributions that tie to a
specific group (Roberts, 1985). As we will elaborate later in Chapter 2, the use of
ordinal scale does not give person- or scale-distribution free measures, which limits
the generalizability of the interpretation on measured outcomes. The interval scale
properties are desired in any empirical applications where the meaningfulness of the
numerical differences is of interest. For instance, interval scales are emphasized in
cost-utility analysis (CUA) for determining healthcare resource allocations (Gold et
al., 1996). Hence, it is important to use a mathematical measurement model to derive
the desired interval scales from the ordinal observations, for instance, the Rasch
family of mathematical measurement models.
The Rasch models are probabilistic latent trait models of item response
(Rasch, 1960/1980; Wright & Masters, 1982; Krantz et al., 1971; Linacre,
2008a/2008b). Rasch models estimates individual person health and item response
simultaneously along a logit (log odds unit) continuum that is common to both
7
persons and items through a probabilistic mechanism. The aforementioned parameter
separation is the Rasch model feature of specific objectivity, which means that
estimations of person measures are separated from the items statistically (Rasch,
1966).
The Rasch model explicitly requires unidimensionality (Smith, 2004).
Unidimensionality is a strong measurement requirement that is rooted in
fundamental measurement paradigm (McHorney, 1999; Krantz et al., 1971). In a
health instrument, unidimensionality means all items in the instrument measuring the
single latent trait, i.e. health. All selected items in health instrument should measure
only health. Hence, the concept of unidimensionality is also the essential requirement
of construct validity (Tennant et al., 2004; Streiner & Norman, 2004). A
unidimensional measurement scale is built on the principle that any attributes of the
measured object can be measured separately (Krantz et al., 1971; Bond & Fox,
2007). An object can be defined as, for instance, person health or a student’s math
skill. Therefore, obtaining a unidimensional scale is determined by aforementioned
parameter separation. In the context of health measurement, unidimensionality
means that health is the unitary dominating dimension governing all health items in a
health instrument. Meanwhile, in empirical settings, unidimensionality can be one of
the most difficult measurement requirements to satisfy. Hence, concerns on violating
the unidimensionality requirement can be one of the impediments in applying
unidimensional measurement models of item response on empirically collected
health data (Reise, Morizot & Hays, 2007). For instance, in health, unidimensionality
8
has often failed with respect to mental and physical health construct (Gunter et al.,
2006; Pickard et al., 2006; Chang et al., 2007).
There are many potential factors that can confound the unidimensionality in a
health instrument. For instance, commonly noted confounders are usually individual
characteristic-related factors such as gender, age, ethnicity or various disease
conditions can all affect how people interpret the items at a given time point
(Tennant & McKenna, 1995; Lange et al., 2002; Tennant & Pallant, 2007; Hahn et
al., 2007). Another crucial confounder for unidimensionality is time, which is the
main focus of this dissertation.
Time is a unique element that is essential to all disease prognoses (Klapow,
Kaplan & Doctor, 2002; Kaplan, 2005; Brown, 2005). Coping with any chronic
disease condition is a continuous experience and not a discrete event. For instance, a
patient needs time to recover from the illnesses; over time, one copes with the
disease better than when it was first diagnosed; a patient with the same health
condition may experiencing the health condition differently from one day to another
due to various reasons or unforeseeable factors; people may also perceive health
differently from time to time. Health conditions are not necessarily inherent or
unchangeable. It can be postulated that a time severity effect is embedded in health
behavior. Hence, time plays a significant role in health assessment for clinical
decision-making and resource allocation decisions (Nord, 1999).
Previous research has shown that mental and physical health items in the
same measurement scale behave independently from each other (Gunter et al., 2006;
9
Pickard et al., 2006; Chang et al., 2007). However, most of this work has been cross-
sectional instead of longitudinal. Hence, adjusting for time confounding effect has
not been undertaken in most of the prior works. Point-in-time health measures may
not be sufficient for capturing the underlying dynamics of health. Although repeated
measures are often used in health measures, time severity effect has received less
study statistically in measurement models of item response.
In this dissertation, we intend to examine the measurement properties of
unidimensionality and invariance with respect to time using a family of
unidimensional Rasch measurement models of item response that can separate the
persons and the items statistically. We hypothesize that parameterizing time in a
Rasch model improves the unidimensionality of a health instrument and yields time
invariant health measures.
1.2 Organization of this dissertation
Chapter 2 provides a literature review outlining the background of the latent
variable of health, the concept of parameter separation and its importance to
unidimensionality. This leads to the section discussing the Rasch model
specifications, advantages as well as its operational difficulties. Emphasis will be
made on the role of time in affecting unidimensionality. Chapter 2 concludes with
the specific aims of this dissertation. Chapter 3 will outline the methods. We will
provide detailed information on the date sets, the variables and the
10
inclusion/exclusion criteria. Graphical presentation and tabulation will be used for
conceptualizing the heath instruments within the context of health measurement. We
will discuss the mathematical expressions of the specific Rasch models for the
proposed analyses, followed by the analytical steps and the corresponding data
matrices. We will discuss the quality control mechanism for the goodness-of-fit to
the Rasch models and the statistical analysis procedures. Chapter 4 provides results
and findings both from cross-sectional analyses and longitudinal analytical results.
Tables and figures will be provided to illustrate our main findings. Chapter 5
discusses, summarizes and concludes the study. We will go over the interpretations
on the results and provide implications of the findings. In this chapter, we will also
state potential significance as well as limitations of the study that may serve as
direction for future research. The appendix provides example of Rasch model control
files and Facet model specifications that underlie the conceptual framework of this
study.
11
CHAPTER 2: BACKGROUND, CONCEPTUALIZATION AND
MOTIVATION
2.1 Health as a latent variable
Health is a latent variable that can not be measured directly. To date, there
are still considerable debates on the underlying definition of health as experts from
different fields define health differently (Parsons, 1951; Patrick & Erickson, 1993;
Klapow, Kaplan & Doctor, 2002). The World Health Organization (WHO) defined
heath as “a state of complete physical, mental, and social well-being, and not merely
the absence of disease or infirmity.” The WHO definition on health contributed some
convergence on understanding health, but did not resolve all ambiguity. There are
debates on whether or not the latent variable of health that an instrument intends to
measure represents a continuum (Klapow, Kaplan & Doctor, 2002).
In health outcomes assessment, health measures almost always involve some
level of health changes between two occasions that attributable to a health care
intervention (Donabedian, 1966). For instance, people may gradually change their
perceptions about their health over the course of the treatment regimen. Rarely do
they maintain a constant perception of their health state. Hence, any point-in-time
measurement of health gives insufficient evaluation of health status given its inherent
characteristics. Measuring health is to determine where an individual is “located”
along the health continuum (Stucki et al., 1996).
12
2.2 Separating the person measures from the health items
Along a health continuum, separating the person measures from the health
items enables objective assessment of health (Wright, 1999a/1999b). Take physical
science for example, physical concepts and measurement methods are different
things. Unidimensionality becomes clear through the following analogy. Suppose
you wanted to measure the mass of objects by examining how far they penetrate a
barrier when traveling at possibly different (but constant) speeds, physics tells us that
such a measurement method is not a pure measure of mass because penetration is
affected by momentum of the object which equals mass X velocity, thus
measurement in this case is multi-dimensional, to make it unidimensional speeds of
all objects would have to be held constant. We can exemplify further using Newton’s
second law of motion, F = ma, where the force, F, and the acceleration, a, are both
vectors that move in the same direction. The mass, m, is the given substance of an
object that remains constant for each object. Other factors held constant, all freely
falling objects accelerate uniformly along a straight line regardless of their mass.
Hence, any computation of the force, F, on any object should vary with and ONLY
with the difference in their mass, m. In other words, any changes in mass do not
affect the acceleration, a, and only affects the computed force, F. The mass itself, m,
is innate to any object. Hence, m and a are separable and operate independently on F.
Similarly the case in intellectual ability assessments, for example,
intelligence tests with various intellectual attributes (questions on math, English or
13
writing skills) are often used to measure a highly relevant dimension of general
ability (Sattler, 1988/1990, p577). Often times the number of correct answers on a
test are counted (summated) to make inference on peoples’ intelligence level. Such
approach can be problematic when there does not exist a one-to-one relationship
between the total score and individual item endorsements (Sattler, 1988/1990; Stucki
et al., 1996). When the item endorsement is confounded, the comparability of the
scores between test-takers or groups is limited. Hence, the underlying latent trait of
one’s intelligence level should be independent from the items on a test as the items
merely serve as indicators of the latent trait. They aim to capture as much
information about the latent trait being measured and never the intension to yield
consequential measurement outcomes due to their own dispositions.
Hence, a unidimensional measurement model of item response that assumes
individual health and item difficulty can be statistically separated for various health
items is called for, such as the aforementioned Rasch models.
2.3 Rasch model
Rasch models are latent trait, probabilistic measurement models that estimate
the probability of the endorsibility of an item (Rasch, 1960/1980; Luce & Tukey,
1964; Perline et al., 1979; Wright & Masters, 1982; Fisher & Wright, 1994; Fischer
& Molenaar, 1995; Tennant & McKenna, 1995; Krantz et al., 1971; Linacre,
2008a/2008b). The Rasch model was first introduced by the Danish mathematician
14
Georg Rasch (1960/1980) in dichotomous models for intelligence and attainment
tests. It has been appended to other model specifications and has been increasingly
used in educational, psychological and health sciences (Wright & Stone, 1979;
Wright & Masters, 1982). For instance, amongst other various model specifications
(see Wolfe & Smith, 2007; Wright & Mok, 2000, 2004), the extended development
on the Rasch models includes the Rasch rating scale model (RSM) by Andrich
(1978/1988) for modeling polytomous data and the Rasch partial credit model
(PCM) by Masters (1982) that differs from RSM by allowing separate
parameterizations on step difficulties for each item. Linacre (1989/1994) further
extended the Rasch model into many-facet Rasch model (MFRM) for controlling
differential rater severity in test or performance settings.
The desired measurement feature of parameter separation in a Rasch model is
what Georg Rasch (1966) termed as specific objectivity. Upon a satisfactory Rasch
model fit, person and item parameter estimates are independent from their sampling
distributions (Wright, 1967; Wright & Master, 1982; Stenner, 1994; Bond & Fox,
2007). This person- and item-distribution independency allows individual
comparisons on persons as well as on items without regard to the distribution of
other persons and items in the group, which overcomes the major limitation in
traditional classic test theory, as will be discussed later. Therefore, in a Rasch model,
a person’s level on a latent trait and various items describing the latent trait can be
estimated independently yet still compared explicitly to one another (Wright &
Stone, 1979; Wright & Masters, 1982, Acton, 2003). Thus, the measurement features
15
of the Rasch model meets the Thurstone’s scaling requirement on the independency
of the measurement scale from the object of measurement (Thurstone, 1927, 1928a,
1928b, 1931; Wright, 1989; Hawthorne et al., 2008). Further, through a probabilistic
approach, the Rasch model states that a more able person is more likely to endorse
the items that imply more of the latent trait (Reeve, 2003). In other words, any two
subsets of items should order persons the same way on health and health only, given
that health is the latent trait to be measured.
Such concepts can be illustrated using the commonly used indifference
curves in Figure 1 (a) and (b) below (Krantz et al., 1971), which assumes all
pertinent assumptions as specified in microeconomics such as completeness,
transitivity, monotonicity and convexity (Mas-Colell et al., 1995). In both panels in
Figure 1, there are 3 indifference curves representing varying levels of probabilities
of endorsing items at a health level (IC
1
: p = 0.65; IC
2
: p = 0.75 and IC
3
: p = 0.85,
respectively). Panel (a) exemplifies a scenario in which indifferences curves do not
cross while panel (b) exhibits an opposite case where indifference curves do cross
one another.
In panel (a), non-crossing indifference curves means that for a given level of
health, there is a monotonic decreasing probability of endorsing more difficult items
and, likewise for an item with a given difficulty level, there is a monotonic
increasing probability of endorsing an item as health level increases. Meanwhile, on
each indifference curve, there are various combinations of health level and item
difficulty that render the same (indifferent) probability on endorsement. For instance,
16
on the middle difference curve where p = 0.75,
3 3 2 2 1 1
~ ~ ! " ! " ! "
, meaning that
there is an equal probability on item endorsement between a patient with lower
health, !
1
, endorsing a less difficult item, "
1
, and a patient with higher health, !
2
or !
3,
endorsing more difficult items, "
2
or "
3.
Crossing indifference curves exhibited in panel (b) mean that there exist
subsets of items that order persons differently on health. This can be problematic
since, for example, a patient with health level "
2
, there are multiple probabilities for
that patient to endorse the item with the difficulty of "
4
. Thus, a model for specifying
the probability of item endorsement as function of person health and item difficulty
fails. On the other hand, for healthier patients with health levels !
3
or !
4
, they have a
higher probability of endorsing a more difficult item, "
6,
while have a lower
probability of endorsing a less difficulty item, "
5
. Such endorsement pattern is
counter-intuitive in terms of health behavior, calling into question the validity of the
health measurement.
In measuring health, it is important to have a preserved order in the
probability of item endorsement given !’s and "’s. The Rasch model detects any
abnormal endorsements as exhibited in Figure 1 panel (b) and preserves the
necessary order as shown in panel (a).
17
Figure 1: Indifference curve illustration of the Rasch model
18
2.3.1 Rasch model specifications
In the context of health, the Rasch model requires that the probability of item
response depends on respondent health, !
n
, and magnitude of health deficit, "
ix
. In
other words, Rasch models are probabilistic models that predict the endorsibility of a
respondent to a particular rating scale on a specific item. A Rasch model can be
generally specified as:
) ( ) , | (
ix n ix n ni
f x X P ! " ! " # = = (1)
Where ) , | (
ix n ni
x X P ! " = is the probability that a person n assigned rating scale
category x on item i conditioned on person n’s health, !
n
and the measure of item i’s
magnitude of health deficit, "
ix,
at category x. The !’s and "’s are not directly
observable, but instead are estimated by the model from data. In some models, the
estimations on item thresholds, #
x
, are not shown, where #
x
is a component of "
ix
, (e.g.
"
ix
= "
i
+ #
x
). In Equation (1), the distance between an individual’s estimated health
location along the latent continuum and the item difficulty in terms of intensity in
health deficits govern the probability that the respondent endorse the items. For
instance, using a simple dichotomous example exhibited in logistic ogive as shown
in Figure 2 below, if
ix n
! " # = 0, then p = 0.5; if
ix n
! " # > 0, then p > 0.5 else if
ix n
! " # < 0, then p < 0.5. Hence, given its model specifications, the Rasch model
19
specified in Equation (1) is also called two-facet measurement model where the two
facets refer to person health, !
n
,
and item difficulty, "
ix
, respectively.
Figure 2 below illustrates the logit difference estimations for a dichotomized
item. In which, the y-axis is the probability of success, i.e., endorsement on a
dichotomous item and, the x-axis is the logit difference between "
n
and #
ix.
For a
polytomous item, there will be multiple curves, depending on the number of
categories an item has.
Figure 2: Logistic ogive illustration of dichotomous item
20
The Rasch model was extended to many-facet Rasch model (MFRM) by
Linacre (1989/1994, 1994, 2008b) for incorporating more than two facets in the
model. Facet models were primarily designed for controlling differential judge or
rater severity effects in test or performance settings but not limited to. For instance,
time can also be a facet controlled in a MFRM for adjusting time severity effect on
health measures. Such is the goal of this research. Denoted the time facet as $
s
, a
MFRM that include time as the third facet can generally be specified as:
) ( ) , , | (
s ix n s ix n nis
f x X P ! " # ! " # $ $ = = (2)
Similarly, ) , , | (
s ix n nis
x X P ! " # = is the probability that a person n assigned rating
scale category x on item i conditioned on person n’s health, !
n
, and the measure of
item i’s magnitude of health deficit, "
ix,
at category x, subject to the time effect, $
s
.
Again, all parameters (!’s, "’s and $’s) are not directly observable, but instead, are
estimated by the model from data.
The Rasch model specifications provide in Equation (1) and Equation (2)
give a linear relationship between the parameters. This additive property of the
Rasch models implies that all parameters are placed on a linear metric or a latent
continuum and additively predict the probabilistic outcomes on item endorsements.
The linear relationship between the parameters are important in terms of jointly
placing respondents and items along a common metric governed by a single latent
construct in an orderly way. This allows us to examine the overall distribution of the
21
respondents’ health and the hierarchy of the items. From which, we are able to
estimate for respondent with the same level of !
n
, what is the likelihood of him/her to
endorse which item, "
ix
, depending on the location differences between the two
parameters (
ix n
! " # ). Hence, the distances of (
ix n
! " # ) preserve the order of the
probability of endorsing an item in a Rasch model, which is underpinned by the
important parameter separation (Luce, 2000).
2.3.2 Rasch in contrast with other measurement models
Compared to the earlier theory of measurement, such as the classical test
theory (CTT), Rasch model provides the quantitative measurement in social science
as those characterized in physical science (Bond & Fox, 2007; Linacre,
2008a/2008b). Classical test theory (CTT) is also called a true score theory that
based on random sampling theory (Marcoulides, 1999). The ground work of CTT
was conducted by Thorndike (1904) and extended based on the true score formula by
Spearman (1907). The classical test theory (CTT) has served as the basis for
numerous health instruments. Detailed descriptions of the CTT modeling have been
documented in substantial body of studies (Yarnold, 1982/1988; Yarnold & Prescott,
1986; Crocker & Algina, 1986; Hambleton et al., 1991; Arnold, 1996; Schumacker,
2005; DeVellis, 2006). Basically, in CTT, the true score is defined as the expected
value of observed performance within the context of the measurement scale. For a
particular item i, its expected score (X
i
) is expressed in a simple formula: X
i
= T
x
+ %
i
.
22
Hence, the observed score X
i
is determined by the true score on the latent variable,
T
x
, and the error associated with that item, %
i
(Spearman, 1907). The error term in
CTT, %
i
, is assumed to be a single source random measurement errors that are
assumed to be equal for all individuals and, expected to be canceled out as sample
size increases, hence, E(%) = 0.
Such assumption may be overly restrictive and unrealistic in many empirical
settings (Lord, 1984). The single source of measurement error in CTT introduce the
unpleasant fact in measurement, that is, data collected from one administration using
the same measurement scale on the same group of people may not be reproducible
(Marcoulides, 1999). Given its design, any systematic inconsistencies in the
measurement scale are not able to be detected, for instance, a 10% error from test-
retest analysis (stability over time) is essentially the same as the 10% of error from
the internal consistency analysis (accuracy within item domain) (Thompson &
Crowley, 1994). Further, interaction effects can not be evaluated statistically in
CTT, since the sources of measurement errors can not be differentiated. This can be
a dismay in health measurement.
Further, the theoretical underpinning of CTT builds on reliability theory, such
as reliability indexes or reliability coefficients (Crocker & Algina, 1986; Arnold,
1996). One of the most commonly used reliability coefficient is Cronbach’s
coefficient alpha, computed as
!
!
"
#
$
$
%
&
'
(
!
"
#
$
%
&
(
=
2
2
1
1
y
i
k
k
)
)
* , where k is the number of
items in the scale;
2
i
! " is the sum of the variances for the items;
2
y
! is the variance
23
for the scale as a whole (Arnold, 1996). Such mathematical expression
unconsciously invokes the notion that the desired reliability level could be achieved
superficially if more people were sampled or more items were added to the scale
(Schumacker, 2005; DeVellis, 2006). This makes CTT sample- and test-dependent,
i.e. persons and items can not be estimated separately in CTT (Hays et al. 2006).
Therefore, the interpretation on each attribute can only be made in the context of one
another and, the comparability of the analysis across groups and over time is limited
since it depends on who answered the survey and what instruments were used.
This can be problematic in health measurement. For instance, if a
measurement scale consists of items representing easier items (lower intensity in
terms of health deficits), this patient appears to be healthier compared to if the items
were harder (higher intensity in terms of health deficits). Such a problem exacerbates
in repeated measures (Stucki et al., 1996). The limitations associated with CTT gives
this measurement theory a limited utility in health measurement. For widely used
health instruments that theoretically underpinned by CTT (e.g. the SF-12v2™ and
the SF-36
®
), it is important to evaluate their measurement properties from an
alternative measurement perspective.
Another popular true score theory that’s based on and improved upon CTT is
called generalizability theory (G theory). G theory subsumes and extends precepts of
CTT and, utilizes a method that is equivalent to the analysis of variance (ANOVA).
Detailed model specifications on G theory can be found elsewhere (Hoyt, 1941;
Cronbach et al., 1963; Cronbach et al., 1972; Kieffer, 1999; Hoyt & Melby, 1999;
24
Nocera et al., 2001). Basically, G theory departs from CTT by estimating multiple
sources of measurement error as well as various interaction effects in the same
analysis. Hence, the fundamental concept of reliability in CTT is replaced in G
theory by the broader notion of generalizability by generalizing a set of observed
scores on person measures in a defined universe of situations (Cronbach et al., 1972;
Shavelson & Webb, 1991). The exact design or definition of the universe, however,
is not clearly defined in G theory (Brennan, 2000; Brennan, 2001, p2). In a way, G
theory can be viewed as the revised true score model and can be expressed as:
r s
! ! + + " = # where X and T are the observed score and the true score specified as
in CTT; %
s
and %
r
represent the systematic and random errors respectively. The
distinction that G theory made based on CTT allows investigators examining various
sources of measurement error instead of one exclusive random measurement error
(Eason, 1991). These sources of variation are labeled as facets in G theory. This
notion of facets in G theory draw similar basis as the aforementioned many-facet
Rasch model (MFRM) (Linacre, 2008b). Facets can be either random or fixed, each
constitute a different degree of generalizability (Shavelson & Webb, 1991; Nocera et
al., 2001). Although G theory deliberately made advancements on use of
measurement by quantifying various sources of error, the theoretical underpinnings
of G theory are still associated with its own set of limitations.
G theory views that the universe of admissible observations involves the
“object of measurement” and “facet of measurement” (Shavelson & Webb, 1991),
where the “object of measurement” is the focus of analysis in G theory
25
corresponding to person measures. The “facet of measurement,” on the other hand,
refers specifically to the decomposition of sources of variations that may be allocated
to various facets, except for person measures. Such measurement design is
asymmetric since the primary focus is the on the measure of individual differences as
person measures are considered as the real systematic differences but not considered
to contribute to error, only the variations in facets are (Cardinet et al., 1976; Kieffer,
1999). It may be common conception that individual differences is the primarily
interest for investigators but, it is a misconception that person differences do not
contribute to error (Nocera et al.,2001; Mushquash & O’Connor, 2006). Hence, the
usefulness of G theory is restricted to studies where individual differences are more
of a focus. Just like CTT, G theory gives a limited utility in health measurement.
The desired measurement properties of parameter separation and preserving
the orders between parameters can not be achieved using true score theory (Hays et
al., 2000; Chang, 2005). The advancement in measurement led to the measurement
models that based on item response theory (IRT), including Rasch models. Features
of different IRT models, their conceptual designs, strengths and weaknesses have
been documented in substantial body of literatures (Hambleton et al., 1991; van der
Linden & Hambleton, 1996; Bock, 1997; Embretson & Reise, 2000; Hays et al.,
2000; Reeve, 2003; Chang, 2005; Hays et al., 2007). Different IRT models have
been increasingly used for various testing systems. For instance, the largest testing
companies in the U.S. and in Europe are using IRT models for computerized
adaptive testing (CAT), test designs, scaling, calibration and construction of item
26
banks (Bock, 1997; Hays et al., 2000; Fayers & Machin, 2007). In health outcomes
assessment, a prominent example of IRT application is the Patient-Reported
Outcomes Measurement Information System (PROMIS) project for construction of
item banks, a NIH (National Institutes of Health) funded project (Fries & Cella;
2005; Reeve et al., 2007; Hays et al., 2007). In addition, there are also ample studies
conducted on comparing IRT models with the measurement models under true score
theory and virtually all study reports were in favor of IRT models (Kolen, 1981;
Hambleton et al., 1991; Hambleton & Jones, 1993; Baker et al., 1997; McHorney et
al., 1997; Fan, 1998; Raczek et al., 1998; Hays et al., 2000; Birbeck et al., 2000;
Hays et al., 2000; Norquist et al., 2003; Reeve, 2003; Fitzpatrick et al., 2004;
Wiberg, 2004; Hays et al., 2006; Martin, 2007). In this study, we will not venture
further on such comparative analysis except that we will briefly illustrate the most
commonly used IRT models and elaborate on the distinctive features of the Rasch
models.
There are many different IRT models that can be conceived. Basically, for
dichotomous IRT models, the most commonly used are one-, two-, and three-
parameter logistic models (1PL, 2PL and 3PL). The 1PL is the one-item parameter
model that adjusts (controls) item difficulty. The 2PL relaxes the item discrimination
assumption and allows an estimated slope that, in the context of health, represents
the intensity of health deficit. The 3PL allows an additional guessing factor into the
model that accounts for lower-asymptote parameter where the probability of success
is greater for persons with a lower trait (Hambleton et al., 1991; van der Linden &
27
Hambleton, 1996; Embretson & Reise, 2000). For polytomous IRT models, in
addition to the aforementioned Rasch RSM and PCM, the other mostly commonly
used polytomous IRT models are, for example, the graded-response model (GRM;
Samejima, 1969, 1996), the modified graded response model (M-GRM; Muraki,
1990) and the generalized partial credit model (G-PCM; Muraki , 1992). Briefly, the
GRM is an extension of the 2PL described earlier but account for multi-category
items in the scale, hence, is built on both 2PL and PCM. The M-GRM is the
restricted GRM that facilitates the analysis on a measurement scale with items
having an equal number of response categories. Amongst all IRT models, both RSM
and PCM are extensions of 1PL and, all of them are belong to the Rasch family of
measurement models.
It is important to note that Rasch models differ from the other IRT models.
As other IRT models are more appropriate when model fitting is more of a concern,
Rasch models are most applicable when strong justifications on the measurement
scale properties are desired (Embretson & Reise, 2000, pp.121). Rasch models
emphasize that the collected data should fit the model instead of modifying the
measurement model in order to obtain satisfactory fit to the collected data (Wolfe &
Smith, 2007; Wright, 1999; Bond & Fox, 2007). Hence, Rasch models are distinctive
from the other IRT models in the sense that Rasch models abide strictly the
fundamental measurement principles with the most restrictive assumptions and
requirements (Wolfe & Smith, 2007; Wright, 1999). From this perspective, the
28
Rasch model is indispensable for both construction and monitoring the measurement
scales such as health instruments (Bond & Fox, 2007).
2.3.3 Rasch model for health measurement
The Rasch models have been increasingly used in health measurement for the
construction of new as well as monitoring the existing health measurement scales
(McHorney & Monahan, 2004; Hays et al., 2000; Tennant et al., 2004; Pickard et al.,
2006). The desired measurement features of the Rasch models specified above
provide a useful gauge for such applications. In a Rasch model, questions with
regard to measurement properties at item response level could be addressed. For
example, whether or not the items in a measurement scale really measuring health as
they are intended? Are they measuring another variable in addition to health? If they
are measuring health, what’s the quantified portion for each of the items that are
attributable to the latent trait of health?
There are a number of advantages on using Rasch model for health
measurement. Some are innate to Rasch measurement design while the others are
realizable upon a satisfactory Rasch model fit.
First, a Rasch model constructs quantitative interval scale for both health
ability and item difficulty parameter estimates, separately. The estimated measures
are reported in convenient units called logits (log odds) and place the estimated
measures along a vertical ruler representing the latent continuum of health.
29
Respondents’ health abilities and item difficulties are placed side by side along the
vertical ruler. The estimated health measures are arranged in the way that higher
scores represent better ability. Just as 2 feet is twice as long as 1 foot, a respondent
with a logit of 2 is twice as healthy as a respondent with a logit 1. Such measurement
capability is important for health measurement since virtually all health items in a
health instrument build on some kind of rating scale design, such as Likert-style.
This kind of quantitative comparisons could not be achieved under the summated
scoring system using raw observations (Wright & Linacre, 1989a/1989b; Bond&
Fox, 2007). In a rating scale, numerals are assigned to the item categories
representing various levels of health or health deficits. Data collected using rating
scales are qualitative, ordinal scores that are lack of necessary cardinal measurement
properties for parametric statistical operations (Stevens, 1946; Wright & Linacre,
1989a/1989b; Linacre, 1999). Ordinal scores can be indicative of more or less of an
entity, presence or absence of a change but, they are not capable of quantifying how
much more or less, hence, less likely to re-iterate in empirical settings (Norquist,
2004). As we stated previously, summated score system for adding up the counts of
health events requires the assumption that there is equal spacing between the
assigned ratings (item categories), hence, invite equal attentions from the respondent
on each of the ratings. Such measurement design is only legitimized upon a
satisfactory Rasch model fit (Wright & Masters, 1982; Linacre 2008a/2008b).
Second, as Rasch model estimations on parameters are based on individual
item response level, this enables an increased sensitivity in examining the difference
30
between what’s observed and what’s expected. It also enhances patient contribution
on treatment evaluation (Wiklund, 2004). Analyzing the endorsing patterns of
individual health items is beneficial clinically since different items are indicative of
various underlying health concepts. The extent and hierarchy of change in perceived
health by the patients can assist clinicians in determine the treatment priority
(Doward & McKenna, 2004; Lin et al. 2005). For example, it is important for a
physician to differentiate that a patient’s physical illness is not associated with
his/her psychological stress when making treatment decisions. Likewise, a
psychologist needs to know a patient’s psychological distress is not a result of
physical illness (Wyrwich & Wolinsky, 2000; Bradley, 2001; Barbour, 2006). Such
important clinical information may be overlooked using summated scores (Wright &
Masters, 1982; Wright & Feinstein, 1992; Stucki et al., 1996; McHorney et al., 1997;
Raczek et al., 1998).
Third, one salient advantage of using Rasch model in health measurement is
its ability to accommodate missing data. This is a common advantage among all
measurement models at item response level. Rasch models are well suited for
tracking health changes and robust to missing data such as when the same items have
not been administered each time during data collection, hence, facilitates minimized
respondent burden (Hays et al., 2000; Linacre 2008a/2008b). This measurement
design is important in health measurement since missing variable has always been a
major concern in longitudinal health studies as a result of sample attrition or
selection bias (Fayers & Machin, 2007; Fairclough, 2005). Theoretically, this is
31
accomplished as the Rasch models are formulated at the item response-level and use
an iterative response-level estimation method, the Unconditional Maximum
Likelihood Estimation (UCON) method (Wright & Masters, 1982; Linacre
2008a/2008b). This allows observations to be missing, i.e. without a rectangular
(complete) pattern or, for the same observation to be made multiple times. In a Rasch
analysis, only observed data are used in Rasch model estimation. For each
observation in turn, its expected value, based on the current set of parameter
estimates, is summed into a marginal expected scores, and its observed value is
summed into a marginal observed scored for each parameter contributing to the
observation. At the end of each iteration process through the available data
information, each parameter estimate is adjusted according to the difference between
its observed and expected marginal scores. The iterative process continues until the
marginal expected scores for every parameter approximately equal the marginal
observed scores. In principle, only the observations which relate to a parameter are
required in order to estimate that parameter. In other words, the Rasch model does
not know that any data are missing, it only knows about the data that are present.
Objective and invariant estimates on health are contingent upon a satisfactory
Rasch model fit. This refers to the fit of individual items and persons to the Rasch
model gauged under a set of Rasch model requirements, primarily, the
unidimensionality requirement. Violation of the unidimensionality requirement
suggests the possibility of sabotaging the construct validity of the measure. This may
result in biased health estimates that consequentially lead to erroneous inference for
32
health decision making. The usefulness of the health measures depends on the
stability of the health measures that are bias free. Hence, the extent the
trustworthiness of a measurement device would be impaired if it varies with object of
measurement (Thurstone, 1927/1928a/1028b/1931; Wright, 1989; Bond & Fox,
2007).
Given the complexity of empirical health data, coupled with the difficulty of
satisfying the unidimensionality requirement, some investigators may view that the
latent human trait is unquantifiable (Edgerton, 1990; Brown & Gordon, 1999). On
the other hand, given the indispensable advantages associated with the Rasch models
in health measurement, we view that Rasch model fit deserves an alternative
examination, as what we plan to do in this dissertation.
2.3.4 Rasch model requirement: unidimensionality
Unidimensionality requirement is fundamental for all measurement
paradigms. It is the primary prerequisite of the Rasch models and explicitly required
by all measurement models of item response. It is also, an implicit requirement in
other measurement theory, such as true score theory (Smith, 1997/2000). Violation
of unidimensionality takes place when there may be subsidiary latent trait(s) or
variable(s) governing certain item responses in the scale (Sprangers, 1996; Wright et
al., 2000; Norquist, 2004). If subsidiary variable was present, the items that are
governed or influenced by the subsidiary trait may behave independently from the
33
rest of the items in the scale. Therefore, violation of unidimensionality suggests
unwarranted systematic noise that may degrade the measurement tool from its
desired level of accuracy. Satisfying the unidimensionality requirement using
empirical data has its own share of difficulties largely due to the complexity of
empirical data.
2.3.5 Rasch model prior applications
To date, most of the Rasch analyses are limited to cross-sectional studies on
building and refining health measurement scales or, comparing pre- and post-
measures using point-in-time assessments (Embretson, 1991; Wilson, 1992; Smith,
1997; Wolfe & Chiu, 1999a/1999b). Few Rasch studies have been conducted on the
EQ-5D and none are longitudinal. Prieto et al. (2003) used a Rasch rating scale
model (RSM) to test the cross-cultural validity of the EQ-5D in the Schizophrenia
Outpatient Health Outcomes Study (SOHO) using samples across 10 European
countries participated in the SOHO study. Pickard et al. (2007) extended the EQ-5D
system from its original 3-level rating to 5-level rating scale and tested its
measurement properties using Rasch partial credit model (PCM) using sample from
different study centers. A Rasch study conducted on the SF-12v2™ was conducted
by McBride et al. (2008) assessed the associations between the physical and mental
health factors of the SF-12v2™ in individuals with alcohol dependence.
Dimensionality of the SF-12v2™ was not assessed in McBride et al. (2008). There
34
are more Rasch analyses have been conducted on the SF-36
®
given its longer
standing in health outcomes assessment, but also limited to, mainly, cross-sectional
or point-in-time analyses (Haley et al., 1994; Fisher et al., 1997; McHorney et al.,
1997; Prieto et al., 1997; Raczek et al., 1998; Fisher, 1999; Jenkinson et al., 2001;
Hays et al., 2007; Martin et al., 2007; Dellmeijer et al., 2007; Taylor et al., 2007).
Rasch analyses on the 10 items in physical functioning (PF-10) subscale
demonstrated satisfactory Rasch model fit. However, physical health is just one
aspect of many in defining health. Amongst the Rasch studies on the SF-36
®
,
McHorney et al., (1997) reported that there are clinical advantages on using the
Rasch derived linear measures at both individual and group levels. Raczek et al.
(1998) used both RSM and PCM in cross-cultural comparisons of the PF-10 items
and reported that Rasch scaling techniques prevail to the Likert-style summated
scores. Specifically, Raczek et al. (1998) reported the robustness of the Rasch
modeling against missing data. An earlier study by Prieto et al. (1997) pointed out
the limitation associated with cross-sectional data for detecting clinical changes and
emphasized the importance of Rasch analysis using a longitudinal sample.
In our literature review, we noted three MFRM studies that involved
parameterizing occasion or time as a facet. Monsaas and Engelhard (1993) controlled
the time effect using the MRFM to explore the changes in a home environment on 40
children from west Georgia over a 6-month period using pre- and post measures.
Chang & Chan (1995) used a Canadian sample to evaluate a stroke program at a
rehabilitation hospital using 99 subjects. In their study, they used both RSM and a 3-
35
facet Rasch models for 3 time points that span for 7 weeks. Occasion (time) was
used as the third facet. Mannarini & Lalli (2008) modeled time in a MFRM study
evaluating Psychiatric Patient Self-Awareness (PPSA) behavior. The study sample
was 48 psychiatric Italian adult female with mean age of 50 on their two medical
visits with an interval ranged from 20 to 88 days. The authors encourage the
inclusion of time facet model for investigating patient behavior longitudinally and
stated that such investigation method is helpful in assessing treatment effect and
allocating clinical resources. At the same time, all these three MFRM studies used
relative short time span, ranged from 20 days to 6 months. In health measurement,
however, understanding how mental health and physical health interplay over time
due to time severity effect and how such effect affects the unidimensional model fit
require a relatively longer time frame to unfold, especially in chronic disease groups.
Hence, evaluating the fit to a unidimensional model warrants a longitudinal sample.
Hence, there is a void in evaluating measurement properties of widely used
health instruments by statistically parameterizing time using longitudinal samples
with a relative longer time frame.
2.4 The role of time in unidimensionality
Time plays a crucial role in health (Kaplan, 2005; Klapow, Kaplan & Doctor,
2002; Fairclough, 2005; Nord et al., 2005). The impact of illness on health
perceptions for chronically ill patients differs from those who have temporary
36
disruptive illness episodes. Over time, different attributes of health, such as mental
and physical health, may interplay with one another. There are considerable amount
of studies demonstrated that there is a strong linkage between mental and physical
health over time (Wells et al., 1988; Hays & Stewart, 1990; Hays et al., 1994; Simon
et al., 1998; Koehler, 2003; Change et al., 2007). As disease prognoses are usually
multi-factorial, even the modest interplay between the mental and physical health
may have important implications (Friedman and Booth-Kewley, 1987).
For instance, Sutherland et al. (1982) reported that over time, patients who
suffer chronic illness may be psychologically distressed due to increased frustration
with the long-lasting illness. On the other hand, the effect can also be reversed once
the patients learn to adjust and adapt to the chronic illness and end up with improved
self-perceptions about their health state compared to when it was first diagnosed
(Calman, 1984). Prior findings from Medical Outcomes Study by Wells et al. (1988)
indicate that patients diagnosed with depressive disorder tend to have worse
physical, social and role functioning; lower perceived health and greater pain,
compared to non-chronically ill patients. In addition, prior clinical literatures have
demonstrated that time 1 depression is predictive of time 2 heart disease (Carney et
al., 2001; Rugulies, 2002; Lett et al., 2004; Matthews et al., 2005; van Melle et al.,
2005; Empana et al., 2006; Whooley, 2006; Carnethon et al., 2007). Specifically, van
Melle et al. (2005) reported that depressive symptoms are significant risk factors to
cardiac prognosis and the excess risk was 2 – 2.5 fold for both mortality and
morbidity. Inspired by the findings by van Melle et al. (2005), Ziegelstein & Thombs
37
(2005) reported a relationship between the brain and the heart, stated that the two are
the twain that have met, indicating that the status of one affects the other. Further,
depressive symptoms have also been found to predict coronary heart disease (CHD)
in initially healthy people (Carney et al., 2001; Rugulies, 2002; Lett et al., 2004;
Matthews et al., 2005; Whooley, 2006; Empana et al., 2006) and, increased risk for
type 2 diabetes mellitus (Carnethon et al., 2007). Alternatively, there are reports on
how physical ailments can also lead to poor mental health outcomes (Aneshensel et
al., 1984; Barbour et al., 2006; Dew, 2001; Fukunishi, 2001; Jayaram & Gasimir,
2005; Singer et al., 2001; Taylor et al., 1985). Moreover, Dew et al. (2001) have
found that poor psychological adjustment to physical illness has been found to be a
major contributor to reduced quality-of-life. For instance, for patients underwent
organ transplantation, psychiatric disorders, particularly, depression- or anxiety-
related difficulties have been reported to be associated with post operations (Mai et
al., 1990; Rodin et al., 1991; Levenson & Olbrisch, 1993; Surman, 1994; Dew et al.,
1997, 2000, 2001; Dew, 1998; Crone & Wise, 1999; Fukunishi, 2001; Singer et al.,
2001; Jayaram & Gasimir, 2005; Barbour et al., 2006). Craven (1990) reported that
the number can be as high as 50% of the patients who experienced lung
transplantations suffered psychiatric disorders.
There is a convergence pattern between physical health and mental health. In
a longitudinal study with 4-year follow-up, Hays et al. (1994) found that in cross
sectional study, physical health and mental health were moderately correlated but
longitudinally, physical health showed a significant positive effect on mental health
38
while mental health revealed a significant negative effect on physical health.
Importantly, the authors found that the inter-relationship between the two tend to
stabilize over time. The authors pointed out: “In any event, understanding of the link
between self-reported physical and mental health would benefit from longitudinal
research aimed at disentangling their interrelationship.” In their quality-of-life
assessment in patients with bladder cancer, Wright & Porter (2007) pointed out one
of the main limitations of the published literatures is that comparisons of health
changes is primary based on point-in-time assessments.
Hence, it is clinically and economically important to differentiate whether or
not the health improvement is truly due to intervention or, it’s simply the reflection
of patients’ adaptation to the chronic illnesses over time (Hardt et al., 2000). As
Smith (1997) stated, “we might expect individuals not only to change their positions
on the underlying continuum as the result of the intervention, we might also expect
individuals to change the ways that they conceptualize or execute the construct being
measured.” For instance, Schwartz et al. (2005) discussed how people change their
internal threshold and conceptualization of health coupled with their experience
associated with changes in health over time via the concept of response shift. How
response shift impacts health perception is beyond the scope of this research and has
been documented in other studies (Howard, 1980; Sprangers & Schwartz, 1999;
Schwartz & Sprangers, 1999; Schwartz et al., 2005). The point is that the presence of
response shift implies that time plays an important role in patients’ self-evaluations
39
on health and it may challenges the interpretation of the health items during the
course of health measurement.
Therefore, time may affect how items are endorsed. This leads to the
violation of unidimensionality requirement and, may serve as one potential source of
bias for preventing a measurement scale to be invariant over time (Wright, 1996b).
This can be problematic as one’s health status changes over time, estimated health
ability at the time of assessment should be defined in relation to the underlying
construct and it should not vary with the measurement framework (Hambleton &
Jones, 1993). If this was not the case, if the difficulty of an item were not invariant
over some useful domain, then the estimated item difficulty would have little useful
meaning (Wright et al., 2000). Hence, point-in-time measurement of health is useful
but associated with inadequacies as the effect of time on the dynamic of health can
not be captured in point-in-time measures (Simon et al., 1998).
2.5 Aims of this dissertation
The aim of this dissertation is to test the role of time in affecting the
unidimensionality in the three widely used health instruments using the multi-faceted
Rasch measurement models (MFRM). The assumption underlying this work is that
although satisfying the fundamental unidimensionality requirement is a concern to
many investigators in conducting health measurement using empirical health data,
health measures that abide the fundamental measurement principals are important in
40
terms of obtaining probabilistic conjoint measures that are permissible for additive or
summated scores. Our research questions center on:
1. How do the mental health and physical health items interplay in terms of
quantifying the latent variable when they are placed on the same metric or
latent continuum?
2. Is unidimensionality attainable in point-in-time measures in multi-itemed
instruments that consist of both mental health and physical health
questions?
3. Will unidimensionality in a health instrument improve in longitudinal
samples after parameterizing time in the measurement model?
4. Does preference-based measure differ from the health profile measure in
the attainability of unidimensionality in both point-in-time measures and
longitudinal settings?
Guided by these research questions, this dissertation moves from point-in-
time analyses to longitudinal evaluations unidimensionality. We hypothesize that
mental health and physical health items behave independently in point-in-time
measures and unidimensionality is less likely to hold. We further hypothesize that
unidimensionality is more plausible using a longitudinal sample by parameterizing
time in the model.
41
CHAPTER 3: METHODS
3.1 Data
This dissertation uses two different sets of data. One is U.S. national
representative data, which is the Medical Expenditure Panel Survey data (MEPS;
AHRQ website; Cohen, 2000, 2003). The other is the longitudinal U.S. community
cohort data from the Beaver Dam Health Outcomes Study (BDHOS; Fryback, 1993).
3.1.1 MEPS data
The Medical Expenditure Panel Survey (MEPS) is an annual U.S. national
survey on health care use and expenditures using a representative sample of the
noninstitutionalized US civilian population oversampling blacks and Hispanics,
conducted by the US Agency for Healthcare Research and Quality. The MEPS data
provides national survey observations from two consecutive years, which enable us
to conduct a two-time point analysis.
The EQ-5D was included in the MEPS from 2000 to 2003. The first version
of the SF-12 was used from 2000 to 2002 and then switched to the SF-12v2™
starting from 2003 to present. The observations from the latest year on the EQ-5D
and the SF-12v2™ will be extracted from the MEPS for our analysis. Hence, using
the Household Component Full-Year Files, for preliminary point-in-time analysis,
42
we will extract the EQ-5D from the 2003 data files and the SF-12v2™ from the 2005
data files. These data files provide information on respondents’ sociodemographics,
the item responses on the instruments, as well as medical conditions that are
identifiable by using ICD9-CM codes (Cohen, 2003). For the longitudinal two-time
analysis, we will extract the EQ-5D observations from the panel 7 (2002-2003) and
the SF-12v2™ observations from the panel 9 (2004-2005).
3.1.2 BDHOS data
The BDHOS data is made available to us through the courtesy of researchers
at University of Wisconsin-Madison, led by Dr. Dennis Fryback. The BDHOS is a
longitudinal cohort study of health status and quality-of-life measures using a
random sample of adults aged 45 years or older. The random sample was drawn from
the Beaver Dam Eye Study cohort that was developed in 1987 for a population-based
study of eye disease prevalence and risk factors (Fryback et al., 1993). The
longitudinal survey was first initiated in September, 1990. The participants are
primarily from the local residents who are predominantly white (99.3%). The
detailed description of the data collection and the characteristics of the participants
can be found in Fryback et al. (1993).
The main health instrument used in the survey was the Medical Outcomes
Study-Short Form-36 (SF-36
®
) (Ware JE, Sherbourne, 1992), which was
administrated at four different time points. The baseline date (Time 1) consists of
43
1430 respondents whom completed the survey during January 15, 1991 and July 22,
1992. The second time point (Time 2) survey was administrated approximately 2
years from Time 1. Time 3 survey was administrated 8 years from Time 1 and, Time
4 data was collected 10 years after the survey was initiated. Hence, the BDHOS data
enables us to test our hypothesis using a much longer span of time.
For the preliminary point-in-time analysis, observations from time 1 on the
SF-36 will be extracted. For the longitudinal analysis, observations from all four
time points will be used in the “stacked” fashion (see data matrix).
3.1.3 Inclusion/exclusion criteria
For the base year observations, we will include respondents who were 18
years or older from the MEPS data and had complete responses. The subgroups will
be identified using the primary ICD-9-CM code for the top ten most prevalent
chronic conditions. We will exclude respondents who reported perfect scores
demonstrated ceiling or floor effects to ensure the necessary stochastic property in a
probabilistic model. All respondents from the BDHOS data were 45 years or older
when the random cohort was formed.
44
3.1.4 Variables
Variables we need to use in a measurement models of item response are
health items pertaining to each instrument as well as individual characteristics such
as gender and disease. As patients’ gender or disease severities can potentially bias
the health outcomes assessment, this information will be used for assessing
differential item functioning (DIF). Hence, 5 items from EQ-5D, 12 items from the
SF-12v2™. For the SF-36
®
, however, only 35 items out of the 36 items in the SF-
36
®
will be analyzed in the Rasch model as the second item (SF2) is a self-evaluated
transition compared to respondent’s health one year ago (Ware & Sherbourne, 1992).
This item is not part of the physical component scores (PCS) or mental component
score (MCS) of the instrument (will be discussed again in the methods section),
hence, will not be used in an item response model. For both data sets, time is a
dummy variable. In MEPS, time = 1, 2; in BDHOS, time = 1, 2, 3, 4.
3.2 Health instruments
The three widely used health instruments that we intend to evaluate in this
dissertation are: the EQ-5D, the SF-12v2™ and the SF-36
®
. The EQ-5D is a widely
used preference-based instrument for clinical and economic evaluations that involve
with tradeoffs (Rosalind & de Charro, 2001; Dolan, 1997; Shaw et al., 2005;
Williams, 2005). The SF-12v2™ and the SF-36
®
are both multi-purpose, function-
45
based, short-form health profile measures that belong to the Medical Outcomes
Study (MOS) Short Form (SF) family.
The three instruments (the EQ-5D, the SF-12v2™ and the SF-36
®
) share a set
of common traits. They all have their origins rooted in psychometric testing (McColl,
2005; Fayers & Machin, 2007) regardless of their interpretations and applications.
These commons traits are: 1) they all consist of multiple health items that aim to
represent a spectrum of health definitions; 2) there are both mental health and
physical health items included in each instrument; 3) all health items are consist of
polytomous rating scales representing monotonic increasing intensity in terms of the
latent trait and 4) they are all designed to describe or measure that latent trait: health.
3.2.1 The EQ-5D
The EQ-5D succinctly consists of five items classifying health in terms of
mobility (MO), self-care (SC), usual activities (UA), pain/discomfort (PD) and
anxiety/depression (AD). The rating scales of the five EQ-5D items consist of three
response options correspond to “no problem,” “some problem” and “extreme
problem.” Numerals 1, 2 and 3 are assigned to each of the three levels, characterizing
the order of the response levels such that a lower rating reflects less of the health
problem or symptom (Kind, 2005; Dolan & Roberts, 2002). A total of 243 (3
5
)
possible five-digit health states can be generated from the EQ-5D descriptive system,
including ‘unconscious’ and ‘immediate death’ as two additional states, yielding 245
46
health states in total from the EQ-5D. Respondent completing the EQ-5D
questionnaire self-express the degree to which they are experiencing health related
problems for each domain on the day of administration. Hence, individual’s self-
reported health-related quality of life (HRQoL) can be defined as a combination of
five digits. For instance, an individual with a preference vector of “11122” can be
interpreted as this individual has no problem with walking about, self-care or
performing any usual activities, but s/he is experiencing moderate pain or discomfort
and moderately anxious or depressed.
The EQ-5D is primarily designed to yield preference-weighted health indices
for each state of health on the scale that are required to construct quality adjusted life
years (QALYs) that can be used for economic studies that involve trade-offs such as
cost-effectiveness analysis (CEA) or cost-utility analysis (CUA) (Fanshel & Bush,
1970; Gold et al., 1996, Feeny, 2005). Much of the attention on this preference-based
instrument is on the quality weighting of the items. Individual item endorsement
underpins the health scores for any respondent but received less study, particularly
from a fundamental measurement perspective. In this study, we aim to address the
endorsement of health-related items on the EQ-5D to identify if these items tend to
describe an underlying unidimensional construct: “health”.
Figure 3 is the construct map illustration of the EQ-5D system. The single
latent construct of health is represented by the vertical directional arrow in the
middle. The respondent’s ability level or health, !
n
is located at the left hand side of
the map with respondent #1 having the best health and respondent #3 having the
47
worst health (!
1
> !
2
> !
3
). The right hand side of the map lists the three
corresponding response vectors comprised of one response per EQ-5D item and, the
utility scores based on both the US valuation (Shaw et al., 2005) and the UK
valuation (Dolan, 1997) systems. Patients with lower or decreased health, !
n
are
expected to be located at the lower level of the map, corresponding more of the
health problems/symptoms, lower utility levels and, higher rating scale.
Figure 3: Construct map illustration of the EQ-5D system
Hence, in the EQ-5D system, respondents and response vectors are placed in
hierarchical orders along the common metric representing the latent trait of health.
The row vectors represent the preference of each individual with a constructed value
48
or index. While such values and valuations facilitates economic studies involve with
tradeoffs, performances on elements in the row vector and the vector space (column
vector) should not be ignored as they underpin the measurement quality for each
respondent and each item, respectively.
3.2.2 The SF-12v2™ and the SF-36
®
Both the SF-12v2™ and the SF-36
®
are widely used for characterizing health
status in clinical research, health policy evaluation, and general population health
surveys (Ware & Sherbourne, 1992; McHorney et al., 1993; Ware et al., 1995; Ware
& Kosinski, 2005; Ware et al., 2007). Just like other generic health profile measures
such as the Sickness Impact Profile (SIP) (Bergner, 1981) and the Nottingham
Health Profile (NHP) (Hunt et al., 1981), both of the SF-12v2™ and the SF-36
®
instruments are widely used generic health survey questionnaires that tap on a broad
spectrum of health profiles (Ware & Sherbourne, 1992; Ware et al., 1995; Ware &
Keller, 1996; Ware & Kosinski, 1996, 2005; Ware et al., 2007). The SF-12v2™ is a
twelve-itemed instrument and is a subset of the original SF-36
®
(Ware et la., 1996;
Ware et al., 2007). The detailed instrument development, item selection and rules for
scoring of the two instruments are documented elsewhere (Ware & Sherbourne,
1992; Ware & Kosinski, 2005; Ware et al., 2007; Franks et al., 2003; Sengupta et al.,
2004; Kaplan et al., 2005; Brazier & Roberts, 2004, 2006; Brazier et al., 2002).
49
Both of these instruments consist of eight health profile scales and two
summary scores. The eight health subscales cover a broad spectrum of health
concepts representing health in terms of physical functioning (PF), role limitation
due to physical health problems (RP), bodily pain (BP), general health perceptions
(GH), vitality (VT), social functioning (SF), role limitations due to emotional
problems (RE) and mental health problems (MH). The two summary scores are
distinctively representing physical and mental health components (PCS and MCS).
The mental health and physical health summary scores are the primary outcome
measures using a SF family health profile measure.
Both the SF-12v2™ and the SF-36
®
are designed based on true score theory,
hence, subsumes the underlying assumptions that are pertinent to true score theory.
In terms of the scoring system, the eight scales are derived based on the norm-based
scoring (NBS) system, which delivers standardized health measures based on the
norms established using a national sample of the random population. Each scale is
scored to have the same mean (50) and the same standard deviation (10) using a T-
score transformation (Fayers & Machin, 2007, pp219), expressed as:
50 ) 10 ( + ! =
i i
Z T , where
pop
pop
i
i
SD
Z X
Z
) ( !
= , in which, pop Z is the population mean
and
pop
SD is the population standard deviation. The two summary scores are in turn,
derived based on the eight health scales using factor analysis (Ware & Sherbourne,
1992; Ware et al., 1995; Ware et la., 1996; Ware & Kosinski, 2005; Ware et al.,
2007).The eight health scales and the two composite scores derived using the norm-
50
based scoring (NBS) system are the primarily focus of the SF-36
®
and the SF-
12v2™ measures. Under the NBS system, the summated scale approach is used
(Wilson, 2005). Hence, all items are assumed to have equal value along the latent
continuum and all response categories in each item are evenly spaced (Brazier et al.,
2002). Such scoring system is associated with the exact problems we elaborated
previously on using the summated scores should the fundamental measurement
properties do not hold.
Figure 4 is the construct map illustration of the SF-12v2™ and the SF-36
®
system. It shows five hypothetical respondents with five distinctive health levels, !
n,
which are located at the left hand side of the map with respondent #1 having the best
health and respondent #5 having the worst health (!
1
> !
2
> !
3
>
!
4
> !
5
). The right
hand side of the map is a hypothetical item with five rating scale categories that
range from “none of the time” to “all of the time” with assigned numerals range from
1 to 5. Hence, the right hand side of the figure forms one of the column vectors
among all possible column vectors (items) in a measurement scale. Each element in
the vector represents the item endorsement of a respondent or, a health event, related
to that particular item. The assumed equal spacing and importance between
items/categories are exemplified in Figure 4.
51
Figure 4: Construct map illustration of the SF-12v2™ and the SF-36
®
system
Table 1 and Table 2 below illustrate the SF-12v2
™
and the SF-36
®
descriptive system overviews. In which, we outline the item descriptions, the eight
profile subscales, the two summary scores as well as the total category levels and the
number of thresholds pertain to each item. Information on category levels and item
thresholds are applicable in Rasch model. There are a total of 12 items in the SF-
12v2
™
and 36 items in the SF-36
®
.
There are a total of 4 items in the SF-12v2™ have the rating scale that
indicate an opposite direction than the other items in the instrument. Similar in the
SF-36
®
, there are a total of 10 items coded in the opposite direction in terms of the
direction for health deficits compared with the others. Hence, they will be reverse
52
coded to have the same direction along the continuum as indicated in Figure 1 and
Figure 2 so that higher ratings represent better health status.
Table 1: The SF-12v2™ descriptive system
Table 2: The SF-36
®
descriptive system
53
3.3 Analytical models
For the cross-sectional analyses, we intend to apply both the Rasch rating
scale model (RSM) and the Rasch partial credit model (PCM) using Winsteps
(Linacre, 2008a). For the longitudinal analyses, we will apply the many-Facet Rasch
models (MFRM) using Facets (Linacre, 2008b). The model specifications of these
models are illustrated below.
3.3.1 The Rasch rating scale model (RSM)
The Rasch rating scale model (RSM) assumes that all items have equal
discrimination and share the same structure of thresholds and describes the
probability that a specific person (n) will respond to a specific item (i) with a specific
rating scale (x) (Wright & Masters, 1982). In this model, an exponential function is
used to estimate a total of three parameters, they are person n’s health ability, !
n,,
the
measure of item i’s magnitude of health deficit, "
ix,
at category x and #
j
for rating
scale thresholds. Specifically, the RSM is expressed as:
( )
( )
m x x X P
m
k
k
j
j i n
j i n
x
j
ni
,... 1 , 0 ,
exp
exp
) (
0 0
0
=
! !
! !
= =
" "
#
= =
=
$ % &
$ % &
(3)
54
All parameters are defined the same as in Equation (1) except that #
j
denotes the
threshold values and each item has m + 1 rating scale categories. For the null
category, we define Equation (4) below for notational convenience,
( ) 0
0
0
! " "
#
= j
j i n
$ % & (4)
Equation (4) is a crucial specification which permits the right hand side of the
Equation (3) to have a linear relationship between the parameters after the logit
transformation. The logit linear form of the model is specified in equation (5), which
demonstrates the linearity, additivity and independency properties of the Rasch
model, which is specified as:
j i n
x ni
nix
P
P
! " # $ $ =
%
%
&
'
(
(
)
*
$ ) 1 (
log (5)
where,
P
nix
is the probability respondent n assigned a rating scale category x on item i
P
ni(x-1)
is the probability respondent n assigned a rating scale category x - 1 on item i
!
n
is the health status or HRQoL of respondent n, where n =1,…N
"
i
is the difficulty parameter for item i, where i = 1,…I
#
j
is the threshold or step values of the rating scale for K + 1 categories
55
3.3.2 The Rasch partial credit model (PCM)
Rasch partial credit model (PCM) is used for the purpose of validation.
Although there’s a report on the equivalence of the two model estimates (Luo, 2007),
sometimes difference in item response estimates may attribute to the model selection
(Bond & Fox, 2007). For instance, if there are any misfits associated with the
responses to the threshold structure, in RSM, such misfit information at item level
are not detected. Hence, it’s practical to validate the findings by using an alternative
model, such as PCM, in which, thresholds are allowed vary with each item. By doing
so, we are able to investigate whether the order of steps on each item were preserved
as expected. The occurring of disordering of the category in a probabilistic model
framework implies that the disordered category will not have the greatest probability
of being endorsed regardless where the respondent ability is located on the linear
metric; hence, the category may be redundant (Wright & Masters, 1982; Bond &
Fox, 2007). Further, sometimes the category doesn’t have enough frequency or
endorsements. Therefore, it is a possible reason for category disordering (Tennant,
2004). Detecting of such problems reflect unnecessary respondent burden, hence,
warranted. The PCM model is specified as:
i m
k
k
j
ij n
ij n
x
j
ni
m x x X P
i
,... 1 , 0 ,
) ( exp
) ( exp
) (
0 0
0
=
!
!
= =
" "
#
= =
=
$ %
$ %
(6)
56
The separate item threshold is exhibited from the additional subscript j on the item
difficulty parameter "
ij
, where "
ij
= "
i
+ #
j
. Hence, the threshold values are
represented by #
j
which corresponds to each of the items separately. There are m
i
+ 1
categories associated with each item i. In the case of the EQ-5D, m
1
= m
2
= …=m
5
.
But for other two health profile measures (the SF-12v2™ and the SF-36
®
), this may
not be the case. We also define the null category for the same reason stated
previously for RSM:
0 ) (
0
0
! "
#
= j
ij n
$ % (7)
And, the logit linear form of the PCM after the logit transformation is:
ij n
x ni
nix
P
P
! " # =
$
$
%
&
'
'
(
)
# ) 1 (
log (8)
The model and the parameters are specified the same as under Equation (5) except
threshold parameter, #
j,
is not specified in PCM but as part of the item parameter, "
ij.
3.3.3 The many-facet Rasch model (MFRM)
In this study, we intend to test the MFRM in our longitudinal samples by
including time as the third facet. By doing so, we assume that time has a severity
effect on items that produces systematic bias that affect the unidimensionality. The
exponential form of a MFRM is specified below:
57
m x x X P
m
k
k
s
s j i n
x
s
s j i n
nij
,... 1 , 0 ,
) ( exp
) ( exp
) (
0 0
0
=
! ! !
! ! !
=
" "
"
= =
=
# $ % &
# $ % &
(9)
Where all parameters are defined in the similar fashion as in Equation (3) except
there is an additional parameter $
j
, which is the time facet we included in the model,
where j = 1,…S for the number of time points in the data. Again, we defined the
same for the null category as below:
( ) 0
0
0
= ! ! !
"
= s
s j i n
# $ % & (10)
And, the logit-linear from of the three-facet model is listed as:
s j i n
nijk
nijk
p
p
! " # $ % % % =
%
] log[
1
(11)
It’s possible to evaluate the interaction bias between different facets in a
MFRM. As we hypothesize the time facet plays an important role in
unidimensionality, we are interested in the time x item interactions. Hence, a MFRM
can be modified into various forms to meet such needs. Equation (12) below is one
of the examples of MFRM variation, a ‘time-scale’ MFRM.
js s i n
nijk
nijk
p
p
! " # $ % % % =
%
] log[
1
(12)
Equation (12) proposes a model for analyzing whether or not the rating scales are
utilized differently at different time points, s. Models of similar kind can be
conceived by including additional or alternative facets into in the model. The
detailed MFRM model descriptions can be found elsewhere (Linacre, 1989/1994;
58
Linacre & Wright, 1990, 2002; Linacre, 2008b) and, Appendix B provides our
conceptual framework on different model specifications.
3.4 Analytical steps
For our preliminary cross-sectional analyses, we will use the Rasch rating
scale and partial credit models (RSM and PCM), as specified from Equation (3) to
Equation (6), to evaluate the unidimensionality of the three health instruments. We
will first implement all items in each of the three health instruments together in the
Rasch models, as if they are describing health as a collective group. By doing so, we
are examining whether or not there is a unitary dimension governing the
performance of the scale as a whole (unidimensionality). Samples will be stratified
into top 10 most prevalent chronic health groups identified in MEPS using ICD-9-
CM code, on both the EQ-5D and the SF-12v2™. For the Beaver Dam data on the
SF-36
®
, since all respondents were included in the study because they were
diagnosed with eye disease and, the data set does not provide detailed information on
chronic diseases. Rather, a dummy coding was used to identify respondents either
present or absent with certain chronic diseases based on the survey. A total of 21
chronic health conditions were identified in the Beaver Dam data (Fryback et al.,
1993).
For our longitudinal MFRM analyses, we intend to estimate our main effects
as the model specified in Equation (9). In which, we aim to use composite
59
observations from all time points and obtain the generalized estimates on person
ability, !
n,
and item difficulty, "
i
, and the time effect, $
s
.
Comparisons on the parameter estimates between point-in-time measures and
the longitudinal measures will be made to assess the significance of including time in
the model. We will examine any item bias associated with interaction effects with
time, $. Except for time, $
s
, gender and health conditions enter into the model as
dummy facets with their elements (data entries) anchored at 0. This means that we
are primarily interested in examining how these facets affect the item endorsements
while direct estimates on these facets themselves are not closely examined.
Table 3: Facets and estimates in the longitudinal analysis
Facet Label Role in the model Effect
Persons !
n
Facet 1 Main effect
Items "
i
Facet 2 Main effect
Time $
s
Facet 3 Interaction / bias effect
In Table 4 below, we illustrate the overall analytical steps. We also outline
the parameters/models to be estimated in each step.
60
Table 4: Analytical steps
3.4.1 Data matrices
Data layout in Winsteps and Facets differ from one another. As
understanding how data sets were set up for different analysis facilitates the
understanding on different analytical steps, in this section, we present the graphical
data matrices in Figures 5, 6 and 7 below. These data matrices correspond to whether
the analysis will be conducted in Winsteps or Facets, as specified in Table 4 above.
Winsteps is more user-friendly and capable of generating figures, graphs and other
61
additional information and, Facets is capable of carrying out our hypothesis testing
using longitudinal samples.
Figure 5 outlines the Winsteps data layout for RSM and PCM in cross
sectional studies. Figure 6 outlines the Facets data layout for a 2-facet Rasch model
for point-in-time estimations as part of the longitudinal analyses. Figure 7 outlines
the stacked longitudinal data for many-faceted Rasch analysis using Facets.
Figure 5: Cross-sectional Winsteps data matrix
62
Figure 6: Point-in-time FACETS data matrix
63
Figure 7: Longitudinal FACETS data matrix
3.5 Goodness-of-fit
Rasch models mainly use the fit statistics as the quality control mechanism
for examining the fit of the data to the model (Wright & Linacre, 1994; Glas, 1995;
Smith, 2000; Linacre, 2008a). For items that fit the Rasch model, it’s an indication
that items are measuring the latent trait as they are expected to measure it (Linacre,
1998). Hence, they can be added together to produce total scores. Misfit to the Rasch
64
model implies that there are, perhaps, other latent variables significantly affecting the
item endorsement. For misfit items, they warrant further diagnostic analysis on the
measurement scale. Violation of unidimensionality implies not only there are erratic
or noisy items in the scale, but the measurement invariant property of scale is less
likely to hold. Such quality control mechanism in the Rasch model enables the
increased probability of detecting a difference when there is truly one, and increased
statistical power since misfit items tend to contribute greater error variance relative
to systematic variance.
Examination of the model fit can be a two-fold process. One is the
unidimensionality check by examining the model fit statistics and Principal
Component Analysis of Rasch residuals. The other entails assessing the
measurement invariance property across groups or over time (Andrich, 1988; Smith,
2004). Validity of the parameter estimates were based on the fit statistics of the data
to the model evaluated using a pre-specified set of criteria (Wright & Master, 1982;
Linacre, 2008a/2008b).
3.5.1 Assessing unidimensionality
Unidimensionality will be examined using item fit statistics. There are
INFIT/OUTFIT mean square (MNSQ) as well as INFIT/OUTFIT standardized Z
score (ZSTD) fit statistics. INFIT/OUTFIT MNSQ tests the hypothesis that “do the
data fit the model usefully” while INFIT/OUTFIT ZSTD fit statistics are the
65
Student's t-statistic value has been adjusted to a unit normal value and tests the
hypothesis that “do the data fit the model perfectly” (Linacre, 2008a/2008b). Hence,
INFIT/OUTFIT MNSQ assesses the magnitude of the misfit and corrected for the
sample size. They are the chi-squares divided by their degree of freedom (d.f.)
(Linacre, 2008a, p445).
INFIT fit statistics are more influenced by unexpected response patterns.
These are often more difficult to diagnose and it has greater threat to measurement.
OUTFIT fit statistics are more influenced by unexpected outlying observations, such
as lucky guesses or careless mistakes. Hence, large OUTFIT fit statistics are usually
caused by a few aberrant observations. The residuals are calculated using the
difference between observed and model expected responses, hence, for a given
observation for an individual on a particular item, x
ni
, the residuals can be computed
as:
ni ni ni
E x y ! = . Although rarely the case the empirical data will fit the data
perfectly, this study investigates all four fit statistics to identify aberrant item
responses from different perspectives.
There are two potential types of misfits that may be detected from the data.
One is the underfit signifying the presence of noisy parameters that degrade the
quality of the instrument, hence, violating unidimensionality. The other is the overfit
indicating that there are Guttman-like (deterministic) responses patterns that may
overstate the quality of the instrument (Bond & Fox, 2007; Linacre, 2008). In this
study, we focus on the detection of any underfit that alerts the departure from
unidimensionality (Smith, 2004; Bond & Fox, 2007). Hence, we are primarily
66
interested in identifying “noisy” item(s) that may degrade the quality of the
instrument with INFIT/OUTFIT MNSQ > 1.4 and INFIT/OUTFIT ZSTD > |2| to
indicate any item misfit with an approximate 5% Type I error rate (Wright &
Linacre, 1994; Smith, 2004; Linacre, 2008a/2008b). Detailed illustration and
explanation of such mathematical computation in fit statistics can be found in Wright
& Master (1982) and Andrich (1988).
We will use PCM to estimate threshold values, #
j
, to examine whether or not
each of categories in the rating scale has sufficient endorsement and presented in an
orderly fashion as expected. For redundant category or categories, the probability of
being endorsed may be overshadowed by adjacent categories (Linacre,
2008a/2008b). The step difficulty estimates in PCM, #
j
, provide a mean for such
detection.
Under the unidimensional model approach, the ideal situation is that all
systematic variances are explained by the latent variable, and the residuals are just
random noise (Smith, 2004). The Principal Component Analysis (PCA) of Rasch
residuals quantifies the systematic variation shared by the items and detects whether
or not the data harbor more than just the single latent construct of health. We conduct
the Rasch residual factor analysis to assess whether or not the items in an instrument
load as implied by the existing scoring systems (Linacre, 1995/1999; Wright, 2000).
Based on the rule of thumb, a model is considered satisfactory if Rasch measures
account for > 50% of systematic variance (Linacre 2008a/2008b). Amongst the
remaining unexplained variance, the minimum amount of variance needed to identify
67
the secondary dimension is 5%. We accept eigenvalues > 3 to be indicative of
sufficient strength (more than 3 items) embedded in a secondary dimension (see
Linacre, 2008, p. 337). Eigenvalues are also called latent roots as a special set of
scalars used to determine the strength of a secondary dimension (Marcus, 1988,
p145).
3.5.2 Assessing measurement invariance
The measurement invariance property is evaluated using the differential item
functioning (DIF) in cross sectional studies and using the item interaction/bias in
longitudinal studies. Evaluations on measurement invariance property are built on
the Rasch principle that for different respondents with similar level of abilities (i.e.
health), they should have similar probability of endorsing the items regardless of
individual characteristics (Wright et al., 2000). If an item measures the same ability
in the same way across groups then, except for random fluctuations, the same
success rate should be found irrespective of the nature of the group. Hence, items
that give different success rates for two or more groups with the same ability level
are said to display DIF (Holland & Wainer, 1993). The problem of DIF is manifested
via differential item responses between groups that are largely related to individual
characteristics such as gender or disease severity. In absence of randomized clinical
trials, DIF is a useful statistical technique quantifying group differences in item
response patterns (Fayers & Machin, 2007, p. 177-185). If DIF was first detected, a
68
post hoc data analysis may be performed by treating items that exhibit DIF as
different items for different groups (Tennant & Pallant, 2007).
In cross-sectional analyses, we will conduct the standardized difference test
on item estimates between discrete occasions using the Equation (13) below, which
accounts for modeled standard errors.
2
1
2
1
)]
ˆ
( [ )]
ˆ
( [
ˆ ˆ
+
+
+
!
=
m m
m m
se se
D
" "
" "
(13)
Where !
ˆ
denotes item difficulty parameters and D is the standardized difference and,
m is the indicator for time points, where calibrations between the different time
points are compared. Using 5% as our a prior significance level, we take the
absolute value of |D| > 2 to indicate whether or not there is a statistically significant
difference between the two calibrations which lead to unstable measure between the
two occasions or two groups (Wright & Masters, 1982; Wolfe & Chiu,
1996a/1996b).
In longitudinal analyses, Facet models enable the modeling of multiple facets
to account for any bias as result of interaction effects. If there is any significant item
interaction/bias associated with any of facets, the invariance property of the
measurement scale is also threatened (Linacre, 2008b). Detection of such problems
prompts diagnostic consideration of item/person performances (Bond & Fox, 2007;
Linacre, 2008a/2008b).
69
3.6 Assessing productivity of the facets
The primary interest of the study is to use the multi-faceted Rasch model
(MFRM) testing the hypothesis: whether or not the set of elements (items) be
regarded as sharing (quantifying) the same measure after allowing for measurement
error? (Linacre, 2008b). Since model estimation with additional inclusion of
parameter(s) will necessarily change, most likely improve, the overall model fit due
to the change of model specification, it is crucial that we conduct a validation test on
the significance or the productivity of the new parameters. In this dissertation, we
plan to use the “fixed all-same” chi-square (!
2
) test to validate the productivity of the
facets in the model, particularly, the time facet. The precise statistical formulation of
this chi-square test on the time facet is:
2
1
j
j
SE
w = (14)
Where j = 1,S for the number of time points in the data. The !
2
is specified as:
( )
j
j j
j j
Sw
w S
w S
2
2
) ( !
! " # = (15)
Where $
j
is time facet with d.f. = S-1.
We will also examine the productivity of person ("
n
) and item (#
i
) facets. For
person facet, we substitute " for $ in equation (15). For item facet, we substitute # for
$ in equation (15). Detailed explanation of this statistical formula can be found in
70
Linacre (2008b, p150-151). The significance of the !
2
indicates the productivity of
including the particular facet in the model.
3.7 Statistical analysis
Parameter calibration in a Rasch model is based on Unconditional
Maximum Likelihood Estimation (UCON) iteration to obtain the exact estimates,
standard errors and fit statistics (Linacre, 2008a/2008b). The residuals are assumed
to follow the type 1 extreme value distribution or log Weibull distribution given the
Rasch model specification (Rasch, 1960/1980; Andrich, 1978/1988). Linear
measures are first generated from the Rasch models. We report the item calibrations
in statistically convenient logits (log-odd units). Since multiple comparisons are
involved, Bonferroni adjustment is used to adjust the Type I error rate as a result of
multiple comparisons. The a prior significance level is set at 5%. We performed
data cleaning and preparation using SAS 9.1.3 (SAS Institute Inc., Cary, NC,
U.S.A.) and, implemented Rasch models in cross sectional analyses using Winsteps
3.67 (Linacre, 2008a) and Facet models in longitudinal analyses using Facets 3.64
(Linacre, 2008b).
71
CHAPTER 4: RESULTS
4.1 Cross-sectional analysis
In this section, we report the cross-sectional results and findings from the
longitudinal analyses are reported in section 4.2.
4.1.1 Sample descriptive
Table 5 below gives the sample descriptive on both the EQ-5D and the SF-
12v2™ from the MEPS. After the inclusion/exclusion criteria, we obtained a total of
2,677 respondents with complete EQ-5D responses from the 2003 MEPS and 5,151
respondents with complete SF-12v2™ responses from the 2005 MEPS.
EQ-5D respondents had a mean age of 54.3 (±17.13) with 63.91% female,
77.51% white and 16.66% black. SF-12v2™ respondents were slightly younger with
a mean age of 52.73 (±16.54), 57.56% female, 76.53% white and 17.14% black. For
both EQ-5D and SF-12v2™ respondents, the most prevalent chronic health
conditions were hypertension (EQ-5D: 22.90%; SF-12v2™: 28.50%), followed by
depression (EQ-5D: 16.10%; SF-12v2™: 10.17%), diabetes (EQ-5D: 14.68%; SF-
12v2™: 14.25%) and back disorder (EQ-5D: 11.69%; SF-12v2™: 10.89%). This
descriptive information was well-representative of a U.S. national sample (Table 5).
72
Table 5: Descriptive of the MEPS sample
EQ-5D (n = 2,677) SF-12v2™ (n = 5,151)
Mean age (SD) 54.3 (17.13) 52.73 (16.54)
n % n %
Female (%) 1711 63.91% 2965 57.56%
Race
White 2075 77.51% 3942 76.53%
Black 446 16.66% 883 17.14%
Asian 74 2.76% 178 3.46%
Other 82 3.06% 148 2.87%
Disease
Hypertension 613 22.90% 1468 28.50%
Depression 431 16.10% 524 10.17%
Diabetes 393 14.68% 734 14.25%
Back disorder 313 11.69% 561 10.89%
Arthropathy 307 11.47% 379 7.36%
Anxiety 165 6.16% 269 5.22%
Asthma 133 4.97% 305 5.92%
Joint disorder 123 4.59% 216 4.19%
Cholesterol 115 4.30% 429 8.33%
Chronic sinusitis 84 3.14% 266 5.16%
73
Table 6 below gives the sample descriptive from the BDHOS data. A total of
1,430 patients were included in the data set from the base year with a mean age of 64
years (±10.76) and, 840 of them were female (58.74%). Ethnicity background
information was not collected in the data set since the community population was
predominantly white (over 95%, Fryback, 1993). There were 21 chronic health
conditions identified in this cohort with overlapping counts in each condition due to
sampling method during survey (Fryback, 1993). In other words, in this cohort, each
patient may have more than one health conditions reported they may re-occur in
different disease groups.
The most prevalent chronic health conditions reported in this cohort were
arthritis (47.97%), followed by other chronic pain (43.92%), hypertension (43.29%)
and back pain (32.73%). Some of the most prevalent health conditions in MEPS such
as diabetes and depression were not as prevalent in this community cohort,
suggesting that cohort differences exist between the two samples.
74
Table 6: Descriptive of the BDHOS sample
SF-36
®
( N = 1,430, base year)
Mean age (SD) 64.0 (10.76)
n %
Female 840 58.74%
Disease
Arthritis 686 47.97%
Other chronic pain 628 43.92%
Hypertension 619 43.29%
Back pain 468 32.73%
Ulcer 204 14.27%
Migraine 189 13.22%
Goiter 182 12.73%
Sleep disorder 169 11.82%
Neck pain 145 10.14%
Diabetes 142 9.93%
Gout 123 8.60%
Depression 118 8.25%
Angina 116 8.11%
Myocardial infarction (MI) 106 7.41%
Anxiety 78 5.45%
Bronchitis 75 5.24%
Asthma 68 4.76%
Stroke 56 3.92%
Congestive heart failure (CHF) 45 3.15%
Emphysema 45 3.15%
Colitis 45 3.15%
Note:
Total sample size in BDHOS data is N = 1430 at base year (Time 1). There are over lapping in disease
stratification. Hence, the stratification is based on the prevalence of each disease category.
75
4.1.2 Goodness-of-fit
We used both the rating scale model (RSM) and the partial credit model
(PCM) for a thorough evaluation on item properties for all three instruments across
different disease groups (10 disease groups in MEPS and 21 groups in BDHOS).
We found comparable results between the models. We are reporting item fit
statistics from the RSM due to space limitation. Table 7 through Table 9 below
report the RSM item fit statistics including item point biserial information on the
three instruments.
Table 7 and Table 8 exhibit the EQ-5D item fit to the models. Based on all
four set of fit statistics (INFIT/OUTFIT MNSQ and INFIT/OUTFIT ZSTD), the
mental health item “anxiety/depression” consistently showed misfit to both models
(INFIT/OUTFIT MNSQ > 1.40 and INFIT/OUTFIT ZSTD > 2.0). By relaxing the
constraint on item thresholds in the PCM, item “mobility” demonstrated some
misfit in some of the disease groups, such as hypertension, depression, diabetes,
arthropathy and anxiety. All items showed positive point biserial correlation,
indicating that each item measured some proportion of the latent trait as expected.
At the same time, the mental health item “anxiety/depression” had the lowest point
biserial correlation among the five items, suggesting that this item had a lower
correlation with the latent trait compared with other items. Hence, it may be
measuring other variable(s) other than the latent trait of health, which was
manifested via its misfits to the RSM.
76
Table 9 and Table 10 report the SF-12v2™ item fit to the models. The
misfits were predominantly among the MCS items and were mostly brought out by
INFIT/OUTFIT ZSTD. This suggests possibly there was sample size effect present.
For instance, by examining the INFIT/OUTFIT MNSQ, the mental health item
“calm and peaceful” was the only item that showed misfit in diseases such as
diabetes and asthma while other items showed reasonable fit. The fit of the SF-
12v2™ items to the PCM is better than the fit of the items to the RSM suggesting
that relaxing the constraint on the item step values give items a better fit. For item
SF5, which assesses whether or not patients’ bodily pain interfere with normal
work, showed misfit in some of the disease group along with the MCS items.
Positive point biserial correlations were found in all of the SF-12v2™ items
indicating that all SF-12v2™ items were measuring some proportion of the latent
trait.
Table 11 shows the SF-36
®
item fit to both RSM and PCM using the whole
sample (N = 1,430). We found that misfit items were also amongst the MCS items.
Out of the PCS items, item 21, “bodily pain magnitude,” and item 35 “expect health
to get worse” showed severe misfit along with the MCS items. The same as the SF-
12v2™, item fit to the PCM was somewhat improved compared with RSM,
especially using the INFIT/OUTFIT MNSQ. This finding suggests, again, the
sample size effect brought out by the INFIT/OUTFIT ZSTD and, by allowing each
item to have their own thresholds rendered positive improvements on the overall
model fit in both health profile measures. We also examined the item fit within each
77
21 disease groups. Overall, the pattern of the item fit was consistent across 21
disease group with that of the whole sample. Again, given the space limitation, we
are only reporting the item fit using the whole sample. Further, like the other two
instruments, MCS items were more likely to misfit the models. PCS items SF21 and
SF35 consistently exhibited misfit along with the MCS items. Perhaps measurement
of bodily pain (SF21) and on subjective expectations on health (SF35) are more in
the line with mental health aspect of the health measurement and hence, more
susceptible to item bias. In health conditions such as arthritis and chronic or back
pains, items that assess physical functioning (PF) on climbing the stairs (SF7) and,
bending, keeling, lifting and stooping (SF8) were more likely to show misfit. This
implies that items that evaluate such bodily conditions within the samples that were
experiencing these diseases were susceptible to condition-related bias.
78
Table 7: Cross sectional fit of the EQ-5D items to the RSM
79
Table 8: Cross sectional fit of the EQ-5D items to the PCM
80
Table 9: Cross sectional fit of the SF-12v2™ items to the RSM
81
Table 9: Continued
82
Table 9: Continued
83
Table 10: Cross sectional fit of the SF-12v2™ items to the PCM
84
Table 10: Continued
85
Table 10: Continued
86
Table 11: Cross sectional fit of the SF-36
®
items (n = 1,430)
Rating Scale Model (RSM) Partial Credit Model (PCM)
Item Summary
INFIT
MNSQ
OUTFIT
MNSQ
INFIT
ZSTD
OUTFIT
ZSTD
INFIT
MNSQ
OUTFIT
MNSQ
INFIT
ZSTD
OUTFIT
ZSTD
SF1 PCS 0.71 0.76 -9.40 -7.00 0.89 0.90 -3.20 -2.60
SF3 PCS 0.60 0.60 -9.90 -9.90 0.80 0.79 -6.80 -5.50
SF4 PCS 0.51 0.48 -9.90 -9.90 0.83 0.71 -3.90 -3.40
SF5 PCS 0.49 0.46 -9.90 -9.90 0.88 0.58 -2.00 -3.70
SF6 PCS 0.64 0.60 -9.90 -9.90 0.89 0.80 -2.80 -3.00
SF7 PCS 0.47 0.45 -9.90 -9.90 0.91 0.63 -1.20 -2.80
SF8 PCS 0.73 0.72 -7.70 -7.20 0.93 0.87 -2.10 -2.40
SF9 PCS 0.81 0.74 -4.80 -6.10 0.92 0.89 -1.90 -1.20
SF10 PCS 0.53 0.48 -9.90 -9.90 0.81 0.60 -4.00 -3.80
SF11 PCS 0.40 0.39 -9.90 -9.90 0.85 0.44 -2.00 -4.00
SF12 PCS 0.43 0.43 -9.90 -9.90 0.92 0.65 -0.60 -1.70
SF13 PCS 0.90 0.69 -1.60 -3.50 0.89 0.71 -2.00 -2.70
SF14 PCS 0.81 0.68 -5.20 -6.50 0.80 0.71 -6.00 -4.30
SF15 PCS 0.78 0.65 -6.40 -7.70 0.77 0.64 -7.20 -5.70
SF16 PCS 0.79 0.61 -4.60 -6.30 0.79 0.63 -5.10 -4.70
SF17 MCS 0.86 0.55 -1.40 -3.50 0.94 0.56 -0.50 -2.70
SF18 MCS 0.92 0.61 -1.00 -3.80 0.92 0.56 -1.10 -3.60
SF19 MCS 0.91 0.60 -1.00 -3.30 0.94 0.60 -0.60 -2.60
SF20 MCS 0.70 0.65 -8.70 -9.90 0.94 0.78 -1.00 -2.40
SF21 PCS 1.75 1.77 9.90 9.90 1.29 1.34 7.40 7.50
SF22 PCS 0.78 0.78 -6.30 -6.10 0.98 0.88 -0.40 -1.60
SF23 MCS 1.45 1.60 9.90 9.90 0.94 0.91 -1.50 -2.40
87
Table 11: Continued
SF24 MCS 2.12 1.94 9.90 9.90 1.44 1.58 8.10 8.80
SF25 MCS 1.84 1.21 9.90 2.40 1.12 1.01 1.30 0.10
SF26 MCS 1.52 1.50 9.90 9.90 1.35 1.38 7.50 7.50
SF27 MCS 1.68 1.83 9.90 9.90 1.00 0.97 0.00 -0.60
SF28 MCS 1.61 1.52 9.90 9.90 1.26 1.27 5.30 5.00
SF29 MCS 1.77 1.62 9.90 9.90 1.18 1.16 3.80 2.90
SF30 MCS 1.31 1.24 7.30 5.50 1.19 1.14 4.20 3.00
SF31 MCS 1.21 1.27 5.40 6.40 1.09 1.06 2.00 1.60
SF32 MCS 0.82 0.73 -5.00 -7.30 1.07 1.00 1.10 0.10
SF33 PCS 0.85 0.89 -4.30 -3.10 1.12 1.07 2.60 1.30
SF34 PCS 0.90 1.00 -3.00 0.00 1.08 1.07 1.70 1.50
SF35 PCS 1.22 1.31 6.00 7.80 1.39 1.54 9.60 9.90
SF36 PCS 0.87 1.04 -3.90 1.00 0.83 0.85 -4.20 -3.30
Note
! Misfit items: INFIT MNSQ > 1.4
! For item descriptions, please refer to Table 2
88
4.1.3 Differential item functioning (DIF)
Table 11 and Figure 8 below examine the gender-related DIF of the EQ-5D.
Items “mobility” and “anxiety/depression” showed significant gender-related DIF
(P < 0.01). For “mobility,” it was more difficult for female than male to endorse
this item (positive DIF contrast: 0.35, P < 0.01). For the mental health item
anxiety/depression,” it was more difficult for male than females to endorse this item
(negative DIF contrast: -0.27, P < 0.01). These findings mean that males and
females with the same level of health endorsed these EQ-5D items differently
simply given their gender differences.
We looked into disease-related DIF in Figure 9. In Figure 9, once again,
items “mobility” and “anxiety/depression” showed significant disease-related DIF
(P < 0.01). Interestingly, depression (group C) and anxiety (Group I) showed more
severe disease-related DIF compared with other disease groups (Figure 9). These
finding suggest that patients diagnosed with mental health problems endorsed
mental health items with more systematic variances.
89
Table 12: Test gender-related differential item functioning (DIF) in EQ-5D
Female Male
DIF
Measure
(S.E.)
DIF
Measure
(S.E.)
DIF
Contrast
Standardized DIF
Difference |D|
Mobility -0.08 (0.06) -0.43 (0.08) 0.35* 3.50**
Self-Care 2.89 (0.09) 2.84 (0.11) 0.05 0.35
Usual Activities -0.60 (0.06) -0.57 (0.08) -0.03 -0.30
Pain/Discomfort -0.60 (0.06) -0.57 (0.08) -0.03 -0.30
Anxiety/Depression -1.55 (0.05) -1.28 (0.07) -0.27* -3.14**
Note:
* Significant after Bonferroni adjustment for Type I error, p < 0.01; Positive DIF means
item is more difficult for female than male and negative DIF means item is more difficult
for male than female
** Significant if |D| > 2 corresponding to the 5% Type I error rate; |D| = Standardized
contrast computed using the standardized difference formula which takes the standard
error into consideration
90
Figure 8: Test gender-related differential item functioning (DIF) in EQ-5D
91
Figure 9: Test disease-related differential item functioning (DIF) in EQ-5D
92
Table 13 and Figure 10 below exhibit the gender-related DIF of the SF-
12v2™ items. From Table 13, there were five SF-12v2™ items demonstrating
gender-related DIF using the standardized DIF difference (SF1, SF2a, SF2b, SF6a
and SF7; |D|> 2). The item with the largest gender-related DIF is the mental health
item SF6a, questioning respondent whether or not he/she was calm or peaceful.
Further, it was more difficulty for males to endorse this item than females (negative
DIF contrast: -0.19, P < 0.04, after Bonferroni adjustment for Type I error). After
standardizing the difference by taking the standard error into account, more items
showed gender-related DIF. The similar finding is exhibited in Figure 10.
Figure 11 exhibits the disease-related DIF of the SF-12v2™ items.
Significant disease-related DIF were found on all 12 items. In addition, DIF
measures in depression (group C) and anxiety (Group I) groups were more erratic
compared with other disease groups. This finding is consistently with what we have
found in the EQ-5D responses previously. Further, findings suggest that health
measurement using the SF-12v2™ is condition dependent.
93
Table 13: Test gender-related differential item functioning (DIF) in SF-12v2™
Female Male
DIF Measure
(S.E.)
DIF Measure
(S.E.)
DIF
Contrast
Standardized DIF
Difference |D|
SF1 0.71 (0.02) 0.84 (0.03) -0.12* -3.33**
SF2a 1.90 (0.03) 1.97 (0.03) -0.08 -1.88
SF2b 2.20 (0.03) 2.11 (0.03) 0.09 2.12**
SF3a -0.36 (0.02) -0.44 (0.03) 0.08 2.21**
SF3b -0.59 (0.03) -0.62 (0.03) 0.03 0.71
SF4a -1.09 (0.03) -1.13 (0.03) 0.04 0.94
SF4b -1.22 (0.03) -1.28 (0.03) 0.06 1.41
SF5 -0.54 (0.03) -0.54 (0.03) 0.00 0.00
SF6a 0.06 (0.02) 0.25 (0.03) -0.19* -5.27**
SF6b 0.59 (0.02) 0.59 (0.03) 0.00 0.00
SF6c -0.66 (0.03) -0.66 (0.03) 0.00 0.00
SF7 -1.02 (0.03) -1.15 (0.03) 0.13* 3.06**
Note:
* Significant after Bonferroni adjustment for Type I error, p < 0.04; Positive DIF means
item is more difficult for female than male and negative DIF means item is more difficult
for male than female
** Significant if |D| > 2 corresponding to the 5% Type I error rate; |D| = Standardized
contrast computed using the standardized difference formula which takes the standard
error into consideration
94
Figure 10: Test gender-related differential item functioning (DIF) in SF-12v2™
95
Figure 11: Test disease-related differential item functioning (DIF) in SF-12v2™
96
Table 14 and Figure 12 below demonstrate the gender-related DIF of the SF-
36
®
items. There were three mental health items, one vitality item and one general
health item showed gender-related DIF (SF24, SF26, SF28, SF29 and SF34). By
standardizing the differences and taking the standard errors into consideration, more
items showed gender-related DIF. The items with the largest DIF were SF24 and
SF26 questioning respondents whether or not they were nervous person or whether
or not s/he was calm and peaceful. For both of these items, males had more
difficulties endorsing the items compared with females.
In terms of disease-related DIF of the SF-36
®
, again, given the space
limitation, we are reporting the top three most prevalent disease groups out of the 21
groups (arthritis, n = 696; other chronic pain, n = 628 and hypertension, n = 619).
We generated a binary scenario on the sample examining respondents who had the
disease (=1) compared with those who did not (=0). In all three disease groups, we
found that most of the SF-36
®
items demonstrated significant disease-related DIF
(Figures 13, 14 and 15, P < 0.001). That is, for respondents diagnosed with these
chronic health conditions endorsed the SF-36
®
items differently from those who did
not experience these health conditions even though they were supposed to have the
same level health along the latent trait. These findings suggest that the measurement
of health using the SF-36
®
is also condition-dependent, as the SF-12v2™.
97
Table 14: Test gender-related differential item functioning (DIF) in SF-36
®
98
Figure 12: Test gender-related differential item functioning (DIF) in SF-36
®
99
Figure 13: Test arthritis-related differential item functioning (DIF) in SF-36
®
100
Figure 14: Test chronic pain-related differential item functioning (DIF) in SF-36
®
101
Figure 15: Test hypertension-related differential item functioning (DIF) in SF-36
®
102
4.1.4 Principal Component Analysis of Residuals (PCAR)
We performed the Principal Component Analysis of Residuals (PCAR) to
examine the overall explanatory power of the items in each of the three instruments.
In Tables 15, 16 and 17 below, we reported 1) the total variance explained by the
dominant dimension, i.e. the Rasch measure; 2) the total variance explained by
persons; 3) the total variance explained by items; 4) unexplained variance by the 2
nd
dimension and 5) the eigenvalues of the 2
nd
dimension to indicate the strength of the
secondary dimension. The total variance explained gives the overall strength of the
instrument in measuring health, which was subdivided into measures by persons and
measures by items. This information can be insightful in terms of examining how
person measures (!
n
’s) and item measures ("
i
’s) attribute toward the overall strength
of the measures in a particular instrument.
Table 15 exhibits the PCAR for the EQ-5D. We see that 6 out of 10 disease
groups had > 50% of the total variance explained. Anxiety group had the highest
PCAR (74.1%) while joint disorder group had the lowest (43.9%). In most cases,
person measures explained more than the item measures towards the total variance.
The unexplained variances in the 2
nd
dimension had < 3 eigenvalues in all disease
groups. This means that the 2
nd
dimension embedded in the measure did not have
enough strength to form an independent dimension. Interestingly, depression and
anxiety groups had the highest total variance explained by the Rasch measure (73.7%
and 74.1%).
103
Table 16 below reports the PCAR of the SF-12v2™. We see that all 10
disease groups had > 50% of the total variance explained by the Rasch measure. The
SF-12v2™ items had comparable amount of the overall measures compared with
persons. Eigenvalues in the depression and anxiety were >3 (3.1 and 3.5,
respectively), indicating the presence of a 2
nd
dimension with a strength of more than
3 items.
Table 17 below exhibits the PCAR for the SF-36
®
. All disease groups
explained > 70% of the total variances by the Rasch measures, except anxiety group
(67.5%). Two things were rather noticeable in the PCAR findings of the SF-36
®
instrument. First of all, across all 21 disease groups, >60% of the total variances
were explained by the items. Second, high eigenvalues were found in all disease
groups that ranges from 4.7 to 5.8, which indicating a strength of 5 to 6 items
embedded in a 2
nd
dimension.
104
Table 15: Principal Component Analysis of Residuals (PCAR) of the EQ-5D
Table 16: Principal Component Analysis of Residuals (PCAR) of the SF-12v2™
105
Table 17: Principal Component Analysis of Residuals (PCAR) of the SF-36
®
106
4.2 Longitudinal analysis
In this section, we report the findings from the longitudinal analysis using
composite data matrices extracted from panel data sets from the MEPS and all 4
times points in the BDHOS data with a 10-year follow-up.
4.2.1 Goodness-of-fit
Given the comparable results from INFIT MNSQ and OUTFIT MNSQ, we
only reporting results using INFIT MNSQ (> 1.40) on all three instruments.
Table 18 reports the item fit statistics for the EQ-5D at time 1, time 2 and the
longitudinal item fit, respectively.For both time 1 and time 2, the mental health item
“anxiety/depression” misfit the model in # 5 out of 10 disease groups. When we
stacked the data and ran a longitudinal FACETS model by parameterizing time as the
third facet, we achieved unidimensional fit of the EQ-5D items in 8 out of 10 disease
groups using INFIT MNSQ (< 1.40). Hence, in longitudinal evaluation of the EQ-5D
on unidimensionality, except for joint disorder and anxiety groups, all other disease
groups achieved unidimensional fit to the Rasch model.
Table 19 reports the item fit statistics for the SF-12v2™. Model misfit was
predominantly found in the MCS items (SF4a, SF4b, SF6a and SF6c), which was the
same as we found from the cross-sectional analyses. At time 2, only SF6a and SF6c
showed misfit and some of the items marginally misfit the model. Overall,
107
unidimensional fit were found in half of the disease groups in the repeated measure.
Longitudinal item fit showed some improvement. Three mental health items (SF4b,
SF6a and SF6c) showed marginally misfit to the model and 5 out of 10 disease
groups achieved unidimensional fit. Findings suggest that parameterizing time in the
SF-12v2™ item response estimation yield less departure from unidimensionality and
improved the overall model fit.
Table 20 to Table 24 below report the item fit statistics of the SF-36
®
at time
1, time 2, time 3, time 4 as well as the longitudinal item fit. Across all 21 disease
groups, misfits were, again, predominantly in MCS items except SF21, the item that
assesses respondents’ bodily pain magnitude and SF35, the item questions
respondents’ expectations on their health status to get worse. This misfit pattern was
more so pronounced during later time periods (time 3 and time 4). During the later
time periods, some of the items that were fitting before showed misfit (see Table 22
and Table 23). Using this US community cohort with 4 time points and with a 10-
year follow-up, we did not find obvious improvement on model fit by parameterizing
time in the model.
108
Table 18: Item fit of the EQ-5D to the FACETS model
109
Table 19: Item fit of the SF-12v2™ to the FACETS model
110
Table 20: Time 1 item fit of the SF-36
®
to the FACETS model
111
Table 20: Continued
112
Table 21: Time 2 item fit of the SF-36
®
to the FACETS model
113
Table 21: Continued
114
Table 22: Time 3 item fit of the SF-36
®
to the FACETS model
115
Table 22: Continued
116
Table 23: Time 4 item fit of the SF-36
®
to the FACETS model
117
Table 23: Continued
118
Table 24: Longitudinal item fit of the SF-36
®
to the FACETS model
119
Table 24: Continued
120
4.2.2 Item x time interaction/bias assessments
Figures 16 to 21 below examine the item x time interaction/biases for each
instrument. In these figures, the x-axis is the items (2
nd
facet) and the y-axis is the
mean estimate of time (3
rd
facet). Higher scores on the time estimate mean higher
bias.
For the EQ-5D, in most cases, time 1 had greater bias on some of the items,
especially the items “pain/discomfort” and “anxiety/depression.” Also,
interaction/biases were not consistent across disease groups. For instance, in asthma,
time 1 had much greater bias compared with time 2 on all items except “mobility.”
In other cases, bias may be greater in time 1 for some items but not others, such as in
anxiety and diabetes groups. Findings suggest that respondents with different
diseases had different item difficulties across different time points (Figure 16, 17).
This finding was consistent with the goodness-of-fit tests and DIF
evaluations that we have illustrated before, that the mental health items or mental
health groups tend to behave independently from other physical health items or
physical health groups. Overall, the interaction between EQ-5D items and time
demonstrated some consistent patterns across disease groups. And, findings suggest
the EQ-5D items were subject to similar time effect at both time points. When the
item x time interaction/bias was examined using the whole sample, items
“pain/discomfort” and “anxiety/depression” showed some convergent pattern toward
other three items at time 2.
121
Figure 18 and 19 below report the item x time interaction/bias of the SF-
12v2™ items across disease groups. Overall, items at time 1 demonstrated rather
obvious bias than time 2 for almost all items except for items SF2a and SF2b. This
finding was consistent across all disease groups except for joint disorder. For joint
disorder, time 2 estimates exhibited a bit more erratic pattern compared with time 2
estimates in the other 9 disease groups. Hence, items SF2a and SF2b that assess
physical limitations on moderate activities or climbing flights of stairs were more
robust to time effect compared with other items. Other items were endorsed
differently at time 1 from at time 2. When the whole sample was used to examined
the item x time interaction/bias of the SF-12v2™ items (N = 5,151), it was very
obvious that except for SF2a and SF2b, the other 10 items in the SF-12v2™ showed
fairly large interaction biases with time 1. Convergent patterns were found at time 2
on all twelve items. At time 2, most of the items had very close time estimates. These
findings suggest that over time, there was less effect of the time on the SF-12v2™
item endorsements.
Figure 20 and 21 illustrates the item x time interaction/bias of the SF-36
®
items. Time 1 showed the greatest bias, mostly among mental health items. This was
consistent across all disease groups. However, in some diseases such as angina,
asthma, CHF, colitis, emphysema, gout, MI and stroke, there were erratic item
measures at later time points (time 3 or time 4). When the whole sample was used
(Figure 21), similar to what we found on the SF-12v2™ items, item endorsements at
time 1 had the highest biases compared to the item estimates at later time points,
122
predominantly in MCS items. Starting at time 2, item estimates with regard to time
were very consistent on all items.
123
Figure 16: Item x time interaction/bias of the EQ-5D in ten disease groups
124
Figure 17: Item x time interaction/bias of the EQ-5D in whole sample (n=2,677)
125
Figure 18: Item x time interaction/bias of the SF-12v2™ in ten disease groups
126
Figure 19: Item x time interaction/bias of the SF-12v2™ in whole sample (n=5,151)
127
Figure 20: Item x time interaction/bias of the SF-36
®
in twenty-one disease groups
128
Figure 20: Continued
129
Figure 21: Item x time interaction/bias of the SF-36
®
in whole sample (n=1,430)
P<0.001
130
4.2.3 Validation test
As we have stated previously in the methods section, a change of model
specification leads to a change of model estimation, and hence, may lead to change
of model fit. In this section, we conduct a post hoc validation test with regard to this
concern. We intend to test the productivity of the facets included in the model,
especially the additional time facet. Table 25, 26 and 27 below reports the fixed all-
same !
2
tests on each of the health instrument in all disease groups and on all facets
(person facet, item facet and time facet).
Table 25 below exhibits the !
2
test on the EQ-5D. Both person and item
facets were significant in all disease groups. Hence, for these two facets, we reject
the null hypothesis and state that these two facets were productive to be included in
the model. Time facet was found to be a significant facet in the model in 5 out of 10
disease groups (P < 0.01) and insignificant in diabetes, arthropathy, depression,
anxiety and cholesterol groups (P > 0.01). This indicates that time facet was
productive in some of the disease groups but not others. This implies that some
diseases are more sensitive to effect of time than others. Therefore, enabled by the
longitudinal Rasch model with the time facet, we are able to detect that the EQ-5D
items interact with time differently in different disease groups.
Table 26 is the !
2
test on the SF-12v2™. All three facets were significant in
the model (P < 0.001), across all 10 disease groups. The same results were found on
the SF-36
®
in Table 27, all three facets were productive in the model (P < 0.001).
131
Table 25: Fixed all-same !
2
test on the EQ-5D
Table 26: Fixed all-same !
2
test on the SF-12v2™
132
Table 27: Fixed all-same !
2
test on the SF-36
®
133
CHAPTER 5: DISCUSSION AND CONCLUSIONS
5.1 Overview
Valid health measurement is important for health care decision making. For
different types of health instruments, it is important to assess whether or not an
instrument is measuring what it is expected to measure. Given the complexity of
human health, it is essential to understand whether an instrument captures optimal
health information or renders suboptimal measures of health.
In this dissertation, we investigated the measurement properties of the three
widely used health instruments using the Rasch family of measurement in both
cross-sectional and longitudinal data settings to understand the details of item
endorsement in different groups and at different time points. The fundamental
concept of the Rasch models on parameter separation on persons and items provided
us not only the theoretical basis for the meaningfulness of the measured outcome, but
also gave us empirical evidences on person and item estimates in an actual health
instrument. Such information is valuable for instrument validation as well as health
assessments such as in clinical trials when specific information on items and persons
is useful for a greater sensitivity in detecting health disparities within and between
groups. Hence, our findings rendered meaningful quantitative interpretations of the
qualitative observations collected from an empirical health survey.
134
The theme of this study centers on unidimensionality of the health instrument
and investigating any possible factors that might affect the attainability of
unidimensionality. In Chapter 1 and Chapter 2, we have discussed the theoretical
basis and empirical importance of unidimensionality pertains to our mathematical
measurement model approach. Empirically, unidimensionality has been regarded as
“unachievable” for a unitary health dimension that involves both mental health and
physical health. After our examination on individual mental health and physical
health item performances in each of the three health instruments and in various
settings, our findings suggest that health or health-related quality of life (HRQoL) is
unidimensional after all, after time is considered in the model.
5.2 Discussion
Empirical findings from both the preliminary investigation (cross-sectional
analysis) and hypothesis testing (longitudinal analysis) rendered results that were
largely consistent with our background literature reviews. At the same time, some
findings warrant further investigation, however, they are beyond the scope of this
work.
135
5.2.1 On the cross sectional findings
From the cross-sectional analyses, item fit or misfit patterns demonstrated
fairly consistent patterns among the three instruments in terms of how mental health
items and physical health items interplay with one another. At the same time, there
were also discrepancies among the three instruments.
First, in all three instruments, mental health items were more likely to misfit
the unidimensional measurement models. This finding is within our expectation as
mental health and physical health tend to behave independently in point-in-time
measures. One interesting findings from the preliminary study is that items
concerning the magnitude of bodily pain or respondents’ expectations on their
general health status tend to misfit the models along with mental health items.
Empirically, these two items were both grouped together with the PCS items in both
the SF-12v2™ and the SF-36
®
. Our findings implies that items assessing pain
magnitude and general health expectations behave independently from the rest of the
PCS items and they may correlate better with the MCS items. In addition, validity
items in the SF-12™ and the SF-36
®
did not show any misfit along with the other
MCS items. Since empirical they belong to MCS group, these findings question the
merits of MCS and PCS scales used in a health profile measure. In a separate study,
we conducted the related analysis and found that separating the items in a health
profile measure into distinctively MCS or PCS may not render the desired outcomes
from the measurement perspective (Gu & Doctor, 2009a; Gu & Doctor, 2009b).
136
Hence, perhaps removing the two misfitting MH items may be an alternative for the
instrument improvement.
Second, differential item functioning (DIF) often noted on mental health
items. Hence, for two equally healthy respondents, they tend to endorse mental
health item differently or give systematic different responses on items if they were
diagnosed with mental health conditions. Further, findings were more pronounced in
males than females. This suggests an underlying cultural effect in the sample. Given
the prevalence of gender-related DIF, in a separate study, we conducted a post hoc
sample split analyses on DIF-related items and found improved model fit after
purging the gender effect (Gu & Doctor, 2008; Gu, Craig & Doctor, 2008).
Third, enabled by the fundamental parameter separation technique in the
Rasch model estimation, we were able to evaluate distinctively the amount of health
attributed by the persons ("
n
’s) and by the items (#
i
’s). For instance, from the
Principal Component Analysis of Residuals (PCAR), we found that the total variance
explained by the Rasch measure in the SF-36
®
were attributed substantially more by
the items than by the persons. Person measures captured < 14% of the total variance
explained. We also found high eigenvalues across all 21 disease groups (> 4), hence,
the 2
nd
dimension embedded in the SF-36
®
had strength of at least 4 items. Perhaps
there were other excessive person traits other than person health that should be
considered in the measure.
Finally, all three instruments showed positive point biserial correlations
across all disease groups, indicating that by the very basic requirement in a health
137
measure, items included in these three instruments are measuring some proportion
(more of less) of the latent construct of health as expected.
5.2.2 On the longitudinal findings
Findings from the longitudinal analyses were focused on testing the
hypothesis that time plays a significant role in unidimensionality. Our investigation
rendered positive and consistent results in the U.S. national representative sample
(MEPS). Findings from the U.S. community sample (BDHOS) warrants further
investigation.
From the longitudinal results, we found that at each discrete time point,
mental health items were more likely to show misfit as we have found in cross-
sectional analyses. Parameterizing time improves the overall model fit of the EQ-5D
and the SF-12v2™ but not the SF-36
®
. As observations on the EQ-5D and the SF-
12v2™ were collected from the U.S. national survey in two consecutive years and
observations on the SF-36
®
were collected from the U.S. community cohort with a
longer follow-up time, this finding intrigued us to investigate and analyze the
differences.
We first noted the obvious cohort-related differences. Since the samples from
BDHOS cohort were extracted from the Beaver Dam Eye Study cohort from 1987
for a population-based study of eye disease prevalence and risk factors (Fryback et
al., 1993), naturally, there were differences between this cohort and the cohort from
138
the MEPS data which was based on the U.S. national representative sample. For
instance, as we have seen from Table 5 and Table 6 on sample descriptive, the
average age for the BDHOS sample was about 10 years older compared with the
average age in MEPS sample. Further, the chronic disease prevalence differed
between the two cohorts as well. For example, the leading chronic diseases in
BDHOS data were arthritis and chronic pain and the leading chronic diseases in
MEPS data were hypertension and depression. In addition, about one quarter of the
sample from MEPS was non-white but over 95% of the sample in the BDHOS were
white (Fryback et al., 1993). Hence, samples in the BDHOS data were
predominantly older white local residents in Beaver Dam who had eye problems and
suffering main arthritis and chronic pain diseases. This gives us a perspective that
following this cohort for over 10 years on their health survey endorsements may
subject to cohort vulnerability over time.
The cohort vulnerability entails to issues that associated with following an
elderly, sicker cohort over a longer period of time (over 10 years). For example,
change in disease severities may be a concern. For an elderly cohort who was
diagnosed with chronic diseases at base year, they were more likely to incur either
changes in disease severity or more co-morbidity over time. Further, varying abilities
in health survey endorsement may be another concern. Since this cohort was
extracted from the Beaver Dam eye study, it was very likely the case that
respondents in this cohort suffered continuously from the eye diseases. The ailing
eye sights may have affected their abilities in item endorsements on the SF-36
®
in
139
repeated administrations. In addition, there was apparent loss of follow-up in the data
collection. Since respondents in this cohort were more likely to die compared to
younger and healthier cohort in the MEPS, naturally, it was more likely to incur
missing data problem. In MEPS, we included respondents with complete responses
at base year and MEPS had a shorter time period of follow-up (two consecutive
years). Thus, there were less missing data in time 2 compared with BDHOS data post
baseline year. In BDHOS data, since we had limited respondents from the survey, we
included all respondents from the base year and, the missing data problem in this
cohort post baseline year was substantial. This study design was rationalized based
on the Rasch model capability in dealing with missing data, as we have elaborated
previously in the methods section (Linacre, 2008a/2008b) that the Rasch models
were robust to missing data. However, to our best knowledge, the Rasch model
capability in dealing with missing data has not been empirically tested on its
robustness in longitudinal settings. In other words, there have been no empirical
evidences with regard to the robustness of MFRM to missing data when time is
parameterized in the model, especially the data matrixes were constructed in the way
that there were substantial missing in responses during the later time points.
Findings on the item x time interaction/bias were consistent in all three health
instruments. Basically, item endorsements at time 1 were more likely to subject to
the item x time interaction/bias, mostly manifested in mental health items, especially
for the two health profile measures. To interpret from clinical point of view, if we
think of time 1 as the initial time that a patient just received his/her diagnosis or
140
treatment, his/her endorsement on health survey items would have reflected the new
information cognitively as well as physically via mental health items and physical
health items. To interpret from psychometric point of view, at time 1, a patient may
not have acquired information about the health instrument or health items the same
way as s/he would have at later time points. Therefore, the gaps exhibited in previous
reported figures (Figures 22, 24, and 26) may be interpreted clinically as either
coping effect or adaptation. Psychometrically, this could be explained as a patient
became more familiarized with the items over time. While the interpretations could
be multifold, our findings on item x time interaction were consistent between the
three instruments; hence, time plays a significant role in their item responses. And,
post hoc validation test on the time facet indicated that time was a productive facet to
be included in the model.
5.3 Answering research questions
To address our first research question, “how do the mental health and
physical health items interplay in terms of quantifying the latent variable when they
are placed on the same metric or latent continuum?” We have used unidimensional
mathematical measurement models by placing mental health and physical health
items on the same metric or latent continuum. In point-in-time measures, the mental
health items tend to behave independently from the physically health items as if they
form two distinctive health constructs. However, longitudinal evaluation of the
141
intercorrelation between the two demonstrated convergent patterns suggesting that
mental health and physical health were both measuring the single latent construct of
health. Therefore, they were approaching unidimensionality.
To answer our second research question, “is unidimensionality attainable in
point-in-time measures in multi-itemed instruments that consist of both mental health
and physical health questions?” Based on our empirical evidence, achieving
unidimensionality in point-in-time measures was less likely to be successful.
Potential confounding factors such as gender, disease as well as time can affect the
attainability of unidimensionality of health instruments. Improved model fit may be
achieved by purging the effect from potential confounders.
For our third research question, “will unidimensionality in a health
instrument improve in longitudinal samples after parameterizing time in the
measurement model?” As we have discussed above, the answer is yes but only in
some of the health instruments using certain data sets. We have found improved
unidimensional fit from the EQ-5D and the SF-12v2™ using the U.S. national
representative sample with a two-year follow-up. We did not found improved
unidimensional model fit of the SF-36
®
using the US community cohort with a ten-
year follow-up. We attributed this outcome to either sample or cohort effect since
respondents were different in the two data sets and the data matrices used for model
estimations were different. We intend to dedicate our effort on further investigations
on these discrepancies.
142
Our fourth research question concerns: “does preference-based measure
differ from the health profile measure in the attainability of unidimensionality in
both point-in-time measures and longitudinal settings?” Our findings rendered both
consistencies and inconsistencies with this regard. Before we purged the
confounding effect, the mental health items in both types of instruments were more
likely to misfit the unidimensional model and this was consistent across all three
health instruments. Inconsistently, after parameterizing the effect of time, there were
improvements in model fit in the EQ-5D and the SF-12v2™ but, not in the SF-36
®
.
Further, time interact with the EQ-5D items differently compared to how time
interact with items in the SF-12v2™ or the SF-36
®
items. Many factors may be
associated with how time interact items in different instruments. For instance, each
health instrument is entailed with different time frame for recall by design. The time
recall for the EQ-5D was “today” and the time recall for both the SF-12v2™ and the
SF-36
®
are for past 4 weeks (www.proqolid.org). The different time lapse on
subjective health measurement may have different effect on item endorsements.
5.4 Significance, limitations and directions for future study
This dissertation stands to make important contribution to the literature of
measurement properties of health survey instruments widely used for health care
decision making. It is the first comprehensive analyses on the three widely applied
health instruments in U.S. national and community large survey samples using the
143
Rasch family of measurement models. Our hypothesis testing facilitates an
alternative evaluation on Rasch model requirement using empirical health data and,
provides further understanding on the dynamic of health using empirical evidence.
Our hypothesis testing demonstrated that unidimensionality requirement in a health
instrument can be a by-product of time and inter-temporal context of health
evaluation is important. Hence, our study support the empirical evidences that time
plays a significant role in health, manifested in the interplay between mental health
and physical health, which in turn, affect unidimensionality. At the same time, for
broad assessment of health status, the tradeoffs between such investigations at the
expense of achieving unidimensionality need to be considered when point-in-time
measures are used.
There are limitations associated with this study and, some of the limitations
may serve as direction for future studies. First of all, in Rasch, ‘health’ is latent and
we can not experimentally control, therefore, we are limited to goodness-of-fit tests.
Goodness-of-fit may lack power to reject, however, this may be balanced out as
Rasch models are strict models that make strong assumptions that often rejected.
Second, there may be deeper currents underneath what we hypothesized and what
could be emerged from the proposed analytical steps. Third, as the understanding on
quantitative health measurement unfolds, a more comprehensive and systematic
studies may be needed using alternative measurement models. For instance, there are
other models that apart from the parametric unidimensional Rasch models that can
be experimented using the same data sets. Forth, there are other person
144
characteristics other than gender and health conditions that can be the potential
source of confounding. Fifth, the model fit criteria we use in this study are based on
a set of pre-established criteria from existing literatures. Hence, they are not
conclusive and serve only just one aspect of cut-off points.
For our future research, 1) we intend to further investigate the inconsistent
findings noted in this study. In particular, works with regard to dealing missing data
using longitudinal measurement models. Perhaps using alternative data sets or in
different populations; 2) future studies should be focused on wider applications of
the approach used in this dissertation. For instance, distribution-independent results
obtained under parameter separation techniques on persons and items may be a
valuable approach used in clinical trials where individualized outcomes may be of
interest to outcomes assessment; 3) health measures derived from the quantitative
measurement model approach as we have demonstrated in this dissertation may be
further applied in economic studies on health valuations using interval scales and 4)
other validation test procedures for testing the productivity of different facets in a
longitudinal measurement model should be further explored. For example, split
sample analysis via anchoring approach may be an interesting research area for
longitudinal measurement models.
145
5.5 Conclusions
Valid health measures should be free from any systematic bias. Empirically,
given the complexity of human health and the definitional and operational
difficulties associated with measurement of health, quantifying item bias via misfits
as we have demonstrated in this study provides useful diagnostic information on
respondents’ health perception over time and the measurement invariance property
of the instruments. Most analyses on health measure suggest unidimensionality fails
because mental health and physical health tend to form different scales.
Based on our findings, qualitative patient reported outcomes (PRO) at any
point-in-time measure may not be sufficient to represent the latent trait that the
instrument is designed to measure. Unidimensionality requirement in a health
instrument is supported when time is considered in the model. Mental health and
physical health tend to form different scales in point-in-time measures but, they may
be linked through time. Therefore, parameterizing time improve the overall model
fit. Our findings suggest that it no longer makes sense to interpret health measure
cross-sectionally. Inter-temporal health context is important in health measurement.
146
BIBLIOGRAPHY
Ackerman TA. A didactic explanation of item bias, item impact, and item validity
from a multidimensional perspective. Journal of Educational Measurement
1992; 29(1): 67-91.
Acton SF. What is good about Rasch measurement? Rasch Measurement
Transactions 2003; 16(4): 902-3.
Adams RJ, Wilson M, Wang WC. The Multidimensional random coefficients
multinomial logit model. Applied Psychological Measurement 1997; 21(1):
1-23.
Agency for Healthcare Research and Quality (AHRQ), Medical Expenditure Panel
Survey (MEPS). http://www.meps.ahrq.gov.
Allen DD, Wilson M. Introducing multidimensional item response modeling in
health behavior and health education research. Health Education Research,
Theory & Practice 2006; 21(suppl): i73-i84.
Andrich, D. A rating formulation for ordered response categories. Psychometrika
1978; 43: 561-573.
Andrich D. Rasch Models for measurement. Newbury Park, CA: Sage. 1988.
Andrich D, Sheridan B. A summary index of multidimensionality in scales
composed of subscales: applications to traditional and Rasch measurement
theory. Paper presented at the 14
th
International Objective Measurement
Workshop (IOMW 2008).
Aneshensel CS, Frerichs RR, Huba GJ. Depression and physical illness: a
multiwave, nonrecursive causal model. Journal of Health and Social Behavior
1984; 25: 350-371.
Arnold ME. Influences on and limitations of classical test theory reliability
estimates. Research in Schools 1996; 3(2): 61-74.
Baghaei P. Local dependency and Rasch measures. Rasch Measurement
Transactions 2007; 21(3): 1105-1106.
147
Baker JG, Granger CV, Fiedler RC. A brief outpatient functional assessment
measure: validity using Rasch measures. American Journal of Physical
Medicine & Rehabilitation 1997; 76(1): 8-13.
Barbour KA, Blumenthal JA, Palmer SM. Psychosocial issues in the assessment and
management of patients undergoing lung transplantation. Chest 2006; 129:
1367-1374.
Bergner M, Bobbitt RA, Carter WB, Gilson S. The Sickness Impact Profile:
development and final revision of a health status measure. Med Care 1981;
19(8): 787-805.
Birbeck GL, Kim S, Hays RD, Vickrey BG. Quality of life measures in epilepsy:
how well can they detect change over time? Neurology 2000; 54: 1822-7.
Bock RD. A brief history of item response theory. Educational Measurement: Issues
and Practice 1997; 21-33.
Bond TG, Fox CM. Applying the Rasch model: fundamental measurement in the
human sciences, 2
nd
edition. Mahwah, New Jersey: Lawrence Erlbaum
Associates, 2007.
Bradley C. Importance of differentiating health status from quality of life. The
Lancet 2001; 357: 7-8.
Brazier JE, Ratcliffe J, Salomon JA, Tsuchiya A. Measuring and Valuing Health
Benefits for Economic Evaluation. Oxford University Press, 2007.
Brazier JE, Roberts J. The estimation of a preference-based measure of health from
the SF-12. Med Care 2004; 42(9): 851-859.
Brazier JE, Roberts J. Methods for developing preference-based measures of health.
In. Jones AM. (Ed.), The Elgar Companion to Health Economics. Edward
Elgar, Cheltenham, UK, 2006; pp. 371-81.
Brazier J, Roberts J, Deverill M. The estimation of a preference-based single index
measure for health from the SF-36. Journal of Health Economics 2002;
21:271-292.
Brennan RL. (Mis)conceptions about generalizability theory. Educational
Measurement: Issues & Practice 2000; 19: 5-10.
Brennan RL. Generalizability Theory. New York: Springer. 2001.
148
Briggs DC, Wilson M. An introduction to multidimensional measurement using
Rasch models. In: Smith EV Jr, Smith RM, editors. Introduction to Rasch
Measurement. Maple Grove, MN: JAM Press; 2004. P575-600.
Brooks R. EuroQol: the current state of play. Health Policy 1996; 37: 53-72.
Brown MM, Brown GC, Sharma S. Evidence-Based to Value-Based Medicine.
American Medical Association (AMA) press, 2005.
Brown M, Gordon WA. Quality of life as a construct in health and disability
research. The Mount Sinai Journal of Medicine 1999; 66(3): 160-169.
Calman KC. Quality of life in cancer patients – a hypothesis. Journal of Medical
Ethics 1984; 10: 124-7.
Cardinet J, Tourneur Y, Allal L. The symmetry of generalizability theory:
applications to education measurement. Journal of Educational Measurement
1976; 13: 119-135.
Carnethon MR, Biggs ML, Barzilay JI, Smith NL, Vaccarino V, Bertoni AG, Arnold
A, Siscovick D. Longitudinal association between depressive symptoms and
incident type 2 diabetes millitus in older adults: the cardiovscular health
study. Arch Intern Med 2007; 167: 802-807.
Carney RM, Blumenthal JA, Stein PK, Watkins L, Catellier D, Berkman LF,
Czajkowski SM, O’Connor C, Stone PH, Freedland KE. Depression, heart
rate variability, and acute myocardial infarction. Circulation 2001; 104: 2024-
2028.
Chang CH. Item response theory and beyond: advances in patient-reported outcomes
measurement. In: Lenderking WR, Revicki DA (Eds.), Advancing Health
Outcomes Research Methods and Clinical Applications. Degnon Associates.
2005.
Chang CH, Wright BD, Cella D, Hays RD. The SF-36 physical and mental health
factors were confirmed in cancer and HIV/AIDS patients. Journal of Clinical
Epidemiology 2007; 60: 68-72.
Chang WC, Chan C. Rasch analysis for outcomes measures: some methodological
considerations. Arch Phys Med Rehabil 1995; 76: 934-9.
149
Cohen SB. Sample design of the 1997 Medical Expenditure Panel Survey Household
Component. Rockville (MD): Agency for Healthcare Research and Quality;
2000. MEPS Methodology Report No. 11. AHRQ Pub. No. 01-0001.
Cohen SB. Design strategies and innovations in the Medical Expenditure Panel
Survey. Med Care 2003; 41(7 Suppl): III5-III12.
Craven J. Psychiatric aspects of lung transplant. Am J Psychiatry 1990; 35: 759-764.
Crocker L, Algina J. Introduction to Classical and Modern Test Theory. Fort Worth:
Harcourt, Brace, Johanovich. 1986.
Cronbach LJ, Glaser GC, Nanda H, Rajaratnum J. The Dependability of Behavioral
Measurement: Theory of Generalizability of Scores and Profiles. New York:
John Wiley. 1972.
Cronbach LJ. Nageswari R, Gleser GC. Theory of generalizability: a liberation of
reliability theory. The British Journal of Statistical Psychology 1963; 16:
137-163.
Crone CC, Wise TN. Psychiatric aspects of transplantation, III: postoperative issues.
Critical Care Nursing 1999; 19: 28-38.
Dellmeijer AJ, de Groot V, Roorda LD, Schepers VP, Lindeman E, van den Berg
LH, Beelen A, Dekker J. Cross-diagnostic validity of the SF-36 physical
functioning scale in patients with stroke, multiple sclerosis and amyotrophic
lateral sclerosis: a study using Rasch analysis. J Rehabil Med 2007; 39: 163-
9.
DeVellis RF. Classical Test theory. Medical Care 2006; 44(11, suppl 3): S50-S68.
Dew MA. Prevalence and risk of depression and anxiety-related disorders during the
first three years after heart transplantation. Psychosomatics 2001; 42: 300-
313.
Dew MA. Psychiatric disorder in the context of physical illness. In: Dohrenwend BP
(Editor), Adversity, Stress and Psychopathology. Oxford University Press,
New York, 1998.
Dew MA, Switzer GE, DiMartini AF: Psychiatric morbidity and organ
transplantation 1997; 64: 1261-1273.
150
Dew MA, Switzer GE, DiMartini AF, et al. Psychosocial assessment and outcomes
in organ transplantation. Prog Transplant 2000; 10: 223-227.
Divgi DR. Does the Rasch Model Really Work for Multiple Choice Items? Not If
You Look Closely. Journal of Educational Measurement 1986; 23(4).
Dolan P, Gudex C, Kind P, William A. The time trade-off method: results from a
general population study. Health Economics 1996; 5: 141-54.
Dolan P. Modeling valuations for EuroQol health states. Medical Care 1997; 35(11):
1095-1108.
Dolan P, Roberts J. Modeling valuation for EQ-5D health states: an alternative
model using difference in valuations. Medial Care 2002; 40(5): 442-6.
Donabedian A. Evaluating the quality of medical care. Milbank Memorial Fund
Quarterly 1966; 44(3): 166-206.
Doward LC, McKenna SP. Defining patient-reported outcomes. Value in Health
2004; 7(Suppl 1): S4-S8.
Drummond MF, Sculpher MJ, Torrance GW, O’Brien BJ, Stoddart GL. Methods for
the Economic Evaluation of Health Care Programmes. Oxford University
Press, 2005.
Eason S. Why generalizability theory yields better results than classical test theory: a
primer with concrete examples. In Thompson B (Ed.) Advances in
Educational Research: Substantive Findings , Methodological Developments.
Vol. 1, pp. 83-98. Greenwich CT: Jai Press. 1991.
Edgerton R. Quality of life from a longitudinal research perspective. In: Schalock R,
Bogale MJ (Eds.), Quality of Life: Perspectives and Issues. Washington
(DC): American Association on Mental Retardation, 1990.
Embretson SE. Implications of a multidimensional latent trait model for measuring
change. In Collins L & Horn J (Eds.), Best Methods for the Analysis of
Change (pp. 184-201). Washington, DC: American Psychological
Association 1991.
Embretson SE. Issues in the measurement of cognitive abilities. In: The New Rules
of Measurement: What Every Psychologist and Educator Should Know.
Embretson SE, Hershberger SL. (Eds.). Lawrence Erlbaum Associates, Inc.,
Publishers. Mahwah, NJ. 1999.
151
Embretson SE, Reise SP. Item Response Theory for Psychologists. Lawrence
Erlbaum Associates, Publishers. Mahwah, New Jersey. 2000.
Empana JP, Jouven X, Lemaitre RN, Sotoodehnia N, Rea T, Raghunathan TE,
Simon G, Siscovick S. Clinical depression and risk of out-of-hospital cardiac
arrest. Arch Intern Med 2006; 166:195-200.
Fairclough D. Analysing longitudinal studies of QoL. In Fayers P, Hays R. (Eds.),
Assessing Quality of Life in Clinical Trials, second edition. Oxford
University Press, 2005.
Fan X. Item response theory and classical test theory: an empirical comparison of
their item/person statistics. Educational and Psychological Measurement
1998; 58(3): 357-381.
Fanshel S, Bush JW. A health-status index and its application to health-services
outcomes. Operations Research 1970; 18(6): 1021-66.
Fayers PM, Machin D. Quality of Life: the Assessment, Analysis and Interpretation
of Patient-Reported Outcomes, 2
nd
edition. John Wiley & Sons Ltd. 2007.
FDA Guidance for Industry, Patient-Reported Outcome Measures: Use in Medical
Product Development to Support Labeling Claims. February 2006,
Clinical/Medical. http://www.fda.gov/CDER/GUIDANCE/5460dft.pdf
[Accessed online on March 4, 2008].
Feeny D. Preference-based measures: utility and quality-adjusted life years. In
Fayers P, Hays R. (Eds.), Assessing Quality of Life in Clinical Trials, second
edition. Oxford University Press, 2005.
Feeny D. The multi-attribute utility approach to assessing health-related quality of
life. In. Jones AM. (Ed.), The Elgar Companion to Health Economics.
Edward Elgar, Cheltenham, UK, 2006; pp. 359-70.
Feeny DH, Furlong WJ, Torrance GW, Goldsmith CH, Zenglong Z, Depauw S,
Denton M, Boyle M. Multiattribute and single-attribute utility function: the
Health Utility Index Mark 3 system 2002; 40: 113-28.
Fisher WP Jr. Foundations for health status metrology: the stability of MOS SF-36
PF-10 calibrations across samples. Journal of the Louisiana State Medical
Society 1999; 151(11): 566-78. (Abstract).
152
Fisher WP Jr, Eubanks RL, Marier RL. Equating the MOS SF36 and the LSU HIS
physical functioning scales. Journal of Outcome Measurement 1997; 1(4):
329-62.
Fischer GH, Molenaar IW (Eds.). Rasch Models: Foundations, Recent
Developments, and Applications. Springer-Verlag New York, Inc. 1995.
Fisher WP, Wright BD. Introduction to probabilistic conjoint measurement theory
and applications. Int J Educ Res 1994; 21: 559-568.
Fitzpatrick R, Norquist JM, Jenkinson C, Reeves BC, Morris RW, Murray DW. A
comparison of Rasch with Likert scoring to discriminate between patients’
evaluations of total hip replacement surgery. Quality of Life Research 2004;
13: 331-338.
Franks P, Lubetkin EI, Gold MR, Tancredi DJ. Mapping the SF-12 to preference-
based instruments: convergent validity in a low-income, minority population.
Medical Care 2003; 41(11): 1277-83.
Fries JF, Cella D. The promise of PROMIS: using item response theory to improve
assessment patient-reported outcomes. Clin Exp Rheumatol 2005; 23 (Suppl.
39): S53-S57.
Friedman HS, Booth-Kewley S. The “Disease-Prone Personality”: a meta-analytic
view of the construct. American Psychologist 1987; 42(6): 539-555.
Fryback DG, Dasbach EJ, Klein R, Klein BE, Dorn N, Peterson K, Martin PA. The
Beaver Dam health outcomes study: initial catalog of health-state quality
factors. Med Decis making 1993; 13: 89-102.
Fukunishi I. Psychiatric disorders before and after living-related transplantation.
Psychosomatics 2001; 42: 337-343.
Glas CW, Verhelst ND. Testing the Rasch model. In: Fishcher GH, Molenaar IW
(Eds.), Rasch Models: Foundations, Recent Developments and Applications.
Springer-Verlag. New York, Inc. 1995.
Gold MR, Siegel JE, Russell LB, Weinstein MC. Cost-Effectiveness in Health and
Medicine. Oxford University Press, Oxford. 1996.
153
Gu NY, Doctor JN. Rasch rating scale model (RSM) analysis of the EQ-5D using the
2003 Medical Expenditure Panel Survey (MEPS). Podium presentation at the
13
th
annual ISPOR, Toronto, Canada, May 2008; Value in Health 2008;
11(3): A13.
Gu NY, Craig BM, Doctor JN. Evaluating EQ-5D items using the Rasch models in a
U.S. representative sample. Podium presentation at the EuroQoL 25
th
Plenary
Meeting, Baveno, Italy. September, 2008.
Gu NY, Doctor JN. Applying the Rasch model to test the merits of physical and
mental summary scores (PCS and MCS) for the SF-12v2™ using a U.S.
national representative sample. Poster presentation at the 3
rd
Western
Pharmacoeconomic Conference (3
rd
WPC), Pasadena, California, March
2009a
Gu NY, Doctor JN. Should an SF-10 replace the SF-12? Poster presentation at the
14
th
annual ISPOR, Orlando, Florida, May 2009b.
Gudex C. The descriptive system of the EuroQol instrument. In Kind P, Brooks R,
Rabin R. (Eds.), EQ-5D Concepts and Methods: A Developmental History.
Springer, Netherlands 2005; P. 19-27.
Gunter OH, Matschinger H, Konig HH. An item response theory model analysis to
evaluate the dimensionality of the EQ-5D across six countries. Paper
presented on 2006 International Society of Quality of Life (ISQOL)
conference. http://www.isoqol.org/2006AbstractBook. abstract # 1656.
[Accessed on March 27, 2008].
Hawthorne G, Kensley K, Pallant, Mortimer D, Segal L. Deriving utility scores from
the SF-36 health instrument using Rasch analysis. Qual Life Res 2008; 17:
1183-1193.
Hattie J, Krakowski K, Rogers HJ, Swaminathan H. An assessment of Stout’s index
of essential unidimensionality. Applied Psychological Measurement 1996;
20(1): 1-14.
Hunt SM, McKenna SP, McEwen J, Williams J, Papp E. The Nottingham Health
Profile: subjective health status and medical consultations. Social Science &
Medicine 1981; 15A: 221-229.
Hahn EA, Cella D, Dobrez DG, Weiss BD, Du H, Lai JS, Victorson D, Garcia SF.
The impact of literacy on health related quality of life measurement and
outcomes in cancer outpatients. Quality of Life Research 2007; 16: 495-507.
154
Haley SM, McHorney CA, Ware JE. Evaluation of the MOS SF-36 physical
functioning scale (PF-10): I. unidimensionality and reproducibility of the
Rasch item scale. J Clin Epidemiol 1994; 47(6): 671-84.
Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of Item Response
Theory. New York: Sage publications 1991.
Hambleton RK, Jones RW. Comparison of classical test theory and item response
theory and their applications to test development. Educational Measurement:
Issues and Practice. 1993; 12(3): 38-47.
Hambleton RK. Good practices for identifying differential item functioning. Medical
Care 2006; 44 (11): S182-S188.
Hardt J, Filipas D, Hohenfellner R, Egle UT. Quality of life in patients with bladder
carcinoma after cystectomy: first results of a prospective study. Quality of
Life Research 2000; 9: 1-12.
Hays RD. Generic versus disease-targeted instruments. In Fayers P, Hays R. (Eds.),
Assessing Quality of Life in Clinical Trials, second edition. Oxford
University Press, 2005.
Hays RD, Stewart AL. The structure of self-reported health in chronic disease
patients. Psychological Assessment: A Journal of Consulting and Clinical
Psychology 1990; 2(1):22-30.
Hays RD, Liu H, Spritzer K, Cella D. Item response theory analysis of physical
functioning items in the Medical Outcomes Study. Medical Care 2007; 45:
S32-S38.
Hays RD, Marshall GN, Wang EYI, Sherbourne CD. Four-year cross-lagged
associations between physical and mental health in the medical outcomes
study. Journal of Consulting and Clinical Psychology 1994; 62(3): 441-449.
Hays RD, Morales LS, Reise SP. Item response theory and health outcomes
measurement in the 21
st
century. Med Care 2000; 38(9): II-28 – II-42.
Hays RD, Brown J, Brown LU, Spritzer KL, Crall JJ. Classical test theory and item
response theory analyses of multi-item scales assessing parents’ perceptions
of their children’s dental care. Medical Care, 2006; 11 (S3): S60-S68.
155
Hays RD, Liu H, Spritzer K, Cella D. Item response theory analyses of physical
functioning items in the medical outcomes study. Med Care 2007; 45: S32-
S38.
Howard GS. Response-shift bias: a problem in evaluating interventions with pre/post
self-reports. Evaluation Review 1980; 4: 93.
Hoyt CJ. Test reliability estimated by analysis of variance. Psychometrika 1941; 6:
153-160.
Hoyt WT, Melby JN. Dependability of measurement in counseling psychology: an
introduction to generalizability theory. Counseling Psychologist 1999; 27:
325-352.
Jayaram G, Casimir A. Major depression and the use of electroconvulsive therapy
(ECT) in lung transplant recipients. Psychosomatics. 2005; 46: 244-249.
Jenkinson C, Fitzpatrick R, Garratt A, Peto V, Stewart-Brown S. Can item response
theory reduce patient burden when measuring health status in neurological
disorders? Results from Rasch analysis of the SF-36 physical functioning
scale (PF-10). J Neurol Neurosurg Psychiatry 2001; 71: 220-4.
Kaplan RM. Measuring quality of life for policy analysis: past, present, and future.
In Lenderking WR, Revicki DA (Eds.), Advancing Health Outcomes
Research Methods and Clinical Applications. Degnon Associates. 2005.
Kaplan RM, Anderson JP. A general health policy model: update and application.
Health Services Research 1988; 23: 203-5.
Kaplan RM, Groessl EJ, Sengupta N, Sieber WJ, Ganiats, TG. Comparison of
measured utility scores and imputed scores from the SF-36 in patients with
Rheumatoid arthritis. Medical Care 2005; 43(1): 79-87.
Kieffer KM. Why generalizability theory is essential and classical test theory is often
inadequate. IN Thompson B (Ed.), Advances in Social Science Methodology,
Vol. 5, pp. 149-170. Stamford, CT: JAI. 1999.
Keeney RL, Raiffa H. Decisions with Multiple Objectives: Preferences and Value
Tradeoffs. John Wiley & Sons, Inc., 1976; Cambridge University Press,
1993.
156
Kind P. Values and valuation in the measurement of HRQoL. In Fayers P, Hays R.
(Eds.), Assessing Quality of Life in Clinical Trials, second edition. Oxford
University Press, 2005.
Klapow JC, Kaplan RM, Doctor JN. Measuring health outcomes: applications for
health psychology. In: Boll TJ Frank RG, Baum A, Wallander JL (Eds.).
Handbook of Clinical Health Psychology, Volume 3. Models and
Perspectives in Health Psychology. American Psychological Association,
Washington, DC. 2002.
Koehler PJ. Freud’s comparative study of hysterical and organic paralyses: how
Charcot’s assignment turned out. Arch Neurol 2003; Vol. 60: 1646-1650.
Kolen MJ. Comparison of traditional and item response theory methods for equating
test. Journal of Educational Measurement 1981; 18: 1-11.
Krantz DH, Luce RD, Suppes P, Tversky A. Foundations of Measurement, Volume
I, Additive and Polynomial Representations. Dover Publications, Inc.
Mineola, New York, 1971.
Lange R, Thalbourne MA, Houran J, Lester D. Depressive response sets due to
gender and culture-based differential item functioning. Personality and
Individual Differences 2002; 33: 937-954.
Lett HS, Blumenthal JA, Babyak MA, Sherwood A, Strauman T, Robins C, Newman
MF. Depression as a risk factor for coronary artery disease: evidence,
mechanisms and treatment. Psychosomatic Medicine 2004; 66: 305-315.
Levenson JL, Olbrisch ME. Psychiatric aspects of heart transportation.
Psychosomatics 1993; 34: 114-123.
Lin JH, Wang WC, Sheu CF, Lo SK, Hsueh IP, Hsieh CL. A Rasch analysis of a
self-perceived change in quality of life scale in patients with mild stroke.
Quality of Life Research. 2005; 14: 2259-2263.
Linacre JM. Many-Facet Rasch Measurement. Chicago: MESA Press. 1989/1994.
Linacre JM. Constructing measurement with a many-facet Rasch model. In Wilson
M (Ed.), Objective Measurement: Theory into Practice. Vol. 2, pp. 129-144.
Norwood, NJ: Ablex. 1994.
Linacre JM. Prioritizing misfit indicators Rasch Measurement Transactions 1995; 9:
422.
157
Linacre JM. Investigating rating scale category utility. Journal of Outcome
Measurement, 1999; 3(2): 102-122.
Linacre JM, Wright BD. Construction of measures from many-facet data. Journal of
Applied Measurement. 2002; 3(4): 484-509.
Linacre JM. A user’s guide to WINSTEPS: A Rasch-Model Computer Program.
Chicago: MESA Press. 2008a.
Linacre JM. A user’s guide to FACETS: A Rasch Measurement Computer Program.
Chicago: MESA Press. 2008b.
Linacre JM. Standard errors: means, measures, origins and anchor values. Rasch
Measurement Transactions 2005; 19(3): 1030.
Linacre JM. Sample size and item calibration stability. Rasch Measurement
Transactions 1994; 7(4): 328.
Linacre JM, Wright BD, Lunz ME. A facets model for judgmental scoring. Rasch
Research Papers, Explorations & Explanations, 1990.
http://www.rasch.org/memos.htm
Lord FM. Standard errors of measurement at different ability levels. Journal of
Educational Measurement 1984; 21: 239-243.
Luo G. The relationship between the rating scale and partial credit models and the
implication of the disordered thresholds of the Rasch models for polytomous
responses. In: Smith EV Jr. & Smith RM (Eds.) Rasch Measurement:
Advanced and Specialized Applications. JAM Press, Maple Grove,
Minnesota, 2007. p. 181-201.
Luce RD. Utility of Gains and Losses: Measurement-Theoretical and experimental
Approaches. Lawrence Erlbaum Associates, Inc., Publishers. Mahwah, New
Jersey. 2000.
Luce RD, Krantz DH, Suppes P, TverskyA. Foundations of Measurement, Volume
III, Representation, Axiomatization and Invariance. Dover Publications, Inc.,
Mineola, New York. 1990.
Luce RD, Tukey JW. Simultaneous conjoint measurement: a new type of
fundamental measurement. J Math Psychol 1964; 1: 1-27.
158
Marcoulides GA. Generalizability theory: picking where the Rasch IRT models off?
In Embretson SE, Hershberger SL. (Eds.) The New Roles of Measurement,
What Every Psychologist and Educator Should Know. Lawrence Erlbaum
Associates, Publishers. Mahwah, New Jersey. 1999.
Marcus M, Minc H. Introduction to Linear Algebra. New York: Dover, 1988. p. 145.
Mai FM, McKenzie FN, Kostuk WJ. Psychosocial adjustment and quality of
following heart transplantation. Can J Psychiatry 1990; 35: 223-227.
McColl E. Developing questionnaires. In Fayers P, Hays R. (Eds.), Assessing
Quality of Life in Clinical Trials, second edition. Oxford University Press.
2005.
McDonald RP. A basis for multidimensional item response theory. Applied
Psychological Measurement 2000; 24(2): 99-114.
McHorney CA, Ware JE, Raczek AE. The MOS 36-item short-form health survey
(SF-36): II. Psychometric and clinical test of validity in measuring physical
and mental health constructs. Med Care 1993; 31(3): 247-63.
McHorney CA, Haley Sm, Ware JE. Evaluation of the MOS SF-36 physical
functioning scale (PF-10): II. comparison of relative precision using Likert
and Rasch scoring methods. J Clin Epidemiol 1997; 50(4): 451-61.
McHorney CA, Colleen A. Methodological inquiries in health status assessment
(Editorial), Medical Care 1998; 36(4): 445-448.
McHorney CA. Health status assessment, methods for adults: past accomplishments
and future challenges. Annu. Rev. Public Health 1999; 20: 309-35.
McHorney CA, Monahan PO. Postscript: applications of Rasch analysis in health
care. Medical Care 2004; 42(Suppl.1): I73-I78.
Mannarini S, Lalli R. Assessing psychiatric patient self-awareness behavior with
many-facet Rasch analysis. Rasch Measurement Transactions 2008, 21(4):
1140-1.
Martin M, Kosinski M, Bjorner JB, Ware JE, MacLean R, Li T. Item response theory
methods can improve the measurement of physical function by combining the
modified health assessment questionnaire and the SF-36 physical function
scale. Quality of Life Research 2007; 16: 647-60.
159
Mas-Colell A, Whinston MD, Green JR. Microeconomic Theory. Oxford University
Press, Inc. 1995.
Masters GN. A Rasch model for partial credit scoring. Psychometrika 1982; 47(2):
149-174.
Masters GN, Wrigth BD. The partial credit model. In van der Linden WJ, Hambleton
RK (Eds.) Handbook of Modern Item Response Theory. New York: Springer.
1996.
Matthews SC, Nelesen RA, Dimsdale JE. Depressive symptoms are associated with
increased systemic vascular resistance to stress. Psychosomatic Medicine
2005; 67: 509-513.
McBride O, Adamson G, Bunting BP, McCann s. Assessing the general health of
diagnostic orphans using the short for health survey (SF-12v2): a latent
variable modeling approach. Alcohol and Alcoholism 2008;
http://alcalc.oxfordjournals.org/cgi/content/full/agn083 [accessed December
2008].
Michell J. Measurement in Psychology: A Critical history of a Methodological
Concept. Cambridge University Press. 2005.
Misajon R, Pallant JF, Manderson L, Chirawatkul S. Measuring the impact of health
problems among adults with limited mobility in Thailand: further validation
of the perceived impact of problem profile. Health and Quality of Life
Outcomes 2008; 6:6.
Monsaas JA, Engelhard G. Examining changes in the home environment with the
Rasch measurement model. In Engelhard G & Wilson M (Eds.), Objective
Measurement: Theory into Practice, Volume 3. Norwood, NJ: Ablex. 1996.
Muraki E. A generalized partial credit model: application of an EM algorithm.
Applied Psychological Measurement 1992; 16: 159-176.
Mushquash C, O’Connor B. SPSS and SAS programs for generalizability theory
analysis. Behavior Research Methods 2006; 38(3): 542-547.
Nandakumar R. Assessing essential unidimensionality of real data. Applied
Psychological Measurement 1993; 17(1): 29-38.
Nandakumar R. Traditional dimensionality versus essential dimensionality. Journal
of Educational Measurement 2005; 28(2): 99-117.
160
Nocera FD, Ferlazzo F, Borghi V. G theory and the reliability of psycho
physiological measures: A tutorial. Psychophysiology 2001; 38: 796-806.
Nord E. Cost-Value Analysis in Health Care: Making Sense out of QALYs.
Cambridge University Press, 1999.
Nord E, Badia X, Rue M, Sintonen H. Hypothetical valuations of health state versus
patients’ self-ratings. In Kind P, Brooks R, Rabin R. (Eds.), EQ-5D Concepts
and Methods: A Developmental History. Springer, Netherlands 2005.
Norquist JM, Fitzpatrick R, Dawson J, Jenkinson C. Comparing alternative Rasch-
based methods vs Raw scores in measuring change in health. Med Care 2004;
42: I-25-I-36.
O’boyle CA, Hofer S, Ring L. Individualized quality of life. In: Fayers P and Hays
R. (Eds.), Assessing Quality of Life in Clinical Trials: Methods and Practice,
Second Edition. Oxford University Press, 2005.
Parsons T. the Social System. Glencoe, IL. Free Press. 1951.
Patient-reported outcomes and quality of life instruments database, PROQOLID;
www.proqolid.org. Accessed [August, 2009].
Patrick DL, Erickson P. Health Status and Health Policy: Allocating Resources to
Health Care. Oxford University Press, 1993.
Pallant JF, Misajon R, Bennett E, Manderson L. Measuring the impact and distress
of health problems from the individual’s perspective: development of the
perceived impact of problem profile (PIPP). Health and Quality of Life
Outcomes 2006; 4:36.
Perline R, Wright BD, Wainer H. The Rasch model as additive conjoint
measurement. Appl Psychol meas 1979; 3: 237-256.
Pickard AS, Dalal MR, Bushnell DM. A Comparison of depressive symptoms in
stroke and primary care: applying Rasch models to evaluate the center for
epidemiologic studies-depression scale. Value in Health 2006; 9(1): 59-64.
Pickard AS, Kohlmann T, Janssen MF, Bonsel G, Rosenbloom S, Cella D.
Evaluating equivalency between response systems: application of the Rasch
model to a 3-level and 5-level EQ-5D. Med Care 2007; 45: 812-819.
161
Prieto L, Alonso J, Ferrer M. Are results of the SF-36 health survey and the
Nottingham health profile similar?: a comparison in COPD patients. J Clin
Epidemiol 1997; 50(4): 463-73.
Prieto L, Novick D, Sacristan JA, Edgell ET, Alonso J. A Rasch model analysis to
test the cross-cultural validity of the EuroQol-5D in the Schizophrenia
Outpatient Health Outcomes Study. Acta Psychiatr Scand 2003; 107 (suppl.
416): 24-29.
Raczek AE, Ware JE, Bjorner JB, Gandek B, Haley SM, Aronson NK, Apolone G,
Bech P, Brazier JE, Bulliner M, Sullivan M. Comparison of Rasch and
summated rating scales constructed from SF-36 physical functioning items in
seven countries: results from the IQOLA project. J Clin Epidemiol 1998;
51(11): 1203-14.
Rasch G. Probabilistic Models for some Intelligence and Attainment Tests. Chicago.
University of Chicago Press. 1960/1980.
Rasch G. An item analysis which takes individual differences into account. British
Journal of Mathematical and Statistical Psychology. 1966; 19: 49-57.
Reckase MD. The past and future of multidimensional item response theory. Appl
Psychol Meas 1997; 21: 25-36.
Reeve BB. Item response theory modeling in health outcomes measurement. Expert
Rev. Pharmacoeconomics Outcomes Res. 2003; 3(2): 131-45.
Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, Thissen D,
Revicki DA, Weiss DJ, Hambleton RK, Liu H, Gershon R, Reise SP, Lai JS,
Cella D. Psychometric evaluation and calibration of health-related quality of
life item banks: plans for the Patient-Reported Outcomes Measurement
Information System (PROMIS). Med Care 2007; 45: S22-S31.
Reise SP, Morizot J, Hays RD. The role of the bifactor model in resolving
dimensionality issue in health outcomes measures. Qual Life Res 2007; 16:
19-31.
Roberts FS. Applications of the theory of meaningfulness to psychology. Journal of
Mathematical Psychology 1985; 29: 311-332.
Rodin G, Craven J, Littlefield C. Depression in the Medically Ill: An Integrated
Approach. Brunner/Mazel, New York, 1991.
162
Rosalind R, de Charro F. EQ-5D: a measure of health status from the EuroQol group.
Annals of Medicine 2001; 33(5): 337-43.
Rugulies R. Depression as a predictor for coronary heart disease: a review and meta-
analysis. Am J Prev Med 2002; 23(1): 51-61.
Samejima F. Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph 1969 No. 17.
Samejima F. The graded response model. In van der Linder WJ & Hambleton RK
(Eds), Handbook of Moden Item Response Theory. New York: Springer.
1996.
SAS. Statistical Analysis System, Version 9.1.3. SAS Institute Inc., Cary, NC,
U.S.A.
Sattler JM. Assessment of Children, Third Edition. Jerome M. Sattler, Publishers San
Diego, California. 1988/1990.
Schumacker RE. Classical test analysis. Applied Measurement Associates. 2005.
Schwartz CE, Sprangers MA. Methodological approaches for assessing response
shift in longitudinal health-related quality-of-life research. Social Science &
Medicine 1999;48(11):1531-48).
Schwartz CE, Sprangers MA, Fayers P. Response shift: you know it’s there but how
do you capture it? Challenges for the next phase of research. In Fayers P,
Hays R. (Eds.), Assessing Quality of Life in Clinical Trials, second edition.
Oxford University Press, 2005.
Sengupta N, Nichol MB, Wu J, Globe D. Mapping the SF-12 to the HUI3 and VAS
in a managed care population. Medical Care 2004; 42(9): 927-37.
Shavelson RJ, Webb NM. Generalizability Theory: A Primer. SAGE Publications,
Thousand Oaks, California, 1991.
Shavelson RJ, Webb NM, Rowley GL. Generalizability theory. American
Psychologist 1989; 44: 922-932.
Shaw JW, Johnson JA, Coons SJ. US valuation of the EQ-5D health states,
development and testing of the D1 valuation model. Med Care 2005; 43: 203-
220.
163
Simon GE, Revicki DA, Grothaus L, Vonkorff M. SF-36 summary scores: are
physical and mental health truly distinct? Med Care 1998; 36(4): 567-572.
Singer HK, Ruchinskas RA, Riley KC, Broshek DK, Barth JT. The psychological
impact of end-stage lung disease. Chest 2001; 120: 1246-1252.
Smith RM. Pre/post comparisons in Rasch measurement. In Wilson M, Draney K,
and Engelhard GJ (Eds.), Objective Measurement: Theory into Practice,
Volume 4, pp. 297-312. Greenwich, CT: Ablex. 1997.
Smith RM. Fit analysis in latent trait measurement models. Journal of Applied
Measurement 2000; 1(2): 199-218.
Smith EV Jr. Detecting and evaluating the impact of multidimensionality using item
fit statistics and principal component analysis of residuals. In: Smith EV Jr,
Smith RM, editors. Introduction to Rasch Measurement. Maple Grove, MN:
JAM Press; 2004. P575-600.
Spearman C. Demonstration of formulae for true measurement of correlation.
American Journal of Psychology 1907; 18: 161-169.
Spilker B. Quality of Life and Pharmacoeconomics in Clinical Trials (2
nd
ed.) In
Spilker B (Ed.), Philadelphia, Lippincott-Raven, 1996.
Sprangers MA. Response-shift bias: a challenge to the assessment of patients’
quality of life in cancer clinical trials. Cancer Treatment Rev. 1996; 22: S55-
S62.
Sprangers MA, Schwartz EC. Integrating response shift into health-related quality of
life research: a theoretical model. Social Science & Medicine 1999; 48: 1507-
1515.
Stenner AJ. Specific objectivity - local and general. Rasch Measurement
Transactions 1994; 8(3): 374.
Stevens SS. On the theory of scales of measurement. Science, New Series 1946;
103(2684): 677-680.
Streiner DL, Norman GR. Health Measurement Scales: a practical guide to their
development and use, 3
rd
edition. Oxford University Press, 2004.
164
Stout WF. A new item response theory modeling approach with applications to
unidimensionality assessment and ability estimation. Psychometrika 1990;
55(2): 293-325.
Stucki G, Daltroy JN, Katz N, Johannesson M, Liang MH. Interpretation of change
scores in ordinal clinical scales and health status measures: the whole may
not equal the sum of the parts. J Clin Epidemiol 1996; 49(7): 711-7.
Suppes P, Krantz DH, Luce RD, Tversky A. Foundations of Measurement, Volume
II, Geometrical, Threshold, and Probabilistic Representations. Dover
Publications, Inc. Mineola, New York, 2007.
Surman OS: Psychiatric aspects of liver transplantation. Psychosomatics 1994; 35:
297-307.
Sutherland HJ, Llewellyn-Thomas H, Boyd NF, Till JE. Attitudes toward quality of
survival. The concept of “maximum endurable time.” Medical Decision
Making 1982; 2: 299-309.
Taylor SE, Lichtman RR, Wood JV, Bluming AZ, Dosik gm, Leibowitz RL. Illness-
related and treatment-related factors in psychological adjustment to breast
cancer. Cancer 1985; 55: 2506-2513.
Taylor WJ, McPherson KM. Using Rasch analysis to compare the psychometric
properties of the short form 36 physical function score and the health
assessment questionnaire disability index in patients with psoriatic arthritis
and rheumatoid arthritis. Arthritis & Rheumatism (Arthritis Care &
Research) 2007; 57(5): 723-9.
Tennant A. Disordered thresholds: an example from the Functional Independence
Measure. Rasch Measurement Transactions 2004; 17(4): 945-8.
Tennant A, McKenna SP. Conceptualizing and defining outcome. Br J Rheumatol
1995; 34: 899-900.
Tennant A, McKenna SP, Hagell P. Application of Rasch analysis in the
development and application of quality of life instruments. Value in Health
2004; 7(1): S22-S26.
Tennant A, Pallant JF. DIF matters: A practical approach to test if Differential Item
Functioning makes a difference. Rasch Measurement Transactions 2007; 20
(4): 1082-84.
165
Thoits PA. Undesirable life events and psycho-physiological distress: a problem of
operational confounding. American Sociological Review 1981; 46: 97-109.
Thompson B, Crowley S. When classical measurement theory is insufficient and
generalizability theory is essential. Paper presented at the annual meeting of
the Western Psychological Association, Kailua-Kona, Hawaii. (Eric
Document Reproduction Service No. ED 377 218), 1994.
Thorndike EL. An Introduction to the Theory of Mental and Social Measurements.
New York: John Wiley, 1904.
Thurstone LL. The law of comparative judgment. Psychological Review 1927; 34:
273-86.
Thurstone LL. Attitudes can be measured. American Journal of Sociology. 1928a;
33(4): 529-554.
Thurstone LL. The measurement of opinion. Journal of Abnormal and Social
Psychology. 1928b; 22: 415-430.
Thurstone LL. Measurement of social attitudes. Journal of Abnormal and Social
Psychology. 1931; 26: 249-269.
Torrance GW, Feeny DH, Furlong WJ, Barr RD, Zhang Y, Wang Q. A multi-
attribute utility function for a comprehensive health status classification
system: Health Utilities Mark 2. Medical Care 1996; 34: 702-22.
van der Linden WJ, Hambleton RK. Handbook of Modern Item Response Theory.
New York: Springer. 1996.
van Melle JP, Jonge P, Ormel J, Crijns H, van Veldhuisen DJ, Honig A, Schene AH,
van den Berg M, Relationship between left ventricular dysfunction and
depression following myocardial infarction: data from the MIND-IT.
European Heart Journal 2005; 26: 2650-2656.
Ware JE. Keller SD. Interpreting general health measures. In: B. Spilker (Ed.)
Quality of Life and Pharmacoeconomics in Clinical Trials, Second Edition.
Lippincott-Raven Publishers, Philadelphia. 1996.
Ware JE, Kosinski M, Bayliss MS, McHorney CA, Rogers WH, Raczek A.
Comparison of methods for the scoring and statistical analysis of the SF-36
health profile and summary measures: summary of results from the Medical
Outcome Study. Med Care 1995; 33(4 suppl): AS264-79.
166
Ware JE, Kosinski M, Keller SD. A 12-item short-form health survey: construction
of scales and preliminary tests of reliability and validity. Medical Care 1996;
34: 220.
Ware JE, Kosinski M. SF-36
®
Physical & Mental Health Summary Scales: A
Manual for Users of Version 1, Second Edition. Quality Metric. Incorporated
Lincoln, Rhode Island. 2005.
Ware JE, Kosinski M, Turner-Bowker DM, Gandek B. User’s Manual for the SF-
12v2
™
Health Survey. Quality Metric. Incorporated Lincoln, Rhode Island
and Health Assessment Lab, Boston, Massachusetts. 2007.
Ware JE, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I.
Conceptual framework and item selection. Medicare Care 1992; 30(6): 473-
83.
Wiberg M. Classical test theory vs. item response theory: an evaluation of the theory
test in the Swedish driving-license test. Education Measurement 2004; 50.
online publication:
http://www.umu.se/edmeas/publikationer/pdf/EM%20no%2050.pdf,
[Accessed March 29, 2008].
Williams A. The EuroQol instrument. In Kind P, Brooks R, Rabin R. (Eds.) EQ-5D
Concepts and Methods: A Developmental History. Springer, Netherlands
2005.
Wells KB, Golding JM, Burnam MA. Psychiatric disorder and limitation in physical
functioning in a sample of the Los Angels general population. Am J
Psychiatry 1988; 145: 712-717.
Whooley MA. Depression and cardiovascular disease: healing the broken-hearted.
JAMA. 2006; 295(24): 2874-2881.
Wiklund I. Assessment of patient-reported outcomes in clinical trials: the example of
health-related quality of life. Fundamental & Clinical Pharmacology 2004;
18: 351-363.
Wilson M. Measuring changes in the quality of school life. In Wilson M (Ed.),
Objective Measurement: Theory into Practice, Volume 1, pp. 77-96.
Norwood, NJ: Ablex, 1992.
167
Wilson M. Subscale and summary scales: issues in health-related outcomes. In:
Lipscomb J, Gotay CC, Snyder C. (Eds.) Outcomes Assessment in Cancer:
Measures, Methods and Applications. Cambridge University Press, 2005.
Wolfe EW, Chiu CWT. Measuring pretest-posttest change with a Rasch rating scale
model. Journal of Outcome Measurement 1999a; 3(2): 134-161.
Wolfe EW, Chiu CWT. Measuring change across multiple occasions using the Rasch
rating scale model. Journal of Outcome Measurement 1999b; 3(4): 360-381.
Wolfe EW, Smith, EV. Instrument Development Tools and Activities for Measure
Validation Using the Rasch Models: Part I – Instrument Development Tools.
In Smith, Jr., EV & Smith, RM (Eds.), Rasch Measurement: Advanced and
Specialized Applications (pp. 202-242). Maple Grove, MN: JAM Press.
2007.
World Health Organization. Constitution of the World Health Organization. In Basic
Documents. Geneva: Author. 1948.
Wright BD. Sample-free test calibration and person measurement. MESA Research
Memorandum #1 1967; www.rasch.org/memo1.htm.
Wright BD. Rasch model from Turnstone’s scaling requirements. Rasch
Measurement Transactions 1989; 2: 13-14.
Wright BD. Comparisons require stability. Rasch Measurement Transactions 1996a;
10: 506.
Wright BD. Reliability and separation. Rasch Measurement Transaction 1996b; 9
(4): 472.
Wright BD. Common sense of measurement. Rasch Measurement Transactions
1999a; 13: 704-705.
Wright BD. Fundamental measurement for psychology. In: The New Rules of
Measurement: What Every Psychologist and Educator Should Know.
Embretson SE, Hershberger SL. (Eds.). Lawrence Erlbaum Associates, Inc.,
Publishers. Mahwah, NJ. 1999b.
Wright BD. Conventional factor analysis vs. Rasch residual factor analysis. Rasch
Measurement Transactions 2000; 14: 753.
168
Wright BD, Huber M, O’Neill T, Linacre JM. The problem of measure invariance.
Rasch Measurement Transaction 2000; 14 (2): 745.
Wright BD, Linacre JM. Observations are always ordinal; measurements, however,
must be interval. Archives of physical measurement and rehabilitation 1989a;
70(12): 857-860.
Wright BD, Linacre JM. Differences between scores and measures. Rasch
Measurement Transaction 1989b; 3(3): 63.
Wright BD, Linacre JM. Reasonable mean-square fit values. Rasch Measurement
Transactions 1994; 8: 370.
Wright BD, Masters GN. Rating Scale Analysis: Rasch Measurement. MESA press,
Chicago. 1982.
Wright BD, Mok M. Rasch models overview. Journal of Applied Measurement
2000; 1(1): 83-106.
Wright BD, Mok MMC. An Overview of the Family of Rasch Measurement Models.
In Smith, Jr., EV & Smith, RM (Eds.), Introduction to Rasch Measurement
(pp. 1-24). Maple Grove, MN: JAM Press. 2004.
Wright BD, Panchapakesan N. A procedure for sample-free item analysis.
Educational and Psychological Measurement 1969; 29: 23-48.
Wright JL, Porter MP. Quality-of-life assessment in patients with bladder cancer.
Urology 2007; 4(3): 147-54.
Wright BD, Stone MH. Best test design. Chicago: MESA Press. 1979.
Wright JG, Feinstein AR. A comparative contrast of clinimetric and psychometric
methods for constructing indexes and rating scales. J Clin Epidemiol 1992;
45(11): 1201-8.
Wyrwich KW, Wolinsky FD. Identifying meaningful intra-individual change
standards for health related quality of life measures. Journal of Evaluation in
Clinical Practice 2000; 6(1): 39-49.
Yarnold PR. On testing inter-scale difference scores within a profile. Educational
and Psychological Measurement 1982; 42: 1037-1044.
169
Yarnold PR. Classical test theory methods for repeated measures N = 1 research
designs. Educational and Psychological Measurement 1988; 48: 913-919.
Yarnold PR, Prescott RV. On testing intra-scale difference scores within a profile.
Proceedings of the 1986 Meetings of SEAIDS, 1986; 16: 147-148.
Ziegelstein RC, Thombs BD. The brain and the heart: the twain meet. European
Heart Journal 2005; 26: 2607-2608.
170
APPENDIX: SAMPLE CONTROL FILES
Table A.1: An example of the rating scale model control file using the SF-12v2™
171
Table A.2: An example of 2-facet model specification file using the EQ-5D
172
Table A.3: An example of 3-facet model specification file using the EQ-5D
Abstract (if available)
Abstract
Effective allocation of limited resources in healthcare delivery system requires valid measurement of health. Valid measurement of health is contingent upon the validity of the health instrument used. An important property of measurement in general is “unidimensionality”: Do the numbers assigned to the qualitative attribute being measured increase with increases in the attribute? Such property of a health instrument is important because it is impossible to discuss improvements in health with treatment or improvements in health at the population level without unidimensionality. In this dissertation, we evaluate unidimensionality with respect to a model of probability of item response.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
New approaches using probabilistic graphical models in health economics and outcomes research
PDF
Discriminating changes in health using patient-reported outcomes
PDF
Essays on the equitable distribution of healthcare
PDF
Value in health in the era of vertical integration
PDF
Characterization of health outcomes in patients with hemophilia A and B: Findings from psychometric and health economic analyses
PDF
The limits of unidimensional computerized adaptive tests for polytomous item measures
PDF
Treatment of hepatitis C in era of direct acting antivirals (DAAs): selected essays in health economics
PDF
Effects of a formulary expansion on the use of atypical antipsychotics and health care services by patients with schizophrenia in the California Medicaid Program
PDF
The causal-effect of childhood obesity on asthma in young and adolescent children
PDF
Burden of illness in hemophilia A: taking the patient’s perspective
PDF
Assessment of the impact of second-generation antipscyhotics in Medi-Cal patients with bipolar disorder using panel data fixed effect models
PDF
Essays on the economics of infectious diseases
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
The impact of Patient-Centered Medical Home on a managed Medicaid plan
PDF
Three essays on estimating the effects of government programs and policies on health care among disadvantaged population
PDF
Understanding primary nonadherence to medications and its associated healthcare outcomes: a retrospective analysis of electronic medical records in an integrated healthcare setting
PDF
Mindfulness and resilience: an investigation of the role of mindfulness in post-9/11 military veterans' mental health-related outcomes
PDF
Essays on health and aging with focus on the spillover of human capital
PDF
The role of individual variability in tests of functional hearing
PDF
The dynamic relationship of emerging adulthood and substance use
Asset Metadata
Creator
Gu, Ning Yan (author)
Core Title
Testing the role of time in affecting unidimensionality in health instruments
School
School of Pharmacy
Degree
Doctor of Philosophy
Degree Program
Pharmaceutical Economics
Publication Date
02/01/2010
Defense Date
12/01/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cross-sectional,DIF,differential item functioning,EQ-5D,health,inter-temporal,item fit,many-FACET model,measurement,misfit,OAI-PMH Harvest,PCAR,principal component analysis of residuals,Rasch partial credit model,Rasch rating scale model,SF-12,SF-36,Time,unidimensionality
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Doctor, Jason N. (
committee chair
), Nichol, Michael B. (
committee member
), Siegmund, Kimberly D. (
committee member
)
Creator Email
gun@usc.edu,ningyan426@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2820
Unique identifier
UC1495143
Identifier
etd-Gu-3414 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-288319 (legacy record id),usctheses-m2820 (legacy record id)
Legacy Identifier
etd-Gu-3414.pdf
Dmrecord
288319
Document Type
Dissertation
Rights
Gu, Ning Yan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
cross-sectional
DIF
differential item functioning
EQ-5D
inter-temporal
item fit
many-FACET model
measurement
misfit
PCAR
principal component analysis of residuals
Rasch partial credit model
Rasch rating scale model
SF-12
SF-36
unidimensionality