Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A functional use of response time data in cognitive assessment
(USC Thesis Other)
A functional use of response time data in cognitive assessment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A FUNCTIONAL USE OF RESPONSE TIME DATA IN COGNITIVE ASSESSMENT
by
John Janson Prindle
____________________________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(PSYCHOLOGY)
May 2012
Copyright 2012 John Janson Prindle
ii
Dedication
To my family.
iii
Acknowledgements
I would like to thank Jack for all of his advice and encouragement since we’ve met. The same day we
met Jack dropped me off with John Horn for my first lesson in factor analysis. This, along with a lively
conversation between the two of them over lunch, sparked the events leading up to this body of work.
In addition to their collective wisdom, I would also like to thank Yan Zhou, Kelly Kadlec, Ricardo Reyes,
Kevin Petway, Leslie Owen Pole, Erin Shelton, Randy Paul Bautista, and Kelly Peters for insightful
conversations over the years.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vi
List of Figures viii
Abstract x
Chapter 1: Introduction 1
References 5
Chapter 2: Stimulus for the Current Study 6
Reacting to Responses 7
Cognitive Ability and Processing Speed 9
Basic Respondent Scoring 13
Scoring with Response Time 14
Shortening the Process 16
Regaining Predictability with Collateral Information 18
Conclusion 21
References to Chapter 2 22
Chapter 3: Measurement Properties Using the Number Series Test 27
Abstract 27
Introduction 27
Methods 34
Results 37
Conclusion 53
References to Chapter 3 57
Chapter 4: Joint Response and Response Time Prediction Model 60
Abstract 60
Introduction 60
Methods 68
Results 73
Conclusion 83
References to Chapter 4 86
v
Chapter 5: Reduced and Adaptive Testing Methods 88
Abstract 88
Introduction 88
Methods 96
Results 99
Conclusion 107
References to Chapter 5 110
Chapter 6: Adaptive Testing Methods with Response Time 112
Abstract 112
Introduction 112
Methods 121
Results 123
Conclusion 134
References to Chapter 6 137
Chapter 7: Predictive Effects of Demographics 139
Abstract 139
Introduction 139
Methods 148
Results 153
Conclusion 167
References to Chapter 7 170
Chapter 8: General Discussion 173
Measurement Model Results 173
Reduced and Adaptive Results 175
Response Time Results 177
Demographic Results 178
Synthesizing a Conclusion 181
Final Remarks 184
References to Chapter 8 186
Bibliography 188
Appendix 197
Chapter 3 Programs 197
Chapter 4 Programs 207
Chapter 5 Programs 212
vi
List of Tables
Table 3.1: Descriptive Statistics of the American Life Panel, complete
sample size N = 2548 35
Table 3.2: Classical Test Theory and Item Response Theory Fit Statistics 39
Table 3.3: Score Means and Correlations for Fixed Anchors and IRT estimation 42
Table 3.4: Differential Item Functioning Between Males and Females 50
Table 3.5: Regression models with Number Series as the Outcome Variable 52
Table 4.1: Descriptive Statistics of the American Life Panel, complete
sample size N = 2548 69
Table 4.2: Item Response and Response Time Statistics with N = 2548 71
Table 4.3: Comparative Model Fits for CIRT models (N=2548) 73
Table 4.4: Item Response Models with Response Time
Parameters Constrained 74
Table 4.5: Model 4 Response and Response Time Item Statistics
with Parameter Standard Deviations 77
Table 4.6: Covariance Structure of Item Difficulty and Discrimination
Parameters 79
Table 4.7: Covariance Structure of Person ability and Speed Parameters 81
Table 5.1: Descriptive Statistics of the American Life Panel, complete
sample size N = 2548 96
Table 5.2: Item Subsets from Different Item Selection Methods
(Item Difficulty Included for Comparison Purposes) 101
Table 5.3: Sample Statistics of Reduced Form Number Series Tests,
Intercorrelations are Shown for All Subtests 103
Table 5.4: Adaptive testing Schemes compared to Total 30 Item Score
with Sample Statistics and Correlations 106
Table 5.5: R
2
Predictability of Total Score with varying sizes of Item Sets 107
Table 6.1: Descriptive Statistics of the American Life Panel, complete
sample size N = 2548 121
vii
Table 6.2: Regression Method of Response and Response Time Integration 125
Table 6.3: Enumerated Results for Item Sets with CIRT Scored Models 128
Table 6.4: Scoring Intercorrelations for (A) CIRT Models and
(B) IRT Models with Reduced Item Sets 129
Table 6.5: Scoring Intercorrelations for (A) CIRT Models and
(B) IRT Models with Adaptive Item Sets 134
Table 7.1: Descriptive Statistics of the American Life Panel, complete
sample size N = 2548 149
Table 7.2: Item Subsets from Different Item Selection Methods
(Item Difficulty Included for Comparison Purposes) 153
Table 7.3: Demographic Regression Weights for Predicting IRT Scores
From (A) Reduced and (B) Adaptive Item Sets (N=2073) 154
Table 7.4.A: IRT Scale Score Correlations for 30 Item Scale, Reduced Item
Set Scales and Adaptive Scales 156
Table 7.4.B: Scale Correlations of CIRT Scores and Speededness with
Demographic Variables (N=2458) 158
Table 7.5.A: Demographic Regression Weights for Predicting CIRT Ability Scores
From (A) Reduced and (B) Adaptive Item Sets and Response
Times (N=2086) 159
Table 7.5.B: Demographic Regression Weights for Predicting CIRT Speededness
Scores From (A) Reduced and (B) Adaptive Item Sets and
Response Times (N=2073) 160
Table 7.6: Regression Models with Item Responses, Response Times,
Interactions and Demographics Predicting Full
Scale Scores (N=2011) 163
Table 7.7: Regression Models with CIRT Ability, Speededness and
Demographics Predicting Full Scale Scores (N=2073) 166
viii
List of Figures
Figure 3.1: A histogram plot of Percentage correct raw score
for 30 item Number Series test 40
Figure 3.2: Scatterplot shows the relationship between scores on
two scoring methods 41
Figure 3.3: Scatterplots show the monotonically increasing trend in
(a) set A items and (b) set B items with fixed anchors 43
Figure 3.4: The scatterplots illustrate the range of item difficulties for (a) set A
of the Number Series items and (b) set B of the Number Series items 44
Figure 3.5: Scatterplot of Infit Meansquares for (a) item set A and (b) item set B 46
Figure 3.6: A scatterplot of the test information curve is presented 48
Figure 3.7: The Item Information Curves provide information on the difficulty
of each item with (a) being item 1, (b) item 7, and (c) item 15 49
Figure 4.1: A factor model for 30 items 61
Figure 4.2: A Conditionally Independent Response Time model 64
Figure 4.3: Probability plots of the response time models for selected
Number Series items 78
Figure 4.4: A scatterplot of the estimated ability levels of Model 4 versus
the estimated speededness parameter for participants 80
Figure 4.5: A scatterplot of estimated standard deviation of abilities for
IRT (crosses) and CIRT (dots) over the ability range 82
Figure 5.1: A pictorial representation of the half adaptive methodology 93
Figure 5.2: Distribution of 30 item scores for the ALP sample 100
Figure 5.3: Representation of Block Adaptive Scoring 104
Figure 5.4: A scatter plot of Item Selection Methods Predicting Total Score
on the Number Series task 108
Figure 6.1: The response time distributions for (a) raw responses in seconds
and (b) log transformations of response times are given 119
ix
Figure 6.2: Scatterplots of 30 item scores, half-adaptive scores,
and modified half adaptive scores 131
Figure 6.3: Scatterplot of 30 item scores versus the half-adaptive scores
and modified half-adaptive scores 132
Figure 7.1: Conditionally Independent Response Time model 145
Figure 7.2: Comparison of (a) raw income values with (b) log transformed
income values 150
Figure 7.3: Plots of explained variance without (blue line) and
with (red line) demographics 168
x
Abstract
The stimulus for the current body of work comes from the desire of researchers to have concise
and accurate cognitive tasks implemented in their surveys. The main purpose of this work was to
introduce collateral information as a way of making up for lost or forgone information when adaptive
frameworks are adopted. The nature of ongoing surveys and the ubiquity of computers provides ample
collateral, or nonintrusive, information which can help improve score accuracy. Information such as how
long a respondent spends on certain items, their age, education, and other characteristics can improve
score prediction beyond simple item responses.
The importance of this work included methods to effectively decrease the number of items
given to participants, as well as keep the accuracy high despite the loss in information. In the current
study, the Woodcock Johnson – III (WJ-III) Number Series (NS) task was presented with 30 previously
unpublished items as stimuli. First, a couple of scoring models were implemented to test for model fit
and compare the implications of the fit values. Then methods outlined below systematically adjusted
patterns of missingness to mimic reduced and adapted subsets. Once the smaller NS item sets were
delineated, several methods of adding predictive accuracy were tested and compared.
In scoring respondents a traditional Item Response Theory (IRT) model as proposed by Rasch
(1960) was first used to provide evidence for a uni-dimensional scale and obtain baseline statistics for
item difficulty and person abilities. The next model was a Conditionally Independent Response Time
(CIRT) model. The latter model includes a response model as well as a joint response time model for
scoring. It was shown that with the full item set these two models provide identical ability estimates and
item parameters. The response time model of the CIRT framework provides ability scores and
speededness scores based on response time patterns.
xi
Next, focus was placed on effectively decreasing the number of items used in scoring each
respondent. Methods included item reduction, test forms in which the same item sets were used to
score each respondent, and adaptive tests, where each respondent could receive a different item set.
Reduced item sets fared better when item difficulties more closely matched sample ability levels
(r=0.72-0.90). Adaptive item sets were more consistent in measuring ability (i.e. half-adaptive, block
adaptive, fully adaptive), but accuracy was best for the fully adaptive method used (r=0.79-0.91).
The last steps of analysis involved introducing response time and demographic variables as
additional predictors of the 30 item scores. Item response, response times, and response/response time
interactions provided small improvements in explained variance when used as predictors (1-8%). When
CIRT ability and speededness scores were used as predictors, speededness provided limited
improvements (<1%) to prediction. The addition of age, education, and gender to response models
improved explained variance to a moderate degree (1-5%).
In conclusion, we note that sample had a higher than average ability level for the NS task and
this should color our findings for the methods outlined. The item sets that did not match respondent
abilities as well were improved more so by response time and demographic data. If one can correctly
identify the ability ranges of a sample before administration, then a more focused reduced item set
would be advantageous. Adaptive item sets seem advantageous in a more general testing situation
where ability levels are more variable. The advantage of using collateral information in predicting
cognitive scores is the amount of time saved by omitting items, potentially lowering costs, and allowing
researchers to move onto more tasks if desired. While the improvement due to response time in these
methods was limited with NS, there is a good foundation for other cognitive tasks administered in
computer assisted designs.
1
Chapter 1: Introduction
When researchers set out to test and score respondents on a given cognitive task it assumed the
researchers will (1) do their best to accurately score said respondents and (2) do it with all of the
information available. The first point is usually obtained by having a scale which has good reliability and
is targeted at the construct under study. The second point, using all information, is the more
problematic one to address. The amount of information available to a researcher is seemingly
boundless. Much of testing focuses on the item responses of the respondent and turns a blind eye to
cursory information such as response times, education, age, gender, and income. The hypothesis in this
body of work is that this information is useful, if not valuable to aiding researchers in obtaining accurate
cognitive ability scores. This is more so the case in adaptive frameworks, where response information is
necessarily lost because some items are chosen not to be presented.
The purpose of this first chapter is to prepare the reader by laying out the structure of the next
sections of work. The sequence of studies builds to form a logical framework in which cognitive tests can
benefit from adaptation and collateral information (Novick & Jackson, 1974). First the Woodcock
Johnson – III Number Series (WJ-III NS or shortened as NS) is introduced as a potential beneficiary in
implementing the methodology set (Woodcock, McGrew, & Mather, 2001). Chapter 2 provides the
necessary background on research regarding response and response time scoring of cognitive ability. It
further provides a path of reasoning for implementing collateral information in regaining selectively
deleted response information.
The analysis of Chapter 3 looks at the scale components of the NS scale and makes the case for a
unidimensional scale with specific ordering of item difficulties. In order to obtain these item parameters
the Item Response Theory (IRT) framework is used to obtain estimates pertaining to items (difficulties)
and estimates of respondent ability. Variations in model parameterization are tested to determine the
2
most parsimonious model for NS. The overall outcome of this portion was to provide ability estimates of
the respondents. This score is the basis for assessing accuracy in the following analyses and is the typical
score one would receive when they complete the NS test.
Following the establishment of traditional scale properties, in Chapter 4 another modern scoring
method was undertaken including the item response time information. As the respondents were given
the test, their response times per each item were recorded. This offered the chance to include these
additional pieces of information as helping to predict the ability scores of the respondents. The model
which included a response function and a response time function has been dubbed a Conditionally
Independent Response Time (CIRT) model (van der Linden, 2007). With the new parameterization of the
joint models each item now has speed characteristics in addition to difficulty characteristics. With the
added speed features, person parameters based on speededness can then be estimated and covaried
with person ability. The CIRT and IRT scores are then compared to one another for agreement when the
full item sets are used.
In Chapter 5 the focus shifts to scoring respondents with fewer items. The fundamental
difference between item selection methods used is how items are selected. The first set of item
selection methods are referred to as reduced (or short-form) item sets. In these item set, items are
chosen by some standard rule (chosen at random, at fixed intervals, etc.) and each respondent will be
given the same item set. Then I contrast these with adaptive item set, where each item set is unique to
the behavior exhibited by the respondent. Respondent behavior dictates which item or items will follow
previous items based on observed performance. There are obvious advantages and disadvantages to
each method, which are highlighted in relation to the performance of these scales to the total scale
scores outlined in Chapters 3 and 4. The objective of these reduced and adaptive item sets is to identify
how well each subset predicted the full scale score.
3
The next set of analyses, Chapter 6, reintroduces response time as a more prominent factor in
increasing score accuracy in the adaptive testing framework. Several methods of including response
time information are used to improve score accuracy of the reduced and adaptive item sets in an
attempt to regain the full scale score that one would have obtained if all items were used to calculate
scores. The implementation of response time data in scoring is the natural choice in psychological
research that is undertaken in a modern age. The ubiquity of computers in everyday life and certainly in
modern testing provides ample opportunity to accurately record timing of keystrokes and interactions
with computers (Horn, 1979). The concept of collateral information is then extended to include any
other information available to researchers in a similar situation (using surveys and internet to administer
their experiments).
The final analysis in Chapter 7 attempts to bring in demographic information to provide
additional improvement in reduced and adaptive item sets. With the possible improvements in
prediction already noted due to response time, there is the possibility that such respondent
characteristics like age, education, gender, income, and Need for Cognition (NFC; Cacioppo, Petty, &
Kao, 1984) may have some influence on score variations. The stimulus for this portion of analysis comes
from to the network of survey datasets as they are designed in studies such as the Health and
Retirement Study (HRS) and the American Life Panel (ALP) in the United States (Kapteyn, 2002; Willis,
2011). These surveys are longitudinal in design and maintain contact with participants for new data
collection. The HRS focuses on the aging population and tries to get repeated measures so as to track
progress over time. The ALP on the other hand puts emphasis on the idea that different surveys can be
administered to a core group and then linked in a growing database. Since these individuals have
provided us the information at one time or another it can be used to help inform our scores provided
such relationships are warranted.
4
In the final section, Chapter 8, a general discussion of the combined results is done. The
implementation of differing item subsets with response times and demographic information is assessed
and suggestions for future use are made. The overall theme of this set of analyses is that researchers are
provided a set of methods from which they can create reduced and adaptive tests that implement
collateral information in order to maintain accuracy. In doing this study semi-experimentally there are
some caveats to note. The first is that the item sets were given to respondents in two sets by order of
anchored difficulty with a filler task designed to eliminate cognitive load for a brief period. This design
provides responses that are arranged by item difficulty and do not take into account possible learning or
item drift during test taking (Bowles & Salthouse, 2003). Second, the data was selectively deleted to
mimic the response patterns associated with each item subset. Our results hold in the IRT framework
and should generalize across other cognitive abilities which scales have been formed around (Wainer,
2000). The final chapter offers alternative designs and factors for examining these theories in depth.
5
References to Chapter 1
Bowles, R.P. & Salthouse, T.A. (2003). Assessing the age-related effects of proactive interference
on working memory tasks using the Rasch model. Psychology and Aging, 18, 608-615.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Horn, J.L. (1979). Trends in the measurement of intelligence. Intelligence, 3, 229-240.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Novick, M. R., & Jackson, P. H. (1974). Statistical methods for educational and psychological
research. New York: McGraw-Hill.
van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
Wainer, H. (Ed.) (2000). Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence
Erlbaum.
Willis, R.J. (2011). Health and Retirement Study. (grant number NIA U01AG009740). University
of Michigan: Ann Arbor, MI.
Woodcock, R.W., McGrew, K.S., & Mather, N. (2001). Woodcock-Johnson III. Rolling Meadows,
IL: Riverside Publishing.
6
Chapter 2: Stimulus for the Current Study
There are specific aims that one has in modern psychological research. First there is a want to
have an accurate measure of behavior. Most methods for assessing ability have been promoted to
provide a better description of behavior over previous methods. Classical Test Theory (CTT) has yielded
some importance to Item Response Theory (IRT). Improving these methods with a model that jointly
looks at response and response time should prove to be helpful in accurately defining ability. The
second aspect of modern research is that of time optimization. In order to use time efficiently the test
cannot go on for extended periods of time, but should still be relatively accurate in pinpointing ability
levels. To bring these points together there is the need to examine time information to accurately
inform ability scores when fewer items are used. Adaptive testing provides the opportunity to assess
ability in a compact period of time (Wainer, 2000). However when a test is adapted it is necessarily
shortened and items omitted to aid in reducing testing time. This leaves room for error in score
computation which is accepted in adaptive methods. The focus of the current study is on overcoming
this deficiency by including information that obtained as a result of psychological testing. In using
information beyond responses only it is hypothesized that some predictability can be regained.
Various scoring methods are used to provide a measure of a given ability that a participant has.
The ability is not directly observable, being indicated only by how the participant responds to the items.
The more difficult items that one answers indicate they have more ability in the domain that the items
are purported to measure. It is through this basic paradigm that concepts like math ability or picture
vocabulary are measured. Then a label of high or low ability on these types of tasks is assigned by how
well they do on the items, but the ability itself cannot be observed.
7
Reacting to Responses
As a functional aspect of research, those that gather data collect far more information than
needed in a given observation period. This is an attempt to “get it while they got them.” The participants
are given more tests and items since their attention is gained already. There is massive potential to
study many objectives in a single study. So many measures can be introduced to provide a good
understanding of many behavioral issues. Provided that measures are of good reliability and based on
sound experimental processes, current psychological studies of broad human behavior such as the
Health and Retirement Study (HRS; Willis, 2011) and the American Life Panel (ALP; Kapteyn, 2002) will
provide content for many studies of behavior. In proposing that researchers gather as much information
during an experiment they promote a nearly universal side effect of such studies which is rarely talked
about, there is a large amount of information which is largely ignored.
With a mountain of information facing researchers in these studies, it seems to lend to the idea
that much of the data collected will go unused. Much of it is due to projects and ideas that get pushed
back when the main focus of the study requires more attention. Some of the reason for this apparent
hoarding of information is an idealized vision of the rate at which work can get done. It seems to be the
idea that once you have the information, then you will find the time analyze it. Some of the data being
collected is truly extra, diversionary ideas that some wish there was more time for. But then some of
this collected information seems like it would be highly useful and often gets thrown into the same
category as the lofty side projects do.
Of course priority is given to tasks that answer primary research goals. Much of research in this
day is done with a heavy emphasis on time constraints and costs. Given that time is always the bottom
line in what is collected (because more important tasks are kept over those that are more trivial), it
behooves researchers to leave out response time in potential analyses. While the importance of
8
response time is not a foreign concept in psychological theories of behavior, implementing time into
measurement models of ability has not progressed to an acceptable level.
The need to use response time in scoring models is pretty straightforward in terms of the
mechanics as to why. A simple example is that of a typical math exam. Tests can be given to two people,
person A and person B; on ten items and provide a score as to how many each got right. This is a typical
scoring system that neglects the idea that some people work faster than others. Person A may have
gotten done with the test in 10 minutes and person B could have taken 20 minutes. This effectively gives
them the same score, but they can each different rates of working. If these two additionally get the
same items correct then they have scored the same and under typical response only models, they would
be given the same level of ability. To neglect the rate of work is to ask a different question other than
ability. One could contend that person A should be rewarded for working at a faster rate, or his ability at
the math task is effectively higher than that of person B.
Time in the above example provides a good view of how two people that perform the same on
our task can be distinguished by introducing response time. Time provides a form of getting at a
different aspect of scoring ability that helps researchers to distinguish between power and performance
type scores. Cattell (1948) indicates that two possible functions of tests are that they either fall into the
power category, how well could one do given ample time, or the performance category, what can a
respondent give us in a limited period of time. In the math test example person A had too much time
and maybe person B had just enough time. In terms of power and performance, the test has good
power, but doesn’t get at individual differences in response rate.
Response time is a readily available piece of information in modern psychological studies done
with computers (Horn, 1979). The information can be documented and stored alongside participant
responses and does not require researchers to insert additional measures into their studies. Response
9
time really is collateral; a result of the process by which psychological testing is done. Since it seems to
be so easy to obtain, the implementation of time to respond should be integrated as a source of
information for ability measurement. To neglect something so readily available would be unbecoming of
the modern investigator.
Cognitive Ability and Processing Speed
The current analyses focus on the Woodcock Johnson – III Number Series (NS) task. As a matter
of defining the cognitive ability and putting stimulus for research forth, information about the structure
of cognition, how NS fits in, and how Processing Speed relates to NS ability need to be provided. The
introduction of ability used in the current line of research is that coined by Carroll (1993). Here a
cognitive ability is an ability in which one “processes mental information”, as opposed to a general
ability which is the capability to perform any action including those of mental capacities. A good
example of a cognitive ability is a recall task in which a list of words is stored and then retrieved. Here it
is inferred that the cognitive process to successfully complete the task provides a measure of specific
cognitive ability. In the following sections the term ability is used to mean cognitive ability in order to be
more concise in wording.
The fitment of NS in a framework of intelligence requires a little knowledge of the structure of
intelligence. Now a brief moment to discuss the development of cognitive ability structures to provide
background on the matter. The first theories of intelligence viewed the construct as a single dimension
(Spearman, 1904). A general factor of intelligence would build on the ideas of Galton (1883) and Cattell
& Galton (1890), where the philosophical idea of intelligence was first discussed. Subsequent studies
examined the statistical structure of intelligence as a unified mental ability (Burt, 1949; McNemar, 1960;
Vernon, 1950). This view was held and is still held today to determine the age related differences and
changes in cognitive ability (Deary, Allerhand, & Der, 2009).
10
A very different but related view of cognitive ability is one of parallel processes. These have the
structure of abilities that fall into equal constructs predefined as factors that one could test for with
scales (Garnett, 1919; Guilford, 1959; Thurstone, 1938). These cognitive abilities are specifically outlined
as having their own factor structure and that these primary mental abilities could be formed based on
behavioral patterns in cognitive testing. The marked difference between a unified intelligence and the
multiple intelligences is that the separate cognitive abilities function equally but independently. In the
unified intelligence theory there is one process that can account for the abilities observed. Both models
make bold statements about the structure of cognitive ability, each in an extreme direction. This is
evident in the next set of models which have taken hold in cognitive ability testing.
In between the unified intelligence theories and the parallel processes theories there sits a
hierarchical theory of intelligence. The stimulus for this model is the connectedness of certain cognitive
abilities as envisioned by Guilford (1959) and Thurstone (1938) in hierarchies of cognitive constructs.
That is that basic abilities that make up each subtest can be bound together to provide a picture of total
functioning within an overall domain. The fluid and crystallized intelligence (Gf-Gc) theory was posed by
Horn & Cattell (1966). In this framework two broad abilities were defined as Gf (fluid ability; ability to
reason with novel information) and Gc (crystallized ability; abilities built up over time, concrete
knowledge). Thus the lower level cognitive abilities fall into categories depending on the type of
overarching ability used to complete the task.
In the evolution of the Gf-Gc theory to Cattell-Horn-Carroll (CHC) theory by Carroll (1993)
reiterated strata of abilities with broad and narrow abilities defined. Broad abilities are considered to be
those that encompass a subset of cognitive abilities to form a larger construct to describe what
processes function together. Narrow abilities are those that make up the broad abilities. Using the broad
ability of fluid intelligence, this is made up of narrow abilities reasoning, concept formation, and novel
11
problem solving (Flanagan, 1998; McGrew, 1997). The CHC strata are ordered from lower level narrow
ability (stratum I), to broad abilities (stratum II), to overall intelligence (stratum III). This theory has a
large set of broad abilities currently defined and an even larger set of narrow abilities indicating the
broad abilities (McGrew, 2005).
Embedded in this work is the NS task as alluded to above. This task has the distinction of falling
into the Gf (fluid abilities) broad category and narrow categories of mathematics knowledge and
quantitative reasoning. The nature of the task is to provide a missing link in a sequence of numbers.
There is a pattern that the numbers follow and the blank value should be filled in to complete the
sequence logically. An example of this is: 1, 2, 3, _. One would have to be in possession of mathematics
ability to identify and manipulate numbers. Additionally, being able to manipulate numbers to fill in the
blank with the correct value is important. Overall, the process of completing each item uses the broad
fluid ability to synthesize a response that hasn’t been put forth by a respondent before. The situation is
inherently novel in that these answers are not ideas committed to memory and accessed at a later time.
Number Series is embedded in the fluid ability broad factor and more narrowly defined as
quantitative reasoning and numeric abilities (McGrew, 1997). The task itself is a first described by
Thurstone & Thurstone (1944) and then alternative administration procedures are presented by
McArdle & Woodcock (2009).A sequence of numbers is shown in which there is a blank element for the
respondent to fill in. The number that is chosen is supposed to complete the series in a way that
produces a logical sequence of numbers. Each item has a different pattern in which the respondent
must solve.
In addition to fluid and crystallized intelligence there is a broad ability of processing speed. In
the CHC framework it is was originally imbedded in the broad abilities as speed ability (Gt; Carroll, 1993).
Subsequently, Horn & Masunaga (2000) provided a similar ability termed Correct Decision Speed (CDS)
12
to describe the amount of time to provide a response. When the narrow definition of CDS is looked at in
the context of CHC theory it can be imbedded within the Gt broad ability, which allowed for a looser
definition of speededness. These works show a quality that is identifiable within a set of defined abilities
that are attributed to respondents. Next the intersection of fluid intelligence and processing speed is
explored.
As an aside, the chance is taken to delineate the difference between speediness and
speededness. Speediness as defined by Wiktionary is “the property of being fast; swiftness” (2012).
Speededness is defined by Wiktionary as “a test characteristic, dictated by the test’s time limits, that
results in a test taker’s score being dependent on the rate at which work is performed as well as the
correctness of the responses. The term is not used to describe tests of speed. Speededness is often an
undesirable characteristic” (2012). Speediness is a trait that we would like to ascribe to respondents
(Horn & Cattell, 1966). Variation in respondent speediness is observed through task speededness.
Speededness is generally used to describe parameters based on ability testing when response times are
used (Klein Entink, 2009). For the most part, discussion is based around speediness as shown through
speededness parameters because the response time information is proposed to be used as improving
ability score accuracy. This information is looked at as helpful in providing insight into ability dynamics for
hierarchical frameworks (in contrast to the view that speededness properties are undesirable).
Whether speed of processing has any effect on other broad domains of cognitive ability is the
question of the moment. The focus of much research is on the relationship of processing speed and how
it correlated with differences in specific cognitive abilities over age (Conway, Cowan, Bunting, Therriault,
& Minkoff, 2002; Nettlebeck, 2011; Salthouse, 1996; Sliwinski & Buschke, 1999). A sequence of studies
showed that when a general cognitive ability structure is assumed there is a significant correlation with
reaction time and mental ability (Deary, Allerhand, & Der, 2009; Deary, Der, & Ford, 2001). Deary et al.
(2009) went on to identify effects of age on mental ability and speed of processing and how the dynamic
relationship between the two should be viewed. Hertzog (2011) noted that there are a number of
13
reasons that cognitive abilities change over age, including “resources like working memory, processing
speed, and inhibitory aspects of attention” (p.182). The interactions of such changes to mental
processes can influence processing speed and cognitive abilities over time (Hertzog, 2008; Salthouse,
1996).
Basic Respondent Scoring
As previously mentioned, there is a desire to get the most of a participant’s time while they are
being tested. To simply give someone a 50 item test will provide researchers with ample opportunity to
hone in on the ability respondents possess for a given construct. These 50 items provide a good
definition of the scale of measurement for the construct, much better than say 10 of 15 items. The
major detriment in having a scale of 50 items is the time it takes to administer such a set of items. The
burden on both the test taker and the researcher is lessened if the test is of a shorter length. The major
drawback in shortening a set of items is the lack of clarity in scoring such a set. By shortening the set of
items determining likelihood of success, the ability to accurately assign ability levels to individuals is
decreased (and predicts their probability of answering an item of a given difficulty correctly).
The measured accuracy of measured ability does decline with fewer items, but does this mean
that one cannot recover the loss by integrating other information gathered? To introduce response time
as another piece of information in addition to response to a given item increasing our ability to get an
accurate score of ability. Each item can thus been seen as having a difficulty to the person’s ability level
to. The basic IRT model takes the form of
2.1. (
)
(
)
.
Here the probability of a correct response to item i for person j is a function of the person ability (
),
the item difficulty (
), and the item discrimination parameter (
). The input is the list of responses for
person j on items i (
). The higher the ability of a respondent compared to the item difficulty will yield
14
a higher probability of success. The item discrimination parameter serves as a way of modifying the
item’s ability to discriminate high and low ability respondents. This is the model as portrayed by Lord
(1952) and when the discrimination parameters are fixed at 1 then the Rasch (1960) model is assumed.
It is possible to fix the item difficulties in equation 2.1 in order to test theory the order of their
difficulties is known. This idea is known as introducing fixed anchors into the IRT model. By anchoring
item difficulties, it is indicated that the relationship between any two items should be the difference in
person abilities that is predetermined. In subsequent chapters the fixed anchor scores are referred to as
W-scores and the freely estimated difficulties as M-scores. The W-score name pays tribute to the WJ-III
scoring which is where the fixed item difficulty values lend themselves to. The M-scores are named as a
counterpoint to the W-scores and offer a way to discern between the two alternatives in defining the
source of the item difficulties.
Scoring with Response Time
In the context of using response time to better score respondents’ speediness, some
alternatives are introduced and their usefulness are noted. The intent of each of these models is to
reinterpret probabilities of correct response with response time taken into account. These models
attempt to make use of the typical information available from cognitive testing and add it to the typical
scoring models. These early attempts to include response time with responses with various effects. First
response time models which have traditionally been used in psychological studies are examined. Klein
Entink (2009) sets the problem up as an issue with the fundamental way experimental research has
been done and cognitive assessment. In trying to implement experimental solutions to assessment
problems there exists a deficiency. In his text Klein Entink (2009) says experimental research “focuses on
elementary cognitive processes related to stimulus discrimination, attention, categorization, or memory
retrieval” (p. 42). Such models put forth by Ratcliff (1978) and Rouder, Sun, Speckman, Lu, & Zhou
15
(2003) focus on simple choice tasks that either model response times or accuracy separately. It is noted
that these models use within subjects designs over many iterations. In contrast, cognitive assessment
items are not repeated and item parameters along with person parameters are equally important.
As a way of attempting to properly assess cognitive ability several theories and models have
been developed. Scheiblechner (1979) and Maris (1993) proposed similar models to score person
speededness varying in the distribution assumed for the speed parameter. These models neglect the
ability of the person and thus cannot tell us anything about the joint ability and speededness of a
respondent in testing. It is with the next set of models that the opportunity to estimate ability and
speededness together is presented. Roskam (1997) and Verhelst, Verstralen, & Jansen (1997)
introduced comparable models in which response time serves as a third parameter adding an increase in
probability of answering an item correct with increased time. A probabilistic model by Wise & DeMars
(2006) and Wise & Kong (2005) uses response time to determine if enough effort has been put into each
item and adjust the probability of correct response based on performance as a whole. The issue with
these models is that either ability or speededness is estimated independently (or not at all in some
cases).
A joint model to estimate ability and speededness simultaneously was first modeled in work by
van der Linden (2007). This model is hierarchical in nature to deal with the two distributions being
modeled. In this joint model the first level is made up of the response model and the response time
model. The response model takes the form of an IRT model as outlined above. The response time model
is written as
2.2. (
) (
(
))
.
This model uses the response time of person j for item i (
) and parameters for item speed intensity
(
), item discrimination (
), and person speededness (
). This is read as saying that the comparison
16
between the person speededness and the item speed intensity determines the probability of correct
response, just as in the traditional IRT response model. The higher the speededness of a respondent
over the course of the testing compared to item speed intensities, then it can be said the greater the
ability given a correct response. The within person variation of scoring is taken into account by including
the log transform of response time as an element of equation 2.2. Then when combined the two models
form
2.3.
∏
.
The joint distribution of ability scores and speededness scores contribute to the probability of a correct
response for an item answered. The covariances between the parameters of response and response
time are modeled on the second level of the hierarchical model. It is noted that the person parameters
(
) and item parameters (
) are estimated simultaneously to allow the
covariance of parameters across models. This is an advantage over previous models is in the
simultaneous estimation of both models which allows one to compare ability with speed directly.
While attempting to accurately measure ability it has been noted that response time can be an
important factor in scoring respondents. The process of computer testing is full of information that
comes along with typical testing, and when other information is light (i.e., missing responses) then
response times can be used to help fill in holes. Such is the case when data is selectively deleted as in
reduced and adaptive testing. Here response time can be used when missingness is built into data
collection methods.
Shortening the Process
Adaptive testing refers to the idea that not all items need be given to every test taker and that a
researcher can choose the optimal set of items to give and get an accurate score. It is acknowledge the
test was given as a full unit, with all 30 items shown to all respondents. In this way full information is the
17
original format and data is selectively deleted to fit each item set. These methods are outlined below,
but the statement is made that these item sets were not administered each time but come from the full
item set and are pared down.
The first definition in deliberately deleting items is item reduction or test short forms. These
tests systematically delete items completely from the item pool for all participants. So when the 30 item
test is chopped down to 6 items (20% of the original length) the other 24 items are not in the item pool.
In this format one chooses items at random or at fixed points along the length of the scale of item
difficulty. These methods do not take into account information about the sample and may provide
scores that are not accurate due to poor fit between sample ability and item difficulties. Then there are
methods that take into account sample ability levels. In methods such as choosing items that correlate
the highest with Cronbach’s alpha, have the highest factor loadings, or produce the highest R
2
in
predicting the total item set score, there is potential for higher levels of accuracy (Birnbaum, 1968).
These methods look at the most discriminating items, items that produce the most information and
have the best predictive power. Implementing these item sets is simple because the format does not
change person to person and there is no branching in the methodology. The problem as mentioned
before is that these subsets of items are sample specific because they account for the sample abilities.
Adaptive testing as it is usually understood to be is defined as selecting a set of items in
succession based on respondent behavior to previous items administered (Lord, 1970). Green (1983)
positions the advantages of adaptive testing as more accurate and immediate scoring with larger item
pools. The most important point is that all items remain in the pool of potential items and respondent
behavior dictates which item will be presented next out of the complete pool. This is because scores can
be calculated as more items are given and the ability score will become more accurate with each new
item. In the reduced item sets there is no need to for special programming of item presentation because
18
everyone receives the same items. In adaptive testing each respondent can receive completely different
items because individual behavior dictates the sequence. The issue of which item to start with, how to
sequence items and when to stop are all issues that must be outlined for successful adaptive tests
(Sands, Waters, & McBride, 1997; Wainer, 2000). Typically starting at a mid-level ability and giving items
close to calculated abilities is done, but attempts are made to try and tailor starting points for testing
(van der Linden & Pashley, 2010). In stopping, more items mean a more accurate score so the standard
error of measurement plays a role in when to stop testing. But also time constraints can lead
researchers to limit the number of items given to each respondent and use this as a stopping rule. While
these extra restrictions make adaptive testing more burdensome than reduced item sets, ability scores
tend to be more accurate because no sample characteristics are needed to determine the best items
before administration.
Regaining Predictability with Collateral Information
In situations where time is of most importance, then other opportunities to regain accuracy in
respondent ability scores must be taken. As foreseen by Horn (1978) computers provide advantages that
were not available a few years ago for research purposes. This information is directly accessible in
survey and typical psychological research. In the ALP the respondents’ continued participation results in
appending new data to the existing data from previous survey participation. This catalogue of data, a
veritable mountain, holds potential information about respondents that would have been foregone with
the methods outlined above for taking items away. In this section two potential sources are
enumerated, response times and demographic information.
Response times seem like natural sources of information about ability in cognitive domains. The
above example of two test takers indicates that response time would provide valuable information
beyond simple response patterns. In order to regain accuracy a discussion of how response time
19
benefits this goal is in order. In a design where items are selectively deleted a full scale score can be
calculated and used as an outcome to predict with subset item responses. The prediction model is
written as
2.4.
.
This is to say the total score for person j is
is a function of an intercept (
) and responses to the item
subset (
). The regression weights for the responses indicate gains made by answering each item
correctly. In this framework it is one more step to add in effects for response times associated with
responses. These terms are added to the previous equation as
2.5.
.
With this equation response times serve as additional predictors for the item subsets (
). Response
times as combined into an overall performance and as individual item timings. The deviation from
average response time per an item can indicate if faster or longer response times benefit total score
prediction. In a similar line of thought the interactions of the responses and response times in the
equation
2.6.
.
Here the interactions (
) serve as a way of discriminating whether fast correct responses should be
rewarded. Regression weights indicate if quicker responses increase scores beyond just providing a
correct response. This process provides basic instruction to improve accuracy with additional predictor s
available at the time of testing. Further analyses can improve accuracy even more with linkable
databases.
Demographics offer another opportunity to increase score accuracy when items are missing by
design. The relationship of respondent characteristics to cognitive ability is documented by studies
looking at each characteristic. For the purpose of providing compelling evidence for future research
20
basic characteristics are outlined in the context of the NS task in the CHC intelligence framework. When
one thinks of demographics usually it is of age and gender as providing good information about a
potential respondent. There is a good amount of research relating cognitive differences to age and to
some extent to differences in processing speed (Hertzog, 2008, 2011; Salthouse, 2006). Horn & Cattell
(1967) proposed a theory of cognitive aging in which fluid abilities have an inevitable decline with
advancing age and crystallized abilities maintain through the years. More current research includes
information on numerous domains and changes with age (McArdle, Ferrer-Caja, Hamagami, &
Woodcock, 2002). In terms of gender there are mixed results on the effects of gender on fluid math type
abilities. Some results give males an advantage while others indicate that differences are due to biased
items in testing (Halpern, Beninger, and Straight, 2011; Hedges & Nowell, 1995). Typically males are
given the advantages in mathematics based abilities and females with vocabulary abilities. Age and
gender differences are used to attempt to account for variation in scores where a deficient (or smaller)
item set may not.
Further demographics variables typically collected with experimentation include education,
income, and for our purposes a time filler between item sets. Education has been linked to cognitive
ability in a positive relationship in a number of studies (Falch & Sandgren, 2011; Hansen, Heckman, &
Mullen, 2004; Winship & Korenman, 1999). The typical idea is that education provides improvement in
cognitive abilities - at least helps in being able to do the tasks that such tests require. Income has a link
with education as shown through other work (Griliches & Mason, 1972; Juster, 1975; Lynn & Vanhanen,
2002; Weede, 2006). This provides stimulus for introducing income as possible predictor of cognitive
ability. To the last predictor, Need for Cognition fills this role (Cacioppo, Petty, & Kao, 1984). This scale
was meant to have no relationship with the NS scale. It measures the degree to which a respondent
enjoys cognitive tasks and actively engages with them. It would be of moderate note if the measure
21
correlates with cognitive ability because it does not measure cognitive ability per say, but whether they
enjoy tasks.
Conclusion
A set of theories to cover the motivation for the current set of analyses undertaken was
presented. Importantly, a line of reasoning was created to study cognitive ability in a narrow ability and
relate it to the speed of response as a proxy for processing speed. The treatment of responses and
response times together in a model of ability is the natural movement for such a theory of cognitive
dynamics. When the aim of research is efficiently and accurately measure ability, response time is the
easiest way of completing this goal. Collateral information allows researchers to keep tests short,
making use of the byproducts of psychological testing.
The introduction of other respondent characteristics makes the point that there is variation in
person ability based on demographic information. It is not the case that all ages, all educations perform
with equal ability on cognitive tasks. It is conceded that these elements may also add in the prediction of
respondent ability over and above the responses and response times. Together the elements of
adaptive testing, response times, and demographic indicators create a package of tools with which new,
compact ability tests can be shaped and compared to traditional cognitive ability tests.
22
References to Chapter 2
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In
F.M. Lord and M.R. Novick, Statistical Theories of Mental Test Scores, Reading, MA: Addison-Wesley.
Burt, C. (1949). The structure of the mind; A review of the results of factor analysis. British
Journal of Psychology, 19, 176-199.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Carroll, J.B. (1993). Human Cognitive Abilities: A survey of factor analytic studies. New York:
Cambridge University Press.
Cattell, J.M. & Galton, F. (1890). Mental tests and measurements. Mind, 15, 373-381.
Cattell, R. B. (1943). The measurement of adult intelligence. Psychological Bulletin, 40, 153-193.
Conway, A.R.A., Cowan, N., Bunting, M.F., Therriault, D.J., & Minkoff, S.R.B. (2002). A latent
variable analysis of working memory capacity, short-term memory capacity, processing speed, and
general fluid intelligence. Intelligence, 30, 163-183.
Deary, I.J., Allerhand, M., Der, G. (2009). Smarter in middle age, faster in old age: a cross-lagged
panel analysis of reaction time and cognitive ability over 13 years in the West of Scotland Twenty-07
study. Psychology and Aging, 24, 40-47.
Deary, I.J., Der, G., & Ford, G. (2001). Reaction times and intelligence differences: a population
based cohort study. Intelligence, 29, 389-399.
Falch, T. & Sandgren, S. (2011). The effect of education on cognitive ability. Economic Inquiry,
49, 838-856.
Flanagan, D. P. (2000). Wechsler-based CHC cross-battery assessment and reading achievement:
Strengthening the validity of interpretations drawn from Wechsler test scores. School Psychology
Quarterly, 15(3), 295−329.
Galton, F (1883). Inquiries into human faculty and its development. Online Galton Archives:
Everyman.
Garnett, J. C. M. (1919). General ability, cleverness and purpose. British Journal of Psychology, 9,
345−366.
Green, B.F. (1983). Adaptive testing by computer. In R.B. Ekstrom (Ed.), Principles of Modern
Psychological Measurement (pp.5-12). San Francisco, CA: Jossey-Bass.
23
Griliches, Z. & Mason, W.M. (1972). Education, income, and ability. The Journal of Political
Economy, 80, S74-S103.
Guilford, J.P. (1959). Personality. New York: McGraw hill.
Halpern, D.F., Beninger, A.S., & Straight, C.A. (2011). Sex differences in intelligence. In R.J.
Sternberg & S.B. Kaufman (Eds.), The Cambridge Handbook of Intelligence (pp. 253-267). New York, NY:
Cambridge University Press.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory.
Newbury Park, CA: Sage Publications.
Hansen, K.T., Heckman, J.J., Mullen, K.J. (2004). The effect of scholling and ability on
achievement test scores. Journal of Econometrics, 121, 39-98.
Hedges, L.V. & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers
of high scoring individuals. Science, 269, 41-45.
Hertzog, C. (1989). Influences of cognitive slowing on age differences in intelligence.
Developmental Psychology, 25, 636 - 651.
Hertzog, C. (2008). Theoretical approaches to the study of cognitive aging: an individual-
differences perspective. In S.M. Hofer & D.F. Alwin (Eds.), Handbook of Cognitive Aging: Interdisciplinary
Perspectives. Thousand Oaks, Ca: Sage.
Hertzog, C. (2011). Intelligence in adulthood. In R.J. Sternberg & S.B. Kaufman (Eds.), The
Cambridge Handbook of Intelligence (pp. 174-190). New York, NY: Cambridge University Press.
Horn, J.L. (1979). Trends in the measurement of intelligence. Intelligence, 3, 229-240.
Horn, J. L., & Cattell, R.B. (1966). Age differences in primary mental ability factors. Journal of
Gerontology, 20, 210-220.
Horn, J. L., & Cattell, R.B. (1966). Refinement and test of the theory of fluid and crystallized
general intelligences. Journal of Educational Psychology, 57, 253-270.
Horn, J. L., & Cattell, R.B. (1967). Age differences in fluid and crystallized intelligence. Acta
Psychologica, 26, 107-129.
Horn, J. L., & Masunaga, H. (2000). New directions for research into aging and intelligence: The
development of expertise. In T.J. Perfect, & E.A. Maylor, Models of cognitive aging (pp. 125-159).
Oxford, England: Oxford University Press.
Juster, F.T. (1975). Education, income , and human behavior. Hightstown, NJ: McGraw-Hill.
24
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Klein Entink, R.H.(2009) Statistical models for responses and response times. Thesis.
Netherlands: University of Twente.
Lord, F. (1952). A theory of test scores. Psychometric Monograph, 7.
Lynn, R. and Vanhanen, T. (2002). IQ and the Wealth of Nations. Westport, CT: Praeger
Publishers.
Maris, E. (1993). Additive and multiplicative models for gamma distributed random variables,
and their application as psychometric models for response times. Psychometrika, 58, 445–469.
McArdle, J. J., Ferrer-Caja, E., Hamagami, F. & Woodcock, R. W. (2002). Comparative longitudinal
structural analyses of the growth and decline of multiple intellectual abilities over the life
span. Developmental Psychology, 38(1), 115-142.
McGrew, K. S. (1997). Analysis of the major intelligence batteries according to a proposed
comprehensive Gf-Gc framework. In D. P. Flanagan, J. L. Genshaft, & P. L. Harrison (Eds.), Contemporary
intellectual assessment: Theories, tests, and issues (pp. 151-179). New York: Guildord.
McGrew, K. S. (2005). The Cattell–Horn–Carroll theory of cognitive abilities. In D. P. Flanagan, &
P. L. Harrison (Eds.),Contemporary intellectual assessment: Theories, tests, and issues (pp. 136−181).,
2nd ed. New York: Guilford Press.
McNemar, Q.(1964). Lost: Our intelligence. Why? American Psychologist. 19, 871-882.
Nettlebeck, T. (2011). Basic processes of intelligence. In R.J. Sternberg & S.B. Kaufman (Eds.),
The Cambridge Handbook of Intelligence (pp. 371-393). New York, NY: Cambridge University Press.
Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Denmark
Paedagogiske Institute, Copenhagen.
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108.
Roskam, E.E. (1997).Models for speed and time-limit tests. InW.J. van der Linden & R.K.
Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 187–208). New York: Springer.
Rouder, J. N., Sun, D., Speckman, P. L., Lu, J., & Zhou, D. (2003). A hierarchical Bayesian
statistical framework for response time distributions. Psychometrika, 68, 589–606.
Salthouse, T. A. (1996). A processing-speed theory of adult age differences in cognition.
Psychological Review, 103, 403-428.
25
Salthouse, T.A. (2006). Mental exercise and mental aging: Evaluating the validity of the use it or
lose it hypothesis. Perspectives on Psychological Science, 1, 68-87.
Sands, W. A., Waters, B.K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: From
inquiry to operation. Washington, DC: American Psychological Association.
Scheiblechner, H. (1979). Specific objective stochastic latency mechanisms. Journal of
Mathematical Psychology, 19, 18–38.
Sliwinski, M. & Buschke, H. (1999). Cross-sectional longitudinal relationships among age,
cognition, and processing speed. Psychology and Aging, 14, 18-33.
Spearman, C. (1904). General intelligence objectively determined and measured. American
Journal of Psychology, 15, 201-293.
Speededness. (2011) In Wiktionary online. Retrieved March 1 2012, from
http://en.wiktionary.org
Speediness. (2011) In Wiktionary online. Retrieved March 1 2012, from http://en.wiktionary.org
Thurstone, L.L. (1938). Primary mental abilities. Psychometric Monographs, 1, 121.
Van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
Van der Linden, W.J. & Pashley, P.J. (2010). Item selection and ability estimation in adaptive
testing. In W.J. van der Linden and C.A.W. Glas (Eds.), Computerized Adaptive Testing: Theory and
Practice (pp. 1-25). Netherlands: Kluwer Academic Publishers.
Verhelst, N.D., Verstralen, H.H.F.M., & Jansen, M.G. (1997). A logistic model for time-limit tests.
InW.J. van der Linden & R.K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 169–
185). New York: Springer-Verlag.
Vernon, P.E. (1950). The structure of human abilities. London: Methuen.
Wainer, H. (Ed.) (2000). Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence
Erlbaum.
Weede, E. (2006). Economic freedom and development: New calculations and interpretations.
Cato Journal, 26, 511-524.
Willis, R.J. (2011). Health and Retirement Study. (grant number NIA U01AG009740). University
of Michigan: Ann Arbor, MI.
26
Winship, C. & Korenman, S.D. (1997). Does staying in school make you smarter? The effect of
education on IQ in the bell curve. In B. Devlin, S.E. Fienberg, D.P. Resnick, & K. Roeder (Eds.),
Intelligence, Genes, and Success. Scientists Respond to the Bell Curve. New York, NY: Springer.
Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort moderated
IRT model. Journal of Educational Measurement, 43, 19-38.
Wise, S. L., & Kong, X. J. (2005). Response time effort: A new measure of examinee motivation in
computer-based tests. Applied Measurement in Education, 16, 163-183.
27
Chapter 3: Measurement Properties Using the Number Series Test
Abstract
This chapter looks at traditional scoring methods and item analysis of the American Life Panel
(ALP) Woodcock Johnson-III Number Series (WJ-III NS) 30 item task. The utility and implementation of
Classical Test Theory (CTT) and Item Response Theory (IRT) are presented with item analyses in each
case given. The IRT model is presented as more appropriate for scales such as NS in which items of
differing difficulties attempt to account for abilities along differing points along the scoring scale. Item
analysis examines how well each item fits into the Number Series construct, with calibration and
organization based on item properties. A discussion of the use of scale scores for testing for sample
biases in scoring follows with recommendations on item use in future administrations and potential
discrepancies to be aware of. This chapter provides a basis for the following analyses in which item
difficulties play an important role in item selection.
Introduction
An assumption of much testing is that there is one right way to look at data collected. This one
way will give a definitive solution to the problem posited, with little room for use of alternative methods
of analysis. To propose that there is exactly one way to examine a set of variables would oversimplify
the idea that data can be represented in a number of ways and a researcher should explore these
different avenues. The devil in the detail is that the question a researcher wishes to answer is usually
vague and leaves models for analysis unintentionally wide open. The scoring of a scale is generally the
end goal for many studies, but how the scale is scored depends on what the scale measures, the kinds of
items to be asked, and the range of item difficulties (to name a few). In ability testing, specifically
cognitive abilities, one can think of a couple of standard scoring methods typically used by researchers.
28
It is proposed that Classical Test Theory (CTT) and Item Response Theory (IRT) are different ways to
describe a test of ability, each with a different idea of how to incorporate the information that
responses provide to scoring persons.
In order to get to the point where interpretation of how well a test informs a researcher about a
given sample it comes to a discussion of test construction. The manual created by the American
Psychological Association (APA) with the American Educational Research Association (AERA) and Council
on Measurement in Education is known in the field of behavior measurement as the “The Standards,”
outlines good steps to follow for creating solid scales of measurement (AERA, APA, & CME, 1999). The
important thrust of the text is purposeful scale creation and correct implementation of testing where
the scale is used only to make statements about ideas for which it was created. For example, tests that
measure math ability should only be used in that context and not as a proxy for other areas of
educational attainment. These standards are based on the IRT ideation of scale construction.
Such topics of scale construction are mentioned to indicate the importance of structured
planning. This prior planning leads to better constructed and more useful measures. The tests of how
well a measure holds up can be broken down into two basic schools of thought. CTT provides
information about how the group performs on a set of items and attempts to indicate how reliable
these performance ratings are. IRT purports to provide sample free information about the item
difficulties and thus comparable test scores from sample to sample.
Classical Test Theory
The method for interpreting scale scores has long been used to describe person performance on
a given test. The main function of CTT is
(3.1) .
29
This is to say that an observed score, X, is decomposed into a true score, T, and random error, E. The
random error is assumed to be normally distributed with a mean of 0. The random error (E) is further
assumed to be uncorrelated with the true scores (T). To indicate how small the error of measurement is
for a test the Standard Error of Measurement (SEM) is calculated. The SEM is the standard deviation of
the random error which is how close the deviations of error are around the true score (T).
In sampling a group to analyze a scale there is variation around the observed score (X). The
variation in X is broken down as
(3.2)
.
From this equation the reliability of the scale is the proportion of variance in the true score (
)
compared to the total observed variance (
. This ratio is known as the reliability and given as
(3.3)
.
Reliability separates out the random error variance from the variance in scores which is attributed to the
ability of persons. The objective of minimizing SEM is important to both CTT and IRT as will be outlined
below. The accuracy of these models depends on how well the items hold together as a construct
measuring a given ability. The set of equations here attempt to provide estimates of how well one can
describe the pattern of behavior without random error. The ability of CTT to deal with systematic error
is not well established (Kline, 1986).
Item Response Theory
Item response Theory provides a likelihood estimate of a latent trait for an individual given a set
of item parameters on which their performance is rated. Further reviews of the literature are plentiful
and highly informative (Hambleton, Swaminathan, & Rogers, 1985; van der Linden & Hambleton, 1997).
Normally calibration studies take up the job of estimating item difficulties and providing person abilities.
Once the model converges to a solution, and the item parameters are stable, the items are used to
30
estimate person ability levels. The basic assumption of this analysis is that once the items have been
ranked by difficulty and parameters estimated, different people can get different item sets and still be
compared on the same scale.
The estimation of ability comes from the probabilistic model provided by Lord (1952), and
reduced to the Rasch (1960), model as
(3.4)
(
)
.
The probability of a correct response for person j on item i (P
ij
) is a function of the ability of person j (θ
j
)
and the difficulty of item I (b
i
). Rewritten as the log of the probability the function looks like
(3.5) (
(
))
.
The log of the probability shows a linear relationship between the person ability and item difficulty.
When the ability of the person is greater than the difficulty of the item the probability of success is
greater than 0.50. The 0.50 point where the person parameter and item parameter are equal is the
point at which the person’s ability level is said to be, the point where they have a 50/50 chance of
getting an item right. This model is also known as a 1 parameter logistic (1PL) model because it only
estimates item difficulty without estimating discrimination and item guessing parameters (a
i
and c
i
).
In order to consider how well one can capture ability of a person examined the information
functions are important. For the item level there are information functions for each item. This equation
is given as
(3.6)
The item information function (I
i
) is a function of the range of abilities (θ) compared to the item
difficulty (b
i
). P is the probability of a correct response and Q is the probability of an incorrect response.
The maximum value for the function is 0.25 when the two probabilities are equal. So each item gives the
31
maximum amount of information when ability levels match its estimated difficulty. To get an overall
sense of the test information function is written as
(3.7)
(
) ∑
.
The test information function is a sum of all of the item information functions over the range of the
scale. This shows that the test as a whole is more informative than any one item by itself and if items are
closely bunched together, then that range of the scale will provide more information over other ranges.
Related to this idea of information is the standard error of measurement (SEM). SEM is a
function of the variance of a given ability level (
̂
). The variance of a fitted value of ability is given as
(3.8)
̂
̂
.
This says that the variance of a given ability level is the reciprocal of the test information function at that
ability level. Generalizing over the test information function the SEM for ability (θ) is
(3.9) √
.
Our accuracy of measuring ability is higher where the test provides more information, and less accurate
at the tails where less information is provided.
In estimating a person’s ability IRT uses the concept of likelihood. In a Rasch model the
likelihood function only depends on the number of correct responses and not the particular items
answered correctly. The likelihood function is given as
(3.10) ∏
.
The likelihood is the joint probability of an observed response pattern given ability and the item
difficulties, where
,
is the probability of a correct response, and
is the
probability of incorrect response. The response to each item is assumed to be independent to responses
on other items given ability, which allows us to multiply the probabilities. Then, to assign an ability level
32
for a given set of items the algorithm finds the point along the ability spectrum at which the likelihood is
maximized. As mentioned above, different response patterns with the same number of correct
responses do not affect the estimated ability level, but do affect the shape of the likelihood function.
The one parameter model can be extended to a two parameter model (2PL) to better fit
response patterns. The difference lies in the ability of each item to discriminate between correct and
incorrect responses for a respondent. When the discrimination parameter is included, the equation
takes the form:
(3.11)
(
)
.
This change identifies the discrimination for item i as a
i
. Each item has a different slope which can
indicate whether an item is a good discriminator or not. Values of one are assumed in the Rasch model
for discrimination.
Ability Testing with the Woodcock-Johnson III
The Number Series test assumes a general structure of ability that is outlined in Gf-Gc theory
proposed by Horn & Cattell (1966). The basic framework is that fluid ability (Gf) and crystallized ability
(Gc) are distinguishable and measureable abilities through a varied battery of psychological tests. Much
of these tests have been developed and packaged as the WJ-III by Woodcock, McGrew, & Mather
(2001). These tests measure the factors of Gf-Gc theory by combining tests that overlap in certain
abilities. The WJ-III tests are specifically designed to tap into varying latent cognitive abilities by having a
participant answer items that would require the ability tested. Each test is designed to measure a single
construct and provide an assessment of ability compared to every other person that will take the same
test.
Particularly Number Series is a test that focuses on quantitative reasoning in the Gf ability
(Schrank, 2006). The basic task requires a participant to recognize and convey a numerical sequence
33
when prompted. The test involves the ability to keep a mental number line and manipulate this
representation to find the rule by which the numbers are ordered. Once the rule has been identified
then the participant provides their response in the form of the missing number. The task provides
opportunity to test the scale properties of the Number Series test and compare CTT and IRT methods for
interpretability. The differences in methods highlight different thoughts on how to deal with error in
measurement and make different statements about what makes a cohesive scale. The Number Series
test is one facet of the Gf ability which the current test is used to identify.
In testing fluid ability the effects of aging and other demographic variables on the measured
ability Number Series can be tested. General theory posits that the fluid ability is changing throughout
the lifespan and has an inverse U-shape that rises through young adulthood, trailing off through later life
(Horn & Cattell, 1967; Schaie & Willis, 1993). For an adult population the only observed effect of age
differences would be the negative trend with age. The growth in fluid ability had already occurred and
one would only expect to see significant declines across ages. Education and gender have also been
found to provide good predictive information for fluid ability. Kaufman, Kaufman, Liu & Johnson (2009)
found that education had a positive effect for fluid ability and males tended to outperform females in
fluid tasks. Similarly, higher levels of income have been shown to outperform lower income levels in
math ability tests (Ang, Rodgers, & Wänström, 2010). These demographic variables are available in the
ALP dataset and are analyzed against the estimated ability level from the Number Series test.
Hypotheses
There are several hypotheses about how CTT and IRT help explain the use of the Number Series
scale. (1) Classical Test Theory provides a partial look at how well the items fit the one facet scale. (2)
Item Response Theory models provide more accurate results about item fit with the one facet model. (3)
Because the focus of CTT and IRT are different, our focus on person ability will be better suited to IRT
34
methods. (4) Any scoring method that uses anything less than the full item set will not provide as
accurate a score as the full item set IRT model. (5) Item analyses based on gender will identify items that
tend to favor males. (6) The ability scores provide good estimates of fluid ability that demographic
variables can significantly predict. The first two hypotheses relate to the idea that item fit is determined
in different ways in CTT and IRT. The desire is to show how the 30 item score compares to the subscales
and CTT methods of scoring. Further analyses of demographic influences are examined on the scale
scores.
Methods
The American Life Panel Sample
The data comes from the American Life Panel (ALP; Kapteyn, 2002), a nationally representative
sample of adults 18 years old and above. Simple demographic Statistics are provided in Table 3.1. The
sample had a mean age of 48.9 with a range of 18 to 109. The sample was 59.3% female and the
ethnicities are broken down as: white 84.2%; 4.6% Hispanic; Black 6.5%; Asian 1.3%; Other 3.4%. The
average years of education attained was 11.5 years. The average income was $74,320 and for analyses
the log of this value was used to bring extreme incomes closer to the mean. Income was initially coded
as brackets and transformed into a continuous variable based on these brackets. Tests of the sample
compared to the national averages indicate that the HRS sample closely resembles the average
American on the demographic aspects outlined above (Couper, Kapteyn, Schonlau, & Winter, 2007).
35
Table 3.1. Descriptive Statistics of the American Life Panel,
complete sample size N = 2548
Mean Median Std Dev Min Max
Age 48.9 50 14.8 18 109
Education 11.5 11 2.08 3 16
Income 74230 55000 60549 2500 200000
Female 59.3%
NCPos 59.8 62.5 17.9 0 100
NCNeg 30.6 31.3 17.6 0 100
Ethnicity
-White 84.2%
-Hispanic 4.6%
-Black 6.5%
-Asian 1.3%
-Other 3.4%
Collection of the data begins by sending an email invitation out to about 3,000 active
participants in 49 states and the District of Columbia. All of the participants have reported that they are
comfortable speaking and reading English. Since the survey is administered over the internet, if a
participant did not have a computer and internet access they were given both a computer and internet
access. If they already had both then the cash value of the computer was given. The rate of dropout
from the sample is low, with only about 3 people formally leaving the sample a month. Nonresponders
are left in the sample and only deleted when dataset is cleaned for these nonresponders. Since the
cleaning is infrequent the response rate for a survey should take into account this group of people.
Number Series Task
The items used for the current analysis come from the Woodcock-Johnson III (WJ-III) set of tests
of ability (WJ-III; Woodcock, McGrew, & Mather, 2001). The Number Series task is one of the tasks that
are used to test math ability and problem solving skills. The task involves the participant to provide a
missing number from a sequence of numbers so that the sequence makes a logical pattern. For example:
if the sequence shown is 1, 2, 3, _, then the participant should provide the answer 4 as it makes the
36
sequence increase by 1 with each successive number. The items were given in order of relative difficulty
with the easiest being given first and progressively difficulty items given. There were 15 items given per
a set and two sets of items were given to each participant. The presentation of set A and set B were
counterbalanced across participants. If the correct answer was given for the blank the participant
received a 1, if the wrong answer was given they received a 0.
Need for Cognition Scale
The Need for Cognition scale (NCS) consisting of 8 items of the original 18 was presented
between the NS sets (Cacioppo, Petty, & Kao, 1984). The scale is divided into two factors, positive and
negative. The 4 items that had the highest factor loadings on each factor were chosen to be included for
a total of 8 items. In the NCS participants are asked to answer statements about whether they seek out
and undertake applied cognitive tasks on a 5 point Likert scale, where 1 is very inaccurate and 5 is very
accurate. The positive and negative scale scores were sums of item endorsement and then given a
percentage score between 0 and 100, with the mean of the positive scale 59.8 and the mean for the
negative scale 30.6
Correcting Item Scores with Response Time
The response time was used to provide two pieces of information for the analysis of correct
scores. The first use was to cut off responses over 120 seconds. After 2 minutes of trying to solve a given
problem, if no answer was given a score of 0 was given for the item. Answers under 120 seconds are
unaffected by this rule. The other rule enforced was that of random guessing by answering too fast. If a
response was input and confirmed in less than 5 seconds a conditional was put in place. Right answers
were allowed and given credit as right since a specific integer was required to give credit. However,
wrong answers were marked as missing since no effort had been put toward the item and no judgment
37
could be made about ability. Of all of the responses provided by participants 2.8% were provided in less
than 5 seconds. Of these fast responses 17.8% were incorrectly answered and considered missing.
Analyses to be Performed
To compare the scores obtained CTT and IRT methods were carried out in sequence. The CTT
approach yields test reliability and score estimates. Items were analyzed for fit with the test score by
examining the point by-serial estimates for each item and biases due to sample characteristics were
examined through item responses by group and regression analysis on the average score. A comparison
of 1PL and 2PL models was done to determine the need for estimating item discrimination parameters
in addition to item difficulties. The IRT analysis found the item difficulties and person abilities on the
latent NS trait. The item fits were analyzed and differential item function (DIF) was used to determine
the stability of scores for demographic variation. The comparison of item performance within the two
methods is explored.
Results
30 Item Classical Estimates
The two 15 item scales are first viewed as a single unit of 30 items to determine how well this
full set of NS items provides a score. Then the two 15 item sets are examined to see how accurate they
are compared to the full item set. Then once the items have been scored, item fits with the test are
examined and decisions about which items fit together as a test of NS ability are made. With CTT there is
a sense of how well people do on the items without assuming any items are harder than others. The
measure of reliability, Cronbach’s alpha is a lower bound measure of the reliability of the scale given the
item responses of the sample. The test was given as two 15 item scales, with items increasing in
difficulty from the first item to the fifteenth. Using the test as two distinct parts and as an overall
38
performance was done in the analyses below. Using these 15 and 30 item scores one can see how
reliability and score estimates are affected with different item sets.
The reliability for NS 15 item scale A was and the reliability for NS 15 item scale B was
.The reliability of the combined 30 item NS test was . As would be expected more
items administered will improve the reliability of the scale to a level at which most researchers would
find acceptable (Nunnally, 1978; p. 245). The classical 30 item scale score mean was
and the
scale variance was
. It is noted that according to this measure of ability the sample average is
better than the scale midpoint of 15, which is designed to be the average score of person with a fifth
grade education.
The view of CTT is that items that have a low or negative point bi-serial correlation with the total
raw score should be removed to improve the scale measurement characteristics. The estimates for item
fit in the CTT framework are given in Table 3.2. This table indicates how well the sample did on the item
level and the individual item fit compared to the overall scale score. The point bi-serial value is an
estimate of the correlation between the responses on a given item to the total raw score of the scale
(Osterlind, 1983). A point bi-serial value of 0.20 to 0.80 is considered a good value for an item
(Thorndike & Thorndike, 1994). By this rule one would reexamine the value of including items 1, 2, and 5
from set A and items 1, 2, 3, and 11 from set B. All of the items except 11B are really easy for most of
the sample and this have low variation and do not add much in telling us about how scores differentiate.
Similarly item 11B was so hard that barely anyone got it right and it has low variance and low correlation
with total score. The items that provide the most information are those in the middle of the scale in
which the sample percentage correct was about 50%.
39
Table 3.2. Classical Test Theory and Item Response Theory Fit Statistics
Incorrect Correct Missing % Correct Pt-Biserial Infit MNSQ Outfit MNSQ Difficulty
Item 1A 6 2534 8 99.76 0.06 0.91 0.86 -4.93
Item 2A 17 2523 8 99.33 0.14 0.92 0.88 -3.88
Item 3A 61 2478 9 97.60 0.22 0.94 0.89 -2.51
Item 4A 38 2503 7 98.50 0.21 0.91 0.90 -3.03
Item 5A 15 2524 9 99.41 0.14 0.93 0.90 -4.01
Item 6A 217 2318 13 91.44 0.30 0.96 0.94 -1.00
Item 7A 579 1948 21 77.09 0.42 1.01 1.02 0.44
Item 8A 633 1882 33 74.83 0.50 0.92 0.93 0.60
Item 9A 1111 1390 47 55.58 0.44 1.06 1.12 1.77
Item 10A 1035 1462 51 58.55 0.57 0.92 0.96 1.60
Item 11A 761 1754 33 69.74 0.64 0.84 0.84 0.94
Item 12A 615 1899 34 75.54 0.57 0.93 0.94 0.55
Item 13A 1305 1192 51 47.74 0.53 0.96 1.00 2.20
Item 14A 1467 1036 45 41.39 0.59 0.83 0.81 2.56
Item 15A 2343 185 20 7.32 0.29 0.92 0.94 5.34
Item 1B 21 2521 6 99.17 0.10 0.92 0.88 -3.67
Item 2B 17 2525 6 99.33 0.12 0.91 0.86 -3.89
Item 3B 25 2517 6 99.02 0.18 0.92 0.92 -3.48
Item 4B 116 2421 11 95.43 0.28 0.95 0.93 -1.77
Item 5B 77 2461 10 96.97 0.29 0.92 0.88 -2.24
Item 6B 312 2217 19 87.66 0.41 0.97 0.96 -0.50
Item 7B 510 2017 21 79.82 0.58 0.88 0.89 0.24
Item 8B 628 1891 29 75.07 0.62 0.88 0.88 0.59
Item 9B 755 1766 27 70.05 0.61 0.90 0.93 0.92
Item 10B 387 2136 25 84.66 0.52 0.88 0.89 -0.18
Item 11B 2496 11 41 0.44 -0.01 0.92 0.97 8.46
Item 12B 1009 1522 17 60.13 0.38 1.09 1.16 1.51
Item 13B 1155 1336 57 53.63 0.46 1.00 1.06 1.88
Item 14B 1257 1228 63 49.42 0.60 0.81 0.84 2.12
Item 15B 1813 704 31 27.97 0.53 1.08 1.07 3.36
40
The distribution of percentage correct is given in Figure 3.1, where the x-axis is the percentage
of items correct out of 30 total items and the y-axis is the frequency of each percentage. The average is
above the halfway point with a negative skew due to the sample being above average on the test.
Figure 3.1. A histogram plot of percentage correct raw score for 30 item Number Series test. The y-axis
indicates the percentage of cases and the x-axis the percentage score observed. The histogram indicates
that the sample was on the higher end of the spectrum and does better than 50% on average, which is
the designed midpoint of the test.
30 Item IRT Model
A comparison of the 1PL and 2PL IRT models was done on the 30 item sets to determine worth
of including item discrimination parameters. The 1PL model was found to provide good fit to the data
with standard fit statistics (
). The 2PL model which freed
the item loadings in the model took away 29 degrees of freedom and provided a similar model fit
(
). In both instances the probability of an exact it was not
likely (p = 0.000). There are various indexes of fit to inform us on how to interpret these results (Hayduk,
Cummings, Boadu, Pazderka-Robinson, & Boulianne, 2007). The model
shows perfect fit of the data
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Frequency Percentage
Raw Score Percentage
41
to the model and while there seems to be a good amount of improvement going from the PL to the 2PL
model, neither is a perfect fit. The CFI indicates a measure of fit taking into account the degrees of
freedom. It is noted that the values of CFI fit favor the 2PL model with the caveat that the RMSEA for
both models provide good values of close fit. Because one cannot rule out the 1PL model as poor fitting,
it was used as the model of choice in subsequent analyses for ease of interpretation.
Figure 3.2. Scatterplot shows the relationship between scores on two scoring methods. The Mscore
analysis estimated the item difficulties and person ability levels. The Wscore analysis used fixed anchors
and estimated the person abilities with these values. The fixed anchor values were provided by WJ-III
Norms on a nationally representative sample.
The performance on an item is compared to the difficulty of the item, and a correct answer will
modify a score upwards, whereas a wrong response will push a score down. The resulting W-score is an
approximation of ability based on the relative difficulty of items answered right compared to items
answered wrong. Then the W-score was compared to a score based on the IRT Rasch model which
estimates the item parameters and person ability levels simultaneously. The resulting person scores are
referred to as M-scores. The item difficulties and person ability levels are estimated using a one
Wscore for 30 Item Analysis
450
460
470
480
490
500
510
520
530
540
550
560
570
580
Mscore from 30 Item Analysis
470 480 490 500 510 520 530 540 550
42
parameter logistic model which is then rescaled to be compared directly to the w-score from the
previous procedure.
The 30 item average score for our sample is 535 for the W-score and 521 for the M-score. These
scores correlate with one another with r = 0.89. The high correlation between the methods indicates
that the fixed anchors used in the W-score method are good estimates of the item difficulties. Table 3.3
provides a view of how the scales compare to one another. The scores with the two scoring methods
looked different and a paired samples t-test indicated that the W-scores were significantly higher than
the M-scores (D = 14.1, t
df=2536
= 88.04, p < 0.001). Notable is the drop in correlation between identical
scoring methods when fewer items are used. This is true of the fixed anchor method and the full IRT
estimation method. Figure 3.2 shows a scatter plot of the two scoring methods with the x-axis and y-axis
being the two item set scores. There appears to be a strong linear trend in scores, as score on one
method increases, the score on the opposing method increases as well.
Table 3.3. Sample Statistics and Correlations of Scale Scores for 30 item, 29 item, 15 item, and BAT scores
Mscore_30 Mscore_29 Mscore_15A W_BAT_A Mscore_15B Mscore_14B W_BAT_B
Mean 520.6 520.7 526.1 541.2 525.1 524.3 534.4
Standard
Error 11.4 11.4 13.3 18.9 11.2 11.3 16.6
Mscore_30 1.000
sym.
Mscore_29 1.000 1.000
Mscore_15A 0.911 0.912 1.000
W_BAT_A 0.744 0.745 0.797 1.000
Mscore_15B 0.908 0.907 0.661 0.554 1.000
Mscore_14B 0.908 0.908 0.662 0.555 1.000 1.000
W_BAT_B 0.673 0.672 0.473 0.435 0.748 0.746 1.000
43
Figure 3.3. The scatter plots show the monotonically increasing trend in (a) set A items and (b) set B
items with fixed anchors. The anchors come from the norming sample of the WJ-III and illustrate that
each item is progressively harder from the previous within a set.
(a)
(b)
450
470
490
510
530
550
570
590
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
450
470
490
510
530
550
570
590
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
44
Figure 3.4. The scatterplots illustrate the range of item difficulties for (a) set A of the Number Series
items and (b) set B of the Number Series items. The items are listed in order of presentation on the x-
axis and the calculated difficulties are on the y-axis. The general trend in the item difficulties is that
higher difficulty was observed as participants’ progress through the test. The lack of a monotonically
increasing trend indicates that the items were not in order of difficulty for the sample tested. The
correlation between the fixed anchors and estimated difficulties was r = 0.82.
(a)
(b)
450
470
490
510
530
550
570
590
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
450
470
490
510
530
550
570
590
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
45
A figure of the item difficulties is given to show the distribution of item difficulties over the
range of Number Series abilities (Figure 3.3). This indicates how well the relative order of the calculated
item difficulties matches the item difficulties assigned by that of the WJ norming tests (Figure 3.4). The
items were chosen from an unpublished list of possible items to have a monotonically increasing feature
(Figure 3.4). The second panel of the figure shows the distributions of item difficulties when the item
difficulties are free to be estimated (Figure 3.3). The items should have the same general shape because
IRT is said to be sample free in terms of item difficulty, but there are a few differences in item ranking.
This change in rank ordering is reflected by a lower correlation between the two set of item difficulties (r
= 0.82).
Further testing of the item characteristics is done to provide evidence that one facet is being
measured by the scale. Unidimensionality is important to IRT because of the nature of scoring. The scale
only indicates positive and negative in a straight line and does not indicate any sideways movement
which would be indicative of another facet. In order to test unidimensionality the infit for the set of
items is examined. Figure 3.5 shows the calculated item infits which are labeled on the left from 1 to 30,
with acceptable ranges being between 0.77 and 1.30 (Adams & Khoo, 1993). A more restrictive range
has been proposed to accept values between 0.83 and 1.20 (Hungi, 1997; Keeves & Alagumalai, 1999).
The values are listed in Table 3.2 to provide exact statistics for determining item infit within the Number
Series facet. All 30 items fall in this range with Item 11A, 14A, and 14B being on the low end but still
quite acceptable for good fit. Compared to the CTT method of item fit, the full item set was kept without
losing items on the easy end of the spectrum because of the sample characteristics.
46
Figure 3.5. Scatter plot of Infit Meansquares for (a) item set A and (b) item set B. The x-axis labels the
items in order of presentation and the y-axis indicates the estimated Infit values. Values between 0.70
and 1.333 are considered to be good item fits.
(a)
(b)
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
47
The test information curve is presented in Figure 3.6. This curve indicates where the scale is
most accurate along the continuum of person scores. The test provides most accurate scores just above
the zero point on the scale but provides good coverage for the lower end and up to 3 standard
deviations above the center point. The higher portions of the scale are not as accurate. A sample of Item
Characteristic Curves (ICCs) are shown in Figure 3.7. These curves indicate for a given item with a given
difficulty (beta), as a person’s ability (theta) increases the probability of getting the item correct
increases (blue line) and the probability of getting the item incorrect decreases (red line). As Figure 3.7
shows, each item has a different difficulty so the probability of getting any particular item correct is only
dependent on the person’s ability. The point at which the ability of the person matches the difficulty of
the item is when the probability of getting the item correct is 0.50.
Selecting Out Extreme Items
In assembling a set of items for use with future samples it is important to identify items that do
not perform well within the scale. In our 30 item set of Number Series items, it appears that one of the
items is far more difficult and lies in the upper end of the difficulty range (b
11B
= 8.46). This is viewable in
Figure 3.4, of panel B. Item 11 is shown to be in the upper end of the spectrum with the next highest
item being item 15 of set A. Other indications that the ability of item 11B to accurately measure Number
Series ability is in the standard errors of the estimates as shown in Table 3.4. The wide range of the
estimate means that the item may provide inaccurate estimation of the respondent ability level. To
handle the potential bias of using this item in getting an overall estimate of ability the IRT analysis when
this item is removed was performed.
The overall 29 item score is then computed in the same way as outlined above. The 29 item IRT
model is run where item parameters and person parameters are free to be estimated. The 29 item score
provides no notable change in person scores for the sample, as shown in Table 3.5 where the 30 item
48
and 29 item complete scale scores are compared. The same is also true of the 15 item versus 14 item
score for the set B items. In comparing the scale correlations of the 15 item and 14 item set B scales it is
found that there are no big deviations between the two scores for respondents. The omission of the
item that seemed to be too hard for the sample did not provide any substantial changes in our
interpretation of how well people performed on the Number Series task.
Figure 3.6. A scatterplot of the test information curve is presented. The proficiency (ability) scale is
shown in log units on the x-axis (the 0 point is analogous to a score of 500 on the transformed scale).
The information for any point on the proficiency scale is shown on the y-axis with higher values
indicating more informative and accurate scores. The Information function of the scale relates to the
Standard Error of the Measurement in equation 3.9, which indicates that portions of the scale with
higher levels of information will have lower standard errors.
49
Figure 3.7. The Item Characteristic Curves provide information on the difficulty of each item with (a)
being item 1, (b) item 7, and (c) item 15. These curves show for a given proficiency (ability; x-axis) what
the probability of a correct response is (y-axis). The line rising from the bottom is the probability of a
correct response given the associated proficiency.
(a)
(b)
(c)
50
Male Threshold Female Threshold χ
2
/DF RMSEA DIFFTEST p-value
Invariant Thresholds -- -- 3787.0 / 867 0.051 -- --
Free Item 1 -2.89 (0.22) -3.20 (0.30) 3791.9 / 866 0.051 1.28 / 1 0.258
Free Item 2 -2.59 (0.15) -2.30 (0.24) 3787.0 / 866 0.051 0.95 / 1 0.329
Free Item 3 -2.11 (0.09) -2.15 (0.24) 3790.8 / 866 0.051 0.03 / 1 0.872
Free Item 4 -2.27 (0.11) -2.05 (0.21) 3791.2 / 866 0.051 0.77 / 1 0.381
Free Item 5 -2.52 (0.14) -2.68 (0.37) 3796.3 / 866 0.052 0.18 / 1 0.675
Free Item 6 -1.36 (0.06) -1.96 (0.19) 3773.5 / 866 0.051 16.9 / 1 0.000
Free Item 7 -0.78 (0.04) -1.13 (0.08) 3775.1 / 866 0.051 22.2 / 1 0.000
Free Item 8 -0.86 (0.05) -0.81 (0.05) 3785.5 / 866 0.051 0.62 / 1 0.432
Free Item 9 -0.27 (0.04) -0.39 (0.05) 3783.5 / 866 0.051 5.44 / 1 0.020
Free Item 10 -0.37 (0.04) -0.40 (0.03) 3785.1 / 866 0.051 0.45 / 1 0.501
Free Item 11 -0.71 (0.04) -0.56 (0.03) 3781.3 / 866 0.051 12.8 / 1 0.001
Free Item 12 -0.85 (0.05) -0.75 (0.04) 3784.0 / 866 0.051 3.65 / 1 0.056
Free Item 13 -0.18 (0.04) -0.12 (0.04) 3784.5 / 866 0.051 1.70 / 1 0.192
Free Item 14 0.00 (0.04) -0.04 (0.04) 3785.2 / 866 0.051 0.80 / 1 0.370
Free Item 15 1.26 (0.05) 3.37 (0.79) 3764.0 / 866 0.051 24.8 / 1 0.000
Free Item 1 -2.52 (0.14) -3.15 (0.51) 3799.1 / 866 0.052 1.96 / 1 0.162
Free Item 2 -2.42 (0.13) -2.27 (0.23) 3792.3 / 866 0.052 0.33 / 1 0.564
Free Item 3 -2.52 (0.14) -2.10 (0.24) 3791.2 / 866 0.051 1.95 / 1 0.163
Free Item 4 -1.71 (0.07) -1.98 (0.18) 3785.5 / 866 0.051 2.54 / 1 0.111
Free Item 5 -1.94 (0.08) -1.81 (0.14) 3787.6 / 866 0.052 0.61 / 1 0.437
Free Item 6 -1.20 (0.05) -1.39 (0.08) 3784.1 / 866 0.051 4.71 / 1 0.030
Free Item 7 -1.01 (0.05) -0.79 (0.04) 3779.3 / 866 0.051 17.3 / 1 0.000
Free Item 8 -0.88 (0.05) -0.65 (0.03) 3777.1 / 866 0.051 24.9 / 1 0.000
Free Item 9 -0.68 (0.04) -0.59 (0.03) 3783.5 / 866 0.051 6.05 / 1 0.014
Free Item 10 -1.23 (0.05) -0.95 (0.04) 3776.9 / 866 0.051 22.1 / 1 0.000
Free Item 11 2.37 (0.12) -4.64 (0.60) 3609.6 / 866 0.050 653 / 1 0.000
Free Item 12 -0.27 (0.04) -0.64 (0.05) 3770.0 / 866 0.051 40.9 / 1 0.000
Free Item 13 -0.24 (0.04) -0.31 (0.04) 3784.6 / 866 0.051 2.21 / 1 0.137
Free Item 14 -0.17 (0.04) -0.21 (0.03) 3784.8 / 866 0.051 0.98 / 1 0.322
Free Item 15 0.34 (0.04) 0.30 (0.06) 3786.6 / 866 0.051 1.28 / 1 0.257
Table 3.4. Differential Item Functioning Between Males and Females
Set A Items
Set B Items
Note: The p-value is the probability of rejecting null hypothesis of no difference in item threshold
between groups based on gender. The WLSV estimator for the IRT model requires a correction coefficient
to do a Chi-Square difference test.
51
Differential Item Functioning
The items were then tested to see if groups based on gender found any bias in the items using
differential item functioning (DIF; Holland & Thayer, 1988). The shape of the item curves were compared
between groups by constraining the item difficulties to be equal and then relaxing this constraint and
comparing the model fits. A significant difference in Chi Square fit for each item indicates that males and
females have different difficulty thresholds. A Chi Square difference test using weight least squares with
adjusted means and variances was performed for the IRT model with categorical data. The results of the
DIF analysis are displayed in Table 3.4, with bold rows highlighting significant group differences and
significance . In set A items 6, 7, 11, and 15 all indicate bias due to gender. Items 6 and 7 show
some bias, where females found the items easier than males. However items 11 and 15 were easier for
males than females. In set B items 7, 8, 10, 11, and 12 were found to have gender biases. Items 7, 8, and
10 were easier for males and items 11 and 12 were easier for females. The standard error of item 15A
and 11B indicate that these estimates may not be accurate and the relatively low variation in
performance on these items may be the source of the inaccuracies. The test could be improved to be rid
of gender biases by either reworking the biased items or removing them from future administrations of
the Number Series.
Scoring Biases Due to Sample Characteristics
Multiple regressions were carried out in order to account for variation in observed Number
Series ability scores. These regressions included age, education, income, gender and Need for Cognition
positive and negative scale scores as predictors. The predictor values were centered at the mean values,
the log of income was calculated, and gender was effect coded towards females. This means that the
intercept is the mean value for the average education, age, log of income, Need for Cognition scores,
and centered between males and females.
52
The results of the regression runs are presented in Table 3.5. Both the fixed anchor score (W-
score) as well as the fully estimated model score (M-score) are listed to illustrate any variation in model
prediction between the two scoring methods. The first two models tested the effects of the predictors
without interactions. Model 1 was the effects of the predictors on the M-score and model 2 was the
effect of the predictors on the W-score. The direction of the effects trended in the same direction over
the two scoring methods and all predictor weights were significant. In model 3 the difference was taken
between the two scoring methods to see if our sample characteristics account for differences due to
methodology. Models 1 and 2 showed that the beta weights remain steady across both of the scoring
methods.
Table 3.5. Regression Models with Number Series Score as the
Outcome Variable
Model 1 Model 2 Model 3
Intercept 518.4 (0.23) 531.9(0.33) 13.5 (0.28)
Age -0.11 (0.01) -0.10 (0.02) 0.01 (0.01)
Education 1.23 (0.12) 1.65 (0.17) 0.42 (0.09)
Gender -2.85 (0.46) -3.90 (0.67) -1.05 (0.35)
Income 0.28 (0.04) 0.36 (0.06) 0.08 (0.03)
NCPos 0.11 (0.02) 0.14 (0.02) 0.03 (0.01)
NCNeg -0.04 (0.02) -0.07 (0.02) -0.03 (0.01)
Explained Variance
R
2
0.21 0.18 0.22
Notes: Age is centered at 50 and divided by 10, education is
centered at 12 and divided by 4, gender is effect coded towards
females, and income is a log transformation of the dollar value.
There is however a noticeable difference in mean scores between each method and in Table 3.5
model 3 the demographics accounted for only a small portion of explained variance (R
2
= 0.05). The
regression of demographics on the difference score showed that education, income, and the NFC scales
increased the difference and gender decreased it. The item responses were then used to see if some
items may have contributed to the difference in scores across either method. In this model seven items
53
were found to have significant regression weights in predicting score differences (3A
; 10A ; 12A ; 13A ; 15A ;
13B ; 14B ). The explained variance of the difference model
was R
2
= 0.21. The correlation of the difference in item difficulties between fixed anchors and calculated
difficulties and the regression weights above was r = 0.34. This amounts to a small trend in which higher
differences were associated with larger regression weights predicting the difference in scoring methods.
Conclusion
The hypotheses outlined above show that the Number Series scale is consistent with
psychological theory. The first hypothesis indicated that CTT and IRT provide differences in scale fit. In
analyzing the test characteristics of the number series test, it provided good evidence for the presence
of a latent ability to perform the task. As seen in the CTT section, items that had little variability in the
sample had low correlation with the total raw score and were therefore recommended to be dropped in
future testing. The same items were found to be well fitting in the IRT analysis by the infit meansquare
estimates and justification to keep them in the item pool could be made based on these findings.
The second and third hypotheses focus on the results from the IRT analyses. Generally a one
facet scale is recommended for this set of items given the Cronbach’s alpha and the infit estimates. This
unidimensionality is important because it indicates that the scale only taps into one trait and provides
reasonably good reliability in the score given. More importantly, the fixed anchors provide good
estimates of ability when compared to the estimated item difficulties. The use of fixed anchors alludes
to the sample free nature of IRT in that item difficulties estimated from one sample should fit with any
other potential sample. Here, the estimates from THE WJ-III norming sample were a good fit for our
sample as well. The predictors did not vary in influence between the two scoring methods and conclude
that either way of scoring will provide a good relative score on the Number Series ability.
54
The fourth hypothesis notes that scale scores created using less than the full item set will be less
accurate than the full 30 item set score. This is illustrated well with Table 3.5 because the 15 item
subscales are compared to the 30 item scales based on the same scoring method, and across to the
black adaptive scoring method. There is a high relationship between the scores within an item subset to
the full score and also from the subsets to the full score of the opposing scoring method. The loss in
correlation comes when subscales are compared to one another between and within scoring methods
(fixed anchors vs. estimated difficulties).
The fifth hypothesis focused on gender biases for items and some were found in a few items.
These items with significant DIF should be examined in future administrations to promote fairness in
testing. Previous work has found gender biases to be present in mathematics skills testing, where males
had an easier time with items than females (Doolittle & Cleary, 1987; Kalaycioglu & Berberoglu, 2010).
This analysis provides some evidence for continued gender differences but further testing will indicate
the extent to which items must be edited for future testing.
In testing the last hypothesis multiple regressions with demographic predictors and all
interactions had indicated that age, education, income, gender, age x education, and positive Need for
Cognition x negative Need for Cognition were significant predictors of Number Series ability. The age
and education effects were as expected with lower ability for older ages and higher ability for higher
education. Income also provided an increased observed score, and a gender difference was observed
with males having a higher score versus females. The age x education interaction shows that higher
education and older age would expect to have a lower score and lower ages with higher education
would have a higher score than average. The Need for Cognition interaction is significant, but the effect
compared to the scale of the scoring is small.
55
Most of the main effects are higher compared to the interaction terms. Further work should be
done to examine the size of the interaction effects. An important note is the fragmented nature of
ability; the fluid process has its own structure of relationships to age, education, etc. (Schaie, 1977;
Schaie, 1978). These relationships are true of the domain and even within the subdomain, where the
Number Series may be effective in measuring fluid ability, but may be slightly different from another
fluid ability task. Additionally the effect of speed is a constant in much research in ability measurement
over age (Salthouse, 1996; Hertzog, 1989). Whether the idea is that speed of processing is slowing, or
there is a general trend of slowing overall, fluid intelligence declines with age and speed is related to this
(Zimprich & Martin, 2002). The use of response time in this set of analyses was simply to create a
standard for scoring responses, but these values could be used in more complete analyses of ability (van
der Linden, 2007).
The inclusion of a difference score between the M-score and the W-score is an attempt to
determine any bias present in using fixed anchors. Including the anchors as fixed would provide for a
stronger model as long as it can be shown to provide answers similar to those when difficulties are
estimated. Our results provided support for the conclusion that the sample did not contribute to the
large differences seen in scoring. It is concluded that the M-scoring methodology provides a more
accurate measure of ability on the Number Series task because the scoring differences could not be
accounted for by simple demographic information. The W-scores provide biased scores which do not
accurately indicate levels of ability.
It is important to highlight the usefulness of item calibration and analysis in this testing
situation. Item 11 of set B seemed to provide no added benefit when included in the analysis and could
thus be dropped without any sacrifice of accuracy. While the item did not raise any flags in terms how
well the item fit in the scale, it did have a particularly high value of difficulty which did not fit with the
56
rest of the items. The suitability of items may be sample dependent and piloting of the item pool is an
important process in order to weed out items which do not fit the scale as intended. It is recommended
to omit the above item in future testing situations due to the extreme difficulty feature it possesses.
In implementing the Number Series test for use with the ALP the test was transformed from a
paper and pencil task to an interactive computer task. The effect of mode is one that can introduce a
certain level of bias to the results and this should be examined. Essentially a bias in those that are willing
to take the test is gained because of comfort with technology, or general ability to get access (Alderson,
2000; Al-Amri, 2008). Cross mode effects of ability testing in psychology is relatively, with few showing
equivalence of computerized tests and paper and pencil tests (Wilhelm & Schroeders, 2008). The few
that have compared cross mode tests have shown high correlations between the two mediums, with
any differences in scores attributable to the samples themselves and not the tests (MacIsaac, Cole, Cole,
McCullough, & Maxka, 2002; Potosky & Bobko, 2004; Schroeders & Wilhelm, 2010).
57
References to Chapter 3
Adams, R.J and Khoo, S.K (1993). QUEST: The Interactive Test Analysis System. Australian Council
for Educational Research, Hawthorn, Victoria.
Al-Amri, S. (2008). Computer-based testing vs. paper-based testing: a comprehensive approach
to examining the comparability of testing modes. Essex Graduate Student Papers in Language &
Linguistics, 10, 22-44.
Alderson, J.C. (2000). Technology in testing: the present and the future. System, 28, 593-603.
American Education Research Association, American Psychological Association, & Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Education Research Association.
Ang, S., Rodgers, J.L., & Wänström, L. (2010). The Flynn effect within subgroups in the U.S.:
gender, race, income, education, and urbanization differences in the NLSY-children data. Intelligence,
38, 367-384.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Couper, M., Kapteyn, A., Schonlau, M., & Winter, J. (2007). Noncoverage and nonresponse in an
internet survey. Social Science Research, 36, 131-148.
Doolittle, A.E. & Cleary, T.A. (1987). Gender-based differential item performance in mathematics
achievement items. Journal of Educational Measurement, 24, 157-166.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory.
Newbury Park, CA: Sage Publications.
Hayduk, L., Cummings, G.G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing!
Testing! One, two three – Testing the theory in structural equation models! Personality and Individual
Differences, 42, 841-50.
Hertzog, C. (1989). Influences of cognitive slowing on age differences in intelligence.
Developmental Psychology, 25, 636 - 651.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel
procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum.
Horn, J. L., & Cattell, R.B. (1966). Refinement and test of the theory of fluid and crystallized
general intelligences. Journal of Educational Psychology, 57, 253-270.
58
Horn, J. L., & Cattell, R.B. (1967). Age differences in fluid and crystallized intelligence. Acta
Psychologica, 26, 107-129.
Hungi, N (1997). Measuring Basic Skills Across Primary School Years. Unpublished MA thesis,
School of Education, The Flinders University of South Australia, Adelaide.
Kalaycioglu, D.B. & Berberoglu, G. (2010). Differential item functioning analysis of the science
and mathematics items in the university entrance exams in Turkey. Journal of Psychoeducational
Assessment, 29, 1-12.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Kaufman, A.S., Kaufman, J.C., Liu, X., & Johnson, C.K. (2009). How do Educational Attainment
and Gender Relate to Fluid Intelligence, Crystallized Intelligence, and Academic Skills at Ages 22–90
Years? Archives of Clinical Neuropsychology, 24, 153 – 163.
Keeves, J.P and Alagumalai, S (1999). New Approaches to Measurement. In Masters, G.N and
Keeves, J.P (Eds.) Advances in Measurement in Educational Research and Assessment (pp. 23-42),
Pergamon, Oxford.
Kline, P. (1986). A Handbook of Test Construction: Introduction to Psychometric Design.
Methuen: London.
Lord, F. (1952). A theory of test scores. Psychometric Monograph, 7.
MacIsaac, D., Cole, R., Cole, D., McCullough, L., & Maxka, J. (2002). Standardized testing in
physics via the world wide web. Electronic Journal of Science Education, 6.
Nunnally, J.C. (1978). Psychometric theory (2
nd
ed.). New York: McGraw-Hill.
Osterlind, S.J (1983). Test item bias. Quantitative Applications in the Social Sciences. Newbury
Park, CA: Sage Publications.
Potosky, D., & Bobko, P. (2004). Selection testing via the Internet: practical considerations and
exploratory empirical findings. Personnel Psychology, 57, 1003-1034.
Salthouse, T. A. (1996). A processing-speed theory of adult age differences in cognition.
Psychological Review, 103, 403-428.
Schaie, K. W., & Willis, S. L. (1993). Age-difference patterns of psychometric intelligence in
adulthood: Generalizability within and across ability domains. Psychology and Aging, 8, 44-55.
59
Schrank, F.A. (2006). Specification of the cognitive processes involved in performance on the
Woodcock-Johnson III (Assessment Service Bulletin No. 7). Itasca, IL: Riverside Publishing.
Schroeders, U. & Wilhelm, O. (2010). Testing reasoning ability with handheld computers,
notebooks, and paper and pencil. European Journal of Psychological Assessment, 26, 284-292.
Thorndike, R.L and Thorndike, R.M (1994). Reliability in Educational and Psychological
Measurement. In T. Husen and T.N Postlethwaite (Eds.) The International Encyclopedia of Education, 2nd
edition, (pp. 4981-4995). Oxford: Pergamon.
Van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
Van der Linden, W.J. & Hambleton, R.K. (1997). Handbook of modern item response theory. New
York: Springer-Verlag.
Wilhelm, O. & Schroeders, U. (2008). Computerized ability measurement: some substantive dos
and don'ts. In F.Scheuermann & A. G.Pereira (Eds), Towards a research agenda on computer-based
assessment: challenges and needs for European educational measurement (pp. 76 –84). Luxembourg:
Office for Publications of the European Communities.
Woodcock, R.W., McGrew, K.S., & Mather, N. (2001). Woodcock-Johnson III. Rolling Meadows,
IL: Riverside Publishing.
Zimprich, D., & Martin, M. (2002). Can longitudinal changes in processing speed explain
longitudinal age changes in fluid intelligence? Psychology and Aging, 17, 690–695.
60
Chapter 4: Joint Response and Response Time Prediction Model
Abstract
This chapter presents the use of a joint response and response time model (Conditionally
Independent Response Time; CIRT) to data collected from the American Life Panel (ALP). The Woodcock
Johnson-III Number Series (WJ-III NS) task was presented to a sample of N = 2500 with 30 items total
given and response times collected at the item level. The data was submitted to traditional Item
Response Theory (IRT) models and CIRT models for a comparison of model fits and parameters. A
discussion of the utility of such hierarchical models, which capitalize on the idea of “collateral data” or
data collected as consequence of procedure is given. The idea of speed-accuracy tradeoffs with
interpretations applied to the current models is also noted as a useful aspect of these models. This
model is put forth as an alternative method to scoring item sets with traditional IRT models. The
potential usefulness of the CIRT model is explored in cognitive testing.
Introduction
The scope of this chapter is to introduce and fit empirical data to test the fit of a range of
Conditionally Independent Response Time (CIRT) models (van der Linden, 2007a). The traditional role of
the Item Response Theory (IRT) model has been to tease apart the latent person ability on a given
measure from the item difficulties (Rasch, 1960). The distinguishing difference between these two
models is that the CIRT model introduces the use of response time as a way of creating more accurate
estimates of ability among test takers. The utility of such a tool is intriguing for a field that places a
premium on time spent with participants.
Of main importance to response time data is the idea that this information is not something that
is hard to come by. The idea of collateral information comes from Novick & Jackson (1974), where the
authors identified this type of information as collected simultaneously with the task responses. The
61
reactions times thus contain information about the person ability in conjunction with the responses to
items with which they are attached. This relationship can then help inform estimates of ability when
modeled in a hierarchical framework.
Figure 4.1. A Factor model for 30 items. The item loadings can be fixed at 1 or free to vary to test one
parameter versus two parameter IRT models. Factor scores indicate ability levels and item thresholds
the item difficulty.
F
Item 1 Item 2 Item 30
a
1
a
2
a
30
1
σ
2
θ
θ
1
b
1
b
2
b
30
The CIRT framework works in improving and building upon IRT by using the collateral response
time information. In a simplistic view the IRT model is simply joined with a response time model. So,
CIRT builds on IRT by providing improvement based on sound assumptions of response and response
time interactions. These assumptions are outlined in van der Linden (2007). The first indicates that the
speed of a person is constant through testing. This is in a similar vein to the response function, which
assumes that ability is constant through testing. The second assumption is that of random variables for
responses and response times. Because these variables are random, the next assumption is that the
person and item parameters of the response time models are separable. This follows from the response
62
model in IRT, where separate person parameters are estimated along with item parameters. These
response and response time parameters are further assumed to be conditionally independent when
ability levels are held constant over items. And finally the population relationships between speed and
accuracy are modeled separately from the person level parameters. This is why the CIRT models have
been proposed to be fit in a hierarchical format, where the person and item characteristics are
separated from the population statistics.
Item Response Theory Model
The IRT model first introduced by Rasch (1960) provides a good framework for comparing
person ability to item difficulty. This concept is introduced as a structural model as shown in Figure 4.1.
The factor is indicated by the items, which all have equal loadings (equal discrimination parameters). In
a categorical response model, the item thresholds indicate the item difficulties; which means that the
factor scores are the ability score. To define the model the persons are labeled as and the
items are listed as . For the IRT model only the responses matter. A simplistic model is the
Rasch IRT model or a 1-parameter logistic (1PL) model. The distribution function of the responses is
written as
(4.1)
.
The response of any item i for a person j is a function of the observed responses , solving for ability ( θ)
and item difficulty (b). The form of the function
is usually given a logistic equation with the
form
(4.2)
(
)
,
this can be reorganized for readability as
(4.3) (
(
))
.
63
The probability of a correct response is then a direct relationship between the item difficulty and the
person ability. To have more ability than the item difficulty is to increase the odds of getting an item
correct. To have less ability than the item difficulty reverses the outcome. The probability of these
successes on successive items of varying difficulties allows us to estimate person ability and item
difficulty simultaneously. It with this base the CIRT takes off from.
Conditional Item Response Theory Model
The framework for the response and response time model has been proposed to be a
hierarchical model with two levels of measurement (van der Linden, 2007). It is acknowledge that
variation comes from two separable sources, those of the population level and those at the individual
level. This model is displayed in Figure 4.2 with level 1 being the person and item level parameters and
level 2 the population and item domain level parameters. The responses and response times are
indicators of level 1 parameters which are shown that person and item parameters are independent of
one another. The level 1 parameters are indicators of the level 2 population and item domain parameter
means and covariances. At this level the Response and Response Time parameters are related.
Specifications of the response and response time models are provided below to make clear the
process by which the current model parameters are estimated. Further model types, with specifications
and assumptions, are available from a number of sources (Klein Entink, 2009; van der Linden, 2006; van
der Linden, 2007a).
As was the case with the IRT model, the persons are indicated by and the items by
. There are now two vectors of data based on person performance of a task, one being the
response vector
and the other the collateral information, the response time vector
. The observed instances of these vectors are given as
for the
responses and
for the response-times.
64
Figure 4.2. A Conditionally Independent Response Time model. The two levels of measurement are
indicated by the scripts on the left. The response and response time models make up level one and the
level two parameters are the person and item parameters. The level two parameters create the bridge
between the two models, response and response time.
Person
θ
j
U
ij
T
ij
Person
τ
j
Item
a
i
, b
i
Item
α
i
, β
i
Person
μ
P
, σ
P
Item
μ
I
, Σ
I
Level 2
Level 1
Level 1 Models. For simplicity and according to the development of the Number series task
(Woodcock, McGrew,& Mather, 2001), a Rasch IRT 1PNO model is compared to a 2PNO model as the
response model. The general 2 parameter model has the distribution function similar to the IRT model
as
(4.4)
.
In the function
, let this be the probability function typically seen in IRT modeling
(4.5)
.
The probability of getting an item correct for person j with a given ability level ( θ) of person j is the
comparison of this value to the item difficulty (b) as previously mentioned in the traditional IRT analysis.
The addition of an item discrimination parameter (a) affects how well the item discriminates between
65
test takers of different ability levels. The normal ogive model is used to provide the probability of correct
response in this model. This is comparable the logistic model which is usually favored because of its
mathematic simplicity.
The additional portion of the CIRT model is the joint probability based on the observed response
time. Response time is modeled using a lognormal distribution with the function parameters
(4.6)
,
and the function is written as
(4.7) (
) (
(
))
.
The person parameter
is the speed of the test taker,
is the discrimination parameter, and
is the
time intensity of the item. The relationship between the item time intensity and person speed is similar
to that of the response model. It is proposed that person speed parameters valued greater than the
item speed intensity is advantageous. In reading the equation, the within person variation in timing is
taken into account by including the log transform of the response time per an item and comparing the
item speed intensity with person speededness over the test. The equation is bounded above 0 because
of the natural distribution of timing values. The stimulus for this model comes from the idea that time is
given by the equation of the random variable
(4.8)
(
) .
The value of z
ij
is given a normal value
and the model of response time is lognormally
distributed with the mean
and variance of one. The probability distribution that a given time t
ij
is less than the mean time is
(4.9)
.
From this set of equations then substitute the definition of t
ij
into the probability equation and rearrange
the terms to form the function outlined in (4.7).
66
The item discrimination parameter allows for variation in response times across items and
provides good fit according to studies (van der Linden, 2006; van der Linden, 2007b; van der Linden &
Guo, 2006). The use of the lognormal distribution is of convention in response time studies because it
typically captures the shape of the distribution of response times well (Fox, Klein Entink, & van der
Linden, 2010). The two level 1 models are now specified and the joint distribution model is set up next.
The vector of person parameters is shown as
(
) and the vector of the item parameters is
. The sampling distribution of the conditionally independent
, for is
given as
(4.10)
∏
.
The distribution of scores and response times is the cumulative product of the joint distributions of
responses and response times given person and item parameters.
Level 2 Models. The second level models provide a way to model the relationship among the
population parameters, , and among the item domain parameters, . The population model estimates
mean person abilities and covariances among the person parameters. Identically the item domain
provides estimates of the means and covariances of the item parameters that come from the domain of
items that could potentially be issued to the population.
The parameters for the level 2 models are defined next. It is important to note that the item
parameters and person parameters are separated. Each set of parameters come from the two level 1
models but at level 2 the models are describing the relationship between the person parameters and
the item parameters separately. The joint distribution of the level 2 person parameters is assumed to be
bivariate normal,
(4.11)
∑
.
The mean vector for the population model is
67
(4.12)
and the covariance structure is formed as
(4.13) ∑ (
)
.
The item domain parameters are identified in the same way with the item parameters outlined
in the level 1 model. These item domain parameters have a multivariate normal distribution,
(4.14)
∑
.
Similar to the model specification above the vector of means for the item domain is given as
(4.15)
,
And the covariance matrix for the response time model is given as
(4.16) ∑
(
)
.
The CIRT model, as shown, is currently unidentified and model constraints are required to estimate the
parameters. Normal constraints for the IRT response model require that the mean ability and ability
variance are fixed. Additionally, in the response model
and
are not identified. To identify the model
the constraint
and
are made. These adjustments are enough to provide estimates for the
rest of the model parameters for response and response time. No constraint is necessary for
because
the scale of measurement is known for the timing.
The CIRT model outlined above provides a general model to test the utility of response time as
collateral information to the traditional IRT model. It has been shown that response time decreases the
standard error of measurement to give the appearance of more accurate scores (van der Linden, 2008).
The effect is lower standard deviations at the tails of the distributions because the response time model
adds extra information to inform the ability estimates.
68
Hypotheses
The purpose of the current study is to compare the estimates of the two forms of the
measurement models for the Number Series task from the WJ-III. The traditional IRT model offers a way
to examine the responses, ignoring the response times. It is then proposed to use the CIRT model to see
if the response times add any information to the model and how speed relates to ability. The outline of
the hypotheses is done in four steps. The first is that the parameter estimates of item difficulty will be of
similar value between the two modeling methods. The second is that the estimates provided by the CIRT
method will be closer to the approximate ability level of the participants based on the use of response
time data. Third, there should be a positive relationship between the item difficulty and item time
intensity, and on a related note the relationship between person ability and speededness should be
positively related. These hypotheses are based on previous work showing these relationships to be true
(Fox, Klein Entink, & van der Linden, 2010). Lastly, the relationship of age with the model parameters is
explored with particular interest on ability and speededness.
Methods
Sample
The data comes from the American Life Panel (ALP; Kapteyn, 2002), a nationally representative
sample of adults 18 years old and above. Simple demographic Statistics are provided in Table 4.1. The
sample had a mean age of 48.9 with a range of 18 to 109. The sample was 59.3% female and the
ethnicities are broken down as: white 84.2%; 4.6% Hispanic; Black 6.5%; Asian 1.3%; Other 3.4%. The
average years of education attained was 11.5 years. The average income was $74,320 and for analyses
the log of the income value was used to bring extreme incomes closer to the mean. Tests of the sample
compared to the national averages indicate that the HRS sample closely resembles the average
American on the demographic aspects outlined above (Couper, Kapteyn, Schonlau, & Winter, 2007).
69
Table 4.1. Descriptive Statistics of the American Life Panel, complete
sample size N = 2548
N Mean Median Std Dev Min Max
Age 2548 48.9 50 14.8 18 109
Education 2548 11.5 11 2.08 3 16
Income 2123 74230 55000 60549 2500 200000
Female 2548 59.3%
NCPos 2533 59.8 62.5 17.9 0 100
NCNeg 2536 30.6 31.3 17.6 0 100
Ethnicity
-White
84.2%
-Hispanic
4.6%
-Black
6.5%
-Asian
1.3%
-Other 3.4%
Collection of the data begins by sending an email invitation out to about 3,000 active
participants in 49 states and the District of Columbia. All of the participants have reported that they are
comfortable speaking and reading English. Since the survey is administered over the internet, if a
participant did not have a computer and internet access they were given both a computer and internet
access. If they already had both then the cash value of the computer was given. The rate of dropout
from the sample is low, with only about 3 people formally leaving the sample a month.
Number Series Test Procedure
The items used for the current analysis come from the Woodcock-Johnson III (WJ-III) set of tests
of ability (WJ-III; Woodcock, McGrew, & Mather, 2001). The Number Series task is one of the tasks that
are used to test math ability and problem solving skills. The task involves the participant to provide a
missing number from a sequence of numbers so that the sequence makes a logical pattern. For example:
if the sequence shown is 1, 2, 3, _, then the participant should provide the answer 4 as it makes the
sequence increase by 1 with each successive number. The items were given in order of relative difficulty
with the easiest being given first and progressively difficulty items given. There were 15 items given per
70
a set and two sets of items were given to each participant. The presentation of set A and set B were
counterbalanced across participants. If the correct answer was given for the blank the participant
received a 1, if the wrong answer was given they received a 0. Between the two sets of Number Series
items the Need for Cognition scale (NCS) consisting of 8 items of the original 18 was presented
(Cacioppo, Petty, & Kao, 1984). The purpose of the NCS task is to provide a break from the NS task
without being too cumbersome for the participant to perform.
Responses
The items of the NS task are scored in a binary way, 0 or 1. A response of the correct answer will
yield a 1 and an incorrect answer receives a 0. The other alternative is if the item was not answered at
which point a missing value is assigned. A breakdown of the response rates to each of the items is given
in Table 4.2. As shown, the sample provided very few missing values across the item set. Items 13B and
14B had the most missing values with 2.2% and 2.5% missing. The total amount of missing data in the
sample was less than 1% (0.99%). There seems to be a relationship between the average time to
complete a given item and the number of missing values (r=0.69), providing evidence that items which
were too hard for some also took a lot of time. Given that a positive relationship between item difficulty
and the item’s time intensity usually exists, these items may have been too much for some test takers
and thus skipped.
Response Times
Each item was given a time limit of 120 seconds with which the participant could provide an
answer and move on to the next item. The minimum possible response was set at 5 seconds to prevent
inaccuracies in scoring due to random guessing. This bounds the responses between 1 and 120 seconds
with varying means and levels of variation as shown in Table 4.2. The most notable thing about the
response time data is that some items were answered relatively quickly for the difficulty compared to
71
easier items. Checks were made for items answered wrong fast, but items answered right fast seem to
be legitimately right answers given a specific integer needs to be entered to get credit for the item.
N Incorrect Correct Missing % Correct Mean RT Std Dev RT Log(RT) Std Dev Log(RT)
Item 1A 2546 6 2534 8 99.76 10.97 8.37 2.24 0.50
Item 2A 2544 17 2523 8 99.33 9.08 6.82 2.06 0.50
Item 3A 2543 61 2478 9 97.60 13.93 10.18 2.47 0.53
Item 4A 2546 38 2503 7 98.50 12.59 10.77 2.35 0.55
Item 5A 2542 15 2524 9 99.41 8.67 8.34 1.98 0.54
Item 6A 2540 217 2318 13 91.44 26.68 21.33 3.05 0.65
Item 7A 2539 579 1948 21 77.09 40.80 27.95 3.48 0.70
Item 8A 2540 633 1882 33 74.83 41.29 29.68 3.48 0.71
Item 9A 2533 1111 1390 47 55.58 46.12 30.60 3.60 0.72
Item 10A 2527 1035 1462 51 58.55 30.49 23.59 3.18 0.69
Item 11A 2534 761 1754 33 69.74 36.31 20.56 3.05 0.64
Item 12A 2533 615 1899 34 75.54 23.83 18.84 2.95 0.65
Item 13A 2513 1305 1192 51 47.74 31.54 27.08 3.14 0.79
Item 14A 2523 1467 1036 45 41.39 51.16 33.42 3.69 0.76
Item 15A 2532 2343 185 20 7.32 76.61 38.56 4.14 0.72
Item 1B 2540 21 2521 6 99.17 11.11 9.76 2.24 0.51
Item 2B 2537 17 2525 6 99.33 9.15 7.54 2.06 0.50
Item 3B 2536 25 2517 6 99.02 10.51 9.01 2.19 0.52
Item 4B 2533 116 2421 11 95.43 14.40 11.50 2.50 0.54
Item 5B 2533 77 2461 10 96.97 14.22 11.21 2.49 0.54
Item 6B 2528 312 2217 19 87.66 25.38 21.43 2.98 0.69
Item 7B 2529 510 2017 21 79.82 26.32 18.29 3.09 0.59
Item 8B 2520 628 1891 29 75.07 23.42 18.91 2.93 0.65
Item 9B 2526 755 1766 27 70.05 32.41 23.76 3.26 0.65
Item 10B 2529 387 2136 25 84.66 25.92 19.76 3.05 0.62
Item 11B 2518 2496 11 41 0.44 36.02 26.22 3.35 0.68
Item 12B 2536 1009 1522 17 60.13 44.88 28.93 3.61 0.64
Item 13B 2527 1155 1336 57 53.63 57.97 34.15 3.83 0.75
Item 14B 2495 1257 1228 63 49.42 38.80 28.71 3.39 0.76
Item 15B 2518 1813 704 31 27.97 54.00 35.63 3.72 0.79
Table 4.2. Item Response and Response Time Statistics with N = 2548
72
There is a notable trend in the time to complete each item within its set. This means for Set A and Set B
there is a trend of more time taken to answer items. One would assume that the more difficult items
would require more time, but this initial look reinforces such a thought. The data are then transformed
to a log scale to reduce the effects of skewness and format the data in terms of the CIRT model.
Planned Analyses
The analyses are broken down into a few basic parts. The first section reports the findings of the
Rasch IRT model of the responses for comparison purposes. These models are fit with the R package
software to get the Bayesian estimates similar to those obtained in the CIRT analyses. The second part
then analyzes the joint response and response time model. This analysis uses the CIRT R package which
reads in response data, log transformed response times, and covariates as explanatory variables for
ability and speededness. The third part compares how the joint model and Rasch IRT model predict
person abilities. Model fit is compared by using Deviance Information Criteria (DIC) and Bayes Factor
(BF). A MCMC process is used with the Gibbs sampler to estimate model parameters and the next
section provides a review of this process.
Prior Parameter Distributions
The hyperprior distributions for the response and response time models are based on
multivariate normal theory (Fox, Klein Entink, & van der Linden, 2010; Klein Entink, Fox, & van der
Linden, 2009; van der Linden, 2006; 2007). The set of hyperpriors defines the distribution of the
parameters of the model in the MCMC framework. The person and item hyperpriors are defined as
shown in Klein Entink (2008) with multvariate normal distributions for the population parameters, fixed
item parameters to define the response model, and conditional parameters for the response time model
to bound response times to be greater than 0.
73
Results
IRT Response Model
The results based on the simple IRT model were computed with the response time model values
constrained. This method provides that the model is nested within the full model from the CIRT
analyses. Essentially the response model is allowed to have the parameters estimated freely and the
response time model is not free to be estimated. The probability of correct response is then only
dependent on the response model, with the response time model adding no information to prediction
of success (greater probability of providing a correct response on a given item).
Table 4.3. Comparative Model Fits for CIRT Models (N=2548)
D(hat) D(bar) pd DIC ΔDIC
Model A 423593 425897 2304 428201 --
Model B 422818 425047 2229 427276 925
Model 1 152586 157312 4726 162038 266163
Model 2 151797 156457 4660 161117 921
Model 3 152104 156870 4767 161637 401
Model 4 151318 156006 4688 160694 1344
Notes: Model A was a one parameter IRT model with other parameter
values constrained. Model B advances model A by freeing the
discrimination parameter to the model. Model 1 is a one parameter
response model and one parameter response time model. Model 2 is a
two parameter response model and one parameter response time
model. Model 3 was a one parameter response model and two
parameter response time model. Model 4 was a two parameter response
and response time model. ΔDIC for Model B is the DIC for Model A - DIC
for Model B. The ΔDIC for Model 1 is DIC from Model A - DIC for Model 1.
The ΔDIC for Models 2-4 are subtracted from Model 1.
A sequence of models were tested to determine whether a 1 parameter or 2 parameter model
captured the data better. These models are listed as Model A and Model B in Table 4.3. The table
indicates the D(hat), D(bar), pd, and DIC for a given model. The value of D(hat) is the point estimate of
74
the deviance over the MCMC values and D(bar) is the posterior mean of the deviance. The value of pd is
the effective number of parameters which is defined as the difference between D(bar) and D(hat). The
Table 4.4. Item Response Models with Response Time Parameters Constrained
Model A (DIC = 428201) Model B (DIC = 427276)
Difficulty (a) Discrimination (b) Difficulty (a) Discrimination (b)
Item 1A -3.60 (0.16) = 1 -2.99 (0.17) 0.33 (0.13)
Item 2A -3.20 (0.11) = 1 -2.85 (0.16) 0.57 (0.12)
Item 3A -2.54 (0.06) = 1 -2.24 (0.08) 0.53 (0.07)
Item 4A -2.81 (0.08) = 1 -2.60 (0.13) 0.66 (0.09)
Item 5A -3.25 (0.11) = 1 -2.92 (0.17) 0.58 (0.11)
Item 6A -1.74 (0.04) = 1 -1.50 (0.04) 0.46 (0.05)
Item 7A -0.96 (0.03) = 1 -0.86 (0.03) 0.58 (0.04)
Item 8A -0.88 (0.03) = 1 -0.83 (0.03) 0.75 (0.04)
Item 9A -0.19 (0.03) = 1 -0.16 (0.03) 0.55 (0.03)
Item 10A -0.29 (0.03) = 1 -0.29 (0.03) 0.86 (0.04)
Item 11A -0.69 (0.03) = 1 -0.83 (0.04) 1.27 (0.06)
Item 12A -0.92 (0.03) = 1 -0.98 (0.04) 1.02 (0.05)
Item 13A 0.07 (0.03) = 1 0.08 (0.03) 0.81 (0.04)
Item 14A 0.29 (0.03) = 1 0.33 (0.03) 1.08 (0.05)
Item 15A 1.89 (0.04) = 1 1.71 (0.06) 0.63 (0.06)
Item 1B -3.05 (0.09) = 1 -2.53 (0.10) 0.33 (0.09)
Item 2B -3.17 (0.11) = 1 -2.73 (0.14) 0.45 (0.11)
Item 3B -3.03 (0.09) = 1 -2.76 (0.15) 0.62 (0.10)
Item 4B -2.17 (0.05) = 1 -1.97 (0.06) 0.60 (0.06)
Item 5B -2.43 (0.06) = 1 -2.31 (0.10) 0.72 (0.08)
Item 6B -1.50 (0.04) = 1 -1.39 (0.04) 0.67 (0.05)
Item 7B -1.11 (0.03) = 1 -1.25 (0.05) 1.14 (0.06)
Item 8B -0.90 (0.03) = 1 -1.07 (0.05) 1.26 (0.07)
Item 9B -0.69 (0.03) = 1 -0.76 (0.04) 1.11 (0.06)
Item 10B -1.34 (0.04) = 1 -1.44 (0.05) 1.03 (0.06)
Item 11B 3.26 (0.11) = 1 2.64 (0.11) -0.11 (0.11)
Item 12B -0.33 (0.03) = 1 -0.28 (0.03) 0.42 (0.03)
Item 13B -0.12 (0.03) = 1 -0.10 (0.03) 0.59 (0.03)
Item 14B 0.02 (0.03) = 1 0.03 (0.03) 1.01 (0.05)
Item 15B 0.77 (0.03) = 1 0.83 (0.04) 1.01 (0.05)
Notes: EAP estimates are provided with standard errors listed in parentheses.
75
Deviance Information Criterion is then defined as the addition of D(bar) and pd. In this formulation DIC
is how much the data deviates from the model, penalized by the number of parameters estimated. The
effect of the parameter depends on the priors used, where non-informative priors are counted more
than informative ones. The model with the smallest DIC is chosen, which is evidence of which model
would provide a dataset identical to the one currently observed.
The model fit estimates are provided in Table 4.4 for the full set of Bayesian models run. The
first two models show the fits values for the response models (A & B), with response time models
constrained. From these fits it was concluded that the best fitting model is Model B. As a general rule of
thumb, DIC changes greater than 10 indicates the higher DIC model should be excluded, changes
between 5 and 10 favor the lower DIC model, and changes less than 5 mean both models should be
examined. In our case the models are shown to be progressively better fitting as constraints were
relaxed.
The model Expected A Posteriori (EAP) estimates are given in Table 4.4 to for the response only
models. Model A item parameters are list in the first two columns, showing that the item
discriminations were held constant over the items for this model. When the item discriminations are
estimated in Model B the item discriminations were adjusted. The standard deviations are increased
slightly when the discrimination parameter was estimated as well. Overall the effect of comparing these
two models shows that a two parameter response model is needed to capture the differences in
discrimination amongst the items. These results were then compared to the model fit values in the next
section to describe how modeling response time jointly increased model fit.
Conditional IRT Model
The CIRT model was run in 4 steps to test the effects of adding item parameters to the joint
model. The first model is a 1 parameter response model and 1 parameter response time model, only
76
including item difficulty and speed intensity. The second model added a discrimination parameter for
the response model and the third model added a discrimination parameter to the response time model.
The fourth model had both discrimination parameters modeled in conjunction with the difficulty
parameters. The model fits are shown in Table 4.3 with the DIC providing model fit comparisons. The
DIC estimates indicate that model 4 fit the best in this sequence of models. Model 1 was a one
parameter response model and one parameter response time model; model 2 was a two parameter
response model and one parameter response time model; model 3 a one parameter response model
and two parameter response time model; and model 4 was a two parameter response model and
response time model.
The tests of model convergence were done to ensure that the number of iterations was
sufficient to provide stable results. As noted previously, two such tests were the Geweke test and the
Heidelberger – Welch tests. These would provide evidence for how to trust the distributions sampled by
the MCMC process. The Geweke test compared the first 10% of the chain to the last 50% for equal
means on each parameter. No significant values were found in the parameters estimated. The
Heidelberger – Welch test has two parts to it to test for stationarity. The first part tests the chain for
stationarity then removes 10% of the sample each time it rejects the null hypothesis of stationarity until
the null hypothesis has failed to be rejected or 50% of the sample has been removed. Our example
passed both tests and 5,000 iterations were recommended to be removed as burn-in of the 50,000 total
computed.
Based on the above results the model chosen to best represent the Number Series task is Model
4. From this point there is an attempt to describe the model results in terms of the framework outlined
above. The full set of item parameters is given in Table 4.5 to show the estimated values of the response
and response time models for Model 4. This table shows that the item parameters follow the same
77
pattern as the constrained models presented in Model A and B. The standard deviation of the estimate
is greater for items at the ends of the scale, where less information is available for estimation of the
Table 4.5. Model 4 Response and Response Time Item Statistics With Parameter
Standard Deviations
b a β α
Item 1A -3.01 (0.17) 0.34 (0.13) 2.24 (0.01) 0.82 (0.02)
Item 2A -2.88 (0.16) 0.59 (0.12) 2.05 (0.01) 0.85 (0.02)
Item 3A -2.24 (0.08) 0.53 (0.07) 2.47 (0.01) 0.95 (0.02)
Item 4A -2.60 (0.12) 0.66 (0.09) 2.35 (0.01) 0.96 (0.02)
Item 5A -2.90 (0.16) 0.56 (0.12) 1.97 (0.01) 0.99 (0.02)
Item 6A -1.50 (0.04) 0.46 (0.05) 3.04 (0.01) 1.07 (0.03)
Item 7A -0.86 (0.03) 0.58 (0.04) 3.47 (0.01) 0.93 (0.03)
Item 8A -0.84 (0.03) 0.75 (0.04) 3.47 (0.01) 1.05 (0.03)
Item 9A -0.16 (0.03) 0.55 (0.03) 3.59 (0.01) 0.93 (0.03)
Item 10A -0.29 (0.03) 0.86 (0.04) 3.17 (0.01) 0.99 (0.03)
Item 11A -0.83 (0.04) 1.27 (0.06) 3.05 (0.01) 1.06 (0.03)
Item 12A -0.99 (0.04) 1.03 (0.06) 2.94 (0.01) 1.09 (0.03)
Item 13A 0.08 (0.03) 0.81 (0.04) 3.12 (0.01) 1.18 (0.04)
Item 14A 0.33 (0.03) 1.08 (0.05) 3.68 (0.01) 0.99 (0.04)
Item 15A 1.71 (0.06) 0.63 (0.06) 4.14 (0.01) 0.69 (0.04)
Item 1B -2.53 (0.11) 0.33 (0.09) 2.24 (0.01) 0.87 (0.02)
Item 2B -2.73 (0.14) 0.46 (0.11) 2.06 (0.01) 0.92 (0.02)
Item 3B -2.76 (0.14) 0.63 (0.10) 2.18 (0.01) 0.96 (0.02)
Item 4B -1.97 (0.07) 0.61 (0.06) 2.49 (0.01) 0.99 (0.02)
Item 5B -2.30 (0.09) 0.72 (0.08) 2.48 (0.01) 0.99 (0.02)
Item 6B -1.39 (0.04) 0.68 (0.05) 2.97 (0.01) 1.20 (0.03)
Item 7B -1.25 (0.05) 1.14 (0.06) 3.08 (0.01) 1.07 (0.02)
Item 8B -1.07 (0.05) 1.27 (0.07) 2.91 (0.01) 1.20 (0.03)
Item 9B -0.76 (0.04) 1.11 (0.06) 3.25 (0.01) 1.17 (0.03)
Item 10B -1.44 (0.05) 1.03 (0.06) 3.04 (0.01) 1.08 (0.03)
Item 11B 2.64 (0.11) -0.12 (0.11) 3.34 (0.01) 1.13 (0.03)
Item 12B -0.28 (0.03) 0.42 (0.03) 3.60 (0.01) 0.98 (0.03)
Item 13B -0.11 (0.03) 0.60 (0.04) 3.83 (0.01) 0.99 (0.04)
Item 14B 0.03 (0.03) 1.01 (0.05) 3.38 (0.01) 1.06 (0.04)
Item 15B 0.83 (0.04) 1.01 (0.05) 3.71 (0.01) 1.08 (0.04)
Note: EAP estimates for Model 4 are provided with parameter standard deviations in
parentheses. DIC = 160694, pd = 4688.
78
response model. The values for the response time model are much more accurate, but the
discrimination estimates (α) do show more variation for items that have higher time intensities (β).
To test the distributions used as probability of getting an item correct against the expected
probability in the above analysis of the response times. This provided evidence to the fit of the gamma
Figure 4.3. Probability plots of the response time models for selected Number Series items. When the
plot lies on the identity line there is a good match between the sample and the prior distribution used
for the response time function (log-normal distribution).
79
distribution to model response times well. Figure 4.3 shows the results of these plots, with the line
through the origin showing no bias in the sampling distribution. The panels for each item show that
response time is captured well by the gamma distribution. Since the plots follow the line through the
origin there is no bias in the distribution modeled (Casella & Berger, 2002).
The structure of the covariances amongst the item parameters shows a picture of the
relationship between the two models in terms of the items. This bridge provides a look at the
relationship between item difficulty and discrimination and the response time qualities it also exhibited.
The covariance structure is outlined in Table 4.6. The relationship between the item difficulty and item
speed intensity seem to have a strong relationship with the other covariance values showing lower
values. The correlation between the item difficulties and speed intensities was r = .87 (p<0.01). The
more difficult items did require more time. The only relationship that seemed to be lacking was that of
item difficulty and discrimination, there was not positive relationship as seen with all other item
parameters.
Table 4.6. Covariance Structure of Item
Difficulty and Discrimination Parameters
A. Covariance Matrix
σ
2
a
σ
2
b
σ
2
α
σ
2
β
σ
2
a
0.45 0.06 0.05 0.08
σ
2
b
0.06 2.34 0.07 0.77
σ
2
α
0.05 0.07 0.36 0.05
σ
2
β
0.08 0.77 0.05 0.70
B. Correlation Matrix
a b α β
a 1.00 0.05 0.19 0.22
b 0.05 1.00 0.76 0.87
α 0.19 0.76 1.00 0.84
β 0.22 0.87 0.84 1.00
80
Figure 4.4. A scatterplot of the estimated ability levels of Model 4 versus the estimated speededness
parameter for participants. The plot shows a small trend and this is reflected by a correlation of r=0.16
(p< 0.01).
The person parameters from the chosen model are now examined with focus on the
relationship of ability and speed the covariance structure of the person parameters is shown in Table 4.7
and shows the relationship between the ability and speed parameters. The covariance of the speed and
ability parameters indicates that there is some relationship that can be explored further. This
relationship is shown in Figure 4.4, where Speed (y-axis) is a function of ability (x-axis). There seems to
be a good amount of spread in ability levels on both scales, with a slight upward trend. The correlation
81
between these two estimates for the ALP sample was r = 0.16 (p<0.01). The value of the person ability
variance is a fixed values as explained above, while the other two estimates are free to vary.
Table 4.7. Covariance
Structure of Person ability and
Speed Parameters
EAP
σ
2
θ
1.00
σ
θτ
0.05
σ
2
τ
0.14
Comparison of IRT and CIRT
Since the addition of the response time should decrease the error of measurement at the tails of
the ability distributions a comparison of the standard errors for the two parameter IRT model and the
full CIRT model (two response and two response time parameters) was done. The graphical comparison
is shown in Figure 4.5 where the distributions show the standard errors over the ability scale. This graph
shows how the estimated ability standard deviation compares between the two models Model B and
Model 4 provide the estimates and effectively compare how accurate the estimates are over the scale of
ability. By plotting them simultaneously there is the test of whether the CIRT and IRT scores lie in the
same space or if one shows more accuracy over the other. The figure shows that the distributions fall in
the same space, indicating that the CIRT model does not increase accuracy in prediction.
To look at the effects of age on the estimates one would like to know how older participants do
compared to younger ones. Our goal is to determine if there is any effect of speed with age and if this
affects age’s relationship with ability. The correlation of age with speed is r=-0.41. This is expected given
extensive research showing speed advantages for younger people. The correlation of ability with age is
much less, r = -0.12 but still there is an effect. Then to compare the effect of ability on age with speed
82
taken into account and look at the partial correlation which is calculated as r = -0.06. Once speed has
been taken into account the relationship between age and ability is halved. This provides evidence that
some decline in ability is due to slowing in advancing age, with a little bit of decline due to other factors.
Figure 4.5. A scatter plot of estimated standard deviation of abilities for IRT (crosses) and CIRT (dots)
over the ability range. The relationship between ability and response speed determines any attenuation
in error of measurement. The low correlation between the person parameters indicates that only
minimal gains in ability measurement can be made. The weak relationship is shown by the distributions
overlapping and the CIRT estimates not sitting just below the IRT estimates.
83
Conclusions
The current chapter attempted to fit a Conditionally Independent Response Time model and
compare its results with that of a traditional Item Response Theory model. Using the framework laid out
by van der Linden (2007a), the CIRT model was shown to provide a better fit for the data over an IRT
model. One and two parameter response and response time models were tested and a two parameter
response model, as well as a two parameter response time model, were chosen as the best fitting
model. Analyses of the model fits indicate that for items with low variation in response (most
participants were correct or wrong) the model did not find as strong an estimate of the item
parameters. However, the response and response time model showed good fit when checks of the
observed versus expected probabilities were explored for biases due to distributional assumptions.
Much of the checks for model fit as prescribed by Klein Entink (2009) show good results when the ALP
Number Series data is used.
The comparison of the IRT and CIRT models showed that modeling of ability and speed was
possible to do simultaneously with Bayesian MCMC techniques. The resulting estimates of person and
item parameters were found to be stationary and conformed well with the distributions specified.
Further examination of ability and speed in the context of the person parameters found that speed
accounted for some of the relationship between age and ability. When the partial effect of speed is
taken out, the relationship between age and ability is attenuated, leaving behind a small effect of aging.
In effect, speed does account for some of the losses seen over age in terms of ability, but there seems to
be another portion of ability unaffected by speed.
With the introduction of speed of response in the measurement model the discussion of the
speed-accuracy tradeoff becomes relevant. The concept of a tradeoff between speed and accuracy
through research by Pachella (1974), Pew (1969), Wickelgren (1977), and Wood & Jennings (1976)
84
indicates that in order to improve speed accuracy must be sacrificed, or vice versa. The traditional sense
is that older adults tend to have slower response patterns when compared to their younger
counterparts in testing situations (Botwinick, 1973). To explain this phenomenon researchers have
posited that slower responses by older adults are observed because there is greater emphasis on
accuracy (Birren, 1964; Botwinick, 1973; Rabbitt, 1968). The hierarchical model presented above
separates the person speededness from the ability level, but no manipulation of pace or answer
checking is done to make such conclusions about the utility of the tradeoff between speed and accuracy.
The results do provide good insight into the age differences of performance when accuracy is
compared to speediness of respondents. The effect of speed has been shown to affect accuracy over
age, with older adults showing slower speeds over younger adults (Salthouse, 1979). Our results seem to
corroborate these findings in that older adults take more time than younger adults, but this difference
does not account for all of the age differences in ability levels. A person is asked to work on a set of
problems and they pick a pace that is comfortable for them. The assumption of IRT (and consequently
CIRT) is that the speed-accuracy tradeoff is constant for a testing period. Given the previous assumption
it becomes apparent that the default speed-accuracy rate for a person is chosen and maintained
through testing. Then the contention becomes that the natural behavior of a person is observed, that
the individual speed-accuracy tradeoff observed is one of personal preference. And one examines how
these preferences in speed versus accuracy are related to age as was done above, noting that some (not
all) age related variation in performance is due to the speediness.
The purpose of such a study is to determine the usefulness of a joint response and response
time model in measuring cognitive abilities where response time is easily gathered. As a useful and
convenient sample, the ALP Number Series task was used to test this CIRT model for fit and comparison.
The relationship between speed and ability was shown to be on the lower end and would then provide
85
limited enhancement of ability estimates in the tails of the sample distribution based on results shown
in simulation studies (Fox, Klein Entink, van der Linden, 2010). The analysis and results provide an
outline for implementing and assessing cognitive tasks by incorporating response time. Using this model
is possible in a number of cognitive scale based tests which are used in the WJ-III, such as Picture
Vocabulary and Verbal Analogies.
The argument is such that this model may provide enhancements over the traditional IRT model
in ways outlined above. The most notable feature is that of reduced standard errors (especially at the
ends of the scale. This increase in accuracy can provide a more accurate score estimate when fewer
items are administered, such as in a short form or adaptive scenario. The purpose of this chapter was to
provide the background and tools to implement the CIRT model using a cognitive measure implemented
over the computer. In the case of the Number Series test, this model seems to provide minimal
improvements based on the relationship of the person parameters.
86
References to Chapter 4
Albert, J.H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs
sampling. Journal of Education Statistics, 17, 251-269.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove: Duxbury.
Couper, M., Kapteyn, A., Schonlau, M., & Winter, J. (2007). Noncoverage and nonresponse in an
internet survey. Social Science Research, 36, 131-148.
Fox, J.P., Klein Entink, R.H., & van der Linden, W.J. (2010). Modeling of responses and response
times with the package CIRT. Journal of Statistical Software, 20, 1-14.
Gelman, A., Carlin, J.B., Stern, H.S., & Rubin, D.B. (2004). Bayesian data analysis. Chapman and
Hall / CRC, London.
Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of
posterior moments. In Bayesian Statistics 4 (J.M. Bernardo, J. 0. Berger, A. P. Dawid and A. F. M. Smith,
eds.) 169-193. Oxford Univ. Press.
Heidelberger, P. & Welch, P.D. (1983). Simulation run length control in the presence of an initial
transient. Operations Research, 31, 1109-1144.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Klein Entink, R.H.(2009) Statistical models for responses and response times. Thesis.
Netherlands: University of Twente.
Klein Entink, R.H., Fox, J.P., & van der Linden, W.J. (2009). A multivariate multilevel approach to
the modeling of accuracy and speed of test takers. Psychometrika, 74, 21-48.
Novick, M. R., & Jackson, P. H. (1974). Statistical methods for educational and psychological
research. New York: McGraw-Hill.
Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Denmark
Paedagogiske Institute, Copenhagen.
87
van der Linden, W.J. (2006). A lognormal model for response times on test items. Journal of
Educational and Behavioral Statistics, 31, 181-204.
van der Linden, W.J. (2007a). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
van der Linden, W.J. (2007b). Using response times for item selection in adaptive tests. Journal
of Educational and Behavioral Statistics, 32.
van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal
of Educational and Behavioral Statistics, 33, 5-20.
van der Linden, W.J., & Guo, F. (2006). Two Bayesian Procedures for Identifying Aberrant
Response-Time Patterns in Adaptive Testing. Manuscript submitted for publication.
van der Linden, W.J., Klein Entink, R.H., & Fox, J.P. (2010). IRT parameter estimation with
response times as collateral information. Applied Psychological Measurement, 34, 327-347.
Woodcock, R.W., McGrew, K.S., & Mather, N. (2001). Woodcock-Johnson III. Rolling Meadows,
IL: Riverside Publishing.
88
Chapter 5: Reduced and Adaptive Testing Methods
Abstract
The intention of this chapter is to provide motivation for implementing adaptive testing in
cognitive testing done over the computer. Traditional cognitive tasks have been the hallmark of
cognitive ability measurement and the introduction of the Woodcock-Johnson Number Series task as a
computer task leads us to the possibility of using adaptive methods not easily done with paper and
pencil testing situations. Various forms of reduction techniques were used, reducing the item pool in a
structured way, to calculate accuracy lost when less items are used to predict the complete item set
score. Then the use of adaptive item sets was considered, in which item sets were not fixed over the
sample, to estimate the complete item set score. The process of form reduction and adaptive testing
presents results which outline the process for adapting cognitive tasks and obtaining ability estimates
that mirror traditional ability estimates (i.e., Item Response Theory). The American Life Panel sample
performance on the Number Series task is used as a prototypical analysis of computer aided data
collection methods and scoring procedures.
Introduction
In testing a person’s ability in a particular domain, there is a desire to be accurate in the number
associated with their performance. Traditionally the number of general use has been a sum scale score
in which one can divide the observed score into the true score and error associated with the
measurement. This method neglects the inherent item differences in difficulty which can indicate the
likelihood of a person with a given ability can complete the item. While it is not an ideal measure of
ability, it is one way to differentiate persons on a task.
The movement of item response theory (IRT) involves looking at the capability of a set of items
to cover the range of the scale and maintain the unidimensionality it was designed to measure. In terms
89
of dimensionality, a given scale to should focus on one construct (Rasch, 1960). Items that do not fit well
with the dimension under investigation should be discarded as they do not provide good information
about the ability of a person. The range of the scale covered by the items is another important factor for
creating a good IRT scale. Items should provide good coverage, so there are not segments or ends of the
scale that are not indicated by the item pool. Scales that neglect this principle do not provide accurate
estimates of ability in the endpoints of possible scores. This would be indicated by larger standard errors
for scores in the aforementioned regions.
Classical Test Adapting Methods
Some methods to obtain scale scores with fewer items are explored and defined. One potential
method is to take a random sample of items from the total potential pool of items. From this random
sample a baseline to compare alternative scale performance. The hope is to improve accuracy in scoring
over any randomly selected collection of items by selecting a particular set of items. To compete with
the random selection items should be chosen which best reflect the construct. The first method is to
select items with the highest correlation with the total score. The second method is to choose items that
load highest in a factor analysis of the item pool (Birnbaum, 1968, sect 20.5). These two proposed
methods use the items that best indicate the construct based on sample performance and score
individuals based on the reduced item set. The tests in these reduced forms, or short forms, would still
be given as a linear test where everyone completes each item.
A comparison of how well the subset scores to scores created with all potential items is done by
calculating the shared variation (R
2
). If a task that uses less items can explain most if not all of the
variation in the full item set score, it can be said that the reduced task is a good approximation of person
ability. These methods of reducing tests by using a larger pool do so by testing everyone with the same
subset of items. The reduced tasks are a complete unit that all participants is tested with and scored
90
with. Short form tests are compared to adaptive testing which each person could receive different
subsets of items depending on their response pattern as they progress through the test. These
differences will further be explained in the coming sections in more detail.
IRT Testing
In order to properly define computer adaptive testing (CAT) the structure of IRT and the role it
takes in CAT is defined. IRT is a method of scoring individuals on a dimension with a standardized scale.
The scale is said to be sample free and performance on each item is independent of the other items
(Rasch, 1960). Computation of the person proficiency (ability) scores are traditionally based on
Maximum Likelihood estimation (MLE) or Bayes Modal estimation (Wainer & Mislevy, 2000). In recent
years the introduction of Markov Chain Monte Carlo (MCMC) methods have provided comparable
estimates of the posterior distribution of IRT model parameters (Albert, 1992; Bradlow, Wainer, &
Wang, 1999). The two methods differ in their objectives (the point estimate of the parameter versus the
estimated distribution of the parameter), but they arrive at similar estimated values for persons and
items. The advantage of IRT estimation of ability is the implied differential treatment of items based on
difficulty. In this framework, each item has different properties which indicate how difficult they are and
what ability levels should be able to complete them. This method is at odds with the classical scoring
methods of sum scores, which would just add up correct responses neglecting that certain items are
more likely to be answered correctly.
In response to the idea of items having different the distinction is made between calibration and
estimation (Wainer, 2000). The first point is that of calibration, where a sample of responses is collected
for the item pool to calibrate the instrument. This is meant to be a preliminary task, prior to the
estimation of ability scores. The idea of calibration works hand in hand with the assumption that item
responses are independent from one another. The item difficulties estimated with one sample should be
91
fully applicable to another sample of respondents given the same items. This step creates a set of item
difficulties which are then used to estimate person abilities. The estimation of person abilities with fixed
anchors (fixed values for the item difficulties based on calibration), would provide comparable scores
across various samples. The two steps of IRT person ability estimation allow for a pool of potential items
to be organized by difficulty and selected for best use. The potential of a set pool of calibrated items to
estimate ability leads us to the idea of CAT.
Computer Adaptive Testing
The first notion of adapting a test to the individual abilities of the persons it was testing was
developed by Lord (1970). The basic objective of this line of work was to have the capability to
administer the test to large samples, but still make the test suitable for individual differences in ability. It
was to take on the role of the test administrator, who would be able to note if items were too difficult
then easier items should be given. The name, Computer Adaptive Testing, is itself based on the idea that
computers are needed to perform the ideal choice. The perks of using computer devices allows for
immediate feedback to responses and the use of response times in item selection (next chapter). These
possible guidelines can be implemented with precise rules so that each person is given equal chance to
provide responses that accurately reflect their ability. While not technically necessary for the task,
computers offer an efficient way of performing adaptive tests for researchers and others interested in
rank ordering ability scores.
The advantages over traditional tests were noted almost immediately, with more accurate and
immediate scoring, and larger item pools possible (Green, 1983). The work involved in administering and
scoring paper and pencil traditional tests could be bypassed with the implementation of computers in
testing. Because the tests between individuals need not be nested, there is the trouble of using
traditional scoring methods such as a total number of right responses. For this purpose the theoretical
92
framework of IRT provides a good set of assumptions to build individualized tests upon. The crux of
these assumptions is that a score is a result of administering all items to each person, but this need not
be the case as explained below.
In adaptive testing the main focus of the technique is to provide a good estimate of person
ability while maintaining accuracy that one would have if the full item set was used. As noted previously,
the pool of items for an adaptive test can change from person to person. One could potentially
randomize the starting point along the scale for the first item given to participants. Alternatively, the
middle of the scale can be chosen as a general starting point. From this first item, depending on getting
the item right or wrong, an item is chosen that is easier or harder. The nature of this testing system
inherently produces different item sets for different response patterns, which begs the question of
whether one can compare scores for persons who were given different items. One study notes that
researchers must look at the pattern of missingness as ignorable (Mislevy & Wu, 1996). Since the items
are not missing due to any other reason than they were not presented, there are no non-ignorable
reasons for missing responses. Mislevy & Chang (2000) make the case that maximum likelihood
estimates will yield the correct value depending on how local independence is defined in the IRT model.
When adaptations to the likelihood model are made for adaptive testing then the response pattern
probabilities provide correct values.
There are formal steps for the implementation of CAT which will be outlined. In discussing the
particular steps by which a person arrives to each subsequent item in a given sequence. A number of
basic steps have been proposed to create CAT scores. These issues as outlined in Wainer (2000) and
Sands, Waters, & McBride (1997) provide a good point for referencing. The first objective is to choose a
starting point. This is simply at what difficulty level along the scale should respondents be started at. To
this point, the most informative item should be given first, which means that an item close to the mean
93
ability of the sample in question should be chosen. In the event that security of testing is an issue, the
starting point should be randomized around this point to provide that the same item is not presented
first to every participant. How to start an adaptive test is largely under debate since the ability of a
person is unknown when they start a test (van der Linden & Pashley, 2010). Information about a person
that may be available in advance, such as simple demographic variables, may provide a way to guess
what item difficulty the CAT should begin with.
Figure 5.1. A Pictorial representation of the half adaptive methodology. The starting point is predefined
as the middle of the scale and depending on whether a response is correct or not, the respondent is
moved either halfway up or halfway down the scale. This halving of the scale continues until the ability
level is narrowed in on to an acceptable level (stopping rules as defined by the researcher).
15
7
23
19
27
3
12
Item 1 Item 2 Item 3
The next aspect, equally as important as where to start a test, is what item should follow given a
particular response to the previous item. The issue is brought up many times in the discussion of CAT
94
implementation (Meijer & Nering, 1999; Tonidandel, Quinones, & Adams, 2002; van der Linden & Glas,
2010; Weiss, 1982, 1985). Several methods will be discussed to give a flavor of the possibilities for item
selection when the testing is in progress. One particularly easy method of item progression is the half-
step method. This idea simply starts a person halfway along the scale difficulty and depending whether
they get the first item right or wrong they move halfway up or halfway down the scale for the next item
difficulty. And then the next move is conditional upon whether they get an item right, they move
halfway between the highest difficulty and the current difficulty. If they get it wrong then they go
halfway between the current item and the previous easier item (if they got the previous item right). A
pictorial representation of this concept is shown in Figure 5.1. The middle of a 30 item test is given as
item 15 and half steps are made in this framework until 6 items have been administered.
The fully adaptive testing algorithm is the classic implementation of CAT (Lord, 1980). The
process involves iterative calculation of the ability score and updating this score based on responses to
each item. The person score is estimated once there is at least one correct and one incorrect response
based on a process of item distribution decided on by the researcher. One process that would provide a
good starting point would be the half adaptive methodology. This would provide that ability would be in
the range of items already given because one would know how difficult the correct responses are and
how difficult the incorrect items are. Once the initial items are given then the calculation of an interim
score is made to predict which item will be given next. This is done with IRT analysis as defined above,
with ability estimates predicted from the items selected up to that point with fixed anchors. Then, the
iterative process of calculation and selection continues until a stopping point is met.
A stopping rule must be implemented in order to be sure that everyone has reached the same
level of accuracy in the estimation of their ability. An effective way of stopping a test is when a desired
level of accuracy is reached, as in a predefined standard error size (Wainer, 2000). The effect of this type
95
of stopping rule is to ensure a level of accuracy in the predicted score of the person. If the desired
standard error is small then more items will be required to achieve such a level of accuracy. Another
such rule is the implementation of a time or item limit. This could be brought on by other reasons
outside the researchers’ control, and only a handful of questions can be administered. In this case, the
number of items possible depends on the time intensity of the task, and adaptive or highly informative
short form tasks are necessary. The implementation of this type of rule would increase the standard
error of score estimates if fewer items are given than in the standard error stopping rule. The
comparison of different stopping rules is a necessity in determining the effectiveness of adaptive item
sets.
Hypotheses
The test of accuracy is against a standard score that one would receive had they been given the
entire item set. In this way, the current paper shows how the short form and adaptive techniques
compare when a person’s full response pattern is known. The full item set score is used as the standard
for all of the short form and adaptive methods to compare to since this is the score to be emulated. The
first hypothesis of this research is that informed item selection will provide better total score prediction
than uninformed. Fixed reduced form tests (items chosen with higher loadings, etc.) will be more in line
with total scores than randomly chosen item sets and equidistant fixed item sets.
The second hypothesis presented is that adaptive item sets should produce results on par with,
or better than informed reduced item sets. The motivation for this hypothesis stems from the idea that
a fixed item set will be limited in how well it scores individuals because it cannot account for the items
that have been omitted from the set. Adaptive testing allows all items to remain in the pool as potential
stimuli and a person’s response pattern will dictate which items will be presented next.
96
Methods
Sample
The data comes from the American Life Panel (ALP; Kapteyn, 2002), a nationally representative
sample of adults 18 years old and above. Simple demographic Statistics are provided in Table 5.1. The
sample had a mean age of 48.9 with a range of 18 to 109. The sample was 59.3% female and the
ethnicities are broken down as: white 84.2%; 4.6% Hispanic; Black 6.5%; Asian 1.3%; Other 3.4%. The
average years of education attained was 11.5 years. The average income was $74,320 and for analyses
the log of this value was used to bring extreme incomes closer to the mean. Tests of the sample
compared to the national averages indicate that the HRS sample closely resembles the average
American on the demographic aspects outlined above (Couper, Kapteyn, Schonlau, & Winter, 2007).
Collection of the data begins by sending an email invitation out to about 3,000 active
participants in 49 states and the District of Columbia. All of the participants have reported that they are
comfortable speaking and reading English. Since the survey is administered over the internet, if a
participant did not have a computer and internet access they were given both a computer and internet
Table 5.1. Descriptive Statistics of the American Life Panel, complete
sample size N = 2548
N Mean Median Std Dev Min Max
Age 2548 48.9 50 14.8 18 109
Education 2548 11.5 11 2.08 3 16
Income 2123 74230 55000 60549 2500 200000
Female 2548 59.3%
NCPos 2533 59.8 62.5 17.9 0 100
NCNeg 2536 30.6 31.3 17.6 0 100
Ethnicity
-White
84.2%
-Hispanic
4.6%
-Black
6.5%
-Asian
1.3%
-Other 3.4%
97
access. If they already had both then the cash value of the computer was given. The rate of dropout
from the sample is low, with only about 3 people formally leaving the sample a month. Nonresponders
are left in the sample and only deleted when dataset is cleaned for these nonresponders. Since the
cleaning is infrequent the response rate for a survey should take into account this group of people.
Measures
Number Series. The items used for the current analysis come from the Woodcock-Johnson III
(WJ-III) set of tests of ability (WJ-III; Woodcock, McGrew, & Mather, 2001). The Number Series task is
one of the tasks that are used to test math ability and problem solving skills. The task involves the
participant to provide a missing number from a sequence of numbers so that the sequence makes a
logical pattern. For example: if the sequence shown is 1, 2, 3, _, then the participant should provide the
answer 4 as it makes the sequence increase by 1 with each successive number. The items were given in
order of relative difficulty with the easiest being given first and progressively difficulty items given.
There were 15 items given per a set and two sets of items were given to each participant. The
presentation of set A and set B were counterbalanced across participants. If the correct answer was
given for the blank the participant received a 1, if the wrong answer was given they received a 0.
Need for Cognition Scale. The Need for Cognition scale (NCS) consisting of 8 items of the original
18 was presented between the NS item sets (Cacioppo, Petty, & Kao, 1984). This set of items was placed
between the two NS item sets and was meant to be a time filler that did not involve cognitive load. The
scale is divided into two factors, positive and negative. The 4 items that had the highest factor loadings
on each factor were chosen to be included for a total of 8 items. In the NCS participants are asked to
answer statements about whether they seek out and undertake applied cognitive tasks on a 5 point
Likert scale, where 1 is very inaccurate and 5 is very accurate. The positive and negative scale scores
were sums of item endorsement and then given a percentage score between 0 and 100.
98
Computer Presentation vs. Paper and Pencil
For the Number Series task there is a large difference in the two forms of presentation, paper
and pencil versus computer. The paper and pencil task involves groups of items which the respondent
must make a verbal statement of their answer to an interviewer. The interviewer determines whether
the answer is correct or not and when to move on to the next item. There are multiple items per a
stimulus sheet (3 items), and if no items are answered correctly on a given sheet then the test is ended.
Since the items get progressively harder if one complete set of items is unanswerable then it is unlikely
that the next set of items will be answered correctly.
The computer adaptation of this task is much different in that all participants have the
opportunity to see all items, given they go through the entire test. The items are presented one at a
time on the screen with a space for the respondent to type in their numeric response. They control the
pace of the testing by selecting the next button when they are ready for the next item in the set. This
has the benefit of taking out interviewer dependencies in timing. The timing data is assumed to be the
pace of the person and their speed of work in this type of task.
Planned Analyses
The analysis of reduced and adaptive item sets involves a few steps in order to determine the
usefulness of differing methods. The first step is establishing a full scale score to compare the reduced
item set scores to when examining response patterns. Once a prototype score is defined, various scale
reduction techniques were implemented, ones that include fixed items sets for the sample, and others
that are adaptive (meaning the item set does not necessarily have to be identical from person to
person). First the reduced forms are examined (random, equidistant, alpha, factor), followed by the
adaptive item set techniques (BAT, half-adaptive, fully adaptive).
99
Results
IRT Scoring
In order to have a reference of how good our reduced scales were, the standard to which they
were compared was that of the full scale IRT model score. The 30 item IRT score used all of the items in
the item pool and was equivalent to the score that one would get in a standard cognitive testing
situation. This score was based on the one parameter model (1PL), where items only had a difficulty
parameter, item discriminations were held constant across items (a
i
=1). A graphical representation of
the spread of scores according to the 1PL IRT model is given in Figure 5.2. The sample average on the NS
task was , with a standard deviation of . This places the sample above the average
score of the scale (500), and may provide for skewed results for optimal item sets. Because higher ability
levels were measured, item sets that focus on more difficult items may be selected to more accurately
measure our sample’s ability.
The standard model is the score that the other possible methods of scoring outlined above
attempted to emulate. The use of reduced item scales inevitably created some loss for predicting the
full scale score, but the degree to which the reduction affects this loss is what differentiates a good
method from a poor one. The following sections interpret the ability of item subsets by themselves, and
with the addition of response time information, to predict full scale scores. The merit of these reduced
item scales is then compared with and without response time information.
100
Figure 5.2. Distribution of 30 item scores for the ALP sample. The distribution of scores is skewed with
the mean well above the NS mean score of 500 ( ). The sample does well on the test,
indicating that the items at the lower end of the scale will be less informative than items at the higher
end of the scale. For adaptive methods selected, items that focus on the higher difficulty levels are
expected to be chosen. The red line indicates the mean for the sample as a reference point.
Comparison of Reduction and Adaptive Methods
Form Reduction Techniques. The intent of this first step of analysis was to display the usefulness
of various form reduction techniques compared to a full item set score. One would also like to see how
the response focused methods compare to methods in which response time or demographic variables
are included to see if any gain is noticed. The intended interpretation for these reduced item sets was to
provide an accurate measure of ability, while staying within a time constraint. Each item takes time to
administer. When given the fact that the full 30 item Number Series task takes much longer than the 5
minutes that is typically allotted to cognitive tasks limits how many items one can administer. In
situations such as the Health and Retirement Study (HRS), the limit for cognitive tasks is a few minutes
and a maximum of six items are given. For this reason the item set is limited to 6 items to see the
101
amount of explained variance remaining from various item selection methods while staying within a real
world constraint.
The reduced forms of the Number Series task outlined below involved each participant being
assigned the same item set. As noted above, this is in contrast to the adaptive techniques which may
provide different items set to participants based on performance on earlier items. Table 5.2 provides
item labels and difficulties to compare across the various scoring methods. The first scoring method was
that of 6 random items chosen from the complete item set. A random number generator provided us
with the method of picking out the randomized subset of 6 items of 30. These items were then scored
with fixed anchors for the 6 random items from the complete 30 item IRT analysis. Second items equally
spaced along the Number Series scale were selected so 6 equidistant items made up the fixed item set.
The scoring method for this item selection method was the same as the previous, with the scores
derived from fixed anchors for the items difficulties calculated from the 30 item IRT score. The major
distinction between these two item selection methods was the coverage of the scale by the items. In
the first method, the random item selection did not take into account whether or not a specific section
of the scale was already represented by a previous item. In the second case, better total score
prediction is expected because the items do better at covering difficulty levels of the scale. Since the
Item Difficulty Item Difficulty Item Difficulty Item Difficulty Item Difficulty
Item 1 4A -3.03 2B -3.89 11A 0.94 8B 0.59 10A 1.60
Item 2 5A -4.01 3A -2.51 9B 0.92 11A 0.94 12A 0.55
Item 3 8A 0.60 10B -0.18 8B 0.59 7B 0.24 14A 2.56
Item 4 11A 0.94 8A 0.60 7B 0.24 9B 0.92 9B 0.92
Item 5 3B -3.48 9A 1.77 12A 0.55 10B -0.18 14B 2.12
Item 6 4B -1.77 15B 3.36 14B 2.12 14A 2.56 15B 3.36
Table 5.2. Item Subsets From Different Item Selection Methods(Item difficulty Included for Comparison Purposes)
Notes: Random and Fixed item subsets do not attempt to tailor items to the sample. The Alpha, Factor and Best
R2 items are picked as best for this sample of respondents. If the characteristics of the sample change, such as
lower average ability, items with lower difficulties should be chosen. Items are shown in rank order for Alpha and
Factor.
Random Fixed Cronbach's α Factor Best 6
102
items are ordered by difficulty selecting every fifth item along the scale yielded good coverage when
compared to random item selection.
The next three methods attempted to provide item subsets that discriminate ability levels
better. The items with the highest correlation with Cronbach’s alpha were the items that correlate best
with the total score for the scale. The stimulus for this method is that items that relate most to the total
score provide better estimates than items less related. Next, items with the highest factor loadings in a
one factor model were the ones with higher discrimination values compared to other items. When a the
scale is assumed to be unidimensional items that are most like the factor they represent will have the
highest loadings with that factor. The last fixed item set for scoring Number Series ability was the best
combination of 6 predictors of the 30 items for the total score. A permutation test calculates the 6 items
with the highest explained variance (R
2
) of the total score. The total list of possible permutations allows
us to determine which combinations of items yield maximum predictability versus competing sets. One
notable difference with these methods compared to the fixed equidistant items was that the lower end
of the scale were not present. The performance of the sample was found to be in the higher regions of
the scale scores, indicating that items of higher difficulty would be more informative versus lower
difficulty items. Also it can be noted that methods that take into account total item performance do
better than the random items sets and fixed item sets. From this portion of analysis it becomes clear
that some prior knowledge of relative group ability can be influential in reducing item sets. Table 5.3
compares how well the reduced form scoring methods relate to the full score.
In comparing the correlations to the total score, methods that used adaptive techniques
performed about the same as methods based on best predictive item sets. The importance of forming a
test that can hone in a person’s ability provides that there is in fact an ideal set of items to ask a person.
103
Table 5.3. Sample Statistics for Reduced Form Number Series Tests
30 item Random Fixed Alpha Factor Best R
2
Mean 520.6 518.2 518.4 519.1 519.2 518.9
Std. Dev. 11.4 8.5 8.6 12.7 13.3 11.5
Min. 472.7 473.5 485.9 489.9 491.4 497.9
Max. 545.4 524.9 531.0 534.6 533.5 537.5
R
2
with IRT-30 1.00 0.53 0.64 0.76 0.73 0.82
30 item 1.00
sym.
Random 0.72 1.00
Fixed 0.80 0.67 1.00
Alpha 0.87 0.68 0.61 1.00
Factor 0.85 0.65 0.63 0.90 1.00
Best R
2
0.90 0.60 0.70 0.86 0.81 1.00
Notes: The column of subtest correlations is boldened to outline how well
each subtest identifies with the total score. Higher correlations indicate more
identical scores.
The set of items that is fixed over the length of the scale performed better than the random set of items
because one could not be sure which portion of the scale (low vs. high end) would be represented in the
random set of items. The other scale reduction techniques provided good improvements in correlations
with the total 30 item score. These methods had substantial increases in correlation and the
permutation method of item selection had the highest correlation of all of the fixed item form reduction
methods. The means of each scoring method seemed to have a downward bias, and significant
differences in mean scores were detected with Repeated Measures ANOVA using Wilks’ Lambda
(F
5,2543
=122, p<0.001). There was a loss in the range of scores, with the lower end of the scales being cut
off across all scoring methods. This was a result of the items being selected in each reduction technique.
By selecting only more difficult items one loses scale definition in lower ability persons and the lower
score estimates were inflated.
104
Adaptive Techniques. The next sets of scoring methods make use of adaptive methods of
testing; that is, not every person would receive the same item set. The Block adaptive testing scheme
(BATS) is outlined in Figure 5.3. This scheme involves giving a block of three items over the span of the
scale and the responses are used to hone in on the section of the scale which the person believed to be
at given their previous responses (McArdle & Woodcock, 2010). No correct responses would put them at
the bottom of the scale for the second set, while all correct responses would put the respondent at the
highest difficulty items for the second set. The purpose of the two stage process is to go from a coarse
estimate of ability to a finer tuned estimate. The first set covers a large range of difficulty levels (low,
medium, high) and using the number of correct responses from the first set, the second set was tailored
to the hypothesized ability (low, low medium, high medium, high).
Figure 5.3. Representation of Block Adaptive Scoring. Set 1, which consists of 3 items, is given to all
respondents. The second set of items (3 items again) is varied depending on prior performance. If no
items are answered correctly in the first set, then CR = 0 and items 1, 2, and 3 are asked. The sequence
follows for the other columns of items labeled CR = 1, CR = 2, and CR = 3. The number of items answered
correctly in the next set is counted and a score is provided based on the number of correct responses to
each item set.
4
11
7
Set 1 Set 2
1
3
2
5
8
6
9
12
9
13
15
14
CR = 0 CR = 1 CR = 2 CR = 3
105
Second a half adaptive method was tested where the person was started at the middle of the
scale. Correct responses moved half way up the scale, incorrect responses moved halfway down the
scale. The process repeated for the next item and so on until 6 items were given to each respondent.
This is in contrast with the fully adaptive technique which determines the next item to give depending
on the calculated ability from the previous items. Respondents were started at the center of the scale
and a half adaptive method was used until at least one item was answered correctly and one incorrectly.
At this point an estimate of ability is obtained and an item close to this ability was given and the ability
level was re-estimated. This was done with a stopping rule where 6 items were administered and then
testing stopped. The difficulty values for items were estimated in the 30 item IRT analysis above. These
scoring schemes were then compared to the total score from the 30 item set in Table 5.4.
The differing adaptive methods all produced good results in recreating the total score, especially
when compared to the random item set from the previous section. The first adaptive method tested
was the BAT scoring method, which showed decent correlation to the total item set score. When the
BAT for Set A was compared to the 15 item Set A score, the relationship became even stronger (r
BA
=
0.88) and the same was true of BAT B with the Set B items (r
BB
= 0.89). The real improvement in total
score prediction was found in the Half Adaptive and Fully Adaptive scoring methods. Here it was seen
that when a half adaptive algorithm was used to predict item sequence, there was a strong correlation
with the 30 item score (r
HA
= 0.87). The same was true of the Fully Adaptive scoring method with had an
equally strong correlation (r
FA
= 0.88).
106
Table 5.4. Sample Statistics for Adaptive Number Series Tests
30 item BAT A BAT B Half adapt Full adapt
Mean 520.6 517.6 518.1 517.7 518.2
Std. Dev. 11.4 10.7 10.5 10.3 10.9
Min. 472.7 468.1 471.7 472.3 476.3
Max. 545.4 539.6 544.4 539.9 535.6
R
2
with IRT-30 1.00 0.64 0.66 0.74 0.76
30 item 1.00
sym.
BAT A 0.80 1.00
BAT B 0.81 0.52 1.00
Half adapt 0.86 0.82 0.68 1.00
Full adapt 0.87 0.85 0.70 0.85 1.00
The last analysis looks to evaluate the usefulness of increasing item sets when reduced and
adapted item sets were used to estimate explained variance (R
2
). In implementing a one item at a time
process, the cost of adding items was seen as accuracy versus time tradeoff. This is illustrated with a
comparison of predictive accuracy depicted in Table 5.5. A selection of item reduction methods and the
fully adaptive method are compared to the total item set score for predictive results for 2 to 9 items in a
set. The trend in these item sets is unequivocally increasing with more items used in scoring
respondents. The trends in predictability hold from the previous analyses, random and fixed item sets
do not capture performance as well as the more sample dependent methods, and adaptive testing
provided consistently good total score prediction. Also of importance in the predictive values with
increasing number of items was that there were decreasing returns. As more items were used, the total
score R
2
went up at a decreasing rate. For each 2 minutes an item takes there is an improvement in
score, but at one point the gains outweigh the added cost in time or accuracy levels are already
sufficiently high.
107
Table 5.5. R
2
Predictability of Total Score with varying sizes of item sets
Number of Items Administered
2 3 4 5 6 7 8 9
Random 0.05 0.26 0.46 0.58 0.59 0.60 0.68 0.73
Fixed 0.38 0.41 0.46 0.52 0.63 0.70 0.73 0.75
Alpha 0.53 0.60 0.70 0.76 0.77 0.80 0.83 0.84
Factor 0.53 0.59 0.63 0.64 0.73 0.76 0.81 0.85
Adaptive 0.47 0.61 0.68 0.72 0.76 0.77 0.79 0.81
Conclusion
The analyses carried out involved creating a rule for reducing the item set, scoring the reduced
item set, and comparing the scores associated with the various response patterns produced. There are
really two types of processes to reduce the number of items given to a group of respondents. The first
involved scale reduction techniques, which decreased the item pool and every respondent received the
same item set. The reduction of items from the 30 item set will effectively ensure that the score
obtained per a person will be noisier and less accurate when more items are taken away (Sands, Waters,
& McBride, 1997). It was found in our study that items with higher difficulties provided more
information and more accurate scores over the lower difficulty items. This is an artifact of the sample
properties because many individuals were in the higher end of ability. In determining whether an item
should be included in a reduced item set, one should consider the target population in order to ensure
that items are used effectively. Items that are too difficult or likewise too easy will provide little
information about respondents and should be dropped in favor of items that are closer to the abilities of
the potential sample.
The move to adaptive tests notes a large distinction between the two construction methods.
Adaptive tests have the property that item sets need not be the same across respondents. This
distinction allows for an item pool to be larger than those of a reduced item test. This property allows
for more coverage of the scale versus the previous scale reduction technique. A number of adaptive
108
testing schemes were implemented and shown to perform at the same level or above the reduced form
scales. These adaptive tests were able to hone in on the respondent ability level by starting with a
coarse estimate and refining it with more items administered.
The recommendation on number of items necessary for accurate measurement of scores is
mainly a matter of the desire of the researcher. In the current study most techniques of scale adaptation
found that accuracy rates in excess of 50% were easily attainable with as few as 2 items. However, a
more conservative accuracy level of 70% is typically needed and most methods (save for random and
fixed item sets) were able to obtain this in 6 items (see Figure 5.X for a graphical representation of Table
5.5). It is the hope that in coming research that these baseline accuracy levels can be improved with
response time and demographic information to even higher levels.
Figure 5.4. A Scatter plot of Item Selection Methods predicting the Total Score on the Number Series
task.
The plan of the above analysis was to compare different item selection methods with a
constraint based on timing. For this reason all of the methods limited the total number of items to 6 per
a respondent. This was to mimic real-life constraints one would encounter when implementing tests in
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2 3 4 5 6 7 8 9
Random
Fixed
Alpha
Factor
Adaptive
109
survey designs. The addition of more items would only prove to enhance prediction of the 30 item score
and would be recommended if time permits.
It should also be said that these results are introduced as a quasi-experimental design. The
implementation of the study afforded all of the respondents to answer all 30 of the items. In doing this,
an expectation was created for the participant, but in the implementation of the IRT scoring the
sequence of item presentation is not assumed to have an effect. For the reduced form methods the
items are scored on 6 item IRT models. The adaptive methods are 30 item IRT models with item
responses missing which do not fall in the sequence of items for the respondent’s response pattern.
In general the issue of measurement comes down to the accuracy with which estimates are
made. Tsutakawa & Johnson (1990) indicate that for estimates that are considered to be accurate a
sample size of about 500 is needed. In determining the effectiveness of the estimates it was subsumed
the sample was representative of the population that one would like to make inferences about. As
noted in the results the sample performed higher than average on the test and was classified as a
special group. Ideally one would like to have a uniform distribution of abilities to be able to effectively
estimate item parameters at all points of the scale in the calibration process.
The ALP had fewer people at the lower portions of the scale leading to poorer estimation of the
easier items. This characteristic of the sample also led to a bias in items chosen by form reduction
methods towards items with higher ability levels. As a targeted internet adaptation, these reduction
methods seem to be a good starting point given the technological and general literacy involved with
computer use. For a more diverse group of individuals it would seem that adaptive methods would
prove more effective than form reduction methods.
110
References to Chapter 5
Albert, J.H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs
sampling. Journal of Education Statistics, 17, 251-269.
Berk, R.A. (2008). Statistical learning from a regression perspective. New York: Springer.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In
F.M. Lord and M.R. Novick, Statistical Theories of Mental Test Scores, Reading, MA: Addison-Wesley.
Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.
Psychometrika, 64, 153-168.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Couper, M., Kapteyn, A., Schonlau, M., & Winter, J. (2007). Noncoverage and nonresponse in an
internet survey. Social Science Research, 36, 131-148.
Green, B.F. (1983). Adaptive testing by computer. In R.B. Ekstrom (Ed.), Principles of Modern
Psychological Measurement (pp.5-12). San Francisco, CA: Jossey-Bass.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Lord, F.M. (1970). Some test theory for tailored testing. In W.H. Holtzman (Ed.), Computer
assisted instruction, testing, and guidance (pp.139-183). New York: Harper and Row.
Meijer, R. R.,& Nering, M. L. (1999). Computerized adaptive testing: Overview and introduction.
Applied Psychological Measurement, 23, 187–194.
Mislevy, R.J. & Chang, H. (2000). Does adaptive testing violate local independence?
Psychometrika, 65, 149-156.
Mislevy & Wu (1996). Inferring examinee ability when some items response are missing
(research Report 88-48-ONR). Princeton, NJ: Educational Testing Service.
Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Denmark
Paedagogiske Institute, Copenhagen.
Sands, W. A., Waters, B.K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: From
inquiry to operation. Washington, DC: American Psychological Association.
Shmueli, G., Patel, N.R., & Bruce, P.C. (2007). Data Mining for Business Intelligence. NY: Wiley.
111
Tonidandel, S., Quinones, M.A., & Adams, A.A. (2002). Computer-adaptive testing: The impact of
test characteristics on perceived performance and test takers’ reactions. Journal of Applied Psychology,
87, 320-332.
Tsutakawa, R.K. & Johnson, J.C. (1990). The effect of uncertainty of item parameter estimation
on ability estimates. Psychometrika, 55, 371-390.
van der Linden, W.J. & Glas, C.A.W. (Eds.) (2010). Computerized Adaptive Testing: Theory and
Practice. Netherlands: Kluwer Academic Publishers.
Van der Linden, W.J. & Pashley, P.J. (2010). Item selection and ability estimation in adaptive
testing. In W.J. van der Linden and C.A.W. Glas (Eds.), Computerized Adaptive Testing: Theory and
Practice (pp. 1-25). Netherlands: Kluwer Academic Publishers.
Wainer, H. (Ed.) (2000). Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence
Erlbaum.
Wainer, H. & Mislevy , R.J. (2000). Item response theory, item calibration, and proficiency
estimation. In H. Wainer (Ed.), Computerized Adaptive Testing: A Primer (pp.61-100). Mahwah, NJ:
Lawrence Erlbaum.
Weiss, D.J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied
Psychological Measurement, 4, 473-492.
Weiss, D.J. (1985). Adaptive testing by computer. Journal of Consulting and Clinical Psychology,
53, 774-789.
Woodcock, R.W., McGrew, K.S., & Mather, N. (2001). Woodcock-Johnson III. Rolling Meadows,
IL: Riverside Publishing.
112
Chapter 6: Adaptive Testing Methods with Response Time
Abstract
The utility of response time in assisting with prediction of complete item set scores is explored
with reduced and adaptive item sets. Reduced item sets are those that are unchanged for the sample,
meaning each item is the same for each person. Adaptive item sets are ones that change depending on
the response pattern as a respondent answers more questions. Because adaptive item sets base the
coming items on prior responses they do not necessarily have to be the same for different response
patterns. Response time is used in a regression format where responses, response time and the
interaction of the two are used as predictors. Then a joint response and response time model is used
which implements Item Response Theory (IRT) with a separate response time model. Finally, a half
adaptive method is updated with response time as a way to further refine ability level. These methods
are explored and outlined as a set of methods for implementing response time in a response models.
Introduction
In this chapter the hope is to make the recommendation that response time can be integrated
into the scoring of items to produce a more accurate score. With the introduction of reduced form and
adaptive testing to create comparable scores for Item Response Theory (IRT) type scales, it would be
advantageous and beneficial to accurately portray a respondent’s ability with all of the information
available from computer testing. The promise of computers in aiding research has long been recognized
(Horn, 1979; Novick & Jackson, 1974). The interpretation of the role of computers has been to provide
information that has always been a part of testing, the response time. Since the time to complete items
was not asking anything extra of the respondents it would behoove researchers to make use of a
potential source of rich information.
113
Assimilating Previous Chapters for Adapted Tests
The previous chapters have set the stage for this analysis in a number of ways. The scale has
been identified as unidimensional and items have been assigned difficulties and sorted for adaptive use.
The adaptive interpretation of the Number Series test has been explored and a number of reduced and
adaptive forms bring us fairly accurate estimates of ability when the 30 item IRT model is used as the
standard to emulate. In much of computer adaptive testing (CAT), and IRT in general, the creation and
implementation of items into a single scale is the most important aspect of testing (McDonald, 1999). To
be able to make the discrimination between which items fit the scale, and which items are best for
certain individuals the IRT method of identifying item parameters separately from person parameters
provides good theory for this objective.
Next, the link between the response and response time model was identified. The Conditionally
Independent Response Time (CIRT) model provides that response time model informs the response
model and effectively helps decrease variation in scores when the relationship between the two models
is higher than chance (van der Linden, Klein Entink, & Fox, 2010). To the extent that the correlation is
greater than r = 0.0, the better response time is at predicting person ability. With the Number Series
data from the American Life Panel (ALP) the association was found be low between speededness and
ability. This will attenuate the effect of the joint model in improving score prediction over the response
only model, but propose its use as a potential alternative.
The previous chapter interpreted the results of various methods of reduced form and adaptive
testing models. In order to meet time restrictions imposed on testing, focused and precise testing tools
in cognitive ability measurement are needed. The focus of the previous chapter on comparing reduced
form tests, tests that give the same reduced item set across the sample, to adaptive tests, tests that give
different item sets to different individuals based on performance throughout testing, provided good
114
results on which methods may be preferred over others. These methods were based purely on ability-
meaning the response time was not investigated as an indicator of ability. These methods are useful at
providing a fairly accurate measure when time is of importance. However, is this all that can be done?
Can better use not be made of the item response times in order to have a more accurate score than
afforded by looking at responses alone? The next sections explore the role of response times in scoring
and item selection for our adapted tests.
Integration of Response Time in Adapted Techniques
Regression Method. The desire is to be able to hone in on a respondent’s ability with accuracy
and speed. The faster researchers can get an accurate score with one measure, the sooner they can
move onto another task to get more information about an individual. With time spent with a person
costing a pretty penny it would make sense to use all of the information available to researchers to help
in creating more accurate scores. For the intents of this chapter the focus is on the collateral
information, response times, which are largely ignored in cognitive testing. In the following methods
proposed the item subsets are used as predictors of the full item set total score. To improve on the use
of item only information, only the times associated with the items selected are introduced in the
following techniques.
The first technique proposed is that of a regression method to emphasize the utility of individual
pieces of information gathered in testing. In a typical regression equation the item responses are input
as predictors of the total score as
6.1.
.
The optimal score to predict is the 30 item score given as
for each person j, with an intercept of
.
The vector of regression weights
describes the effect of answering an item i correctly (
) on the
predicted score. The role of this model is to provide a baseline effect of a subset of items in predicting
115
the total score. From here one can get an estimate of the amount of variation in scores that can be
predicted by the item subset, ranging from 0.0 to 1.0, defined as
6.2.
∑ (
̂
̅
)
∑ (
̂
)
∑ (
̂
̅
)
.
The explained variance is a function of the sum of the squared differences between the fitted values for
person j (
̂
) and the mean score of the sample
̅
and the total score (
) and the fitted values (
̂
) all
over the first term again. The closer the total score is to the fitted value, the less error there is in
prediction and the higher the R
2
is for the predictors on the total score.
The next move is to propose that the item timing could have an effect on the prediction of the
total score. This step includes adding time as a predictor in the regression equation above as
6.3.
.
Equation 6.3 now defines the total score
in terms of both the vector of responses (
) and response
times (
) . The expectation is that knowing the length of time spent on a given item i for a person j
will provide information above and beyond the correct/incorrect information in the responses. Then
interactions are added of the responses and response times by setting
6.4.
.
Here the interaction vector (
) is the simple multiplication of the two arrays response (
) and
response time (
) for each person j and each item i in the item subset applied. When the response is
coded as a binary variable, 0 or 1, then only correct responses would have meaningful information in the
interaction variables. The effect of the interaction on the total score would be indicated by the sign of
the
weight. Weights significantly different from 0 and negative would indicate higher scores for
faster responses and weights that are significantly positive would indicate higher scores for slower
responses.
116
CIRT Method. The next method explored is that of the hierarchical response/response time
model (van der Linden, 2007). The potential of this model lies in the idea that both the response
function and response time function are modeled simultaneously, providing extra information towards
ability. The likelihood model is identified as a joint response and response time model which is
expressed as
6.5.
∏
The responses (u
ij
) and response times (t
ij
) are the observed variables in the two models with
parameters for persons and items each as follows. In the response model it is identified as a typical IRT
model with person parameter
for ability and item parameter
for difficulty. The response model
follows a log normal distribution with person parameter
as the speed of the test taker and
as the
time intensity of the item. This method is introduced as an alternative to scoring items with item
responses alone.
The response model as stated before takes the form of the typical IRT model. This is shown as
6.6.
(
).
This particular model is known as the Rasch model and is the simplest form of the family of IRT models
(Rasch, 1960). The response pattern is a function of the two parameters which are the ability level of
person j and the difficulty level of item i. The expectation is that the probability of a correct response
increases as the ability of the person surpasses the difficulty of the item. The response model is written
as
6.7. (
) (
(
)).
In this model the response time
is a function of two additional variables for person j to item i. The
first is a speededness of person j and the second is the time intensity of item i. This function can
117
accommodate more variable time intensity shape by adding a parameter
which changes the ability of
an item to discriminate between slow and quick responders. The quicker a person responds to an item,
the higher the probability of correctly responding to that same item. As a joint model the response time
parameters help to inform the response parameters so long as there is a relationship between the two
for the person parameters for the overall model (
) .
The effect of modeling response time within this framework is that the extra information about
ability in response time can be used to narrow in on ability, beyond on just using response patterns (van
der Linden, 2008; van der Linden & Pashley, 2010). The implementation of response times with this
modeling technique lie in the use of IRT models for scoring. With short form testing and fully adaptive
tests in which ability scores are calculated this has the potential of increasing accuracy of scores when
compared to full scale scores (Klein Entink, 2009). The improvement is governed by the correlation
between person ability and person speediness which is examined as a factor in the below sections.
Modified Adaptive Testing Method. This method is based on the idea that the adaptive testing
techniques follow a pattern of increasing item difficulty when an item is answered correctly and
decreasing item difficulty when an item is answered incorrectly. To ignore the timing data items would
be issued based purely on response pattern which provides pretty good accuracy when adaptive scores
are compared to the full item set scores. However there is room for improvement and the introduction
of response time in this framework could prove to increase the accuracy of scores.
The first item starts with the middle of the scale as a general starting point and move halfway up
or down depending on correct or incorrect response to this item. However, simply moving halfway up or
down would neglect the idea that our respondent may already be at their ability threshold or may even
have exceeded this threshold. The amount of time they spent on a given item can be used to determine
whether they are more proficient with a given difficulty of item or not. In terms of looking at the
118
response time of a given item, one can plot the distribution of response times as shown in Figure 6.1.
Here the distribution of response times for item 15 which is the halfway point of the Number Series
scale. Panel A represents the distribution of raw response times in seconds and panel B the distribution
of the log transformation of response times. The responses are split in this way to see if the difference in
response affected the distribution of response times. There is a higher incidence of incorrect scores at
120 seconds which is the time limit provided for testing. The distribution would tail off more if this rule
had not been imposed to show that individuals took longer to answer the item but were not necessarily
getting any advantage by having more exposure to the question.
When a response is provided, the response time along with the response dictates the next item
to be given. In the case of a correct response, a fast correct response implies good understanding of the
scale at that difficulty and would require more upward movement than a response at the median of the
response times. This would follow given the interpretation of the interaction of response and response
time in the regression equations outlined above. Likewise, a slow response time that is correct would
indicate the person is closer to their ability level and should not move as high up the scale in terms of
item difficulty. When response times are transformed into z-scores (
̅
), each respondent’s z-scores
are the standardized differences of their response times from the average response times for each item
in the item pool. For each item one can determine if they were quick or slow comparatively. A rule is
instituted that response time standard scores that are less than σ=-0.50 (half a standard deviation faster
than the average) will move up one additional item than the normal half step rule. Response times that
are extremely quick (σ≤ -2.00) will move up two items over the normal half adaptive method.
119
Figure 6.1. The response time distributions for (a) raw responses in seconds and (b) log transformation
of response times are given. The graphical representation is meant to discern if there is a notable
difference in the distribution of response times between correct and incorrect scores. The scaling is
changed between raw time in seconds and the log transformation of the response times. Log
transformations are used in the analyses carried out.
(a)
(b)
120
These rules can be implemented into the half adaptive framework with relative ease. When an
incorrect response is given, similar rules can be applied. Slower incorrect responses would be more
indicative of a person that is beyond their ability and should be shifted down further than thought under
response only adaptive testing algorithms. In this adaptation of using response time to improve scoring,
a rule of response times more than 1.00 standard score were dropped one additional item when an item
was answered incorrectly.
Hypotheses
Associated with the above methods outlined several hypotheses are proposed to be found in
the analysis of the ALP data. The first is that harder items will necessitate more time to complete. This is
suspected to be found in two ways. The first is that higher scores will be associated with more time
taken on more difficult items versus the easier items with positive
values for weights. Items that are
on the lower end of the scale will be answered quickly by most and thus will not provide much more
information beyond whether the person can provide a right answer or not. More difficult items require
more time to process the information and synthesize a response. This can also be indicated by a positive
correlation between item difficulty
and item speed intensity
. The next hypothesis is that
people who can answer items faster correctly should have higher scores than those who answer slowly
correctly. This is a statement of the interaction term for the regression (
) , which would be negative
if this hypotheses were the case.
In terms of the adaptive methods of testing, it is expected that modulating the movement up or
down the scale by the response time will provide some improvement over the traditional half-adaptive
methods. In using response time to govern how much to move up or down it is hypothesized that a
greater ability to attain the correct score over non response time influence half-adaptive testing.
Additionally, the distributions of correct response times and incorrect response times are different and
121
should be treated as such when determining how far to move in a certain direction on the difficulty
scale.
Methods
Sample
The data comes from the American Life Panel (ALP; Kapteyn, 2002), a nationally representative
sample of adults 18 years old and above. Simple demographic Statistics are provided again in Table 6.1.
The sample had a mean age of 48.9 with a range of 18 to 109. The sample was 59.3% female and the
ethnicities are broken down as: white 84.2%; 4.6% Hispanic; Black 6.5%; Asian 1.3%; Other 3.4%. The
average years of education attained was 11.5 years. The average income was $74,320 and for analyses
the log of this value was used to bring extreme incomes closer to the mean. Tests of the sample
compared to the national averages indicate that the HRS sample closely resembles the average
American on the demographic aspects outlined above (Couper, Kapteyn, Schonlau, & Winter, 2007).
Table 6.1. Descriptive Statistics of the American Life Panel, complete
sample size N = 2548
N Mean Median Std Dev Min Max
Age 2548 48.9 50 14.8 18 109
Education 2548 11.5 11 2.08 3 16
Income 2123 74230 55000 60549 2500 200000
Female 2548 59.3%
NCPos 2533 59.8 62.5 17.9 0 100
NCNeg 2536 30.6 31.3 17.6 0 100
Ethnicity
-White
84.2%
-Hispanic
4.6%
-Black
6.5%
-Asian
1.3%
-Other 3.4%
Collection of the data begins by sending an email invitation out to about 3,000 active
participants in 49 states and the District of Columbia. All of the participants have reported that they are
122
comfortable speaking and reading English. Since the survey is administered over the internet, if a
participant did not have a computer and internet access they were given both a computer and internet
access. If they already had both then the cash value of the computer was given. The rate of dropout
from the sample is low, with only about 3 people formally leaving the sample a month. Nonresponders
are left in the sample and only deleted when dataset is cleaned for these nonresponders. Since the
cleaning is infrequent the response rate for a survey should take into account this group of people.
Measures
Number Series. The items used for the current analysis come from the Woodcock-Johnson III
(WJ-III) set of tests of ability (WJ-III; Woodcock, McGrew, & Mather, 2001). The Number Series task is
one of the tasks that are used to test math ability and problem solving skills. The task involves the
participant to provide a missing number from a sequence of numbers so that the sequence makes a
logical pattern. For example: if the sequence shown is 1, 2, 3, _, then the participant should provide the
answer 4 as it makes the sequence increase by 1 with each successive number. The items were given in
order of relative difficulty with the easiest being given first and progressively difficulty items given.
There were 15 items given per a set and two sets of items were given to each participant. The
presentation of set A and set B were counterbalanced across participants. If the correct answer was
given for the blank the participant received a 1, if the wrong answer was given they received a 0.
Need for Cognition Scale. The Need for Cognition scale (NCS) consisting of 8 items of the original
18 was presented between the NS item sets (Cacioppo, Petty, & Kao, 1984). This set of items was placed
between the two NS item sets and was meant to be a time filler that did not involve cognitive load. The
scale is divided into two factors, positive and negative. The 4 items that had the highest factor loadings
on each factor were chosen to be included for a total of 8 items. In the NCS participants are asked to
answer statements about whether they seek out and undertake applied cognitive tasks on a 5 point
123
Likert scale, where 1 is very inaccurate and 5 is very accurate. The positive and negative scale scores
were sums of item endorsement and then given a percentage score between 0 and 100.
Response Times. The response times were the length of time that a stimulus item was
presented to the respondent before they moved onto the next item. While not an exact indicator of the
length of time in which it took a respondent to think of an answer and provide it, this is a good
indication of the pace that a respondent takes during the course of testing. As a new test, and
implementation still being refined, the response times on the Number Series serve to help improve
cognitive testing in internet surveys. The log transform of the time was used to bring in extreme scores
and this is typically done with timed response data (Klein Entink, 2009). As shown in Chapter 4 of this
text, the lognormal distribution seems to be a good fit for the response time data collected according to
the CIRT modeling implementation.
Planned Analyses
The analyses will take 4 steps as outlined in the analytic methods above. The first step is to
create the full item set score as the score which all other scores are compared. This score is a
respondent would be given traditionally if they took all of the items in our item pool. Second the
regression method of incorporating responses and response times is tested. Third the CIRT model is
implemented with results on the various reduction and adaptive methods compared to the 30 item
score. Lastly, the adaptive scoring methods are updated with response time informing how much to
increment scores versus static rules (half-adaptive method).
Results
Regression Method
The regression model outlined above was used in a 4 step process to determine the effect of
adding response time to the prediction model which the results are identified in Table 6.2. The
124
prediction model has the 30 item total score as the outcome variable and predictors based on responses
and response times. The first step labeled Model A (EQ (6.1)) in Table 6.2 was to include the responses
as the only predictors of the 30 item IRT score. For each of the item selection methods a positive weight
is associated with adding item responses as the only predictors of the total score. The next step, B,
added the total time spent on each of the 6 item sets as a predictor for the total score. The results
indicated that the total time adds minimally to score variance prediction in all item reduction methods
(adding 0.00 to 0.01 R
2
). In step C (EQ (6.3)) of the analysis the individual times that respondents spent
on each item were used as predictors of the total score in addition to the item responses. The effect of
this addition is complicated because there is not a uniform outcome across each timing weight. The
regression weights for timing variables range in both positive and negative directions showed more time
could be indicative of higher scores for some items, and lower scores for other items. A check of the
relationship between the item difficulties and the response time estimated parameters yielded no
significant relationship (t
11
=0.046, p=0.83). There was no significant trend between item difficulty and
amount of time taken per item. Step D (EQ (6.4)) involved including the interaction of responses and
response times in the regression model as predictors in addition to the predictors in model C. The
predictor weights for the interactions were clear in establishing that significant weights were negative.
This means that people that got items correct at a faster rate had a higher predicted total score.
125
Table 6.2. Regression Method of Response and Response Time Integration
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Total Time
Random A 6.44 (1.40) 5.75 (2.26) 8.38 (0.38) 12.0 (0.36) 8.69 (1.70) 7.77 (0.78)
Random B 5.86 (1.38) 4.63 (2.22) 7.75 (0.38) 11.8 (0.36) 8.01 (1.68) 7.40 (0.77) -0.56 (0.07)
Random C 5.11 (1.39) 4.67 (2.27) 7.89 (0.38) 11.7 (0.36) 7.72 (1.68) 7.03 (0.77) --
Random D 6.44 (5.27) 20.1 (7.61) 16.6 (1.80) 20.8 (1.62) 19.6 (6.23) 8.59 (2.88) --
Fixed A 7.84 (1.65) 8.38 (0.95) 9.77 (0.40) 6.68 (0.34) 6.23 (0.29) 9.55 (0.32)
Fixed B 8.18 (1.64) 8.26 (0.94) 9.50 (0.40) 6.76 (0.33) 6.34 (0.29) 9.64 (0.32) 0.25 (0.06)
Fixed C 5.48 (1.52) 6.30 (0.86) 7.42 (0.38) 5.43 (0.32) 6.13 (0.26) 8.69 (0.30) --
Fixed D 5.85 (4.56) 5.86 (3.27) 10.5 (1.49) 8.70 (1.47) 8.61 (1.31) 15.2 (1.65) --
Alpha A 5.52 (0.31) 4.54 (0.31) 4.94 (0.33) 4.59 (0.34) 5.48 (0.31) 7.41 (0.25)
Alpha B 5.50 (0.30) 4.54 (0.30) 4.85 (0.32) 4.88 (0.34) 5.30 (0.31) 7.34 (0.24) -0.40 (0.04)
Alpha C 5.16 (0.31) 4.34 (0.30) 4.62 (0.32) 4.72 (0.34) 5.38 (0.31) 7.15 (0.25) --
Alpha D 11.4 (1.25) 8.07 (1.29) 6.83 (1.23) 6.99 (1.50) 4.45 (1.24) 11.6 (1.16) --
Factor A 4.88 (0.35) 6.95 (0.30) 3.97 (0.37) 4.67 (0.32) 4.08 (0.40) 7.46 (0.27)
Factor B 4.80 (0.34) 6.94 (0.30) 4.14 (0.37) 4.63 (0.32) 4.31 (0.39) 7.25 (0.27) -0.34 (0.04)
Factor C 4.48 (0.34) 6.29 (0.30) 4.00 (0.36) 4.30 (0.32) 4.22 (0.38) 7.00 (0.26) --
Factor D 8.39 (1.30) 10.9 (1.30) 4.03 (1.59) 8.43 (1.37) 3.94 (1.48) 8.91 (1.47) --
R
2
A 5.20 (0.23) 5.93 (0.26) 4.93 (0.24) 6.79 (0.24) 5.43 (0.23) 5.61 (0.25)
R
2
B 5.19 (0.23) 5.93 (0.26) 4.93 (0.24) 6.77 (0.24) 5.41 (0.23) 5.63 (0.25) 0.03 (0.03)
R
2
C 5.06 (0.22) 5.54 (0.26) 4.76 (0.23) 6.44 (0.24) 5.09 (0.23) 5.58 (0.25) --
R
2
D 7.57 (0.98) 5.03 (1.08) 8.34 (1.26) 9.67 (1.09) 9.01 (1.39) 6.27 (1.39) --
Notes: Models A - D for each form reduction method add predictors in order of theoretical
usage. Models A include only item responses as a baseline pridction value (Item1-Item6).
Models B add the total testing time to the prediction model (Total Time), while Models C
add the individual item response times as separate predictors instead of the total time
(Time1-Time6). The Models D include interaction terms for each of the six item responses
with response times (Int1-Int6). Regression model parameters are listed with standard
errors in parentheses and significant values are shown in bold.
126
Table 6.2. Continued.
Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Int 1 Int 2 Int 3 Int 4 Int 5 Int 6 Model R
2
Random A 0.53
Random B 0.54
Random C -1.40 (0.39) 0.18 (0.41) -0.25 (0.26) 0.60 (0.30) -1.05 (0.41) -1.75 (0.36) 0.55
Random D -0.60 (1.53) 4.29 (2.05) 1.36 (0.41) 2.17 (0.41) 2.59 (1.94) -1.00 (0.96) -0.51 (1.58) -4.22 (2.09) -2.42 (0.48) -2.97 (0.51) -3.79 (1.96) -0.62 (0.99) 0.56
Fixed A 0.64
Fixed B 0.65
Fixed C -1.76 (0.30) -2.47 (0.29) -0.45 (0.23) -1.02 (0.22) 2.05 (0.22) 2.95 (0.19) 0.71
Fixed D -1.60 (1.32) -2.46 (1.09) 0.36 (0.44) -0.44 (0.34) 2.33 (0.28) 3.25 (0.20) -0.09 (1.35) 0.13 (1.11) -1.06 (0.48) -0.90 (0.39) -0.69 (0.35) -1.73 (0.43) 0.72
Alpha A 0.76
Alpha B 0.77
Alpha C -1.07 (0.22) -0.37 (0.23) -0.78 (0.23) -0.97 (0.26) 0.15 (0.21) 0.48 (0.18) 0.77
Alpha D 0.14 (0.31) 0.47 (0.31) -0.16 (0.33) -0.36 (0.39) -0.08 (0.33) 0.78 (0.21) -1.99 (0.38) -1.14 (0.37) -0.74 (0.39) -0.81 (0.45) 0.27 (0.38) -1.31 (0.32) 0.78
Factor A 0.73
Factor B 0.73
Factor C -0.97 (0.25) -1.21 (0.22) 1.10 (0.27) -0.70 (0.24) 0.67 (0.24) 1.27 (0.19) 0.74
Factor D -0.12 (0.34) -0.29 (0.33) -1.01 (0.41) 0.10 (0.33) 0.78 (0.43) 1.30 (0.21) -1.29 (0.41) -1.49 (0.40) -0.08 (0.48) -1.24 (0.39) 0.03 (0.47) -0.54 (0.38) 0.75
R
2
A 0.82
R
2
B 0.82
R
2
C -0.28 (0.18) -0.84 (0.19) 0.89 (0.17) -0.92 (0.18) 0.29 (0.18) 0.56 (0.18) 0.82
R
2
D 0.13 (0.23) -0.96 (0.30) 1.10 (0.20) -0.22 (0.26) 0.62 (0.20) 0.56 (0.20) -0.79 (0.30) 0.16 (0.34) -0.95 (0.32) -1.00 (0.32) -1.12 (0.29) -0.23 (0.35) 0.83
127
Noted in Table 6.2 is also the explained variance for each model. As has been noted previously,
the models with items that cover the range of respondent abilities better will improve model accuracy.
For this reason, the Alpha, Factor, and Best R
2
models predict the most variation in scores over the
Random and Fixed item sets. The addition of response times to each model provided little in terms of
increased explained variance (on the order of 1%) for all but the Fixed item set. The Fixed item set saw a
6% increase in explained variance when individual item response times were used versus a 1% increase
when only the total time was used. Small improvements were also made when the interaction terms
were added to the regression equations, but these improvements were 1.0% at best.
CIRT Method
The CIRT method uses Markov Chain Monte Carlo (MCMC) estimation techniques to estimate
the distribution of parameters for a sample. The CIRT model was run as outlined in Chapter 4 with fixed
anchors for the item difficulties and speed intensities. The comparison of the complete 30 item score
using EQ 6.6 to the reduced item sets scored with EQ 6.5. When fixed anchors are used in this
framework the only parameter estimation that takes place is that of the person parameters (
) .
Stationarity was established with 11,000 iterations and 1,000 of these used as burn-in (1,000 iterations
discarded as the program was still working to establish the correct distribution for the parameter
Geweke, 1992). Once stable distribution estimates were established, person parameters could be
estimated per a respondent and scores compared over the various item reduction methods.
128
Item Difficulty Intensity Item Difficulty Intensity Item Difficulty Intensity Item Difficulty Intensity Item Difficulty Intensity
Item 1 4A -2.82 2.35 2B -3.17 2.06 11A -0.69 3.05 8B -0.9 2.92 10A -0.29 3.17
Item 2 5A -3.26 1.97 3A -2.54 2.47 9B -0.69 3.25 11A -0.69 3.05 12A -0.92 2.94
Item 3 8A -0.88 3.47 10B -1.34 3.04 8B -0.90 2.92 7B -1.11 3.08 14A 0.29 3.68
Item 4 11A -0.69 3.05 8A -0.88 3.47 7B -1.11 3.08 9B -0.69 3.25 9B -0.69 3.25
Item 5 3B -3.03 2.18 9A -0.19 3.59 12A -0.92 2.94 10B -1.34 3.04 14B 0.02 3.38
Item 6 4B -2.17 2.49 15B 0.77 3.72 14B 0.02 3.38 14A 0.29 3.68 15B 0.77 3.72
Notes: The item order is determined at random for the Random and R2 item sets. The order of items as diplayed for Fixed, Alpha, and Factor are shown in order of
difficulty, total score predictiveness, and factor loading, respectively.
Table 6.3. Enumerated Item Sets for CIRT Scored Models. The response model used is a Rasch IRT model and the response time model is a log normal distribution. The
item difficulty parameter in the response model is analogous to the item intensity parameter in the response time model.
Random Fixed Alpha Factor R2
129
As noted previously, the improvement in CIRT over IRT in terms of ability estimate accuracy is
dependent on the relationship between the person speededness and the person abilities. The
relationship between these predicted values for the ALP sample was established as . According
to simulation work done this is on the lower end of the bounds for effective improvement in score
prediction (van der Linden, 2008). With such a small relationship between ability and person
speededness one would not expect a large improvement in score prediction by including response time
in our prediction model compared to when it is left out. A simple Rasch IRT prediction model could
provide the same information as the joint model in Equation (5). The item difficulties and time
intensities are shown in Table 6.3 for reference when CIRT model results are shown next.
Table 6.4. Scoring Method Intercorrelations for (A) IRT Models and (B)
CIRT Models with Reduced Item Sets
A. IRT
30 item Random Fixed Alpha Factor Best R
2
30 item 1.00
sym.
Random 0.72 1.00
Fixed 0.80 0.67 1.00
Alpha 0.87 0.68 0.61 1.00
Factor 0.85 0.65 0.63 0.90 1.00
Best R
2
0.90 0.60 0.70 0.86 0.81 1.00
B. CIRT
30 item Random Fixed Alpha Factor Best R
2
30 item 1.00
sym.
Random 0.72 1.00
Fixed 0.79 0.67 1.00
Alpha 0.86 0.68 0.59 1.00
Factor 0.85 0.66 0.62 0.92 1.00
Best R
2
0.90 0.58 0.68 0.85 0.80 1.00
Notes: The column of subtest correlations is boldened to outline how
well each subtest identifies with the total score. Higher correlations
indicate higher rank order agreement. CIRT scores include response time
as an input for scoring whereas the IRT scores do not.
130
The comparison of correlations CIRT and IRT with the 30 item score are available in Table 6.4. A
reproduction of the correlations for the standard IRT models used to score the reduced item sets from
the previous chapter is included in Panel A of Table 6.4. The CIRT score correlations with the total score
are shown in Panel B of Table 6.4. When the two panels of results are compared the same relationships
exist between the reduced items set scores and the 30 item total score used as the standard to emulate.
In fact there is no significant gain or loss when the CIRT modeling method is used versus the IRT
modeling method. The joint model does produce a score that is comparable to the response only (IRT)
model. However, there is no apparent gain in adding this level of complexity given the reduced item sets
used in this analysis. It is important to stress that the utility of such a model is not under fire given that it
does not improve estimation in one particular domain of cognitive ability. What this result informs us of
is that the timing data does not provide further information about ability estimation in the framework
suggested. Other domains have the potential of being influenced by timing data within the CIRT
framework, and this implementation exemplifies how to proceed with such an analysis.
Modified Adaptive Testing
This section tests the utility of adding response time to an adaptive framework. The use of a
half-adaptive method to select items was previously used with good results compared to other
methods. Our goal is to introduce the response time as a variable adjusting the amount of movement
between item difficulties in successive presentations. The possibility of having modifying rules govern
movement for correct responses with response time taken into account and incorrect responses with
timing accounted for was explored.
In the half adaptive framework response time can be used to mediate the amount of movement
that a correct or incorrect response causes. The item presentation process started with item 15 for all
participants. If the respondent was exactly average in their response time then they would move
131
halfway up the rest of the scale for a correct response, or halfway down for an incorrect response.
Responses that were quicker than average for correct responses increased item difficulty slightly by
moving respondents up 1 or two items depending on the quickness. This creates one of three possible
second items being presented to respondents depending on the speed of their response. Similarly
incorrect responses that were slow were considered to be at much higher difficulty level and the
difficulty was decreased even more by dropping down one more item in the rank of item difficulties.
This led to two possibilities for the second item for those that got the first item wrong. The same
response time rule was implemented with responses to the second item. From here on the half adaptive
method was used without response time because of the size of the item pool.
Figure 6.2. Scatterplots of 30 item scores, half-adaptive scores, and modified half-adaptive scores. The
plots indicate that there is little difference in the scores between the half-adaptive method and the
modified half-adaptive method which incorporates response time in item movement for correct and
incorrect responses.
132
The interpretation of the time modulated half adaptive test is one of minimal difference to the
half adaptive test in Chapter 5. The relationship of the half-adaptive test with the complete 30 item
score was shown to be a correlation of r=0.86. When time is used to move respondents up or down an
additional couple of items the correlation does not change substantially (r=0.84). The lack of
improvement is highlighted by Figure 6.2, where scores obtained through the half-adaptive method and
modified half-adaptive methods were plotted against each other and the 30 item score. Comparing the
bottom two cells in the first column (in bold box) to see if response time changed the prediction of the
Figure 6.3. Scatterplots of 30 item scores versus the half-adaptive scores and modified half-adaptive
scores. The comparisons using half-adaptive method are in blue and the scores using the modified half-
adaptive method are in red. Here the overlap of the scoring methods is much clearer and the correlation
is not significantly different. Variation accounted for the 30 item scores are about the same using either
form of the half-adaptive scoring method.
Half-Adaptive Methods Scores
30 Item IRT Scores
133
30 item score it is shown there is no gain. The overlap of the two scoring methods is further illustrated
by Figure 6.3, where the plots of the scores from half-adaptive methods are plotted against the 30 item
score. The blue points indicate the half-adaptive scores and the red points the modified half-adaptive
scores and there is a great deal of overlap observed in this plot. While timing did not hinder our ability
retain score accuracy, there was no gain by introducing this extra layer of scoring complexity in this
particular cognitive test.
In further examination of the response and response time interaction the CIRT model was used
with the adaptive item sets as outlined before. These sets included the block adaptive (BAT-A and Bat-
B), half-adaptive, and fully-adaptive items. As shown in Table 6.5, the items were analyzed using CIRT
scoring and IRT scoring and compared the 30 item scores to the adaptive item sets. Panel A of Table 6.5
shows that the BAT scoring showed lower correlation with the 30 item score and the truly adaptive item
sets provided higher correlations with the total score. Panel B reprints the IRT correlations for the
adaptive scoring methods for comparison to the CIRT correlations. Of note is the fact that the CIRT
correlations were slightly lower than the IRT correlation (between 0.01 and 0.03), but the trend in
correlation between the two scoring methods were identical. The purpose of introducing the CIRT
scoring method is to make use of the response time data in a dynamic model of ability, where test
ability and speededness are calculated in tandem. There was no substantial increasing in score
correlation when CIRT scoring was implemented versus IRT scoring. That being the case, they produced
similar results as indicated in results above, so either would provide comparable scores for use in
cognitive ability analyses.
134
Table 6.5. Scoring Method Intercorrelations for (A) IRT Models and
(B) CIRT Models with Adaptive Item Sets
A. IRT
30 item BATA BATB Half-Adapt Adapt
30 item 1.00
sym.
BATA 0.80 1.00
BATB 0.81 0.52 1.00
Half-Adapt 0.86 0.82 0.68 1.00
Adapt 0.87 0.85 0.70 0.85 1.00
B. CIRT
30 item BATA BATB Half-Adapt Adapt
30 item 1.00
sym.
BATA 0.79 1.00
BATB 0.79 0.53 1.00
Half-Adapt 0.85 0.81 0.68 1.00
Adapt 0.86 0.85 0.70 0.84 1.00
Notes: The column of subtest correlations is boldened to outline
how well each subtest identifies with the total score. Higher
correlations indicate higher rank order agreement. CIRT scores
include response time as an input for scoring whereas the IRT
scores do not.
Conclusion
In the first analysis, the method of adding response time as predictors in a regression equation
with item responses provided the most utility when a fixed item set was used. This provides some
evidence for the utility of response time when a representative reduced item pool is used by
researchers. The most gain was made when response times to each item were used as opposed to the
total time taken on the reduced item set. In all other cases of reduced item sets, response time made
minimal improvements in predictability of the variation in total 30 item scores (R
2
improvements on the
order of 1-2%). The introduction of the interaction served as a way of determining the importance of
answering fast and correctly. The significant values of these beta weights were negative which would
lead one to conclude that answering slower and correct was an indicator that one was closer to their
135
threshold of ability than someone who answered fast and correct. This idea was implemented in the
modified half-adaptive method outlined above. The low strength of these weights seems to provide
some evidence that the returns would be limited when timing rules are implemented for half-adaptive
testing.
Likewise, one can say that the CIRT method of scoring with reduced item sets was limited due to
the low relationship between scores and timing exhibited above. The correlation between the total time
of testing and the total 30 item score was low to negligible in our sample (r=0.03). When the CIRT model
is implemented there is a relationship between the constructs of ability and speededness as defined in
the response and response time models (r=0.18). As noted previously, to the point of overstating it, this
is considered a lower correlation by simulation studies using the CIRT model and the effect of response
time in improving ability scores is low (Klein Entink, 2009). And as is the case with the regression method
of scoring items with response times included, there is no detriment to including response times; they
merely fail to yield any significant additional information beyond the response only models.
The response time modulated half-adaptive scoring method was introduced with the idea that
there may be an interaction to speed of response and ability. As shown in the regression analysis a
significant interaction effect can be detected wherein faster slower responses yield a lower score. This
premise would indicate that faster correct responses should be moved up a couple of items in difficulty
versus the slower responses. A simple introduction of this rule within the ALP sample provided no
significant improvement, but again the scores were comparable to those that used only responses to
items to do the half adaptive testing. The implementation of response time modified adaptive testing
would potentially be more effective with larger item pools, where there are more items to cycle through
and a two item difference doesn’t move up or down the scale to such a large degree.
136
The overall theme that appears in including response time in adding the scoring of the number
series task is that there is no real gain. The other important point in adding response time to the scoring
model is that it did not make the scores any worse than using responses only. This seems to be logical
given the previous studies on the usefulness of response times in response models (Fox, Klein Entink, &
van der Linden, 2010). They found that scales in which the respondent speededness has a higher
relationship with the ability measured relates to better ability estimation (lower standard errors).
Another point of discussion is the utility of adaptive testing in the CIRT framework is its use in adaptive
testing as highlighted in van der Linden (2008). The response times in this work are shown to provide
good information about test taker ability and even help to find some problems with test taking. For
example, respondents that intentionally respond fast to items to gain advantages for fast responses can
be identified and penalized for their poor test taking behavior. The utility of response times in cognitive
testing is shown to be meaningful in a number of ways and there is no detriment to including this
information.
137
References to Chapter 6
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Couper, M., Kapteyn, A., Schonlau, M., & Winter, J. (2007). Noncoverage and nonresponse in an
internet survey. Social Science Research, 36, 131-148.
Fox, J.P., Klein Entink, R.H., & van der Linden, W.J. (2010). Modeling of responses and response
times with the package CIRT. Journal of Statistical Software, 20, 1-14.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of
posterior moments. In Bayesian Statistics 4 (J.M. Bernardo, J. 0. In Berger, A. P. David and A. F. M. Smith,
eds.) 169-193. Oxford Univ. Press.
Horn, J.L. (1979). Trends in the measurement of intelligence. Intelligence, 3, 229-240.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Klein Entink, R.H.(2009) Statistical models for responses and response times. Thesis.
Netherlands: University of Twente.
Novick, M. R., & Jackson, P. H. (1974). Statistical methods for educational and psychological
research. New York: McGraw-Hill.
Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Denmark
Paedagogiske Institute, Copenhagen.
McDonald, R.P. (1999). Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum.
van der Linden, W.J., Klein Entink, R.H., & Fox, J.P. (2010). IRT parameter estimation with
response times as collateral information. Applied Psychological Measurement, 34, 327-347.
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72. 287-308.
van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal
of Educational and Behavioral Statistics, 33, 5-20.
138
van der Linden, W.J. & Pashley, P.J. (2000). Item selection and ability estimation in adaptive
testing. In W.J. van der Linden & C.A.W. Glas (Eds.), Computerized Adaptive Testing: Theory and
Practice. Boston: Kluwer Academic Press.
Woodcock, R.W., McGrew, K.S., & Mather, N. (2001). Woodcock-Johnson III. Rolling Meadows,
IL: Riverside Publishing.
139
Chapter 7: Predictive Effects of Demographics
Abstract
This final section of analysis deals with the usefulness of other collateral information available
in the American Life Panel (ALP) data set. The utility of including demographic variables in predicting a
total score on the Woodcock Johnson Number Series task given minimal item and response time
information was explored. Results indicate that there is some utility to adding this information but this
comes at points when there is a drop in accuracy for predicting the total score from a given item subset.
IT is shown that some demographics provide helpful hints in the direction of adjusting scores while
others do not reliably provide good information about the total score.
Introduction
A main focus of much of the text up to this point has been to score respondents accurately using
information that is readily available at the time of testing administration. The proposed use of response
time is a logical step in order to account for individual differences in ability. To the end that ability and
speed are related there is some additional information that speededness adds to our ability to
accurately score a person. That is, as long as there is an identifiable reason to include speededness as a
factor in scoring, it should be done. In the context of the previous chapters this is saying that the
cognitive ability required to do Number Series (fluid intelligence in the broadest sense; Horn & Cattell,
1967) and speed are correlated and provide ancillary information that is useful. Researchers talk about
this idea in a lot of cases, but there is a lack of methodical implementation. The previous chapters
outlined various methods in which one could implement response times effectively in cognitive testing.
In this chapter the aim to see if person characteristics add more to the prediction of the total score.
The American Life Panel (ALP; Kapteyn, 2002), is a continuing database of respondents. Once
people are in the database, they will continue to receive requests to take part in online experiments and
140
surveys. For this reason there is a lot of information that can be linked between each survey and applied
in each subsequent analysis. With this building process relevant information can be accessed from
previous surveys and applied without having to take time from the current survey to ask for the
information again. The gains in efficiency allow researchers to put more new content in their surveys
and reap potentially greater gains per an individual experiment. The only issue that arises in this
framework is finding the information that will augment the current study and merging the datasets. The
design of the current study allows for one to download a data file with person demographics and raw
scores and response times on the Number Series items and Need for Cognition items. Simple response
models can be implemented to score the items as shown in Chapters 3 and 5. Response and Response
Time models make more efficient use of data collected to attempt to capture a more accurate
representation of person ability on the Number Series task as shown in Chapters 4 and 6.
The estimation of ability with item responses was the basis of the current paper up to this point.
A basic outline of the procedure is shown with the equation
1.
.
The full item set score
is the outcome variable for person j for the regression equation where item
responses of person j,
, the response time to the item set
of person j, and the interaction of
response and response time of person j is
. There is a an intercept term to provide a reference for
individuals who get all items wrong and average response time on each of the items. There is some
noise to be expected when using item subsets from the complete item set,
. This error is unaccounted
for variance because less information is available when item responses are deleted and only use six of
the thirty (a fifth of the item pool). The previous chapter attempted to remedy this lack of accuracy by
introducing response time as a viable source of information on person ability. Now, I look at the use of
141
other incidental information that is easily obtainable and can be worked into a methodical scoring
system.
The next evolution in adding response time into a model of ability is the joint response and
response time model, Conditionally Independent Response Time (CIRT) model (van der Linden, 2007).
This model is notable in its treatment of the two models simultaneously, estimating the ability item
parameters for both models with covariances between the two. This treatment of parameterization
allows us to identify person parameters as ability and speededness and identically for the items,
difficulties and time intensities. As indicated in Figure 7.1, these two models have a level one model
which is indicated by responses or response times. The bridge between the models is at the second level
where the person variables are allowed to covary and the item parameters are allowed to covary. Then
covariates are included as predictors on the person ability and speededness parameters.
142
Figure 7.1. Conditionally Independent Response Time model. The joint response and response time
models are shown in a Structural Equation Model format. The responses (Uij) and response times (Tij)
are indicators of the level 1 parameters for each model separately. Then the level 2 parameters link the
person parameters from the response and response time models and likewise with the item parameters.
It is level 2 which provides the link between the two models.
Person
θ
j
U
ij
T
ij
Person
τ
j
Item
a
i
, b
i
Item
α
i
, β
i
Person
μ
P
, σ
P
Item
μ
I
, Σ
I
Level 2
Level 1
Gender Age Edu Income
Cognitive Ability and Processing Speed
There is a need to categorize and organize ideas into units that make sense and can be conveyed
easily. To that end it is acknowledged that there are different theories of organization for cognitive
abilities. Spearman (1904) provides evidence for a unified theory of ability in which all abilities lie within
the label of Intelligence. All types of abilities (i.e. memory, math, reading, …) are all part of the same
construct and one scale can be used to compare persons. To one who would like to emphasize the
idiosyncrasies and strengths of individuals this does not allow for good comparisons in psychological
143
research. The pendulum swung back in the other direction as an attempt to explain the observed
individual differences among persons, with each ability as its own identifiable construct (Thurstone,
1938). It would seem that the ideal position on cognitive ability organization would be somewhere
between these extreme views and it came in the form of Fluid and Crystallized Intelligence Theory (Gf-
Gc Theory; Horn & Cattell, 1967). The major distinction in Gf-Gc theory is that there are branches in
which abilities fall, with fluid intelligence and crystallized intelligence being the first major branches.
Speed of Processing falls into the area of cognitive ability, but the major contention is how it fits
into the ability framework. Most importantly, whether speed of processing regulates areas of cognitive
functioning and which of these areas are impacted is topic of much research (Conway, Cowan, Bunting,
Therriault, & Minkoff, 2002; Nettlebeck, 2011; Salthouse, 1996; Sliwinski & Buschke, 1999). In particular
the findings lend themselves to positive signs of the association between cognitive ability factors and
speed of processing factors. A study by Deary, Der, and Ford (2001) showed that there was a significant
correlation between reaction time and a British test of mental ability . A follow up to this
study, Deary, Allerhand, and Der (2009) found that when the same individuals were retested after 13
years, the correlation between response time and mental ability held between assessment times. The
longitudinal effect found by Sliwinski & Buschke (1999) was found to decrease when compared to the
cross-sectional effect. These findings are interpreted as evidence that there is potential for processing
speed to indicate to some extent one’s mental ability. They suggest that researchers must differentiate
between the within time and across time relationships in order to understand the dynamics of ability
and speededness over age. Given the current definition of mental ability and how mental ability
constructs are defined may change depending on the framework used, there is potential for stronger or
weaker correlations to exist.
144
Age Effects
As alluded to in Deary et al. (2009), there is concern of the effects of age on processing speed
and the combined effects on mental ability. These effects become important for much of later life
research where the emphasis is on successful aging. In terms of successful aging it is defined as
maintaining an independent and active lifestyle without major declines in cognitive abilities. In terms of
aging and Deary’s experiment they test if there is a relationship between mental ability and response
time over testing periods and at different age ranges. As noted in Hertzog (2011) there are a number of
reasons that can influence changes in cognitive abilities as one ages, including “resources like working
memory, processing speed, and inhibitory aspects of attention” (p.182). The effects of changes to such
processes can influence the response time and ability to perform tasks over time (Hertzog, 2008;
Salthouse, 1996).
The Fluid and Crystallized intelligence theory posits that there are cognitive abilities that gain
through formative years and plateau in old age and other abilities rise until middle age and then start to
decline in old age (Horn & Cattell, 1967; Wechsler, 1939). Whether or not the ability holds or not has
been traditionally split conceptually by verbal abilities tending to remain constant and
performance/response-synthesizing abilities tend to not hold. These factors and trends have been
replicated in numerous studies (see McArdle, Ferrer-Caja, Hamagami, & Woodcock, 2002).
Gender Effects
In addition to age and education the effects of gender differences in intelligence are important
in understanding effects of nature and nurture on intelligence (Halpern, Beninger, & Straight, 2011).
Studying the effects of gender is made difficult because much test construction involves removing and
replacing items that show bias between groups (Brody, 1992). Given this it would follow that for an
145
overall “g” interpretation of intelligence Jensen (1998) found no difference in mean scores between
males and females. A potential issue with this aggregate view of intelligence is that variations in the
individual abilities are lost and lump individuals with different strengths together. It provides more
insight to do domain specific testing of gender differences to see where these differences lie, if there are
any differences. Another possible reason for gender differences in testing has been associated with
sampling. Volkman, Szatmari, and Sparrow (1993) noted that more males are mentally retarded and, as
a result, are underrepresented in tests of cognitive ability. This is reflected by the study by Hyde,
Lindberg, Linn, Ellis, and Williams (2007) which showed that more females do in fact take the SATs and
have more variation in their scores. It would appear that tests may show bias because of the sampling
techniques used. If there a bias exists where females show lower scores because they include more
ability levels, then the test would be accurate in subsequent testing of finding a sample bias if the same
techniques are used. Oversampling lower ability males would be effective in attempting to overcome
such biases but would have to be the case in subsequent samples too, otherwise the bias would still be
observable.
The gender bias in testing is of particular interest because of the cultural impact of testing. In
tests such as the SATs and many cognitive tests, ability level is determined and schooling requirements
are assessed. Halpern, Beninger, and Straight (2011) note that there are sex differences over the life
span, females tend to be favored by reading, writing and vocabulary tasks, whereas males are favored
by spatial tasks. It also of interest to note that sex differences have been found to remain over time,
and not decreasing as individuals age (Hedges & Nowell, 1995). This meta-analytic study seems to
indicate that if a bias exists for young adults, it should hold for older adults. There is no evidence for a
dynamic relationship of cognitive ability over age due to gender. It should also be noted that the
difference in gender can be overcome with training (Halpern, Beninger, & Straight, 2011). Someone who
146
is not proficient in a given task can be trained to perform at a higher standard given the training is
effective.
Income and Education Effects
Seeing as income and education seem to be linked it would make sense to talk about them in
tandem. This assumption is backed with numerous reports of the link of income and education (Griliches
& Mason, 1972; Juster, 1975; Lynn & Vanhanen, 2002; Weede, 2006). These trends are apparent in
many countries and replicated over time with additional individual demographic variables that could
potentially be of interest (Rindermann, 2008).The body of works indicates that higher education leads to
higher income on an aggregate level. From here work is done to establish a link between education and
cognitive ability and income and cognitive ability to determine that these demographic variables are
viable possibilities for variation in cognitive ability scores.
The relationship between cognitive ability and education is also studied to some extent (Falch &
Sandgren, 2011; Hansen, Heckman, & Mullen,2004; Winship & Korenman, 1999). These studies indicate
that education is tied to IQ, a broad definition of cognitive ability similar to “g”. Particularly these
studies took into account the amount of schooling obtained at the period and followed up with
participants to see if earlier estimates of IQ and education matched with later scores on these measures.
Falch and Sandgren (2011) note that education and ability have a substantial relationship and about 4 or
5 years of education amounts to an increase in IQ of about one standard deviation (15 points). Note the
relationship between cognitive ability and income has been shown to have a positive correlation
( ; Lynn & Vanhanen, 2006). Other research has also indicated this correlational evidence for a
compelling story (Gottfredson, 1997, 2003; Jensen, 1998; Schmidt and Hunter, 2004; Strenze, 2007).
Then attention is turned to the use of education and income with cognitive ability to find
evidence for their combined use. In a series of studies there was evidence for the predictability of
147
cognitive ability with both education and income for older women (Lee, Buring, Cook, & Grodstein,
2006; Lee, Kawachi, Berman, & Grodstein, 2003). These studies found the relationship between
education with cognitive function and decline, whereas income had a weaker if not significant
relationship with functioning and decline of cognition. Similar studies of Italian populations has provided
substantial relationships between cognitive ability and demographics education and income (
; Lynn, 2010). In proposing that there is a specific relationship with a fluid
intelligence measure and these variables are dissecting the term IQ into a subcomponent. In much of
these studies IQ has been used to mean a construct of problem solving and applied knowledge, similar
to the function of fluid intelligence.
Hypotheses
One would expect that the sample demographics provided would have some effect in
accounting variation in Number Series scores because of age and education differences. The major
effect being that education would have positive impact on scores, the higher the education the higher
the score. Income goes with this hypothesis, with higher incomes being related to higher educations, it
is hypothesized that those with higher incomes to perform significantly better. And with age, it is
hypothesized that some decline in scores over age because of the well-known decline in fluid
intelligence with advancing age. The stated gender differences point to a bias towards males for such a
task. Based on earlier differential item functioning findings it is also hypothesized that there is evidence
for gender bias but the extent of it is not defined. It is expected that there is some attenuation of effects
when other variables are accounted for in the models. In particular, the relationship between education
and income is one of importance and how this affects predictability in the Woodcock-Johnson Number
Series task is important for interpreting scores between persons.
148
Methods
Sample
The data comes from the American Life Panel (ALP; Kapteyn, 2002), a nationally representative
sample of adults 18 years old and above. Simple demographic Statistics are provided again in Table 7.1.
The sample had a mean age of 48.9 with a range of 18 to 109. The sample was 59.3% female and the
ethnicities are broken down as: white 84.2%; 4.6% Hispanic; Black 6.5%; Asian 1.3%; Other 3.4%. The
average years of education attained was 11.5 years. The average income was $74,320 and for analyses
the log of this value was used to bring extreme incomes closer to the mean (Gelman, 2008). Tests of the
sample compared to the national averages indicate that the HRS sample closely resembles the average
American on the demographic aspects outlined above (Couper, Kapteyn, Schonlau, & Winter, 2007).
The income variable is treated specifically in these analyses. Log transformations were used to
bring extreme values in from the tails. As presented in Figure 7.2, the original distribution of income is
shown in Panel A and the transformed income values in Panel B. This treatment of the income variable is
typical of much research that uses such information in regression and other statistical analyses (Olsson,
2005).
149
Table 7.1 Descriptive Statistics of the American Life Panel, complete
sample size N = 2548
N Mean Median Std Dev Min Max
Age 2548 48.9 50 14.8 18 109
Education 2548 11.5 11 2.08 3 16
Income 2123 74230 55000 60549 2500 200000
Female 2548 59.3%
NCPos 2533 59.8 62.5 17.9 0 100
NCNeg 2536 30.6 31.3 17.6 0 100
Ethnicity
-White
84.2%
-Hispanic
4.6%
-Black
6.5%
-Asian
1.3%
-Other 3.4%
150
Figure 7.2. Comparison of (a) raw income values with (b) log transformed income values. The skew in
income that occurs because of small number of occurrences of high earners is dealt with by
transforming the income value. Typically a log transformation is used in much of economic research
(Olsson, 2005).
(a)
(b)
151
Collection of the data begins by sending an email invitation out to about 3,000 active
participants in 49 states and the District of Columbia. All of the participants have reported that they are
comfortable speaking and reading English. Since the survey is administered over the internet, if a
participant did not have a computer and internet access they were given both a computer and internet
access. If they already had both then the cash value of the computer was given. The rate of dropout
from the sample is low, with only about 3 people formally leaving the sample a month. Nonresponders
are left in the sample and only deleted when dataset is cleaned for these nonresponders. Since the
cleaning is infrequent the response rate for a survey should take into account this group of people.
Measures
Number Series. The items used for the current analysis come from the Woodcock-Johnson III
(WJ-III) set of tests of ability (WJ-III; Woodcock, McGrew, & Mather, 2001). The Number Series task is
one of the tasks that are used to test math ability and problem solving skills. The task involves the
participant to provide a missing number from a sequence of numbers so that the sequence makes a
logical pattern. For example: if the sequence shown is 1, 2, 3, _, then the participant should provide the
answer 4 as it makes the sequence increase by 1 with each successive number. The items were given in
order of relative difficulty with the easiest being given first and progressively difficulty items given.
There were 15 items given per a set and two sets of items were given to each participant. The
presentation of set A and set B were counterbalanced across participants. If the correct answer was
given for the blank the participant received a 1, if the wrong answer was given they received a 0.
Need for Cognition Scale. The Need for Cognition scale (NCS) consisting of 8 items of the original
18 was presented between the NS item sets (Cacioppo, Petty, & Kao, 1984). This set of items was placed
between the two NS item sets and was meant to be a time filler that did not involve cognitive load. The
scale is divided into two factors, positive and negative. The 4 items that had the highest factor loadings
152
on each factor were chosen to be included for a total of 8 items. In the NCS participants are asked to
answer statements about whether they seek out and undertake applied cognitive tasks on a 5 point
Likert scale, where 1 is very inaccurate and 5 is very accurate. The positive and negative scale scores
were sums of item endorsement and then given a percentage score between 0 and 100.
Response Times. The response times were the length of time that a stimulus item was
presented to the respondent before they moved onto the next item. While not an exact indicator of the
length of time in which it took a respondent to think of an answer and provide it, this is a good
indication of the pace that a respondent takes during the course of testing. As a new test, and
implementation still being refined, the response times on the Number Series serve to help improve
cognitive testing in internet surveys. The log transform of the time was used to bring in extreme scores
and this is typically done with timed response data (Klein Entink, 2009). As shown in Chapter 4 of this
text, the lognormal distribution seems to be a good fit for the response time data collected according to
the CIRT modeling implementation.
Planned Analyses
Multiple block regressions were performed on the 30 item total score with various item subsets
used as predictors. These item subsets were outlined in the previous chapters as reduced item sets and
adaptive item sets. The item subsets for the reduced item sets are given in Table 7.2. Included are the
calculated item difficulties to give one a sense of the scale coverage between the different reduction
methods. The adaptive item sets started with item 15 (middle of the scale) and the next item was
chosen based on correct versus incorrect response and the Block Adaptive methods used a fixed item
set followed by an adaptive set of items depending on performance to the first set. The sample
demographics are used to identify how much variation can be accounted for simply by knowing person
characteristics provided in a typical survey. Then item responses, response times and the interaction
153
between the two are implemented to determine if the relationships hold, age, education, income,
gender, the weights are still the same given the responses provided to each item set.
Particularly each demographic variable is examined for potential influences. In terms of aging,
age and speed relationships were examined, taking into account gender, education, and income effects.
Likewise gender effects were examined with other variables accounted for, education and income with
the same treatment. These interactions provide good insight into the complicated relationship of person
demographics with cognitive abilities.
Results
The first set of results was done to see if the relationship between the predictor variables held
across the various scoring methods. These results are displayed in Table 7.3 for reference. In Table 7.3
the first set of weights was the relationship of the total 30 item score with the demographic variables
and the time filler Need for Cognition scores. The IRT model was run as a Rasch Model with 30 items
predicting the ability on the Number Series task. The effect of age was negative, -1.07 points for every
decade of age, and there was no quadratic effect of age. The effect of gender was a 3.41 point
advantage for males over females. Every 4 years of education increased scores by 5.83 points and
Item Difficulty Item Difficulty Item Difficulty Item Difficulty Item Difficulty
Item 1 4A -3.03 2B -3.89 11A 0.94 8B 0.59 10A 1.60
Item 2 5A -4.01 3A -2.51 9B 0.92 11A 0.94 12A 0.55
Item 3 8A 0.60 10B -0.18 8B 0.59 7B 0.24 14A 2.56
Item 4 11A 0.94 8A 0.60 7B 0.24 9B 0.92 9B 0.92
Item 5 3B -3.48 9A 1.77 12A 0.55 10B -0.18 14B 2.12
Item 6 4B -1.77 15B 3.36 14B 2.12 14A 2.56 15B 3.36
Notes: Random and Fixed item subsets do not attempt to tailor items to the sample. The Alpha, Factor and Best R2
items are picked as best for this sample of respondents. If the characteristics of the sample change, such as lower
average ability, items with lower difficulties should be chosen. Items are shown in rank order of difficulty for Alpha
and Factor item sests. Diffuculties in Log-Units.
Table 7.2. Item Subsets From Different Item Selection Methods (Item difficulty Included for Comparison Purposes)
Random Fixed Cronbach's α Factor Best 6
154
Outcome Intercept Age Age
2
Female Education Income NC-Pos NC-Neg R
2
30 Item Score 519.7 (2.85) -1.07 (0.15) -0.07 (0.08) -3.41 (0.46) 5.83 (0.45) -0.18 (0.23) 0.10 (0.02) -0.06 (0.02) 0.21
Random Score 519.0 (2.28) -0.58 (0.12) -0.01 (0.07) -1.89 (0.36) 3.12 (0.36) -0.09 (0.19) 0.04 (0.01) -0.05 (0.01) 0.11
Fixed Score 517.6 (2.25) -0.85 (0.12) -0.02 (0.07) -2.61 (0.36) 3.44 (0.35) -0.07 (0.19) 0.06 (0.01) -0.03 (0.01) 0.14
Alpha Score 518.9 (3.28) -0.84 (0.18) -0.05 (0.10) -3.16 (0.53) 5.81 (0.51) -0.20 (0.27) 0.10 (0.02) -0.08 (0.02) 0.17
Factor Score 518.5 (3.44) -0.79 (0.18) 0.04 (0.10) -3.64 (0.55) 6.22 (0.54) -0.12 (0.28) 0.10 (0.02) -0.08 (0.02) 0.17
Best R
2
Score 517.8 (2.99) -1.05 (0.16) -0.01 (0.09) -3.45 (0.48) 5.14 (0.47) -0.13 (0.25) 0.09 (0.02) -0.06 (0.02) 0.17
BATA Score 516.0 (2.76) -0.74 (0.15) -0.06 (0.08) -2.64 (0.44) 4.37 (0.43) -0.08 (0.23) 0.08 (0.01) -0.03 (0.01) 0.13
BATB Score 519.8 (2.69) -0.84 (0.14) -0.17 (0.08) -2.16 (0.43) 4.02 (0.42) -0.37 (0.22) 0.08 (0.01) -0.04 (0.01) 0.13
Half-Adapt Score 516.0 (2.69) -0.85 (0.14) -0.01 (0.08) -2.70 (0.43) 4.09 (0.42) -0.09(0.22) 0.08 (0.01) -0.04 (0.01) 0.14
Timed HA Score 517.0 (2.70) -0.84 (0.14) -0.08 (0.08) -2.64 (0.43) 4.54 (0.42) -0.18 (0.22) 0.09 (0.01) -0.04 (0.01) 0.15
Adaptive Score 516.3 (2.80) -0.77 (0.15) -0.08 (0.08) -2.79 (0.44) 4.48 (0.44) -0.18 (0.23) 0.10 (0.02) -0.04 (0.02) 0.15
A. Reduced Item Sets
Table 7.3. Demographic Regression Weights for Predicting IRT Scores From (A) Reduced and (B) Adaptive Item Sets (N = 2073)
B. Adaptive Item Sets
Notes: Age variable is centered at age 50 and divided by 10. Female is effect coded -0.5 and +0.5 towards females. Education is
centered at 12 years and divided by 4. The Need for cognition scales (Positive: NC-Pos and Negative: NC-Neg) are scored 0 to 100
with higher scores meaning higher endorsement of the factor. Regression weights are provided with standard errors in parentheses.
Significant values are in bold. Outcome scaled in W-units.
155
income did not have a significant effect on the total score. The Need for Cognition task was used as filler
that was intended to not put any cognitive load on the respondent. The relationship of the positive
factor on the 30 item score was minimal but significant, a 0.1 point increase for every point increase in
the positive factor score. The negative factor had an even weaker effect, 0.06 for every point increase in
the negative factor, but this was still significantly different from 0. Overall the model accounted for 21%
of the variation in the total score. In the next series of results were intended to test the relationship of
the demographics held across different scoring methods. The question to be answered was if one subset
of items or another was used to score respondents, would the demographic relationships remain the
same in predicting the range Number Series scores.
Predicting Reduced and Adaptive Scales with Demographic Variables
IRT Scoring. The scale scores using the item reduction techniques were regressed on the
demographic variables, and compared these weights to those outlined in the previous section. Of
primary importance is the overall direction and significance of each of the weights is maintained in the
various item reduction techniques. The effect of age on predicting reduced item set scores ranged
between -0.58 and -1.05. These weights were underestimated compared to the 30 item score model.
The quadratic effect of age was not significant in each regression model. The effect of being female
versus male ranged between -1.89 and -3.64, where males showed a small advantage just as before.
Here the effects of gender ranged from underestimating to slightly overestimating point differences
between males and females. Again education was found to be important (3.12 to 6.22 points per 4 years
of education) and weights were in the range of the 30 item score weights, but income still showed no
significant influence on the scale score. The Need for Cognition scales again showed minor
improvements for the positive scale (0.04 to 0.10 for 1 point increase) and minor decreases for the
negative scale (-0.03 to -0.08 for 1 point increase). There did not appear to be any bias in weights for
156
30 Item Random Fixed Alpha Factor Best R
2
BATA BATB Half-Adapt Timed HA Adaptive
30 Item Score 1.00 sym.
Random Score 0.72 1.00
Fixed Score 0.80 0.67 1.00
Alpha Score 0.87 0.68 0.61 1.00
Factor Score 0.85 0.65 0.63 0.90 1.00
Best R
2
Score 0.90 0.60 0.70 0.86 0.81 1.00
BATA Score 0.80 0.62 0.63 0.63 0.62 0.75 1.00
BATB Score 0.81 0.51 0.65 0.71 0.71 0.70 0.52 1.00
Half-Adapt Score 0.86 0.57 0.72 0.70 0.69 0.77 0.82 0.68 1.00
Timed HA Score 0.84 0.58 0.68 0.69 0.69 0.77 0.85 0.66 0.88 1.00
Adaptive Score 0.87 0.59 0.66 0.72 0.71 0.81 0.85 0.70 0.85 0.87 1.00
Table 7.4.A. IRT Scale Score Correlations for 30 item Scale, Reduced Item Set Scales, and Adaptive Scales
Notes: The correlations of each adapted score with the 30 item score are highlighted in the first column. The reduced item sets are
listed first with the adaptive item sets lower in the listings.
157
these predictors compared to the 30 item scale scores. It is noted that there were differences in
explained variance that seemed to go with the reduced scale correlation with the total 30 item scale
score. The more the reduced item score resembled the total 30 item score, the closer the R
2
for these
regressions were to the R
2
for the 30 item regression. For reference to the scale correlations please see
Table 7.4.A. As given by the correlation of the reduced and adaptive scales with the 30 item scale
highlighted in Table 7.4.A. and the explained variances in Table 7.3, one can compare the two values and
find that they are highly related ( ). The higher the correlation of the reduced and adaptive
scales with the full scale score, the higher the explained variance of the regression with demographics as
predictors.
Next, analyses focused on the adaptive item sets as the outcome for our demographic variables.
Starting with the effects of age, there seemed to be a downward bias, where the effect of age was lower
than in the 30 item scale score (-1.07 vs. -0.73 to -0.85). The quadratic effect of age was only significant
in the BAT-B scoring method, and barely at . The effect of gender also showed a downward
bias where males and females showed a smaller difference in mean scores between groups (-3.41 vs. -
2.16 to -3.09). Similarly there is a bias in the weights for the effects of education (5.83 vs. 4.01 to 4.69)
and no effect of income on the scale scores. The Need for Cognition scores provide similar weights when
compared to the 30 item scores (Positive scale: 0.10 vs. 0.08 to 0.10 and Negative scale: -0.06 vs. -0.03
to-0.06). The resulting theme in comparing reduced and adaptive item sets is that the reduced item sets
show weights that vary above and below the 30 item score weights. On the other hand demographic
weights in the adaptive item set seem to be biased downward, with differences in age, education and
sex being smaller than shown when these variables are predicting the 30 item scale score.
158
CIRT Scoring. Just as before, the scores from each method were regressed on the demographic
variables to test for stability of the scoring methods. When changing from and IRT scoring process to the
CIRT, one would assume the relationship would hold as it does in Table 7.3. With the CIRT framework
there is the opportunity to analyze the person speededness as a factor beyond just raw response times.
The relationship of this speededness and the Number Series ability is shown in Table 7.4.B. The
correlation between the ability and speededness scores was shown to be . The results of the
analysis of demographics on the person ability and speededness scores are represented in Table 7.5.
Part A of the table explains the effects on ability scores and Part B the effects on Speededness scores,
with the 30 item CIRT score outcome being the first line in each. In the ability regression equation the
relationships of age (
), gender (
), education
(
), the positive scale of need for cognition (
), and
the negative scale (
) were all significant. The quadratic effect of age and
income both did not significantly predict total scores and the calculated R
2
was 0.21. In the speededness
equation only age (
) and education (
) were
significant predictors. The explained variance for the complete regression was R
2
=0.19 and when age
was the only predictor the explained variance remained at this level.
Score Speed Age Age2 Gender Education Income NC-Pos NC-Neg
30 item Score 1.00 sym.
30 Item Speed 0.19 1.00
Age -0.14 -0.41 1.00
Age2 -0.01 0.05 -0.15 1.00
Gender -0.19 0.00 -0.05 -0.02 1.00
Education 0.32 0.02 0.09 -0.07 -0.05 1.00
Income -0.01 0.00 0.03 -0.02 -0.01 0.01 1.00
NC-Pos 0.33 0.06 -0.08 0.00 -0.18 0.26 0.01 1.00
NC-Neg -0.26 -0.01 0.00 0.09 0.08 -0.25 -0.02 -0.56 1.00
Table 7.4.B. Scale Correlations of CIRT Scores and Speededness with Demographic Variables (N = 2458)
159
Outcome Age Age
2
Female Education Income NC-Pos NC-Neg R
2
30 Item Ability -0.74 (0.10) -0.04 (0.05) -2.23 (0.29) 3.67 (0.28) -0.10 (0.15) 0.06 (0.01) -0.04 (0.01) 0.21
Random Ability -0.82 (0.08) -0.03 (0.05) -1.33 (0.25) 2.42 (0.25) 0.00 (0.13) 0.03 (0.01) -0.03 (0.01) 0.14
Fixed Ability -0.78 (0.09) -0.01 (0.05) -2.01 (0.27) 2.60 (0.26) -0.04 (0.14) 0.04 (0.01) -0.02 (0.01) 0.15
Alpha Ability -0.69 (0.15) -0.03 (0.08) -2.62 (0.44) 4.66 (0.43) -0.18 (0.23) 0.09 (0.01) -0.07 (0.01) 0.17
Factor Ability -0.61 (0.14) 0.02 (0.08) -2.79 (0.43) 4.83 (0.42) -0.11 (0.22) 0.08 (0.01) -0.06 (0.01) 0.17
Best R2 Ability -0.51 (0.11) -0.01 (0.06) -2.17 (0.32) 3.42 (0.31) -0.09 (0.16) 0.06 (0.01) -0.04 (0.01) 0.16
BATA Ability -0.50 (0.09) -0.04 (0.05) -1.74 (0.28) 2.80 (0.27) -0.06 (0.14) 0.05 (0.01) -0.02 (0.01) 0.14
BATB Ability -0.61 (0.09) -0.11 (0.05) -1.47 (0.28) 2.58 (0.27) -0.24 (0.14) 0.05 (0.01) -0.03 (0.01) 0.14
Half-Adapt Ability -0.51 (0.10) 0.00 (0.05) -1.85 (0.29) 2.72 (0.28) -0.07 (0.15) 0.05 (0.01) -0.03 (0.01) 0.14
Adaptive Ability -0.51 (0.10) -0.06 (0.06) -1.92 (0.30) 3.04 (0.30) -0.13 (0.16) 0.07 (0.01) -0.02 (0.01) 0.15
Table 7.5.A. Demographic Regression Weights for Predicting CIRT Ability Scores From (A) Reduced and (B) Adaptive
Item Sets and Response Times (N = 2086)
A. Reduced Item Sets
B. Adaptive Item Sets
Notes: Age variable was centered at age 50 and divided by 10. Female was effect coded -0.5 and +0.5 towards
females. Education was centered at 12 years and divided by 4. The Need for cognition scales (Positive: NC-Pos and
Negative: NC-Neg) were scored 0 to 100 with higher scores meaning higher endorsement of the factor. Regression
weights are provided with standard errors in parentheses. Significant values are in bold. Outcome scaled in W-
units.
160
Outcome Age Age
2
Female Education Income NC-Pos NC-Neg R
2
R
2
(Age only)
30 item Speed -0.11 (0.00) 0.00 (0.00) -0.02 (0.01) 0.05 (0.01) 0.01 (0.01) 0.00 (0.00) 0.00 (0.00) 0.19 0.19
Random Speed -0.09 (0.00) -0.01 (0.01) -0.02 (0.01) 0.09 (0.01) 0.01 (0.01) 0.00 (0.00) 0.00 (0.00) 0.17 0.13
Fixed Speed -0.08 (0.00) 0.00 (0.00) -0.03 (0.01) 0.03 (0.01) 0.00 (0.01) 0.00 (0.00) 0.00 (0.00) 0.15 0.14
Alpha Speed -0.08 (0.01) 0.00 (0.00) -0.06 (0.01) 0.03 (0.01) 0.01 (0.01) 0.00 (0.00) 0.00 (0.00) 0.14 0.12
Factor Speed -0.08 (0.01) 0.00 (0.00) -0.06 (0.01) 0.04 (0.01) 0.01 (0.01) 0.00 (0.00) 0.00 (0.00) 0.12 0.11
Best R
2
Speed -0.10 (0.01) 0.00 (0.00) -0.02 (0.02) -0.09 (0.02) 0.00 (0.01) 0.00 (0.00) 0.00 (0.00) 0.15 0.11
BATA Speed -0.08 (0.01) 0.00 (0.00) -0.01 (0.02) 0.00 (0.01) 0.00 (0.01) 0.00 (0.00) 0.00 (0.00) 0.12 0.12
BATB Speed -0.07 (0.01) 0.00 (0.00) -0.03 (0.01) -0.02 (0.01) 0.00 (0.01) 0.00 (0.00) 0.00 (0.00) 0.11 0.10
Half Adapt Speed -0.07 (0.01) 0.00 (0.00) 0.01 (0.02) -0.05 (0.02) 0.00 (0.01) 0.00 (0.00) 0.00 (0.00) 0.11 0.09
Adaptive Speed -0.09 (0.01) 0.00 (0.00) -0.02 (0.02) -0.03 (0.02) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.13 0.11
Notes: Age variable was centered at age 50 and divided by 10. Female was effect coded -0.5 and +0.5 towards females. Education was
centered at 12 years and divided by 4. The Need for cognition scales (Positive: NC-Pos and Negative: NC-Neg) were scored 0 to 100
with higher scores meaning higher endorsement of the factor. Regression weights are provided with standard errors in parentheses.
Significant values are in bold. Outcome scaled in log(time) units.
Table 7.5.B. Demographic Regression Weights for Predicting CIRT Speededness Scores From (A) Reduced and (B) Adaptive Item Sets
and Response Times (N = 2073)
A. Reduced Item Sets
B. Adaptive Item Sets
161
As outlined with the IRT scoring, the reduced item sets produced weights that varied around the
estimates for the 30 item ability score. With the adaptive item sets there was an underestimation of the
influence of the demographics on these ability scores. In terms of the speededness scores, the trend was
that age was the constant factor in predicting speededness. Education and gender both had roles that
fluctuated with scoring methods, but were not consistent nor as strong as age was. In fact, a comparison
of explained variance of models with all demographics to age as the only predictor showed most of the
explained variance could be attributed to age.
Predicting Full Scale Score with Item Reponses, Response Times, and Demographic Variables
Regression Method with Item Responses. In Chapter 6 relationship of item responses and
response times in predicting the total score from reduced item sets was identified. In this chapter, the
relationship between the demographic variables and the total score has been identified. Now a model
was introduced in which item responses and response times, and demographics are used to predict total
scores. The motivation is that the additional information in demographic variables can be used to gain
further accuracy in predicting the full scale score over the reduced item set information. First the
random item set was implemented. These results are displayed in Table 7.6 with Model 1 representing
prediction of the total score with responses and response times. Model 2 then includes demographic
variables with the previous model predictors. Only the reduced item sets are treated this way because
the amount of missing predictors in the adaptive item sets create problems for estimating the models.
For the Random item set (A), the improvement in R
2
was from 0.56 to 0.60. The weights for item
5, time 2, and interactions 2 and 5 are found not significant in Model 2, but were significant in Model 1
(
).
The results for the demographic variables mimicked those found in the previous section where age
(
), education (
), gender (
162
), and the positive scale of Need for Cognition (
) all had significant
weights. With the Fixed item set (B), the improvement in R
2
was from 0.72 to 0.73. The weight for
interaction 5 was found not significant in Model 2, but was significant in Model 1 (
). The results for the demographic variables were similar to the previous analysis, age (
), education (
), gender (
), and
the positive scale of Need for Cognition (
) all had significant weights. Next,
the Alpha item set (C), the improvement in R
2
was from 0.78 to 0.79. The weights for time 2 and
interaction 3 were significant in model 2, but not model 1 (
). The results for the demographic variables were similar to those found previously where age
(
), education (
), gender (
), and the positive scale of Need for Cognition (
) all had significant
weights. Then for the Factor item set (D), the improvement in R
2
was from 0.75 to 0.76. The significant
weights for Model 2 matched the significant weights for item responses and response times. The results
for the demographic variables were similar to those found in the previous section where age (
), education (
), gender (
), and
the positive scale of Need for Cognition (
) all had significant weights. Lastly,
with the Best R
2
item set (E) the improvement in R
2
is from 0.83 to 0.84. The significant model 2 weights
were the same set as in model 1. The results for the demographic variables mirrored those found in the
previous section where age (
), education (
), gender
(
), and the positive scale of Need for Cognition (
) all had significant weights.
163
R
2
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Time 1 Time 2 Time 3 Time 4 Time 5 Time 6
1. No Demos 0.56 6.44 (5.27) 20.1 (7.61) 16.6 (1.80) 20.8 (1.62) 19.6 (6.23) 8.59 (2.88) -0.60 (1.53) 4.29 (2.05) 1.36 (0.41) 2.17 (0.41) 2.59 (1.94) -1.00 (0.96)
2. Add Demos 0.60 5.00 (5.58) 21.5 (9.06) 16.2 (1.91) 18.6 (1.70) 11.3 (7.17) 7.45 (2.95) -0.83 (1.62) 4.41 (2.44) 1.49 (0.45) 1.91 (0.43) 0.82 (2.24) -0.91 (0.97)
1. No Demos 0.72 5.85 (4.56) 5.86 (3.27) 10.5 (1.49) 8.70 (1.47) 8.61 (1.31) 15.2 (1.65) -1.60 (1.32) -2.46 (1.09) 0.36 (0.44) -0.44 (0.34) 2.33 (0.28) 3.25 (0.20)
2. Add Demos 0.73 8.94 (5.06) 3.72 (3.38) 10.1 (1.62) 9.17 (1.62) 8.12 (1.42) 14.5 (1.85) -0.85 (1.46) -2.94 (1.12) 0.46 (0.47) -0.17 (0.38) 2.10 (0.31) 3.13 (0.23)
1. No Demos 0.78 11.4 (1.25) 8.07 (1.29) 6.83 (1.23) 6.99 (1.50) 4.45 (1.24) 11.6 (1.16) 0.14 (0.31) 0.47 (0.31) -0.16 (0.33) -0.36 (0.39) -0.08 (0.33) 0.78 (0.21)
2. Add Demos 0.79 10.3 (1.35) 9.07 (1.36) 6.49 (1.31) 6.03 (1.60) 4.12 (1.31) 12.5 (1.27) -0.09 (0.33) 0.87 (0.33) -0.10 (0.35) -0.39 (0.41) 0.04 (0.35) 0.66 (0.22)
1. No Demos 0.75 8.39 (1.30) 10.9 (1.30) 4.03 (1.59) 8.43 (1.37) 3.94 (1.48) 8.91 (1.47) -0.12 (0.34) -0.29 (0.33) -1.01 (0.41) 0.10 (0.33) 0.78 (0.43) 1.30 (0.21)
2. Add Demos 0.76 7.34 (1.40) 10.1 (1.40) 3.57 (1.70) 9.38 (1.47) 4.08 (1.61) 8.27 (1.58) -0.05 (0.38) -0.41 (0.35) -0.93 (0.44) 0.51 (0.34) 0.84 (0.47) 1.25 (0.23)
1. No Demos 0.83 7.57 (0.98) 5.03 (1.08) 8.34 (1.26) 9.67 (1.09) 9.01 (1.39) 6.27 (1.39) 0.13 (0.23) -0.96 (0.30) 1.10 (0.20) -0.22 (0.26) 0.62 (0.20) 0.56 (0.20)
2. Add Demos 0.84 7.77 (1.05) 4.66 (1.14) 7.72 (1.36) 9.78 (1.17) 9.60 (1.16) 4.38 (1.53) 0.08 (0.25) -0.85 (0.31) 1.19 (0.21) 0.03 (0.27) 0.61 (0.22) 0.47 (0.21)
Notes: Table shows the effect of item responses and response time only information in predicting full scale score (1), then adds demographics as a second step (2) in
prediction. The estimated weights for each equation are shown to highlight any differences that may arise in estimation of weights when demographics are added to the
multiple regression equation. Outcome scaled in W-units.
Table 7.6. Regression Models with Item Response, Response Times, Interactions and Demographics Predicting Full Scale Scores. (N=2,011)
(A) Random Items
(B) Fixed Score
(C) Alpha Items
(D) Factor Items
(E) Best R
2
Items
164
R
2
Int 1 Int 2 Int 3 Int 4 Int 5 Int 6 Age Age
2
Female Education Income NC-Pos NC-Neg
1. No Demos 0.56 -0.51 (1.58) -4.22 (2.09) -2.42 (0.48) -2.97 (0.51) -3.79 (1.96) -0.62 (0.99) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.60 -0.05 (1.66) -4.55 (2.47) -2.57 (0.51) -2.57 (0.53) -1.86 (2.24) -0.40 (1.02) -0.32 (0.12) -0.05 (0.06) -1.80 (0.33) 2.65 (0.33) -0.16 (0.17) 0.06 (0.01) -0.02 (0.01)
1. No Demos 0.72 -0.09 (1.35) 0.13 (1.11) -1.06 (0.48) -0.90 (0.39) -0.69 (0.35) -1.73 (0.43) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.73 -0.59 (1.49) 0.92 (1.15) -1.12 (0.52) -1.16 (0.43) -0.66 (0.38) -1.71 (0.47) -0.36 (0.10) 0.00 (0.05) -1.08 (0.28) 1.51 (0.28) -0.15 (0.14) 0.04 (0.01) -0.01 (0.01)
1. No Demos 0.78 -1.99 (0.38) -1.14 (0.37) -0.74 (0.39) -0.81 (0.45) 0.27 (0.38) -1.31 (0.32) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.79 -1.71 (0.41) -1.45 (0.39) -0.82 (0.42) -0.64 (0.48) 0.32 (0.41) -1.61 (0.35) -0.25 (0.09) -0.02 (0.04) -0.93 (0.24) 1.32 (0.24) -0.01 (0.12) 0.03 (0.01) 0.01 (0.01)
1. No Demos 0.75 -1.29 (0.41) -1.49 (0.40) -0.08 (0.48) -1.24 (0.39) 0.03 (0.47) -0.54 (0.38) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.76 -1.11 (0.45) -1.28 (0.43) -0.04 (0.51) -1.56 (0.42) -0.17 (0.51) -0.45 (0.40) -0.40 (0.09) -0.09 (0.05) -0.85 (0.26) 1.10 (0.26) -0.08 (0.13) 0.4 (0.01) -0.01 (0.01)
1. No Demos 0.83 -0.79 (0.30) 0.16 (0.34) -0.95 (0.32) -1.00 (0.32) -1.12 (0.29) -0.23 (0.35) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.84 -0.88 (0.31) 0.22 (0.36) -0.85 (0.35) -1.18 (0.34) -1.36 (0.32) 0.13 (0.39) -0.31 (0.08) -0.00 (0.04) -0.66 (0.22) 1.26 (0.22) -0.07 (0.11) 0.03 (0.01) 0.00 (0.01)
(B) Fixed Score
(C) Alpha Items
(D) Factor Items
(E) Best R
2
Items
Table 7.6. Continued.
(A) Random Items
165
Regression Method with CIRT Scores. As mentioned above, the CIRT framework allows us to use both
ability and speededness scores as predictors of the total ability score. First the results for the reduced
item sets as shown in Table 7.7 (A-E) were examined. The random item set explained variance went
from 0.53 to 0.58 when demographics were added with ability and speededness as predictors. When
demographics were omitted ability (
) and speededness (
) were significant. This relationship was maintained when demographics were added (
), and age (
), gender (
), education (
), and the positive need for cognition scale
(
) were significant. When a fixed item set was used, the explained variance
goes from 0.62 to 0.66 when demographics were added. Again, the ability and speed scores were
significant (
), and age (
),
gender (
), education (
), and the positive and
negative need for cognition scales (
) were
significant. Next the alpha item set found an improvement in explained variance of 0.74 to 0.76 when
demographics were added. The significant weights found were for ability (
),
speededness (
), age (
), gender (
), education (
) and the positive need for cognition scale
(
). Then the factor items saw an increase in R2 of 0.72 to 0.73. Significant
weights found for the factor item set were for ability (
), speededness (
), age (
), gender (
), education
(
) and the positive need for cognition scale
(). Finally,
the best R2 item set had an increase in R2 of 0.79 to 0.80 when demographics were added as predictors.
166
The next set of regressions dealt with the adaptive item sets and how well demographics did in
aiding prediction of the total score with ability and speededness accounted for as shown in Table 7.7 (F-
J). The explained variance improvements are again highlighted by reasonable gains for lower correlation
scoring methods (BAT-A and BAT-B) but lower gains for more tailored item subsets (Half-Adaptive and
Fully-Adaptive). While the predictive power of the adapted item subset score is highly indicative of total
item set performance, the speededness regression weight fluctuates with the type of item subset
adaptation used (
). The effect of age is consistent with the previous analyses in that
the older respondents tended to have lower scores (
). The effect of gender was
significant, with females showing a slightly lower score across the item sets (
).
The role of education held as well, with more education providing an increase in scores (
R
2
Ability Speed Age Age
2
Female Education Income NC-Pos NC-Neg
1. No Demos 0.53 0.90 (0.02) -1.23 (0.36) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.58 0.80 (0.02) -1.38 (0.37) -0.21 (0.08) -0.03 (0.04) -1.18 (0.21) 1.85 (0.21) -0.09 (0.11) 0.04 (0.01) -0.01 (0.01)
1. No Demos 0.62 0.89 (0.02) -1.37 (0.33) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.66 0.81(0.02) -1.51 (0.34) -0.23 (0.07) -0.03 (0.03) -0.65 (0.19) 1.62 (0.19) -0.07 (0.10) 0.03 (0.01) -0.02 (0.01)
1. No Demos 0.74 0.57 (0.01) 2.39 (0.25) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.76 0.54 (0.01) 1.95 (0.27) -0.22 (0.06) -0.02 (0.03) -0.71 (0.16) 1.11 (0.16) -0.02(0.08) 0.01 (0.00) -0.00 (0.01)
1. No Demos 0.72 0.58 (0.01) 2.14 (0.24) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.73 0.54 (0.01) 1.66 (0.25) -0.28 (0.06) -0.05 (0.03) -0.63 (0.17) 0.98 (0.17) -0.06 (0.09) 0.02 (0.01) -0.01 (0.01)
1. No Demos 0.79 0.86 (0.01) 1.27 (0.18) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.80 0.81 (0.01) 1.04 (0.19) -0.22 (0.05) -0.04 (0.03) -0.46 (0.15) 1.02 (0.14) -0.04 (0.07) 0.02 (0.00) -0.01 (0.01)
1. No Demos 0.68 0.89 (0.01) -0.67 (0.25) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.71 0.81 (0.01) -1.03 (0.25) -0.41 (0.06) -0.01 (0.03) -0.83 (0.18) 1.39 (0.18) -0.06 (0.09) 0.02 (0.01) -0.02 (0.01)
1. No Demos 0.66 0.88 (0.01) 0.46 (0.29) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.69 0.81 (0.01) 0.26 (0.29) -0.23 (0.06) 0.05 (0.03) -1.04 (0.18) 1.59 (0.18) 0.09 (0.09) 0.02 (0.01) -0.2 (0.01)
1. No Demos 0.75 0.91 (0.01) -0.06 (0.21) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.77 0.84 (0.01) -0.28 (0.21) -0.33 (0.05) -0.04 (0.03) -0.67 (0.16) 1.37 (0.15) -0.05 (0.08) 0.02 (0.01) -0.01 (0.01)
1. No Demos 0.75 0.85 (0.01) -0.21 (0.20) =0 =0 =0 =0 =0 =0 =0
2. Add Demos 0.77 0.79 (0.01) -0.54 (0.21) -0.38 (0.06) 0.00 (0.03) -0.72 (0.16) 1.25 (0.16) 0.00 (0.08) 0.01 (0.01) -0.02 (0.01)
Table 7.7. Regression Models with CIRT Ability, Speededness, and Demographics Predicting Full Scale Scores. (N=2,073)
Notes: Table shows the effect of item responses and response time only information in predicting full scale score (1), then adds
demographics as a second step (2) in prediction. The estimated weights for each equation are shown to highlight any differences that may
arise in estimation of weights when demographics are added to the multiple regression equation. Outcome scaled in W-units.
(F) BAT-A Items
(G) BAT-B Items
(H) Half Adapt Items
(I) Adaptive Items
(A) Random Items
(B) Fixed Items
(C) Alpha Items
(D) Factor Items
(E) Best R
2
Items
167
). The effects of Need for Cognition were inconsistent and provided minimal
improvement/decreases to predicted scores.
Conclusion
The intent of the above analyses was to provide evidence for the inclusion of demographic
variables as predictors of total score performance. The first step was to establish how the demographic
variables related to the ability and speededness scores. In the IRT framework, the ability scores from the
various item selection methods provided outcomes that the demographics would be regressed on. The
relationships found with the 30 item score held through all of the subset scores, with some loss in
accuracy. This is attributed to the fact that these scores based on a subset of the full item set will be less
accurate and thus the relationship of the demographics will be strained to the extent that the subset
scores reflect the total item set score. This is illustrated by Figure 7.3, where the most gains in explained
variance are for the less informed item subsets that don’t attempt to tailor items to participants and
whose ability scores do not correlate as well with the total item set ability score.
The next result moved to the CIRT framework to look at the constructs of ability and
speededness in tandem. When the joint model results were compared to the IRT results, the
demographic relationships were very much similar across both ability estimates. The speededness
factors score was heavily related to the age of the person, which could be shown to account for most of
the explained variance. There was a small correlation between ability and speededness, but not to the
extent that has previously been shown to be helpful in improving ability score estimation (van der
Linden, Klein Entink, & Fox, 2010).
168
Figure 7.3. Plots of explained variance without (blue line) and with (red line) demographics. Note the
improvement in R
2
is related to the degree to which the item subset ability score is correlated to the
complete item set score. Larger gains are made for lower correlations with total score (i.e., Random,
Fixed and BAT scoring methods).
Then a comparison of scoring methods using raw responses and response times, and CIRT ability
and speededness estimates was done that showed higher scores in the item subsets, and item response
times helped in predicting total item scores. It is noted that in the previous chapter, the addition of
response time to the model provided limited improvement, but larger improvements were made for less
informative item sets. That is, for the item subset scores with a lower correlation to the total item set
score, there is more room for improvement. Both Response Time and Demographics can be useful in
getting at a more accurate score if there is uncertainty in the items chosen when doing
reduced/adaptive testing.
In regards to the demographics, there is indeed some value in noting that these tests provide
similar results across the board. In particular, the differences in performance across ages indicate that
there may be some age related performance issues (Horn & Cattell, 1967). In the CIRT framework, older
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
169
respondents have a negative relationship with speed, as they report older ages their speed goes down.
The slower performance in older adults is an issue because speededness was found to be a significant
predictor of total score performance with most of the item subsets. And even with this relationship
between age and speed accounted for a negative relationship between age and ability was still present.
The mean differences between genders was consistent but small when compared to the scaling of the
ability score.
A great deal of time was spent noting the link between education and income. In our data this
link was not found. Additionally there was no impact of income on ability scores. Education did have a
strong positive impact, which one would expect given the nature of the task. The sample is actively
involved in technology and of the ability to use computers effectively. This can be the reason for the lack
of impact of income, or the sample being restricted to middleclass Americans would provide that a
limited scope of income is actually sampled. Perhaps a sample of wider income levels and data with
finer income details would be able to shed light on this point. It would be beneficial for testing the effect
of income by introducing a more detailed income variable which allows for more variation (instead of
values determined by the number of income brackets allowed).
In ending this discussion, it has been shown that good reasons to include demographics as
predictors of ability estimation exist. These reasons are to be sure that you are getting the most out of
the limited information you have from a single data collection point and to account for the chance that
the items may not be ideal for the sample. If items are too easy or too hard then they will not be as
informative as items at the sample ability level. As shown above, this leads to a loss of accurate scoring,
which demographics can supplement more so than scores that are more accurate.
170
References to Chapter 7
Brody, N. (1992). Intelligence (2
nd
ed.) New York, NY: Academic Press.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Conway, A.R.A., Cowan, N., Bunting, M.F., Therriault, D.J., & Minkoff, S.R.B. (2002). A latent
variable analysis of working memory capacity, short-term memory capacity, processing speed, and
general fluid intelligence. Intelligence, 30, 163-183.
Couper, M., Kapteyn, A., Schonlau, M., & Winter, J. (2007). Noncoverage and nonresponse in an
internet survey. Social Science Research, 36, 131-148.
Deary, I.J., Der, G., & Ford, G. (2001). Reaction times and intelligence differences: a population
based cohort study. Intelligence, 29, 389-399.
Deary, I.J., Allerhand, M., Der, G. (2009). Smarter in middle age, faster in old age: a cross-lagged
panel analysis of reaction time and cognitive ability over 13 years in the West of Scotland Twenty-07
study. Psychology and Aging, 24, 40-47.
Falch, T. & Sandgren, S. (2011). The effect of education on cognitive ability. Economic Inquiry,
49, 838-856.
Fry, A.F. & Kintsch, W. (1995). Processing speed, working memory, and fluid intelligence:
evidence for a developmental cascade. Psychological Science, 237-241.
Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in
Medicine, 27, 2865-2873.
Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24,
79-132.
Gottfredson, L. S. (2003). G, jobs, and life. In H. Nyborg (Ed.), The scientific study of general
intelligence: Tribute to Arthur R. Jensen. Amsterdam: Pergamon. pp. 293-342.
Griliches, Z. & Mason, W.M. (1972). Education, income, and ability. The Journal of Political
Economy, 80, S74-S103.
Halpern, D.F., Beninger, A.S., & Straight, C.A. (2011). Sex differences in intelligence. In R.J.
Sternberg & S.B. Kaufman (Eds.), The Cambridge Handbook of Intelligence (pp. 253-267). New York, NY:
Cambridge University Press.
Hansen, K.T., Heckman, J.J., Mullen, K.J. (2004). The effect of scholling and ability on
achievement test scores. Journal of Econometrics, 121, 39-98.
171
Hedges, L.V. & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers
of high scoring individuals. Science, 269, 41-45.
Hertzog, C. (2008). Theoretical approaches to the study of cognitive aging: an individual-
differences perspective. In S.M. Hofer & D.F. Alwin (Eds.), Handbook of Cognitive Aging: Interdisciplinary
Perspectives. Thousand Oaks, Ca: Sage.
Hertzog, C. (2011). Intelligence in adulthood. In R.J. Sternberg & S.B. Kaufman (Eds.), The
Cambridge Handbook of Intelligence (pp. 174-190). New York, NY: Cambridge University Press.
Horn, J.L. & Cattell, R.B. (1967). Age differences in fluid and crystallized intelligence. Acta
Psychologica, 26, 107-129.
Hyde, J.S., Lindberg, S.M., Linn, M.C., Ellis, A.B. and Williams, C.C. (2007). Gender similarities
characterize math performance. Science, 321,494-495.
Jensen, A.R. (1998). The g factor: The science of mental ability. New York, NY: Praeger.
Juster, F.T. (1975). Education, income , and human behavior. Hightstown, NJ: McGraw-Hill.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Klein Entink, R.H.(2009) Statistical models for responses and response times. Thesis.
Netherlands: University of Twente.
Lynn, R. (2010). In Italy, north-south differences in IQ predict differences in income, education,
infant mortality, stature, and literacy. Intelligence, 38, 93-100.
Lynn, R. & Vanhanen, T. (2002). IQ and the Wealth of Nations. Westport, CT: Praeger Publishers.
Lynn, R. & Vanhanen, T. (2006). IQ and global inequality. Augusta, GA: Summit Books.
Nettlebeck, T. (2011). Basic processes of intelligence. In R.J. Sternberg & S.B. Kaufman (Eds.),
The Cambridge Handbook of Intelligence (pp. 371-393). New York, NY: Cambridge University Press.
Olsson, U. (2005). Confidence intervals for the mean of a log-normal distribution. Journal of
Statistics Education, 13.
Rindermann, H. (2008). Relevance of education and intelligence at the national level for the
economic welfare of people. Intelligence, 36, 127-142.
Salthouse, T.A. (1996). The processing-speed theory of adult age differences in cognition.
Psychological Review, 103, 403-428.
172
Schmidt, F. L., & Hunter, J. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of Personality and Social Psychology, 86, 162-173
Sliwinski, M. & Buschke, H. (1999). Cross-sectional longitudinal relationships among age,
cognition, and processing speed. Psychology and Aging, 14, 18-33.
Spearman, C. (1904). General intelligence, objectively determined and measured. American
Journal of Psychology, 15, 201-293.
Strenze, T. 2007. Intelligence and socioeconomic success: A meta-analytic review of longitudinal
research, Intelligence, 35, 402-426.
Thurstone (1938). Primary Mental Abilities. Chicago, IL: University of Chicago Press.
van der Linden, W.J., Klein Entink, R.H., & Fox, J.P. (2010). IRT parameter estimation with
response times as collateral information. Applied Psychological Measurement, 34, 327-347.
Volkman, F., Szatmari, P., and Sparrow, S. (1993). Sex differences in pervasive developmental
disabilities. Journal of Autism and Developmental Disabilities, 23, 579-591.
Wechsler, D. (1939). Measurement of adult intelligence. Baltimore, MD: Williams & Wilkins.
Weede, E. (2006). Economic freedom and development: New calculations and interpretations.
Cato Journal, 26, 511-524.
Winship, C. & Korenman, S.D. (1997). Does staying in school make you smarter? The effect of
education on IQ in the bell curve. In B. Devlin, S.E. Fienberg, D.P. Resnick, & K. Roeder (Eds.),
Intelligence, Genes, and Success. Scientists Respond to the Bell Curve. New York, NY: Springer.
Woodcock, R.W., McGrew, K.S., & Mather, N. (2001). Woodcock-Johnson III. Rolling Meadows,
IL: Riverside Publishing.
173
Chapter 8: General Discussion
In five analytical steps I outlined a general process for evaluating, adapting and improving a
scale for use in psychological research. In this chapter the results and conclusions from these analyses
are considered and a general consensus is reached in the direction of future research. While it has been
shown that relatively few gains were made in this particular implementation, the potential for higher
gains are possible in other cognitive tasks. The desire for accurate and concise measures is a good
motivator for improving and innovating in the usage of response time and other collateral information in
scoring methodologies.
Measurement Model Results
In testing the Woodcock Johnson – III Number Series task for its scale properties some minor
adjustments were made to existing framework. The purpose of this section of analysis was to provide a
set of methods to analyze and construct a unidimensional scale for use in measuring ability. In
particular, the measurement model was identified with 30 items all fitting the single dimension of fluid
reasoning. This ability was identified as a fluid ability in the context of fluid and crystallized intelligence
(Gf-Gc theory), where the fluid abilities have traditionally shown more age related differences (Horn &
Cattell, 1967). It is for this reason I noted the relationship between ability and age. Though only a small
correlation was found between the two, it was in the direction expected and used in subsequent
sections of analysis. Similarly, the gender of respondents was used to examine for the possibility of item
bias. The traditional sense in cognitive testing is that there is a bias for males to perform better on tests
of math related abilities (Halpern, Beninger, and Straight, 2011). Findings tend to corroborate the work
of Lindberg, Hyde, Petersen, and Linn (2010), where a meta-analysis showed no difference between
males and females in mathematics based abilities. The results indicated that item difficulties do not
174
favor males or females overwhelmingly. Any differences noted are small and varied towards males in
some cases and females in other cases.
In addition to these characteristics the measurement model was put through a couple of
processes to test the parameterization of the items. In the one parameter model each item has a
difficulty and the discrimination parameter is fixed to be equal across items. In a more relaxed model
the discrimination parameters are estimated in conjunction with the item difficulties to provide that
items can vary in their difficulties, as well as their ability to discriminate person abilities. The
improvement in model fit to include estimated discrimination parameters did not result in a greatly
better fitting model. In this way, the model looks very similar to that used by the test makers of the
Number Series task. I do however show a preference to using estimated item difficulties over fixed
anchors provided. The comparison of person scores shows an upward bias with fixed anchors,
overestimating ability by a substantial amount. In this section the models were based on the scored
responses of the respondents. The next section added to the response model with a joint response time
model as an alternative to simple response scores.
In the Conditionally Independent Response Time (CIRT) model there is a joint response and
response time model that estimates a distribution of person parameters for ability and speededness
(van der Linden, 2007). The joint likelihood of the two functions allows one to make statements both
about the items in terms of difficulty and speed intensity and person ability and speededness. The
benefit comes from using the response times that respondents take in providing an answer to each item
given. This simple addition in computation provides a method to test the idea of whether response time
provides anything to score prediction over correctness of response. If there is a benefit to responding
quicker with a correct response is shown in whether the probability of response goes up with speed. I
used the CIRT model as an intermediate step in scoring respondents with reduced and adaptive steps.
175
The CIRT model was shown to provide respondent scores identical to those scored with a
traditional IRT model. The estimator used to obtain distribution estimates was a Markov Chain Monte
Carlo (MCMC) method with a Gibbs Sampler (Klein Entink, 2009). This method differs from the Marginal
Maximum Likelihood (MML) estimator in that it provides a distribution of estimates based on a
predefined distribution specified by the researcher. When compiling the distribution from all of the
iterations, I wanted to reach a stable estimate and show that it was the same shape as the distribution
specified. This was found to be the case with the responses and response times in the CIRT model on the
Number Series items.
The major distinction between the methods of Chapter 3 and Chapter 4 is that the MML
methods attempt to provide a point estimate that were used as item difficulties and discrimination
parameters when parameterized. The MCMC methods are attempting to estimate distributions of
parameters. Because there is an emphasis on picking the correct data shape, one could use the
distribution to obtain a point estimate similar to the MML method. By providing two alternative
methods of model estimation, I could then show that the two methods provide similar results. In the
subsequent sections the IRT and CIRT methods were used to estimate ability of respondents.
Reduced and Adaptive Results
In introducing item subsets from the 30 item set there is the question of how many questions
should be given to provide researchers with a sufficiently accurate score. This question was answered
for us in the presentation of the problem. In the Health and Retirement Study (HRS) a premium is placed
on time and moving onto a new task quickly. In particular, if a test could be given in five minutes and still
get an accurate score, then this option would be used instead of traditional long form tests. The obvious
answer to this problem is to pick the items that will be most informative and give as many as you can in
this period. In a five minute period it is realistically possible to give 6 items have everyone get through
176
them and provide informed responses. Knowing that the theoretical cap was 6 items, each of the item
subsets consists of 6 items for all participants. This ensured that each person would be given the same
number of attempts and had the same number of stimuli.
The first subset of items focused on reduced item sets. These items sets, also known as short
forms, reduced the 30 item pool down and each respondent received the same item subset. Such
methods as random picks, equally spaced fixed items, items with highest correlation to Cronbach’s
alpha, highest factor loading, and highest R
2
with total score were compared. It was shown that the
latter three methods provided substantially more accurate score predictions over the first two methods.
This could be accounted or because the first two methods placed no weight on sample performance.
The latter methods could be influenced by the information curve of the IRT model. Items that were
more informative were closer to the sample mean on the ability scale (which was also the difficulty scale
the items are ordered on). The reduced item sets have advantages in being able to be administered
because everyone will be presented with the same items. There is no shuffling around to find the next
item, because performance is not a factor in item selection
I then moved to the adaptive item subsets which looked at a block adaptive method, half-
adaptive, and fully adaptive methods. The common theme among these methods is that the item sets all
start off with the same item but branch out depending on respondent behavior. While there are slight
variations in accuracy of respondent score, these methods are on par with the more sample dependent
methods of the reduced item sets. In fact, respondent scores in adaptive frameworks can provide
equally good scores as a full scale test if the right parameters for accuracy are set (Sands, Waters, &
McBride, 1997; Wainer, 2000). Because each respondent can receive a potentially different item
sequence from the next respondent, these contingencies must be coded and accounted for. While this is
not a problem for tests administered in a computer format, problems can arise when tests are given
177
with paper and pencil, and a test administrator setting the pace. In an increasingly technologically driven
environment the last point seems to be moot, suggesting if researchers are willing to do the work of
laying out testing sequences up front, then adaptive tests are lucrative in that they provide accurate and
quick respondent scores.
Response Time Results
In the preceding sections work was done to score and choose items in an efficient and logical
manner (except for the randomly chosen item set). While the scoring methods were shown to be
indifferent, the item selection methods necessarily alter accuracy of scores by dropping potentially
useful information about respondent performance. My goal at this stage of analysis is to regain accuracy
in scores through the collateral information provided to the typical researcher. I first wanted to identify
the usefulness of including response times in predicting a total item set score, along with the item
subset responses. In particular, how much accuracy can be added by including response times with
responses to predict the 30 item score? Overall, the response times provided a minimal increase in
accuracy. The largest gain came when response time was added to response in the fixed item set. A
potential reason for such a gain would be that less informative items were used with informative items
and the response times helped fill in gaps. In other cases, most items were informative and thus the
response times did not add much to the prediction of the total score.
In adding response time in as a predictor and as an interaction term with response a couple of
hypotheses are tested. The first is: how does time spent correlate with total score on an item basis?
Some items may require more time than others and this would be shown in this weight. If one item
needs a lot of time to complete successfully then a positive weight would indicate that higher scores are
associated with longer response times. The opposite may be true of easier items where the weight
would be negative and more time spent penalizes the total score. The second involves the examination
178
of the interaction term. If quicker correct responses are desired then one would expect to see negative
regression weights for the interaction terms. The regression weights indicate that the interaction terms
do support our hypothesis that quick correct responses are better. I can then show that the faster one
answers an item correctly their score should be higher than someone that answers it slowly. The
meaning in this is that someone answering an item faster is not being tested at their ability level yet and
harder questions are suggested. The respondent answering slower is being tested close to their ability
level and questions of similar difficulty would be desired.
The minimal gain in including response time in scoring the Number Series test can potentially be
explained with the CIRT model outlined previously. In this model an ability level and speededness are
recorded for each respondent. With each of these estimates calculated a correlation of the person
scores and it was found to be r = 0.19. This value is on the lower end of acceptable correlations for
effective use of response time in the CIRT model (Fox, Klein Entink, & van der Linden, 2010). With this
information it is shown that the relationship of response and response time per item is relatively low
within a respondent and researchers would expect little gain when this information is included. In fact,
this is the case when the CIRT model is used to score the reduced and adaptive item sets. The
correlation of scores using IRT scoring and CIRT to the total item set score is close to identical. If
response times improved scoring of the item subsets then the CIRT scores would correlate better with
total scores over the IRT scores of the same item subsets. It is noted that including the response time
data does not hurt prediction accuracy and could be included as a precaution in the event that items are
not chosen effectively for the sample being tested.
Demographic Results
In the final analysis section the case is made for including demographic predictors of the total
score, along with responses and response times. The impetus for this work is that characteristics of a
179
person account for some variation in scores. Some of the differences that exist between respondents
are age related, meaning that differences in scores can be accounted for by knowing one’s age (Horn &
Cattell, 1967). In the sample I found that as respondents reported older ages, there was a slight negative
relationship and scores dropped a small, albeit significant amount. While this is an expected age related
effect, the effect is smaller than shown previously (McArdle, Ferrer, Hamagami, & Woodcock, 2002). It
seems that this group is well versed in the format of the testing and while there is some variation in
scores, most people do very well on the scale. The ALP sample does a survey about every couple of
weeks and so they are used to using the computer to provide information and interact. Active
engagement in an activity does seem to provide benefit within the domain although it does not seem to
generalize to a large extent (McArdle & Prindle, 2008).
Other respondent characteristics tested included education, income, gender, and Need for
Cognition scores. As would be expected, education had a strong positive relationship with Number
Series scores. Those with more education were expected to have higher scores. One could make the
argument that education brings up more training in reasoning and fluid ability, which is how the Number
Series task is categorized. Income was included because it is typically tied to education in a number of
ways (Griliches & Mason, 1972; Juster, 1975; Lynn & Vanhanen, 2002; Weede, 2006). Though there was
a small relationship between the two, respondents income did not provide any additional prediction of
the total score with other demographic characteristics taken into account. The income values lacked
definition because they were originally given as bracket values and translated to numeric values. This
coupled with the fact that most were in households of similar status would work against any accounted
variance income might explain. Gender was used as a possible predictor because of evidence where
males outperform females in cognitive tasks involving math skills (Halpern, Beninger, and Straight,
2011). This was found to be the case and males had slightly higher scores than females.
180
The Need for Cognition task was originally used as filler between Number Series item sets with
no cognitive load (Cacioppo, Petty, & Kao, 1984). To this end, I expected no particular relation with the
Number Series task, but there are some possibilities to account for the significant relationship. The scale
is broken up into two subscales in which one is coined “positive’ and the other ‘negative.” The Positive
scale is associated with higher scores and the negative scale with lower scores. In terms of the design of
the Need for Cognition scale, the positive scale can be viewed as the degree to which respondents enjoy
cognitive tasks and the negative scale how much they avoid cognitive tasks. With this knowledge it
would seem appropriate that those with higher positive need for cognition would do better with the
Number Series task, and those with higher negative need for cognition scores would have lower Number
Series scores. Those with desire to do a task will do better than those who are less engaged.
As noted before, it is suggested that the HRS sample can be considered nationally
representative of the US population when weights were used in internet testing (Couper, Kapteyn,
Schonlau, & Winter, 2007). The implementation of weights in the IRT scoring provides no significant
change in rank order when compared to unweighted scores in the ALP sample. What this indicates is
that testing of the ALP on how well it represents the population is needed in order to correctly identify
deficiencies in sampling. To test this idea, the ALP and HRS samples must be compared to one another.
In a logistic model with sample, ALP or HRS, as the outcome and the demographics as predictors,
significant beta weights indicate sources of bias between the samples. If any particular demographic
variables provide information that differentiates the samples, then they are both not nationally
representative, nor equal in sample composition. In their paper, Couper et al. (2007) indicate that the
important barrier in internet testing is access more than sampling. The goal of their work has been to
provide internet to an underrepresented group that would benefit from being given access. It is also
181
important to note that though steps were taken to include this underrepresented group, they were still
under-sampled.
The ALP provides good information about sampling processes and current composition of the
sample. In the current sample all states are represented except for Hawaii. There are efforts to fill out
the sample, but some have been through snowball methods which do not proceed methodologically.
Extra care in sampling regions with little representation would help to bolster the claim of being in line
with national norms demographically. The array of sampling techniques and growth over the years has
led to a diverse use of tactics and theory in sample collection. Care is taken to ensure that as many
respondents answer a given survey as possible. Reminders and extended survey periods allow for
respondents to undertake a survey when they feel they are ready. There is a steady attrition rate of
about 7% a year within the sample, meaning these people do not actively do surveys for a given
calendar year. Given the volume of surveys given, the response rate and retention rate seem to be
pretty good for the ALP.
Synthesizing a Conclusion
I now outline how to move forward with the results accumulated in this sequence of analyses.
One must note the utility of response time in this study and its role in future studies. Additionally, steps
to successfully implement response time in cognitive testing must be provided. The lack of overall
improvement in total score prediction is examined in this study and suggestions for similar future
studies are made.
The results outlined above point to the need to have good item selection methods. Whether a
reduced or adaptive item set is chosen, care must be taken to make sure items are able to accurately
score respondents. Once this aspect is addressed, response time helps minimally in terms of the study
design laid out. The natural rate of problem solving when no specific instruction to respond in a quick
182
manner is made is not highly related to Number Series ability. In future administrations it may be of use
to include more specific instructions with the task as to elicit response times that would be given with
responses that are quick and careful. The current instructions as presented to the respondents in the
ALP were:
Some of the problems may be easy but other may be hard. Please try your best to answer all of
the items even if you are not sure of the answer (i.e., there is no reason to stop). There is no
credit for answering quickly – it is more important to answer the item correctly, but it is okay if
you do not know the answer because some of the items are intended to be very difficult. You
can go on the next item at any time.
In order to obtain response times of more focused effort slight modifications can be made to the
instructions. I offer this straightforward interpretation as an alternative to the previous instructions:
You will be given 30 items to complete in no more than one hour. You have a maximum of 2
minutes for each item. If you have not given an answer in this time the item will be scored as
incorrect and the next item will be presented. Work as carefully and quickly as you can to get
each item correct. You may continue onto the next item at any time.
These new instructions make the desire that the participant work at a rate that is speedy yet careful
apparent. There is no overt claim that they should work at a recklessly fast rate, nor should they over
analyze their responses. The desire to get respondents to work at a comfortable and focused rate so
that response times are as close to time spent thinking about each item as possible is made clear.
Additionally, the claim that response time should be overtly made apparent in the instructions can be
made. This would indicate to a respondent that their time to respond should be taken into account and
their work rate should reflect this set of instructions. This would be written as:
Some of the problems may be easy but other may be hard. Please try your best to answer all of
the items even if you are not sure of the answer (i.e., there is no reason to stop). Your time in
completing the survey is taken into account in calculating a test score. Once you are satisfied
with an answer for an item please move onto the next item. You can go on the next item at any
time.
183
The claim here is that respondents know that their time is factored into how well they do on the scoring
of the test. While the claim that speededness is not the goal in measuring a cognitive ability, it becomes
an undeniable factor in this testing situation. This raises the power versus performance aspect of
cognitive ability (Cattell, 1948) again. A large scale comparison of the variations in performance based
on these instructions can provide insight into the influence of response time in score prediction across
the range of performance tests where speed is a factor to power tests where speed is not taken into
account.
Another point to make is that possibly the Number Series task and its relation to processing
speed is not as strong as other constructs. It was noted before that Number Series typically has been
identified as a fluid ability, which has been linked to speed of processing (Deary, Allerhand, and Der,
2009). While Number Series is one aspect of fluid intelligence, there is the possibility that other
constructs may better capture the relationship between fluid intelligence and processing speed. That is
to say that the ability and speededness estimates may have a higher correlation and response time may
provide larger increases in score accuracy of reduced and adaptive item sets when predicting the total
score. Such tests could be other quantitatively oriented Woodcock Johnson – III tests such as
Calculation, Picture Naming, and Spatial Decision Tasks. Such tasks require identifying and correctly
stating the answer in response to a stimulus. Calculation and Spatial Decision Tasks require identification
of a set of parameters and identifying the solution. There could be an array of difficulties and
speededness could relate to ability. Picture Naming corresponds to another broad cognitive factor
where response time could improve scoring. When items are relatively easy and abilities are less varied,
response time can vary widely to provide additional information to differentiate respondents.
Finally, as alluded to in Chapter 7, there is room to improve scores with demographics. With
that, it is important to have variables with good definition. The income variable had poor definition and
184
could be improved by either having finer brackets, or by allowing respondents to provide the
information themselves. There was modest improvement in the current study by including
demographics as predictors and the same logical applies her as it did for response times. There is no loss
by including these predictors in addition to response and response times. If there is a deficiency in items
chosen, there is the possibility that demographics can help recover some of the total score prediction.
Final Remarks
With the conclusion of this set of studies some last thoughts are offered to tie up loose ends. I
would like to identify some caveats and provide motivation for future research in response and
response time research. First, there is motivation for respondents to do the survey presented to them in
the form of monetary compensation. Respondents receive $20 for each survey they complete and most
surveys take about 30 minutes to complete. For the most part the sample answered questions with
seemingly sincere responses. This is assumed because response times indicated that harder items took
longer to answer over easier ones, and there were very few respondents that just clicked through
stimulus screens. It is important to remember the respondent motivation for sincerely putting effort
into participating in an online survey. They are not being observed as in a traditional psychological study
and one really does not know who is on the other end pressing the keys.
Second, is that the response patterns were synthesized to mimic the item subsets detailed
above. Each respondent only answered the 30 items once and those responses were used to create each
item subset. The synthesized responses and scores are correct under the IRT framework, which includes
that item responses are independent of other item responses (Lord, 1952; Rasch, 1960). It would
provide further evidence if the item subsets were formed and performance to the total scale was done
in a similar internet sample. It is also important to note the effect of order of presentation has been
shown to affect the probability of correct response to an item (Bowles & Salthouse, 2003). This raises
185
the question of the effectiveness of adapting item sets and extra care should be made to examine item
properties in adapted and non-adapted experiments.
The third point is that this set of analyses is a framework for future studies of response and
response time. The purpose of the third chapter was to identify the scale and its properties as would be
done in typical scale development. The fourth chapter added response time data to the traditional IRT
model and the fifth chapter created item subsets with the intent to accurately measure respondent
ability. The sixth and seventh chapters attempted to regain predictability lost by deleting items with
information readily available to researchers using survey samples or information typically collected in
psychological studies. The idea of collateral data is borrowed, data that is readily available as a
byproduct of testing, and implement it in a way that benefits our goal of creating short, but accurate
cognitive tasks. Of interest are ideas such as those proposed by McArdle, Grimm, Hamagami, Bowles,
and Meredith (2009), where curves of IRT factors were modeled to predict longitudinal trajectories. The
scores of ability and speededness can be used in multifaceted analyses combining methods to answer
complicated research questions.
In the case of this Number Series task, introducing response time as a predictor of ability did not
improve predictability. It would not benefit researchers to continue to try to get more from the
orthogonal speededness factor in predicting Number Series ability. What is found is an extra piece of
information about a person, which can then be used in testing other theories of cognitive ability. As
mentioned before, it would be beneficial to see how a speededness factor relates to other cognitive
abilities, and to traditional tests of processing speed and speediness. These alternative uses realize that
while not useful in the current framework, collateral information in the form of response time can be of
potential use to researchers in studying the dynamics of cognitive ability.
186
References to Chapter 8
Bowles, R.P. & Salthouse, T.A. (2003). Assessing the age-related effects of proactive interference
on working memory tasks using the Rasch model. Psychology and Aging, 18, 608-615.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Deary, I.J., Allerhand, M., Der, G. (2009). Smarter in middle age, faster in old age: a cross-lagged
panel analysis of reaction time and cognitive ability over 13 years in the West of Scotland Twenty-07
study. Psychology and Aging, 24, 40-47.
Fox, J.P., Klein Entink, R.H., & van der Linden, W.J. (2007). Modeling of responses and response
times with the package CIRT. Journal of Statistical Software, 20, 1-14.
Griliches, Z. & Mason, W.M. (1972). Education, income, and ability. The Journal of Political
Economy, 80, S74-S103.
Halpern, D.F., Beninger, A.S., & Straight, C.A. (2011). Sex differences in intelligence. In R.J.
Sternberg & S.B. Kaufman (Eds.), The Cambridge Handbook of Intelligence (pp. 253-267). New York, NY:
Cambridge University Press.
Horn, J. L., & Cattell, R.B. (1967). Age differences in fluid and crystallized intelligence. Acta
Psychologica, 26, 107-129.
Juster, F.T. (1975). Education, income , and human behavior. Hightstown, NJ: McGraw-Hill.
Klein Entink, R.H.(2009) Statistical models for responses and response times. Thesis.
Netherlands: University of Twente.
Lord, F. (1952). A theory of test scores. Psychometric Monograph, 7.
Lynn, R. and Vanhanen, T. (2002). IQ and the Wealth of Nations. Westport, CT: Praeger
Publishers.
McArdle, J. J., Ferrer-Caja, E., Hamagami, F. & Woodcock, R. W. (2002). Comparative longitudinal
structural analyses of the growth and decline of multiple intellectual abilities over the life
span. Developmental Psychology, 38, 115-142.
McArdle, J.J., Grimm, K.J., Hamagami, F., Bowles, R.P., & Meredith, W. (2009). Modeling life-
span growth curves of cognition using longitudinal data with multiple samples and changing scales of
measurement. Psychological Methods, 14, 126-149.
McArdle, J.J. & Prindle, J.J. (2008). A latent change score analysis of a randomized clinical trial in
reasoning training. Psychology and Aging, 23 (4), 702-719.
187
Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Denmark
Paedagogiske Institute, Copenhagen.
Sands, W. A., Waters, B.K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: From
inquiry to operation. Washington, DC: American Psychological Association.
van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
Wainer, H. (Ed.) (2000). Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence
Erlbaum.
Weede, E. (2006). Economic freedom and development: New calculations and interpretations.
Cato Journal, 26, 511-524.
188
Bibliography
Adams, R.J. and Khoo, S.K. (1993). QUEST: The Interactive Test Analysis System. Australian
Council for Educational Research, Hawthorn, Victoria.
Al-Amri, S. (2008). Computer-based testing vs. paper-based testing: a comprehensive approach
to examining the comparability of testing modes. Essex Graduate Student Papers in Language &
Linguistics, 10, 22-44.
Albert, J.H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs
sampling. Journal of Education Statistics, 17, 251-269.
Alderson, J.C. (2000). Technology in testing: the present and the future. System, 28, 593-603.
American Education Research Association, American Psychological Association, & Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Education Research Association.
Ang, S., Rodgers, J.L., & Wänström, L. (2010). The Flynn effect within subgroups in the U.S.:
gender, race, income, education, and urbanization differences in the NLSY-children data. Intelligence,
38, 367-384.
Berk, R.A. (2008). Statistical learning from a regression perspective. New York: Springer.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In
F.M. Lord and M.R. Novick, Statistical Theories of Mental Test Scores, Reading, MA: Addison-Wesley.
Bowles, R.P. & Salthouse, T.A. (2003). Assessing the age-related effects of proactive interference
on working memory tasks using the Rasch model. Psychology and Aging, 18, 608-615.
Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.
Psychometrika, 64, 153-168.
Brody, N. (1992). Intelligence (2
nd
ed.) New York, NY: Academic Press.
Burt, C. (1949). The structure of the mind; A review of the results of factor analysis. British
Journal of Psychology, 19, 176-199.
Cacioppo, J.T., Petty, R.E., & Kao, C.F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48(3), 306-307.
Carroll, J.B. (1993). Human Cognitive Abilities: A survey of factor analytic studies. New York:
Cambridge University Press.
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Pacific Grove: Duxbury.
189
Cattell, J.M. & Galton, F. (1890). Mental tests and measurements. Mind, 15, 373-381.
Cattell, R. B. (1943). The measurement of adult intelligence. Psychological Bulletin, 40, 153-193.
Conway, A.R.A., Cowan, N., Bunting, M.F., Therriault, D.J., & Minkoff, S.R.B. (2002). A latent
variable analysis of working memory capacity, short-term memory capacity, processing speed, and
general fluid intelligence. Intelligence, 30, 163-183.
Deary, I.J., Allerhand, M., Der, G. (2009). Smarter in middle age, faster in old age: a cross-lagged
panel analysis of reaction time and cognitive ability over 13 years in the West of Scotland Twenty-07
study. Psychology and Aging, 24, 40-47.
Deary, I.J., Der, G., & Ford, G. (2001). Reaction times and intelligence differences: a population
based cohort study. Intelligence, 29, 389-399.
Doolittle, A.E. & Cleary, T.A. (1987). Gender-based differential item performance in mathematics
achievement items. Journal of Educational Measurement, 24, 157-166.
Falch, T. & Sandgren, S. (2011). The effect of education on cognitive ability. Economic Inquiry,
49, 838-856.
Flanagan, D. P. (2000). Wechsler-based CHC cross-battery assessment and reading achievement:
Strengthening the validity of interpretations drawn from Wechsler test scores. School Psychology
Quarterly, 15(3), 295−329.
Fox, J.P., Klein Entink, R.H., & van der Linden, W.J. (2007). Modeling of responses and response
times with the package CIRT. Journal of Statistical Software, 20, 1-14.
Fry, A.F. & Kintsch, W. (1995). Processing speed, working memory, and fluid intelligence:
evidence for a developmental cascade. Psychological Science, 237-241.
Galton, F (1883). Inquiries into human faculty and its development. Online Galton Archives:
Everyman.
Garnett, J. C. M. (1919). General ability, cleverness and purpose. British Journal of Psychology, 9,
345−366.
Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in
Medicine, 27, 2865-2873.
Gelman, A., Carlin, J.B., Stern, H.S., & Rubin, D.B. (2004). Bayesian data analysis. Chapman and
Hall / CRC, London.
Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.
190
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of
posterior moments. In Bayesian Statistics 4 (J.M. Bernardo, J. 0. In Berger, A. P. David and A. F. M. Smith,
eds.) 169-193. Oxford Univ. Press.
Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24,
79-132.
Gottfredson, L. S. (2003). G, jobs, and life. In H. Nyborg (Ed.), The scientific study of general
intelligence: Tribute to Arthur R. Jensen. Amsterdam: Pergamon. pp. 293-342.
Green, B.F. (1983). Adaptive testing by computer. In R.B. Ekstrom (Ed.), Principles of Modern
Psychological Measurement (pp.5-12). San Francisco, CA: Jossey-Bass.
Griliches, Z. & Mason, W.M. (1972). Education, income, and ability. The Journal of Political
Economy, 80, S74-S103.
Guilford, J.P. (1959). Personality. New York: McGraw hill.
Halpern, D.F., Beninger, A.S., & Straight, C.A. (2011). Sex differences in intelligence. In R.J.
Sternberg & S.B. Kaufman (Eds.), The Cambridge Handbook of Intelligence (pp. 253-267). New York, NY:
Cambridge University Press.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory.
Newbury Park, CA: Sage Publications.
Hansen, K.T., Heckman, J.J., Mullen, K.J. (2004). The effect of scholling and ability on
achievement test scores. Journal of Econometrics, 121, 39-98.
Hayduk, L., Cummings, G.G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing!
Testing! One, two three – Testing the theory in structural equation models! Personality and Individual
Differences, 42, 841-50.
Hedges, L.V. & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers
of high scoring individuals. Science, 269, 41-45.
Heidelberger, P. & Welch, P.D. (1983). Simulation run length control in the presence of an initial
transient. Operations Research, 31, 1109-1144.
Hertzog, C. (1989). Influences of cognitive slowing on age differences in intelligence.
Developmental Psychology, 25, 636 - 651.
Hertzog, C. (2008). Theoretical approaches to the study of cognitive aging: an individual-
differences perspective. In S.M. Hofer & D.F. Alwin (Eds.), Handbook of Cognitive Aging: Interdisciplinary
Perspectives. Thousand Oaks, Ca: Sage.
191
Hertzog, C. (2011). Intelligence in adulthood. In R.J. Sternberg & S.B. Kaufman (Eds.), The
Cambridge Handbook of Intelligence (pp. 174-190). New York, NY: Cambridge University Press.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel
procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum.
Horn, J.L. (1979). Trends in the measurement of intelligence. Intelligence, 3, 229-240.
Horn, J. L., & Cattell, R.B. (1966). Age differences in primary mental ability factors. Journal of
Gerontology, 20, 210-220.
Horn, J. L., & Cattell, R.B. (1966). Refinement and test of the theory of fluid and crystallized
general intelligences. Journal of Educational Psychology, 57, 253-270.
Horn, J. L., & Cattell, R.B. (1967). Age differences in fluid and crystallized intelligence. Acta
Psychologica, 26, 107-129.
Horn, J. L., & Masunaga, H. (2000). New directions for research into aging and intelligence: The
development of expertise. In T.J. Perfect, & E.A. Maylor, Models of Cognitive Aging (pp. 125-159).
Oxford, England: Oxford University Press.
Hungi, N (1997). Measuring Basic Skills Across Primary School Years. Unpublished MA thesis,
School of Education, The Flinders University of South Australia, Adelaide.
Hyde, J.S., Lindberg, S.M., Linn, M.C., Ellis, A.B. and Williams, C.C. (2007). Gender similarities
characterize math performance. Science, 321,494-495.
Jensen, A.R. (1998). The g factor: The science of mental ability. New York, NY: Praeger.
Juster, F.T. (1975). Education, income , and human behavior. Hightstown, NJ: McGraw-Hill.
Kalaycioglu, D.B. & Berberoglu, G. (2010). Differential item functioning analysis of the science
and mathematics items in the university entrance exams in Turkey. Journal of Psychoeducational
Assessment, 29, 1-12.
Kapteyn, A. (2002). Internet Interviewing and the HRS (grant number 1R01AG020717-01). Rand
Corporation: Santa Monica, CA.
Kaufman, A.S., Kaufman, J.C., Liu, X., & Johnson, C.K. (2009). How do Educational Attainment
and Gender Relate to Fluid Intelligence, Crystallized Intelligence, and Academic Skills at Ages 22–90
Years? Archives of Clinical Neuropsychology, 24, 153 – 163.
Keeves, J.P and Alagumalai, S (1999). New Approaches to Measurement. In Masters, G.N and
Keeves, J.P (Eds.) Advances in Measurement in Educational Research and Assessment (pp. 23-42),
Pergamon, Oxford.
192
Klein Entink, R.H.(2009) Statistical models for responses and response times. Thesis.
Netherlands: University of Twente.
Klein Entink, R.H., Fox, J.P., & van der Linden, W.J. (2009). A multivariate multilevel approach to
the modeling of accuracy and speed of test takers. Psychometrika, 74, 21-48.
Kline, P. (1986). A Handbook of Test Construction: Introduction to Psychometric Design.
Methuen: London.
Lord, F. (1952). A theory of test scores. Psychometric Monograph, 7.
Lord, F.M. (1970). Some test theory for tailored testing. In W.H. Holtzman (Ed.), Computer
assisted instruction, testing, and guidance (pp.139-183). New York: Harper and Row.
Lynn, R. (2010). In Italy, north-south differences in IQ predict differences in income, education,
infant mortality, stature, and literacy. Intelligence, 38, 93-100.
Lynn, R. & Vanhanen, T. (2002). IQ and the Wealth of Nations. Westport, CT: Praeger Publishers.
Lynn, R. & Vanhanen, T. (2006). IQ and global inequality. Augusta, GA: Summit Books.
MacIsaac, D., Cole, R., Cole, D., McCullough, L., & Maxka, J. (2002). Standardized testing in
physics via the world wide web. Electronic Journal of Science Education, 6.
Maris, E. (1993). Additive and multiplicative models for gamma distributed random variables,
and their application as psychometric models for response times. Psychometrika, 58, 445–469.
McArdle, J. J., Ferrer-Caja, E., Hamagami, F. & Woodcock, R. W. (2002). Comparative longitudinal
structural analyses of the growth and decline of multiple intellectual abilities over the life
span. Developmental Psychology, 38(1), 115-142.
McArdle, J.J., Grimm, K.J., Hamagami, F., Bowles, R.P., & Meredith, W. (2009). Modeling life-
span growth curves of cognition using longitudinal data with multiple samples and changing scales of
measurement. Psychological Methods, 14, 126-149.
McArdle, J.J. & Prindle, J.J. (2008). A latent change score analysis of a randomized clinical trial in
reasoning training. Psychology and Aging, 23 (4), 702-719.
McDonald, R.P. (1999). Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum.
McGrew, K. S. (1997). Analysis of the major intelligence batteries according to a proposed
comprehensive Gf-Gc framework. In D. P. Flanagan, J. L. Genshaft, & P. L. Harrison (Eds.), Contemporary
intellectual assessment: Theories, tests, and issues (pp. 151-179). New York: Guildord.
193
McGrew, K. S. (2005). The Cattell–Horn–Carroll theory of cognitive abilities. In D. P. Flanagan, &
P. L. Harrison (Eds.),Contemporary intellectual assessment: Theories, tests, and issues (pp. 136−181).,
2nd ed. New York: Guilford Press.
McNemar, Q.(1964). Lost: Our intelligence. Why? American Psychologist. 19, 871-882.
Meijer, R. R.,&Nering, M. L. (1999). Computerized adaptive testing: Overview and introduction.
Applied Psychological Measurement, 23, 187–194.
Mislevy, R.J. & Chang, H. (2000). Does adaptive testing violate local independence?
Psychometrika, 65, 149-156.
Mislevy & Wu (1996). Inferring examinee ability when some items response are missing
(research Report 88-48-ONR). Princeton, NJ: Educational Testing Service.
Nettlebeck, T. (2011). Basic processes of intelligence. In R.J. Sternberg & S.B. Kaufman (Eds.),
The Cambridge Handbook of Intelligence (pp. 371-393). New York, NY: Cambridge University Press.
Nunnally, J.C. (1978). Psychometric theory (2
nd
ed.). New York: McGraw-Hill.
Olsson, U. (2005). Confidence intervals for the mean of a log-normal distribution. Journal of
Statistics Education, 13.
Osterlind, S.J (1983). Test item bias. Quantitative Applications in the Social Sciences. Newbury
Park, CA: Sage Publications.
Potosky, D., & Bobko, P. (2004). Selection testing via the Internet: practical considerations and
exploratory empirical findings. Personnel Psychology, 57, 1003-1034.
Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Denmark
Paedagogiske Institute, Copenhagen.
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108.
Rindermann, H. (2008). Relevance of education and intelligence at the national level for the
economic welfare of people. Intelligence, 36, 127-142.
Roskam, E.E. (1997).Models for speed and time-limit tests. InW.J. van der Linden & R.K.
Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 187–208). New York: Springer.
Rouder, J. N., Sun, D., Speckman, P. L., Lu, J., & Zhou, D. (2003). A hierarchical Bayesian
statistical framework for response time distributions. Psychometrika, 68, 589–606.
194
Salthouse, T. A. (1996). A processing-speed theory of adult age differences in cognition.
Psychological Review, 103, 403-428.
Salthouse, T.A. (2006). Mental exercise and mental aging: Evaluating the validity of the use it or
lose it hypothesis. Perspectives on Psychological Science, 1, 68-87.
Sands, W. A., Waters, B.K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: From
inquiry to operation. Washington, DC: American Psychological Association.
Schaie, K. W., & Willis, S. L. (1993). Age-difference patterns of psychometric intelligence in
adulthood: Generalizability within and across ability domains. Psychology and Aging, 8, 44-55.
Scheiblechner, H. (1979). Specific objective stochastic latency mechanisms. Journal of
Mathematical Psychology, 19, 18–38.
Schmidt, F. L., & Hunter, J. (2004). General mental ability in the world of work: Occupational
attainment and job performance. Journal of Personality and Social Psychology, 86, 162-173
Schrank, F.A. (2006). Specification of the cognitive processes involved in performance on the
Woodcock-Johnson III (Assessment Service Bulletin No. 7). Itasca, IL: Riverside Publishing.
Schroeders, U. & Wilhelm, O. (2010). Testing reasoning ability with handheld computers,
notebooks, and paper and pencil. European Journal of Psychological Assessment, 26, 284-292.
Shmueli, G., Patel, N.R., & Bruce, P.C. (2007). Data Mining for Business Intelligence. NY: Wiley.
Sliwinski, M. & Buschke, H. (1999). Cross-sectional longitudinal relationships among age,
cognition, and processing speed. Psychology and Aging, 14, 18-33.
Spearman, C. (1904). General intelligence objectively determined and measured. American
Journal of Psychology, 15, 201-293.
Speededness. (2011). In Wiktionary online. Retrieved March 1 2012, from
http://en.wiktionary.org
Speediness. (2011). In Wiktionary online. Retrieved March 1 2012, from http://en.wiktionary.org
Strenze, T. (2007). Intelligence and socioeconomic success: A meta-analytic review of
longitudinal research, Intelligence, 35, 402-426.
Thorndike, R.L and Thorndike, R.M (1994). Reliability in Educational and Psychological
Measurement. In T. Husen and T.N Postlethwaite (Eds.) The International Encyclopedia of Education, 2nd
edition, (pp. 4981-4995). Oxford: Pergamon.
Thurstone, L.L. (1938). Primary mental abilities. Psychometric Monographs, 1, 121.
195
Tonidandel, S., Quinones, M.A., & Adams, A.A. (2002). Computer-adaptive testing: The impact of
test characteristics on perceived performance and test takers’ reactions. Journal of Applied Psychology,
87, 320-332.
Tsutakawa, R.K. & Johnson, J.C. (1990). The effect of uncertainty of item parameter estimation
on ability estimates. Psychometrika, 55, 371-390.
van der Linden, W.J. (2006). A lognormal model for response times on test items. Journal of
Educational and Behavioral Statistics, 31, 181-204.
Van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal
of Educational and Behavioral Statistics, 33, 5-20.
Van der Linden, W.J. & Hambleton, R.K. (Eds.) (1997). Handbook of modern item response
theory. New York: Springer-Verlag.
van der Linden, W.J. & Glas, C.A.W. (Eds.) (2010). Computerized Adaptive Testing: Theory and
Practice. Netherlands: Kluwer Academic Publishers.
van der Linden, W.J., & Guo, F. (2006). Two Bayesian Procedures for Identifying Aberrant
Response-Time Patterns in Adaptive Testing. Manuscript submitted for publication.
Van der Linden, W.J. & Pashley, P.J. (2010). Item selection and ability estimation in adaptive
testing. In W.J. van der Linden and C.A.W. Glas (Eds.), Computerized Adaptive Testing: Theory and
Practice (pp. 1-25). Netherlands: Kluwer Academic Publishers.
Verhelst, N.D., Verstralen, H.H.F.M., & Jansen, M.G. (1997). A logistic model for time-limit tests.
InW.J. van der Linden & R.K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 169–
185). New York: Springer-Verlag.
Vernon, P.E. (1950). The structure of human abilities. London: Methuen.
Volkman, F., Szatmari, P., and Sparrow, S. (1993). Sex differences in pervasive developmental
disabilities. Journal of Autism and Developmental Disabilities, 23, 579-591.
Wainer, H. (Ed.) (2000). Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence
Erlbaum.
Wainer, H. & Mislevy , R.J. (2000). Item response theory, item calibration, and proficiency
estimation. In H. Wainer (Ed.), Computerized Adaptive Testing: A Primer (pp.61-100). Mahwah, NJ:
Lawrence Erlbaum.
196
Wechsler, D. (1939). Measurement of adult intelligence. Baltimore, MD: Williams & Wilkins.
Weede, E. (2006). Economic freedom and development: New calculations and interpretations.
Cato Journal, 26, 511-524.
Weiss, D.J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied
Psychological Measurement, 4, 473-492.
Weiss, D.J. (1985). Adaptive testing by computer. Journal of Consulting and Clinical Psychology,
53, 774-789.
Wilhelm, O. & Schroeders, U. (2008). Computerized ability measurement: some substantive dos
and don'ts. In F.Scheuermann & A. G.Pereira (Eds), Towards a research agenda on computer-based
assessment: challenges and needs for European educational measurement (pp. 76 –84). Luxembourg:
Office for Publications of the European Communities.
Willis, R.J. (2011). Health and Retirement Study. (grant number NIA U01AG009740). University
of Michigan: Ann Arbor, MI.
Winship, C. & Korenman, S.D. (1997). Does staying in school make you smarter? The effect of
education on IQ in the bell curve. In B. Devlin, S.E. Fienberg, D.P. Resnick, & K. Roeder (Eds.),
Intelligence, Genes, and Success. Scientists Respond to the Bell Curve. New York, NY: Springer.
Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort moderated
IRT model. Journal of Educational Measurement, 43, 19-38.
Wise, S. L., & Kong, X. J. (2005). Response time effort: A new measure of examinee motivation in
computer-based tests. Applied Measurement in Education, 16, 163-183.
Zimprich, D., & Martin, M. (2002). Can longitudinal changes in processing speed explain
longitudinal age changes in fluid intelligence? Psychology and Aging, 17, 690–695.
197
Appendix
The following pieces of computer code outline steps taken to organize and analyze data as
presented in the previous chapters. The code is organized by chapter and by statistical package. Each
program provides logic for carrying out processes, so processes written in other programming languages
should provide the same results.
Chapter 3 Programs
These code segments outline steps to organize and convert data to a format for calculating
ability scores. Code organizes correct and incorrect responses and normalizes response times to be in
line with survey design. Efforts are made to determine if responses were done without examining
content (answering too fast) or not focusing on the task (taking too long). Then code is presented for
scoring respondents in the IRT framework, followed by code to perform differential item functioning.
SAS Programs
*ALP ANALYSIS* *John Prindle* *June 2011*;
*(1) Drop variables and start with raw dataset;
*(2) Format item scores with response times as inputs;
*(3) CTT variables created;
*(4) IRT analysis;
*(5) Merge data together;
LIBNAME ALP 'C:\Bank\ALP2\Data';
LIBNAME DISS 'C:\Work\Dissertation\Data';
***(1)***;
DATA temp1; SET ALP.ns_2011_01;
KEEP mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15 ns_at1-ns_at15 ns_bt1-ns_bt15
prim_key id attempt highest_a highest_b highest_d highest_s
wscore_a werror_a count_a correct_a wscore_b werror_b count_b
correct_b
wscore_s wscore_d
bata_wscore batb_wscore bats_wscore batd_wscore
age50d10 e_female educ12d4 nc_pos nc_neg fam_income log_income;
RUN;
**Create demo variables for data from transformed variables;
DATA temp1; SET temp1;
age = (age50d10*10)+50;
female = 1; IF e_female < 0 THEN female = 0;
educ = (educ12d4*4)+12;
198
member = attempt;
DROP attempt;
RUN;
PROC MEANS DATA = temp1;
VAR age; RUN;
***(2)***;
**Examine timing variables to rescore items: >120 seconds is wrong response
and time is capped at 120;
**A wrong response less than 5 seconds is a nonresponse;
DATA temp2; SET temp1;
ARRAY timea {15} ns_at1-ns_at15;
ARRAY timeb {15} ns_bt1-ns_bt15;
ARRAY itema {15} mcorr_a01-mcorr_a15;
ARRAY itemb {15} mcorr_b01-mcorr_b15;
fixya = 0; fixyb = 0;
fixed1a = 0; fixed1b = 0;
fixed2a = 0; fixed2b = 0;
DO i = 1 TO 15;
IF timea(i) < 5 THEN DO;
fixya = fixya + 1;
IF itema(i) = 0 THEN DO; itema(i) = .;
timea(i) = .;
fixed1a = fixed1a + 1;
END;
IF itema(i) = . THEN timea(i) = .;
END;
IF timeb(i) < 5 THEN DO;
fixyb = fixyb + 1;
IF itemb(i) = 0 THEN DO; itemb(i) = .;
timeb(i) = .;
fixed1b = fixed1b + 1;
END;
IF itemb(i) = . THEN timeb(i) = .;
END;
IF timea(i) > 120 THEN DO;
itema(i) = 0;
timea(i)=120;
fixed2a = fixed2a + 1;
END;
IF timeb(i) > 120 THEN DO;
itemb(i) = 0;
timeb(i)=120;
fixed2b = fixed2b + 1;
END;
END;
DROP i;
RUN;
**Check the changed scores. fixed1 indicates number of scores changed due to
not enough time;
**fixed2 is the number of scores changed due to too much time taken;
PROC MEANS SUM; VAR fixya fixyb fixed1a fixed2a fixed1b fixed2b; RUN;
PROC FREQ; TABLE fixed1a fixed2a fixed1b fixed2b; RUN;
199
***(3)***;
**Create CTT type scores for part of inclusive analysis of paper;
DATA temp3; SET temp2;
ARRAY itema {15} mcorr_a01-mcorr_a15;
ARRAY itemb {15} mcorr_b01-mcorr_b15;
seed = 20110601;
random = ranuni(seed);
correct_a = 0; correct_b = 0;
DO i = 1 TO 15;
correct_a = correct_a + itema(i);
correct_b = correct_b + itemb(i);
END;
correct_t = correct_a + correct_b;
correct_d = correct_a - correct_b;
correct_s = (correct_a + correct_b)/2;
correct_p = (correct_t/30)*100;
RUN;
*****SAVE DATA*****;
DATA DISS.rawscore_201106; SET temp3; RUN;
DATA DISS.rawscore_201107; SET temp3; RUN;
**Check the sum scores of the two scales and the total for correctness;
PROC MEANS; VAR correct_p correct_t correct_d correct_s correct_a correct_b;
RUN;
**Cohen's Alpha computed for scale A and scale B and the combined scale;
**Nunnally and Bernstein (1994) suggest alpha = 0.70 at least;
PROC CORR DATA = temp3 NOPROB NOCORR NOMISS ALPHA; VAR mcorr_a01-mcorr_a15;
RUN;
PROC CORR DATA = temp3 NOPROB NOCORR NOMISS ALPHA; VAR mcorr_b01-mcorr_b15;
RUN;
PROC CORR DATA = temp3 NOPROB NOCORR NOMISS ALPHA; VAR mcorr_a01-mcorr_a15
mcorr_b01-mcorr_b15; RUN;
*****SAVE FOR ALP*****;
DATA ALP.ns_2011_06_organized; SET temp3; RUN;
***(4)***;
**IRT analysis;
**Set A;
DATA temp_1a;
SET alp.ns_2011_06_organized;
person = ID;
KEEP person member mcorr_A01-mcorr_A15;
RUN;
PROC SORT DATA=temp_1a; BY person member; RUN;
PROC TRANSPOSE DATA=temp_1a OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_1b; SET longForm1;
new = SUBSTR(i, 8, 2);
item = input(new,8.0);
resp = score1;
200
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*10 + 1;
IF member = 2 THEN tempid = person*10 + 2;
IF member = 3 THEN tempid = person*10 + 3;
IF member = 4 THEN tempid = person*10 + 4;
IF member = 5 THEN tempid = person*10 + 5;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_1b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
ARRAY beta[15] beta1-beta15;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
beta1=-2.91; beta2=-3.27; beta3=-2.97; beta4=-1.23; beta5=-1.77;
beta6=.04; beta7=.82; beta8=1.19; beta9=1.55; beta10=.38;
beta11=8.82; beta12=2.2;
beta13=2.56; beta14=2.82; beta15=4.05;
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.7 v_g=2.4;
PREDICT p OUT=predProbA;
PREDICT gscore OUT=personParmA;
ODS OUTPUT ParameterEstimates=itemParmA;
RUN;
DATA ALP.itemparmA; SET itemparmA; RUN;
DATA ALP.personparmA; SET personparmA; RUN;
DATA ALP.predProbA; SET predProbA; RUN;
**Set B;
DATA temp_2a;
SET alp.ns_2011_06_organized;
ARRAY mcorr {1:15} mcorr_b01-mcorr_b15;
ARRAY temp {1:14} temp1-temp14;
DO i = 1 TO 10; temp[i]=mcorr[i]; END;
DO i = 11 TO 14; temp[i]=mcorr[i+1]; END;
DO i = 1 TO 14; mcorr[i]=temp[i]; END;
person = ID;
KEEP person member mcorr_B01-mcorr_B14;
RUN;
PROC MEANS DATA=temp_2a; VAR mcorr_B01-mcorr_B14; RUN;
PROC SORT DATA=temp_2a; BY person member; RUN;
PROC TRANSPOSE DATA=temp_2a OUT=longForm2 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_2b; SET longForm2;
new = SUBSTR(i, 8, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*10 + 1;
201
IF member = 2 THEN tempid = person*10 + 2;
IF member = 3 THEN tempid = person*10 + 3;
IF member = 4 THEN tempid = person*10 + 4;
IF member = 5 THEN tempid = person*10 + 5;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_2b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
ARRAY beta[14] beta1-beta14;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86 beta1=-3.13 beta2=-3.34 beta3=-3.01 beta4=-1.17
beta5=-1.72
beta6=.06 beta7=.95 beta8=1.34 beta9=1.66 beta10=.47 beta11=2.27
beta12=2.41 beta13=2.94 beta14=4.05;
PREDICT p OUT=predProbB;
PREDICT gscore OUT=personParmB;
ODS OUTPUT ParameterEstimates=itemParmB;
RUN;
DATA ALP.itemparmB14; SET itemparmB; RUN;
DATA ALP.personparmB14; SET personparmB; RUN;
DATA ALP.predProbB14; SET predProbB; RUN;
**Set AB;
DATA temp_3a;
SET alp.ns_2011_06_organized;
person = ID;
ARRAY mcorr {1:15} mcorr_b01-mcorr_b15;
ARRAY temp {1:14} temp1-temp14;
DO i = 1 TO 10; temp[i]=mcorr[i]; END;
DO i = 11 TO 14; temp[i]=mcorr[i+1]; END;
DO i = 1 TO 14; mcorr[i]=temp[i]; END;
KEEP person member mcorr_A01-mcorr_A15 mcorr_B01-mcorr_B14;
RUN;
PROC SORT DATA=temp_3a; BY person member; RUN;
PROC TRANSPOSE DATA=temp_3a OUT=longForm3 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_3b; SET longForm3;
new = SUBSTR(i, 8, 2);
alphabet = SUBSTR(i, 7, 1);
item = input(new,8.0);
IF alphabet = 'B' THEN item = item + 15;
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*10 + 1;
IF member = 2 THEN tempid = person*10 + 2;
IF member = 3 THEN tempid = person*10 + 3;
IF member = 4 THEN tempid = person*10 + 4;
IF member = 5 THEN tempid = person*10 + 5;
202
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_3b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
ARRAY beta[29] beta1-beta29;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.6 v_g=2.5 beta1=-2.91 beta2=-3.27 beta3=-2.97 beta4=-1.23
beta5=-1.77
beta6=.04 beta7=.82 beta8=1.19 beta9=1.55 beta10=.38 beta11=8.82
beta12=2.2
beta13=2.56 beta14=2.82 beta15=4.05 beta16=-2.91 beta17=-3.27 beta18=-
2.97 beta19=-1.23 beta20=-1.77
beta21=.04 beta22=.82 beta23=1.19 beta24=1.55 beta25=.38 beta26=2.2
beta27=2.56 beta28=2.82 beta29=4.05;
PREDICT p OUT=predProbAB;
PREDICT gscore OUT=personParmAB;
ODS OUTPUT ParameterEstimates=itemParmAB;
RUN;
DATA ALP.itemparmAB29; SET itemparmAB; RUN;
DATA ALP.personparmAB29; SET personparmAB; RUN;
DATA ALP.predProbAB29; SET predProbAB; RUN;
***(5)***;
**Merge three IRT scores to the main file;
PROC SORT DATA = ALP.personparmA; BY tempid;
DATA tempmergeA; SET ALP.personparmA;
BY tempid;
id = person;
Mscore_A = 9.1024*pred+500;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id member Mscore_A;
RUN;
PROC MEANS DATA = tempmergeA; VAR Mscore_A; RUN;
PROC SORT DATA = ALP.personparmB; BY tempid;
DATA tempmergeB; SET ALP.personparmB;
BY tempid;
id = person;
Mscore_B = 9.1024*pred+500;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id member Mscore_B;
RUN;
PROC MEANS DATA = tempmergeB; VAR Mscore_B; RUN;
PROC SORT DATA = ALP.personparmAB; BY tempid;
DATA tempmergeAB; SET ALP.personparmAB;
BY tempid;
id = person;
203
Mscore_AB = 9.1024*pred+500;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id member Mscore_AB;
RUN;
PROC MEANS DATA = tempmergeAB; VAR Mscore_AB; RUN;
DATA temp4;
MERGE alp.ns_2011_06_organized tempmergeA tempmergeB tempmergeAB;
BY id member;
RUN;
*****SAVE DATA*****;
DATA DISS.calcscore_201106; SET temp4; RUN;
*Procedures to analyze descriptives;
PROC CORR DATA = DISS.calcscore_201106; VAR Wscore_A Wscore_B Wscore_S
Mscore_A Mscore_B Mscore_AB; RUN;
PROC MEANS DATA = DISS.calcscore_201106 SUM;
VAR fixed1a fixed1b;
RUN;
PROC CORR DATA = data1;
VAR Wscore_s Wscore_a Wscore_B mscore_ab mscore_a mscore_b nc_pos nc_neg
lincome;
RUN;
PROC MEANS DATA = data1 MEDIAN;
VAR Wscore_s Wscore_a Wscore_B mscore_ab mscore_a mscore_b nc_pos nc_neg
lincome;
RUN;
PROC TTEST DATA = data1;
PAIRED mscore_ab*wscore_s; RUN;
PROC FREQ DATA = DISS.calcscore_201106;
TABLE fixed1a fixed2a; RUN;
DATA data1; SET diss.calcscore_201106;
label correct_p='Raw Score Percentage';
label mscore='Mscore for 30 Item Test';
label wscore='Wscore for 30 Item Test';
wscore_ab = (wscore_a + wscore_b) / 2;
RUN;
PROC UNIVARIATE DATA = data1 NOPRINT;
HISTOGRAM correct_p/VAXISLABEL='Frequency Percentage';
RUN; QUIT;
goptions reset=all;
axis1 label=(a=90 'Wscore for 30 Item Analysis');
204
axis2 label=('Mscore from 30 Item Analysis');
proc gplot data=data1;
plot Wscore_ab * Mscore_ab / vaxis=axis1 HAXIS=axis2;
run;
quit;
**Save data for mplus DIF test;
DATA _NULL_; SET diss.calcscore_201106;
FILE 'C:\Work\Dissertation\Mplus\scoresAB.dat';
PUT female mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
RUN;
PROC SORT DATA = diss.income; BY id member; RUN;
PROC SORT DATA = diss.calcscore_201106; BY id member; RUN;
DATA data1; MERGE diss.calcscore_201106 diss.income;
BY id member;
wscore_ab = (wscore_a + wscore_b) / 2;
nc_posc = nc_pos-60;
nc_negc = nc_neg-31;
agec = age - 48.9;
educ_c = educ - 11.5;
lincome = log(fam_income)-10.8;
abye = agec*educ_c;
abyf = agec*e_female;
abyp = agec*nc_posc;
abyn = agec*nc_negc;
ebyf = educ_c*e_female;
ebyp = educ_c*nc_posc;
ebyn = educ_c*nc_negc;
fbyp = e_female*nc_posc;
fbyn = e_female*nc_negc;
pbyn = nc_posc*nc_negc;
lbya = lincome*agec;
lbye = lincome*educ_c;
lbyf = lincome*e_female;
lbyp = lincome*nc_posc;
lbyn = lincome*nc_negc;
RUN;
PROC REG DATA = data1;
MODEL mscore_ab = agec educ_c e_female lincome nc_posc nc_negc;
MODEL Wscore_ab = agec educ_c e_female lincome nc_posc nc_negc;
RUN; QUIT;
PROC REG DATA = data1;
MODEL mscore_ab = agec educ_c e_female lincome nc_posc nc_negc abye abyf abyp
abyn ebyf ebyp ebyn fbyn fbyp pbyn lbya lbye lbyf lbyp lbyn;
205
MODEL Wscore_ab = agec educ_c e_female lincome nc_posc nc_negc abye abyf abyp
abyn ebyf ebyp ebyn fbyn fbyp pbyn lbya lbye lbyf lbyp lbyn;
RUN; QUIT;
206
Mplus Programs
TITLE: ALP DIF testing
invariant groups
DATA: FILE = scoresAB.dat;
ANALYSIS: DIFFTEST IS difftest.dat;
VARIABLE: NAMES = female itema01-itema15 itemb01-itemb15;
USEV = itema01-itema15 itemb01-itemb15;
MISSING = .;
GROUPING = female (0=m 1=f);
CATEGORICAL = itema01-itemb15;
MODEL: F1 BY itema01-itema15@1;
F1 BY itemb01-itemb15@1;
F1 (vf1);
[itema01$1] (t);
MODEL m: F1 BY itema01-itema15@1;
F1 BY itemb01-itemb15@1;
F1 (mvf1);
[itema01$1] (t);
MODEL f: F1 BY itema01-itema15@1;
F1 BY itemb01-itemb15@1;
F1 (fvf1);
[itema01$1] (t);
OUTPUT: SAMPSTAT STD MODINDICES TECH4;
207
Chapter 4 Programs
This portion of the analysis code focuses on formatting data and running CIRT analyses for
alternative ability score estimation procedures.
SAS Programs
*ALP ANALYSIS* *John Prindle* *July 2011*;
*Use dataset : diss.rawscore_201107;
*(1) Calculate demo stats for Response and RT;
*(2) create file for CIRT analysis;
LIBNAME ALP 'C:\Bank\ALP2\Data';
LIBNAME DISS 'C:\Work\Dissertation\Data';
*(1) demo stats calculated;
PROC MEANS DATA = diss.rawscore_201107;
VAR ns_bt1-ns_bt15; RUN;
DATA temp; set diss.rawscore_201107;
ARRAY time {30} ns_at1-ns_at15 ns_bt1-ns_bt15;
ARRAY ltime {30} ns_lt1-ns_lt30;
DO i = 1 TO 30;
ltime(i) = log(time(i));
END;
RUN;
PROC MEANS DATA = diss.income;
VAR fam_income; RUN;
PROC MEANS DATA = temp;
VAR female age educ nc_pos nc_neg; RUN;
PROC FREQ DATA = diss.rawscore_201107;
TABLE ns_at15*mcorr_a15; RUN;
DATA temp; SET diss.rawscore_201107;
ARRAY nsitem {30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY nstime {30} ns_at1-ns_at15 ns_bt1-ns_bt15;
DO i = 1 TO 30;
IF nsitem(i) = . THEN nsitem(i) = 9;
IF nstime(i) NE . THEN nstime(i) = log(nstime(i));
IF nstime(i) = . THEN nstime(i) = 0;
END;
RUN;
*create empty time data file (all 0s);
DATA temp;
208
ARRAY time {30} time1-time30;
DO n = 1 TO 2548;
DO i = 1 TO 30;
time(i)=0;
END;
OUTPUT;
END;
RUN;
DATA _NULL_; SET temp;
FILE 'C:\Work\Dissertation\R\itemsABdot.dat';
PUT mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
RUN;
DATA _NULL_; SET temp;
FILE 'C:\Work\Dissertation\R\timesA.dat';
PUT ns_at1-ns_at15;
RUN;
DATA _NULL_; SET temp;
FILE 'C:\Work\Dissertation\R\timesB.dat';
PUT ns_bt1-ns_bt15;
RUN;
DATA _NULL_; SET temp;
FILE 'C:\Work\Dissertation\R\times0.dat';
PUT time1-time30;
RUN;
DATA _NULL_; SET temp;
FILE 'C:\Work\Dissertation\R\age.dat';
PUT age;
RUN;
PROC FREQ DATA = temp;
TABLE mcorr_a01-mcorr_a15; RUN;
PROC MEANS DATA = diss.rawscore_201107 SKEW;
VAR ns_at1-ns_at15; RUN;
209
R Programs
#load in function libraries
library("cirt")
library("coda")
library("boa")
#load in data from hard drive
setwd('C:/Work/Dissertation/R')
scores <- read.table('itemsAB.dat')
timesA <- read.table('timesA.dat')
timesB <- read.table('timesB.dat')
times0 <- read.table('times0.dat')
age <- read.table('age.dat')
times <- cbind(timesA, timesB)
#set up model parms
iter <- 30000
N <- length(scores[,1])
N
K <- length(scores[1,])
K
scores <- as.matrix(scores)
times <- as.matrix(times)
times0 <- as.matrix(times0)
#?estimate
#run models: out1 is the IPL response model
# out2 is the 2PL response model
out1 <- estimate(scores,times,N,K, iter, PL = 1)
out2 <- estimate(scores,times,N,K, iter, PL = 2)
out3 <- estimate(scores,times,N,K, iter, PL = 1, TM = 1)
out4 <- estimate(scores,times,N,K, iter, PL = 2, TM = 1)
summarize(out1,5000)
summarize(out2,5000)
summarize(out3,5000)
summarize(out4,5000)
#isolate 1PL,
parms1 <- matrix(out1[[22]],nrow = iter,ncol=K)
210
parms1 <-as.mcmc(parms1)
parms2 <- matrix(out1[[24]],nrow = iter,ncol=K)
parms2 <-as.mcmc(parms2)
#isolate 2PL parameters
parms3 <- matrix(out2[[22]],nrow = iter,ncol=K)
parms3 <-as.mcmc(parms3)
parms4 <- matrix(out2[[24]],nrow = iter,ncol=K)
parms4 <-as.mcmc(parms4)
#1TM models for DIC comparison
parms5 <- matrix(out3[[22]],nrow = iter,ncol=K)
parms5 <-as.mcmc(parms5)
parms6 <- matrix(out3[[24]],nrow = iter,ncol=K)
parms6 <-as.mcmc(parms6)
parms7 <- matrix(out4[[22]],nrow = iter,ncol=K)
parms7 <-as.mcmc(parms7)
parms8 <- matrix(out4[[24]],nrow = iter,ncol=K)
parms8 <-as.mcmc(parms8)
#run CODA analysis with 3 types of analysis
codamenu()
#model fit stats
fitirt(N,K, out2)
#Figure 4.3
par(mfrow=c(3,2))
fitrt(scores,N,K, out2, 1)
fitrt(scores,N,K, out2, 2)
fitrt(scores,N,K, out2, 3)
fitrt(scores,N,K, out2, 4)
fitrt(scores,N,K, out2, 5)
fitrt(scores,N,K, out2, 6)
#Parse out speed and ability person parameters
theta <- matrix(out2[[26]], nrow = N, ncol = 2)
zeta <- matrix(out2[[27]], nrow = N, ncol = 2)
compt <- matrix(out6[[26]], nrow = N, ncol = 2)
#Figure 4.4
211
plot(theta[,1], zeta[,1], xlab = "Estimated Ability", ylab = "Estimated Speed")
cor(theta[,1], zeta[,1])
plot(theta[,1], theta[,2], xlab = "Estimated CIRT Ability", ylab = "Estimated SD")
plot(compt[,1], compt[,2], xlab = "Estimated IRT Ability", ylab = "Estimated SD")
#Figure 4.5
plot(theta[,1],theta[,2], col=2, pch=16, xlab = "Estimated Ability", ylab = "Estimated SD")
points(theta[,1],compt[,2], col=4, pch=3)
# partial effect of age and ability with speed taken out
pcor <- function(v1, v2, v3)
{ c12 <- cor(v1, v2)
c23 <- cor(v2, v3)
c13 <- cor(v1, v3)
partial <- (c12-(c13*c23))/(sqrt(1-(c13^2)) * sqrt(1-(c23^2)))
return(partial) }
pcor(age,theta[,1],zeta[,1])
212
Chapter 5 Programs
This set of statistical programming presents the methodology for creating adaptive item sets.
Each of the short form item set methods are outlined followed by methods for adaptive item sets. The
process of ability score calculation uses calculation methods outlined in Chapter 3 and Chapter 4. The
item subset scores are then compared to the scores obtained from the full item sets using IRT and CIRT
scoring schemes.
SAS Programs
TITLE1 'Cognitive Adaptive Testing in ALP: 2 Item Analysis';
TITLE2 'John Prindle, USC, September 2011';
/*
John Prindle
September 2011
Program: RAND ALP code for partial item set scores predicting total score
A set of codes to compare partial item sets to the 30 item score
focus on different item sets through various methods of form reduction.
Random items sets, fixed, BAT, (ALL fixed Item Sets)
*/
LIBNAME ALP 'C:\Work\Dissertation\Data';
/*Procedures to run random 6 items analysis
items: (4, 5, 8, 11, 18, 19)
difficulties: (-3.03, -4.01, .60, 0.94, -3.48, -1.77)
*/
DATA temp; SET alp.calcscore_201106;
RETAIN kept1-kept6;
ARRAY pick {1:30} pick1-pick30;
ARRAY kept {1:6} kept1-kept6;
k = 1;
IF _N_ = 1 THEN DO;
DO i = 1 TO 30;
seed = 20110915;
pick(i)=ranuni(seed);
IF pick(i) <= .2 THEN DO;
kept(k)=i;
k = k + 1;
END;
IF k = 7 THEN i = 30;
END;
END;
ARRAY item {1:6} item1-item6;
ARRAY resp {1:30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
213
ARRAY time {1:6} time1-time6;
ARRAY rt {1:30} ns_at1-ns_at15 ns_bt1-ns_bt15;
DO i = 1 TO 6;
item(i)=resp(kept(i));
time(i)=log(rt(kept(i)));
END;
DROP pick1-pick30 i k;
RUN;
DATA temp_a;
SET temp;
person = ID;
KEEP person member item1-item6;
RUN;
PROC TRANSPOSE DATA=temp_a OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_b; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = -3.03; beta2 = -4.01; beta3 = 0.60;
beta4 = 0.94; beta5 = -3.48; beta6 = -1.77;
ARRAY beta[6] beta1-beta6;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbRand;
PREDICT gscore OUT=personParmRand;
*ODS OUTPUT ParameterEstimates=itemParmB;
RUN;
PROC SORT DATA = personparmrand; BY tempid;
DATA itemrand; SET personparmrand;
BY tempid;
id = person;
predrand = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predrand member;
214
RUN;
PROC SORT DATA = itemrand; BY id member; RUN;
DATA temp1; MERGE alp.calcscore_201106 itemrand temp; RUN;
PROC CORR DATA = temp1; VAR mscore_ab predrand; RUN;
PROC REG DATA = temp1;
MODEL mscore_ab = predrand;
MODEL mscore_ab = predrand time1-time6; RUN; QUIT;
/*Procedures to run fixed 6 item analysis
items: (17, 3, 25, 8, 9, 30)
difficulties: (-3.89, -2.51, -0.18, 0.60, 1.77, 3.36)
*/
DATA temp; SET alp.calcscore_201106;
ARRAY kept {1:6} (17, 3, 25, 8, 9, 30);
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY resp {1:30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY rt {1:30} ns_at1-ns_at15 ns_bt1-ns_bt15;
DO i = 1 TO 6;
item(i)=resp(kept(i));
time(i)=log(rt(kept(i)));
END;
DROP i k;
RUN;
DATA temp_a;
SET temp;
person = ID;
KEEP person member item1-item6;
RUN;
PROC TRANSPOSE DATA=temp_a OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_b; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = -3.89; beta2 = -2.51; beta3 = -0.18;
beta4 = 0.60; beta5 = 1.77; beta6 = 3.36;
ARRAY beta[6] beta1-beta6;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
215
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbfix;
PREDICT gscore OUT=personParmfix;
RUN;
PROC SORT DATA = personparmfix; BY tempid;
DATA itemfixed; SET personparmfix;
BY tempid;
id = person;
predfixed = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predfixed member;
RUN;
PROC SORT DATA = itemfixed; BY id member; RUN;
DATA temp2; MERGE temp1 itemfixed; RUN;
PROC CORR DATA = temp2; VAR mscore_ab predrand predfixed; RUN;
/*Highest 6 factor loadings (discrimination)
items: (23, 11, 22, 24, 25, 14)
difficulties: (.59, .94, .24, .92, 2.56)
*/
DATA temp; SET alp.calcscore_201106;
ARRAY kept {1:6} (23, 11, 22, 24, 25, 14);
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY resp {1:30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY rt {1:30} ns_at1-ns_at15 ns_bt1-ns_bt15;
DO i = 1 TO 6;
item(i)=resp(kept(i));
time(i)=log(rt(kept(i)));
END;
DROP i k;
RUN;
DATA temp_a;
SET temp;
person = ID;
KEEP person member item1-item6;
RUN;
PROC TRANSPOSE DATA=temp_a OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_b; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
216
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = 0.59; beta2 = 0.94; beta3 = 0.24;
beta4 = 0.92; beta5 = -0.18; beta6 = 2.56;
ARRAY beta[6] beta1-beta6;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbfact;
PREDICT gscore OUT=personParmfact;
*ODS OUTPUT ParameterEstimates=itemParmB;
RUN;
PROC SORT DATA = personparmfact; BY tempid;
DATA itemfact; SET personparmfact;
BY tempid;
id = person;
predfact = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predfact member;
RUN;
PROC SORT DATA = itemfact; BY id member; RUN;
DATA temp3; MERGE temp2 itemfact; RUN;
PROC CORR DATA = temp3; VAR mscore_ab predrand predfixed predfact; RUN;
/*Highest 6 correlations with alpha
items: (11, 24, 23, 22, 12, 29)
difficulties: ()
*/
PROC CORR DATA = alp.calcscore_201106 ALPHA;
VAR mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
RUN;QUIT;
DATA temp; SET alp.calcscore_201106;
ARRAY kept {1:6} (11, 24, 23, 22, 12, 29);
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY resp {1:30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY rt {1:30} ns_at1-ns_at15 ns_bt1-ns_bt15;
DO i = 1 TO 6;
item(i)=resp(kept(i));
time(i)=log(rt(kept(i)));
END;
217
DROP i k;
RUN;
DATA temp_a;
SET temp;
person = ID;
KEEP person member item1-item6;
RUN;
PROC TRANSPOSE DATA=temp_a OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_b; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = 0.94; beta2 = 0.92; beta3 = 0.59;
beta4 = 0.24; beta5 = 0.55; beta6 = 2.12;
ARRAY beta[6] beta1-beta6;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbalpha;
PREDICT gscore OUT=personParmalpha;
*ODS OUTPUT ParameterEstimates=itemParmB;
RUN;
PROC SORT DATA = personparmalpha; BY tempid;
DATA itemalpha; SET personparmalpha;
BY tempid;
id = person;
predalpha = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predalpha member;
RUN;
PROC SORT DATA = itemalpha; BY id member; RUN;
DATA temp4; MERGE temp3 itemalpha; RUN;
PROC CORR DATA = temp4; VAR mscore_ab predrand predfixed predfact predalpha;
RUN;
218
PROC GPLOT DATA = temp4; PLOT mscore_ab * predrand; RUN;
PROC REG DATA = temp4; MODEL mscore_ab = predrand; RUN;
/*Highest R2 6 items for Mscore_AB
items: (10,12,14,24,29,30)
difficulties: ()
*/
DATA temp; SET alp.calcscore_201106;
ARRAY kept {1:6} (10,12,14,24,29,30);
ARRAY item {1:6} item1-item6;
*ARRAY time {1:6} time1-time6;
ARRAY resp {1:30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
*ARRAY rt {1:30} ns_at1-ns_at15 ns_bt1-ns_bt15;
DO i = 1 TO 6;
item(i)=resp(kept(i));
*time(i)=log(rt(kept(i)));
END;
DROP i k;
RUN;
DATA temp_a;
SET temp;
person = ID;
KEEP person member item1-item6;
RUN;
PROC TRANSPOSE DATA=temp_a OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA temp_b; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=temp_b METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = 1.60; beta2 = 0.55; beta3 = 2.56;
beta4 = 0.92; beta5 = 2.12; beta6 = 3.36;
ARRAY beta[6] beta1-beta6;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbrs;
PREDICT gscore OUT=personParmrs;
*ODS OUTPUT ParameterEstimates=itemParmB;
219
RUN;
PROC SORT DATA = personparmrs; BY tempid;
DATA itemrs; SET personparmrs;
BY tempid;
id = person;
predrs = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predrs member;
RUN;
PROC SORT DATA = itemrs; BY id member; RUN;
DATA temp5; MERGE temp4 itemrs; RUN;
PROC CORR DATA = temp5; VAR mscore_ab predrand predfixed predfact predalpha
predrs; RUN;
**Comparing Means from different scoring methods;
PROC GLM DATA = temp5 ;
*CLASS female;
MODEL mscore_ab predrand predfixed predfact predalpha predrs = / nouni;
REPEATED set 6 / PRINTE ;
*LSMEANS female ;
RUN ;
**Regressions for timing. Must rerun each temp for each dataset;
**to get timing values and then run regressions;
DATA mergeda; MERGE temp4 temp;
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY int {1:6} int1-int6;
tot_time=.;
DO i = 1 TO 6;
int[i]=time[i]*item[i];
IF time[i] NE . THEN DO;
IF tot_time = . THEN tot_time = 0;
tot_time = tot_time + time[i];
END;
END;
RUN;
DATA mergedb; MERGE temp4 temp;
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY int {1:6} int1-int6;
tot_time=.;
DO i = 1 TO 6;
int[i]=time[i]*item[i];
IF time[i] NE . THEN DO;
IF tot_time = . THEN tot_time = 0;
tot_time = tot_time + time[i];
END;
END;
220
RUN;
DATA mergedc; MERGE temp4 temp;
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY int {1:6} int1-int6;
tot_time=.;
DO i = 1 TO 6;
int[i]=time[i]*item[i];
IF time[i] NE . THEN DO;
IF tot_time = . THEN tot_time = 0;
tot_time = tot_time + time[i];
END;
END;
RUN;
DATA mergedd; MERGE temp4 temp;
ARRAY item {1:6} item1-item6;
ARRAY time {1:6} time1-time6;
ARRAY int {1:6} int1-int6;
tot_time=.;
DO i = 1 TO 6;
int[i]=time[i]*item[i];
IF time[i] NE . THEN DO;
IF tot_time = . THEN tot_time = 0;
tot_time = tot_time + time[i];
END;
END;
RUN;
PROC REG DATA = merged;
MODEL mscore_ab = predrand;
MODEL mscore_ab = predrand tot_time;
MODEL mscore_ab = predrand time1-time6;
MODEL mscore_ab = predrand time1-time6 int1-int6;
RUN; QUIT;
PROC REG DATA = mergedb;
MODEL mscore_ab = predfixed;
MODEL mscore_ab = predfixed tot_time;
MODEL mscore_ab = predfixed time1-time6;
MODEL mscore_ab = predfixed time1-time6 int1-int6;
RUN; QUIT;
PROC REG DATA = mergedc;
MODEL mscore_ab = predfact;
MODEL mscore_ab = predfact tot_time;
MODEL mscore_ab = predfact time1-time6;
MODEL mscore_ab = predfact time1-time6 int1-int6;
RUN; QUIT;
PROC REG DATA = mergedd;
MODEL mscore_ab = predalpha;
MODEL mscore_ab = predalpha tot_time;
221
MODEL mscore_ab = predalpha time1-time6;
MODEL mscore_ab = predalpha time1-time6 int1-int6;
RUN; QUIT;
TITLE1 'Cognitive Adaptive Testing in ALP: 3 Item Analysis';
TITLE2 'John Prindle, USC, September 2011';
/*
Program: RAND ALP code for regressions of items on total score
Purpose: Find the items which predict total score the best.
We want to first use responses then add in response time to see if additional
variance is accounted for. The interaction of the response by response
time is also added in a third step.
*/
LIBNAME ALP 'C:\Work\2011_3_Fall\ALP\Data';
LIBNAME temp 'C:\WORK\Dissertation\Data';
/*
Use a Macro to step through the 30 items.
Create an array of the item responses and regress total 30 item score on
them.
Should be a loop that saves the R^2 to plot against the items.
*/
DATA temp; SET temp.calcscore_201106;
ARRAY old1 {30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY old2 {30} ns_at1-ns_at15 ns_bt1-ns_bt15;
ARRAY new1 {30} response1-response30;
ARRAY new2 {30} rt1-rt30;
ARRAY new3 {30} int1-int30;
DO i = 1 TO 30;
new1{i} = old1{i};
new2{i} = old2{i};
new3{i} = old1{i}*old2{i};
END;
*can be updated to retain more of the variables here;
KEEP id member mscore_ab response1-response30 rt1-rt30 int1-int30;
RUN;
*6 Item Response / RT / Int Best Item Models;
%MACRO regression(start=,end=,r=);
%DO h=&start. %TO &end.-5;
%DO i=&h.+1 %TO &end.-4;
%DO j=&i.+1 %TO &end.-3;
%DO k=&j.+1 %TO &end.-2;
%DO l=&k.+1 %TO &end.-1;
%DO z=&l.+1 %TO &end.;
%reg1(a=&h.,b=&i.,c=&j.,d=&k.,e=&l.,f=&z.)
222
DATA fit&z.; SET fitstats&z.;
fit&z. = nvalue2;
itema&z.=&h.; itemb&z.=&i.; itemc&z.=&j.;
itemd&z.=&k.; iteme&z.=&l.; itemf&z.=&z.;
IF _N_ = 1 THEN OUTPUT;
KEEP fit&z. itema&z. itemb&z. itemc&z.
itemd&z. iteme&z. itemf&z.;
RUN; QUIT;
%END;
%DO z=&l.+1 %TO &l.+1;
DATA bestf&l.; MERGE fit&z.-fit&end.;
ARRAY fits {&z.:&end.} fit&z.-fit&end.;
ARRAY fitf {&z.:&end.} itemf&z.-
itemf&end.;
bestf&l.=fits(&z.); itema&l.=&h.;
itemb&l.=&i.; itemc&l.=&j.;
itemd&l.=&k.; iteme&l.=&l.; itemf&l.=fitf(&z.);
DO x=&z. TO &end.;
IF fits(x)>bestf&l. THEN DO;
bestf&l.=fits(x); itemf&l.=fitf(x);
END; END;
KEEP bestf&l. itema&l. itemb&l. itemc&l.
itemd&l. iteme&l. itemf&l.;
RUN; QUIT;
%END;
%END;
dm 'odsresults; clear;';
dm "log;clear;";
%DO l=&k.+1 %TO &k.+1;
%DO z=&end.-1 %TO &end.-1;
DATA beste&k.; MERGE bestf&l.-bestf&z.;
ARRAY fits {&l.:&z.} bestf&l.-bestf&z.;
ARRAY fitf {&l.:&z.} itemf&l.-itemf&z.;
ARRAY fite {&l.:&z.} iteme&l.-iteme&z.;
beste&k.=fits(&l.); itema&k.=&h.; itemb&k.=&i.;
itemc&k.=&j.;
itemd&k.=&k.;
iteme&k.=fite(&l.); itemf&k.=fitf(&l.);
DO x=&l. TO &z.;
IF fits(x)>beste&k. THEN DO;
beste&k.=fits(x); iteme&k.=fite(x);
itemf&k.=fitf(x);
END; END;
KEEP beste&k. itema&k. itemb&k. itemc&k.
itemd&k. iteme&k. itemf&k.;
RUN; QUIT;
%END;
%END;
223
%END;
%DO k=&j.+1 %TO &j.+1;
%DO l=&end.-2 %TO &end.-2;
DATA bestd&j.; MERGE beste&k.-beste&l.;
ARRAY fits {&k.:&l.} beste&k.-beste&l.;
ARRAY fitf {&k.:&l.} itemf&k.-itemf&l.;
ARRAY fite {&k.:&l.} iteme&k.-iteme&l.;
ARRAY fitd {&k.:&l.} itemd&k.-itemd&l.;
bestd&j.=fits(&k.); itema&j.=&h.; itemb&j.=&i.;
itemc&j.=&j.;
itemd&j.=fitd(&k.);
iteme&j.=fite(&k.); itemf&j.=fitf(&k.);
DO z=&k. TO &l.;
IF fits(z)>bestd&j. THEN DO;
bestd&j.=fits(z); itemd&j.=fitd(z); iteme&j.=fite(z);
itemf&j.=fitf(z);
END; END;
KEEP bestd&j. itema&j. itemb&j. itemc&j. itemd&j.
iteme&j. itemf&j.;
RUN; QUIT;
%END;
%END;
%END;
%DO j=&i.+1 %TO &i.+1;
%DO k=&end.-3 %TO &end.-3;
DATA bestc&i.; MERGE bestd&j.-bestd&k.;
ARRAY fits {&j.:&k.} bestd&j.-bestd&k.;
ARRAY fitf {&j.:&k.} itemf&j.-itemf&k.;
ARRAY fite {&j.:&k.} iteme&j.-iteme&k.;
ARRAY fitd {&j.:&k.} itemd&j.-itemd&k.;
ARRAY fitc {&j.:&k.} itemc&j.-itemc&k.;
bestc&i.=fits(&j.); itema&i.=&h.; itemb&i.=&i.;
itemc&i.=fitc(&j.);
itemd&i.=fitd(&j.);
iteme&i.=fite(&j.); itemf&i.=fitf(&j.);
DO l=&j. TO &k.;
IF fits(l)>bestc&i. THEN DO;
bestc&i.=fits(l); itemc&i.=fitc(l); itemd&i.=fitd(l);
iteme&i.=fite(l); itemf&i.=fitf(l);
END; END;
KEEP bestc&i. itema&i. itemb&i. itemc&i. itemd&i.
iteme&i. itemf&i.;
RUN; QUIT;
%END;
%END;
%END;
%DO j=&h.+1 %TO &h.+1;
%DO k=&end.-4 %TO &end.-4;
DATA best&h.; MERGE bestc&j.-bestc&k.;
224
ARRAY fits {&j.:&k.} bestc&j.-bestc&k.;
ARRAY fita {&j.:&k.} itema&j.-itema&k.;
ARRAY fitb {&j.:&k.} itemb&j.-itemb&k.;
ARRAY fitc {&j.:&k.} itemc&j.-itemc&k.;
ARRAY fitd {&j.:&k.} itemd&j.-itemd&k.;
ARRAY fite {&j.:&k.} iteme&j.-iteme&k.;
ARRAY fitf {&j.:&k.} itemf&j.-itemf&k.;
best=fits(&j.); itema=fita(&j.); itemb=itemb&j.;
itemc=itemc&j.;
itemd=itemd&j.; iteme=iteme&j.;
itemf=itemf&j.; model=&h.;
DO l=&j. TO &k.;
IF fits(l)>best THEN DO;
best=fits(l); itema=fita(l); itemb=fitb(l); itemc=fitc(l);
itemd=fitd(l); iteme=fite(l);
itemf=fitf(l);
END;
END;
KEEP model best itema itemb itemc itemd iteme itemf;
RUN; QUIT;
%END;
%END;
%END;
%DO k=&start.+1 %TO &end.-5;
PROC APPEND BASE=best&start. DATA=best&k.; RUN; QUIT;
%END;
DATA alp.bestr2_6item; SET best&start.; best&r.=best; DROP best; RUN;
PROC DATASETS NOLIST; DELETE fit&start.-fit&end.; RUN; QUIT;
PROC DATASETS NOLIST; DELETE fitstats&start.-fitstats&end.; RUN; QUIT;
PROC DATASETS NOLIST; DELETE bestf&start.-bestf&end.; RUN; QUIT;
PROC DATASETS NOLIST; DELETE beste&start.-beste&end.; RUN; QUIT;
PROC DATASETS NOLIST; DELETE bestd&start.-bestd&end.; RUN; QUIT;
PROC DATASETS NOLIST; DELETE bestc&start.-bestc&end.; RUN; QUIT;
%DO k=&start. %TO &end.;
PROC DATASETS NOLIST; DELETE best&k.; RUN; QUIT;
%END;
%MEND regression;
%MACRO reg1(a=,b=,c=,d=,e=,f=);
ODS LISTING CLOSE;
ODS OUTPUT fitstatistics = fitstats&f.;
PROC REG DATA = temp;
MODEL mscore_ab = response&a. response&b. response&c. response&d. response&e.
response&f.;
RUN; QUIT;
ODS OUTPUT CLOSE;
ODS LISTING;
%MEND;
225
%MACRO reg2(a=,b=,c=,d=,e=,f=);
ODS LISTING CLOSE;
ODS OUTPUT fitstatistics = fitstats&f.;
PROC REG DATA = temp;
MODEL mscore_ab = response&a. response&b. response&c. response&d. response&e.
response&f. rt&a. rt&b. rt&c. rt&d. rt&e. rt&f.;
RUN; QUIT;
ODS OUTPUT CLOSE;
ODS LISTING;
%MEND;
%MACRO reg3(a=,b=,c=,d=,e=,f=);
ODS LISTING CLOSE;
ODS OUTPUT fitstatistics = fitstats&f.;
PROC REG DATA = temp;
MODEL mscore_ab = response&a. response&b. response&c. response&d. response&e.
response&f. rt&a. rt&b. rt&c. rt&d. rt&e. rt&f. int&a. int&b. int&c. int&d.
int&e. int&f.;
RUN; QUIT;
ODS OUTPUT CLOSE;
ODS LISTING;
%MEND;
%regression(start=1,end=30,r=2)
/*
PROC GPLOT DATA = bestr1;
PLOT best1 * model;
RUN; QUIT;
DATA alp.bestr1_6item; SET bestr1;
RUN;
LIBNAME ALP 'C:\Work\2011_3_Fall\ALP\Data';
LIBNAME temp 'C:\WORK\Dissertation\Data';
/*
Creating Datasets with missing values for the adaptive scoring methods.
1) BAT Scoring
2) Half Adaptive Scoring
3) Fully Adaptive
*/
DATA temp; SET temp.calcscore_201106;
ARRAY old1 {30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY old2 {30} ns_at1-ns_at15 ns_bt1-ns_bt15;
ARRAY new1 {30} response1-response30;
ARRAY new2 {30} rt1-rt30;
ARRAY new3 {30} int1-int30;
DO i = 1 TO 30;
new1{i} = old1{i};
new2{i} = old2{i};
new3{i} = old1{i}*old2{i};
226
END;
*can be updated to retain more of the variables here;
*KEEP id member mscore_ab response1-response30 rt1-rt30 int1-int30;
RUN;
*******1********;
** BATA Scores;
** Items in order of difficulty;
** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15;
** 1, 5, 2, 4, 3, 6, 7, 12, 8, 11, 10, 9, 13, 14, 15;
DATA bata; SET temp.calcscore_201106;
ARRAY old1 {15} mcorr_a01-mcorr_a15;
ARRAY new1 {15} item1-item15;
DO i = 1 TO 15;
new1[i]=.;
IF old1[i]=. THEN old1[i] = 0;
END;
*Create 2 sets of 3 items, 7,8,9 are the first set for everyone;
new1[7]=old1[4]; new1[8]=old1[7]; new1[9]=old1[10];
CR1 = 0; CR2 = 0;
CR1 = mcorr_a04 + mcorr_a07 + mcorr_a10;
IF CR1 = 0 THEN DO;
new1[1]=old1[1]; new1[2]=old1[5]; new1[3]=old1[2];
CR2 = new1[1] + new1[2] + new1[3];
END;
IF CR1 = 1 THEN DO;
new1[4]=old1[3]; new1[5]=old1[6]; new1[6]=old1[12];
CR2 = new1[4] + new1[5] + new1[6];
END;
IF CR1 = 2 THEN DO i=1 TO 15;
new1[10]=old1[8]; new1[11]=old1[11]; new1[12]=old1[9];
CR2 = new1[10] + new1[11] + new1[12];
END;
IF CR1 = 3 THEN DO;
new1[13]=old1[13]; new1[14]=old1[14]; new1[15]=old1[15];
CR2 = new1[13] + new1[14] + new1[15];
END;
person = id;
KEEP item1-item15 person member;
RUN; QUIT;
PROC TRANSPOSE DATA=bata OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA bata; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
227
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=bata METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = -4.93; beta2 = -4.01; beta3 = -3.88;
beta4 = -2.51; beta5 = -1.00; beta6 = 0.55;
beta7 = -3.03; beta8 = 0.44; beta9 = 1.60;
beta10 = 0.60; beta11 = 0.94; beta12 = 1.77;
beta13 = 2.20; beta14 = 2.56; beta15 = 5.34;
ARRAY beta[15] beta1-beta15;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbbata;
PREDICT gscore OUT=personParmbata;
RUN;
PROC SORT DATA = personparmbata; BY tempid;
DATA itembata; SET personparmbata;
BY tempid;
id = person;
predbata = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predbata member;
RUN;
PROC FREQ DATA = itembata; TABLE predbata; RUN; QUIT;
**BATB Scores;
** Items in order of difficulty;
** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15;
** 2, 1, 3, 5, 4, 6, 10, 7, 8, 9, 12, 13, 14, 15, 11;
DATA batb; SET temp.calcscore_201106;
ARRAY old1 {15} mcorr_b01-mcorr_b15;
ARRAY new1 {15} item1-item15;
DO i = 1 TO 15;
new1[i]=.;
IF old1[i]=. THEN old1[i] = 0;
END;
*Create 2 sets of 3 items, 7,8,9 are the first set for everyone;
new1[7]=old1[5]; new1[8]=old1[10]; new1[9]=old1[12];
CR1 = 0; CR2 = 0;
CR1 = mcorr_b05 + mcorr_b10 + mcorr_b12;
IF CR1 = 0 THEN DO;
new1[1]=old1[2]; new1[2]=old1[1]; new1[3]=old1[3];
CR2 = new1[1] + new1[2] + new1[3];
END;
IF CR1 = 1 THEN DO;
new1[4]=old1[4]; new1[5]=old1[6]; new1[6]=old1[7];
CR2 = new1[4] + new1[5] + new1[6];
228
END;
IF CR1 = 2 THEN DO i=1 TO 15;
new1[10]=old1[8]; new1[11]=old1[9]; new1[12]=old1[13];
CR2 = new1[10] + new1[11] + new1[12];
END;
IF CR1 = 3 THEN DO;
new1[13]=old1[14]; new1[14]=old1[15]; new1[15]=old1[11];
CR2 = new1[13] + new1[14] + new1[15];
END;
person = id;
KEEP item1-item15 person member;
RUN; QUIT;
PROC FREQ DATA = batb; TABLE CR1 * CR2; RUN;
PROC TRANSPOSE DATA=batb OUT=longForm1 NAME=i PREFIX=score;
BY person member;
RUN;
DATA batb; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=batb METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = -3.89; beta2 = -3.67; beta3 = -3.48;
beta4 = -1.77; beta5 = -0.50; beta6 = 0.24;
beta7 = -2.24; beta8 = -0.18; beta9 = 1.51;
beta10 = 0.59; beta11 = 0.92; beta12 = 1.88;
beta13 = 2.12; beta14 = 3.36; beta15 = 8.46;
ARRAY beta[15] beta1-beta15;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbbatb;
PREDICT gscore OUT=personParmbatb;
RUN;
PROC SORT DATA = personparmbatb; BY tempid;
DATA itembatb; SET personparmbatb;
BY tempid;
id = person;
predbatb = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predbatb member;
229
RUN;
PROC FREQ DATA = itembatb; TABLE predbatb; RUN; QUIT;
*******2********;
** Half Adaptive - No Timing;
DATA halfadapt; SET temp.calcscore_201106;
ARRAY old1 {30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY new1 {30} ha1-ha30;
ARRAY adpt {30} item1-item30;
DO i = 1 TO 30; adpt[i]=.; IF old1[i]=. THEN old1[i]=0; END;
new1[1]= old1[1]; new1[2]=old1[5]; new1[3]=old1[17];
new1[4]= old1[2]; new1[5]=old1[16]; new1[6]=old1[18];
new1[7]= old1[4]; new1[8]=old1[3]; new1[9]=old1[20];
new1[10]=old1[19]; new1[11]=old1[6]; new1[12]=old1[21];
new1[13]=old1[25]; new1[14]=old1[22]; new1[15]=old1[7];
new1[16]=old1[12]; new1[17]=old1[23]; new1[18]=old1[8];
new1[19]=old1[24]; new1[20]=old1[11]; new1[21]=old1[27];
new1[22]=old1[10]; new1[23]=old1[9]; new1[24]=old1[28];
new1[25]=old1[29]; new1[26]=old1[13]; new1[27]=old1[14];
new1[28]=old1[30]; new1[29]=old1[15]; new1[30]=old1[26];
adpt[15]=new1[15];
IF adpt[15]=1 THEN DO; adpt[23]=new1[23];
IF adpt[23]=1 THEN DO; adpt[27]=new1[27];
IF adpt[27]=1 THEN DO; adpt[29]=new1[29];
IF adpt[29]=1 THEN DO; adpt[30]=new1[30];
adpt[28]=new1[28]; END;
IF adpt[29]=0 THEN DO; adpt[28]=new1[28];
IF adpt[28]=1 THEN adpt[30]=new1[30];
IF adpt[28]=0 THEN adpt[26]=new1[26];
END;
END;
IF adpt[27]=0 THEN DO; adpt[25]=new1[25];
IF adpt[25]=1 THEN DO; adpt[26]=new1[26];
IF adpt[26]=1 THEN adpt[28]=new1[28];
IF adpt[26]=0 THEN adpt[24]=new1[24];
END;
IF adpt[25]=0 THEN DO; adpt[24]=new1[24];
IF adpt[24]=1 THEN adpt[26]=new1[26];
IF adpt[24]=0 THEN adpt[22]=new1[22];
END;
END;
END;
IF adpt[23]=0 THEN DO; adpt[19]=new1[19];
IF adpt[19]=1 THEN DO; adpt[21]=new1[21];
IF adpt[21]=1 THEN DO; adpt[22]=new1[22];
IF adpt[22]=1 THEN adpt[24]=new1[24];
IF adpt[22]=0 THEN adpt[20]=new1[20];
END;
IF adpt[21]=0 THEN DO; adpt[20]=new1[20];
IF adpt[20]=1 THEN adpt[22]=new1[22];
IF adpt[20]=0 THEN adpt[18]=new1[18];
END;
230
END;
IF adpt[19]=0 THEN DO; adpt[17]=new1[17];
IF adpt[17]=1 THEN DO; adpt[18]=new1[18];
IF adpt[18]=1 THEN adpt[20]=new1[20];
IF adpt[18]=0 THEN adpt[16]=new1[16];
END;
IF adpt[17]=0 THEN DO; adpt[16]=new1[16];
IF adpt[16]=1 THEN adpt[18]=new1[18];
IF adpt[16]=0 THEN adpt[14]=new1[14];
END;
END;
END;
END;
IF adpt[15]=0 THEN DO; adpt[7]=new1[7];
IF adpt[7]=1 THEN DO; adpt[11]=new1[11];
IF adpt[11]=1 THEN DO; adpt[13]=new1[13];
IF adpt[13]=1 THEN adpt[14]=new1[14];
IF adpt[14]=1 THEN adpt[16]=new1[16];
IF adpt[14]=0 THEN adpt[12]=new1[12];
IF adpt[13]=0 THEN adpt[12]=new1[12];
IF adpt[12]=1 THEN adpt[14]=new1[14];
IF adpt[12]=0 THEN adpt[10]=new1[10];
END;
IF adpt[11]=0 THEN DO; adpt[9]=new1[9];
IF adpt[9]=1 THEN adpt[10]=new1[10];
IF adpt[10]=1 THEN adpt[12]=new1[12];
IF adpt[10]=0 THEN adpt[8]=new1[8];
IF adpt[9]=0 THEN adpt[8]=new1[8];
IF adpt[8]=1 THEN adpt[10]=new1[10];
IF adpt[8]=0 THEN adpt[6]=new1[6];
END;
END;
IF adpt[7]=0 THEN DO; adpt[3]=new1[3];
IF adpt[3]=1 THEN DO; adpt[5]=new1[5];
IF adpt[5]=1 THEN DO; adpt[6]=new1[6];
IF adpt[6]=1 THEN adpt[8]=new1[8];
IF adpt[6]=0 THEN adpt[4]=new1[4];
END;
IF adpt[5]=0 THEN DO; adpt[4]=new1[4];
IF adpt[4]=1 THEN adpt[6]=new1[6];
IF adpt[4]=0 THEN adpt[2]=new1[2];
END;
END;
IF adpt[3]=0 THEN DO; adpt[1]=new1[1]; adpt[2]=new1[2];
adpt[4]=new1[4]; END;
END;
END;
person=id;
keep person member item1-item30;
RUN;
PROC TRANSPOSE DATA=halfadapt OUT=longForm1 NAME=i PREFIX=score;
BY person member;
231
RUN;
DATA halfadapt; SET longForm1;
new = SUBSTR(i, 5, 2);
item = input(new,8.0);
resp = score1;
IF member = . THEN member = 1;
IF member = 1 THEN tempid = person*1;
IF member = 2 THEN tempid = person*10;
IF member = 3 THEN tempid = person*100;
IF member = 4 THEN tempid = person*1000;
IF member = 5 THEN tempid = person*10000;
KEEP person item resp member tempid;
RUN;
PROC NLMIXED DATA=halfadapt METHOD=GAUSS TECHNIQUE=NEWRAP NOAD QPOINTS=20;
beta1 = -4.93; beta2 = -4.01; beta3 = -3.89;
beta4 = -3.88; beta5 = -3.67; beta6 = -3.48;
beta7 = -3.03; beta8 = -2.51; beta9 = -2.24;
beta10 = -1.77; beta11 = -1.00; beta12 = -0.50;
beta13 = -0.18; beta14 = 0.24; beta15 = 0.44;
beta16 = 0.55; beta17 = 0.59; beta18 = 0.60;
beta19 = 0.92; beta20 = 0.94; beta21 = 1.51;
beta22 = 1.60; beta23 = 1.77; beta24 = 1.88;
beta25 = 2.12; beta26 = 2.20; beta27 = 2.56;
beta28 = 3.36; beta29 = 5.34; beta30 = 8.46;
ARRAY beta[30] beta1-beta30;
e1 = EXP(gscore - beta[item]);
p=e1/(1+e1);
MODEL resp ~ BINARY(p);
RANDOM gscore ~ NORMAL([m_g], [v_g]) SUBJECT = tempid;
PARMS m_g=2.99 v_g=2.86;
PREDICT p OUT=predProbhalf;
PREDICT gscore OUT=personParmhalf;
RUN;
PROC SORT DATA = personparmhalf; BY tempid;
DATA itemhalf; SET personparmhalf;
BY tempid;
id = person;
predhalf = 9.1024*pred+500;
attempt = member;
IF first.tempid = 1 THEN OUTPUT;
IF first.tempid = 0 THEN DELETE;
KEEP id predhalf member;
RUN;
** Half Adaptive - With Timing;
DATA halftime; SET temp.calcscore_201106;
ARRAY old1 {30} mcorr_a01-mcorr_a15 mcorr_b01-mcorr_b15;
ARRAY new1 {30} ha1-ha30;
ARRAY old2 {30} ns_at1-ns_at15 ns_bt1-ns_bt15;
ARRAY new2 {30} rt1-rt30;
ARRAY meant {30} mean1-mean30;
ARRAY stdt {30} stdt1-stdt30;
ARRAY adpt {30} item1-item30;
DO i = 1 TO 30; adpt[i]=.; IF old1[i]=. THEN old1[i]=0; END;
232
new1[1]= old1[1]; new1[2]=old1[5]; new1[3]=old1[17];
new1[4]= old1[2]; new1[5]=old1[16]; new1[6]=old1[18];
new1[7]= old1[4]; new1[8]=old1[3]; new1[9]=old1[20];
new1[10]=old1[19]; new1[11]=old1[6]; new1[12]=old1[21];
new1[13]=old1[25]; new1[14]=old1[22]; new1[15]=old1[7];
new1[16]=old1[12]; new1[17]=old1[23]; new1[18]=old1[8];
new1[19]=old1[24]; new1[20]=old1[11]; new1[21]=old1[27];
new1[22]=old1[10]; new1[23]=old1[9]; new1[24]=old1[28];
new1[25]=old1[29]; new1[26]=old1[13]; new1[27]=old1[14];
new1[28]=old1[30]; new1[29]=old1[15]; new1[30]=old1[26];
new2[1]= old2[1]; new2[2]=old2[5]; new2[3]=old2[17];
new2[4]= old2[2]; new2[5]=old2[16]; new2[6]=old2[18];
new2[7]= old2[4]; new2[8]=old2[3]; new2[9]=old2[20];
new2[10]=old2[19]; new2[11]=old2[6]; new2[12]=old2[21];
new2[13]=old2[25]; new2[14]=old2[22]; new2[15]=old2[7];
new2[16]=old2[12]; new2[17]=old2[23]; new2[18]=old2[8];
new2[19]=old2[24]; new2[20]=old2[11]; new2[21]=old2[27];
new2[22]=old2[10]; new2[23]=old2[9]; new2[24]=old2[28];
new2[25]=old2[29]; new2[26]=old2[13]; new2[27]=old2[14];
new2[28]=old2[30]; new2[29]=old2[15]; new2[30]=old2[26];
mean1= 2.24; mean2= 1.98; mean3= 2.06; mean4= 2.06; mean5= 2.24; mean6= 2.18;
mean7= 2.35; mean8= 2.47; mean9= 2.48; mean10= 2.49; mean11= 3.05; mean12=
2.97;
mean13= 3.04; mean14= 3.08; mean15= 3.47; mean16= 2.94; mean17= 2.91; mean18=
3.47;
mean19= 3.25; mean20= 3.04; mean21= 3.60; mean22= 3.16; mean23= 3.59; mean24=
3.82;
mean25= 3.35; mean26= 3.11; mean27= 3.66; mean28= 3.70; mean29= 4.13; mean30=
3.34;
stdt1= 0.50; stdt2= 0.54; stdt3= 0.50; stdt4= 0.50; stdt5= 0.51; stdt6= 0.52;
stdt7= 0.55; stdt8= 0.53; stdt9= 0.54; stdt10= 0.54; stdt11= 0.65; stdt12=
0.70;
stdt13= 0.63; stdt14= 0.59; stdt15= 0.71; stdt16= 0.66; stdt17= 0.66; stdt18=
0.71;
stdt19= 0.67; stdt20= 0.65; stdt21= 0.65; stdt22= 0.71; stdt23= 0.73; stdt24=
0.77;
stdt25= 0.81; stdt26= 0.82; stdt27= 0.80; stdt28= 0.82; stdt29= 0.74; stdt30=
0.70;
DO i = 1 TO 30; new2[i]=(log(new2[i])-meant[i])/stdt[i]; END;
person=id;
KEEP person member item1-item30 rt1-rt30;
RUN;
PROC MEANS DATA = halftime; VAR rt1-rt30; RUN;
PROC SORT DATA = itembata; BY id member; RUN;
PROC SORT DATA = itembatb; BY id member; RUN;
PROC SORT DATA = itemhalf; BY id member; RUN;
PROC SORT DATA = temp.calcscore_201106; BY id member; RUN;
Abstract (if available)
Abstract
The stimulus for the current body of work comes from the desire of researchers to have concise and accurate cognitive tasks implemented in their surveys. The main purpose of this work was to introduce collateral information as a way of making up for lost or forgone information when adaptive frameworks are adopted. The nature of ongoing surveys and the ubiquity of computers provides ample collateral, or nonintrusive, information which can help improve score accuracy. Information such as how long a respondent spends on certain items, their age, education, and other characteristics can improve score prediction beyond simple item responses. ❧ The importance of this work included methods to effectively decrease the number of items given to participants, as well as keep the accuracy high despite the loss in information. In the current study, the Woodcock Johnson – III (WJ-III) Number Series (NS) task was presented with 30 previously unpublished items as stimuli. First, a couple of scoring models were implemented to test for model fit and compare the implications of the fit values. Then methods outlined below systematically adjusted patterns of missingness to mimic reduced and adapted subsets. Once the smaller NS item sets were delineated, several methods of adding predictive accuracy were tested and compared. ❧ In scoring respondents a traditional Item Response Theory (IRT) model as proposed by Rasch (1960) was first used to provide evidence for a uni-dimensional scale and obtain baseline statistics for item difficulty and person abilities. The next model was a Conditionally Independent Response Time (CIRT) model. The latter model includes a response model as well as a joint response time model for scoring. It was shown that with the full item set these two models provide identical ability estimates and item parameters. The response time model of the CIRT framework provides ability scores and speededness scores based on response time patterns. ❧ Next, focus was placed on effectively decreasing the number of items used in scoring each respondent. Methods included item reduction, test forms in which the same item sets were used to score each respondent, and adaptive tests, where each respondent could receive a different item set. Reduced item sets fared better when item difficulties more closely matched sample ability levels (r=0.72-0.90). Adaptive item sets were more consistent in measuring ability (i.e. half-adaptive, block adaptive, fully adaptive), but accuracy was best for the fully adaptive method used (r=0.79-0.91). ❧ The last steps of analysis involved introducing response time and demographic variables as additional predictors of the 30 item scores. Item response, response times, and response/response time interactions provided small improvements in explained variance when used as predictors (1-8%). When CIRT ability and speededness scores were used as predictors, speededness provided limited improvements (<1%) to prediction. The addition of age, education, and gender to response models improved explained variance to a moderate degree (1-5%). ❧ In conclusion, we note that sample had a higher than average ability level for the NS task and this should color our findings for the methods outlined. The item sets that did not match respondent abilities as well were improved more so by response time and demographic data. If one can correctly identify the ability ranges of a sample before administration, then a more focused reduced item set would be advantageous. Adaptive item sets seem advantageous in a more general testing situation where ability levels are more variable. The advantage of using collateral information in predicting cognitive scores is the amount of time saved by omitting items, potentially lowering costs, and allowing researchers to move onto more tasks if desired. While the improvement due to response time in these methods was limited with NS, there is a good foundation for other cognitive tasks administered in computer assisted designs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Latent change score analysis of the impacts of memory training in the elderly from a randomized clinical trial
PDF
Evaluating social-cognitive measures of motivation in a longitudinal study of people completing New Year's resolutions to exercise
PDF
Improving reliability in noncognitive measures with response latencies
PDF
Applying adaptive methods and classical scale reduction techniques to data from the big five inventory
PDF
The limits of unidimensional computerized adaptive tests for polytomous item measures
PDF
Psychophysiological assessment of cognitive and affective responses for prediction of performance in arousal inducing virtual environments
PDF
Later life success of former college student-athletes as a function of retirement from sport and participant characteristics
PDF
Identifying diverse pathways to cognitive decline in later life using genetic and environmental factors
PDF
Effects of AT-1 receptor blockers on cognitive decline and Alzheimer's disease
PDF
Cartographic approaches to the visual exploration of violent crime patterns in space and time: a user performance based comparison of methods
PDF
Sources of stability and change in the trajectory of openness to experience across the lifespan
PDF
Cognitive functioning following ovarian removal before or after natural menopause
PDF
Biometric models of psychopathic traits in adolescence: a comparison of item-level and sum-score approaches
PDF
Self-reported and physiological responses to hate speech and criticisms of systemic social inequality: an investigation of response patterns and their mediation…
PDF
Essays on the empirics of risk and time preferences in Indonesia
PDF
Selectivity for visual speech in posterior temporal cortex
PDF
Modeling nitrate contamination of groundwater in Mountain Home, Idaho using the DRASTIC method
PDF
Structure and function of the locus coeruleus across the lifespan
PDF
Essays in panel data analysis
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
Asset Metadata
Creator
Prindle, John Janson
(author)
Core Title
A functional use of response time data in cognitive assessment
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Publication Date
05/01/2012
Defense Date
03/05/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
CIRT,cognitive ability,IRT,OAI-PMH Harvest,response,response time,RT,speededness,speediness
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McArdle, John J. (
committee chair
), Rueda, Robert (
committee member
), Trickett, Penelope (
committee member
), Walsh, David A. (
committee member
), Zelinski, Elizabeth M. (
committee member
)
Creator Email
janson.prindle@gmail.com,jprindle@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-19203
Unique identifier
UC11289372
Identifier
usctheses-c3-19203 (legacy record id)
Legacy Identifier
etd-PrindleJoh-693-1.pdf
Dmrecord
19203
Document Type
Dissertation
Rights
Prindle, John Janson
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
CIRT
cognitive ability
IRT
response
response time
RT
speededness
speediness