Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Applying adaptive methods and classical scale reduction techniques to data from the big five inventory
(USC Thesis Other)
Applying adaptive methods and classical scale reduction techniques to data from the big five inventory
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPLYING ADAPTIVE METHODS AND CLASSICAL SCALE REDUCTION TECHNIQUES TO DATA FROM
THE BIG FIVE INVENTORY
by
Kevin Terrance Petway, II
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF ARTS
(PSYCHOLOGY)
August 2010
Copyright 2010 Kevin Terrance Petway, II
ii
Dedication
To my grandmother, mother, and sister
iii
Acknowledgments
I extend much gratitude towards my advisor, Jack McArdle, for providing me with ample
advice, amusing anecdotes, and worthwhile bits of knowledge to guide me through academia
way these past few years. Many thanks to Elizabeth Zelinski as well, for having confidence in my
abilities and allowing me access to her dataset. The humorous conversations never hurt either.
Additional thanks to several friends of mine, without whom completion of this thesis
would have been hindered by procrastination. Firstly, to Erin Shelton for engaging in active
discussions with me re: the properties and complications of various data analytic techniques,
and for being an all-around motivational powerhouse. Secondly, to Sameer ud Dowla Khan, for
humoring me during long nights as I wrote or thought or struggled, and for providing me with
useful suggestions or comments whenever I asked for them. Thirdly, I acknowledge John Prindle
for unwittingly allowing me to use his Master’s thesis as a structural template for my own.
Finally, I am thankful to Leslie Owen for assisting in the creation of a completion schedule and to
Ricardo Reyes for providing the motivation to stick to said schedule.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables v
List of Figures vi
Abstract vii
Chapter 1: Introduction 1
Chapter 2: Methods 13
Chapter 3: Short Form Technique 15
Chapter 4: IRT-based Adaptive Technique 29
Chapter 5: CART-based Adaptive Technique 39
Chapter 6: A Comparison of Alternative Techniques 45
Chapter 7: BFI Factor Structure 51
Chapter 8: General Discussion 53
Bibliography 59
Appendices:
Appendix A: Sample Statistics from CogUSA 63
Appendix B: BFI-44 64
Appendix C: Short Form Item Utilization 67
Appendix D: IRT Graded Response Model Designs/Paths 69
Appendix E: Three-Item CART Trees 75
Appendix F: Example Four-Item and Five-Item CART Tree Descriptions 81
v
List of Tables
Table 1: Model fit for each of the reduced scales and the full model 19
Table 2: DIFFTEST comparison of restrained LVP Models to unrestrained LVP Models 22
vi
List of Figures
Figure 1: Cronbach’s alphas for correlational and factor-analytic techniques 20
Figure 2: Accuracy of short form techniques relative to observed scores 24
Figure 3: Accuracy of short form techniques relative to factor scores 26
Figure 4: Difficulty Histograms 30
Figure 5: Accuracy of IRT technique relative to observed scores 35
Figure 6: Accuracy of IRT technique relative to factor scores 37
Figure 7: Accuracy of each CART condition 42
Figure 8: Technique performance relative to the observed scores 46
Figure 9: Technique performance relative to the factor scores 49
vii
Abstract
Multiple short form approaches, the graded response IRT model (Samejima, 1969), and
CART were used to reduce the Big Five Inventory’s five personality scales (extraversion,
agreeableness, conscientiousness, neuroticism, and openness to experience). Data was taken
from the 2008 CogUSA sample of older adults (Fisher et al, TBP, N = 1427). Two short form
techniques were adopted: (1) an extraction of the highest correlates based on correlational
analysis, and (2) an extraction of the highest loaders based on factor analysis. The graded
response model utilized item difficulties and discrimination to adapt the measure to
participants. Following McArdle (2009), Classification and Regression Trees (CART), which
estimates scores based on regression logic that allows for differential allocation of participants,
was used here for adaptive testing as well. Each method’s predicted scores were correlated
with both observed scores and factor scores. Evidence suggested the two score types did not
line up perfectly due the measure’s factor structure. The correlational approach to short forms
best approximated the observed score while the CART technique that predicted factor scores
(used as the target in the CART analysis) most consistently estimated factor scores. Issues with
the factor structure, problems with the agreeableness scale, time savings, and identification of
the “true” score are discussed.
1
Chapter 1: Introduction
The Big Five Inventory (BFI) & Scale Reduction
The so-called “big-five” is a moniker for the repeated result of five common factors (or,
more accurately, principal components) of lexically defined dimensions of personality. The
model seemingly originated in the 1950s as part of U.S. Air Force personnel assessments done
by Ernest Tupes and Raymond Cristal (1961). They noted several recurrent personality factors,
but the specifics of their research were muddled within the only report documenting their
findings. Digman (1990) pushed his five factor model of personality (a supposed extension of
Tupes’ and Cristal’s work), which Goldberg later suggested represented the highest level of
personality organization (Goldberg, 1993). These five factors were considered by many to be
representative of most known personality traits, and their introduction provided much needed
order and stability to an area that was riddled with a confusing amalgam of smaller personality
concepts.
While Goldberg and others were demonstrating the value of these five overarching
factors, John, Donahue, and Kentle (1991) created a measure called the Big Five Inventory (BFI)
to address the need for a shorter personality instrument that could measure the big five
personality factors of extraversion, agreeableness, conscientiousness, neuroticism and openness
to experience. Measures then and now are quite lengthy (e.g. Costa and McCrae’s 240-item
NEO PI-R, 1992). John et al. recognized that not all information in the items added to
description accuracy, and based item creation on a set of definitions developed from expert
ratings – the hope was to remove the need to measure individual facets and instead capture
those properties in the overarching big five factor. However, each scale (personality variable)
contains items that related to the majority of the facets identified by Costa and Mcrae, thereby
roughly covering the full range of each variable (John & Srivastava, 1999).
2
John’s 44-item scale is not a long scale in comparison to more costly and extensive
batteries (the NEO PI-R has 240 items and requires about 60 minutes from participants).
However, even this 44-item inventory could benefit from a size reduction to make it a feasible
measure for use in more time sensitive contexts (e.g. the telephone). A measure that is too long
(e.g. the NEO PI-R) may be seen by some as too daunting to even start. Also likely is someone
experiencing fatigue or losing interest as they continue filling out a measure. All of these
threaten clarity of the results as they can lead to unwanted forms of selection bias (Robins,
Hendin & Trzesniewski, 2001; Saucier, 1994; Gosling, Rentfrow & Swann, 2003; Knottnerus,
Knipschild & Sturmans, 1989).
The two most often used means of reduction are short forms and adaptive tests. Short
forms are created by selection of only some of the test items, with the same set of items
administered to everyone (Robins, Hendin & Trzesniewski, 2001; Saucier, 1994; Cacioppo, Petty
& Kao, 1984). In contrast, adaptive tests are designed to adjust to the response characteristics
of an individual based on some underlying criteria (Weiss & Kingsbury, 1984).
Short Forms
The creation of short forms is probably the simplest way to reduce item load. The most
common approach is to remove items from a scale that have low item-total correlations
(correlational approach), since in theory these items are less indicative of the true score.
Internal consistency reliability underlies this methodology. One hopes that removing these poor
correlates does not affect the internal consistency of the measure beyond an acceptable (and
often arbitrary) amount.
One of the most commonly used indicators of internal consistency is Cronbach’s alpha
(α), a reliability index which evaluates the intercorrelations among items in a measure. More
specifically, it relies on the average intercorrelation when estimating how reliable a scale is – a
3
measure with items that are more highly intercorrelated should better measure the construct of
interest. Unfortunately, the average intercorrelation is subject to the same threats as a mean or
correlation in that skewness and outliers can be of detrimental effect. The presence of spurious
(or coincidental) correlations between items interferes with the interpretability of alpha, and
simply increasing the number of items tends to improve alpha.
Factor analysis is another way to assess internal consistency, and has become a popular
alternative (or additional step) to reducing measures (factor analytic approach). Factor analysis
is a data reduction method where items are reduced into fewer dimensions (latent variables)
based on shared (or common) variance. An item’s shared variation is split from its unique
variation (that portion that is not shared with other items). The shared variation components of
each item are then combined to create a factor score for each observation. A factor loading (the
strength of each item’s relationship with the latent variable) is thus the correlation between an
adjusted item score based solely on shared variation and the newly created factor score.
Structural validation is another way to think of the factor analytic means of internal consistency.
Short form creation using factor analysis is done by removing the items with the
smallest loadings. Thus, the strongest indicators of this original construct comprise the new,
reduced latent variables.
Item Response Theory and the Graded Response Model
The classical test theory (CTT) model (Novick, 1966; Allen & Yen, 2002), from which
ideas like internal consistency and item-total correlations are based, proposes that the observed
score on a scale is a function of the true score and some error. Since the true score is an
unobserved component of the model, CTT is primarily concerned with the ability of a test to
reliably estimate a person’s true score. This true score lacks generalizability across instruments,
a consequence of the score being directly tied to the instrument from which it originates (Zickar,
4
1998). Additionally, the model unsatisfactorily distinguishes between ability levels. For
instance, a personality test may adequately differentiate those high in neuroticism from those
with average or below average neuroticism, but fail to differentiate those with average
neuroticism from those with below average neuroticism. Item response theory (IRT;
Hambleton, Swaminathan & Rogers, 1991) and its associated models act as a powerful response
to classical test theory, and the theory’s focus on the item allows it to work around many of the
scale-level issues experienced with CTT.
The item response function (IRF), IRT’s foundation, directly relates an item’s
endorsement probability to theta (θ), the latent trait (Zickar, 1998; Fraley et al., 2000). Two
characteristics underlie the function (for a two-parameter logistic item response model, 2PLM):
an item’s difficulty and an item’s discrimination. Item difficulty (b
j
) is the position on the latent
trait an individual would need for a 0.50 probability of endorsing an item in the appropriate
direction. A higher difficulty indicates an item that is endorsed by fewer individuals (those
higher on theta). Lower difficulties are more readily endorsed by individuals, particularly those
with moderate to high thetas. Item discrimination (a
j
) is an item’s ability to distinguish between
people who are close to one another on the trait continuum. The two parameters are
conceptually combined into the index, information, which provides a clear description of how an
item measures the trait on the continuum of theta. Item information curves (IIC) are graphical
representations of item information, where the peak of the curve represents the items difficulty
and the steepness of the curve’s slope represents its discrimination (Simms & Clark, 2005; Fraley
et al., 2000; Bock, 1972). This allows one to evaluate the effectiveness of each item across the
entire range of theta, and the theory allows for items to be differentially effective – for instance,
in a hypothetical IIC, item 2 appears more informative at lower levels of theta, while item 6
seems more useful at higher levels of theta where it best discriminates.
5
Researchers have shied away from using IRT models for data that is not based on a
dichotomous scale or a correct-incorrect multiple choice scale. Most of this can be attributed to
the difficulties associated with evaluating scales that have a less discernable set of “rules” for
analysis. Fortunately, a few models have been developed to deal with polytomous item scales
like Likert-type scales. Samejima’s (1969) graded response model is used fairly often now as a
way to analyze social personality measures in particular, though companies like the Educational
Testing Service have expressed interest in them for computerized adaptive testing purposes.
The graded response model (GRM) follows similar logic to the two-parameter logistic
model, and in fact, Samejima (1969) showed that this model can be generalized to a measure
with polytomous items once any item has been conceptualized as a series of ordered
dichotomous responses. A categorical model for a scale produces a series of thresholds for each
of the scale’s items. These thresholds represent a set of response dichotomies where one
category or a series of categories is evaluated against the remaining categories. For instance,
when there are 4 response options, three thresholds (m – 1, where m is the number of response
options) are produced: (a) response option 1 versus 2, 3, and 4; (b) options 1 and 2 versus
options 3 and 4; and (c) options 1, 2, and 3 versus option 4. This leads to the creation of three
item characteristic curves (ICCs), each of which indicates a response option’s difficulty. Again,
difficulty is the point on the latent trait where there is a 50% chance of endorsing the higher
response option (Samejima, 1997; Fraley et al., 2000). The probability functions calculated after
the ICCs have been formed are known as category response curves. These can be thought of as
information curves for the response options within an item and detail the probability of
selecting any one category at a specific theta. Each item also has its own general information
curve. Additionally, an item produces multiple difficulties but only has one discrimination
parameter.
6
The IRT framework easily allows for the creation of computerized adaptive testing,
where an algorithm can be created to adapt a particular measure to each individual based on
item response patterns. The amount of information that can be obtained from an item
response theory model allows for rather elaborate algorithms, though they can be as simple or
complicated as one likes. The tradeoff for simplicity is often reduced accuracy, though the
specification gained from even limited adaptability may be an improvement over a short form
measure.
Classification and Regression Trees
A truly explorative way to analyze one’s data comes in the form of Classification and
Regression Trees (CART). A growingly popular alternative to common-use multiple regression
methods (e.g. stepwise), it relies extensively on computer assistance to make more precise
classifications of the sample of interest. After the dependent variable (DV) and independent
variables (IVs) have been defined, a CART program will begin by isolating the best predictor of
the DV. This IV provides the first two-way split in the data according to a predefined splitting
rule. The search is continued for each branch until stopping criteria have been met or until a
final split is no longer possible (both lead to the creation of a “terminal node”). There are a host
of programs that can perform this type of analysis now, such as XLMiner, R Part (and Rattle),
CART Pro, and SAS Enterprise Miner.
The benefit of the CART approach is that its precision allows for the creation of adaptive
tests. Each participant can be effectively traced through a set of specific splits, with each set
essentially representing a multiple regression equation. Since CART can produce a plethora of
different split sets depending on complexity restraints, one could obtain as much specification
as needed to place people as accurately as possible on the observed score continuum. Usual
multiple regression methods are limited to a single equation that must represent everyone.
7
Since CART incorporates many of these “equations”, prediction of a participant’s performance is
not as restricted.
Past Research
Short form techniques have been used for quite some time, while adaptive techniques
have gained steam over the past two decades. Short forms are a natural progression for many
social and personality measures since researchers often notice how unnecessary a host of items
may be, or how impractical long measurement is. For instance, the short version of the NEO-PI-
R (the NEO-FFI) contains only 60 items instead of the original’s 240 items, greatly reducing
participant load while presumably capturing the same overarching constructs (Costa & McCrae,
1992).
Recently, Rammstedt and John (2007) designed a 10-item short version of the BFI and
evaluated it in both an English and German sample. They selected two items per big five factor.
In an attempt to cover as wide a range as possible, they selected one positively-worded item
and one negatively-worded item. To improve selection, items that exhibited higher item-total
correlations with the full BFI scales and items that had more clear factor associations (less cross-
loading in factor analysis of all 44 items) were prioritized. Each of the final two-item scales
retained notable levels of reliability (retest) and validity (convergent and structural), but
correlations with the full scale scores varied depending on the personality variable.
Agreeableness experienced the most loss (r
2
= 0.54) in their samples, to which the researchers
suggested an additional item be added if Agreeableness is of interest. Ultimately, they conclude
that this shortened version is acceptable in the context of time strain, but if time is not limited,
the full version should be used.
Graded response models (GRM) are commonly used to refine measures with
polytomous items based on item response properties. For instance, Fraley, Waller & Brennan
8
(2000) applied GRM to a pool of 323 items related to adult attachment. GRM allowed them to
modify scales to better measure those at the low end of each attachment dimension with the
level of precision maintained for those in the middle or high ends. After realizing that few items
adequately assessed the low end of the dimensions (difficulties were not low enough) they
relied on discrimination values alone. Simms and Clark (2005) would support this approach
since items with higher discrimination values were overwhelmingly selected early in their IRT
analyses.
There have not been a large number of studies detailing the application of the graded
response model for adaptive testing, but a few studies have noted the conditions under which it
is most effective. Dodd, Koch & Ayala (1989) used simulated data to determine in what ways
manipulation of certain aspects affected the adaptive procedures of the model. They
investigated the size of the item pool, the stopping rule, and the stepsize (along the latent trait’s
continuum). They concluded that a pool of items as few as 30 may be sufficient for
computerized adaptive testing (CAT). In addition, a “stepsize” approach that is variable
performs better than a fixed stepsize approach, and using a stopping rule based on minimizing
the standard error produces estimates that more highly correlate with the full-scale estimates.
This is similar to the fully adaptive test strategy (FAT), used by McArdle (2009) in an attempt to
adapt the Woodcock-Johnson III Number Series test. In this fully adaptive technique,
participants were initially given the item centrally located along the difficulty spectrum. A
correct response moved the participant to a more difficult item halfway up the spectrum while
an incorrect response motivated movement to an easier item halfway down the spectrum. The
technique deviated from this pattern once a participant responded to at least one correct and
one incorrect response. Subsequent item selection depended on maximum likelihood
estimation, where the selected item was the one as close to the estimated score ability level as
9
possible. FAT was applied to a test with items that could only be correct or incorrect, but similar
algorithms could be created to fit the graded response model.
Reise and Henson (2000) applied GRM to NEO-PI-R data obtained from an
undergraduate sample (n = 1059). They tried to maintain the facet (subscale) structure of the
measure, so reduction occurred at the facet level, not the big five factor level. That is, all six
facets per big five factor were measured by varying numbers of items per facet, based on results
of the adaptive process. The computerized adaptive testing (CAT) process began after the
appropriate GRM parameters were estimated. With all participants starting off at a latent trait
level of (θ = 0), item information was computed for each item. Whichever item had the most
information was the first item administered, and this item was given to everyone. Participant
response to this item modified θ appropriately. With a new θ, item information was computed
again (based on a prior normal distribution of θ), and the most informative item (which was no
longer the same item for everyone) was administered. This process was repeated until all eight
items per facet were used. With only three or four items per facet, true score correlations were
large (r > 0.9). Reise & Henson concluded that while the CAT algorithm performed well, it might
not be necessary because administering the four items (per facet) that provided the most
information produced similar results. Related IRT-based approaches have been taken with
depression scales (e.g. Fliege, Becker, Walter, Bjorner, Klapp & Rose, 2005) and some non-IRT
alternative techniques have been used with other scales (e.g. the MMPI-2 – Forbey & Ben-
Porath, 2007; Handel, Ben-Porath & Watt, 1999).
The primary focus of the McArdle (2009) analysis was to utilize Classification and
Regression Trees (CART) to formulate an adaptive test, an approach that has scarcely been used.
Since CART programs do the item relegation for the researcher, they provide an easy way to
visualize complex relationships between items and individuals. McArdle (2009) found that
10
scores estimated using CART far more accurately trended with the observed score than either
the number series short form or the IRT-based adaptive tests. More specifically, scores created
from a CART assessment where participants were given up to seven items achieved a positive
correlation of r = 0.96 with the full scale (i = 47). The FAT strategy produced total score
correlations between r = 0.77 (i = 3) to r = 0.85 (i = 7) while full scale score correlations with the
fixed short forms ranged from r = 0.50 (i = 3) to r = 0.85 (i = 7). CART clearly outperformed the
other techniques at i = 7 without actually requiring seven items for every participants.
Current Research
Each of the above techniques offers the promise of a quicker, but nearly as accurate,
assessment tool, something with many applications in psychological research. Generally, the
adaptive methods are considered an improvement over the short form method, largely because
they allow the test to conform to the individual to varying degrees instead of relying on sample-
level prediction. This increases the level of precision. The tradeoff is complexity – a short form
is much easier to implement because it does not necessarily require the use of a computer or a
complex in-person/over-the-phone item adaptive strategy.
Any reduction will motivate a loss of some kind of information. The nature of that
information of course varies depending on the measure. Additionally, one must weigh the pros
and cons when deciding which reduced format is the most appropriate. For some measures, the
benefit of adaptive formats is lost because the gain over short forms in terms of predicting a
person’s true score are small relative to how involved the adaptive process can be. Also, a scale
that is not unidimensional may perform more poorly under adaptive methods. This is especially
true for IRT, where the models depend on a scale’s unidimensionality. Small departures from
this assumption are considered reasonable (Zickar, 1998); however, personality measures tend
to display a greater amount of multidimensionality than would be considered reasonable
11
because principal components and exploratory factor analysis often identify meaningful factors
(those factors exhibiting eigenvalues greater than 1) with notable item cross-loading. A
satisfactory departure from the assumption of unidimensionality occurs when the primary
factor has an eigenvalue that is much larger than any meaningful secondary factors, or when
there is minimal cross-loading (such that a factor could be interpreted separately of the others).
The short form does not necessarily avoid issues of multidimensionality either. If a
measure unknowingly taps into more than one dimension, the short form methods will likely
isolate a particular dimension, with potentially different results depending on how the short
form is created. Under certain circumstances, known multidimensionality can be dealt with
using alternative forms of the graded response model. For instance, Gibbons et al. (2007)
extended an earlier bifactor model for binary item responses to polytomous items (coined the
bifactor model for graded response data). Applying the model to one’s data allows for the
simultaneous estimation of parameters for an underlying (or central) factor and a subdomain.
This helps account for the dependency between the underlying factor and its subdomains. One
fundamental problem with the model which was unresolved (but noticed) by the researchers
was the estimation of factor loadings for the subdomains. Estimating loadings for the central
factor and the subdomains at the same time means the subdomains’ loadings describe
associations between the item residuals. To rectify this they obtained loadings for each factor
by running unidimensional IRT models on each subscale, but could not say how this necessity
influenced the overall bifactor model results. Unfortunately, the BFI and other big five
personality measures do not subscribe to the idea of one central factor, so this model cannot be
used with them in its presented form.
Additional issues relate to the findings of Dodd, Koch & Ayala (1989). Each scale of the
BFI would not satisfy the item requirement they identified for adapting each to the graded
12
response model. CART is not an IRT method, but does follow similar logic in that it traces each
individual along a series of paths based on some predetermined information. With such a small
item base for each scale CART may encounter the same complications as IRT and perform
relatively poorly when predicting the true score.
This research attempts to find the most effective scale reduction method for the Big Five
Inventory. Given John’s & Srivastava’s (1999) description of the scale creation process, there is
some evidence already that each of the five big scales has some level of multidimensionality.
Since the true score may not be accurately represented by one’s observed score, a predicted
score from each of the methods will be related to both the observed score and the factor score
for each scale. The assumption is that this factor score is a better estimate of the true score
because it incorporates only the shared components of each item, whereas the observed
summed score includes each item’s noise (unrelated) component. The correlational approach
should best approximate a person’s observed score, followed by CART, the graded response
method, and finally the factor analytic approach. The factor analytic approach should
outperform all other methods when correlating predicted scores with factor scores. It is
expected that scores from the adaptive methods will relate more to the factor scores than
predicted scores from the correlational approach. For every technique, models with only a
small number of items – three, four, and five items – will be evaluated.
13
Chapter 2: Methods
Sample Details
Participants came from a collaborative study of older adults called the Cognition and
Aging in the USA (CogUSA) study, 2008 sample. The first part of the study involved a telephone
interview (N = 1718). These participants were subsequently asked to participate in a face-to-
face interview, of which n = 1434 complied. The Big Five Inventory (BFI) was administered
during this latter stage as an addition to a large set of Woodcock-Johnson scales. The final
sample size for later evaluation was n = 1427 – this was the number of participants who
responded to at least one portion of the BFI.
CogUSA is considered a representative sample of older adults living in the 48 contiguous
United States (Fisher, Rodgers, Kadlec & McArdle, TBP). Mean age for this sample was 64.3
years, with a mean education of 14.1 years. Females comprised 56% of the sample. There were
no notable demographic differences between the telephone sample and this final sample.
Big Five Inventory
The Big Five Inventory (BFI) is a 44-item personality instrument designed to measure the
big five personality factors: extraversion, agreeableness, conscientiousness, neuroticism, and
openness to experience. The items are not equally spread across these factors. Extraversion
and neuroticism are indicated by eight items each, agreeableness and conscientiousness are
indicated by nine items each, and openness to experience is measured with ten items. Each
item was measured on a five-point Likert-type scale from 1 (disagree strongly) to 5 (agree
strongly), with category 3 representing “neither agree nor disagree”. There is no central factor
of personality in the five-factor literature, so a personality variable is treated as an individual
scale with its own summed score. The measure incorporates a varying number of positively
14
worded and negatively worded items for each personality variable– the latter items were
rescaled accordingly.
Appendix A further details sample demographics as well as the summed scores for each
of the five personality variables. Appendix B contains the 44-item BFI.
15
Chapter 3: Short Form Technique
Identification of BFI Short Forms
Short forms are the most common means of scale reduction so it is appropriate to start
the evaluation process with them. The technique is arguably the most basic – the original set of
items is reduced using some criterion to form a newly reduced scale with a fixed set of items.
The means of reduction varies by researcher but two very common approaches involve item-
total correlations and factor loadings. The first approach attempts to maintain the highest
possible Cronbach’s alpha by retaining the items that demonstrate the highest item-total
correlation with the scale score. The second approach removes items that correlate most
poorly with the construct of interest (i.e. those items that exhibit the smallest factor loadings).
Both of these approaches will be tested here.
Cronbach’s coefficient alpha was estimated for the full versions of each personality
scale. A traditional means of applying a Cronbach dependant approach involves removing items
when they exhibit low item-total correlations, with the lowest correlate being removed first.
Alpha would then be recalculated and the process would repeat. Here, items were removed all
at once to form the three-, four-, and five-item reduced scales. That is, instead of removing one
item at a time and recalculating alpha, the six lowest correlates (for example) were removed at
the same time. Coefficient alphas were estimated for the reduced scales, and summed scores
were created which were correlated with the respective actual (or full-scale) score. Pattern-
evaluating factor analytic models were then carried out for each set of reduced scales (e.g. one
model incorporated all five three-item scales).
The structural equation modeling software Mplus (Muthén & Muthén, 1998-2007) was
used to evaluate factor models. For the preceding model and all subsequent models, indicators
were treated as categorical instead of continuous. Categorical models typically require larger
16
sample sizes to account for the loss of information and the estimation of more parameters
(thresholds). However, responses to the items in this measure for this particular sample are
somewhat skewed. Participants tended towards the upper two categories more than the lower
three for “positive” factors (agreeableness, extraversion, conscientiousness, and openness to
experience) while concurrently endorsing the lower group of categories for the “negative”
factor (neuroticism) far more frequently. Since the underlying distribution of the items is not
normal, indicators (items) were treated as categorical and the chosen estimator was the
weighted least squares means and variance (WLSMV) adjustment. WLSMV has demonstrated
greater accuracy in the estimation of standard errors and test statistics when there are
violations of multivariate normality in the data – maximum likelihood often inaccurately
estimates these when violations exist (Flora & Curran, 2004; Yu, 2002).
The full 44-item model of personality underwent evaluation in a pattern-evaluating
factor analytic model, where each latent (personality) variable was indicated by its respective
items. New models with latent variables indicated by three, four and five items were created by
systematically removing the poorest loaders until only the intended number of items per latent
variable remained. Model fit was determined for the reduced models. Cronbach’s alpha was
estimated for each of these reduced scales.
Each of the factor models had an equivalent latent variable path (LVP) model which
appraised each latent variable’s relationship with three demographic variables – age, sex, and
education. Construct validity is strengthened if the relationship between the reduced scale
latent variables and the demographics is close to the relationship between the full scale latent
variables and the demographics. Before analysis, each demographic variable was rescaled. The
mean was subtracted from each participant’s age for centering, and this newly centered age
variable was divided by 10 to indicate decades. Age relationships are thus relative to one
17
decade, not to a single year. Sex was centered about zero, with males representing the lower
(negative) end and females representing the higher (positive) end. 12 years were subtracted
from each person’s level of education to center it about high school education. This was
subsequently divided by four to indicate educational periods. A value of 1, for instance,
represented four years of college.
One large assumption here is that every item relates to the demographic variables in the
same way. Thus, a factor made up of fewer items should retain the beta coefficient properties
of the full model. The Mplus DIFFTEST option was used to evaluate χ
2
differences here since the
model estimator (WLSMV) did not allow a direct comparison of the chi-square statistic between
two models (Muthén & Muthén, 1998-2007). The beta coefficients that were freely estimated
from the full LVP model were affixed to the regression paths in the reduced model. This model
was the more restrained model. Regression parameter estimates from the unrestrained,
reduced LVP model were compared to the aforementioned restrained model’s fixed parameters
and a chi-square difference (with degrees of freedom) was produced. A poorer fitting model,
one that indicates greater difference between the full model’s betas and the unrestricted
reduced model’s betas, exhibited a higher chi-square value.
Model fit was determined using the following indices: the comparative fit index (CFI),
the root mean square error of approximation (RMSEA), and the weighted root mean square
residual (WRMR). Each index evaluates model performance in a different way. The CFI assesses
incremental fit by comparing the null model to the more restricted baseline model. The RMSEA
is based on the discrepancy due to approximation function, which is not dependent on sample
data. Finally, the WRMR measures the weighted average difference between sample and
estimated population variances and covariances (Yu, 2002). Changes to the data (e.g. a
reduction in the number of items incorporated into the model) should motivate changes in the
18
CFI and WRMR more than the RMSEA because the first two are dependent on statistics that are
susceptible to change (e.g. the chi-square statistic or the variance) while the latter is much less
dependent on such statistics. Yu has demonstrated that with sample sizes greater than 250,
these indices perform reasonably well in terms of type-I and type-II error rates when assessed at
the following cutoff points: CFI ≥ 0.95, RMSEA ≤ 0.05, and WRMR ≤ 1.00. Additionally, they are
reasonable indices to use with categorical data.
Short form scores were calculated for each participant by summing responses to the
items that made up each reduced scale. The scores were correlated with both the observed
total score and the factor score to identify the strength of the relationship. Squaring these
correlations provides information regarding how much variance is shared between the scores –
the greater the squared correlation, the more variance the two scores share. This is more useful
than the standard correlation coefficient in gauging the effectiveness of the reduced scale
scores because squared correlations are better judges of the true amount of covariation that
exists between two variables. That is, the correlation coefficient tends to overstate the actual
amount of commonality, and this effect increases as the coefficient approaches zero. Squared
correlations will be used in place of the correlation coefficient when comparing the
effectiveness of any one scale reduction technique with another.
Evaluation of BFI Short Forms
The full model yielded an RMSEA ε
α
of 0.101, a WRMR of 2.787, and a CFI of 0.737,
suggesting poor fit to the data. The reduced versions of the model based on the correlational
approach better fit the data according to the WRMR and CFI, but were still poor fits. The RMSEA
ε
α
estimate stayed relatively equivalent across the three short form lengths in the correlational
approach, but both the CFI and WRMR improved as the number of items decreased. None of
these shifts in fit indices were notably large. The factor analytic approach created models that
19
best fit the data, with the three-item model outdoing the four- and five-item models. However,
none of these models satisfied the fit criteria. Fit indices for all seven models are in Table 1.
Table 1: Model fit for each of the reduced scales and the full model
Approach Number of items per scale CFI RMSEA ε
α
WRMR
Maximum 0.737 0.101 2.787
Correlational 3 items 0.856 0.104 2.108
4 items 0.816 0.103 2.360
5 items 0.805 0.100 2.466
Factor Analytic 3 items 0.937 0.067 1.317
4 items 0.911 0.068 1.535
5 items 0.872 0.077 1.871
Cronbach coefficient alpha (α), as the other indicator of scale internal consistency, was
calculated for each scale using SAS’ correlation procedure PROC CORR (with requests for alpha).
Though it was expected that the correlational approach would consistently produce higher
coefficient alphas, the results suggest this is not the case. This can be attributed to the item
omission algorithm used for the correlational approach, which didn’t expressly seek to maximize
the average inter-item correlation. Estimates for extraversion were the only ones that were
larger under the correlational approach for all three reduced scales. Agreeableness showcased
the greatest disparity across the two approaches, which was most noticeable with three items:
under the factor analytic approach, a Cronbach’s alpha of 0.602 was estimated, in comparison to
the 0.497 estimate under the correlational approach. In general, estimates were close between
both approaches, with the correlational approach performing more accurately with five items
20
and the factor analytic approach outperforming when only three items were used. Refer to
Figure 1 for a comparison of scale’s coefficient alpha estimate for each condition of i.
Figure 1: Cronbach’s alphas for correlational and factor-analytic techniques
Cronbach’s alpha estimated using SAS procedure PROC CORR, with requests for alpha.
Figure 1(a) is for three items, (b) is for four items, and (c) is for five items.
0.4
0.5
0.6
0.7
0.8
0.9
1
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Cronbach's Alpha (α)
Personality Variable
All Items Correlational Factor Analytic
(a)
21
Figure 1: Continued
0.4
0.5
0.6
0.7
0.8
0.9
1
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Cronbach's Alpha (α)
Personality Variable
All Items Correlational Factor Analytic
(b)
0.4
0.5
0.6
0.7
0.8
0.9
1
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Cronbach's Alpha (α)
Personality Variable
All Items Correlational Factor Analytic
(c)
22
To ensure that none of the relationships with age, sex, or education were lost in the
scale reduction process, each latent variable was regressed on said demographics. Results of
the DIFFTEST suggested that for both approaches and all item lengths, the beta coefficients
were statistically significantly different overall. The correlational models produced greater χ
2
values than the factor analytic models (Table 2), suggesting more deviation within the
correlational models.
Table 2: DIFFTEST comparison of restrained LVP Models to unrestrained LVP Models
Approach Number of items per scale Difference χ
2
Difference df
Correlational 3 items 59.6 12
4 items 53.6 11
5 items 56.5 11
Factor Analytic 3 items 43.3 9
4 items 42.6 9
5 items 44.4 9
There were no dramatic shifts in coefficients across approaches or within item
conditions. A few weak relationships (statistically significant but very small) wandered in and
out of statistical significance depending on the personality variable, but strong relationships
remained.
Summed score correlations with the observed total score revealed higher squared
correlations for scales created under the correlational approach than under the factor analytic
approach, and this was true for all three scale lengths. The difference in amount of shared
variance was largest for agreeableness and smallest for openness to experience.
23
Squared correlations with the estimated factor scores showcased a markedly different
pattern of relationships. Extraversion and conscientiousness predicted scores from the factor
analytic approach more closely resembled the factor scores than the equivalent scores from the
correlational approach. Neuroticism and agreeableness both exhibited shifts where the factor
analytic scores were better with five items, then worsened or equalized with fewer items,
relative to the correlational scores. Finally, openness to experience factor scores shared more
variation with the correlational scores than the factor analytic scores over all three test lengths,
but the level of divergence was smaller here than when using the observed total score. Figures
2 and 3 compare squared correlations for the correlational and factor analytic techniques,
respectively.
It is useful noting the effectiveness of a single item when evaluating reduced scales of
varying sizes. Each personality variable’s highest correlate (that is, the item with the highest
item-total correlation), after squaring, exhibited considerably smaller squared correlations when
compared to the three item condition. The single-item squared correlations are as follows: r
2
=
0.389 (extraversion), r
2
= 0.242 (agreeableness), r
2
= 0.267 (conscientiousness), r
2
= 0.356
(neuroticism), and r
2
= 0.377 (openness to experience).
24
Figure 2: Accuracy of short form techniques relative to observed scores
Figure 2 covers the correlational technique. Figure 2(a) is for three items, (b) is for four
items, and (c) is for five items.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic
Squared Correlation (r
2
)
Personality Variable
(a)
25
Figure 2: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic
Squared Correlation (r
2
)
Personality Variable
(c)
26
Figure 3: Accuracy of short form techniques relative to factor scores
Figure 3 covers the factor-analytic technique. Figure 3(a) is for three items, (b) is for
four items, and (c) is for five items.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic
Squared Correlation (r
2
)
Personality Variable
(a)
27
Figure 3: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic
Squared Correlation (r
2
)
Personality Variable
(c)
28
There was some item overlap amongst the scales when evaluated across approaches, so
despite overall differences in fit and demographic relationships, certain items were consistent
indicators regardless of approach. Appendix C lists the items for each of the full factors and
displays which items were retained for the short forms, separated by approach. The best item
agreement existed for the openness to experience factor, where the scales ultimately became
one and the same with only three items. All other factors exhibit only one item overlap at the
three-item level. Observing the primary adjectives associated with each of the short forms
provides an indication of the substantive focus (and bias) of each approach within each
personality variable. For instance, in extraversion under the correlational approach, the items
related to talking behavior are stressed (talkative, quiet, outgoing, assertive, and shy). Contrast
this with the items under the factor analytic approach, which focus more on the energy aspect
(outgoing, energetic, enthusiastic, assertive, and quiet). The primary implication of such a
disparity is that the measure itself may not pick up the personality variables with satisfactory
precision. Chapter 7 more extensively details the problems associated with assuming the
prescribed item-factor structure for this sample.
29
Chapter 4: IRT-based Adaptive Technique
The Graded Response Model and Item Parameter Estimation
In Mplus, the graded response model (GRM) for polytomous items can be
approximated by running a categorical factor model with maximum likelihood estimation. A
logistic distribution is assumed. Item loadings and thresholds are freely estimable, but the
common factor’s mean is fixed at zero and its variance fixed at one. GRM incorporates the same
two parameters as other IRT models, item difficulty and item discrimination.
Each threshold has its own difficulty which can be calculated using equation (1):
(1) b
ij
= β
ij
/ λ
i
.
In (1), β
ij
indicates an item’s unstandardized threshold, the difficulty of which is a function of the
estimated threshold and the item’s estimated unstandardized factor loading (λ). An item’s
discrimination can be calculated in (2):
(2) a
i
= λ
i
/ 1.7 .
The unstandardized estimate of the item loading is divided by the conversion factor (1.7 is the
multiplier for the optimal difference in the three parameter logistic model) to produce an
estimate of the discrimination.
Item Difficulties
Responses to the majority of BFI items were skewed towards the negative for all
personality factors except neuroticism. Neuroticism items tended to skew towards the positive.
The spread of difficulties for each personality factor are consistent with this observation. Figure
4 presents difficulty histograms for each of the factors.
30
Figure 4: Difficulty Histograms
31
Figure 4: Continued
The extraversion histogram revealed unsatisfactory measurement of high extraversion;
however, it exhibits a good spread of items that capture a wide range of theta below the high
end. This is not the case for agreeableness, which has difficulties mostly between thetas of -4.0
and 0.0 – that is, all but four (out of 36) difficulties lie outside of this range . There are no items
that measure above average agreeableness well, which is understandable since most
participants seemed to score highly on this scale. The range below theta 0.0 is wide, but not
uniform like that observed in extraversion, which means certain groups of individuals that fall on
32
the lower end of the agreeableness spectrum may be poorly assessed relative to other groups.
The majority of conscientiousness item difficulties lie between thetas of -3.5 and 0.5, which is
similar to agreeableness’ pattern of difficulties, though the full range is not as wide. This
produces a difficulty spread that is a little more uniform. Like agreeableness, there are no items
that measure above average or higher conscientiousness well. Neuroticism, as the negative
factor, exhibits similar behavior to conscientiousness, but in the opposing direction. Fewer
people endorsed the higher response options, producing a difficulty histogram where most
difficulties span a theta range of -1.0 to 3.0. The distribution appears more normally distributed
than the previous two factors, suggesting that these items are better at dealing with just below
average to above average neuroticism. The weakest spot is below average to low neuroticism.
Difficulties for openness to experience are distributed more uniformly, mostly between thetas
of -3.0 and 1.0. A recurring theme, these items do not measure above average to high openness
to experience well.
Note that determinations about the distributions of these difficulties depend largely on
the nature of the x-axis. Here, each increment of one from zero on the trait continuum was
considered meaningful, so judgments were made under the assumption that this was the truest
way to perceive each factor’s difficulty histogram. Different bin sizes (increments) may produce
different results. Regardless of the bin, the spread of difficulties and what that spread indicates
is observable, albeit incompletely, from the numerical range of difficulties for each factor.
Establishing the Adaptive Scale
For the four negatively skewed factors, item difficulties associated with the fourth
threshold were used to establish an item pattern by which to form the adaptive scales.
Difficulties were arranged from highest to lowest for the positive factors and lowest to highest
for the negative factor (neuroticism). The fourth threshold represented the highest difficulty
33
associated with an item, so using the order based on these difficulties should provide the best
approximation of a person’s score. Neuroticism, because of its directionality, required focus on
its first threshold to better estimate those at the lower end.
Identifying the starting item can be complicated. It is perhaps simplest to select the
item centrally located along the difficulty spectrum. An alternate strategy is to use the highest
discriminator (Simms & Clark, 2005). Starting with the highest discriminator may prove effective
here because it identifies the item that best distinguishes its five categories. Since the higher
(or, for neuroticism, lower) categories require added sensitivity, initially separating participants
on the item that most clearly distinguishes along theta is a good place to begin the adaptive
scale process.
From the starting item, participants were sent to either the top or bottom of the
difficulty spectrum. Subsequent items were closer to the central item, with a single step down
or up for those who consistently answered at the high end or low end of possible response
options. After the first split, a notable change in response to the next item led to a two-step
increase or decrease along the difficulty spectrum, reliant upon whether a participant started
high or low at the second split. The number of steps down the spectrum experienced in the
later splits was dependent on agreement between the nature of a participant’s response to the
new item and their response to the discriminator item. If they agreed (e.g. a response of 4 or 5
to the discriminator and to this new item) then the participant moved a single step. If they
disagreed, the participant generally moved two steps in the opposing direction. The
effectiveness of this strategy was limited to the number of items available for a given factor.
The fewer the items, the more difficult it was to adhere strictly to the strategy, since item
repetition was more likely. Some adjustments were necessary to a) account for the varying
position of the discriminator variable, and b) account for the restrictive nature of fewer items.
34
Additionally, in order to maintain the systematic item decision strategy, the maximum number
of items (i) that could be used for any observation was k
items
/2, where k
items
represented the total
number of items available. Only openness to experience could utilize five items per person. All
others were restricted to four maximum. Appendix D identifies the possible item paths for each
of the personality variables across conditions of i.
For the four positively oriented factors, the splitting rules were consistent at each split
and were as follows: a response of 1, 2, or 3 led to a “down” split, and a response of 4 or 5 led to
an “up” split. Down and up in this context cannot be interpreted literally since every item after
the initial split was closer to the center. Here they refer to the direction of the participants
score relative to the difficulty spectrum. Roughly, the average item score goes down or remains
constant if one is sent down and the average item score goes up or remains constant if one is
sent up. Splitting rules were slightly different for neuroticism, since the lower end of the
spectrum was of more concern. A 1 or 2 for a neuroticism item warranted a “down” split and a
3, 4, or 5 led to an “up” split.
Squared Correlations
Predicted scores from the IRT technique consistently aligned with the observed scores
more than the factor scores. The discrepancy was surprisingly large – there was roughly a 0.091
difference between the squared correlations for the observed scores and the factor scores. This
implied 9.1% less commonality, on average, with the factor scores. Conscientiousness exhibited
the smallest average discrepancy across scale lengths, while agreeableness displayed the largest
average discrepancy. Figures 5 and 6 provide a comparison of the squared correlations
obtained using the IRT technique relative to observed scores and factor scores, respectively.
35
Figure 5: Accuracy of IRT technique relative to observed scores
Figure 5(a) is for three items, (b) is for four items, and (c) is for five items. Only one
personality variable (openness to experience) had enough items (i = 10) to employ the IRT
method for five items given our algorithm.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT
Squared Correlation (r
2
)
Personality Variable
(a)
36
Figure 5: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT
Squared Correlation (r
2
)
Personality Variable
(c)
37
Figure 6: Accuracy of IRT technique relative to factor scores
Figure 6(a) is for three items, (b) is for four items, and (c) is for five items. Only one
personality variable (openness to experience) had enough items (i = 10) to employ the IRT
method for five items given our algorithm.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT
Squared Correlation (r
2
)
Personality Variable
(a)
38
Figure 6: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT
Squared Correlation (r
2
)
Personality Variable
(c)
39
Chapter 5: CART-based Adaptive Technique
CART Procedure
Though CART Pro users can limit the allowable relative error (ratio of model’s variance
to original sample’s variance) and control the appearance of the output tree (e.g. restrict what is
shown to four levels), direct manipulation over the number of items that make up each tree
branch are beyond the users control. This is due to the CART algorithm which attempts to
optimize the tree for a particular relative error. Consequently, landing on a tree that involves
branches of equal length (e.g. entirely three-item paths) is random and dependent on the
algorithm. As a default, CART Pro produces a tree of maximum possible size – one that includes
as many splits as possible given user restrictions such as end node N. These trees can be pruned
back along their relative error curves to an acceptable relative error or a simpler tree (the full
tree is typically very complex and difficult to use for practical purposes).
The CART technique applied here expands upon the CART-based adaptive test
construction approach employed by McArdle (2009). The current analyses are concerned only
with three-, four-, and five-item scales, so the tree for each scale will be pruned back to trees
that satisfy those criteria, regardless of relative error. Each tree will be selected based on the
majority branch length – for example, the simplest tree which has a majority of three-item paths
will be retained for the three-item condition. This strategy will be replicated for the four- and
five-item conditions.
Both the observed total score and the factor scores can (and will) be predicted using
CART. This allows for a comparison of produced trees – if different items are selected, or if the
patterns are different, that is added evidence that the factor scores and observed scores are not
referencing the same construct. Four different CART conditions are established here: (1) CART
criterion as observed score and model predicted scores correlated with observed scores (CART
40
Observed / Data Observed); (2) CART criterion as observed score and model predicted scores
correlated with factor scores (CART Observed / Data Factor); (3) CART criterion as factor score
and model predicted scores correlated with observed scores (CART Factor / Data Observed); and
(4) CART criterion as factor score and model predicted scores correlated with factor scores
(CART Factor / Data Factor). CART programs can internally calculate predicted scores for each
participant. These scores will be correlated with either the observed score or the factor score, as
described above.
CART and Missing Data
The internal CART missing data procedures (e.g. substituting a single value or using the
item mean) are considered unsatisfactory due largely to their tendency to bias estimates. CART
Pro defaults to a missing data procedure based on surrogacy – using a score on a different item
in place of the missing value (i.e. a surrogate). The algorithm can be rather erroneous (Williams,
2009; Loh, 2008), thus the use of more powerful external techniques for missing values has been
suggested when dealing with CART Pro and other data mining programs. To deal with missing
values in this dataset, the SAS procedure for multiple imputations (coined “MI”) was used. The
Markov chain Monte Carlo (MCMC) method assumes that the data follows a multivariate normal
distribution, though the method displays considerable robustness to departures from this
assumption if the amount missing information is not large (Schafer, 1997). Additionally, the
method assumes that data is missing at random (MAR). The expectation-maximization (EM)
algorithm was utilized to identify maximum likelihood estimates for each parameter, limiting the
iterative process to 1000 iterations. After 10 imputations, a new dataset formed where each
individual had ten observations. This complete dataset was used when creating CART models.
These “corrected” CART models were subsequently reapplied to the original data.
CART Results
41
Trees
Visually, the trees produced when predicting factor scores differed from those produced
using observed scores. These differences intensified as the number of items increased. Another
observation involved the paths of prediction – there were notable item overlaps amongst the
factor score and observed score conditions, but the pattern of prediction changed. For instance,
an item that differentiated person’s high in openness to experience when using the factor score
was used to differentiate those low in openness to experience when using the observed score.
This may be due to the lack of starting item agreement. The starting item differed between the
observed score and factor score conditions for all five personality variables across all three item
lengths. Different starting points prompt changes in split criteria that would naturally affect the
lower levels. The presence and absence of certain items in one condition versus the other could
also be attributed to the starting point. Three-item trees are presented in Appendix E.
Examples of four-item and five-item trees are described in Appendix F.
Squared Correlations
Figure 7 provides a graphical depiction of the four CART conditions’ trends for three-,
four-, and five items.
42
Figure 7: Accuracy of each CART condition
Figure 7(a) is for an average of three items, (b) is for an average of four items, and (c) is
for an average of five items.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
CART Observed / Data Observed CART Observed / Data Factor CART Factor / Data Observed CART Factor / Data Factor
Squared Correlation (r
2
)
Personality Variable
(a)
43
Figure 7: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
CART Observed / Data Observed CART Observed / Data Factor CART Factor / Data Observed CART Factor / Data Factor
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
CART Observed / Data Observed CART Observed / Data Factor CART Factor / Data Observed CART Factor / Data Factor
Squared Correlation (r
2
)
Personality Variable
(c)
44
Scores were notably better when the score types used for CART and the correlation
were the same (e.g. factor score for CART and factor score for correlation versus factor score for
CART and observed score for correlation). This suggested that the factor scores and the
observed scores were not quite measuring the same thing.
The visible gap between CART factor score conditions (3) and (4) was larger than the gap
between observed score conditions (1) and (2) for all three item lengths.
45
Chapter 6: A Comparison of Alternative Techniques
Observed Score
As expected, scores predicted using the correlational approach best approximated the
observed scores across all item lengths and all personality variables. The only technique overlap
existed for the openness to experience variable under the three-item length, where the factor
analytic approach and the correlational approach incorporated the same three items. The IRT
technique performed well with extraversion and agreeableness, but faltered for the other three
variables, especially openness to experience. Openness experienced an unexpected drop in
shared variation as the number of items increased, from 61% to about 52%. CART predictions
established using the observed scores (1) were more stable than those from IRT across
variables. For conscientiousness and neuroticism, (1) outperformed all techniques but the
correlational approach, sharing about 67%, 77%, and 85% of the variation in conscientiousness
observed scores (i ≈ 3, 4 and 5, respectively) and about 74%, 80%, and 88% of the variation in
neuroticism observed scores (i ≈ 3, 4 and 5, respectively). Squared correlations were similar
with extraversion and agreeableness, but IRT outshined CART here. For openness, the factor
analytic approach bested CART (1) under i ≈ 3 and 4 items only. CART predictions formed using
factor scores as the CART target (2) generally had the least in common with observed scores.
Evident exceptions exist for agreeableness under i ≈ 3 and 4, where the technique outperformed
the factor analytic approach (quite dramatically with i ≈ 3). A comparison of all five techniques
for each of the item conditions is presented graphically in Figure 8.
46
Figure 8: Technique performance relative to the observed scores
Figure 8(a) is for an average of three items, (b) is for an average of four items, and (c) is
for an average of five items.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT CART Observed / Data Observed CART Factor / Data Observed
Squared Correlation (r
2
)
Personality Variable
(a)
47
Figure 8: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT CART Observed / Data Observed CART Factor / Data Observed
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT CART Observed / Data Observed CART Factor / Data Observed
Squared Correlation (r
2
)
Personality Variable
(c)
48
Factor Score
Factor score correlations showcased more unusual behavior than those with the
observed score. The original hypotheses are only partially supported by these findings. The
factor analytic approach consistently, across all conditions of i, outdid all other techniques with
extraversion only (r
2
= 0.78, 0.86, and 0.93 for i = 3, 4, and 5 respectively). The approach did well
with openness, but its position as most notable was usurped by the correlational approach
under four items, and both the correlational approach and CART (4) under five items. The
aforementioned, (4), used participant factor scores as targets in CART. Under the three-item
condition, the factor analytic approach was poorly associated with agreeableness factor scores
(r
2
= 0.60) relative to the best (CART-4 – r
2
= 0.74). IRT displayed the worst overall performance
here, exceptions for extraversion. It was remarkably ineffective with openness, the degree of
which was worse than when correlated with observed scores. This is surprising since the IRT
technique depends on the latent metric to define one’s performance. The pitfalls of the
technique here probably have to do with the algorithm used, an issue discussed in a later
section of this paper. Relative to the other techniques, the correlational approach performed
best at three items, and became a progressively worse option as the number of items increased.
It was still consistently better than the final CART technique (3), which used observed scores as
targets – this CART technique and the correlational approach seemed to converge on one
another as the number of allowable items increased. The correlational approach performed
especially well for neuroticism and openness under i = 4 and 5 (neuroticism r
2
= 0.85 and 0.87,
respectively; openness r
2
= 0.86 and 0.89, respectively). IRT aside, the correlational approach
most deviated from its hypothesized abilities. Finally, CART (4) showed consistently good
performance across the personality variables under all three items conditions. If using factor
scores, this method would probably be the most appropriate to stick with as it had the best
49
grasp on agreeableness, and was a top tier technique for all other variables (e.g.
conscientiousness). Refer to Figure 9 for a graphical comparison of the five methods within
each condition i.
Figure 9: Technique performance relative to the factor scores
Figure 9(a) is for an average of three items, (b) is for an average of four items, and (c) is
for an average of five items.
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT CART Observed / Data Factor CART Factor / Data Factor
Squared Correlation (r
2
)
Personality Variable
(a)
50
Figure 9: Continued
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT CART Observed / Data Factor CART Factor / Data Factor
Squared Correlation (r
2
)
Personality Variable
(b)
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Extraversion Agreeableness Conscientiousness Neuroticism Openness
Correlational Factor Analytic IRT CART Observed / Data Factor CART Factor / Data Factor
Squared Correlation (r
2
)
Personality Variable
(c)
51
Chapter 7: BFI Factor Structure
Exploratory Factor Analysis
One of the issues mentioned in Chapter 3 touched upon the idea of construct
definitions. Each personality factor is associated with a set of items, all of which cover some
aspect of the overarching factor. A large problem with the Big Five factors themselves is that
they are intended to reflect very broad concepts. It is no surprise that, in devising the NEO,
Costa & McCrae (1985) established subscales for each personality factor which covered an array
of different personality characteristics. While moderately correlated (and indicators of their
overarching factor), the subscales were not considered one and the same.
Exploratory factor analysis (EFA) allows one the opportunity to explore the structure of
their data. Analysis of the CogUSA dataset revealed possible factor structure alternatives with
up to eight factors. The EFA suggested that, for five factors, there would be notable item cross-
loading, which is problematic particularly when trying to devise an adaptive measure. However,
cross-loading items also pose a serious threat to construct validity. If the aim is to create items
that successfully indicate one specific construct (e.g. Agreeableness), an item that adequately
indicates two or more constructs (e.g. Agreeableness and Extraversion) poorly discriminates
between the two constructs. On a more practical level, such an item is much more difficult to
score. When confirming the structure of the EFAs, the latent variable associated with an item
was the one which exhibited the highest item loading. After observing the loading pattern of
the five-item EFA (model EFA5), it was assumed that this cross-loading behavior would only
worsen as the number of allowable latent variables increases. For this reason, restricting an
item to load on only one latent variable should be of more detriment to a model with more
latent variables
Results of the Exploratory Factor Analysis
52
Initial assessment of the six, seven, and eight latent variable EFAs involved determining
how many items were attributable to each latent variable after limiting items to the highest
loading condition (coined “meaningful items”). The six-factor model (model EFA6) had a
reasonable sixth factor with four items. The seven-factor model (model EFA7) had a seventh
factor with only one meaningful item, so the seven-factor model was disregarded. The eight-
factor model (model EFA8) had a sixth factor with six items, a seventh factor with three items,
and an eighth factor with three items. In EFA6, all of the items fueling factor six came from the
Agreeableness factor. Two items originally with extraversion were also appropriated by other
factors – item 11 (“Is full of energy”) went to conscientiousness and item 16 (“Generates a lot of
enthusiasm”) went to openness to experience. Model fit obtained from the confirmatory
analysis of EFA6 (CFI = 0.729, RMSEA ε
α
=0.102, and WRMR =2.785) was only slightly worse
overall than that for EFA5 (CFI = 0.737, RMSEA ε
α
=0.101, and WRMR =2.787).
EFA8 exhibited notably different appropriation patterns. This time, the original
conscientiousness latent variable lost six of its items to latent variable six. The original
conscientiousness LV is comprised of items that describe diligence (in the context of
employment) while the items associated with the sixth factor indicate effectiveness (and
organization). The seventh factor’s items deal with valuation of the arts, and were taken from
the original openness to experience latent variable. Three agreeableness items make up the
eighth factor, effectively splitting up the agreeableness factor into two components: the positive
aspects (e.g. cooperation and helpfulness) and the negative aspects (e.g. quarrelsome and
distant). Finally, the same two items from extraversion were reassigned, though this time both
went to openness to experience. Model fit for EFA8 was a little worse than the previous
models (CFI = 0.707, RMSEA ε
α
=0.105, and WRMR =2.816).
53
Chapter 8: General Discussion
Reduction Methods
Short Forms
The correlational and factor analytic approaches are arguably the easiest to conceive,
and the simplest to implement. They performed well here, an unsurprising find given how they
were created. The correlational approach selected the highest correlates with the observed
score – the resulting reduced scores should be well related to this total score. The added
benefit of this approach is that it removes some of the extraneous noise gained from items that
do not share as much information with the total score. Additional study into this approach
could evaluate whether the more traditional means of item removal described in chapter 3
produces results different from the ones found in this study. The algorithm used here chose the
items that best represented the correlate as defined by the full scale for each personality
variable. It is possible that readjustments to the scale score after individual item removal may
produce different reduced scales, which could measure (slightly) different constructs.
The factor analytic approach, while underperforming in relation to the observed score,
acted as expected when correlated with factor scores. Since item selection was based on how
well an item loaded on the latent construct, this is a consistent find. The selected items best
indicated the construct and extraneous information was removed (from the poorest loaders).
A non-ignorable issue arises if the construct in question covers more than a single
dimension. This would indicate an observed score or factor score that either equally represents
multiple constructs, or strongly leans towards one dimension over another. In that case, shared
variation exhibited by reduced scale scores may be deceiving because it could represent
relationship between legitimately related constructs instead of the same construct.
IRT
54
Of all five utilized techniques, the IRT approach could be considered the most involved,
requiring both in-depth knowledge of its components and detailed understanding of how those
components could be applied. Consequently, designing the adaptive algorithm can be
complicated if IRT programs aren’t used to do the work. This complication does allow for an
array of algorithms, and one could be designed with such complexity that it consistently
provides an accurate representation of all participants’ scores on a given personality measure,
overshadowing even the subscribed merits of CART.
The benefits of IRT for adaptive testing are lost when items are more haphazardly
assigned as they may have been in these analyses. Though selection of the starting item was
systematic, the step rules were arbitrary. Progressing from the difficulty outskirts and working
in towards the center of the difficulty spectrum seemed reasonable at the design phase, but in
hindsight may have been inappropriate considering initial findings regarding the difficulty of
items. If trying to maximize the effectiveness of a simple algorithm, relegating everyone to
higher difficulties and then variably lowering them along the difficulty spectrum may have
proved more reliable. A more complex algorithm would incorporate some sort of variable
assignment rule which maximized the correlation between the current score and the total score.
Also arbitrary was the response split point. Based loosely on response propensity, splits
occurred at the 3-4 mark, where participants with a 3 or lower went one direction, and
participants who responded with 4s or 5s went a different direction (neuroticism followed a
modified rule since it was a negatively oriented variable with participant response propensities
towards the lower end). Firstly, the use of this rule for all items was a misstep since there were
a good number of items for which the observation of high positive endorsement did not hold.
Secondly, if the intention was to better measure people with high scores on any given
personality variable, a default split at the 4-5 mark would have made more sense, thereby
55
effectively splitting individuals with moderately high scores from those with high scores. The
used algorithm combined these two groups, thus failing to do what was originally intended.
Future research can investigate whether different splitting rules would improve the estimates
for this sample.
It is possible that the IRT technique was doomed from the beginning. With 8 to 10 items
per scale, the BFI did not meet the item minimum of 30 suggested by Dodd, Koch & Ayala
(1989). McArdle (2009) had access to 47 items for a single construct, allowing for more
flexibility in the adaptive process. An insufficient item base coupled with a questionable factor
structure may have severely weakened the effectiveness of this technique. In addition, IRT is
generally more effective when there is wider variation in item difficulties. The items of the BFI
exhibited far more clustering at a particular end of the trait continuum, making it more difficult
to distinguish participants based on response propensity.
An important observation to keep in mind is that the IRT technique did poorly in a
relative sense, but with the exception of openness to experience, was rarely the worst
technique. The discussion here relates to deviations from the technique’s expected
performance relative to some of the other techniques (e.g. the correlational approach).
CART
Neither CART procedure consistently outperformed all techniques, an observation
contrary to the findings of McArdle (2009). Having said that, the techniques performed well and
appropriately improved at higher levels of i, an expectation given decreases in relative error for
the more complex CART trees.
CART is only as effective as the items allow it to be. With such a limited pool of items,
there weren’t many unique paths of predictions for participants. CART trees were pruned to
certain sizes for two main reasons: (1) to keep the number of items consistent with the other
56
methods, and (2) to prevent the resurgence of items. This latter point is an issue when using
CART because of the difficulties item reuse presents for adaptive testing. If allowed to grow out
to maximal size, the model’s predicted scores would produce shared variation greatly exceeding
that of the other methods, but the resulting model would be neither feasible nor
comprehendible.
One thing the CART and IRT methods could benefit from is the use of priors (and
utilities). Priors increase predictive strength because they promote customization of start points
and can establish more appropriate split criteria. A prior can technically be anything that
provides useful and differentiating information, such as age, ethnicity, or performance on some
related variable such as a cognitive test.
Closing Thoughts
The “True” Score
If the purpose of scale reduction is to find a time-saving method that best estimates the
observed score, then the results of this study would suggest one go with the correlational
approach to short forms above all other methods. It consistently outperformed across all item
conditions for all personality variables, and given a sample of similar makeup to the one used
here, that should remain the case. Any test size could be used – three items may work if time is
the issue, but five items may represent a useful compromise between time and size.
As already mentioned, there are limitations in classical test theory which bring into
question how accurate these observed scores are in capturing a participant’s true score on any
given variable. The factor score can be considered a better estimate of the true score because it
does not assume all items are equally important in the development of the construct. If one
relied on the factor score instead of the observed score, the CART based on factor scores would
be the best method. A simulation study would help strengthen this claim by allowing for tests of
57
different sample characteristics that could modify the efficacy of a particular technique. For
instance, one could manipulate sample size or participant performance and evaluate how each
technique performs under these new conditions.
Factor analyses suggested that the items do not load equally on any construct, so the
observed scores and the factor scores were not going to be perfectly consistent with one
another. However, direct correlation of the score types revealed that, for the most part, they
were not too different from each other. The extraversion r
2
= 0.94, the agreeableness r
2
= 0.76,
the conscientiousness r
2
= 0.91, the neuroticism r
2
= 0.96, and the openness to experience r
2
=
0.86. Agreeableness seems the most problematic, sharing 76% of variation at the full scale
level. Extraversion, conscientiousness, and neuroticism all remained above 90%, with
neuroticism reigning as most consistent.
Saving Time
One purpose of scale reduction is to conserve time while sacrificing as little as possible
in terms of scale accuracy. The process has the added beneficial outcome of lessening the
burden placed on respondents. Preliminary evaluation of the time commitment needed to
complete the BFI is about 4 minutes, or 240 seconds. This suggests each item requires an
average of 5.5 seconds to read and endorse. For some of the personality variables (e.g.
agreeableness), a shift from three to five items presents a notable improvement in accuracy,
regardless of technique. For other variables (e.g. openness to experience), this shift does not
greatly improve accuracy beyond four items. The added time may not be sufficiently useful in
the latter case whereas in the former it might be wise to ask five items over three or four.
Ultimately, the number of items used depends on how much information a researcher is willing
to lose and which aspect of scale reduction (accuracy or time savings) is more important to
them.
58
Future Study
Given the EFA and the observed score / factor score squared correlation, it may be
worthwhile to better define the agreeableness construct in order to improve its measurement.
Openness could benefit from this treatment as well – it has the most items in the BFI but it is
possible that some of the extra items detract from the constructs central theme.
Of particular interest for future research is the IRT technique. As applied here it was a
relatively ineffective means of scale reduction, but there were a number of already mentioned
factors impeding its success. Despite limitations of the BFI’s item pool size, there is still
evidence that this method could perform admirably with a more advanced algorithm, such as
the FAT approach used by McArdle (2009) or the item information recalculation approach used
by Reise & Henson (2000).
59
Bibliography
Allen, M. J. & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL:
Waveland Press.
Aluja, A., Garcia, O., Garcia, L. F., & Seisdedos, N. (2004). Invariance of the NEO-PI-R factor
structure across exploratory and confirmatory factor analyses. Personality and
Individual Differences, 38, 1879-1889.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in
two or more nominal categories. Psychometrika, 27, 29-51.
Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for cognition.
Journal of Personality Assessment, 48:3, 306-307.
Caparara, G. V., Barbaranelli, C., Hahn, R., & Comrey, A. L. (2001). Factor analyses of the NEO-PI-
R Inventory and the Comrey Personality Scales in Italy and the United States.
Personality and Individual Differences, 30, 217-228.
Costa, P. T., & McCrae, R. R. (1985). The NEO personality inventory manual. Odessa, FL:
Psychological Assessment Resources.
Costa, P. T., & McCrae, R. R. (1992). The NEO PI-R professional manual. Odessa, FL:
Psychological Assessment Resources, Inc.
Digman, J. M. (1990). Personality structure: emergence of the five-factor model. Annual Review
of Psychology, 41, 417-440.
Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1989). Operational characteristics of adaptive testing
procedures using the graded response model. Applied Psychological Measurement,
13:2, 129-143.
Fisher, G. G., Rodgers, W. L., Kadlec, K. M., & McArdle, J. J. (TBP). Cognition and aging in the
USA: study methods and sample selectivity. Retrieved from:
http://kiptron.usc.edu/publications/merit_pubs.php
Fliege, H., Becker, J., Walter, O. B., Bjorner, J. B., Klapp, B. F., & Rose, M. (2005). Development
of a computer-adaptive test for depression (D-CAT). Quality of Life Research, 14:10,
2277-2291.
Flora, D. B. & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation
for confirmatory factor analysis with ordinal data. Psychological Methods, 9:4, 466-491.
Forbey, J. D. & Ben-Porath, Y. S. (2007). Computerized adaptive personality testing: a review
and illustration with the MMPI-2 computerized adaptive version. Psychological
Assessment, 19:1, 14-24.
60
Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An item response theory analysis of self-
report measures of adult attachment. Journal of Personality and Social Psychology,
78:2, 350-365.
Gibbons, R. D., Bock, R. D., Hedeker, D., Weiss, D. J., Segawa, E., Bhaumik, D. K., …Stover, A.
(2007). Full-information item bifactor analyses of graded response data. Applied
Psychological Measurement, 31, 4-19.
Goldberg, L. R. (1990). An alternative “description of personality”: the big-five factor structure.
Journal of Personality and Social Psychology, 59:6, 1216-1229.
Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist,
48:1, 26-34.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough,
H. G. (2006). The international personality item pool and the future of public-domain
personality measures. Journal of Research in Personality, 40, 84-96.
Gosling, S. D., Rentfrow, P. J., & Swann Jr., W. B. (2003). A very brief measure of the big-five
personality domains. Journal of Research in Personality, 37:6, 504-528.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage Press.
Handel, R. W., Ben-Porath, Y. S., & Watt, M. (1999). Computerized adaptive assessment with
the MMPI-2 in a clinical setting. Psychological Assessment, 11:3, 369-380.
Hoffman, L. (2009). Graded response IRT models in Mplus version 5.2 [PowerPoint slides].
Retrieved from
http://psych.unl.edu/psycrs/948/10c_Graded_Response_IRT_Models_in_Mplus.pdf
John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory – versions 4a and 54.
Berkely, CA: University of California, Berkeley, Institute of Personality and Social
Research.
John, O. P., Naumann, L. P., & Soto, C. J. (2008). Paradigm Shift to the Integrative Big-Five Trait
Taxonomy: History, Measurement, and Conceptual Issues. In O. P. John, R. W. Robins, &
L. A. Pervin (Eds.), Handbook of personality: Theory and research (pp. 114-158). New
York, NY: Guilford Press.
John, O. P. & Srivastava, S. (1999). The Big-Five trait taxonomy: history, measurement, and
theoretical perspectives. In L. Pervin and O. P. John (Eds.), Handbook of personality:
theory and research (pp. 102-138). New York: Guildford.
Knottnerus, J. A., Knipschild, P. G., & Sturmans, F. (1989). Symptoms and selection bias: the
influence of selection toward specialist care on the relationship between symptoms and
diagnoses. Theoretical Medicine and Bioethics, 10:1, 67-81.
61
Loh, W. (2008). Classification and regression tree methods. In Encyclopedia of Statistics in
Quality and Reliabiltiy (pp. 315-323).
McArdle, J. J. (2009). Adaptive testing of the Number Series test using standard approaches and
new a new classification and regression tree approach. The Woodcock-Munoz Research
Foundation Journal, 1.
Muthén, L. K. & Muthén, B. O. (1998-2007). Mplus User’s Guide. Fifth Edition. Los Angeles, CA:
Muthén & Muthén.
Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of
Mathematical Personlaity, 3:1, 1-18.
Rammstedt, B. & John, O.P. (2007). Measuring personality in one minute or less: a 10-item
short version of the Big Five Inventory in English and German. Journal of Research in
Personality, 41, 203-212.
Reise, S. P. (1999). Personality measurement issues viewed through the eyes of IRT. In S. E.
Embretson and S. L. Hershberger (Eds.), the new rules of measurement: what every
psychologist and educator should know (pp. 219-242). New Jersey: Lawrence Erlbaum
Associates.
Reise, S. P. & Henson, J. M. (2000). Computerization and adaptive administration of the NEO PI-
R. Assessment, 7:4, 347-364.
Robins, R. W., Hendin, H. M., & Trzesniewski, K. H. (2001). Measuring global self-esteem:
construct validation of a single-item measure of the Rosenberg Self-Esteem scale.
Personality and Social Psychology Bulletin, 27, 151-161.
Saucier, G. (1994). Mini-markers: a brief version of Golberg’s unipolar big-five markers. Journal
of Personality Assessment, 63:3, 506-516.
Saucier, G. & Goldberg, L. R. (1998). What is beyond the big five? Journal of Personality, 66:4,
495-524.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph Supplement, 17.
Samejima, F. (1997). Graded response model. In W. J. van der Linden and R. K. Hambleton
(Eds.), Handbook of modern item response theory (pp. 85-100). New York: Springer-
Verlag.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. Boca Raton, FL: Chapman &
Hall/CRC.
62
Simms, L. J. & Clark, L. A. (2005). Validation of a computerized adaptive version of the schedule
for nonadaptive and adaptive personality (SNAP). Psychological Assessment, 17:1, 28-
43.
Tupes, E. C. & Cristal, R. E. (1961). Recurrent personality factors based on trait ratings.
Technical Report ASD-TR-61-97. Lackland Air Force Base, TX: Personnel Laboratory, Air
Force Systems Command.
Waller, N. G. & Reise, S. P. (1989). Computerized adaptive personality assessment: an
illustration with the absorption scale. Journal of Personality and Social Psychology, 57:6,
1051-1058.
Weiss, D. J. & Kingsbury, G. G. (1984). Application of computerized adaptive testing to
educational problems. Journal of Educational Measurement, 21, 361-375.
Wiggins, J. S. & Pincus, A.L. (1992). Personality: structure and assessment. Annual Review of
Psychology, 43, 473-504.
Williams, G. J. (2009). Rattle: a data mining GUI for R. The R Journal, 1/2, 45-55.
Yu, C. Y. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with
binary and continuous outcomes. Doctoral dissertation, University of California, Los
Angeles.
Zhang, B. A CART based sub-pixel method to map spatial and temporal patterns of prairie
pothole lakes with climatic variability. Unpublished manuscript, Ohio State University,
Columbus, Ohio.
Zickar, M. J. (1998). Modeling item-level data with item response theory. Current Directions in
Psychological Science, 7:4, 104-109.
63
Appendix A: Sample statistics from CogUSA (N = 1427)
Age Females Education Extra Agree Consc Neuro Open
Mean 64.3 56% 14.1 26.8 38.3 36.7 20.0 36.8
SD 10.5 2.5 6.6 4.9 5.6 6.6 7.2
Correlations
Age 1.000
Females -0.082 1.000
Education -0.205 -0.066 1.000
Extra -0.039 0.039 0.108 1.000
Agree 0.080 0.197 -0.021 0.169 1.000
Consc 0.032 0.074 0.076 0.274 0.333 1.000
Neuro -0.061 0.089 -0.097 -0.297 -0.367 -0.347 1.000
Open -0.129 -0.057 0.355 0.350 0.070 0.222 -0.184 1.000
64
Appendix B: BFI-44
Please write a number next to each statement to indicate the extent to which you agree
or disagree with that statement
(1) = disagree strongly
(2) = disagree a little
(3) = neither agree nor disagree
(4) = agree a little
(5) = agree strongly
I am someone who…
1. Is talkative
2. Tends to find fault with others
3. Does a thorough job
4. Is depressed, blue
5. Is original, comes up with new ideas
6. Is reserved
7. Is helpful and unselfish with others
8. Can be somewhat careless
9. Is relaxed, handles stress well
10. Is curious about many different things
11. Is full of energy
12. Starts quarrels with others
13. Is a reliable worker
14. Can be tense
15. Is ingenious, a deep thinker
65
16. Generates a lot of enthusiasm
17. Has a forgiving nature
18. Tends to be disorganized
19. Worries a lot
20. Has an active imagination
21. Tends to be quiet
22. Is generally trusting
23. Tends to be lazy
24. Is emotionally stable, not easily upset
25. Is inventive
26. Has an assertive personality
27. Can be cold and aloof
28. Perseveres until the task is finished
29. Can be moody
30. Values artistic, aesthetic experiences
31. Is sometimes shy, inhibited
32. Is considerate and kind to almost everyone
33. Does things efficiently
34. Remains calm in tense situations
35. Prefers work that is routine
36. Is outgoing, sociable
37. Is sometimes rude to others
38. Makes plans and follows through with them
39. Gets nervous easily
66
40. Likes to reflect, play with ideas
41. Has few artistic interests
42. Likes to cooperate with others
43. Is easily distracted
44. Is sophisticated in art, music, or literature
67
Appendix C: Short Form Item Utilization
Each personality variable is listed in the left hand column. The next two columns
identify each of the items associated with the respective personality variable by item number
and description (adjective). An ‘X’ in a box indicates the item was retained for that particular
condition i (MAX, 5, 4, or 3 items).
Approach Correlational Factor Analytic
Item # # of Items MAX 5 4 3 MAX 5 4 3
Extraversion
1 Talkative X X X X X
6 Reserved X X
11 Energetic X X X X X
16 Enthusiastic X X X X X
21 Quiet X X X X X X
26 Assertive X X X X X X
31 Shy X X X
36 Outgoing X X X X X X X X
Agreeableness
2 Faultfinding X X X X
7 Helpful X X X X X
12 Quarrelsome X X
17 Forgiving X X X X X X X
22 Trusting X X
27 Aloof X X
32 Kind X X X X X X X X
37 Rude X X X X X X
42 Cooperative X X X X X X
Conscientiousness
3 Thorough X X
8 Careless X X X
13 Reliable X X X
18 Disorganized X X X X X
23 Lazy X X X X X X X
28 Persevering X X X X X X X X
33 Efficient X X X X X
38 Follow-through X X X X X X X
43 Distractible X X
Neuroticism
4 Depressed X X X X
9 Relaxed X X X X X X X X
14 Tense X X X
19 Worrisome X X X X X X
24 Coolheaded X X X X X X X
29 Moody X X
34 Calm X X X X X
68
Approach Correlational Factor Analytic
Item # # of Items MAX 5 4 3 MAX 5 4 3
39 Nervous X X X X X
Openness
5 Original X X X X X X X X
10 Curious X X X X
15 Ingenious X X X
20 Imaginative X X X
25 Inventive X X X X X X X X
30 Cultured X X X X
35 Routine X X
40 Reflective X X X X X X X X
41 Philistine X X
44 Art Savvy X X
69
Appendix D: IRT Graded Response Model Designs/Paths
Refer to Appendices B and C for item identification.
Extraversion
Three-item
B21
B6 B36
B16 B31 B16 B11
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
Four-item
B21
B6 B36
B16 B31 B16 B11
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
B11 B31 B11 B31 B16 B1 B16 B26
1,2,3 1,2,3 1,2,3 1,2,3 4,5 4,5 4,5 4,5
70
Agreeableness
Three-item
B32
B2 B7
B37 B27 B37 B22
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
Four-item
B32
B2 B7
B37 B27 B37 B22
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
B22 B27 B22 B27 B37 B42 B37 B17
1,2,3 1,2,3 1,2,3 1,2,3 4,5 4,5 4,5 4,5
71
Conscientiousness
Three-item
B33
B43 B13
B38 B18 B23 B3
4,5 1,2,3
4,5 4,5 1,2,3
Four-item
B33
B43 B13
B38 B18 B23 B3
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
B3 B38 B23 B18 B23 B28 B38 B8
1,2,3 1,2,3 1,2,3 1,2,3 4,5 4,5 4,5 4,5
72
Neuroticism
Three-item
B19
B14 B4
B39 B29 B34 B24
1,2 3,4,5
3,4,5 1,2 1,2 3,4,5
Four-item
B19
B14 B4
B39 B29 B34 B24
1,2 3,4,5
3,4,5 1,2 1,2 3,4,5
B24 B9 B9 B29 B9 B34 B9 B39
3,4,5 3,4,5 3,4,5 3,4,5 1,2 1,2 1,2 1,2
73
Openness to Experience
Three-item
B40
B35 B10
B41 B44 B5 B30
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
Four-item
B40
B35 B10
B41 B44 B5 B30
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
B20 B41 B5 B15 B5 B20 B41 B15
1,2,3 1,2,3 1,2,3 1,2,3 4,5 4,5 4,5 4,5
74
Five-item
B40
B35 B10
B41 B44 B5 B30
4,5 1,2,3
1,2,3 4,5 4,5 1,2,3
B20 B41 B5 B15 B5 B20 B41 B15
1,2,3 1,2,3 1,2,3 1,2,3 4,5 4,5 4,5 4,5
B5
B25
B20
B25
B30 B25 B25 B15
B25
B44 B25 B25 B30
B15
B41
B25
1,2,3 4,5
1,2,3 4,5
4,5
4,5
1,2,3
1,2,3
4,5 1,2,3
4,5
4,5
4,5
1,2,3
1,2,3
1,2,3
75
Appendix E: Three-Item CART Trees
Refer to Appendices B and C for item identification.
Observed Scores
The five trees below represent the three-item CART analyses which used each
personality variable’s observed score as the target variable. Relative error for each tree is
presented in parenthesis next to the personality variable’s name.
Extraversion (relative error = 0.309)
76
Agreeableness (relative error = 0.337)
Conscientiousness (relative error = 0.352)
77
Neuroticism (relative error = 0.274)
Openness to Experience (relative error = 0.328)
Factor Scores
The five trees below represent the three-item CART analyses which used each
personality variable’s factor score as the target variable. Relative error for each tree is
presented in parenthesis next to the personality variable’s name.
78
Extraversion (relative error = 0.272)
Agreeableness (relative error = 0.266)
79
Conscientiousness (relative error = 0.319)
Neuroticism (relative error = 0.271)
80
Openness to Experience (relative error = 0.281)
81
Appendix F: Example Four-Item and Five-Item CART Tree Descriptions
Refer to Appendices B and C for item identification.
For all CART models, scores as described are averages, with associated standard deviations in
parentheses.
Observed Score Example
Four items: extraversion (relative error = 0.216)
This extraversion tree had 15 total terminal nodes.
Participants were split on item 21 (step 1) at the value 3.5. For those who endorsed a
value less than the cutoff, item 26 (step 2) was used as the subsequent split (at 2.5). A value less
than 2.5 on item 26 warranted a split on item 36 (at 2.5). From here, a final split occurred again
with item 21 (at 1.5) if less than 2.5 on item 36. The repeated item indicates an interaction, and
the model would have to be modified to use this for adaptive testing purposes (you wouldn’t
give a participant the same item twice). A response less than 1.5 on item 21 garnered a score of
14.135 (SD = 2.6) while a response greater than 1.5 led to a score of 18.150 (SD = 3.1). Note that
CART returns split points that do not actually represent true response values (i.e., 1, 2, 3, 4, and
5). Returning to item 36, if a response greater than 2.5 was given, the final split occurred with
item 11 (at 3.5). A value less than 3.5 on item 11 produces an estimated score of 19.767 (SD =
3.0), while a value greater than 3.5 leads to an estimated score of 22.945 (SD = 3.2). Back to
step 2, participants who endorsed a value greater than 2.5 on item 26 were subsequently split
on item 1 at 3.5 (step 3). Scoring less than this value led to another split on item 21 (at 1.5),
with a terminal node (#5) for those who responded with less than 1.5. The predicted score for
these participants was 20.610 (SD = 3.5). A response greater than 1.5 led to a final split (at 4.5)
of item 36. The below cutoff score was 23.971 (SD = 2.6) and the above cutoff score was 28.415
(SD = 3.3). From step 3, above cutoff responders continued to item 31, where they were split at
82
3.5. Above cutoff responders hit a terminal node (#10) and were attributed a score of 30.370
(SD = 2.880). Below cutoff, a final split occurred for item 11 at 3.5. Those who were less than
this cutoff received a score of 24.394 (SD = 2.8) and those who were above scored a 27.849 (SD
= 2.6).
From step 1, endorsement of item 21 at a value greater than 3.5 produced a split on
item 36 at 4.5 (step 4). Responses less than this cutoff were further split on item 36 at 2.5. This
split is not quite the same as the item reuse issue from earlier. One of CART’s limitations is that
it cannot do more than a two-way split. So while a three-way split may have been most
appropriate, the only way to do that here was to have the item re-split. A response less than 2.5
led to a predicted score of 23.667 (SD = 3.4). A response greater than that cutoff (but less than
4.5) produced a score of 30.250 (SD =3.3). From step 4, endorsement of item 36 with a value
greater than 4.5 motivated a later split on item 31 at 4.50. Those who scored above this cutoff
were greeted by the score in the final terminal node (#15), 36.689 (SD = 2.7). Below the cutoff,
a final split occurred for item 1 at 2.5. Low endorsers achieved a score of 27.000 (SD = 3.4) and
high endorsers scored 33.171 (SD = 2.7).
Factor Score Example
Five-item: openness to experience (relative error = 0.169)
This openness to experience tree had 27 terminal nodes. Due to complexity, this tree
will be described in an abbreviated form. Several of the paths include repeated items.
Parent Node (Node 1): Item 5 (cut point = 3.5)
1. Terminal Node 1: -2.512 (SD = 0.375)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 < 1.5 If item 15 < 0.5 If item
5 < 0.5
2. Terminal Node 2: -1.774 (SD = 0.361)
83
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 < 1.5 If item 15 < 0.5 If item
5 > 0.5
3. Terminal Node 3: -1.173 (SD = 0.352)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 < 1.5 If item 15 > 0.5
4. Terminal Node 4: -1.623 (SD = 0.305)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 > 1.5 If item 30 < 3.5 If item
15 < 2.5 If item 5 < 1.5 If item 44 < 0.5
5. Terminal Node 5: -0.951 (SD = 0.317)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 > 1.5 If item 30 < 3.5 If item
15 < 2.5 If item 5 < 1.5 If item 44 > 0.5
6. Terminal Node 6: -0.804 (SD = 0.237)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 > 1.5 If item 30 < 3.5 If item
15 < 2.5 If item 5 > 1.5
7. Terminal Node 7: -0.458 (SD = 0.297)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 > 1.5 If item 30 < 3.5 If item
15 > 2.5
8. Terminal Node 8: -0.608 (SD = 0.305)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 > 1.5 If item 30 > 3.5 If item
5 < 2.5
9. Terminal Node 9: -0.080 (SD = 0.295)
a. If item 5 < 3.5 If item 25 < 2.5 If item 40 > 1.5 If item 30 > 3.5 If item
5 > 2.5
10. Terminal Node 10: -1.004 (SD = 0.286)
84
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 < 3.5 If item 40 < 2.5 If item
15 < 1.5
11. Terminal Node 11: -0.453 (SD = 0.293)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 < 3.5 If item 40 < 2.5 If item
15 > 1.5
12. Terminal Node 12: -0.244 (SD = 0.293)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 < 3.5 If item 40 > 2.5 If item
20 < 3.5
13. Terminal Node 13: 0.141 (SD = 0.370)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 < 3.5 If item 40 > 2.5 If item
20 > 3.5
14. Terminal Node 14: -0.233 (SD = 0.328)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 > 3.5 If item 40 < 3.5 If item
5 < 2.5
15. Terminal Node 15: 0.136 (SD = 0.290)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 > 3.5 If item 40 < 3.5 If item
5 > 2.5 If item 25 < 3.5
16. Terminal Node 16: 0.674 (SD = 0.305)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 > 3.5 If item 40 < 3.5 If item
5 > 2.5 If item 25 > 3.5
17. Terminal Node 17: 0.621 (SD = 0.330)
a. If item 5 < 3.5 If item 25 > 2.5 If item 30 > 3.5 If item 40 > 3.5
18. Terminal Node 18: -1.077 (SD = 0.585)
a. If item 5 > 3.5 If item 40 < 3.5 If item 40 < 0.5
85
19. Terminal Node 19: 0.177 (SD = 0.402)
a. If item 5 > 3.5 If item 40 < 3.5 If item 40 > 0.5 If item 41 < 2.5
20. Terminal Node 20: 0.083 (SD = 0.222)
a. If item 5 > 3.5 If item 40 < 3.5 If item 40 > 0.5 If item 41 > 2.5 If item
20 < 2.5
21. Terminal Node 21: 0.726 (SD = 0.355)
a. If item 5 > 3.5 If item 40 < 3.5 If item 40 > 0.5 If item 41 > 2.5 If item
20 > 2.5
22. Terminal Node 22: 0.420 (SD = 0.352)
a. If item 5 > 3.5 If item 40 > 3.5 If item 25 < 3.5 If item 44 < 2.5
23. Terminal Node 23: 0.663 (SD = 0.244)
a. If item 5 > 3.5 If item 40 > 3.5 If item 25 < 3.5 If item 44 > 2.5 If item
41 < 3.5
24. Terminal Node 24: 1.167 (SD = 0.217)
a. If item 5 > 3.5 If item 40 > 3.5 If item 25 < 3.5 If item 44 > 2.5 If item
41 > 3.5
25. Terminal Node 25: 1.006 (SD = 0.383)
a. If item 5 > 3.5 If item 40 > 3.5 If item 25 > 3.5 If item 44 < 2.5
26. Terminal Node 26: 1.259 (SD = 0.356)
a. If item 5 > 3.5 If item 40 > 3.5 If item 25 > 3.5 If item 44 > 2.5 If item
15 < 3.5
27. Terminal Node 27: 1.719 (SD = 0.314)
a. If item 5 > 3.5 If item 40 > 3.5 If item 25 > 3.5 If item 44 > 2.5 If item
15 > 3.5
Abstract (if available)
Abstract
Multiple short form approaches, the graded response IRT model (Samejima, 1969), and CART were used to reduce the Big Five Inventory’s five personality scales (extraversion, agreeableness, conscientiousness, neuroticism, and openness to experience). Data was taken from the 2008 CogUSA sample of older adults (Fisher et al., TBP, N = 1427). Two short form techniques were adopted: (1) an extraction of the highest correlates based on correlational analysis, and (2) an extraction of the highest loaders based on factor analysis. The graded response model utilized item difficulties and discrimination to adapt the measure to participants. Following McArdle (2009), Classification and Regression Trees (CART), which estimates scores based on regression logic that allows for differential allocation of participants, was used here for adaptive testing as well. Each method’s predicted scores were correlated with both observed scores and factor scores. Evidence suggested the two score types did not line up perfectly due the measure’s factor structure. The correlational approach to short forms best approximated the observed score while the CART technique that predicted factor scores (used as the target in the CART analysis) most consistently estimated factor scores. Issues with the factor structure, problems with the agreeableness scale, time savings, and identification of the “true” score are discussed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The limits of unidimensional computerized adaptive tests for polytomous item measures
PDF
A functional use of response time data in cognitive assessment
PDF
Evaluating social-cognitive measures of motivation in a longitudinal study of people completing New Year's resolutions to exercise
Asset Metadata
Creator
Petway, Kevin Terrance, II
(author)
Core Title
Applying adaptive methods and classical scale reduction techniques to data from the big five inventory
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Psychology
Publication Date
07/16/2010
Defense Date
06/15/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive scale,adaptive test,Big Five,Cart,classification and regression trees,CogUSA,graded response model,IRT,item response theory,OAI-PMH Harvest,Personality,personality inventory,scale measurement,short form
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McArdle, John J. (
committee chair
), John, Richard S. (
committee member
), Read, Stephen J. (
committee member
)
Creator Email
kevpetway2@gmail.com,petway@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3194
Unique identifier
UC1226119
Identifier
etd-Petway-3892 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-349209 (legacy record id),usctheses-m3194 (legacy record id)
Legacy Identifier
etd-Petway-3892.pdf
Dmrecord
349209
Document Type
Thesis
Rights
Petway, Kevin Terrance, II
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
adaptive scale
adaptive test
classification and regression trees
CogUSA
graded response model
IRT
item response theory
personality inventory
scale measurement
short form