Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Rasch modeling of abstract reasoning in Project TALENT
(USC Thesis Other)
Rasch modeling of abstract reasoning in Project TALENT
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Running head: RASCH MODELING ABSTRACT REASONING 1
Rasch Modeling of Abstract Reasoning in Project TALENT
Randy Bautista
University of Southern California
RASCH MODELING ABSTRACT REASONING 2
Table of Contents
Abstract 3
Introduction 4
Significance of Study 5
Classical Test Theory, Structural Factor Analysis, and Item Response Theory Models 6
Measurement Invariance and Multiple-Group Structural Factor Analysis 12
The Purpose of this Study 13
Research Questions 13
Method 13
Sample Description 13
Measures 14
Procedure 16
Analysis Plan 17
Results 20
The One- versus Two- versus Three-Parameter Model 20
Rasch Model Analysis 21
Testing Measurement Invariance of the Rasch Model 24
Discussion 27
Limitations 30
Future Directions and Conclusion 30
References 32
Appendices
Appendix A: Abstract Reasoning Items 38
Technical Appendix A-1: Mplus code for Rasch Model 42
Technical Appendix A-2: Mplus code for the 2PL Model 43
Technical Appendix B: ltm package code for IRT Models 44
Technical Appendix C-1: Mplus code for Baseline Model 46
Technical Appendix C-2: Mplus code for Metric Invariance Model 47
Technical Appendix C-3: Mplus code for Scalar Invariance Model 48
RASCH MODELING ABSTRACT REASONING 3
Abstract
The current study examined 1960 data from Project TALENT (Flanagan et al., 1962), a study
investigating students' talents and career aspirations using a comprehensive battery measuring
cognitive ability and interests. In this paper, we evaluate the psychometric properties of the
abstract reasoning measure—one of several measures used in the battery—because it was not
done so when the battery was initially created. A reason to undertake a Rasch analysis is because
the original investigators summed scores for each of the measures and used these summed scores
in further analyses. Initially, we compare the Rasch model to two alternative models, the two-
and three-parameter model and identify any items that do not fit the Rasch model. We then
examine whether the Rasch model of the abstract reasoning measure is invariant across grade
level and sex. Results indicated that the data did fit the Rasch model, however, one item did not
fit and was excluded in subsequent analyses. Metric invariance held for grade level and sex.
Keywords: abstract reasoning, Rasch model, item response theory, structural factor
analysis, measurement invariance
RASCH MODELING ABSTRACT REASONING 4
Rasch Modeling of Abstract Reasoning in Project TALENT
Measurement is at the heart of psychology and the field of psychometrics is concerned
with measurement and explaining the relationship between observed and unobserved attributes.
Cognitive psychologists have found great utility in psychometrics as a field and as a tool to
measuring the unobserved. In particular, intelligence itself is a theoretical construct since it
cannot be directly observed, but is assumed to exist. As researchers, we can use tests to try to
measure intelligence.
Data from Project TALENT (Flanagan et al., 1962) were collected in 1960 on over 440,000
high school students to investigate how well students’ talents matched up with their career
aspirations (Cooley & Lohnes, 1971; Humphreys, Parsons, & Park, 1979). The original
investigators used over 2,000 questions on 30 different tests to measure an array of aptitudes and
knowledge from the ability to remember grammatical rules to familiarity with home economics.
This study looks at one specific measure, abstract reasoning (AR), and its psychometric
properties, and we do so for several key reasons. First, the data is valuable since is it a large
nationally representative sample and would be difficult, costly, and unlikely to carry out a study
this large today. Second, re-testing of a large sample (N > 10,000) of the original 9
th
grade
participants were administered the same battery in the 12
th
grade (Wise, McLaughlin, & Steel,
1979). The AR measure was one of several measures in the aptitude, ability and achievement
tests. Third, since the creation of the battery, only a few researchers have looked at or evaluated
the psychometric properties of the various measures using techniques that are now more
accessible than they would have been in the past with the advancement of technology and
computers (McArdle, 2010).
RASCH MODELING ABSTRACT REASONING 5
The AR measure, similar to Raven’s Progressive Matrices (Raven, Raven, & Court, 2003)
and Cattell Culture Fair Test (Cattell, 1949), is a test used to measure non-verbal reasoning
independent of verbal or learned knowledge. Measures like AR in Project TALENT, Raven’s
Progressive Matrices, and Cattell Culture Fair Tests are often proxies to an individual’s fluid
intelligence (Gf; McGrew, 1997). Gf largely differs from crystallized intelligence (Gc), which
tends to encompass knowledge from experiences or prior learning (Horn & Cattell, 1967).
However, in expanding on the Gf-Gc theory, Gf would fall under the same stratum as other
abilities like crystallized intelligence, general memory and learning, broad visual perception,
broad auditory perception, processing speed, etc. (Carroll, 1993). In Project TALENT, the AR
measure is composed of several different types of items. These include figure analogies,
sequences, patterns, or groupings (Flanagan et al., 1962). The items are presented as a series of
figures with a missing portion. The participant is then instructed to select the most logically
correct option among several presented choices.
Significance of the Study
While there are a plethora of studies that apply modern techniques to examine the
psychometric properties of various measures, this was not done with Project TALENT’s AR
measure. The research done here adds to the Project TALENT literature by demonstrating the
use of item response theory (IRT) techniques to evaluate the AR measure and structural factor
analysis (SFA) in establishing test validity via measurement invariance of the AR measure across
grade levels and sex. The sum of the items taken were turned into scores for all of the subscales
used in the Project TALENT were used in later analyses following the completion of the initial
testing. The IRT, and more specifically, the Rasch model, analysis will support what was already
assumed, i.e., the use of sum scores in further analyses. If the Rasch model fits, then we are able
RASCH MODELING ABSTRACT REASONING 6
to compare participants independently of the measure. The multiple-group SFA will support the
use of the measure with different subgroups. Evaluation of the psychometric properties of the AR
measure will prove useful since efforts have been made to use this measure as part of a cognitive
battery of tests to re-examine the original sample from 1960 and older adults (Prescott et al.,
2012). Unlike Raven’s Progressive Matrices and the Cattell Culture Fair Test, the AR measure in
Project TALENT is now publicly available.
Classical Test Theory, Structural Factor Analysis, and Item Response Theory Models
Classical Test Theory (CTT; McDonald, 1999; Embretson & Reise, 2000) posits that a
person’s observed score (X) on a test or measure is a function of their true score (T) and random
error (E)
𝑋=𝑇+𝐸.
It is assumed that the random error is normally distributed with a mean of 0 and is uncorrelated
with the true score. In this model, the unit of analysis is the entire test, which can often just be the
sum of the scores for the items or the mean. The items are treated equally even though that may
not always be the case since CTT does not take into account properties of an item. The goal of
CTT is to quantify reliability, i.e., how much of the observed score is due to the true score and
not error. This can be conceptualized in terms of variances; reliability is the proportion of total
variance attributable to the true score. However, CTT has its shortcomings. In CTT, reliability
indexes are characteristics of the sample and not of the test. It also fails to take into account
properties of an item such as item difficulty and discrimination. Other shortcomings include the
assumption of equal error variances for all persons (Embretson & Reise, 2000). Furthermore, to
make a test more reliable, researchers can add increase the amount of items. This in itself can be
RASCH MODELING ABSTRACT REASONING 7
more of a nuisance to researchers. Administering longer tests of a measure can be costly for both
the test taker and the test administrator.
Fortunately, newer methods have been developed and formulated to account for the basic
limitations of CTT. Modern theoretical frameworks such as factor analysis and item response
theory will be discussed in the next sections.
Structural Factor Analysis (SFA; Harrington, 2009; McArdle, 1996) is a multivariate
technique often used to address questions regarding validity, e.g. “Do these items measure what
they purport to measure?” and questions regarding the structures of theories of constructs.
Similarly, SFA can be used as a form of data reduction. The factor in SFA refers to a latent
(unobserved) variable that accounts for the variation among observed variables. Latent variables
are constructs that cannot be directly measured (e.g., intelligence, depression).
SFA is closely related to exploratory factor analysis (EFA). With EFA, a researcher may
not have a particular theory in mind but seeks to summarize the relationships between sets of
variables. In SFA, the researcher specifies different aspects of the model that they believe reflect
their expectations; it is both hypothesis- and theory-driven (Thompson, 2004; Brown, 2006).
More specifically, researchers can specify the number of factors and whether or not these factors
correlate, what variables load on which factors, and relations between error/residual terms. The
two factor analytic techniques are based on the common factor model and EFA can be thought of
as a less restrictive SFA.
In a one-factor SFA measurement model, the items of a measure are the unit of analysis
and can be represented mathematically as a linear relationship between the latent construct and
observed items, written as
𝑌
!
= 𝜇
!
+𝜆
!
𝐹+𝑢
!
RASCH MODELING ABSTRACT REASONING 8
where 𝑌
!
is the observed response for ith item, 𝜇
!
is the intercept or mean of item i, 𝜆
!
is the factor
loading for item i for the factor 𝐹, and 𝑢
!
is the unique variance of each variable (often thought of
being composed of a specific factor and a random residual) and is independent of all 𝐹
(McDonald, 1999). Typically, means are ignored, however, they can be included in the models.
The factor loadings in this model can be conceptualized similarly to slope or a regression weight
in a linear regression model.
It is possible to examine certain CTT assumptions using an SFA approach. As previously
mentioned, the unit of analysis in CTT is the entire test and thus, every item is assumed to be
equivalent. Constraining the factor loadings to be equal in an SFA model allows us to assess
whether the assumption of equivalence holds.
Item Response Theory (IRT) and the Rasch Model. As in any other science, item response
theory (IRT; Embretson & Reise, 2000; de Ayala, 2009; Hambleton, Swaminathan, & Roges, 1991;
McDonald, 1999) emerged and developed to address the limitations of CTT. IRT encompasses a
set of models in which a latent trait ability is dependent on both a person’s response and on the
properties of each item. Often this latent trait is symbolized by the Greek letter 𝜃 (“theta”; Lord &
Novick, 1968) but, because it is not necessarily estimated, we will term it F here (after McDonald,
1999). In SFA, the relationship among the observed variables and the latent factors are assumed
to be linearly related. However, in IRT, the relationship is specified as nonlinear. Historically,
IRT models have been applied to dichotomous response items (e.g., pass or fail), but extensions
to polychotomous response data have been made. This study will primarily focus on modeling
dichotomous responses.
RASCH MODELING ABSTRACT REASONING 9
Assumptions underlying IRT models are (1) unidimensionality, (2) local independence,
and (3) functional form (de Ayala, 2008). The unidimensionality assumption is the assumption
that the items only measure one single trait or ability. Related to the assumption of
unidimensionality, the assumption of local independence is the idea that after controlling for all
sources of item covariances (i.e., a person’s latent trait score (F)), the item responses should be
uncorrelated. With this interpretation, the only reason why item responses are correlated is due
to the trait. If we cannot support this assumption, a multidimensional model should be fitted
instead. de Alaya (2008) states the functional form assumption that “the data follow the function
specified by the model” (pg. 21). All three are testable assumptions.
The Rasch, or one-parameter (1PL), model is the simplest IRT model in which the only
thing that differs between items is their difficulty (Rasch, 1960). Difficulty can often be looked at
as the location of an item on a latent trait. There are two equivalent ways in which the Rasch
model can be represented. The first equation is the odds ratio; the odds ratio is the probability of
answering correctly on item i, 𝑃(𝑌
!
= 1), to the probability of answering incorrectly on item i,
1−𝑃(𝑌
!
= 1), conditional on their trait level (F) and the difficulty of the item (b) written as
𝐿𝑜𝑔
𝑃(𝑌
!
= 1)
1−𝑃(𝑌
!
= 1)
𝐹
!
= 𝐹
!
−𝑏
!
.
Alternatively, this is represented as a probability of a person responding correctly (Y = 1) on item
i conditional on their trait level (𝐹) and the difficulty of the item (b) written as
𝑃 𝑌
!
= 1 𝐹
!
)=
𝑒
(!
!
!!
!
)
1+𝑒
(!
!
!!
!
)
.
The focus here is on the 𝐹−𝑏 term since the form 𝑒
!
/(1+𝑒
!
) is a logistic link function that
relates x to probabilities of Y (Nunnally & Bernstein, 1994; Embretson & Reise, 2000). The 𝐹−𝑏
RASCH MODELING ABSTRACT REASONING 10
term represents the distance between the person’s trait level (𝐹) and the item difficulty (b);
whether a person answers correctly on an item is dependent on this difference of 𝐹−𝑏.
Graphically, the second equation is represented as an item response curve (ICC; see Figure X).
The ICC is characterized by its S-shape curve similar to a graphical depiction of a logistic
regression model. In the case of the Rasch model, the figure depicts the probability of answering
correctly on that particular item given a participants trait/ability level.
Figure 1. Example of an Item Characteristic Curve
The two- (2PL) and three-parameter (3PL) models of each item build upon the one-
parameter model by taking into account item discrimination and guessing, respectively.
Discrimination, in this case, refers to how related an item is to the latent trait. Mathematically, it
is exactly the 1PL model with the addition of the discrimination parameter, a i, written as
𝑃 𝑌
!
= 1 𝐹
!
)=
𝑎
!
𝑒
(!
!
!!
!
)
1+𝑎
!
𝑒
(!
!
!!
!
)
.
Thus, the probability of answering correctly on an item i is not only a function of the distance
between the person’s trait level and item difficulty, but also how well the item differentiates
RASCH MODELING ABSTRACT REASONING 11
among respondents. The 3PL model builds upon the 2PL model by adding a pseudo-guessing
parameter, c i, usually included as
𝑃 𝑌
!
= 1 𝐹
!
)= 𝑐
!
+ (1−𝑐
!
)
𝑎
!
𝑒
(!
!
!!
!
)
1+𝑎
!
𝑒
(!
!
!!
!
)
.
The pseudo-guessing parameter, c, is the item’s lower asymptote and represents the lower bound
of probability independent of ability.
A technical term mentioned in the CTT and IRT literature, also closely related to
difficulty and discrimination, is information. Information can be intuitively thought of as
reliability. In both CTT and IRT, information is inversely related to error variance (the standard
error of measurement). Reliability in this case is measurement precision. The distinction between
information in CTT and IRT is that the former assumes information is constant across
participants whereas in IRT, information differs between participants. Each item has its own
information, referred to as item information. Information of the entire measure is referred to as
test information or total test information and is the sum of all the item information functions.
Test information functions help researchers see deficiencies in a measure and can help in
designing a measure with specific psychometric properties (Hambleton, Swaminathan & Rogers,
1991).
While SFA and IRT may appear to be different, perhaps since they evolved from different
traditions, they represent the same core model. Both are measurement models that try to
quantify unmeasured psychological phenomena. Mathematically, IRT models are essentially
factor models on a transformed dependent variable. In this case, we are no longer predicting Y
but a transformation of Y that accounts for it being binary. Previous researchers have shown that
SFA parameters can easily be converted to equivalent IRT parameters (McDonald, 1999;
RASCH MODELING ABSTRACT REASONING 12
Muthén, Kao, & Burstein, 1991; Muthén & Asparouhov, 2002). Difficulty parameters in the IRT
literature are the same as intercepts in SFA and discrimination parameters in IRT are the same as
a slope, or factor loading, in SFA. Other researchers have also reported on this link between SFA
and IRT models (Reise, Widaman, & Pugh, 1993; Muthén et al, 1991; Moustaki, Jöreskog, &
Mavridis, 2004).
Measurement Invariance and Multiple-Group Structural Factor Analysis
The concept of measurement invariance concerns the extent to which the psychometric
properties of the observed indicators are generalizable across groups or time, as in longitudinal
data. For example, we may measure the same construct with the same test over different
occasions or across different groups. If invariance holds, then these observed scores should only
be due to the latent factor and not on the group or time. Thus, differences that may arise
represent the true variability of the latent construct and not due to measurement issues (Millsap,
2007). More elegantly stated, Horn and McArdle (1992) referred to measurement invariance as
“whether or not, under different conditions of observing and studying phenomena, measurement
operations yield measures of the same attribute.” Measurement invariance techniques have
applications in different settings, e.g. invariance across age, sex, cultures, languages, etc. Lastly,
measurement invariance is often assumed, but rarely examined, when comparing groups. This
has an important implication when comparing groups. If a measure is not invariant, then group
differences could be confounded by the lack of invariance.
A multiple-group SFA is a method used to compare groups that allows one to evaluate
measurement invariance. A series of progressive steps are involved with testing invariance across
groups. The first step is often to test for configural invariance. The second step tests for metric
invariance, sometimes referred to as weak factorial invariance. The third step tests for scalar
RASCH MODELING ABSTRACT REASONING 13
invariance, or strong invariance. The fourth step tests for the equality of indicator residuals, or
strict factorial invariance (Horn & McArdle, 1992; Meredith, 1993).
A measure is said to be configural invariant if the factor structure or pattern holds across
groups. Technically, in the configural model, we have the same number of factors with the same
items for each factor; however, the parameters are allowed to estimate differently between the
groups. The metric invariance model tests for equal factor loadings across groups. Tests for scalar
invariance constrain the intercepts across groups to be equal and similarly, tests for equality of
indicator residuals constrain the item residual variances to be equal across groups. Generally,
these are nested models.
The Purpose of this Study
The purpose of the study is
(1) to examine the psychometric properties of the PT AR measure using both SFA and
IRT methods, and
(2) to determine how well the PT 15 item AR measure shows properties of measurement
invariance over age (or here, grade in school) and sex (as reported). The PT reporting
was done at only one occasion, 1960, so this is a cross-sectional analysis.
Research Questions
(1) How does the Rasch model compare to alternative models, e.g., the 2PL model or a
model incorporating guessing, for the AR items?
(2) Do the AR items fit the restrictive expectations of the Rasch model?
(3) Does the Rasch model exhibit characteristics of measurement invariance across four
grade levels? Across both sexes?
Method
RASCH MODELING ABSTRACT REASONING 14
Sample Description
Participants come from a nationwide study called Project TALENT (PT). This study was
conducted in 1960 by a group of collaborative investigators (Flanagan et al., 1962). Around
440,000 students from over 1,000 high schools across the United States make up the entire
sample. Schools and students were chosen in a systematic way that satisfied random sampling
and so that each type and size of school were fairly representative of the schools in the US. A
battery of tests and surveys were administered to students and counselors to assess students’
aptitudes, personalities, vocational interests, as well as school characteristics over a two-day
testing period.
For the purposes of this study, only four percent (4%) of the participants could be used in
the analyses because individual item-level responses for the measures were only given for four
percent of the sample. The reason this was done is because the original investigators had to code
up all scores for all people and they wanted a base to check the items using only CTT. The
participants in the four percent were randomly selected from the full sample. Although it is a
smaller sample, this is by far the largest study done at the item level. The participants used in the
present study consist of a total N = 13,277 students and we provide a breakdown in Table 1 by
grade in school and sex.
Table 1. Descriptive Statistics for Sample
Sex
Male Female Total
n Col % n Col %
Overall (Row %) 6653 50.1 6624 49.9 13277
Grade 9
th
1737 26.1 1768 26.7 3505
10
th
1697 25.5 1729 26.1 3405
11
th
1574 23.7 1575 23.8 3149
12
th
1645 24.7 1552 23.4 3197
RASCH MODELING ABSTRACT REASONING 15
Measures
Abstract Reasoning. The AR measure is one of several scales used in the PT and was
chosen because it is now one of the scales being used in studies being presently conducted.
Although it looks very much like the Cattell’s Culture Fair Tests non-verbal intelligence matrices
task and the Raven’s Progressive Matrices, these are distinctly different items. The AR measure
consists of several types of items were used to measure abstract reasoning: figure analogy, figure
sequences, pattern matrices, and figure grouping items. In the PT testing material, diagrams and
pictures are used instead of verbal material as a way to isolate non-verbal reasoning ability. The
AR measure is often used as a proxy for “reasoning” (Flanagan, 1998) or “fluid intelligence”
(Horn & Cattell, 1967) and this is thought to be the essence of intelligence by many researchers
in this area (Gustafsson, 1984, 1988; Carroll, 1993). All 15 items had five response options with
only one correct response each. Items were then given a “1” for a correct response and a “0” for
an incorrect response. A complete listing of the AR items used in PT appears in Appendix A.
Item Statistics. Table 2 presents traditional item analysis statistics including the
proportion passing for the 4% sample. Item difficulties are the proportion of participants who
passed a particular item. Items 1 and 2 were the two easiest items with proportions correct of
89.7% and 78.7%, respectively, while items 8 and 12 were the most difficult items with
proportions correct of 31.1% and 27.0%, respectively. Item 12 for abstract reasoning had a low
item-total correlation of .10 compared to other item-total correlations. This would suggest that
item 12 is inconsistent with the test as a whole. Overall, the item-total correlations for abstract
reasoning ranged from .10 to .42. The AR scale has an estimated reliability (Cronbach’s alpha) of
.71.
RASCH MODELING ABSTRACT REASONING 16
Table 2
Classical Test Theory Statistics
Item n n Missing % Missing % Correct
1 12678 599 5% 89.73
2 12656 621 5% 78.75
3 12663 614 5% 68.26
4 12659 618 5% 75.73
5 12647 630 5% 52.00
6 12628 649 5% 62.82
7 12660 617 5% 74.49
8 12650 627 5% 31.18
9 12652 625 5% 73.44
10 12600 677 5% 54.90
11 12600 677 5% 61.74
12 12455 822 7% 27.02
13 12465 812 7% 47.41
14 12486 791 6% 44.72
15 12266 1011 8% 35.81
Procedure
In selecting the students and schools, the original investigators used a stratified random
sampling design to obtain a sample large enough to satisfy their research needs. Secondary
schools, including public, parochial, and private, were selected and invited to participate to
obtain an accurate geographical representation of US at the time. A total of 987 high schools
participated with a 93% response rate. The entire battery was conducted in either a two-day
testing period or four half-day testing periods. The AR measure would have been administered
on the second day or the fourth half-day. Students were allotted 11 minutes to complete the AR
measure; this time did not include instructions given for directions. Complete details of how
students and schools were chosen along with details regarding construction and administration
of the battery can be found in other sources (Flanagan et al., 1962; Wise et al., 1979; Prescott et
al., 2012).
RASCH MODELING ABSTRACT REASONING 17
Analysis Plan
Several models will be looked at to answer the following questions:
(1) How does the 1PL/Rasch model compare to alternative models, e.g., the 2PL model or
a model incorporating guessing, for the AR items?
(2) Do the AR items fit the restrictive expectations of the 1PL/Rasch model?
(3) Does the Rasch model exhibit characteristics of measurement invariance across four
grade levels? Across both sexes?
Research Question 1. We will assess how well the data fits the restrictive expectations of
the Rasch model by analyzing the models in both Mplus v. 6.12 (Muthén & Muthén, 1998-2010)
and the ltm package v. 0.9-9 in R (Rizopoulous, 2006). Results should be the same in both
programs.
The three models to be analyzed in Mplus are the baseline, or null model, the 1PL model,
and the 2PL model. To do so, an item factor analysis technique (i.e., McDonald, 1999) will used
with weighted least squares with means- and variances-adjusted (WLMSV) estimation so that
SFA model fit statistics describe the fit of the item factor models to the tetrachoric correlation
matrix among the items. Models are identified by setting the factor mean to 0 and the factor
variance to 1, such that intercepts and factor loadings are estimated. The baseline model is
specified in the model statement by constraining the factor loadings for each item to be 0, while
item thresholds are estimated. In the 1PL model, the factor loadings are constrained to be equal
to each other and in the 2PL model, the factor loadings for the items are freely estimated (see
Technical Appendix A-1 and A-2 for Mplus code).
To take into consideration guessing, a third alternative model is proposed. It is similar to
the 3PL, except discriminations are constrained to the estimate given in the 1PL model and a .20
RASCH MODELING ABSTRACT REASONING 18
guessing parameter is specified for each item. Given that the original AR measure had five
response options with only one correct answer, a participant would have a 20% chance of getting
an item correct purely by guessing. This model can be conceptualized as the Rasch/1PL model
with guessing. The parameters for this model will be estimated using the ltm package since Mplus
v. 6.12 does not currently support the 3PL model where a guessing parameter could be specified
(see Technical Appendix B for R ltm code).
Comparisons between models will be made using the Akaike Information Criteria (AIC;
Akaike, 1987), chi-square (𝜒
!
) statistic along with the degrees of freedom (df). The AIC is a
parsimony-adjusted index that takes into account the number of model parameters. In the ltm
package, the AIC is calculated using the log-likelihood and the number of parameters estimated
in the model. The model chi-square (𝜒
!
) is an absolute goodness-of-fit statistic, which tests the
null hypothesis that the predicted variance-covariance matrix is equal to the observed variance-
covariance matrix. We follow the work of previous researchers to inform us how to interpret
these fit indices (Hooper, Coughlan, & Mullen, 2008; Hayduk, Cummings, Boadu, Pazderka-
Robinson, & Boulianne, 2007).
Research Question 2. To assess how well the data fit the Rasch model, Winsteps will be
used. The item difficulties and their corresponding standard errors will be estimated using joint
maximum likelihood estimation (JMLE) in Winsteps v. 3.74.0 (Linacre, 2012). Two fit statistics—
infit and outfit—are also provided alongside the item difficulty estimates. Similar to global model
fit indices given in SFA, infit and outfit assess how well the data fits the Rasch model. We
hypothesize that all items will fit the Rasch model.
RASCH MODELING ABSTRACT REASONING 19
Infit and outfit mean square statistics identify how well the data fit the Rasch model. The
infit and outfit mean squares are based on squared standardized residuals between the observed
and the predicted value from the model (similar to a 𝜒
!
statistic). However, the infit is an
information-weighted statistic where as the outfit is unweighted. The expected value for both
infit and outfit mean-squares is 1.0 (de Ayala, 2008). Values greater than 1.0 indicate the data to
underfit the data (i.e., noise and unpredictability); values less than 1.0 indicate the data to overfit
the model (Bond & Fox, 2007). Several researchers have suggested using acceptable cutoff values
of 0.7-1.3 for “run of the mill” multiple-choice tests, which serves to fit the purposes of the AR
measure (Bond & Fox, 2007; Adams & Khoo, 1993).
Research Question 3. We will examine the Rasch model measurement invariance of
abstract reasoning across grades and across sex as a way to investigate bias in the measure. A
multiple-group SFA approach will be used to examine measurement invariance of the Rasch
model across grade levels and sex (see Technical Appendix C-1, C-2, and C-3 for Mplus code).
The following describes the steps in examining invariance of the Rasch model in detail:
(1) Rasch model tested separately in each group;
(2) baseline model: equal factor loadings and thresholds estimated differently across
groups;
(3) metric invariance: equal factor loadings within and across groups, thresholds
estimated differently across groups;
(4) scalar invariance: equal factor loadings and thresholds across groups.
The chi-square (𝜒
!
) statistic, ratio of the chi-square over the degrees of freedom (𝜒
!
/df) and root
mean square of approximation (RMSEA) are provided. The ratio of the chi-square value divided
RASCH MODELING ABSTRACT REASONING 20
by its degrees of freedom (𝜒
!
/df) is an index that can be plotted where the “penalty line” is based
on a no intercept regression line (McArdle, 1988; McArdle & Nesselroade, 1994). The penalty
line is a regression line running through the origin (𝜒
!
= 0 and df = 0) in which models below the
line are considered to be good fit and above the line are poor fitting.
Results
The One- versus Two- versus Three-Parameter Model
To assess how well the AR data fit the restrictive expectations of the Rasch model, several
models were compared. The baseline model was initially fitted to the data. Next, the 1PL/Rasch
model was then estimated and compared to the proposed alternatives: a 2PL model and a 1PL
model with a fixed guessing parameter. Models 0 through 2 were identified and estimated in
Mplus (2008-2012) using WLSMV estimation and also in the ltm package using ML estimation.
Model 3 was not possible to fit in Mplus, so it was only estimated using the ltm package in R, thus
only its AIC is provided in Table 3. Given that the third model was non-nested, AIC was also
used to compare Model 3 with the previous two models (see Table 3).
Table 3
Model Fit Statistics for the 1PL, 2PL, and Rasch with Fixed Guessing Models
𝜒
!
df RMSEA (90% CI) AIC
0. Baseline 28395 105 229337
1. 1PL/Rasch 4888 104 .060 (.059-.062) 216508
2. 2PL 695 90 .023 (.021-.025) 214488
3. 1PL/Rasch + Guessing - - - 216332
The results from Table 3 indicate that all alternatives are better that the baseline. Many
indices (e.g, the AIC and although not shown, the BIC) favor the 2PL model. However, while the
2PL model provided the best fit, it does not necessarily result in a meaningful improvement of fit
over the other models except for the baseline.
RASCH MODELING ABSTRACT REASONING 21
The test information curves for the 3 models are presented in Figure 2. As previously
stated, information is reliability in the IRT context; the higher the information, the more accurate
scores will be. The plot shows that the 2PL model has the highest information but for a rather
restricted ability range where it is most accurate when ability is around -1. The 1PL/Rasch +
Guessing model peaks around zero but its peak is lower than the 1PL and the 2PL curves. The
1PL model is most informative for when ability is around -1 to 0. All three are less reliable at
higher ability levels.
Figure 2. Test Information Curve Plots for 1PL, 2PL, and 3PL Models
Rasch Model Analysis
Item fit statistics reported in Table 3 include infit and outfit mean square statistics (in
which values around 1.00 are desirable for good fit; de Ayala, 2008). If the items don’t fall within
an acceptable infit and outfit mean squares cut-off range, then the item can be viewed as
misfitting (Bond & Fox, 2008).
RASCH MODELING ABSTRACT REASONING 22
The weighted mean square infit and outfit statistics indicate that the majority of the items
are fitting within reasonable bounds. The average infit mean squares is .99, indicating that the
items are behaving as expected. The infits for all items ranged from .89 to 1.17. Similarly, the
average outfit mean squares is 1.02, indicating that the items are behaving as expected. However,
the range for outfit mean squares was wider from values of .78 to 1.56 (see Table 4). Table 4 also
presents the each item difficulty estimates along with its respective standard error; the
discrimination parameter, or factor loading, is 1 for all items.
Table 4
Item Difficulty Estimates, Infit and Outfit Mean Squares
Difficulty Infit Outfit
Item Estimate SE Mean Square Mean Square
1 -2.22 0.03 0.94 0.88
2 -1.16 0.02 0.89 0.78
3 -0.46 0.02 0.98 0.95
4 -0.94 0.02 0.90 0.84
5 0.42 0.02 0.93 0.91
6 -0.15 0.02 0.91 0.87
7 -0.86 0.02 1.02 1.00
8 1.53 0.02 1.02 1.24
9 -0.79 0.02 0.93 0.87
10 0.27 0.02 1.02 1.01
11 -0.1 0.02 0.97 0.95
12 1.77 0.02 1.17 1.56
13 0.65 0.02 1.00 1.01
14 0.79 0.02 1.13 1.25
15 1.25 0.02 1.09 1.20
Mean 0.99 1.02
SD 0.08 0.20
Evaluating Misfitting Items. The items would be seen as fitting the Rasch model if the infit
and outfit statistics fit within the range of 0.7-1.3 (Adams & Khoo, 1993; Bond & Fox, 2007).
Figure 3 shows a scatterplot of both the infit and outfit mean-squares values plotted against the
item number on the x-axis.
RASCH MODELING ABSTRACT REASONING 23
Figure 3. Infit and Outfit Mean-Square Scatterplots
Item 12 seems to display the most misfit when compared to the other items in the model.
For further confirmation, two models were examined in Mplus to compare a Rasch model
including item 12 and one without item 12. To examine the Rasch model, the factor loadings
were constrained to be equal (i.e., equal item discriminations). For the model without item 12,
item 12 was fixed to 0. The one-factor model excluding item 12 (𝜒
!
(104) = 3636; RMSEA (90%
CI) = .052 (.050-.053) provided better fit than the model including item 12 (𝜒
!
(104) = 4888;
RMSEA (90% CI) = .060 (.059-.062)). Given the results from the Rasch modeling techniques and
the results from Mplus, the final model to be used for the rest of the analyses excludes item 12.
Furthermore, a correlation was performed to assess the strength of the relationship
between an AR sum score including item 12 and an AR sum score without item 12. The
correlation between the score with item 12 and the score without item 12 was statistically
RASCH MODELING ABSTRACT REASONING 24
significant r(11780) = .99, p < .001. Finally, a regression analysis was performed to predict
number of items missing per participant from their reasoning ability factor scores (saved from
the Rasch model analysis). The correlation between number of items missing and reasoning
ability factor scores was not statistically significant, r(13275) = .00, p = .98. The overall regression
was not significant F(1, 13275) = .00, p = .98.
Testing Measurement Invariance of the Rasch Model
Testing Measurement Invariance Across Grade Level. Table 5 summarizes the model fit
statistics for each of the grade levels and the fit statistics for each invariance model. The metric
invariance model of covariances only was established by constraining the equal factor loadings to
be invariant across the grade levels while allowing the thresholds to differ between the grades.
The model fit improved from the baseline model even though the chi-square difference test
indicated worse fit when comparing the baseline model to the metric invariance model (diff
𝜒
!
(3) = 12, p < .001). The third model, which tests scalar invariance, constrains the equal factor
loadings and the thresholds to be invariant across the grade levels. The model fit decreases from
the previous model. In addition, the chi-square difference test indicated worse fit when
comparing scalar invariance model to the metric invariance model (diff 𝜒
!
(45) = 932, p < .001).
Table 5
Grade Level Measurement Invariance Model Fit Statistics
𝜒
!
df RMSEA (90% CI) Δ𝜒
!
Δdf
9
th
1115 104 .054 (.051-.056)
10
th
1024 104 .052 (.049-.055)
11
th
866 104 .049 (.046-.052)
12
th
856 104 .049 (.046-.052)
1. Baseline 3866 416 .051 (.050-.053)
2. Metric 3600 419 .049 (.047-.050) 12 3
3. Scalar 4479 464 .051 (.051-.054) 932 45
RASCH MODELING ABSTRACT REASONING 25
The models were also examined using a penalty function plot of the 𝜒
!
statistic as a
function of each model’s respective df (see Figure 4). The scalar invariance model indicated to be
a poor fit to the data. However, the metric invariance model, where the equal factor loadings
were constrained to be equal across groups, but thresholds differed across groups, stood out as
the “best” fit to the data.
Figure 4. Penalty Function Plot for Grade Level Rasch Invariance Models
Testing Measurement Invariance Across Sex. Table 6 summarizes the model fit statistics
for men and women separately and for each of the invariance models tested. As in the previous
section, fit is assessed using the 𝜒
!
, the respective df, and RMSEA.
Results for the measurement invariance models for sex were similar to the models tested
for grade level. The model fit improved from the baseline model to the metric invariance model
(diff 𝜒
!
(1) = 2, p = .16). The chi-square difference test for the metric invariance model to the
RASCH MODELING ABSTRACT REASONING 26
scalar invariance model did not indicate that constraining the equal factor loadings and
thresholds across sex to improve fit (diff 𝜒
!
(15) = 227, p < .001).
Table 6
Sex Measurement Invariance Model Fit Statistics
𝜒
!
df RMSEA (90% CI) Δ𝜒
!
Δdf
Men 1703 104 .049 (.047-.051)
Women 2006 104 .054 (.052-.056)
1. Baseline 3710 208 .051 (.050-.053)
2. Metric 3485 209 .050 (.048-.051) 2 1
3. Scalar 4855 224 .057 (.056-.058) 227 15
Similar to the previous section examining the grade level Rasch invariance, the models
were compared using a plot (Figure 5) of each model’s 𝜒
!
statistic and the respective df
(McArdle, 1988). Only the scalar invariance model was above the penalty line, indicating poor fit.
Both the baseline and the metric invariance model were below the penalty line indicating good
fit. However, the metric invariance model did provide the best fit since it was furthest below the
penalty line.
Figure 5. Penalty Function Plot for Sex Rasch Invariance Models
RASCH MODELING ABSTRACT REASONING 27
Finally, since the Rasch model were both metric invariant across grade level and sex, a
two-way ANOVA was performed to assess whether their total AR score could be predicted from
grade level, sex, and the interaction between the two. Results are presented in Table 7. It was
hypothesized that the interaction would not be significant at the .05 alpha level. As predicted, the
interaction of grade level and sex was not significant: F(3, 11853) = .74, p = .53, 𝜂
!
= .00.
However, the main effects were statistically significant: for grade level, F(3, 11853) = 101.49, p <
.001, 𝜂
!
= .03 and for sex, F(1, 11853) = 29.83, p < .001, 𝜂
!
= .00. Only 2.7% of the variance in AR
scores is explained by the independent variables with grade level explaining 2.5% of the variance
in AR scores.
Table 7. Two-way Analysis of Variance for Abstract Reasoning Scores
Source SS df MS F p
Sex 255 1 255 29.83 < .001
Grade Level 2599 3 866 101.49 < .001
Sex × Grade Level 19 3 6.3 0.74 .53
Within (Error) 101164 11853 8.5
Total 104037
This lack of strong main effect for sex means that, although it seems that men had higher AR
total scores across all grade levels and AR total scores increased throughout grade level, the
variation within cells was also large (see Table 7).
Discussion
The concept of measurement is often taken for granted in psychology. Many studies are
conducted using measures that may not necessarily be psychometrically sound. If we are to use
these measures in future research studies and future statistical analyses, we should be concerned
about their properties.
RASCH MODELING ABSTRACT REASONING 28
The present study evaluated the psychometric properties of the AR measure as it is in
Project TALENT using SFA and IRT techniques. First, we compared the Rasch model with the
2PL and 3PL models. Often, choosing between the three models reflects the researcher’s own
philosophy on whether the model should fit the data or vice versa. Given that PT already had
sum scores for the AR measure for nearly 440,000 participants, it seemed appropriate to fit the
data to the Rasch model and, hopefully, to identify items that do not fit well with the measure.
The goal was to evaluate the measure as it was presented in 1960, not to develop a new scale. We
thought that exactly how far a contemporary and optimal IRT derived scale would be from the
original total score in use now by most others could be informative in later work.
In evaluating the misfitting items, item 12 displayed the most misfit with an outfit mean-
square value that indicated un-modeled noise. From the simple analysis, item 12 was the most
difficult item in the entire measure. Upon closer inspection, the misfit of this item might be due
to person misfit, i.e., a participant may have answered item 12 correctly while incorrectly
answering the other 14 items perhaps due to lucky random guessing. For this particular item, we
say this misfit and unpredictability is attributable to erratic responses. Whether a participant
answered correctly cannot be predicted accurately by what is known about those participants’
performance on the AR measure overall. Overall, item 12 had the lowest percent of participants
answering correctly (27%) from the initial item statistics and we can infer that it was not due to a
large number of missing responses for item 12.
A suggestion we could make from these results is if we were to further develop this AR
scale, it may be best to add more difficult items to get the full spectrum of ability since they were
probably too easy for this population. As shown in the test information curve plots, information
in the AR measure was lacking for the higher trait levels. However, removing this item certainly
RASCH MODELING ABSTRACT REASONING 29
improved model fit when examined in Mplus. There was a strong relation between the original
total AR score (including item 12) and a score calculated without item 12 (r = .99). This could
suggest that using the original total AR scores would not necessarily make a distinctive difference
in future analyses than if they had used a total score excluding item 12. Furthermore, inspection
of the actual item (see Appendix A) did not raise any flags in terms of questioning whether or not
it measured another construct—just that it did not fit in with the rest of the items which tended
to be easier. The finding that the number of items with missing values and their reasoning factor
score was not correlated also confirmed that missing on an item was not due to their ability.
After removing item 12, we were interested in how well the Rasch model exhibited
properties of measurement invariance across grade level and sex since measurement invariance is
often assumed. As a result of conducting tests of measurement invariance, metric invariance
held, but scalar did not. To recap, the scalar invariance model had equal factor loadings within a
group and across groups and also constrained the thresholds to be equal across groups. The
significant decrease in fit from the metric invariance models in both the grade level and sex
analyses suggested different thresholds should be estimated for each of the groups.
Finally, we reported that the interaction between sex and grade level was not significant at
the alpha of .05 level for assessing whether a participants’ AR score could be predicted from sex,
grade level and their interaction, but significant main effects. However, the main effects were
fairly weak, only explaining 2.7% of the variance in AR scores. Interestingly, the scores did
increase as age (grade level) increased; the sample sizes were smaller as we go from grade 9 to 10
and so on. While the data we have is not longitudinal, we can make guesses to why this might be.
One hypothesis is that compulsory education laws in 1960 required students to only stay in
school until a certain age at which students could choose to leave school. Thus, the students who
RASCH MODELING ABSTRACT REASONING 30
did stay in school did so because they did well in school and can possibly represent a unique
sample with higher ability.
Limitations
Several limitations concerning the data and sample in this study should be noted. The AR
measure was administered in paper format in which students were given instructions on how to
complete the AR section. At that point, they proceeded to the actual items. It is assumed that
participants follow the order that they are presented in the test booklet. However, we can also
assume that participants could skip questions and return back to them in the given time limit for
that section.
Measurement invariance of the Rasch model was only examined with grade level and sex.
However, the initial Project TALENT study was conducted in 1960, around the time before the
Civil Rights Act of 1964. Schools at this time were clearly segregated (Prescott et al., 2012).
Unfortunately, ethnicity and race were not asked in the original 1960 sample for which the data
come from. However, weights could be used since school level information on whether or not a
school was segregated were asked of principals on the school characteristics survey and could be
a topic of interest for a future study.
Future Directions and Conclusion
The IRT and multiple-group SFA focused on only one measure. Clearly, it would be wise
to examine the other 29 scales in the battery used for Project TALENT since sum scores were also
used for later analyses. The same approach and techniques can extend to examine the
psychometric properties. We should ask, “How well does the data fit the Rasch model?”
Alternatively, future studies could extend the current analyses by looking into transforming the
AR measure into a fully computer adaptive test (CAT). However, the presentation of the item
RASCH MODELING ABSTRACT REASONING 31
should be taken into consideration since it may actually affect a participant’s probability of
answering correctly on an item (Bowles & Salthouse, 2003). Despite these concerns, transforming
the AR measure into a CAT would be beneficial given the cost and time in administering a
battery this large.
Latent measurement models such as SFA and IRT have certainly impacted the field of
psychometrics and our understanding of psychological assessment. In the past, developing scales
may have only been concerned with internal consistency reliability. Conducting an IRT analysis
to evaluate a measure has advantages over CTT since its parameters are independent of the
sample.
RASCH MODELING ABSTRACT REASONING 32
References
Adams, R. J., & Khoo, S. K. (1993). QUEST: The Interactive Test Analysis System. Hawthorn,
Victoria: Australian Council for Educational Research.
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52(3), 317–332.
Backman, M. E. (1972). Patterns of Mental Abilities: Ethnic, Socioeconomic, and Sex Differences.
Educational Research, 9(1), 1–12.
Bentler, P. M. (1990). Comparative Fit Indiexes in Structural Models. Psychological Bulletin,
107(2), 238–246.
Bond, T. G., & Fox, C. M. (2007). Applying The Rasch Model: Fundamental Measurement in the
Human Sciences. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Bowles, R. P., & Salthouse, T. A. (2003). Assessing the age-related effects of proactive interference
on working memory tasks using the Rasch model. Psychology and Aging, 18, 608–615.
Bredemeier, H. C. (1967). The Differential Effectivness of High Schools with Selected
Characteristics in Producing Cognitive Growth in Different Kinds of Students. New
Brunswick, N.J.
Brown, T. A. (2006). Confirmatory Factor Analysis for Applied Research. (D. A. Kenny, Ed.). New
York, NY: Guilford Press.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J.
S. Long (Eds.), Testing Structural Equation Models (pp. 136–162). Beverly Hills, CA: Sage.
Carroll, J.B. (1993). Human cognitive abilities: A survey of factor-analytical studies. New York,
NY: Cambridge University Press.
Cattell, R. B. (n.d.). Test of “g”: Culture fair. Savoy, IL: Institute for Personality and Ability
Testing.
RASCH MODELING ABSTRACT REASONING 33
Clemans, W. V. (1997). John Clemans Flanagan (1906-1996). American Psychologist, 1375–1376.
Cooley, W. W., & Lohnes, P. R. (1971). Multivariate data analysis. New York: Academic Press.
Cressie, N., & Holland, P. W. (1983). Characterizing the manifest probabilities of latent trait
models. Psychometrika, 48(1), 129–141.
Dailey, J. T., & Shaycoft, M. F. (1961). Types of Tests in Project Talent: Standardized Aptitude and
Achievement Tests. Washington, D.C.
de Ayala, R. J. (2009). The Theory and Practice of Item Response Theory. New York, NY: Guilford
Press.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ:
Erlbaum Publishers.
Flanagan, J. C. (1976). Changes in School Levels of Achievement: Project TALENT Ten and
Fifteen Retests. Educational Research, 5(8), 9–12.
Flanagan, J. C. (1979). Findings from Project TALENT. The Educational Forum, 43(4), 489–490.
Flanagan, J. C., Dailey, J. T., Shaycoft, M. F., Gorham, W. A., Orr, D. B., & Goldberg, I. (1962).
Design for a Study of American Youth. Boston, MA: Houghton Mifflin Company.
Gustafsson, J.-E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8,
179-203.
Gustafsson, J.-E. (1988). Hierarchical models of individual differences in cognitive abilities. In R.
J. Sternberg (Ed.)., Advances in the psychology of human intelligence (Vol. 4). Hillsdale,
NJ: Erlbaum.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamental of Item Response
Theory. Newbury Park, CA: Sage.
Harrington, D. (2009). Confirmatory factor analysis. New York, NY: Oxford University Press.
RASCH MODELING ABSTRACT REASONING 34
Hayduk, L., Cummings, G. G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007).
Testing! Testing! One, two three – Testing the theory in structural equation models!
Personality and Individual Differences, 42, 841-50.
Hooper, D., Coughlan, J., & Mullen, M. R. (2008). Structural equation modeling: Guidelines for
determining model fit. Journal of Business Research Methods, 6, 53-60.
Horn, J. L., & Cattell, R. B. (1967). Age differences in fluid and crystallized intelligence. Acta
Psychologica, 26, 107–129.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance
in aging research. Experimental aging research, 18(3-4), 117–44.
doi:10.1080/03610739208253916
Humphreys, L. G., Parsons, C. K., & Park, R. K. (1979). Dimensions Involved in Differences
among School Means of Cognitive Measures, 16(2), 63–76.
Linacre, J. M. (2012). Winsteps ® (Version 3.74.0). Beaverton, Oregon: Winsteps.com.
McArdle, J. J. (1988). Dynamic but structural equation modeling of repeated measures data. In J.
R. Nesselroade & R. B. Cattell (Eds.). The handbook of multivariate experimental
psychology. (Vol. 2., pp. 561-614). New York: Plenium.
McArdle, J. J. (1990). Principles versus Principals of Structural Factor Analyses. Multivariate
Behavioral Research, 25(1), 81–87.
McArdle, J. J. (1996). Current Directions in Structural Factor Analysis. Current Directions in
Psychological Science, 5(1), 11–18.
McArdle, J. J. (2010, November). Measurement of cognition in Project TALENT research. Paper
presented at the Gerontological Society of American Annual Meeting, New Orleans, LA.
RASCH MODELING ABSTRACT REASONING 35
McArdle, J. J., & Nesselroade, J. R. (1994). Structuring data to study development and change. In
S. H. Cohen & H. W. Reese (Eds.), Life-Span Developmental Psychology: Methodological
Innovations (pp. 223–267). Hillsdale, NJ: Erlbaum.
McDonald, R. P. (1999). Test Theory: A Unified Treatment. Mahwah, New Jersey: Lawrence
Erlbaum Associates, Inc.
McGrew, K. S. (1997). Analysis of the major intelligence batteries according to a proposed
comprehensive Gf-Gc framework. Contemporary intellectual assessment: Theories, tests,
and issues (pp. 151–179). New York: Guilford.
Meredith, W. (1993). Measurement Invariance, Factor Analysis and Factorial Invariance.
Psychometrika, 58(4), 525–543.
Millsap, R. E. (2007). Invariance In Measurement and Prediction Revisted. Psychometrika, 72(4),
461–473. doi:10.1007/S11336-007-9039-7
Mousaki, I., Jöreskog, K. G., & Mavridis, D. (2004). Factor models for ordinal variables with
covariate effects on the manifest and latent variables: A comparison of LISREL and IRT
approaches. Structural Equation Modeling, 11, 487–513.
Muthén, B. O., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes:
Multiple-group and growth modeling in Mplus.
Muthén, B. O., Kao, C., & Burstein, L. (1991). Instructional sensitivity in mathematics
achievement test items: Applications of a new IRT-based detection technique. Journal of
Educational Measurement, 28, 1–22.
Muthén, L. K., & Muthén, B. O. (1998-2010). Mplus User’s Guide. Sixth Edition. Los Angeles, CA:
Muthén & Muthén.
RASCH MODELING ABSTRACT REASONING 36
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York, NY:
McGraw-Hill.
Prescott, C. A., Achorn, D. L., Kaiser, A., Mitchell, L., McArdle, J. J., & Lapham, S. J. (2013). The
Project TALENT Twin and Sibling Study. Twin research and human genetics : the official
journal of the International Society for Twin Studies, 16(1), 437–48.
doi:10.1017/thg.2012.71
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen:
Danish Institute for Educational Research.
Raven, J., Raven, J. C., & Court, J. H. (2003). Manual for Raven’s Progressive Matrices and
Vocabulary Scales. San Antonio, TX: Harcourt Assessment.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item
response theory: Two approaches for exploring measurement invariance. Psychological
Bulletin, 114, 552–566.
Rizopoulos, D. (2006). ltm: An R Package for Latent Variable Modeling, 17(5).
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(461-464).
Steiger, J. H., & Lind, J. C. (n.d.). Statistically based tests for the number of common factors.
Stemberg, R. J. (1994). Factor Analysis. In R. J. Sternberg (Ed.), Encyclopedia of Human
Intelligence (pp. 422–430). New York, NY: Macmillan Publishing Company.
Thompson, B. (2004). Exploratory and Confirmatory Factor Analysis: Understanding Concepts
and Applications. Washington, D.C.: American Psychological Association.
Winkler, D. L., & Jolly, J. L. (2011). Project TALENT. Gifted Child Today, 34(2), 34–36.
RASCH MODELING ABSTRACT REASONING 37
Wise, L. L., McLaughlin, D. H., & Steel, L. (1979). The Project TALENT Data Bank Handbook.
Palo Alto, California: American Institutes for Research.
RASCH MODELING ABSTRACT REASONING 38
Appendix A
RASCH MODELING ABSTRACT REASONING 39
RASCH MODELING ABSTRACT REASONING 40
RASCH MODELING ABSTRACT REASONING 41
RASCH MODELING ABSTRACT REASONING 42
Technical Appendix A-1
TITLE: 1PL/Rasch Model
DATA: FILE = pt60clean.dat;
VARIABLE: NAMES = studid grade usregion male voc1-voc21 ar1-ar15
spel1-spel16 cap1-cap34 pun1-pun27 ee1-ee12;
USEVAR = ar1-ar15;
CATEGORICAL = ar1-ar15;
MISSING = .;
IDVARIABLE = studid;
ANALYSIS: ESTIMATOR = WLSMV;
PARAMETERIZATION = THETA;
DIFFTEST = bigger2.dat;
MODEL:
! Factor loadings all constrained equal in 1PL
AR BY ar1-ar15* (loading);
! Item thresholds all estimated
[ar1$1-ar15$1*];
! Factor mean = 0 and variance = 1 for identification
[AR@0]; AR@1;
OUTPUT: STDYX Residual;
SAVEDATA: DIFFTEST = bigger.dat;
RASCH MODELING ABSTRACT REASONING 43
Technical Appendix A-2
TITLE: 2PL Model
DATA: FILE = pt60clean.dat;
VARIABLE: NAMES = studid grade usregion male voc1-voc21 ar1-ar15
spel1-spel16 cap1-cap34 pun1-pun27 ee1-ee12;
USEVAR = ar1-ar15;
CATEGORICAL = ar1-ar15;
MISSING = .;
IDVARIABLE = studid;
ANALYSIS: ESTIMATOR = WLSMV;
PARAMETERIZATION = THETA;
MODEL:
! Factor loadings all estimated in 2PL
AR BY ar1-ar15*;
! Item thresholds all estimated
[ar1$1-ar15$1*];
! Factor mean = 0 and variance = 1 for identification
[AR@0]; AR@1;
OUTPUT: STDYX CINTERVAL Residual;
SAVEDATA: DIFFTEST = bigger2.dat;
RASCH MODELING ABSTRACT REASONING 44
Technical Appendix B
# loading ltm package
library(ltm)
# 1PL/Rasch
ARltm.rasch <- rasch(data.ar, constraint = cbind(ncol(data.ar) + 1,
1))
coef(ARltm.rasch, prob = T, order = T)
# 2PL
ARltm <- ltm(data.ar ~ z1)
print(ARltm)
coef(ARltm, prob = T, order = T)
# 3PL
cnst3 <- cbind(c(1:15), rep(3,15), rep(1,15))
cnst1 <- cbind(c(1:15), rep(1,15), rep(.2,15))
cnst <- rbind(cnst1, cnst3)
# first column - represents the items
# second column - 1 denotes the guessing parameters, 2 denotes the
difficulty parameters, 3 denotes the discrimination parameters
# third column - represents the value at which the 2nd column should
be fixed
ARltm.3PL <- tpm(data.ar, type = "latent.trait", constraint = cnst)
coef(ARltm.3PL, prob = T, order = T)
# comparing 1PL/Rasch vs. 2PL model
anova(ARltm.rasch, ARltm)
# comparing 1PL/Rasch vs. 3PL
anova(ARltm.rasch, ARltm.3PL)
### PLOTTING TIC FOR 1PL, 2PL, 3PL ###
# test information curve for Rasch model
plot(ARltm.rasch, type = "IIC", items = 0, lwd = 2, cex.lab = 1.1,
xlim = c(-4,4), ylim = c(0,4))
par(new = T) # superimposing 2nd plot
# TIC for 2PL
plot(ARltm, type = "IIC", items = 0, lwd = 2, cex.lab = 1.1, xlim =
RASCH MODELING ABSTRACT REASONING 45
c(-4,4), ylim = c(0,4), axes = F, main = "", xlab = "", ylab = "", lty
= 2)
par(new = T)
# TIC for Rasch model with guessing
plot(ARltm.3PL, type = "IIC", items = 0, lwd = 3, cex.lab = 1.1, xlim
= c(-4,4), ylim = c(0,4), axes = F, main = "", xlab = "", ylab = "",
lty = 3)
# creating legend
legend(2, 4, c("1PL", "2PL", "Guessing"), lty = c(1:3), cex = .9, bty
= "n")
# lty gives the legend appropriate lines
RASCH MODELING ABSTRACT REASONING 46
Technical Appendix C-1
TITLE: Measurement Invariance - Grades
Baseline model
DATA: FILE = pt60clean_allgrades.dat;
VARIABLE: NAMES = studid grade usregion male voc1-voc21 ar1-ar15
spel1-spel16
cap1-cap34 pun1-pun27 ee1-ee12;
USEVAR = ar1-ar15;
GROUPING = grade (9=9TH 10=10TH 11=11TH 12=12TH);
CATEGORICAL = ar1-ar15;
MISSING = .;
IDVARIABLE = studid;
ANALYSIS: ESTIMATOR = WLSMV;
PARAMETERIZATION = THETA;
! Metric invariance model for 9th grade reference group
MODEL:
! 1 factor (loadings all estimated)
Gf BY ar1-ar11* ar13-ar15* (L9);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0 by default);
Gf@1; [Gf@0];
MODEL 10TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (L10);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0 by default);
Gf@1; [Gf@0];
MODEL 11TH:
! 1 factor (loadings now all held EQUAL)
RASCH MODELING ABSTRACT REASONING 47
Gf BY ar1-ar11* ar13-ar15* (L11);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0 by default);
Gf@1; [Gf@0];
MODEL 12TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (L12);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0 by default);
Gf@1; [Gf@0];
OUTPUT: STDYX Residual;
SAVEDATA: DIFFTEST = baseline.dat;
RASCH MODELING ABSTRACT REASONING 48
Technical Appendix C-2
TITLE: Measurement Invariance - Grades
Metric Invariance model
DATA: FILE = pt60clean_allgrades.dat;
VARIABLE: NAMES = studid grade usregion male voc1-voc21 ar1-ar15
spel1-spel16
cap1-cap34 pun1-pun27 ee1-ee12;
USEVAR = ar1-ar15;
GROUPING = grade (9=9TH 10=10TH 11=11TH 12=12TH);
CATEGORICAL = ar1-ar15;
MISSING = .;
IDVARIABLE = studid;
ANALYSIS: ESTIMATOR = WLSMV;
PARAMETERIZATION = THETA;
DIFFTEST = baseline.dat;
! Metric invariance model for 9th grade reference group
MODEL:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
MODEL 10TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
MODEL 11TH:
RASCH MODELING ABSTRACT REASONING 49
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
MODEL 12TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
OUTPUT: STDYX Residual;
SAVEDATA: DIFFTEST = equalload_grades.dat;
RASCH MODELING ABSTRACT REASONING 50
Technical Appendix C-3
TITLE: Measurement Invariance - Grades
Scalar Invariance model
Equal Thresholds Across Grades
DATA: FILE = pt60clean_allgrades.dat;
VARIABLE: NAMES = studid grade usregion male voc1-voc21 ar1-ar15
spel1-spel16
cap1-cap34 pun1-pun27 ee1-ee12;
USEVAR = ar1-ar15;
GROUPING = grade (9=9TH 10=10TH 11=11TH 12=12TH);
CATEGORICAL = ar1-ar15;
MISSING = .;
IDVARIABLE = studid;
ANALYSIS: ESTIMATOR = WLSMV;
PARAMETERIZATION = THETA;
DIFFTEST = equalload_grades.dat;
! Scalar invariance model for 9th grade reference group
MODEL:
! 1 factor (loadings all estimated)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free
[ar1$1-ar15$1*];
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
MODEL 10TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free (HELD EQUAL IF LEFT OFF)
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
RASCH MODELING ABSTRACT REASONING 51
MODEL 11TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free (HELD EQUAL IF LEFT OFF)
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
MODEL 12TH:
! 1 factor (loadings now all held EQUAL)
Gf BY ar1-ar11* ar13-ar15* (loading);
Gf BY ar12@0;
! Item thresholds all free (HELD EQUAL IF LEFT OFF)
! Item residual variances all fixed = 1
ar1-ar15@1;
! Factor Variance (fixed = 1) and Mean (fixed = 0);
Gf@1; [Gf@0];
OUTPUT: STDYX Residual;
Abstract (if available)
Abstract
The current study examined 1960 data from Project TALENT (Flanagan et al., 1962), a study investigating students' talents and career aspirations using a comprehensive battery measuring cognitive ability and interests. In this paper, we evaluate the psychometric properties of the abstract reasoning measure--one of several measures used in the battery--because it was not done so when the battery was initially created. A reason to undertake a Rasch analysis is because the original investigators summed scores for each of the measures and used these summed scores in further analyses. Initially, we compare the Rasch model to two alternative models, the two- and three-parameter model and identify any items that do not fit the Rasch model. We then examine whether the Rasch model of the abstract reasoning measure is invariant across grade level and sex. Results indicated that the data did fit the Rasch model, however, one item did not fit and was excluded in subsequent analyses. Metric invariance held for grade level and sex.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving reliability in noncognitive measures with response latencies
PDF
The limits of unidimensional computerized adaptive tests for polytomous item measures
PDF
Attrition in longitudinal twin studies: a comparative study of SEM estimation methods
PDF
Multivariate biometric modeling among multiple traits across different raters in child externalizing behavior studies
PDF
Estimation of nonlinear mixed effects mixture models with individually varying measurement occasions
PDF
A functional use of response time data in cognitive assessment
PDF
Sources of stability and change in the trajectory of openness to experience across the lifespan
PDF
Evaluating social-cognitive measures of motivation in a longitudinal study of people completing New Year's resolutions to exercise
PDF
Emergent literacy skills and home literacy environment of Latino children who live in poverty
Asset Metadata
Creator
Bautista, Randy Paul M.
(author)
Core Title
Rasch modeling of abstract reasoning in Project TALENT
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Psychology
Publication Date
06/27/2013
Defense Date
05/15/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
abstract reasoning,item response theory,measurement invariance,OAI-PMH Harvest,Rasch model
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McArdle, John J. (
committee chair
), Baker, Laura A. (
committee member
), John, Richard S. (
committee member
)
Creator Email
rpbautis@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-282116
Unique identifier
UC11293996
Identifier
etd-BautistaRa-1715.pdf (filename),usctheses-c3-282116 (legacy record id)
Legacy Identifier
etd-BautistaRa-1715.pdf
Dmrecord
282116
Document Type
Thesis
Format
application/pdf (imt)
Rights
Bautista, Randy Paul M.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
abstract reasoning
item response theory
measurement invariance
Rasch model