Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-group Multidimensional Classification Accuracy Analysis (MMCAA): a general framework for evaluating the practical impact of partial invariance
(USC Thesis Other)
Multi-group Multidimensional Classification Accuracy Analysis (MMCAA): a general framework for evaluating the practical impact of partial invariance
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
Multi-group Multidimensional Classification Accuracy Analysis (MMCAA)
A General Framework for Evaluating the Practical Impact of Partial Invariance
by
Meltem Ozcan
A Thesis Presented to the
FACULTY OF THE USC DORNSIFE COLLEGE OF
LETTERS, ARTS, AND SCIENCES
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF ARTS
(PSYCHOLOGY)
August 2023
Copyright 2023 Meltem Ozcan
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Chapter One: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Common Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Measurement Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Establishing measurement invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
The practical impact of measurement noninvariance . . . . . . . . . . . . . . . . . . 6
Chapter Two: Multi-group MCAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Steps to performing a Multi-group MCAA . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Step 1: Perform measurement invariance testing. . . . . . . . . . . . . . . . . . . . . 11
Step 2: Call PartInv. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Step 3: Explore proportions selected and AI ratios. . . . . . . . . . . . . . . . . . . . 13
Step 4: Compare classification accuracy indices across invariance conditions. . . . . . 13
Step 5: Compare observed and expected classification accuracy indices. . . . . . . . 14
Chapter Three: Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
UCLA-LS-9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Step 1: Perform invariance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Step 2: Call PartInv. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Step 3: Explore proportions selected and AI ratios. . . . . . . . . . . . . . . . . . . . . . . 20
Step 4: Compare classification accuracy indices across invariance conditions. . . . . . . . . 20
Step 5: Compare observed and expected classification accuracy indices. . . . . . . . . . . 22
Bringing it together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Collapsing categories can mask noninvariance . . . . . . . . . . . . . . . . . . . . . . . . . 24
ii
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Chapter Four: Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
iii
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
List of Tables
1 Descriptives for the UCLA-LS-9 composite by group . . . . . . . . . . . . . . . . . . 35
2 Model fit indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Groups with non-invariant measurement parameters on the UCLA-LS-9 . . . . . . . 37
4 UCLA-LS-9 partial invariance model loadings by group . . . . . . . . . . . . . . . . 38
5 UCLA-LS-9 partial invariance model measurement intercepts by group . . . . . . . . 39
6 UCLA-LS-9 partial invariance model unique factor variances by group . . . . . . . . 40
7 UCLA-LS-9 partial invariance model latent mean estimates by group . . . . . . . . . 41
8 UCLA-LS-9 partial invariance model latent variance-covariance estimates by group . 42
9 Adverse Impact (AI) ratio computed for every possible reference group (PS
total
=.07) 43
10 Classification accuracy indices (CAI) under partial and strict measurement invariance 44
11 Observed and expected classification accuracy indices under matched distributions. . 45
12 Classification accuracy indices for the two group scenario. . . . . . . . . . . . . . . . 46
A1 Item-level descriptive statistics by age group. . . . . . . . . . . . . . . . . . . . . . . 51
A2 UCLA-LS-9 item correlations using the full sample. . . . . . . . . . . . . . . . . . . . 52
A3 UCLA-LS-9 item correlations by age group . . . . . . . . . . . . . . . . . . . . . . . 53
iv
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
List of Figures
1 A confusion matrix for the hypothetical example. . . . . . . . . . . . . . . . . . . . . 47
2 A path diagram for the partial invariance model for late adolescents (ages 18-25) . . 48
3 Classification accuracy indices at different proportions of selection by group. . . . . . 49
4 The joint bivariate distributions of observed and latent scores for each age group. . . 50
v
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Abstract
Measurement invariance is an important prerequisite for the meaningful and valid comparison
of test scores across individuals with different group membership. Given that tests are often used
in high-stakes contexts (e.g., personnel selection, diagnosis of diseases), the practical impact of
violations of measurement invariance is of great interest to researchers and practitioners alike.
Millsap and Kwok (2004) proposed a framework for investigating the practical impact of
noninvariance on selection accuracy, which has recently been extended to multidimensional tests
by Lai and Zhang (2022) in the Multidimensional Classification Accuracy Analysis (MCAA)
Framework. Existing approaches to evaluating practical impact have mostly considered
measurement invariance across two groups. When a population is made up of multiple
subpopulations (e.g., ethnicity), groups are often dichotomized for ease of analysis, which may
lead to misleading inferences due to the loss of information and precision. The current paper
introduces a general framework for investigating the practical impact of measurement
noninvariance on the accuracy and fairness of decisions made using a psychometric test made up
of any number of dimensions administered to individuals from any number of subpopulations. We
demonstrate the application and the advantages of the multi-group MCAA through an illustrative
example using data from a published study on the measurement invariance of a loneliness scale
across seven age groups, and show that valuable information is lost if the grouping variable is
collapsed to achieve a binary grouping. We offer guidelines for interpretation. The multi-group
MCAA framework is fully implemented in the R package unbiasr.
vi
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Chapter One: Introduction
Psychometric test scores are commonly employed as a proxy for individuals’ relative
standings on otherwise unobservable psychological phenomena in order to make selection or
classification decisions. For example, individuals scoring above a pre-determined threshold on a
diagnostic test may be marked for further screening for a disease, and the top 10% of candidates
on an aptitude test may be selected for employment. In such high-stakes scenarios, it is vital that
decisions are based solely on the constructs of interest, and are not influenced by external factors
such as the race or the age of the test-taker. The notion that items on a test measure the same
construct in each individual in the same way irrespective of group membership or of other
construct-irrelevant conditions is termed measurement invariance (Drasgow, 1984; Mellenbergh,
1989). Observed group level differences in scores can be unambiguously and meaningfully
interpreted as true group differences in the latent variable of interest only if one can establish the
across-groups equivalence of the relationship between the observed and latent variables (Horn &
McArdle, 1992; Yoon & Millsap, 2007). If, however, measurement invariance does not hold, any
inference drawn using the scores must be taken with the proverbial grain of salt, and any
decisions made using the test must be subject to scrutiny.
The criteria for measurement invariance, which will be discussed in detail later, are stringent
and rarely met in practice for all of the items on a test. More commonly, only a subset of items
on a test are found to be measurement invariant, a condition termed partial factorial invariance
1
(Byrne et al., 1989). While partial invariance offers a more realistic approach to model
specification which facilitates the comparison of scores across groups, some ambiguity regarding
the true meaning of score differences persists under partial invariance. This ambiguity calls for an
investigation into not only the degree but also the practical impact of the violations of invariance
on classification outcomes: is there a discrepancy in how individuals with different group
membership are impacted by noninvariance? Do the noninvariant items on the test favor
individuals from certain backgrounds while disadvantaging others, and are individuals from some
groups systematically losing out on opportunities? Clearly, quantifying the practical impact of
1
Factorial invariance has been shown to be equivalent to measurement invariance under the common factor model
(Horn & McArdle, 1992; Thurstone, 1947).
1
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
noninvariance on classification decisions is a crucial first step to disentangling and addressing any
negative consequences of systematic discrepancies in classification outcomes across groups.
The current paper introduces a general framework for investigating the practical impact of
measurement noninvariance on classification accuracy on tests with any number of dimensions
administered to any number of groups of individuals. We first review the common factor model to
lay the foundation on which we will build the multi-group multi-category classification accuracy
analysis framework, and synthesize relevant literature on measurement invariance testing and
practical impact. We define classification accuracy and introduce metrics that may be used to
evaluate the performance of a test in producing scores that lead to correct decisions. Then, we
introduce the multi-group multidimensional classification accuracy analysis framework, and
showcase the developed methods through an illustrative example on the measurement invariance
and classification accuracy analysis of the UCLA-9 Loneliness Scale across seven age groups.
Finally, we discuss results, implications, and future directions.
Common Factor Model
The common factor model (Spearman, 1904; Thurstone, 1947) posits that scores observed on
a set of test items (also called indicators) are reflective of the latent (theoretical, unobserved)
constructs that the items were developed to measure. Under the common factor model, the
relationship between M latent constructs and J observed variables is explained by the following
linear equation
y
ig
=ν
g
+ Λ
g
η
ig
+ϵ
ig
(1)
wherey
ig
denotes a J× 1 vector of observed item scores,η
ig
is a M× 1 vector of latent factor
scores,ν
g
is a J× 1 vector of intercepts, Λ
g
is a J×M matrix of factor loadings, andϵ
ig
is a
J× 1 vector of unique factor variables (Lai & Zhang, 2022; Meredith & Teresi, 2006). Here, i and
g indicate the individual (i = 1,...,N) and the individual’s group membership (g = 1,...,G)
respectively. Latent variables (η) explain the shared variance in observed item scores (y), and
unique factor variables (ϵ) capture the variance in scores unexplained by the latent variables,
subsuming both measurement error (random error) and systemic error. Unique factor variables
are locally independent after conditioning on the latent constructs and are assumed to follow a
2
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
normal distribution with E(ϵ) = 0 and variance-covariance matrix Cov(ϵ) = Θ. We assume that
the latent constructs are normally distributed with M× 1 mean vector E(η) =α and M×M
variance-covariance matrix Cov(η) = Ψ. Assuming that the latent and unique factor variables are
uncorrelated (Cor[ϵ,η] = 0), observed variables follow a normal distribution with mean
E(y
g
) =ν
g
+ Λ
g
α
g
and variance-covariance matrix Σ
g
= Λ
g
Θ
g
Λ
′
g
+ Ψ
g
.
Typically, observed item scores are summed through a weighted or unweighted linear
combination to create a composite score Z on which individuals are compared to other individuals
taking the test (e.g., in school admissions or personnel selection) or to a specific absolute score
(e.g., diagnostic screens, licensing). Regardless of whether the intended use of the test is to assess
performance, qualification, or eligibility, any inference made about the relative standing of
individuals on the attribute(s) being measured is only valid if measurement invariance holds
(Drasgow, 1984; Horn & McArdle, 1992).
Measurement Invariance
Measurement invariance occurs when measurement operations are equivalent and comparable
across groups such that the test produces equal expected scores for individuals with the same
standing on the latent construct, regardless of construct irrelevant variables such as group
membership (Drasgow, 1984; Mellenbergh, 1989). Namely, under the common factor model, the
distribution of observed scores conditional on latent variables is independent of group membership
such that
P (y|η,G =g) =P (y|η), ∀g, (2)
resulting in equal intercepts, factor loadings, and unique factor variances across groups:
ν
g
=ν,λ
g
=λ,θ
g
=θ, ∀g. (3)
The level of invariance indicated by the attainment of (3) is termed strict measurement
invariance, which subsumes (from most to least restrictive) scalar, metric and configural
invariance (Meredith, 1993; Steenkamp and Baumgartner, 1998; Vandenberg and Lance, 2000).
Configural invariance is satisfied if the composition of factors and items within the factors is
equivalent across groups such that the same items make up the factors in each group and the
3
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
location of any zero loading is identical across groups. At the configural stage, all measurement
parameters are estimated freely (Millsap, 2012). The lack of a common metric, and therefore a
common meaning, at this stage precludes group level comparisons, which Horn and McArdle
(1992) likened to "comparing apples with beans" (p. 126). The next step, metric (also dubbed
weak factorial or pattern) invariance is achieved if in addition to configural invariance, the factor
loadings are equivalent across groups. Metric invariance implies that the same construct is being
measured across groups or conditions, and that the latent construct has the same meaning across
groups (Horn & McArdle, 1992; Steinmetz et al., 2009; Van de Schoot et al., 2012). At this stage,
it becomes possible to draw comparisons about factor variance and covariances (Meredith &
Teresi, 2006). Scalar (strong factorial) invariance further requires equal item intercepts across
groups. Once scalar invariance is achieved, the comparison and interpretation of group-level
differences in factor means becomes legitimate (Meredith & Teresi, 2006). One caveat is that
because the error variances are not equal across groups at the scalar invariance stage, the latent
variables are measured with different levels of error (Van de Schoot et al., 2012). The final and
most restrictive level of invariance is strict invariance, previously defined in equation (3), which
further requires that unique factor variances and therefore construct reliabilities (Cole & Maxwell,
1985) are equivalent across groups and means that the measurement operations are identical
across groups.
Although scalar invariance is the commonly accepted standard for measurement invariance in
psychology research (Luong & Flake, 2022), it is nevertheless rare in practice that full scalar
invariance, where scalar invariance holds for every item on the scale, is achieved as traditional
invariance testing procedures rely on tests of exact equality which may indicate a poor fit due to a
single noninvariant item. Under partial (Byrne et al., 1989) invariance, some items are allowed to
have measurement parameters that vary across groups, i.e., parameters that are freely estimated,
while other items remain invariant across groups. While there is no agreed upon rule on how
many items can reasonably be freed in a partial invariance model, 20% is a commonly cited rule
of thumb (Dimitrov, 2010).
4
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Establishing measurement invariance
A number of paradigms have been developed to test for measurement invariance, including
but not limited to multiple-groups confirmatory factor analysis (MGCFA; Jöreskog, 1969),
exploratory structural equations model (ESEM; Marsh et al., 2009), multiple indicators multiple
causes (MIMIC; Muthén, 1989) models, and moderated non-linear factor analysis (MNLFA;
Bauer, 2017) . Traditional approaches to establishing measurement invariance, such as the
MGCFA, involve fitting increasingly constrained structural equation models (SEM) to data, and
assessing model fit through goodness-of-fit indices (GFI) that compare the covariance structure
implied by the SEM to the observed sample covariance structure, such as the Comparative Fit
Index (CFI) and Standardized Root Mean-square Residual (SRMR)
2
. Models that have more
parameter constraints are built and compared until metrics such as the likelihood ratio test
(LRT) or the magnitude of change in CFI indicate a worse fit of the more constrained model to
the data, at which point the less constrained model is accepted as the final model.
When testing for full (metric, scalar, or strict) measurement invariance, the relevant set of
parameters are constrained to be equal across all groups for each item. In comparison, with
partial invariance, the primary task is identifying which items and which sets of parameters need
to be constrained as opposed to freely estimated, and across which groups, which is often done by
sequentially constraining or freeing certain subsets of parameters, and comparing model fit
through tests such as the likelihood ratio test. The larger the number of items on a test and the
larger the number of groups, the more tedious this process becomes. Tools such as pinsearch
(Lai, 2023), which is based on the sequential modification index approach by Yoon and Millsap
(2007), have been developed to help automate the identification of noninvariant items. An
alternate framework, the alignment method (Asparouhov & Muthén, 2014), does away with the
search and specification process entirely to reach an approximate rather than a partial invariance
model. Note that approximate invariance does not equate to full invariance: a degree of
measurement bias, which may have downstream consequences if the test is used for classification
purposes, remains.
2
See Schmitt and Kuljanin, 2008; Somaraju et al., 2021; Vandenberg and Lance, 2000 for detailed reviews of
existing measurement invariance testing procedures.
5
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
The practical impact of measurement noninvariance
The steps following the discovery of violations of measurement invariance have received less
attention in the literature. One prominent exception is the selection accuracy analysis framework
proposed by Millsap and Kwok (2004) and further developed by Lai and Zhang (2022), which
focuses on the practical impact of partial measurement invariance in the context of test scores
being used to select or classify individuals using some pre-determined cut-off score or a proportion
of selection. In the selection accuracy analysis framework, indices quantifying the accuracy and
performance of a test are computed for each of the two groups under consideration (dubbed the
reference and focal groups, where the reference group may refer to the majority or the advantaged
group) under full strict invariance and under partial invariance (Millsap & Kwok, 2004). Then,
the indices are compared across groups and across invariance conditions with consideration of the
intended test use (Millsap & Kwok, 2004).
The classification accuracy of a test, or the degree of correspondence between (a) the
individuals who should be classified into a category (or class) based on their standing on the
latent construct(s) and (b) the individuals who actually were classified into that category based
on their observed scores, can be evaluated using indices drawn from signal detection theory
(Gonzalez et al., 2022; Millsap & Kwok, 2004; Swets & Pickett, 1982). For simplicity, we define
the indices in the context of binary classification (e.g., low risk or high risk of disease), but the
definitions hold for multiple classes (e.g., mild, moderate, and severe depression). We can think of
a hypothetical scenario in which a screening test is administered with the goal of identifying
at-risk individuals who would benefit from a health-related or behavioral intervention. One such
example is the Alcohol Use Disorders Identification Test (AUDIT; 1993), which is a commonly
used tool used by health professionals to, among other purposes, determine eligibility for alcohol
consumption reduction interventions. In this scenario, an individual who scores above a certain
latent threshold on the screening test can be thought of as an at-risk individual, while individuals
who do not reach this latent threshold are considered not to be at risk.
Ideally, the test would produce observed scores such that individuals who would be selected
for the intervention based on their latent standing (i.e. the at-risk individiuals) would also be
determined eligible for the intervention based on their observed scores. Conversely, we would
6
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
want an individual who is not actually at risk to have observed scores that do not meet the
corresponding observed score threshold, leading to a judgment of ineligibility for the intervention.
The screening-in of an at-risk individual (true positive; TP) and the screening-out of an individual
who is not at risk (true negative; TN) reflect accurate decisions. Alternatively, an inaccurate
decision is made if an individual who is not at risk is incorrectly determined to be eligible (false
positive; FP) or an at-risk individual is incorrectly screened out (false negative; FN). These four
possible outcomes may be illustrated in a contingency table (Pearson, 1904) also referred to as an
error or confusion matrix in which the class labels predicted using the test scores are compared
against the actual labels based on the latent standings (see Figure 1 for an illustration).
Millsap and Kwok (2004) further define four composite indices that summarize test
performance. The first index, proportion selected (PS), is the proportion of individuals who were
classified as positive, PS = P(TP)+P(FP), where P() denotes a proportion. In the hypothetical
example, PS is the proportion of individuals who were determined to be at risk and therefore
eligible for the intervention out of all the individuals who were screened. The second index,
success ratio (SR), illustrates the ratio of true positives to all the positive classifications, SR =
P(TP)/PS. SR answers the question "out of all the positive classifications, how many correspond
to the truth?". The third index, sensitivity (SE), can be thought of as the ability of the test to
correctly identify all the individuals who are truly at risk, SE = P(TP) / [P(TP) + P(FN)]. Out
of everyone who is truly at risk, how many did the test successfully capture? The fourth and final
index, specificity , SP = P(TN) / [P(TN)+P(FP)], indicates the flip side of SE: how successful was
the test in screening out the individuals who are not at risk? By comparing PS, SR, SE, and SP
and other relevant indices across groups and invariance conditions we can gain a better
understanding of the nature and magnitude of the practical impact of noninvariance on
classification accuracy in different groups (Millsap and Kwok, 2004)
3
.
A related framework for quantifying the practical impact of measurement bias is the Adverse
Impact (AI) ratio (Nye and Drasgow, 2011; the Ratio of Selection Ratios Index by Stark et al.,
2004). The AI ratio allows researchers to compare and attribute differences in PS across two
3
Note that different terminology may be used in other contexts to indicate these indices. For instance, SR, SE, and
SP are often called precision, recall, and selectivity in computer science and related fields.
7
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
groups to measurement bias by assuming that the groups are matched on the underlying trait(s).
The AI ratio for a group of interest is the ratio of the expected PS (PS
Ef
) for that group
(computed using the the mean and standard deviation for that group with the latent distribution
of a group designated as thereference group) and the observed PS for the reference group (PS
R
):
AI ratio =
PS
Ef
PS
R
(4)
(Nye & Drasgow, 2011). If strict measurement invariance holds, we would expect an AI ratio of 1,
and deviations from 1 can be interpreted as resulting from measurement bias. The ’four-fifths
rule’ suggests that the focal group suffers from adverse impact if this ratio falls below .80 (Nye &
Drasgow, 2011).
The selection accuracy analysis framework has gained traction in recent years, with
extensions into tests with binary (Lai et al., 2019) and ordinal (Gonzalez & Pelham, 2021) items.
Noting that selection and classification decisions are rarely made on unidimensional constructs,
Lai and Zhang (2022) extended the method to handle tests made up of multiple dimensions with
potentially unequal item and dimension weights in the Multidimensional Classification Accuracy
Analysis (MCAA) framework. Lai and Zhang (2022) also applied the idea of matching latent
distributions to compute expected indices for the focal group beyond PS
Ef
, i.e., SR
Ef
, SE
Ef
, and
SP
Ef
. These expected classification accuracy indices for the focal group are then compared
against the indices observed for the reference group.
One major limitation of the MCAA is the assumption that the test takers belong to one of
two groups. While some grouping variables such as gender (e.g., Lu et al., 2018; Ock et al., 2020;
Reise et al., 2001) conventionally result in two distinct groups, characteristics such as ethnicity,
age, or language may comprise many more distinct subpopulations. Measurement invariance of
test scores have been evaluated across groups characterized by age (e.g., Marsh et al., 2013;
Siedlecki et al., 2008), ethnicity (e.g., Skriner and Chu, 2014), language (e.g., Király et al., 2019;
Thielmann et al., 2020), culture (e.g., Bowden et al., 2016; Di Giunta et al., 2017), test mode (i.e.,
paper or computer-based; e.g., Meade et al., 2007), and device type (i.e., mobile, PC, tablet; e.g.,
Menold and Toepoel, 2022), among others. In such contexts where the grouping variable may be
8
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
polychotomous, analyses tend to be limited to an examination of measurement invariance across
two of the groups, or, some subgroups may be collapsed to achieve a binary grouping. For
example, in an examination of the impact of partial invariance on classification accuracy in
addiction research, Lai and colleagues (2019) opted to focus on two age groups, ages 12-25 and 26
and older. As we will see later in the illustrative example, focusing on two groups out of many or
collapsing categories to achieve dichotomy may lead to a loss of information, power and precision,
or lead to misleading inferences (Strömberg, 1996). In order to remedy this gap, we establish a
more general framework that allows researchers to investigate the magnitude and practical impact
of partial invariance on tests made up of any number of dimensions administered to individuals
from any number of subpopulations.
9
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Chapter Two: Multi-group MCAA
Let us consider an M-dimensional construct on which we would like to assess the relative
standings of individuals we would like to classify into one of two binary categories (e.g., selection
and rejection). Ideally, our decisions would be based on composites of the true scores on the M
dimensions (ζ), but as latent variables are unobservable by definition, we may administer a test
and use observed item scores as the basis of decision-making. In practice, this is done by either
specifying a cut-off score ( Z
c
) and selecting all individuals that score above it (Z
i
>Z
c
), or
specifying a proportion of selection (PS
total
) and selecting the top PS
total
of the scorers.
Let us assume that this test is made up of J items, partitions of which tap into the M
dimensions of the construct. We may choose to place differential weights on the items and/or the
M dimensions to reflect in the composite scores the relative importance or relevance of each test
component. Lettingc = [c
1
,....,c
J
] andy
i
= [y
1
,....,y
J
] denote the vectors of item weights and
the observed scores of test-taker i (i = 1,...,N), we can compute the test-taker’s composite score
as Z
i
=cy
i
. If c
1
=... =c
J
= 1, the composite score amounts to the simple addition of the scores
on each item. Under the assumptions previously discussed for the common factor model, the true
scores and the observed scale sums follow a bivariate normal distribution
Z
g
ζ
g
=N
cv
g
+cΛ
g
α
g
wα
g
,
cΛ
g
Ψ
g
Λ
′
g
c
′
+cΘ
g
c
′
cΛ
g
Ψ
g
w
′
wΨ
g
w
′
(5)
wherew = [w
1
,....,w
M
] denotes a vector of weights for the M latent dimensions. In the
two-group case, Millsap and Kwok (2004) demonstrated that the latent score cutoff ζ
c
can be
computed as the quantile in a mixture of two bivariate normal distributions (5) corresponding to
PS
total
(which is either pre-specified by the researcher or computed as the proportion of
individuals selected using the cutoff Z
c
). In the multi-group MCAA, the latent score cutoff ζ
c
is
computed as the boundary of the area under the mixture of bivariate normal distributions
characterized by the mean and standard deviations of the observed variables i.e., the cutoffs are
based on a mixture of bivariate normal distributions.
Taking selections that would have been made using ζ
c
as our ground truth and the selections
that were made using Z
c
as the predictions, we can build a confusion matrix and evaluate the
10
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
performance of the test within each group through the previously introduced classification
accuracy indices (CAI). Comparisons of CAI across groups can give us fine-grained insight into the
nature and the magnitude of the impact of noninvariance on classification decisions. Furthermore,
once the latent distributions of the groups are matched to the latent distribution of a reference
group, Cohen’s h (Cohen, 1988) effect size indices can be computed to gauge the significance of
the discrepancy between the observed CAI for the reference group and the expected CAI for the
focal groups. Cohen’s h effect size of the difference between two proportions is defined as
h = 2arcsin (
√
p
1
)− 2arcsin (
√
p
2
) (6)
(Cohen, 1988). In the context of the multi-group MCAA, p
1
and p
2
denote the observed CAI for
the reference group and the expected CAI for a focal group respectively.
Steps to performing a Multi-group MCAA
We have fully implemented the multi-group MCAA framework in the PartInv function of the
R package unbiasr
4
. In the following section, we lay out the steps for performing a multi-group
MCAA in unbiasr after which we demonstrate the application and various advantages of the
framework through an illustrative example.
Step 1: Perform measurement invariance testing.
First, measurement invariance testing is performed as usual, arriving at an acceptable final
(partial invariance, approximate invariance, alignment, etc.) model. Estimated model parameters
(factor loadings, measurement intercepts, unique factor variances, latent means, and latent
variance-covariances) are extracted for this model and passed on to PartInv.
Step 2: Call PartInv.
Next follow choices regarding the (a) cut-off score or proportion of selection, (b) mixing
proportions, (c) item weights, (d) latent weights, and (e) reference group.
(a) Determine a cut-off score or proportion of selection . PartInv requires the input
of either a cut-off score or a proportion of selection to determine which individuals will be
4
https://github.com/mmm-usc/unbiasr.
11
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
selected. For instance, the top X% of candidates may be selected in a norm-referenced personnel
selection scenario, or the X% highest-scoring individuals on a diagnostic test may be selected for
an intervention that is made available to only a subset of individuals due to resource constraints.
Alternatively, a cut-off may be used if, for example, there exists a clinically meaningful threshold
with which to determine eligibility or a diagnosis (e.g., a score of 15 or more on the AUDIT
suggesting moderate to severe alcohol use disorder). Ultimately, a cut-off score corresponds to a
proportion of selection and vice versa, so the decision to use a cut-off score or a proportion of
selection is one of convention and/or practicality.
(b) Choose mixing proportions. By default, PartInv assumes equal sample sizes for the
groups. For unbalanced samples, researchers may choose to provide mixing proportions based on
the proportion of individuals in each group.
(c-d) Choose item and latent weights. By default, PartInv assumes equal weights on
the items and latent factors, but researchers may choose to place more or less weight on certain
items or factors. For example, an employer using the big-five inventory to make hiring decisions
may want to prioritize ’conscientiousness’ over ’openness to experience’, and opt to provide latent
weights reflecting the relative importance of these facets of personality.
(e) Decide on a reference group. A group may be designated as reference based on the
researchers’ hypotheses or aims (e.g., individuals taking the paper-based version of a test may be
designated as reference if the goal is to investigate how individuals taking the test on a computer
and individuals taking the test on a tablet perform relative to the paper-based test-takers).
Alternatively, the majority group, the group that benefits most from noninvariance, or the group
that leads to the greatest number of focal groups suffering from adverse impact may be
designated as reference. To this end, AI ratios are computed for each possible reference group by
reordering the input parameters to PartInv, and AI ratios below 0.8 are tallied. The group that
leads to the greatest number of AI < 0.8 values is designated as reference. If there is a tie, the
group that leads to the most severe cases of adverse impact may be chosen as reference. If there
are no AI ratios below 0.8, the group that leads to the greatest number of focal groups with AI <
1 may be chosen as an alternative. We encourage exploring different reference groups for a more
comprehensive understanding of the magnitude and direction of bias.
12
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
PartInv is called with the model parameters extracted in Step 1 and the configurations
outlined above. The function call will return (i) classification accuracy indices computed for each
group, (ii) adverse impact ratios for the focal group(s), (iii) expected classification accuracy
indices for the focal group(s) if their latent distributions matched that of the reference group,
and, if specified, (iv) classification accuracy indices under strict invariance. The user also has the
option of requesting plots for the joint bivariate distributions of observed and latent scores for
each group under partial and strict invariance.
Step 3: Explore proportions selected and AI ratios.
Proportions selected under partial invariance and strict invariance are examined across
groups to detect any systematic patterns, with particular attention paid to the direction and
magnitude of bias. Was a larger proportion of individuals from a certain group selected compared
to other groups under partial invariance? Does the discrepancy decrease under strict invariance,
and if so, by how much? Alternatively, do the proportions selected appear comparable across
groups under both invariance conditions? AI ratios are examined next to gain clearer insight into
these questions. AI< 1 indicates that the focal group is disadvantaged due to the presence of
measurement bias since, despite equivalent latent distributions, a smaller proportion of individuals
in the focal group would be expected to be selected compared to the observed PS for the reference
group. AI< 0.8 corresponds to adverse impact by the ‘four-fifths‘ rule (Nye & Drasgow, 2011).
AI=1 indicates that equal proportions would be selected for the reference and focal groups under
equivalent latent distributions (i.e., that no adverse impact has occurred). AI< 1 values indicate
that under equivalent latent distributions, a smaller proportion would be selected for the focal
group in comparison to the reference group, and AI> 1 indicates the opposite.
Step 4: Compare classification accuracy indices across invariance conditions.
Next, composite indices such as SR, SE, and SP under partial and strict invariance conditions
are examined and compared. Plots of the joint bivariate distributions of observed and latent
scores for each group may be examined to get a better understanding of group differences in raw
classification accuracy indices (TP, FP, TN, and FN), and how these differences change across
invariance conditions.
13
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Step 5: Compare observed and expected classification accuracy indices.
As the final step, observed classification accuracy for the reference group is compared with
expected classification accuracy for the focal group(s). Using the helper function cohens_h
provided in unbiasr, Cohen’s h effect sizes are computed for the discrepancy between the
observed classification accuracy indices for the reference group and the expected classification
accuracy indices for the focal group(s) if their latent distribution(s) matched that of the focal
group. The size and direction of the effects are scrutinized from a practical impact lens. Matching
the groups on their latent distributions enables meaningful across-group comparisons of
classification accuracy indices since any observed discrepancies across groups can now be
attributed to noninvariance and quantified in a consistent metric.
In short, the first two steps are preparatory and consist of extracting and choosing parameter
values to be passed on to PartInv. The following three steps entail the examination of the
function outputs to gather information which is then synthesized to arrive at a comprehensive
assessment of the practical impact of measurement bias on classification accuracy.
14
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Chapter Three: Illustrative Example
We now illustrate the application of the multi-group MCAA framework with data from a
previously published study on the measurement invariance and factor structure of various forms
of the UCLA Loneliness Scale (UCLA-LS) across seven age groups in a community sample in the
U.K. (Panayiotou et al., 2022). We consider a hypothetical scenario in which the UCLA-LS is
used as an initial screener to determine which individuals to further assess for loneliness. Let us
assume, for the sake of simplicity, that all the individuals who meet the threshold in this initial
screener will be eligible for enrollment in one of the multitude of loneliness interventions currently
supported by research (e.g., cognitive behavioral therapy [CBT], counseling, exercise, animal
therapy, technological interventions, etc.; Hoang et al., 2022). Given the serious health risks posed
by loneliness (National Academies of Sciences, Engineering, and Medicine, 2020), it is important
that scores on the screener lead to accurate decisions regardless of group membership (in this
example, defined by age groups). If full measurement invariance cannot be established, two
individuals from different age groups who have the same latent standing on loneliness may receive
systematically different scores (beyond the contribution of random error). It is possible that
individuals from certain age groups may be disproportionately selected for or screened out of
further screening, possibly leading to situations of adverse impact. Our goal is to answer the
question: what is the practical impact of using the scores on the UCLA-LS to select individuals
for further assessment (and, subsequent enrollment for an intervention) for loneliness when the
resulting scores are not necessarily equivalent due to the presence of measurement bias in certain
items?
The Data
The data were initially collected as part of a collaborative research project by the Wellcome
Collection and the BBC which aimed to investigate individuals’ attitudes towards and experiences
of touch in the U.K. (the "Touch Test Project"). Participants were recruited through radio
broadcasts and social media prior to the Covid-19 lockdowns in 2020 and reflect a heavily White
and female non-clinical community sample that was not fully representative of the general U.K.
population in terms of gender and ethnic breakdown (Panayiotou et al., 2022). After the removal
of 4,503 participants due to missing data (see Panayiotou et al., 2022 for additional details), the
15
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
total sample size was n = 19,521 (24.3% male, 74.2% female, 0.05% non-binary, 0.89% prefer not
to self-describe/not to say).
In parallel to the example of Panayiotou et al., 2022, we divided participants into the
following seven age groups characterized by distinct formative experiences: late adolescence
(18–25; n = 625), early young adulthood (26–35; n = 1300), late young adulthood (36–45; n =
1846), early middle adulthood (46–55; n = 3800), late middle adulthood (56–65; n = 5919), early
old age (66–75; n = 4991), and middle-oldest old age (76+; n = 1040). Tables illustrating item
means, standard deviations, and correlations by age group can be found in the appendices.
UCLA-LS-9
For our analyses, we focus on the nine-item, three-factor short form of the revised UCLA-LS
(UCLA-LS-9; Hawkley et al., 2005) which was demonstrated to reach full scalar invariance and
was recommended for use in clinical and applied settings as a reliable and structurally valid
diagnostic screener for loneliness (Panayiotou et al., 2022). The first factor of the UCLA-LS-9
(F1) assesses intimate connectedness and consists of items 1, 4, and 8 ("I lack companionship", "I
feel left out", and "I feel isolated from others"). Factor 2 (F2) assesses relational connectedness
with items 3, 5, and 7 ("There are people I can talk to", "There are people I can turn to", and
"There are people I feel close to"). The third and final factor of the UCLA-LS-9 (F3) assesses
collective connectedness and consists of items 2, 6, and 9 ("I feel in tune with the people around
me", "I have a lot in common with the people around me", and "I feel part of a group of friends").
Items are answered on a 4-point scale (never, rarely, sometimes, often). Items 3, 5, 7, 2, 6, and 9
(F2 and F3) are reverse coded such that a lower score on these items are associated with higher
loneliness ratings. Composite scores for each individual are computed by summing the item
scores. Table 1 illustrates mean composite scores and standard deviations by age group.
Step 1: Perform invariance testing
All analyses were conducted in R (version 4.2.2; R Core Team, 2022). Invariance testing was
performed in a step-wise fashion using the cfa function from the lavaan package (version 0.6.13;
Rosseel, 2012). We used the robust maximum likelihood (MLR) estimator with Huber-White
"sandwich" standard errors and the Yuan-Bentler test statistic to account for the nonnormality of
scores (Savalei & Rosseel, 2022; Yuan & Bentler, 2000). Full information maximum likelihood
16
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
(FIML) was specified to use all available information and handle the small number of observations
with partially missing data on the UCLA-LS-9 items (n=39). Table 2 illustrates fit indices for the
models considered.
First, we fit a multigroup CFA without equality constraints to test for configural invariance.
Latent variances in one group were fixed to unity and all latent means were fixed to zero for
identification. Given the sensitivity of the chi-square statistic to large sample sizes (Cheung &
Rensvold, 2002; Putnick & Bornstein, 2016; Van de Schoot et al., 2012), rather than interpreting
the significant scaled chi-square statistic χ
2
(168) = 1645.995, p<.001 as a sign of misfit of the
configural model to the data, we examined indices that are more resistant to the influence of
sample size. As CFI =.981>.95, TLI =.972>.95, RMSEA =.062<.08, and
SRMR =.033<.08 (Cheung & Rensvold, 2002; Hu & Bentler, 1999), we determined that
configural invariance holds.
Then, we fit a second model with equality constraints on the factor loadings to test for metric
invariance. Latent variances in one group were fixed to unity and all latent means were fixed to
zero for identification. The metric model also showed acceptable fit, with CFI =.981>.95,
TLI =.976>.95, RMSEA =.057<.08, and SRMR =.035<.08. The Satorra-Bentler scaled
chi-square difference test comparing the full configural invariance and the full metric invariance
models was significant, indicating that the more restricted full metric invariance model should be
rejected, ∆ χ
2
(36) = 71.694, with p<.001. Having rejected full metric invariance, we then
explored a partial invariance model for the data.
pinsearch (Lai, 2023), a freely available R package that automates the partial invariance
model specification process, was used to identify non-invariant items. Table 3 lists the groups for
which item parameters were determined to be non-invariant. We see that while no items reached
strict invariance, items 6 and 9 reached scalar invariance and items 4, 5, and 8 reached metric
invariance. Items 1, 2, 3, and 7 did not reach metric invariance. The f
MACS
effect sizes of
item-level noninvariance (illustrated in Table 3) were largest for f
MACS
1
= 0.166 and
f
MACS
2
= 0.144 suggesting that items 1 and 2 contribute the greatest bias. These f
MACS
values
correspond to Cohen’s d values of d
1
= 0.332 and d
2
= 0.288, which fall between the commonly
used but admittedly subjective thresholds of .2, used to indicate a small effect size, and .5, used
17
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
to indicate a medium effect size (Cohen, 1988). As the thresholds of .2, .5 and .8 to indicate
small, medium, and large effect sizes has been criticized for being unrealistically large and
demonstrated to be unrepresentative of findings in the field (Bosco et al., 2015), it is possible that
the practical significance of the bias in these items is larger than indicated.
A partial invariance model was fit to the data using the model specification determined by
pinsearch (provided in the supplemental material). The model fit indices are illustrated in Table
2. The partial invariance model demonstrated acceptable fit to the data CFI =.981>.95,
TLI =.981>.95, RMSEA =.051<.08, and SRMR =.035<.08. Both AIC and BIC are lowest
for the partial invariance model and favor the partial invariance model over the full configural
invariance and full metric invariance models.
Tables 4, 5, 6 illustrate the factor loadings, measurement intercepts, and unique factor
variances under the partial invariance model by age group. Tables 7 and 8 present the latent
mean and latent variance-covariances under the partial invariance model by age group. The factor
means were fixed to zero and factor variances were fixed to one in the first group to achieve model
identifiability. This parameterization facilitates the comparison of factor loadings and intercepts
across groups, while an alternative parameterization with one factor loading and the
corresponding intercept constrained to one and zero respectively would allow for the latent means
to be comparable across groups (Van de Schoot et al., 2012). Note that we discuss only one
possible partial invariance model out of the many alternatives that could be specified through the
selection of a different referent item and a different parameterization. As our primary goal is to
illustrate the use of the multi-group MCAA, the current specification determined by pinsearch
was deemed adequate for the purposes of the current work. However, it is possible that the
specified model is not the best model if the item selected as referent (in our case, item 1) is in fact
noninvariant, as forcing a noninvariant item to be invariant and using this item for calibration
may distort the pattern of noninvariance across the remaining items (Yoon & Millsap, 2007). One
proposed solution to this problem is Yoon and Millsap’s (2007) data-based Monte Carlo approach
in which a partial invariance model is determined without the need for a referent item.
18
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Step 2: Call PartInv.
(a) Determine a cut-off score or proportion of selection. Figure 3 illustrates
classification accuracy indices computed under partial invariance and strict invariance plotted
against different proportions of selection between 0.01 and 0.20. We see that success ratio and
sensitivity increase and specificity decreases as the proportion of selection increases, suggesting
that as the proportion of selection goes up, selection becomes more indiscriminate. As there is no
consensus on a single clinically meaningful cutoff score on the UCLA-LS-9, we explored three
proportions of selection in the plotted range of values which correspond to individuals scoring 1,
1.5, and 2 standard deviations above the mean composite score across the entire sample: 2%, 7%,
and 16%. The larger PS
total
=16% indicates a scenario where we cast a wider net with the initial
screener, while PS
total
=2% may be more pertinent in a scenario where, due to scarcity of
resources, only the highest scoring subset of individuals are selected for further screening. We
observed that the discrepancies in PS across groups were larger when the more stringent
PS
total
=2% was employed to select individuals, while the discrepancies were less drastic when
PS
total
=16% was used. PS
total
=7% was explored as a middle ground of the two scenarios, and is
the only scenario discussed here. Note that this proportion of selection was chosen for illustrative
purposes only and further research by field experts is necessary to determine an appropriate
cutoff for the use of UCLA-LS-9 in applied settings.
(b-d) Choose mixing proportions, item weights, and latent weights. We specified
the mixing proportions as the proportion of individuals in each age group,
π = [.032,.067,.095,.195,.303,.256,.053]. Equal weights were placed on the items and latent
factors.
(e) Decide on a reference group. As we did not have any particular hypotheses regarding
the seven age groups, we considered each possible reference group for illustrative purposes. Some
viable choices for the reference group include the late middle adulthood group (56-65) which is
the majority and accounts for about for 30% of our sample, or the young adolescence group
(18-25) which leads to the greatest number of focal groups with lower proportions selected. As
can be seen in Table 9, which presents the AI ratios when different groups are chosen as reference,
the choice of young adolescents leads to AI values below 1 in each focal group.
19
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
PartInv was called with the configurations specified above and the partial invariance model
parameters extracted in Step 1 to apply the multi-group MCAA framework. The function call
and the parameter specifications can be found in the supplemental material.
Step 3: Explore proportions selected and AI ratios.
From the function output, we see that PS
total
= 0.067 (indicating a threshold 1.5 SD above
the mean) corresponds to the observed score cutoff z
c
= 25.124 and a latent score cutoff of
ζ
c
= 3.341. Table 10 illustrates the proportions selected (PS). If full strict measurement
invariance held, about 11% of late adolescents (18-25), approximately 8% of individuals in early
young, late young, and early middle adulthood (26-35, 36-45, 56-65), 6% of individuals in late
middle adulthood (56-65), and about 5% individuals in early and middle-oldest old age (66-75
and 76+) groups would be flagged for further screening. The presence of noninvariant items leads
to some minor changes in PS in each group, with the largest changes being the 0.9% decrease in
PS
26-35
and the 0.7% decrease in PS
18-25
. All AI ratios fall below AI=1 (Table 9) which suggests
that individuals in all age groups are less likely to be selected for further screening for the
loneliness intervention compared to individuals in the 18-25 age group. The lowest AI ratio is
observed in the early young adulthood group (26-35), with AI = 0.834, suggesting that under
matching latent distributions, this group has the lowest PS compared to the late adolescence
group. AI ratios below 1 suggest that individuals in these groups (early young adults, middle
young adults, early middle adulthood, late middle adulthood, early old age, and middle-oldest old
age) would all need to score higher on the UCLA-LS-9 to be selected for further assessment (and
be eligible for the subsequent loneliness intervention) than the late adolescents. This finding is
especially concerning for the older age groups given that these groups are more prone to
experiencing loneliness and closely related conditions such as depression (Lee et al., 2021;
National Academies of Sciences, Engineering, and Medicine, 2020), and might mean that
individuals who are most vulnerable to the severe negative consequences of loneliness may not get
access to an intervention they need due to noninvariance.
Step 4: Compare classification accuracy indices across invariance conditions.
We start with visually examining the latent and observed score distributions under partial
and strict invariance in Figure 4. The largest discrepancy between TP rates under partial
20
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
measurement invariance is observed for the late adolescents (18-25) and the two oldest age groups
(66-75 and 76+) such that the TP rates are higher for the late adolescence group (quadrant A).
For FP, the difference between groups appears smaller, with a relatively larger FP for the late
adolescents (quadrant B). The TN rates are smallest for the late adolescence (quadrant C), with
the greatest discrepancies occurring between this group and the early old age (66-75) and
middle-oldest old age (76+) groups. Conversely, the FN rates appear comparable across the seven
age groups (quadrant D). The discrepancies in the classification accuracy rates are smaller under
strict invariance compared to under partial invariance.
We turn to Table 10 to examine the SR, SE, and SP rates for each group across invariance
conditions. Overall, SR ranges between .747 and .814 and SE between .721 and .811 under partial
invariance. SP rates under partial invariance are above .985 for all groups, suggesting that the
test is relatively stronger at screening out individuals who are truly ineligible compared to
screening in the individuals who are truly eligible. Although the largest PS is observed for late
adolescents (18-25), the lower SR, SE, and SP rates for this group indicate issues with the
accuracy of selections in this group. The low SR suggests that a smaller proportion of late
adolescents who were selected for further screening actually qualified for it, and may be
attributable to the higher FP rate. In a similar vein, SP is slightly lower in the youngest group
compared to the other groups, suggesting that the test is not as successful at screening out the
individuals aged 18-25 who do not need further screening.
The discrepancies in classification accuracy indices across groups are blunted if strict
invariance is assumed to hold. Overall, the largest difference between invariance conditions is the
6.4% increase in SE
26-35
when strict invariance is assumed, suggesting that if there was no
measurement bias, the test would be more effective at capturing truly eligible individuals aged
26-35. Millsap and Kwok (2004) imply that a five percentage point change may be considered as
one possible rule of thumb for determining whether observed differences are practically significant,
but emphasize that the the intended use of the test is an important point for consideration while
determining practical significance.
We note that the changes in CAI are not monotone as age increases. For instance, SE is
SE
18-25
=.804 for the youngest group, then dips with SE
26-35
=.721 in the second age group,
21
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
peaks in the third group with SE
36-45
=.811, dipping again in the fourth group SE
46-55
=.756,
increasing to SE
56-65
=.792 and tapering off around .785 in older ages. Similarly, SR is lowest for
the two oldest groups (66-75 and 76+) and the youngest group (18-25), and oscillates up and
down with each age group in between. These observations may have implications for the study of
change over the lifespan.
Step 5: Compare observed and expected classification accuracy indices.
Table 11 illustrates the expected CAI for the focal groups if the underlying distributions for
these groups matched the latent distribution of each possible reference group, and Cohen’s h
effect sizes for discrepancies between the observed CAI for the reference group (presented in
italics) with the expected CAI for the other groups. Overall, the discrepancies with the largest
effect sizes occur when the focal groups’ latent distributions are matched to the distributions of
early young adults (26-35) or late young adults (36-45). The largest h values are observed for the
discrepancy in SE for early young adults (26-35) and late young adults (36-45), and the
discrepancy in SR for late adolescents (18-25) and early young adults (26-35). Matching the
younger groups’ latent distribution to the older groups’ distribution, we compute h
SE
=.203 for
discrepancy between the observed SE for late young adults (36-45) and the expected SE for early
young adults (26-35), and h
SR
=.200 for the discrepancy between observed SR for early young
adults (26-35) and the expected SR for late adolescents (18-25). Flipping the reference group to
the younger counterparts give us h
SE
=−.205 and h
SR
=−.193 respectively. Both sets of h
values reflect a small effect size (Cohen, 1988). Such information reinforces our observation that
SR is smaller (and FP is larger) in the late adolescence group (18-25) compared to other groups,
and that this discrepancy is especially meaningful between the two youngest groups of test-takers
such that the test produces more ’false alarms’ for individuals aged 18-25.
Differences in effect sizes shown in Table 11 highlights the importance of exploring how the
choice of the reference group changes expected accuracy indices. For instance, the discrepancies in
observed and expected indices appear relatively minor if the majority group (the late middle
adults; 56-65) is chosen as the reference group. Blindly choosing a reference group may conceal
existing effects. By comparing the results when different groups are chosen as reference (if
different), we can get an idea of the bounds of the extent of the impact of measurement
22
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
noninvariance.
Bringing it together
Overall, our investigations indicate that the early young adult, early old age, and
middle-oldest old age groups may need to pass a higher threshold to be selected for further
screening at a similar rate as late adolescents, and may not receive an intervention they need due
to the presence of noninvariance. While the effect size of the potential discrepancies are small, it
nonetheless provides valuable information for the next steps after the discovery of partial
invariance. The relatively high FP rate for the late adolescent group has implications in the
context of allocation of resources. Since the TP and SE indices for the late adolescence group are
relatively high, it appears that the individuals aged 18-25 who would benefit from further
screening are receiving it. The potential problem arises with regards to the false alarms as the
high FP/SR rates for the youngest group may indicate that older individuals who are actually
eligible may not be screened further and may not get the intervention as resources may be
misused on some individuals aged 18-25. Millsap and Kwok (2004) emphasize that some indices
are more meaningful in certain contexts. For example, in the illustrative example where the goal
is to identify who to further screen and provide an intervention for loneliness, we may be most
interested in SE if we have sufficient resources and our primary concern is to ensure that everyone
who needs the intervention gets it, and we might be less concerned with individuals who are
deemed eligible for the intervention but do not need it (i.e., the FP). However, under resource
constraints (e.g., if existing funds may only support a small number of individuals to attend the
intervention program), SR might be of greater interest. On the other hand, if the test is to used
to make decisions about the qualifications of a candidate (e.g., licensure for nurses or driving
tests), it may be worthwhile to explore SP as well as SE as it is important in such contexts that
individuals are correctly rejected if they do not meet the requirements. Equipped with a better
understanding of the potential practical impact of using the UCLA-LS-9 as a initial screener for
loneliness in this sample, and the knowledge that items 1 and 2 contribute the most measurement
bias, researchers or practitioners may opt to drop this item, or, if they consider the size of the
discrepancy acceptable, retain the item and proceed with the analyses. The item_deletion_h
function in unbiasr allows users to efficiently explore the practical impact of deleting items on
23
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
classification accuracy. One alternative strategy which we pursued but do not report here in
detail is to use differential weights such that less weight is placed on items 1 and 2. This strategy
resulted in AI ratios that are closer to 1, less discrepancy between observed CAI for the reference
group and expected CAI for the focal groups when matched on latent distributions, and less
discrepancy between partial and strict invariance conditions, overall reducing the negative impact
of noninvariance on classification accuracy.
Collapsing categories can mask noninvariance
As discussed earlier, grouping variables with more than two categories such as ethnicity,
language, or age, are often reduced to binary categories for ease of analysis and interpretation.
Such an approach compromises the quality of inferences drawn from the data and can result in
the loss of highly valuable information, which we demonstrate now by collapsing the seven age
groups into two groups young adults (18-45, n = 7571) and older adults (46+, n = 11950) and
repeating the analyses. Data exploration and invariance testing were performed as before.
Classification accuracy indices and Cohen’s h effect sizes for the discrepancy across the two
groups and invariance conditions are reported in Table 12.
The proportion of selection of PS
total
= 0.07 results in an observed cut score of Z
c
= 24.997
and a latent score of ζ
c
= 3.736. The AI ratio was computed as 0.987, suggesting a smaller PS for
the focal group (older adults) relative to the reference group. PS remains the same for both
groups under partial and strict invariance conditions. The largest differences observed in CAI
under partial versus strict invariance is the 0.007 increase in SR for the young adults group and
the 0.006 decrease in SR for the older adults group if strict invariance is assumed. Similarly, the
largest discrepancy is between the observed CAI for the young adult group and the expected CAI
for the older adult group is in SR, with a Cohen’s h =−.032. The nuances discovered and
discussed in the illustrative example above are missing from these results. Clearly, using a rough
binary grouping leads to the loss of important information about the practical impact of partial
invariance. Our aim in extending the MCAA to the multi-group scenario was to tackle this issue.
The full automation of the multi-group framework in unbiasr makes it easy to remain faithful to
the polychotomous nature of grouping variables, and we hope that it will aid researchers
circumvent the pitfalls of dichotomizing variables while examining the impact of measurement
24
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
invariance on classification accuracy.
25
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Chapter Four: Discussion and Limitations
The multi-group MCAA framework goes one step further than traditional approaches to
measurement invariance testing and provides more fine-grained information about how exactly
each group is impacted by the presence of noninvariance through classification accuracy indices
such as sensitivity and specificity, and metrics such as the adverse impact ratio. Researchers and
practitioners using tests to make classification decisions can utilize the multi-group MCAA
framework to explore not only the direction and magnitude of any adverse impact and potential
discrepancies in classification accuracy, but also examine the effect size of the differences. Future
work should focus on determining whether the .2, .5, and .8 thresholds commonly used for effect
sizes are meaningful in the context of quantifying the significance of any discrepancies between
observed and expected CAI, or whether new thresholds need to be developed to draw accurate
inferences. As previously discussed, these benchmarks have been criticized for being
unrepresentative of findings in the field, and it is possible that a Cohen’s h value of .2 may be
indicative of, for instance, a medium effect size rather than a small effect size. It is also unclear
whether the intended test use should play a role in the conceptualization of small, medium, and
large effect sizes. Using these thresholds too rigidly without further investigation may lead to
existing effects being written off as not meaningful.
Furthermore, while we have attempted to address the nonnormality of the data through the
use of the robust ML (MLR) estimator, the MLR may not be sufficient for severe violations of
nonnormality. Future work should investigate the implications of using different estimators such
as diagonally weighted least squares (DWLS) or unweighted least squares (ULS) with a polychoric
correlation matrix on the indices and incorporate functionality to correctly handle ordinal data.
While the UCLA-LS-9 example in this paper is intended for illustrative purposes only, the
choice of age as a grouping variable may be subject of criticism as age is not a naturally
categorical variable. Although one can argue that more information is retained when using seven
groups compared to two age groups, which we were able to demonstrate in a secondary extension
to the example, transforming interval data into ordinal data nevertheless constitutes a
"downgrading of measurement" (Kim and Frisby, 2019, p. 589). However, there can be a trade-off
between the information lost during this transformation, and the information we gain from the
26
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
analysis of interactions made possible by handling a continuous variable as categorical (Kim and
Frisby, 2019). In the context of our example, discretizing our age variable into theoretically
supported categories allowed us to gain insight into the practical impact of measurement
noninvariance on classification accuracy across the seven groups. That being said, it is debatable
whether the homogeneity of responses within groups implied by the grouping of observations into
a given group holds (Panayiotou et al., 2022). Can edge cases (individuals who are at the
boundary of groups, such as 25 and 26 year olds) truly be considered to be different enough to fall
under different categories? Future work can explore different categorization strategies to
investigate such questions. It is also important to note that all 20 items of the UCLA-LS were
administered at once (Panayiotou et al., 2022), which contain the nine item subset that we used
for our analyses. Scores and response patterns on these nine items may have been influenced by
any preceding items, the ordering of items, or the length of the scale as a whole. As such, the
validity of the data collected may be questioned. Another limitation of the illustrative example is
the unequal sample sizes, with late adolescents (18-25) group accounting for only 3.2% of the
data. We have attempted to address by exploring how the findings change if a bootstrapping
procedure was used to conduct our analyses on randomly drawn samples of equal size from each
group. These analyses are not reported here due to limited space, but suggest that the
discrepancies observed in classification accuracy indices across groups and invariance conditions
were not as pronounced when the sample sizes were balanced using the bootstrapping procedure.
Simulation studies may be employed in future work to further explore the relationship between
the effect size of discrepancies in classification accuracy indices and total sample size and
balance/imbalance of sample sizes across groups.
In this paper, we introduced a general framework to explore the practical impact of
noninvariance on the classification accuracy of tests used in high-stakes contexts, and illustrated
how the developed methods may be employed through functions freely available in the unbiasr
package. The methods introduced here are not specific to partial invariance models and the
framework freely lends itself to use with any model in which a degree of noninvariance is
permitted (e.g., approximate invariance achieved in alignment). The multi-group MCAA provides
researchers and practitioners with tools to independently and efficiently explore the performance
27
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
and fairness of tests, and make better informed and fairer decisions while using test scores in
high-stakes contexts.
28
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
References
Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural
Equation Modeling: A Multidisciplinary Journal, 21(4).
https://doi.org/10.1080/10705511.2014.919210
Bauer, D. J. (2017). A more general model for testing measurement invariance and differential
item functioning. Psychological Methods, 22(3).
https://doi.org/https://doi.org/10.1037/met000007
Bosco, F., Aguinis, H., Singh, K., Field, J., & A., P. C. (2015). Correlational effect size
benchmarks. The Journal of Applied Psychology, 100(2), 431–49.
Bowden, S., Saklofske, D., van de Vijver, F., Sudarshan, N., & Eysenck, S. (2016). Cross-cultural
measurement invariance of the Eysenck Personality Questionnaire across 33 countries.
Personality and Individual Differences , 103.
https://doi.org/https://doi.org/10.1016/j.paid.2016.04.028.
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor
covariance and mean structures: The issue of partial measurement invariance.
Psychological Bulletin, 105(3), 456–466. https://doi.org/10.1037/0033-2909.105.3.456
Cheung, G., & Rensvold, R. (2002). Evaluating goodness-of-fit indexes for testing measurement
invariance. Structural Equation Modeling, 9.
https://doi.org/doi:Âă10.1207/S15328007SEM0902_5
Cohen, J. (1988). Multiple factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates,
Publishers.
Cole, D., & Maxwell, S. (1985). Multitrait-multimethod comparisons across populations: A
confirmatory factor analytic approach. Multivariate Behavioral Research, 20.
Di Giunta, L., Iselin, A. R., Eisenberg, N., Pastorelli, C., Gerbino, M., Lansford, J. E.,
Dodge, K. A., Caprara, G. V., Bacchini, D., Uribe Tirado, L. M., & Thartori, E. (2017).
Measurement invariance and convergent validity of anger and sadness self-regulation
among youth from six cultural groups. Assessment, 24(4).
https://doi.org/https://doi.org/10.1177/1073191115615214
Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation.
Measurement and Evaluation in Counseling and Development, 43(2).
Drasgow, F. (1984). Scrutinizing psychological tests: Measurement equivalence and equivalent
relations with external variables are the central issues. Psychological Bulletin, 95(1).
https://doi.org/10.1037/0033-2909.95.1.134
Epskamp, S. (2015). Semplot: Unified visualizations of structural equation models. Structural
Equation Modeling: A Multidisciplinary Journal.
https://doi.org/10.1080/10705511.2014.937847
29
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Gonzalez, O., Georgeson, A. R., & Pelham, W. E. 3rd. (2022). How accurate and consistent are
score-based assessment decisions? a procedure using the linear factor model. Assessment.
https://doi.org/https://doi.org/10.1177/10731911221113568
Gonzalez, O., & Pelham, W. E. (2021). When does differential item functioning matter for
screening? A method for empirical evaluation. Assessment, 28, 446–456.
https://doi.org/https://doi.org/10.1177/1073191120913618
Hawkley, L. C., Browne, M. W., & Cacioppo, J. T. (2005). How can I connect with thee? Let me
count the ways. Psychological Science, 16.
https://doi.org/https://doi.org/10.1111/j.1467-9280.2005.01617.x
Hoang, P., King, J. A., Moore, S., Moore, K., Reich, K., Sidhu, H., Tan, C. V., Whaley, C., &
McMillan, J. (2022). Interventions associated with reduced loneliness and social isolation
in older adults: A systematic review and meta-analysis. JAMA Network Open, 5(10).
https://doi.org/https://doi.org/10.1001/jamanetworkopen.2022.36676
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance
in aging research. Experimental Aging Research, 18:3, 117–144.
https://doi.org/10.1080/03610739208253916
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1).
https://doi.org/https://doi.org/10.1080/10705519909540118
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika, 34(2), 183–202. https://doi.org/https://doi.org/10.1007/BF02289343
Kim, S., & Frisby, C. L. (2019). Gaining from discretization of continuous data: The
correspondence analysis biplot approach. Behavior Research Methods, 51(2).
https://doi.org/https://doi.org/10.3758/s13428-018-1161-1
Király, O., Bőthe, B., Ramos-Diaz, J., Rahimi-Movaghar, A., Lukavska, K., Hrabec, O.,
Miovsky, M., Billieux, J., Deleuze, J., Nuyens, F., Karila, L., Griffiths, M. D.,
Nagygyörgy, K., Urbán, R., Potenza, M. N., King, D. L., Rumpf, H., Carragher, N., &
Demetrovics, Z. (2019). Ten-Item Internet Gaming Disorder Test (IGDT-10):
Measurement invariance and cross-cultural validation across seven language-based
samples. Psychology of Addictive Behaviors, 33(1).
https://doi.org/https://doi.org/10.1037/adb0000433
Lai, M. (2023). Pinsearch: Specification search for partial factorial invariance
[https://github.com/marklhc/pinsearch, https://marklhc.github.io/pinsearch/].
Lai, M., Richardson, G. B., & Mak, H. W. (2019). Quantifying the impact of partial measurement
invariance in diagnostic research: An application to addiction research. Addictive
behaviors, 94. https://doi.org/https://doi.org/10.1016/j.addbeh.2018.11.02
Lai, M., & Zhang, Y. (2022). Classification accuracy of multidimensional tests: Quantifying the
impact of noninvariance. Structural Equation Modeling: A Multidisciplinary Journal.
https://doi.org/10.1080/10705511.2021.1977936
30
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Lee, S. L., Pearce, E., Ajnakina, O., Johnson, S., Lewis, G., Mann, F., Pitman, A., Solmi, F.,
Sommerlad, A., Steptoe, A., Tymoszuk, U., & Lewis, G. (2021). The association between
loneliness and depressive symptoms among adults aged 50 years and older: A 12-year
population-based cohort study. The Lancet. Psychiatry, 8(1).
https://doi.org/https://doi.org/10.1016/S2215-0366(20)30383-7
Lu, S., Hu, S., Guan, Y., Xiao, J., Cai, D., Gao, Z., Sang, Z., Wei, J., Zhang, X., & Margraf, J.
(2018). Measurement invariance of the Depression Anxiety Stress Scales-21 across gender
in a sample of Chinese university students. Frontiers in Psychology, 9.
https://doi.org/https://doi.org/10.3389/fpsyg.2018.02064
Luong, R., & Flake, J. K. (2022). Measurement invariance testing using confirmatory factor
analysis and alignment optimization: A tutorial for transparent analysis planning and
reporting. Psychological Methods. https://doi.org/http://dx.doi.org/10.1037/met0000441
Marsh, H. W., Muthén, B., Asparouhov, T., Lüdtke, O., Robitzsch, A., Morin, A. J., &
Trautwein, U. (2009). Exploratory structural equation modeling, integrating CFA and
EFA: Application to students’ evaluations of university teaching. Structural Equation
Modeling, 16(3). https://doi.org/doi:10.1080/10705510903008220
Marsh, H. W., Nagengast, B., & Morin, A. J. (2013). Measurement invariance of big-five factors
over the life span: ESEM tests of gender, age, plasticity, maturity, and la dolce vita effects.
Developmental Psychology, 49(6). https://doi.org/10.1037/a0026913
Meade, A. W., Michels, L. C., & Lautenschlager, G. J. (2007). Are Internet and paper-and-pencil
personality tests truly comparable? an experimental design measurement invariance study.
Organizational Research Methods, 10(2).
Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of
Educational Research, 13. https://doi.org/10.1016/0883-0355(89)90002-5
Menold, N., & Toepoel, V. (2022). Do different devices perform equally well with different
numbers of scale points and response formats? A test of measurement invariance and
reliability. Sociological Methods and Research, 0.
https://doi.org/https://doi.org/10.1177/00491241221077237
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance.
Psychometrika, 58(4). https://doi.org/https://doi.org/10.1007/BF02294825
Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical
Care, 44(11, Suppl 3). https://doi.org/10.1097/01.mlr.0000245438.73837.89
Millsap, R. E. (2012). Statistical approaches to measurement invariance. Routledge.
Millsap, R. E., & Kwok, O. M. (2004). Evaluating the impact of partial factorial invariance on
selection in two populations. Psychological Methods, 9(1), 93–115.
https://doi.org/https://doi.org/10.1037/1082-989X.9.1.93
Muthén, B. O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54.
https://doi.org/http://dx.doi.org/10.1007/BF02296397
31
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
National Academies of Sciences, Engineering, and Medicine. (2020). Social isolation and
loneliness in older adults: Opportunities for the health care system. The National
Academies Press. https://doi.org/https://doi.org/10.17226/25663
Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence:
Understanding the practical importance of differences between groups. Journal of Applied
Psychology, 96, 966–980. https://doi.org/https://doi.org/10.1037/a0022955
Ock, J., McAbee, S. T., Mulfinger, E., & Oswald, F. L. (2020). The practical effects of
measurement invariance: Gender invariance in two big five personality measures.
Assessment, 27(4).
Panayiotou, M., Badcock, J. C., Lim, M. H., Banissy, M. J., & Qualter, P. (2022). Measuring
loneliness in different age groups: The measurement invariance of the UCLA Loneliness
Scale. Assessment, 0(0). https://doi.org/https://doi.org/10.1177/10731911221119533
Pearson, K. (1904). On the theory of contingency and its relation to association and normal
correlation. biometric series. Drapers’ Co. Memoirs, London.
Putnick, D. L., & Bornstein, M. (2016). Measurement invariance conventions and reporting: The
state of the art and future directions for psychological research. Developmental review:
DR, 41. https://doi.org/https://doi.org/10.1016/j.dr.2016.06.004
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for
Statistical Computing. Vienna, Austria. https://www.R-project.org/
Reise, S. P., Smith, L., & Furr, M. R. (2001). Invariance on the NEO PI-R Neuroticism Scale.
Multivariate Behavioral Research, 36(1).
https://doi.org/https://doi.org/10.1207/S15327906MBR3601_04
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical
Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02
Saunders, J. B., Aasland, O. G., Babor, T. F., de la Fuente, J. R., & Grant, M. (1993).
Development of the Alcohol Use Disorders Identification Test (AUDIT): WHO
collaborative project on early detection of persons with harmful alcohol consumption–ii.
Addiction, 88(6). https://doi.org/https://doi.org/10.1111/j.1360-0443.1993.tb02093.x
Savalei, V., & Rosseel, Y. (2022). Computational options for standard errors and test statistics
with incomplete normal and nonnormal data in sem. Structural Equation Modeling: A
Multidisciplinary Journal, 29(2).
Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications.
Human Resource Management Review, 18(4). https://doi.org/10.1016/j.hrmr.2008.03.003
Siedlecki, K. L., Tucker-Drob, E. M., Oishi, S., & Salthouse, T. A. (2008). Life satisfaction across
adulthood: Different determinants at different ages?. The Journal of Positive Psychology,
3(3). https://doi.org/https://doi.org/10.1080/17439760701834602
32
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Skriner, L., & Chu, B. (2014). Cross-ethnic measurement invariance of the scared and ces-d in a
youth sample. Psychological Assessment, 26(1).
Somaraju, A. V., Nye, C. D., & Olenick, J. (2021). A review of measurement equivalence in
organizational research: What’s old, what’s new, what’s next? Organizational Research
Methods. https://doi.org/10.1177/10944281211056524
Spearman, C. (1904). "general intelligence objectively determined and measured". American
Journal of Psychology, 15(2). https://doi.org/doi:10.2307/1412107
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item
(functioning and differential) test functioning on selection decisions: When are statistically
significant effects practically important? Journal of Applied Psychology, 89(3), 497–508.
https://doi.org/doi:http://dx.doi.org/10.1037/0021-9010.89.3.497
Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in
cross-national consumer research. Journal of Consumer Research, 25.
https://doi.org/https://doi.org/10.1086/209528
Steinmetz, H., Schmidt, P., Tina-Booh, A., Wieczorek, S., & Schwartz, S. H. (2009). Testing
measurement invariance using multigroup CFA: Differences between educational groups in
human values measurement. Quality & Quantity, 43.
Strömberg, U. (1996). Collapsing ordered outcome categories: A note of concern. American
Journal of Epidemiology, 144(4).
https://doi.org/https://doi.org/10.1093/oxfordjournals.aje.a008944
Swets, J. A., & Pickett, R. M. (1982). Evaluation of diagnostic systems: Methods from signal
detection theory. New York: Academic Press.
https://doi.org/https://doi.org/10.1016/B978-0-12-679080-1.X5001-4
Thielmann, I., Akrami, N., Babarović, T., Belloch, A., Bergh, R., Chirumbolo, A., Čolović, P.,
de Vries, R. E., Dostál, D., Egorova, M., Gnisci, A., Heydasch, T., Hilbig, B. E.,
Hsu, K. Y., Izdebski, P., Leone, L., Marcus, B., Međedović, J., Nagy, J., ... Lee, K.
(2020). The HEXACO-100 across 16 languages: A large-scale test of measurement
invariance. Journal of Personality Assessment, 102(5).
https://doi.org/https://doi.org/10.1080/00223891.2019.1614011
Thurstone, L. L. (1947). Multiple factor analysis. University of Chicago Press.
Van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance.
European journal of developmental psychology, 9(4).
https://doi.org/https://doi.org/10.1080/17405629.2012.686740
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance
literature: Suggestions, practices, and recommendations for organizational research.
Organizational research methods, 3(1).
33
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Yoon, M., & Millsap, R. E. (2007). Detecting violations of factorial invariance using data-based
specification searches: A Monte Carlo study. Structural Equation Modeling, 14(3).
https://doi.org/https://doi.org/10.1080/10705510701301677
Yuan, K. H., & Bentler, P. M. (2000). "5. three likelihood-based methods for mean and covariance
structure analysis with nonnormal missing data. Sociological Methodology, 30(1).
34
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 1
Descriptives for the UCLA-LS-9 composite by group
Group M SD N SE 95% CI
18-25 18.852 5.218 625 0.209 [18.442, 19.262]
26-35 17.769 5.119 1300 0.142 [17.491, 18.048]
36-45 17.716 5.413 1846 0.126 [17.469, 17.963]
46-55 17.442 5.281 3800 0.086 [17.275, 17.610]
56-65 17.185 5.325 5919 0.069 [17.049, 17.321]
66-75 16.879 5.128 4991 0.073 [16.737, 17.021]
76+ 16.939 5.154 1040 0.160 [16.625, 17.252]
35
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 2
Model fit indices
Model Scaled χ
2
(df) p-value RMSEA* RMSEA* 90% CI CFI* TLI* SRMR AIC BIC Scaled ∆ χ
2
ddf npval
Configural 1645.995(168) 0.000 0.062 [.059,0.064] 0.981 0.972 0.033 319278.840 320933.482 - - -
Metric 1711.683(204) 0.000 0.057 [.054,0.059] 0.981 0.976 0.035 319295.661 320666.650 71.694 36 0
Note. * indicates that robust fit indices were reported, i.e., robust RMSEA, robust CFI, and robust TLI.
36
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 3
Groups with non-invariant measurement parameters on the UCLA-LS-9
Factor Item Factor Loadings (λ) Intercepts (ν) Unique Factor Variances (θ) f
MACS
1 1 1, 2, 5, 6, 7 1, 2, 4 0.165
F1 4 - 1, 3, 5 7 0.054
8 - 2, 4 3 0.080
3 5 5 6, 7 0.011
F2 5 - 3 1, 7 0.014
7 3 3 1, 6, 7 0.037
2 1 1, 2, 3, 7 1, 2, 3, 4, 5, 6, 7 0.142
F3 6 - - 1 -
9 - - 1, 2, 6 -
Note. "-" indicates that the item parameter was invariant across all seven groups. No items reached strict
invariance, items 6 and 9 reached scalar invariance and items 4, 5, and 8 reached metric invariance. Items
1, 2, 3, and 7 did not reach metric invariance. The column labeled f
MACS
displays the effect size of
noninvariance by item.
37
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 4
UCLA-LS-9 partial invariance model loadings by group
18-25 26-35 36-45 46-55 56-65 66-75 76+
Factor Item Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE
1 0.512 (0.036) 0.626 (0.018) 0.626 (0.018) 0.626 (0.018) 0.626 (0.018) 0.626 (0.018) 0.626 (0.018)
F1 4 0.591 (0.017) 0.591 (0.017) 0.591 (0.017) 0.591 (0.017) 0.591 (0.017) 0.591 (0.017) 0.591 (0.017)
8 0.809 (0.022) 0.809 (0.022) 0.809 (0.022) 0.809 (0.022) 0.809 (0.022) 0.809 (0.022) 0.809 (0.022)
3 0.673 (0.024) 0.673 (0.024) 0.673 (0.024) 0.673 (0.024) 0.656 (0.024) 0.673 (0.024) 0.673 (0.024)
F2 5 0.682 (0.024) 0.682 (0.024) 0.682 (0.024) 0.682 (0.024) 0.682 (0.024) 0.682 (0.024) 0.682 (0.024)
7 0.461 (0.017) 0.461 (0.017) 0.519 (0.024) 0.461 (0.017) 0.461 (0.017) 0.461 (0.017) 0.461 (0.017)
2 0.482 (0.030) 0.404 (0.018) 0.404 (0.018) 0.404 (0.018) 0.404 (0.018) 0.404 (0.018) 0.404 (0.018)
F3 6 0.603 (0.026) 0.603 (0.026) 0.603 (0.026) 0.603 (0.026) 0.603 (0.026) 0.603 (0.026) 0.603 (0.026)
9 0.677 (0.029) 0.677 (0.029) 0.677 (0.029) 0.677 (0.029) 0.677 (0.029) 0.677 (0.029) 0.677 (0.029)
Note. SE values for each estimate are provided in parentheses. Noninvariant parameters are bolded.
38
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 5
UCLA-LS-9 partial invariance model measurement intercepts by group
18-25 26-35 36-45 46-55 56-65 66-75 76+
Factor Item Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE
1 2.632 (0.037) 2.466 (0.040) 2.657 (0.031) 2.657 (0.031) 2.788 (0.030) 2.862 (0.031) 2.931 (0.037)
F1 4 2.726 (0.032) 2.656 (0.029) 2.781 (0.031) 2.656 (0.029) 2.703 (0.029) 2.656 (0.029) 2.656 (0.029)
8 2.632 (0.037) 2.421 (0.046) 2.632 (0.037) 2.539 (0.041) 2.632 (0.037) 2.632 (0.037) 2.632 (0.037)
3 1.580 (0.028) 1.580 (0.028) 1.580 (0.028) 1.580 (0.028) 1.577 (0.028) 1.580 (0.028) 1.580 (0.028)
F2 5 1.583 (0.029) 1.583 (0.029) 1.554 (0.030) 1.583 (0.029) 1.583 (0.029) 1.583 (0.029) 1.583 (0.029)
7 1.492 (0.020) 1.492 (0.020) 1.506 (0.025) 1.492 (0.020) 1.492 (0.020) 1.492 (0.020) 1.492 (0.020)
2 1.998 (0.030) 1.903 (0.027) 1.844 (0.025) 1.776 (0.020) 1.776 (0.020) 1.776 (0.020) 1.700 (0.026)
F3 6 2.097 (0.030) 2.097 (0.030) 2.097 (0.030) 2.097 (0.030) 2.097 (0.030) 2.097 (0.030) 2.097 (0.030)
9 2.113 (0.033) 2.113 (0.033) 2.113 (0.033) 2.113 (0.033) 2.113 (0.033) 2.113 (0.033) 2.113 (0.033)
Note. SE values for each estimate are provided in parentheses. Noninvariant parameters are bolded.
39
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 6
UCLA-LS-9 partial invariance model unique factor variances by group
18-25 26-35 36-45 46-55 56-65 66-75 76+
Factor Item Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE
1 0.576 (0.040) 0.490 (0.023) 0.415 (0.007) 0.456 (0.014) 0.415 (0.007) 0.415 (0.007) 0.415 (0.007)
F1 4 0.320 (0.005) 0.320 (0.005) 0.320 (0.005) 0.320 (0.005) 0.320 (0.005) 0.320 (0.005) 0.357 (0.020)
8 0.205 (0.005) 0.205 (0.005) 0.167 (0.013) 0.205 (0.005) 0.205 (0.005) 0.205 (0.005) 0.205 (0.005)
3 0.067 (0.003) 0.067 (0.003) 0.067 (0.003) 0.067 (0.003) 0.067 (0.003) 0.080 (0.006) 0.088 (0.010)
F2 5 0.104 (0.017) 0.075 (0.002) 0.075 (0.002) 0.075 (0.002) 0.075 (0.002) 0.075 (0.002) 0.112 (0.012)
7 0.259 (0.018) 0.221 (0.005) 0.221 (0.005) 0.221 (0.005) 0.221 (0.005) .239 (0.008) 0.252 (0.021)
2 0.336 (0.023) 0.364 (0.016) 0.335 (0.013) 0.308 (0.008) 0.250 (0.006) 0.239 (0.006) 0.209 (0.011)
F3 6 0.277 (0.025) 0.199 (0.004) 0.199 (0.004) 0.199 (0.004) 0.199 (0.004) 0.199 (0.004) 0.199 (0.004)
9 0.515 (0.040) 0.408 (0.023) 0.323 (0.007) 0.323 (0.007) 0.323 (0.007) 0.298 (0.011) 0.323 (0.007)
Note. SE values for each estimate are provided in parentheses. Noninvariant parameters are bolded.
40
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 7
UCLA-LS-9 partial invariance model latent mean estimates by group
18-25 26-35 36-45 46-55 56-65 66-75 76+
Factor Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE Est. SE
F1 0 (0) -0.029 (0.062) -0.320 (0.055) -0.249 (0.053) -0.437 (0.051) -0.524 (0.052) -0.543 (0.061)
F2 0 (0) -0.140 (0.048) -0.072 (0.047) -0.121 (0.043) -0.119 (0.042) -0.112 (0.042) -0.099 (0.050)
F3 0 (0) -0.130 (0.058) -0.162 (0.055) -0.201 (0.051) -0.280 (0.050) -0.372 (0.051) -0.334 (0.061)
Note. SE values for each estimate are provided in parentheses.
41
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 8
UCLA-LS-9 partial invariance model latent variance-covariance estimates by group
18-25 26-35 36-45 46-55 56-65 66-75 76+
F1 F2 F3 F1 F2 F3 F1 F2 F3 F1 F2 F3 F1 F2 F3 F1 F2 F3 F1 F2 F3
F1 1 0.655 0.744 0.929 0.586 0.688 1.116 0.661 0.820 1.047 0.654 0.773 1.081 0.688 0.810 1.060 0.64 0.724 1.083 0.634 0.757
(0) (0.033)(0.038)(0.062)(0.044)(0.052)(0.070)(0.044)(0.056)(0.061)(0.039)(0.048)(0.061)(0.039)(0.048)(0.061)(0.037)(0.044)(0.073)(0.049)(0.061)
F2 0.655 1 0.700 0.586 0.910 0.673 0.661 0.919 0.733 0.654 0.948 0.728 0.688 0.967 0.748 0.640 0.912 0.670 0.634 0.818 0.678
(0.033) (0) (0.032)(0.044)(0.082)(0.058)(0.044)(0.077)(0.057)(0.039)(0.074)(0.052)(0.039)(0.074)(0.051)(0.037) (0.07) (0.046)(0.049)(0.076)(0.060)
F3 0.744 0.700 1 0.688 0.673 1.095 0.820 0.733 1.160 0.773 0.728 1.136 0.810 0.748 1.119 0.724 0.67 1.019 0.757 0.678 1.090
(0.038)(0.032) (0) (0.052)(0.058)(0.106)(0.056)(0.057)(0.108)(0.048)(0.052)(0.101)(0.048)(0.051)(0.098)(0.044)(0.046)(0.090)(0.061)(0.060)(0.109)
Note. Standard errors for the variance-covariance estimates are provided in parentheses below the estimate.
42
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 9
Adverse Impact (AI) ratio computed for every possible reference group (PS
total
=.07)
Focal Group
Reference Group 18-25 26-35 36-45 46-55 56-65 66-75 76+
18-25 0.834 0.982 0.881 0.954 0.970 0.971
26-35 1.222 1.192 1.061 1.158 1.178 1.180
35-45 1.028 0.842 0.892 0.970 0.986 0.988
46-55 1.153 0.944 1.123 1.089 1.108 1.110
56-65 1.066 0.865 1.035 0.917 1.018 1.020
66-75 1.053 0.839 1.019 0.893 0.980 1.003
76+ 1.052 0.838 1.015 0.892 0.978 0.997
Note. Row labels indicate the reference group with which the AI ratios of the focal groups were computed.
The AI ratio compares the observed proportion selected (PS) for a reference group to the expected PS for a
focal group after matching the latent distribution of the focal group to the latent distribution of the
reference group. When these PS values are equal (i.e., AI = 1), we can conclude that there is no impact of
noninvariance on PS. AI< 1 indicates that individuals in a focal group were selected at a lower rate
compared to individuals in the reference group due to noninvariance. Conversely, AI> 1 suggests that
individuals in the group designated as reference were selected at a lower rate compared to individuals in
the focal group of of interest. We see in the first row that, after accounting for discrepancies due to any
differences in underlying distributions, PS was lower in all groups in comparison to the late adolescence
group (18-25) such that individuals in these groups were not selected at the same rate and may need to
score higher to be selected for further screening.
43
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 10
Classification accuracy indices (CAI) under partial and strict measurement invariance
18-25 26-35 36-45 46-55 56-65 66-75 76+
CAI Partial Strict Partial Strict Partial Strict Partial Strict Partial Strict Partial Strict Partial
Strict
PS .113 .106 .073 .082 .082 .078 .071 .076 .067 .065 .054 .052 .056 .053
SR .760 .805 .814 .780 .780 .796 .811 .788 .781 .785 .747 .758 .747 .764
SE .804. .799 .721 .785 .811 .788 .756 .786 .792 .782 .788 .769 .784 .769
SP .970 .977 .985 .980 980 .983 .986 .983 .984 .985 .986 .987 .985 .987
Note. PS, SR, SE, and SP indicate Proportion Selected, Success Ratio, Sensitivity, and Specificity respectively. CAI are reported for each age group
under partial and strict invariance (PS
total
=.07). The discrepancies in CAI across groups are subdued under strict invariance compared to partial
invariance, e.g., a smaller proportion of individuals from the late adolescence group (18-25) and larger proportions of individuals from all other
groups would be selected if the items were invariant.
44
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 11
Observed and expected classification accuracy indices under matched distributions.
Focal Group
18-25 26-35 36-45 46-55 56-65 66-75 76+
Reference CAI h h h h h h h
18-25 PS .113 - .094 .062 .111 .006 .099 .044 .108 .016 .109 .011 .109 .011
SR .760 - .837 -.193 .790 -.073 .826 -.164 .801 -.099 .796 -.086 .791 -.076
SE .804 - .738 .157 .821 -.044 .770 .083 .809 -.012 .816 -.031 .813 -.022
SP .970 - .983 -.087 .974 -.026 .981 -.071 .976 -.039 .975 -.032 .974 -.029
26-35 PS .089 -.059 .073 - .086 -.052 .077 -.017 .084 -.043 .085 -.048 .086 -.048
SR .731 .200 .814 - .765 .122 .803 .029 .775 .096 .770 .109 .765 .121
SE .791 -.163 .721 - .808 −.205 .755 -.076 .796 -.174 .803 -.193 .800 -.184
SP .974 .081 .985 - .978 .056 .983 .015 .979 .045 .979 .051 .978 .054
36-45 PS .069 -.008 .085 .049 .082 - .073 .033 .080 .009 .081 .004 .081 .004
SR .829 .082 .745 -.122 .780 - .818 -.094 .792 -.028 .786 -.014 .781 -.003
SE .725 .037 .796 .203 .811 - .758 .129 .798 .033 .806 .013 .802 .022
SP .987 .026 .977 -.054 .980 - .985 -.039 .982 -.012 .981 -.006 .981 -.002
46-55 PS .067 -.041 .079 .016 .082 -.033 .071 - .077 -.024 .078 -.029 .078 -.030
SR .822 .175 .772 -.029 .738 .094 .811 - .784 .066 .778 .080 .773 .092
SE .723 -.090 .810 .075 .794 -.130 .756 - .797 -.097 .804 -.116 .801 -.107
SP .987 .064 .980 -.014 .977 .039 .986 - .982 .028 .981 .034 .981 .037
56-65 PS .058 -.017 .069 .037 .061 -.009 .071 .022 .067 - .068 -.005 .068 -.005
SR .819 .114 .769 -.095 .808 .030 .732 -.066 .781 - .775 .014 .770 .026
SE .718 .002 .806 .172 .751 -.036 .791 .097 .792 - .800 -.020 .796 -.011
SP .989 .036 .983 -.039 .987 .012 .980 -.026 .984 - .984 .006 .983 .009
66-75 PS .045 -.013 .055 .040 .048 -.005 .053 .026 .056 .005 .054 - .054 -.001
SR .794 .105 .740 -.112 .782 .018 .754 -.083 .700 -.015 .747 - .742 .013
SE .702 .024 .795 .197 .737 -.017 .779 .120 .778 .021 .788 - .784 .009
SP .990 .028 .985 -.042 .989 .006 .986 -.030 .982 -.005 .986 - .985 .003
76+ PS .058 -.017 .069 .037 .061 -.009 .071 .022 .068 -.005 .068 -.005 .056 -
SR .819 .114 .769 -.095 .808 .030 .732 -.066 .775 .014 .770 .026 .747 -
SE .718 .002 .806 .172 .751 -.036 .791 .097 .800 -.020 .796 -.011 .784 -
SP .989 .036 .983 -.039 .987 .012 .980 -.026 .984 .006 .983 .009 .985 -
Note. The table illustrates observed classification accuracy indices (CAI) for the reference group and
expected CAI for the focal groups computed for each possible reference group. Expected CAI for a focal
group refers to the CAI values we would expect to see for that focal group if its latent distribution was
matched to the latent distribution of the reference group. Observed CAI for the reference group are
italicized. PS, SR, SE, and SP indicate Proportion Selected, Success Ratio, Sensitivity, and Specificity
respectively. For each focal group, Cohen’s h values are reported for the discrepancy between the observed
CAI for the reference group and the expected CAI for that focal group. As the underlying distributions are
matched, any discrepancy observed is attributable to measurement bias (noninvariance). Cohen’s h values
greater than or equal to .2 (indicating a small effect size) are bolded. Values are reported for PS
total
=.07.
45
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table 12
Classification accuracy indices for the two group scenario.
18-45 46+
CAI Partial Strict Partial Strict E[46+] h
PS 0.081 0.081 0.063 0.063 0.080 0.004
SR 0.785 0.792 0.782 0.776 0.798 -0.032
SE 0.787 0.789 0.780 0.779 0.789 -0.007
SP 0.981 0.982 0.985 0.985 0.982 -0.010
Note. PS, SR, SE, and SP indicate Proportion Selected, Success Ratio, Sensitivity, and Specificity. CAI are
reported for each age group under partial and strict invariance (PS
total
=.07). When the seven age groups
are collapsed into two groups (18-45 and 46+; reference and focal group respectively), the discrepancies in
CAI observed in Table 10 and 11 become obscured. Collapsing across groups to achieve a binary grouping
leads to a loss of information.
46
47
Figure 1
A confusion matrix for the hypothetical example.
Note. The rows indicate the actual classification of an individual based on their latent standing on the
construct of interest (i.e., the ground truth). The columns indicate the classification determined (predicted)
using the observed test scores. Highlighted cells correspond to instances when an accurate decision was
made.
47
48
Figure 2
A path diagram for the partial invariance model for late adolescents (ages 18-25)
Note. The dotted lines represent the fixed latent mean and latent variances in the first group. Parameters
that are noninvariant for the late adolescent group were flagged with asterisks (*). The path diagram was
created using semPlot (Epskamp, 2015).
48
49
Figure 3
Classification accuracy indices at different proportions of selection by group.
(a) Proportion Selected (PS), partial invariance (b) Proportion Selected (PS), strict invariance
(c) Success Ratio (SR), partial invariance (d) Success Ratio (SR), strict invariance
(e) Sensitivity (SE), partial invariance (f) Sensitivity (SE), strict invariance
(g) Specificity (SP), partial invariance (h) Specificity (SP), strict invariance
Note. The figures illustrate the classification accuracy indices at different proportions of selection (PS
total
)
under partial and strict invariance by group.
49
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Figure 4
The joint bivariate distributions of observed and latent scores for each age group.
(a) Partial invariance
(b) Strict invariance
Note. The figures illustrate the joint bivariate distributions of observed and latent scores for each age
group under partial and strict measurement invariance. Quadrants A, B, C, and D correspond to True
Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) rates for age groups
18-25 (late adolescence), 26–35 (early young adulthood), 36-45 (late young adulthood), 46-55 (early middle
adulthood), 56–65 (late middle adulthood), 66–75 (early old age), and 76+ (middle-oldest old age)
respectively.
50
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Appendix A
Table A1
Item-level descriptive statistics by age group.
18-25 26-35 36-45 46-55 56-65 66-75 76+
Factor Item M SD M SD M SD M SD M SD M SD M SD
1 2.632 0.916 2.448 0.907 2.463 0.947 2.499 0.925 2.515 0.910 2.534 0.915 2.591 0.927
F1 4 2.726 0.805 2.640 0.817 2.592 0.848 2.512 0.830 2.444 0.835 2.345 0.827 2.328 0.851
8 2.631 0.932 2.398 0.909 2.371 0.937 2.338 0.942 2.278 0.957 2.207 0.949 2.197 0.965
3 1.585 0.723 1.488 0.697 1.532 0.700 1.498 0.707 1.499 0.695 1.505 0.696 1.503 0.669
F2 5 1.569 0.761 1.480 0.697 1.505 0.704 1.501 0.715 1.503 0.722 1.508 0.713 1.515 0.704
7 1.514 0.670 1.447 0.660 1.469 0.683 1.438 0.646 1.434 0.656 1.429 0.651 1.476 0.663
2 1.998 0.754 1.851 0.750 1.778 0.736 1.708 0.705 1.654 0.656 1.628 0.633 1.564 0.614
F3 6 2.115 0.800 2.033 0.781 1.998 0.792 1.966 0.784 1.925 0.773 1.877 0.756 1.908 0.782
9 2.083 0.988 1.999 0.940 2.004 0.911 1.983 0.917 1.936 0.919 1.852 0.877 1.869 0.915
51
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table A2
UCLA-LS-9 item correlations using the full sample.
Factor Item 1 Item 4 Item 8 Item 3 Item 5 Item 7 Item 2 Item 6 Item 9
Item 1 1
F1 Item 4 .502 1
Item 8 .600 .656 1
Item 3 .453 .407 .525 1
F2 Item 5 .456 .413 .525 .855 1
Item 7 .413 .351 .465 .622 .622 1
Item 2 .329 .365 .429 .432 .420 .417 1
F3 Item 6 .398 .397 .497 .502 .489 .465 .536 1
Item 9 .408 .403 .502 .512 .506 .470 .442 .647 1
52
MULTI-GROUP CLASSIFICATION ACCURACY ANALYSIS
Table A3
UCLA-LS-9 item correlations by age group
Group Item Item 1 Item 4 Item 8 Item 3 Item 5 Item 7 Item 2 Item 6 Item 9
Item 4 .389 1
Item 8 .487 .641 1
Item 3 .399 .408 .510 1
18-25 Item 5 .424 .454 .503 .847 1
Item 7 .358 .329 .439 .592 .579 1
Item 2 .280 .350 .433 .428 .462 .395 1
Item 6 .277 .395 .467 .471 .450 .436 .483 1
Item 9 .218 .419 .470 .426 .408 .414 .394 .545 1
Item 1 1
Item 4 .425 1
Item 8 .538 .619 1
Item 3 .436 .413 .48 1
26-35 Item 5 .428 .413 .478 .861 1
Item 7 .411 .364 .452 .629 .611 1
Item 2 .323 .325 .363 .407 .393 .408 1
Item 6 .332 .387 .456 .485 .463 .470 .494 1
Item 9 .317 .388 .440 .458 .447 .484 .398 .594 1
Item 1 1
Item 4 .550 1
Item 8 .636 .683 1
Item 3 .456 .446 .542 1
36-45 Item 5 .439 .424 .519 .860 1
Item 7 .441 .395 .506 .672 .650 1
Item 2 .317 .348 .413 .430 .422 .441 1
Item 6 .415 .442 .505 .519 .499 .515 .549 1
Item 9 .427 .449 .530 .517 .512 .506 .427 .649 1
Item 1 1
Item 4 .495 1
Item 8 .591 .666 1
Item 3 .454 .407 .523 1
46-55 Item 5 .458 .414 .530 .864 1
Item 7 .431 .371 .459 .638 .630 1
Item 2 .324 .350 .410 .423 .404 .414 1
Item 6 .376 .400 .497 .506 .499 .483 .528 1
Item 9 .397 .409 .503 .519 .514 .484 .435 .649 1
Item 1 1
Item 4 .522 1
Item 8 .623 .654 1
Item 3 .474 .415 .543 1
56-65 Item 5 .479 .418 .547 .867 1
Item 7 .438 .354 .484 .631 .644 1
Item 2 .346 .366 .438 .452 .446 .423 1
Item 6 .430 .405 .520 .523 .510 .476 .551 1
Item 9 .442 .411 .527 .534 .530 .477 .462 .663 1
Item 1 1
Item 4 .520 1
Item 8 .610 .648 1
Item 3 .444 .394 .519 1
66-75 Item 5 .446 .408 .520 .846 1
Item 7 .380 .319 .442 .589 .601 1
Item 2 .347 .36 .430 .427 .416 .415 1
Item 6 .415 .353 .471 .478 .472 .429 .533 1
Item 9 .435 .367 .479 .511 .512 .452 .437 .65 1
Item 1 1
Item 4 .516 1
Item 8 .593 .625 1
Item 3 .429 .381 .514 1
76+ Item 5 .447 .421 .527 .791 1
Item 7 .359 .331 .441 .570 .573 1
Item 2 .343 .366 .463 .490 .450 .425 1
Item 6 .398 .380 .485 .496 .473 .427 .538 1
Item 9 .388 .345 .467 .496 .463 .430 .462 .650 1
53
Asset Metadata
Creator
Ozcan, Meltem (author)
Core Title
Multi-group Multidimensional Classification Accuracy Analysis (MMCAA): a general framework for evaluating the practical impact of partial invariance
Contributor
Electronically uploaded by the author
(provenance)
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Psychology
Degree Conferral Date
2023-08
Publication Date
07/12/2023
Defense Date
07/11/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
accuracy,adverse impact,classification,Fairness,measurement invariance,OAI-PMH Harvest,partial invariance
Format
theses
(aat)
Language
English
Advisor
Lai, Mark (
committee chair
), Beam, Christopher (
committee member
), John, Richard (
committee member
)
Creator Email
mozcan@ucsc.edu,ozcan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113263923
Unique identifier
UC113263923
Identifier
etd-OzcanMelte-12075.pdf (filename)
Legacy Identifier
etd-OzcanMelte-12075.pdf
Document Type
Thesis
Format
theses (aat)
Rights
Ozcan, Meltem
Internet Media Type
application/pdf
Type
texts
Source
20230713-usctheses-batch-1067
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
Measurement invariance is an important prerequisite for the meaningful and valid comparison of test scores across individuals with different group membership. Given that tests are often used in high-stakes contexts (e.g., personnel selection, diagnosis of diseases), the practical impact of violations of measurement invariance is of great interest to researchers and practitioners alike. Millsap and Kwok (2004) proposed a framework for investigating the practical impact of noninvariance on selection accuracy, which has recently been extended to multidimensional tests by Lai and Zhang (2022) in the Multidimensional Classification Accuracy Analysis (MCAA) Framework. Existing approaches to evaluating practical impact have mostly considered measurement invariance across two groups. When a population is made up of multiple subpopulations (e.g., ethnicity), groups are often dichotomized for ease of analysis, which may lead to misleading inferences due to the loss of information and precision. The current paper introduces a general framework for investigating the practical impact of measurement noninvariance on the accuracy and fairness of decisions made using a psychometric test made up of any number of dimensions administered to individuals from any number of subpopulations. We demonstrate the application and the advantages of the multi-group MCAA through an illustrative example using data from a published study on the measurement invariance of a loneliness scale across seven age groups, and show that valuable information is lost if the grouping variable is collapsed to achieve a binary grouping. We offer guidelines for interpretation. The multi-group MCAA framework is fully implemented in the R package unbiasr.
Tags
adverse impact
measurement invariance
partial invariance
Linked assets
University of Southern California Dissertations and Theses