Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The limits of unidimensional computerized adaptive tests for polytomous item measures
(USC Thesis Other)
The limits of unidimensional computerized adaptive tests for polytomous item measures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Running Head: LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 1 The Limits of Unidimensional Computerized Adaptive Tests for Polytomous Item Measures Kevin T. Petway, II University of Southern California LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 2 Table of Contents Page Abstract 3 Body I. Introduction 4 II. Study 1: Analyses with Real Data 13 III. Study 2: Simulation Study of Measure Properties 44 IV. Study 3: Real Data CAT and Multidimensionality 78 V. General Discussion 87 References 94 Tables 105 Figures 142 Appendices I. Appendix A 161 II. Appendix B 164 III. Appendix C 169 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 3 Abstract This dissertation investigated the usefulness of computerized adaptive tests (CATs) based on Samejima’s (1969) unidimensional graded response model (GRM) for polytomous item measures, and had two aims: (1) determine to what extent multidimensionality impacted the accuracy of CAT estimates; and (2) isolate the measure properties impacting the CAT process. The author conducted three studies to address these. The first utilized three measures, each with real data, and evaluated them using standard techniques for measure reduction and unidimensional CAT. In addition, the first study related certain measure properties (e.g. number of items) to differences in CAT and short form performance. The second study, a series of simulations, evaluated the relationship between unidimensional CAT accuracy and six properties: item pool size, number of response options, item discriminations, item difficulty (also skew/kurtosis), number of dimensions, and the observed correlation between dimensions. These simulations suggested dimensionality was most important for accuracy, consistent with past literature. Discriminations better explained mean number of items selected by the CAT, while more than 12 items were recommended for unidimensional CAT to perform better than the short form methods. The final study compared unidimensional results from Study 1 to dimensions from more valid multidimensional models. Results indicated unidimensional CATs handled minor multidimensionality (e.g. two highly correlated dimensions) well. More was too detrimental to the CAT process, though good discriminators may have provided some protection when there was evidence of a strong general dimension. The implications of this for multidimensional adaptive tests utilizing presumably unidimesnional subscales are discussed. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 4 The Limits of Unidimensional Computerized Adaptive Tests for Polytomous Item Measures Computers have been available for decades now, but only relatively recent advancements have allowed most people access to quick, powerful, affordable computers in their own homes. For researchers interested in the measurement of some construct, this development was an advantageous one. Increased accessibility meant more experience with computers in the general population, justifying the use of computers as a medium for various testing purposes. Computers became a way to stabilize the test-taking process. For instance, a set of cognitive measures could be administered to participants via computers, potentially removing many of the performance biases attributable to the test administrator (e.g. due to the influence of administrator expectations). Further, computers could be used to streamline testing, ensuring everyone received the same amount of time and feedback. As technology continued to evolve, the options afforded researchers expanded in tandem. The Internet allowed researchers to interact with people from all over the country and the world, so it was no surprise it became a popular place for those interested in non-cognitive measures. A recent Google search of the term “personality test” resulted in hundreds of links leading to a plethora of personality measures designed by everyone from high-level personality researchers to the layperson. Indeed, personality tests have become quite ubiquitous, found now on almost all popular matchmaking and social networking websites, highlighting both an ever-growing interest in and the perceived importance of non-cognitive measurement to researchers and non- researchers alike. However, this recent trend towards Internet-based test administration stresses the importance of measurement brevity. The lengthier a test, the more likely a participant will LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 5 choose not to start or complete it (Rammstedt & John, 2007), and this threat is amplified with participants over the Internet due to its more noncommittal nature. Many researchers have responded to this issue by using simple short forms (i.e. shortened versions of a measure). The practice of making short forms to mitigate problems such as participant fatigue and boredom is nothing new in the testing world, but computers have motivated more extensive use of tests that adapt to the user’s response behavior. A notable problem with short forms (and one motivation behind adaptive testing) is the implicit assumption the selected items best measure everyone in a given sample. Many researchers have found this to be a persistent and somewhat unreasonable assumption (Fliege et al., 2005; Forbey & Ben-Porath, 2007), though any improvement in accuracy or reliability obtained with an adaptive test could be minimal enough relative to a short form to justify the simpler short form. In adaptive testing, items providing maximal information about a person’s trait or behavior are selected sequentially from a pool. The goal is typically to hone in on an accurate estimate of the person’s score without requiring all items. An added benefit is participants will not typically end up receiving the same set of items, depending on algorithm complexity. Computers allow for these more complex item-selection algorithms, leading to what is now called computerized adaptive testing (CAT). Researchers have since found applications for CAT across a wide array of disciplines ranging from educational testing (e.g. the GRE) to clinical assessment, and it is generally assumed CAT, with more comprehensive algorithms, can provide more accurate estimates of a trait, ability, or behavior than short forms (Forbey & Ben-Porath, 2007; Waller & Reise, 1989) of equivalent length. Adaptive tests are currently designed using pre-collected full-measure data, and are typically based on item response theory models. Most applications involve binary items (e.g. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 6 true/false measures), or items with a single correct answer amongst multiple incorrect answers (e.g. most academic tests). However, measures with items utilizing response scales with several ordered response categories (i.e. polytomous items) have received a good amount of attention in the CAT literature as well, especially over the last decade. It is increasingly evident polytomous item measures work just as effectively under a CAT framework, but it is reasonable to assume not every polytomous item measure benefits from a computerized adaptive test. For instance, Reise and Henson (2000) used an IRT-based CAT to shorten the facet scales of the NEO PI-R, and administered this CAT to a large group of college students. Though they noted minimal loss of accuracy (r > .90) with only half of the items from each scale (approximately four items) under CAT, they found simply administering the four items contributing the most information (as a short form) for 23 of the 30 facet scales produced nearly identical results to the adaptive tests, and had the added advantage of simplicity. It may have helped these researchers to know what aspects of their personality measure (e.g. pattern of item discriminations) affected the usefulness of the CAT before administering it to a real sample. When the CAT algorithm does not appropriately account for the constructs, or the relationships between them, as was the case with the Reise & Henson study, a unidimensional CAT may look no different from (or may erroneously look better than) simpler short forms. This can be particularly troublesome if one is unaware of any multidimensionality in the measure. Reise and Henson used unidimensional CATs instead of a multidimensional CAT for the NEO PI-R, which exhibits multidimensionality by design. They evaluated each of the 30 8-item facets separately, ignoring the observed correlations between them. It is possible a multidimensional adaptive test (MAT), would have proved beneficial by allowing them to account for items that inform more than one facet. However, while MATs offer useful ways to deal with multiple LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 7 dimensions, they often rely on statistical models ignorant of the dimensionality possibly present in a measure’s subscales. Consequently, the subscales are assumed unidimensional even in cases where this assumption does not hold. For a measure like the NEO-PI-R, where the facets themselves may not truly be unidimensional constructs, a typical MAT application could still err. Three studies were carried out here to address the abovementioned issues in a step-wise manner. The first compared various methods of measure reduction using real data for three polytomous item measures. All selected measures were originally designed to tap into a single construct but were actually multidimensional to some extent, and two of these measures were subscales. This bettered understanding of the limits of unidimensional CATs (and short forms) when unintended multidimensionality was present, and produced implications that could generaliz to MAT applications treating subscales as unidimensional. Three different short forms were created based on traditional approaches and compared to determine which produced the most accurate and reliable short form. In addition, adaptive tests were approximated for each measure. These real data analyses guided the second study, which investigated the factors related to adaptive test accuracy. Through a series of simulations, several properties associated with measures (e.g. dimensionality, item pool size) were evaluated to determine how they interacted with the adaptive process. The third and final study matched the simulation results to the original measures from the first study to determine how helpful the findings of the second study could be to researchers interested in adapting their own measures. The third study also related unidimensional estimates from the first study to estimates obtained for the dimensions from each measure’s multidimensional models. This provided a way to see how well the unidimensional CATs and short forms approximated more accurate representations of the underlying constructs for each measure. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 8 To summarize, these studies had three general goals: (1) determine how severe multidimensionality needed to be before estimates from the unidimensional CAT were invalid, and determine what other aspects of a measure could affect this (if any); (2) determine under what circumstances a short form would function better than or just as well as a CAT, and better understand how measure properties might relate to the various reduction techniques; and (3) establish a preliminary guide to help identify “CAT-friendly” measures using the simulation study results and findings from the third study. Ultimately, the studies could provide researchers useful information to either motivate them to use an approach described and carried out by the author, or facilitate their search for alternate methods. Background Scale-level Approaches to Measure Reduction. Traditionally, measure reduction techniques relied on measurement properties associated with a class of psychometric theory often called “Classical Test Theory” (CTT; Novick, 1966; Allen & Yen, 2002). This involved an explicit focus on scale-level reliability, a series of indexes stemming from CTT’s basic concept of a true score. Here, a person’s observed score, which is obtained from the test, is the aggregation of a person’s true score (their actual trait level) and some amount of error (i.e. X = T + e). The error (e) is itself a theoretical aggregation of several different types of noise, some random, some not. The best-case scenario is one where no error exists, thus the true score (T) is simply the one observed (X). The general premise of CTT is straightforward, though idealistically so, as the true score is arguably never observed due to the continued presence of some kind of interference. Relying on the observed score could lead one to misrepresent the construct of interest. Despite this, CTT concepts are still a popular approach to test diminution (e.g. McDonald, 1999; Robins, Hendin & Trzesniewski, 2001; Saucier, 1994). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 9 Researchers often evaluate an item’s relationship to the measure’s overall score to determine its value. Items with lower item-total correlations can be perceived as poorer indicators of the construct of interest. The poorest correlates might be removed in an effort to have a better measure of the central construct. When shortening a measure, most researchers aim to maintain an arbitrarily “acceptable” level of reliability, usually no smaller than .70 (Cortina, 1993), or about 50% overlap, and it is clear removing items with the lowest item-total correlations tends to work best with reliability indexes. Factor analysis is another popular method for measure reduction (see McDonald, 1985). Used often for construct validity, factor analysis attempts to explain the relationship between items through latent variables (or factors). These latent variables represent unobserved constructs the items are designed to measure. Items “load” on the factor, and items with larger loadings are assumed more related to (and indicative of) the latent construct (McDonald, 1999). As with the observed total score from above, if one believes the latent variable is what they think it is, removing items with the smallest factor loadings is one way to reduce a measure using factor analysis. The need for a single factor or construct has been prominent in this work (Rasch, 1960). Item Response Theory. Item Response Theory (IRT; Hambleton, Swaminathan & Rogers, 1991) and Rasch (Rasch, 1966) models offer an alternative to CTT’s measure reduction approach. IRT models emphasize item-level information, contrasting CTT’s scale-level focus. As Zickar (1998) pointed out, CTT’s true score does not generalize across instruments of similar types because the score is tied directly to the instrument from which it originates. In addition, the item and person parameters in CTT are not necessarily invariant across samples (Hambleton & Jones, 1993). The consequence of this is decreased utility of any findings. The prevalence of LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 10 CTT models can be attributed to ease. IRT models make stronger assumptions, which are more difficult to meet with actual data, whereas CTT models make simpler assumptions at the expense of generalizability. IRT models include sample independent parameters if the particular model fits the data used. A researcher can assess performance across tests because item responses are tied to a person’s trait level and all of the model parameters are presented on the same scale as the trait. Another component of the models, item characteristic functions (or curves), provides researchers a basis for identifying good and bad items, since these functions directly relate an item’s endorsement probability to the trait level (Zickar, 1998; Fraley, Waller, & Brennan, 2000; Hambleton & Jones, 1993). The trait (or ability) level is often referred to as theta (θ), an attribution originally made by Lord (1974). In a general three-parameter logistic (3PL) model, the parameters are written as expressed in Equation 1: € Pθ =c i +(1−c i ) 1 1+e (−a i (θ−b i )) . (1) In this general form, Equation 1 applies to any dichotomous item and provides the probability of a particular estimate of θ given the item’s discrimination (a), difficulty (b), and pseudo-guessing (c) parameters. The discrimination represents an item’s ability to distinguish between people who have very similar θ estimates. In an item information curve (IIC), which shows how informative an item is across the range of θ, an item with a higher discrimination exhibits a steeper slope. Difficulty parameters identify the point along the trait continuum at which a person has a (1 + c)/2 probability of endorsing the item with the higher category. In an item characteristic curve (ICC), more difficult items are shifted more rightward along the trait continuum, indicating higher trait levels are generally needed to endorse the higher category (Simms & Clark, 2005; Bock, 1972, Fraley et al., 2000). Finally, pseudo-guessing is LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 11 traditionally an estimated parameter (Han, 2012) representing the height of the lower asymptote of the item ICC. When c = 0, difficulty indicates the theta where there is a 50% chance of selecting the higher category. This is the appropriate definition of difficulty in the one-parameter logistic (1PL) and later-discussed two-parameter logistic (2PL) models. Pseudo-guessing is included to account for the performance of individuals with very low θ estimates (Bock & Aitkin, 1981; Hambleton & Jones, 1993). The 3PL model is usually associated with multiple- choice tests, where guessing is of practical interest. The 2PL model removes the components of Equation 1 associated with the pseudo-guessing parameter (Birnbaum, 1968): € Pθ = 1 1+e (−a i (θ−b i )) . (2) Equation 2 can be reworked to indicate the logit associated with a particular response: € Logit (y =1) = a i (θ− b i ). (3) Lastly, the most reduced form is the 1PL model, which omits the discrimination parameter. It is analogous to the Rasch model, though there are subtle differences between the two that will not be discussed here. Both assume the discrimination is constant across items. Several variants of these models have been created for use with polytomous items. Seemingly the most widespread model, and perhaps best fitting (Maydeu-Olivares, 2005) for this type of item, is Samejima’s (1969) graded response model (GRM), an extension of the 2PL model. Samejima was among the first researchers to conceptualize polytomous items as a series of ordered dichotomous responses (McDonald, 1999), thus allowing the two-parameter logistic model to generalize to items with more than two response options. Under the graded response model, each item has K = J – 1 difficulty parameters, where J is the number of response options. For instance, an item with J = 4 response options enumerated j = 1 to 4 produces K = 3 dichotomies: (1) response option 1 versus 2, 3, and 4 (k = 1); (2) response options 1 and 2 versus LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 12 3 and 4 (k = 2); and (3) response options 1, 2, and 3 versus 4 (k = 3). The probability of endorsing response option q given θ is the difference between the probability of responding above k = j and the probability of responding above k = j + 1 (up to J – 1), as indicated by Equation 4: € P j θ = 1 1+e (−a i (θ−b ik )) − 1 1+e (−a i (θ−b i,k+1 )) . (4) The probability of endorsing above the lowest category is 1.0, and the probability of endorsing above the highest category is 0.0. While this model identifies multiple item difficulties, there is still only one discrimination parameter per item (Samejima, 1997). By including only a single discrimination per item, the GRM assumes the process used to select a particular response is consistent across the set of response categories (Cohen, Kim, & Baker, 1993). Also, because the GRM assumes truly ordered response categories, any violations of this would likely produce inaccurate estimates. IRT and Rasch models are usually the basis for adaptive tests because they provide an abundance of item-level information. For polytomous item measures, the graded response model tends to be the most popular. A number of studies have applied Samejima’s model to measures of personality and behavior (e.g. Reise & Waller, 1990; Reise & Henson, 2000; Smits, Cuijpers, & van Straten, 2011; Simms et al., 2011; Gibbons et al., 2007; Gibbons et al., 2008; Simms & Clark, 2005), a few of which used the model as the basis for adaptive testing. Their findings suggested use of adaptive tests produced considerable reductions in the number of items necessary with only minor accuracy losses. Study 1: Analyses with Real Data Study 1 had multiple purposes. The first involved preliminary analyses of the chosen non-cognitive measures to identify the parameters necessary for short form development and the LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 13 computerized adaptive tests. In addition, multidimensionality was examined, making these preliminary analyses the first step towards determining what could affect the performance of the CATs and the short forms. The second purpose was to evaluate the measures as computerized adaptive tests (CATs). The CATs provided a means to investigate how specific combinations of measure properties possibly impacted the adaptive process, informing later studies. A common approach to CAT performance assessment is a comparison between the CAT and simpler fixed-length short forms. CATs often complicate the test-taking process because they require, as the name suggests, computers, as well as an efficient algorithm to quickly but accurately administer the test. As a result, the third purpose of Study 1 involved identifying when and why CATs performed better than short forms, in the unidimensional context, to determine the potential value of the CAT for these non-cognitive measures. This worked both ways, as it also allowed for a better understanding of how and why a CAT might underperform relative to short forms. Finally, to make the simulated data in Study 2 as realistic as possible, estimates of discrimination and difficulty parameters from Study 1 were used to guide the values of these parameters in the simulations. Study 1 also allowed for a better understanding of the relationship between the central moments (skew and kurtosis) and the parameters of the graded response model, which was necessary to approximate specific ranges of values for the moments in simulation. Method Measures Traditional Punishments (TP) subscale. The traditional punishments subscale is one of two in the Questions About Justice (QAJ) measure, designed to assess one’s beliefs in and LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 14 acceptance of punishment methods. The traditional punishments subscale focused on harsher, more “traditional”, or archaic methods (see Table 1), contrasting “modern” methods emphasizing rehabilitation and reintegration (i.e. modern punishments subscale). The QAJ itself is a component of a much larger array of measures created by the Your Morals team (YourMorals.org; Ditto et al., n.d.) who’ve tasked themselves with trying to better understand moral and political attitudes. Data collection for the measure commenced in 2007 and continues presently, with participant ages ranging from 18 to over 99 years old. The study has attracted participants from all over the world, with about 21% of the QAJ respondents residing outside of the United States at the time of testing. The QAJ contains 19 items, 12 of which address TP. Traditional punishments are supposed to align with conservative attitudes. As a result, higher scores indicate greater conservatism. All items were measured on a 7-point Likert-type scale. While over 200,000 people completed at least one of the measures on the website, approximately 8,920 responded to the QAJ, with 8,426 remaining after removing cases with missing TP responses. This Cronbach’s alpha for TP after removing incomplete cases was .830, with an average inter-item correlation of .290 and polychoric correlation of .350. Need for Cognition (NC) Scale. The 18-item need for cognition scale (Cacioppo, Petty & Kao, 1984) is a shortened version of a 34-item scale developed by Cacioppo and Petty (1982). The researchers defined the need for cognition as one’s enjoyment of and engagement in cognitively demanding tasks, a re-imagining of Cohen’s, Stotland’s, and Wolfe’s (1955) original concept. It could simultaneously be considered an indicator of behavior and personality. Every item is measured on a 5-point Likert-type scale, and nine of the 18 items are negatively worded. Refer to Table 2 for a summary of each item. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 15 Data for the NC scale came from two different longitudinal studies of cognition. The first, Cognition and Aging in the USA (CogUSA), had a representative sample of older adults living in the 48 contiguous United States (Fisher, Rodgers, McArdle, & Kadlec, 2010). Since CogUSA focused on older adults, most of the participants were over the age of 50, with an observed age range of 41 to 99. CogUSA’s 2008 sample was used here. The second study, called the Virginia Cognitive Aging Project (VCAP; e.g. Salthouse & Siedlecki, 2007), included a cross-sectional pool of new participants residing in or around Charlottesville, Virginia from study years 2001 to 2007. Participants were anywhere between 20 to 102 years old at the time of testing. Initial sample sizes for CogUSA and VCAP were n = 1,434 and 3,560, respectively (N = 4,994). These were combined to form a single, larger data set with a reduced total sample size of N = 4,478 after removing cases with missing data. All analyses with the NC scale used this reduced total sample (i.e. CogUSA and VCAP were not evaluated separately at any point). Cronbach’s alpha for the need for cognition (NC) scale was .894 for this overall sample, with an accompanying average inter-item correlation of .320 and polychoric correlation of .375. Mature Personality (MP) subscale. Mature personality is a subscale of the Students Activities Inventory (SAI), a 150-item scale accessing various personality characteristics. Mature personality, measured with 24 of the 150 items, was the largest construct in the SAI. It is most analogous to the construct conscientiousness, which is synonymous with and often presented as diligence, thoroughness, meticulousness, and studiousness. Each item of the MP subscale is measured on a 5-point Likert-type scale, and items 2, 10, 13, 16, and 20 are negatively worded. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 16 Data for the SAI and the MP subscale came from Project TALENT (ProjectTalent.org; Flanagan et al., 1960), a large, nationally representative sample of mostly high school students who were evaluated extensively in 1960. The study coordinators included measures addressing educational competencies, clerical skills, personality, and future aspirations. While Project TALENT is formally a longitudinal study, many of the cognitive and personality measures were given only during the 1960 wave of testing. For instance, SAI data were present only in this wave. Student ages ranged from 13 to 18 years old. The full Project TALENT sample involved N ≈ 400,000 students from over 1,200 high schools all across the country (American Institutes for Research, 2012). A smaller file, which was not publicly released, included only 13,478 students (4% of the original sample) chosen randomly from the full sample. Unlike the full file, which had only scale-level data, this file contained all of the item-level data coded from Project TALENT, so the present study made use of it instead of the larger, scale-level data file. After removing cases with missing data, the usable sample size dropped to N = 6,897. Initial investigation of this measure revealed three of the reverse-scored items from this scale were incompatible with the remaining scale items. They all demonstrated very poor inter- item and item-total correlations (less than .1), equivalently poor factor loadings, and incongruent IRT beta coefficients. These three items were removed since they would invariably be the last selected. In addition, they would likely make results worse. Removing them produced a 21-item measure of mature personality. Cronbach’s alpha for the MP subscale without these three items was .863, a bit higher than when these items were included (α = .830). Table 3 provides a list of the used items with their means and standard deviations. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 17 Preliminary Evaluation of Measures All analyses were conducted using the computer program R. Within R, the packages ‘psych’ (Revelle, 2012), ‘lavaan’ (Rosseel, 2012), and ‘ltm’ (Rizopoulos, 2006) were used. Assessments of reliability were carried out using the alpha function from the ‘psych’ package. Estimates of Cronbach’s alpha, average inter-item correlations, and item-total correlations were obtained from this function’s output. Item response theory analyses were carried out using the grm function of the ‘ltm’ package. This function was used to estimate parameters for measures with items conforming to the assumptions of Samejima’s (1969) graded response model. Discriminations, category difficulties, and information were estimated for each item. Test information could be estimated from the results as well. The approximate trait range within which the measure worked best was determined by isolating the theta minimum and maximum associated with an information value of five. This value represented the amount of information necessary to achieve an approximate IRT reliability estimate of .80, used here as the minimum. Evaluation of the stability of the one-factor model for each measure required the use of the cfa function in ‘lavaan’. Since these measures produced ordinal data, the means and variances adjusted weighted least squares (WLSMV) estimator was used for more accurate estimation of model parameters (Beauducel & Herzberg, 2006; Flora & Curran, 2004; Yu, 2002). The root mean square error of approximation (RMSEA) and Tucker-Lewis Index (TLI) were used jointly to make decisions about the fit of these structural (i.e. confirmatory) factor analysis (SFA) models. Exploratory factor analysis (EFA) was carried out using the irt.fa function within the ‘psych’ package. This was used to determine whether an alternate factor structure was more appropriate for the data. The function returns IRT parameters instead of the conventional factor LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 18 analysis parameters, though this is a secondary step after a traditional EFA has been run with estimated polychoric correlations. Polychoric correlations are inherent to these procedures because this type of correlation makes more accurate assumptions about ordinal variables than the traditional Pearson’s correlation. The number of factors to test is specified by the user prior to running the function. The output provides a list of items best associated with specific factors (shown based on a user-specified cutoff), factor correlations, and estimates of the TLI and RMSEA. Conventionally, an RMSEA less than or equal to .05 (Browne & Cudeck, 1992), or a TLI greater than or equal to .95, identifies a good fitting model (Yu, 2002). There have been studies suggesting an RMSEA less than .08 (or even .1) can be used to identify an adequately fitting model (e.g. Browne & Cudeck, 1992; Chen, Curran, Bollen, Kirby, & Paxton, 2008). Similarly, some have suggested a TLI greater than the more lenient cutoff of .9 indicates a sufficiently good model (e.g. Marsh, Hau, & Wen, 2004). Stricter cutoffs were used in the present study (RMSEA <= .06 and TLI >= .95). For the EFA, an additional requirement was a model with a theoretically feasible factor structure. Models identified using the exploratory approach described above were then reexamined using structural factor analysis (again, via ‘lavaan’). Simple structure was imposed – items demonstrating meaningful loadings on more than one factor were constrained to associate only with the factor on which the item loaded highest. Fit criteria remained consistent with the abovementioned. These multidimensional models were the basis for Study 3 as they were held as the “true” model, so the use of two fit criteria (the RMSEA and the TLI) was a way to strengthen the validity of the model. For example, a model that satisfied both the RMSEA and LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 19 TLI criteria was considered a truer representation of the data than a model that only met one of the criteria. As already indicated, cases with missing data were removed from this and the other measures because the various methods of analysis and the measure reduction techniques dealt with missing observations differently. Removing cases with missing ensured the same number of trait estimates was produced in all situations before comparison. This is further explored in the present study’s discussion section. Calibration and Validation Data for Measure Reduction Before the reduction techniques were applied to the measures, two subsets of each data set were formed. The first subset represented the calibration sample, and had a fixed sample size of N = 2,500. While an N = 500-1,000 is often considered sufficient for estimation of IRT parameters (e.g. Lord, 1980; Hulin, Lissak, & Drasgow, 1982), CTT statistics are not considered stable across samples. A larger sample size provided more confidence in the CTT-derived statistics. The second subset represented the validation sample, and had a size equal to the remaining N (5,926, 2,278, and 4,397 for TP, NC and MP, respectively). Initial analyses were run using the calibration sample to identify the parameters necessary to carry out the reduction. For example, item-total correlations for the CTT technique were obtained using the calibration samples. Subsequently, these estimates were used to develop CTT-based short forms with the validation sample. This provided a means of cross-validation for the fixed reduction techniques. The computerized adaptive tests used the validation sample only, but the IRT parameters were defined beforehand using the calibration sample. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 20 Measure Reduction Techniques A series of different short forms were created using the techniques described earlier. In factor analysis, a lone factor indicated by three items represents a just-identified model, thus a three-item short form was developed for all measures and was considered the shortest possible length. Petway (2010) showed reducing measures to a fixed set of three items could, in some cases, lead to decreases in the reliability coefficient in excess of .2, but his results also revealed a Cronbach’s alpha greater than .7 could be maintained with this number of items. Other testable short form sizes were determined using previous research. Waller & Reise (1989) adapted a measure of absorption and achieved an items savings of 50% to 75%, dependent upon the stopping rule used. Similarly, Reise & Henson (2000) shortened an already small measure of personality and found a 50% reduction in the number of items produced scores exhibiting a near- perfect correlation with the full scale. These studies suggested something around half of the original item pool was needed to obtain comparable scores. This was six items for TP and nine items for NC. For MP, this would have been about 11 items. However, to keep the numbers consistent across measures, MP had both a nine-item and 12-item condition. The largest item conditions were nine-, 12-, and 15-items for TP, NC, and MP, respectively, in line with the more conservative reduction suggested by Simms & Clark (2005). Finally, a six-item condition was also evaluated for NC and MP, as it was a reasonable step from three and nine items, and signified a large drop consistent with the findings of Waller & Reise. Classical Test Theory (CTT) Short Form. The CTT technique used item-total correlations to determine which items should be retained. This was done in blocks so the association with the total score did not change as a function of removing a single item at a time. For instance, to fulfill the six-item condition for MP, the six items with the highest item-total LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 21 correlations were separated from the rest of the measure at the same time to form a shorter measure of MP. Confirmatory Factor Analysis (CFA) Short Form. The CFA technique followed logic very much identical to CTT. Items were retained based on the standardized estimates of their factor loadings. Since the highest loaders indicated those items which were most correlated with the construct of interest, the items with the highest factor loading were taken from the full measure to produce a short form of desired size. While in theory this technique could end up with the same set of items as CTT, in practice, the assumptions made by factor analysis often change the order. This meant item arrangements could differ, and for very large reductions (e.g. three-item short forms), a difference of one item could produce notable changes. Item Response Theory (IRT) Short Form. The IRT technique used the concept of information to determine which items should be retained. Estimation of item information (in the two-parameter logistic and graded response models) requires both discriminations and difficulties, and the item information function can be obtained for data satisfying the graded response model using Equation 6: € I i θ ( ) = a i 2 p ik q ik ∑ . (5) Equation 5 provides an estimate of item information at a given theta, where a is the discrimination parameter, p is the probability of a “correct” response (in this case, endorsement of the higher responses associated with k) and q is the probability of an “incorrect response” (in this case, endorsement of the lower responses indicated by k). When there is only a single difficulty parameter (identifying a dichotomous item), Equation 5 simplifies to the general two- parameter logistic model’s item information function. Item information provides a way to judge the contribution of an item to the overall test information function, and an item’s information is LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 22 considered independent of other items. Larger information values suggest greater coverage of the observed θ, so when forming a short form using information, items contributing more should be retained over those contributing less if the goal is to retain maximal information. Equation 6 can be used to understand the relationship between discrimination, difficulty, and information. A discrimination of 1 is “standard”, and does not impact the amount of information and item provides. A discrimination estimate less than 1 thus takes away from item information, while a discrimination estimate greater than 1 can boost information. Discrimination parameters alone could be used for the IRT technique, but because the unstandardized estimates from CFA are linearly related to the discrimination parameters from an IRT model, the results would generally be equivalent to those from the CFA technique. Since item information is an aggregate of both item discrimination and item difficulties, it should allow for more distance from the CFA technique than relying solely on discriminations. Computerized Adaptive Test (CAT), Simulated. An extensive number of CATs were evaluated in this study. The first set consisted of fixed computerized adaptive tests (FCATs), while the second set was variable. The fixed CATs conformed to the item restrictions described for the other reduction techniques. For instance, TP had a six-item fixed CAT, which administered six items to everyone. The important difference between this and the earlier short form techniques was the items did not have to be the same for everyone. The variable CATs (VCATs) allowed the number of items to vary between people. There were a few determinations needed before the computerized adaptive testing process commenced: Item Selection. The simplest procedure chooses items at random, which could really only work if all of the items exhibited very similar difficulties and discriminations (assuming the LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 23 2PL model or an extension). This never happens unintentionally in practice. Maximum Fisher’s Information (MFI), the typical item selection procedure for maximum likelihood estimation, selects the item that maximizes Fisher’s information at the current trait estimate. Fisher’s information indicates the amount of expected information an item possesses about θ. MFI is a simple, often reliable procedure, but can lead to more error around the resultant estimate of θ than other approaches. It is particularly more problematic for shorter tests (van der Linden, 1998). Maximum Likelihood Weighted Information (MLWI), another procedure based in maximum likelihood, weights the information matrix using the likelihood distribution. This can produce marginal improvements over MFI, though it can be as problematic as MFI because it is more susceptible to bias in resultant estimates of θ (Choi & Swartz, 2009). A variant of this, Maximum Posterior Weighted Information (MPWI), instead weights the information matrix using the posterior distribution, and could thus be considered a more Bayesian approach. It has demonstrated stability across a variety of conditions, though may underperform relative to some more complex methods when items are dichotomous (van der Linden & Pashley, 2000). Maximum Expected Information (MEI) can be considered a hybrid method that uses observed information (like MFI) to make predictions about the information for remaining items. The item with the highest expected information is subsequently selected. The final method discussed here, Minimum Expected Posterior Variance (MEPV), is notably different from the others mentioned because its intent is to minimize posterior variance instead of maximize information. It does this by selecting the item that yields the minimum predicted posterior variance, dependent upon the predicted probability of an item response given previous responses. Both of these selection methods produced small mean squared standard errors in Choi and Swartz (2009) and generally seemed like the best performers. However, preliminary evaluations LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 24 of each method combined with the computational demands of each suggested the benefits of MEI and MEPV were minimal when compared to the computationally less demanding MPWI. In addition, Penfield (2006) found the MEI and MPWI performed similarly when applying both selection methods to the partial credit model. For these reasons, MPWI was used as the selection method for this study. Scoring. Maximum likelihood (ML) scoring is perhaps the most common approach to estimating an individual’s θ, especially for dichotomous variables, but it is poorly suited to handle polytomous items efficiently. For example, it requires an instance of endorsement and no endorsement, a complicated request with polytomous items due to their non-directional natures. Another scoring method is often required initially until enough information is obtained to use ML, and it is possible the ML estimate does not even exist. For polytomous items, ML scoring also tends to be exponentially slower than other scoring approaches because of the computational demands of its iterative process. Lastly, ML scoring functions best when the number of items is large (e.g. greater than 50; Embretson & Reise, 2000). The largest number of items used in this study was 21, so the ML procedure would have been difficult to utilize. An alternative procedure is a Bayesian approach called expected a posteriori (EAP) scoring. EAP is often computationally faster than ML because it is not iterative and conceptualizes numerical integrations as a series of approximating summations. In addition, EAP estimation is able to obtain a finite score for individuals even if they have a “perfect” or null score. That is, with a proper prior, an EAP estimate always exists. However, this stresses the need for a properly specified prior. A related estimator, maximum a posteriori (MAP) tends to produce estimates similar to EAP. It does not need a properly specified prior (often a uniform prior will do), and thus behaves similarly to ML estimation in this regard. MAP scores represent the mode of the LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 25 posterior distribution, while EAP scores are the mean. An important drawback of MAP is the possibility of producing multimodal posterior distributions, which is not an issue for EAP. Studies by Veerkamp and Berger (1997) and Choi and Scwartz (2009) discovered simpler item selection methods (e.g. MFI, MPWI), which were susceptible to more bias and estimation error with the maximum likelihood score, performed similarly to more complex methods (e.g. MEI) when evaluated at the EAP estimate. EAP scoring was used for the CAT simulations (and with the IRT technique) in this study. Stopping Rule. For the fixed adaptive forms, the CAT stopped when the requested number of items was selected for each observation. While all individuals consequently had the same number of items, they did not necessarily have the same items. When the number of items is allowed to vary across participants, a common CAT stopping rule is the standard error of measurement (SEM) of θ. For EAP, the standard error estimate usually equals the standard deviation of the posterior deviation, though the standard error of information (the inverse of the square root of test information) is another option. For each measure, two different variable CATs were run. The first stopped selecting items for an individual when their estimated SEM was equal to or less than .37. The second stopped when the individual reached the maximum number of usable items, 12, assuming the SEM stopping rule was not already satisfied. An SEM of .37 is associated with an approximate reliability of .86, a rough average of the reliability coefficients for the three measures. The first variable CAT could be considered “free” because use of the total number of items in the pool was possible if an SEM of .37 was never achieved. The second variable CAT could be considered “restrictive” because participants were limited to 12 items maximum, regardless of the final SEM. Gnambs & Batinic (2011) found 12 items was too few for adaptive LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 26 testing purposes in their study, so this number was selected as the cap of the restrictive VCAT for verification. Restricting at this level meant the TP subscale had only one variable adaptive test. Program. While there are several programs and functions designed to simulate CAT, few have functionality for graded response data. One relatively recent computer program, called Firestar (Choi, 2009), was designed to address this, and accurately simulates a CAT with polytomous item data. Choi (2009) developed the program with partial funding from the National Institute of Health (NIH) and is provided to users “as is”. Though the program has not been used extensively in the literature yet, preliminary investigation of the generated R code by the study author indicated the overall CAT algorithm should produce accurate results under the graded response model. Further, since the program made use of R functions already developed, and because the algorithm logic seemed accurate to the author, the generated code was assumed valid. The program has a user interface allowing the user to define several components of the adaptive process – for instance, a researcher can specify what type of selection method to use, what type of polytomous IRT model to evaluate, the type of estimator to use, and the minimum number of items to select, to name just a few of the many options. Once selections have been made, the program generates R code. This allows for further editing and provides more control over what type of output to produce, etc. The actual CAT is run within the R program. To simulate CAT for this study, the following settings were specified in the program Firestar: • The item selection method = maximum posterior weighted information (MPWI) • The prior distribution was assumed normally distributed, with a mean of 0 and a standard deviation of 1 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 27 • First item selected = mean of prior distribution • Assumed the graded response model • The posterior distribution determined the standard error estimate • Maximum standard error = 0.37 (for the variable CATs only) • Interim thetas were estimated using EAP scoring • The minimum and maximum theta values used to define the quadrature points were -4.0 and 4.0, respectively, with an interval of 0.1. This is equivalent to 81 quadrature nodes. • Minimum number of items = 3 Evaluation of Measure Reduction Techniques Fixed Short Forms. For each technique, the following information was collected: average inter-item correlation, Cronbach’s alpha, mean and median skew, and mean and median kurtosis. All of these statistics were estimated using R’s ‘psych’ package. Squared correlations between full-scale and short-form scores were computed within and across techniques. CTT scores were sums of item responses. CTT short forms required only a re-summing the relevant items to form a new estimate of a person’s trait level. CFA scores were estimated using a ‘lavaan’ extension of the R base function, predict. The estimation technique followed Thurstone’s least squares regression approach (Thurstone, 1935). As with the CATs, EAP estimates were computed for the IRT technique. EAP estimation required the specification of quadrature points, which represent equally spaced points along a predetermined theta distribution. An individual is attributed a series of likelihoods for each theta along this distribution given their item responses, and their final trait estimate is a cumulative sum of the likelihoods multiplied by the respective thetas. In estimating the parameters for the graded response model, 21 quadrature nodes were used. Significantly more drastically slowed LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 28 down the estimation of parameters. Many studies have suggested values around this number are sufficient (e.g. Chen, Hou, & Dodd, 1998) to obtain accurate estimates of model parameters and trait scores, but this was mentioned here to explain why slight differences in trait estimates could occur between the IRT short forms and the CAT. Simulated Computerized Adaptive Tests. The only squared correlation obtained here was for the estimated CAT score with the CAT’s internal estimate of the full-scale score. In addition, estimates of the standard error of measurement (SEM) of theta were computed within the CATs, which allowed for calculation of IRT-based test-level reliability (i.e. 1 – SEM 2 ). Test- level SEM, used to judge overall test reliability, was simply the mean of individual SEMs. Item Use. Since the specific items selected could vary, item selection frequency was collected, as well as a weighted mean for the following parameters/statistics: skew, kurtosis, discriminations, and betas. The weighted mean took into account the frequency of item use, and allowed for a comparison with the observed (un-weighted) full-scale mean to determine if any observable patterns existed. Results Traditional Punishments With only 12 items, the TP subscale had the smallest amount of associated test information (44.474), but of the three scales it had the highest amount of average item information (I avg = 3.706). Fit of the single-factor model was poor (RMSEA = .110, TLI = .903). Three factors were recommended by an exploratory factor analysis, which demonstrated acceptable fit as a structural factor analysis model (RMSEA = .066, TLI = .965). Factor correlations were high in the model, with an average of .711, quite a bit higher than those estimated in the EFA (average r = .350), a likely consequence of imposed simple structure. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 29 Items were grouped thematically: the first factor accounted for four items which addressed philosophies underlying punishment and traditional motivations for punishment (items 2, 3, 7, 11); the second factor focused on archaic or older forms of punishment (items 4, 5, 8, 9); and the final factor linked items related to society’s need for punishment (items 1, 6, 10, 12). Cronbach’s alphas for each of these subscales were comparably lower (.798, .693, and .566 for factors 1, 2, and 3, respectively) than the full subscale, but average inter-item correlations were larger than the full value for all but the third factor’s items (r = .499, .372, and .244 for factors one, two, and three, respectively). As noted, the three-factor model fit the data relatively well, but other performance indicators (e.g. alpha) suggested the third factor in particular could be problematic as defined. This factor could represent a set of items which are not as related to each other conceptually but are also not sufficiently related to items from the other factors. The factor correlations would imply this was not quite true; however, the items from the third factor would likely be chosen last in an adaptive testing or short form procedure because the items (excluding item 10) exhibited the weakest loadings and item-total correlations when a unidimensional construct was assumed (see Table 4). TP was the smallest of the three measures, but seemed to perform well in terms of average inter-item correlation coefficients and alpha. The subscale’s unidimensional model fit could improve by simply removing certain misaligned items (e.g. items one and 12). Estimates of item information supported this, suggesting a handful of items (again, one and 12, as well as six) added little to the subscale’s ability to discriminate between people of different trait levels. The measure had the highest average item information of all three measures, and included several (“powerhouse”) items contributing significantly to test information. However, these LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 30 items focused the discriminatory power of the measure across a relatively narrow range, indicating the measure best discriminated individuals whose trait levels fell between θ = -1.0 to θ = 3.0 (roughly average to above average levels of “support” for traditional punishments; see Figure 1). The observed trait range in the sample was about -2.5 to 3.0, so a few more items designed to help differentiate those with low levels of this trait might improve reliability in that region. Figure 2 graphically presents the distributions of the three types of scores. Refer to Table A1 in Appendix A for TP discrimination and difficulty parameters estimated from a unidimensional IRT model. Fixed Short Forms. Items two and 11 were selected as the first two items by all techniques. For the third item, the CTT and CFA techniques selected item eight, while the IRT technique selected item nine. Item nine’s position as fourth (CFA) and sixth (CTT) item in the six-item short form may stem from how the techniques respond to items with high kurtosis and skew, of which item nine has significant amounts. By nine items all forms were consistent with each other. In general, items with negative kurtosis were selected more often, but the actual amount of kurtosis preferred was not clear from the results. Reliability was good with three items (α range = .775 - .784), and increased somewhat with six items (α range = .842 - .844), but no change in reliability occurred at nine items (α = .844). The addition of three more items added nothing to the reliability of the short form, and their inclusion led to a decrease in the average inter-item correlation. Interestingly, and perhaps consequently, the six- and nine-item short forms had higher estimates of reliability and average inter-item correlations than the full TP subscale. Squared correlations between reduced and full scale scores from the CFA technique were highest among the three techniques across all item conditions, with a squared correlation of .904 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 31 observed with only three items. IRT squared correlations were consistently in between CFA and CTT; however, these correlations leaned progressively closer to CFA such that by nine items there was little difference between CFA and IRT in terms of variance shared with their respective full-scale estimates. Shared variance greater than 90% (r 2 = .942) was obtained with six items for IRT. CTT full-scale scores correlated relatively poorly with the reduced form’s scores, not achieving at least 90% shared variance until nine items (r 2 = .933). Surprisingly, the CTT reduced form’s scores exhibited their largest correlations with the full-scale factor score, and at three items actually had a larger correlation with it (r 2 = .912) than the CFA short form factor scores. Simulated Computerized Adaptive Tests. The fixed-length CATs (FCATs) were consistent with the short forms. Items one, six and 12, never utilized by the short forms, were mostly kept out of the fixed-length CATs as well (though item six was used with some non- negligible frequency in the nine-item FCAT). These items made up three of the four items indicating the third of TP’s three factors in the multidimensional model. They had relatively small factor loadings in both the unidimensional and multidimensional case. Items two and 11 were the most frequent, administered to everyone in all simulated CATs. Following, items three and eight (FCAT-3), then items seven, eight, nine, and 10 (FCAT-6), were used with high frequency 1 in the FCATs. Estimates of reliability were consistent with the short form reliabilities for FCAT-3 and FCAT-6, but FCAT-9 saw improvement not witnessed with the short forms (Figure 3). Reliability increased from .845 (FCAT-6) to .867 (FCAT-9), the latter of which coincided with an SEM of .365 (below the maximum of .370). This could be due to the occasional use of item 1 This list of items is not in exact order of frequency of use. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 32 six, excluded from the short forms, instead of item nine. Here, the flexibility afforded the CAT led to a surprisingly high squared correlation as well (r 2 = .990). However, while squared correlations from the FCATs were always higher than those from the IRT and CTT techniques, they remained lower than those from the CFA technique at all item conditions, as can be seen in Figure 4. The SEM-constrained variable CAT (VCAT-SEM) also administered items two and 11 to everyone. The median number of selected items was six, with a rounded mean of seven. This suggested positive skew, but also provided interesting information about the sample. At least 50% of the sample required no more than six items to return reliable estimates of their trait level, and by extension, an even larger percentage required no more than seven. However, the reliability of VCAT-SEM was .851, with a full-scale r 2 = .977. The reliability estimate, linked to an SEM of .386, was smaller than the estimate desired by design. This suggested that for some people, even 12 items (the real maximum) were not sufficient to settle on a reasonably reliable estimate of their trait level. Item selection frequencies in VCAT-SEM were ordered in a manner consistent with the CFA technique, and compared to CFA-6 and CFA-9, which sandwiched VCAT-SEM in terms of mean item use, VCAT-SEM did not perform notably based on squared correlation or reliability. Lastly, item-use frequencies in VCAT-SEM were high even for the least used item (item 12, administered to 28% of participants), but this is unsurprising considering the overall test SEM did not meet criteria. Three items were administered to over 90% of the sample, and again, no items were administered with less than 25% frequency. Positively skewed items were used far more often than negatively skewed items (or items with little skew). As with the short forms, the relationship between the CATs and kurtosis was LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 33 unclear. Finally, the CATs administered items with larger discrimination parameters more often, and preferred items with higher initial difficulties (i.e. b 1 values closer to zero). Need for Cognition Total test information was much higher here than for the TP subscale, with an estimate of 65.392. However, per item, the average was slightly smaller (I avg = 3.633). Model fit for the one-factor model was considered mediocre (RMSEA = .096, TLI = .899). Exploratory factor analysis identified two possible factors representing the data. This was supported by a structural factor analysis model, which demonstrated good fit (RMSEA = .055, TLI = .967). The resultant model linked the positively and negatively worded items to separate dimensions, with a factor correlation of .774, a value consistent with the EFA estimate. Separately, the reliability estimates were .841 and .836, positive and negative items respectively, which were reasonably close to the full-scale estimate. Average inter-item correlations also improved (r pos = .371, r neg = .362). Of the three scales, the NC items exhibited the largest average inter-item correlation coefficients and alpha. This remained true of the alphas estimated for each of the two factors from the two-factor model (though these factors each had more items than any single factor from MP’s or TP’s multidimensional models). Unlike the other measures, NC did not include especially poor items (though one could consider items 17 and 18 questionable). Since every item exhibited at least a moderately sized correlation with the construct under assumptions of unidimensionality (refer to Table 5), NC seemed most amenable to an adaptive testing procedure. This belief was bolstered by the large factor correlation in the two-factor model – while two factors better explained the data, the high factor correlation coupled with good LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 34 loadings meant the total score and theta estimates from scale reduction techniques could remain close to comparable estimates from the two-factor model. NC items provided the most information at average-to-below average levels of need for cognition. In addition, there was good spread of information across theta, with a range of -3.0 to 2.0 indicating the trait levels at which this measure functioned best (Figure 5). This almost perfectly mapped onto the observed range of trait estimates (about -3.0 to 2.5; visualized in Figure 6), possibly explaining why this measure had the highest overall estimate of reliability. However, some items better designed to differentiate those with high levels of need for cognition would be useful. Table A2 in Appendix A provides NC discrimination and difficulty parameters from an IRT model assuming unidimensionality. Fixed Short Forms. Across the techniques within each item condition, item selection order differed, but the items ultimately chosen did not change notably between conditions. Only for the six- and nine-item conditions was there variation in item selection. For six items, the CTT technique selected item 14 instead of item 10. With nine items, the CTT technique selected item 13 instead of item three, while the IRT technique incorporated item six instead of item seven. However, the item differences produced negligible changes in alpha and average inter- item correlations. Item selection favored negatively skewed items and items with low kurtosis: as the number of items increased, the mean and median skew moved closer to zero, while the mean and median kurtosis became more negative. Reliability dipped a bit from 18 items (.894) to 3 items (.666), but the three-item estimate was still surprisingly high considering how many items were removed (see Figure 7). Conventionally good levels of alpha were achieved with as few as six items (.789-.799, depending on technique). By 12 items, alpha estimates were close to that of the full-scale. The LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 35 nine-item condition was of particular interest here because the preliminary analyses indicated the positively worded and negatively worded items produced separate factors, each with nine items. Here, a roughly equal number of items were selected from each factor, and the combination exhibited a slightly higher average inter-item correlation and alpha than either of the two nine- item dimensions as defined in the preliminary analyses. Across all item conditions, CTT scores exhibited weaker squared correlations with the full-scale sum than any other method with their respective full-scale estimates. However, the CTT scores consistently had higher correlations with the CFA full-scale estimate than scores produced using the CFA technique. No such behavior was observed for the IRT estimates, which always had the highest correlations with the IRT full-scale EAP estimate. Squared correlations were sturdy, with 90% of the variation in full-scale estimates explained by nine items (see Figure 8). Unfortunately, all of the short forms demonstrated poor model fit. Simulated Computerized Adaptive Tests. Item selection in the FCATs was consistent with the fixed short forms in that certain items were never selected (items 17 and 18). In addition, items seven, eight, nine, 15, and 16 were seldom used. The VCATs were consistent with each other, utilizing the same top nine items, but the frequencies of use (and thus the order of “importance”) differed. All CATs prioritized more negatively skewed items with less negative kurtosis. Higher discriminators were chosen most often, so it is no surprise item two, with the highest discrimination estimate (a = 2.151), was the most frequently selected item. Finally, beta estimates did not seem to impact item selection, as there was no discernable pattern of prioritization. Of the CATs, FCAT-12 produced the highest reliability estimate (.897), followed by FCAT-9 (.875). VCAT-12 and VCAT-SEM had reliability estimates of .866 and .868, LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 36 respectively. All of these estimates were consistent with the SEM stopping rule, though only the VCATs were allowed to use it. In general, CAT reliabilities were larger than estimates from the comparable short forms. The mean number of selected items for VCAT-SEM was nine, with a range from five to 18, and a median of 8. VCAT-12 had the same median and a smaller mean of eight. Since the estimate of reliability was sufficiently large for VCAT-12, and since the estimate was not much improved by allowing for up to 18 items, it was safe to assume only a few people actually needed more than 12 items to obtain reasonably accurate estimates of their trait level. Based on mean and median item selection, the VCATs were most consistent with FCAT- 9. Despite FCAT-9 having higher reliability, both of the VCATs produced higher squared correlations with the full-scale estimate (VCAT-12 r 2 = .944; VCAT-SEM r 2 = .951) than FCAT-9 (r 2 = .939), suggesting some benefit, albeit small, to the SEM-based adaptive process. Squared correlations tended to be slightly higher than those from the fixed short forms as well, especially those from the CTT technique. VCAT-SEM administered every item at some point, but five items (about 28% of the NC pool) were given to participants with at least 90% frequency. At the other end, three items were administered to no more than 10% of the sample. There was a decent spread of frequencies here, as the six items following the top-five were administered to at least 25% of the sample. Mature Personality The estimate of Cronbach’s alpha for MP was slightly smaller than that of NC despite having a few more items, and perhaps meaninglessly larger than TP considering the nine-item difference. This was not surprising considering MP had the smallest average inter-item correlation (r = .238) and polychoric correlation (r = .275). The single-factor model exhibited LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 37 poor fit, with an RMSEA = .124 and a TLI = .740. Total information for the measure was 69.141, and I avg was 3.292. Exploratory factor analysis suggested four factors best described the data. Unlike with the other measures, this possibility was still a rather poor fit to the data (RMSEA = .082, TLI = .887) when assessed using CFA with simple structure. However, other patterns of this particular set of variables did not exhibit notably better fit, and models with more than four factors were practically invalid (either because of too few indicators on a particular factor, or because several items ended up with negligible loadings on all factors). Despite the questionable fit, this model was held as the best representation of the items given the data. Factor 1 consisted of items 1, 2, 5, 6, 7, 9, and 10, items concentrating on efficiency and productivity. Factor 2 had four items (14, 16, 18, 20) related to reliability and dependability, while items 3, 4, 8, 11, 12, 13, and 17 indicated factor 3 (thoroughness and completeness). Finally, the fourth factor had only three items (15, 19, and 21), which measured persistence. Factor correlations ranged from .403 to .791, with an average of about .613. Disregarding simple structure, as is the case in EFA, the correlations averaged .35. Cronbach’s alpha for each dimension dipped a bit compared to the full-subscale estimate: .772, .717, .700, and .700 for factors one, two, three, and four, respectively; however, all of the average inter-item correlations improved: .330, .397, .432, and .252 for factors one through four, respectively. Inadequate model fit for the tested models and weak factor indicators (in the single dimension case) suggest the MP subscale would benefit greatly from a fixed scale reduction technique, removing in particular those items thrown on the fourth dimension. However, the unidimensional model’s factor loadings were, overall, more stable for this measure (Table 6) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 38 than the comparable TP subscale, and considering the larger item pool, MP could function as an adaptive test. Though items generally provided less information than in other measures, the test information function (Figure 9) suggested good, consistent coverage of the trait. The measure functioned best within the range of -2.0 to 4.0 along the trait continuum. However, the peak of the test information curve was at a lower level of information than for the other two measures. The distribution of trait estimates appeared most normally distributed for this measure, and the estimate types were in greater agreement here than with the other measures (Figure 10). The observed range of scores for this measure was about -3.0 to 4.0 averaged across the score types. Taken with the test information function, reliability was of most concern for those with very low levels of the mature personality (i.e. less than -2.0). See Table A3 in Appendix A for MP discrimination and difficulty parameters. Fixed Short Forms. Short forms generally differed by one item. For instance, the six- item form from the CFA technique used item 14 instead of item 9, producing a very small increase in reliability and average inter-item correlation. With nine items, the CTT technique utilized item three instead of item 14, which actually produced lower estimates of reliability and average r. Finally, IRT-12 differed from the other 12-item short forms because it selected item two as its final item instead of item eight. This was not to its benefit, but differences were very small. The three- and 15-item versions developed by each technique utilized the same items. Short forms prioritized items with positive kurtosis, such that average and median kurtosis values became larger negative values as more items were introduced. Positively skewed items were also prioritized, such that the mean and median skew decreased with more items. However changes in kurtosis seemed larger and more indicative of selection. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 39 The alpha for the three-item forms was .712, and an estimate of reliability nearly equal to the full measure was observed at 15 items (α 15 = .861; α 21 = .863). Squared correlations between the reduced-form estimates and the full-scale estimates were acceptable at the three-item level, falling within the .6 - .7 range. Shared variance reached more than 80% as early as six items (for the CFA and IRT techniques), more than 90% by nine items (again, for the CFA and IRT techniques), and more than 95% by twelve items (except for the CTT technique). Consistent with the other measures, the CTT reduced-form estimates generally correlated more highly with the IRT and CFA full-scale estimates than the CTT full-scale estimate. Simulated Computerized Adaptive Tests. For all of the FCATs, item selection rarely, if ever, included the following items: one, 13, 15, 17, 18 and 19. This is consistent with the short forms, which excluded all of these items. The reliability of FCAT-3 was similar to the short forms estimates (.711), but this was likely due to the FCAT administering the same three items from the short form to everyone in the sample. Later item conditions showed marked improvement, such that by six items the reliability estimate was .807, noticeably larger than the alpha estimates for the six-item short forms (α = .781 - .784). FCAT reliabilities continued to be larger than the short forms up through 15 items (Figure 11). However, it was not until FCAT-12 that a fixed adaptive test had a standard error of measurement less than .37, the set maximum for SEM-based CATs. Specifically, FCAT-12 had an SEM = .36, indicating a reliability of .87. Generally, squared correlations were also larger than those from the short forms until 15 items, at which point the squared correlations were equal to the estimate from CFA-15 (r 2 = .984). Figure 12 compared the performance of the fixed short forms with the fixed CATs. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 40 The VCATs maintained good reliabilities, with VCAT-SEM producing a reliability estimate consistent with the SEM stopping criteria (reliability = .865, with an associated SEM = .367). VCAT-12 had a reliability of .859 (associated with an SEM of .375), which was lower than the desired reliability cutoff. This outcome was observed with the TP subscale, and as with TP it suggested certain individuals needed more than 12 items to obtain sufficiently reliable estimates. VCAT-12 was similar to the FCATs in that a large number of items were rarely if ever used. Only in the VCAT-SEM were items unused by other techniques incorporated, all of which had surprisingly high frequencies of use considering their exclusion from the other CATs. Regardless, eight items (about 38% of the total item pool) were administered to at least 90% of this sample. Considering the mean and median number of items, there was a clear “core” set of items here. Four additional items were given to at least 25% of the participants, while four items were administered to no more than 10% of the sample. VCAT-12 administered 10 items on average, with a median of 10. The minimum and maximum number of items was seven and 12, respectively. In VCAT-SEM, 12 items were administered on average, with a median of 10, a minimum of seven, and a maximum of 21. As with the other measures, the CATs extensively relied on items with larger discrimination estimates, and mostly ignored items with discriminations less than a = 1 except when absolutely necessary. Additionally, the adaptive process favored items that were slightly more difficult at the lower category levels (i.e. negative values closer to zero for b 1 and b 2 ; items four and nine are the only exceptions to this). No clear pattern was found at higher categories. This behavior was consistent with the test information curve, which showed coverage was greatest between just below average to well above average levels of the trait. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 41 Discussion Despite being the smallest measure, with only 12 items, TP had promising results. Its three- and six-item short forms performed better than those of the other measures, and while it didn’t quite meet the cutoff for reliability, the 12-item CAT did a good job approximating its full-scale estimate with a small number of items overall. Its item pool size perhaps automatically throws it into the short form category, but with more discriminating items this measure could possibly function well as an adaptive test. The MP measure had the largest item pool, but NC demonstrated better support for the adaptive process. Since the CATs seemed to prioritize items with high discriminations, TP and MP suffered by having several items with discriminations less than 1 2 . NC had only one item fall into this category, so almost all of the items were at least good discriminators. This meant there was more variable item use by the adaptive process. In fact, based on the results of the VCAT-SEMs, NC used a larger proportion of items outside of the “core” items with more than 25% of the sample, relative to MP. Additionally, more items from the NC measure were used with at least 10% of the sample. Given the above, item pool size was perhaps not as meaningful as some other properties for the CATs. For example, kurtosis values consistently leaned closer to zero across all measures, suggesting items with low kurtosis (positive or negative) were preferable to CATs than items with high kurtosis in either direction. IRT betas, which share a relationship with both skew and kurtosis (though it is not wholly clear from the results), behaved as expected. Generally, items with beta parameters that better approximated the observed test information 2 Many of these items were not much better in the multidimensional context as well making them good candidates for removal. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 42 function were used more often. Though discrimination was clearly important for all measures, for NC it would factor into the selection process more simply because the betas were not very distinct from one item to the next. Incomplete Data Project Talent’s mature personality subscale had considerable amounts of incomplete data, and removing cases with any missing led to a considerable drop (nearly 50%) in its overall sample size (N = 6,581). The full Student Activities Inventory was long (with 150 items), so participant fatigue or time constraints might explain why so many participants did not respond to some of the MP items. It is also possible the incomplete data was due to the construct(s) of interest, since mature personality is related to conscientiousness. Those with more “mature” personalities may have been more inclined to complete the measure. However, results of additional analyses (presented in Appendix B) incorporating the missing data suggested removing any incomplete cases did not lead to any noteworthy changes. Tables B1-B3 provide comparisons of parameter estimates and item selection behavior with and without incomplete data. The other two measures were much more complete, and in both cases only about 500 cases were removed, a significantly smaller portion of their overall samples. Also, for both measures, the majority of removed cases were missing every item, not just select items. The Better Technique The two primary indices used to judge the effectiveness of these techniques were the reliability coefficients and the full-scale squared correlations. Based on the results of the present study, and relying on the above two indices, one might assume fixed short forms created using the CFA technique specifically were generally better regardless of measure characteristics. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 43 There are several ways to rationalize the above assertion. Firstly, reliability was not much better for the CATs than for the short forms, and in certain scenarios there was no real difference. Kim & Feldt (2010) add further doubt to differences here. They discovered test- level reliability coefficients estimated within IRT tended to be higher than Cronbach’s alpha. In fact, alpha better approximated the lower bound of the IRT coefficient, suggesting alpha was a more conservative indicator of test reliability. If their findings hold for the present study, differences between the alpha estimates and the IRT reliabilities could be considered minimal. Secondly, the squared correlations, which were perhaps a more direct indication of score accuracy, were not notably better for the CATs in most cases. The question then becomes whether any small benefits of the adaptive process warrant the added effort required to setup a computerized adaptive test. Given these results, it would be difficult to justify the effort, especially considering how well the CFA technique (and to a lesser extent, the IRT technique) reproduced its full-scale estimate. Even the CTT technique performed well if one assumes the full-scale factor score is a better indicator of a person’s trait than a simpler sum. However, with the possible exception of TP, the short forms did not describe good-fitting models, stressing the potential importance of measure dimensionality. The Issue of Multidimensionality The results lend support for simpler short form reduction techniques, but the methods underlying all of the techniques (including the CATs) assumed unidimensionality. The preliminary analyses demonstrated all three of the measures exhibited varying degrees of multidimensionality. While there is some level of robustness to multidimensionality in these methods, the point at which this falls apart is not clear and may not be consistent. Study 2 delved into this further with a series of simulation studies designed to better understand the LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 44 influence of several measure properties including multidimensionality. If the full-scale estimate was derived from an incorrect model, high squared correlation with it and good reliability do not provide much useful information. Considering the extent of dimensionality identified for these measures in the preliminary analyses (especially TP and MP), assuming a unidimensional trait could have produced highly inaccurate short forms and adaptive tests. Study 2: Simulation Study of Measure Properties Many different aspects of a polytomous item measures can influence the effectiveness of any reduction technique. A large item pool size, for instance, is assumed better for computerized adaptive tests because larger item pools (in theory) allow for a more comprehensive array of items to measure the trait (e.g. Gnambs & Batinic, 2011). However, if many of those items are poor indicators of the primary construct, a smaller item pool in the form of a shortened (but fixed) scale may function just as well as a more elaborate adaptive test. That is, better measurement could be obtained by simply removing unnecessary items. Study 2 addressed this through a series of simulations. The goal was to better understand the results of the simulated CATs from study 1. In addition to item pool size, number of response categories, dimensionality, dimension interrelatedness, discrimination range, and item non-normality were investigated. A number of studies detail the effects of non-normal distributions on many CTT parameters, so it is highly reasonable to assume measure reduction approaches based on them suffer from bias. Likewise, several studies have investigated how skewed items negatively impact the estimates in factor analysis models and found definitive results – for example, the more skewed the items, the more inaccurate the estimates of factor loadings (e.g. Owen, 2010). Effects depended on the size of the expected factor loading – small LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 45 loadings were enlarged when skew was at least moderate, while large loadings were reduced when skew was at least moderate. The use of multidimensional IRT models for CAT has become more prominent in recent years (e.g. Frey & Seitz, 2011; Liu, 2007), but it is still not clear how severe multidimensionality needs to be before unidimensional models break down. In other words, at what point are the estimates generated from unidimensional models not sufficiently trustworthy? This was related to another Study 1 issue. Study 1 relied on full-scale estimates to make judgments about the accuracy of the reduction method. According to the Study 1, the dimensionality characteristics of the real measures used would indicate this was problematic, since none of the measures satisfactorily conformed to a single dimension. This study utilized a “true” theta, which allowed for a more truthful inspection of simulated CAT accuracy. Hypotheses While many of the univariate tests were exploratory, some had hypothesized outcomes, which are provided below. In general, every property had hypotheses pertaining to mean items and the correlation between the true and CAT thetas (CAT-SIM r). Only the dimensionality conditions were associated with additional hypotheses. Tested dependent variables were described with more detail in the subsequent Method section. Number of Items (N ITEMS) Since several different studies have already indicated 12 items were insufficient for CAT, all of the hypotheses for N ITEMS assumed differences were linked solely to disparities between the 12-item condition and the other conditions: LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 46 1. There is a relationship between number of available items and CAT-SIM r. That is, the mean for 12 items will be smaller than that of 24, 36, and 48 items, but there will be no differences between 24, 36, and 48 items. i) Mean 12 ≠ Mean 24 = Mean 36 = Mean 48 (i.e. Mean 12 < Mean 24 = Mean 36 = Mean 48 ) 2. There is a relationship between number of available items and mean items selected. That is, the mean for 12 items will be smaller than that of 24, 36, and 48 items, but there will be no differences between 24, 36, and 48 items. i) Mean 12 ≠ Mean 24 = Mean 36 = Mean 48 (i.e. Mean 12 < Mean 24 = Mean 36 = Mean 48 ) Number of Response Options (N RESP) Results of Study 1 did not suggest the response scale size contributed to the performance of those measures as CAT, so for the present study it was assumed 5 and 7 response options would behave similarly. Differences were expected between these two conditions and the two- category condition because the dichotomous case represents a more significant restriction of response options: 3. There is a relationship between number of response options and CAT-SIM r. The mean for 2 response options will be smaller than that of 5 and 7 response options, but there will be no difference between 5 and 7 response options. i) Mean 2 ≠ Mean 5 = Mean 7 (i.e. Mean 2 < Mean 5 = Mean 7 ) 4. There is a relationship between number of response options and mean items selected. The mean for 2 response options will be larger than that of 5 and 7 response options, but there will be no difference between 5 and 7 response options. i) Mean 2 ≠ Mean 5 = Mean 7 (i.e. Mean 2 > Mean 5 = Mean 7 ) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 47 Number of Dimensions (N DIM) As already discussed, dimensionality has been thoroughly assessed, and many researchers noted dimensionality of any kind could prove detrimental to the adaptive process. For this reason, the hypotheses for N DIM assumed performance worsened as the number of dimensions increases. N DIM had two additional hypotheses linked to indicators of discrepancy and error. Specifically, it was theorized differences in CAT and CTT reliability estimates could be indicative of multidimensionality, due to the sensitivity of Cronbach’s alpha (Schmitt, 1996; Cortina, 1993; Cronbach, 1951), and it was assumed CAT and true estimates would become more disparate, as implied by the root mean square error, with more dimensions: 5. There is a relationship between number of dimensions and CAT-SIM r. The means will decrease as the number of dimensions increases. i) Mean 1 ≠ Mean 2 ≠ Mean 4 (i.e. Mean 1 > Mean 2 > Mean 4 ) 6. There is a relationship between number of dimensions and the difference in reliability estimates. The mean difference will increase as the number of dimensions increases. i) Mean 1 ≠ Mean 2 ≠ Mean 4 (i.e. Mean 1 < Mean 2 < Mean 4 ) 7. There is a relationship between number of dimensions and RMSE. The means will increase as the number of dimensions increases. i) Mean 1 ≠ Mean 2 ≠ Mean 4 (i.e. Mean 1 < Mean 2 < Mean 4 ) 8. There is a relationship between number of dimensions and mean items selected. The means will increase as the number of dimensions increases. i) Mean 1 ≠ Mean 2 ≠ Mean 4 (i.e. Mean 1 < Mean 2 < Mean 4 ) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 48 Correlation between Dimensions (DIM R) N DIM was conceptually linked to dimension correlation, so the rationale for the abovementioned N DIM hypotheses generalizes to DIM R. This property was included and held separate of N DIM because it was assumed to have independent contributions. A correlation of .35 indicated more “severe” multidimensionality than a correlation of .70, and a correlation of 1 was identical to assuming the single dimension condition of N DIM: 9. There is a relationship between magnitude of dimension correlation and CAT-SIM r. The means will decrease as the correlation decreases. i) Mean 1 ≠ Mean .70 ≠ Mean .35 (i.e. Mean 1 > Mean .70 > Mean .35 ) 10. There is a relationship between magnitude of dimension correlation and the difference in reliability estimates. The mean difference will increase as the correlation becomes smaller. i) Mean 1 ≠ Mean .70 ≠ Mean .35 (i.e. Mean 1 < Mean .70 < Mean .35 ) 11. There is a relationship between magnitude of dimension correlation and RMSE. The means will increase as the correlation decreases. i) Mean 1 ≠ Mean .70 ≠ Mean .35 (i.e. Mean 1 < Mean .70 < Mean .35 ) 12. There is a relationship between magnitude of dimension correlation and mean items selected. The means will increase as the correlation decreases. i) Mean 1 ≠ Mean .70 ≠ Mean .35 (i.e. Mean 1 < Mean .70 < Mean .35 ) Discrimination Range (Alpha) Study 1 linked discrimination parameters to mean number of selected items. In general, larger discriminators meant fewer items were needed. However, the link to accuracy was not clear. The number of items to administer was not limited in the variable CATs, so they LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 49 recovered accuracy (according to CAT-EST r at least) by selecting additional items. For that reason, in the present study, CAT-SIM r should not differ between conditions: 13. There is no relationship between alpha and CAT-SIM r. The means will not differ between conditions: i) Mean A1 = Mean A2 = Mean A3 (plus all associated pairwise comparisons) 14. There is a relationship between alpha and mean items selected. The means will differ between conditions, as specified: i) Mean A1 < Mean A2 < Mean A3 (plus all associated pairwise comparisons) Difficulties (Skew; Kurtosis; Beta) Study 1 did not provide much to inform these hypotheses. The specific comparisons below are based on assumptions incorporated into study design. For example, conditions four and five should not differ from each other because they exhibit the same amount of skew, but in different directions. With a normally distributed true theta (and a normal prior for the CAT), they should perform equivalently. Conversely, there should be a significant difference between conditions one and three because the latter has significantly more kurtosis. In that case, more kurtosis describes a departure from normality not observed in condition one (i.e. condition one is the “ideal” condition given a normal distribution): 15. There is a relationship between beta and CAT-SIM r. Specific comparisons are listed below. i) Mean B1 ≠ Mean B2 ii) Mean B1 ≠ Mean B3 iii) Mean B1 ≠ Mean B4 iv) Mean B2 ≠ Mean B3 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 50 v) Mean B2 ≠ Mean B7 vi) Mean B4 = Mean B5 vii) Mean B6 = Mean B7 16. There is a relationship between beta and mean items selected. Specific comparisons are listed below. i) Mean B1 ≠ Mean B2 ii) Mean B1 ≠ Mean B3 iii) Mean B1 ≠ Mean B4 iv) Mean B2 ≠ Mean B3 v) Mean B2 ≠ Mean B7 vi) Mean B4 = Mean B5 vii) Mean B6 = Mean B7 Interactions To assess the models incorporating an interaction between number of dimensions and dimension interrelatedness (N DIM x DIM R), only runs with two and four dimensions were used. These analyses focused solely on the multidimensional conditions since the goal was to determine how much multidimensionality was detrimental. The final interaction model (N ITEMS x N RESP x BETA) excluded runs with N ITEMS = 12, since it was hypothesized this condition would differ from the others even if the higher conditions behaved similarly to one another. The intention was to better understand the relationship between these conditions when item pool size was not considered a limiting factor. 2) N DIM (DIM = 2 or 4) x DIM R (r = .35 or .7) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 51 1. There is a statistically significant interaction between number of dimensions and dimension correlation with regards to their relationship to CAT-SIM r. The mean difference between four dimensions with r = .35 vs. .70 will be larger than the mean difference between two dimensions with r = .35 vs. .70. i) Mean 2-.70 – Mean 2-.35 < Mean 4-.70 – Mean 2-.35 2. There is a statistically significant interaction between number of dimensions and dimension correlation with regards to their relationship to reliability differences. This would indicate a larger discrepancy in reliability differences for the four dimension conditions than the two dimension conditions. i) Mean 2-.70 – Mean 2-.35 < Mean 4-.70 – Mean 2-.35 3) Mean Items (DV) ~ N ITEMS (ITEMS = 24, 36, or 48) x N RESP x BETA 1. Firstly, this model will show observed differences in beta groups and response options groups depend on the number of items available in the pool. Despite not expecting a main effect of N ITEMS without the 12-item condition, it is still hypothesized these conditions will differentially affect means for the other variables. Secondly, number of responses is expected to moderate the relationship between the beta parameters and mean items selected. Finally, this test will demonstrate the interaction between number of response options and beta conditions varies as a function of the higher item conditions. To summarize: i) Differences in mean number of selected items for RESP condition depends on the number of items available (24, 36, and 48). ii) Differences in mean number of selected items for BETA condition depends on the number of items available (24, 36, and 48). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 52 iii) Differences in mean number of selected items for BETA condition depends on the number of response options. iv) Differences in mean number of selected items for BETA condition depends on the number of response options, and this dynamic varies as a function of number of items available (24, 36, or 48). Method Simulation Properties, Conditions, and Parameters Number of Items. Four item-number conditions were evaluated in this study: 12, 24, 36, and 48 items. The first two conditions roughly coincided with the number of items from Study 1 measures. Dodd, Koch, and Ayala (1989) considered approximately 30 items a good size for an item pool, so 36 and 48 items were included to represent potentially excessive item pool sizes. Increments of 12 were used to maintain equal spacing of the conditions, and to mirror the conditions around 30. Discriminations. Three fixed sets of 96 discrimination parameters were randomly generated from a uniform distribution. The first set had the widest range, extending from a = .5 to a = 2.5, and was most like the TP measure. The second had a narrower range (a = 1 to a = 2), consistent with the tighter NC measure. The third also had a narrower range (a = .5 to a = 1.5), in line with the MP measure. This final condition was considered the most restrictive because there was a higher likelihood of introducing weak discriminators (i.e. items with discriminations less than 1). Past research indicated discriminations played a large role in the CAT process, and Study 1 suggested items with discriminations values under 1 lessened certainty around a person’s trait estimate. While 96 parameters were generated per condition, this many estimates were not LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 53 needed for the CATs. In simulation, alphas were randomly sampled without replacement to acquire the necessary number as dictated by the item number condition. Difficulties. Skew and kurtosis largely describe a distribution’s appearance (e.g. Stuart & Ord, 1994; DeCarlo, 1997). In IRT, the beta parameters provide information about category endorsement given a theta value, and this endorsement behavior provides information about the distribution of a variable. Sets of betas were established to define a series of skew and kurtosis conditions, seven in total (see Table 7). The real data from Study 1 mostly aligned with conditions one, two, and four, while the others conditions were considered alternative distributions possible for this type of data. While Table 7 suggested all betas in a simulated data set were fixed to the same quantities, adjustments were imposed to introduce necessary variation. The fixed betas were simply a starting point to ensure some level of average skew and kurtosis. A random number drawn from a normal distribution with a mean of zero and standard deviation of .25 was added to each beta. In conjunction with the discriminations, this allowed certain items a higher chance of selection while maintaining a scale with the desired average item skew and kurtosis implied by the base betas. The stability of skew and kurtosis was investigated relative to other properties to make sure varying combinations of the other properties did not impact desired levels of skew and kurtosis. All tests of this returned satisfactory – simulations under almost all conditions and across all properties maintained the desired mean skew and kurtosis linked to each beta condition. It was especially important to make sure varying discrimination parameters in simulation did not impact these estimates, since the above beta parameters were determined assuming a = 1 for all items. Since discrimination is akin to the slope, it could influence the amount of kurtosis ultimately observed in a simulated item. Varying the alphas did increase the LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 54 observed range of kurtosis estimates, but it did not notably impact the average skew or kurtosis except when beta condition six was used. Here, the range of kurtosis estimates was significantly influenced such that average kurtosis was a bit smaller when discriminations much different from a = 1.0 were allowed. Dimensionality. One-, two-, and four-dimension conditions were included in this study. The two-dimension condition was consistent with the NC measure, while the four-dimension condition was in line with the MP subscale. Correlations between the simulated dimensions were selected to replicate these scales as well. The MP dimensions correlated between the range .4 and .8, while the dimensions from NC and TP correlated more consistently around .70. The first of two correlation conditions, the large condition, was thus defined by this recurring correlation of .7 (associated with shared variance of about 50%). The other condition set the correlation at .35 (approximately 12% shared variance), which was the average estimate associated with the EFA results for MP and TP, and was considered a small correlation. Weakly related factors could more readily be treated as distinct factors, but the presence of multiple dimensions with such correlations in real data could be more problematic for CAT. The dimensionality and correlation conditions produced five possible combinations: (1) a single dimension with an implied correlation of r = 1.00; (2) two dimensions with a correlation of r = .70; (3) two dimensions with a correlation of r = .35; (4) four dimensions with correlations of r = .70; and (5) four dimensions with correlations of r = .35. They were treated as separate variables in analyses. Theta. Trait estimates for each observation were randomly generated from a multivariate normal distribution, with a mean of 0 for each dimension and a correlation matrix as dictated by the selected dimensionality combination. Thetas were fixed so they could be treated as stable LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 55 “true” estimates of the trait. Therefore, each of the dimensionality combinations had their own set of true thetas. For each condition, N = 500 thetas were generated. Number of Response Options. Three number of response options conditions were investigated. The first had five categories, consistent with many current measures (including two from Study 1). The second utilized seven categories, while the final condition had only two categories. Since the beta conditions were specified for the five-category condition, these pre- defined betas were adjusted when a different number of categories were evaluated. To simulate two categories, which would produce a single beta parameter, an overall mean of the pre-defined betas was calculated. This created a single difficulty parameter covering a particular beta condition for the two-category condition. Seven categories required expanding the pre-defined beta conditions to include two more estimates. The first addition was the mean of the first two betas, which inevitably fell between the originals. The second was the mean of the last two betas, similarly falling between the two origin betas. Measure Reduction Techniques Computerized Adaptive Test (CAT), Simulated. Choi’s (2009) Firestar program includes data simulation. However, the internal approach does not allow for direct control over the dimensionality characteristics of the simulated data, so a different function to simulate graded response data was created to accomplish this. The data simulation function was aggressively evaluated to ensure pre-specified parameters and estimates were reproducible. See Appendix C for the R code used to simulate data, as well as the R code (from Choi’s program) necessary to simulate the computerized adaptive tests. Within every CAT, data was simulated 10 times for each of the 1260 possible condition combinations, with a sample size of N = 500 for each of the 10 runs. This sample size was LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 56 chosen to match the number of predetermined thetas. Each simulated data set was evaluated as a CAT, but results were averaged across runs to produce an overall summary. Item selection, the scoring rule, and general program details outlined in Study 1 were retained for the present study. Classical Test Theory (CTT) Short Form. This study simplified the short form step by performing only one fixed reduction per simulated data set. The CTT technique was selected because it was the most straightforward and is still used extensively instead of (or in addition to) more advanced techniques. The median number of items selected by the CAT determined the size of the CTT short form. Since there was potential for wide variation in the number of selected items, a highly skewed distribution was possible, depending on where most people lied relative to the observed range. The median provided a safe alternative to mean use that was less susceptible to skew. This study retained the CTT item selection process described in Study 1. Evaluation of Simulation Results Simulation allowed for a more thorough investigation of CAT results, so many new statistics were computed in Study 2. Ten simulated samples were evaluated per set of conditions, so overall means across samples were estimated and used to summarize the results. Collected information about item use, skew, and kurtosis was consistent with Study 1. Also included from Study 1 were CAT reliability (and its associated SEM estimate) and the squared correlation between the full-scale estimate and the CAT estimate. Only additions or modifications were highlighted below. Squared Correlations. The true thetas were correlated with both the estimated full-scale theta and the estimated CAT theta, per simulated sample. For conditions with more than one simulated true dimension, the estimated full-scale and CAT thetas were correlated with each dimension and then a mean was computed across the dimensions to describe the overall LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 57 relationship. This resulted in a single r 2 associated with the CAT and a single r 2 associated with the full scale, each describing the average strength of the relationship across all simulated dimensions for a particular set of conditions. The fixed short form summed score had its own set of coefficients. The most important described its relationship with the true theta(s), since this was the one used to determine the real success of the CAT. Error and Reliability. The first of two error-related indices, called the root mean square (RMS), allowed for an alternate evaluation of the variation present in the CAT estimates: € RMS = θ cat,i 2 i=1 N ∑ N . (6) N in Equation 6 is the sample size. RMS values could not be smaller than zero. The farther from zero the RMS, the more implied variance found in trait estimates. In theory, as the observed trait range widened, or as the distribution’s kurtosis or skew became more dramatic, the RMS increased. In this context, higher RMS was not necessarily bad. The second index, known as the root-mean-square error (RMSE), was considered a good alternate indicator of accuracy (Willmott et al., 1985; Choi & Swartz, 2009). Like RMS, it could only be positive. The RMSE was calculated similarly to RMS as well; however, instead of simply summing the squared CAT thetas, the squared differences between the simulated true and CAT thetas were summed: € RMSE = (θ sim,i −θ cat,i ) 2 i=1 N ∑ N . (7) RMSE described the amount of error associated with the CAT estimates relative to the true estimates, which Equation 7 illustrates. It provided a different way to judge CAT accuracy than LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 58 RMS, r 2 , or CAT reliability. Larger RMSE values indicated more error (i.e. discrepancy between CAT and true), so values closer to 0 were preferable. Finally, an estimate of Cronbach’s alpha was obtained for each CTT short form. CTT reliability was compared to the IRT reliability estimate associated with the CAT to evaluate how different conditions might relate to any incongruity observed between the two estimates. Analyses Variables. The first of eight variables of interest was the correlation between the CAT theta and the simulated true theta(s). The second was the correlation between the CAT theta and the estimated full-scale theta. The latter represented the usual comparison (see Study 1), while the former was the more accurate comparison. Since the variance of the correlation coefficient is not constant across the range of its values, Fisher’s z transformation (Fisher, 1915) was applied to the pre-squared estimates in order to approximate a distribution with constant variance at all points. The transformation forced the correlations to conform to a z distribution, with values that could be negative or positive. However, because the observed correlations in this study were all positive, their transformed values were also strictly positive. Higher values indicated stronger relationships, consistent with the implications of higher correlation coefficients, while an estimate of zero was analogous to no relationship. These transformed correlations were referred to as CAT-SIM z and CAT-EST z, and described the CAT estimate’s relationship with the simulated true theta(s) and the estimated full-scale theta, respectively. Instead of evaluating the CAT reliability and alpha reliabilities separately, the difference between the two (REL DIFF) was used. The estimate of alpha was subtracted from the CAT estimate, thus positive means for this difference suggested CAT reliability was larger, while a LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 59 negative mean suggested alpha was larger. The maximum difference could not exceed the absolute value of 1. RMS and RMSE, discussed earlier, were also included in these analyses. The five aforementioned variables addressed accuracy, error, and reliability. The final three focused instead on item summarization. The first of the set was mean number of selected items (mean items; MI). As implied, this was simply the mean number of selected items administered by the CAT. The second variable represented the difference between the mean number of selected items and the median number, calculated by simply subtracting the median number of items from the mean. As with all difference variables, a value of 0 implied no difference, while positive values meant the larger means and negative values identified larger medians. Larger means suggested the distribution of number of administered items skewed positive, while negative skew was assumed when the median was larger. The last variable was the calculated difference between the weighted kurtosis and the un-weighted kurtosis estimate. Again, the weighted estimate took into account how frequently the item was selected by the CAT. To produce the difference, un-weighted kurtosis was subtracted from weighted kurtosis. Larger values (in either direction) indicated larger discrepancies between the two estimates. MANOVA. To warrant exploration of specific relationships between the six properties (i.e. independent variables) and the dependent variables, multivariate ANOVA models were evaluated first. Since there were several dependent variables, MANOVA offered a more powerful way to establish differences existed somewhere, and provided a basis to test more specific ANOVA models. One MANOVA model was run with each property separately to determine whether multivariate means (centroids) differed between the property’s groups. These tests required p < .05 to indicate a statistically significant effect, and did not have any specific hypotheses associated with them. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 60 ANOVA. In addition to tests of univariate models supported by the results of the MANOVA, a series of planned comparisons were investigated to more clearly address hypotheses. Several interactions were assessed as well to better understand how these properties worked together to affect certain dependent variables. Like the planned comparisons, these were based on a priori hypotheses. Results Multivariate Tests The multivariate models suggested statistically significant mean differences existed for all six independent variables (see Table 8). Particularly large effects, based on the estimates of f 2 and Cohen’s (1988) guidelines, were identified for number of items, number of dimensions, dimension interrelatedness, and discriminations parameters. Since all of the properties had statistically significant results, all possible one-way ANOVA models were evaluated. Univariate Tests Given the number of one-way ANOVA models, a stricter alpha level of .001 was used. Thus, to indicate a statistically significant effect, the p-value associated with a test had to be less than .001. This coincided with a Bonferroni correction of the usual alpha level (α = .05) for 50 tests, though there were only 48 one-way ANOVAs here (not including planned comparisons and or interaction tests). Correlations between the dependent variables were generally small, with an average absolute value equal to .262. There were two notable exceptions. The first was the correlation between RMSE and CAT-SIM z, which was r = -.745, and the second was the correlation between RMS and the difference in kurtosis estimates (r = -.733). Despite these, the correlations between the dependent variables in general were quite minimal, so the selected LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 61 correction may have been overly conservative. This can affect estimates of power by underestimating them, but has no bearing on the estimates of effect size. Number of Items. Number of items exhibited statistically significant relationships with all dependent variables except RMSE (see Table 9 for summary of results). These results on their own provided incomplete support for the two hypotheses posed earlier, which addressed between-group differences in CAT-SIM z and mean item selected (MI). Further investigation required removing the 12-item condition and re-running the ANOVA models. The new tests returned statistically non-significant for both CAT-SIM z and MI, suggesting meaningful differences in CAT-SIM z and MI existed only between the 12-item condition and at least one of the other three conditions. The pattern of means for these dependent variables indicated the relationships were consistent with expectations. CAT-SIM z and MI were both smaller for the 12-item condition than the others, and minimal differences were observed between the other three conditions (see Table 10). Number of items related most to CAT-EST z. Means suggested the 12-item condition had the highest CAT-EST z, which decreased quite dramatically to 36 items, then remained relatively stable from 36 to 48 items. This is not surprising when viewed in the context of mean items. The mean number of selected items for the smallest item condition was 11.5. If most people received all or almost all of the full scale in the CAT, as they would have with the 12- item condition, the correlation between the CAT and full-scale thetas would be nearly 1. As more items were introduced, mean number of items increased to a point, but the ratio of mean number of items to total item pool size decreased. Consequently, estimates stemming from larger item pools looked less like the estimated theta than those from conditions with smaller item pools. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 62 Number of Responses. Like number of items, number of responses was not statistically significantly related RMSE. It also did not produce significant differences between mean and median number of selected items. While the other effects were statistically significantly, they were mostly small excepting RMS and MI (see Table 11). Mean RMS, an indicator of variation about the CAT estimate, was largest for the five- and seven-category conditions (which were roughly equal) and smallest for the two-category condition (see Table 12). The main hypotheses associated with number of responses addressed its relationship with CAT-SIM z and MI. The CAT-SIM z effect was small, similar in size to that observed for number of items. Comparing only five and seven response categories indicated no differences, so the small effect observed in the main test was due to the difference between these conditions and the two-category condition. The mean for CAT-SIM z was smaller for two response options, as hypothesized, but again, not by any meaningful amount. The MI effect was considerably larger, and explained about 15.5% of the variance in mean items selected 3 . In support of the hypothesis, the mean for two response options was largest, with a rather dramatic drop from two to five response options. Consistent with the more specific hypothesis, five and seven categories exhibited no statistically significant difference in MI, though the seven-category condition used nearly one item less on average. Number of Dimensions. Mean differences between the dimensionality conditions were in the expected direction. CAT-SIM z was greatest for the one dimension condition and rather significantly dropped by two dimensions (Table 13). The observed difference between two and 3 Not adjusted for effects of other independent variables. See the ANOVA model that included the interactions stemming from N ITEMS x N RESP x BETA for a more accurate description of this effect. This was the estimate of R 2 assuming a one-way ANOVA. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 63 four dimensions was not as large, but both the overall ANOVA model and the subset model comparing the two- and four-dimension conditions revealed statistically significant effects (see Table 14). This supported the hypothesis for CAT-SIM z. The hypotheses associated with REL DIFF and RMSE were also supported by the results. The mean difference in reliabilities and RMSE both increased as the number of dimensions increased, and the univariate tests indicated the effects were statistically significant. The effect for REL DIFF persisted even when the one dimension condition was excluded; however, the size of the effect was proportionally smaller, consistent with larger observed differences from one to more dimensions than from two to four dimensions. Similarly, the effect for RMSE decreased when the one dimension condition was removed, but remained very large. Mean number of selected items actually decreased from one dimension to four, but the actual numerical changes were quite small. This effect was not statistically significant as indicated in Table 14, so the hypothesis of differences here was not supported by the results. Additional comparison of the two and four dimension conditions was deemed unnecessary. Dimension Correlations. Since correlations were so highly linked to the number of dimensions property, the two performed similarly (see Tables 15 and 16). The hypotheses for CAT-SIM z, REL DIFF, and RMSE were supported by the results, while the hypotheses associated with MI were not. This highlighted the importance of investigating the interaction between the two and four dimensions categories and their possible correlation values (r = .35 or .70). The effects for CAT-SIM z, REL DIFF, and RMSE were quite large here, but were certainly redundant with those for dimension number. The correlation between dimensions seemed more detrimental to CAT accuracy (CAT- SIM z) than the number of dimensions, such that having poorly correlated dimensions was worse LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 64 than just having many dimensions. Accuracy was severely affected with even the mildest set of multidimensionality conditions (2 dimensions, r = .70). CAT-SIM r 2 averaged .691 (range = .350 - .751) when N DIM = 2 and DIM R = .70, which was quite a bit smaller than that for one dimension (mean r 2 = .840, range = .448 - .902). The single dimension case had a noticeably wide range that dipped quite low, but only 24 of the 252 unidimensional combinations had CAT- SIM r 2 values less than .750. Discriminations. Study 1 suggested measures with many good discriminators required fewer items during the CAT phase. Study 1 also implied the measure with better discriminators would capture its construct(s) more accurately than the measure with poorer discriminators when the number of selected items was held constant. Here, the mean items and CAT-SIM z conditions worked against each other. In order for CAT-SIM z to differ across alpha conditions, the number of items selected would have to be roughly equal. This could only happen in the CAT if constrained intentionally (or if the item pools were all sufficiently small). Here, the CAT used more items to produce a CAT-SIM z that was relatively unchanged across discrimination condition (see Tables 17). Consequently, the effect was not statistically significant (Table 18), supporting the hypothesis. Mean number of items did change quite a bit though, and the overall test was statistically significant. All possible comparisons were evaluated. It was hypothesized MI would increase as the discrimination ranges narrowed and decreased. The first condition, representing the traditional punishments measure from Study 1, did in fact have the smallest MI, followed by the second condition (need for cognition), and finally the third condition (mature personality). Much like Study 1, the differences were in the correct direction; however, the difference between the second and third conditions were perhaps quite a bit larger than expected in these simulations LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 65 compared to the actual results for NC and MP from Study 1, while the differences between condition one and the other conditions were not quite as notable as observed in that study. Pairwise comparisons revealed the small difference between condition one and two was statistically significant, and both conditions differed to a significant degree from condition three. Difficulties (Beta; Skew and Kurtosis). There appeared to be a small amount of variation in CAT-SIM z across the beta conditions (Table 19), but as Table 20 shows, there was no statistically significant effect. Pairwise comparisons were not investigated for this dependent variable as a result. There was a statistically significant mean difference in MI between beta conditions, an effect that was relatively small when compared to that of the discrimination and number of response option properties in particular, but it was still meaningful. However, the pairwise hypotheses were partially supported by the results. Statistically significant differences were observed for condition one vs. condition two, condition one vs. condition four, condition two vs. condition three, and condition six vs. condition seven. This supported the four hypotheses linked to these tests. In general, the statistically significant tests exhibited an average difference of at least two items. In the present study, the largest tested difference occurred between conditions two and three, which were the positive and negative kurtosis (without skew) conditions, respectively. The findings here supported the MANOVA, which suggested the beta property had the smallest overall effect. In general, none of the effect sizes even reached Cohen’s (1988) cutoff for a medium-sized effect. This contrasted all other properties (except number of response options), which had at least one large effect each. Of the remaining tests, CAT-EST z was the only dependent variable not to exhibit statistically significant differences in means. Kurtosis differences, associated with the largest effect, were greatest for high positive kurtosis conditions. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 66 Coupled with the near-zero mean difference for the high negative kurtosis condition, the CATs seemed to prefer items with less positive kurtosis, pulling the distribution away from a potentially leptokurtic shape. Surprisingly, the high positive skew with high positive kurtosis condition had the smallest mean REL DIFF, though further exploration revealed this difference was due to the condition producing the smallest mean estimate of CAT reliability. Excluding the final beta category, observed REL DIFF did not change much from condition to condition. In fact, Cronbach’s alpha means for all seven conditions were relatively close to one another, falling within the tight range of .734 to .755. Tests of Interaction Effects N DIM x DIM R and CAT-SIM z, REL DIFF. Both of the interaction tests returned statistically significantly (see Tables 21 and 22). The actual interaction effects were small but meaningful. The tests of main effects remained statistically significant despite the interaction, so both number of dimensions and dimension correlation were related to CAT-SIM z and REL DIFF, outside of the interaction. As suggested by Figure 13, the mean difference in CAT-SIM z between the two correlation conditions for four dimensions was larger than the mean difference for two dimensions, supporting the hypothesis. Similarly, the difference in REL DIFF between correlation conditions was larger for four dimensions, consistent with the hypothesis. Refer to Figure 14 and Table 23 to clarify the reliability interaction. N ITEMS x N RESP x BETA and Mean Items (MI). Removal of the N ITEMS 12- item condition nullified the main effects noted in the earlier one-way ANOVAs for both N ITEMS and N RESP, though the N ITEMS result was consistent with the smaller three-group LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 67 ANOVA tested. The beta property’s main effect remained, but was smaller than the earlier estimate. There was no N ITEMS x N RESP interaction, suggesting differences in MI for each N ITEMS condition did not vary across N RESP conditions (Table 24). The other three interaction terms were statistically significant, supporting the associated hypotheses. The effect size for N ITEMS x BETA was the largest of the three, while the effect for N RESP x BETA was smallest. However, they all fell within the small-to-medium range when judged using Cohen’s cutoffs. The significant N ITEMS x BETA interaction indicated differences in MI across BETA conditions varied by N ITEMS condition, such that item condition impacted each BETA condition somewhat differently. The significant N RESP x BETA interaction identified a similar dynamic. Figures 15 and 16 provide visual representations of these interactions. Figure 16 highlighted two things worth noting. The first was the disparity produced by beta condition seven. Mean differences were most dramatic for that condition. The second interesting observation related to how similar the five and seven response option conditions were. Excepting beta condition six, they had identical mean MI. As hypothesized with regards to the main effect of N RESP, the real N RESP story concerned the difference between dichotomous scaling and polytomous scaling, not between various polytomous scale sizes. Figures 17 through 19 show the N RESP x BETA interaction for each N ITEMS conditions to illustrate the three-way interaction. From the figures it can be seen how the number of items available in the pool interacts with number of response options and the beta conditions. The differences are most notable for beta conditions four, five, and seven, which were the moderate-to-high skew conditions. The CAT frequently administered all of the item pool for these conditions to obtain an estimate of the standard error of measurement of theta as LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 68 close to the preset value as possible, explaining the dramatic increase in MI with larger item pools. Strangely, beta condition six did not exhibit the same behavior as the other skew conditions. Discussion Summary of Results Number of Items. As indicated in Table 25, both N ITEMS hypotheses were fully supported by the data. However, the results indicated the statistically significant CAT-SIM z effect was tied to the 12-item condition. When excluded, the effect went away almost entirely. There are ways the 12-item condition could have confounded some results (see below), but here it would seem even with the condition in play the effect was nowhere near large enough to suggest number of available items really mattered when it came to CAT-SIM z. The property did significantly relate to CAT-EST z, which might explain why past research has continued to push the importance of item pool size. The mean number of selected items was larger with more items in the item pool, but the ranges for mean and median numbers of items revealed some interesting information about potential. Most notably, the lower bound for the 12-item condition was almost always larger than that of the other conditions. While the other conditions were often able to achieve means of seven and eight items, the 12-item condition was usually stuck at a minimum of 10, or approximately 83% of its pool. It would seem the limited pool size forced administering items that would otherwise be ignored (or used later) by the selection process if other items were available to choose from. Based only on the results of this set of analyses, sticking with a pool size of about 24 items would probably suffice for CAT. This was a bit less than the 30 items LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 69 suggested by Dodd, Koch, & Ayala (1989), but was in general agreement with the findings of Gnambs & Batinic (2011). Number of Response Options. The CAT-SIM z and CAT-EST z effects were both small despite statistical significance. In general, many of the effects tied to number of response options were due to differences between two responses categories and the other conditions, not between the other conditions themselves. That is, though only formally tested with CAT-SIM z and mean items, if the two-category condition was removed, most of the other dependent variable would not relate to number of responses at all. Number of Dimensions & Dimension Correlations. All of the associated hypotheses for dimensionality were fully supported by the data, excluding the mean items hypotheses, which were not supported at all. In fact, the mean number of selected items was actually greater for the “ideal” condition (a single dimension) than the others. These two properties demonstrated sizable relationships with RMSE and reliability differences, but their most important contribution seemed to be their relationships with CAT-SIM z. These two properties accounted for a substantial portion of the variance in CAT-SIM z, making dimensionality the most important determinant of this indicator of CAT accuracy. Dimensionality had no statistically significant relationship with CAT-EST z though, which explains why the measures in Study 1 performed well as CATs despite the presence of significant multidimensionality. As indicated earlier, number of items in the pool was really the main link to CAT-EST z, demonstrating how unreliable item pool size was as an indicator of CAT performance. The N DIM x DIM R interactions were statistically significant. Both of these outcomes were consistent with the idea of compounding effects – that is, the worst of both properties was notably worse than any alternative pairings. The effects were rather small for these interactions, LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 70 indicating much of the action still lied with the main effects, but the take-home was still there: with more dimensions and smaller correlations between dimensions, the effects looked more severe than would be expected if differences were constant. Discriminations. The inclusion of discrimination parameters was more confirmatory than exploratory, as many studies have looked into the effects of discrimination already and found similar results. However, it was necessary to include them because they influenced the perceived realness of the data. It was hypothesized the discrimination parameters would not relate to CAT-SIM z, and this was in fact true based on the results. The assumption was simulated data sets associated with smaller discriminations could compensate by simply administering more items. These parameters would be important for mean items, but not for CAT-SIM z, as a consequence of design more than anything. Undoubtedly, if the number of selectable items were restricted, there would be a relationship between discriminations and CAT- SIM z. The statistically significant association with mean items was large, the largest of all properties. Study 1’s measures linked discrimination to mean items, and this simulation study further supported that link. Here, the first discrimination condition, which was allowed the widest range, did in fact have the smallest number of mean items, an amount that was statistically significantly different from the next condition. The largest differences occurred between the first two conditions and the third condition, which consistently had discrimination values less than 1. Difficulties (Beta; Skew and Kurtosis). Several of the pairwise hypotheses were supported by the data, but in general, the beta property was poorly related to the dependent variables. The lack of relationship with CAT-EST z made sense, since there was no indication LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 71 beta parameters (and by extension, skew) really mattered in Study 1. No relationship with CAT- SIM z was surprising, since it was assumed one would exist due to how variations in skew and kurtosis would affect the simulated data. More specifically, it was theorized the level of skew and kurtosis would become so severe with certain beta conditions that the CAT would be unable to really capture the simulated theta no matter how many items were administered. Number of mean items selected was an area the beta conditions had some impact, and this was explored further with several interaction tests. The most important indicated the number of items available in the pool influenced the interaction between number of response options and the beta conditions. Essentially, the beta conditions involving high skew and/or high positive kurtosis needed significantly more items from the pool when there were only two response options. These results would suggest skew and kurtosis really do matter, but perhaps only for dichotomous variables (which tend to have higher skew and kurtosis anyway). Under those circumstances, a larger item pool would greatly benefit the researcher. Accuracy and Reliability The CAT correlation with the simulated theta could not exceed the correlation between the full-scale estimate and the simulated theta (heretofore known as EST-SIM). It was important to establish EST-SIM could be as close to r = 1 as possible under ideal conditions because if this could not be demonstrated the simulated data would be difficult to trust. The highest observed EST-SIM correlation (r = .985) occurred for the following combination of properties: N ITEMS = 48, N RESP = 7, N DIM = 1 (with DIM R = 1), ALPHA = 2, and BETA = 6. Ignoring the beta condition (the peculiarity of which is discussed further below), this was expected, and indicated the simulation performed properly. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 72 The correlation (or squared correlation) between the CAT estimate and the full-scale estimate (here CAT-EST) is usually used to judge the accuracy of CATs. However, this study demonstrated the relationship was linked mostly to item pool size (and the discriminations to a lesser extent). The correlation with the simulated (true) theta was entirely a dimensionality issue, and meant Study 1 results were incorrect due to identified multidimensionality. The present study would place the VCAT-SEM squared correlations from Study 1 somewhere around an average r 2 = .640 (equivalent to a Fisher’s z transformed pre-squared r = 1.110), considerably smaller than their CAT-EST r 2 estimates. Incorrect model specification also had implications for reliability. CAT reliability did not change meaningfully from one dimensionality condition to the next. This might seem strange, but since SEM was dependent on the CATs ability to reproduce the estimated theta, this actually made perfect sense. CAT reliability was consequently not a very useful indicator because the estimated theta was inaccurate. Cronbach’s alpha estimates were actually more sensitive to changes in dimensionality. The more dimensions present, the smaller the mean estimate of alpha. Additionally, a possible interaction was observed between number of dimensions and dimension correlation. The difference in mean alpha estimates between dimension number conditions was larger for samples with the dimension r = .35 condition. A discrepancy between CAT reliability and Cronbach’s alpha with real data could indicate some amount of unaccounted for multidimensionality. However, in all but the most extreme dimensionality case tested (four dimensions with an average r = .35), mean reliability as assessed with alpha would still be considered good. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 73 Limitations The most notable limitations of the present study concerned issues with item pool size and the beta conditions. In many cases, effects dependent on item pool size (e.g. mean items selected, CAT-EST z) were significant due to the inclusion of the 12-item category. This was explored to some degree in this study by excluding the 12-item condition from the assessment of CAT-SIM z and mean items. The mean number of selected items in the CATs overall was about 15, rounded to the nearest whole number, which was close to the estimates for all item pool sizes except 12 items. As a result, many differences identified could be present because the 12-item condition was not able to administer enough items to deal with the various conditions of other properties and obtain a sufficient SEM. It is possible leaving this condition in for other one-way ANOVA models (specifically those not involving number of items) produced distorted results. However, the problems associated with this condition were also reasons to justify its inclusion. It was known beforehand that 12 items were not sufficient for CAT, and the results of the present study support this. The assessment of skew and kurtosis relied heavily on correct specification of the beta parameters, and initial testing found them to be stable across the various properties’ conditions. However, in the simulated data evaluated for this study, some of these conditions turned out to be less stable. For the most part, skew was correct – mean skew was roughly as expected for all conditions. Kurtosis proved problematic. Conditions one, two, and six all demonstrated wild variation in kurtosis. This made estimates inconsistent with condition descriptions. For instance, condition six ended up indicating negative skew (accurate) with no kurtosis (inaccurate), making it identical to condition four. Similarly, condition one produced no skew (accurate) with moderate negative kurtosis (inaccurate), while condition two ended up producing items which LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 74 should have been associated with condition one (i.e. low skew and kurtosis). Surprisingly, condition seven, arguable the most extreme (with hypothetical condition six) actually produced items that looked the way they were supposed to. The other conditions were also correct. This modified the interpretation of some results from above (notably, condition two should be condition one), but the findings as is were still meaningful. The effects would likely be larger with correctly specified conditions two and six (high positive kurtosis). Some other factors affected how realistic the simulated data looked. For instance, the beta parameters were all relatively the same within a particular condition. Again, this was designed to impose certain ranges of skew and kurtosis on the data, but this came at the price of truly “real” data. Since the simulated thetas were always generated from normal distributions, betas indicating high skew and kurtosis were going to perform more poorly just because they forced the sample estimates away from the true estimates. Trying out trait distributions with some skew or kurtosis might increase generalizability of these results, as Study 1 indicated the real data measures might not actually conform to a normal distribution. Alternatively (or in addition), it may be informative to vary the type of items within each simulated measure. Lastly, it is possible little additional information is truly gained with seven categories over five, but the lack of observed differences in the present study could be due to how the seven-category condition was formed. Generally, the more extreme categories of a Likert-type item are selected less frequently when one assumes a normally distributed trait. Relative to 5- point Likert-type items, a 7-point Likert-type item would probably have more extreme first and final difficulties to account for their even less frequently selected first and seventh categories. This study manipulated the middle betas instead of the outer two, essentially providing more LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 75 information within the same difficulty range for the seven-category condition. Instead, widening the difficulty range could prove useful and merits further investigation. Recommendations Taken together, these results can inform researchers interested in using CAT with their non-cognitive measures. For instance, five-category response scales are probably sufficient for most research needs, if one intends to adopt the typical questionnaire format with ordered- category response scales (here, this includes dichotomous scales). They certainly perform better than two-category scales, and in this study seven categories offered no benefits. An item pool size somewhere between 15 and 24 items would probably suffice, unless dichotomous items must be used. If so, a larger pool, around 36 items, is advised. The most extreme beta condition, condition seven, was the only one that frequently required the maximum number of items, and it is unlikely this situation occurs in real data unintentionally. Even conditions four and five are unlikely, considering betas of this nature would probably be this way because the underlying trait is very much skewed as well. A larger item pool would not hurt of course, but if at the item creation stage, a researcher probably would not need to find or create more than 36 items. Multidimensionality and discrimination are harder to address during the design phase, and to affect them may require quite a bit of diligence on the part of the researcher. It would help to anticipate how participants might respond to an item during its creation, noting its response scale. Discriminations, dimensionality, and difficulties all depend on response behavior to some degree, so identifying vague items or items begging for inconsistent responses would be a good thing to do before administering the test. One goal here is to make the items as clear as possible so responses are accurate. Items can be vague in multiple ways though. The items need to be clear, but they should not be too general. The main consequence would be smaller LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 76 discriminations, thus requiring more items in the CAT phase. Overly easy and overly difficult items are also possible with non-cognitive measures. One such example would be a statement like, “I am severely depressed”. In general settings, this would garner a response of “not at all” or “strongly disagree” from most participants. Assuming the other items in the measure are not like this one, the eventual IRT or factor analysis would suggest this item is not a good indicator of the trait. It would exhibit high skew (and kurtosis), and would have a mean at the low end of the response scale (or high end, depending on where the disagreement response lies). Multidimensionality can be addressed during the CAT phase as well. If one has already analyzed some of their data and noted multidimensionality is present, there are two directions they could go. The first would be to treat each dimension as a separate measure. This is the simpler approach, but without accounting for the relationship between dimensions, significantly more items than necessary would be required. In addition, there just may not be enough items associated with each dimension for this approach to work. An alternative would be to utilize some kind of multidimensional IRT model, which would allow the researcher to keep the measure together. Multidimensional adaptive testing procedures are more complex than their unidimensional counterparts though, and as a result they can be more difficult to produce. Unfortunately, according to the results of this study, if there is any amount of multidimensionality present, the unidimensional models do not perform well. A short form may, in these situations, be a good way to go, unless there is a sufficiently large pool of items to allow for by-dimension unidimensional CATs. If so, it would be better if the dimensions were as unrelated to each other as possible. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 77 Future Studies The number of significant interactions identified in this study demonstrated how likely other interactions could be in this data. In particular, it would be worthwhile to better understand how discrimination parameters interact with the betas, number of response options, and number of items properties, to explain number of mean items selected. The discrimination parameters had the highest estimated effect on this dependent variable, so it seems reasonable to assume it might have a complex relationship with those properties. With additional exploration it is possible many new dynamics could be identified that provide a lot of information about any of the dependent variables. A lot of information was collected about the CTT short forms created within each CAT run as well, but only the differences in reliabilities were analyzed in this study. Preliminary analyses run on this data provided a few clues as to how the study properties might affect the z transformed correlations between the summed score and the simulated theta: (1) the general mean difference between the CTT correlation and the CAT correlation was about .099, which meant a slightly smaller correlation for CTT; (2) dimensionality led to the most noted mean shift, but in this case, the largest difference (.208) was actually for the one dimension case, while the other conditions had differences ranging from .044 - .100; and (3) fewer items in the pool meant these estimates were going to be closer (again, because more items were being used in general relative to the pool size), so differences were smallest for the 12-item condition. The most interesting of these was the second observation, which suggested incorrect model specification actually made the CTT and CAT results look more similar. Future study would more thoroughly analyze the short form results, as they provide another test of the usefulness of the CAT. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 78 Study 3: Real Data CAT and Multidimensionality The findings of Study 2 revealed multidimensionality most affected trait estimates, and showed many common ways to assess CAT performance were inaccurate when multidimensionality was present to any extent. A third study, described in detail below, was proposed to apply the results of Study 2 to the real data used in Study 1. One could argue the simulations from Study 2 presented simplistic versions of real-life situations. Consequently, two checks were essential to extend generalizability of the Study 2 results: (1) demonstrate true, underlying multidimensionality impacted the real measures as much as the simulated measures; and (2) demonstrate results for each measure were consistent with the best matched run(s) from the simulation study. With satisfactory evidence of both, the results of Study 2 could more reliably be applied to real measures. Researchers would be able to anticipate the adequacy of their measures for CAT before investing the time and money often required of the approach. In addition, they could use the results to inform decisions during the formation of measures. Method Trait Estimates Three types of “true” trait estimates were derived for each measure based on multidimensional models. All of these were obtained using R’s ‘mirt’ package (Chalmers, 2012), which can fit exploratory and confirmatory multidimensional IRT models. This package was used instead of functions from packages discussed earlier (i.e. ‘psych’, ‘lavaan’) because it allowed for EAP estimates. This kept the score type within the same “family” as the CATs. The first set of estimates was created from an exploratory IRT model specifying the number of dimensions identified in the Study 1 for each measure. Factor patterns from these exploratory LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 79 item factor analyses generally agreed with those from the ‘psych’ package’s irt.fa function. An oblique rotation method, promax, was used to allow for correlated dimensions. The second set of trait estimates was obtained using the confirmatory multidimensional models described in the Study 1 (i.e. correlated factors models), while the third set was derived from a third model, the bi-factor model (Gibbons & Hedecker, 1992; Holzinger & Swineford, 1937). Bi-factor models assumed an overarching dimension representing a general construct and several specific dimensions (in this case, the dimensions of the correlated factors models). Any relationship between items occurred via the general dimension, so the specific dimensions were not allowed to correlate with each other. In addition, these specific dimensions did not correlate with the general dimension. Many have noted multidimensionality was more detrimental if evidence for a general dimension, as predicated by models like the bi-factor model, was limited (e.g. Gibbons et al., 2007). Package ‘mirt’ provided the following indices to compare model fit for confirmatory multidimensional models: G 2 , an approximation of the chi-square (χ 2 ) statistic (Sokal & Rohlf, 1981), with associated degrees of freedom and p-value, Akaike information criteria (AIC), and the Bayesian information criteria (BIC). Since the correlated factors models could be considered nested within the bi-factor models, estimates of the AIC, BIC, and G 2 were compared to determine whether there was a meaningful difference between the two model types. Of the three, the BIC is typically the most conservative because it imposes a potentially stringent per- parameter penalty equal to the natural log of the sample size (Wicherts & Dolan, 2004). To compare, the AIC imposes a smaller penalty of 2 for each additional parameter. As with the chi- square, the smaller the BIC and AIC, the better the model. These indices were also provided for LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 80 the exploratory tests, and loose comparisons were made with the confirmatory models. For the present study, changes for all three had to agree before a difference in model fit was definitive. Factor scores were computed using the ‘mirt’ function fscores. Again, EAP estimates were requested for all models. The confirmatory methods only used items specified to load on a particular dimension to estimate factor scores, while the exploratory method used all items with a non-zero loading. No loading cutoff was specified for the exploratory model (e.g. λ = .3) to ensure the correlated factors model and the exploratory model differed from each other in their relationships with the CAT estimates. The VCAT-SEMs 4 from Study 1 were used as the basis for this study’s analyses. Since reevaluation of the original measures did not require actually re-running the computerized adaptive tests, the CAT EAP estimates obtained from earlier runs were used with the estimates from the above models. For that reason, only the validation samples from Study 1 were utilized in the present study. CAT estimates for each measure were correlated with each of the above models’ trait estimates. The same correlations were obtained for the CTT-based summed scores, CFA-based factor scores, and IRT-based EAP scores. These short forms assumed a number of items equal to the median from each measure’s CAT: TP had a median of six, NC had a median of eight, and MP had a median of 10. 4 These were the variable computerized adaptive tests (VCAT) that relied on the standard error of measurement (SEM) criteria for CAT termination. The SEM rule for all CAT runs was a maximum SEM of .37. When the SEM rule was prioritized, the CAT terminated for an individual when either an SEM equal to or less than .37 was observed, or when the maximum number of items in the pool was administered. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 81 Estimates of score deviance (BIAS) and RMSE were obtained for each type of trait estimate, providing two additional ways to evaluate their relationships with the CAT score. BIAS was the estimated mean difference between the score for a particular dimension and the CAT score, and was a straightforward indicator of how different the estimates appeared, on average. Since positive and negative differences could cancel each other out once averaged, the mean of the absolute value of differences was computed. Larger values indicated greater departure from the dimension scores. Refer to Equation 7 from Study 2 to review how RMSE was calculated. Though all of aforementioned statistics were estimated for the dimensions of the bi-factor model, the dimension of primary interest was the general dimension. Tables 26 - 28 provide the 14 best-matched Study 2 runs for each measure. Runs were selected based on the combination of properties most aligned with the particular measure. For instance, the traditional punishments (TP) measure exhibited a three-factor structure with high factor correlations, a wide range of discriminations, had seven response options, and had 12 items. The runs that satisfied this arrangement of conditions were selected. Since there was no three-dimension condition in Study 2, the traditional punishment results were compared to one set of runs for two dimensions, and another set for four dimensions. Need for cognition (NC) had 18 items, which, like the observed dimensionality of the traditional punishments measure, fell between two tested conditions in Study 2. Results from its evaluations were compared to the nearest 12- and 24-item conditions. Lastly, mature personality (MP) had an average factor correlation between r = .35 and r = .70, so it was compared to matched runs from both dimension correlation conditions. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 82 Results and Discussion Traditional Punishments The chi-square difference test indicated the bi-factor model was statistically significantly better than the correlated factors model (χ 2 diff = 152.4, df = 6, p = .000; AIC diff = 140.4; BIC diff = 100.3), a finding supported by the change in BIC. However, the exploratory model technically fit the data best according to all three indices. The difference in fit was statistically significant relative to the correlated factors model (χ 2 diff = 358.0, df = 6, p = .000; AIC diff = 316.0; BIC diff = 175.6) and the bi-factor model (χ 2 diff = 205.6, df = 15, p = .000; AIC diff = 175.6; BIC diff = 75.28). Based on results of the exploratory model (Table 29), the CAT effectively recovered the first dimension. The squared correlation with the first dimension’s trait estimates was quite high, mean bias was negligible, and the RMSE matched the lowest found in Table 26. However, the CAT performed poorly when judged using the mean squared correlation, due to much weaker relationships with the measure’s second and third dimensions. The CAT’s relationship with the bi-factor model’s general dimension was a bit better (Table 30). Its r 2 exceeded the simulated means listed in Table 26, but the RMSE more closely aligned with the four-dimension conditions. The specific dimensions were completely unexplained by these estimates. Surprisingly, the CAT most aligned with the dimensions of the correlated factors model (see Table 31). Mean CAT r 2 was high, its dimensions exhibited the least amount of BIAS by far, and average RMSE was relatively small. Overall, performance was quite a bit worse than implied by the CAT r 2 from Study 1 (r 2 = .977). The summed score did not share as much variance with the true trait estimates as the CAT scores, but the CFA score performed consistently worse. Of the three short forms, the IRT-based form was closest to the CAT. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 83 Estimates of reliability were close (CAT reliability = .851 and CTT α = .838), demonstrating a difference smaller than any found in Table 26. Upon inspection of CAT item selection frequencies, the four items that made up the first dimension were among the top six selected items. The top two, both associated with this dimension, were the only two items administered to the entire sample. CAT and short form correlations with the exploratory model’s dimensions looked a bit different from those with the dimensions of the correlated factors model, and this could be due to the former’s pattern of loadings. The exploratory item factor analysis indicated one item loaded on the first and second dimensions, suggested another item loaded on the first and third dimension, and associated a dimension three item with dimension one instead. The restructuring of the first and third dimensions, and allowing items to cross-load, could explain the observed differences. Use of the bi-factor model’s general dimension led to r 2 estimates much closer to those of the exploratory model’s first dimension, supporting the idea of a dominant factor for the measure. Additionally, the four items associated with the first dimension exhibited the highest loadings on the general dimension. The CAT selected these items most frequently, so it was no surprise to see the estimated CAT r 2 value based on the bi-factor model’s trait estimates was higher than any CAT-SIM r 2 in Table 26. However, mean CAT r 2 associated with the exploratory model placed TP at the lower end of the two-dimension condition, or perhaps more appropriately, in a space nestled between two and four dimensions. Need for Cognition According to the chi-square test, the bi-factor model was a better fit to the data (χ 2 diff = 116.0, df = 15, p = .000; AIC diff = 86.00; BIC diff = 0.035), but the BIC suggested there was no real difference. Consequently, the two models were considered equivalent for purposes of LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 84 evaluation. Unlike TP, the exploratory model for NC exhibited worse fit than the confirmatory models and used more degrees of freedom. The bi-factor CAT r 2 estimate for the general dimension was very high, bias was very small, and the RMSE was the smallest observed with any of the measures (see Table 32). Similarly, the CAT estimates were highly related to both the first and second dimensions from the correlated factors model (Table 33), though these relationship were not as strong as those observed with the bi-factor model’s general dimension. The mean r 2 actually put performance of the CAT outside of the bounds expected given Table 27. In addition, mean BIAS and RMSE were small. The squared correlation with the exploratory model’s first dimension was very close to the CAT correlation with the bi-factor model’s general dimension. However, the second dimension was poorly explained by CAT estimates (Table 34), contrasting the correlated factors model. This occurred despite agreement between the exploratory and correlated factors models with regards to the dimension correlation (r ≈ .770). As with TP, all of the short form r 2 values were consistently smaller than the CAT analogues, with the greatest mean difference tied to the exploratory estimates. Though they did not perform as well as the CAT estimates, the CTT scores did a good job here, maintaining r 2 values greater than .800 with the trait estimates from the two confirmatory models. However, the IRT scores performed better, and in the exploratory case, they nearly matched the CAT on average. The CFA technique generally underperformed, but its scores were more in line with the exploratory dimensions than the CTT scores. The difference in reliabilities was .032, larger than TP, but smaller than MP, and within the range for 24 items (Table 27). The need for cognition performed much better than implied by the simulation studies as well, producing RMSE estimates within or close to the bounds of those listed in Table 27. Even LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 85 the CTT r 2 values were much larger than those found in the simulations, suggesting the summed score may be sufficient, especially for the general dimension or the correlated factors model’s first dimension. Of course, if a short form were desired with a measure like this, the IRT technique would be the wisest source. Of the three measures, the need for cognition performed the best overall as a CAT in the face of multidimensionality. Compared to the CAT r 2 from Study 1 (r 2 = .951), the CAT and IRT methods did particularly well here when related to the bi-factor model’s general dimension, with r 2 estimates greater than r 2 = .90. Unfortunately, the specific dimensions of the bi-factor model were not well explained, though there was a small relationship, particularly for the first specific dimension. The average r 2 for this first dimension was .158, while the r 2 second was .074. The CAT did a negligibly better job capturing these, followed by the CTT estimates, averaged across both dimensions. Finally, CAT BIAS and RMSE were high for both specific dimensions. The r 2 values produced using the estimates from the correlated factors model, while not quite as high as with the bi-factor model’s general dimension, were good (especially for CAT and IRT), and well within acceptable ranges for most circumstances. Mature Personality The bi-factor model exhibited statistically significantly improved fit over the multiple correlated factors model (χ 2 diff = 1172, df = 11, p = .000; AIC diff = 1150; BIC diff = 1080). The exploratory model had the worst fit, and had fewer remaining degrees of freedom. As a result, the trait estimates from the bi-factor model were held as the most accurate of the three sources, followed by the correlated factors model. According to Tables 35 - 37, squared correlations were low, even for the confirmatory models. Relationships with the CTT-based summed scores were generally lower than the LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 86 comparable CAT correlations, but on average these differences were not as striking as those observed with TP and NC. In fact, the CTT scores had a slightly stronger relationship with the bi-factor model’s general dimension than all other scores. The CFA and IRT scores did a slightly better job of approximating the exploratory and correlated errors models’ trait estimates than the CTT scores, but again, these differences were negligible considering the differences observed with the other measures. RMSE estimates were larger than expected when compared to the closet CAT runs from Study 2. For example, the smallest RMSE, associated with the third factor in the multiple correlated factors model, was barely smaller than the largest RMSE found in Table 28. BIAS was large overall as well. The CAT-CTT reliability difference was .041, favoring the CAT. This was consistent with smaller differences linked to highly correlated dimensions. Results did not suggest a dominant factor really existed. The performance of the bi- factor model’s general dimension was higher than the mean performance found for the exploratory and common factors dimensions, but all of these r 2 estimates were associated with moderate to substantial amounts of error, as evidenced by large RMSE values. The general dimension CAT r 2 was actually smaller than that of the first three dimensions from the correlated factors model, which had CAT r 2 estimates of consistent strength. These results suggest removal of the items associated with the fourth dimension could actually improve overall accuracy (though a considerable amount of error would remain). Unsurprisingly, only one of the three items from that dimension was used in the CAT to any meaningful degree, and may explain the weak relationship. As with the other measures, the CAT did a poorer job of reproducing the exploratory model’s estimates, and none of the reduced form estimates exhibited even small relationships with the specific dimensions of the bi-factor model. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 87 Holding the tested bi-factor model as the most accurate representation of the data, the measure actually performed (slightly) better than the simulation associated with the best- performing set of conditions in Table 28 (CAT-SIM r 2 = .647). If judged using the correlated factors models, the measure would fall somewhere between the two dimension correlation conditions, leaning closer to runs associated with a correlation of .70. General Discussion Limitations and Future Study Conflicting results from Studies 2 and 3 may be due to assumptions underlying the simulated true thetas. The simulated thetas had constant correlations with each other of either .70 or .35. MP exhibited rather wide variation in dimension correlation, while TP and NC had between-dimension correlations greater than .70. For TP and NC, their highly correlated dimensions could explain why their CATs performed as well as they did, relative to expectations motivated by the simulation study. Finally, the bi-factor model was evaluated in Study 3, but data conforming to the assumptions of the bi-factor model were not simulated in Study 2. It is possible Study 2 runs may have looked more congruent with Study 3 results if this model was investigated in the simulation. Item wording could not be evaluated in the simulations as it is dependent on human interpretation, but it could be tested in the creation phase of a measure. This goes beyond whether a word is negatively or positively worded, though this is certainly part of it. Simple changes to one word in a statement, such as removing a modifier, could significantly alter how a participant perceives it. Similarly, the way response options are worded can manipulate how a participant interprets the statement, thereby influencing their response behavior. Use of a different response scale could (notably) alter the observed distribution of a measure’s items. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 88 These shifts beget larger changes, which could ultimately affect the accuracy and usefulness of a CAT version of the measure. Perhaps the most significant limitation stems from not developing a measure from scratch based on the results of the simulation. Since a large goal of this study was to identify factors that improve the success of a polytomous item measure as an adaptive test, it would have been beneficial to see if designing a measure as best as possible to capture a promising set of conditions actually produced a CAT-friendly measure. While much was gained from using pre- designed measures with pre-collected data, there was no way to control the measure properties or sample characteristics. Three measures were used here to illustrate how certain factors impact CAT, but in many ways these measures were similar to each other. For example, a measure with a two-category response scale would have provided a useful check of an unrepresented condition from the simulate study. Also, it may have been helpful to evaluate a measure with known multidimensionality by design, particularly one where correlations between dimensions were relatively low. While the simulation study had a low-correlation condition, there was no real data analogue. The Issue of Accuracy. Evaluation of CAT success rested upon a stable indicator of accuracy. Squared correlations offered a straightforward means of assessment, as they have known bounds and are familiar coefficients. However, squared correlations, like their origin statistic, only provide information about the consistency of participant scores. They do not speak to actual changes in the numeric estimate. For example, a CAT estimate could exhibit a very strong relationship with a true estimate, but every participant’s CAT estimates could be shifted towards the higher end of the trait distribution by .5. The result is a CAT that overestimates the trait level of the sample. BIAS was one way to check this. BIAS, as estimated in Study 3, LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 89 represented the raw, observed mean difference between two sets of trait estimates. This supplied a general picture of how well the CAT (or short form) literally reproduced the true or estimated trait level. RMSE, by extension, could be described as the variance around BIAS. Study 2 provided insight into how some of these indices, each reasonable measures of accuracy, address it in various ways. The difference in mean CAT-SIM z between the two and four dimension conditions was smaller than the mean difference between the r = .35 and r = .75 dimension correlation conditions. However, the mean RMSE difference was significantly larger for the dimension conditions. Skew and kurtosis did not have any notable impact on the squared correlations in the simulation study, and they had only a small effect on RMSE. However, it is very likely they would impact BIAS, since these statistics, via the difficulty and discrimination parameters, largely control the eventual range of trait estimates. A negatively skewed set of item responses, for instance, would likely overestimate most scores for a trait with a truly normal distribution. Study 3 results also suggested item skew could impact the performance of CAT relative to the other methods. There was a good amount of variation in r 2 for Traditional Punishments and Need for Cognition, but not for Mature Personality. TP and NC items exhibited notable skew whereas MP items had little skew in general. This could indicate CAT is beneficial when there is at least moderate item skew. There are seemingly no definitive ways to judge CAT accuracy. For BIAS and RMSE, this is partly because they do not have upper bounds. Closer to zero is better for both, but how close to zero is required to feel comfortable with the estimate may depend largely on the setting. Likewise, values closer to the upper bound of r 2 = 1 for a squared correlation are better, but how close to 1 likely depends on the needs of the interested party. In general, it is probably best to LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 90 aim for squared correlations greater than r 2 = .80. Unfortunately, concerns about accuracy cannot be separated from an incorrectly specified model for the CAT. The correct model is crucial in order to truly trust the estimated full-scale (and subsequently the CAT) trait estimates. Study 3 demonstrated certain scenarios offered some protection, but even then, improved performance could be obtained if a more accurate model was used for the adaptive test. Time. Many studies investigating computerized adaptive tests discuss time saved by administering the CAT version of a measure (and/or a short form) instead of the full version. While this is a valuable topic to explore, this dissertation did not address time saved for several practical reasons. Data for the real data measures were mostly collected via paper-and-pencil, so accurate estimates of response time could not be obtained for them. The traditional punishments measure was computer administered, but response times were not provided by the database. A guess could be made as to the time commitment required by each item for these measures, a guess that would assume equivalent time for all items, but depending on the characteristics of the sample, this could paint a radically inaccurate picture of true time saved. Since the item number conditions were linear for the fixed short forms and the fixed CATs, the assumed time equivalency, when assessed relative to reliability, would produce a plot nearly identical to Figures 3, 7, and 11. This is because time would be perfectly correlated with item number, introducing redundancy. One could already discern from the figures that, assuming each item required about 10 seconds, three items (30 seconds) would produce a reliability of X, while six items (60 seconds) would produce a reliability of Y. Item selection is not fixed for CATs like it is for short forms, so two people who would respond to an identical set of items at the same speed could exhibit different response times with the jumbled presentation in a live CAT, however, this cannot be evaluated with simulated CATs. Time saved is arguably best studied LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 91 with live (i.e. real) computer administration, where response time can be actively monitored and recorded for later use. Conclusion Multidimensional IRT models (MIRTs) have gained a lot of popularity of late, and there have been a number of studies applying them to polytomous item measures (Rijmen, 2010; Gibbons & Hedeker, 1992; Chen, West, & Sousa, 2006; Leue & Beauducel, 2011). Though not as extensive, the use of these models for adaptive testing, referred to as multidimensional adaptive testing (MAT), has been promising. For instance, Weiss & Gibbons (2007) found an adequately fitting bi-factor model for a personality measure with 616 items performed well as the basis for MAT. After live testing, they discovered correlations with the full-scale scores were well over .90, with an approximate length reduction of 80% and decrease in testing time of about of about 93 minutes on average. The bi-factor model has proven a popular MIRT model, in part because of its simplified structure – the idea of a general construct that is uncorrelated with a set of specific dimensions is appealing, and is easier to conceptualize as an adaptive test. Frey & Seitz (2011) used a different MIRT model to simulate MAT with real data from the Programme for International Student Assessment (PISA) to investigate the benefits of MAT over CAT and conventional testing. They noted an increase in measurement efficiency and a rather dramatic reduction in number of needed items, but cautioned against their approach to MAT in high- stakes settings given restrictions inherent to PISA design. MAT is still relatively new, but it has proven particularly useful for measures with a large number of dimensions (and items). The results of Study 3 support this. NC managed to account for a large amount of variance in the factors from its multidimensional models. This could be attributed to its two highly correlated factors, an attribution that did not generalize to TP, which LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 92 also had highly correlated factors. Generally, as the number of dimensions increased, the performance of the unidimensional CAT worsened. In the case of NC, sticking with a GRM- based short form could work almost as well as the CAT. Determining whether to use the adaptive test or not may come down to whether the small gain in score consistency is worth the time, money, and effort potentially required to design and implement the adaptive test. Both TP and MP would benefit from one of the several current MIRT models for adaptive testing, since their dimensionality issues seem to present problems the unidimensional IRT models cannot quite overcome; however, it is possible a MIRT-based short form could work just as well as the MAT with these measures. A presumed unidimensional measure with underlying multidimensionality will likely perform well as a unidimensional CAT when there is evidence of a strong general dimension, and when there is no interest in any of the specific dimensions (of the bi-factor model in particular) once that general dimension is explained. The CATs seemed to perform better than the CTT short form in the presence of at least moderate skew in Study 2, and CFA and CTT between-method performance differences in Study 3 for TP and NC could be linked to their more skewed items, so CAT may be preferable to a short form when there are a lot of skewed items in a measure 5 . A unidimensional CAT may also be fine if the goal is to capture only the dominant dimension(s), however, in this case a short form may be better, especially if the number of items is less than 12. MAT is recommended if there seem to be several moderately correlated dimensions, or the dimensions exhibit a range of correlations. As well, if one were interested in specific dimensions despite the presence of a strong general dimension, MAT would be necessary. 5 Though an IRT short form may also suffice. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 93 Two of the three measures used in the present study, Traditional Punishments and Mature Personality, could be considered dimensions of a larger construct, as they were technically subscales. When evaluated in the context of their parent measures, most multidimensional IRT models used for MAT would disregard the dimensionality of these subscales, focusing on the dimensionality present at the higher level. In other words, they would treat each subscale as a unidimensional element. This study suggests a MAT ignoring the subscale structures could produce significant errors. A possible endeavor for future study would include an investigation of how well MAT performs with subscale dimensionality when this aspect is and is not accounted for by the MAT algorithm. Despite many of the issues with CAT accuracy, and the plethora of questions and concerns to address with future studies, these findings still provide helpful information researchers can use to judge how their measures might perform as computerized adaptive tests. Notably, aggregating the results across all of the studies will allow researchers to put together a set of profiles they can use to match their own measures to those used here. In conjunction with the simulation results, which can be considered worst-case scenarios (until further investigation can be done with other measures), researchers can form a guide to help them decide: (1) whether a unidimensional CAT for their measure will suffice; (2) what they can do to their measures to increase support for CAT; (3) if it would be best to go with a MAT approach; and (4) if some kind of short form would best suit their needs. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 94 References Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL: Waveland Press. American Institutes for Research (2012). Project Talent, Base Year Data [ICPSR33341-v1]. doi: http://dx.doi.org/10.3886/ICPSR33341.v1 Beauducel, A. & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling: A Multidisciplinary Journal, 13(2), 186-203. doi:http://dx.doi.org/10.1207/s15328007sem1302_2 Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397 – 472). Reading, MA: Addison-Wesley. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 27, 29-51. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. doi:http://dx.doi.org/10.1007/BF02293801 Browne, M. W. & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods & Research, 21(2), 230-258. doi:http://dx.doi.org/10.1177/0049124192021002005 Cacioppo, J. T. & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42(1), 116-131. doi:http://dx.doi.org/10.1037/0022-3514.42.1.116 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 95 Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48(3), 306-307. doi:http://dx.doi.org/10.1207/s15327752jpa4803_13 Chalmers, P. R. (2012). Mirt: a multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. Retrieved from http://www.jstatsoft.org/v48/i06/ Chen, F., Curran, P. J., Bollen, K. A., Kirby, J., & Paxton, P. (2008). An empirical evaluation of the use of fixed cutoff points in RMSEA test statistic in structural equation models. Sociological Methods & Research, 36(4), 462-494. Chen, F. F., West, S. G., & Sousa, K. H. (2006). A comparison of bifactor and second-order models of quality of life. Multivariate Behavioral Research, 41(2), 189-225. doi:http://dx.doi.org/10.1207/s15327906mbr4102_5 Chen, S., Hou, L., & Dodd, B. G. (1998). A comparison of maximum likelihood estimation and expected a posteriori estimation in CAT using the partial credit model. Educational and Psychological Measurement, 58, 569-595. doi:http://dx.doi.org/10.1177/0013164498058004002 Choi, S. W. (2009). Firestar: computerized adaptive testing simulation program for polytomous item response theory models. Applied Psychological Measurement, 33(8), 644-645. doi:http://dx.doi.org/10.1177/0146621608329892 Choi, S. W., & Swarz, R. J. (2009). Comparison of CAT item selection criteria for polytomous items. Applied Psychological Measurement, 33(6), 1-22. doi:http://dx.doi.org/10.1177/0146621608327801 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 96 Cohen, A. R., Stotland, E., & Wolfe, D. M. (1955). An experimental investigation of need for cognition. Journal of Abnormal and Social Psychology, 51 (2), 291-294. doi:http://dx.doi.org/10.1037/h0042761 Cohen, A. S., Kim, S., & Baker, F. B. (1983). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17(4), 335-350. doi:http://dx.doi.org/10.1177/014662169301700402 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2 nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98-104. doi:http://dx.doi.org/10.1037/0021- 9010.78.1.98 Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 22, 347-358. doi:http://dx.doi.org/10.1007BF02310555 DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2(3), 292- 307. doi:http://dx.doi.org/10.1037/1082-989X.2.3.292 Ditto, P., Graham, J., Haidt, J., Iyer, R. Koleva, S., Motyl, M., Wojcik, S. (n.d.). [Morality and ideology questionnaires]. Unpublished raw data. Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1989). Operational characteristics of adaptive testing procedures using the graded response model. Applied Psychological Measurement, 13(2), 129-143. doi:http://dx.doi.org/10.1177/014662168901300202 Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. MahWah, NJ: Lawrence Erlbaum Associates, Inc., Publishers. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 97 Fisher, G. G., Rodgers, W. L., McArdle, J. J., & Kadlec, K. M. (2010). Cognition and aging in the USA: study methods and sample selectivity. Unpublished Manuscript. Retrieved from http://kiptron.usc.edu/publications/merit_pubs.php Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507-521. doi:http://dx.doi.org/10.2307/2331838 Flanagan J. C., Dailey J. T., Shaycoft M. F., Gorham W. A., Orr D. B., Goldberg I. (1960). Project TALENT: Monograph series. Pittsburgh PA: University of Pittsburgh. Fliege, H., Becker, J., Walter, O. B., Bjorner, J. B., Klapp, B. F., & Rose, M. (2005). Development of a computer-adaptive test for depression (D-CAT). Quality of Life Research, 14(10) 2277-2291. doi:http://dx.doi.org/10.1007/s11136-005-6651-9 Flora, D. B. & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9(4), 466-491. doi:http://dx.doi.org/10.1037/1082-989X.9.4.466 Forbey, J. D. & Ben-Porath, Y. S. (2007). Computerized adaptive personality testing: A review and illustration with the MMPI-2 computerized adaptive version. Psychological Assessment, 19(1), 14-24. doi:http://dx.doi.org/10.1037/1040-3590.19.1.14 Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An item-response theory analysis of self- report measures of adult attachment. Journal of Personality and Social Psychology, 78(2), 350-365. doi:http://dx.doi.org/10.1037/0022-3514.78.2.350 Frey, A., & Seitz, N. (2011). Hypothetical use of multidimensional adaptive testing for the assessment of student achievement in the programme for international student LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 98 assessment. Educational and Psychological Measurement, 71(3), 503-522. doi:http://dx.doi.org/10.1177/0013164410381521 Gibbons, R. D., Bock, R. D., Hedeker, D., Weiss, D. J., Segawa, E., Bhaumik, D. K., …Stover, A. (2007). Full-information item bifactor analysis of graded response data. Applied Psychological Measurement, 31, 4-19. doi:http://dx.doi.org/10.1177/0146621606289485 Gibbons, R. D., & Hedecker, D. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423-436. doi: http://dx.doi.org/10.1007/BF02295430 Gibbons, R. D., Weiss, D. J., Kupfer, D. J., Frank, E., & Fagiolini, A. (2008). Using computerized adaptive testing to reduce the burden of mental health assessment. Psychiatric Services, 59(4), 361-368. doi:http://dx.doi.org/10.1176/appi.ps.59.4.361 Gnambs, T., & Batinic, B. (2011). Polytomous Adaptive Classification Testing: Effects of item pool size, test termination criterion, and number of cutscores. Educational and Psychological Measurement, 71(6), 1006-1022. doi:http://dx.doi.org/10.1177/0013164410393956 Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38-47. doi:http://dx.doi.org/10.1111/j.1745-3992.1993.tb00543.x Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Press. Han, K. T. (2012). Fixing the c parameter in the three-parameter logistic model. Practical Assessment, Research & Evaluation, 17(1). Retrieved from http://pareonline.net/pdf/v17n1.pdf LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 99 Holzinger, K., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2, 41-54. doi:http://dx.doi.org/10.1007/BF02287965 Hulin, C. L., Lissak, R. I., Drasgow, F. (1982). Recovery of two- and three-parameter logistic item characteristic curves: A monte carlo study. Applied Psychological Measurement, 6(3), 249-260. doi:http://dx.doi.org/10.1177/014662168200600301 Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11, 179-188. doi:http://dx.doi.org/10.1007/s12564-009-9062-8 Leue, A., & Beauducel, A. (2011). The PANAS structure revisited: on the validity of the bifactor model in community and forensic samples. Psychological Assessment, 23(1), 215-225. doi:http://dx.doi.org/10.1037/a0021400 Liu, J. (2007). Comparing multi-dimensional and uni-dimensional computer adaptive strategies in psychological and health assessment. (Doctoral dissertation). Retrieved from PITTCat Electronic Theses and Dissertations (ETD). (ETD No. 06132007-134603) Lord, F. M. (1974). Estimation of latent ability and item parameters when there are omitted responses. Psychometrika, 39(2), 247-264. doi:http://dx.doi.org/10.1007/BF02291471 Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Marsh, H. W., Hau, K., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis- testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling: A Multidisciplinary Journal, 11(3), 320-341. doi:http://dx.doi.org/10.1207/s15328007sem1103_2 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 100 Maydeu-Olivares, A. (2005). Further empirical results on parametric versus non-parametric IRT modeling of likert-type personality data. Multivariate Behavioral Research, 40(2), 261- 279. doi:http://dx.doi.org/10.1207/s15327906mbr4002_5 McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum. Muthén, L. K., & Muthén, B. O. (1998-2011). Mplus User’s Guide (6th ed.). Los Angeles, CA: Muthén & Muthén. Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Personality, 3(1), 1-18. doi:http://dx.doi.org/10.1016/0022- 2496(66)90002-2 Owen, L. M. (2010). A comparison of standard and alternative measurement models for dealing with skewed data with applications to longitudinal data on the child psychopathology scale. (Master’s thesis). Retrieved from ProQuest Digital Dissertations and Theses. (AAT No. 1484236) Penfield, R D. (2006). Applying Bayesian item selection approaches to adaptive tests using polytomous items. Applied Measurement in Education, 19(1), 1-20. doi:http://dx.doi.org/10.1207/s15324818ame1901_1 Petway, K. T., II. (2010). Applying adaptive methods and classical scale reduction techniques to data from the big five inventory. (Master’s thesis). Retrieved from ProQuest Digital Dissertations and Theses. (AAT No. 1479936) Rammstedt, B. & John, O. P. (2007). Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality, 41, 203-212. doi:http://dx.doi.org/10.1016/j.jrp.2006.02.001 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 101 Rasch, G. (1960). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Oxford, England: Nielsen & Lydiche. Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19(1), 49-57. doi:http://dx.doi.org/10.1111/j.2044-8317.1966.tb00354.x Reise, S. P., & Henson, J. M. (2000). Computerization and adaptive administration of the NEO PI-R. Assessment, 7(4), 347-364. doi:http://dx.doi.org/10.1177/107319110000700404 Reise, S. P., & Waller, N. G. (1990). Fitting the two-parameter model to personality data. Applied Psychological Measurement, 14(1), 45-58. doi:http://dx.doi.org/10.1177/014662169001400105 Revelle, W. (2012). Procedures for psychological, psychometric, and personality research [R package ‘psych’]. Retrieved from http://cran.r-project.org/web/packages/psych/ Rijmen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361-372. doi:http://dx.doi.org/10.1111/j.1745-3984.2010.00118.x Rizopoulos, D. (2006). Ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17(5) 1-25. Retrieved from http://www.jstatsoft.org/v17/a5/paper/ Robins, R. W., Hendin, H. M., & Trzesniewski, K. H. (2001). Measuring global self-esteem: construct validation of a single-item measure of the Rosenberg Self-Esteem scale. Personality and Social Psychology Bulletin, 27, 151-161. doi:http://dx.doi.org/10.1177/0146167201272002 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 102 Rosseel, Y. (2012). Lavaan: an R package for structural equation modeling. Journal of Statistical Software, 48(2), 1-36. Retrieved from http://www.jstatsoft.org/v48/i02/ Salthouse, T. A. & Siedlecki, K. L. (2007). An individual difference analysis of false recognition. American Journal of Psychology, 120(3), 429-458. doi:http://dx.doi.org/10.2307/20445413 Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4-2). Saucier, G. (1994). Mini-markers: a brief version of Golberg’s unipolar big-five markers. Journal of Personality Assessment, 63(3), 506-516. doi:http://dx.doi.org/10.1207/s15327752jpa6303_8 Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8(4), 350- 353. doi:http://dx.doi.org/10.1037/1040-3590.8.4.350 Simms, L. J., & Clark, L. A. (2005). Validation of a computerized adaptive version of the schedule for nonadaptive and adaptive personality (SNAP). Psychological Assessment, 17(1), 28-43. doi:http://dx.doi.org/10.1037/1040-3590.17.1.28 Simms, L. J., Goldberg, L. R., Roberts, J. E., Watson, D., Welte, J., & Rotterman, J. H. (2011). Computerized adaptive assessment of personality disorder: introducing the CAT-PD project. Journal of Personality Assessment, 93(4), 380-389. doi:http://dx.doi.org/10.1080/00223891.2011.577475 Smits, N., Cuijpers, P., & van Straten, A. (2011). Applying computerized adaptive testing to the CES-D scale: a simulation study. Psychiatry Research, 188, 147-155. doi:http://dx.doi.org/10.1016/j.psychres.2010.12.001 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 103 Sokal, R. R., & Rohlf, F. J. (1969). Biometry: The principles and practice of statistics in biological research. San Francisco, CA: W. H. Freeman. Stuart, A., & Ord, K. (2010). Kendall’s advanced theory of statistics, distribution theory (6th ed.). New York, NY: Wiley-Blackwell. Thurstone, L. L. (1935). The vectors of mind. Chicago, IL: University of Chicago Press. van der Linden, W. J. (1998). Bayesian item selection criteria for adaptive testing. Psychometrika, 63(2), 201-216. doi:http://dx.doi.org/10.1007/BF02294775 van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 1-25). Boston: Kluwer Academic. Veerkamp, W. J. J., & Berger, M. P. F. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22(2), 203-226. doi:http://dx.doi.org/10.2307/1165378 Waller, N. G., & Reise, S. P. (1989). Computerized adaptive personality assessment: an illustration with the absorption scale. Journal of Personality and Social Psychology, 56(6), 1051-1058. doi:http://dx.doi.org/10.1037/0022-3514.57.6.1051 Weiss, D. J., & Gibbons, R. D. (2007). Computerized adaptive testing with the bifactor model. In D.J. Weiss (Ed.). Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing. Retrieved from www.psych.umn.edu/psylabs/CATCentral/ Wicherts, J. M., & Dolan, C. V. (2004). A cautionary note on the use of information fit indices in covariance structure modeling with means. Structural Equation Modeling, 11, 45-50. doi:http://dx.doi.org/10.1207/S15328007SEM1101_3 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 104 Willmott, C. J., Ackleson, S. G., Davis, R. E., Feddema, J. J., Klink, K. M., Legates, D. R., …Rowe, C. M. (1985). Statistics for the evaluation and comparison of models. Journal of Geophysical Research, 90(C5), 8995-9005. doi:http://dx.doi.org/10.1029/JC090iC05p08995 Yu, C. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes. (Doctoral dissertation). University of California, Los Angeles. Retrieved from statmodel2.com/download/Yudissertation.pdf Zickar, M. J. (1998). Modeling item-level data with item response theory. Current Directions in Psychological Science, 7(4), 104-109. doi:http://dx.doi.org/10.1111/1467- 8721.ep10774739 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 105 Table 1 Item Details for Traditional Punishments Subscale Item Mean SD 1 Long jail sentences are useful because they keep offenders from harming future victims 4.598 1.835 2 An eye for an eye is the correct philosophy behind punishing offenders 2.614 1.890 3 Vigilante justice (such as the people of a town taking action on their own) might sometimes be appropriate if the courts let off an obviously guilty criminal on a technicality 2.950 2.043 4 Exiling an offender from his or her community or nation would be an appropriate punishment for shame violations 4.101 2.096 5 Shaming techniques, like making people wear a sign in public, would be appropriate punishments for some crimes (instead of putting the offender in jail) 3.863 2.082 6 It undermines the integrity of society when any crime goes unpunished 4.872 1.779 7 We should look to the punishments used in the early history of our country for guidance in how to respond to crime today 2.230 1.631 8 Public flogging (i.e. whipping) would be an appropriate punishment for some crimes (instead of putting the offender in jail) 2.432 1.964 9 Amputating an offenders body part (such as a finger or toe) would be an appropriate punishment for some crimes (instead of putting the offender in jail) 1.742 1.484 10 Crime is a stain upon society, which must be cleansed by any means 2.714 1.698 11 When a criminal behaves in a sub-‐human way, he should be treated in a sub-‐human way 2.591 1.985 12 Victims voices should be heard as part of the justice process 5.595 1.621 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 106 Table 2 Item Details for Need for Cognition Scale Item Mean SD 1 I would prefer complex to simple problems 3.082 1.277 2 I like to have the responsibility of handling a situation that requires a lot of thinking 3.497 1.213 3 *Thinking is not my idea of fun 3.903 1.211 4 *I would rather do something that requires little thought than something that is sure to challenge my thinking abilities 3.742 1.230 5 *I try to anticipate and avoid situation where there is likely a chance I will have to think in depth about something 3.932 1.200 6 I find satisfaction in deliberating hard and for long hours 2.864 1.285 7 *I only think as hard as I have to 3.415 1.306 8 *I prefer to think about small, daily projects to long-‐term ones 3.217 1.256 9 *I like tasks that require little thought once I’ve learned them 3.206 1.306 10 The idea of relying on thought to make my way to the top appeals to me 3.521 1.225 11 I really enjoy a task that involves coming up with new solutions to problems 3.936 1.108 12 *Learning new ways to think doesn’t excite me very much 3.938 1.160 13 I prefer my life to be filled with puzzles that I must solve 2.863 1.241 14 The notion of thinking abstractly is appealing to me 3.235 1.287 15 I would prefer a task that is intellectual, difficult, and important to one that is somewhat important but does not require much thought 3.393 1.194 16 *I feel relief rather than satisfaction after completing a task that required a lot of mental effort 3.359 1.288 17 *It’s enough for me that something gets the job done; I don’t care how or why it works 3.783 1.235 18 I usually end up deliberating about issues even when they do not affect me personally 3.243 1.241 Note: * indicates a negatively worded item that required score reversal. Estimates of the mean and standard deviation are for the reverse scored items. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 107 Table 3 Item Details for Mature Personality Subscale Item Mean SD 1 I make good use of all my time 2.837 1.004 2 I work fast and get a lot done 2.766 1.003 3 When I say I’ll do something I get it done 2.311 .991 4 It bothers me to leave a task half done 2.193 1.106 5 I can turn out a lot more work than average 2.845 .997 6 I am hard-‐working 2.495 .986 7 People consider me an efficient worker 2.547 .971 8 I do my job, even when I don’t like it 2.457 1.007 9 I am productive 2.717 .930 10 As soon as I finish one project or assignment, I always have something else I want to begin 2.798 1.195 11 I think that if something is worth starting its worth finishing 1.990 .949 12 I do things the best I know how, even if no one checks up on me 2.047 .950 13 I lose interest in most projects before I get them done 2.284 1.100 14 People seem to think they can count on me 2.262 .981 15 People consider me persistent 2.918 1.036 16 I am dependable 2.081 .877 17 People have criticized me for leaving things undone 2.320 1.164 18 I am conscientious 2.530 .997 19 I am persistent 2.850 1.038 20 I am reliable 2.153 .873 21 People consider me determined 2.479 .905 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 108 Table 4 Item Skew, Kurtosis, Item-Total r, Unidimensional and Multidimensional Factor Loadings, and Information for Traditional Punishments Subscale Factor Item Skew Kurtosis r-‐Total 1F λ 1F λ MD Info (I) 2 .957 -‐.326 .759 .822 .834 7.334 3 .652 -‐.982 .643 .652 .662 3.608 7 1.314 .769 .673 .703 .714 4.642 1 11 1.054 -‐.246 .754 .828 .840 6.901 4 -‐.131 -‐1.354 .526 .488 .531 2.166 5 -‐.026 -‐1.362 .522 .517 .568 2.256 8 1.142 -‐.107 .690 .788 .863 3.639 2 9 2.187 3.880 .632 .775 .839 4.831 1 -‐.439 -‐.865 .413 .355 .429 1.530 6 -‐.601 -‐.641 .491 .431 .555 1.945 10 .788 -‐.353 .651 .639 .845 3.931 3 12 -‐1.234 .838 .333 .233 .298 .690 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 109 Table 5 Item Skew, Kurtosis, Item-Total r, Unidimensional and Multidimensional Factor Loadings, and Information for Need for Cognition Scale Factor Item Skew Kurtosis r-‐Total 1F λ 1F λ MD Info (I) 1 -‐.317 -‐1.096 .611 .632 .665 3.709 2 -‐.663 -‐.581 .717 .762 .803 5.744 6 -‐.007 -‐1.199 .618 .627 .662 3.708 10 -‐.645 -‐.542 .629 .660 .696 4.001 11 -‐1.074 .447 .653 .707 .748 4.616 13 -‐.015 -‐1.076 .624 .628 .662 3.772 14 -‐.340 -‐1.000 .641 .656 .692 3.761 15 -‐.461 -‐.724 .562 .578 .610 3.454 1 18 -‐.415 -‐.935 .364 .323 .347 1.396 3 -‐.879 -‐.391 .608 .656 .695 3.805 4 -‐.676 -‐.712 .643 .691 .733 4.286 5 -‐.940 -‐.218 .644 .702 .746 4.302 7 -‐.313 -‐1.190 .612 .629 .669 3.422 8 -‐.161 -‐1.117 .536 .529 .566 2.584 9 -‐.108 -‐1.254 .605 .618 .658 3.289 12 -‐.950 -‐.092 .642 .687 .731 4.397 16 -‐.369 -‐1.068 .563 .554 .591 2.880 2 17 -‐.780 -‐.526 .503 .494 .525 2.266 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 110 Table 6 Item Skew, Kurtosis, Item-Total r, Unidimensional and Multidimensional Factor Loadings, and Information for Mature Personality Subscale Factor Item Skew Kurtosis r-‐Total 1F λ 1F λ MD Info (I) 1 .352 -‐.055 .472 .448 .490 2.390 2 .161 -‐.232 .516 .503 .549 2.848 5 .146 -‐.278 .525 .522 .569 3.004 6 .351 -‐.170 .617 .639 .696 4.427 7 .360 -‐.097 .640 .667 .727 5.025 9 .144 -‐.085 .592 .595 .645 4.093 1 10 .110 -‐.851 .557 .540 .587 2.845 14 .601 .137 .544 .595 .663 3.795 16 .590 .224 .646 .738 .816 5.809 18 .176 -‐.341 .450 .435 .488 2.429 2 20 .514 .181 .641 .724 .801 5.728 3 .472 -‐.201 .563 .556 .619 3.427 4 .735 -‐.164 .510 .507 .563 2.660 8 .347 -‐.301 .521 .519 .580 3.144 11 .774 .141 .563 .580 .648 3.508 12 .696 .103 .594 .619 .693 4.038 13 .677 -‐.233 .337 .307 .355 1.388 3 17 .683 -‐.370 .341 .312 .364 1.346 15 .034 -‐.483 .400 .453 .700 1.797 19 .038 -‐.499 .409 .459 .707 1.867 4 21 .250 -‐.098 .554 .553 .792 3.575 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 111 Table 7 Beta (Difficulty) Conditions and Implied Mean Skew and Kurtosis ID Description Skew Kurtosis Betas (Difficulties) 1 Low skew / low kurtosis -‐.05 -‐.35 -‐2.5, -‐1.3, 1.5, 3.1 2 Low skew / high pos kurtosis -‐.05 1.45 -‐3.8, -‐2.0, 2.0, 3.8 3 Low skew / high neg kurtosis .05 -‐1.75 -‐1.25, -‐.42, .42, 1.25 4 Mod neg skew / low kurtosis -‐.90 -‐.35 -‐4.0, -‐2.4, -‐1.0, 0 5 Mod pos skew / low kurtosis .90 -‐.40 0, 1.0, 2.4, 4.0 6 High neg skew / high pos kurtosis -‐1.25 1.15 -‐3.0, -‐2.3, -‐.9, 3.6 7 High pos skew / high pos kurtosis 1.45 1.45 .5, 2.0, 3.5, 5.0 Notes: Skew and kurtosis estimates were averages determined using ten simulated data sets (per condition ID). Betas were base estimates for 5 response options. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 112 Table 8 Multivariate Effects by Property Property (IV) Effect a F b DFnum DFden Sig. c ηp 2 f 2 Power N Items 0.914 68.54 24 3753 .000 0.305 0.438 1.000 N Resp 0.457 46.29 16 2502 .000 0.228 0.296 1.000 N Dim 1.459 421.5 16 2502 .000 0.729 2.696 1.000 Dim R 1.019 162.4 16 2502 .000 0.509 1.039 1.000 Alpha 0.732 90.30 16 2502 .000 0.366 0.577 1.000 Beta 0.532 15.20 48 7506 .000 0.089 0.097 1.000 Notes: a Pillai’s trace. b Approximate F-‐statistic. c p < .05 indicates statistical significance. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 113 Table 9 Univariate Effects for Number of Items (N Items) Property (DFnum = 3, DFden = 1256) Dependent Variable MSS MSSres F Sig. a η 2 f 2 Power CAT-‐SIM z 2.892 118.3 10.23 .000 0.024 0.024 0.956 CAT-‐EST z 203.3 0.245 831.3 .000 0.665 1.985 1.000 REL DIFF 0.222 0.007 32.49 .000 0.072 0.078 1.000 RMS 0.184 0.008 22.60 .000 0.051 0.054 1.000 RMSE 0.033 0.025 1.282 .279 0.003 0.003 0.043 Mean Items (MI) 2072 34.29 60.43 .000 0.126 0.144 1.000 Mean vs. Median Items 332.7 6.760 49.21 .000 0.105 0.118 1.000 Diff. Kurtosis 3.397 0.026 131.5 .000 0.239 0.314 1.000 Subset (Pairwise) Comparisons (DFnum = 2, DFden = 942) CAT-‐SIM z, it = 24:48 0.055 0.099 0.554 .575 0.001 0.001 0.008 MI, it = 24:48 94.98 45.53 2.086 .125 0.004 0.004 0.067 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 114 Table 10 Means of Dependent Variables (DVs) by Number of Items in Pool Dependent Variable 12 items 24 items 36 items 48 items CAT-‐SIM z 1.020 1.114 1.133 1.139 CAT-‐EST z 3.751 2.656 2.151 1.959 REL DIFF .051 .074 .100 .110 RMS 0.322 0.355 0.369 0.377 RMSE 0.383 0.392 0.400 0.406 Mean Items (MI) 11.52 15.95 16.79 16.97 Mean vs. Median Items -‐0.321 0.511 1.711 1.835 Diff. Kurtosis -‐0.017 -‐0.128 -‐0.201 -‐0.258 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 115 Table 11 Univariate Effects for Number of Responses (N Resp) Property (DFnum = 3, DFden = 1256) Dependent Variable MSS MSSres F Sig. a η 2 f 2 Power CAT-‐SIM z 2.892 118.3 16.21 .000 0.025 0.026 0.981 CAT-‐EST z 203.3 0.245 16.17 .000 0.025 0.026 0.981 REL DIFF 0.222 0.007 20.16 .000 0.031 0.032 0.997 RMS 0.184 0.008 185.3 .000 0.228 0.295 1.000 RMSE 0.033 0.025 2.851 .058 0.005 0.005 0.122 Mean Items (MI) 2072 34.29 115.0 .000 0.155 0.183 1.000 Mean vs. Median Items 332.7 6.760 2.909 .055 0.005 0.005 0.127 Diff. Kurtosis 3.397 0.026 59.51 .000 0.086 0.095 1.000 Subset (Pairwise) Comparisons (DFnum = 1, DFden = 838) CAT-‐SIM z, resp = 5, 7 0.000 86.47 0.000 .986 0.000 0.000 0.001 MI, resp = 5, 7 120.4 19.22 6.262 .013 0.007 0.007 0.214 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 116 Table 12 Means of DVs by Number of Response Options Dependent Variable 2 5 7 CAT-‐SIM z 1.032 1.136 1.137 CAT-‐EST z 2.812 2.586 2.489 REL DIFF .063 .091 .097 RMS 0.293 0.386 0.388 RMSE 0.380 0.402 0.404 Mean Items (MI) 18.76 13.96 13.20 Mean vs. Median Items 0.731 1.181 0.890 Diff. Kurtosis -‐0.076 -‐0.174 -‐0.203 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 117 Table 13 Means of DVs by Number of Dimensions Dependent Variable 1 2 4 CAT-‐SIM z 1.596 1.065 0.891 CAT-‐EST z 2.761 2.635 2.557 REL DIFF .014 .076 .126 RMS 0.393 0.358 0.335 RMSE 0.187 0.352 0.542 Mean Items (MI) 15.57 15.33 15.16 Mean vs. Median Items 1.119 0.988 0.788 Diff. Kurtosis -‐0.146 -‐0.151 -‐0.154 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 118 Table 14 Univariate Effects for Number of Dimensions (N Dim) Property (DFnum = 2, DFden = 1257) Dependent Variable MSS MSSres F Sig. a η 2 f 2 Power CAT-‐SIM z 42.23 0.029 1443 .000 0.697 2.296 1.000 CAT-‐EST z 3.490 0.724 4.821 .008 0.008 0.008 0.320 REL DIFF 1.074 0.006 190.4 .000 0.233 0.303 1.000 RMS 0.283 0.008 34.75 .000 0.052 0.055 1.000 RMSE 11.36 0.007 1543 .000 0.711 2.455 1.000 Mean Items (MI) 14.62 39.19 0.373 .689 0.001 0.001 0.005 Mean vs. Median Items 10.44 7.533 1.387 .250 0.002 0.002 0.032 Diff. Kurtosis 0.006 0.034 0.183 .832 0.000 0.000 0.003 Subset (Pairwise) Comparisons (DFnum = 1, DFden = 1006) CAT-‐SIM z, dim = 2, 4 7.598 0.029 266.2 .000 0.209 0.265 1.000 REL DIFF, dim = 2, 4 0.625 0.006 106.9 .000 0.096 0.106 1.000 RMSE, dim = 2, 4 9.104 0.008 1087 .000 0.519 1.080 1.000 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 119 Table 15 Univariate Effects for Dimension Correlations (Dim R) Property (DFnum = 2, DFden = 1257) Dependent Variable MSS MSSres F Sig. a η 2 f 2 Power CAT-‐SIM z 49.72 0.017 2866 .000 0.820 4.560 1.000 CAT-‐EST z 4.518 0.722 6.256 .002 0.010 0.010 0.482 REL DIFF 1.369 0.005 264.9 .000 0.297 0.422 1.000 RMS 0.336 0.008 41.77 .000 0.062 0.066 1.000 RMSE 7.613 0.013 571.0 .000 0.476 0.909 1.000 Mean Items (MI) 14.98 39.19 0.382 .682 0.001 0.001 0.005 Mean vs. Median Items 11.07 7.532 1.469 .231 0.002 0.002 0.035 Diff. Kurtosis 0.006 0.034 0.163 .849 0.000 0.000 0.002 Subset (Pairwise) Comparisons (DFnum = 1, DFden = 1006) CAT-‐SIM z, r = .35, .70 22.57 0.014 1652 .000 0.622 1.642 1.000 REL DIFF, r = .35, .70 1.216 0.005 231.4 .000 0.187 0.230 1.000 RMSE, r = .35, .70 1.603 0.016 101.3 .000 0.091 0.101 1.000 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 120 Table 16 Means of DVs by Dimension Correlation Dependent Variable r = 1.0* r = .70 r = .35 CAT-‐SIM z 1.596 1.128 0.829 CAT-‐EST z 2.761 2.656 2.537 REL DIFF .014 .066 .136 RMS 0.393 0.362 0.331 RMSE 0.187 0.407 0.487 Mean Items (MI) 15.57 15.33 15.15 Mean vs. Median Items 1.119 0.994 0.782 Diff. Kurtosis -‐0.146 -‐0.151 -‐0.154 Note: * r = 1.0 is equivalent to 1 dimension. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 121 Table 17 Means of DVs by Discrimination (Alpha) Condition Dependent Variable 1 (a = 0.5 – 2.5) 2 (a = 1.0 – 2.0) 3 (a = 0.5 – 1.5) CAT-‐SIM z 1.114 1.125 1.067 CAT-‐EST z 2.403 2.432 3.053 REL DIFF .102 .063 .085 RMS 0.406 0.374 0.288 RMSE 0.444 0.405 0.338 Mean Items (MI) 12.43 13.74 19.76 Mean vs. Median Items 1.362 1.148 0.293 Diff. Kurtosis -‐0.229 -‐0.137 -‐0.088 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 122 Table 18 Univariate Effects for Discrimination (Alpha) Property (DFnum = 2, DFden = 1257) Dependent Variable MSS MSSres F Sig. a η 2 f 2 Power CAT-‐SIM z 0.404 0.096 4.219 .015 0.007 0.007 0.254 CAT-‐EST z 56.74 0.639 88.77 .000 0.124 0.141 1.000 REL DIFF 0.165 0.007 23.27 .000 0.036 0.037 0.999 RMS 1.560 0.006 255.8 .000 0.289 0.407 1.000 RMSE 1.206 0.024 51.27 .000 0.075 0.082 1.000 Mean Items (MI) 6412 29.00 221.1 .000 0.260 0.352 1.000 Mean vs. Median Items 134.4 7.336 18.32 .000 0.028 0.029 0.992 Diff. Kurtosis 2.159 0.030 70.83 .000 0.101 0.113 1.000 Subset (Pairwise) Comparisons (DFnum = 1, DFden = 838) MI, 1 & 2 358.8 20.48 17.52 .000 0.020 0.021 0.812 MI, 1 & 3 11271 33.30 338.9 .000 0.288 0.404 1.000 MI, 2 & 3 7608 33.30 228.6 .000 0.214 0.273 1.000 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 123 Table 19 Means of DVs by Difficulty/Skew/Kurtosis (Beta) Condition Dependent Variable No SK Hi K+ Hi K-‐ Mod S-‐ Mod S+ S-‐, K+ 1 S+, K+ 1 CAT-‐SIM z 1.138 1.137 1.117 1.075 1.078 1.134 1.034 CAT-‐EST z 2.604 2.794 2.401 2.606 2.616 2.645 2.738 REL DIFF .099 .085 .121 .073 .073 .100 .035 RMS 0.371 0.390 0.308 0.346 0.348 0.370 0.358 RMSE 0.391 0.415 0.337 0.400 0.400 0.395 0.431 Mean Items (MI) 13.66 16.26 12.43 15.85 15.83 14.86 18.28 Mean vs. Median Items 1.017 1.540 0.978 0.683 0.694 1.639 -‐0.017 Diff. Kurtosis -‐0.165 -‐0.212 -‐0.033 -‐0.113 -‐0.114 -‐0.238 -‐0.183 Note: 1 Indicates high skew and kurtosis. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 124 Table 20 Univariate Effects for “Difficulty”/Skew/Kurtosis (Beta) Property (DFnum = 6, DFden = 1253) Dependent Variable MSS MSSres F Sig. a η 2 f 2 Power CAT-‐SIM z 0.288 0.095 3.019 .006 0.014 0.014 0.528 CAT-‐EST z 2.783 0.718 3.874 .001 0.018 0.019 0.726 REL DIFF 0.135 0.007 20.02 .000 0.087 0.096 1.000 RMS 0.122 0.008 15.24 .000 0.068 0.073 1.000 RMSE 0.154 0.025 6.228 .000 0.029 0.030 0.966 Mean Items (MI) 646.7 36.24 17.84 .000 0.079 0.085 1.000 Mean vs. Median Items 57.07 7.300 7.817 .000 0.036 0.037 0.994 Diff. Kurtosis 0.877 0.030 29.40 .000 0.123 0.141 1.000 Subset (Pairwise) Comparisons (DFnum = 1, DFden = 358) MI, 1 & 2 611.0 23.07 26.48 .000 0.069 0.074 0.966 MI, 1 & 3 135.7 18.54 7.316 .007 0.020 0.020 0.275 MI, 1 & 4 433.4 32.22 13.45 .000 0.036 0.038 0.640 MI, 2 & 3 1323 22.38 59.09 .000 0.142 0.165 1.000 MI, 2 & 7 368.0 51.30 7.174 .008 0.020 0.020 0.266 MI, 4 & 5 0.044 45.80 0.001 .975 0.000 0.000 0.001 MI, 6 & 7 1057 49.04 21.56 .000 0.057 0.060 0.908 Notes: a p < .001 indicates statistical significance. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 125 Table 21 Summary of CAT-SIM z ~ N DIM x DIM R (DFnum = 1, DFden = 1004) Independent Variable MSS MSSres F Sig. a η 2 f 2 Power N Dim (2, 4) 5.287 0.006 899.8 .000 0.258 0.896 1.000 Dim R (.35, .70) 9.056 0.006 1541 .000 0.442 1.535 1.000 N Dim x Dim R 0.245 0.006 41.75 .000 0.012 0.042 0.999 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. Model R 2 = .838, F(3,1004) = 1726, p < .001 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 126 Table 22 Summary of REL DIFF ~ N DIM x DIM R (DFnum = 1, DFden = 1004) Independent Variable df (MSS) dfR (MSSR) F Sig. a η 2 f 2 Power N Dim (2, 4) 0.611 0.005 134.6 .000 0.110 0.134 1.000 Dim R (.35, .70) 0.310 0.005 68.20 .000 0.056 0.068 1.000 N Dim x Dim R 0.100 0.005 21.91 .000 0.018 0.022 0.918 Notes: a p < .001 indicates statistical significance. Boldface values indicate effect sizes > .15. Model R 2 = .298, F(3,1004) = 142.4, p < .001 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 127 Table 23 Comparison of CAT Reliability Estimate to CTT Alpha for 2 and 4 N DIM Conditions N DIM 1 DIM R CAT reliability CTT alpha REL.DIFF Diff by DIM R 2 .35 .833 .732 .101 4 .35 .835 .664 .171 .070 2 .70 .830 .779 .051 4 .70 .832 .751 .081 .030 Note: 1 For one dimension (R = 1) case, mean CAT reliability was .828 and mean alpha was .814, producing a mean difference of .014. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 128 Table 24 Summary of MI ~ N ITEMS x N RESP x BETA (DFnum = 6, DFden = 882) Independent Variable MSS MSSres F Sig. a η 2 f 2 Power N Items = 24, 36, & 48 3.300 20.30 0.161 .852 0.000 0.000 0.002 N Resp 11.40 20.30 0.561 .571 0.001 0.001 0.009 Beta (B) 86.90 20.30 4.276 .000 0.023 0.029 0.836 N Items x N Resp 11.00 20.30 0.543 .704 0.002 0.002 0.014 N Items x B 158 20.30 7.787 .000 0.085 0.106 1.000 N Resp x B 54.5 20.30 2.682 .001 0.029 0.036 0.851 N Items x N Resp x B 47.8 20.30 2.351 .000 0.052 0.064 0.981 Notes: a p < .001 indicates statistical significance. Model R 2 = .584, F(62,882) = 19.97, p < .001 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 129 Table 25 Summary of Hypothesis Results for Main Effects Property DV Hypothesis Outcome N ITEMS CAT-‐SIM z Mean12 < Mean24 = Mean36 = Mean48 FULL SUPPORT N ITEMS Mean Items Mean12 < Mean24 = Mean36 = Mean48 FULL SUPPORT N RESP CAT-‐SIM z Mean2 < Mean5 = Mean7 FULL SUPPORT N RESP Mean Items Mean2 < Mean5 = Mean7 FULL SUPPORT N DIM CAT-‐SIM z Mean1 > Mean2 > Mean4 FULL SUPPORT N DIM REL DIFF Mean1 < Mean2 < Mean4 FULL SUPPORT N DIM RMSE Mean1 < Mean2 < Mean4 FULL SUPPORT N DIM Mean Items Mean1 < Mean2 < Mean4 NO SUPPORT DIM R CAT-‐SIM z Mean1 > Mean.70 > Mean.35 FULL SUPPORT DIM R REL DIFF Mean1 < Mean.70 < Mean.35 FULL SUPPORT DIM R RMSE Mean1 < Mean.70 < Mean.35 FULL SUPPORT DIM R Mean Items Mean1 < Mean.70 < Mean.35 NO SUPPORT ALPHA CAT-‐SIM z MeanA1 = MeanA2 = MeanA3 FULL SUPPORT ALPHA Mean Items MeanA1 < MeanA2 < MeanA3 FULL SUPPORT BETA CAT-‐SIM z Several NO SUPPORT BETA Mean Items Several PARTIAL SUPPORT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 130 Table 26 Best matched Study 2 Samples for Traditional Punishments N DIM CAT-‐SIM r 2 SUM-‐SIM r 2 REL DIFF RMSE B COND .703 .705 .025 0.377 1 .711 .719 -‐.021 0.434 2 .713 .703 .029 0.268 3 .701 .670 .022 0.324 4 .700 .682 .021 0.325 5 .743 .697 .045 0.420 6 2 .670 .626 .031 0.331 7 .641 .643 .033 0.523 1 .616 .609 .031 0.570 2 .626 .548 .165 0.443 3 .596 .576 .064 0.449 4 .606 .601 .054 0.480 5 .635 .602 .071 0.482 6 4 .608 .593 .031 0.551 7 Notes: These runs were matched exactly on discrimination (a = .5 to 2.5), number of items (12), dimension correlation (r = .70), and number of response options (5). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 131 Table 27 Best matched Study 2 Samples for Need for Cognition N ITEMS CAT-‐SIM r 2 SUM-‐SIM r 2 REL DIFF RMSE B COND .727 .727 -‐.002 0.316 1 .714 .715 -‐.027 0.367 2 .717 .711 .026 0.263 3 .693 .669 .017 0.288 4 .699 .683 .017 0.287 5 .706 .670 -‐.008 0.339 6 12 .688 .651 -‐.008 0.315 7 .747 .714 .045 0.370 1 .739 .725 .006 0.420 2 .722 .681 .058 0.277 3 .718 .651 .059 0.318 4 .709 .657 .019 0.324 5 .738 .671 .039 0.364 6 24 .720 .644 .051 0.342 7 Notes: These runs were matched exactly on discrimination (a = 1 to 2), dimensions (2), dimension correlation (r = .70), and number of response options (7). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 132 Table 28 Best matched Study 2 Samples for Mature Personality DIM R CAT-‐SIM r 2 SUM-‐SIM r 2 REL DIFF RMSE B COND .403 .397 .134 0.502 1 .379 .373 .141 0.513 2 .402 .368 .185 0.507 3 .405 .374 .154 0.505 4 .400 .394 .128 0.504 5 .402 .385 .122 0.529 6 .35 .386 .367 .130 0.511 7 .647 .639 .044 0.388 1 .624 .612 .047 0.397 2 .640 .595 .092 0.383 3 .621 .601 .056 0.393 4 .635 .611 .053 0.392 5 .629 .599 .040 0.408 6 .70 .620 .586 .050 0.395 7 Notes: These runs were matched exactly on discrimination (a = .5 to 1.5), dimensions (4), number of items (24), and number of response options (5). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 133 Table 29 Traditional Punishments Exploratory Model Results 1 F1 F2 F3 Mean CTT r 2 .813 .380 .583 .592 CFA r 2 .755 .470 .523 .583 IRT r 2 .861 .485 .581 .643 CAT r 2 .881 .482 .677 .680 BIAS 0.264 1.229 0.421 0.638 RMSE 0.325 1.514 0.520 0.786 Notes: 1 AIC = 215372, BIC = 216075, G 2 = 113561 (df = 5724, p = .000). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 134 Table 30 Traditional Punishments Bi-Factor Model Results 1 F1 F2 F3 General 2 Mean CTT r 2 .014 .021 .009 .712 .189 CFA r 2 .017 .039 .000 .688 .186 IRT r 2 .012 .035 .003 .759 .202 CAT r 2 .024 .044 .006 .793 .217 BIAS 0.758 0.819 0.860 0.406 0.711 RMSE 0.947 1.062 1.101 .474 0.896 Notes: 1 AIC = 215548, BIC = 216150, G 2 = 113767 (df = 5739, p = .000). 2 For this study, information pertaining to the general factor is most relevant. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 135 Table 31 Traditional Punishments Correlated Factors Model Results 1 F1 F2 F3 Mean CTT r 2 .865 .729 .677 .757 CFA r 2 .843 .780 .545 .723 IRT r 2 .914 .815 .664 .797 CAT r 2 .943 .854 .674 .824 BIAS 0.176 0.277 0.412 0.288 RMSE 0.224 0.355 0.534 0.371 Notes: 1 AIC = 215688, BIC = 216250, G 2 = 113919 (df = 5745, p = .000). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 136 Table 32 Need for Cognition Bi-Factor Model Results 1 F1 F2 General 2 Mean CTT r 2 .183 .053 .882 .373 CFA r 2 .131 .082 .876 .363 IRT r 2 .144 .085 .921 .383 CAT r 2 .172 .076 .945 .398 BIAS 0.714 0.775 0.178 0.556 RMSE 0.893 0.993 0.222 .703 Notes: 1 AIC = 104931, BIC = 105532, G 2 = 71725 (df = 2160, p = .000). 2 For this study, information pertaining to the general factor is most relevant. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 137 Table 33 Need for Cognition Correlated Factors Model Results 1 F1 F2 Mean CTT r 2 .861 .764 .813 CFA r 2 .816 .798 .807 IRT r 2 .862 .837 .849 CAT r 2 .907 .840 .874 BIAS 0.224 0.295 0.260 RMSE 0.291 0.384 0.338 Notes: 1 AIC = 105017, BIC = 105532, G 2 = 71841 (df = 2175, p = .000). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 138 Table 34 Need for Cognition Exploratory Model Results 1 F1 F2 Mean CTT r 2 .861 .352 .607 CFA r 2 .865 .414 .640 IRT r 2 .915 .452 .684 CAT r 2 .938 .439 .689 BIAS 0.203 0.567 0.385 RMSE 0.257 0.723 0.490 Notes: 1 AIC = 105359, BIC = 105972, G 2 = 72149 (df = 2158, p = .000). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 139 Table 35 Mature Personality Exploratory Model Results 1 F1 F2 F3 F4 Mean CTT r 2 .483 .616 .316 .320 .434 CFA r 2 .581 .514 .366 .308 .442 IRT r 2 .620 .524 .382 .312 .459 CAT r 2 .569 .607 .367 .339 .470 BIAS 0.500 0.475 0.616 0.638 0.557 RMSE 0.628 0.596 0.792 0.826 0.711 Notes: 1 AIC = 203548, BIC = 204583, G 2 = 155619 (df = 4231, p = .000). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 140 Table 36 Mature Personality Correlated Factors Model Results 1 F1 F2 F3 F4 Mean CTT r 2 .684 .646 .693 .217 .560 CFA r 2 .623 .727 .673 .209 .558 IRT r 2 .633 .751 .788 .216 .572 CAT r 2 .704 .721 .729 .225 .595 BIAS 0.447 0.438 0.431 0.721 0.509 RMSE 0.539 0.530 0.515 0.942 0.631 Notes: 1 AIC = 203071, BIC = 203742, G 2 = 155253 (df = 4288, p = .000). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 141 Table 37 Mature Personality Bi-Factor Model Results 1 F1 F2 F3 F4 General 2 Mean CTT r 2 .001 .003 .000 .000 .665 .134 CFA r 2 .001 .001 .038 .001 .650 .138 IRT r 2 .001 .050 .001 .000 .641 .138 CAT r 2 .001 .020 .000 .000 .661 .137 BIAS 1.163 1.273 1.123 1.263 0.656 1.095 RMSE 1.478 1.552 1.405 1.541 0.750 1.345 Notes: 1 AIC = 201921, BIC = 202663, G 2 = 154081 (df = 4277, p = .000). 2 For this study, information pertaining to the general factor is most relevant. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 142 Figure 1. Test Information Curve (TIC) for Traditional Punishments subscale. It most reliably measures those with trait estimates between 0 and 2 (where I >= 10, analogous to reliability of .9). -6 -4 -2 0 2 4 6 0 5 10 15 Traditional Punishments Trait Estimate Information LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 143 Figure 2. Plot displaying the distributions of Traditional Punishments estimates. Z estimates represent z-transformed summed scores, FS is the estimated factor score derived from the confirmatory factor model, and EAP is the expected-a-posteriori estimate of a person’s trait level. -4 -2 0 2 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Traditional Punishments Trait Estimate Density Z FS EAP LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 144 Figure 3. Line plot of reliability estimates for Traditional Punishments subscale from each fixed reduction techniques. Reliability did not differ much (if at all) between techniques except when nine items were used. Additionally, reliability increased much less overall from three items to nine items when compared to the other measures (across the same number of items). ! ! ! 0.5 0.6 0.7 0.8 0.9 1.0 Number of Items Reliability + + + * * * | | | 3 6 9 ! + * | CFA CTT FCAT IRT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 145 Figure 4. Line plot of r 2 estimates for Traditional Punishments subscale for each fixed reduction techniques. In general, the CFA estimates were larger than all others across all item conditions. Note: these represent correlations with the within-method full-scale estimate. In other words, the CFA r 2 represents the relationship between the estimate from CFA reduced form and the estimate from the CFA full form. ! ! ! 0.5 0.6 0.7 0.8 0.9 1.0 Number of Items r 2 + + + * * * | | | 3 6 9 ! + * | CFA CTT FCAT IRT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 146 Figure 5. Test Information Curve (TIC) for the Need for Cognition scale. It most reliably measures those with trait estimates between -2 and 1. -6 -4 -2 0 2 4 6 0 5 10 15 Need for Cognition Trait Estimate Information LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 147 Figure 6. Plot displaying the distributions of Need for Cognition estimates. Z estimates represent z-transformed summed scores, FS is the estimated factor score derived from the confirmatory factor model, and EAP is the expected-a-posteriori estimate of a person’s trait level. -4 -2 0 2 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Need for Cognition Trait Estimate Density Z FS EAP LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 148 Figure 7. Line plot of reliability estimates for Need for Cognition measure from each fixed reduction techniques. FCAT reliability was consistently higher than all other across all item conditions. ! ! ! ! 0.5 0.6 0.7 0.8 0.9 1.0 Number of Items Reliability + + + + * * * * | | | | 3 6 9 12 ! + * | CFA CTT FCAT IRT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 149 Figure 8. Line plot of r 2 estimates for the Need for Cognition (NC) measure by reduction method and number of items. The fixed CAT had negligibly higher estimates at all item conditions than the CFA and IRT techniques. Of the three measures, the estimates for NC were most similar. ! ! ! ! 0.5 0.6 0.7 0.8 0.9 1.0 Number of Items r 2 + + + + * * * * | | | | 3 6 9 12 ! + * | CFA CTT FCAT IRT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 150 Figure 9. Test Information Curve (TIC) for the Mature Personality subscale. At no point does the TIC display an estimate of information greater than 10, thus reliability remains less than .9 at all trait levels. -6 -4 -2 0 2 4 6 0 5 10 15 Mature Personality Trait Estimate Information LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 151 Figure 10. Plot displaying the distributions of Mature Personality estimates. Z estimates represent z-transformed summed scores, FS is the estimated factor score derived from the confirmatory factor model, and EAP is the expected-a-posteriori estimate of a person’s trait level. -4 -2 0 2 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Mature Personality Trait Estimate Density Z FS EAP LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 152 Figure 11. Line plot of reliability estimates for Mature Personality subscale from each fixed reduction techniques. FCAT had higher estimates of reliability than all other techniques except at three items. The other techniques did not differ from each other, consistent with the other measures. The difference between FCAT reliability and the other reliability estimates was much smaller here than with the Need for Cognition measure when fewer items were used. ! ! ! ! ! 0.5 0.6 0.7 0.8 0.9 1.0 Number of Items Reliability + + + + + * * * * * | | | | | 3 6 9 12 15 ! + * | CFA CTT FCAT IRT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 153 Figure 12. Line plot of r 2 estimates for the Mature Personality subscale, arranged by reduction method and number of items. ! ! ! ! ! 0.5 0.6 0.7 0.8 0.9 1.0 Number of Items r 2 + + + + + * * * * * | | | | | 3 6 9 12 15 ! + * | CFA CTT FCAT IRT LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 154 Figure 13. Plot of the interaction between number of dimensions and dimension correlation (R), where the y-axis is the z-transformed correlation between the CAT estimate and the true estimate (CAT-SIM r), one index of CAT accuracy. The two correlation conditions exhibit slightly more divergent CAT-SIM r estimates at four dimensions than at two. ! ! 0.5 1.0 1.5 2.0 Number of Dimensions Fisher's z transformed CAT-SIM r + + 2 4 R ! + 0.35 0.7 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 155 Figure 14. Plot of the interaction between number of dimensions and dimension correlation (R), where the y-axis is the difference between the estimate of reliability for the CAT and the estimate of reliability (α) for the CTT reduced form. The two correlation conditions are slightly more divergent at four dimensions than at two. ! ! -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 Number of Dimensions Diff b/w CAT rel. and CTT alpha + + 2 4 R ! + 0.35 0.7 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 156 Figure 15. Plot of the interaction between beta conditions and number of items in the item pool (ITEMS), with mean number of selected items on the y-axis. The plot showcases small variations in mean items across beta conditions relative to ITEMS, with beta condition seven exhibiting notable differences from the other beta conditions. ! ! ! ! ! ! ! 10 20 30 40 Beta Conditions Mean Number of Selected Items + + + + + + + * * * * * * * 1 2 3 4 5 6 7 ITEMS ! + * 24 36 48 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 157 Figure 16. Plot of the interaction between beta conditions and number of response options (RESP), with mean number of selected items on the y-axis. The moderate-to-high skew conditions (beta conditions four, five, and seven) exhibited the most notable differences in mean items between RESP conditions. ! ! ! ! ! ! ! 10 20 30 40 Beta Conditions Mean Number of Selected Items + + + + + + + * * * * * * * 1 2 3 4 5 6 7 RESP ! + * 2 5 7 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 158 Figure 17. Plot 1 of 3 addressing the interaction between beta conditions, number of response options (RESP), and size of item pool, with mean number of selected items on the y-axis. Here, the relationship between beta and RESP is plotted for an item pool size of 24 items. ! ! ! ! ! ! ! 10 20 30 40 Beta Conditions Mean Number of Selected Items + + + + + + + * * * * * * * 1 2 3 4 5 6 7 RESP ! + * 2 5 7 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 159 Figure 18. Plot 2 of 3 addressing the interaction between beta conditions, number of response options (RESP), and size of item pool, with mean number of selected items on the y-axis. Here, the relationship between beta and RESP is plotted for an item pool size of 36 items. ! ! ! ! ! ! ! 10 20 30 40 Beta Conditions Mean Number of Selected Items + + + + + + + * * * * * * * 1 2 3 4 5 6 7 RESP ! + * 2 5 7 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 160 Figure 19. Plot 3 of 3 addressing the interaction between beta conditions, number of response options (RESP), and size of item pool, with mean number of selected items on the y-axis. Here, the relationship between beta and RESP is plotted for an item pool size of 48 items. ! ! ! ! ! ! ! 10 20 30 40 Beta Conditions Mean Number of Selected Items + + + + + + + * * * * * * * 1 2 3 4 5 6 7 RESP ! + * 2 5 7 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 161 Appendix A Table A1 Traditional Punishments Discrimination and Difficulty Parameters from Unidimensional Model Item a b1 b2 b3 b4 b5 b6 1 .641 -‐4.197 -‐2.631 -‐1.638 -‐.568 .872 2.627 2 2.617 -‐.228 .328 .655 1.024 1.467 1.936 3 1.543 -‐.491 .167 .499 .862 1.477 2.185 4 .967 -‐1.949 -‐1.003 -‐.504 .059 .901 1.991 5 .998 -‐1.610 -‐.778 -‐.324 .244 1.197 2.282 6 .788 -‐3.965 -‐2.603 -‐1.674 -‐.865 .397 1.843 7 1.778 -‐.039 .697 1.077 1.604 2.173 2.780 8 2.041 .103 .596 .826 1.105 1.542 2.021 9 2.105 .700 1.214 1.479 1.759 2.142 2.591 10 1.446 -‐.697 .152 .765 1.431 2.186 3.012 11 2.672 -‐.149 .405 .699 .964 1.296 1.700 12 .381 -‐8.678 -‐6.869 -‐5.486 -‐3.663 -‐1.386 1.167 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 162 Table A2 Need for Cognition Discrimination and Difficulty Parameters from Unidimensional Model Item a b1 b2 b3 b4 1 1.506 -‐1.451 -‐.608 .075 1.884 2 2.151 -‐1.706 -‐.889 -‐.409 1.094 3 1.605 -‐2.531 -‐1.311 -‐.835 .235 4 1.723 -‐2.357 -‐1.124 -‐.619 .538 5 1.784 -‐2.368 -‐1.344 -‐.836 .235 6 1.495 -‐1.306 -‐.259 .439 2.006 7 1.432 -‐2.241 -‐.737 -‐.227 .986 8 1.081 -‐2.492 -‐.727 .127 1.721 9 1.369 -‐2.094 -‐.513 .054 1.312 10 1.629 -‐1.910 -‐1.135 -‐.395 1.081 11 1.838 -‐2.312 -‐1.472 -‐.925 .492 12 1.779 -‐2.436 -‐1.375 -‐.839 .314 13 1.477 -‐1.405 -‐.349 .569 2.100 14 1.547 -‐1.637 -‐.762 -‐.004 1.415 15 1.379 -‐2.194 -‐1.088 -‐.218 1.444 16 1.229 -‐2.244 -‐.856 -‐.246 1.350 17 1.024 -‐3.127 -‐1.567 -‐.890 .683 18 .627 -‐3.390 -‐1.449 -‐.203 3.152 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 163 Table A3 Mature Personality Discrimination and Difficulty Parameters from Unidimensional Model Item a b1 b2 b3 b4 1 .908 -‐3.099 -‐.689 1.797 3.040 2 1.046 -‐2.402 -‐.550 1.606 3.184 3 1.246 -‐1.262 .424 2.088 3.511 4 1.103 -‐.815 .745 2.108 3.324 5 1.082 -‐2.594 -‐.638 1.376 3.009 6 1.529 -‐1.494 .067 1.672 2.847 7 1.684 -‐1.585 .025 1.556 2.643 8 1.152 -‐1.599 .137 1.961 3.481 9 1.370 -‐2.142 -‐.388 1.534 3.025 10 1.155 -‐1.692 -‐.388 1.016 2.429 11 1.292 -‐.570 .996 2.528 3.904 12 1.461 -‐.658 .795 2.372 3.474 13 .608 -‐1.741 1.098 3.103 5.378 14 1.395 -‐1.120 .535 2.162 3.112 15 .693 -‐3.614 -‐1.013 1.549 4.145 16 1.906 -‐.788 .728 2.258 3.175 17 .614 -‐1.638 1.002 2.771 4.781 18 .912 -‐1.954 -‐.147 2.323 4.209 19 .719 -‐3.228 -‐.819 1.674 4.258 20 1.861 -‐.959 .620 2.217 3.186 21 1.209 -‐1.889 .071 2.140 3.830 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 164 Appendix B Incomplete Data and Mature Personality To understand how missing data influenced the parameter and trait estimates for the Mature Personality (MP) subscale, some additional analyses were conducted. Consistent with Study 1, a validation sample was used for the majority of these analyses (N = 10,978, once the calibration sample of 2,500 individuals was accounted for). The full sample (N = 13,478) was used for the initial assessment of multidimensionality (as in Study 1) and to identify the IRT parameters needed for the reassessment of the CAT. There were 277 individuals in the validation sample who did not respond to any items, so they were removed. The final validation sample size was N = 10,701. An EFA using the ‘psych’ packages irt.fa function revealed a four-factor structure was best, with the same item-to-factor associations determined with complete cases (see Study 1 and Table 6). EFA model fit was also nearly identical, though there was a slightly smaller TLI (.933) and a tighter confidence interval around the RMSEA with incomplete cases. Re-evaluating the unidimensional GRM with all data, discrimination and difficulty parameters changed slightly (Table C1; compare to Table A3 in Appendix A), producing changes to item information (see Table C2). These changes were not dramatic, however information did change enough such that item order selection differed for the IRT method. This did not affect the estimates of r 2 (or reliability) notably for any of the item conditions in the unidimensional case. For example, the r 2 for the 12-item IRT condition was .948, down from .954 with complete cases only. Only item 11 experienced a large shift in selection order (moving from seventh to 11 th item selected). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 165 Unidimensional and multidimensional factor loadings, as assessed using the structural equation modeling computer program Mplus 6 (Muthén & Muthén, 1998-2011) did not change notably when incomplete cases were included (refer to Table C3), and actual order of item selection shifted much like the IRT method (e.g. item 20 was selected before item 16). There was an improvement in model fit with the additional cases (TLI = .903, RMSEA = .069), though the model still did not satisfy the criteria laid out in Study 1. The average factor correlation only slightly decreased, from r = .613 to r = .608. For the CAT, inclusion of incomplete cases increased the median number of items selected from 10 to 12, and the mean increased from 12 to 13. The CAT r 2 increased from .968 with complete cases only to .976 with incomplete data. Conversely, CAT reliability decreased from .865 (SEM = .367) to .860 (SEM = .375). The order of item selection frequency shifted around a bit, but the eight least frequently used items were the same with and without incomplete cases. There was a better spread of item selection frequencies in the incomplete data CAT. Table C3 compares selection frequencies for the CAT with and without incomplete cases, and displays the presentation order for the IRT method under both circumstances. 6 The R package ‘lavaan’ currently uses listwise deletion with ordinal data (that is treated as such) when there is incomplete data. Mplus currently uses a pairwise present method that utilizes any case with at least one response (assuming the WLSMV estimator). LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 166 Table B1 Mature Personality (All Cases) Discrimination and Difficulty Parameters from Unidimensional Model with Item a b1 b2 b3 b4 1 0.924 -‐3.022 -‐0.765 1.638 2.838 2 1.065 -‐2.513 -‐0.753 1.309 2.733 3 1.217 -‐1.297 0.365 2.021 3.420 4 1.055 -‐0.914 0.682 2.047 3.360 5 1.084 -‐2.726 -‐0.836 1.074 2.629 6 1.523 -‐1.533 -‐0.013 1.567 2.763 7 1.624 -‐1.688 -‐0.084 1.440 2.520 8 1.145 -‐1.633 0.103 1.898 3.334 9 1.355 -‐2.280 -‐0.513 1.403 2.820 10 1.114 -‐1.822 -‐0.467 0.953 2.424 11 1.250 -‐0.652 0.920 2.492 3.810 12 1.394 -‐0.747 0.788 2.373 3.521 13 0.563 -‐1.839 1.182 3.309 5.810 14 1.315 -‐1.253 0.432 2.103 3.070 15 0.672 -‐3.788 -‐1.143 1.460 4.136 16 1.801 -‐0.887 0.649 2.238 3.172 17 0.582 -‐1.747 0.980 2.850 4.918 18 0.897 -‐2.077 -‐0.245 2.194 4.091 19 0.714 -‐3.347 -‐0.922 1.564 4.101 20 1.822 -‐1.037 0.523 2.139 3.087 21 1.188 -‐1.988 -‐0.032 2.080 3.769 Note: N = 13,478. LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 167 Table B2 Unidimensional and Multidimensional Factor Loadings, and IRT Information, for Mature Personality Subscale with and without Incomplete Cases Complete Cases (N = 6,897) Incomplete Cases (N = 13,478) Factor Item λ 1F λ MD Info (I) λ 1F λ MD Info (I) 1 .448 .490 2.390 .461 .497 2.398 2 .503 .549 2.848 .506 .545 2.831 5 .522 .569 3.004 .522 .561 2.946 6 .639 .696 4.427 .628 .677 4.377 7 .667 .727 5.025 .650 .699 4.758 9 .595 .645 4.093 .594 .637 3.998 1 10 .540 .587 2.845 .529 .569 2.739 14 .595 .663 3.795 .577 .650 3.510 16 .738 .816 5.809 .718 .801 5.410 18 .435 .488 2.429 .435 .489 2.374 2 20 .724 .801 5.728 .722 .798 5.539 3 .556 .619 3.427 .543 .598 3.289 4 .507 .563 2.660 .495 .546 2.526 8 .519 .580 3.144 .520 .576 3.079 11 .580 .648 3.508 .571 .633 3.330 12 .619 .693 4.038 .600 .666 3.829 13 .307 .355 1.388 .282 .325 1.262 3 17 .312 .364 1.346 .298 .347 1.257 15 .453 .700 1.797 .428 .699 1.729 19 .459 .707 1.867 .456 .721 1.846 4 21 .553 .792 3.575 .544 .781 3.495 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 168 Table B3 Order of Selected Items from First to Last (IRT) or Most Frequent to Least Frequent (CAT) Complete Cases (N = 4,397) Incomplete Cases (N = 10,701) IRT CAT Frequency % Total IRT CAT Frequency % Total 16 6 4397 100 20 6 10651 99.5 20 7 4397 100 16 3 10633 99.4 7 12 4397 100 7 7 10528 98.4 6 14 4397 100 6 12 10395 97.1 12 16 4397 100 9 9 10342 96.6 9 20 4397 100 12 11 10234 95.6 11 9 4395 100 14 14 9968 93.2 14 11 4279 97.3 21 16 9305 87.0 21 3 3909 88.9 11 8 8053 75.3 3 21 2646 60.2 3 20 7437 69.5 8 10 2458 55.9 8 10 6883 64.3 2 8 1657 37.7 5 5 5820 54.4 5 5 1070 24.3 2 21 5225 48.8 10 4 1066 24.2 10 4 4528 42.3 4 2 837 19.0 4 2 4463 41.7 18 18 598 13.6 1 1 3489 32.6 1 1 546 12.4 18 13 1824 17.0 19 19 440 10.0 19 15 1481 13.8 15 15 382 8.69 15 18 1480 13.8 13 17 350 7.96 17 17 1082 10.1 17 13 330 7.96 13 19 993 9.28 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 169 Appendix C R code used to simulate data and the CAT is below. R Code for CAT simulation consisted largely of Choi’s (2009) Firestar-generated R code. The author made several adjustments to accommodate the needs of the dissertation studies. Firestar generates a significant amount of R code, but much of it was not necessary to run the adaptive tests. What is presented below is a modified version of the truncated form actually used for this dissertation. Note: this may be difficult to follow, as it was copied directly from R as is. Some management was done to reduce the number of lines. # SIMULATE DATA FUNCTION simulateCatData <- function(n=5000, ni=48, bcond=NULL, maxCat=5, m=NULL, ndim=1, A=NULL, B=NULL, R=1) { require(MASS) nbeta = maxCat - 1; if (ndim>1) popMean = rep(0,ndim) else popMean = 0 if (ndim==1) { if (is.null(A) == T) A.new = rep(1,ni) else if (is.null(A) == F) { if (is.matrix(A) == T || is.data.frame(A) == T) { stop('Matrix for A is not allowed when ndim = 1.') } else if (length(A) > ni) A.new = sample(A, ni, replace=F) else if (length(A) < ni) A.new = sample(A, ni, replace=T) else A.new = A } } else if (ndim > 1) { ni.dim = ni/ndim A.new = mat.or.vec(ni,ndim) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 170 if (is.null(A) == T) { for (i in 1:ndim) { for (j in ((ni.dim*i)-ni.dim+1):(ni.dim*i)) { A.new[j,i] = 1 } } } else if (is.vector(A, "numeric") == T) { count = 0 if (length(A) == ni) A = A else if (length(A) > ni) { A = sample(A, ni, replace=F) warning('Number of alphas is greater than item number. Sampling w/out replacement.') } else if (length(A) < ni) { A = sample(A, ni, replace=T) warning('Number of alphas is less than item number. Sampling w/ replacement.') } for (i in 1:ndim) { for (j in ((ni.dim*i)-ni.dim+1):(ni.dim*i)) { count = count + 1 A.new[j,i] = A[count] } } } else A.new = A } genB <- function(ni, bcond){ # bcond = skew/kurtosis condition # 1 = low skew/kurtosis (consistent with observed) # 2 = low skew/high positive kurtosis (unobserved) # 3 = low skew/high negative kurtosis (consistent with observed) # 4 = mod neg skew/low kurtosis (partially consistent with observed) # 5 = mod pos skew/low kurtosis (partially consistent with observed) # 6 = high neg skew/high pos kurtosis (unobserved) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 171 # 7 = high pos skew/high pos kurtosis (partially consistent with observed) if (bcond==1) baseB = c(-2.5,-1.3,1.5,3.1) else if (bcond==2) baseB = c(-3.8,-2,2,3.8) else if (bcond==3) baseB = seq(-1.25,1.25,length.out=4) else if (bcond==4) baseB = c(-4,-2.4,-1,0) else if (bcond==5) baseB = c(0,1,2.4,4) else if (bcond==6) baseB = c(-3,-2.3,-.9,3.6) else if (bcond==7) baseB = seq(.5,5,length.out=4) if (maxCat == 2) baseB = mean(baseB) if (maxCat == 7) { baseB = c(baseB[1], mean(c(baseB[1],baseB[2])), baseB[2], baseB[3], mean(c(baseB[3],baseB[4])), baseB[4]) } fullB = rep(baseB,ni) for (i in 1:length(fullB)) { fullB[i] = fullB[i] + rnorm(1,0,.25) } return(matrix(fullB, nrow = ni, ncol = length(baseB), byrow = T)) } if (is.null(bcond) == F) if (maxCat == 5 || maxCat == 2 || maxCat == 7) B = t(genB(ni,bcond)) else stop('Beta conditions apply only to 2-, 5- and 7-category simulated data') else if (is.null(bcond) == T) { if (is.null(B) == F) { if (maxCat == 2) { if (length(B) > ni) B = sample(B, ni, replace = F) else if (length(B) < ni) B = sample(B, ni, replace = T) else B = B } else if (maxCat > 2) { if (nrow(B) > ni) B = t(B[sample(1:nrow(B),ni,replace = F),]) else if (nrow(B) < ni) B = t(B[sample(1:nrow(B),ni,replace = T),]) else B = B } LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 172 } else { stop('You must select a beta condition or provide a beta vector/matrix') } } if (is.null(m) == T) m = mvrnorm(n,popMean,R) else if (is.null(m) == F) { if (is.null(dim(m)) == F) { if (is.data.frame(m) == T) stop('True theta must be a matrix or vector.') if (nrow(m) != n) stop('Number of thetas does not equal number of observations') if (ncol(m) != ndim) stop('Number of specified dimensions does not equal number of observed dimensions in true theta matrix') } m = m } mA = m %*% t(A.new) u = runif(n*ni,0,1); dim(u) <- c(n,ni) data = matrix(1, n, ni) t = vector("list", nbeta); lp = vector("list", nbeta) z = vector("list", nbeta); d = vector("list", nbeta) for (i in 1:nbeta) { if (nbeta == 1) t[[i]] = matrix(rep(B,n), nrow = n, byrow=T) else t[[i]] = matrix(rep(B[i,],n), nrow = n, byrow = T) lp[[i]] = mA - t[[i]] #lp[[i]] = mA + t[[i]] z[[i]] = 1 / (1 + exp(lp[[i]])) #z[[i]] = exp(lp[[i]]) / (1 + exp(lp[[i]])) d[[i]] = mat.or.vec(n,ni) for (j in 1:n) { for (k in 1:ni) { if (u[j,k] > z[[i]][j,k]) #if (u[j,k] < z[[i]][j,k]) { LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 173 d[[i]][j,k] = 1 } } } data = data + d[[i]] } data.list <- list(as.data.frame(data), m, A.new, t(B), maxCat, ndim) names(data.list) <- c("Data","m","A","B","NCAT","NDIM") return(data.list) } # FUNCTIONS RELEVANT FOR ADAPTIVE TESTS (from FIRESTAR, not adjusted) prep.prob.info <- function(model,D,nq,ni,maxCat,DISC,CB,NCAT,theta) { pp <- array(0,c(nq,ni,maxCat)) matrix.info <- matrix(0,nq,ni) if (model==1) { for (i in 1:ni) { ps <- matrix(0,nq,NCAT[i]+1) ps[,1] <- 1 ps[,NCAT[i]+1] <- 0 for (k in 1:(NCAT[i]-1)) { ps[,k+1] <- 1/(1+exp(-D*DISC[i]*(theta-CB[i,k]))); } for (k in 1:NCAT[i]) { pp[,i,k] <- ps[,k]-ps[,k+1] matrix.info[,i] <- matrix.info[,i]+(D*DISC[i]*(ps[,k]*(1-ps[,k])-ps[,k+1]*(1-ps[,k+1])))^2/pp[,i,k] } } } else if (model==2) { for (i in 1:ni) { cb <- unlist(CB[i,]) cb <- c(0,cb) zz <- matrix(0,nq,NCAT[i]) sdsum <- 0 LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 174 den <- rep(0,nq) for (k in 1:NCAT[i]) { sdsum <- sdsum+cb[k] zz[,k] <- exp(D*DISC[i]*(k*theta-sdsum)) den <- den+zz[,k] } AX <- rep(0,nq); BX <- rep(0,nq) for (k in 1:NCAT[i]) { pp[,i,k] <- zz[,k]/den AX <- AX+k^2*pp[,i,k] BX <- BX+k*pp[,i,k] } matrix.info[,i] <- D^2*DISC[i]^2*(AX-BX^2) } } list(pp=pp,matrix.info=matrix.info) } calcFullLengthEAP <- function(pp,prior,nExaminees,nq,ni,theta,resp.data) { posterior <- matrix(rep(prior,nExaminees),nExaminees,nq,byrow=T) for (i in 1:ni) { resp <- matrix(resp.data[[paste("R",i,sep="")]],nExaminees,1) prob <- t(pp[,i,resp]) #! prob[is.na(prob)] <- 1.0 posterior <- posterior*prob } EAP <- posterior%*%theta/rowSums(posterior) SEM <- sqrt(rowSums(posterior*(matrix(theta,nExaminees,nq,byrow=T)- matrix(EAP,nExaminees,nq))^2)/rowSums(posterior)) return(list(theta=EAP,SE=SEM)) } # FUNCTION TO SIMULATE ADAPTIVE TESTS LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 175 simulateAdaptive <- function(data=NULL, n=5000, ni=48, acond=NULL, bcond=NULL, maxCat=5, m=NULL, ndim=1, A=NULL, B=NULL, R=1, maxNI=ni, minNI=1, maxSE=.37, selection.method = 3, runs=10, n.cal=2500, cal.m=NULL, cal.data, suppress=F, outp=T) { begtime <- Sys.time() message(paste("Start Time: ", begtime, sep="")) if (suppress == T) options(warn = -1) else options(warn = 1) true.theta = m if (is.null(true.theta) == FALSE) { if (is.null(dim(true.theta)) == F) { if (n/runs != nrow(true.theta)) stop('Number of thetas must equal n / runs') } else if (is.null(dim(true.theta)) == T) { if (n/runs != length(true.theta)) stop('Length of theta must equal n / runs') } } if (is.null(data)==FALSE) { if (is.data.frame(data) == F && is.matrix(data) == F) { stop('Specified data must be provided as a matrix or data frame.') } tot.resp.data = data n = dim(tot.resp.data)[1]; ni = dim(tot.resp.data)[2] maxCat = max(tot.resp.data, na.rm=T) - min(tot.resp.data, na.rm=T) + 1 if (length(A)==1) A.new = rep(A,ni) else if (length(A)>1 && (is.matrix(A)==T || is.data.frame(A)==T)) { A.new = rowSums(A) message('Assuming unidimensionality. Discriminations collapsed.') } else A.new = A if (length(A.new) != ni) { stop('number of alpha parameters does not match number of items') LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 176 } if (dim(B)[1] != ni) { stop('number of beta parameters does not match number of items') } item.par = data.frame(A.new,B,maxCat) colnames(tot.resp.data) <- c(paste("R",1:ni,sep="")) colnames(item.par) <- c("a",paste("cb",1:(maxCat-1),sep=""),"NCAT") require(psych) ctt.r = alpha(cal.data)$item.stats["r"] r = data.frame(1:ni, ctt.r)[order(-ctt.r),] } else { message('No data specified. Data will be simulated.') cal.data <- simulateCatData(n.cal, ni, bcond, maxCat, cal.m, ndim, A, B, R) require(psych) ctt.r = alpha(cal.data[['Data']])$item.stats["r"] r = data.frame(1:ni, ctt.r)[order(-ctt.r),] } model <- 1 D = 1.0 eapFullLength <- T minTheta = -4; maxTheta = 4 inc = 0.10 topN = 1 interim.Theta = 1 se.method = 1 first.item.selection = 1; first.at.theta = 0; first.item = 1 show.Theta.Audit.Trail = F plot.usage = T; plot.info = T; plot.prob = F add.final.theta = F bank.diagnosis = T prior.dist = 1; prior.mean = 0; prior.sd = 1 theta <- seq(minTheta,maxTheta,inc) nq = length(theta) if (first.item.selection==2 && first.at.theta>=minTheta && first.at.theta<=maxTheta) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 177 { start.theta<-first.at.theta } else start.theta <- prior.mean if (prior.dist==1) { prior <- dnorm((theta-prior.mean)/prior.sd); } else if (prior.dist==2) { prior <- exp((theta-prior.mean)/prior.sd)/(1+exp((theta- prior.mean)/prior.sd))^2; } else prior <- dnorm(theta) if (n/runs < 50 || runs == 1) { if (n/10 < 1) warning('Sample size cannot accomodate requested runs. Assuming runs = 1') runs = 1; n_sub = n } else if (is.wholenumber(n/runs) == FALSE) { warning('Sample size is not divisible by requested # of runs. Assuming runs = 1') runs = 1; n_sub = n } else { runs = runs; n_sub = n / runs } sub = sample(rep(1:runs, n_sub),n,replace=F) # randomly shuffles sub if (is.null(data) == FALSE) tot.resp.data = cbind(tot.resp.data, sub) sem.cat = mat.or.vec(runs,1); rel.cat = sem.cat; bias.cat = sem.cat rms.cat = sem.cat; rmsd.cat = sem.cat mean.use = sem.cat; median.use = sem.cat; percent.use = sem.cat; full_cat_diff = sem.cat case.use.sum.min = sem.cat; case.use.sum.max = sem.cat rCAT.FULL = sem.cat; rFULL.SIM = sem.cat; rCAT.SIM = sem.cat mean.skew = sem.cat; mean.kurt = sem.cat median.skew = sem.cat; median.kurt = sem.cat skew.b = sem.cat; kurt.b = sem.cat skew.p = sem.cat; kurt.p = sem.cat LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 178 r_square = sem.cat; r_square.p = sem.cat rCAT.SUM = sem.cat; rFULL.SUM = sem.cat rSIM.SUM = sem.cat; rSUM.FULLSUM = sem.cat alphas = sem.cat; ctt.avg_r = sem.cat ITEMS = vector("list", runs) theta_sim = mat.or.vec(n_sub,runs) parms_means = mat.or.vec(runs, (maxCat+2)) parms_obs_means = parms_means parms_obs_medians = parms_means for (x in 1:runs) { if (is.null(data) == TRUE) { sim.output = simulateCatData(n_sub, ni, bcond, maxCat, m, ndim, A, B, R) resp.data = sim.output[['Data']] sim.theta = sim.output[['m']] # "true" theta if (sim.output[['NDIM']] > 1) { A = rowSums(sim.output[['A']]) } else A = sim.output[['A']] item.par = data.frame(A,sim.output[['B']],sim.output[['NCAT']]) colnames(resp.data) <- c(paste("R",1:ni,sep="")) colnames(item.par) <- c("a",paste("cb",1:(maxCat-1),sep=""),"NCAT") if (suppress == F) print(paste("Evaluating simulation ",x,sep="")) ctt.sum = rowSums(resp.data) if (ndim == 1) { theta_sim[,x] = sim.theta } else theta_sim = NULL } else if (is.null(data) == FALSE) { resp.data = subset(tot.resp.data, sub == x, -sub) if (suppress == F) print(paste("Evaluating sample ",x,sep="")) } LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 179 nExaminees <- dim(resp.data)[1] items.used <- matrix(NA,nExaminees,maxNI) selected.item.resp <- matrix(NA,nExaminees,maxNI) ni.administered <- numeric(nExaminees) theta.CAT <- rep(NA,nExaminees) sem.CAT <- rep(NA,nExaminees) theta.history <- matrix(NA,nExaminees,maxNI) se.history <- matrix(NA,nExaminees,maxNI) posterior.matrix <- matrix(NA,nExaminees,nq) LH.matrix <- matrix(NA,nExaminees,nq) NCAT <- item.par[,"NCAT"] DISC <- item.par[,"a"] CB <- item.par[paste("cb",1:(maxCat-1),sep="")] matrix.prob.info <- prep.prob.info(model,D,nq,ni,maxCat,DISC,CB,NCAT,theta) pp <- matrix.prob.info$pp matrix.info <- matrix.prob.info$matrix.info ext.theta <- calcFullLengthEAP(pp,prior,nExaminees,nq,ni,theta,resp.data) calcInfo <- function(th) { info <- numeric(ni) if (model==1) { for (i in 1:ni) { if (items.available[i]==TRUE) { ps <- numeric(NCAT[i]+1) ps[1] <- 1 ps[NCAT[i]+1] <- 0 for (k in 1:(NCAT[i]-1)) { ps[k+1] <- 1/(1+exp(-D*DISC[i]*(th-CB[i,k]))) } prob <- numeric(NCAT[i]) for (k in 1:NCAT[i]) { prob[k] <- ps[k]-ps[k+1] info[i] <- info[i]+(D*DISC[i]*(ps[k]*(1-ps[k])-ps[k+1]*(1-ps[k+1])))^2/prob[k] } } } } else if (model==2) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 180 { for (i in 1:ni) { if (items.available[i]==TRUE) { zz <- numeric(NCAT[i]) sdsum <- 0; den <- 0 cb <- unlist(CB[i,]); cb <- c(0,cb) for (k in 1:(NCAT[i])) { sdsum <- sdsum+cb[k] zz[k] <- exp(D*DISC[i]*(k*th-sdsum)) den <- den+zz[k] } AX <- 0; BX <- 0 prob<-numeric(NCAT[i]) for (k in 1:NCAT[i]) { prob[k] <- zz[k]/den AX <- AX+k^2*prob[k] BX <- BX+k*prob[k] } info[i] <- D^2*DISC[i]^2*(AX-BX^2) } } } return(info) } calc.Loc.info <- function(th) { info <- numeric(ni) avg <- function(...) mean(...,na.rm=T) loc <- apply(CB,1,avg) for (i in 1:ni) { if (items.available[i]==TRUE) { p <- 1/(1+exp(-D*DISC[i]*(th-loc[i]))) q <- 1-p info[i] <- D^2*DISC[i]^2*p*q } } return(info) } LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 181 calc.LW.info <- function(lk) { info <- numeric(ni) info <- apply(matrix.info*lk,2,sum) info[items.available==FALSE] <- 0 return(info) } calc.PW.info <- function(pos) { info <- numeric(ni) info <- apply(matrix.info*pos,2,sum) info[items.available==FALSE] <- 0 return(info) } calc.Expected.Info <- function(pos,current.theta) { info<-numeric(ni) for (i in 1:ni) { if (items.available[i]==TRUE) { ncat <- NCAT[i] a <- DISC[i] cb <- unlist(CB[i,]) EAP.k <- numeric(ncat) wt <- numeric(ncat) for (k in 1:ncat) { posterior.k <- pos*pp[,i,k] wt[k] <- sum(posterior.k) EAP.k[k] <- sum(posterior.k*theta)/sum(posterior.k) } wt <- wt/sum(wt) if (model==1) { ps <- numeric(ncat+1) ps[1] <- 1 ps[ncat+1] <- 0 for (r in 1:ncat) { info.r <- 0 prob <- numeric(ncat) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 182 for (k in 1:(ncat-1)) { ps[k+1] <- 1/(1+exp(-D*a*(EAP.k[r]-cb[k]))) } for (k in 1:ncat) { prob[k] <- ps[k]-ps[k+1] info.r <- info.r+(D*a*(ps[k]*(1-ps[k])-ps[k+1]*(1-ps[k+1])))^2/prob[k] } info[i] <- info[i]+wt[r]*info.r } } else if (model==2) { cb <- c(0,cb) for (r in 1:ncat) { prob <- numeric(ncat) zz <- numeric(ncat) sdsum <- 0; den<-0 for (k in 1:ncat) { sdsum <- sdsum+cb[k] zz[k] <- exp(D*DISC[i]*(k*EAP.k[r]-sdsum)) den <- den+zz[k] } AX <- 0; BX <- 0 for (k in 1:ncat) { prob[k] <- zz[k]/den AX <- AX+k^2*prob[k] BX <- BX+k*prob[k] } info.r <- D^2*DISC[i]^2*(AX-BX^2) info[i] <- info[i]+wt[r]*info.r } } } } return(info) } LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 183 calc.Expected.Var <- function(pos,current.theta) { epv <- numeric(ni) for (i in 1:ni) { if (items.available[i]==TRUE) { ncat <- NCAT[i] wt <- numeric(ncat) EAP.k <- numeric(ncat) for (k in 1:ncat) { posterior.k <- pos*pp[,i,k] EAP.k[k] <- sum(posterior.k*theta)/sum(posterior.k) wt[k] <- sum(posterior.k) epv[i] <- epv[i]+wt[k]*sum(posterior.k*(theta-EAP.k[k])^2)/sum(posterior.k) } epv[i] <- 1/epv[i]^2 } } return(epv); } calc.Expected.PW.Info <- function(pos,current.theta) { info <- numeric(ni) for (i in 1:ni) { if (items.available[i]==TRUE) { ncat <- NCAT[i] wt <- numeric(ncat) info.i <- matrix.info[,i] info.k <- numeric(ncat) for (k in 1:ncat) { posterior.k <- pos*pp[,i,k]/sum(pos*pp[,i,k]) info.k[k] <- sum(info.i*posterior.k) wt[k] <- sum(pos*pp[,i,k]) } wt <- wt/sum(wt) info[i] <- sum(info.k*wt) } LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 184 } return(info); } prep.info.ext.theta <- function(){ pp <- matrix(0,nExaminees,maxCat) matrix.info <- matrix(0,nExaminees,ni) if (model==1) { for (i in 1:ni) { ps <- matrix(0,nExaminees,NCAT[i]+1) ps[,1] <- 1 ps[,NCAT[i]+1] <- 0 for (k in 1:(NCAT[i]-1)) { ps[,k+1] <- 1/(1+exp(-D*DISC[i]*(ext.theta[[1]]-CB[i,k]))) } pp[,1] <- 1-ps[,1] pp[,NCAT[i]] <- ps[,NCAT[i]] for (k in 1:NCAT[i]) { pp[,k] = ps[,k]-ps[,k+1] matrix.info[,i] <- matrix.info[,i]+(D*DISC[i]*(ps[,k]*(1-ps[,k])-ps[,k+1]*(1-ps[,k+1])))^2/pp[,k] } } } else if (model==2) { for (i in 1:ni) { cb <- unlist(CB[i,]); cb <- c(0,cb); zz <- matrix(0,nExaminees,NCAT[i]); sdsum <- 0; den <- rep(0,nExaminees); for (k in 1:NCAT[i]) { sdsum <- sdsum+cb[k]; zz[,k] <- exp(D*DISC[i]*(k*ext.theta[[1]]-sdsum)); den <- den+zz[,k]; } LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 185 AX <- rep(0,nExaminees); BX <- rep(0,nExaminees) for (k in 1:NCAT[i]) { pp[,k] <- zz[,k]/den AX <- AX+k^2*pp[,k] BX <- BX+k*pp[,k] } matrix.info[,i] <- D^2*DISC[i]^2*(AX-BX^2) } } return(matrix.info) } select.maxInfo <- function () { if (ni.available>=topN) { item.selected<-info.index[sample(topN,1)] } else if (ni.available>0) { item.selected<-info.index[sample(ni.available,1)] } return (item.selected) } calcSE <- function(examinee,ngiven,th) { info <- 0 if (model==1) { for (i in 1:ngiven) { itm<-items.used[examinee,i]; ps<-numeric(NCAT[itm]+1); ps[1]<-1; ps[NCAT[itm]+1]<-0; for (k in 1:(NCAT[itm]-1)) { ps[k+1]<-1/(1+exp(-D*DISC[itm]*(th-CB[itm,k]))); } prob<-numeric(NCAT[itm]); for (k in 1:NCAT[itm]) { prob[k]<-ps[k]-ps[k+1]; info<-info+(D*DISC[itm]*(ps[k]*(1-ps[k])-ps[k+1]*(1-ps[k+1])))^2/prob[k]; } } } else if (model==2) { info<-0; LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 186 for (i in 1:ngiven) { itm<-items.used[examinee,i]; cb<-unlist(CB[itm,]); cb<-c(0,cb); zz<-numeric(NCAT[itm]); sdsum<-0; den<-0; for (k in 1:NCAT[itm]) { sdsum<-sdsum+cb[k]; zz[k]<-exp(D*DISC[itm]*(k*th-sdsum)); den<-den+zz[k]; } AX<-0; BX<-0; prob<-numeric(NCAT[itm]); for (k in 1:NCAT[itm]) { prob[k]<-zz[k]/den; AX<-AX+k^2*prob[k]; BX<-BX+k*prob[k]; } info<-info+D^2*DISC[itm]^2*(AX-BX^2); } } SEM<-1/sqrt(info); return(SEM); } calcEAP <- function (examinee,ngiven) { LH<-rep(1,nq); for (i in 1:ngiven) { item<-items.used[examinee,i]; resp<-resp.data[examinee,paste("R",item,sep="")]; prob<-pp[,item,resp]; LH<-LH*prob; } posterior<-prior*LH; EAP<-sum(posterior*theta)/sum(posterior); if (se.method==1) { SEM<-sqrt(sum(posterior*(theta-EAP)^2)/sum(posterior)); } else if (se.method==2) { SEM<-calcSE(examinee,ngiven,EAP); } return(list(THETA=EAP,SEM=SEM,LH=LH,posterior=posterior)); } if (selection.method==3) { LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 187 for (j in 1:nExaminees) { critMet <- FALSE items.available <- rep(TRUE,ni) items.available[is.na(resp.data[j,paste("R",1:ni,sep="")])]<-FALSE max.to.administer <- ifelse(sum(items.available)<=maxNI,sum(items.available),maxNI) ni.given <- 0 if (first.item.selection==4) theta.current <- ext.theta$theta[j] else theta.current <- start.theta posterior<-prior while (critMet==FALSE && ni.given<max.to.administer) { array.info <- calc.PW.info(posterior) ni.available <- sum(array.info>0) info.index <- rev(order(array.info)) item.selected <- select.maxInfo() if (ni.given==0) { if (first.item.selection==3 && first.item>=1 && first.item<=ni) { if (items.available[first.item]==TRUE) { item.selected<-first.item } } else if (first.item.selection==2 || first.item.selection==4) { array.info <- calcInfo(theta.current) info.index <- rev(order(array.info)) item.selected <- select.maxInfo() } } resp <- resp.data[j,paste("R",item.selected,sep="")] prob <- pp[,item.selected,resp] posterior <- posterior*prob ni.given <- ni.given+1 items.used[j,ni.given] <- item.selected items.available[item.selected] <- FALSE selected.item.resp[j,ni.given] <- resp.data[j,paste("R",item.selected,sep="")] estimates <- calcEAP(j,ni.given) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 188 theta.history[j,ni.given] <- estimates$THETA se.history[j,ni.given] <- estimates$SEM theta.current <- estimates$THETA if (ni.given>=max.to.administer || (estimates$SEM<=maxSE && ni.given>=minNI)) { critMet <- TRUE theta.CAT[j] <- estimates$THETA sem.CAT[j]<-estimates$SEM LH.matrix[j,]<-estimates$LH posterior.matrix[j,]<-estimates$posterior ni.administered[j]<-ni.given } } if (show.Theta.Audit.Trail) plot.theta.audit.trail(); } } if (selection.method==5) { for (j in 1:nExaminees) { critMet<-FALSE; items.available<-rep(TRUE,ni); items.available[is.na(resp.data[j,paste("R",1:ni,sep="")])]<- FALSE; max.to.administer<- ifelse(sum(items.available)<=maxNI,sum(items.available),maxNI); ni.given<-0; if (first.item.selection==4) theta.current<-ext.theta$theta[j] else theta.current<-start.theta posterior<-prior; while (critMet==FALSE && ni.given<max.to.administer) { array.info<- calc.Expected.Var(posterior,theta.current); ni.available<-sum(array.info>0); info.index<-rev(order(array.info)); item.selected<-select.maxInfo(); if (ni.given==0) { if (first.item.selection==3 && first.item>=1 && first.item<=ni) { if (items.available[first.item]==TRUE) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 189 { item.selected<-first.item } } else if (first.item.selection==2 || first.item.selection==4) { array.info<-calcInfo(theta.current); info.index<-rev(order(array.info)); item.selected<-select.maxInfo(); } } resp<-resp.data[j,paste("R",item.selected,sep="")]; prob<-pp[,item.selected,resp]; posterior<-posterior*prob; ni.given<-ni.given+1; items.used[j,ni.given]<-item.selected; items.available[item.selected]<-FALSE; selected.item.resp[j,ni.given]<-resp; estimates<-calcEAP(j,ni.given); theta.history[j,ni.given]<-estimates$THETA; se.history[j,ni.given]<-estimates$SEM; theta.current<-estimates$THETA; if (ni.given>=max.to.administer || (estimates$SEM<=maxSE && ni.given>=minNI)) { critMet<-TRUE; theta.CAT[j]<-estimates$THETA; sem.CAT[j]<-estimates$SEM; LH.matrix[j,]<-estimates$LH; posterior.matrix[j,]<-estimates$posterior; ni.administered[j]<-ni.given; } } if (show.Theta.Audit.Trail) plot.theta.audit.trail(); } } colnames(items.used)<-paste("Item",seq(1:maxNI),sep="") LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 190 final.theta.se<-as.data.frame(cbind(theta.CAT,sem.CAT)); colnames(final.theta.se)<-c("Theta","SEM"); # THE REST OF THIS IS STUDY-SPECIFIC USER (AUTHOR) CODE: require(psych) skew_kurt = describe(resp.data)[,c('skew','kurtosis')] itmiss = data.frame(table(items.used)); blank = data.frame(items.used=1:ni) itmissn = merge(blank,itmiss,by="items.used",all.x=T) itmissn$Freq = ifelse(is.na(itmissn$Freq)==T,0,itmissn$Freq) tot = data.frame(itmissn,skew_kurt)[-1] reg = summary(lm(Freq ~ skew + kurtosis, data = tot)) skew.b[x] = round(reg$coef['skew',1],3) skew.p[x] = round(reg$coef['skew',4],3) kurt.b[x] = round(reg$coef['kurtosis',1],3) kurt.p[x] = round(reg$coef['kurtosis',4],3) r_square[x] = round(reg$r.squared,3) r_square.p[x] = round(pf(reg$fstatistic[1],reg$fstatistic[2],reg$fstatistic[3],lower.tail=F),3) frequency = t(tot['Freq']) all_parms = data.frame(tot,DISC,CB) parms_means[x,] = (frequency %*% as.matrix(all_parms[-1]))/sum(frequency) parms_obs_means[x,] = colMeans(all_parms[-1]) parms_obs_medians[x,] = apply(all_parms[-1],2,median) if (is.null(data) == FALSE) { ITEMS[[x]] = rbind(all_parms[order(-all_parms$Freq),], MEAN = c(NA,parms_means[x,]), OBS_MEAN = c(NA,parms_obs_means[x,]), OBS_MEDIAN = c(NA,parms_obs_medians[x,])) } rCAT.FULL[x] = round(cor(ext.theta$theta, final.theta.se$Theta)^2, 3) case.use.sum.min[x] = min(rowSums(!is.na(items.used))) case.use.sum.max[x] = max(rowSums(!is.na(items.used))) mean.use[x] = round(mean(rowSums(!is.na(items.used))),0) median.use[x] = round(median(rowSums(!is.na(items.used))),0) percent.use[x] = round((median.use[x]/ni)*100,2) full_var_prop = 100/ni; cat_var_prop = (rCAT.FULL[x]*100)/median.use[x] full_cat_diff[x] = round(cat_var_prop - full_var_prop,2) # gain per item for CAT if (full_cat_diff[x] < 0) full_cat_diff[x] = 0 sem.cat[x] = round(mean(sem.CAT),3) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 191 rel.cat[x] = round((1 - mean(final.theta.se$SEM)^2),3) r_items = mat.or.vec(median.use[x],1) if (is.null(data) == TRUE) { rCAT.SIM[x] = mean(round(cor(sim.theta, final.theta.se$Theta)^2,3)) rFULL.SIM[x] = mean(round(cor(ext.theta$theta, sim.theta)^2,3)) bias.cat[x] = round(mean(sim.theta - final.theta.se$Theta),3) rms.cat[x] = round(sqrt(sum(final.theta.se$Theta^2)/n),3) rmsd.cat[x] = round(sqrt(sum((sim.theta-final.theta.se$Theta)^2)/n),3) for (i in 1:median.use[x]) { r_items[i] = r[i,1] } ctt.data = resp.data[,r_items] sum.s = rowSums(ctt.data); alphas[x] = round(alpha(ctt.data)[[1]]$raw_alpha,3) ctt.avg_r[x] = round(alpha(ctt.data)[[1]]$average_r,3) rCAT.SUM[x] = mean(round(cor(final.theta.se$Theta, sum.s)^2,3)) rFULL.SUM[x] = mean(round(cor(ext.theta$theta, sum.s)^2,3)) rSIM.SUM[x] = mean(round(cor(sim.theta, sum.s)^2,3)) rSUM.FULLSUM[x] = mean(round(cor(sum.s, ctt.sum)^2,3)) } else { for (i in 1:median.use[x]) { r_items[i] = r[i,1] } ctt.data = resp.data[,r_items] sum.s = rowSums(ctt.data) alphas[x] = round(alpha(ctt.data)[[1]]$raw_alpha,3) } } # closes 'for' loop defining runs sk1 = data.frame(rbind(W.Mean = round(colMeans(parms_means),3), UW.Mean = round(colMeans(parms_obs_means),3), UW.Median = round(colMeans(parms_obs_medians),3))) LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 192 sk2 = data.frame(SKEW.b = skew.b, SKEW.p = skew.p, KURT.b = kurt.b, KURT.p = kurt.p, R2 = r_square, R2.p = r_square.p) gain = data.frame(CAT.GAIN = full_cat_diff, ITEM_MU = mean.use, ITEM_MED = median.use, MIN = case.use.sum.min, MAX = case.use.sum.max, ITEM_FULL = ni, PERCENT = percent.use) if (is.null(data) == TRUE) { rMAT = data.frame(CAT.EST = rCAT.FULL, SUM.EST = rFULL.SUM, EST.DIFF = round(rCAT.FULL-rFULL.SUM,3), CAT.SIM = rCAT.SIM, SUM.SIM = rSIM.SUM, SIM.DIFF = round(rCAT.SIM-rSIM.SUM,3), CAT.SUM = rCAT.SUM, EST.SIM = rFULL.SIM, SUM.FULL = rSUM.FULLSUM) error = data.frame(SEM = sem.cat, BIAS = bias.cat, RMS = rms.cat, RMSD = rmsd.cat) reliab = data.frame(CAT.REL = rel.cat, CTT.ALPHAS = alphas, CTT.AVG_R = ctt.avg_r) summ = data.frame(N_TOTAL = n, RUNS = runs, N_SAMPLES = n_sub, B_COND = bcond, N_DIM.SIM = ndim, R_COND = min(R), SEL.METHOD = selection.method, DATA.TYPE = "Simulated Data") if (suppress == F) print('Negative values for (EST.DIFF) and (SIM.DIFF) indicate higher correlations for CTT') } else { rMAT = data.frame(CAT.EST = rCAT.FULL) error = data.frame(SEM = sem.cat) reliab = data.frame(CAT.REL = rel.cat, CTT.REL = alphas) summ = data.frame(N_TOTAL = n, RUNS = runs, N_SAMPLES = n_sub, SEL.METHOD = selection.method, DATA.TYPE = "Real Data") } if (runs > 1) { if (suppress == F) { rMAT = data.frame(rbind(rMAT, Mean = round(colMeans(rMAT),3), Median = round(apply(rMAT,2,median),3), Discrepancy = round(colMeans(rMAT),3) - round(apply(rMAT,2,median),3))) error = data.frame(rbind(error, Mean = round(colMeans(error),3), Median = round(apply(error,2,median),3), Discrepancy = round(colMeans(error),3) - round(apply(error,2,median),3))) reliab = data.frame(rbind(reliab, Mean = round(colMeans(reliab),3), Median = round(apply(reliab,2,median),3), Discrepancy = round(colMeans(reliab),3) - round(apply(reliab,2,median),3))) gain = data.frame(rbind(gain, Mean = round(colMeans(gain),3), Median = round(apply(gain,2,median),3), LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 193 Discrepancy = round(colMeans(gain),3) - round(apply(gain,2,median),3))) sk2 = data.frame(rbind(sk2, Mean = round(colMeans(sk2),3), Median = round(apply(sk2,2,median),3), Discrepancy = round(colMeans(sk2),3) - round(apply(sk2,2,median),3))) } else if (suppress == T) { rMAT = data.frame(rbind(Mean = round(colMeans(rMAT),3), Median = round(apply(rMAT,2,median),3), Discrepancy = round(colMeans(rMAT),3) - round(apply(rMAT,2,median),3))) error = data.frame(rbind(Mean = round(colMeans(error),3), Median = round(apply(error,2,median),3), Discrepancy = round(colMeans(error),3) - round(apply(error,2,median),3))) reliab = data.frame(rbind(Mean = round(colMeans(reliab),3), Median = round(apply(reliab,2,median),3), Discrepancy = round(colMeans(reliab),3) - round(apply(reliab,2,median),3))) gain = data.frame(rbind(Mean = round(colMeans(gain),3), Median = round(apply(gain,2,median),3), Discrepancy = round(colMeans(gain),3) - round(apply(gain,2,median),3))) sk2 = data.frame(rbind(Mean = round(colMeans(sk2),3), Median = round(apply(sk2,2,median),3), Discrepancy = round(colMeans(sk2),3) - round(apply(sk2,2,median),3))) } } colnames(sk1) <- c("Skew","Kurtosis","A",paste("B",1:(maxCat-1),sep="")) output = list(SUMMARY = summ, CORRELATIONS = rMAT, RELIABILITY = reliab, ERROR = error, GAIN = gain, SK_REG = sk2, ITEM.SUMMARY = sk1) if (runs == 1) { output = list(SUMMARY = summ, CORRELATIONS = rMAT, RELIABILITY = reliab, ERROR = error, GAIN = gain, SK_REG = sk2, ITEM.SUMMARY = ITEMS[[1]]) } if (outp == T) print(output) endtime = Sys.time() message(paste("End Time: ", endtime, sep="")) if (is.null(data) == FALSE) { LIMITS OF UNIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS 194 if (runs == 1) out = list(OUTPUT = output, EST_THETA = ext.theta$theta, CAT_THETA = final.theta.se$Theta, CTT=sum.s) else if (runs > 1) out = list(ITEMS, EST_THETA = ext.theta, OUTPUT = output, TIME = endtime - begtime) invisible(out) } else { if (is.null(true.theta) == FALSE) { if (suppress == F) print('Provided theta (m) was used as simulated theta for comparisons') } if (is.null(acond)==T) acond="NA" summ.mod = data.frame(N_ITEMS = ni, A_COND = acond, B_COND = bcond, N_RESP = maxCat, N_DIM.SIM = ndim, R_COND = min(R)) if (outp == T) { if (ndim == 1) { colnames(theta_sim) <- paste("simTheta",1:runs,sep="") out = list(SIMULATED_THETA = theta_sim, EST_THETA = ext.theta, OUTPUT = output, TIME = endtime - begtime) } else out = list(EST_THETA = ext.theta, OUTPUT = output, TIME = endtime - begtime) } else { rMAT = rMAT[1,] reliab = reliab[1,] error = error[1,] gain = gain[1,] ska = sk1[1,] skb = sk1[2,] out = data.frame(summ.mod, rMAT, reliab, error, gain, ska, skb) } invisible(out) } }
Abstract (if available)
Abstract
This dissertation investigated the usefulness of computerized adaptive tests (CATs) based on Samejima’s (1969) unidimensional graded response model (GRM) for polytomous item measures, and had two aims: (1) determine to what extent multidimensionality impacted the accuracy of CAT estimates
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Applying adaptive methods and classical scale reduction techniques to data from the big five inventory
PDF
Improving reliability in noncognitive measures with response latencies
PDF
Rasch modeling of abstract reasoning in Project TALENT
PDF
Antecedents of marijuana initiation
PDF
Direct and indirect predictors of traumatic stress and distress in orphaned survivors of the 1994 Rwandan Tutsi genocide
PDF
A biometric latent curve anaysis of visual memory development using data from the Colorado Adoption Project
PDF
A comparison of standard and alternative measurement models for dealing with skewed data with applications to longitudinal data on the child psychopathy scale
PDF
A functional use of response time data in cognitive assessment
PDF
Sources of stability and change in the trajectory of openness to experience across the lifespan
PDF
Revision of the multisystemic therapy (MST) adherence coding protocol: assessing the reliability and predictive validity of adherence to the nine MST principles
PDF
Biometric models of psychopathic traits in adolescence: a comparison of item-level and sum-score approaches
PDF
Estimation of nonlinear mixed effects mixture models with individually varying measurement occasions
PDF
The neural correlates of creativity and perceptual pleasure: from simple shapes to humor
PDF
Attachment, maltreatment and autonomic nervous system responsivity as predictors of adolescent anxiety and depression
PDF
Context dependent utility: an appraisal-based approach to modeling context, framing, and decisions
PDF
Computational principles in human motor adaptation: sources, memories, and variability
PDF
Regularized structural equation modeling
PDF
Attrition in longitudinal twin studies: a comparative study of SEM estimation methods
PDF
Comparing dependent groups with missing values: an approach based on a robust method
PDF
Patterns of EEG spectral power in 9-10 year old twins and their relationships with aggressive and nonaggressive antisocial behavior in childhood and adolescence
Asset Metadata
Creator
Petway, Kevin T., II
(author)
Core Title
The limits of unidimensional computerized adaptive tests for polytomous item measures
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Publication Date
02/26/2013
Defense Date
01/15/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computerized adaptive test,conscientiousness,graded response model,item response theory,multidimensional adaptive test,need for cognition,OAI-PMH Harvest,Personality,scale properties,unidimensional adaptive test
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McArdle, John J. (
committee chair
), Baker, Laura A. (
committee member
), John, Richard S. (
committee member
), Melguizo, Tatiana (
committee member
), Zelinski, Elizabeth M. (
committee member
)
Creator Email
kevin.t.petway.ii@gmail.com,kevpetway2@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-222114
Unique identifier
UC11294495
Identifier
usctheses-c3-222114 (legacy record id)
Legacy Identifier
etd-PetwayKevi-1450.pdf
Dmrecord
222114
Document Type
Dissertation
Rights
Petway, Kevin T., II
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computerized adaptive test
conscientiousness
graded response model
item response theory
multidimensional adaptive test
need for cognition
scale properties
unidimensional adaptive test