Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust Principal Components Analysis: Improving Inference About Largest Eigenvalues In Multivariate Data Analysis Using Resistant Statistical Methods And Bootstrap Resampling Techniques
(USC Thesis Other)
Robust Principal Components Analysis: Improving Inference About Largest Eigenvalues In Multivariate Data Analysis Using Resistant Statistical Methods And Bootstrap Resampling Techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality o f this reproduction is dependent apon the quality o f the copy submitted. Broken o r indistinct print, colored or poor quality illustrations and photographs, print bleedthrougfa, substandard margin* and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note w iD indicate the deletion. Oversize m aterials (e.g^ maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. A Beil & Howeii information Com pany 300 Nortn ZeeO R oad Ann Arbor M l AS 106- 1346 USA 313/ 761-4700 000-521-0600 ROBUST PRINCIPAL COMPONENTS ANALYSIS: IMPROVING INFERENCE ABOUT LARGEST EIGENVALUES IN MULTIVARIATE DATA ANALYSIS USING RESISTANT STATISTICAL METHODS AND BOOTSTRAP RESAMPLING TECHNIQUES by Francine L. Fidler A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Psychology) December 1994 Copyright 1994 Francine L. Fidler UNI Number: 9600973 UMI Microfora 9600973 Copyright 1995, by UMI Coapany. All rights reserved. This aicrofora edition la protected against unauthorised copying undsr Title 17, United States Code. UMI 300 Morth Seeb Road Ann Arbor, MI 40103 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-4015 This dissertation, written by under the direction of her. Dissertation Committeef and approved by alt its members, has been presented to and accepted by The Graduate School in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies D ate N t t Y e m t o e x . DISSERTATION COMMITTEE .hairpenon DEDICATION the memory of my mother Estelle King (1917-1993), who encouraged me to fulfill my dreams. ACKNOWLEDGEMENTS i i i Several people deserve thanks for their assistance in the completion of this work. First, 1 would like to thank my committee, Dr. Norman Cliff, Dr. Mitchell Earleywine, and Dr. Michael Newcomb for their valuable ideas and suggestions. A special thanks to Norm Cliff for his careful review of this work and his thought-provoking suggestions. I am especially grateful to Dr. Rand Wilcox , my committee chair , for his long hours of reading and his helpful suggestions. I also wish to thank him for his continued kindness, attentiveness to my research, and always being available for advice. I am grateful to all the nice people in the psychology department office. A special thanks to Pat Edson, Joan Sterling, and Esme Takahashi. They all helped in keeping my spirits high and never stopped supporting me. I wish to thank all my fellow graduate students and colleagues for their support. Thanks to Jeff Long, Sumiko Takayanagi, Susannah Rousculp, Stuart Wugalter, Sam Posner, John Caruso, Du Feng, Steve Carter, Tara Rose, Stacy Hart and Jennie Noll. A special thanks to Dr. Laura Baker and Dr. John Horn for including me in their research projects. I thoroughly enjoyed the experience. Finally, thank you Tracy, Artie, and Andy, this project could not have been completed without your unconditional love and support. i v TABLE OF CONTENTS Page DEDICATION................................................................................................................ ii ACKNOWLEDGEMENTS.............................................................................................iii LIST OF TABLES...........................................................................................................vi LIST OF FIGURES........................................................................................................vii ABSTRACT................................................................................................................... viii CHAPTERS 1. INTRODUCTION....................................................................................................... 1 Statement of the Problem......................................................................................1 Robust Analogues of Principal Components Analysis................................. 2 Resistant Statistical Methods................................................................................7 The Influence Function........................................................................... 8 Suggestions for Dealing with Heavy-t&ils and Outliers ............. 10 L-Estimators of Location and Scale...................................................... 10 Winsorization Method................................................................ 1 1 M-Estimators of Location..................................................................... 16 M-Estimators of Scale........................................................................... 18 The Biweight Midvariance.........................................................21 The Biweight Midcovariance.....................................................23 Confidence Intervals for the Largest Eigenvalue............................................. 24 Normal Theory....................................................................................... 24 Bootstrap Methods.................. 26 2. THE PRESENT STUDY..........................................................................................30 Goals of the Present Study.................................................................................30 Research Questions............................................................................................. 31 3. SIMULATION STUDY........................................................................................... 32 General Method and Data Generation...............................................................32 Simulation Study 1..............................................................................................35 Design and Procedure................................................................35 Results....................................................................................... 36 Simulation Study 2..............................................................................................40 Design and Procedure.............................................................................41 4. SIMULATION RESULTS......................................................................................42 Probability Coverage.......................................................................................... 42 Bias and Mean Squared Error............................................................................50 Skewness and Kurtosis....................................................................................... 53 Miscellaneous Simulation Results......................................................................56 5. EXAMPLES USING EMPIRICAL DATA.......................................................... 57 Results..................................................................................................................57 6. DISCUSSION.......................................................................................................... 60 7. CONCLUSIONS......................................................................................................63 Suggestions for Further Research......................................................................64 REFERENCES................................................................................................................ 65 APPENDICES.................................................................................................................72 I. Listing of the Fortran Codes..........................................................................72 Program PARMBOOT...........................................................................73 Program BCBOOT...* *.................................................................76 Program PCAWIN................................................................................79 Program PCAUCOV.............................................................................82 Program PCABIWGT............................................................................ 85 v i LIST OF TABLES Table Page 1 Some Properties of the g-and-h Distribution..................................................... 34 2 Estimated Probability Coverage for the Winsorized Covariance Matrix, the Usual Covariance Matrix, and the Biweight Midcovariance Matrix using (3)...................................................................... 38 3 Population Values for the Largest Eigenvalue of the Usual Covariance and Biweight Midcovariance Matrices (N = 100,000)......................................44 4 Estimated Probability Coverage for the Largest Eigenvalue Based on the Usual Covariance Using (3) and the Bootstrap for Normal (gs=h=0) and Nonnormal (g=0,h — .2) Distributions................................... 45 5 Estimated Probability Coverage for the Largest Eigenvalue Based on the Usual Covariance with Normal (g=h=0) and Nonnormal Distributions......................................................................................................48 6 Estimated Probability Coverage for the Largest Eigenvalue Based on the Biweight Midcovariance with Normal (g=h=0) and Nonnormal Distributions............................................................................49 7 Estimated Bias and Mean Squared Error for the Largest Eigenvalue based on the Usual Covariance Matrix............................................................. 51 8 Estimated Bias and Mean Squared Error for the Largest Eigenvalue based on the Biweight Midcovariance Matrix................................................ 52 9 Skewness and Kurtosis for the Largest Eigenvalue of the g-and-h Distribution (Hoaglin, 1983) based on the Usual Covariance 54 10 Skewness and Kurtosis for the Largest Eigenvalue of the g-and-h Distribution (Hoaglin, 1983) based on the Biweight Midcovariance.................................................................................. 55 11 Results from PCA Using Empirical Data Sets.................................................. 59 v i i LIST OF FIGURES Figure Page 1 Boxplots of the Distribution of the Largest Eigenvalue for the Usual Covariance, Biweight Midcovariance, and Winsorized Covariance Under Normality (A , = 2.5, r = 5)...................... 39 2 Boxplots of the Distribution of the Largest Eigenvalue for 1,000 Replications as Compared to the Averaged B=299 for each Replication of the Usual Covariance Assuming Normality...................46 ABSTRACT v i i i Within the context of principal components analysis (PCA) one issue that arises is making inferences about the largest eigenvalue (EVAL). Assuming normality, a confidence interval for this quantity is available. One goal of this study is to examine the accuracy of the confidence interval when distributions are nonnormal. It was found that even slight departures from normality result in unsatisfactory probability coverage. Another goal is computing a confidence interval when the usual sample covariance matrix is replaced by robust analogues based on the biweight midcovariance matrix and the Winsorized covariance matrix. The issue of using robust analogues is important because published results indicate that PCA is sensitive to outliers and heavy-tailed distributions, which are common among psychometric measures. Simulations were used to study the accuracy of the probability coverage for the largest EVAL under normality and nonnormality. The number of subjects and the population correlation values were varied. For each situation, 1,000 replications were used to estimate the accuracy of the probability coverage, bias, and mean squared error. A bootstrap method was found which gives good results, but when this same bootstrap procedure was applied to the usual covariance matrix, confidence intervals were less satisfactory. 1 CHAPTER 1 INTRODUCTION Statement of the Problem One goal of interest within the context of principal components analysis (PCA) is making inferences about the largest eigenvalue, and a confidence interval has been derived assuming normality (Anderson, 1963). One purpose of this paper is to examine the accuracy of the confidence interval, especially under violations of normality. Another goal is computing a confidence interval when the usual covariance matrix is replaced with robust analogues based on the Winsorized covariance and the biweight midcovariance. This study examined the distribution of the largest eigenvalue only. The problem addressed here is twofold. First, the inferences from PCA requires that the data are normally distributed. The research will examine the implications of inferences from PCA when the distributions are nonnormal (e.g. heavy tails and outliers). Will 93% confidence intervals based on normal theory (Anderson, 1963) provide accurate probability coverage when the data are nonnormal? Second, the research will employ several robust analogues of the usual covariance matrix in PCA in various situations. The percentile bootstrap procedure (Efron, 1982) will be used to compute the probability coverage for the distribution 2 of the largest eigenvalue. Is the percentile bootstrap effective when used to answer these types of research questions? Under what circumstances do the resistant measures of dispersion suggested in this study provide more accurate probability coverage for both norma) and nonnormal distributions? Robust Analogues of Principal Components Analysis The use of robust methods in Principal components analysis is an important problem because it is known that the conventional approach to PCA is not robust when the assumption of normality is violated (Ammann, 1989; Chen, Gnadadesikan, & Kettenring, 1974; Gnadadesikan, & Kettenring, 1972; Gnadadesikan, 1977; Naga & Antilie, 1990; Tanaka, 1988; Tanaka, 1989). In particular, outliers and heavy tailed distributions, even under slight departures from normality, can distort the results obtained from PCA. Moreover, outliers and heavy-tailed distributions appear to be common among many psychometric measures (Micceri, 1989; Sawilowsky & Blair, 1990, Wilcox, 1990). The usual method of PCA is explained below followed by an overview of the research conducted using robust methods with PCA. Principal components analysis is an exploratory data reduction technique designed to construct linear combinations (factors) of the original variables that account for as much of the total variance as possible. This widely used multivariate statistical method involves the eigenvalue decomposition of the covariance matrix as 3 a dimension-reduction technique (Cliff, 1987). Principal component analysis is used in various areas of research. For example, psychologists regularly administer multi-item questionnaires resulting in many variables, some of which provide redundant information. In cases like this one, PCA can be used quite effectively where the variables are correlated to obtain a smaller set which maximizes the correlation of the original set of variables. Similarly, psychophysiologists use PCA to reduce the dimensionality of EEG data as a signal-to-noise amplification device (Coles, Gratton, & Fabiani, 1990). Another important application of PCA is to aid in the classification of psychological symptoms into a systematic description of mental disorders (DSM-IIIR, 1987). Finally, PCA is used in regression analysis to deal with multicollinearity problems (i.e., imprecision in parameter estimates due to highly correlated independent variables) (Dunteman, 1987). The methodology for PCA originated with Karl Pearson (1901) as a way of fitting planes to data by orthogonal least squares. He described a robust approach to PCA by fitting a sequence of mutually orthogonal hyperplanes and replaced the criterion of minimizing the sum of squares of perpendicular deviations of the observations from the fitted planes with a more robust criterion. The approach was never tested by Pearson (1901), but Gerig (1974) found it to involve more computationally complex algorithms than those involved in the least squares method which is not very robust to nonnormality. Hotelling (1933) reintroduced PCA for 4 the specific purpose of analyzing covariance and correlation matrices. Subsequently, researchers began to study the large-sample properties of principal components (i.e. eigenvalues) using techniques such as the Ec kart-Young decomposition of data matrices (Morrison, 1990) which will be described later. It has been demonstrated that PCA is very sensitive to the presence of outliers. Chen, Gnadadesikan, and Kettenring (1974) illustrated the effect of outliers in data from 29 chemical companies using 14 economic variables. They showed that removing one unusual observation from the data changed the correlation among the components from r = 0 to r = .99. Gnadadesikan (1977) was interested in certain interpretational and statistical properties associated with covariance input to PCA. He found that in the presence of outliers, a robustified covariance matrix provided more accurate estimates. Devlin, Gnadadesikan, and Kettenring (1981) used simulations to compare robust procedures for estimating a correlation matrix and its principal components. Their estimators were formed by two methods. The first was to use separate bivariate analyses which estimated each off-diagonal element of the new correlation matrix (R*) separately by a robust correlation coefficient (r*). The second method consisted of simultaneous manipulation of all the variables using iterative techniques such as multivariate trimming (MVT) and M-estimation. Devlin et al. (1981) found that both MVT and M-estimation are especially effective at estimating a correlation matrix and its principal components. On the other hand, their results suggest that m-estimators 5 can break down relatively easily when the outliers are asymmetric and the dimensionality is large. A somewhat different approach was taken by U and Chen (1985) in dealing with outliers. They were interested in obtaining both rotational equivariance as well as a high breakdown point (e.g. 1/2) in elliptic distributions. Rotational equivariance was originally suggested by Huber (1981), p. 200: Mimic an eigenvalue determination and find the direction with the smallest or largest robust variance which leads to an approach where the variance does not change with rotation. A high breakdown point (i.e. 1/2) indicates that one-half of the observations in the data can be made unusually large or unusually small without affecting the results. Li and Chen (1985) used projection pursuit techniques to determine robust principal components. The projection pursuit method determines robust principal components first; then, the estimate-of-dispersion matrix is constructed from the estimated components to insure that the matrix is automatically positive definite. Li and Chen (1985) found their technique to be quite robust with respect to the presence of outliers in elliptic distributions. Ammann (1989) combined a rotation technique with an iterative reweighting of the data to obtain robust eigenvalue decomposition of a covariance matrix (ROPRC). He found that the ROPRC algorithm was quite effective in the presence of outliers and heavy-tailed distributions. Naga and Antilie (1990) illustrated that the robustness of an estimator should be considered along with the stability of the components. They computed robust principal components based on Rousseeuw’s (1984) minimum volume ellipsoid (MVE) technique. The stability measure is an angular measure between the sample principal components and the population principal components which is arrived at through a bootstrap method. The results suggest that robustness does not always improve the stability of PCA subspaces. The implication for applied research is that the researcher can choose between robust and nonrobust PCA, depending on the research question at hand. If the research depends more on stability of subspaces than a high-finite sample breakdown point, then a non-robust method should be used. On the other hand, if there are many outliers, and the stability of the components is of secondary concern, robust methods should be used. The research presented in this section indicates that several resistant techniques have had mixed results , but they have been generally successful in improving the results of PCA when the data have highly skewed distributions with extremely heavy tails and outliers. The next section describes the development of resistant statistical methods and suggests the rationale for using them. Resistant Statistical Methods 7 Modern robust statistical methods are concerned in part with developing estimation procedures that perform well in spite of outliers and heavy-tailed distributions. The essential requirements of robust estimators are that they be both resistant and efficient. Robustness is a general term which will be used to imply an insensitivity of an estimator to departures from assumptions regarding an underlying distribution. Resistance is a property where changing a small number of observations drastically will not affect the summary statistic (Mosteller & Tukey, 1977). Consequently, a statistic is said to be resistant if a few, very unusual observations (outliers), do not have an effect on the outcome. The finite sample breakdown point (Hodges, 1967; Hampel, 1968) of an estimator is the smallest proportion of the n observations which, when altered, can render it meaningless. More formally, it is the smallest proportion of values that can drive an estimator to infinity. For example, the mean and the variance both have a finite sample breakdown point equal to 1/n because if the value of a single observation goes to infinity, the sample mean and variance go to infinity as well. In comparison, the sample median has a finite sample breakdown point of approximately 1/2. This means that half the values in the sample can be unusually large without affecting the value of the sample median. Therefore, the sample median is more resistant than the sample mean. An estimator is considered to have robustness of efficiency over a range of 8 distributions if its variance (or mean squared error) is close to the minimum for each distribution (Goodall, 1983). The relative efficiency of an estimator T is the ratio of the "best" estimator (var (BEST,F)), to the variance of the estimator T, which is denoted as var (T,F) (Rosenberger & Gasko, 1983). The estimators for location and for scale, if correctly specified, will provide more resistant estimates that are more efficient under departures from normality when dealing with heavy-tailed distributions and outliers. Robust statistical methods have been around for hundreds of years (Stigler, 1973), however, it was only recently that asymptotic results could be examined for their level of efficiency (Pitman, 1949; Hodges & Lehmann, 1956). Tukey (1960) demonstrated the sensitivity of some classical estimators to slight changes in assumptions. Research in robust methods virtually exploded with the advent of high-speed computer technology and with the introduction of ideas from functional analysis (i.e. Huber, 1964; Hampel, 1968). Staudte and Sheather (1990) credit the work of Hampel (1968, 1986) specifically for the concepts of the influence function and the breakdown point, for their universal applicability and intuitive appeal in advancing the study of robustness. The Influence Function The current trend in robust statistical methods is to view measures of location and scale in terms of a functional, T(F), where F is an element in the class of all 9 probability distributions and is mapped into the real line by T(F) (Staudte S c Sheather, 1990). T(F) is also quite useful because it leads to a natural nonparametric estimator T(F.), where F. is the empirical distribution. Because T assigns a real number to each member of a class of functions like F,, T is called a "functional". For example, the mean can be written as T(F) - fxdF(x) and the variance is given by T(F) . /(V * ,)2 m x J d F t y and X, and X2 are random samples from F. It is very important that a functional be chosen that has a well-behaved (Gateaux) derivative to ensure that small changes in F, particularly in the tails, do not overly influence the values of T(F) (Wilcox, 1992a). The first derivative of a functional is called the influence function or influence curve (IC). To accomplish this end, the IC(x) must be bounded for all possible values of X. It is well known that the IC of the population mean, ft, and variance, o2, are not bounded. For example, the mean has the influence function given by IC(x) — x - ft, therefore, as X - > 00, IC(x) -> * > also. Consequently, as the first derivative becomes large, the changes in T(F) become large as well. 10 Suggestions for Dealing with Heavy-tails and Outliers Two general approaches are frequently taken when dealing with outliers and heavy-tailed distributions in terms of measures of location and scale. The first is to make an adjustment based on a predetermined percentage of the tail of a distribution. This method is used by the trimmed and Winsorized means. The second approach uses an adjustment that is a function of the distribution. The distribution is typically unknown, so the adjustment is determined empirically. Recent research using M-estimators of location form the basis for the second approach which will be extended to measures of scale later in this paper. L-estimators of location and Scale Two types of estimators for both location and scale are typically used and are of interest in this study: L-estimators and M-estimators. L-estimators are linear combinations of order statistics, while M-estimators have maximum likelihood estimators as a special case and are defined below. L-estimators of location include the sample mean, median, trimmed mean, and Winsorized mean as special members. The trimmed sample mean and the Winsorized mean are both measures of location. They are related, yet different. The trimmed mean represents a measure of location that lies between the sample mean and the median. It performs well 11 regardless if outliers are present or not. Rosenberger and Gasko s( 1983) recommend 20% trimming, thus the trimmed mean is computed as follows: the observations are ordered and 20% of the largest and smallest observations are removed. The Winsorized mean plays an important role in the trimmed mean, in an indirect way (Wilcox, in press b). It is preferred when there are violations of assumptions. The Winsorized mean is calculated by ordering the observations, computing the 20% upper and lower values. It differs from trimming in that the values are not removed, instead they are replaced with the ”20%" trimmed values. Winsorization method. The P-trimmed mean is given by the functional f F~\$)-W(F), tf x < F \ P) (l-2p)/C(*) - j if f l(P) s x s F \ 1-P) (F-'O -PJ-H X iO (f * > F - 'd - P ) and W(F) is the P-Winsorized mean (Staudte & Sheather, 1990): / xdF(x), F-'d-M *•*< *) where F ‘(p ) and F '( l - p ) are the p and 1-p quantiles. The influence function of the P-trimmed mean is W(f) - (l-2p)r(F)*PF '(P)*pFHi-P). 12 The following section describes Winsorized expected values and shows how they lead to estimates of the Winsorized sample variance, covariance, and correlation. For a random variable X, with distribution F, the Winsorized expected value of X is * { !-» > E W (X) - j xdF {x) ♦ y(*y * *0 . T)) where 0 £ y £ .5 is the amount of Winsorization (Staudte & Sheather, 1990), and where X T and x,_y are the y and 1-y quantiles of X. Rosenberger and Gasko (1983) recommend y - .20, which is equivalent to 20% trimming. The population Winsorized mean is given by Kr " B wV b The Winsorized variance of X is ' i - r o \ . |( x - p J 3dF(x) + y((xY - *Y Let X and Y be any two random variables with joint distribution F, and let g(X,Y) be any function of X and Y. The Winsorized expected value of g(X,Y) equals 13 f f ' i f v W v ) * y, *l-r " f j g ( x y r)&lx*y)+ j f g ( x y Uy)dF {xy)+ Xj ~ Xyyi y Vl-T V r f f g (? yy W ( x y ) * f f g ( x yp rw ( x v b - > r * - X y m ■ i^l-y f f g Q 'y f l . J & 'M * / f g i X ^ W i X ^ h -T V l-V 'l-yVy » > Y • ■ / f g i x ^ y j ^ i x y ) * f f ^ i.T l y,.r^ ) (This definition was introduced by Wilcox, 1993). The usual rules of expected values apply in the preceding equation. As an example, the Winsorized expected value of a sum of random variables is equal to the sum of the Winsorized expected values (Wilcox, 1994), Furthermore, if X and Y are independent, EW {XY) - E wV 0 E w(Y). The Winsorized variance of X is E J X , - 1*.)2 ■ ° . 2 and the covariance between X and Y is 14 e j x - o f f - t o ) - < L, The Winsorized correlation is defined as o p . — n S L - o o m w y where ow x is the Winsorized standard deviation of x. It should be noted that Huber (1981) suggests that the trimmed mean be preferred over the Winsorized mean because the influence function of the Winsorized mean has "jumps" at the y and 1-y quantiles. However, Winsorized means are being employed here in order to be able to easily derive estimates of the Winsorized correlation (Wilcox, 1992a; 1994). The derivation is easier for the following reason. If g(X,,...,XJ is any function of the random sample X t,...,Xn , and " C. this indicates that ( be estimated with g(Yl,...,Ylt), where 15 g = [ny], [x] is the greatest integer & x, and X ()) * s ...s X(n ) are the usual order statistics. The usual estimate of the Winsorized sample mean is given by where (r*i) (»-r> i f x , * The Winsorized sample variance is * - • -T T H" 1 and the Winsorized covariance is estimated with where ut. 1 tr-n (»-*) ( / r , s r , < r < r . (»-*) 16 The estimate of the Winsorized correlation is Y - — *ny_ f s s w x w y M-Estimators of Location The class of M-estimators typically used requires an understanding of location-invariance and scale-invariance. Location- and scale-invariance or equivariance are defined by the way an estimator responds to systematic changes in a sample. For example, when an entire sample is shifted by an amount "a", it is reasonable to expect that an estimator of location should also shift by the amount "a" (Goodall, 1983). A location estimator with this property is said to be location- equi variant (Huber, 1981). In symbols, location-equivariance means that Tn{xx+ a,...,xm * a) - a, where T n is an estimator of location. An estimator of location that is location-and- scale equivariant is defined by Tm (bxx + a,...JbxM + a) - bTm (xx ,...,xJ+ a , where a and b are nonzero constants. A location-invariant estimator of scale, (S J , allows for a shift in the sample, 17 b, which leaves the value of Sn unchanged. a , a) - £,(*, x j , If Sn follows changes in scaling, for example, ^ , j , ( ^ X i , . , . f & X j | ) ■ | b | i ^ ( X j »• • • » ^ n ) i it is said to be scale-equivariant (Huber, 1981). Let X , X n be a random sample and 7 (x ) is any odd function, where Y (-x ) - -Y '(x ) The simplest M-estimator of location is the value satisfying f n x - ( i j . o This measure of location is not location-invariant (Rosenberger & Gasko, 1983). This problem can be addressed with an ancillary estimate of scale such as the median absolute deviation statistic (MAD) (Huber, 1981). An M-estimator of location is any quantity, n m, satisfying - X . - IT / T ( ' ^ " ) - 0, 18 where t is a measure of scale which is used to provide location- and scale- equivariance. The choice V (x) — X yields Hm — the population mean. An appropriate V(x) must be chosen which can deal effectively with outliers and heavy-tailed distributions. Two examples are Huber's (1964) ? function given by T(x) - max [*,min (£,*)] , where k is a predetermined constant, and the biweight function: Y (x) - jcCI-x2)2, when | x | < 1, and is 0 otherwise. The constant k is involved with determining whether the value of a random variable is unusually large or small. M-Rfl»imatnrs of Scale The previous section focused on M-estimators of location. This section describes M-estimators of scale. Listed below are the desirable properties that a measure of scale should possess to be considered a scale estimator. Some desirable properties of scale estimators (Lax, 1985) are: (1) the scale of a sample is nonnegative and is zero only when all of the observations in the sample are identical; (2) the scale is invariant to additive shifts in the location of a sample; and (3) the scale is unaffected by multiplicative shifts. Lax (1985) concludes that if X is 19 the vector of observations and a and b are scalar constants, then S (X) is a scale estimator if S(a ♦ bx) - |6|S(J0. (1) Bickel and Lehmann (1976) suggest a fourth property that is desirable for estimators of scale. This property refers to the ordering of the distributions and their variances. For example, if X and Y are symmetrically distributed about 0, X is said to be more dispersed than Y if | X | is stochastically larger than | Y | . The measure of dispersion, t x, should also be stochastically larger than t y, whenever X is more dispersed than Y. M-estimators of scale do not meet the fourth criterion, because there does not appear to be a single characteristic of a distribution that implies a useful and complete ordering of all distributions according to their scale (Lax, 1985). In contrast, Winsorized variances meet all four criterion and are considered measures of dispersion. Wilcox (1994) argues that although Bickel and Lehmann (1976) have proposed a desirable property for a measure of dispersion, in applied work, certain M-estimators of scale are more flexible and resistant than trimmed or Winsorized variances and should be considered. One such estimator is the biweight midvariance and it is derived below. Let X ,, ... , Xn be a random sample. Let 7 be some odd function, K is a tuning constant, < • > is also a measure of scale, 7 '(x ) is the derivative of 7 with respect to x, and X - 0 , V - i K < * > where 0 X is the population median of X. Then the midvariance has the general form: t 2 _ K 2ti> 2 E V 2(V) [EV'(V)]2 when satisfied the biweight midvariance corresponds to ¥ (x ) = x (l-x 2 )2, if | x | < l t and 0 otherwise. It follows that, Y '(x) = (1-x2 ) (l-5 x 2) when |x | < 1, otherwise T '(x )= 0, while u satisfies P ( |X -0 x| < w) = 1/2. Lax (1985) and Iglewicz (1983) recommend that K = 9 be used. The parameter is estimated with the median absolute deviation statistic: MAD - MED flX , - M J) where M x = M E D {X J is the usual sample median of the X|'s. MAD has a breakdown point of approximately 1/2 (Marona, 1976). Letting & - MADt the biweight midvariance is estimated with 21 f l _ ^ v 2(v,) (E ▼ ' ( » ' , ) ) 5 where Xhfi Biweight Mid variance. The focus of this section is on the biweight midvariance. A review of the current literature revealed several studies that used the biweight midvariance as a measure of scale in place of the usual variance under quite different situations. Results are presented which support the use of the biweight function to answer various research questions. For example. Shoemaker and Hettmansperger (1982) proposed a method of comparing the biweight midvariances of two independent groups, assuming symmetric distributions. They found that when there are heavy-tailed distributions, the biweight measure outperforms the usual variance with regard to probability coverage. As mentioned previously, the sample variance has a finite sample breakdown point of 1/n, whereas the biweight midvariance appears to have a finite sample breakdown point of 1/2, but a formal proof has not been given 22 (Goodall, 1983). Lax (1983) found the biweight midvariance to have the highest triefficiency (80%) out of the 130 estimators that he examined. In contrast, the sample standard deviation has a triefficiency of 0. He also found that the best estimators (A-estimators) used the bi weight weighting function. Wilcox (1992a) employed the biweight midvariance as a robust analogue to Cronbach's alpha. Cronbach's alpha is not a resistant measure of reliability because it is quite influenced by heavy-tailed distributions. A measure of reliability should be itself reliable. Using simulations, Wilcox (1992a) found that in nonnormal situations the biweight analogue provided slightly more accurate probability coverage than Cronbach's alpha. Goldberg and Iglewicz (1992) illustrate the use of the bi weight function to robustify boxplots. The boxplot is a very useful tool for summarizing univariate data and identifying outliers. The boxplot shows characteristics that derive from the actual data, not from an assumed distributional form. Goldberg and Iglewicz (1992) recommend that boxplots (i.e. "relplots” and "quelplots”) based on the biweight represent the best choice for estimating the necessary parameters for obtaining bivariate versions of the boxplot. Recently, Verboon (1993) used the biweight analogue of the usual variance in nonlinear regression analysis. Simulations compared biweight regression with the usual least squares method using the jackknife procedure to assess the stability of 23 each solution under the influence of outliers. Verboon (1993) found that the bi weight was the most robust when it came to eliminating the influence of outliers. He found the least squares solution stable at the wrong solution (local minima). In summary, the biweight midvariance has been shown to have a very high triefficiency (Lax, 1985), high finite sample breakdown point (Goodall, 1983; Iglewicz, 1983), and it has the advantage of having a bounded influence function which makes it quite attractive as a robust measure of scale (Staudte & Sheather, 1990), If a scale estimator, such as the biweight midvariance, satisfies equation (1)* it is considered a reasonable solution and can be used as a measure of dispersion (Iglewicz, 1983; Lax, 1985; Staudte & Sheather, 1990; Wilcox, 1994). Biweight Midcovariance. For a ramdom sample from a bivariate normal distribution let ' (* & ,) 24 The biweight midcovariance between X and Y can be estimated with X V " n n (2) E T '( P ,) E The biweight midcovariance, as presented in equation (2), is used in place of the usual covariance in this study. Below, procedures for computing 95% confidence intervals for the distribution of the largest eigenvalue are presented. Confidence Intervals for the Largest Eigenvalue Two methods were considered for computing a l-o confidence interval for the largest eigenvalue when the usual covariance matrix is used. The first is derived under the assumption that the sample is drawn from a multivariate normal population (Anderson, 1951, 1963; Bartlett, 1954; & Lawley, 1956, 1963). Knowing the properties of the asymptotic distributions of eigenvalues allows the researcher to test hypotheses and construct confidence intervals for the population component structure when sample size is sufficiently large. Normal Theory Given n independent observations on a p-dimensional random variable with distribution N (ft,£) with positive definite covariance matrix, E, where E is a 25 covariance matrix, and S estimates E. Let lj > ... > I p > 0, and A ; > . . . > A p > 0 be the ordered eigenvalues of S and E, respectively. The l/s are biased estimators of the A,'s. The means and variances of the sample eigenvalues as derived by Lawley (1956) are given by « (/,) ■ a, * — r and 2A,2 1 A K , VAR(it) . — T - (1 - 1 £ )2] n n >1 to terms of order n ‘. A complete derivation is presented in detail, for example, by Anderson (1984) and Muirhead (1982) in their texts. The first method used for computing a 95 % confidence interval for the largest covariance population eigenvalue is L (3) 26 where 1 , is the largest sample eigenvalue and Zs /3 is the a/2 quantile of the standard normal distribution (e.g. Anderson, 1963). Even though equation (3) is known to be biased , it is the best estimate currently available. Bootstrap Methods The second method for computing a confidence interval is with the percentile bootstrap procedure (Efron, 1982). Bootstrap resampling is one of the most recent developments in computer-intensive methods. The bootstrap technique (Diaconis & Efron, 1983; Efron, 1982; Efron & Tibshirani, 1986) was suggested by the jackknife method (Quenouille, 1949; Tukey, 1956). The jackknife is not considered because Efron (1982) illustrated that bootstrap procedures can be more accurate than the jackknife. In general, bootstrap procedures consist of estimating a characteristic of the unknown population by simulating the characteristic when the true population is replaced by an estimated one. This approach is widely applicable to complex data structures in both parametric and nonparametric problems (Diciccio & Romano, 1988). A parametric bootstrap has the strictest assumptions and is used when a distributional form is known. One problem is that the solution may not be generalizable to other situations, because this method involves repeatedly resampling from the same population. If the data are normal and there are no outliers, normal test statistics and methods for constructing confidence intervals should be used. 27 However, the bootstrap results in similar test statistic values compared to the values obtained with the usual methods (Efron, 1982). The nonparametric bootstrap technique is used when parametric assumptions cannot be met, or when the underlying distribution is unknown. In this type of bootstrap, resampling is done with replacement so that each "observation” has an equal chance of being sampled. Bootstrap procedures can be used to estimate standard errors and to construct approximate confidence intervals. Percentile bootstrap methods include techniques for correcting varying orders of bias in estimated confidence intervals. (Efron, 1982, 1987; Le Page & Billard, 1992; Noreen, 1989). A review of recent research has indicated that many studies are employing bootstrap resampling procedures. For example, Rasmussen's (1989) results supported the use of the percentile bootstrap procedure when testing the hypothesis that the population correlation between two random variables, X and Y, are equal to zero. Rasmussen (1989) reported that the percentile bootstrap procedure gave good control over Type I errors given that the marginal distributions of X and Y were normal, and n is not too small. In support, Wilcox (1991) found that using the percentile bootstrap to test whether populations correlation is equal to zero under dependence results in good control over Type I error. However, Wilcox (1991) found that under dependence, such as testing the hypothesis: H0: = o* — ay, the percentile bootstrap technique can be unsatisfactory. 28 Lambert, Wildt, and Durand (1991) examined approximate confidence intervals for factor loadings. Previous research indicates that deciding how many factors to retain involves using rules of thumb that produce interpretations which ignore the presence of sampling variability in estimated factor loadings (e.g. Cliff & Hamburger, 1967; Gorsuch, 1983), but see the clear results of later work (e.g. Cliff & Pennell, 1969; Cliff, 1970). Lambert et al. (1991) present the bootstrap method as a means of approximating the sampling distribution of estimated factor loadings to compute approximate confidence intervals. For completeness, Eaton and Tyler (1991) show how Weilandt's inequality can be applied to the bootstrap distribution of eigenvalues of the sample covariance matrix. It is used in place of Taylor Series expansions to provide more accurate estimates of the third and forth moments of the distribution of eigenvalues. Weilandt's inequality is not being considered in this study. The bootstrap procedure of primary interest in this study is the percentile method. The method is applied as follows. Let (i = l,..., n; k = l,..., p) be a random sample from a p-variate distribution. The first step is to draw a random bootstrap sample of size n. That is, vectors of observations are obtained by randomly sampling rows from with replacement yielding X*lk . Next, the largest eigenvalue is computed for the bootstrap sample, X*. This process is repeated B times yielding X\ , b = 1 B, where B is the number of bootstrap samples. 29 B = 299 rather than B = 300 is used because results in Hal) (1986) indicate choosing B so that (B + l ) 1 is a multiple of 1-a. Percentile methods can require B = 1,000, but when dealing with biweight measures of location, smaller B values seem to suffice (Wilcox, 1990; cf. Efron, 1990). In this study, all indications are that B = 299 works very well for biweight measures of scale. Let A .* (I) £ ...£ be the X* values written in ascending order. Let L = |oB/2] and U = [(I-a/2)B], where [x] is the greatest integer £ x. The \-a confidence interval is (X(L )\ A^*). (The Fortran-77 code for Programs PCABIWGT and PCAUCOV are listed in Appendix I ) . The percentile bootstrap method described above was used in the simulation studies to compute 95% confidence intervals for the largest eigenvalue of a covariance matrix, a biweight midcovariance matrix, and a Winsorized covariance matrix under varying situations. The Goals of the present study are presented in the next section. 30 CHAPTER 2 THE PRESENT STUDY Goals of the Present Study The study had several major goals: (1) Use normal methods for the usual covariance (Anderson, 1963) in PCA to assess the minimum number of observations required to obtain accurate probability coverage for the distribution of the largest eigenvalue for varying correlations. (2) Replace the usual covariance matrix with robust analogues such as the biweight midcovariance and the Winsorized covariance as input to PCA assuming normality. Examine the probability coverage for the three types of input to PCA when normal assumptions are violated (e.g. outliers and heavy- tails). Applied researchers would like to know that they are getting a reasonable estimate of the measure of dispersion, especially when there are outliers and heavy-tailed distributions. (3) Use a percentile bootstrap method to compute approximate 93 % confidence intervals for the distribution of the largest eigenvalue. This allows the researcher to assess the accuracy of the estimates without having to make assumptions about the underlying distribution. Research Questions 31 In order to address the research goals, the following specific research questions were generated: (1) How large must the sample size be to obtain accurate 95 % confidence intervals using equation (3) for the usual covariance, the Winsorized covariance, and the biweight midcovariance assuming normality? (2) Does equation (3) provide accurate 95% confidence intervals under nonnormality for the usual covariance? (3) How accurate is the probability coverage using bootstrap methods for the usual covariance under normality and nonnormality (extreme heavy-tai led ness)? (4) How effectively does the bootstrap procedure deal with outliers and heavy tails? (5) How do the resistant covariance matrices compare to the usual covariance matrix under nonnormality with regard to probability coverage, bias, and mean squared error? (6) Are the robust methods useful in applied research and how easy are they to implement? 32 CHAPTER 3 SIMULATION STUDY General Method and Data Generation The general procedure is to generate random multivariate normal observations using 1MSL (1989) subroutine RNMVN and then transform the marginals to nonnormal, g-and-h distributions (Hoaglin, 1985). The family of g-and-h distributions represents a very wide range of both symmetric and asymmetric distributions. If Z is a standard normal random variable, then the g-and-h distribution is the distribution corresponding to the random variable x - g The case g = h = 0 corresponds to a normal distribution. For the special case g = 0, this last expression is taken to be X - Z tu 'n , which is known as the h distribution (Hoaglin, 1985). The case g = 0 corresponds to a symmetric distribution, and as g increases, skewness increases as well. The tails of the g-and-h distribution get heavier as h increases. Kurtosis increases with g as well as h. Table 1 describes some properties of the g-and-h distribution. In terms of 33 kurtosis, the case g = 0 and h = .2 might seem extreme, but Wilcox (1990) found these characteristics in his data. Because it is unknown how large skewness and kurtosis values might be in applied work, the values for g were taken to be 0, .5, and I, and the values for h were taken to be 0 and .2. The data in Table 1 indicates that when g = 1 and h = .2, the values for skewness and kurtosis are quite extreme. Using these particular g and h values provides a wide range of situations which may be encountered in applied research. Table 1 Some Properties of the g-and-h Distribution 34 g h Skewness Kurtosis 0.0 0.0 0.00 3.0 0.0 0.2 0.00 36.0 0.5 0.0 1.75 8.9 0.5 0.2 13.16 42895.9 1.0 0.0 6.2 114 1.0 0.2 713 5.6 x 101 4 Simulation Study 1 35 The first set of simulations were exploratory in nature. They were performed to examine the accuracy of the probability coverage under normality (g=0,h=0) using equation (3) for the distribution of the largest eigenvalue using the usual sample covariance matrix, the biweight midcovariance matrix, and the Winsorized covariance matrix. The simulations in study 1 were designed to address the first research question posed earlier in the paper and to provide guidelines for the simulation design. The object of the simulation studies was to investigate the distribution of the largest eigenvalue only. Design and Procedure The following situations were of interest and served as the independent variables: (1) Sample size: 25 . 51, 101, 151, 201, and 251. (2) Target population correlation matrices: .1, .5, and .9 as the off-diagonal elements. The population eigenvalues for the three target matrices are X, = 1.30, X} = 2,50, and = 3.70. (3) Covariance matrix: usual, biweight, and Winsorized. (4) The number of variables was equal to 4. During each of the 1,000 replications, a covariance, biweight midcovariance, and a Winsorized covariance matrix is calculated and then subjected to PCA using the 36 IMSL subroutine CHFAC (cholesky factorization of the target correlation matrix) and PRINC (principal components analysis). The program PC A WIN outputs 4 columns of estimated eigenvalues for each situation. A listing of the Fortran-77 code for this part of the study is listed in Appendix I (Program PCAWIN). The Winsorized distributions of largest eigenvalue were then divided by (.645)J = .4122 to transform the Winsorized values to normalized values, as suggested by Wilcox (in press b). For each situation, 1,000 replications were used to estimate the probability coverage using equation (3). Results Table 2 illustrates the estimated probability coverage under normality for the usual sample covariance matrix, the biweight midcovariance matrix, and the Winsorized covariance matrix using (3). The necessary sample sizes for accurate probability coverage for the usual covariance and the biweight midcovariance ranged between 100 and 150 observations. The Winsorized covariance was unacceptable for all sample sizes. This may be because the Winsorized covariance does not estimate the usual covariance accurately for all correlations. Figure 1 illustrates boxplots of the data from the three measures of dispersion under normality (n=l00, As = 2.5, r = .5). The data generated by the usual covariance produced a boxplot similar to that of the biweight, with the biweight resulting in fewer, but more dispersed outliers. The boxplot for the Winsorized covariance illustrates that the bulk of the data are shifted to the left and there are 37 twice as many outliers to the right of the distribution. Because the Winsorized analogue of the usual sample covariance matrix failed to provide adequate probability coverage under normality for moderate to high correlations, it is not considered further in this study. 38 Table 2 Estimated Probability Coverage for the Winsorized Covariance Matrix, the Usual Covariance Matrix, and the Biweight Midcovariance Matrix using (3) Matrix Usual Biwelght Wtnsor N =25 .1 99.4 98.1 91.3 .5 93.0 92.9 80.1 .9 91.3 91.5 78.4 N=51 .1 97.0 97.6 88.2 .5 93.9 93.1 77.6 .9 92.3 93.0 75.5 N=101 .1 98.2 98.3 93.8 .5 95.8 94.3 87.3 .9 93.8 93.2 79.8 N = 151 .1 98.0 97.2 95.4 .5 95.8 94.3 82.3 .9 94.6 94.4 81.6 N=201 .1 96.8 96.5 94.3 .5 95.0 96.2 80.6 .9 94.9 94.5 82.5 39 Figure I Boxplots of the Distribution of the Largest Eigenvalue for the Usual Covariance. Biweight Midcovariance, and Winsorized Covariance Matrices Under Normality (X = 2.5. r = .51 USUAL I + 1 BIWEIGHT I + I WINSOR I + I + + + + + + + 1.00 1.50 2.00 2.50 3.00 3.50 4.00 EIGENVALUE Simulation Study 2 40 The simulation studies described in this section were used to address research questions 2-5. Simulations were performed to compare the probability coverage for the distribution of the largest eigenvalue using (3) with the probability coverage using the percentile bootstrap method with the usual covariance under normality (g=0,h =0) and nonnormality (g-0,h = .2). Simulations were also performed for situations where both the biweight midcovariance matrix and the usual sample covariance matrix are used as input to PCA, with the percentile bootstrap method used to compute a 95% confidence interval for normal and nonnormal distributions. Simulations were also done to compare a parametric bootstrap procedure that involved resampling (B=299) from the multivariate normal distribution (Efron. 1988). The Fortran code for Program PARMBOOT is included in Appendix I. A bias-correction procedure was tested, but was not very effective for this problem. The Program BCBOOT is listed in Appendix 1 . The following small sample properties were of interest: 95% probability coverage, bias, and mean squared error (MSE). Bias is the expected difference between the largest population eigenvalue and its estimated value. Mean squared error is the expected squared difference between the largest population eigenvalue and its estimated value, which confounds bias with variance. 41 Design and Procedure For both methods of computing confidence intervals, the following conditions were studied: (1) Three sample sizes: 25, 51, and 101. (2) Three values of correlations ,2, .5, and .8 for the common off-diagonal elements of the population matrices. (For g = 0, h = 0: \ 2 = 1.80, Xs — 3.0, and X* = 4.2) (3) Six distributional situations using the g-and-h distribution (g = h = 0, g = 0 and h =.2, g =.5 and h = 0, g = .5 and h =.2, g = 1 and h = 0, g = 1 and h = .2). (4) Two covariance matrices: the usual sample covariance and biweight midcovariance. (5) The number of variables was equal to 5. The simulations for each situation were replicated 1,000 times (B = 299) and the resulting values were used to estimate probability coverage, bias, and MSE. Because of the results of Simulation Study 1, the Winsorized covariance was dropped from consideration, the target population correlations were made less extreme (.2, .5, .8), the number of subjects was reduced to 25, 51, and 101, and the number of variables was increased to 5 for the subsequent set of simulations (Simulation Study 2). 42 CHAPTER 4 SIMULATION RESULTS Probability Coverage Table 3 contains the population covariance values for each situation. They were estimated using a population of n - 100,000 with one replication. Table 4 reports the estimated probability coverage employing equation (3) and the percentile bootstrap method when the usual covariance matrix is used. Under normality, (3) and the bootstrap procedure give similar results. However, when g = 0 and h = .2. (3) results in very poor probability coverage. The bootstrap procedure gives improved results, especially for low correlations, but overall it does not provide adequate probability coverage, so results for the other g-and-h distributions are not reported. Figure 2 shows a boxplot of the distribution of the largest eigenvalue under normality for the bootstrap procedure (B = 299 estimates sampled with replacement, averaged for each replication) and compares it to a boxplot of the procedure generating one estimated eigenvalue for each replication (1,000 replications). Boxplots are quite useful in identifying outliers and they are being used here for the purpose of comparing the distribution of the outliers with and without the bootstrap. The boxplots are nearly identical when the population correlations are 43 high (i.e .8), however for correlations equal to .2 and .5, the bootstrap mean value is closer to the population value. The standard errors of the mean are smaller than with 1,000 replications. 44 Table 3 Population Values for the largest Eigenvalue of the Usual Covariance and Biweight Midcovariance Matrices (N = 100.000) g h r Usual Biweight 0.0 0.0 .2 1.80 1.83 0.0 0.0 .5 3.01 3.04 0.0 0.0 .8 4.21 4.26 0.0 0.2 .2 3.72 2.41 0.0 0.2 .5 6.18 3.98 0.0 0.2 .8 8.93 5.67 0.5 0.0 .2 2.51 1.84 0.5 0.0 .5 4.20 2.% 0.5 0.0 .8 6.01 4.20 0.5 0.2 .2 6.35 2.22 0.5 0.2 .5 10.48 3.60 0.5 0.2 .8 15.92 5.18 1.0 0.0 .2 7.03 1.57 1.0 0.0 .5 11.77 2.39 1.0 0.0 .8 18.01 3.49 1.0 0.2 .2 36.53 1.80 1.0 0.2 .5 50.84 2.81 1.0 0.2 .8 89.64 4.11 45 Table 4 Estimated Probability Coverage for the Largest Eigenvalue Based on the Usual Covariance Using (3) and the Bootstrap for Normal (g=h=0) and Nonnormal (g=0 h = .2) Distributions Using (8) Bootstrap (g»h) (0,0) (0,.2) (0,0) (0,.2) N=25 .2 .994 .891 .955 .959 .5 .930 .816 .921 .880 .8 .913 .690 .894 .769 N =5i .2 .970 .830 .957 .947 .5 .939 .773 .932 .879 .8 .923 .690 .920 .814 N=101 .2 .965 .778 .974 .967 .5 .948 .760 .956 .915 .8 .941 .666 .941 .840 46 Figure 2 Boxplots of the Distribution of the largest Eigenvalue for 1.000 Replications as Compared to the Averaged B=299 for each Replication of the Usual Covariance Assuming Normality r= .2 , A , = 1.8 1000 1 + 1 BOOT 1 + I- - + + + - + + + 1.00 1.50 2.00 2.50 3.00 3.50 r= .5 , A,=3.0 1000 *----------------1 +---- I------------------------* BOOT —- I + I- + + + +_„ 1.60 2.40 3.20 4.00 4.80 r= .8 , A l=4.2 1000 1 + 1 * * * * * * • BOOT -...................I +....... 1 .....................** *** * * + ----------- + ------------ + --------------+ ----------- + ----------- + ------- 2.0 3.0 4.0 5.0 6.0 7.0 EIGENVALUE 47 Table 5 presents the estimated probability coverage for the largest eigenvalue for all distributions for the usual sample covariance matrix. The accuracy of the probability coverage under normality (g = h = 0) was fairly accurate when the number of subjects was at least SI. Under nonnormality, the usual covariance fails completely. Table 6 presents the probability coverage for the biweight midcovariance. The biweight provides highly accurate probability coverage, especially under nonnormality, for sample sizes of 25. In summary, when distributions are normal, g = h = 0, probability coverage is good for both measures of spread. However, for nonnormal conditions, probability coverage is good for the biweight but not for the usual covariance matrix. 48 Table 5 Estimated Probability Coverage for the Largest Eigenvalue based on the Usual Covariance with Normal (g=h=0) and Nonnormal Distributions (g&h. Hoaalin. 1983) 1000 REPLICATIONS 299 BOOTSTRAP SAMPLES (g,h) (0,0) (0,.2) (.5,0) (.5,.2) <1,.0) (1,.2) N=25 .2 .955 .955 .943 .917 .937 .700 .5 .921 .880 ,853 .760 .793 .595 .8 .894 .769 .785 .596 .637 .390 N=51 .2 .957 .947 .943 .907 .926 .752 .5 .932 .879 .881 .732 .834 .640 .8 .920 .814 .827 .642 .690 .436 N=101 .2 .974 .967 .959 .939 .948 .833 .5 .956 .915 .921 .825 .868 .699 .8 .941 .840 i o o -J O O .696 .750 .502 Bradley's (1978) liberal rejection criteria (.925-.975). 49 Table 6 Estimated Probability Coverage for the Largest Eigenvalue based on the Biwetgbt Mldcovariance with Normal(g=h=0) and Nonnormal Distributions 1000 REPLICATIONS 299 BOOTSTRAP SAMPLES (g,h) (0,0) (0,.2) (.5,0) (.5,.2) (1,.0) (1,-2) N=25 .2 .968 .940 .960 .925 .923 .875 .5 .937 .959 .958 .969 .926 .970 .8 .908 .926 .915 .948 .955 .960 2 I I u .2 .966 .952 .964 .936 .925 .881 .5 .945 .962 .952 .958 .962 .962 .8 .935 .942 .931 .949 .942 ,942 N = 101 .2 .975 .969 .966 .959 .947 .925 .5 .960 .969 .967 .973 .971 .973 .8 .951 .957 .952 .954 .939 .965 Bradley (1978) liberal rejection criteria (.925-.975). Bias and Mean Squared Error 50 Estimates of bias and mean squared error (MSE) for the usual sample covariance matrix and the biweight midcovariance matrix are shown in Table 7 and Table 8, respectively. The case where g = 1 and h = .2 gave very poor results, particularly for MSE, so it is not considered further. As can be seen in Table 7 for the usual sample covariance, when g = 1 and h = 0, the values for bias and MSE are relatively large. In general, bias and MSE decreased as the sample size increased. Bias also decreased as the correlation increased. The biweight midcovariance showed far less bias and MSE compared to the usual covariance with regard to sample size, correlation, and distribution. Under normality (g = h = 0), the usual sample covariance matrix (see Table 7) and the biweight midcovariance matrix (see Table 8) give accurate and very similar results with regard to MSE. In particular, as the correlation increased, the MSE increased. The usual sample covariance matrix gives very large values for MSE compared to the biweight midcovariance matrix, especially for very heavy tails: g - .5 and h = .2, and g = I and h = 0. The estimates of MSE based on the usual sample covariance matrix are on the order of 100 times larger than the estimates of MSE based on the biweight midcovariance matrix. 51 Table 7 Estimated Bias and Mean Squared Error fMSF.l fnr the largest Eigenvalue based on the Usual Covariance Matrix 8 h r n bias = 25 MSE n bias = 51 MSE n bias = 101 MSE 0 0 .2 .42 .35 .20 .14 .11 .06 0 0 .5 .07 .65 .01 .33 .02 .15 0 0 .8 -.10 1.32 -.08 .64 -.02 .31 .5 0 .2 1.03 2.47 .61 1.19 .38 .46 .5 0 .5 .38 3.64 .21 2.06 .17 .99 .5 0 .8 -.03 8.33 -.04 4.59 .04 2.33 0 .2 .2 -1.79 12.62 1.13 12.79 .75 2.59 0 .2 .5 .70 13.24 .40 7.14 .33 3.22 0 .2 .8 -.11 29.77 -.20 17.01 -.08 8.14 .5 .2 .2 5.80 307.00 6.70 404.00 4.66 192.00 .5 .2 .5 3.28 429.00 2.95 293.00 3.03 165.00 .5 .2 .8 .72 764.00 .47 466.00 1.36 387.00 1.0 0 .2 7.01 230.00 5.90 390.00 4.50 85.00 1.0 0 .5 3.99 281.00 3.21 176.00 2.86 98.00 1.0 0 .8 1.13 593.00 .76 327.00 1.18 222.00 52 Table 8 Estimated Bias and Mean Squared Error for the Largest Eigenvalue based on the Blwetabt M idcovariance M atrix 8 h r n bias = 25 MSE n bias = 51 MSE n bias = 101 MSE 0 0 .2 .37 .31 .18 .13 .09 .06 0 0 .5 -.07 .62 -.06 .33 -.01 .16 0 0 .8 -.32 1.44 -.17 .70 -.05 .35 .5 0 .2 .63 .74 .30 .25 .16 .10 .5 0 .5 -.49 .58 .04 .51 .06 .26 .5 0 .8 -.13 2.38 -.10 1.19 .02 ,63 0 .2 .2 1.01 1.71 .51 .54 .26 .20 0 .2 .5 .39 2.21 .14 .94 .10 .46 0 .2 .8 .00 4.84 -.04 2.17 .04 1.11 .5 .2 .2 1.30 2.64 .64 .75 .32 .25 .5 .2 .5 -.08 .96 .27 1.07 .18 .50 .5 .2 .8 .32 5.67 .09 2.34 .14 1.20 1.0 0 .2 1.36 2.94 .68 .79 .35 .25 1.0 0 .5 .54 1.38 .41 .99 .24 .43 1.0 0 .8 .61 5.43 .25 2.09 .21 1.03 Skewness and Kurt os is 53 Table 9 presents the values for skewness and kurtosis for the usual covariance input and Table 10 illustrates skewness and kurtosis for the biweight covariance input, for all situations in the simulation. For the usual covariance matrix. Table 9 indicates that skewness is large for most nonnormal distributions and kurtosis is extreme (as defined by the g-and-h distribution). The results using the biweight are much smaller for both the third and fourth moments than the usual covariance matrix. In general, as the number of subjects increased, the values of skewness decreased. In particular, for the biweight under nonnormality, skewness remained quite small (< 2). In contrast, for the usual covariance, skewness values were as large as 10. 54 Table 9 Skewness and Kurtosis for the Largest Eigenvalue of the g-and-h Distribution (Hoaylin. 1983) based on the Usual Covariance n = 25 n = 51 n = 101 8 h r skew kurt skew kurt skew kurt 0 0 .2 .50 -.04 .53 .45 .37 .21 0 0 .5 .42 -.01 .35 .12 .31 .34 0 0 .8 .47 .18 .30 .09 .26 .22 0 .2 .2 4.67 36.09 19.90 52.34 3.29 17.70 0 .2 .5 2.74 13.55 2.75 14.84 1.63 4.94 0 .2 .8 3.58 24.19 3.% 30.34 3.60 27.90 .5 0 .2 1.33 3.05 2.26 14.54 .93 1.62 .5 0 .5 1.24 2.33 1.33 3.67 .97 1.63 .5 0 .8 1.70 5.51 1.50 4.67 1.03 1.58 .5 .2 ,2 8.78 123.00 29.55 910.00 7.42 77.95 .5 .2 .5 9.67 127.00 8.48 102.00 5.29 40.19 .5 .2 .8 10.29 152.00 9.84 141.00 8.04 95.49 1.0 0 .2 4.85 39.57 19.97 520.00 4.29 28.38 1.0 0 .5 5.29 44.59 4.89 38.51 3.34 18.22 1.0 0 .8 15.23 318.00 6.82 73.53 6.16 61.27 1.0 .2 .2 15.48 328.00 15.36 319.00 10.94 160.00 1.0 .2 .5 14.72 250.00 12.99 214.00 7.69 75.81 1.0 .2 .8 10.49 155.00 14.34 264.00 13.62 225.00 55 Table 10 Skewness and Kurtosis for the Largest Eigenvalue of the g-and-h Distribution (Hoaalin. 1983) based on the Blwetoht M ldcovariance n = 25 n * 51 n = 101 8 h r skew kurt skew kurt skew kurt 0 0 .2 .55 .07 .55 .55 .39 .26 0 0 .5 .50 .07 .38 .16 .34 .44 0 0 .8 .54 .32 .32 .06 .28 .28 0 .2 .2 .71 .57 .59 .47 .50 .49 0 .2 .5 .79 .57 .58 .48 .51 .94 0 .2 .8 .98 1.32 .58 .45 .52 .72 .5 0 .2 .71 .45 .62 .55 .63 .83 .5 0 .5 .87 .78 .65 .60 .66 1.11 .5 0 .8 1.04 1.57 .63 .58 .55 .90 .5 .2 .2 .82 .56 .72 .80 .75 1.35 .5 .2 .5 .82 .56 .80 1.14 .75 1.76 .5 .2 .8 1.77 1.79 .73 .76 .67 1.60 1.0 0 .2 1.16 1.84 .86 1.39 .72 .97 1.0 0 .5 1.16 1.83 .75 1.22 .53 .88 1.0 0 .8 1.76 4.53 1.06 1.71 ,76 1.32 1.0 .2 .2 .74 .95 1.16 1.84 .86 1.39 1.0 .2 .5 1.11 1.84 1.18 1.18 .77 1.42 1.0 .2 .8 1.84 4.82 1.14 2.20 .81 1.78 Miscellaneous Simulation Results 56 Simulations were done using a parametric bootstrap technique (Program PARMBOOT) which was quite accurate with regard to probability coverage when compared to equation (3) under normality, it was found to be more accurate than the percentile bootstrap as illustrated by Efron (1988), however it does not generalize to the situation where the underlying distribution is unknown. Its assumptions are quite rigid because it resamples from the same population. In the case where the distributions are not normal, the bias-corrected bootstrap (Program BCBOOT) is found to be quite inaccurate with regard to probability coverage. Therefore, it was not pursued further in this study. The last set of simulations tested the probability coverage with the percentile bootstrap method employing varying bootstrap sample sizes: 199, 299, 399, 599, and 999 for each of the 1,000 replications for the usual covariance. It was of interest to see if more bootstrap samples meant more accurate estimates. The results illustrated that for the problem at hand, more than 400 bootstrap samples produced highly inaccurate probability coverage. The probability coverage improved, as the size of the bootstrap sample decreased to B=299, where excellent estimates were obtained. CHAPTER 5 EXAMPLES USING EMPIRICAL DATA 57 In order to illustrate the practical implications of using a more resistant measure of scale with PCA, several published data sets were reanalyzed: Fisher's Iris data (N=51) taken from Morrison (1990); alcohol effects (N=585) (Earleywine, 1994); EEG responses to photic driving in schizophrenics (N = 23) (Jin, Potkin, Rice, Sramek, Caster, Isenhart, Heh, & Sandman; 1990); memory in depressed adults (N= 179) and reaction time in depressed adults (N= 179) (Sandman, Barron, Nackout, Goldstein, & Fidler, 1993). The data were analyzed with BMDP program 4M (Dixon, 1990). The biweight midcovariance matrix was estimated using the Minitab macro bicovm.mtb (Wilcox, in press b), and was subsequently entered into BMDP for analysis. Results Table 1 1 presents the results from the reanalysis of the empirical data. Table 11 reports the number of subjects in each data set, the number of variables, the values obtained for the largest eigenvalue (EVAL) as well as 95% confidence intervals for each EVAL, and the variance accounted for (VAF). In the case of the biweight midcovariance, the midvariance accounted for (MVAF) is presented. For normally distributed data such as the Iris and alcohol data (* = normal), both the covariance- 58 based and the biweight-based largest eigenvalues resulted in similar variances accounted for and similar magnitudes of eigenvalues. For data with outliers and/or heavy tails (EEG, RT, MEM), the biweight produced smaller eigenvalues as compared to the usual covariance and the percent of variance/dispersion accounted for using the biweight was much larger. When the data were fairly normally distributed, the results were very similar when employing the biweight midcovariance in place of the usual covariance. In situations where the data are normal, robust methods do not improve the results over the usual methods. The 95% confidence intervals were computed using the percentile bootstrap method with B = 299 and 1 replication. The only situation where accurate 95% confidence intervals were not obtained, was for the Iris data set, which was normally distributed. The result could be due to the bootstrap procedure, but it is difficult to be sure without further computer-intensive procedures. Table 1 1 Results from PCA on Empirical Data Sets Usaal Cow faace ^ ---------*-------- R M D V T M a K Data ■ VAR EVAL 95% d VAF EVAL 95% d MVAF ALC* 585 4 474 (417, 537) 45% 489 (429, 550) 45% EEG 23 4 .77 (.38, 1,26) 62% .64 (.25, 1.25) 69% DUS* 51 4 .60 (.11. -17) 69% .60 (.13, .21) 69% MEM 179 6 10.68 (8.89, 12.72) 46% 10.95 (8.96, 13.46) 50% RT 179 6 563,455 (431238. 687393) 90% 518,266 (374243, 649994) 91% * NORMALLY DISTRIBUTED DATA (I.e. No O tffim or Heavy Tdb) CHAPTER 6 60 DISCUSSION The results from Simulation 1 indicate that under normality, (3) provides similar probability coverage for the usual covariance as compared to the biweight midcovariance. The Winsorized covariance provides accurate probability coverage only when the target correlation is low (i.e. r= .l). (See Table 2). Boxplots (Figure 1) illustrate the situation where the population correlation is .5. The population value was computed for the largest eigenvalue, and found to equal 2.5. Figure 1 clearly indicates that under normality, the usual covariance and the biweight midcovariance distributions of the largest eigenvalue are similar, whereas the Winsorized covariance underestimates its "population" value and has asymmetric outliers. The rationale for not including the Winsorized Covariance matrix in the second set of simulations is twofold. First, the population value for the target matrix was arrived at through a transformation of the distribution (Aw / .4122). After the transformation, the probability coverage was accurate for low correlations, but was highly inaccurate when r= .5 and r = .9. Thus, the technique seems to be too unreliable to recommend in applied work employing PCA, Second, the Winsorized distribution of the largest eigenvalue resulted in twice as many outliers as compared to the other two. It did not behave welt assuming normality, therefore, no relevant inferences could be made under nonnormality. 61 The results indicate that for light-tailed distributions, (3) used with the usual sample covariance matrix gives reasonably accurate 93 % confidence interval coverage. For heavy-tailed distributions, both the percentile bootstrap procedure and (3) can have inadequate probability coverage even with n = 101 (see Table 4). On the other hand, simulations using the percentile bootstrap procedure show that when using the largest eigenvalue based on the biweight midcovariance matrix, reasonably accurate 93% confidence intervals can be obtained, even with n = 25 (see Table 3 and Table 6). The results suggest that under normality, both the biweight midcovariance matrix and the usual sample covariance matrix give accurate estimates in terms of bias and MSE. Under nonnormality, bias and MSE are generally much larger for the usual sample covariance matrix than for the biweight midcovariance matrix (see Table 7, and Table 8). It is difficult to say what the most extreme value for skewness and kurtosis is in psychological research. Wilcox (1994) suggests a skewness of 3 and kurtosis equal to 36 are fairly high. Under normality (g = 0, h — 0) the values for skewness and kurtosis for both distributions were between 0 and 1. For the usual covariance, kurtosis values exceeded 36 in most situations where the distributions were nonnormal (see Table 9). For the biweight, the results indicate that for all sample sizes and all distributions, most values for skewness and kurtosis were under 2 (see Table 10). Thus, the resistant matrix produced distributions of estimated eigenvalues with very small values for skewness and kurtosis. 62 The analysis of the empirical data illustrates that resistant measures of covariance, such as the biweight midcovariance, can yield substantially different results than with the usual covariance. For light-tailed distributions with no outliers, the biweight produced eigenvalues that were slighty different from those of the usual covariance and the proportion of variance/dispersion accounted for were exactly the same as the values for the usual covariance. In contrast, for heavy-tailed distributions, or when there were outliers, the usual sample covariance matrix produced inflated largest eigenvalues as compared to the biweight midcovariance matrix (see Table 11), but the proportion of variance/dispersion accounted for was much larger using the biweight midvariance as input to PCA. This suggests that when the data have heavy tails or outliers, PCA may give a distorted view of the proportion of variance/dispersion accounted for by a factor and robust methods should be considered. CHAPTER 7 63 CONCLUSIONS The following conclusions can be derived from this study: 1. The sample size necessary to obtain accurate 95% confidence intervals under normality with (3) or with the biweight (see Table 2) takes at least L 51 observations. The Winsorized covariance results in inaccurate probability coverage for moderate to high population correlations. 2. Equation (3) does not provide accurate probability coverage under nonnormality for the usual covariance matrix. 3. The percentile bootstrap method did not result in greatly improved estimates for the distribution of the largest eigenvalue under normality, and it was not very accurate for all correlations with regard to probability coverage. Equation (3) provided the most accurate results. 4. The percentile bootstap procedure pulled in the distribution so that the confidence interval was shorter and more accurate. The pattern of outliers seemed very similar, overall. Thus, the bootstrap procedure alone is not very effective at dealing with the influence of outliers in the data. This study illustrated that there is a very small amount of bias using this method for this problem and it is recommended. 5. Under normality, the probability coverage, bias, and MSE of the usual covariance matrix is quite acceptable, however under nonnormality the biweight is far 64 superior than the usual covariance input to PCA. As the distributions get more heavy tailed, the biweight provides more accurate probability coverage. The biweight is virtually unaffected by outliers and heavy tails. 6. Robust methods are quite useful in applied work. The empirical study illustrates with published data that when the data are light-tailed, the results of the usual method can be replicated with the robust method. In the empirical data sets that are light-tailed, the biweight results in better estimates than with the usual covariance. This is a two-shot procedure: first, the biweight midcovariance matrix (Wilcox, in press b) is calculated in Minitab (Ryan & Joiner, 1994) and then entered into a principal components analysis routine in BMDP. SAS, or SPSS Statistical software. Suggestions for Further Research 1. Calculate a more effective Winsorized transformation that can be applied when correlations are high and low. 2. Perform simulations in a similar manner as in this study, but in place of estimating eigenvalues, estimate factor loadings to obtain robust estimates and accurate probability coverage of the distributions of interest. It would be of interest to see the effect of rotation on reproduction of the correlation matrix using a resistant measure of scale. The percentage bend correlation (Wilcox, in press a) may provide a more resistant measure for examining factor loadings. 3. Rule of thumb: If the data are light-tailed, use the usual methods. If there are outliers and heavy-tailed distributions, use the biweight midcovariance. REFERENCES 65 American Psychiatric Association (1987). Diagnostic and statistical manual of mental disorders (3rd ed. revised) Washington, D.C.: Author. Ammann, L.P. (1989). Robust principal components. Communication in Statistics- Simulation and Computation. 18. 857-874. Anderson, T.W. (1951). The asymptotic distribution of certain characteristic roots and vectors. Proceedings of the Second Berkeley Symposium in Mathematical Statistics and Probability. Berkeley: University of Berkeley Press. Anderson, T.W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics. 34. 122-148. Anderson. T.W. (1984). An introduction to multivariate statistical analysis (2nd ed.)New York: John Wiley & Sons. Bartlett, M.S. (1954). Approximate confidence intervals, Biometrika. 40. 12-19. Beaton. A.E. & Tukey, J.W. (1974), The fitting of power series, meaning, polynomials, illustrated on band-spectroscopic data. Technometrics. 16. 147- 185. Bickel, P.J. & Lehmann, E.L. (1976). Descriptive statistics for nonparametric models III. Dispersion. Annals of Statistics. 4. 1139-1158. Chen, H.J., Gnanadesikan, R. & Kettenring, J.R. (1974). Statistical Methods for grouping corporations. Sankhva. 36. 1-28. Cliff, N. & Hamburger, C.D. (1967). The study of sampling errors in factor analysis by means of artificial experiments. Psychological Bulletin. 68. 430-445 Cliff, N. & Pennell, R. (1967).The influence of communality, factor strength, and loading size on the sampling distribution of factor loadings. Psvchometrika. 32* 309-326. Cliff, N. (1970). The relation between sample and population characteristic vectors. Psychometrika. 35. 163-178. Cliff, N. (1987). Analyzing multivariate data. San Diego: Harcourt Brace Jovanovich Publishers. 66 Coles, M., Gratton, G., & Fabiani, M. (1990). Event-related brain potentials. In J.T. Cacioppo & L.G. Tassinary (Eds.), Principles of psvchophvsiolopv pp 413- 4SS. Cambridge: Cambridge University Press. Devlin, S.J., Gnanadesikan, R., & Kettenring, J.R. (1981). Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association. 76. 354-362. Diaconis, P.M. & Efron, B.S. (1983). Computer-intensive methods in statistics. Scientific American. Mav. 116-130. Diciccio, T.J. & Romano, J.P. (1988). A review of bootstrap confidence intervals (with discussion). Journal of the Roval Statistical Society, 50. 338-354. Dixon, W.J. (1990). BMDP statistical software manual. Los Angeles: University of California Press. Duntemann, G.H. (1989). Principal components analysis. London: Sage Publications, Inc. Earleywine, M. (1994). Confirming the factor structure of the anticipated biphasic alcohol effects scale. Alcoholism: Clinical and Experimental Research. 18. 861-866. Eaton, M.L. & Tyler, D.E. (1991). On Wielandt’s inequality and its application to the asymptotic distribution of the eigenvalue of a random symmetric matrix. The Annals of Statistics. 19. 260-271. Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Regional Conference Series in Applied Mathematics. 38. Philadelphia: Society for Industrial and Applied Mathematics. Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association. 82. 171-185. Efron, B. (1990). More efficient bootstrap computations. Journal of the American Statistical Association. 85. 79-89. Efron, B. & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science. 1. 54- 75. 67 Freedman, D.A. & Diaconis, P. (1982). On inconsistent M-estimators. Annals of Statistics. 10. 454-461. Gerig, T.M. (1974). Robust principal components analysis using m-estimators. Unpublished manuscript. Gnanadesikan, R. (1977). Methods for statistical data analysis of multivariate observations. New York: John Wiley. Gnanadesikan, R. & Kettenring, J.R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics. 28. 81-124. Goldberg, K.M. & Iglewicz, B. (1992). Bivariate extensions of the boxplot. Technometrics. 34. 307-320. Goodall, C. (1983). M-estimators of location: An outline of the theory. In D.C. Hoagtin, F. Mosteller, & J. Tukey (Eds.), Understanding robust and exploratory data analysis, pp. 211-246. New York: Wiley. Gross, A.M. (1976). Confidence-interval robustness with long-tailed symmetric distributions. Journal of the American Statistical Association. 71. 409-416. Hall, P. (1986). On the bootstrap simulations required to construct a confidence interval. Annals of Statistics. 14. 1453-1462. Hampel, F.R, (1968). Contributions to the theory of robust estimation. Unpublished doctoral dissertation. University of California, Berkeley. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. & Stahel, W.A. (1986). Robust statistics. New York: Wiley. Hoaglin, D.C., Mosteller, F. & Tukey, J.W. (Eds.) (1983). Understanding robust and exploratory data analysis. New York: Wiley. Hoaglin, D.C. (1985). Summarizing shape numerically: The g-and-h distributions. In D. Hoaglin, F. Mosteller & J.W. Tukey. (Eds.) Exploring data tables, trends and shapes. New York: Wiley. Hodges, J.L. Jr. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. Proceedings of the Fifth Symposium on Mathematical Statistics and Probability. 1. 163-186. 68 Hodges, J.L. Jr. & Lehmann, E.H. (1956). The efficiency of some nonparametric competitors of the t-test. Annals of Mathematical Statistics. 27. 324-335. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology. 24. 417-441. Huber, P.J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics. 35. 73-101. Huber, P.J. (1977). Robust Covariances. In S.S. Gupta & D.S. Moore (Eds.), Statistical decision theory and related topics II. New York: Academic Press. Huber, P.J. (1981). Robust statistics. New York: Wiley. Iglewicz, B. (1983). Robust scale estimators and confidence intervals for location. In D.C. Hoaglin, F. Mosteller, & J.W. Tukey (Eds.), Understanding robust and exploratory data analysis, pp. 404-431. New York: Wiley. IMSL (1989). Fortran subroutines for statistical analysis. An unpublished work. Houston, TX: IMSL.Inc. Jin, Y., Potkin, G.G., Rice, D,, Sramek, J., Custer, J., I sen hart, R., Heh, C. & Sandman, C. (1990). Abnormal EEG responses to photic stimulation in schizophrenic patients. Schizophrenia Bulletin. 16. 627-634. Lambert, Z.V., Wildt. A.R., & Durand, R.N. (1991). Approximate confidence intervals for factor loadings. Multivariate Behavioral Research. 26. 421-434. Lax, D.A. (1985). Robust estimators of scale: Finite-sample performance in long tailed symmetric distributions. Journal of the American Statistical Association. 736-741. Lawley, D.N. (1956). Tests of significance for the latent roots of covariance and correlation matrices. Biometrika. 43. 128-136. Lawley, D.N. (1963). On testing a set of correlation coefficients for equality. Annals of Mathematical Statistics. 34. 149-151. LePage, R. & Billard, L. (1992). Exploring the limits of the bootstrap. New York: John Wiley & Sons. 69 Li, G. & Chen, Z. (1985). Projection-pursuit approach to robust dispersion and principal components: Primary theory and Monte Carlo. Journal of the American Statistical Association. 80. 759-766. Maronna, R.A. (1976). Robust M-estimators of multivariate location and scatter. Annals of Statistics. 4. 51-67. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 105. 156-166. Morrison, D.P. (1990). Multivariate statistical methods. 3rd Ed. New York: McGraw-Hill. Mosteller, F. & Tukey, J.W. (1977). Data analysis and regression. Reading MA: Addison-Wesley. Muirhead, R.J. (1982). Aspects of multivariate statistical theory. New York; John Wiley & Sons. Naga, R.A. & Antilie, G. (1990). Stability of robust and non-robust principal components analysis. Computational Statistics & Data Analysis. 10. 169-174. Noreen, E.W, (1989). Computer-intensive methods for testing hypotheses. New York: John Wiley & Sons. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine. 6. 559-572. Pitman, E.J.G. (1949). Lecture notes on nonparametric statistics. New York: Columbia University Press. Quenouille, M. (1949). Notes on bias in estimation. Biometrika. 43. 353-360. Rasmussen, J.L. (1989). Data transformations, Type 1 error rate and power. British Journal of Mathematical and Statistical Psychology. 42. 203-214. Rosenberger, J.L. & Gasko, M. (1983). Comparing location estimators: trimmed means, medians and trimean. In D. Hoaglin, F. Mosteller, & J. Tukey (Eds), Understanding robust and exploratory data analysis, pp. 297-336. New York: Wiley 8 l Sons. Rousseeuw, P. (1984). Least median squares regression. Journal of the American Statistical Association. 79. 871-880. 70 Ryan, B,F. & Joiner, B.L, (1994). Minitab handbook (3rd Ed.) Belmont, CA: Duxbury Press. Sandman, C., Barron.J.L., Nackoul, K., Goldstein, J. & Fidler, P.L. (1993). Memory deficits associated with chronic fatigue immune dysfunction syndrome. Biological Psychiatry. 33. 618-623. Sawilowsky, S.S. & Blair, R.C. (1990). A more realistic look at the robustness of independent and dependent samples t tests to departures from population normality. Unpublished technical report. College of Education. Wayne Stale University. Shoemaker, L.H. & Hettmansperger, T.P. (1982). Robust estimates and tests for the one- and two-sample scale problem. Biometrika. 69. 47-53. Staudte, R.G. & Sheather, S J. (1990). Robust estimation and testing. New York: Wiley. Stigler, S.M. (1973). Simon Newcombe, Percy Daniel), and the history of robust estimation, 1885-1920, Journal of the American Statistical Association. 68. 872-879. Tanaka, Y. (1988). Sensitivity analysis in principal component analysis: influence on the subspace spanned by principal components. Communication in Statistics: Theory and Methods. 17. 3157-3175. Tanaka, Y. (1989). Influence functions related to eigenvalue problems which appear in multivariate methods. Communication in Statistics: Theory and Methods. 3991-4010. Tukey, J.W. (1956). Bias and confidence in not quite large samples. Annals of Mathematical Statistics. 29. 614. Tukey, J.W. (1960). A survey of sampling from contaminated normal distributions. In 1 . Olkin, et. al. (Eds.), Contributions to probability and statistics. Stanford. CA: Stanford Press. Verboon, P. (1993). Robust nonlinear regression analysis. British Journal of Mathematical & Statistical Psychology. 46. 77-94. Wilcox, R.R. (1990). Comparing the means of two independent groups. Biometrical Journal. 32. 771-780. 71 Wilcox, R.R. (1991), Bootstrap inferences about the correlation and variances of paired data. British Journal of Mathematical and Statistical Psychology. 44. 379-382. Wilcox, R.R. (1992a). Robust generalizations of classical test reliability and Cronbach’s alpha. British Journal of Mathematical and Statistical Psychology. 4 ^ 239-254. Wilcox, R.R. (1992b). Why can methods for comparing means have relatively low power, and what can you do to correct the problem? Current Directions in Psychological Science. 1(3). 101-105. Wilcox, R.R, (1993). Comparing the biweight midvariances of two independent groups. The Statistician. 45 . 29-35. Wilcox, R.R. (1994). Estimating Winsorized correlations in a univariate or bivariate random effects model. British Journal of Mathematical and Statistical Psychology. 47. 167-183. Wilcox, R.R. (in press a). The percentage bend correlation coefficient. Psvchometrika. Wilcox. R.R. (in press b). Statistics for the Social Sciences. New York: Academic Press. 12 APPENDIX I LISTING OF THE COMPUTER PROGRAMS 73 PROGRAM PARMBOOT C MAIN PROGRAM: A_2558.F - CALCULATES PRINCIPAL COMPONENTS C FOR THE USUAL COV MATRIX - ESTIMATED COV AS RSIGH C N = 25 NVAR=5 r = 8 REPS = 1000 BOOTSTRAP=299 C MV NORMAL DIST- PARAMETRIC BOOTSTRAP C DIMENSION VAR(5),COV(5,5),XB(5),HXB(5),HVAR(5) DIMENSION RVEC(25,5).X(25,5),RSIG(5,5) DIMENSION ECOV(5,5)fEIG(5),RSIGH(5,5) DIMENSION PCT(5),STD(5),EVEC(5,5),LOAD(5,5) DIMENSION BLOB(5),STAT(299),SUMBAR(1000) C REAL*8 DSEED COMMON /R2NCOM/DSEED DATA COV/1.,.8,.8,.8,.8, C 8,1.,.8..8,.8, C .8, .8.1. ,.8, .8. C .8,.8,.8,1.,8, C .8..8,.8,.8,1./ JJ =5 N =25 NG = .2*N NGM = N-NG ICOV-O NDF=N-I 1SEED= 12357 CALL RNSET(ISEED) I REP =1000 I BOOT=299 ICOUNT= 0 C C ***ESTIMATE COV WITH RSIG FOR EACH REP C ***DO CHOLESKY FACTORIZATION OF COV CALL CHFAC(JJ,COV.JJ, .00001 ,IRANK.RSIG,JJ) C ***BEGIN REPLICATION LOOP*** DO 1000 IR -U R E P CALL RNMVN(N,JJ,RSIG,JJ,X,N) DO 50 J=1,JJ HXB(J)=0.0 HVAR(J)=0.0 DO 51 JI = l.JJ RSIGH(J,JI)=0.0 51 CONTINUE 50 CONTINUE DO 52 J=1,JJ 74 DO 53 1*1,N HXB(J)=HXB(J)+X(U) 53 CONTINUE HXB(J)=HXB(J)/N 52 CONTINUE DO 54 J=1,JJ DO 55 1 = 1, N HVAR(J)=HVAR(J)+((X(I,J)-HXB(J))**2)/(N-1) 55 CONTINUE 54 CONTINUE DO 56 J=1,JJ DO 57 JW=1,JJ DO 58 1 = 1 .N RSlGH(J,JW)=RSiGH{J,JW) + (X(l,J)-HXB(J))*(X(lJW)-HXB(JW)) 58 CONTINUE 57 CONTINUE 56 CONTINUE DO 61 J = 1,JJ DO 62 1 1 = U J RSIGH( J. JI)« RSIGH(J, JI)/(N-1) 62 CONTINUE 61 CONTINUE C ***BEGIN BOOTSTRAP*** DO 200 IB= 1 ,1 BOOT C INITIALIZE VARIABLES DO 25 M -1.JJ XB(JA)=0.0 VAR(JA)=0.0 DO 24 JB = 1,JJ ECOV(JA.JB) *0.0 24 CONTINUE DO 4 1 = 1. N RVEC(l,JA)-0.0 4 CONTINUE 25 CONTINUE C ***GENERATE MULTIVARIATE NORMAL RANDOM VECTOR OF SIZE N BY JJ CALL RNMVN(N,JJ,RSIGH,JJ,RVEC,N) DO 10 J = 1.JJ DO 350 1 = 1, N XB(J) * XB(J)+RVEC(I,J) 350 CONTINUE XB(J)*XB(J)/N 10 CONTINUE DO 12 J = 1,JJ DO 33 1 = 1,N V AR(J) = V AR(J) + ((RVEC(I,J)-XB(J))**2)/(N-1) 75 33 CONTINUE 12 CONTINUE C CALCULATE COVARIANCE DO 13 J-I.JJ DO 14 IW= l.JJ DO 30 1 = 1 ,N ECOV(J,JW) = ECOV(J,JW)+(RVEC(I,J)-XB(J))*(RVEC(I,JW)-XB(JW)) 30 CONTINUE 14 CONTINUE 13 CONTINUE DO 16 J = 1,JJ DO 15 J I -U J ECOV(J,JI) = ECOV(J,JI)/(N-I) 15 CONTINUE 16 CONTINUE C CALL PRINC(NDF,JJ,ECOV,JJ,ICOV,BLOB,PCT,STDtEVEC,JJ,LOAD,JJ) DO 60 J=1,JJ EIG(J)=BLOB(J) BLOB(J)«0.0 60 CONTINUE STAT(IB)=EIG(1) BOOTSUM= BOOTSUM + STAT(IB) 200 CONTINUE BOOTBAR - BOOTSUM/REA L(IBOOD BOOTSUM =0.0 SUMBAR(IR)= BOOTBAR C **END BOOTSTRAP LOOP C **DO 95 Cl TEST ON 299 STATS (EVALS), STORE IN ICOUNT CALL S VRGN(IBOOT,ST AT, ST AT) CRITL= ST AT(6) CRITU =ST AT(292) IF(CRITL. LT.4.2. AND.CRJTU.GT.4.2)ICOUNT = ICOUNT +1 1000 CONTINUE WRITE(3,918)ICOUNT 918 FORMAT{lX,I5) WRITE(7,919)SUMBAR 919 F0RMAT(1X,F9.4) STOP END 76 PROGRAM PC BOOT C MAIN PROGRAM: ADJ2558.F - CALCULATES PRINCIPAL COMPONENTS C ON THE USUAL COVARIANCE MATRIX- BIAS CORRECTION C N =25 NVAR = 5 r = 8(4.20 pop value) REPS«1000 B=299 C NORMAL DIST ADJ-EVAL-HAT-BOOTBAR C DIMENSION VAR(5),COV(5,5),XB(5),XBAR(5),XVAR(5) DIMENSION RVEC(25,5),RSIG(5,5),U(25,5),UNIF(25) DIMENSION ECOV(5,5),XCOV(5,5),ECORR(5,5),XCORR(5,5) DIMENSION PCT(5),STD(5),EVEC(5,5),LOAD(5,5) DIMENSION BLOB(5) ,STORU(25,5), STAT(299) DIMENSION EIG(5),EVAL(1000),SUMBAR(1000),EIGEN(5) C REAL*8 DSEED COMMON /R2NCOM/DSEED DATA COV/1.,.8,.8,.8,.8, C .8,1.,8,.8,.8, C .8,8,1.,.8,8. C .8,.8,8,1.,.8. C .8,8,8,.8,1./ JJ=5 N =25 G =0.0 H =0.0 NG“ .2*N NGM=N-NG ICOV=0 WNDF=N-NG-1 NDF = N -1 ISEED = 12357 CALL RNSET(ISEED) [REP =1000 I BOOT=299 ICOUNT =0 ADJ=0.0 C C ***DO CHOLESKY FACTORIZATION CALL CHFAC(JJ,COV,JJ,.00001,IRANK,RSIG,JJ) C ***BEG1N REPLICATION LOOP*** DO 1000 IR-l.IREP CALL RNMVN(N,JJ,RSIG,JJ,RVEC,N) C ***BEG1N BOOTSTRAP LOOP DO 200 IB-l.IBOOT C INITIALIZE VARIABLES DO 25 J = 1,JJ 77 XBAR(J)»0.0 XVAR(J)=0.0 XB(J)=0.0 VAR(J)=0.0 DO 24 JI = 1,JJ ECOV(J,JI)=0.0 XCOV(J,JI)=0.0 24 CONTINUE 25 CONTINUE C ***GENERATE RANDOM UNIFORM VECTOR OF SIZE N CALL RNUN(N.UNIF) DO 26 IK = 1,N l = UNIF(IK)*N+l DO 27 J=1,JJ STORU(IK, J)= R VEC(!. J) 27 CONTINUE 26 CONTINUE DO 10 J = 1,JJ DO 350 1=1,N XB(J)= XB(J) + STORU(l,J) XBAR(J) = XBAR{J)+RVEC(I, J) 350 CONTINUE XB(J)*=XB(I)/N 10 CONTINUE DO 12 J-L JJ XBAR(J) = XBAR(J)/N DO 33 1 = 1, N VAR(I)= VAR(J)+((STORU(I,J)-XB(J))**2)/(N-l) XVAR(J)=XVAR(J)+((RVEC(I,J)-XBAR(J))**2)/(N-1) 33 CONTINUE 12 CONTINUE C CALCULATE COVARIANCE & WINSORIZED COVARIANCE DO 13 J-L JJ DO 14 JW -LJJ DO 30 I = I ,N ECOV(J.JW) — ECOV(J,JW)+(STORU(I,J)-XB(J))*(STORU(I,JW)-XB(JW)) XCOV(J.JW)=XCOV(J,JW)+(RVEC(I,J)-XBAR(J))*(RVEC(I,JW)-XBAR(JW)) 30 CONTINUE 14 CONTINUE 13 CONTINUE DO 16 J-L JJ DO 15 JI-L JJ ECOV(J,JI)=ECOV(J,JI)/(N-l) XCOV(J,JI) - XCOV(J,JI)/(N-1 ) 15 CONTINUE 16 CONTINUE 78 C** CALCULATE CORRELATION MATRICES C** INITIALIZE VARIABLES DO 17 J=I,JJ DO 18 JI*t,JJ ECORR(J,JI)=0.0 XCORR(J.Jl) *0.0 18 CONTINUE 17 CONTINUE DO 19 J = I,JJ DO 20 J1 = 1 ,JJ ECORR(J.JI) »(ECOV(J,JI))/ C (SQRT(VAR(J))*(SQRT(VAR(JI)))) XCORR(J.JI)=(XCOV(J,JI))/ C (SQRT(XVAR(J))*(SQRT(XVAR(JI)))) 20 CONTINUE 19 CONTINUE C CALL PRJNC(NDF,JJ,ECOV,JJ,ICOV,BLOB,PCT,STD,EVEC.JJ,LOAD,JJ) DO 60 J*=1,JJ EIG(J)=BLOB(J) BLOB(J)~0.0 60 CONTINUE STAT(IB)«EIG(I) BOOTSUM - BOOTSUM+STAT(IB) 200 CONTINUE BOOTBAR= BOOTSUM/REALGBOOT) BOOTSUM =0.0 SUMBAR(IR)=BOOTBAR CALL PRlNC(NDF,JJ,XCOV,JJ,ICOV,BLOB,PCT,STD,EVEC.JJ,LOAD,JJ) DO 62 J-1,JJ EIGEN(J)= BLOB(J) BLOB(J)-0.0 62 CONTINUE EVAL(IR)=EIGEN(1) A DJ = (E V A L(IR)-S UMB AR(IR)) C **DO 95 Cl TEST ON 299 STATS (EVALS), STORE IN ICOUNT CALL SVRGN(lBOOT,STAT,STAT) CRITL= ST AT(6) + ADJ CRITU - ST AT(292)+ ADJ IF(CRITL. LT.4.2. AND.CRITU.GT.4.2)ICOUNT - ICOUNT +1 1000 CONTINUE WRITE(3,918)ICOUNT 918 FORMAT(lX,15) STOP END 79 PROGRAM PCAWIN C MAIN PROGRAM: EVWI005.F - CALCULATES PRINCIPAL COMPONENTS C ON THE WINSORIZED COVARIANCE MATRIX (USUAL COV ALSO) C NOTE N — 100 NVAR =4 REPS =1000 C DIMENSION X(100,4),VAR(4),COV(4,4), XB(4), WXB(4) DIMENSION RVEC(100,4),RSIG(4,4).U(100,4),WVAR(4) DIMENSION ECOV(4,4),WCOV(4,4),ECORR(4,4),WCORR(4,4) DIMENSION EVAL( 1000,4),PCT(4),STD(4),EVEC(4,4) DIMENSION BLOB(4),LOAD(4,4) C REAL*8 DSEED COMMON /R2NCOM/DSEED DATA COV/1 .,.5,.5,.5, C .5,1. ,.5, ,5, C .5,.5,1 .,.5, C .5,.5,.5,1./ JJ =4 N = 100 NG = .2*N NGM =N-NG G =0 H =0 ICOV =0 WNDF=N-NG-1 NDF=N-1 ISEED = I2357 CALL RNSET(ISEED) 1REP= 1000 C CALL CHFAC(JJ,COV,JJ,.00001 ,IRANK,RSIG,JJ) DO 1000 IR-l.IREP C INITIALIZE VARIABLES HERE DO 300 J=1,JJ WXB(1)=0.0 WVAR(J)=0.0 XB(J)=0.0 VAR(J)=0.0 DO 300 il-l,J J ECOV(J,JI)=0.0 WCOV(J,JI)=0.0 300 CONTINUE CALL RNMVN(N,JJ,RSIG,JJ,RVEC,N) DO 10 J=I,JJ C **G AND H DISTRIBUTION 80 DO 350 I-1,N RVEC(I,J)=RVEC(I,J)*EXP(H*RVEC(l,J)**2/2.) C RVEC(I,J)=(EXP(G*RVEC(I,!))-! )*EXP(H*RVEC(l.J)**2/2.)/G X(U)=RVEC(I,D XB(J)= XB(J)+ X {1, J) 350 CONTINUE C SORT RVEC INTO RVEC AND CALCULATE WINSORIZED VALUES CALL SVRGN(N,RVEC( 1 ,J),RVEC(I,!)) XB(J)=XB(J)/N BOT~RVEC(NG + LD TOP=RVEC(NGM,J) DO II I= * 1,N lF(X(LJ).LE.BOT)U(LD-BOT IF(X(LJ).GE.TOP)U(I,J)-TOP IF(X(IJ).GT.BOT.AND.X(LD LT.TOP)U(I.J)«X(I,J) WXB(D= WXB(J) + U(I ,D 1 1 CONTINUE 10 CONTINUE DO 12 J-I.JJ WXB(J)= WXB( J)/N DO 33 I-l,N VAR(D= VAR(D+((X(I,J)-XB(D)**2)/(N-I) WVAR(J)= WVAR(J) + ((U(l, J)-WXB( J» * *2)/(N-1 > 33 CONTINUE 12 CONTINUE C C CALCULATE COVARIANCE & WINSORIZED COVARIANCE DO 13 J= l,ll DO 14 JW =IJJ DO 30 1= UN ECOV(LJW)«ECOV(J,JW)+(X(U)-XB<I))*(X(I,lW)-XB(JW)) WCOV(J,iW) = WCOV(J,iW)+(U(LD-WXB(J))*(U(LJW)-WXB(JW)) 30 CONTINUE 14 CONTINUE 13 CONTINUE DO 16 J-U D DO 15 JI-U JJ ECOV(J, JI) - ECOV(J,JI)/(N-1 ) WCOV (J ,11)=W COV( J, JI)/(N-1 ) 1 5 CONTINUE 16 CONTINUE C C CALCULATE CORRELATION MATRICES C INITIALIZE VARIABLES DO 17 J-U D EVAL(IR,J)=0.0 8 1 DO 18 JI = 1,JJ ECORR(J,JI)«0.0 WCORR(J, JI)=0.0 18 CONTINUE 17 CONTINUE DO 19 J = 1,JJ DO 20 JI = 1,JJ ECORR(J.JI) *(ECOV(J,/l))/ C (SQRT(VAR(J))*(SQRT(VAR(JI)))) WCORR(J,JI>=(WCOV(J,JI»/ C (SQRT(WVAR(J))*{SQRT(WVAR(JI)))) 20 CONTINUE 19 CONTINUE C CALL PRINC(NDF,JJ,WCOV,JJJCOV,BLOB,PCT,STD,EVEC,JJ,LOAD,JJ) C DO 76 J = l.iJ EVAL(lR,J) = BLOB(J) 76 CONTINUE 1000 CONTINUE WRITE(6,930) ((EVAL(IR,J). J = 1 ,JJ),IR = 1 ,IREP) 930 FORMAT(lX,4F8.4) STOP END 82 PROGRAM PCAUCOV C MAIN PROGRAM: GHB5I52.F - CALCULATES PRINCIPAL COMPONENTS C USING THE USUAL COV MATRIX -g-AND-h dist to nonnormalize C N = 51 NVAR=5 REPS-1000 B=299 C DIMENSION VAR(5),COV(5,5),XB(5) DIMENSION RVEC(51,5),RSIG(5,5),U(51,5), UNIF(51 ) DIMENSION ECOV(5,5),ECORR(5,5) DIMENSION PCT(5),STD(5),EVEC(5,5),LOAD(5,5) DIMENSION BLOB(5),STORU(51,5),STAT(299) DIMENSION EIG{5),GHVEC(51,5),SUMBAR(IOOO) C REAL*8 DSEED COMMON /R2NCOM/DSEED DATA COV/1.,.2,.2,.2,.2, C .2.I...2..2..2. C .2, .2.1. ,.2..2, C .2,.2,.2,1,,,2, C .2,.2,.2,.2,l./ JJ =5 N = 51 G =0 H = .5 NG = .2*N NGM=N-NG ICOV«0 NDF=N-1 ISEED= 12357 CALL RNSET(1SEED) 1 REP-1000 I BOOT=299 ICOUNT=0 C ***DO CHOLESKY FACTORIZATION CALL CHFAC(JJ,COV,JJ,.00001 ,IRANK,RSIG,JJ) C ***BEGIN REPLICATION LOOP*** DO 1000 IR»1,IREP CALL RNMVN(N,JJ,RSIG,JJ,RVEC,N) DO 70 J = 1 ,JJ DO 71 I-l.N GHVEC(I,J)«RVEC(I,J) RVEC(1,J)=0.0 71 CONTINUE 70 CONTINUE DO 5 J = I,JJ D 06 I-l.N 83 GHVEC(I,J)=GHVEC(I,J)*EXP(H*GHVEC(I,J)**2/2.) C GH V EC(I, J)= (EXP(G*GHVEC(I, J))-1) C *EXP(H *GH VEC(I, J) * *2/2. )/G 6 CONTINUE 5 CONTINUE C " ‘" ‘ "'BEGIN BOOTSTRAP LOOP DO 200 IB= 1, I BOOT C INITIALIZE VARIABLES DO 25 J-L JJ XB(J)=0.0 VAR(J)=0.0 DO 24 J1 = 1 ,JJ ECOV(J,JI)=0.0 24 CONTINUE 25 CONTINUE C " ‘ ♦"‘ GENERATE RANDOM UNIFORM VECTOR OF SIZE N CALL RNUN(N.UNIF) DO 26 IK* 1,N I = UNIF(IK)*N + I DO 27 J -t.J J STORU(IK, J) - GH VEC(I.J) 27 CONTINUE 26 CONTINUE DO 10 DO 350 I=LN XB(I) = XB(J) + STORU(I, J) 350 CONTINUE XB<J)-XB(J)/N 10 CONTINUE DO 12 J -l.J J DO 33 I-l.N VAR(J)» VAR(J) +((STORU(I.J)-XB(J))"'*2)/(N-l) 33 CONTINUE 12 CONTINUE C CALCULATE COVARIANCE DO 13 J -l.J J DO 14 JW -l.JJ DO 30 I-l.N ECOV(J.JW)— ECOV(J,JW)+(STORU(I,J)-XB(J)) C *(STORU(LJW)-XB(JW)) 30 CONTINUE 14 CONTINUE 13 CONTINUE DO 16 J-L JJ DO 15 JI-L JJ ECOV(J,JI) = ECOV(J.Jl)/(N-l) 15 CONTINUE 16 CONTINUE C** CALCULATE CORRELATION MATRICES C** INITIALIZE VARIABLES DO 17 J=1,JJ DO 18 Jl— 1,JJ ECORR(J,JI)=0.0 18 CONTINUE 17 CONTINUE DO 19 J = 1,JJ DO 20 II = 1,JJ ECORR(J,JI)=(ECOV{J,JI))/ C (SQRT(VAR(J))*(SQRT(VAR(JI)))) 20 CONTINUE 19 CONTINUE C CALL PRINC(NDF,JJ,ECOV,JJ,lCOV,BLOB,PCT,STD,EVEC.JJ.LOAD.JJ) DO 60 J=1,JJ EIG(J)=BLOB(J) BLOB(J)=0.0 60 CONTINUE STAT(1B) = EIG<1) BOOTSUM = BOOTSUM+STAT(IB) 200 CONTINUE BOOTBAR= BOOTSUM/REAL<IBOOT) BOOTSUM =0.0 SUMBAR(IR)=BOOTBAR C **DO 95 Cl TEST ON 299 STATS (EVALS), STORE IN ICOUNT CALL SVRGN(IBOOT,STAT,STAT) CRITL=ST AT(6) CRITU = STAT(292) IF(CRITL.LT. I 23.AND.CRITU.GT. 1 23)ICOUNT=ICOUNT+1 1000 CONTINUE WRITE(3,918)ICOUNT 918 FORMAT(lX,I5) WRITE(7,919)S UMBAR 919 F0RMAT(1X,F9.4) STOP END PROGRAM PCABiWGHT C MAIN PROGRAM: GHE5155.F - CALC PRIN COMP FOR THE BIWEIGHT MIDCOVARIANCE C *add g and h distribution + unif Bootstrap*4 ' DIMENSION X(51,5),XMED(5), VAR(5),COV(5,5),XB(5),DIFFVEC(51,5) DIMENSION ABDIFF(51,5),MADDIV(51,5),ABMAD(51,5),RVEC(51,5) DIMENSION TOP(51,5),BOT(51,5),TSUM(5).BSUM(5),MAD(5).BIWGT(5) DIMENSION TERM 1(51,5,5),TERM2(51,5),SUM 1 (5.5),RSIG(5,5) DIMENSION SUM2(5),BICOV(5,5),BICORR(5,5),BLOB(5),GHVEC(51.5) DIMENSION EVAL(lOOO,5),PCT(5),STD(5),EVEC(5,5),LOAD(5,5) DIMENSION UN1F(51 ),STORU(51,5),STAT(299),EIG(5),SUMBAR( 1000) C REAL*8 DSEED,MAD,MADDIV COMMON /R2NCOM/DSEED DATA COV/1.,.5,.5,.5,.5, C .5.1,5,.5,.5, C .5,.5,1.,.5,.5, C .5,5,.5.1,5. C .5,5,.5,5.1./ JJ-5 N =51 MN=(N + l)/2 G =0.0 H=0 ICOV=0 NDF = N -1 ISEED= 12357 CALL RNSET(ISEED) 1REP = 1000 1 BOOT = 299 ICOUNT-0 C CALL CHFAC(JJ.COV,JJ, .00001 ,IRANK,RSIG,JJ) C **BEGIN REPLICATION LOOP** DO 1000 IR ** l.IREP CALL RNMVN(N,JJ,RSIG,JJ,RVEC,N) DO 15 J=I,JJ DO 350 1= 1,N RVEC(U)«RVEC(I,J)*EXP(H*RVEC(U)**2/2.) C RVEC(],J)«(EXP(G*RVEC(I,J))-l)*EXP(H*RVEC(I,J)**2/2.)/G GHVEC(I.J)-RVEC(I,J) RVEC(I,J)-0.0 350 CONTINUE 15 CONTINUE 86 C **BEGIN BOOTSTRAP LOOP** DO 200 IB-IJBOOT C INITIALIZE VARIABLES HERE DO 300 J = 1.JJ XB(J)«0.0 VAR(J)-0.0 TSUM(J)=0.0 BSUM(J)=0.0 DO 21 I-l.N DIFFVEC(I, J) =0.0 ABDIFF(I,J) — 0.0 21 CONTINUE 300 CONTINUE C **GENERATE RANDOM UNIFORM VECTOR OF SIZE N** CALL RNUN(N.UNIF) DO 26 IK = t ,N I =UNIF(IK)*N+ 1 DO 27 J-1.JJ STORU(IK,J)=GHVEC(U) 27 CONTINUE 26 CONTINUE DO 10 J = I,JJ CALL SVRGN(N,STORU( 1 ,J),X( 1 ,J)) XMED(J)=X(MN,J) DO II I-l.N XB(J) - XB(J) +STORU(l ,J) 11 CONTINUE XB(J)=XB{J)/N DO 1 2 I-l.N VAR(J) - VAR(J) + ((STORU(I,J)-XB(J))**2)/(N-1) 12 CONTINUE C CALCULATE DIFFVEC DO 24 1-1, N DIFFVEC(IJ) - DIFFVEC(I.J)+(STORU(l.J)-XMED{J» ABDIFF(LJ) - ABS(DIFFVEC(I,J)) 24 CONTINUE C SORT ABDIFF CALL SVRGN(N,ABDIFF( 1 ,J),ABDIFF( 1 ,J)) C CALCULATE MAD*9 M AEHJ) - ABDIFF(MN.J)*9 DO 17 I-l.N MADDIV(I,J)=(DIFFVEC(I,J))/(MAD(J)) 17 CONTINUE DO 218 I — I.N ABMAD(I,J) = ABS(MADDIV(LJ)) 218 CONTINUE 87 C COMPUTE FOR ABMAD LE t DO 33 l-l.N IF((ABMAD(LJ).LE.l)) THEN TOP(I,J)-(DIFFVEC(I,J)**2)* C (1-(MADDIV(I,J)*MADDIV(I,J)))**4 BOT(l,J)=<HMADDIV(I,J)*MADDIV(U)))* C (1-(5*(MADDIV([,J)*MADDIV(LJ)))) ENDIF 33 CONTINUE DO 35 NUM=1,N TSUM(J)=TSUM(J)+TOP(NUM,J) BSUM(J)=BSUM(J)+BOT(NUM.J) 35 CONTINUE BIWGT(J) * N *TSUM(J)/BSUM(J)**2 10 CONTINUE C BIWEIGHT MIDCOVARIANCE AND MIDCORRELATION C INITIALIZE VARIABLES HERE DO 223 JA = 1,JJ SUM2(JA)=0.0 DO 224 JB= 1,JJ SUM1(JA,JB)=0.0 SUM2(JB)=0.0 B1C0V(JA,JB) =0.0 BICORR(JA,IB)=0.0 DO 225 1 = 1,N TERM l(I.JA.JB) =0.0 TERM2(I,JA)=0.0 225 CONTINUE 224 CONTINUE 223 CONTINUE C CALCULATE BIWEIGHT MIDCOVARIANCE DO 65 JA = 1,JJ DO 402 I-l.N IF(ABMAD(I,JA).LE.l) THEN TERM2(I,M)=(MMADDIV(I,M)*MADDIV(UA)))* C (1 -(5*(MADDIV(LJ A)*M ADDIV(I,J A)))) ELSE TERM2(I,JA) =0.0 ENDIF 402 CONTINUE DO 66 IB = I ,JJ DO 67 1= I,N IF(ABMAD(I,JA).LE.l.AND.ABMAD(I,JB).LE.l) THEN TERM1(I,JA,JB)=(DIFFVEC(I,JA)*DIFFVEC(LJB))* C ((HMADDIV(I,JA)*MADDIV(I,JA)))**2)* C (l-(MADDIV(I,JB)*MADDIV(I,JB)))**2 88 ELSE TERM 1(I,JA,JB) =0.0 ENDIF 67 CONTINUE 66 CONTINUE 65 CONTINUE DO 50 JA = 1,JJ DO 415 1=1, N SUM2(JA)=SUM2(JA) + TERM2(1,JA) 415 CONTINUE DO 42 JB= I ,JJ DO 43 I«1,N SUM 1(JA,JB)=SUM 1(JA,JB)+TERM1(I,JA,JB) 43 CONTINUE 42 CONTINUE 50 CONTINUE DO 53 JA = I,JJ DO 54 JB= 1 ,JJ BICOV(JA.JB) = N*(SUM1(JA,JB))/(SUM2(JA)*SUM2(IB)) 54 CONTINUE 53 CONTINUE C CALCULATE BIWEIGHT MIDCORRELATION DO 60 JA=1 JJ DO 61 JB= l.IJ BlCORR(JA,iB)=(BICOV(JA.JB»/ C (SQRT(BIWGT(JA))*(SQRT(BIWGT(JB)))) 61 CONTINUE 60 CONTINUE C CALL PRINC(NDF,JJ,BICOV,JJ,ICOV,BLOB,PCT,STD,EVEC,JJ,LOAD,JJ) C DO 76 J-l.JJ EIG(J) = BLOB(J) BLOB(J)=0.0 76 CONTINUE STAT(IB)=EIG(1) BOOTSUM = BOOTSUM +STAT(IB) 200 CONTINUE BOOTBAR= BOOTSUM/REAL(IBOOT) BOOTSUM =0.0 SUMBAR(IR)=BOOTBAR C **95% Cl TEST ON 299 STATS (EIGENVALUES) CALL S VRG N(lBOOT,ST AT, ST AT) CRITL»STAT(6) CRITU =STAT(292) IF(CRITL. LT.2.98. AND.CRJTU.GT.2.98)ICOUNT=ICOUNT+1 89 1000 CONTINUE WRITE(3*930)ICOU NT 930 FORMAT(lX,I5) WRITE(7,969)SUMBAR 969 F0RMAT(1X,F9,4) STOP END
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An examination of reliable component analysis
PDF
Affective Information And Affect-Cognition Interaction: Mood And Emotion
PDF
An evaluation of a brief HIV /AIDS prevention intervention using normative feedback to promote risk reduction among sexually active college students
PDF
Approaches to inference in ordinal multiple regression
PDF
Self-Concept In Adult Women: A Multidimensional Approach
PDF
Dementia And The Perception Of Experimental Pain
PDF
Comparison of bootstrap prediction loss/error estimators
PDF
Concordant and discordant drug use in intimate relationships: A longitudinal study
PDF
Ethnic Group Differences In Employment Test--Job Performance Relationships
PDF
Elaborative facilitation, learning, and cognitive abilities in children with and without learning difficulties
PDF
The Effects Of Unity-Balance Screen Design On The Learning Of Mathematics In Computer-Based Instruction
PDF
Children'S Resistance To Competing And Distracting Stimuli In The Classroom
PDF
Figural And Symbolic Divergent-Production Abilities In Adults And Adolescents
PDF
Age, Sex, And Task Difficulty As Predictors Of Social Conformity: A Search For General Tendencies Of Conformity Behavior
PDF
A factor-analytic study of symbolic-memory abilities
PDF
Cognitive and non-cognitive factors as predictors of retention among academically at-risk college students: A structural equation modelling approach
PDF
Some Factors Affecting Ucr Diminution
PDF
Biosocial antecedents of schizophrenia-spectrum personality disorders: A longitudinal study
PDF
Illusion And Illocution: J. L. Austin'S "Sense And Sensibilia."
PDF
Associations and mechanisms among attention deficit hyperactivity disorder symptoms, cognitive functioning, and drinking habits
Asset Metadata
Creator
Fidler, Francine L. (author)
Core Title
Robust Principal Components Analysis: Improving Inference About Largest Eigenvalues In Multivariate Data Analysis Using Resistant Statistical Methods And Bootstrap Resampling Techniques
Contributor
Digitized by ProQuest
(provenance)
Degree
Doctor of Philosophy
Degree Program
Psychology
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,psychology, psychometrics
Language
English
Advisor
Wilcox, Rand R. (
committee chair
), Cliff, Norman (
committee member
), Earleywine, Mitchell (
committee member
), Newcomb, Michael (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c20-570153
Unique identifier
UC11226693
Identifier
9600973.pdf (filename),usctheses-c20-570153 (legacy record id)
Legacy Identifier
9600973.pdf
Dmrecord
570153
Document Type
Dissertation
Rights
Fidler, Francine L.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
psychology, psychometrics