Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Comparing skipped correlations: the overlapping case
(USC Thesis Other)
Comparing skipped correlations: the overlapping case
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Ii Comparing Skipped Correlations The Overlapping Case by Lai Xu A Thesis Presented to the FACULTY OF THE USC DORNSIFE COLLEGE OF LETTERS, ARTS, AND SCIENCES UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF ARTS (Psychology) May 2021 Copyright 2021 Lai Xu Ii ii TABLE OF CONTENTS List of Tables ................................................................................................................... .. ..ii i Abstract ......................................................................................................................... .. ...i v Introduction ............................................................................................................... ...1 Correlation Methods ....................................................................................................5 Spearman’s rho .................................................................................................................. ....5 Kendall’s tau ...................................................................................................................... ....5 Winsorized Correlation ...................................................................................................... ....6 Skipped Correlation .................................................................................................. ...6 BCa Confidence Interval .......................................................................................... ...9 Design of the Simulation Study ............................................................................... ...9 Simulation Results ................................................................................................... ..11 An Illustration .......................................................................................................... ..13 Conclusions ............................................................................................................ ....14 Bibliography ........................................................................................................... ....15 Appendix A ............................................................................................................ ....18 BCa Method .................................................................................................................... .....18 Ii iii List of Tables Table 1: Some properties of the g-and-h distributions ...............................................................11 Table 2: Estimated Type I error probabilities, overlapping case, first approximation technique ....................................................................................................................................12 Table 3: Estimated Type I error probabilities, overlapping case, second approximation technique ....................................................................................................................................13 iv Abstract Consider random variables Y, X 1 and X 2 and let τ j be some measure of association associated with Y and X j (j = 1, 2). It is well known that Pearson’s correlation is highly sensitive to outliers. Kendall’s tau, Spearman’s rho and Winsorized correlation deal with outliers among the marginal distributions but they do not deal with outliers based on the overall structure of the data cloud, which can lead to misleading results. The paper deals with testing H 0 : τ 1 =τ 2 when τ is a measure of association that takes into account the overall structure of data cloud when dealing with outliers. The proposed methods are based on projection-type skipped correlation (OP correlation) used in conjunction with the bias corrected, accelerated bootstrap (BCa) method. Comments on related methods are included. 1 Introduction Let Y , X 1 and X 2 denote three random variables having some unknown multivariate distribution and let τ j be some measure of association associated with Y and X j (j = 1, 2). A basic goal is to test H 0 :τ 1 =τ 2 . (1) In particular, to what extent is it reasonable to make a decision about whether τ 1 is less than or greater than τ 2 based on a random sample? Numerous methods have been proposed for testing (1). Most of the literature has focused on Pearson’s correlation (e.g., Dunn and Clark 1969; Dunn and Clark 1971; Hittner, May, and Silver 2003; Meng, Rosenthal, and Rubin 1992; Steiger 1980; Wilcox 2009; Williams 1959; Zou 2007). However, it has long been known that Pearson’s correlation is not robust. For example, even an arbitrarily small departure from a bivariate normal distribution can alter its value substantially. Moreover, the usual estimate of Pearson’s correlation, r, has a breakdown point of only 1/n, where n is the sample size. That is, even a single outlier can substantially alter its value. More formally, the influence function of Pearson’s correlation IF (x,y) =xy− ( x 2 +y 2 2 )ρ is unbounded (Devlin, Gnanadesikan, & Kettenring, 1981). Influence function serves as an important tool to measure robustness of a statistical measure (Hampel, Ronchetti, Rousseeuw, & Stahel, 2011). The unbounded influence function of Pearson’s correlation indicates a lack of robustness. Three well-known alternatives to Pearson’s correlation are Spearman’s rho (Spearman, 1961), Kendall’s tau (Kendall, 1938) and Winsorized correlation (Wilcox, 1993). However, all three of these measures of association only guard against outliers among the marginal distributions. It is readily illustrated that outliers, properly placed, can alter the estimates values substantially because the correlation methods mentioned above do not deal with outliers in a manner that takes into account the overall structure of the data cloud. A short summary of these three correlations will be given 2 in the next section. −2 −1 0 1 2 −2 −1 0 1 2 X Y * * Figure 1: This scatterplot illustrates that ouliers, properly placed, can have a large impact on Kendall’s tau. The data were generated from normal marginal distributions with σ = 1.5. The two points in the lower right corner are outliers. Kendall’s tau is 0.049 when the two outliers are removed. However, if using all data points, Kendall’s tau is 0.031 and provides a non-significant result when testing the hypothesis ˆ τ = 0 at α = 0.05 level via percentile bootstrap method. A general way of improving Spearman’s rho, Kendall’s tau and Winsorized correlation is to use a skipped correlation(Wilcox, 2017). Roughly, a skipped correlation estimator, r p , is obtained by removing outliers using some method that takes into account the overall data structure cloud. Then something like Pearson’s r, Kendall’s tau, Spearman’s rho, or Winsorized correlation is computed based on the remaining data. There are a variety of outlier detection methods that take into account the overall structure of a data cloud. (Wilcox, 2017). A classic method is based on Mahalanobis distance(Mahalanobis, 1936), which measures the distance between a point x and the 3 sample mean: d 2 = (x− ¯ X) 0 S −1 (x− ¯ X) where ¯ X is the sample mean and S is the sample covariance matrix. Mahalanobis distance can be converted to Mahalanobis depth (Mahalanobis, 1936) as: M D (x) = [1 + (x− ¯ X) 0 S −1 (x− ¯ X)] −1 which shows that the depth is the reciprocal of its distance plus one. Thus, the closer a point is to the sample mean, as measured by Mahalanobis distance, the larger is its Mahalanobis depth. Mahalanobis distance method then groups points with the same distance from the mean into respective ellipses. Roughly, the inner ellipse contains about half of the points, and the outer ellipse contains about 97.5% of the points in a scatter plot. Points that lie outside of the outer ellipse are declared as outliers. However, there are two main concerns with this approach. First, it suffers from masking, which refers to failing to detect outliers due to their very presence. This occurs because the conventional mean and covariance matrix are themselves highly sensitive to outliers. Second, in essence, Mahalanobis distance is reasonable only when samplings are from elliptically contoured distribution. 4 One general strategy to address the first concern is to replace the mean and covariance in Mahalanobis distance with some robust measures of location and scatter, which can be achieved by using the minimum volume ellipsoid (MVE) estimator (Leroy & Rousseeuw, 1987). In the bivariate case, the MVE estimator searches for the ellipse for that has the smallest area. When there are more than two variables, it searches for the ellipsoid having the smallest volume. Then MVE computes the mean and covariance matrix based on this subset. Furthermore, MVE rescales the covariance matrix to estimate the usual covariance matrix under normality. Another alternative to the MVE estimator that has a relatively high breakdown point is called the minimum covariance determinant (MCD) estimator (Rousseeuw, 1984). MCD estimator searches for the subset of half the data that has the smallest generalized variance. Because it is impractical to consider all subsets of half the data, an algorithm (Rousseeuw & Driessen, 1999) has been proposed to approximate the subset that minimizes the generalized variance. The mean and covariance matrix based on this subset is highly robust in the sense that it achieves the highest possible breakdown point: 0.5, which means that half of the values must be altered to make the estimators arbitrarily large or small. Then Mahalanobis distance can be computed again based on MCD estimator. As for the second concern, a general approach is to replace robust analogs of Mahalanobis distance with methods that measure the depth of a point in a data cloud that avoids the assumption that a distribution is elliptically contoured. Examples are halfspace depth (Tukey, 1975), simplicial depth (Liu, 1990), generalizations of other projection-based approaches (Zuo et al., 2003), and many more. More details can be found in Wilcox (2017). No single outlier detection method dominates but ones that perform relatively well are based on projection-type techniques. The current paper uses two projection-type methods to detect outliers. More details are described in the Skipped Correlation section. One way of testing (1) is via a percentile bootstrap method. Often this approach 5 performs reasonably well when dealing with estimators that are relatively insensitive to outliers. However, preliminary simulations indicated that this is not the case for the situation at hand. The goal in this paper is to report simulation results when using instead a bias corrected accelerated (BCa) bootstrap method. The remainder of the paper is organized as follows. The next section provides descriptions of the three correlation methods used here, namely Spearman’s rho, Kendall’s tau and Winsorized correlation, that serve as alternatives to Pearson’s r used in skipped correlation. Then the details of the skipped correlation are outlined. This is followed by a description of a basic percentile bootstrap method. The BCa method, which is used here, is an attempt to improve on the basic percentile bootstrap. Details on BCa confidence interval can be found in Appendix A. Correlation Methods Spearman’s rho Spearman’s rho correlation is Pearson’s correlation between the ranked values. If we have a set of observed values (X 1 ,Y 1 ),..., (X n ,Y n ), convert X 1 ,...,X n and Y 1 ,...,Y n to ranks, and compute Pearson’s correlations based on these ranks. Spearman’s rho evaluates the monotonic relationships between variables. If there are no repeated values, a Spearman’s rho of +1 or -1 means there is a perfect monotonic increasing or monotonic decreasing relationship. Kendall’s tau Consider two pairs of observed data (X 1 ,Y 1 ) and (X 2 ,Y 2 ). If X 1 < X 2 and Y 1 < Y 2 , or if X 1 > X 2 and Y 1 > Y 2 , we call these two pairs of observations are concordant. That is, if Y increases as X increases, or if Y decreases as X decreases, they are concordant. On the other hand, if Y increases as X decreases, or Y decreases as X increases, they are discordant pairs. In general, if the ith and the jth pairs of points are concordant, let K ij = 1. If they are discordant, K ij = -1. Kendall’s tau ˆ τ is estimated 6 to be the average of all K ij for i<j: ˆ τ = 2 P i<j K ij n(n− 1) A Kendall’s tau of +1 or -1 indicates there is a concordant or discordant relationships between these two variables. A positive Kendall’s tau indicates there is a tendency for concordant relationship, and when Kendall’s tau is negative, the reverse is true. Winsorized Correlation A Winsorized correlation is a Pearson’s correlation based on Winsorized values. Consider a set of observed values (X 1 ,Y 1 ),..., (X n ,Y n ). To Winsorize the observed values, first rearrange X 1 ,...X n and select a certain amount (e.g., 10%, 20% etc) of the smallest and largest values in X. Instead of trimming these values, set their values equal to the smallest or largest value that is not trimmed. Repeat this process for Y values. Winsorized correlation is thus Pearson’s correlation computed based on this Winsorized data set. To be more precise, let X ij be a random sample from some bivariate distribution (i = 1,...,n;j = 1, 2). Let γ (0≤γ≤ 0.5) denote the amount of Winsorizing and let g = [γn] denote γ n rounded down to the nearest integer. Let X (1)j ≤X (2)j ≤···≤X (n)j be the n values in the jth group written in ascending order, and let W ij = X (g+1)j if X ij ≤X (g+1)j X ij if X (g+1)j <X ij <X (n−g)j X (n−g)j if X ij ≥X (n−g)j , The γ Winsorized correlation is Pearson’s correlation based on the W ij values. Skipped Correlation Following Wilcox (2017), a projection-type method is used to detect outliers that can be described as follows. Consider a random sample and let n denote the sample size. The method begins by finding the center of the data cloud, ˆ ζ. In this paper, the marginal medians are used for ˆ ζ. Next, for a fixed point X i , project all n points onto 7 the line connecting the center ˆ ζ and X i . The immediate goal is to compute the distance of each of the projected points from ˆ ζ. Let A i =X i − ˆ ζ, B j =X j − ˆ ζ, where X j are data points other than X i , A i and B j are column vectors having length p, and let C j = A 0 i B j B 0 j B j B j j = 1,...,n. When projecting the points onto the line between X i and ˆ ζ, the projection distance of the ith point from ˆ ζ is D ij =kC j k where kC j k = q C 2 j1 +... +C 2 jp Here, a modification of the boxplot rule is used to check for outliers among the D ij values. For fixed i, put D i1 ...D in in ascending order, D i(1) <...<D i(n) . The ideal fourths associated with the D ij values are q 1 = (1−h)D i(h) +hD i(h+1) q 2 = (1−h)D i(l) +hD i(l−1) where l = [n/4 + 5/12], and [.] is the greatest integer function. Let h = n 4 + 5 12 −l The ith point is declared an outlier if any of its n projections satisfies D ij >M D + q χ 2 0.975,p (q 2 −q 1 ) (2) 8 where M D is the usual sample median based on the D i1 ,...,D in values and χ 2 0.975,p is the 0.95 quantile of a chi-squared distribution with p degrees of freedom. Repeating this process for each i, i = 1,...,n, a point is declared an outlier if any of these projections satisfies Eq.(2). The alternative criterion of detecting an outlier based on D ij values is to replace the interquartile range (q 2 −q 1 ) with the median absolute deviation (MAD). Here, MAD is the median of the values|D 1 −M D |,...,|D n −M D |. Then the ith point is declared an outlier if for any of the n projections D j >M D + q χ 2 0.975,p MAD 0.6745 (3) where MAD/0.6745 estimates the standard deviation. Let p n be the outside rate per observation, which is the expected proportion of outliers based on a random sample of size n. Though a positive feature of MAD is that it has a higher finite sample breakdown point – 0.5, than the interquartile range, which has the breakdown point of 0.25, a negative feature associated with Eq.(3) is that p n seems to be less stable as a function of n (Wilcox, 2017). Thus, in this paper, Eq.(2) is used as the outlier detection criterion. A criticism is that a change in scale can impact which points are declared outliers. Here, this concern is avoided by standardizing the marginal distributions. Otherwise the computations remain the same as just described. The second projection-type technique used to detect outliers is based on an R function called "depthProjection" in the R package "DepthProc" (Kosiorowski and Zawadzki, 2020). This function calculates randomly selected projection depth instead of using all possible projections. Then,the R function "out.pro" converts depth to distance, and outliers are detected based on interquartile range as described above. An advantage of the second approximation method is that it significantly reduces the execution time when sample size is large. 9 BCa Confidence Interval Let (X i1 ,X i2 ,Y i ), i = 1,...,n denote a random sample. A basic percentile bootstrap method (Efron, 1992) is applied as follows. 1. Generate a bootstrap sample by resampling with replacement n points from (X 11 ,X 12 ,Y 1 ),...,(X n1 ,X n2 ,Y n ), yielding (X ∗ 11 ,X ∗ 12 ,Y ∗ 1 ),...,(X ∗ n1 ,X ∗ n2 ,Y ∗ n ). 2. Detect and remove any point that is considered as outliers by OP estimator. 3. Compute the correlation coefficient based on three methods mentioned earlier with the remaining points in this bootstrap sample, yielding ξ ∗ j , the correlation between Y and X j (j = 1,2) and let d ∗ = ξ ∗ 1 - ξ ∗ 2 , denote the difference between these two correlation coefficients. 4. Repeat steps 1 to 3 B times and let d ∗ b (b = 1,...,B) denote the resulting d ∗ values. 5. Put the d ∗ b in ascending order yielding d ∗ 1 ≤···≤d ∗ B 6. Based on this bootstrap sampling distribution, the percentile bootstrap confidence interval for ξ 1 −ξ 2 is ( ˆ d ∗ (l+1) , ˆ d ∗ u ) (4) where l =αB/2, rounded to the nearest integer and u =B−l. Here the focus is on bootstrap BCa confidence interval, which is a modification of the percentile bootstrap method that consists of two correction factors that deal with bias and skewness in the distribution of bootstrap estimates. The construction of a BCa confidence interval follows the same steps. The detailed computations of respective parameters in a BCa confidence interval are described in Appendix A Design of the Simulation Study Four types of g-and-h distributions are used to generate simulation data (n = 40), and three variation patterns are used to model heteroscedasicity. 10 Let Z be a random variable having a standard normal distribution. Then W = exp(gZ)−1 g exp(hZ 2 /2), if g> 0 Zexp(hZ 2 /2), if g = 0 (5) has a g-and-h distribution (Hoaglin, 2006), where g and h are parameters that determine the first four moments. The four distributions considered here are standard normal distribution (g = h = 0), a symmetric heavy-tailed distribution (g = 0, h = 0.2), an asymmetric distribution with relatively light tails (g = 0.2, h = 0), and an asymmetric distribution with heavy tails (g = h = 0.2). There are two basic situations when comparing dependent correlation estimates, but only the overlapping case will be discussed in this paper. Consider the random variables Y, X 1 and X 2 having some unknown multivariate distribution and let ξ j be some robust measure of association between X j and Y (j = 1,2). For the overlapping case, data are generated according to the model Y =λ(X 1 ,X 2 ) (6) where the function λ(X 1 ,X 2 ) models heteroscedasticity. More specifically, three choices forλ(X 1 ,X 2 ) are used: λ(X 1 ,X 2 ) = 1,λ(X 1 ,X 2 ) = |X 1 | + |X 2 | + 1 (conditional variance of Y is smallest when X’s are close to their means), and 1/(|X 1 | + |X 2 | + 1) (the conditional variance of Y is largest when X’s are close to their means) (Wilcox, 2016). These three λ functions will be called variation patterns (VP) 1, 2, and 3. The R function rmul (Rallfun, https://dornsifelive.usc.edu/labs/rwilcox/software/) is used to generate values for X 1 and X 2 , in which two arguments, g and h, allow users to specify the marginal distributions and to obtain data from one of the g-and-h distributions mentioned above. Similarly, the error term is generated from a g-and-h distribution as well, which yields values for Y. A bootstrap sample is generated with replacement from the original two sets of 40 observations. The difference between two skipped correlations is denoted as (d ∗ ). Repeat this process for 1000 times to get the resulting difference value between these 11 ∗ 1 two skipped correlations. Rearrange (d ,... d ∗ 1000 ) to ascending order, and confidence interval can be obtained via accelerated bootstrap method. Simulation Results Table 2 and Table 3 showed the estimated Type I errors for the overlapping case when testing at 0.05 level, via two types of depth approximation techniques respectively. Due to the high execution time when using the skipped correlation with the first depth approximation technique and BCa method, estimated Type I error rate were based on 1000 replications. As a general guide, the actual level should be between 0.025 and 0.075, when testing at 0.05 level (Bradley, 1978). Based on the method in Pratt (Pratt, 1968), with 1000 replication, the estimated level is less than or equal to 0.061 to reject the hypothesis that the actual level is greater than or equal to 0.075 at the 0.05 level. Similarly, the hypothesis that the actual level is less than and equal to 0.025 would be rejected if the estimated level is greater than or equal to 0.034. As can be seen from Table 2, all of the correlation estimates based on the first approximation technique satisfied Bradley’s criterion when n = 40. Furthermore, based on Pratt’s criterion, Winsorized correlation in all situations rejected the hypothesis that the actual level is greater than 0.075 or less than 0.025. Spearman’s correlation had one relatively large estimates – 0.069 with VP 2 and under asymmetrical, heavy-tailed distribution (g = h = 0,2). Similarly, Kendall’s tau had one relatively large estimates – 0.73 with VP 2 and symmetrical and light-tailed distributional situation (g = h = 0). Additional simulations were run with n = 60 to test to what extent increasing sample size avoids Type I probabilities inflation. The results have shown that, after increasing the sample size to 60, the estimate based on Spearman’s rho decreased to 0.045. However, estimates based on Kendall’s tau did not show a significant change after sample size is increased (0.07). 12 When using the second depth approximation outlier detection technique, Type I error rates estimated by all three correlation methods satisfied Bradley’s criterion when n = 40. All estimates based on Winsorized correlation again satisfied Pratt’s criterion. Only two estimates were relatively large compared to the rest, namely Spearman’s rho (0.066) with VP 2 under symmetrical and heavy-tailed distribution (g = 0, h = 0.2), and Kendall’s tau (0.067) with VP2 and under the same distribution (g = 0, h = 0.2) as Spearman’s rho but under VP 1. After increasing sample size to 60, Spearman’s rho reduced to 0.058 and Kendall’s tau reduced to 0.038. In summary, these two depth approximation techniques showed similar results, with the second technique having smaller and more consistent estimates, and much faster execution time. 13 An Illustration To illustrate the methods, data from the Elderly 2 study are used to show the estimated confidence intervals between two independent variables and one dependent variable (Clark et al., 2012). The Well Elderly 2 study assessed the effectiveness of a life-style intervention program which aimed to improve the physical and emotional wellbeing of older adults. Two explanatory variables, CAR and LSIZ, and one dependent variable (MAPAGLOB) are selected for illustration. CAR measures the cortisol awakening response, which is the change in cortisol level 30-45 minutes after waking up and the cortisol level upon awakening, LSIZ measures the individual’s life satisfaction, and MAPAGLOB is a measure of meaningful activities. The confidence intervals for the difference between these two skipped correlations based on Spearman’s rho, Winsorized correlation, and Kendall’s tau are (0.277, 0.628), (0.241, 0.621) and (0.191, 0.435) respectively via the first depth approximation technique. Moreover, the confidence intervals for the difference between two skipped correlations based on Spearman’s rho, Winsorized correlation, and Kendall’s tau are (0.25, 0.623), (0.236, 0.636) and (0.171, 0.436) respectively via the second distance approximation technique. As can be seen, the confidence intervals are similar according 14 to these two outlier detection techniques. However, the execution time when using second approximation technique is vastly reduced (20 seconds) compare to the first approximation technique (267 seconds). Conclusions This paper evaluates the controlling of Type I error probability using skipped correlation when comparing correlation coefficients between two groups in overlapping case. BCa confidence intervals of the correlation difference between two groups are computed after the outliers are removed by two projection-type outlier detection techniques. In the first approximation case, the skipped estimates based on Winsorized’s correlation seem to control Type I error rate relatively well and stable in all situations. After increasing the sample size to 60, the skipped correlations based on Spearman’s rho and Kendall’s tau perform better. For the second depth approximation technique, all three skipped correlations control Type I error relatively well in all situations. Spearman’s rho and Kendall’s tau have some relatively large estimates. However, they still satisfied Bradley’s criterion. One advantage of the second projection-type outlier detection technique is that it reduces the execution time significantly, comparing to the first technique. 15 References Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31(2), 144–152. Clark, F., Jackson, J., Carlson, M., Chou, C.-P., Cherry, B. J., Jordan-Marsh, M., ... others (2012). Effectiveness of a lifestyle intervention in promoting the well-being of independently living older people: results of the well elderly 2 randomised controlled trial. J Epidemiol Community Health, 66(9), 782–790. Devlin, S. J., Gnanadesikan, R., & Kettenring, J. R. (1981). Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association, 76(374), 354–362. Dunn, O. J., & Clark, V. (1969). Correlation coefficients measured on the same individuals. Journal of the American Statistical Association, 64(325), 366–377. Dunn, O. J., & Clark, V. (1971). Comparison of tests of the equality of dependent correlation coefficients. Journal of the American Statistical Association, 66(336), 904–908. Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82 (397), 171–185. Efron, B. (1992). Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics (pp. 569–593). Springer. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: the approach based on influence functions (Vol. 196). John Wiley & Sons. Hittner, J. B., May, K., & Silver, N. C. (2003). A Monte Carlo evaluation of tests for comparing dependent correlations. The Journal of General Psychology, 130 (2), 149–168. Hoaglin, D. C. (2006). Summarizing shape numerically: The g-and-h distributions. Exploring data tables, trends, and shapes, 461–513. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93. Leroy, A. M., & Rousseeuw, P. J. (1987). Robust regression and outlier detection. 16 Wiley Series in Probability and Mathematical Statistics. Liu, R. Y. (1990). On a notion of data depth based on random simplices. The Annals of Statistics, 18(1), 405–414. Mahalanobis, P. C. (1936). On the generalized distance in statistics.. Meng, X.-L., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111 (1), 172-175. Pratt, J. W. (1968). A normal approximation for binomial, f, beta, and other common, related tail probabilities, ii. Journal of the American Statistical Association, 63 (324), 1457–1483. Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79 (388), 871–880. Rousseeuw, P. J., & Driessen, K. V. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41 (3), 212–223. Spearman, C. (1961). The proof and measurement of association between two things. Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87 (2), 245-251. Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings of the international congress of mathematicians, Vancouver, 1975 (Vol. 2, pp. 523–531). Wilcox, R. R. (1993). Some results on a winsorized correlation coecient. British Journal of Mathematical and Statistical Psychology, 46 (2), 339–349. Wilcox, R. R. (2009). Comparing pearson correlations: Dealing with heteroscedasticity and nonnormality. Communications in Statistics-Simulation and Computation, 38(10), 2220–2234. Wilcox, R. R. (2016). Comparing dependent robust correlations. British Journal of Mathematical and Statistical Psychology, 69(3), 215–224. Wilcox, R. R. (2017). Introduction to robust estimation and hypothesis testing, 4th ed. San Diego, CA: Academic press. Williams, E. J. (1959). The comparison of regression variables. Journal of the Royal Statistical Society: Series B (Methodological), 21(2), 396–399. 17 Zou, G . Y . (2007). T oward using confidence intervals to compare correlations. Psychological Methods, 12 (4), 399–413. Zuo, Y., et al. (2003). Projection-based depth functions and associated medians. Annals of Statistics, 31(5), 1460–1490. 18 Appendix A BCa Method Bias-corrected and accelerated bootstrap method corrects for both bias and skewness in the bootstrap estimates distribution with two factors, z 0 and a (Efron, 1987). ˆ z 0 can be estimated as: ˆ z 0 = Φ −1 ( # ˆ θ ∗ b < ˆ θ B ) (7) where Φ −1 is the inverse function of a standard normal cumulative distribution function. The numerator is the number of bootstrap parameter estimates that are smaller than the original parameter estimate ˆ θ. In the current case, the difference between two correlation estimators ˆ d is considered as the parameter estimate here. Factor a, which corrects the skewness in bootstrap estimates distribution, can be estimated via jackknife resampling. The jackknife estimator involves leave-one-out resampling technique to obtain the estimated parameter ˆ θ (−i) . After n samples of size n-1 are estimated, the average of these estimators is: ˆ θ (.) = n X i=1 ˆ θ (−i) n (8) The factor a then can be calculated as follow: ˆ a = P n i=1 ( ˆ θ (.) − ˆ θ (−i) ) 3 6{ P n i=1 ( ˆ θ (.) − ˆ θ (−i) ) 2 } (3/2) (9) With the values of z 0 and a, α 1 and α 2 can be calculated: α 1 = Φ( ˆ z 0 + ˆ z 0 +z α/2 1− ˆ a( ˆ z 0 +z α/2 ) ) (10) α 2 = Φ( ˆ z 0 + ˆ z 0 +z (1−α/2) 1− ˆ a( ˆ z 0 +z (1−α/2) ) ) (11) where z α/2 is the (100× α 2 )th percentile of a standard normal distribution. The BCa confidence interval is constructed as ( ˆ θ ∗ l , ˆ θ ∗ u ), where l =α 1 ∗B, and u =α 2 ∗B.
Abstract (if available)
Abstract
Consider random variables Y, X₁ and X₂ and let τⱼ be some measure of association associated with Y and Xⱼ (j = 1, 2). It is well known that Pearson’s correlation is highly sensitive to outliers. Kendall’s tau, Spearman’s rho and Winsorized correlation deal with outliers among the marginal distributions but they do not deal with outliers based on the overall structure of the data cloud, which can lead to misleading results. The paper deals with testing H₀: τ₁ = τ₂ when τ is a measure of association that takes into account the overall structure of the data cloud when dealing with outliers. The proposed methods are based on projection-type skipped correlation (OP correlation) used in conjunction with the bias-corrected, accelerated bootstrap (BCa) method. Comments on related methods are included.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Evaluating standard error estimators for multilevel models on small samples with heteroscedasticity and unbalanced cluster sizes
PDF
The robustification of the lasso and the elastic net: utility in practical research settings
PDF
Making appropriate decisions for nonnormal data: when skewness and kurtosis matter for the nominal response model
PDF
Bound in hatred: a multi-methodological investigation of morally motivated acts of hate
PDF
The impact of statistical method choice: evaluation of the SANO randomized clinical trial using two non-traditional statistical methods
PDF
The design, implementation, and evaluation of accelerated longitudinal designs
PDF
Are life events differentially associated with dementia risk by gender? A twin study
PDF
Comparative 3D geographic web server development: visualizing point clouds in the web
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
Applications of graph theory to brain connectivity analysis
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Brain-based prediction of chronic pain progression: a longitudinal study of urologic chronic pelvic pain syndrome using baseline resting state connectivity from the periaqueductal gray
PDF
Investigation of mechanisms of complex catalytic reactions from obtaining and analyzing experimental data to mechanistic modeling
PDF
Heart, brain, and breath: studies on the neuromodulation of interoceptive systems
PDF
The role of individual variability in tests of functional hearing
Asset Metadata
Creator
Xu, Lai
(author)
Core Title
Comparing skipped correlations: the overlapping case
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Psychology
Publication Date
04/28/2021
Defense Date
04/28/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
BCa method,bootstrap,Kendall's tau,OAI-PMH Harvest,OP estimator,outliers,projection method,robust statistics,skipped correlation,Spearman's rho,Winsorized correlation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wilcox, Rand (
committee chair
), John, Richard (
committee member
), Lai, Hok Chio (Mark) (
committee member
)
Creator Email
laixu@usc.edu,xlcatherine815@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-454505
Unique identifier
UC11669120
Identifier
etd-XuLai-9562.pdf (filename),usctheses-c89-454505 (legacy record id)
Legacy Identifier
etd-XuLai-9562.pdf
Dmrecord
454505
Document Type
Thesis
Rights
Xu, Lai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
BCa method
bootstrap
Kendall's tau
OP estimator
outliers
projection method
robust statistics
skipped correlation
Spearman's rho
Winsorized correlation